A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | d | extension | language | count | low_alphanum_count | long_lines_count | non_lexable_count | XML_detected | Data_detected | Name | Overall quality | Alphanum filter | Long line filter | Lexer filter | Other comments | Include | Alphanum_threshold | Long_line_threshold | XML filter | Alpha filter | Near-dedup settings | ||||||||
2 | 1 | ads | ada | 1000 | 1 | 2 | 20 | 0 | 12 | Harm | LGTM | False positive | TBD; breaks space | 1 | 0.25 | 1000 | 1 | 0.25 | |||||||||||
3 | 2 | ada | ada | 1000 | 0 | 4 | 31 | 0 | 2 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
4 | 0 | adb | ada | 1000 | 0 | 1 | 85 | 75 | 4 | Harm | LGTM | LGTM | False positive | Mostly xml; few false positives | 1 | 0.25 | 1000 | 1 | |||||||||||
5 | 3 | agda | agda | 1000 | 0 | 3 | 49 | 0 | 1 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
6 | 4 | als | alloy | 1000 | 0 | 1 | 39 | 0 | 2 | Harm | LGTM | LGTM | False positive | 1 | 0.25 | 1000 | 1 | ||||||||||||
7 | 5 | g4 | antlr | 1000 | 3 | 0 | 443 | 0 | 8 | Harm | LGTM | https://docs.google.com/spreadsheets/d/1Lk-pTk_rXI__fCgixr7ZWSi8wR09Zzd2j_G90J80r00/edit?usp=sharing; lower to 0.2 | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
8 | 144 | markdown | markdown | 244 | 0 | 32 | 2 | 0 | 0 | Urvashi | some false positives, should be filtered | 0 file remaining | 23 files. All look good | 1 | 0.25 | remove | Add language filter? | 1 | |||||||||||
9 | 7 | applescript | applescript | 1000 | 0 | 38 | 113 | 0 | 9 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
10 | 145 | mkd | markdown | 5 | 0 | 0 | 0 | 0 | 0 | Urvashi | looks good | 0 files remaining | 0 files remain | 1 | 0.25 | remove | 1 | ||||||||||||
11 | 147 | mkdn | markdown | 1 | 0 | 0 | 0 | 0 | 0 | Urvashi | only 1 file, looks good | 0 files remaining | 0 files remain | 1 | 0.25 | remove | 1 | ||||||||||||
12 | 146 | ron | markdown | 2 | 0 | 0 | 0 | 0 | 0 | Urvashi | only 2 eamples, can be excluded | 0 files remaining | 0 files remain | 1 | 0.25 | remove | 1 | ||||||||||||
13 | 6 | scpt | applescript | 1000 | 3 | 22 | 57 | 0 | 3 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
14 | 8 | asm | assembly | 1000 | 2 | 115 | 0 | 0 | 318 | Evgenii | LGTM | Some false positives | Many similar false postives with a long comment line containing only numbers | 1 | 0.25 | Custom! Can we remove comments for files with long lines? | 1 | remove | |||||||||||
15 | 12 | awk | awk | 1000 | 5 | 2 | 255 | 2 | 27 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
16 | 18 | bat | batchfile | 1000 | 1 | 37 | 13 | 0 | 2 | Evgenii | LGTM | Some are false positives | 1 | 0.25 | 1000 | 1 | |||||||||||||
17 | 270 | bbx | tex | 15 | 0 | 0 | 0 | 0 | 1 | Evgenii | LGTM | 1 | 0.25 | remove | 1 | ||||||||||||||
18 | 17 | cmd | batchfile | 1000 | 0 | 163 | 14 | 0 | 3 | Evgenii | LGTM | Some are false positives | 1 | 0.25 | 1000 | 1 | |||||||||||||
19 | 266 | ins | tex | 44 | 1 | 2 | 0 | 0 | 10 | Evgenii | Quite a few misclassifications (data, assembly), can be filtered | 1 | 0.25 | remove | 1 | ||||||||||||||
20 | 271 | lbx | tex | 2 | h | 0 | 0 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | remove | 1 | ||||||||||||||
21 | 269 | mkii | tex | 12 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | remove | 1 | ||||||||||||||
22 | 273 | mkiv | tex | 27 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | remove | 1 | ||||||||||||||
23 | 274 | mkvi | tex | 3 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | remove | 1 | ||||||||||||||
24 | 272 | cbx | tex | 6 | 0 | 0 | 0 | 1 | 0 | Evgenii | LGTM | 1 | 0.25 | remove | 1 | ||||||||||||||
25 | 20 | bsv | bluespec | 1000 | 3 | 7 | 0 | 0 | 6 | Harm | LGTM | LGTM | False positives, remove filter | 1 | 0.25 | remove | 1 | ||||||||||||
26 | 264 | sty | tex | 611 | 0 | 6 | 0 | 2 | 0 | Evgenii | LGTM, few misclassifications | A few false positives | 1 | 0.25 | remove | 1 | |||||||||||||
27 | 267 | dtx | tex | 174 | 0 | 1 | 2 | 3 | 0 | Evgenii | LGTM, few misclassifications | can be increased | 1 | 0.25 | remove | 1 | |||||||||||||
28 | 48 | cmake | bluespec | 1000 | 0 | 29 | 41 | 0 | 0 | Harm | LGTM | LGTM | False positives; Remove filter | 1 | 0.25 | remove | 1 | ||||||||||||
29 | 21 | c | c | 1000 | 5 | 4 | 11 | 0 | 15 | Harm | LGTM | False positive; but let's keep filter | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
30 | 22 | h | c | 1000 | 1 | 3 | 192 | 0 | 5 | Evgenii | LGTM | Non-lexable files have @property and similar in them, can be used to filter Objective-C headers | 1 | 0.25 | 1000 | 1 | |||||||||||||
31 | 38 | cs | c-sharp | 1000 | 0 | 2 | 31 | 0 | 1 | Qian | LGTM | LGTM | LGTM | False positive | 1 | 0.25 | 1000 | 1 | |||||||||||
32 | 27 | cc | c++ | 1000 | 0 | 1 | 3 | 0 | 0 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
33 | 25 | cpp | c++ | 1000 | 2 | 3 | 3 | 0 | 5 | Zhihan Zhang | LGTM | 2 cases in total, they are false positives | 1 | 0.25 | 1000 | 1 | |||||||||||||
34 | 11 | ||||||||||||||||||||||||||||
35 | 98 | aug | augeas | 255 | 0 | 0 | 0 | 19 | 6 | Evgenii | LGTM, some minor misclassifications: xml or data | 1 | 0.25 | 1000 | 1 | 0.25 | |||||||||||||
36 | 28 | hpp | c++ | 1000 | 1 | 7 | 3 | 0 | 0 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
37 | 152 | wl | mathematica | 686 | 1 | 13 | 503 | 0 | 16 | Evgenii | LGTM | Small amount of non-passing legitimate examples, maybe should be tuned | 1 | 0.25 | 1000 | 1 | 0.25 | ||||||||||||
38 | 42 | clj | clojure | 1000 | 5 | 10 | 9 | 0 | 11 | Harm | LGTM | LGTM | TBD; breaks space | 1 | 0.25 | 1000 | 1 | ||||||||||||
39 | 216 | sps | scheme | 446 | 2 | 9 | 87 | 64 | 25 | Evgenii | Many xmls, sql, and other non-scheme data | 1 | 0.25 | 1000 | 1 | 0.25 | |||||||||||||
40 | 44 | cljc | clojure | 1000 | 0 | 21 | 11 | 0 | 4 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
41 | 233 | prc | sql | 23 | 0 | 2 | 9 | 0 | 2 | Evgenii | LGTM, a couple of very large and likely autogenerated files, maybe worth filtering | 1 | 0.25 | 1000 | 1 | 0.25 | |||||||||||||
42 | 43 | cljs | clojure | 1000 | 2 | 2 | 3 | 0 | 4 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
43 | 49 | coffee | coffeescript | 1000 | 2 | 20 | 47 | 1 | 7 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
44 | 50 | cson | coffeescript | 1000 | 3 | 13 | 0 | 1 | 4 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
45 | 54 | lisp | common-lisp | 1000 | 2 | 21 | 32 | 0 | 7 | Harm | LGTM | LGTM | LGTM; removes auto-generated | 1 | 0.25 | 1000 | 1 | ||||||||||||
46 | 56 | lsp | common-lisp | 1000 | 0 | 0 | 28 | 0 | 3 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
47 | 55 | asd | common-lisp | 1000 | 0 | 1 | 6 | 6 | 0 | Harm | LGTM | LGTM | LGTM; | 1 | 0.25 | 1000 | 1 | ||||||||||||
48 | 59 | css | css | 1000 | 0 | 154 | 273 | 0 | 0 | Nour Fahmy | 25% not lexable | LGTM | lots of examples are one singular line of code; recommend to remove filter as CSS code can be accordingly compressed | 1 | 0.25 | 1000 | 1 | ||||||||||||
49 | 61 | cu | cuda | 1000 | 0 | 4 | 2 | 0 | 1 | Evgenii | LGTM | Some small amount of legit code is filterered | 1 | 0.25 | 1000 | 1 | |||||||||||||
50 | 60 | cuh | cuda | 1000 | 1 | 3 | 0 | 0 | 3 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
51 | 62 | dart | dart | 1000 | 0 | 3 | 17 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
52 | 63 | dockerfile | 1000 | 0 | 0 | 29 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | |||||||||||||||
53 | 10 | a51 | assembly | 28 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
54 | 9 | nasm | assembly | 159 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
55 | 16 | auk | awk | 3 | 0 | 0 | 3 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
56 | 13 | gawk | awk | 225 | 0 | 1 | 103 | 0 | 0 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
57 | 14 | mawk | awk | 22 | 0 | 0 | 13 | 0 | 0 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
58 | 15 | nawk | awk | 8 | 0 | 0 | 1 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
59 | 68 | ex | elixir | 1000 | 0 | 7 | 379 | 0 | 1 | Raymond | LGTM | LGTM | False positives mostly | 1 | 0.25 | 1000 | 1 | ||||||||||||
60 | 69 | exs | elixir | 1000 | 0 | 2 | 57 | 0 | 1 | Raymond | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
61 | 70 | elm | elm | 1000 | 2 | 10 | 82 | 0 | 16 | Raymond | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
62 | 19 | bison | bison | 176 | 0 | 1 | 0 | 0 | 0 | Harm | Exclude (too many parse error file logs) | 0 | 0.25 | 1000 | 1 | ||||||||||||||
63 | 71 | el | emacs-lisp | 1000 | 1 | 31 | 109 | 0 | 3 | Marco Zocca | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | |||||||||||||
64 | 73 | erl | erlang | 1000 | 6 | 3 | 35 | 0 | 9 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
65 | 40 | cake | c-sharp | 9 | 0 | 0 | 8 | 0 | 1 | Qian | LGTM | LGTM | LGTM | Cake extensions cannot be well recoginzed. False positive | 1 | 0.25 | 1000 | 1 | |||||||||||
66 | 74 | hrl | erlang | 1000 | 2 | 8 | 13 | 1 | 13 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
67 | 39 | cshtml | c-sharp | 585 | 0 | 9 | 429 | 0 | 0 | Qian | LGTM | LGTM | LGTM | Mostly html pages with some C# lex. False positive | 1 | 0.25 | 1000 | 1 | |||||||||||
68 | 36 | c++ | c++ | 15 | 0 | 0 | 0 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
69 | 78 | fs | f-sharp | 1000 | 3 | 13 | 39 | 0 | 8 | Claire Schlesinger | LGTM | LGTM | LGTM | LGTM but most are false positives | Seems quite a few files end up being setting up types or other parameters, might be unhelpful when writing functions, but useful for more tasks like type inference. | 1 | 0.25 | 1000 | 1 | ||||||||||
70 | 31 | cp | c++ | 10 | 0 | 0 | 0 | 0 | 0 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
71 | 79 | fsx | f-sharp | 1000 | 1 | 32 | 31 | 0 | 4 | Claire Schlesinger | LGTM | LGTM | LGTM | LGTM but most are false positives | 1 | 0.25 | 1000 | 1 | |||||||||||
72 | 30 | cxx | c++ | 389 | 0 | 2 | 0 | 0 | 0 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
73 | 37 | h++ | c++ | 1 | 0 | 0 | 0 | 0 | 1 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
74 | 32 | hh | c++ | 197 | 0 | 0 | 5 | 0 | 2 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
75 | 81 | f | fortran | 1000 | 6 | 38 | 559 | 0 | 25 | Manan Dey | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
76 | 34 | hxx | c++ | 140 | 0 | 1 | 0 | 0 | 0 | Harm | LGTM | False positive but let's keep as is | 1 | 0.25 | 1000 | 1 | |||||||||||||
77 | 29 | inl | c++ | 91 | 0 | 0 | 0 | 0 | 0 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
78 | 26 | ipp | c++ | 20 | 0 | 2 | 0 | 0 | 0 | Zhihan Zhang | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
79 | 33 | tcc | c++ | 19 | 0 | 0 | 0 | 0 | 1 | Ejiro | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
80 | 35 | tpp | c++ | 3 | 0 | 0 | 0 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
81 | 45 | boot | clojure | 121 | 0 | 1 | 23 | 0 | 0 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
82 | 82 | f90 | fortran | 1000 | 9 | 1 | 14 | 0 | 16 | Manan Dey | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
83 | 89 | glsl | glsl | 1000 | 2 | 22 | 119 | 0 | 11 | Manan Dey | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
84 | 92 | shader | glsl | 1000 | 0 | 45 | 908 | 0 | 0 | Evgenii | Requires filtering | Many are autogenerated but long lines are only comments | I've looked at few, they are not only glsl, e.g., Unity shaders | 1 | 0.25 | 1000 | 1 | ||||||||||||
85 | 47 | cljx | clojure | 8 | 0 | 0 | 0 | 0 | 0 | Harm | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
86 | 53 | _coffee | coffeescript | 4 | 0 | 0 | 0 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
87 | 51 | cjsx | coffeescript | 185 | 0 | 1 | 8 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
88 | 52 | iced | coffeescript | 92 | 0 | 0 | 9 | 0 | 0 | Harm | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
89 | 91 | vert | glsl | 1000 | 0 | 1 | 44 | 0 | 2 | Manan Dey | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
90 | 90 | frag | glsl | 1000 | 1 | 2 | 120 | 21 | 4 | Manan Dey | LGTM | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||
91 | 103 | go | go | 1000 | 0 | 13 | 0 | 1 | 1 | Evgenii | LGTM, small amount of autogen | False positives: long comments or string constants | 1 | 0.25 | 1000 | 1 | |||||||||||||
92 | 104 | groovy | groovy | 1000 | 0 | 3 | 16 | 0 | 2 | Zhihan Zhang | LGTM | LGTM | false positives (long strings) | 1 | 0.25 | 1000 | 1 | ||||||||||||
93 | 58 | ny | common-lisp | 52 | 0 | 5 | 6 | 0 | 0 | Harm | LGTM | LGTM | 1 | 0.25 | 1000 | 1 | |||||||||||||
94 | 108 | hs | haskell | 1000 | 1 | 3 | 39 | 6 | 3 | Zhihan Zhang | LGTM | false positive (only 1 case in total, the script used too many indents) | false positives (only 2 cases in total) | 1 | 0.25 | 1000 | 1 | ||||||||||||
95 | 110 | html | html | 1000 | 0 | 240 | 131 | 14 | 3 | Zhihan Zhang | LGTM | LGTM | many false positives, since HTML does not require line breaks and many may compress their HTML code, I suggest remove the length limit for HTML | 1 | 0.25 | 1000 | 1 | ||||||||||||
96 | 114 | idr | idris | 1000 | 1 | 2 | 195 | 0 | 1 | Evgenii | LGTM | 1 | 0.25 | 1000 | 1 | ||||||||||||||
97 | 116 | thy | isabelle | 1000 | 0 | 20 | 399 | 2 | 2 | Evgenii | LGTM | Some false positives, some autogenerated | 1 | 0.25 | 1000 | 1 | |||||||||||||
98 | 117 | java | java | 1000 | 0 | 3 | 10 | 0 | 0 | Nour Fahmy | LGTM | LGTM | data quality improves when max_length <= 100 | 1 | 0.25 | 1000 | 1 | ||||||||||||
99 | 64 | 1 | dockerfile | 1 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM, only one file | 1 | 0.25 | 1000 | 1 | ||||||||||||||
100 | 66 | 3 | dockerfile | 1 | 0 | 0 | 0 | 0 | 0 | Evgenii | LGTM, only one file | 1 | 0.25 | 1000 | 1 |