ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAM
1
CategoryDatasetNameSubsetNameIs SFT123456789101112131415161718192021222324252627Used (B)Total Tokenized (B)Remained (B)NotesStatistics
2
01_20241017_013512
02_20241017_013401
03_20241020_001556.json
04_20241021_170901.json
05_20241022_221453.json
06_20241024_013137
07_20241025_022032
08_20241026_151354
09_20241027_190948
10_20241028_225112
11_20241030_124814
12_20241101_002827
13_20241102_160534
14_20241104_000454
15_20241105_023029
16_20241106_180613
17_20241108_004951
18_20241113_034017
19_20241114_115241
20_20241115_234357
21_20241117_021115
22_20241118_155407
23_20241120_033942
24_20241121_133110
25_20241123_030124
26_20241211_015209
27_20241213_051741
Pre-TrainSFT
3
ChineseWebpagehttps://huggingface.co/datasets/opencsg/chinese-fineweb-educci20.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.3910.4160.4160.4160.4160.4160.4160.4240.4240.3244.28245.481.1976建议使用新版chinese-fineweb-edu-v2.14.28240
4
IndustryCorpus1.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8411.8811.8811.68119.850824.885.029219.85080
5
Skypile0.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.9710.8259.65169.770.11849.65160
6
tele2.7512.7512.7512.7512.7512.7512.7512.7512.7512.7512.7512.7512.7512.7512.8012.7512.7512.7512.8512.8512.8512.8512.8512.8512.92.92.529.989633.183.190429.98960
7
map1.0711.0711.0711.0711.0711.0711.0711.0711.0711.0711.0711.0711.0711.2211.3211.3211.3211.3212.0212.0212.0212.0212.0212.0212.072.071.7715.385631.3815.994415.38560
8
Encyclopediahttps://huggingface.co/datasets/TMZN/baidubaike(Filtered)0.8750.8750.8750.8750.8750.8750.8750.6750.6750.6750.6750.6750.6750.6750.6750.6750.6750.6750.055.442.72-2.72Epoch=25.440
9
https://zh.wikipedia.org/wiki/Wikipedia(Filtered)0.10.10.10.10.10.10.10.10.0250.330.330Related issue: https://github.com/RUC-GSAI/YuLan-Mini/issues/110.330
10
https://huggingface.co/datasets/LLM360/TxT360wiki-zh-10.20.20.20.10.10.0500.340.3400.340
11
wiki-zh-20.0750.10.20.20.250.150.390.3900.390
12
QAhttps://www.zhihu.com/(Filtered)0.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.1250.1253.13.103.10
13
Bookbestsellers(Filtered)0.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.1250.0751.231.320.091.230
14
https://github.com/FudanNLPLAB/CBook-150K(Filtered)0.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.4250.450.450.450.450.450.51.142.255.68613.397.7045.6860
15
textbooks-book-cn/v10.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.0750.0750.0750.0750.0750.0750.20.10.680.770.090.680
16
exam-book-cn/v111.151.150.920.960.0400.92
17
Lawlegal_case-law-cn/v2(Filtered)0.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.90.9981.4472.4490
18
Governmenthttps://huggingface.co/datasets/liwu/MNBVCgov/20230172/XueXiQiangGuo_cleaned0.050.050.050.050.050.050.050.050.050.050.0250.210.2100.210
19
gov/20230172/GovReport0.010.004-0.0040.0040
20
Newsnews/20230196/news_peoples_daily_cleaned0.20.20.20.20.20.20.20.20.20.20.20.20.20.20.20.20.20.21.440.72-0.72Epoch=21.440
21
Knowlege about Renmin University of China
aa_mini/rucweb0.010.0040.00400.0040
22
23
CodeCodehttps://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K10.150.1750.130.13000.13
24
https://huggingface.co/datasets/vikp/textbook-quality-programming10.150.1250.110.11000.11
25
https://huggingface.co/datasets/yulan-team/YuLan-Mini-Text-Datasetscode-the_stack_v2_python_cleaned_scored_dedup-score_12.30.9651.3067.235.9241.3060
26
code-the_stack_v2_python_cleaned_scored_dedup-score_20.610.610.610.7350.760.911.011.111.512.012.112.212.212.213.4655.0155.0155.3255.545.62255.82756.32755.77755.77755.87750.87750.72531.91532.050.13531.9150
27
code-the_stack_v2_python_cleaned_scored_dedup-score_31.581.4551.3551.3551.431.581.6051.7551.7051.7051.7051.7051.7051.7050.559.1589.270.1129.1580
28
code-the_stack_v2_python_cleaned_scored_dedup-score_40.30.40.50.50.40.30.20.21.121.1201.120
29
code-the_stack_v2_python_cleaned_scored_dedup-score_50.0250.010-0.010.010
30
code-the-stack-v2-Shell0.30.30.30.30.30.2750.250.250.910.930.020.910
31
code-the-stack-v2-SQL0.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.60.60.60.60.60.60.60.540.547.87213.725.8487.8720
32
code-the-stack-v2-Java0.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.80.51756.2876.40.1136.2870
33
code-the-stack-v2-JavaScript0.50.50.50.50.50.50.50.50.50.50.50.5250.5250.5250.252.932.970.042.930
34
code-the-stack-v2-TypeScript0.50.50.50.50.50.50.50.50.50.50.50.4750.4750.4750.4750.4750.4750.093.3763.470.0943.3760
35
code-the-stack-v2-Go0.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.40.320.324.2565.971.7144.2560
36
code-the-stack-v2-Rust1.251.251.25110.750.750.50.50.50.33.623.6203.620
37
code-the-stack-v2-R0.750.750.750.750.750.750.750.750.750.750.750.750.750.750.750.750.750.8250.750.750.750.750.316.7546.850.0966.7540
38
code-the-stack-v2-HTML0.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.30.240.23.1765.362.1843.1760
39
code-the-stack-v2-Coq0.40.40.40.40.40.4250.9710.030.970
40
code-the-stack-v2-Lean0.050.050.050.050.080.0800.080
41
code-the-stack-v2-Jupyter_Notebook-md_scored_classified-score_10.150.150.150.150.151.321.411.411.561.561.562.062.3752.4752.483.4819.3769.480.1049.3760
42
code-the-stack-v2-Jupyter_Notebook-md_scored_classified-score_21.9851.5851.6851.7851.881.9752.0852.1852.662.662.212.212.262.262.260.0912.7112.71012.710
43
code-the-stack-v2-Jupyter_Notebook-md_scored_classified-score_322.32.10.121.721.51.31.10.70.25.2166.050.8345.2160
44
code-the-stack-v2-Jupyter_Notebook-md_scored_classified-score_40.20.30.30.30.20.20.10.10.0250.0250.70.730.030.70
45
code-the-stack-v2-Jupyter_Notebook-md_scored_classified-score_50.010.0040-0.0040.0040
46
code-the-stack-v2-Jupyter_Notebook-md2_scored_classifier-score_10.10.10.10.10.10.10.150.150.150.150.4250.6250.6250.6250.6250.6250.6250.1352.2042.290.0862.2040
47
code-the-stack-v2-Jupyter_Notebook-md2_scored_classifier-score_20.10.20.30.40.50.50.50.50.50.550.550.5750.5750.5750.5750.22.842.90.062.840
48
code-the-stack-v2-Jupyter_Notebook-md2_scored_classifier-score_30.20.30.3750.50.50.40.30.20.11.151.170.021.150
49
code-the-stack-v2-Jupyter_Notebook-md2_scored_classifier-score_40.0250.0750.0750.0750.0750.0750.160.1600.160
50
code-the-stack-v2-Jupyter_Notebook-md2_scored_classifier-score_50.010.0040-0.0040.0040
51
code-starcoderdata-c1.81.81.81.81.81.81.81.81.81.81.81.81.81.81.833333322220.437521.37521.50.12521.3750
52
code-starcoderdata-cpp3.23.23.23.23.23.23.23.23.23.23.23.23.23.21.718.618.640.0418.60
53
code-ioccc10.010.0040-0.00400.004
54
code-starcoder_smollm_dedup-starcoder_dedup2.32.32.32.32.32.32.32.32.32.32.32.32.32.32.32.32.32.32.32.32.30.7519.6219.62019.620
55
code-starcoder_smollm_dedup-smollm_dedup0.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.50.54.44.460.064.40
56
code-MNBVC-code-googlecode-filter_copyright-cc111110.152.062.070.012.060
57
code-MNBVC-code-googlecode-filter_copyright-cpp1.5110.50.2751.711.730.021.710
58
code-MNBVC-code-googlecode-filter_copyright-c0.50.72511.5252.51.41.41.02654.03064.050.01944.03060
59
code-MNBVC-code-googlecode-filter_copyright-h0.20.20.20.20.10.3643.640.360
60
code-MNBVC-code-googlecode-filter_copyright-java0.10.10.10.10.10.31750.31750.31750.31750.51750.40.06251.11.120.021.10
61
code-MNBVC-code-googlecode-filter_copyright-js0.2750.5250.5250.5250.5250.040.9660.970.0040.9660
62
code-MNBVC-code-googlecode-filter_copyright-lua0.00250.0010.00100.0010
63
code-MNBVC-code-googlecode-filter_copyright-php0.0250.010.0100.010
64
code-MNBVC-code-googlecode-filter_copyright-rs0.10.20.20.20.10.320.3200.320
65
code-MNBVC-code-googlecode-filter_copyright-sh0.10.040.0400.040
66
code-MNBVC-code-googlecode-filter_copyright-swift0.00250.0010.00100.0010
67
code-MNBVC-code-googlecode-filter_copyright-ts0.0750.0250.0250.050.04-0.010.050
68
code-MNBVC-code-googlecode-filter_copyright-go0.20.20.20.20.250.420.4200.420
69
code-MNBVC-code-googlecode-py-score-10.20.150.140.150.010.140
70
code-MNBVC-code-googlecode-py-score-20.6250.30.370.3700.370
71
code-MNBVC-code-googlecode-py-score-30.10.0750.070.0700.070
72
code-MNBVC-code-googlecode-py-score-40.0250.010.0100.010
73
code-MNBVC-code-matlab0.20.20.20.20.20.20.1250.530.560.030.530
74
code-wjp_scored-code-score_110.40.250.260.260使用the-stack-v2的4~5分数据筛选出代码相关的作为种子数据,使用llama3/mistral采用oss方法生成的00.26
75
code-wjp_scored-code-score_210.60.550.460.46000.46
76
code-wjp_scored-code-score_310.20.20.21.32511.171.17001.17
77
code-wjp_scored-code-score_410.0750.0750.430.430.630.430.430.430.230.0051.2661.320.05401.266
78
code-wjp_scored-leetcode-score_00.0250.010.0100.010
79
code-wjp_scored-leetcode-score_10.0250.150.150.130.1300.130
80
code-wjp_scored-leetcode-score_20.0750.150.250.050.210.2100.210
81
code-wjp_scored-leetcode-score_30.40.40.40.40.40.40.40.40.40.40.40.40.40.32.22.202.20
82
code-wjp_scored-leetcode-score_40.0650.30.30.30.30.30.30.30.30.30.30.30.30.20.061.571.5701.570
83
code-wjp_scored-leetcode-score_50.010.0040-0.0040.0040
84
https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1(English only)10.50.50.6350.6351.550.411.6921.730.03801.692
85
https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2(Regenerate using Qwen2.5-7B-Instruct)10.3750.190.210.310.31000.31
86
https://huggingface.co/datasets/yulan-team/YuLan-Mini-Text-Datasetscode-code_gen_qwen-coder-score30.050.020.020使用the-stack-v2的4~5分数据筛选出代码相关的作为种子数据,使用qwen2.5-inst和qwen2.5-coder-inst采用oss方法生成的0.020
87
code-code_gen_qwen-coder-score40.150.060.150.090.060
88
code-code_gen_qwen-inst-score30.0750.030.0300.030
89
code-code_gen_qwen-inst-score40.2250.090.20.110.090
90
code-code_gen_qwen-filter-batch1-score310.860.250.4440.460.01600.444
91
code-code_gen_qwen-filter-batch1-score411.70.951.061.06001.06
92
code-code_gen_qwen-filter-batch1-score510.4250.20.250.25000.25
93
code-code_gen_qwen-filter-batch2-score310.40.10.20.210.0100.2
94
code-code_gen_qwen-filter-batch2-score411.11250.31250.570.57000.57
95
code-code_gen_qwen-filter-batch2-score510.20.1250.130.13000.13
96
code-code_gen_qwen-filter-batch3-score310.7250.290.810.5200.29
97
code-code_gen_qwen-filter-batch3-score414.151.661.66001.66
98
code-code_gen_qwen-filter-batch3-score510.50.20.2000.2
99
https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpussynthetic_qa-lang/javascript10.050.050.150.050.120.120按照语言分类00.12
100
synthetic_qa-lang/cpp10.050.050.150.150.160.16000.16