ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Overview
2
This document contains two main databases in two sheets. The "Chinese Frontier AI Safety Papers" sheet collects technical papers that involve substantial involvement of Chinese research institutions and focus on frontier AI safety topics. The "Key Chinese AI Safety-relevant Research Groups" sheet uses the data from the other sheet to list the key groups pursuing frontier AI safety research in China.
3
Methodology for Chinese Frontier AI Safety Papers
4
1. We searched on arXiv for papers using the following keywords: "AI safety" OR "artificial intelligence safety" OR "artificial intelligence alignment" OR "AI alignment" OR "reward hacking" OR "reward misspecification"; "interpretability" AND "AI"; "explainability" AND "AI"; "robustness" AND "AI"; "adversarial" AND "AI"; "biological" AND "AI"; "cyber" AND "AI." We did not search Chinese language journals, as the standard practice is to upload papers in English on arXiv.
5
2. We checked for substantial contributions from authors based in Chinese institutions, namely where one of the anchor authors (last 2-3 authors listed) was based at a Chinese institution, or where 2+ or 33%+ authors were based at mainland Chinese institutions.
6
3. We assessed whether papers were on "frontier" AI safety, rather than AI safety more broadly; otherwise, the dataset would have been much too large and unwieldy. For our definition of "frontier," we decided that this refers to frontier AI systems, namely frontier large models, e.g., GPT-3.5 and Llama 2, or narrow AI systems with advanced capabilities, e.g., Sora, DALLE-3, and AI used in biological design tools. Our decision to exclude non-frontier AI safety work was a conservative decision designed to ensure that the dataset is high-fidelity, and a number of papers were excluded that may still have implications or applicability to frontier AI safety. We also excluded work that did not appear motivated by AI safety considerations, such as intention or capabilities alignment or capabilities evaluations, as the objective of the dataset is to identify Chinese AI safety researchers.
7
4. We categorized papers into five directions of AI safety research inspired by taxonomies in several papers by Dan Hendrycks and his Introduction to AI Safety, Ethics, and Society. The five categories are: alignment, robustness, monitoring, and systemic safety. We additionally categorized monitoring papers into subcategories of monitoring (interpretability); monitoring (evaluations); and monitoring (other).
8
4.a. Alignment: Alignment research seeks to ensure AI systems are controllable and to make them less hazardous by focusing on hazards such as power-seeking tendencies, dishonesty, or hazardous goals.” This includes RLHF for question refusal, representation control and unlearning specific capabilities, value alignment, machine ethics, etc. It does not include RLHF for capabilities improvement or instruction following (unless there seems to be serious safety intentions in the work).
9
4.b. Robustness: “Robustness research enables withstanding hazards, including adversaries, unusual situations, and Black Swans.” This includes adversarial attacks, data poisoning, Trojan attacks, extraction of model weights and training data, etc.
10
4.c. Systemic Safety: Systemic safety research seeks to reduce system-level risks from AI, such as malicious applications of AI and poor epistemics. For example, it utilizes AI research to improve defenses against cyber-attacks, to enhance security against pandemics, and to improve the information environment.
11
4.d. Monitoring: “The internal operations of many AI systems are opaque. Monitoring research aims to reveal and prevent harmful behavior by making these systems more transparent.” This includes understanding models' internal representations, monitoring anomalies, and evaluating hazardous capabilities. Given that much of monitoring research is either interpretability or evaluations, which differ substantially as research techniques, we further subdivided papers in this category.
12
4.d.i. Monitoring (evaluations): “Evaluation research aims to detect potentially hazardous capabilities as they emerge or develop techniques to track and predict the progress of models' capabilities in certain relevant domains and skills.” This includes benchmarks to evaluate dangerous capabilities and propensities of the models to create harms, but does not include benchmarks for general model capability. It also includes anomaly detection.
13
4.d.ii. Monitoring (interpretability): Interpretability research seeks to make the black box behavior of AI models more transparent and explainable. This includes work on explainability, saliency maps, mechanistic interpretability, and representation engineering.
14
4.d.iii. Monitoring (other): Papers that were considered monitoring but not interpretability or evaluations, such as work on trojans or calibration fell into this category.
15
5. To further validate the data, we asked several Concordia AI affiliates or collaborators to confirm whether papers should be considered "frontier AI safety" papers, and what research direction they should be considered under. Affiliates are largely Masters or PhD students in a technical AI discipline, and each paper was reviewed by one affiliate. Ultimately, there is some level of subjectivity in whether papers are considered "frontier" enough, as well as whether papers should be considered safety-oriented if they are motivated to a large extent by increasing AI capabilities. We took a conservative approach in our instructions, asking affiliates to exclude papers that appear largely oriented or motivated around improving AI capabilities, so that we can be confident that the work we are highlighting is motivated by improving AI safety.
16
Methodology for Chinese AI Safety-relevant Research Groups
17
1. For each paper in the "Chinese Frontier AI Safety Papers" database, we collected the names of the last 2-3 authors listed (anchor authors), who likely guided the research.
18
2. We included as a "Key Chinese AI Safety-relevant Research Group" any group with at least one researcher who led at least 3 frontier AI safety papers.
19
3. We collected information on each such researcher to determine the quality of their previous research. This is one of the best predictors for the quality of their future AI safety work, since the AI safety papers they have written often are so recent that there is no conference submission data yet.
20
3.a. We investigated whether the researchers have received best paper awards from top machine learning conferences, as self-reported on their websites. We included NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, and ECCV as "top machine learning conferences" based on an internal judgement of which conferences are considered most prestigious in the field.
21
3.b. We explored whether researchers were considered a World top 200,000 scientist / Subfield top 2% scientist based on a ranking by Stanford University researchers. Researchers could qualify under two separate metrics. One metric compared the researcher's full body of work to all other researchers' full body of work up through 2022. The second measure compares scientists based solely on work published in 2022. We included both measures.
22
While these are incomplete and lagging metrics upon which to compare researcher accomplishments, these are standard methods in the field. Ultimately, you are the final judge of the quality of these papers and AI safety technical research in China. We view this as an iterative process and are open to updating the methodology based on feedback from our readers. We welcome you to read these papers for yourself and email us with any comments or feedback at info@concordia-ai.com.
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100