Participants_method_investigation

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R
1	Team Name	CodaBench username/s	Method Summary	What is interesting about your method?	use_small_PLM	use_LLM	use_GPT	use_GPT4	fine-tuning	zero-shot	few-shot (k=?)	CoT	ReAct	Reflection	Data_augmentation	Use external data (generate additional data using the same LLMs in the data set)	Use instruction-tuned vs non instruction tuned LLM to generate extra data	@brahim

2	QUST	karry	First, we experiment with multiple models on a dataset that has undergone data augmentation. Then, we select the two best-performing models for ensemble. Finally, we choose a fine-tuned RoBERTa model, combined with the Multiscale Positive-Unlabeled (MPU) training framework and DeBERTa model, for model fusion through stacking ensemble.	Our approach is based on fine-tuned models, incorporating the MPU framework and ensemble techniques, training on a multilingual dataset after data cleaning.	No	Yes	No	No	Yes	No	No	No	No	No	Yes	No	No
3	NewbieML	baotran2003	Stack Ensemble with SVM (kernel is rbf, C = 1, k = 0.1) + LogisticRegression + XGBoost (max_depth = 8) , meta model is KNN (k = 7). Text embedding with Longformer-large.	1) Fast inference time than baseline with acceptable accuracy on test dataset 2) Simple and lightweight approach	Yes	No	No	No	No	No	No	No	No	No	No	No	No
4	NCL-UoR	windwind	Fine-tuning LLMs including XLM-RoBERTa, RoBERTa with Low-Rank Adaptation (LoRA) method, and DistilmBERT, and using majority voting ensemble method on XLM-RoBERTa and LoRA-RoBERTa	Our method explores various LLMs and applies an ensemble method on fine-tuned LLMs.	No	Yes	No	No	Yes	No	No				No	No	No
5	L3i++	honghanhh	A comparative study among 3 groups of methods to trigger the detection: 5 metric-based models (Log-Likelihood, Rank, Log-Rank, Entropy, and MFDMetric); 2 fine-tuned sequence-labeling language model (RoBERTa, XLM-R) ; and a fine-tuned large-scale language model (LLaMA-2 - version 7b). LLaMA-2 outperformed the rest and accurately detect machine-generated texts (>= 90% in multilingual subtask A and >= 99% for ChatGPT/bloomz in substask B). We provide the error analysis and different factors in the discussion of our paper.	1) Comparative studies among several category of approaches; 2) Fine-tuned LLMs	No	Yes	No	No	Yes	No	No	No	No	No	No	No	No
6	Rkadiyala	rkadiyal	DeBERTa-CRF , Longformer-CRF ; with a modified logic for finding text boundary		No	No	No	No	Yes	No	No				No	No	No
7	HU	dipta007	I have used a novel method architecture inspired by contrastive learning approach. As a encoder I have used "sentence-transformers/all-mpnet-base-v2" from sbert and fine tuned it on my augmented dataset	1. Novel architecture 2. Comparable performance without ensembling	No	Yes	No	No	Yes	No	No	No	No	No	Yes	No	No
8	petkaz	sachertort	We used embeddings from fine-tuned RoBERTa-base combined with linguistic features to train a feed-forward neural network for binary classification. Specifically, in our final configuration, we used diversity features. In addition, we resampled the training data by reducing the number of human-written texts.	1. Linguistic features. 2. Smart resampling of training data.	Yes	No	No	No	Yes	No	No	No	No	No	No	No	No
9
10	RUG-D	tisocoda	1. Fine-tuning different DeBERTa models. 2. We generated additional data and tested the performance of our model when trained with and without said data	We compare the performance of LLMs with and without additionally generated training data	No	Yes	No	No	Yes	No	No	No	No	No	No	Yes	No
11	TueCICL	teamdanielaron	Charachter-level LSTM (official) LSTM with pretrained word2vec embeddings as input (unofficial) Linguistically motivated features with MLP (matches baseline on test; unofficial)	Potential to match transformer results with smaller models	No	No	No	No	No	No	No	No	No	No	No	No	No
12	USTC-BUPT	comp5	We incorporate domain adversarial neural networks into the task of machine-generated text detection. This is accomplished by simply adding a gradient reversal layer on top of the baseline. Furthermore, in addition to category labels, we innovatively incorporate extra domain labels to enable the model to learn transferable features between the training and testing datasets. 1）The overarching framework begins by utilizing semantic information extraction layers (such as RoBERTa, BERT) to acquire text embeddings. 2）Subsequently, the text information is separately fed into the category classifier and the domain classifier. The domain classifier, integrated into an MLP, incorporates a gradient reversal layer. 3）The final model's loss comprises not only the classification loss but also the domain loss. The ultimate outcome demonstrates an improvement of approximately 8% compared to the baseline. For further details, please refer to our paper.	1.The framework is novel，to the best of our knowledge, our model represents the first instance of integrating domain adversarial neural networks into machine-generated text detection. 2.The structure is straightforward, requiring only the addition of domain labels and the incorporation of a gradient reversal layer on top of the baseline. 3.Moreover, compared to the baseline, the inclusion of domain adversarial neural networks leads to an approximately 8% improvement in accuracy（We achieved the second place in the official ranking with a score of 96.09% accuracy）, indicating a significant enhancement in performance.	Yes	No	No	No	Yes	No	No	No	No	No	No	No	No
13	RUG-3	frieso	We extracted the hidden layers from an LLM and combined them with a set of linguistic features as input to an SVM.	Instead of doing either LLMs or solely features, we combined them	No	Yes	Yes	No	Yes	No	No	No	No	No	No	No	No
14	Genaios	safeai	Our model is a Transformer Encoder that mixes token-level probabilistic features extracted from four Llama-2 models, both instructed and base flavors: Llama-2-7b, Llama-2-7b-chat, Llama-2-13b, and Llama-2-13b-chat. Specifically, for each token, our features are: the log probability of the observed token, the log probability of the predicted token, and the entropy of the distribution. These features are concatenated at token level, projected by a feed-forward layer, mixed with a Transformer encoder, mean pooled, and projected with a softmax layer.	We verified that LLMs assign higher probabilities to machine-generated text than to human texts. Our solution leverages this phenomenon through LLM probabilistic features and captures machine-generated text style in a highly precise manner. In consequence, it obtained the best results in the official ranking (96.88% accuracy).	No	Yes	No	No	No	No	No	No	No	No	No	No	No
15	TrustAI	ashokurlana,saibewaraditya,balamallikarjun	i) In this work, we used an ensemble approach with the combination of Multinomial Naive Bayes + LGBM Classifier (lightGBM classifier) + SGD classifier. We utilized the concatenation of TFIDF and spaCy embeddings. The model was trained only on the SubtaskA-monolingual dataset without using any extra dataset. ii) In our other model we used RoBERTa base OpenAI Detector. Which is the GPT-2 output detector model, obtained by fine-tuning a RoBERTa base model with the outputs of the 1.5B-parameter GPT-2 model. FInetuned on the SubtaskA-monolingual dataset.	Our approach explores various pretrained models and statistical models. And we select the best model using ensembleing.	YES	NO	NO	NO	YES	NO	NO	NO	No	NO	NO	NO	NO
16	SINAI	sinai	In this work, we have compared different systems with the evaluation set by comparing each one until the final system is chosen. The first system deals with the fine-tuning of the XLM-RoBERTa-Large language model, for the next system we use GPT-2 to extract the perplexity of each text and create a decision system depending on the value of this perplexity. The last and final system is a system that merges the text with its perplexity value into a classification head. The same system has been used for the monolingual and multilingual tasks.	1) Comparison of various classification systems. 2) Using perplexity as feature to classify. 3) Fusion model that uses text and perplexity to classify. 4) Comprehensive error analysis	no	yes	no	no	yes	no	no	no	no	no	no	no	no
17	art-nat-HHU	artifnatural	In this work we have finetuned a RoBERTa-model that was pretrained for AI-detection and combined it with a set of features ranging from syntactic, lexic, probabilistic to stylistic features. In order to produce a classification, we trained two separate neural networks on the features, one for each class predicted by the RoBERTa classifier. The neural nets were used to make potential corrections to the RoBERTa output.	1) Fusion model combining neural networks and RoBERTa 2) Interesting range of stylistic reatures from sentiment to word level CEFR and readability	yes	no	no	no	yes	no	no	no	no	no	no	no	no
18	clulab-UofA	yekwon	1. An unsupervised text similarity system that computes cosine similarity to measure text similarity, which assesses the angle between vector representations of texts. 2. A sentence transformer trained with triplet loss to learn the distinctions between the given texts. 3. A RoBERTa classifier that makes decisions based on the given paragraph. 4. A RoBERTa classifier which takes into account sentence paraphrases generated by the candidate models		yes	yes	yes	no	yes	no	no	no	no	no	no	no	no
19	I2C-Huelva	albrp97		The method's innovative integration of multimodal models with numerical text analysis presents a novel approach to enhancing the detection of machine-generated texts by capturing subtle linguistic and structural patterns.	no	yes	no	no	yes	no	no	no	no	no	no	no	no
20	NootNoot	yash9439, sankalp.bahad	Fine tuned Roberta Base		yes	no	no	no	yes	no	no	no	no	no	no	no	no
21	RUG-1	alonscheuer, halecakir	Combined linear model with document-level features and token-level features which have been processed by an LSTM	Our method combines probability-based features with low-level and high-level linguistic features	yes	no	no	no	no	no	no	no	no	no	no	no	no
22	T5-Medical	msiino	Use of a pre-trained T5-large medical embedding and then cosine distance to calculate the degree of similarity.		no	yes	no	no	no	yes	no	no	no	no	no	no	no
23	Unibuc - NLP	tmarchitan, claudiu	Transformer-based model architecture with a set of fully connected layers for the classification head. Combined Subtask A - monolingual dataset with Subtask B dataset.	Explore different methods of layer selection and fine-tuning.	yes	no	no	no	yes	no	no	no	no	no	yes	no	no
24	Collectivized Semantics	advin4603	Fine Tuned Roberta-base using Ada-LoRa and used a weighted sum of all the layer hidden states' mean, the weights were trained along with the model, similar to ElMo. Mixed train and val data and split it uniformly across various different domains and generators in order to get a balanced val-train split.	The method encourages usage of information from different linguistic levels (Syntax, Semantics, Lexical) as different layers capture different levels of linguistic information	yes	no	no	no	yes	no	no	no	no	no	no	no	no
25	RUG-5	pdarwinkel, sijbrenvv	For subtask A and B, we augment distilbert-base-cased for the monolingual tasks and distilbert-base-multilingual-cased for the multilingual task with an additional layer for classification using features. Next to the augmented DistelBERT for subtasks A and B, we explore the use of a Random Forest classifier using distilbert-base-cased embeddings, instead of simpler one-hot encodings or TF-IDF embeddings, contatenated with our 20 linguistic-stylistic features. Similar to the augmented DistilBERT, we use distilbert-base-multilingual-cased in the multilingual track of subtask A.	We explore the use of linguistic-stylistic features with Random Forest classifiers and DistilBERT instead of merely optimising (hyperparameter search) a PLM. In other words, really trying something "new".	yes	yes	no	no	yes	no	no	no	no	no	no	no	no
26	Mashee	areeg94fahad	We utilized the Chi-square test to select high-quality samples and low-quality samples. Subsequently, these selected samples were employed in a few-shot learning setting, utilizing the FlanT5 Large version model.		no	yes	no	no	no	no	2	no	no	no	no	no	no
27		SURBHI	The stylometric classifier first extracts the lexical stylometric features . The features extracted are the length of text, the number of words, the average length of words, the number of short words, the proportion of digits and capital letters, individual letters and digits frequencies, hapax-legomena, a measure of text richness, and the frequency of 12 punctuation marks. A LR is trained on these features. The hybrid features we extract are the frequencies of the 100 most frequent character-level bi-grams and tri-grams. Classification is then done using a LR. Finally, the output probabilities of fine tuned Roberta classifier, the stylometric, and the hybrid ones are concatenated and classified using an additional LR.(SUBTASK MONOLINGUAL AND SUBTASK B DATASET)	I HAVE APPLIED THE TECHNIQUE OF AUTHOR ATTRIBUTION	NO	YES	NO	NO	YES	NO	NO	NO	NO	NO	NO	no	no
28	DUTh	dorakir	We tried several Machine Learning Algorithms and LLMs. We ended up with finetuned mBERT trained for 5 epochs.	We have compared many Machine Learning Algorithms, ensembling methods and LLMs.	no	yes	no	no	yes	no	no	no	no	no	no	no	no
29	MasonTigers	sadiya	We experimented different transformer based models: Roberta, DistilBERT, ELECTRA. Later we did ensemble of these models. We also impemented zero-shot prompting and finetuning FlanT5.	We did ensembling and also finetuned FlanT5	Yes	Yes	No	No	Yes	Yes	No	No	No	No	No	No	No
30	Kathlalu	kathlalu	We used Zipf's Law for our main approach and tried other appraoches, e.g. word unigram count, tf-idf, logistic regression		no	no	no	no	no	no	no	no	no	no	no	no	no
31	Mast Kalandar	suyash, jainitbafna	We used roberta to encode and then train lstm on that freezing the roberta. We also trained a model on losses calculated using lstm		no	Yes	no	no	yes	no	no	no	no	no	no	no	no
32	KInIT	dominikmacko, michalspiegel	(SUBTASK A MULTILINGUAL) We used an ensemble using two-step majority voting for predictions, consisting of 2 LLMs (Falcon-7B and Mistral-7B) fine-tuned using the train set only, 3 zero-shot statistical methods (Entropy, Rank, Binoculars) using Falcon-7B and Falcon-7B-Instruct for calculation of the metrics, utilizing language identification and per-language threshold calibration.
33	Sharif-MGTD	ebrahimi	We adopt a binary classification approach to MGT detection, utilizing fine-tuning on a RoBERTa-base Transformer. Fine-tuning involves adapting a pre-trained language model to a specific task by training it on task-specific data. Our methodology involves training the RoBERTa model on a labeled dataset comprising both human-written and machine-generated text samples. The model learns to differentiate between these two categories during the training process. We evaluate our approach using Subtask A (Monolingual - English). Our system's performance is measured in terms of accuracy on the test set.	Our method creatively employs fine-tuned RoBERTa-base Transformer for MGT detection, achieving competitive accuracy while highlighting challenges in discerning machine-generated text, offering valuable insights for further advancements in NLP.	no	yes	no	no	yes	yes	no	no	no	no	no	no	no
34	Werkzeug	werkzeug	We utilize Roberta-large and XLM-roberta-large to encode the text. As mentioned in previous researches, PLM encoded text embeddings may suffer from anisotropy issue, which makes text embeddings become difficult to differentiate in the latent space, thus we employ a learnable parametric whitening (PW) transformation to mitigate such problem. Moreover, in order to capture features of LLM-generated text from different perspectives, we use multiple PW transformation layers as experts under the mixture-of-experts (MoE) architecture equipped with a gating router, to construct the complete model.	However, we find that our model exhibits more obvious performance improvements (13.7%) in subtask B over the baseline method than it did in subtask A monolingual (2.03%) and subtask A multilingual (10.3%). It may suggests that the MoE architecture enables our model to capture a broader range of language styles and features, thus making it more suitable for detecting multi-model generated text.	yes	no	no	no	yes	no	no	no	no	no	no	no	no
35	UMUTeam	ronghaopan	We participated in Subtask B, using an approach based on fine-tuning a pre-trained model, such as RoBERTa, combined with syntactic features of the texts.Syntax features of the text refer to the writing style, such as token-level features (e.g. word length, part of speech, function word frequency and stop word ratio) and sentence-level features (e.g. sentence length).
36	Groningen F	bbjoverbeek	We use feature based SVM and FFNN models. The features we use include tense of the sentence, voice of the sentence, sentiment of sentence, and the amount of pronouns vs proper nouns. Our hypothesis was that the traditional models would generalize more than LLMs.	Our method uses computationally less expensive models compared to LLMs, while achieving decent performance. We do this with a previously unexplored combination of features.	no	no	no	no	no	no	no	no	no	no	no	no	no
37	CUNLP	desa8310	Our approach involved employing a range of machine learning techniques, including logistic regression, transformer models, attention mechanisms, and unsupervised learning methods. Through rigorous experimentation, we identified key features influencing classification accuracy, namely text length, vocabulary richness, and coherence. Notably, our highest classification accuracy was achieved by integrating transformer models with TF-IDF representation and feature engineering. However, it is essential to note that this approach demanded substantial computational resources due to the complexity of transformer models and the incorporation of TF-IDF. Additionally, our investigation encompassed a thorough exploration of various ML algorithms, extensive hyperparameter tuning, and optimization techniques. Furthermore, we conducted detailed exploratory data analysis to gain insights into the structural and lexical characteristics of the text data. Overall, our findings contribute to a deeper understanding of the challenges and strategies involved in distinguishing between AI-generated and human-generated text, with implications for applications in natural language processing and AI ethics.	Used TF-IDF in combination with transformer model	no	no	yes	no	yes	no	no	no	no	no	no	no	no
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100