ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
Team NameCodaBench username/sMethod SummaryWhat is interesting about your method?use_small_PLMuse_LLMuse_GPTuse_GPT4fine-tuningzero-shotfew-shot (k=?)CoTReActReflectionData_augmentationUse external data (generate additional data using the same LLMs in the data set)Use instruction-tuned vs non instruction tuned LLM to generate extra data@brahim
2
QUSTkarryFirst, we experiment with multiple models on a dataset that has undergone data augmentation. Then, we select the two best-performing models for ensemble. Finally, we choose a fine-tuned RoBERTa model, combined with the Multiscale Positive-Unlabeled (MPU) training framework and DeBERTa model, for model fusion through stacking ensemble.Our approach is based on fine-tuned models, incorporating the MPU framework and ensemble techniques, training on a multilingual dataset after data cleaning.NoYesNoNoYesNoNoNoNoNoYesNoNo
3
NewbieMLbaotran2003Stack Ensemble with SVM (kernel is rbf, C = 1, k = 0.1) + LogisticRegression +
XGBoost (max_depth = 8) , meta model is KNN (k = 7).
Text embedding with Longformer-large.
1) Fast inference time than baseline with acceptable accuracy on test dataset
2) Simple and lightweight approach
YesNoNoNoNoNoNoNoNoNoNoNoNo
4
NCL-UoRwindwindFine-tuning LLMs including XLM-RoBERTa, RoBERTa with Low-Rank Adaptation (LoRA) method, and DistilmBERT, and using majority voting ensemble method on XLM-RoBERTa and LoRA-RoBERTa Our method explores various LLMs and applies an ensemble method on fine-tuned LLMs.NoYesNoNoYesNoNoNoNoNo
5
L3i++honghanhhA comparative study among 3 groups of methods to trigger the detection: 5 metric-based models (Log-Likelihood, Rank, Log-Rank, Entropy, and MFDMetric); 2 fine-tuned sequence-labeling language model (RoBERTa, XLM-R) ; and a fine-tuned large-scale language model (LLaMA-2 - version 7b). LLaMA-2 outperformed the rest and accurately detect machine-generated texts (>= 90% in multilingual subtask A and >= 99% for ChatGPT/bloomz in substask B). We provide the error analysis and different factors in the discussion of our paper.1) Comparative studies among several category of approaches; 2) Fine-tuned LLMsNoYesNoNoYesNoNoNoNoNoNoNoNo
6
Rkadiyalarkadiyal DeBERTa-CRF , Longformer-CRF ; with a modified logic for finding text boundary NoNoNoNoYesNoNoNoNoNo
7
HUdipta007I have used a novel method architecture inspired by contrastive learning approach. As a encoder I have used "sentence-transformers/all-mpnet-base-v2" from sbert and fine tuned it on my augmented dataset1. Novel architecture
2. Comparable performance without ensembling
NoYesNoNoYesNoNoNoNoNoYesNoNo
8
petkazsachertortWe used embeddings from fine-tuned RoBERTa-base combined with linguistic features to train a feed-forward neural network for binary classification. Specifically, in our final configuration, we used diversity features. In addition, we resampled the training data by reducing the number of human-written texts.1. Linguistic features.
2. Smart resampling of training data.
YesNoNoNoYesNoNoNoNoNoNoNoNo
9
10
RUG-Dtisocoda1. Fine-tuning different DeBERTa models. 2. We generated additional data and tested the performance of our model when trained with and without said dataWe compare the performance of LLMs with and without additionally generated training dataNoYesNoNoYesNoNoNoNoNoNoYesNo
11
TueCICLteamdanielaronCharachter-level LSTM (official)
LSTM with pretrained word2vec embeddings as input (unofficial)
Linguistically motivated features with MLP (matches baseline on test; unofficial)
Potential to match transformer results
with smaller models
NoNoNoNoNoNoNoNoNoNoNoNoNo
12
USTC-BUPTcomp5We incorporate domain adversarial neural networks into the task of machine-generated text detection. This is accomplished by simply adding a gradient reversal layer on top of the baseline. Furthermore, in addition to category labels, we innovatively incorporate extra domain labels to enable the model to learn transferable features between the training and testing datasets.
1)The overarching framework begins by utilizing semantic information extraction layers (such as RoBERTa, BERT) to acquire text embeddings.
2)Subsequently, the text information is separately fed into the category classifier and the domain classifier. The domain classifier, integrated into an MLP, incorporates a gradient reversal layer.
3)The final model's loss comprises not only the classification loss but also the domain loss.
The ultimate outcome demonstrates an improvement of approximately 8% compared to the baseline. For further details, please refer to our paper.
1.The framework is novel,to the best of our knowledge, our model represents the first instance of integrating domain adversarial neural networks into machine-generated text detection.
2.The structure is straightforward, requiring only the addition of domain labels and the incorporation of a gradient reversal layer on top of the baseline.
3.Moreover, compared to the baseline, the inclusion of domain adversarial neural networks leads to an approximately 8% improvement in accuracy(We achieved the second place in the official ranking with a score of 96.09% accuracy), indicating a significant enhancement in performance.
YesNoNoNoYesNoNoNoNoNoNoNoNo
13
RUG-3friesoWe extracted the hidden layers from an LLM and combined them with a set of linguistic features as input to an SVM.Instead of doing either LLMs or solely features, we combined themNoYesYesNoYesNoNoNoNoNoNoNoNo
14
GenaiossafeaiOur model is a Transformer Encoder that mixes token-level probabilistic features extracted from four Llama-2 models, both instructed and base flavors: Llama-2-7b, Llama-2-7b-chat, Llama-2-13b, and Llama-2-13b-chat. Specifically, for each token, our features are: the log probability of the observed token, the log probability of the predicted token, and the entropy of the distribution. These features are concatenated at token level, projected by a feed-forward layer, mixed with a Transformer encoder, mean pooled, and projected with a softmax layer.We verified that LLMs assign higher probabilities to machine-generated text than to human texts. Our solution leverages this phenomenon through LLM probabilistic features and captures machine-generated text style in a highly precise manner. In consequence, it obtained the best results in the official ranking (96.88% accuracy).NoYesNoNoNoNoNoNoNoNoNoNoNo
15
TrustAIashokurlana,saibewaraditya,balamallikarjuni) In this work, we used an ensemble approach with the combination of Multinomial Naive Bayes + LGBM Classifier (lightGBM classifier) + SGD classifier. We utilized the concatenation of TFIDF and spaCy embeddings. The model was trained only on the SubtaskA-monolingual dataset without using any extra dataset.
ii) In our other model we used RoBERTa base OpenAI Detector. Which is the GPT-2 output detector model, obtained by fine-tuning a RoBERTa base model with the outputs of the 1.5B-parameter GPT-2 model. FInetuned on the SubtaskA-monolingual dataset.
Our approach explores various pretrained models and statistical models. And we select the best model using ensembleing.YESNONONOYESNONONONoNONONONO
16
SINAIsinaiIn this work, we have compared different systems with the evaluation set by comparing each one until
the final system is chosen. The first system deals with the fine-tuning of the XLM-RoBERTa-Large
language model, for the next system we use GPT-2 to extract the perplexity of each text and create a
decision system depending on the value of this perplexity. The last and final system is a system
that merges the text with its perplexity value into a classification head. The same system has been
used for the monolingual and multilingual tasks.
1) Comparison of various classification systems.
2) Using perplexity as feature to classify.
3) Fusion model that uses text and perplexity to classify.
4) Comprehensive error analysis
noyesnonoyesnononononononono
17
art-nat-HHUartifnaturalIn this work we have finetuned a RoBERTa-model that was pretrained for AI-detection and combined it with a set of features ranging from syntactic, lexic, probabilistic to stylistic features. In order to produce a classification, we trained two separate neural networks on the features, one for each class predicted by the RoBERTa classifier. The neural nets were used to make potential corrections to the RoBERTa output.1) Fusion model combining neural networks and RoBERTa
2) Interesting range of stylistic reatures from sentiment to word level CEFR and readability
yesnononoyesnononononononono
18
clulab-UofAyekwon1. An unsupervised text similarity system that computes cosine similarity to measure text similarity, which assesses the angle between vector representations of texts.
2. A sentence transformer trained with triplet loss to learn the distinctions between the given texts.
3. A RoBERTa classifier that makes decisions based on the given paragraph.
4. A RoBERTa classifier which takes into account sentence paraphrases generated by the candidate models
yesyesyesnoyesnononononononono
19
I2C-Huelvaalbrp97
The method's innovative integration of multimodal models with numerical text analysis presents a novel approach to enhancing the detection of machine-generated texts by capturing subtle linguistic and structural patterns.
noyesnonoyesnononononononono
20
NootNootyash9439, sankalp.bahadFine tuned Roberta Baseyesnononoyesnononononononono
21
RUG-1alonscheuer, halecakirCombined linear model with document-level features and token-level features which have been processed by an LSTMOur method combines probability-based features with low-level and high-level linguistic featuresyesnononononononononononono
22
T5-MedicalmsiinoUse of a pre-trained T5-large medical embedding and then cosine distance to calculate the degree of similarity.noyesnononoyesnonononononono
23
Unibuc - NLPtmarchitan, claudiuTransformer-based model architecture with a set of fully connected layers for the classification head. Combined Subtask A - monolingual dataset with Subtask B dataset.Explore different methods of layer selection and fine-tuning.yesnononoyesnononononoyesnono
24
Collectivized Semanticsadvin4603Fine Tuned Roberta-base using Ada-LoRa and used a weighted sum of all the layer hidden states' mean, the weights were trained along with the model, similar to ElMo. Mixed train and val data and split it uniformly across various different domains and generators in order to get a balanced val-train split.
The method encourages usage of information from different linguistic levels (Syntax, Semantics, Lexical) as different layers capture different levels of linguistic information
yesnononoyesnononononononono
25
RUG-5pdarwinkel, sijbrenvvFor subtask A and B, we augment distilbert-base-cased for the monolingual tasks and distilbert-base-multilingual-cased for the multilingual task with an additional layer for classification using features. Next to the augmented DistelBERT for subtasks A and B, we explore the use of a Random Forest classifier using distilbert-base-cased embeddings, instead of simpler one-hot encodings or TF-IDF embeddings, contatenated with our 20 linguistic-stylistic features. Similar to the augmented DistilBERT, we use distilbert-base-multilingual-cased in the multilingual track of subtask A.
We explore the use of linguistic-stylistic features with Random Forest classifiers and DistilBERT instead of merely optimising (hyperparameter search) a PLM. In other words, really trying something "new".
yesyesnonoyesnononononononono
26
Masheeareeg94fahadWe utilized the Chi-square test to select high-quality samples and low-quality samples. Subsequently, these selected samples were employed in a few-shot learning setting, utilizing the FlanT5 Large version model.noyesnononono2nononononono
27
SURBHIThe stylometric classifier first extracts the lexical stylometric features
. The features extracted are the length of
text, the number of words, the average length of
words, the number of short words, the proportion
of digits and capital letters, individual letters and
digits frequencies, hapax-legomena, a measure of
text richness, and the frequency of 12 punctuation
marks. A LR is trained on these features. The
hybrid features we extract are the frequencies of
the 100 most frequent character-level bi-grams and
tri-grams. Classification is then done using a LR.
Finally, the output probabilities of fine tuned Roberta classifier,
the stylometric, and the hybrid ones are concatenated and classified using an additional LR.(SUBTASK MONOLINGUAL AND SUBTASK B DATASET)
I HAVE APPLIED THE TECHNIQUE OF AUTHOR ATTRIBUTIONNOYESNONOYESNONONONONONOnono
28
DUThdorakirWe tried several Machine Learning Algorithms and LLMs. We ended up with finetuned mBERT trained for 5 epochs.We have compared many Machine Learning Algorithms, ensembling methods and LLMs.noyesnonoyesnononononononono
29
MasonTigerssadiyaWe experimented different transformer based models: Roberta, DistilBERT, ELECTRA. Later we did ensemble of these models. We also impemented zero-shot prompting and finetuning FlanT5.We did ensembling and also finetuned FlanT5YesYesNoNoYesYesNoNoNoNoNoNoNo
30
KathlalukathlaluWe used Zipf's Law for our main approach and tried other appraoches, e.g. word unigram count, tf-idf, logistic regressionnonononononononononononono
31
Mast Kalandar suyash, jainitbafnaWe used roberta to encode and then train lstm on that freezing the roberta. We also trained a model on losses calculated using lstmnoYesnonoyesnononononononono
32
KInITdominikmacko, michalspiegel(SUBTASK A MULTILINGUAL) We used an ensemble using two-step majority voting for predictions,
consisting of 2 LLMs (Falcon-7B and Mistral-7B) fine-tuned using the train set only, 3 zero-shot statistical
methods (Entropy, Rank, Binoculars) using Falcon-7B and Falcon-7B-Instruct for calculation of the
metrics, utilizing language identification and per-language threshold calibration.
33
Sharif-MGTDebrahimi
We adopt a binary classification approach to MGT detection, utilizing fine-tuning on a RoBERTa-base Transformer. Fine-tuning involves adapting a pre-trained language model to a specific task by training it on task-specific data. Our methodology involves training the RoBERTa model on a labeled dataset comprising both human-written and machine-generated text samples. The model learns to differentiate between these two categories during the training process. We evaluate our approach using Subtask A (Monolingual - English). Our system's performance is measured in terms of accuracy on the test set.
Our method creatively employs fine-tuned RoBERTa-base Transformer for MGT detection, achieving competitive accuracy while highlighting challenges in discerning machine-generated text, offering valuable insights for further advancements in NLP.
noyesnonoyesyesnonononononono
34
WerkzeugwerkzeugWe utilize Roberta-large and XLM-roberta-large to encode the text. As mentioned in previous researches, PLM encoded text embeddings may suffer from anisotropy issue, which makes text embeddings become difficult to differentiate in the latent space, thus we employ a learnable parametric whitening (PW) transformation to mitigate such problem. Moreover, in order to capture features of LLM-generated text from different perspectives, we use multiple PW transformation layers as experts under the mixture-of-experts (MoE) architecture equipped with a gating router, to construct the complete model.
However, we find that our model exhibits more obvious performance improvements (13.7%) in subtask B over the baseline method than it did in subtask A monolingual (2.03%) and subtask A multilingual (10.3%). It may suggests that the MoE architecture enables our model to capture a broader range of language styles and features, thus making it more suitable for detecting multi-model generated text.
yesnononoyesnononononononono
35
UMUTeamronghaopanWe participated in Subtask B, using an approach based on fine-tuning a pre-trained model, such as RoBERTa, combined with syntactic features of the texts.Syntax features of the text refer to the writing style, such as token-level features (e.g. word length, part of speech, function word frequency and stop word ratio) and sentence-level features (e.g. sentence length).
36
Groningen FbbjoverbeekWe use feature based SVM and FFNN models. The features we use include tense of the sentence, voice of the sentence, sentiment of sentence, and the amount of pronouns vs proper nouns. Our hypothesis was that the traditional models would generalize more than LLMs.
Our method uses computationally less expensive models compared to LLMs, while achieving decent performance. We do this with a previously unexplored combination of features.
nonononononononononononono
37
CUNLPdesa8310
Our approach involved employing a range of machine learning techniques, including logistic regression, transformer models, attention mechanisms, and unsupervised learning methods. Through rigorous experimentation, we identified key features influencing classification accuracy, namely text length, vocabulary richness, and coherence. Notably, our highest classification accuracy was achieved by integrating transformer models with TF-IDF representation and feature engineering. However, it is essential to note that this approach demanded substantial computational resources due to the complexity of transformer models and the incorporation of TF-IDF. Additionally, our investigation encompassed a thorough exploration of various ML algorithms, extensive hyperparameter tuning, and optimization techniques. Furthermore, we conducted detailed exploratory data analysis to gain insights into the structural and lexical characteristics of the text data. Overall, our findings contribute to a deeper understanding of the challenges and strategies involved in distinguishing between AI-generated and human-generated text, with implications for applications in natural language processing and AI ethics.
Used TF-IDF in combination with transformer modelnonoyesnoyesnononononononono
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100