| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Team Name | CodaBench username/s | Method Summary | What is interesting about your method? | use_small_PLM | use_LLM | use_GPT | use_GPT4 | fine-tuning | zero-shot | few-shot (k=?) | CoT | ReAct | Reflection | Data_augmentation | Use external data (generate additional data using the same LLMs in the data set) | Use instruction-tuned vs non instruction tuned LLM to generate extra data | @brahim | |||||||||||
2 | QUST | karry | First, we experiment with multiple models on a dataset that has undergone data augmentation. Then, we select the two best-performing models for ensemble. Finally, we choose a fine-tuned RoBERTa model, combined with the Multiscale Positive-Unlabeled (MPU) training framework and DeBERTa model, for model fusion through stacking ensemble. | Our approach is based on fine-tuned models, incorporating the MPU framework and ensemble techniques, training on a multilingual dataset after data cleaning. | No | Yes | No | No | Yes | No | No | No | No | No | Yes | No | No | ||||||||||||
3 | NewbieML | baotran2003 | Stack Ensemble with SVM (kernel is rbf, C = 1, k = 0.1) + LogisticRegression + XGBoost (max_depth = 8) , meta model is KNN (k = 7). Text embedding with Longformer-large. | 1) Fast inference time than baseline with acceptable accuracy on test dataset 2) Simple and lightweight approach | Yes | No | No | No | No | No | No | No | No | No | No | No | No | ||||||||||||
4 | NCL-UoR | windwind | Fine-tuning LLMs including XLM-RoBERTa, RoBERTa with Low-Rank Adaptation (LoRA) method, and DistilmBERT, and using majority voting ensemble method on XLM-RoBERTa and LoRA-RoBERTa | Our method explores various LLMs and applies an ensemble method on fine-tuned LLMs. | No | Yes | No | No | Yes | No | No | No | No | No | |||||||||||||||
5 | L3i++ | honghanhh | A comparative study among 3 groups of methods to trigger the detection: 5 metric-based models (Log-Likelihood, Rank, Log-Rank, Entropy, and MFDMetric); 2 fine-tuned sequence-labeling language model (RoBERTa, XLM-R) ; and a fine-tuned large-scale language model (LLaMA-2 - version 7b). LLaMA-2 outperformed the rest and accurately detect machine-generated texts (>= 90% in multilingual subtask A and >= 99% for ChatGPT/bloomz in substask B). We provide the error analysis and different factors in the discussion of our paper. | 1) Comparative studies among several category of approaches; 2) Fine-tuned LLMs | No | Yes | No | No | Yes | No | No | No | No | No | No | No | No | ||||||||||||
6 | Rkadiyala | rkadiyal | DeBERTa-CRF , Longformer-CRF ; with a modified logic for finding text boundary | No | No | No | No | Yes | No | No | No | No | No | ||||||||||||||||
7 | HU | dipta007 | I have used a novel method architecture inspired by contrastive learning approach. As a encoder I have used "sentence-transformers/all-mpnet-base-v2" from sbert and fine tuned it on my augmented dataset | 1. Novel architecture 2. Comparable performance without ensembling | No | Yes | No | No | Yes | No | No | No | No | No | Yes | No | No | ||||||||||||
8 | petkaz | sachertort | We used embeddings from fine-tuned RoBERTa-base combined with linguistic features to train a feed-forward neural network for binary classification. Specifically, in our final configuration, we used diversity features. In addition, we resampled the training data by reducing the number of human-written texts. | 1. Linguistic features. 2. Smart resampling of training data. | Yes | No | No | No | Yes | No | No | No | No | No | No | No | No | ||||||||||||
9 | |||||||||||||||||||||||||||||
10 | RUG-D | tisocoda | 1. Fine-tuning different DeBERTa models. 2. We generated additional data and tested the performance of our model when trained with and without said data | We compare the performance of LLMs with and without additionally generated training data | No | Yes | No | No | Yes | No | No | No | No | No | No | Yes | No | ||||||||||||
11 | TueCICL | teamdanielaron | Charachter-level LSTM (official) LSTM with pretrained word2vec embeddings as input (unofficial) Linguistically motivated features with MLP (matches baseline on test; unofficial) | Potential to match transformer results with smaller models | No | No | No | No | No | No | No | No | No | No | No | No | No | ||||||||||||
12 | USTC-BUPT | comp5 | We incorporate domain adversarial neural networks into the task of machine-generated text detection. This is accomplished by simply adding a gradient reversal layer on top of the baseline. Furthermore, in addition to category labels, we innovatively incorporate extra domain labels to enable the model to learn transferable features between the training and testing datasets. 1)The overarching framework begins by utilizing semantic information extraction layers (such as RoBERTa, BERT) to acquire text embeddings. 2)Subsequently, the text information is separately fed into the category classifier and the domain classifier. The domain classifier, integrated into an MLP, incorporates a gradient reversal layer. 3)The final model's loss comprises not only the classification loss but also the domain loss. The ultimate outcome demonstrates an improvement of approximately 8% compared to the baseline. For further details, please refer to our paper. | 1.The framework is novel,to the best of our knowledge, our model represents the first instance of integrating domain adversarial neural networks into machine-generated text detection. 2.The structure is straightforward, requiring only the addition of domain labels and the incorporation of a gradient reversal layer on top of the baseline. 3.Moreover, compared to the baseline, the inclusion of domain adversarial neural networks leads to an approximately 8% improvement in accuracy(We achieved the second place in the official ranking with a score of 96.09% accuracy), indicating a significant enhancement in performance. | Yes | No | No | No | Yes | No | No | No | No | No | No | No | No | ||||||||||||
13 | RUG-3 | frieso | We extracted the hidden layers from an LLM and combined them with a set of linguistic features as input to an SVM. | Instead of doing either LLMs or solely features, we combined them | No | Yes | Yes | No | Yes | No | No | No | No | No | No | No | No | ||||||||||||
14 | Genaios | safeai | Our model is a Transformer Encoder that mixes token-level probabilistic features extracted from four Llama-2 models, both instructed and base flavors: Llama-2-7b, Llama-2-7b-chat, Llama-2-13b, and Llama-2-13b-chat. Specifically, for each token, our features are: the log probability of the observed token, the log probability of the predicted token, and the entropy of the distribution. These features are concatenated at token level, projected by a feed-forward layer, mixed with a Transformer encoder, mean pooled, and projected with a softmax layer. | We verified that LLMs assign higher probabilities to machine-generated text than to human texts. Our solution leverages this phenomenon through LLM probabilistic features and captures machine-generated text style in a highly precise manner. In consequence, it obtained the best results in the official ranking (96.88% accuracy). | No | Yes | No | No | No | No | No | No | No | No | No | No | No | ||||||||||||
15 | TrustAI | ashokurlana,saibewaraditya,balamallikarjun | i) In this work, we used an ensemble approach with the combination of Multinomial Naive Bayes + LGBM Classifier (lightGBM classifier) + SGD classifier. We utilized the concatenation of TFIDF and spaCy embeddings. The model was trained only on the SubtaskA-monolingual dataset without using any extra dataset. ii) In our other model we used RoBERTa base OpenAI Detector. Which is the GPT-2 output detector model, obtained by fine-tuning a RoBERTa base model with the outputs of the 1.5B-parameter GPT-2 model. FInetuned on the SubtaskA-monolingual dataset. | Our approach explores various pretrained models and statistical models. And we select the best model using ensembleing. | YES | NO | NO | NO | YES | NO | NO | NO | No | NO | NO | NO | NO | ||||||||||||
16 | SINAI | sinai | In this work, we have compared different systems with the evaluation set by comparing each one until the final system is chosen. The first system deals with the fine-tuning of the XLM-RoBERTa-Large language model, for the next system we use GPT-2 to extract the perplexity of each text and create a decision system depending on the value of this perplexity. The last and final system is a system that merges the text with its perplexity value into a classification head. The same system has been used for the monolingual and multilingual tasks. | 1) Comparison of various classification systems. 2) Using perplexity as feature to classify. 3) Fusion model that uses text and perplexity to classify. 4) Comprehensive error analysis | no | yes | no | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
17 | art-nat-HHU | artifnatural | In this work we have finetuned a RoBERTa-model that was pretrained for AI-detection and combined it with a set of features ranging from syntactic, lexic, probabilistic to stylistic features. In order to produce a classification, we trained two separate neural networks on the features, one for each class predicted by the RoBERTa classifier. The neural nets were used to make potential corrections to the RoBERTa output. | 1) Fusion model combining neural networks and RoBERTa 2) Interesting range of stylistic reatures from sentiment to word level CEFR and readability | yes | no | no | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
18 | clulab-UofA | yekwon | 1. An unsupervised text similarity system that computes cosine similarity to measure text similarity, which assesses the angle between vector representations of texts. 2. A sentence transformer trained with triplet loss to learn the distinctions between the given texts. 3. A RoBERTa classifier that makes decisions based on the given paragraph. 4. A RoBERTa classifier which takes into account sentence paraphrases generated by the candidate models | yes | yes | yes | no | yes | no | no | no | no | no | no | no | no | |||||||||||||
19 | I2C-Huelva | albrp97 | The method's innovative integration of multimodal models with numerical text analysis presents a novel approach to enhancing the detection of machine-generated texts by capturing subtle linguistic and structural patterns. | no | yes | no | no | yes | no | no | no | no | no | no | no | no | |||||||||||||
20 | NootNoot | yash9439, sankalp.bahad | Fine tuned Roberta Base | yes | no | no | no | yes | no | no | no | no | no | no | no | no | |||||||||||||
21 | RUG-1 | alonscheuer, halecakir | Combined linear model with document-level features and token-level features which have been processed by an LSTM | Our method combines probability-based features with low-level and high-level linguistic features | yes | no | no | no | no | no | no | no | no | no | no | no | no | ||||||||||||
22 | T5-Medical | msiino | Use of a pre-trained T5-large medical embedding and then cosine distance to calculate the degree of similarity. | no | yes | no | no | no | yes | no | no | no | no | no | no | no | |||||||||||||
23 | Unibuc - NLP | tmarchitan, claudiu | Transformer-based model architecture with a set of fully connected layers for the classification head. Combined Subtask A - monolingual dataset with Subtask B dataset. | Explore different methods of layer selection and fine-tuning. | yes | no | no | no | yes | no | no | no | no | no | yes | no | no | ||||||||||||
24 | Collectivized Semantics | advin4603 | Fine Tuned Roberta-base using Ada-LoRa and used a weighted sum of all the layer hidden states' mean, the weights were trained along with the model, similar to ElMo. Mixed train and val data and split it uniformly across various different domains and generators in order to get a balanced val-train split. | The method encourages usage of information from different linguistic levels (Syntax, Semantics, Lexical) as different layers capture different levels of linguistic information | yes | no | no | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
25 | RUG-5 | pdarwinkel, sijbrenvv | For subtask A and B, we augment distilbert-base-cased for the monolingual tasks and distilbert-base-multilingual-cased for the multilingual task with an additional layer for classification using features. Next to the augmented DistelBERT for subtasks A and B, we explore the use of a Random Forest classifier using distilbert-base-cased embeddings, instead of simpler one-hot encodings or TF-IDF embeddings, contatenated with our 20 linguistic-stylistic features. Similar to the augmented DistilBERT, we use distilbert-base-multilingual-cased in the multilingual track of subtask A. | We explore the use of linguistic-stylistic features with Random Forest classifiers and DistilBERT instead of merely optimising (hyperparameter search) a PLM. In other words, really trying something "new". | yes | yes | no | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
26 | Mashee | areeg94fahad | We utilized the Chi-square test to select high-quality samples and low-quality samples. Subsequently, these selected samples were employed in a few-shot learning setting, utilizing the FlanT5 Large version model. | no | yes | no | no | no | no | 2 | no | no | no | no | no | no | |||||||||||||
27 | SURBHI | The stylometric classifier first extracts the lexical stylometric features . The features extracted are the length of text, the number of words, the average length of words, the number of short words, the proportion of digits and capital letters, individual letters and digits frequencies, hapax-legomena, a measure of text richness, and the frequency of 12 punctuation marks. A LR is trained on these features. The hybrid features we extract are the frequencies of the 100 most frequent character-level bi-grams and tri-grams. Classification is then done using a LR. Finally, the output probabilities of fine tuned Roberta classifier, the stylometric, and the hybrid ones are concatenated and classified using an additional LR.(SUBTASK MONOLINGUAL AND SUBTASK B DATASET) | I HAVE APPLIED THE TECHNIQUE OF AUTHOR ATTRIBUTION | NO | YES | NO | NO | YES | NO | NO | NO | NO | NO | NO | no | no | |||||||||||||
28 | DUTh | dorakir | We tried several Machine Learning Algorithms and LLMs. We ended up with finetuned mBERT trained for 5 epochs. | We have compared many Machine Learning Algorithms, ensembling methods and LLMs. | no | yes | no | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
29 | MasonTigers | sadiya | We experimented different transformer based models: Roberta, DistilBERT, ELECTRA. Later we did ensemble of these models. We also impemented zero-shot prompting and finetuning FlanT5. | We did ensembling and also finetuned FlanT5 | Yes | Yes | No | No | Yes | Yes | No | No | No | No | No | No | No | ||||||||||||
30 | Kathlalu | kathlalu | We used Zipf's Law for our main approach and tried other appraoches, e.g. word unigram count, tf-idf, logistic regression | no | no | no | no | no | no | no | no | no | no | no | no | no | |||||||||||||
31 | Mast Kalandar | suyash, jainitbafna | We used roberta to encode and then train lstm on that freezing the roberta. We also trained a model on losses calculated using lstm | no | Yes | no | no | yes | no | no | no | no | no | no | no | no | |||||||||||||
32 | KInIT | dominikmacko, michalspiegel | (SUBTASK A MULTILINGUAL) We used an ensemble using two-step majority voting for predictions, consisting of 2 LLMs (Falcon-7B and Mistral-7B) fine-tuned using the train set only, 3 zero-shot statistical methods (Entropy, Rank, Binoculars) using Falcon-7B and Falcon-7B-Instruct for calculation of the metrics, utilizing language identification and per-language threshold calibration. | ||||||||||||||||||||||||||
33 | Sharif-MGTD | ebrahimi | We adopt a binary classification approach to MGT detection, utilizing fine-tuning on a RoBERTa-base Transformer. Fine-tuning involves adapting a pre-trained language model to a specific task by training it on task-specific data. Our methodology involves training the RoBERTa model on a labeled dataset comprising both human-written and machine-generated text samples. The model learns to differentiate between these two categories during the training process. We evaluate our approach using Subtask A (Monolingual - English). Our system's performance is measured in terms of accuracy on the test set. | Our method creatively employs fine-tuned RoBERTa-base Transformer for MGT detection, achieving competitive accuracy while highlighting challenges in discerning machine-generated text, offering valuable insights for further advancements in NLP. | no | yes | no | no | yes | yes | no | no | no | no | no | no | no | ||||||||||||
34 | Werkzeug | werkzeug | We utilize Roberta-large and XLM-roberta-large to encode the text. As mentioned in previous researches, PLM encoded text embeddings may suffer from anisotropy issue, which makes text embeddings become difficult to differentiate in the latent space, thus we employ a learnable parametric whitening (PW) transformation to mitigate such problem. Moreover, in order to capture features of LLM-generated text from different perspectives, we use multiple PW transformation layers as experts under the mixture-of-experts (MoE) architecture equipped with a gating router, to construct the complete model. | However, we find that our model exhibits more obvious performance improvements (13.7%) in subtask B over the baseline method than it did in subtask A monolingual (2.03%) and subtask A multilingual (10.3%). It may suggests that the MoE architecture enables our model to capture a broader range of language styles and features, thus making it more suitable for detecting multi-model generated text. | yes | no | no | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
35 | UMUTeam | ronghaopan | We participated in Subtask B, using an approach based on fine-tuning a pre-trained model, such as RoBERTa, combined with syntactic features of the texts.Syntax features of the text refer to the writing style, such as token-level features (e.g. word length, part of speech, function word frequency and stop word ratio) and sentence-level features (e.g. sentence length). | ||||||||||||||||||||||||||
36 | Groningen F | bbjoverbeek | We use feature based SVM and FFNN models. The features we use include tense of the sentence, voice of the sentence, sentiment of sentence, and the amount of pronouns vs proper nouns. Our hypothesis was that the traditional models would generalize more than LLMs. | Our method uses computationally less expensive models compared to LLMs, while achieving decent performance. We do this with a previously unexplored combination of features. | no | no | no | no | no | no | no | no | no | no | no | no | no | ||||||||||||
37 | CUNLP | desa8310 | Our approach involved employing a range of machine learning techniques, including logistic regression, transformer models, attention mechanisms, and unsupervised learning methods. Through rigorous experimentation, we identified key features influencing classification accuracy, namely text length, vocabulary richness, and coherence. Notably, our highest classification accuracy was achieved by integrating transformer models with TF-IDF representation and feature engineering. However, it is essential to note that this approach demanded substantial computational resources due to the complexity of transformer models and the incorporation of TF-IDF. Additionally, our investigation encompassed a thorough exploration of various ML algorithms, extensive hyperparameter tuning, and optimization techniques. Furthermore, we conducted detailed exploratory data analysis to gain insights into the structural and lexical characteristics of the text data. Overall, our findings contribute to a deeper understanding of the challenges and strategies involved in distinguishing between AI-generated and human-generated text, with implications for applications in natural language processing and AI ethics. | Used TF-IDF in combination with transformer model | no | no | yes | no | yes | no | no | no | no | no | no | no | no | ||||||||||||
38 | |||||||||||||||||||||||||||||
39 | |||||||||||||||||||||||||||||
40 | |||||||||||||||||||||||||||||
41 | |||||||||||||||||||||||||||||
42 | |||||||||||||||||||||||||||||
43 | |||||||||||||||||||||||||||||
44 | |||||||||||||||||||||||||||||
45 | |||||||||||||||||||||||||||||
46 | |||||||||||||||||||||||||||||
47 | |||||||||||||||||||||||||||||
48 | |||||||||||||||||||||||||||||
49 | |||||||||||||||||||||||||||||
50 | |||||||||||||||||||||||||||||
51 | |||||||||||||||||||||||||||||
52 | |||||||||||||||||||||||||||||
53 | |||||||||||||||||||||||||||||
54 | |||||||||||||||||||||||||||||
55 | |||||||||||||||||||||||||||||
56 | |||||||||||||||||||||||||||||
57 | |||||||||||||||||||||||||||||
58 | |||||||||||||||||||||||||||||
59 | |||||||||||||||||||||||||||||
60 | |||||||||||||||||||||||||||||
61 | |||||||||||||||||||||||||||||
62 | |||||||||||||||||||||||||||||
63 | |||||||||||||||||||||||||||||
64 | |||||||||||||||||||||||||||||
65 | |||||||||||||||||||||||||||||
66 | |||||||||||||||||||||||||||||
67 | |||||||||||||||||||||||||||||
68 | |||||||||||||||||||||||||||||
69 | |||||||||||||||||||||||||||||
70 | |||||||||||||||||||||||||||||
71 | |||||||||||||||||||||||||||||
72 | |||||||||||||||||||||||||||||
73 | |||||||||||||||||||||||||||||
74 | |||||||||||||||||||||||||||||
75 | |||||||||||||||||||||||||||||
76 | |||||||||||||||||||||||||||||
77 | |||||||||||||||||||||||||||||
78 | |||||||||||||||||||||||||||||
79 | |||||||||||||||||||||||||||||
80 | |||||||||||||||||||||||||||||
81 | |||||||||||||||||||||||||||||
82 | |||||||||||||||||||||||||||||
83 | |||||||||||||||||||||||||||||
84 | |||||||||||||||||||||||||||||
85 | |||||||||||||||||||||||||||||
86 | |||||||||||||||||||||||||||||
87 | |||||||||||||||||||||||||||||
88 | |||||||||||||||||||||||||||||
89 | |||||||||||||||||||||||||||||
90 | |||||||||||||||||||||||||||||
91 | |||||||||||||||||||||||||||||
92 | |||||||||||||||||||||||||||||
93 | |||||||||||||||||||||||||||||
94 | |||||||||||||||||||||||||||||
95 | |||||||||||||||||||||||||||||
96 | |||||||||||||||||||||||||||||
97 | |||||||||||||||||||||||||||||
98 | |||||||||||||||||||||||||||||
99 | |||||||||||||||||||||||||||||
100 |