2 of 18

THE CHATGPT PROBLEM

Using easily accessible AI tools, students can generate sophisticated writing without going through the cognitive processes our assignments are designed to develop or test. (In addition to ChatGPT, which is still free, there are paid tools such as Jasper, Copy AI, Article Forge, and Writesonic.)

Many students are under pressures and/or have attitudes towards academic work such that they will be inclined to use the technology this way.

To what extent can AI-detection tools help in ensuring the integrity of our courses?

Humans perform only slightly better than chance when classifying AI- vs. human-written text (Gehrmann et al., 2019).

3 of 18

OVERVIEW: METHODS OF DETECTION

(1) Statistical models measure certain statistical properties of a text and compare the result with properties typical of AI-generated text.

This is the method used by Turnitin and most of the other detectors currently available, e.g. CopyLeaks, GPTZero, ZeroGPT, Writer AI, Originality AI

(2) Training-based models use machine-learning techniques to train an AI system to classify text as either human or AI-generated. Through exposure to a large quantity of text of each kind, the machine creates its own rules for identifying AI-generated text.

This is the method Open AI used for its detector. But it was unsuccessful and Open AI quietly shut it down in July 2023.

(3) Watermarking models modify an LLM’s text-generation process such that its output contains hidden markers that can be detected by a statistical algorithm.

(4) Retrieval-based models store all the outputs of an LLM in a database and then compare text against these outputs.

4 of 18

WHAT STATISTICAL MODELS (OUR CURRENT DETECTORS) MEASURE: �“PERPLEXITY” AND “BURSTINESS”

Large-language models (LLMs) generate strings of linguistic tokens by calculating tokens’ probability given the prompt and the tokens already generated. To ensure its output is coherent, an LLM will tend to select tokens with higher degrees of probability. This is why LLMs tend to generate bland, formulaic and syntactically clean text. More technically, AI-generated text tends to have lower “perplexity” and “burstiness” than human-generated text.

These terms are used to describe both the performance of an LLM and the text it generates.

The perplexity of an LLM is how well it is able to predict the next token from a given string. A model with low perplexity is one that predicts the next token well.

Thus the perplexity of a text is how predictable it is, i.e. how likely each word is given the previous words. Since humans often creatively (or mistakenly) combine words in novel ways, human writing tends to have higher perplexity than AI writing.

5 of 18

The burstiness of an LLM is how much variation it generates in its outputs, i.e. in its terms, sentence structures, and sentence lengths. A model with low burstiness is one that generates syntactically homogenous text.

Thus the burstiness of a text is how much variety it has amongst its terms and in its structure. Again, given the dynamism of human language, human writing tends to have higher burstiness than AI writing.

Statistical models can measure perplexity and burstiness by calculating the average per-token probability of a text and comparing this with a threshold it deems characteristic of AI writing. (Mitchell et al. 2)

This threshold is difficult to set. If it is set too high, the risk of falsely identifying human writing as AI-generated increases (the false positive problem). If it is set too low, the risk of failing to detect AI writing increases (the evasion problem).

6 of 18

PROBLEMS WITH TURNITIN AND OTHER STATISTICAL MODELS

(1) Risk of false positives

(2) Bias against non-native writers

(3) Easily defeated through prompting and paraphrasing

(Here is an overview of Turnitin’s strengths and weaknesses, from a website that is a strange mix of advice for educators on how to detect AI and advice for students on how to evade detection.)

7 of 18

THE RISK OF FALSE POSITIVES

Given how statistical models work, student writing that is less “perplexing” or “bursty” risks falling under the statistical thresholds of the detector and getting wrongly flagged as AI-generated.

A detector set to catch a higher percentage of AI writing will set its threshold higher, thereby generating a higher rate of false positives. A lower threshold results in fewer false positives, but at the cost of detecting a lower percentage of AI writing.

So statistical models will inevitably generate false positives; the likelihood depends on the choice the developer makes in the trade-off between maximizing true positives and minimizing false ones.

8 of 18

STUDENTS AT A HIGHER RISK OF FALSE POSITIVES

Strong students who write well, but formulaically. (In Winter 2023 Robert had a Turnitin false positive like this from a Liberal Arts student.)
Students who use editing tools like Grammarly.

Non-native English writers

9 of 18

�LIANG ET AL., “GPT DETECTORS ARE BIASED AGAINST NON-NATIVE ENGLISH WRITERS” (JULY 2023)

“In our study, we evaluated the performance of seven widely used GPT detectors on 91 TOEFL (Test of English as a Foreign Language) essays from a Chinese forum and 88 US eighth-grade essays from the Hewlett Foundation’s ASAP dataset. While the detectors accurately classified the US student essays, they incorrectly labeled more than half of the TOEFL essays as "AI-generated" (average false-positive rate: 61.3%). All detectors unanimously identified 19.8% of the human-written TOEFL essays as AI authored, and at least one detector flagged 97.8% of TOEFL essays as AI generated. Upon closer inspection, the unanimously identified TOEFL essays exhibited significantly lower text perplexity.”

10 of 18

“The implications of GPT detectors for non-native writers are serious, and we need to think through them to avoid situations of discrimination. Within social media, GPT detectors could spuriously flag non-native authors’ content as AI plagiarism, paving the way for undue harassment of specific non-native communities. Internet search engines, such as Google, that implement mechanisms to devalue AI-generated content may inadvertently restrict the visibility of non-native communities, potentially silencing diverse perspectives. Academic conferences or journals prohibiting use of GPT may penalize researchers from non-English-speaking countries. In education, arguably the most significant market for GPT detectors, non-native students bear more risks of false accusations of cheating, which can be detrimental to a student’s academic career and psychological well-being.

Paradoxically, GPT detectors might compel non-native writers to use GPT more to evade detection. As GPT text-generation models advance and detection thresholds tighten, the risk of non-native authors being inadvertently caught in the GPT detection net increases. If non-native writing is more consistently caught as GPT, this may create an unintended consequence of ironically causing non-native writers to use GPT to refine their vocabulary and linguistic diversity to sound more native.”

11 of 18

EVADING DETECTION THROUGH PROMPTING �

“Write an engaging article that incorporates a human-like style, simple English, contractions, idioms, transitional phrases, interjections, dangling modifiers, and colloquialisms, while also weaving in literary devices such as symbolism, irony, foreshadowing, metaphor, personification, hyperbole, alliteration, imagery, onomatopoeia, and simile without directly mentioning them.”

One way to evade detection is simply to prompt the LLM to generate text that is less statistically flat, e.g. this prompt from the sort of “how to avoid detection” video a student might consult on YouTube:

12 of 18

EVADING DETECTION THROUGH PARAPHRASERS �

Another highly effective way to evade existing detectors is to run AI-generated text through an AI-paraphrasing tool. These tools are readily available, and a student need only spend a short time on YouTube to learn how to use them effectively.

One way to paraphrase is simply by prompting the LLM to paraphrase its own output:

Lu et al. propose a paraphrasing prompting technique called “SICO” (Substitution-Based In-Context Example Optimization) and show it to be effective against six existing detectors, including DetectGPT, which may be the strongest statistical method of detection. (Mitchell et al.)

Lang et al. report that simply prompting ChatGPT to “Elevate the provided text by employing literary language” lowered the detection rate on 31 AI-generated essays to 0% on five of seven detectors. For the remaining two, GPTZero and ZeroGPT, the detection rate dropped from 100% to 13% and 10%, respectively (7).

13 of 18

EVADING DETECTION THROUGH PARAPHRASERS �

Another way to paraphrase is by using a paraphrasing program like Quillbot, Spinbot, Undetectable, Stealthwriter, CogniBypass, Word AI.

I tested Quillbot vs. Turnitin by having ChatGPT generate a 1200-word essay on Descartes on free will. Turnitin flagged it as 40% AI-generated. I paraphrased it on Quillbot with the synonym setting to max. (Doing this for free requires tediously inputting 125 words or less at a time.) The paraphrased essay was clunkier and less readable, but this gave it the feel of an average student paper.

After Quillbot paraphrasing, the Turnitin AI score fell from 40 to 0.

Undetectable seems especially strong for evading detection.

Sadasivan et al. (June 2023) show that paraphrasing can defeat any current state-of-the-art detector.

14 of 18

LONG-TERM PROSPECTS FOR DETECTION: �GROUNDS FOR OPTIMISM

Currently, computer scientists are divided on the prospects for reliably detecting AI-generated text.

Chakraborty et al. (June 2023) provide a mathematical proof that “it is almost always possible to detect AI-generated text as long as we can collect multiple samples.”

They argue that while detection becomes more difficult the more statistically similar AI and human text become, it becomes impossible only in the rare instance of no statistical difference whatsoever. Short of this, we can always compensate for a smaller statistical gap between human and AI text by collecting more samples.

The authors acknowledge that paraphrasing reduces detection performance. But they argue that detection can withstand paraphrasing through a combination of increasing the sample and designing better watermarks.

15 of 18

LONG-TERM PROSPECTS FOR DETECTION: �GROUNDS FOR OPTIMISM

Kirchenbauer et al. (June 2023) share this view. They propose a watermarking technique and show that it is robust against paraphrasing, especially as text length increases.

The technique: “At each step of the text generation process, the watermark pseudo-randomly “colors” tokens into green and red lists. Then a sampling rule is used that preferentially samples [selects] green tokens when doing so does not negatively impact perplexity. To detect the watermark, a third party with knowledge of the hash function can reproduce the red and green lists for each step and count the violations.”

They find that that “even human writers cannot reliably remove watermarks if being measured at 1000 words, despite having the goal of removing the watermark.”

16 of 18

LONG-TERM PROSPECTS FOR DETECTION: �GROUNDS FOR PESSIMISM

Sadasivan et al. (June 2023) show that paraphrasing can defeat any current state-of-the-art detector, and argue that, as LLMs become more and more sophisticated, “reliable detection may be unachievable.” (5)

In response to the proof in Chakraborty et al. that detection is, in principle, always possible with enough samples, they point out that this will often not be practical. The samples must be independent of each other (e.g. separate articles or papers), but “it would be unreasonable to expect a student to submit several versions of their essay just to determine whether it has been written using AI or not.” (20)

17 of 18

LONG-TERM PROSPECTS FOR DETECTION: �GROUNDS FOR PESSIMISM

Problems with a watermarking approach:

Practically all LLMs would have to be watermarked for this technique to be effective on a social scale. How to ensure this?
A clever attacker could figure out the watermarking scheme in order to enable a paraphraser to defeat it, or even to imprint the watermark on human text to make it seem AI-generated (“spoofing” attacks).

Problem with a retrieval approach

To work effectively, practically all user interactions with practically all LLMs would have to be stored.
This would pose major privacy issues.

18 of 18

WORKS CITED

Chakraborty, Souradip et al., “On the Possibilities of AI-Generated Text Detection.” ArXiv pre-print, June 2023. https://arxiv.org/abs/2304.04736

Kirchenbauer et al., “On the Reliability of Watermarks for Large Language Models.” ArXiv pre-print, June 2023. https://arxiv.org/abs/2306.04634

Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense.” ArXiv pre-print, March 2023. https://arxiv.org/abs/2303.13408

Liang, Weixin et al., “GPT detectors are biased against non-native English writers.” ArXiv pre-print, June 2023. https://arxiv.org/abs/2304.02819

Lu, Ning et al., “Large Language Models can be Guided to Evade AI-Generated Text Detection.” ArXiv pre-print June 2023. https://arxiv.org/abs/2305.10847

Mitchell, Eric et al., “DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature.” ArXiv pre-print, January 2023. https://arxiv.org/abs/2301.11305

Sadasivan, Vinu et al., “Can AI-Generated Text be Reliably Detected?” ArXiv pre-print, June 2023. https://arxiv.org/abs/2303.11156