An attempt at a zero-shot AI discriminator

April 10, 2026 — Andrew Fowlie

I have become interested in detecting AI generated text. I have been reading about zero-shot methods; methods that don't require and aren't trained on explicit examples of AI and human-written text. These methods have a frequentist, goodness-of-fit flavor to them. One constructs a statistic from the text under consideration, and compares it to an expected distribution were the text AI generated.

Naturally, I wanted to detect AI text in a Bayesian way. My first thought was, when we consider whether a text was AI generated, the alternative is that it was human-generated. But which human! A child? A high-school student? A Nobel-prize winning scientist? A non-native English speaker? My second thought was, which AI! Which language model? Sampled at what temperature? With what system and user prompts?

We should make use of any information we have about the possible origin of the text. For example, suppose we were setting a homework problem for high-school students and wondered whether they copied and pasted our question into an LLM and submitted that as an answer, or whether they wrote an answer in the old-fashioned way. In this case, we would know that the AI prompt was likely to be the exact question we asked, and that the human writer might be a high-school student.

Two models for the generation of the unknown text could thus be:

Model 0: Text generated from an unknown LLM with the homework question copied and pasted, and off-the-shelf system prompts and temperature.
Model 1: Text generated from a high-school student with a reasonable grasp of the subject matter and strong English language skills.

My weapon of choice in model selection is, as always, the Bayes factor. This requires us to write the likelihood of the text under the two models. My idea was to use LLMs as surrogates for the two models. We usually interact with LLMs by chatting to them: here, we are sampling new text from them. Under the hood, they generate new text by computing the probabilities of parts of text, called tokens. Thus, if you load an LLM model in, e.g., the transformer library in Python, as well as sampling text from an LLM, you can also compute the probability of that LLM generating a sequence of text that you specify from a given prompt.

Thus, I would use LLMs as surrogate models that would approximate my models 0 and 1:

Surrogate model 0: Text generated from a particular LLM, prompted by the homework question.
Surrogate model 1: Text generated from a particular LLM, prompted by the homework question, and with a system prompt to answer in the manner of a high-school student with a reasonable grasp of the subject matter and strong English language skills.

For a given answer to the homework question, I could compute the probabilities of that answer under these two surrogate models. By taking the ratio of these probabilities, I would have the Bayes factor $$ B = \frac{P(\text{text} \,|\, \text{LLM prompted by homework question)}}{P(\text{text} \,|\, \text{LLM prompted by homework question and system prompt to answer in style of student})} $$ The Bayes factor would tell me the relative probability of the text originating from surrogate model 0 versus surrogate model 1. If the surrogates are reasonable approximations to the real models, it would inform us about the chances that the text came from an AI versus our anticipated type of human writer.

I had an idea that one could tweak the precise system prompts by sampling texts from the two surrogate models until they were in line with our expectations. This is analogous to checking different choices of priors; a recommended part of any Bayesian workflow.

Does this work? The Bayes factor was computationally cheap (cheaper than sampling new text) and easy to compute using the standard transformer library and a freely-available Qwen pretrained LLM model. It appeared to perform well in discriminating quite badly written English text from smooth and polished AI text. Unfortunately, it's easy for adversaries to beat our system by prompting their own LLM to answer in the manner of a student. Furthermore, the resulting Bayes factors were sensitive to arbitrary aspects of the prior. That is, the Bayes factor was sensitive to the exact system prompts that I used in our surrogate models. Changes to the system prompt that appeared innocuous led to huge changes in the resulting Bayes factor. E.g., slight changes to my system prompt could change the probability of the first token significantly, to the point where the whole result changed.

For that reason, I abandoned this idea. The surrogates might appear to well-calibrated and generating acceptable sample text, but they secretly encode strange and arbitrary assumptions about the anticipated tokens. Additionally, I was concerned about the ethics of profiling the human writers in my system prompt.

Tags: ai

/home/andrew$

An attempt at a zero-shot AI discriminator