Contributing Authors: Melanie Kurimchak, Learning Data Insights | Alexis Andres, Learning Data Insights | Maggie Beiting-Parish, CUNY Graduate Center/EdAIfy
AI Disclosure: Claude Sonnet v4.5 was used in the initial drafting of this post, human editing and review conducted throughout. This post is part of the GenAI Insights Hub. All content CC BY-SA.One of the most time-consuming parts of rigorous literature synthesis is establishing inter-rater reliability (Belur et al., 2018). In our project, coders must interpret the same research in consistent ways, which requires detailed, field by field comparison of their coding sheets. Until now, that work has mostly been manual and slow.
Over the past several months, we tested whether AI systems could help. Not by coding research papers themselves, but by auditing how humans coded them. We ran two rounds of experiments using different models and different prompts. The results were useful in some places and disappointing in others.
Experiment 1: Can AI Audit Human Coding?
Setup
Four expert coders independently analyzed the same seven research papers using our coding framework, which includes more than 190 fields per paper. Once coding was complete, we asked four AI systems to compare the outputs:
- Claude Sonnet 4.5
- ChatGPT-4
- Gemini Flash 2.5
- NotebookLM
The models were not evaluating the papers themselves. Instead, they were asked to review the completed coding sheets, identify where coders agreed or disagreed, and estimate inter-rater reliability based on those entries.
What we wanted to know:
- Can AI correctly identify where human coders agree and disagree?
- Can it surface patterns in disagreement that suggest protocol improvements?
- Can it reduce the time required for reliability checks?
- Can it handle structured but nuanced coding data?
Where AI auditing worked
AI performed best in fields with clear values and little interpretation.
When all coders recorded the same metric, such as quadratic weighted kappa, every system correctly reported full agreement. When all coders marked “zero-shot” as the prompting technique, the models flagged perfect consensus. When three coders recorded 0.789 and one recorded 0.786, ChatGPT correctly identified the difference and reported 75 percent agreement.
Fields like these could realistically be auto-checked and skipped during manual review. They represent the kind of straightforward comparison that machines handle well (
Bommasani et al., 2021).
Claude also surfaced a useful error. One reviewer had accidentally flipped minimum and maximum accuracy values. By comparing that reviewer’s entries across papers, Claude noticed the pattern and flagged it as an outlier. That kind of issue would normally take time for a human reviewer to notice or perhaps even go unnoticed, leading to errors in IRR measurements.
Where AI auditing failed
Performance dropped in fields that required interpretation or context.
Semantic equivalence was a common problem. One coder entered “Scored Student Responses.” Another wrote “ASAP.” A third entered “Synthetic Data.” Human reviewers quickly recognized that two of these referred to the same dataset. The AI systems generally did not. Most flagged them as disagreements.
The models treated the entries as different strings rather than recognizing that they referred to the same thing.
Research question entries produced similar issues. One coder entered an abbreviation, another copied the full research question, and a third wrote a paraphrase. Human reviewers recognized that these entries aligned. The models flagged them as disagreements.

The now familiar
model counting problem also surfaced. AI systems could see that coders recorded different values, but they could not explain why the numbers differed or which interpretation might be reasonable. Without access to the original paper, the models lacked the context needed to evaluate the disagreement.
This highlights a clear limit. AI can point out where differences appear, but it cannot reliably explain them. The fields that require interpretation still need human review.
Inconsistency across systems
When we asked each system to compute overall agreement, the results varied widely.
| MODEL NAME | OVERALL PAIRWISE AGREEMENT |
| Chat GPT-5 mini | 90% |
| NotebookLM | 77.1% |
| Gemini Flash 2.5 | 67.5% |
| Claude Sonnet 4.5 | 62% |
Pairwise agreement across models in Round 7 (12/5) paper review.
All four systems analyzed identical coding sheets and were asked to calculate the same metric. We did not provide the equation. When we asked how the number had been calculated, each model described a different methodology.
Claude

Gemini
NotebookLM
GPT

Yet, the outputs differed by nearly thirty percentage points. Using Claude’s output would suggest the coders needed substantial recalibration. Using ChatGPT’s would suggest everything was functioning smoothly. In reality, the former didn't even perform a verifiable IRR calculation.
The gap reflects fundamental differences in how each system defines agreement. Models make implicit assumptions about how to treat blank fields, how to interpret semantic similarity, how to weight fields, and how to count partial matches. None of those assumptions were visible until we compared outputs side by side. The seemingly minute semantic differences of each model carry larger implications on their usage and application that have the potential to greatly influence research outcomes and analyses depending on the level of integration, reliance, and human review of AI-generated interpretations.
The lesson is straightforward. When using AI for quality control, the number itself cannot be trusted without understanding how it was produced.
Experiment 2: Using AI as a Diagnostic Tool
After the first round, we reframed the task. Instead of asking AI to audit coders or calculate agreement scores, we asked it to act as a diagnostic tool. The goal was to surface patterns that pointed to unclear definitions or inconsistencies in how the coding sheet was being used.
We tested several versions of Claude, GPT-5, and Gemini using model-optimized prompts that traded universal algorithmic complexity for model-specific behavioral constraints and native strengths.
View and compare the prompts we used.
What improved
This framing worked better. All three models independently surfaced the same major problem areas: model counting inconsistencies, inconsistent use of N/A versus blank fields, and differences in how metrics were named. Claude produced the richest qualitative analysis. Gemini suggested several straightforward reframings that directly informed protocol changes. GPT-5 produced broader observations but prioritized them less effectively.
These insights fed directly into updates to our coding sheet and reviewer guidance.
What failed (badly!)
Statistical computation remained unreliable, and in some cases clearly fabricated.
One model produced Krippendorff’s α values reported to two decimal places, complete with field level breakdowns. In a later response, the same model acknowledged that those numbers were estimates rather than calculations.
This behavior is worse than refusal. Numbers formatted like real statistics create false confidence. Any team relying on AI generated metrics without independent verification is taking a significant risk.
Prompt compliance also varied. Some models did not consistently follow instructions and occasionally skipped requested sections entirely.
What We Learned Across Both Experiments
AI Is Most Valuable for First-Stage Triage
Both experiments pointed to the same conclusion. In our experiments, AI systems are effective at identifying where disagreements cluster and surfacing those patterns quickly. They are far less effective at explaining why disagreements exist or how they should be resolved.
Our workflow uses AI only for the first stage of the process.

AI helps with Stage 1. The rest still requires human judgment.
Verify Any Statistics Reported by LLMs
The fabricated statistics we observed are not an edge case. Large language models often “guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty.” (
Kalai et al., 2025) This is particularly dangerous for statistical metrics.
Teams using LLMs in research workflows should treat reported statistics as hypotheses to verify rather than results to cite. It is important to remember that LLMs do not inherently understand the context of a calculation the same way a researcher would conceptualize it. Therefore, its implementation would not capture the research problem to the same level of complexity and comprehensiveness as a human researcher.
Diagnostic Framing Worked Better Than the Auditing Framing
When we asked AI to audit coders, the outputs tended to reduce the situation to judgments about agreement and disagreement, sometimes even assuming correctness. When we asked AI to diagnose patterns, the results were far more useful. The systems surfaced systematic divergences that actually helped us refine the coding protocol.
The shift from asking “Who is right?” to “What is unclear?” produced better outputs across every model we tested.
Tool Setup Matters More Than Model Selection
Across both experiments, practical setup differences mattered as much as model capability. Claude’s project workspace allowed us to upload the coding template once and reuse instructions each week. ChatGPT* generated a different report structure each session, which made week over week comparisons harder. Gemini’s ability to run code gave it the strongest claim to actual computation.
No single model performed best across all tasks.
Note: Early experiments used the free GPT-5 mini model. This version is not designed for deeper analytical tasks and was likely not the right tool for this type of evaluation.
What We Are Still Figuring Out
Several open questions remain. We still do not know whether AI can reliably recognize semantic equivalence without custom dictionaries. Anecdotally, it can. Sometimes. We are also still figuring out how to quickly detect fabricated statistics when they are formatted to look legitimate, and whether fast but unverifiable outputs are ever preferable to slower but reliable ones (spoiler: so far they’re not). Another open question is whether it is possible to combine the strengths of different models without creating an unwieldy multi model workflow.
We do not have all the answers yet.
The Honest Take
AI can save time on mechanical tasks, but takes some of it back through verification. It works well for specific, bounded tasks. It is not a general solution for research quality assurance.
The optimistic view is that AI triage could reduce low value manual checking when reviews scale to hundreds or thousands of papers. The realistic view is that the hardest part of the work, the part that determines validity, still depends on human judgment.
About the GenAI Insights Hub
This blog documents our work building the GenAI Evidence Hub for Educational Assessment, an open access analysis of more than 250 research studies.
Comments
Post a Comment