The Model Counting Problem: Moving toward Consistency in AI Research Analysis
Contributing Authors: Melanie Kurimchak, Learning Data Insights | Aaron Wong, University of Minnesota | Maggie Beiting-Parish, CUNY Graduate Center/EdAIfy | Kristen DiCerbo, Khan Academy | John Whitmer, Learning Data Insights
AI Disclosure: NotebookLM was used in the initial drafting of this post, human editing and review conducted throughout.
We are documenting our work building an evidence hub for AI in educational assessment. All content CC BY-SA. Read the welcome post.
What We Noticed: The "Simple" Counting Problem
As we began designing our coding sheet for the GenAI Evidence Hub, our coding team hit a question that looked simple and turned out to be foundational: How many models did this study actually test?We encountered a paper, Improve LLM-based Automatic Essay Scoring with Linguistic Features (Zhaoyi Joey Hou, et al., 2025), that compared GPT-4 and Mistral 7B for automated essay scoring. Mistral 7Bl was evaluated under several conditions: no features, top one (unique word features), top three linguistic features, and top ten features. GPT-4 was evaluated with no features and top 10. The study also included a trained BERT model as an additional baseline.
When you consider all these variations, the answer to a basic counting question stops being obvious. When we assigned this paper to four experienced coders during our pilot phase, we got three different answers:
One coder counted 2 models (counting only the base architectures GPT-4 and Mistral 7B), another counted: 3 models (adding BERT as a trained baseline comparison), and two more counted: 6 models (counting each technique and feature condition as its own model). So, who is right? Everyone. And that is exactly the problem. When a paper says "we tested eight models," readers need to know if that means one model has a broad advantage over others or if certain features contribute to a model’s success.
Why is this important? Why is this important? We are still very early in using AI for education assessment, and the field is rapidly evolving. If we are to build on successful research, we need to have descriptions of the methods used that are sufficiently specific for other people to identify them, and consistent enough that results can be summarized across studies. Otherwise we will make slow progress and reinvent the same results.
In early iterations of our coding sheet, we realized that without explicit fields for experimental conditions, we were losing the nuance of why a model performed well. Was it the architecture? Or was it the prompting strategy?
We also ran into a specific classification debate during team meetings: Does BERT count? Some coders considered it a "model"; others considered it a "baseline" or "field standard," because it only contains the encoder portion within its architecture; it isn’t generative in the same way a prompt-based model (e.g. OpenAI) is.
In our internal comments and revision logs, we are currently wrestling with these "edge cases" that defy simple definitions:
Check out our newest coding sheet definitions and view the full changelog.
When you consider all these variations, the answer to a basic counting question stops being obvious. When we assigned this paper to four experienced coders during our pilot phase, we got three different answers:
One coder counted 2 models (counting only the base architectures GPT-4 and Mistral 7B), another counted: 3 models (adding BERT as a trained baseline comparison), and two more counted: 6 models (counting each technique and feature condition as its own model). So, who is right? Everyone. And that is exactly the problem. When a paper says "we tested eight models," readers need to know if that means one model has a broad advantage over others or if certain features contribute to a model’s success.
Why is this important? Why is this important? We are still very early in using AI for education assessment, and the field is rapidly evolving. If we are to build on successful research, we need to have descriptions of the methods used that are sufficiently specific for other people to identify them, and consistent enough that results can be summarized across studies. Otherwise we will make slow progress and reinvent the same results.
What We Tried: The Early Definitions
At first, we attempted to capture this data with a single, broad column labeled "Total Models" and a checkbox for "GenAI." We assumed that if we provided a definition of Generative AI, coders would intuitively know which models to include, but quickly learned that is not an effective methodology.In early iterations of our coding sheet, we realized that without explicit fields for experimental conditions, we were losing the nuance of why a model performed well. Was it the architecture? Or was it the prompting strategy?
We also ran into a specific classification debate during team meetings: Does BERT count? Some coders considered it a "model"; others considered it a "baseline" or "field standard," because it only contains the encoder portion within its architecture; it isn’t generative in the same way a prompt-based model (e.g. OpenAI) is.
How We Iterated
We moved our coding sheet through versions to address these ambiguity problems. Here’s how we refined our logic based on what we found:- Splitting "Models" from "Techniques"
We realized we could not mash architectures and prompting strategies into a single number. We introduced separate columns for "Total Models Tested" and "Prompting Techniques." This allowed us to capture that a study might use one model (e.g., GPT-4) but test it with multiple techniques (Zero-shot, Few-shot, Chain-of-Thought). - Identifying the Baseline
The BERT debate taught us we needed to strictly define what counts as the "experimental" GenAI versus the "control." We added specific fields to track human raters, existing models (like BERT), and field standards/random chance separately from the GenAI models being evaluated. - Ground Results in Research Questions
We noticed that a single paper often answers multiple questions using different sets of models. To address this, we shifted from coding the "Paper" to coding the "Research Question." This lets us record that for Research Question A, the authors compared 2 models, while for Research Question B, they compared 5.
Where We Are Now: We Are Not There Yet
At first, we thought our approach of tracking base models separately from the variations or techniques used on them would solve the problem. However, our most recent team discussions (January 2026) revealed that we are still disagreeing on the unit of analysis.In our internal comments and revision logs, we are currently wrestling with these "edge cases" that defy simple definitions:
- In Crosslingual Content Scoring in Five Languages Using Machine-Translation and Multilingual Transformer Models (Horbach, et. al., 2023), the study tested 2 models across 7 original languages and 5 translated languages. Does this count as 2 models (just the base architectures) or 70 models (every permutation of language and training data)?
- In Automatic Scoring of Students’ Science Writing Using Hybrid Neural Network (Latif, et. al., 2024), an LLM is just one component of a larger pipeline. Should this count as testing an LLM, or as evaluating a completely new architecture?
Check out our newest coding sheet definitions and view the full changelog.
Next Steps: Decision Trees over Definitions
We are learning that definitions alone aren't enough; we need strict logic flows. We are currently implementing a "Model Counting Decision Tree" for our reviewers:- Is it a base architecture? (Yes = Count as 1)
- Is it the same architecture fine-tuned on different data? (The team is currently deciding: does French-BERT count as a different model than English-BERT?)
- Is it a prompt variation? (Yes = List in 'Conditions' column, do not add to count).
Comments
Post a Comment