GPT as a Measurement Tool
Authors: Hemanth Asirvatham, Elliott Mokski, Andrei Shleifer
Published: 2026-02-25 · View on NBER · PDF
Abstract
We present the GABRIEL software package, which uses GPT to quantify attributes in qualitative data (e.g. how pro innovation a speech is). GPT is evaluated on classification and attribute rating performance against 1000+ human annotated tasks across a range of topics and data. We find that GPT as a
Analysis
Research Question
Can GPT accurately quantify attributes in qualitative text data as a measurement tool, replacing or augmenting human annotation?
Data
GABRIEL software package; 1000+ human-annotated tasks across Congressional remarks, social media, school curricula; 37,000 technologies dataset for tech adoption history
Identification Strategy
Comparison of GPT labels vs. human annotators across multiple classification and rating tasks; tests for prompt sensitivity and contamination
Main Findings
GPT as measurement tool is accurate across domains and statistically indistinguishable from human evaluators; results robust to prompting strategy; not relying on data contamination; applied to quantify Congressional innovation rhetoric, toxicity trends, and tech adoption speed (tenfold decline in adoption lag over industrial age)
Limitations
May still hallucinate on very niche domains; benchmark tasks may not capture full range of nuanced economic text; open-source GABRIEL package quality may vary across providers
Connection to Current Research
Directly relevant to Project 2 (corporate political attention via earnings call text): GABRIEL can complement or validate Word2Vec/NLP approach for measuring partisan alignment and political attention in earnings calls. Could use GPT to rate how “politically engaged” or “partisan-leaning” an earnings call segment is — much more flexible than Word2Vec for nuanced constructs. Also relevant to Project 1 for measuring NLP seller impatience in listing text.
Should pilot GABRIEL on a sample of earnings calls to check whether GPT-based scoring of political tone matches our Word2Vec partisan alignment scores. If aligned, GPT can validate; if divergent, use as robustness check. GABRIEL pipeline much faster to implement than training custom Word2Vec.