GPT as a Measurement Tool

← Back to all papers
0.0 / 10 NBER Text Methods Corp. Political

Authors: Hemanth Asirvatham, Elliott Mokski, Andrei Shleifer

Published: 2026-02-25 · View on NBER · PDF


Abstract

We present the GABRIEL software package, which uses GPT to quantify attributes in qualitative data (e.g. how pro innovation a speech is). GPT is evaluated on classification and attribute rating performance against 1000+ human annotated tasks across a range of topics and data. We find that GPT as a


Analysis

Research Question

Can GPT accurately quantify attributes in qualitative text data as a measurement tool, replacing or augmenting human annotation?

Data

GABRIEL software package; 1000+ human-annotated tasks across Congressional remarks, social media, school curricula; 37,000 technologies dataset for tech adoption history

Identification Strategy

Comparison of GPT labels vs. human annotators across multiple classification and rating tasks; tests for prompt sensitivity and contamination

Main Findings

GPT as measurement tool is accurate across domains and statistically indistinguishable from human evaluators; results robust to prompting strategy; not relying on data contamination; applied to quantify Congressional innovation rhetoric, toxicity trends, and tech adoption speed (tenfold decline in adoption lag over industrial age)

Limitations

May still hallucinate on very niche domains; benchmark tasks may not capture full range of nuanced economic text; open-source GABRIEL package quality may vary across providers


Connection to Current Research

Directly relevant to Project 2 (corporate political attention via earnings call text): GABRIEL can complement or validate Word2Vec/NLP approach for measuring partisan alignment and political attention in earnings calls. Could use GPT to rate how “politically engaged” or “partisan-leaning” an earnings call segment is — much more flexible than Word2Vec for nuanced constructs. Also relevant to Project 1 for measuring NLP seller impatience in listing text.

TipKey Takeaway

Should pilot GABRIEL on a sample of earnings calls to check whether GPT-based scoring of political tone matches our Word2Vec partisan alignment scores. If aligned, GPT can validate; if divergent, use as robustness check. GABRIEL pipeline much faster to implement than training custom Word2Vec.