Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 17;15(1):40122.
doi: 10.1038/s41598-025-23798-y.

LLMs outperform outsourced human coders on complex textual analysis

Affiliations

LLMs outperform outsourced human coders on complex textual analysis

Vicente J Bermejo et al. Sci Rep. .

Abstract

This paper evaluates the effectiveness of large language models (LLMs) in extracting complex information from text data. Using a corpus of Spanish news articles, we compare how accurately various LLMs and outsourced human coders reproduce expert annotations on five natural language processing tasks, ranging from named entity recognition to identifying nuanced political criticism in news articles. We find that LLMs consistently outperform outsourced human coders, particularly in tasks requiring deep contextual understanding. These findings suggest that current LLM technology offers researchers without programming expertise a cost-effective alternative for sophisticated text analysis.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overall performance, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies. For T1, the figure shows the Macro formula image score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller errors in identifying the correct number of municipalities), while for the remaining tasks, higher numbers indicate better performance (i.e., closer alignment with the expert benchmark). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.
Fig. 2
Fig. 2
Performance by article difficulty, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies classified by task difficulty. For T1, the figure shows the Macro formula image score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller counting errors), while for the remaining tasks, higher numbers indicate better performance (i.e., greater agreement with gold-standard labels). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.
Fig. 3
Fig. 3
Performance by article length, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies classified by article length. For T1, the figure shows the Macro formula image score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., fewer numeric discrepancies), while for the remaining tasks, higher numbers indicate better performance (i.e., greater match with expert annotations). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.
Fig. 4
Fig. 4
Human coders’ performance, statistical significance, and task progression. This figure shows the performance distribution by task order for the outsourced human coders. For T1, the figure shows the Macro formula image score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., more accurate counts), while for the remaining tasks, higher numbers indicate better performance (i.e., closer alignment with expert judgments). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly. For T1, T3, T4, and T5, observed values above the black line indicate performance significantly better than random chance in the permutation tests (5% level), while for T2, observed values below the line indicate significantly better performance.
Fig. 5
Fig. 5
Performance comparison between LLMs and high-performing human coders. This figure compares the performance of high-performing human coders (above median aggregate performance) against LLMs across all tasks and coding strategies. For T1, the figure shows the Macro formula image score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller numeric deviation from the correct count), while for the remaining tasks, higher numbers indicate better performance (i.e., higher agreement with expert coders). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.

References

    1. Barberá, P., Boydstun, A. E., Linn, S., McMahon, R. & Nagler, J. Automated text classification of news articles: A practical guide. Political Anal.29, 19–42. 10.1017/pan.2020.8 (2021). - DOI
    1. Rathje, S. et al. Gpt is an effective tool for multilingual psychological text analysis. Proc. Natl. Acad. Sci.121, e2308950121. 10.1073/pnas.2308950121 (2024). - DOI - PMC - PubMed
    1. Gentzkow, M., Kelly, B. & Taddy, M. Text as data. J. Econ. Lit.57, 535–74. 10.1257/jel.20181020 (2019). - DOI
    1. Song, H. et al. In validations we trust? the impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Commun.37, 550–572. 10.1080/10584609.2020.1723752 (2020). - DOI
    1. Grimmer, J. & Stewart, B. M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal.21, 267–297. 10.1093/pan/mps028 (2013). - DOI

LinkOut - more resources