. 2025 Nov 17;15(1):40122.

doi: 10.1038/s41598-025-23798-y.

LLMs outperform outsourced human coders on complex textual analysis

Vicente J Bermejo^#¹, Andrés Gago^#², Ramiro H Gálvez^#², Nicolás Harari^#³

Affiliations

¹ ESADE Business School, Universitat Ramon Llull, Barcelona, 08034, Spain. vicente.bermejo@esade.edu.
² Universidad Torcuato Di Tella, Buenos Aires, 1428, Argentina.
³ Boston University, Department of Economics, Boston, MA, 02215, United States.

^# Contributed equally.

PMID: 41249236
PMCID: PMC12623721
DOI: 10.1038/s41598-025-23798-y

LLMs outperform outsourced human coders on complex textual analysis

Vicente J Bermejo et al. Sci Rep. 2025.

. 2025 Nov 17;15(1):40122.

doi: 10.1038/s41598-025-23798-y.

Authors

Vicente J Bermejo^#¹, Andrés Gago^#², Ramiro H Gálvez^#², Nicolás Harari^#³

Affiliations

¹ ESADE Business School, Universitat Ramon Llull, Barcelona, 08034, Spain. vicente.bermejo@esade.edu.
² Universidad Torcuato Di Tella, Buenos Aires, 1428, Argentina.
³ Boston University, Department of Economics, Boston, MA, 02215, United States.

^# Contributed equally.

PMID: 41249236
PMCID: PMC12623721
DOI: 10.1038/s41598-025-23798-y

Abstract

This paper evaluates the effectiveness of large language models (LLMs) in extracting complex information from text data. Using a corpus of Spanish news articles, we compare how accurately various LLMs and outsourced human coders reproduce expert annotations on five natural language processing tasks, ranging from named entity recognition to identifying nuanced political criticism in news articles. We find that LLMs consistently outperform outsourced human coders, particularly in tasks requiring deep contextual understanding. These findings suggest that current LLM technology offers researchers without programming expertise a cost-effective alternative for sophisticated text analysis.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Overall performance, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies. For T1, the figure shows the Macro score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller errors in identifying the correct number of municipalities), while for the remaining tasks, higher numbers indicate better performance (i.e., closer alignment with the expert benchmark). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.

formula image — **Fig. 1**
Overall performance, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies. For T1, the figure shows the Macro score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller errors in identifying the correct number of municipalities), while for the remaining tasks, higher numbers indicate better performance (i.e., closer alignment with the expert benchmark). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.

**Fig. 2**
Performance by article difficulty, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies classified by task difficulty. For T1, the figure shows the Macro score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller counting errors), while for the remaining tasks, higher numbers indicate better performance (i.e., greater agreement with gold-standard labels). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.

**Fig. 3**
Performance by article length, across tasks and coding strategies. This figure displays the overall performance across all tasks and coding strategies classified by article length. For T1, the figure shows the Macro score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., fewer numeric discrepancies), while for the remaining tasks, higher numbers indicate better performance (i.e., greater match with expert annotations). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.

**Fig. 4**
Human coders’ performance, statistical significance, and task progression. This figure shows the performance distribution by task order for the outsourced human coders. For T1, the figure shows the Macro score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., more accurate counts), while for the remaining tasks, higher numbers indicate better performance (i.e., closer alignment with expert judgments). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly. For T1, T3, T4, and T5, observed values above the black line indicate performance significantly better than random chance in the permutation tests (5% level), while for T2, observed values below the line indicate significantly better performance.

**Fig. 5**
Performance comparison between LLMs and high-performing human coders. This figure compares the performance of high-performing human coders (above median aggregate performance) against LLMs across all tasks and coding strategies. For T1, the figure shows the Macro score; for T2, the Mean Absolute Error (MAE); and for T3, T4, and T5, it shows the accuracy. For T2, a lower number denotes better performance (i.e., smaller numeric deviation from the correct count), while for the remaining tasks, higher numbers indicate better performance (i.e., higher agreement with expert coders). The “All correct” panel indicates the proportion of news articles for which all tasks were completed entirely correctly, broken down by coding strategy.

See this image and copyright information in PMC

References

1. Barberá, P., Boydstun, A. E., Linn, S., McMahon, R. & Nagler, J. Automated text classification of news articles: A practical guide. Political Anal.29, 19–42. 10.1017/pan.2020.8 (2021). - DOI
1. Rathje, S. et al. Gpt is an effective tool for multilingual psychological text analysis. Proc. Natl. Acad. Sci.121, e2308950121. 10.1073/pnas.2308950121 (2024). - DOI - PMC - PubMed
1. Gentzkow, M., Kelly, B. & Taddy, M. Text as data. J. Econ. Lit.57, 535–74. 10.1257/jel.20181020 (2019). - DOI
1. Song, H. et al. In validations we trust? the impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Commun.37, 550–572. 10.1080/10584609.2020.1723752 (2020). - DOI
1. Grimmer, J. & Stewart, B. M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal.21, 267–297. 10.1093/pan/mps028 (2013). - DOI

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LLMs outperform outsourced human coders on complex textual analysis

Affiliations

LLMs outperform outsourced human coders on complex textual analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources