. 2024 Sep;633(8028):147-154.

doi: 10.1038/s41586-024-07856-5. Epub 2024 Aug 28.

AI generates covertly racist decisions about people based on their dialect

Valentin Hofmann^{1

2

3}, Pratyusha Ria Kalluri⁴, Dan Jurafsky⁴, Sharese King⁵

Affiliations

¹ Allen Institute for AI, Seattle, WA, USA. valentinh@allenai.org.
² University of Oxford, Oxford, UK. valentinh@allenai.org.
³ LMU Munich, Munich, Germany. valentinh@allenai.org.
⁴ Stanford University, Stanford, CA, USA.
⁵ The University of Chicago, Chicago, IL, USA. sharesek@uchicago.edu.

PMID: 39198640
PMCID: PMC11374696
DOI: 10.1038/s41586-024-07856-5

AI generates covertly racist decisions about people based on their dialect

Valentin Hofmann et al. Nature. 2024 Sep.

. 2024 Sep;633(8028):147-154.

doi: 10.1038/s41586-024-07856-5. Epub 2024 Aug 28.

Authors

Valentin Hofmann^{1

2

3}, Pratyusha Ria Kalluri⁴, Dan Jurafsky⁴, Sharese King⁵

Affiliations

¹ Allen Institute for AI, Seattle, WA, USA. valentinh@allenai.org.
² University of Oxford, Oxford, UK. valentinh@allenai.org.
³ LMU Munich, Munich, Germany. valentinh@allenai.org.
⁴ Stanford University, Stanford, CA, USA.
⁵ The University of Chicago, Chicago, IL, USA. sharesek@uchicago.edu.

PMID: 39198640
PMCID: PMC11374696
DOI: 10.1038/s41586-024-07856-5

Abstract

Hundreds of millions of people now interact with language models, with uses ranging from help with writing^1,2 to informing hiring decisions³. However, these language models are known to perpetuate systematic racial prejudices, making their judgements biased in problematic ways about groups such as African Americans^4-7. Although previous research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time, particularly in the United States after the civil rights movement^8,9. It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice, exhibiting raciolinguistic stereotypes about speakers of African American English (AAE) that are more negative than any human stereotypes about African Americans ever experimentally recorded. By contrast, the language models' overt stereotypes about African Americans are more positive. Dialect prejudice has the potential for harmful consequences: language models are more likely to suggest that speakers of AAE be assigned less-prestigious jobs, be convicted of crimes and be sentenced to death. Finally, we show that current practices of alleviating racial bias in language models, such as human preference alignment, exacerbate the discrepancy between covert and overt stereotypes, by superficially obscuring the racism that language models maintain on a deeper level. Our findings have far-reaching implications for the fair and safe use of language technology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Probing AI dialect prejudice.**
a, We used texts in SAE (green) and AAE (blue). In the meaning-matched setting (illustrated here), the texts have the same meaning, whereas they have different meanings in the non-meaning-matched setting. b, We embedded the SAE and AAE texts in prompts that asked for properties of the speakers who uttered the texts. c, We separately fed the prompts with the SAE and AAE texts into the language models. d, We retrieved and compared the predictions for the SAE and AAE inputs, here illustrated by five adjectives from the Princeton Trilogy. See Methods for more details.

**Fig. 2. Covert stereotypes in language models.**
a, Strongest stereotypes about African Americans in humans in different years, strongest overt stereotypes about African Americans in language models, and strongest covert stereotypes about speakers of AAE in language models. Colour coding as positive (green) and negative (red) is based on ref. . Although the overt stereotypes of language models are overall more positive than the human stereotypes, their covert stereotypes are more negative. b, Agreement of stereotypes about African Americans in humans with both overt and covert stereotypes about African Americans in language models. The black dotted line shows chance agreement using a random bootstrap. Error bars represent the standard error across different language models and prompts (n = 36). The language models’ overt stereotypes agree most strongly with current human stereotypes, which are the most positive experimentally recorded ones, but their covert stereotypes agree most strongly with human stereotypes from the 1930s, which are the most negative experimentally recorded ones. c, Stereotype strength for individual linguistic features of AAE. Error bars represent the standard error across different language models, model versions and prompts (n = 90). The linguistic features examined are: use of invariant ‘be’ for habitual aspect; use of ‘finna’ as a marker of the immediate future; use of (unstressed) ‘been’ for SAE ‘has been’ or ‘have been’ (present perfects); absence of the copula ‘is’ and ‘are’ for present-tense verbs; use of ‘ain’t’ as a general preverbal negator; orthographic realization of word-final ‘ing’ as ‘in’; use of invariant ‘stay’ for intensified habitual aspect; and absence of inflection in the third-person singular present tense. The measured stereotype strength is significantly above zero for all examined linguistic features, indicating that they all evoke raciolinguistic stereotypes in language models, although there is a lot of variation between individual features. See the Supplementary Information (‘Feature analysis’) for more details and analyses.

**Fig. 3. Impact of covert racism on AI decisions.**
a, Association of different occupations with AAE or SAE. Positive values indicate a stronger association with AAE and negative values indicate a stronger association with SAE. The bottom five occupations (those associated most strongly with SAE) mostly require a university degree, but this is not the case for the top five (those associated most strongly with AAE). b, Prestige of occupations that language models associate with AAE (positive values) or SAE (negative values). The shaded area shows a 95% confidence band around the regression line. The association with AAE or SAE predicts the occupational prestige. Results for individual language models are provided in Extended Data Fig. 2. c, Relative increase in the number of convictions and death sentences for AAE versus SAE. Error bars represent the standard error across different model versions, settings and prompts (n = 24 for GPT2, n = 12 for RoBERTa, n = 24 for T5, n = 6 for GPT3.5 and n = 6 for GPT4). In cases of small sample size (n ≤ 10 for GPT3.5 and GPT4), we plotted the individual results as overlaid dots. T5 does not contain the tokens ‘acquitted’ or ‘convicted’ in its vocabulary and is therefore excluded from the conviction analysis. Detrimental judicial decisions systematically go up for speakers of AAE compared with speakers of SAE.

**Fig. 4. Resolvability of dialect prejudice.**
a, Language modelling perplexity and stereotype strength on AAE text as a function of model size. Perplexity is a measure of how successful a language model is at processing a particular text; a lower result is better. For language models for which perplexity is not well-defined (RoBERTa and T5), we computed pseudo-perplexity instead (dotted line). Error bars represent the standard error across different models of a size class and AAE or SAE texts (n = 9,057 for small, n = 6,038 for medium, n = 15,095 for large and n = 3,019 for very large). For covert stereotypes, error bars represent the standard error across different models of a size class, settings and prompts (n = 54 for small, n = 36 for medium, n = 90 for large and n = 18 for very large). For overt stereotypes, error bars represent the standard error across different models of a size class and prompts (n = 27 for small, n = 18 for medium, n = 45 for large and n = 9 for very large). Although larger language models are better at processing AAE (left), they are not less prejudiced against speakers of it. Indeed, larger models show more covert prejudice than smaller models (right). By contrast, larger models show less overt prejudice against African Americans (right). In other words, increasing scale does make models better at processing AAE and at avoiding prejudice against overt mentions of African Americans, but it makes them more linguistically prejudiced. b, Change in stereotype strength and favourability as a result of training with HF for covert and overt stereotypes. Error bars represent the standard error across different prompts (n = 9). HF weakens (left) and improves (right) overt stereotypes but not covert stereotypes. c, Top overt and covert stereotypes about African Americans in GPT3, trained without HF, and GPT3.5, trained with HF. Colour coding as positive (green) and negative (red) is based on ref. . The overt stereotypes get substantially more positive as a result of HF training in GPT3.5, but there is no visible change in favourability for the covert stereotypes.

**Extended Data Fig. 1. Weighted average favourability of top stereotypes about African Americans in humans and top overt as well as covert stereotypes about African Americans in language models (LMs).**
The overt stereotypes are more favourable than the reported human stereotypes, except for GPT2. The covert stereotypes are substantially less favourable than the least favourable reported human stereotypes from 1933. Results without weighting, which are very similar, are provided in Supplementary Fig. 6.

**Extended Data Fig. 2. Prestige of occupations associated with AAE (positive values) versus SAE (negative values), for individual language models.**
The shaded areas show 95% confidence bands around the regression lines. The association with AAE versus SAE is negatively correlated with occupational prestige, for all language models. We cannot conduct this analysis with GPT4 since the OpenAI API does not give access to the probabilities for all occupations.

See this image and copyright information in PMC

References

1. Zhao, W. et al. WildChat: 1M ChatGPT interaction logs in the wild. In Proc. Twelfth International Conference on Learning Representations (OpenReview.net, 2024).
1. Zheng, L. et al. LMSYS-Chat-1M: a large-scale real-world LLM conversation dataset. In Proc. Twelfth International Conference on Learning Representations (OpenReview.net, 2024).
1. Gaebler, J. D., Goel, S., Huq, A. & Tambe, P. Auditing the use of language models to guide hiring decisions. Preprint at https://arxiv.org/abs/2404.03086 (2024).
1. Sheng, E., Chang, K.-W., Natarajan, P. & Peng, N. The woman worked as a babysitter: on biases in language generation. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (eds Inui. K. et al.) 3407–3412 (Association for Computational Linguistics, 2019).
1. Nangia, N., Vania, C., Bhalerao, R. & Bowman, S. R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds Webber, B. et al.) 1953–1967 (Association for Computational Linguistics, 2020).

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AI generates covertly racist decisions about people based on their dialect

Affiliations

AI generates covertly racist decisions about people based on their dialect

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources