Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2025 Mar 5:27:e67891.
doi: 10.2196/67891.

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Affiliations
Observational Study

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Ryan K McBain et al. J Med Internet Res. .

Abstract

Background: With suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support.

Objective: The objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.

Methods: This observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from -3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <-1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies.

Results: All 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master's level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.

Conclusions: Current versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.

Keywords: ChatGPT; Suicidal Ideation Response Inventory; artificial intelligence; chatbot; depression; digital health; large language model; mental health; suicide; suicidologist.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Mean difference in ratings on Suicidal Ideation Response Inventory (SIRI-2) items: large language model versus expert suicidologists.
Figure 2
Figure 2
Density plot represents the proportion of responses, across all 48 item responses, with z scores ranging from –3 to +6. Dashed vertical lines indicate cutoff thresholds of –1.96 and +1.96. Values less than –1.96 or greater than +1.96 are significant at P<.05.

References

    1. Suicide. National Institute of Mental Health. 2024. [2024-07-01]. https://www.nimh.nih.gov/health/statistics/suicide .
    1. Saunders H, Panchal N. A look at the latest suicide data and change over the last decade. Kaiser Family Foundation. 2023. Aug 04, [2024-07-01]. https://www.kff.org/mental-health/issue-brief/a-look-at-the-latest-suici...
    1. Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, Klang E. Applications of large language models in psychiatry: a systematic review. Front Psychiatry. 2024 Jun 24;15:1422807. doi: 10.3389/fpsyt.2024.1422807. https://europepmc.org/abstract/MED/38979501 - DOI - PMC - PubMed
    1. Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus. 2023 May;15(5):e39305. doi: 10.7759/cureus.39305. https://europepmc.org/abstract/MED/37378099 - DOI - PMC - PubMed
    1. Mental health apps and the role of ai in emotional well-being. Mya Care. 2023. Nov 08, [2024-07-15]. https://myacare.com/blog/mental-health-apps-and-the-role-of-ai-in-emotio... .

Publication types

LinkOut - more resources