Observational Study

. 2025 Mar 5:27:e67891.

doi: 10.2196/67891.

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Ryan K McBain^{1

2

3}, Jonathan H Cantor⁴, Li Ang Zhang⁴, Olesya Baker^{3

5}, Fang Zhang^{3

5}, Alyssa Halbisen⁵, Aaron Kofner¹, Joshua Breslau⁶, Bradley Stein⁶, Ateev Mehrotra⁷, Hao Yu^{3

5}

Affiliations

¹ RAND, Arlington, VA, United States.
² Brigham and Women's Hospital, Boston, MA, MA, United States.
³ Harvard Medical School, Boston, MA, United States.
⁴ RAND, Santa Monica, CA, United States.
⁵ Harvard Pilgrim Health Care Institute, Boston, MA, United States.
⁶ RAND, Pittsburgh, PA, United States.
⁷ Brown University School of Public Health, Providence, RI, United States.

PMID: 40053817
PMCID: PMC11928068
DOI: 10.2196/67891

Observational Study

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Ryan K McBain et al. J Med Internet Res. 2025.

. 2025 Mar 5:27:e67891.

doi: 10.2196/67891.

Authors

Affiliations

¹ RAND, Arlington, VA, United States.
² Brigham and Women's Hospital, Boston, MA, MA, United States.
³ Harvard Medical School, Boston, MA, United States.
⁴ RAND, Santa Monica, CA, United States.
⁵ Harvard Pilgrim Health Care Institute, Boston, MA, United States.
⁶ RAND, Pittsburgh, PA, United States.
⁷ Brown University School of Public Health, Providence, RI, United States.

PMID: 40053817
PMCID: PMC11928068
DOI: 10.2196/67891

Abstract

Background: With suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support.

Objective: The objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.

Methods: This observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from -3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <-1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies.

Results: All 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master's level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.

Conclusions: Current versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.

Keywords: ChatGPT; Suicidal Ideation Response Inventory; artificial intelligence; chatbot; depression; digital health; large language model; mental health; suicide; suicidologist.

©Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, Hao Yu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 05.03.2025.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Mean difference in ratings on Suicidal Ideation Response Inventory (SIRI-2) items: large language model versus expert suicidologists.

**Figure 2**
Density plot represents the proportion of responses, across all 48 item responses, with z scores ranging from –3 to +6. Dashed vertical lines indicate cutoff thresholds of –1.96 and +1.96. Values less than –1.96 or greater than +1.96 are significant at P<.05.

See this image and copyright information in PMC

References

1. Suicide. National Institute of Mental Health. 2024. [2024-07-01]. https://www.nimh.nih.gov/health/statistics/suicide .
1. Saunders H, Panchal N. A look at the latest suicide data and change over the last decade. Kaiser Family Foundation. 2023. Aug 04, [2024-07-01]. https://www.kff.org/mental-health/issue-brief/a-look-at-the-latest-suici...
1. Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, Klang E. Applications of large language models in psychiatry: a systematic review. Front Psychiatry. 2024 Jun 24;15:1422807. doi: 10.3389/fpsyt.2024.1422807. https://europepmc.org/abstract/MED/38979501 - DOI - PMC - PubMed
1. Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus. 2023 May;15(5):e39305. doi: 10.7759/cureus.39305. https://europepmc.org/abstract/MED/37378099 - DOI - PMC - PubMed
1. Mental health apps and the role of ai in emotional well-being. Mya Care. 2023. Nov 08, [2024-07-15]. https://myacare.com/blog/mental-health-apps-and-the-role-of-ai-in-emotio... .

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 MH132551/MH/NIMH NIH HHS/United States

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Affiliations

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical