Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
- PMID: 40053817
- PMCID: PMC11928068
- DOI: 10.2196/67891
Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
Abstract
Background: With suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support.
Objective: The objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.
Methods: This observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from -3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <-1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies.
Results: All 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master's level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.
Conclusions: Current versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.
Keywords: ChatGPT; Suicidal Ideation Response Inventory; artificial intelligence; chatbot; depression; digital health; large language model; mental health; suicide; suicidologist.
©Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, Hao Yu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 05.03.2025.
Conflict of interest statement
Conflicts of Interest: None declared.
Figures
References
-
- Suicide. National Institute of Mental Health. 2024. [2024-07-01]. https://www.nimh.nih.gov/health/statistics/suicide .
-
- Saunders H, Panchal N. A look at the latest suicide data and change over the last decade. Kaiser Family Foundation. 2023. Aug 04, [2024-07-01]. https://www.kff.org/mental-health/issue-brief/a-look-at-the-latest-suici...
-
- Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, Klang E. Applications of large language models in psychiatry: a systematic review. Front Psychiatry. 2024 Jun 24;15:1422807. doi: 10.3389/fpsyt.2024.1422807. https://europepmc.org/abstract/MED/38979501 - DOI - PMC - PubMed
-
- Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus. 2023 May;15(5):e39305. doi: 10.7759/cureus.39305. https://europepmc.org/abstract/MED/37378099 - DOI - PMC - PubMed
-
- Mental health apps and the role of ai in emotional well-being. Mya Care. 2023. Nov 08, [2024-07-15]. https://myacare.com/blog/mental-health-apps-and-the-role-of-ai-in-emotio... .
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
