Refine
Keywords
- adolescent (2)
- mental health (2)
- ChatGPT (1)
- LLM (1)
- anorexia nervosa (1)
- artificial intelligence (1)
- bulimia nervosa (1)
- crisis helpline (1)
- deep learning (1)
- delivery modality (1)
Background: Large language models (LLMs) are increasingly used in mental health, showing promise in assessing disorders. However, concerns exist regarding their accuracy, reliability, and fairness. Societal biases and underrepresentation of certain populations may impact LLMs. Because LLMs are already used for clinical practice, including decision support, it is important to investigate potential biases to ensure a responsible use of LLMs. Anorexia nervosa (AN) and bulimia nervosa (BN) show a lifetime prevalence of 1%-2%, affecting more women than men. Among men, homosexual men face a higher risk of eating disorders (EDs) than heterosexual men. However, men are underrepresented in ED research, and studies on gender, sexual orientation, and their impact on AN and BN prevalence, symptoms, and treatment outcomes remain limited.
Objectives: We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by a common LLM as well as a smaller LLM specifically trained for mental health analyses, exemplified in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN.
Methods: We extracted 30 case vignettes (22 AN and 8 BN) from scientific papers. We adapted each vignette to create 4 versions, describing a female versus male patient living with their female versus male partner (2 × 2 design), yielding 120 vignettes. We then fed each vignette into ChatGPT-4 and to “MentaLLaMA” based on the Large Language Model Meta AI (LLaMA) architecture thrice with the instruction to evaluate them by providing responses to 2 psychometric instruments, the RAND-36 questionnaire assessing HRQoL and the eating disorder examination questionnaire. With the resulting LLM-generated scores, we calculated multilevel models with a random intercept for gender and sexual orientation (accounting for within-vignette variance), nested in vignettes (accounting for between-vignette variance).
Results: In ChatGPT-4, the multilevel model with 360 observations indicated a significant association with gender for the RAND-36 mental composite summary (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect –6.15 to -0.35; P=.04) but neither with sexual orientation (P=.71) nor with an interaction effect (P=.37). We found no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI –0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI –0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI –0.11 to 0.19) for the eating disorder examination questionnaire overall score (conditional means 5.59-5.65 95% CIs 5.45 to 5.7). MentaLLaMA did not yield reliable results.
Conclusions: LLM-generated mental HRQoL estimates for AN and BN case vignettes may be biased by gender, with male cases scoring lower despite no real-world evidence supporting this pattern. This highlights the risk of bias in generative artificial intelligence in the field of mental health. Understanding and mitigating biases related to gender and other factors, such as ethnicity, and socioeconomic status are crucial for responsible use in diagnostics and treatment recommendations.
Background: As digital mental health delivery becomes increasingly prominent, a solid evidence base regarding its efficacy is needed.
Objective: This study aims to synthesize evidence on the comparative efficacy of systemic psychotherapy interventions provided via digital versus face-to-face delivery modalities.
Methods: We followed PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for searching PubMed, Embase, Cochrane CENTRAL, CINAHL, PsycINFO, and PSYNDEX and conducting a systematic review and meta-analysis. We included randomized controlled trials comparing mental, behavioral, and somatic outcomes of systemic psychotherapy interventions using self- and therapist-guided digital versus face-to-face delivery modalities. The risk of bias was assessed with the revised Cochrane Risk of Bias tool for randomized trials. Where appropriate, we calculated standardized mean differences and risk ratios. We calculated separate mean differences for nonaggregated analysis.
Results: We screened 3633 references and included 12 articles reporting on 4 trials (N=754). Participants were youths with poor diabetic control, traumatic brain injuries, increased risk behavior likelihood, and parents of youths with anorexia nervosa. A total of 56 outcomes were identified. Two trials provided digital intervention delivery via videoconferencing: one via an interactive graphic interface and one via a web-based program. In total, 23% (14/60) of risk of bias judgments were high risk, 42% (25/60) were some concerns, and 35% (21/60) were low risk. Due to heterogeneity in the data, meta-analysis was deemed inappropriate for 96% (54/56) of outcomes, which were interpreted qualitatively instead. Nonaggregated analyses of mean differences and CIs between delivery modalities yielded mixed results, with superiority of the digital delivery modality for 18% (10/56) of outcomes, superiority of the face-to-face delivery modality for 5% (3/56) of outcomes, equivalence between delivery modalities for 2% (1/56) of outcomes, and neither superiority of one modality nor equivalence between modalities for 75% (42/56) of outcomes. Consequently, for most outcome measures, no indication of superiority or equivalence regarding the relative efficacy of either delivery modality can be made at this stage. We further meta-analytically compared digital versus face-to-face delivery modalities for attrition (risk ratio 1.03, 95% CI 0.52-2.03; P=.93) and number of sessions attended (standardized mean difference –0.11; 95% CI –1.13 to –0.91; P=.83), finding no significant differences between modalities, while CIs falling outside the range of the minimal important difference indicate that equivalence cannot be determined at this stage.
Conclusions: Evidence on digital and face-to-face modalities for systemic psychotherapy interventions is largely heterogeneous, limiting conclusions regarding the differential efficacy of digital and face-to-face delivery. Nonaggregated and meta-analytic analyses did not indicate the superiority of either delivery condition. More research is needed to conclude if digital and face-to-face delivery modalities are generally equivalent or if—and in which contexts—one modality is superior to another.
Background: Suicide represents a critical public health concern, and machine learning (ML) models offer the potential for identifying at-risk individuals. Recent studies using benchmark datasets and real-world social media data have demonstrated the capability of pretrained large language models in predicting suicidal ideation and behaviors (SIB) in speech and text.
Objective: This study aimed to (1) develop and implement ML methods for predicting SIBs in a real-world crisis helpline dataset, using transformer-based pretrained models as a foundation; (2) evaluate, cross-validate, and benchmark the model against traditional text classification approaches; and (3) train an explainable model to highlight relevant risk-associated features.
Methods: We analyzed chat protocols from adolescents and young adults (aged 14-25 years) seeking assistance from a German crisis helpline. An ML model was developed using a transformer-based language model architecture with pretrained weights and long short-term memory layers. The model predicted suicidal ideation (SI) and advanced suicidal engagement (ASE), as indicated by composite Columbia-Suicide Severity Rating Scale scores. We compared model performance against a classical word-vector-based ML model. We subsequently computed discrimination, calibration, clinical utility, and explainability information using a Shapley Additive Explanations value-based post hoc estimation model.
Results: The dataset comprised 1348 help-seeking encounters (1011 for training and 337 for testing). The transformer-based classifier achieved a macroaveraged area under the curve (AUC) receiver operating characteristic (ROC) of 0.89 (95% CI 0.81-0.91) and an overall accuracy of 0.79 (95% CI 0.73-0.99). This performance surpassed the word-vector-based baseline model (AUC-ROC=0.77, 95% CI 0.64-0.90; accuracy=0.61, 95% CI 0.61-0.80). The transformer model demonstrated excellent prediction for nonsuicidal sessions (AUC-ROC=0.96, 95% CI 0.96-0.99) and good prediction for SI and ASE, with AUC-ROCs of 0.85 (95% CI 0.97-0.86) and 0.87 (95% CI 0.81-0.88), respectively. The Brier Skill Score indicated a 44% improvement in classification performance over the baseline model. The Shapley Additive Explanations model identified language features predictive of SIBs, including self-reference, negation, expressions of low self-esteem, and absolutist language.
Conclusions: Neural networks using large language model–based transfer learning can accurately identify SI and ASE. The post hoc explainer model revealed language features associated with SI and ASE. Such models may potentially support clinical decision-making in suicide prevention services. Future research should explore multimodal input features and temporal aspects of suicide risk.