Inspiration for Zeynep Demirbas’ research struck during a chat with a family friend. That friend, a psychologist, said some health insurance companies were pushing the use of artificial intelligence, such as ChatGPT, for mental health. The idea: AI might be less costly and easier to access than human therapists.
That worried Zeynep, 14. She knew that ChatGPT often gave wrong answers or agreed with incorrect statements. Could this type of AI, known as a large language model — or LLM — really be trusted with our mental health?
Lots of people already use AI chatbots like ChatGPT for free therapy. Zeynep Demirbas’ research suggests that may not be a good idea.Society for Science
To find out, she tested whether several LLMs could detect stress in human text. She gave the models a dataset of more than 3,500 Reddit posts. Human raters had labeled each one as containing stress or not. Zeynep asked the models to identify which posts showed stress.
To judge how well the models did, Zeynep calculated something called an F1-score for each one. This score considers how many stress-containing posts the models accurately spotted. It also accounts for how often the models missed cases of stress and how often they mislabeled posts as showing stress.
An LLM specifically designed for mental health did the best. It scored about 82 percent. ChatGPT scored only about 74 percent.
An aspiring computer scientist, Zeynep did this project as an eighth grader at Transit Middle School in East Amherst, N.Y. Her research earned her a finalist spot in the 2025 Thermo Fisher Scientific Junior Innovators Challenge. The Society for Science runs this program (and also publishes Science News Explores).
Here, Zeynep shares her research experiences and advice.
What was your reaction to seeing the results?
ChatGPT performing badly was “really surprising,” Zeynep says. It did even worse than the “random-forest” model. This model makes predictions by using a collection of decision trees, sort of like a super-complex flow chart. Random-forest is “supposed to be a very simple and old technique. So I just put it in as a baseline,” Zeynep says. “That was very interesting — how something so small and simple was able to beat an LLM [like ChatGPT] that used millions of parameters and had so much coding go into it.”
What are the main takeaways from your project?
“We should be mindful with AI, because it doesn’t really have an acceptable grade in mental health,” Zeynep says. “That doesn’t mean that LLMs are bad, because they’re for general use. They’re not necessarily meant for mental health.” Her data led her to conclude that LLMs should not be replacing human therapists. Instead, these models might help identify people who are struggling and refer them to a mental health professional.
How could you take this project further?
“One way I feel I could expand it is seeing whether LLMs carry biases toward different genders,” Zeynep says. “How would it respond [differently] if it was girls or guys?” Zeynep has read that sometimes doctors dismiss the symptoms of female patients because they think that women are exaggerating. The doctors’ assumption is an example of personal bias. Since LLMs are trained on texts written by people, they can pick up human biases, Zeynep says. She’s curious whether LLMs would show gender biases similar to human doctors.


Bengali (Bangladesh) ·
English (United States) ·