The Paradox of AI Evaluation: Beyond Technical Capabilities
In the current landscape of artificial intelligence, a significant portion of resources and effort is dedicated to measuring the capabilities of AI systems. Researchers analyze technical evaluation metrics, subject AIs to complex reasoning tests, and track their throughput, constantly pushing the boundaries of performance. This approach has led to impressive advancements, with models excelling in benchmarks like SWE-bench or LLM arena, fueling competition among companies to develop the highest-performing solutions.
However, in this fervor for technical progress, a fundamental metric is often overlooked: the impact of AI on humans. Imran Khan, who leads psychosocial evaluation of AI at the nonprofit Center for Humane Technology, recently highlighted this paradox. In his essay published on the organization's Substack, Khan points out that we are deploying AI tools capable of reshaping our cognition, relationships, and behavior, but with little systematic effort to measure their downstream impacts on us. This gap raises crucial questions about the true meaning of technological progress.
Psychosocial Effects: A Growing Shadow on the Human Experience
The push to look more closely at AI’s psychosocial effects is similar to debates that emerged around social media and its harms. However, Khan believes AI could have even broader and more intimate effects, penetrating aspects of daily life in unprecedented ways. While the improved performance of Large Language Models (LLMs) and their progress are celebrated, the central question is missed: is this technology ultimately helping humans flourish, or is it eroding some of our most fundamental capacities?
There are already high-profile cases that, according to Khan, are just the tip of the iceberg: teen suicides, instances of “AI psychosis,” and people spending immense amounts of time and money engaging with chatbots designed to be incredibly sycophantic. These harms are already present, but the good news is that public pressure can make a difference. An example is OpenAI's tweak to one of its ChatGPT models due to public concerns about sycophancy. This demonstrates that labs are responsive to scrutiny and can adapt, offering potential to steer the technology towards useful but less harmful applications. However, the trickier question concerns societal-level harms: what will happen to romantic relationships, families, and teenagers' identities if AI becomes an integral part of their daily lives for months and years? Without timely measurement of these phenomena, the risk is that it will be too late to make a difference.
Designing Evaluation for Long-Term Impacts
The problem of evaluating AI's human impacts lies in their long-term nature. Current benchmarks focus on short-term tasks, such as an AI's ability to write code or answer complex scientific questions. Psychosocial impacts, by contrast, emerge over months or years, affecting individual minds, relationships, communities, and society as a whole. This requires longitudinal studies, an approach similar to that adopted in the pharmaceutical industry.
When a new drug is approved, it not only goes through different stages of trials, but companies are also mandated to conduct post-deployment monitoring, looking at effects that might crop up over a five or ten-year horizon. Similarly, for AI, we need to monitor novel phenomena, such as how people's relationship with AI changes over a year or two by looking at chat logs. Currently, companies hold this data, but external researchers do not have access. Opening access to more data, while still preserving user privacy, is a critical step. While individual companies might perceive a first-mover disadvantage in sharing their data, the industry as a whole has an incentive to ensure safe and trusted products. Liability, with AI companies already sued for extreme harms like suicide, represents a further incentive to make products safer, ideally supported by clear regulation.
Towards a More Humane Relationship with AI
The future of AI promises even greater integration into human life. While many harms today stem from interaction with text-based chatbots, we are already seeing a shift towards extended use of AI agents, with real-time audio conversations and increasingly realistic video avatars. The concern is that if we don't start trying to understand the human effect of these technologies now, we risk falling too far behind the curve to adequately assess future developments. Technological progress, in fact, seems to be outpacing our ability to analyze it.
Success, from this perspective, would translate into the creation of a collaborative ecosystem bringing together expertise from AI laboratories, governments, regulators, universities, and startups. The goal would be to define what a healthy relationship between humans and AI looks like and to develop the techniques necessary to ensure such a balance. Only through a joint and multidisciplinary effort will it be possible to build a more humane relationship with artificial intelligence, ensuring that its development is aligned with human well-being and flourishing.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!