From LLMs to Theories: How Generative Causal Testing Explains the Brain

Predicting how the brain responds to language is a challenge that large language models have conquered, but understanding why those predictions work has remained a puzzle. All those parameters are indecipherable. Breaking the deadlock is Generative Causal Testing (GCT), a framework developed by Microsoft Research with the University of California, Berkeley, UC San Francisco, and Columbia University, just accepted in Nature Neuroscience. GCT doesn’t just say that a brain region responds to something: it formulates a succinct hypothesis, such as “food preparation” or “proper location names,” and then tests it with new stimuli generated by an LLM.

The black-box dilemma in neuroscience

Over the past decade, LLMs have become remarkably good at replicating the brain’s language-related activity. Feed an LLM the same story a person hears in an fMRI scanner, and the model’s representations can simulate the activity of individual cortical regions with stunning fidelity. The catch is that those predictions remain opaque: millions of inaccessible parameters, a black box that yields no readable scientific theory. Thus, the impasse between predictive power and understanding has become a central problem in computational neuroscience.

How Generative Causal Testing works

GCT has two stages. First, it analyzes the predictive model for a single voxel or brain area and extracts the short phrases that most strongly drive its response. An LLM then summarizes those phrases into a concise verbal explanation. The second stage closes the loop: the same LLM writes new stories, paragraph by paragraph, carefully built to selectively stimulate the target region according to the explanation. Participants listen to these stories while undergoing fMRI; if the region’s activity rises significantly above baseline text, the explanation passes a genuine causal test, not just a correlation exercise.

Three volunteers returned to the scanner for this experiment. The synthetic stories drove the target regions well above baseline, confirming that GCT’s summaries capture something the cortex genuinely responds to. The explanations were most trustworthy where the underlying predictive models were most stable, a finding that gives the method solid foundations.

Validation and surprising discoveries

With validation in hand, the researchers turned GCT on harder problems. They took three adjacent areas that process spatial information—the retrosplenial cortex, the parahippocampal place area, and the occipital place area—previously considered almost interchangeable. By generating differential stimuli (stories designed to fire up one region while keeping its neighbors quiet), the team was able to tease them apart: for instance, RSC responds more strongly to proper place names like Tokyo or Connecticut than to generic location references. A nuance that a raw predictive model could never isolate on its own.

Perhaps the most fascinating result is the discovery of entirely new prefrontal micro-regions. By scanning a grid of candidate locations and discarding unstable ones, GCT revealed selective clusters for remarkably specific concepts: one reacts to dialogue between people (words like “said” or “told”), another to clock times (“one o’clock”), and another to numeric measurements (“50 feet”). These are distinctions no one had gone looking for; they emerged because the method can formulate a hypothesis and test it immediately.

Beyond neuroscience: what it means for data-driven science

The work carries implications beyond neuroscience. Today, many scientific fields face the same puzzle: models that predict beautifully but explain nothing. GCT shows that a data-driven model need not be an opaque endpoint; it can be distilled into a readable, experimentally testable theory. It’s a “generate-and-verify” philosophy that could extend to any domain in which predictive models have outrun our ability to understand them.

Infrastructure and data sovereignty: the on-premise angle

Some practical considerations remain for those looking to adopt GCT in sensitive contexts. The study does not detail the computing infrastructure, but the Microsoft Research partnership suggests the use of cloud-hosted LLMs. However, when working with functional MRI data, which are personal and potentially revealing of clinical conditions, regulations like GDPR come into play. In such scenarios, self-hosted inference on on-premise hardware becomes crucial to ensure data sovereignty. A local deployment also reduces latency in iterative generation-and-test cycles, with potential long-term advantages on Total Cost of Ownership. While GCT does not provide hardware recipes, its large-scale adoption raises concrete questions about compute power, VRAM, and model quantization—topics where the community evaluating on-premise LLM stacks can offer valuable insights.

Ultimately, the rise of black-box models in science does not have to mean the retreat of human-readable theories. With the right framework, the two can advance together. And, as GCT teaches, sometimes a short verbal explanation is enough to literally turn on new lights in the brain.