Transformers and Bayesian Networks: A Proven Equivalence

A recent scientific paper has established a formal equivalence between Transformers, the dominant architecture in AI, and Bayesian networks. The research offers a precise explanation of why Transformers work, demonstrating that a Transformer is, in essence, a Bayesian network.

The demonstration is articulated in five main points:

  1. Every sigmoid transformer implements weighted loopy belief propagation on its implicit factor graph. One layer corresponds to one round of propagation.
  2. A Transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies, this yields provably correct probability estimates at every node.
  3. Uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP weights. There is no other path through the sigmoid architecture to exact posteriors.
  4. The AND/OR boolean structure of the Transformer layer: attention is AND, the feedforward network is OR, and their strict alternation is exactly Pearl's gather/update algorithm.
  5. The formal results have been confirmed experimentally, corroborating the Bayesian network characterization in practice.

Hallucination: A Structural Problem, Not a Scaling Bug

The research also demonstrates that verifiable inference requires a finite concept space. Any finite verification procedure can distinguish at most finitely many concepts. Without grounding, correctness is not defined. Hallucination is not a bug that scaling can fix, but a structural consequence of operating without concepts. This aspect is particularly relevant for those considering on-premise deployments and the need for reliable and interpretable models.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.