Quantized LLMs: Tackling Hallucinations with Multi-Pass Verification

Quantized Large Language Models (LLMs) are increasingly utilized in qualitative analysis due to their faster execution and reduced computational resource requirements. This efficiency makes them particularly appealing for scenarios with limited hardware resources or where minimizing operational costs is a priority. However, adopting lower-precision models, achieved through Quantization, introduces significant challenges, notably the tendency to generate "hallucinations" and produce unstable results, especially when processing texts containing non-expert language or ambiguous terms.

A recent study specifically addressed these aspects, analyzing the impact of various Quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and different Quantization types on the performance of LLaMA-3.1 (8B) in qualitative analysis. The research leveraged expert and non-expert responses from 82 interview transcripts to assess model reliability and accuracy in real-world contexts. Initial findings confirmed that reducing precision, while ensuring faster execution, compromises output fidelity—a critical trade-off for applications demanding high reliability.

The Quantization Challenge and the Proposed Solution

Quantization is a fundamental technique for optimizing LLMs, reducing model size and VRAM requirements, thereby making them more accessible for deployment on less powerful hardware or in edge scenarios. However, as highlighted by the research, lower-precision models, particularly 3-bit and 2-bit versions, exhibit a notable loss of accuracy. This degradation is especially problematic in qualitative analysis, where nuanced text interpretation is crucial, and hallucinations can completely invalidate results.

To address these critical issues, the study proposes an innovative quantization-aware multi-pass prompt verification method. This methodology guides the model through a series of controlled steps designed to reduce hallucinations and enhance output stability. The process involves removing unreliable content and passing verified results to the next transcript in an iterative cycle aimed at progressively increasing overall accuracy. The objective is to enable even the most compressed models to provide more consistent and reliable responses while retaining resource efficiency benefits.

Validation Methodology and Key Findings

Performance validation was conducted using a rigorous approach. Researchers employed human coders to analyze transcripts using NVivo and a LLaMA-3.1 (BF16) model. Although the BF16 model produced high-precision output, it still exhibited semantic drift and hallucinations, which were manually corrected. The corrected BF16 output, combined with human coding from NVivo, formed a Gold-Standard Ground Truth (GSGT) for thematic extraction and frequency analysis, providing an objective reference for evaluation.

The results revealed that 8-bit models most closely aligned with the GSGT, maintaining a good balance between efficiency and accuracy. While 4-bit models experienced some accuracy loss, they demonstrated increased stability when the proposed multi-pass prompt verification method was applied. The 3-bit and 2-bit versions, despite suffering a significant performance drop due to heavy compression, still benefited substantially from the new prompt design and verification process, showing tangible improvement. The study also highlighted how models at the same bit level can behave differently depending on the specific Quantization type used, underscoring the importance of informed selection.

Implications for On-Premise and Low-Cost Deployments

This research holds significant implications for organizations considering LLM deployment in resource-constrained environments, such as self-hosted infrastructures, edge computing, or air-gapped configurations. The ability to make quantized models more stable and accurate while maintaining low resource consumption is crucial for optimizing the Total Cost of Ownership (TCO) and ensuring data sovereignty. For CTOs, DevOps leads, and infrastructure architects, the prospect of utilizing effective LLMs on less expensive or existing hardware represents a notable competitive advantage.

The multi-pass verification method offers a concrete strategy to mitigate the risks associated with hallucinations and instability in lower-precision models, making these LLMs more suitable for qualitative research and other sensitive applications. While Quantization always involves trade-offs, this research demonstrates that the reliability of more compressed models can be improved through targeted prompt engineering. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess trade-offs between performance, costs, and security requirements, supporting informed decisions without specific recommendations.