Uncovering Hidden Behaviors of Finetuned LLMs

Finetuning represents a critical phase in the development of Large Language Models (LLMs), as it can significantly alter their behavior, sometimes introducing undesirable or even risky functionalities. To study these phenomena in controlled environments, researchers use โ€œmodel organismsโ€: models finetuned to exhibit specific known behaviors. However, precisely identifying the finetuning objectives of an LLM remains a complex challenge, especially when dealing with unintentional or hidden behaviors.

A new approach based on perplexity analysis aims to address this problem. The methodology leverages the tendency of LLMs to overgeneralize behaviors learned during finetuning, extending them beyond their originally intended context. This mechanism offers a window into the model's internal processes, allowing researchers to infer training objectives without needing to access its internal architecture or make prior assumptions about its behaviors.

The Perplexity Differencing Method in Detail

The proposed technique involves two main steps. Initially, diverse completions are generated from the finetuned model using short, random prefills drawn from general corpora. These prefills act as neutral stimuli, designed to elicit responses that might reveal the model's inclinations.

Subsequently, the generated completions are ranked by the perplexity gap between the finetuned model and a reference model. A larger perplexity gap indicates that the finetuned model has a greater โ€œsurpriseโ€ or lower probability of generating that specific token sequence compared to the reference model. The top-ranked completions often explicitly reveal the finetuning objectives, providing valuable clues about the behaviors the model has learned. This approach is notable for its ability to operate without requiring knowledge of the model's internals or prior assumptions about the expected behavior.

Implications and Versatility of the Technique

The effectiveness of this method was evaluated on a diverse set of 76 โ€œmodel organisms,โ€ with sizes ranging from 0.5 to 70 billion parameters. This set included backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of the tested models, the method successfully identified completions revealing finetuning objectives among the top-ranked results. Models trained via synthetic document finetuning or to produce exact phrases proved particularly susceptible to this analysis.

A key aspect of this technique is its flexibility: it can be effective even without access to the exact pre-finetuning checkpoint. Trusted reference models, even from different families, can serve as effective substitutes. Furthermore, since the method only requires next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs, significantly expanding its scope of application.

Prospects for Enterprise Deployments and Data Sovereignty

For CTOs, DevOps leads, and infrastructure architects evaluating LLM deployment in on-premise or hybrid environments, the ability to understand and verify a model's behavior is of strategic importance. In contexts where data sovereignty, regulatory compliance (such as GDPR), and security are priorities, the possibility of auditing finetuned models without full access to their source code or original training data represents a significant advantage. This method offers a tool to identify potential vulnerabilities or undesirable behaviors that could compromise the security or integrity of corporate data.

The versatility in operating with API-gated models makes this technique also applicable for evaluating third-party LLMs or cloud services, providing a level of transparency and control otherwise difficult to achieve. For those considering on-premise deployments, tools like this can mitigate the risks associated with model customization, supporting informed decisions on the trade-offs between internal control and the adoption of cloud-based solutions. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs, emphasizing the importance of robust verification tools to ensure the compliance and security of AI systems.