Factual Recall Mechanisms in Gemma-2B and Gemma-12B-IT

Understanding how Large Language Models (LLMs) like Google's Gemma series retrieve factual information is a cornerstone of mechanistic interpretability. Recent research into the internal circuits of Gemma-2B and Gemma-12B-IT has revealed a sophisticated 'Three-Phase Factual Recall Circuit.' This discovery provides developers with a roadmap for optimizing RAG (Retrieval-Augmented Generation) systems and fine-tuning strategies. For developers seeking to leverage these models at scale, n1n.ai offers a high-performance gateway to the latest open-weights models.

The Methodology: Activation Patching

To identify these circuits, researchers use a technique called Activation Patching (also known as causal mediation analysis). The process involves two prompts: a 'clean' prompt (e.g., 'The capital of France is Paris') and a 'corrupted' prompt (e.g., 'The capital of Germany is Berlin'). By systematically swapping activations (hidden states, attention head outputs, or MLP outputs) from the clean run into the corrupted run, we can measure which specific components restore the correct answer ('Paris').

If patching a specific layer's residual stream at a specific token position restores the probability of the target token, we conclude that this component is part of the factual recall circuit. This method allows us to move beyond treating LLMs as 'black boxes' and instead map the flow of information through the transformer layers.

Phase 1: Subject Recognition and Attribute Extraction

In the first phase, the model identifies the subject of the query. For a prompt like 'The Eiffel Tower is located in...', the model must first recognize 'Eiffel Tower' as a discrete entity.

Research indicates that in the early layers of Gemma-2B, the attention heads are primarily responsible for gathering information from the subject tokens. These 'Subject-Reading' heads aggregate the embeddings of the entity into a unified representation within the residual stream. This phase typically occurs within the first 20% of the model's layers. The residual stream acts as a high-speed highway, carrying this entity representation forward.

Phase 2: Property Mapping and Relation Retrieval

Once the subject is identified, the model must map it to the requested attribute (e.g., 'location'). This is the most complex phase, involving a synergy between the Multi-Layer Perceptrons (MLPs) and the Attention mechanism.

In Gemma-12B-IT, this phase is more pronounced. The MLPs act as key-value memories. When the residual stream carries the 'Eiffel Tower' vector into a specific MLP layer in the middle of the network, the MLP 'fires' on the 'location' property. This transforms the representation from 'Eiffel Tower' to 'Paris-related context.'

Pro Tip for Developers: When using models via n1n.ai, understanding that middle layers handle this mapping can help in designing better prompt templates. Clearer relational markers in your prompt reduce the 'noise' during this property mapping phase.

Phase 3: Logit Readout and Generation

The final phase occurs in the late layers. Here, the model uses 'Attribute-Extraction' heads to pull the specific answer out of the residual stream and prepare it for the Unembedding layer (the final linear layer that turns hidden states into token probabilities).

In Gemma-2B, this phase is highly concentrated in the last 2-3 layers. The model essentially 'reads' the factual information that has been refined through the previous two phases and projects it onto the vocabulary space. Interestingly, the residual stream maintains a high degree of fidelity throughout this process, meaning that the information is not just stored in individual neurons but is distributed across the vector space.

Comparison: Gemma-2B vs. Gemma-12B-IT

Feature	Gemma-2B	Gemma-12B-IT
Recall Efficiency	High in shallow layers	Distributed across mid-layers
Instruction Following	Basic	Advanced (RLHF tuned)
Circuit Complexity	Linear	Multi-path / Redundant
Residual Stream Usage	Critical for all phases	Primary information carrier

The 12B-IT (Instruction Tuned) version shows a more robust circuit. Because it has been trained on conversation data, its 'Phase 2' mapping is less susceptible to prompt perturbations. This makes it a superior choice for complex enterprise applications. You can access both versions with low latency through the n1n.ai API.

Practical Implementation: Measuring Recall with Python

If you are using a library like TransformerLens, you can visualize these circuits. Below is a conceptual snippet for performing a logit lens analysis, which is a precursor to activation patching:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_logit_lens(model, tokenizer, text, target_token):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    # hidden_states is a tuple of (batch, sequence, hidden_dim)
    for i, layer_state in enumerate(outputs.hidden_states):
        logits = model.lm_head(layer_state)
        probs = torch.softmax(logits[0, -1, :], dim=-1)
        target_id = tokenizer.encode(target_token)[-1]
        print(f"Layer {i}: Probability of '{target_token}' is {probs[target_id].item():.4f}")

# Example usage with Gemma
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

This code allows you to see at which layer the model 'decides' on the correct answer. In the three-phase model, you will typically see the probability spike during the transition from Phase 2 to Phase 3.

Why the Residual Stream Matters

The most striking finding in the Gemma research is the dominance of the residual stream. In traditional architectures, it was often thought that MLPs did all the heavy lifting. However, in Gemma, the residual stream acts as a persistent memory buffer. Attention heads and MLPs only make 'incremental updates' to this stream.

For developers, this means that 'steering' the model via activation steering or latent perturbations is highly effective. By slightly modifying the residual stream in the middle layers, you can change the model's factual output without retraining the entire network.

Conclusion

The discovery of the three-phase circuit—Subject Recognition, Property Mapping, and Logit Readout—simplifies our understanding of LLM 'knowledge.' It proves that factual recall is not a single event but a pipeline. For those building the next generation of AI agents, leveraging models that exhibit clear and robust circuits like Gemma-12B-IT is essential for reliability.

Get a free API key at n1n.ai.

Source: https://towardsdatascience.com/a-three-phase-factual-recall-circuit-in-gemma-2b-and-gemma-12b-it/