Understanding why large language models (LLMs) exhibit certain behaviors is the goal of mechanistic interpretability. One of the major tools employed by mechanistic interpretability is circuit discovery, i.e., identifying a subset of the model’s components responsible for a given task. We present a novel circuit discovery technique called IPE (Isolating Path Effects) that, unlike traditional edge-centric approaches, aims to identify entire computational paths (from input embeddings to output logits) responsible for certain model behaviors. Our method modifies the messages passed between nodes along a given path in such a way as to either precisely remove the effects of the entire path (i.e., ablate it) or to replace the path’s effects with those that would have been generated by a counterfactual input. IPE is different from current path-patching or edge activation-patching techniques since they are not ablating single paths, but rather a set of paths sharing certain edges, preventing more precise tracing of information flow. We apply our method to the well-known Indirect Object Identification (IOI) task, recovering the canonical circuit reported in prior work. On the MIB workshop leaderboard, we tested IOI and MCQA tasks on GPT2-small and Qwen2.5. For GPT2, path counterfactual replacement outperformed path ablation as expected and led to top-ranking results, while for Qwen, no significant differences were observed, indicating a need for larger experiments to distinguish the two approaches.
BlackboxNLP-2025 MIB Shared Task: IPE: Isolating Path Effects for Improving Latent Circuit Identification
Brunello, Nicolò;Sassella, Andrea;Carman, Mark James
2025-01-01
Abstract
Understanding why large language models (LLMs) exhibit certain behaviors is the goal of mechanistic interpretability. One of the major tools employed by mechanistic interpretability is circuit discovery, i.e., identifying a subset of the model’s components responsible for a given task. We present a novel circuit discovery technique called IPE (Isolating Path Effects) that, unlike traditional edge-centric approaches, aims to identify entire computational paths (from input embeddings to output logits) responsible for certain model behaviors. Our method modifies the messages passed between nodes along a given path in such a way as to either precisely remove the effects of the entire path (i.e., ablate it) or to replace the path’s effects with those that would have been generated by a counterfactual input. IPE is different from current path-patching or edge activation-patching techniques since they are not ablating single paths, but rather a set of paths sharing certain edges, preventing more precise tracing of information flow. We apply our method to the well-known Indirect Object Identification (IOI) task, recovering the canonical circuit reported in prior work. On the MIB workshop leaderboard, we tested IOI and MCQA tasks on GPT2-small and Qwen2.5. For GPT2, path counterfactual replacement outperformed path ablation as expected and led to top-ranking results, while for Qwen, no significant differences were observed, indicating a need for larger experiments to distinguish the two approaches.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025.blackboxnlp-1.30.pdf
accesso aperto
Descrizione: PDF of published paper
:
Publisher’s version
Dimensione
685.32 kB
Formato
Adobe PDF
|
685.32 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


