RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Understanding why large language models (LLMs) exhibit certain behaviors is the goal of mechanistic interpretability. One of the major tools employed by mechanistic interpretability is circuit discovery, i.e., identifying a subset of the model’s components responsible for a given task. We present a novel circuit discovery technique called IPE (Isolating Path Effects) that, unlike traditional edge-centric approaches, aims to identify entire computational paths (from input embeddings to output logits) responsible for certain model behaviors. Our method modifies the messages passed between nodes along a given path in such a way as to either precisely remove the effects of the entire path (i.e., ablate it) or to replace the path’s effects with those that would have been generated by a counterfactual input. IPE is different from current path-patching or edge activation-patching techniques since they are not ablating single paths, but rather a set of paths sharing certain edges, preventing more precise tracing of information flow. We apply our method to the well-known Indirect Object Identification (IOI) task, recovering the canonical circuit reported in prior work. On the MIB workshop leaderboard, we tested IOI and MCQA tasks on GPT2-small and Qwen2.5. For GPT2, path counterfactual replacement outperformed path ablation as expected and led to top-ranking results, while for Qwen, no significant differences were observed, indicating a need for larger experiments to distinguish the two approaches.

BlackboxNLP-2025 MIB Shared Task: IPE: Isolating Path Effects for Improving Latent Circuit Identification

Brunello, Nicolò;Cerutti, Andrea;Sassella, Andrea;Carman, Mark James

2025-01-01

Abstract

Understanding why large language models (LLMs) exhibit certain behaviors is the goal of mechanistic interpretability. One of the major tools employed by mechanistic interpretability is circuit discovery, i.e., identifying a subset of the model’s components responsible for a given task. We present a novel circuit discovery technique called IPE (Isolating Path Effects) that, unlike traditional edge-centric approaches, aims to identify entire computational paths (from input embeddings to output logits) responsible for certain model behaviors. Our method modifies the messages passed between nodes along a given path in such a way as to either precisely remove the effects of the entire path (i.e., ablate it) or to replace the path’s effects with those that would have been generated by a counterfactual input. IPE is different from current path-patching or edge activation-patching techniques since they are not ablating single paths, but rather a set of paths sharing certain edges, preventing more precise tracing of information flow. We apply our method to the well-known Indirect Object Identification (IOI) task, recovering the canonical circuit reported in prior work. On the MIB workshop leaderboard, we tested IOI and MCQA tasks on GPT2-small and Qwen2.5. For GPT2, path counterfactual replacement outperformed path ablation as expected and led to top-ranking results, while for Qwen, no significant differences were observed, indicating a need for larger experiments to distinguish the two approaches.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
			
	Parole chiave
	
				Mechanistic Interpretability
Circuit Discovery
			
	Parole chiave
	
				Large Language Models
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.blackboxnlp-1.30.pdf accesso aperto Descrizione: PDF of published paper : Publisher’s version Dimensione 685.32 kB Formato Adobe PDF Visualizza/Apri	685.32 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1309656

Citazioni

ND

ND

ND

social impact