RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Navigating extensive legislative corpora is often impeded by the linguistic complexity inherent in legal texts. To address this, we present a novel topic representation learning method designed to facilitate the systematic exploration of legislative content. We demonstrate the efficacy of this approach by applying it to the vast corpus of Italian legislation comprising about 74 k laws with more than 300 k articles. While current topic models group documents by latent semantic similarity, they often lack the granularity required for precise navigation. Our approach augments these representations by integrating our topic modeling framework with multi-label profiles. We enrich the representation of individual laws by extracting and ranking the top 10 keywords based on their relevance to the enclosing topic, subsequently aggregating these rankings to construct a comprehensive, alternative description of the broader legal themes. By bridging latent semantic clusters with explicit, LLM-generated labels, this method yields a highly interpretable representation of the corpus, significantly enhancing the profiling and navigability of complex legislative content. We improve over our baseline representation in 74.67% of cases, showing potential for re-use in highly specialized text corpora.

Ranked Multi-Label-Augmented Topic Modeling for Legislative Content Profiling

Invernici, Francesco;Colombo, Andrea;Telese, Flaminia;Bernasconi, Anna

2026-01-01

Abstract

Navigating extensive legislative corpora is often impeded by the linguistic complexity inherent in legal texts. To address this, we present a novel topic representation learning method designed to facilitate the systematic exploration of legislative content. We demonstrate the efficacy of this approach by applying it to the vast corpus of Italian legislation comprising about 74 k laws with more than 300 k articles. While current topic models group documents by latent semantic similarity, they often lack the granularity required for precise navigation. Our approach augments these representations by integrating our topic modeling framework with multi-label profiles. We enrich the representation of individual laws by extracting and ranking the top 10 keywords based on their relevance to the enclosing topic, subsequently aggregating these rankings to construct a comprehensive, alternative description of the broader legal themes. By bridging latent semantic clusters with explicit, LLM-generated labels, this method yields a highly interpretable representation of the corpus, significantly enhancing the profiling and navigability of complex legislative content. We improve over our baseline representation in 74.67% of cases, showing potential for re-use in highly specialized text corpora.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Titolo della rivista
	
				APPLIED SCIENCES
			
	Parole chiave
	
				topic modeling
unsupervised learning
augmented representation
multi-label assignment legislative corpus
Italian legislation
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
applsci-16-04383.pdf accesso aperto : Publisher’s version Dimensione 4.53 MB Formato Adobe PDF Visualizza/Apri	4.53 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1314285

Citazioni

ND

1

0

social impact