Navigating extensive legislative corpora is often impeded by the linguistic complexity inherent in legal texts. To address this, we present a novel topic representation learning method designed to facilitate the systematic exploration of legislative content. We demonstrate the efficacy of this approach by applying it to the vast corpus of Italian legislation comprising about 74 k laws with more than 300 k articles. While current topic models group documents by latent semantic similarity, they often lack the granularity required for precise navigation. Our approach augments these representations by integrating our topic modeling framework with multi-label profiles. We enrich the representation of individual laws by extracting and ranking the top 10 keywords based on their relevance to the enclosing topic, subsequently aggregating these rankings to construct a comprehensive, alternative description of the broader legal themes. By bridging latent semantic clusters with explicit, LLM-generated labels, this method yields a highly interpretable representation of the corpus, significantly enhancing the profiling and navigability of complex legislative content. We improve over our baseline representation in 74.67% of cases, showing potential for re-use in highly specialized text corpora.

Ranked Multi-Label-Augmented Topic Modeling for Legislative Content Profiling

Invernici, Francesco;Colombo, Andrea;Telese, Flaminia;Bernasconi, Anna
2026-01-01

Abstract

Navigating extensive legislative corpora is often impeded by the linguistic complexity inherent in legal texts. To address this, we present a novel topic representation learning method designed to facilitate the systematic exploration of legislative content. We demonstrate the efficacy of this approach by applying it to the vast corpus of Italian legislation comprising about 74 k laws with more than 300 k articles. While current topic models group documents by latent semantic similarity, they often lack the granularity required for precise navigation. Our approach augments these representations by integrating our topic modeling framework with multi-label profiles. We enrich the representation of individual laws by extracting and ranking the top 10 keywords based on their relevance to the enclosing topic, subsequently aggregating these rankings to construct a comprehensive, alternative description of the broader legal themes. By bridging latent semantic clusters with explicit, LLM-generated labels, this method yields a highly interpretable representation of the corpus, significantly enhancing the profiling and navigability of complex legislative content. We improve over our baseline representation in 74.67% of cases, showing potential for re-use in highly specialized text corpora.
2026
topic modeling
unsupervised learning
augmented representation
multi-label assignment legislative corpus
Italian legislation
File in questo prodotto:
File Dimensione Formato  
applsci-16-04383.pdf

accesso aperto

: Publisher’s version
Dimensione 4.53 MB
Formato Adobe PDF
4.53 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1314285
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact