Navigating extensive legislative corpora is often impeded by the linguistic complexity inherent in legal texts. To address this, we present a novel topic representation learning method designed to facilitate the systematic exploration of legislative content. We demonstrate the efficacy of this approach by applying it to the vast corpus of Italian legislation comprising about 74 k laws with more than 300 k articles. While current topic models group documents by latent semantic similarity, they often lack the granularity required for precise navigation. Our approach augments these representations by integrating our topic modeling framework with multi-label profiles. We enrich the representation of individual laws by extracting and ranking the top 10 keywords based on their relevance to the enclosing topic, subsequently aggregating these rankings to construct a comprehensive, alternative description of the broader legal themes. By bridging latent semantic clusters with explicit, LLM-generated labels, this method yields a highly interpretable representation of the corpus, significantly enhancing the profiling and navigability of complex legislative content. We improve over our baseline representation in 74.67% of cases, showing potential for re-use in highly specialized text corpora.
Ranked Multi-Label-Augmented Topic Modeling for Legislative Content Profiling
Invernici, Francesco;Colombo, Andrea;Telese, Flaminia;Bernasconi, Anna
2026-01-01
Abstract
Navigating extensive legislative corpora is often impeded by the linguistic complexity inherent in legal texts. To address this, we present a novel topic representation learning method designed to facilitate the systematic exploration of legislative content. We demonstrate the efficacy of this approach by applying it to the vast corpus of Italian legislation comprising about 74 k laws with more than 300 k articles. While current topic models group documents by latent semantic similarity, they often lack the granularity required for precise navigation. Our approach augments these representations by integrating our topic modeling framework with multi-label profiles. We enrich the representation of individual laws by extracting and ranking the top 10 keywords based on their relevance to the enclosing topic, subsequently aggregating these rankings to construct a comprehensive, alternative description of the broader legal themes. By bridging latent semantic clusters with explicit, LLM-generated labels, this method yields a highly interpretable representation of the corpus, significantly enhancing the profiling and navigability of complex legislative content. We improve over our baseline representation in 74.67% of cases, showing potential for re-use in highly specialized text corpora.| File | Dimensione | Formato | |
|---|---|---|---|
|
applsci-16-04383.pdf
accesso aperto
:
Publisher’s version
Dimensione
4.53 MB
Formato
Adobe PDF
|
4.53 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


