We comprehensively compare thirteen machine learning models for forecasting urban air pollutants. However, the accuracy of existing prediction models varies as a function of what specific pollutant is predicted, as well as the nature and size of the training set. We examine the performance of thirteen machine learning models using fifteen years of IoT sensor data, including both meteorological and pollutant data representative of a rural industrial urban environment in the heart of the Lombardy region (Italy). While prior studies have applied machine learning models to urban air pollution forecasting [3], [4], [7], few have systematically compared a diverse set of models using a long-term, 15 -year dataset across multiple pollutants and training data scenarios. In this work, we benchmark thirteen models, revealing how pollutant-specific characteristics and training history affect forecasting performance. Ensemble tree-based models, particularly LightGBM, XGBoost, and Random Forest, consistently outperform others, especially for pollutants with strong temporal patterns such as NO2 and NO. Conversely, pollutants like NH3 and CO prove more challenging to predict, due to irregular dynamics and weaker correlation with meteorological features. Our analysis also reveals that increasing the proportion of training data generally enhances model accuracy as expected, though improvements diminish beyond a 70-80% split w.r.t test data.

Comparative Analysis of Machine Learning Models for Forecasting Urban Air Pollutants

Ivanova, Martina;Celani, Alberto;Mottola, Luca
2025-01-01

Abstract

We comprehensively compare thirteen machine learning models for forecasting urban air pollutants. However, the accuracy of existing prediction models varies as a function of what specific pollutant is predicted, as well as the nature and size of the training set. We examine the performance of thirteen machine learning models using fifteen years of IoT sensor data, including both meteorological and pollutant data representative of a rural industrial urban environment in the heart of the Lombardy region (Italy). While prior studies have applied machine learning models to urban air pollution forecasting [3], [4], [7], few have systematically compared a diverse set of models using a long-term, 15 -year dataset across multiple pollutants and training data scenarios. In this work, we benchmark thirteen models, revealing how pollutant-specific characteristics and training history affect forecasting performance. Ensemble tree-based models, particularly LightGBM, XGBoost, and Random Forest, consistently outperform others, especially for pollutants with strong temporal patterns such as NO2 and NO. Conversely, pollutants like NH3 and CO prove more challenging to predict, due to irregular dynamics and weaker correlation with meteorological features. Our analysis also reveals that increasing the proportion of training data generally enhances model accuracy as expected, though improvements diminish beyond a 70-80% split w.r.t test data.
2025
Proceedings - 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things, DCOSS-IoT 2025
Air quality forecasting
data-driven modeling
ensemble models
environmental monitoring
machine learning
pollutant prediction
File in questo prodotto:
File Dimensione Formato  
Comparative_Analysis_of_Machine_Learning_Models_for_Forecasting_Urban_Air_Pollutants.pdf

Accesso riservato

Dimensione 2.15 MB
Formato Adobe PDF
2.15 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1298433
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact