Existing acoustic scene classification (ASC) systems often fail to generalize across different recording devices. In this work, we present an unsupervised domain adaptation method for ASC based on data standardization and feature projection. First, log-amplitude spectro-temporal features are standardized in a band-wise fashion over samples and time. Then, both source- and target-domain samples are projected onto the span of the principal eigenvectors of the covariance matrix of source-domain training data. The proposed method, being devised as a preprocessing procedure, is independent of the choice of the classification algorithm and can be readily applied to any ASC model at a minimal cost. Using the TUT Urban Acoustic Scenes 2018 Mobile Development dataset, we show that the proposed method can provide an absolute increment of over 10% compared to state-of-the-art unsupervised adaptation methods. Furthermore, the proposed method consistently outperforms a recent ASC model that ranked first in Task 1-A of the 2021 DCASE Challenge when evaluated on various unseen devices from the TAU Urban Acoustic Scenes 2020 Mobile Development dataset. In addition, our method appears robust even when provided with a small amount of target-domain data, proving effective using as few as 90 seconds of test audio recordings. Finally, we show that the proposed adaptation method can also be employed as a feature extraction stage for shallower neural networks, thus significantly reducing model complexity.

Unsupervised domain adaptation via principal subspace projection for acoustic scene classification

Alessandro Ilic Mezza;Augusto Sarti
2022-01-01

Abstract

Existing acoustic scene classification (ASC) systems often fail to generalize across different recording devices. In this work, we present an unsupervised domain adaptation method for ASC based on data standardization and feature projection. First, log-amplitude spectro-temporal features are standardized in a band-wise fashion over samples and time. Then, both source- and target-domain samples are projected onto the span of the principal eigenvectors of the covariance matrix of source-domain training data. The proposed method, being devised as a preprocessing procedure, is independent of the choice of the classification algorithm and can be readily applied to any ASC model at a minimal cost. Using the TUT Urban Acoustic Scenes 2018 Mobile Development dataset, we show that the proposed method can provide an absolute increment of over 10% compared to state-of-the-art unsupervised adaptation methods. Furthermore, the proposed method consistently outperforms a recent ASC model that ranked first in Task 1-A of the 2021 DCASE Challenge when evaluated on various unseen devices from the TAU Urban Acoustic Scenes 2020 Mobile Development dataset. In addition, our method appears robust even when provided with a small amount of target-domain data, proving effective using as few as 90 seconds of test audio recordings. Finally, we show that the proposed adaptation method can also be employed as a feature extraction stage for shallower neural networks, thus significantly reducing model complexity.
2022
Acoustic scene classification
Unsupervised domain adaptation
Mismatched recording devices
File in questo prodotto:
File Dimensione Formato  
mezza2022jsps.pdf

Accesso riservato

: Publisher’s version
Dimensione 2.21 MB
Formato Adobe PDF
2.21 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1250197
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact