Machine Learning–Based Prediction of Occupant Symptom Risk Using Indoor and Outdoor Environmental Quality Data in University Dormitories

Yu, Y.; Gola, M.; Settimo, G.; Capolongo, S.

1. Introduction People spend over 80% of their time indoors, making indoor air quality (IAQ) a key determinant of daily exposure and health. In urban environments, IAQ is strongly influenced by outdoor air pollution, which can infiltrate buildings through ventilation and occupant behaviour. Student housing represents a critical residential setting, where occupants spend prolonged periods living and studying, leading to cumulative exposure to both indoor- and outdoor-originated pollutants. Poor IAQ has been associated with respiratory irritation, headaches, fatigue, and reduced perceived comfort. However, the relationship between objectively measured environmental parameters and self-reported occupant symptoms remains insufficiently understood, particularly with respect to temporal variability. This study applies machine learning approaches to link indoor and outdoor air quality data with occupants’ perceived IAQ and symptom reports in university dormitories, aiming to identify hidden patterns and predictors of symptom risk. The research involved intermittent fieldwork conducted from May 2024 to June 2025 in two dormitory buildings, using low-cost air sensors and occupant questionnaires. During measurement periods, around 14 bedrooms with different floors and orientations were monitored for two weeks each, and there were 7 periods in all validated as the results of the fieldwork. In the fieldworks of each period, the devices assembled with low-cost sensors were deployed in the selected spots, as Fig. 1 shows, with 1 indoor and 1 outdoor as a pair for 1 room. Paired indoor and outdoor sensors measured temperature (T), relative humidity (RH), air pressure (AP), carbon dioxide (CO2) and particulate matter (PM2.5 and PM10). Questionnaires were distributed in the selected rooms in each period, collecting data on demographics, health symptoms, environmental perceptions, and occupancy and window operation patterns. It aims to track the occupancy condition of users' informantion and their perceptions. For the classification model in machine learning, the Ensemble Bagged Trees were applied from the default model provided by MATLAB, which had the most optimized perfomance in general among all the trained models. 3. Result from fieldwork As a result of the fieldwork, every room has 1 pair of quantitative IAQ datasets for both indoor and outdoor environmental quality, captured by the sensors of each 2-week period. Meanwhile, 74 questionnaires were received from all periods of 2 facilities, with part of the results from them summarised in Fig. 3 to Fig. 6. From Fig. 3 to Fig. 4, it could be found that the residents spent more than 60% of their time inside their facilities, with most of the time spent in their room between 18:00 and 06:00, regardless of the days of the week. But one important issue among all periods in both facilities was the lack of ventilation during the sleeping hours, as Fig. 5 shows. Also, the self-reported symptoms were summarised in Fig. 6. The symptoms had the highest frequencies: Sneezing (n=18), Dry/Sore throat (n=15), Headache (n=14), Cough (n=13), Dry skin (n=11), Fatigue (n=10), Runny nose (n=10), which were selected as the target to predict. 4. Result from Machine Learning With all the quantitative data from indoor and outdoor sensors and the qualitative data from the questionnaires, all the data were rearranged in groups to prepare for the classification model to predict the symptoms. And due to the fact that each model could have only 1 output, at the early stage, there were 7 individual models trained for each symptom. After that, the 7 models were joined into 1 single model for predicting the condition when multiple symptoms exist together. After the training, all models' performance was evaluated at Fig. 8 and 9, which reached 66.6% in 7 correct, 11.1% in 6 correct, and 22.2% in 5 correct in the validation of all 71 valid datasets. And it had 0% in 7 correct, 33.3% in 6 correct and 66.7% in 5 correct in the validation on the condition that at least 1 symptom existed. In this preliminary training, the permutation importance of the 7 individual models was also calculated, in Fig. 7, which presented not only the predictors that contributed to the predictions of target symptoms, but also those irrelevant predictors that negatively influenced the models. After the optimisation of each individual model by eliminating the irrelevant predictors, the joined model for the 7 symptoms was evaluated again in its prediction accuracy, as Fig. 10 and 11 show. It reached 55.6% in 7 correct and 44.4% correct in validation among all 71 datasets, and on the condition that there was at least 1 symptom, it achieved 33% in 7 correct and 66.7% in 6 correct. 5. Conclusions and perspectives The classification models demonstrate promising potential for predicting multiple symptoms using combined indoor, outdoor, and personal condition data. Model interpretation shows that symptoms are driven by different factors, highlighting the role of environmental parameters in symptom risk. Despite limited data, the models achieved moderate prediction accuracy, indicating feasibility and scalability. With larger datasets and further optimisation, this approach could support data-driven strategies to improve indoor environmental quality and occupant well-being. This practice also indicated the possibility to customise the indoor environment based on the target of the user's physical conditions, according to the time of the year and the day, as well as the local micro-climate, to achieve the best user perception in the living spaces or productivity in the working spaces. By learning from the user's feedback, the building infrastructure could be adjusted accordingly, like the ventilation, HVAC, lighting and acoustic systems, to dynamically meet the needs of the specific person in the rooms.