Leveraging Big Data for PM2.5 Prediction: A Case Study in Selangor, Malaysia

Authors

En Xin Neo
Department of Biomedical Engineering, Faculty of Engineering, Universiti Malaya, Malaysia.
Khairunnisa Hasikin
Department of Biomedical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur, Malaysia
Khin Wee Lai
1Department of Biomedical Engineering, Faculty of Engineering, Universiti Malaya, Malaysia
Mohd Istajib Mokhtar
Department of Science & Technology Studies, Faculty of Science, Universiti Malaya, Malaysia
Muhammad Mokhzaini Azizan
Department of Electrical and Electronix Engineering, Faculty of Engineering and Built Environment, Universiti Sains Islam Malaysia, Malaysia
Sarah Abdul Razak
Institute of Biological Sciences, Faculty of Science, Universiti Malaya, Malaysia
Hanee Farzana Hizaddin
Department of Chemical Engineering, Faculty of Engineering, Universiti Malaya, Malaysia

Synopsis

Air pollution has become a serious issue and has continually increased since the half-decade ago due to globalization. Activities such as urbanization, industrialization, power plants, agricultural open burning and natural disaster such as wildfires are the key factors in air pollution. The air pollutants produced include particulate matter (PM10 and PM2.5), ozone (O3), carbon monoxide (CO), sulfur dioxide(SO2), nitrogen dioxides (NO2) and heavy metals such as lead (Pb) and cadmium (Cd). According to the most recent revision of the Global Burden of Diseases (GBD), PM10 and PM2.5 were listed as the fourth most common killer out of 85 risk factors. Hence, it is important to assess air pollution, especially the particulate matter concentration in the air. In this study, we emphasize the development of PM2.5 prediction models using machine learning for air pollution evaluation in Selangor, Malaysia. This is because Selangor contributed most pollutants due to its highest population distribution in the country. The machine learning models involved are Random Forest, Naïve Bayes, KNN, SVM, and Gradient Boosting. Gradient boosting and Random Forest contributed comparable prediction results. However, gradient boosting was chosen as the best model for the prediction in this study due to the accuracy and precision in predicting the Classes of PM2.5 without misclassification. The accuracy, precision, and recall of the model are 99.9% and 99.94% for F1 score respectively.

TechPost2022
Published
December 28, 2022
Online ISSN
2582-3922