Leveraging Big Data for PM2.5 Prediction: A Case Study in Selangor, Malaysia
Synopsis
Air pollution has become a serious issue and has continually increased since the half-decade ago due to globalization. Activities such as urbanization, industrialization, power plants, agricultural open burning and natural disaster such as wildfires are the key factors in air pollution. The air pollutants produced include particulate matter (PM10 and PM2.5), ozone (O3), carbon monoxide (CO), sulfur dioxide(SO2), nitrogen dioxides (NO2) and heavy metals such as lead (Pb) and cadmium (Cd). According to the most recent revision of the Global Burden of Diseases (GBD), PM10 and PM2.5 were listed as the fourth most common killer out of 85 risk factors. Hence, it is important to assess air pollution, especially the particulate matter concentration in the air. In this study, we emphasize the development of PM2.5 prediction models using machine learning for air pollution evaluation in Selangor, Malaysia. This is because Selangor contributed most pollutants due to its highest population distribution in the country. The machine learning models involved are Random Forest, Naïve Bayes, KNN, SVM, and Gradient Boosting. Gradient boosting and Random Forest contributed comparable prediction results. However, gradient boosting was chosen as the best model for the prediction in this study due to the accuracy and precision in predicting the Classes of PM2.5 without misclassification. The accuracy, precision, and recall of the model are 99.9% and 99.94% for F1 score respectively.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.