A regression model is one of the most common machine learning models used for forecasting. The main idea of the regression model in air pollution prediction is that the model learns about the relationship between a dependent variable and a number of independent variables. There are different forms of the regression model, namely: linear regression and multiple regression. Linear regression is the simplest type of regression model as it only uses one independent variable as the model’s predictor. In predicting air pollution, a type of multiple regression called land use regression is very commonly used (Hankey et al., 2015).
In the study of predicting daily peak Ozone concentration in Houston, Prybutok et al. (2000) used a simple linear regression model that resulted in a maximum correlation coefficient of 0.47. Meanwhile, in another study, Chaloulakou et al. (1999) used a multiple regression model to forecast the next day’s hourly maximum Ozone concentration in Athens. That model resulted in a mean absolute error of 19.4% to 33.0% of the corresponding average O3 concentrations (Peng, 2015). On the other hand, Larkin et al. (2017) used a global land use regression model using the data from 5200 air monitors in 58 countries to predict the global NO2 level. The model resulted in a 54% variance with a mean absolute error of 3.7 ppb.
Similarly, Gilbert et al. (2005) also used a land use regression model to estimate the concentration of NO2 to asses the health effects of NO2 based on traffic pollution in Montreal, Canada. After collecting data from 67 air monitors for 14 consecutive days, the model resulted in concentrations ranged from 4.9 to 21.2 ppb (median 11.8 ppb).
However, the study also makes further use of linear regression and multiple regression analysis. Linear regression analysis was used to asses the relationship between the NO2 concentrations and land-use variables. The result showed that NO2 concentration was negatively correlated to the area of open space. On the contrary, NO2 has a high positive correlation to the area with large amount of traffic within 100 to 750 m radius. Interestingly, industrial land-use and minor roads show insignificant correlation with NO2. In the multiple regression analysis, the author concluded that the distance from the closest highway, traffic count in the highway, length of highways within 100 m radius and population density have a strong correlation with NO2. The best fitting regression model had an R2 of 0.54.
In a similar study done in San Diego, USA, Ross et al. (2006) modeled the distribution of NO2 using land use regression based on data collected from 39 monitor stations. The features include geographical information such as road information, traffic flow, land use, population and housing. The result was 79% of the variation in NO2 levels with four variables such as traffic density within 40-300 meters radius, 300-1000 m radius, the road length within 40 m and distance to the Pacific coast.
These authors show that multiple regression models seem to be superior to simple linear regression models when it comes to forecasting air quality because they are able to capture the non-linear relationships between air pollutant and meteorological parameters. However, linear regression is still very functional to capture the level of correlation of the output variable and a targeted geographical feature.
Neural Network Model
Kukkonen et al. (2003) evaluated the performance of five different Neural Network (NN) models for NO2 prediction and compared them to the deterministic modeling system (DET) and measurement in central Helsinki. The models considered traffic flow, concentration data, and meteorological data. The evaluation showed that the non-linear NN models show a better agreement with the measured concentration data for NO2 than those of linear models.
In addition, Perez et al. (2000) also compared a multilayer NN model to a linear regression model in predicting PM2.5 in downtown Santiago, Chile. Studied in a period between 1994 to 1995, the NN model showed a better result, with prediction errors ranging from 30% to 60%. The model also showed that the fine particulate matter is strongly dependent on meteorological data and has the highest negative correlation with wind speed and relative humidity.
In a more recent study, Yadav et al. (2019) build a short-term forecasting model of NO2 that predicts one day ahead using a non-linear autoregressive neural network. In this study, 491 measured time series data are utilized. The model showed a result with a root mean square error of 0.0456.
Abdu-Wahab et al. (2002) built an NN model to predict the ozone concentration in the lower atmosphere of urban areas with high traffic influences using meteorological data and various air quality parameters. The intuition behind using NN model was based on neural network’s ability to model highly non linear relationships and the fact that neural networks are highly able to be modeled from historical data. The result showed that the major feature that contributes on ozone concentration was found to be meteorological data such as nitrogen oxide, sulfur dioxide, relative humidity, non-methane hydrocarbon and nitrogen dioxide, with weight on the model ranging from 33.15–40.64%. Other features that are also significant are temperature and solar radiation.
- Peng, H. (2015). Air quality prediction by machine learning methods (Doctoral dissertation, University of British Columbia).
- Hankey, S., & Marshall, J. D. (2015). Land use regression models of on-road particulate air pollution (particle number, black carbon, PM2. 5, particle size) using mobile monitoring. Environmental science & technology, 49(15), 9194-9202.
- Prybutok, V. R., Yi, J., and Mitchell, D. (2000). Comparison of neural network model with ARIMA and regression models for prediction of Houston’s daily maximum ozone concentrations. European Journal of Operational Research, 122:31–40.
- Chaloulakou, A., Assimakopoulos, D., and Kekkas, T. (1999). Forecasting daily max- imum ozone concentrations in the Athens basin. Environmental Monitoring and Assessment, 56:97–112.
- Larkin, A., Geddes, J. A., Martin, R. V., Xiao, Q., Liu, Y., Marshall, J. D., … & Hystad, P. (2017). Global land use regression model for nitrogen dioxide air pollution. Environmental science & technology, 51(12), 6957-6964.
- Gilbert, N. L., Goldberg, M. S., Beckerman, B., Brook, J. R., & Jerrett, M. (2005). Assessing spatial variability of ambient nitrogen dioxide in Montreal, Canada, with a land-use regression model. Journal of the Air & Waste Management Association, 55(8), 1059-1063.
- Ross, Z., English, P. B., Scalf, R., Gunier, R., Smorodinsky, S., Wall, S., & Jerrett, M. (2006). Nitrogen dioxide prediction in Southern California using land use regression modeling: potential for environmental health analyses. Journal of Exposure Science and Environmental Epidemiology, 16(2), 106.
- Kukkonen, J., Partanen, L., Karppinen, A., Ruuskanen, J., Junninen, H., Kolehmainen, M., Niska, H., Dorling, S., Chatterton, T., Foxall, R., and Cawley, G. (2003). Extensive evaluation of neural network models for the prediction of NO2 and PM10 concentrations, compared with a deterministic modeling system and measurements in central Helsinki. Atmospheric Environment, 37(32):4549–4550.
- Perez, P., Trier, A., and Reyes, J. (2000). Prediction of PM2.5 concentrations several hours in advance using neural networks in Santiago, Chile. Atmospheric Environ- ment, 34:1189–1196.
- Yadav, V., Nath, S., & Malik, H. (2019). Forecasting of Nitrogen Dioxide at One Day Ahead Using Nonlinear Autoregressive Neural Network for Environmental Applications. In Applications of Artificial Intelligence Techniques in Engineering (pp. 615-623). Springer, Singapore.
- Abdul-Wahab, S. A., & Al-Alawi, S. M. (2002). Assessment and prediction of tropospheric ozone concentration levels using artificial neural networks. Environmental Modelling & Software, 17(3), 219-228.