A benchmark dataset for machine learning on global air quality metrics

Why is the field of machine learning advancing so fast?
There are many reasons why machine learning research is flourishing. One of them is benchmark datasets. Loosely speaking, benchmark datasets combine a task with preprocessed data. The task is usually performed with a machine learning algorithm, accelerating development and performance testing. Our benchmark dataset paper (Betancourt et al. 2021) provides geospatial data paired with ozone metrics, called AQ-Bench. We tackle predicting the ozone metrics based on the geospatial data with different machine learning models. The figure below shows the concept of our study.

What is the goal of this study?
In the end, we want to predict ozone. Predicting ozone metrics, for example, related to health, supports mitigating adverse effects. Nevertheless, ozone prediction is difficult due to its atmospheric chemistry and interactions with weather patterns. Computationally expensive and sophisticated models exist but we want to use machine learning. Therefore, our goal is to compose a benchmark dataset to develop machine learning for ozone prediction.

What data is in AQ-Bench?
AQ-Bench consists of globally available geospatial data and ozone metrics based upon measurements, which are scarce and unevenly distributed worldwide. We took the ozone metrics from the TOAR database. The TOAR community put an enormous effort into collecting the data from different countries and providing it to us.

What about machine learning?
Using our AQ-Bench, we trained different machine learning models, a linear regression, a shallow neural network, and a random forest. The study shows their performance in predicting different ozone metrics. We hope other researchers can easily reproduce our work and join the ozone research using machine learning.

Clara Betancourt, Timo Stomberg, Ribana Roscher, Martin G. Schultz, and Scarlet Stadtler, AQ-Bench: a benchmark dataset for machine learning on global air quality metrics, Earth Syst. Sci. Data, 13, 3013–3033, 2021 https://doi.org/10.5194/essd-13-3013-2021