Loads of AQ data = a good thing? Be careful what you wish for…

Hyperlocal air quality monitoring promises to fill in the gaps between sparse reference stations – great, lots more measurements. And each of your lovely small sensor monitoring stations can measure a dozen or more pollutants and environmental conditions – even better. Or is it? Even at a typical 15-minute reading average – let’s not think about 1-minute readings – that’s a lot of data to process and understand. The meta data attached to each reading is valuable, including data confidence flags, wind speed and direction, geographical location or processing algorithm version, but it adds to the pile of information to be managed.

Looking back to the beginning of small sensor systems – maybe 5-10 years ago – readings were typically downloaded as CSV/Excel files and analysed offline. Maybe you had access to some clever online trickery for analysis and some visualisation tools to get some sort of overview. More commonly we see customers calling data by API from our cloud server and building up their own air quality database. Which is great, but getting dangerously close to the situation in UK a decade or so ago when there was plenty of high quality data from AURN stations, but still no closer to actually moving the needle on air pollution levels. Plus these AQ database owners want to publish air quality information in real-time to the public or use it to generate meaningful alerts.

But is the mountain of data from small sensor systems even high quality, like the AURN network output? By definition it will not be and the truth is that it varies considerably between system types. And even the best systems have bad days and need output to be automatically scrutinised. Focus has moved emphatically onto quality assurance and there are some emerging and developing techniques which – used in conjunction with data confidence flags – can deliver the sort of real-time insights and publishable data that was the original dream.

Gas, PM and other sensors all have limitations and weaknesses that sensors system manufacturers should be well aware of, particularly if they are engaging with users and evaluating performance of their product in the field. This knowledge can be used to develop a sophisticated and transparent system of data confidence flags and/or data redaction. This is an evolving field, as what can look like a malfunction in one application is quite normal in another. For example, the levels of nitrogen oxides typically found in tunnels would look like a sensor failure in a roadside monitoring project. Clearly communicated data confidence indicators are overlaid on systematically processed readings, offering repeatability and traceability, and are not the same as AI-based data treatments.

Blessed with a large number of small sensor readings, which are ‘good enough’ (actually a compliment from one of our favourite leading academics in the field), which have been filtered using data confidence criteria, the next big win comes from network analysis. We call it Long Distance Scaling and it involves analysis of data from usually five or more measurement points (and ideally a reference station). Using this sort of approach, baselines can be separated from local sources, potential pollution sources identified by distance and direction, and the big picture starts to emerge from the mass. The best analysis we have seen of data from our pods is by the team at the University of Cambridge.

So, without the resources to manage quality assurance, there is a real danger that a project can sink under the weight of its own data. Large amounts of data can certainly help to build a better air quality picture – and even help with QA itself – but a clear data management plan, and the resources to carry it out, are essential.