In my last post, we discussed environmental sound, and how it is currently measured and regulated using absolute sound levels – LAeq – only. This is a sorry state of affairs as the noise level alone is really a poor metric with which to describe an acoustic scene. A researcher from Sound Appriasal I met at the latest DCASE Workshop put it succinctly – this is “like rating how much you like some food based only on the temperature”.
There have been several metrics put forward by the community intended to augment the LAeq and provide more insight into what sounds constitute the scene, giving us some idea of the ‘flavour’ of the sound (to extend the food metaphor just a tad). My favourite of these is the Normalised Differential Soundscape Index, or NDSI1. Numerous studies have shown that mechanical sounds tend to be less preferred than natural sounds2, so the NDSI is designed to give a figure between -1 and 1 based on a ratio of human/mechanical sound to natural sound. The more negative the number, the more prominent the human – or anthrophonic – sound, and the more positive, the more prominent the natural – or biophonic sound3. There’s also geophonic sound, which is sound made by weather or geological processes (e.g. rainfall or ocean waves), but that’s not considered by the NDSI as it was originally developed for use in biodiversity monitoring.
The NDSI could be a very useful tool for giving surveyors and legislators more of an idea of the nature of an acoustic environment and human reaction to it than the LAeq alone, but it presents the problem of how to calculate the relative contributions of anthrophonic and biophonic sound to a scene. This is typically done by analysing the intensity of the low-frequency sounds of a recording (1 kHz – 2 kHz) – assumed to be mainly mechanical sound – against the higher-frequency sounds in a recording (2 kHz – 11 kHz) – assumed to be mainly biophonic sound1.
The problem with this approach is that a lot of mechanical sound is broadband in nature, meaning that it contains frequency components across the spectrum, rather than neatly fitting into these bands as would be more convenient for acoustic analysts. Likewise, sounds from many animals can contain components well within the 1 kHz – 2kHz band. Nevertheless, the technique has been reported to give reasonable results when used in wilderness environments where the contribution of mechanical sound tends to be fairly low1. In urban areas, however, this method completely breaks down. A hedge trimmer, for example, yields an NDSI of 0.3, indicating prevalent natural sound, when this is clearly not the case. This is one of the reasons why I decided to focus on machine listening in my research. If the constituent sounds of an acoustic scene could be identified electronically, this could provide a much more reliable way of calculating metrics such as the NDSI in urban environments whilst also opening up many other avenues for deeper understanding and analysis.
Machine learning is a hot topic at the moment in many academic disciplines, but machine listening is often less considered than its visual counterpart. Nevertheless, it is now a part of our daily lives from the speech recognition built in to most modern smartphones to the intelligent music selection technologies used by online streaming services. Using machine listening technologies to analyse environmental / general everyday sound is a comparatively new area of study that is, broadly speaking, split into two tasks – Acoustic Scene Classification (ASC) and Acoustic Event Detection (AED)4. The aim of ASC is for a system to intelligently identify a location label for a scene recording, whereas AED drills a little deeper by attempting to detect sounds within a scene and assign labels to those.
In a very brief nutshell, the way machine listening (and indeed supervised machine learning in general) works is that some kind of mathematical model (there are many) is ‘trained’ to identify sounds using ‘features’ extracted from pre-labelled example recordings. Features from new unlabelled sounds are then presented to the model to test it and see how well it has learned. Most of the work done so far in this area uses mono or stereo recordings (for reasons that I’ll go into in a future post) but for my research I am investigating using multi-microphone spatial audio recordings. Spatial audio is a particular speciality of the Audio Lab I’m a part of in York and in theory should allow us to isolate sounds from a scene with far more precision than standard stereo recordings.
As a first stage in my project, I wanted to prove that spatial audio features could be used to do Acoustic Scene Classification. Early work in ASC (using mono recordings) achieved extremely high classification accuracies of up to 96%, which led the authors to essentially declare ASC a solved problem5. Later, it was found that these results were somewhat artificially inflated as the researchers had tested their system using excerpts from the same longer recordings as they’d used to train their system6. It turns out that the sound of a given location doesn’t tend to change much (statistically speaking) within the timespan of a few minutes! As a result of this finding, modern datasets intended for ASC research now need to include multiple examples of the same kinds of scenes7.
Upon investigating databases of acoustic scene recordings, I found that there were no sets available of spatial audio recordings with the required large number of examples of many different types of acoustic scene. Large stereo databases were available, as were smaller spatial databases, but no single set that had everything I needed. For this reason I decided to record my own database.
Making use of the mh Acoustics Eigenmike, which is a 32-channel spherical microphone array capable of recording highly detailed spatial information, I travelled all around the North of England seeking out soundscapes. Travelling from Scarborough across to Liverpool and stopping off at towns and cities along the route of the TransPennine Express train, I recorded Beaches, Busy Streets, Quiet Streets, Train Stations, Parks, Woodlands, Shopping Centres and Pedestrian Zones. EigenScape contains eight different examples each of these eight scenes for a total of 64 recordings, all 10 minutes long. The key was to record in locations that sounded sufficiently similar to enable them to be grouped, but different enough to encapsulate some of the variety found in comparable locations, thus potentially enabling the machine listening system to generalise beyond these initial recordings.
I have now made these recordings publicly available. The filesize for the entire database is almost 140 GB, so I’ve also put together a slightly cut-down version to make the set more accessible. Using the data, I have managed to prove that it is possible for a machine listening system to accurately classify acoustic scenes based entirely on spatial audio features. This had never been done before and is a really important first step towards the complete scene analysis system that is the ultimate goal. I’ll write more about this soon, but check out my latest academic publications if you’d like to find out more in advance!
1 Devos, P. Soundecology indicators applied to urban soundscapes, Internoise 2016
2 Axelsson, Ö. How to measure soundscape quality, Euronoise 2015.
3 Gage, S.; Ummadi, P.; Shortridge, A.; Qi, J. & Jella, P. K. Using GIS to Develop a Network of Acoustic Environmental Sensors, ESRI international user conference, 2004
4 Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M. & Plumbley, M. D. Detection and Classification of Acoustic Scenes and Events, IEEE Transactions on Multimedia, 2015, 17, 1733-1746
5 Aucouturier, J.-J.; Defreville, B. & Pachet, F. The Bag-of-frames Approach to Audio Pattern Recognition: A Sufficient Model for Urban Soundscapes But Not For Polyphonic Music, The Journal of the Acoustical Society of America, 2007, 122
6 Lagrange, M.; Lafay, G.; Défréville, B. & Aucouturier, J.-J. The bag-of-frames approach: A not so sufficient model for urban soundscapes, Journal of the Acoustical Society of America, 2015, 128
7 Mesaros, A.; Heittola, T. & Virtanen, T. TUT Database for Acoustic Scene Classification and Sound Event Detection, 24th European Signal Processing Conference (EUSIPCO), 2016