Sampling a 3D Sound Field

During the past six months or so I have been focusing on identifying the presence and directions of the individual sounds that make up an acoustic scene. My EigenScape recordings were recorded in fourth-order Ambisonic format, which contains more detailed spatial information than I used in my previous work.

The Ambisonic format makes it fairly simple to focus on sounds from specific directions in the scene using a process called beamforming, which is like pointing a virtual microphone. One common way of producing intensity ‘maps’ of an acoustic scene is to use beamforming to look in all directions and log the sound power in each direction to build up a picture of the scene. This is called the Steered-Response Power (SRP) method and is used in commercial ‘Acoustic Camera’ products.

srp_2d

Estimating direction of arrival using steered-response power.

Of course, this being the real world, we can’t steer the beam in all directions, just a representative set of discrete points. Doing this on a 2-dimensional plane isn’t that difficult. To sample a plane around a point, you go round in a circle, and it’s not difficult to divide a circle into equally-spaced points. In the diagram above, the sampling happens at eight points, but we can use any number of points and it will be possible to space them equally around a circle. It’s quite easy to confirm this with a simple thought experiment: Imagine a straight line of some length, and imagine dividing this line into some number of equally-spaced segments. If you then took that line and wrapped it around into a circle, those segments will always also form equally-spaced segments along the circumference of the circle. In other words, you’ll always be able to slice a pizza into equally-sized slices no matter how many guests you have. This means we can uniformly sample around a 2D plane at any arbitrary resolution.

circdots

Uniform sampling around a circle.

Sometimes, sampling in 2D might be enough, but if we want a full picture of the entire acoustic scene in 3D (or indeed any 3D field viewed from a point – this is a common problem) we need to extrapolate from that circular sampling and sample round in a sphere. The question then becomes how to select points to uniformly sample across the surface of a sphere? This is a much thornier question than it initially seems.

Initially, the most intuitive way forward might be to approach this in the same way as we approached sampling the circle. We could achieve uniform circular sampling by placing points on a line and ‘wrapping’ it around, so for spherical sampling we could place points in a uniform grid on a 2-dimensional plane and then ‘wrap’ this around the sphere:

reg_off

Sampling a sphere using a regular grid.

Unfortunately, as you can see, when we use this regular grid and map it to the sphere, it results in the points near the poles ‘bunching up’, whereas points at the equator are more spread out. This results in us sampling the sphere in varying-resolution, with much more fine-grained detail at the poles than elsewhere. For instance, if we want a high level of detail at the equator, we end up with a ridiculously high density of points at the poles. Not only is this computationally inefficient, but can lead to the output power map being somewhat skewed towards the poles. It might appear that there’s more power in the polar regions simply as we are effectively sampling the same area over and over again, these points having equal weighting to those sampled in sparser areas (in simple models, anyway).

You’ll have seen this problem in reverse when looking at maps of the world. Land masses to the far north are stretched out and appear huge relative to those nearer the equator. Greenland, for instance, appears on a regular map to be about the same size as Africa, whereas in reality it’s only about 1/14th of the size.

So how do we fix this? Mathematically, there are actually only a very few distributions of points on a sphere that perfectly fit our criteria. These relate to the platonic solids, the regular 3D shapes made by regular 2D polygons with each vertex connecting the same number of faces. We can uniformly sample the surface of a sphere by sampling points corresponding to the vertices of these shapes! Unfortunately, only five of these shapes exist and these have relatively few vertices; the most we can get is 20, using the vertices of a dodecahedron, or equivalently the centres of the faces of an icosahedron.

dodec_voronoi.png
Uniform sampling of a sphere on the faces of an icosahedron / vertices of  a dodecahedron.

As you can see, this sampling scheme is very sparse and suitable only if we want low-resolution output. Unfortunately, it’s actually mathematically impossible to equally divide the surface of a sphere into any more points than this. So if you’re planning on serving spherical pizza to more than 20 very picky people, you’re out of luck!

So what do we do about it? We just have to settle for not-quite uniformity, and in this case good enough has to be good enough! One common approach to nearly-uniform spherical sampling is the icosphere. This is an extrapolation to a full sphere of the geodesic dome concept popularised by Buckminster Fuller as a way of constructing extremely strong dome structures for ecologically conscious buildings.

The method for constructing an icosphere is relatively simple. Starting with an icosahedron, we interpolate between the 12 initial points by splitting each triangular face into four smaller triangles by finding the halfway point between each pair of vertices and turning this into a new vertex. Repeating this process numerous times results in more densely-spaced points for higher-resolution sampling.

tri_interp

Interpolating the triangular faces of an icosahedron.

Most of the code I found online to accomplish this was solid, but convoluted, involving elaborate methods for checking for pre-computed vertices (since all the triangles are adjacent, the interpolation results in lots of duplicate vertices), and requiring manual definition of which sets of vertices make up a face. I decided to write my own code within my preferred Python 3 / Numpy environment. This provided me with easier interaction with my other project code and allowed me to use some of Numpy’s features to make the code more elegant:

Numpy Icosphere Implementation

In [1]:
# standard modules
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d

# distance matrix module
from scipy.spatial.distance import cdist

from utilities import normalise

First, I define the initial icosahedron using three intersecting orthogonal golden rectangles:

In [3]:
# golden ratio
R = (1 + np.sqrt(5)) / 2

vertices = np.array([[-1,R,0],
                     [1,R,0],
                     [-1,-R,0],
                     [1,-R,0],

                     [0,-1,R],
                     [0,1,R],
                     [0,-1,-R],
                     [0,1,-R],

                     [R,0,-1],
                     [R,0,1],
                     [-R,0,-1],
                     [-R,0,1]])

# plot
xu, yu, zu = vertices[:,0], vertices[:,1], vertices[:,2]
fig, ax = plt.subplots(1, 1, subplot_kw={
    'projection':'3d', 'aspect':'equal'}, figsize=(10,10))
ax.scatter(xu, yu, zu, zorder=10, alpha=.6)
plt.show()
 ico_1

Then, I infer which sets of indices make up the faces of the icosahedron. This is not too difficult to do manually for the initial icosahedron, but the beauty of this code is that it can be used to infer faces on every level of interpolation. Scipy’s cdist module is used to get a matrix of distances between each point.

In [4]:
# find euclidian distances between all points -
# gives us a matrix of distances
euclid_dists = cdist(vertices, vertices)

# find list of adjacent vertices
sides_idx = np.where(
    euclid_dists == np.min(euclid_dists[euclid_dists > 0]))

# concatenate output locations into one array
sides_idx = np.concatenate(
    (sides_idx[0].reshape(-1,1), sides_idx[1].reshape(-1,1)), axis=1)

# remove duplicate sides_idx (there are many)
_, idx = np.unique(np.sort(sides_idx), axis=0, return_index=True)
sides_idx = sides_idx[idx]

sides_idx now contains a list of pairs of vertices that make the sides of the icosahedron.

In [5]:
sides_idx
Out[5]:
array([[ 0,  1],
       [ 0,  5],
       [ 0,  7],
       [ 0, 10],
       [ 0, 11],
       [ 1,  5],
       [ 1,  7],
       [ 1,  8],
       ...]], dtype=int64)

Then, I work out which of these sides connect to make faces:

In [6]:
# set up empty array
faces_idx = np.array([], dtype=int)

for i in np.unique(sides_idx[:,0]):
    # extract sides_idx related to each vertex
    a = sides_idx[np.where(sides_idx[:,0] == i),1]

    for j in a:
        for l in j:
            # find 3rd adjacent vertices common to both points
            b = sides_idx[np.where(sides_idx[:,0] == l), 1]
            intersect = np.intersect1d(a,b).reshape(-1,1)

            for m in intersect:
                # add faces_idx to array
                faces_idx = np.append(faces_idx, np.array([i,l,m]))

# output is a 1D list, so we need to reshape it
faces_idx = faces_idx.reshape(-1,3)

This outputs a list of the triads of points that make faces:

In [7]:
faces_idx
Out[7]:
array([[ 0,  1,  5],
       [ 0,  1,  7],
       [ 0,  5, 11],
       [ 0,  7, 10],
       [ 0, 10, 11],
       [ 1,  5,  9],
       [ 1,  7,  8],
       [ 1,  8,  9],
       ...]], dtype=int64)

Plugging these indices into our actual vertex co-ordinates gives us a face co-ordinates list (a three-dimensional numpy array):

In [8]:
# 3D matrix with xyz co-ordnates for vertices of all faces
v = vertices[faces_idx]
v
Out[8]:
array([[[-1.        ,  1.61803399,  0.        ],
        [ 1.        ,  1.61803399,  0.        ],
        [ 0.        ,  1.        ,  1.61803399]],

       [[-1.        ,  1.61803399,  0.        ],
        [ 1.        ,  1.61803399,  0.        ],
        [ 0.        ,  1.        , -1.61803399]],

       [[-1.        ,  1.61803399,  0.        ],
        [ 0.        ,  1.        ,  1.61803399],
        [-1.61803399,  0.        ,  1.        ]],

       ...]]])

We can then find the midpoints of each side all in one go to generate the smaller triangles by using numpy’s roll function. This moves each element of an array along one space, reintroducing elements that fall off the end back at the beginning. By adding the rolled version of the v array to the original version and dividing by 2, we calculate the midpoints of each pair of points in one go:

In [9]:
midpoints = ((v + np.roll(v,1,axis=1)) / 2).reshape(-1,3)
In [10]:
midpoints
Out[10]:
array([[-0.5       ,  1.30901699,  0.80901699],
       [ 0.        ,  1.61803399,  0.        ],
       [ 0.5       ,  1.30901699,  0.80901699],
       [-0.5       ,  1.30901699, -0.80901699],
       [ 0.        ,  1.61803399,  0.        ],
       [ 0.5       ,  1.30901699, -0.80901699],
       [-1.30901699,  0.80901699,  0.5       ],
       [-0.5       ,  1.30901699,  0.80901699],
       ...]])

It is then simply a case of adding the new midpoints to the list of vertices and removing duplicates, and we have the points of our interpolated icosahedron:

In [13]:
# add new vertices to list
vertices = np.append(vertices, midpoints, axis=0)
# find duplicate vertices
_, idx = np.unique(vertices, axis=0, return_index=True)
# remove duplicates and re-sort vertices
vertices = vertices[np.sort(idx)]

# normalise points to surface of unit sphere
vertices = normalise(vertices, axis=1)

# plot
xu, yu, zu = vertices[:,0], vertices[:,1], vertices[:,2]
fig, ax = plt.subplots(1, 1, subplot_kw={
    'projection':'3d', 'aspect':'equal'}, figsize=(10,10))
ax.scatter(xu, yu, zu, zorder=10, alpha=.6)
plt.show()
 ico_2

And there you have it. To calculate further levels of interpolations to get more vertices, you simply use the output of the first interpolation as the input in place of the icosahedron at the beginning of this process and the whole thing can be looped. It’s important not to normalise the points to the unit sphere if you intend to proceed to higher orders of interpolation as this causes the inference of faces stage (cdist etc.) not to work properly. Always normalise at the end of the process only.

My function that automates this whole process can be found in https://github.com/marc1701/area-beamforming/blob/SRP_dev/spherical_sampling.py. With this function you can specify your desired level of interpolation, whether or not to receive output as cartesian or spherical co-ordinates, and also whether you’d like to receive points specifying the faces of the interpolated icosahedron rather than the vertices (similarly to how a dodecahedron is the ‘dual’ of an icosahedron, with faces and vertices interchanged).

In [14]:
from spherical_sampling import geodesic

vertices = geodesic(3, co_ords='cart')

# plot
xu, yu, zu = vertices[:,0], vertices[:,1], vertices[:,2]
fig, ax = plt.subplots(1, 1, subplot_kw={
    'projection':'3d', 'aspect':'equal'}, figsize=(10,10))
ax.scatter(xu, yu, zu, zorder=10, alpha=.6)
plt.show()
ico_3

I think this code is an elegant solution for easily calculating nearly-uniform sampling points on a sphere, and should definitely be uniform enough for most purposes. In fact, the Eigenmike itself uses a variation on this for the positioning of its microphones – the capsules are placed on the vertices of both the dodecahedron and icosahedron for 32 nearly-uniformly spaced sampling points.

Fibonacci Sampling

I was quite happy with this code for several minutes before a colleague of mine threw a spanner in the works. He introduced me to sampling a sphere using a spiral based on the Fibonacci sequence, which is used by certain plants to evenly distribute seeds or leaves. The Fibonacci sampling method allows for the nearly-uniform distribution of any arbitrary number of points. This overcomes the main problem with the icosphere method, which is that the number of sample points, and thus the sampling resolution, is limited to the number of points that result after each interpolation of the icosahedron: 12, 42, 162, 642, 2562, etc. These values increase exponentially on each interpolation, limiting your options if you had a specific number of sampling points or resolution in mind.

In two-dimensions (as used by plants), the spiral is formulated as follows:

fib_spir_2d

Where λ and r represent the angle and radius in spherical co-ordinates, Φ is the golden ratio (about 1.618) and n is the index for each point up to a maximum of N points. For 500 points, this results in a pattern that looks like this:

fib_circ

This should look pretty familiar to anyone who’s ever taken the time to look at the seeds in a sunflower. From here, it’s a pretty simple step to take this pattern into 3D on the surface of a sphere. We need to replace the radius calculation with one that will calculate the elevation of each point along a unit sphere, essentially spreading this pattern downwards over the sphere:

fib_spir_3d

fib_off

Sampling a sphere using the Fibonacci spiral method.

It’s a little more difficult to see the unique spiral distribution on the sphere here, but it can be seen that this is a uniform distribution relative to the regular grid (check out this cool animated version by KrazyDad for a clearer look). It can be calculated with much simpler code and massively less processor overhead than the geodesic icosphere method. I think the Fibonacci method is easily the most intuitive way of getting a simple nearly-uniform distribution of points on a sphere and should be suitable for most purposes (I’m not going to get into the nightmarish t-designs!).

For these reasons, I’ve been using the Fibonacci pattern in my work so far and it’s proven to be effective for making steered-response power maps. All of my code for spherical sampling can be found on my project github repo.

 

Introducing EigenScape

In my last post, we discussed environmental sound, and how it is currently measured and regulated using absolute sound levels – LAeq – only. This is a sorry state of affairs as the noise level alone is really a poor metric with which to describe an acoustic scene. A researcher from Sound Appriasal I met at the latest DCASE Workshop put it succinctly – this is “like rating how much you like some food based only on the temperature”.

Alternative Metrics

There have been several metrics put forward by the community intended to augment the LAeq and provide more insight into what sounds constitute the scene, giving us some idea of the ‘flavour’ of the sound (to extend the food metaphor just a tad). My favourite of these is the Normalised Differential Soundscape Index, or NDSI1. Numerous studies have shown that mechanical sounds tend to be less preferred than natural sounds2, so the NDSI is designed to give a figure between -1 and 1 based on a ratio of human/mechanical sound to natural sound. The more negative the number, the more prominent the human – or anthrophonic – sound, and the more positive, the more prominent the natural – or biophonic sound3. There’s also geophonic sound, which is sound made by weather or geological processes (e.g. rainfall or ocean waves), but that’s not considered by the NDSI as it was originally developed for use in biodiversity monitoring.

The NDSI could be a very useful tool for giving surveyors and legislators more of an idea of the nature of an acoustic environment and human reaction to it than the LAeq alone, but it presents the problem of how to calculate the relative contributions of anthrophonic and biophonic sound to a scene. This is typically done by analysing the intensity of the low-frequency sounds of a recording (1 kHz – 2 kHz) – assumed to be mainly mechanical sound – against the higher-frequency sounds in a recording (2 kHz – 11 kHz) – assumed to be mainly biophonic sound1.

The problem with this approach is that a lot of mechanical sound is broadband in nature, meaning that it contains frequency components across the spectrum, rather than neatly fitting into these bands as would be more convenient for acoustic analysts. Likewise, sounds from many animals can contain components well within the 1 kHz – 2kHz band. Nevertheless, the technique has been reported to give reasonable results when used in wilderness environments where the contribution of mechanical sound tends to be fairly low1. In urban areas, however, this method completely breaks down. A hedge trimmer, for example, yields an NDSI of 0.3, indicating prevalent natural sound, when this is clearly not the case. This is one of the reasons why I decided to focus on machine listening in my research. If the constituent sounds of an acoustic scene could be identified electronically, this could provide a much more reliable way of calculating metrics such as the NDSI in urban environments whilst also opening up many other avenues for deeper understanding and analysis.

Machine Listening

Machine learning is a hot topic at the moment in many academic disciplines, but machine listening is often less considered than its visual counterpart. Nevertheless, it is now a part of our daily lives from the speech recognition built in to most modern smartphones to the intelligent music selection technologies used by online streaming services. Using machine listening technologies to analyse environmental / general everyday sound is a comparatively new area of study that is, broadly speaking, split into two tasks – Acoustic Scene Classification (ASC) and Acoustic Event Detection (AED)4. The aim of ASC is for a system to intelligently identify a location label for a scene recording, whereas AED drills a little deeper by attempting to detect sounds within a scene and assign labels to those.

In a very brief nutshell, the way machine listening (and indeed supervised machine learning in general) works is that some kind of mathematical model (there are many) is ‘trained’ to identify sounds using ‘features’ extracted from pre-labelled example recordings. Features from new unlabelled sounds are then presented to the model to test it and see how well it has learned. Most of the work done so far in this area uses mono or stereo recordings (for reasons that I’ll go into in a future post) but for my research I am investigating using multi-microphone spatial audio recordings. Spatial audio is a particular speciality of the Audio Lab I’m a part of in York and in theory should allow us to isolate sounds from a scene with far more precision than standard stereo recordings.

As a first stage in my project, I wanted to prove that spatial audio features could be used to do Acoustic Scene Classification. Early work in ASC (using mono recordings) achieved extremely high classification accuracies of up to 96%, which led the authors to essentially declare ASC a solved problem5. Later, it was found that these results were somewhat artificially inflated as the researchers had tested their system using excerpts from the same longer recordings as they’d used to train their system6. It turns out that the sound of a given location doesn’t tend to change much (statistically speaking) within the timespan of a few minutes! As a result of this finding, modern datasets intended for ASC research now need to include multiple examples of the same kinds of scenes7.

Recording EigenScape

Upon investigating databases of acoustic scene recordings, I found that there were no sets available of spatial audio recordings with the required large number of examples of many different types of acoustic scene. Large stereo databases were available, as were smaller spatial databases, but no single set that had everything I needed. For this reason I decided to record my own database.

Figure-1-Microphone-Eigenmike.png

mh Acoustics Eigenmike

Making use of the mh Acoustics Eigenmike, which is a 32-channel spherical microphone array capable of recording highly detailed spatial information, I travelled all around the North of England seeking out soundscapes. Travelling from Scarborough across to Liverpool and stopping off at towns and cities along the route of the TransPennine Express train, I recorded Beaches, Busy Streets, Quiet Streets, Train Stations, Parks, Woodlands, Shopping Centres and Pedestrian Zones. EigenScape contains eight different examples each of these eight scenes for a total of 64 recordings, all 10 minutes long. The key was to record in locations that sounded sufficiently similar to enable them to be grouped, but different enough to encapsulate some of the variety found in comparable locations, thus potentially enabling the machine listening system to generalise beyond these initial recordings.

I have now made these recordings publicly available. The filesize for the entire database is almost 140 GB, so I’ve also put together a slightly cut-down version to make the set more accessible. Using the data, I have managed to prove that it is possible for a machine listening system to accurately classify acoustic scenes based entirely on spatial audio features. This had never been done before and is a really important first step towards the complete scene analysis system that is the ultimate goal. I’ll write more about this soon, but check out my latest academic publications if you’d like to find out more in advance!

1 Devos, P. Soundecology indicators applied to urban soundscapes, Internoise 2016
2 Axelsson, Ö. How to measure soundscape quality, Euronoise 2015.
3 Gage, S.; Ummadi, P.; Shortridge, A.; Qi, J. & Jella, P. K. Using GIS to Develop a Network of Acoustic Environmental Sensors, ESRI international user conference, 2004
4 Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M. & Plumbley, M. D. Detection and Classification of Acoustic Scenes and Events, IEEE Transactions on Multimedia, 2015, 17, 1733-1746
5 Aucouturier, J.-J.; Defreville, B. & Pachet, F. The Bag-of-frames Approach to Audio Pattern Recognition: A Sufficient Model for Urban Soundscapes But Not For Polyphonic Music, The Journal of the Acoustical Society of America, 2007, 122
6 Lagrange, M.; Lafay, G.; Défréville, B. & Aucouturier, J.-J. The bag-of-frames approach: A not so sufficient model for urban soundscapes, Journal of the Acoustical Society of America, 2015, 128
7 Mesaros, A.; Heittola, T. & Virtanen, T. TUT Database for Acoustic Scene Classification and Sound Event Detection, 24th European Signal Processing Conference (EUSIPCO), 2016

Soundscapes and Environmental Noise

Have you ever thought about how you perceive the sounds in the environments you live and work in? The human mind is capable of distilling the myriad vibrations that reach the ear every second of our lives to create perceptions of sound worlds that not only give us an important sense of our surroundings but can enrich or enrage in equal measure.

Contrary to popular belief, it is not simply sheer loudness that makes a sound annoying. Context is everything. Consider a rock concert: the sound levels could be reaching levels in excess of 110 dB1, but members of the audience would not generally consider this a bad thing (although it is most certainly damaging to one’s hearing). A large aircraft a considerable distance away can make sound reaching comparable loudness, but very few people living close to an airport would consider this pleasant.

The fact that context informs our experience of a sound almost feels like common sense. We are more likely to notice (and be annoyed by) sounds that feel out of place given the space and situation we are inhabiting – perhaps a train whistle in the middle of a woodland or a TV presenter yelling ‘bogeys’ in a library. There is, in fact, a great deal of research to back this up, with statistical analysis showing that ‘appropriateness’ forms a key component of how people react to sound scenes2.

This reveals that sound scenes from similar locations have commonality in their content and there is a degree of expectation from a experienced listener of how an environment should sound that affects perception. The perceptual construct of a sound scene built up in the mind of a listener is known as a soundscape. The complex interplay between our expectations of a soundscape and its reality govern our experience, yet the vast majority of the legislation surrounding environmental sound considers the absolute sound levels alone, which only tell a very thin slice of the story.

A ubiquitous tool for undertaking environmental acoustic surveys is the Sound Pressure Level (SPL) meter. The standard measure referenced in most legislation is the LAeq, which indicates the equivalent continuous A-weighted SPL ‘dose’ 81v-hXq6y0L._SL1500_received over a given period of time3. SPL meters have the advantage of being cheap, portable and easy to use, but, in returning this single-figure level, discard the specific content and any perceptual nuances of the soundscape. To an SPL meter, then, an aircraft takeoff and a rock concert look the same and so might crashing waves and passing traffic. It’s not a leap of logic to conclude that the use of LAeq could be encouraging legislators to think of sound as ‘all the same’.

The law in this area tends to talk about “Environmental Noise”, “Noise Mapping” and “Noise Action Plans” which focus on the suppression of sound above a certain threshold level4. To an audio practitioner like myself, noise is a very specific type of signal – the auditory equivalent of the ‘snow’ seen on an analogue television tuned between stations – and so to see all sound indiscriminately lumped together under this label is philosophically troubling.

Researcher A. L. Brown summed up the prevailing approach as, appropriately enough, the “Environmental Noise Approach”, in which sound is treated solely as a waste, to be managed by reducing its levels. Brown contrasts this with the “Soundscape Approach”, in which sound is reframed as a resource to be managed by working towards preferred sounds not being masked or interrupted by disagreeable sounds5. Clearly such an approach requires a much more nuanced method of analysis than is possible with only an SPL meter.

This is where my research comes in. Over the past year since beginning my PhD research I have been studying environmental sound and how modern machine listening techniques can be brought to bear on its analysis. If we can understand acoustic environments more completely, we can help determine how we might work with the sounds in the environment to make the areas where we live and work better places for everyone.


1 http://airportnoiselaw.org/dblevels.html
2 F. Aletta, J. Kang, and O. Axelsson, “Soundscape descriptors and a conceptual framework for developing predictive soundscape models,” Landscape and Urban Planning, vol. 149, pp. 65–74, 2016.
3 http://www.who.int/quantifying_ehimpacts/publications/en/ebd9.pdf
4 https://www.gov.uk/government/publications/2010-to-2015-government-policy-environmental-quality/2010-to-2015-government-policy-environmental-quality#appendix-8-managing-noise-and-other-nuisances-in-the-local-environment
5 A. L. Brown, “Soundscapes and environmental noise management,” Noise Control Engineering Journal, vol. 58, no. 5, pp. 493 – 500, 2010