Project Topic

Spatial Sound Synthesis: To analyze how directional qualities are represented in sound signals and reproduce them for any input audio-data

Tools we Use:

Frequency Domain Analysis & Computation using Fourier Transform (in)
Filtering Sound Samples (denoising, isolating important components, and modeling the pinna response) (in)
Convolution & LTI Systems of Impulse Response (in)
Correlation (for Interaural Time Difference) (out)
Head-Related Transfer Function (out) for binaural sound reproduction
Raytracing (out), which we are considering using for a computational model of the HRTF.
If time permits: Beamforming (out), which we can use to enhance the spatial resolution of directional sound which is done by processing signals from multiple microphones and configuring different weights to focus on specific directions.

General Idea: We will collect data using a stereo two-channel microphone. We generate impulse responses by recording balloon pops around various parts of our control room, and from there we will use various DSP analysis tools such as Frequency Domain Analysis using a Fast Fourier Transform algorithm, Filtering/Filter Design, and then apply various methods from extended out-of-class research on the Head-Related Transfer Function (HRTF) in order to reproduce sound that is accurate to what a human ear would hear. Effectively, our goal is to reproduce both the directional quality of sound as well as any effect in sound filtering caused by the human ear, which combine to create a realistic spatial sound effect.

Initial Task Items

1. First, we will gather impulse responses as our reference data, by popping balloons at various locations around the room. We will initially go with a basic 8-direction setup, which will be North, East, South, West, NE, SE, SW, and NW directions from the center. We will also record responses from above the mic as a “control” impulse that should have effectively zero directional bias. We will gather multiple sample points for each data to perform analysis on the accuracy of our measurements, and possibly also gather sample points at different distances from each direction, which would allow us to validate the directional aspect of our data.

For our “true audio”, we will use isolated sounds by recording things such as footsteps, whistling, and sound coming out from stereo speakers (not surround) placed in a particular location. We can use this along with the balloon-generated “impulse responses” to learn how sound location presents itself in the data by treating it as a LTI System. This will allow us to focus specifically on the direction and locational quality of the sound itself.

We have recorded our data in a square-floor, square prism-shaped room with concrete walls covered in a foam board. Here is a link to some of the balloon recordings we have made, along with a link to the “true audio” recordings we took. We took recordings of a popular song, “Never Gonna Give You Up” by Rick Astley, in the same 8 locations around the stereo microphone in the same room.

We were able to listen to our recordings using headphones (Sennheiser HD800S) and were able to reasonably predict which direction a sound was coming from almost 100% of the time when comparing between opposite sides (especially East, West, NE, NW, SW, SE) although North and South directions were hard to predict unless played simultaneously.

2. A few key takeaways from this paper: Xie, Bosun. Head-related transfer function and virtual auditory display. J. Ross Publishing, 2013 [1]

Sound localization is subjective and can vary across different people. However, one thing commonly used in auditory localization is Interaural Time Difference (section 1.4.1), which they explain “refers to the arrival time difference between the sound waves at the left and right ear” (pg. 9). When the sound comes from a non-central direction, there is a difference in the distance of the path to each ear which necessarily causes a difference in the ITD, which they provide a few equations from research such as the ITD from a straight plane:

where theta is the azimuthal angle of the source, a is distance, and c is speed of sound.
We could perhaps apply this in our analysis by comparing time-domain delays between the two-channels in our data, and compute from that where we think the sound angle is from.

Similar to ITD is Interaural Level Difference, which is the difference in level (sound pressure i.e. volume) in each ear, which they provide an equation here:

This may also be another tool we use to calculate angles. We can get frequency-domain sound pressure by taking the Discrete Fourier Transform of both channels and taking the log-scale to get decibels.

The Head-Related Transfer Function is a system that describes how sound transforms from outside the ear into the ear. It is obtained by combining many smaller LTI systems that use information such as the ITD and ILD described above in order to reproduce binaural sound by adjusting spatial and temporal properties. Technically, the HRTF will vary across each individual due to variation in one’s ears, but we can use a generalized HRTF that will apply to most humans. As the human head causes some attenuation of the signal known as head shadow, we can approximate that effect by designing a filter that will attenuate the signal in a similar fashion, and we can incorporate this into our HRTF. There is also the pinna gain, which is essentially a direction-sensitive highpass filter. This is the high-frequency filtering caused by the outer ear, which filters more heavily in certain directions than others. Although we cannot really make these measurements ourselves, there are plenty of online databases where we can obtain measurements for HRTF. We can do a frequency-domain analysis to obtain an LTI system for the HRTF, which can then be incorporated into our algorithm. Table 2.1 lists numerous HRTF databases, and gives some comments on each one. According to Chapter 4, we can also compute HRTF mathematically without any measurements. This will be useful if the measurements from databases prove difficult to work with. This talks about different models such as the spherical model and the snowman model, as well as various implementations that can represent an HRTF as a filter to an LTI system.

Finally, chapter 8 talks about binaural reproduction through speakers. They say “Headphone equalization is implemented by filtering binaural signals with the inverse of headphone-to-ear canal transfer functions (HpTFs). In headphone-based binaural reproduction, this equalization is required to eliminate the influence of headphone-to-ear canal transmission.” Since many headphones are either free-field or diffuse-field equalized, we could consider adding equalization to meet these sound-fields.

3. We would like to focus more on replication of the location aspect in a particular space. We will attempt to use a typical small-classroom/personal room-like space to record all of our sample and testing data. this is because location is a very important aspect of sound reproduction that is often overshadowed by frequency & time-specific effects. While the effect of different spaces such as an open field, a studio, a bedroom, and a movie box can sound very different due to the wall material (or lack thereof) affecting the way sound waves reflect and propagate in a space, we have decided to focus more on how the sound changes based on the way it hits the human ear, rather than how it changes based on the way it hits the surrounding space. We can validate our process for replicating location by using the balloon popping to create an impulse response of a system that replicates direction, convolving it with a control sample of our “true audio”, and then performing a correlation with validation recordings of the “true audio” at that location to see how similar we can achieve the sound. We can compare this correlation to a correlation between the control and validation data to confirm that our algorithm has improved the correlation. The same process can be applied when we include other features to make the location quality more accurate.

However, there are a few more things we could include if there is more time to work on our algorithm. We could advance our algorithm to care not only about the direction, but also the type of room itself, which includes characteristics such as the spatial geometry, as well as the wall material.

If time permits, we could extend this project by applying our code and methodology to create a VST plugin (Virtual Sound Technology, a software that interfaces with a Digital Audio Workstation application to be used to modify sounds). However we would have to port our code to C++, so this is only a potential reach.