Audio un-mixing
Research Student: Toby Stokes
Principal Supervisor: Dr Tim Brookes
Co-Supervisor: Dr Chris Hummersone
Supported by: EPSRC and BBC R&D
Start date: 2011
End date: 2015
Project Outline
Given a mixture of audio sources, a blind audio source separation (BASS) tool is required to extract audio relating to one specific source whilst attenuating that related to all others. This project answers the question "How can the perceptual quality of BASS be improved for broadcasting applications?"
The most common source separation scenario, particularly in the field of broadcasting, is single channel, and this is particularly challenging as a limited set of cues are available. Broadcasting also requires that a source separator is automated, capable of handling non-stationary, reverberant mixtures and able to separate an unknown number of sources. In the single-channel case, the time- frequency mask is common as a method of separation. However, this process produces artefacts in the separated audio.
The perceptual evaluation for audio source separation (PEASS) toolkit represents an efficient way to generate a multi-dimensional measure of perceptual quality. Initial experimental work, using ideal target and interferer estimates, uses PEASS to test variations on the ideal binary mask and shows continuous masks are perceptually better than binary while identifying a trade-off between artefacts and interferer suppression.
To explore the optimisation of this trade-off, a series of sigmoidal functions are used to map target-to-mixture ratios to mask coefficients. This leads to a mask, with less target-to-mixture based discrimination than those typically found in literature, being identified as the optimum. Further experiments applying offsets, hysteresis, smoothing and frequency-dependency to the mask do not show any benefit in audio quality.
The optimal sigmoidal mask is demonstrated to also be superior under non-ideal conditions using a non-negative matrix factorisation algorithm to produce the estimates. A final listening test compares the outputs of binary, ratio and optimal sigmoidal masks concluding that listeners prefer the ratio mask to the sigmoidal mask and both continuous masks to the binary mask.
Publications
- Stokes T. (2015) 'Improving the perceptual quality of single-channel blind audio source separation'. PhD Thesis, Institute of Sound Recording, University of Surrey.
Full text available at epubs.surrey.ac.uk/807786/1/TobyStokesThesis.pdf - Hummersone C, Stokes T, Brookes T. (2014) 'On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis'. in Naik GR, Wang W (eds.) Blind Source Separation: Advances in Theory, Algorithms and Applications Berlin/Heidelberg : Springer Article number 12 , pp. 349-368.
doi: 10.1007/978-3-642-55016-4_12 - Stokes T, Hummersone C, Brookes TS. (2013) 'Reducing Binary Masking Artefacts in Blind Audio Source Separation'. Rome, Italy: AES 134th Convention paper 8853
- Stokes T, Brookes TS, Hummersone C. (2012) 'Improving the Quality of Separated Audio: What Works?'. Salford UK: 1st Anniversary Celebration for the BBC Audio Research Partnership
Data Archive
The data generated by this project (including code, listening test interfaces and results) are available in these two repositories:
- Audio Un-mixing Dataset doi: 10.5281/zenodo.19035
- Audio Un-mixing Dataset (addendum) doi: 10.5281/zenodo.31873