Singing Voice Extraction from Stereophonic Recordings
Singing voice separation (SVS) can be defined as the process of extracting the vocal element from a given song recording. The impetus for research in this area is mainly that of facilitating certain important applications of music information retrieval (MIR) such as lyrics recognition, singer identification, and melody extraction. To date, the research in the field of SVS has been relatively limited, and mainly focused on the extraction of vocals from monophonic sources. The general approach in this scenario has been one of considering SVS as a blind source separation (BSS) problem. Given the inherent diversity of music, such an approach is motivated by the quest for a generic solution. However, it does not allow the exploitation of prior information, regarding the way in which commercial music is produced. To this end, investigations are conducted into effective methods for unsupervised separation of singing voice from stereophonic studio recordings. The work involves extensive literature review of existing methods that relate to SVS, as well as commercial approaches. Following the identification of shortcomings of the conventional methods, two novel approaches are developed for the purpose of SVS. These approaches, termed SEMANICS and SEMANTICS draw their motivation from statistical as well as spectral properties of the target signal and focus on the separation of voice in the frequency domain. In addition, a third method, named Hybrid SEMANTICS, is introduced that addresses time‐, as well as frequency‐domain separation. As there is lack of a concrete standardised music database that includes a large number of songs, a dataset is created using conventional stereophonic mixing methods. Using this database, and based on widely adopted objective metrics, the effectiveness of the proposed methods has been evaluated through thorough experimental investigations.