Singing Voice Detection Using Multi-Feature Deep Fusion with CNN

The proposed SVD system overview


The problem of singing voice detection is to segment a song into vocal and non-vocal parts. Commonly used methods usually train a model on a set of frame-based features and then predict the unknown frames by the model. However, the multi-dimensional features are usually concatenated together for each frame, with little consideration of spatial information. Hence, a deep fusion method of the Multi-feature dimensions with Convolution Neural Networks (CNN) is proposed. A one dimension convolution is made on feature dimensions for each frames, then the high-level features obtained can be used for a direct binary classification. The performance of the proposed method is on par with the state-of-art methods on public dataset.

In Proceedings of the 7th Conference on Sound and Music Technology
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.