Song Classification using Machine Learning

Authors: Ritika Dhyani, Priyansh Vatsal, Priyanshu Goel, Prafull chauhan, Prince Chauhan, Pratham Chauhan

DOI Link: https://doi.org/10.22214/ijraset.2023.50890

Abstract

The classification of music by genre is crucial in the modern world since the number of music tracks, both online and offline, is growing quickly. We must appropriately index them in order to have greater access to them. To retrieve music from a vast collection, automatic music genre classification is crucial. The majority of the current methods for categorising music genres rely on machine learning. We give a music dataset with ten distinct genres in this article. The system is trained and classified using a Deep Learning technique. Convolution neural networks are employed in this instance for training and classification. For audio analysis, feature extraction is the most important step. For sound samples, the Mel Frequency Cepstral Coefficient (MFCC) is employed as a feature vector. The suggested technique uses feature vector extraction to categorise music into different genres. Our findings indicate that our system\'s accuracy level is approximately 76%, which will significantly increase and facilitate the automatic classification of musical genres.

Introduction

I. INTRODUCTION

With the abundance of music at consumers' fingertips throughout the globe, there is a growing need for automatic classification of music for indexing of music and easier retrieval, which is frequently done manually by specialists in the field. In a nutshell, the issue statement for our project may be stated as follows: Given a number of audio recordings, the job is to classify each audio file into a specific category, such as audio that belongs to happy, sad, etc. Audio processing is one of the more difficult data science projects compared to image processing and other classification techniques.One such use is the classification of music genres, which seeks to place audio files in the appropriate sound groups to which they belong. Because classifying music manually requires listening to each song for the entirety, the application is crucial and needs automation to reduce manual error and time. Therefore, we will employ machine learning and deep learning techniques to automate the procedure.

In a nutshell, the issue statement for our project may be stated as follows: Given a number of audio files, the job is to classify each audio file into a specific genre, such as disco, hip-hop, etc.

A classification algorithm uses a dataset of labelled examples as inputs to create a model that can automatically categorise unlabeled examples when presented with new, unlabeled data. A binary classification problem is one where there are just two labels (such as "calm" or "rock"). The challenge of multi-class classification arises when there are three or more labels in the set. We are looking at a multi-class problem because the set contains a variety of genres.

II. LITERATURE REVIEW

When listening to brief musical samples, humans are very adept at identifying the song's author, title, and even genre. Numerous NN approaches have been used to try and replicate these skills, with various degrees of success [7]. The mobile app Shazam is a well-known example of an application that uses music data to automatically identify an artist and a song's title. Shazam is well renowned for its ability to identify a song's title and artist from just a few seconds of audio. According to Shazam, a song's trademark consists of its spectrogram's prominent amplitude peaks. In place of (latitude, longitude, height), it's like compiling the positions of the highest mountain peaks in a region. For these noticeable peaks, we have (time, frequency, amplitude) [8]. Using two fully connected layers and a final classification output layer containing genre labels, Shazam's Tim O'Brien created a NN. This seems to be a pretty "vanilla" multiclass classifier model. He scored in the low 90% level for test accuracy. He was able to somewhat enhance the model by combining his NN with Sharath Pingula's (another Shazam employee) track-level collaborative filtering features. This article provides a summary of the machine learning research and application work done with regard to musical genre classification. For the purposes of the research, songs were divided into brief time segments. These time segments were then represented by the accompanying spectrogram images.These spectrograms were each assigned a music genre label before being used as inputs into a CNN. Six convolutional layers, a fully connected layer, a softmax function, and a one-hot array of genre classifications were the components of the NN. The softmax function was used to determine the likelihood that each genre would be recognised. On the basis of the test data, the results were 85% accurate.
We contrast the effectiveness of two kinds of models in this study. The first method uses deep learning to train a CNN model from beginning to end to predict an audio signal's genre label simply based on its spectrogram. The second method makes use of specially created time- and frequency-domain features. These features are used to train four conventional machine learning classifiers, and we evaluate how well they perform. The characteristics that are most helpful in this classification process are determined. For audio streaming services like Spotify and iTunes, being able to automatically categorise and assign tags to the music that is currently in a user's collection based on genre would be advantageous. In this study, the use of machine learning (ML) algorithms to recognise and categorise the genre of an audio recording is explored. Convolutional neural networks [2] are used in the first model that is discussed in this research. It is trained end-to-end on the MEL spectrogram of the audio input. In the second section of the investigation, we extract features from the audio signal's time domain and frequency domain. These features are then supplied to well-known machine learning models, such as Support Vector Machines, Gradient Boosting, Random Forests, and Logistic Regression, which are trained to categorise the given audio file. On the Audio Set dataset, the models are assessed [1]. We contrast the suggested models and research the relative significance of certain variables. It can be seen that with only the top 10 features, the model performance is surprisingly good, and that the model with the top 30 features only slightly performs worse than the full model, which has 97 features. We study how much performance in terms of AUC and accuracy, can be obtained by just using the top N while training the model.
In this study, music's acoustic characteristics were extracted using digital signal processing techniques, and music genre classification was subsequently carried out using neural networks. The process of grouping related types of information into a single identity (depending on the rhythm instrument used or the harmonic content) and naming that identity is known as genre classification. The genre, which is distinguished by some distinctive elements of the music, is one way to classify and arrange songs. Music genre classification has been a hotly debated topic ever since the early days of the Internet. Since they result from a complex interplay between the general audience, marketing, historical, and cultural variables, musical genres lack specific definitions and boundariesSome academics have proposed the definition of a new genre classification system specifically for the purposes of music information retrieval as a result of this observation. [4][12] 2003 [13] genre of music is reportedly the best source of general knowledge for deciphering the music's substance, according to Aucouturier and Pachet. For audio streaming services like Spotify and iTunes, being able to automatically categorise and assign tags to the music that is currently in a user's collection based on genre would be advantageous. Recent deep learning techniques make use of spectrograms, which are visual representations of the audio signal. Convolutional neural networks (CNNs) are fed data from these visual representations.[14] The authors of Lidy and Rauber (2005)[13] talk on the use of psychoacoustic properties for classifying musical genres, particularly the significance of STFT measured using the Bark Scale (Zwicker and Fastl, 1999). Among the features used by (Tzanetakis and Cook, 2002)[4] were spectral contrast, spectral roll-off, and mel-frequency cepstral coefficients (MFCCs). In Nanni et al. (2016), SVM and AdaBoost classifiers are trained using a combination of audio and visual information.

III. METHODOLOGY

Convolutional neural networks (CNNs) are fed data from these visual representations.[14] The authors of Lidy and Rauber (2005)[13] talk on the use of psychoacoustic properties for classifying musical genres, particularly the significance of STFT measured using the Bark Scale (Zwicker and Fastl, 1999). Among the features used by (Tzanetakis and Cook, 2002)[4] were spectral contrast, spectral roll-off, and mel-frequency cepstral coefficients (MFCCs). In Nannietal. (2016), SVM and AdaBoost classifiers are trained using a combination of audio and visual information.

A. Common ML Algorithms

A few of the algorithms are described below.

Artificial Neural Network (ANN): ANNs are effective parallel-processing mathematical modelling systems that may simulate biological neural networks by using interconnected neuron units. The most well-liked learning algorithms in ML are ANNs, which are well-known for their adaptability, efficiency, and ability to represent complex flood processes with high fault tolerance and precise approximation. As a result, ANNs are regarded as trustworthy data-driven tools for developing black-box models of intricate and nonlinear interactions between rainfall and flooding as well as forecasting river flow and discharge. Numerous flood prediction applications, such as streamflow forecasting, river flow, rainfall-runoff, precipitation-runoff modelling, water quality, evaporation, river stage prediction, low-flow estimation, flood mapping and susceptibility, and river time series, have already been successfully implemented using artificial neural networks (ANNs). Iterative parameter adjustment is one of the main drawbacks of ANN use.
Support Vector Machine (SVM): Flood modelling makes extensive use of SVM, a supervised learning machine that operates on the principles of structural risk minimization and statistical learning theory. The SVM's training process creates models that assign new non-probabilistic binary linear classifiers that, by using inverse problem-solving, minimise the empirical classification error and maximise the geometric margin. Based on training from historical data, SVM is used to predict a quantity going forward in time. SVMs are now recognised as reliable and effective ML flood prediction systems. As ML alternatives to ANNs, SVM and SVR have gained appeal among hydrologists for flood prediction. As a result, they are used to predict floods in a variety of situations with promising results, superior generalisation ability, and higher performance when compared to ANNs, such as in cases of extreme rainfall, precipitation, rainfall-runoff, reservoir inflow, streamflow, flood quantiles, flood time series, and soil moisture.
K- Nearest Neighbour (KNN): Problems involving classification and regression can both be solved using this approach. It appears that the solution of categorization issues is more frequently applied within the Data Science business. It is a straightforward algorithm that sorts new instances by getting the consent of at least k of its neighbours and then saves all of the existing cases.This calculation is made using a distance function.By drawing parallels between.KNN and actual life, it is simple to comprehend. For instance, it makes sense to speak with a person's friends and co-workers if you want to learn more about them.

Before using the K Nearest Neighbours Algorithm, keep the following points in mind:

KNN is computationally expensive
Variables should be normalised to prevent greater range variables from skewing the algorithm
Data still needs to be pre-processed.

4. Convolutional Neural Network (CNN): Convolutional neural network is a Deep Learning method built specifically for working with photos and videos. It uses photographs as inputs, extracts and learns the image's attributes, then categorises the images using the learned features. This programme takes its cues from how the Visual Cortex functions in the human brain. Processing of visual data from the outside world is carried out by the visual cortex, a region of the human brain. It has many levels, and each layer functions independently, extracting different information from images or other visuals. Once all the information from the various layers has been merged, the picture or visual is then evaluated or classed.

A neural network type called a convolutional neural network, or CNN or ConvNet, is particularly adept at processing input with a grid-like architecture, like an image. A binary representation of visual data is a digital image. It is made up of a grid-like arrangement of pixels, each of which has a pixel value to indicate how bright and what colour it should beIn CNN, rather than all the neurons in the fully linked layer, a layer's neurons will only be connected to a tiny portion of the layer.

IV. RANDOM FOREST CLASSIFICATION

The supervised classification approach known as the random forest can be applied to both classification and regression issues. As the name implies, this algorithm builds a forest out of several trees.

A. Input Data Set

Three Types of Music Metadata.

Descriptive Metadata: With objective text tags like song title, duration in milliseconds, danceability, acousticness, energy, instrumentalness, and other information, descriptive metadata describes the contents of the recording. Every time someone has to search, arrange, sort, or display the music, descriptive information is used.
Ownership/Performing Rights Metadata: The cash will be split among a number of parties, including performing artists, lyricists, producers, and songwriters, whether we're talking digital streams, airplay, or movie synch. Therefore, ownership metadata is required, describing the legal arrangements supporting the release for the purpose of calculating (and allocating) royalties.
Recommendation Metadata: Metadata for recommendations differs. It primarily consists of subjective tags intended to reflect the recording's content and characterise its sound. To connect tracks in a meaningful way and fuel recommendation engines, recommendation information is used, such as mood labels, generative genre tags, and song similarity scores. There are several songs in the dataset. There are labels on the songs that are collected. One of the output classes—Happy, Sad, Energetic, Calm—includes labels. Additionally, each of these songs is examined, its parameters are retrieved, and a numerical value on a scale of 1 to 10 is assigned.

The picture classification model will be created, trained, and tested using the Python programming language. The model could be categorised roughly into:

a. Importing libraries and getting data ready

b. Model definition

c. Report on classification

d. Confusion Matrix

e. Last classified photos

V. FUTURE SCOPE

With more research in this area, we will be able to use different machine learning algorithms, compare accuracies, and make even more accurate predictions while also learning how other models function and their benefits.

The classification of music into genres is a fundamental component of a powerful recommendation system. The major objective is to develop a machine learning model that categorises music samples into various genres in a more methodical manner.

Automating music classification can make it easier to locate important information like trends, popular genres, and performers.

Conclusion

Our application successfully categorises playlists according to mood with the aid of machine learning, giving users a categorised playlist. When a playlist is being listened to, the listener feels more at ease and filled with emotions, which boosts their mood and improves their mental condition. Marilyn Manson once said that \"Music is the strongest form of magic\" because music has the power to heal people and transform their emotions, which is equivalent to any form of magic. Different music from your mood can make you feel stressed and unhappy, which can lead to low energy or inappropriate actions. However, this application\'s playlist perfectly matches the user\'s mood. The right music energises and inspires people to combat or handle their current predicament.

References

[1] Neural Network Music Genre Classification des genres de par reseau-neuronal (Nikki Pelchat). [2] Music Genre Classification using Machine Learning Techniques (by Hareesh Bahuleyan) [3] Music Genre Classification Using Deep Learning (by Navneet Parab, Shikta Das, Gunj Goda, Ameya Naik) [4] George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing 10(5):293– 302. [5] Y. M. Costa, L. S. Oliveira, and C. N. Silla, “An evaluation of convolutional neural networks for music classification using spectrograms,” Appl. Soft Comput., vol. 52, pp. 28–38, Mar. 2017. Accessed: Dec. 16, 2018. [Online] [6] “On Combining Diverse Models for Lyrics-Based Music Genre Classification ,Caio Luiggy Riyoichi SawadaUeno;Diego Furtado Silva, 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). [7] J. Despois. Finding the Genre of a Song With Deep Learning— A.I. Odyssey Part. 1. Accessed: Dec. 27, 2018. [Online]. Available: https://hackernoon.com/finding-the-genre-of-a-song-with-deep-learningda8f59a61194. [8] F. Pachet and D. Cazaly, “A classification of musical genre,” in Proc. RIAO Content-Based Multimedia Information Access Conf., Paris, France, Mar. 2000. [9] S. Gollapudi, Practial Machine Learning. Birmingham, U.K.: Packt, 2016. [10] T. O’Brien. (2017). Learning to Understand Music From Shazam. Accessed: Dec. 19, 2018. [Online]. Available: https://blog.shazam. com/learning-to-understand-music-from-shazam-56a60788b62 [11] T. Feng. Deep learning for music genre classification. 2014. [12] R. Panda and R. P. Paiva, “Mirex 2012: Mood classification tasks submission,” Machine Learning, vol. 53, no. 1-2, pp. 23–69, 2003

Copyright

Copyright © 2023 Ritika Dhyani, Priyansh Vatsal, Priyanshu Goel, Prafull chauhan, Prince Chauhan, Pratham Chauhan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET50890

Publish Date : 2023-04-24

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here