This repository contains the code and the results for the project titled "Lyrics Web Scraping and Text Mining Analysis". It was completed by Group-2 for the course ECE 143, Winter 2019.
- Song rank, year, artist and lyrics - This file contains the code to gather the data regarding the song name, the rank it obtained in the billboard charts, the respective year, and finally the name of the artist. All these information were gather using Wikipedia Billboard Year End list. Lyrics of these songs were extracted from genius.
- Nationality - This file contains the code for gathering data about the nationality of the artists for each of the songs. This was gathered from the Wikipedia pages for the artists.
- Genre - The genre data was collected for the songs in two parts. Firstly, we consider the years which have links for all the 100 songs in the list. This is done in the first file. Secondly, the years which do not fall into this category need to be dealt with individually since we do not know which song might be absent. This is done in the second file.
Data cleaning is very important for perfoming text mining analysis. This is specially true for the attributes of nationality, genres, and lyrics.
- Nationality - The data cleaning for nationalities involved classifying more than 128 areas into 37 different nationalities. This was done by grouping labels of different areas(stats, cities, etc) in the same country into a uniform label of nationality. This part is shown in this file.
- Genres - The data cleaning for genres involved classifying the 489 different genres as obtained from the data collection process into the 17 different existing main genres. This was done by grouping differently labelled genres under the umbrella of some main genres. This part is shown in this file.
- Lyrics - The data cleaning for lyrics involved filtering out 'url not retrieved' and removing stop words, punctuations and unwanted words. We retrieved 5367 out of 6000 lyrics from genius. This part is shown in this file.
We performed 3 types of text-mining analysis. The first is N-gram where we show the results for unigram, bigram and trigram. The second is the Sentiment Analysis where the songs are studied in terms of their positive/negative sentiment scores. The third analysis is using tf-idf, where we try to find out the most common words which could be found for the three most popular genres - hiphop, electronic, and pop. All the analysis and their visualizations can be found in this file.
The data folder contains our results in .csv files. all information are represented as pandas dataframes, and stored in 5 different parts - only song details, song details + lyrics, song details + nationality cleaned, song details + nationality cleaned + genres cleaned and all information. In addition, positive words and negative words were found online and used for sentiment analysis.
All the plots for our codes can be found in the plots folder. Each plot is named and titled to show what they represent.
- Proposal - Equal Contribution
- Song name, artist, rank, year and lyrics collection - Zhaoyuan He
- Nationality Collection + Cleaning - Yihua Yang and Qinyan Li
- Genre Collection + Cleaning - Anwesan Pal
- Text Mining - Equal Contribution
- Presentation - Equal Contribution