40 Years of Thrash: Comparing the lyrical content of Sodom, Megadeth and Sepultura via Wordclouds in Python

5 min readMar 10, 2021

A short guide to collecting and cleaning lyrics’ data from Genius.com and to creating Wordclouds for exploratory data analysis and comparison.

INTRODUCTION

Have you ever wondered how your favourite artists’ lyrical content differs? Do all of them focus on romance? Maybe one of them has a nihilistic point of view and encourages you to live life to the fullest without worrying too much about the moral ambiguity of the world around you? Perhaps that artist you never truly cared for brings up political notions once too many times?

This guide will provide you with the tools in Python and corresponding explanations so you can find out the answers to the aforementioned questions, and more!

CHOICE OF ARTISTS

Sodom, Megadeth and Sepultura are thrash metal bands, all of which originated in the early 1980s in Germany, the USA and Brazil, respectively — right when the subgenre of thrash metal was establishing itself in the mainstream. Since they all comment on social and political issues like war, human rights and civil equality and are commercially very successful, I thought comparing their lyrics should be quite interesting, despite the differences in their places of origin.

PROCEDURE

The process can be broadly divided into the following steps:

Data Collection
Data Cleaning and Preparation
Visualization

Now let us discuss each step in detail.

1. Data Collection

Genius.com has a rather simple API available in Python which lets users extract information on artists, their songs (lyrics) and albums. You have to make an account here, fill in the corresponding details and it should provide you with the Access Token. For more details, you can read the documentation. The final step should look like this:

Once you have the access token, you are now in a position to start with the coding! Install the lyricsgenius package on your platform and import it along with the following packages:

Now, select the artists/bands you want to compare and enter the access token:

Now, we write a function get_all_songs() that enables us to download the lyrics of all songs of each artist provided in the list above. We do this using the search_artist() method by creating an object of the Genius class provided in the lyricsgenius package. You can read more about selecting the parameters of the Genius class here. I will go through steps performed inside the get_all_songs() function sequentially and the function as a whole will be available afterwards.

The output of the search_artist() method is a json file — which in our case is not very useful. So, we convert it into a dictionary and extract the relevant information. The snippet is given below:

Now, we iterate through the dictionary and store album and song names, date of release and the lyrics of each song in a list. This is not necessary for the current project as we only need the lyrics, but it would be useful for possible future analysis. We have to implement Exception Handling here because some album names are missing from the scraped data and running the code without it generates an error. We give the value ‘Unknown’ to the album column in this case.

All songs for each artist are now obtained in the all_artists_list which is initialized outside the main for loop. Now, we convert this to a pandas data frame for easier manipulation later on.

This is the entire function:

2. Data Cleaning and Preparation

We start by removing ‘\n’ values from the lyrics column using regex as follows:

The pandas data frame looks like this:

Since, for now, we only require the lyrics, we write a function to concatenate all the lyrics as a string value grouped by the band name and store them in a dictionary — with band names as key values. The function is called with the pandas data frame and the list of artists (initialized at the beginning) as its parameters. All the letters of strings are also changed to lower case for Wordclouds afterwards.

Please note that we did not remove the stopwords in this section as the WordCloud package provides a very efficient way to perform this task. It will be explained in the next section.

3. Visualization

Now that we have our data ready as strings inside a dictionary, we are ready to visualize it as WordClouds for our bands of choice. We write a function to make Wordclouds with corresponding masks as the logos of the bands (a mask can be made of any image, see here). The package WordCloud is used for this purpose and makes the task rather simple. Again, this function accepts parameters of the dataset and a list of artists, and iteratively forms Wordclouds of strings in the dictionary lyrics_dict and saves each visualization on disk. Stopwords are also removed using the module STOPWORDS as an attribute of the WordCloud function.

To use masks, make sure the logos are in the same source folder and are named appropriately. For this code, the images are named as ‘{Artist_Name}_logo.jpg’. You can, alternatively, change the function accordingly.

P.S png files did not seem to work as mask images. I tried a number of them but I got a regular Wordcloud for those attempts. On the other hand, jpg files worked fine! I could not find out why this was in the documentation, but if you know please email me or find me on Linkedin. Thanks!

Finally, we can see the results of our effort. Following are the Wordclouds for each band:

I. Sodom

II. Megadeth

III. Sepultura

With these visualizations, we can clearly and easily tell the main themes used in the lyrical composition by each of the three bands. The entire code (as an ipynb file) with the datasets, band logos and output visualizations can be found here. Thank you!