40 Years of Thrash: Comparing the lyrical content of Sodom, Megadeth and Sepultura via Wordclouds in Python

A short guide to collecting and cleaning lyrics’ data from Genius.com and to creating Wordclouds for exploratory data analysis and comparison.

Image by Carabo Spain from Pixabay


This guide will provide you with the tools in Python and corresponding explanations so you can find out the answers to the aforementioned questions, and more!



  1. Data Collection
  2. Data Cleaning and Preparation
  3. Visualization

Now let us discuss each step in detail.

1. Data Collection

Access Token generation screen

Once you have the access token, you are now in a position to start with the coding! Install the lyricsgenius package on your platform and import it along with the following packages:

Now, select the artists/bands you want to compare and enter the access token:

Now, we write a function get_all_songs() that enables us to download the lyrics of all songs of each artist provided in the list above. We do this using the search_artist() method by creating an object of the Genius class provided in the lyricsgenius package. You can read more about selecting the parameters of the Genius class here. I will go through steps performed inside the get_all_songs() function sequentially and the function as a whole will be available afterwards.

The output of the search_artist() method is a json file — which in our case is not very useful. So, we convert it into a dictionary and extract the relevant information. The snippet is given below:

Now, we iterate through the dictionary and store album and song names, date of release and the lyrics of each song in a list. This is not necessary for the current project as we only need the lyrics, but it would be useful for possible future analysis. We have to implement Exception Handling here because some album names are missing from the scraped data and running the code without it generates an error. We give the value ‘Unknown’ to the album column in this case.

All songs for each artist are now obtained in the all_artists_list which is initialized outside the main for loop. Now, we convert this to a pandas data frame for easier manipulation later on.

This is the entire function:

2. Data Cleaning and Preparation

The pandas data frame looks like this:

Since, for now, we only require the lyrics, we write a function to concatenate all the lyrics as a string value grouped by the band name and store them in a dictionary — with band names as key values. The function is called with the pandas data frame and the list of artists (initialized at the beginning) as its parameters. All the letters of strings are also changed to lower case for Wordclouds afterwards.

Please note that we did not remove the stopwords in this section as the WordCloud package provides a very efficient way to perform this task. It will be explained in the next section.

3. Visualization

To use masks, make sure the logos are in the same source folder and are named appropriately. For this code, the images are named as ‘{Artist_Name}_logo.jpg’. You can, alternatively, change the function accordingly.

P.S png files did not seem to work as mask images. I tried a number of them but I got a regular Wordcloud for those attempts. On the other hand, jpg files worked fine! I could not find out why this was in the documentation, but if you know please email me or find me on Linkedin. Thanks!

Finally, we can see the results of our effort. Following are the Wordclouds for each band:

I. Sodom

II. Megadeth

III. Sepultura

With these visualizations, we can clearly and easily tell the main themes used in the lyrical composition by each of the three bands. The entire code (as an ipynb file) with the datasets, band logos and output visualizations can be found here. Thank you!

I make a mean curry, try (really hard) to solve problems and am a heavy metal enthusiast. Learning to be a data analyst @University of Bonn, Germany.