HOME | ABOUT | RESUME | GITHUB | CONTACT


CapStats

Pitchfork 50 Best Albums of the Year

February 6, 2019

All code can be found here

Pitchfork

Pitchfork.com is a music review website. Each album that is reviewed on the site is also rated on a scale from 0 to 10. Pitchfork began in the mid 90s as a review site for new independent music. It’s cultural footprint has increased each year and now Pitchfork hosts over 200,000 visitors on its site each day.

I love new music. I read Pitchfork daily to see if there are any new albums that have been relaesed that I would be interested in. Recently, while looking through some of the available data sets on Kaggle, I came across a collection of over 18,000 music reviews from Pitchfork. Nolan Conway created a scraper and uploaded a database he created to store all of the album review info.

One of my first projects that I completed for my MS in Data Science was creating a scraper to pull review data from Pitchfork. Taking a look at how Nolan created his Pitchfork scraper was illuminating. It is very clear and concise. I suggest anyone reading this post that is looking for a python web scraping tutorial to go read through Nolan’s code.

After taking a look at all of this data, I had an idea. Pitchfork comes out with a list of the 50 best albums every year. Can the data from the album reviews that are produced throughout the year predict if an album will be included in the year end list? Can the album review data predict an album’s year end ranking on the list? The first step to finding out is to scrape these lists from Pitchfork and link them to the album review data. A previous post I created explained how to scrape webpages using R. The code in this post is written in python utilizing the BeautifulSoup library. After scraping the year end results, I will merge them to the data that Nolan has already provided on Kaggle.

The first step is to get the web address of each Pitchfork list that is produced each year. Pitchfork has a webpage that has a link to all of these lists that can be found here.

from bs4 import BeautifulSoup  #for webscraping
import requests                #to get the webpage
import re                      #for regular expression matching

base_url = 'https://pitchfork.com'

#use this page to get the urls for all the best album lists 
page_name = base_url + '/features/lists-and-guides/'
#request the page
page = requests.get(page_name)
#make the soup
soup = BeautifulSoup(page.content)

#2017 list added manually #'/features/lists-and-guides/the-50-best-albums-of-2017/' ????? remove or use
#initialize a list
lists = []

#get all links for 50 best albums of the year
#matching for the regular expression -50-
for link in soup.findAll('a', attrs={'href':re.compile(r'-50-')}):
    lists.append(link['href'])

#remove duplicates
lists = list(set(lists))

#print the list

Using the list of links above, Pitchfork’s top 50 albums from each year can be can be captured. To do this, a helper function is created to extract the album titles from these lists. Each year’s list is broken up into 5 webpages (10 albums on each page) with the first webpage listing the #50 to #41 albums.

def get_album_titles(webpage):
    """
    quick helper function to extract album titles from 
    pitchfork best album of the year lists
    @webpage is the page where the albums are listed
    returns a list of album titles
    """
    page = requests.get(webpage) #request the page
    soup = BeautifulSoup(page.content) #make the soup
    #extract the album titles
    albums = soup.findAll('h2', {'class' : 'list-blurb__work-title'})
    #add the album titles to the list
    return([t.get_text() for t in albums])

#initalize list to store info
album_title = [] 

#best album link years
years = range(2013, 2019)
#for each year
for year in years:
    #filter our list of links for the year
    new_page = base_url + [l for l in lists if str(year) in l][0]
    #append the album titles on the page using the helper function
    album_title += get_album_titles(new_page)
    #pitchfork uses a separate web page for every 10 albums
    for next_page in range(2, 6):
        #get the next page
        next_page = new_page + '?page=%s' % next_page
        #append the album titles on the page using the helper function
        album_title += get_album_titles(next_page)

Since this information will be combined with the album review data that has already been scraped, it is helpful to create a pandas data frame to store the information.

import pandas as pd #to create a data frame

#create a one column df using the album titles
album_df = pd.DataFrame({'album_title' : album_title})
#add the year and rank for each album
album_df['year'] = [year for year in years for i in range(50)]
album_df['rank'] = (list(range(50, 0, -1)) * len(years))
#save as a csv file
album_df.to_csv("data/pitchfork_50.csv")
album_title year rank
My Name Is My Name 2013 50
Loud City Song 2013 49
Major Arcana 2013 48
Nonfiction 2013 47
Matangi 2013 46
Slow Focus 2013 45

Now that the best album lists are in a data frame, this data can be joined with the album review data from Nolan Conway. To do follow along with this section, you need to download the pitchfork.db file from here.

import sqlite3 #to import data from the data base
from functools import reduce #help joining data frames

con = sqlite3.connect("data/pitchfork.db")
#get review data
reviews = pd.read_sql('SELECT * FROM reviews', con) 
#get genre, label, content (text of review)
genres = pd.read_sql('SELECT * FROM genres', con)
labels = pd.read_sql('SELECT * FROM labels', con)
content = pd.read_sql('SELECT * FROM content', con)
con.close()

#get word count from review
content['word_count'] = content['content'].apply(lambda x: len(x.split()))

#data frames to join
dfs = [reviews, genres, labels, content]
#join and create one data frame
reviews_final = reduce(lambda left, right: pd.merge(left, right, on='reviewid'), dfs)
#save as a csv file
reviews_final.to_csv("data/pitchfork_reviews.csv")

#join the top 50 lists with the review data
#review album titles are all lower case
album_df['album_title_lower'] = album_df['album_title'].str.lower()
final_album_df = pd.merge(reviews_final, album_df,  
                          how='left', 
                          left_on= ['title', 'pub_year'], 
                          right_on = ['album_title_lower', 'year'])
final_album_df.to_csv("data/pitchfork50_with_reviews.csv")

The next post will create a model that tries to predict when an album will appear on the year end list, and if so, what will the album’s ranking be.


HOME | ABOUT | RESUME | GITHUB | CONTACT