HOME | ABOUT | RESUME | GITHUB | CONTACT


CapStats

Scraping New York Times Movie Reviews

December 20, 2018

All code can be found here

New York Times Movie Reviews

I love reading movie reviews. I avidly look forward to the Friday New York Times because that is when most of the movie reviews get printed (yes - printed, I still get the physical copy sent to my house on weekends). Manohla Dargis, A.O. Scott, Stephen Holden, and the rest of the staff do a great job providing succint summaries and insightful critiques of almost every film that is released in the United States.

To extract the most out of movie criticism, it helps to have a relationship with the critic. Interestingly, excluding this relationship is a major appeal of using Rotten Tomatos, the film review aggregation website. You can simply look at a number to help you decide weather a movie will be worth your time or not. Reading a review from an unknown source is not as helpful as using Rotten Tomatos, but if you are familiar with the source and have a movie watching history intertwined with that source, it is much easier to discern what the reveiewer is presenting.

For example, “Roma”, Alfonso Cuarón’s Netflix funded film, was recently reviewed by Manhola Dargis. Dargis says that the movie is “a masterpiece”, and no doubt it is. But I remember reading Dargis’ review of Michael Hanekes’ “Amour”, another foreign language film that is highly regarded, and based upon Dargis’ excellent review, I saw the movie. I was not disappointed, but I like movies that are more propulsive - a romantic drama about two octogenarians is not for me. I will most likely skip “Roma” as well since my idea of a masterpiece (“Goodfellas”) may not be the same as hers.

NYT Movie Review Ratings Blurbs

Anyway, you might think this post is about the sentiment analysis of NYT movie reviews. While that would be a great post, I was intrigued by something else. Many of the NYT movie reviews end with a blurb stating the MPAA rating and why the movie received this rating. These blurbs, which mostly seem innocuous, can sometimes be pretty hilarious. Here is the rating blurb for “Amour”:

“Amour” is rated PG-13 (Parents strongly cautioned). Illness, suffering, death.

While not being that funny, I appreciate the bluff description (but am dismayed by the possible spoiler). Here is a genuinenly funny rating blurb, for the movie “A Quiet Place”:

Rated PG-13 for being scary enough to keep your kids quiet for at least a weekend.

or this blurb for “Loveless”:

Rated R for putting an icy fist around your heart, and squeezing.

Collecting the Blurbs

I love these blurbs. Who writes them? I wanted to create a collection of all these NYT movie review rating blurbs so I could try to find out. I used the rvest package to write a script that extracts some basic movie info and the ratings blurb. You will need to sign up for a NYT API key. Store this key as an evironment variable Sys.setenv(NYTIMES_KEY="YOUR_API_KEY").

library(rvest) #to parse webpages
library(jsonlite) #nyt data is in json format
library(stringr) #for string manipulation
library(dplyr) #for datawrangling

NYTIMES_KEY = Sys.getenv("NYTIMES_KEY")
year = 2018
month = 11 #november

#insert in the year and month
api_link = sprintf("https://api.nytimes.com/svc/archive/v1/%s/%s.json", year, month)
#api string
api_key = sprintf("?api-key=%s", NYTIMES_KEY)
  
#extraxt the data from the api and create a data frame
api_df = fromJSON(paste0(api_link, api_key), flatten = TRUE) %>% 
  data.frame() 

#filter for just movie review articles
review_links_df = api_df %>%
  filter(str_detect(response.docs.web_url, "movies"),
         str_detect(response.docs.web_url, "-?review-?")) 
copyright response.hits response.docs.web_url
Copyright (c) 2018 The New York Times Company. All Rights Reserved. 4732 https://www.nytimes.com/2018/11/01/movies/maria-by-callas-review.html
Copyright (c) 2018 The New York Times Company. All Rights Reserved. 4732 https://www.nytimes.com/2018/11/01/movies/the-price-of-free-review-documentary.html
Copyright (c) 2018 The New York Times Company. All Rights Reserved. 4732 https://www.nytimes.com/2018/11/01/movies/9-fingers-review.html

Using the links from this data frame we can start scraping individual movie reviews.

Scraping One Blurb

#link to the movie review - i know this one has a blurb
review_link = review_links_df$response.docs.web_url[52]

https://www.nytimes.com/2018/11/21/movies/the-favourite-review.html

#date
date = substr(review_link, 25, 34)

#read the link to the movie review
review_html = read_html(review_link)

#movie info is at bottom of article
movie_info_html = html_nodes(review_html, '.bottom-of-article') 
#using html_text() to keep only the text
title = html_nodes(movie_info_html, 'h4') %>% html_text()
movie_info = movie_info_html %>% html_nodes("dd") %>% html_text()

#the blurb always appears after the rating at the bottom of the article
#sometimes it will say rated # for.. 
#we are going to try to capture all of them
pattern = "([R|r]ated [G|PG|PG\\-13|R].*).Running"
#only want the matched group
blurb = str_match(review_html %>% html_text(), pattern)[,2]

temp = data.frame(
    url=review_link,
    date=date,
    title=title,
    director=paste0(movie_info[1], collapse="; "), #can be more than one
    writer=paste0(movie_info[2], collapse="; "),   #so use a ; 
    stars=paste0(movie_info[3], collapse="; "),    #to separate
    running_time=movie_info[5],
    genre=paste0(movie_info[6], collapse="; "),
    rating=movie_info[4],
    blurb=substr(blurb, 1, 100), #in case we captured too much
    stringsAsFactors = FALSE
)
date title director writer stars running_time genre rating blurb
2018/11/21 The Favourite Yorgos Lanthimos Deborah Davis, Tony McNamara Olivia Colman, Emma Stone, Rachel Weisz, Nicholas Hoult, Emma Delves 1h 59m Biography, Drama R Rated R. Naughty lords and ladies.

Scraping a Bunch of Blurbs

It is easy to scrape a whole bunch of blurbs. Just wrap what is above in a function and call it for all of the links that we collected from the API.

ScrapeOneReview = function(link) {
  # Scrapes some basic information from a NYT movine review.
  # Args: link: The HTML link of the movie review.
  # Returns: Date Frame with movie info.
  date = substr(link, 25, 34)
  review_html = read_html(link)
  movie_info_html = html_nodes(review_html, '.bottom-of-article') 
  title = html_nodes(movie_info_html, 'h4') %>% html_text()
  movie_info = movie_info_html %>% html_nodes("dd") %>% html_text()
  pattern = "([R|r]ated [G|PG|PG\\-13|R].*).Running"
  blurb = str_match(review_html %>% html_text(), pattern)[,2]
  temp = data.frame(
    url=link,
    date=date,
    title=title,
    director=paste0(movie_info[1], collapse="; "), 
    writer=paste0(movie_info[2], collapse="; "),   
    stars=paste0(movie_info[3], collapse="; "),    
    running_time=movie_info[5],
    genre=paste0(movie_info[6], collapse="; "),
    rating=movie_info[4],
    blurb=substr(blurb, 1, 100), 
    stringsAsFactors = FALSE
  )
  return(temp)
}

#create an empty data frame
one_month_reviews = data.frame()

#loop through each movie review link
for(each in review_links_df$response.docs.web_url){
  #skip any link that is already there (in case of duplicates)
  if(each %in% one_month_reviews$url) next
  #take a random break while scraping
  Sys.sleep(sample(10,1))
  #scrape one review
  temp_movie_review = ScrapeOneReview(each)
  #add it to the data frame
  one_month_reviews = rbind(one_month_reviews, temp_movie_review)
}

#remove movies without a ratings blurb
reviews_with_blurbs = one_month_reviews %>% filter(!is.na(blurb))
date title director writer stars running_time genre rating blurb
2018/11/01 Maria by Callas Tom Volf Fanny Ardant, Joyce DiDonato, Maria Callas, Aristotle Onassis, Vittorio De Sica PG Documentary, Biography NA 1h 53m Rated PG. In English, French and Italian, with English subtitles.
2018/11/01 Bodied Joseph Kahn Alex Larsen Calum Worthy, Jackie Long, Rory Uphold, Jonathan Park, Walter Perez 2 hours Comedy, Drama, Music R Rated R for, you got it, language.
2018/10/31 The Nutcracker and the Four Realms Lasse Hallström, Joe Johnston Ashleigh Powell Mackenzie Foy, Keira Knightley, Morgan Freeman, Helen Mirren, Tom Sweet 1h 39m Adventure, Family, Fantasy PG Rated PG.

Scraping a Whole Buch of Reviews

Only some movies have the ratings blurb we are looking for. Also, by adding a little more code, this routine can be extended to scrape reviews going back for a few years. One thing that should be noticed is that at some point in the past the way these blurs show up in articles changed and the pattern variable would have to be updated accordingly.

whole_bunch_of_reviews = data.frame()

for(one_year in seq(2018, 2016)){
  for(one_month in seq(12, 1)){
    print(one_year, one_month) #to keep track
    api_link = sprintf("https://api.nytimes.com/svc/archive/v1/%s/%s.json", 
                       one_year, one_month)
    api_df = fromJSON(paste0(api_link, api_key), flatten = TRUE) %>% 
      data.frame() 
    review_links_df = api_df %>%
      filter(str_detect(response.docs.web_url, "movies"),
             str_detect(response.docs.web_url, "-?review-?")) 
    for(each in review_links_df$response.docs.web_url){
      if(each %in% whole_bunch_of_reviews$url) next
      Sys.sleep(sample(10,1))
      temp_movie_review = ScrapeOneReview(each)
      whole_bunch_of_reviews = rbind(whole_bunch_of_reviews, temp_movie_review)
}

Some Funny Blurbs

Lastly, here are some funny blurbs I have come across:

“Down a Dark Hall”

Rated PG-13 for sensitive ghosts.

“Goodbye Christopher Robin”

Rated PG for predatory fans and terrible parenting.

“Fallen”

Rated PG-13 for closed-mouth kissing and mild theological wrath.


HOME | ABOUT | RESUME | GITHUB | CONTACT