HOME | ABOUT | RESUME | GITHUB | CONTACT


CapStats

Using R Markdown to Automate Report Writing

January 2, 2019

All code can be found here

R Markdown

Markdown is a markup language used to create different types of documents (pdf, html, etc). R Markdown allows a user to embed R code within the document, thus producing figures, creating tables, performing calculations, and doing anything else that can be done with R. R Markdown can even be used to create Microsoft Word documents.

I work for a school district that supports over 60 schools and 25,000 students. I create a lot of reports for our Superintendent, but I also create school level and student level reports as well. For school level and student level reports I use R Markdown. I will write a script in R that filters school data and calls a R Markdown file to create a report for each school.

Most of the reports I create are pdfs. Creating html files using R Markdown is a little bit easier compared to pdfs, but schools seem to like pdfs better, as does my Superintendent.

Setting Up Your Project

To create a document using R Markdown, download and install R and R Studio. Once these programs are installed, open up R Studio and create a New Project. Doing this will create a new directory that will contain your R scripts, data, and R Markdown file. Within this new directory, create a folder named by_school (or whatever). This is where completed reports will go. It is also a good idea to create a data folder (a place to store your data). To follow along with this project, all files can be found on github.

Next, create 2 new files. First, create an R Script file named create_rmarkdown.R. Next, create an R Markdown file named rmarkdown_report.Rmd. When you create a new R Markdown file, you will be prompted to choose what type of output format you want. For this project, we will be creating pdfs. When a new R Markdown file is created it contains sample text and code. Feel free to delete everything. Once you are finished with all of this prep work, your project directory should look like this:

Prepping the Data

This project creates reports using the data/example_school_data.xlsx file. It’s pretty basic, here is what the data looks like:

School Percent n
School 9 0.7586207 100
School 10 0.2200000 100
District 0.7323177 1000

To import the data, enter the following code in the create_rmarkdown.R script. If you do not have the readxl package, install it using this command in the R console: install.packages("readxl").

library(readxl) #to read the excel file

#read the data
school_data = read_excel("data/example_school_data.xlsx")
#list of all schools
schools = unique(school_data$School)

These are the schools that we will be creating the reports for: School 1, School 2, School 3, School 4, School 5, School 6, School 7, School 8, School 9, School 10. We will not be creating a report for the District. School level and district level reporting are usually different and reports cannot be created using the same .Rmd file.

Eventually a for loop will be created to cycle through the schools and create our reports. For now, in the R console, use this command to extract the school data for School 1 and the District. We will use this school’s data to create the rmarkdown_report.Rmd file.

#school variable is the name of one school
school = "School 1"
#extract the school data and the district data
temp_school_data = school_data[school_data$School==school | school_data$School=="District",]
School Percent n
School 1 0.6428571 100
District 0.7323177 1000

Creating the R Markdown Document

R Markdown files contain 3 major components:

  • YAML = heading

  • R chunks = embedded R code

  • Text

This tutorial is meant as a simple introduction. For a more in depth guide, consult R Markdown: The Definitive Guide.

YAML

The YAML is used to configure the output. The basic YAML that is created for .Rmd files looks like this:

---
title: "Your Title Here"
author: "Your Name Here"
date: "October 21, 2015"
output: pdf_document
---

One of the issues with creating pdfs using R Markdwon is that the plot font does not automatically match the text font. This is not an issue when creating html documents. To combat this issue when creating pdf files, specify which latex_engine should be used to create the pdf. The xelatex latex_engine makes it easier to manipulate fonts.

Since a report is being made for every school, the title should indicate this. R code can be embedded in the YAML. In quotes, the title of the document should be:

`r paste(school, 'Summary')`

When editing the YAML indentation is very important. The new YAML now looks like this:

---
title: "`r paste(school, 'Summary')`"
author: "Your Name Here"
date: "October 21, 2015"
output: 
  pdf_document:
    latex_engine: xelatex
mainfont: Arial
---
R chunks

R chunks are used to write the R code that will be used to create tables, plots, etc. The first R chunk is used to setup the rest of the subsequent chunks. This is where all of the libraries that will be used during the construction of the document are called. For this project, these are the libraries that will be utilized. Also, we do not want to display any code so we will set the echo option to FALSE.

knitr::opts_chunk$set(echo=FALSE)
library(ggplot2)      #for plotting
library(ggthemes)     #to use stephen few's theme
library(scales)       #use the percent function
library(knitr)        #to create table

The next chunk is a simple plot that compares the school data to the District.

#use ggplot to create a chart from the data
p = ggplot(data=temp_school_data, aes(x=School, y=Percent))
#add the title
p = p + ggtitle(paste(school, "Percentage of Something"))
#bar chart
p = p + geom_col(aes(fill=School), position="dodge")
#percent on top of bars
p = p + geom_text(aes(label=percent(round(Percent, 2))), 
                  size=3, vjust=-0.25, position=position_dodge(0.9))
#stephen few's theme
p = p + theme_few()
#match the pdf font output
p = p + theme(text=element_text(family="ArialMT"))
#set the y axis
p = p + scale_y_continuous(labels=percent, 
                           limits=c(0, 1), 
                           breaks=c(0, 0.25, 0.5, 0.75, 1))
p = p + xlab("") + ylab("")
p

Text

Text can be entered anywhere in the document using Markdown syntax.

Inline R code like what is below can be used as well:

`r kable(temp_school_data[,c(1,3)], "latex")`

Creating the R Script

Now that a R Markdown file has been created, the R script can loop through each school and create a document using that school’s data. The rmarkdown library will be used to render the pdf files.

#load rmarkdown library
library(rmarkdown)
library(readxl) #to read the excel file

#read the data
school_data = read_excel("data/example_school_data.xlsx")
#list of all schools
schools = unique(school_data$School)

Next, create a for loop to cycle through each school.

#for each school create a report
for(school in schools){
  #skip any district row data
  if(school=="District") next
  #filter for school and district data
  temp_school_data = school_data[school_data$School==school | school_data$School=="District",]
  #want the school data to be plotted first so set school as a factor
  temp_school_data$School = factor(temp_school_data$School, levels=c(school, "District"))
  #create the markdown document using the render function
  render(input="rmarkdown_report.Rmd", #this is the name of the rmarkdown file
         #this is the name of the finished report
         output_file=paste0("by_school/", school, " Summary.pdf"))
}

Now, by running the R script, a pdf file will be created for every school. The results will be saved in the by_school folder. Cool right? The R Markdown file does not have to be so short. I have created pdfs that are over 10 pages for our schools. But as the project gets bigger, pitfalls begin to appear.

Troubleshooting

Here are some errors that I have dealt with. Hope this helps.

  • Missing data - make sure that there is no missing data in the data sets. Or if there is, create an elegant way to handle this. If I know one of my schools is missing data, or there will be issues when their report is created, I will place if(school=="School Name") next within the for loop to skip over that school and create that report manually.

  • Sometimes the finished pdf needs to be saved in a different folder or multiple folders. My work environment makes it hard for me to write files to certain parts of my computer’s memory. The work around for this is to name your output file temp.pdf and then use file.copy(from="temp.pdf", to="NEW/FILE.pdf") to place it in the desired location.

  • If a memory issue ever occurs, place this line of code after you render the pdf: knitr::knit_meta(class=NULL, clean=TRUE). This will clear the cache and free up the needed RAM.

And lastly, it is important to note, combining the process outlined above with the mailR package will really pay dividends.


HOME | ABOUT | RESUME | GITHUB | CONTACT