January 2, 2019
Markdown
is a markup language used to create different types of documents (pdf, html, etc). R Markdown
allows a user to embed R
code within the document, thus producing figures, creating tables, performing calculations, and doing anything else that can be done with R
. R Markdown
can even be used to create Microsoft Word documents.
I work for a school district that supports over 60 schools and 25,000 students. I create a lot of reports for our Superintendent, but I also create school level and student level reports as well. For school level and student level reports I use R Markdown
. I will write a script in R
that filters school data and calls a R Markdown
file to create a report for each school.
Most of the reports I create are pdfs. Creating html files using R Markdown
is a little bit easier compared to pdfs, but schools seem to like pdfs better, as does my Superintendent.
To create a document using R Markdown
, download and install R
and R Studio. Once these programs are installed, open up R Studio and create a New Project. Doing this will create a new directory that will contain your R
scripts, data, and R Markdown
file. Within this new directory, create a folder named by_school
(or whatever). This is where completed reports will go. It is also a good idea to create a data
folder (a place to store your data). To follow along with this project, all files can be found on github.
Next, create 2 new files. First, create an R Script
file named create_rmarkdown.R
. Next, create an R Markdown
file named rmarkdown_report.Rmd
. When you create a new R Markdown
file, you will be prompted to choose what type of output format you want. For this project, we will be creating pdfs. When a new R Markdown
file is created it contains sample text and code. Feel free to delete everything. Once you are finished with all of this prep work, your project directory should look like this:
This project creates reports using the data/example_school_data.xlsx
file. It’s pretty basic, here is what the data looks like:
School | Percent | n |
---|---|---|
School 9 | 0.7586207 | 100 |
School 10 | 0.2200000 | 100 |
District | 0.7323177 | 1000 |
To import the data, enter the following code in the create_rmarkdown.R
script. If you do not have the readxl
package, install it using this command in the R
console: install.packages("readxl")
.
library(readxl) #to read the excel file
#read the data
school_data = read_excel("data/example_school_data.xlsx")
#list of all schools
schools = unique(school_data$School)
These are the schools that we will be creating the reports for: School 1, School 2, School 3, School 4, School 5, School 6, School 7, School 8, School 9, School 10. We will not be creating a report for the District. School level and district level reporting are usually different and reports cannot be created using the same .Rmd
file.
Eventually a for
loop will be created to cycle through the schools and create our reports. For now, in the R
console, use this command to extract the school data for School 1 and the District. We will use this school’s data to create the rmarkdown_report.Rmd
file.
#school variable is the name of one school
school = "School 1"
#extract the school data and the district data
temp_school_data = school_data[school_data$School==school | school_data$School=="District",]
School | Percent | n |
---|---|---|
School 1 | 0.6428571 | 100 |
District | 0.7323177 | 1000 |
R Markdown
files contain 3 major components:
YAML
= heading
R chunks
= embedded R
code
Text
This tutorial is meant as a simple introduction. For a more in depth guide, consult R Markdown: The Definitive Guide.
YAML
The YAML
is used to configure the output. The basic YAML
that is created for .Rmd
files looks like this:
---
title: "Your Title Here"
author: "Your Name Here"
date: "October 21, 2015"
output: pdf_document
---
One of the issues with creating pdfs using R Markdwon
is that the plot font does not automatically match the text font. This is not an issue when creating html documents. To combat this issue when creating pdf files, specify which latex_engine
should be used to create the pdf. The xelatex
latex_engine
makes it easier to manipulate fonts.
Since a report is being made for every school, the title should indicate this. R
code can be embedded in the YAML
. In quotes, the title of the document should be:
`r paste(school, 'Summary')`
When editing the YAML
indentation is very important. The new YAML
now looks like this:
---
title: "`r paste(school, 'Summary')`"
author: "Your Name Here"
date: "October 21, 2015"
output:
pdf_document:
latex_engine: xelatex
mainfont: Arial
---
R chunks
R chunks
are used to write the R
code that will be used to create tables, plots, etc. The first R chunk
is used to setup the rest of the subsequent chunks
. This is where all of the libraries that will be used during the construction of the document are called. For this project, these are the libraries that will be utilized. Also, we do not want to display any code so we will set the echo
option to FALSE
.
knitr::opts_chunk$set(echo=FALSE)
library(ggplot2) #for plotting
library(ggthemes) #to use stephen few's theme
library(scales) #use the percent function
library(knitr) #to create table
The next chunk is a simple plot that compares the school
data to the District.
#use ggplot to create a chart from the data
p = ggplot(data=temp_school_data, aes(x=School, y=Percent))
#add the title
p = p + ggtitle(paste(school, "Percentage of Something"))
#bar chart
p = p + geom_col(aes(fill=School), position="dodge")
#percent on top of bars
p = p + geom_text(aes(label=percent(round(Percent, 2))),
size=3, vjust=-0.25, position=position_dodge(0.9))
#stephen few's theme
p = p + theme_few()
#match the pdf font output
p = p + theme(text=element_text(family="ArialMT"))
#set the y axis
p = p + scale_y_continuous(labels=percent,
limits=c(0, 1),
breaks=c(0, 0.25, 0.5, 0.75, 1))
p = p + xlab("") + ylab("")
p
Text can be entered anywhere in the document using Markdown
syntax.
Inline R
code like what is below can be used as well:
`r kable(temp_school_data[,c(1,3)], "latex")`
Now that a R Markdown
file has been created, the R
script can loop through each school and create a document using that school’s data. The rmarkdown
library will be used to render the pdf files.
#load rmarkdown library
library(rmarkdown)
library(readxl) #to read the excel file
#read the data
school_data = read_excel("data/example_school_data.xlsx")
#list of all schools
schools = unique(school_data$School)
Next, create a for
loop to cycle through each school.
#for each school create a report
for(school in schools){
#skip any district row data
if(school=="District") next
#filter for school and district data
temp_school_data = school_data[school_data$School==school | school_data$School=="District",]
#want the school data to be plotted first so set school as a factor
temp_school_data$School = factor(temp_school_data$School, levels=c(school, "District"))
#create the markdown document using the render function
render(input="rmarkdown_report.Rmd", #this is the name of the rmarkdown file
#this is the name of the finished report
output_file=paste0("by_school/", school, " Summary.pdf"))
}
Now, by running the R
script, a pdf file will be created for every school. The results will be saved in the by_school
folder. Cool right? The R Markdown
file does not have to be so short. I have created pdfs that are over 10 pages for our schools. But as the project gets bigger, pitfalls begin to appear.
Here are some errors that I have dealt with. Hope this helps.
Missing data - make sure that there is no missing data in the data sets. Or if there is, create an elegant way to handle this. If I know one of my schools is missing data, or there will be issues when their report is created, I will place if(school=="School Name") next
within the for
loop to skip over that school and create that report manually.
Sometimes the finished pdf needs to be saved in a different folder or multiple folders. My work environment makes it hard for me to write files to certain parts of my computer’s memory. The work around for this is to name your output file temp.pdf
and then use file.copy(from="temp.pdf", to="NEW/FILE.pdf")
to place it in the desired location.
If a memory issue ever occurs, place this line of code after you render the pdf: knitr::knit_meta(class=NULL, clean=TRUE)
. This will clear the cache and free up the needed RAM.
And lastly, it is important to note, combining the process outlined above with the mailR
package will really pay dividends.