In this tutorial, we will create an R data package using selected novels from H. G. Wells.
“We all have our time machines, don’t we. Those that take us back are memories…And those that carry us forward, are dreams.” - H.G. Wells
During my short tenure in the R community I’ve found one of the basic tenants shared by all is to have reproducible code.
Having reproducible code allows you to easily share your experiments with others, your own projects and reproducibly tracks how the data was created.
Text mining a time machine
For this blog post, I will be packaging up the following novels from author H. G. Wells:
- Ann Veronica
- Mr. Polly
- The Invisible Man
- The Island of Doctor Moreau
- The Time Machine
- The War of the Worlds
The full text of these novels were all source from Project Gutenberg
- Git repository
There are a few ways to create an R package. This post will only cover using RStudio for package creation.
What is an R package?
An R package is minimally a directory that contains:
- Metadata in a DESCRIPTION file
- An R/ folder that contains R code
Packages can also contain data. If there’s a
data/ subdirectory in the package directory, R will make any data files there available under the package namespace.
Creating a new R package
To create a new Package in RStudio:
- File -> New Project -> New Directory -> R Package
- Enter a name and folder for your package
- Click “Create git repository”
- Click “Create Project” button to create your new project
Edit package metadata
Open up the newly created DESCRIPTION file in the root of the package directory. You will want to make a few changes to this file.
Hadley Wickham has a great blog post on Package metadata
Below are my metadata settings for my hgwellsr package:
Things to note in the DESCRIPTION file.
- Give your package a good
- Make sure you have the format for
Maintainercorrect. R is kinda picky about that
- Make sure you include which version of R your package depends on
- If you have any entry that spans multiple lines, make sure to indent it
You may have to add more dependencies but since this is only a data package, I don’t have any other dependencies to add.
Adding data to a package
We now need to create a data directory to store the data for this package. First we will create a directory to store our scripts that will collect and process our raw data.
We can use
devtools to do this:
This will create a
data-raw/ subdirectory. Change to this directory, delete any existing files and create a new file called
Next, open up
data_prep.R and add code to retrieve data from Project Gutenberg and save the cleaned up data files to the
data/ folder in
Lets break down what’s going on in the code above.
First we load the
gutenbergr R package.
Next we will download each novel into a character vector. The
gutenberg_download function returns a two column data frame with one row for each line of text or texts.
You’ll notice that I use an offset instead of taking the entire contents of the
$text column in the data frame. This is due to the Gutenberg headers that are in every novel. It’s different for every novel.
I didn’t use any special process to come up with these offsets other than taking a look at each file to see where the Gutenberg headers ended.
After all the novels have been downloaded and slightly cleaned, use the
devtools::use_data function to create
RData files in the
data/ folder and overwrite any existing file if it exists.
While it’s optional, you should always document your datasets .
Create a .R file with Roxygen2 comments in the R subdirectory. Here’s an example of the one I used for my hgwellsr package:
Unicorns and rainbows
Congratulations your package is ready to be used!
Before you unleash your awesomeness upon the world, you should probably run a few sanity checks first.
In RStudio, go to Build -> Check Package. RStudio will run a variety of checks on your package.
Once the checks are complete go ahead and push your local repository to your remote git repository.
hgwellsr package is on Github
. To use
hgwellsr in R: