In this tutorial, we will create an R data package using selected novels from H. G. Wells.

“We all have our time machines, don’t we. Those that take us back are memories…And those that carry us forward, are dreams.” - H.G. Wells

## Introduction

During my short tenure in the R community I’ve found one of the basic tenants shared by all is to have reproducible code.

Having reproducible code allows you to easily share your experiments with others, your own projects and reproducibly tracks how the data was created.

## Text mining a time machine

For this blog post, I will be packaging up the following novels from author H. G. Wells:

• Ann Veronica
• Mr. Polly
• The Invisible Man
• The Island of Doctor Moreau
• The Time Machine
• The War of the Worlds

The full text of these novels were all source from Project Gutenberg

## Prerequisites

• RStudio
• Install devtools for R
• Git repository

There are a few ways to create an R package. This post will only cover using RStudio for package creation.

## What is an R package?

An R package is minimally a directory that contains:

• Metadata in a DESCRIPTION file
• An R/ folder that contains R code

Packages can also contain data. If there’s a data/ subdirectory in the package directory, R will make any data files there available under the package namespace.

## Creating a new R package

To create a new Package in RStudio:

• File -> New Project -> New Directory -> R Package
• Enter a name and folder for your package
• Click “Create git repository”
• Click “Create Project” button to create your new project

Open up the newly created DESCRIPTION file in the root of the package directory. You will want to make a few changes to this file.

Below are my metadata settings for my hgwellsr package:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  Package: hgwellsr Type: Package Title: Data package of selected H. G. Wells novels Version: 0.1.0 Authors@R: c( person("Erik", "Howard", email="erikhoward@protonmail.com", role = c("aut", "cre")) ) Maintainer: Erik Howard Description: Full text data for selected H. G. Wells novels ready for data analysis. This includes the novels Ann Vernnica, The History of Mr Polly, The Invisible Man, The Island of Doctor Moreau, The Time Machine and The War of the Worlds. Depends: R (>= 3.1) License: MIT + file LICENSE Encoding: UTF-8 LazyData: true URL: https://github.com/erikhoward/hgwellsr 

Things to note in the DESCRIPTION file.

• Give your package a good Title and Description
• Make sure you have the format for Authors and Maintainer correct. R is kinda picky about that
• Make sure you include which version of R your package depends on
• If you have any entry that spans multiple lines, make sure to indent it

You may have to add more dependencies but since this is only a data package, I don’t have any other dependencies to add.

## Adding data to a package

We now need to create a data directory to store the data for this package. First we will create a directory to store our scripts that will collect and process our raw data.

We can use devtools to do this:

 1  devtools::use_data_raw() 

This will create a data-raw/ subdirectory. Change to this directory, delete any existing files and create a new file called data_prep.R

Next, open up data_prep.R and add code to retrieve data from Project Gutenberg and save the cleaned up data files to the data/ folder in .RData format.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  library(gutenbergr) timemachine <- gutenberg_download(35)$text timemachine <- timemachine[9:length(timemachine)] waroftheworlds <- gutenberg_download(36)$text waroftheworlds <- waroftheworlds[19:length(waroftheworlds)] doctormoreau <- gutenberg_download(159)$text doctormoreau <- doctormoreau[48:length(doctormoreau)] annveronica <- gutenberg_download(524)$text annveronica <- annveronica[44:length(annveronica)] mrpolly <- gutenberg_download(7308)$text mrpolly <- mrpolly[9:length(mrpolly)] invisibleman <- gutenberg_download(5230)$text invisibleman <- invisibleman[44:length(invisibleman)] ## Add data files to project devtools::use_data(timemachine, overwrite = TRUE) devtools::use_data(waroftheworlds, overwrite = TRUE) devtools::use_data(doctormoreau, overwrite = TRUE) devtools::use_data(annveronica, overwrite = TRUE) devtools::use_data(mrpolly, overwrite = TRUE) devtools::use_data(invisibleman, overwrite = TRUE) 

Lets break down what’s going on in the code above.

First we load the gutenbergr R package.

 1  library(gutenbergr) 

Next we will download each novel into a character vector. The gutenberg_download function returns a two column data frame with one row for each line of text or texts.

You’ll notice that I use an offset instead of taking the entire contents of the \$text column in the data frame. This is due to the Gutenberg headers that are in every novel. It’s different for every novel.

I didn’t use any special process to come up with these offsets other than taking a look at each file to see where the Gutenberg headers ended.

After all the novels have been downloaded and slightly cleaned, use the devtools::use_data function to create RData files in the data/ folder and overwrite any existing file if it exists.

 1  devtools::use_data(timemachine, overwrite = TRUE) 

## Documentation

While it’s optional, you should always document your datasets .

Create a .R file with Roxygen2 comments in the R subdirectory. Here’s an example of the one I used for my hgwellsr package:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  #' Selected H. G. Wells Novels #' #' This package contains complete text of selected novels of #' H. G. Wells formatted to be convenient for text analysis. #' @docType package #' @name hgwellsr #' @aliases hgwellsr hgwellsr-package NULL #' "Ann Veronica: Modern Love Story" #' #' A data set containing the complete text of H. G. Wells 1909 #' novel "Ann Veronica". #' #' @source \url{https://www.gutenberg.org/files/524/524-0.txt} #' @format A character vector with 12054 elements "annveronica" 

## Unicorns and rainbows

Before you unleash your awesomeness upon the world, you should probably run a few sanity checks first.

In RStudio, go to Build -> Check Package. RStudio will run a variety of checks on your package.

Once the checks are complete go ahead and push your local repository to your remote git repository.

## Package installation

My hgwellsr package is on Github . To use hgwellsr in R:

 1 2 3  library(devtools) install_github("erikhoward/hgwellsr") library(hgwellsr)