A quick tutorial on how to use Azure Cosmos DB from R.

What is Azure Cosmos DB?

Azure Cosmos DB is Microsoft’s globally-distributed multi-model low-latency database service. The Azure Cosmos DB API can be accessed via SQL, Javascript, Gremlin, MongoDB, Cassandra and Azure Table Storage.

Prerequisites:

Create A New Azure Cosmos DB Instance

Login to the Microsoft Azure Console and create a new resource.

Create a resource


Next, click Databases then Azure Cosmos DB


Create Azure Cosmos DB resource


In the New account dialog we will be specifying:

  • ID - Unique name for this Azure Cosmos DB instance
  • API - Which API we will use to access Azure Cosmos DB.
  • Subscription - Which subscription this instance will run under
  • Resource Group - A new or existing resource group Azure Cosmos DB should run in
  • Location - Which Azure datacenter to run this instance of Azure Cosmos DB in.
  • Geo-redundancy - Optionally enable geo-redundancy.


SettingDescription
IDUnique name used to identify this Azure Cosmos DB instance. The ID can contain only lowercase letters, numbers, and the hyphen (-) character, and it must contain 3 to 50 characters.
APIAPI determines which type of account to create. There are 5 possible choices to select from. SQL (document database), MongoDB (document database), Gremlin (graph database), Azure Table and Cassandra.
SubscriptionSelect which account to associate this Azure Cosmos DB instance too.
Resource GroupSelect or create a new resource group for this instance.
LocationSelect which geographic region to host your Azure Cosmos DB instance.
Enable geo-redundancyIf selected, this will create a replicated version of your database in a second region.

New account dialog


After you click Create, Azure will start provisioning your new Cosmos DB instance. This process will take a few minutes.

After the provisioning process is complete, if you click on the All resources menu option, you should see something like the image below.

Azure resources


Azure Cosmos DB Credentials and Connection String

Before we’re able to login, you’ll need to view your Azure Cosmos DB credentials and connection settings by clicking on your newly created Azure Cosmos DB instance from the All resources menu. Then click Connection String under the Settings menu.

Azure Cosmos DB Credentials

Create Azure CosmosDB Database and Collection

Now click on Browse under the Collections section.

Azure Cosmos DB Collections menu

Click the Add Collection button.

Azure Cosmos DB New Collection

Next we will enter the names for our new database and collection. We’ll go ahead and use the defaults for the rest of this form. Click OK to create your new database and collection.

You might want to adjust the collections Throughput to 2,000-4,000 if you plan on inserting a large amount of data quickly >initially into this collection. Throughput can also be adjusted downward later. See Request Units in Azure Cosmos DB for more information.

Azure Cosmos DB New Collection Options

Connecting to Azure Cosmos DB from R

The R notebook for this demo is located on my Github account here:

For this tutorial, we’ll be using the following R packages.

1
2
3
library(dplyr)
library(here)
library(mongolite)

The dataset we’ll be using the SpaceX Launch Data on Kaggle. This dataset contains launch, payload and outcome information for SpaceX missions.

Let’s load this dataset into R.

1
launch_df <- read.csv(here("input", "spacex_launch_data.csv"), stringsAsFactors=FALSE)

I won’t be doing any feature engineering or exploratory data analysis on this dataset, but lets quickly look at the shape of our data.

1
glimpse(launch_df)
## Observations: 51
## Variables: 11
## $ Flight.Number      "1", "2", "3", "4", "5", "6", "7", "8", "9",...
## $ Date               "June 4, 2010", "December 8, 2010", "May 22,...
## $ Time..UTC.         "18:45", "15:43", "7:44", "0:35", "15:10", "...
## $ Booster.Version    "F9 v1.0", "F9 v1.0", "F9 v1.0", "F9 v1.0", ...
## $ Launch.Site        "CCAFS LC-40", "CCAFS LC-40", "CCAFS LC-40",...
## $ Payload            "Dragon Spacecraft Qualification Unit", "Dra...
## $ Payload.Mass..kg.  "", "", "525Â ", "500Â ", "677Â ", "500Â ", ...
## $ Orbit              "LEO", "LEO", "LEO", "LEO", "LEO", "Polar or...
## $ Customer           "SpaceX", "NASA (COTS) NRO", "NASA (COTS)", ...
## $ Mission.Outcome    "Success", "Success", "Success", "Success", ...
## $ Landing.Outcome    "Failure (parachutes)", "Failure (parachutes...

We can quickly see that we have 51 observations or rows that each have 11 columns. Lets go ahead and insert this data into the collection we previously created in Azure Cosmos DB.

1
2
mgo <- mongo(db = "demo", collection = "launch", url="[YOUR AZURE COSMOSDB PRIMARY CONNECTION STRING])
mgo$insert(launch_df)
1
2
3
4
5
6
7
> mgo$insert(launch_df)
List of 5
 $ nInserted  : num 51
 $ nMatched   : num 0
 $ nRemoved   : num 0
 $ nUpserted  : num 0
 $ writeErrors: list()

Fantastic!

Looks like our insert work. You can verify this by going to the Azure Cosmos DB Console and click on Data Explorer and expand the collection name you created for this tutorial. You should see something like this:

Insert results from R


Querying data from Azure Cosmos DB

Now, that we have data to work in Azure Cosmos DB, lets perfom a few operations with our data.

Count number of records

1
2
mgo$count()
[1] 51

Get all records

1
2
all <- mgo$find('{}')
print(all)
   Flight_Number               Date Time__UTC_                               Booster_Version
1              1       June 4, 2010      18:45                                       F9 v1.0
2              2   December 8, 2010      15:43                                       F9 v1.0
3              3       May 22, 2012       7:44                                       F9 v1.0
4              4    October 8, 2012       0:35                                       F9 v1.0
5              5      March 1, 2013      15:10                                       F9 v1.0
...

Limiting returned results

1
2
limit <- mgo$find(limit=3)
print(limit)
  Flight_Number             Date Time__UTC_ Booster_Version Launch_Site
1             1     June 4, 2010      18:45         F9 v1.0 CCAFS LC-40
2             2 December 8, 2010      15:43         F9 v1.0 CCAFS LC-40
3             3     May 22, 2012       7:44         F9 v1.0 CCAFS LC-40

Find all GTO orbit missions

1
2
gto <- mgo$find('{"Orbit": "GTO"}')
print(gto)
   Flight_Number              Date Time__UTC_ Booster_Version  Launch_Site
1              7  December 3, 2013      22:41         F9 v1.1  CCAFS LC-40
2              8   January 6, 2014      22:06         F9 v1.1  CCAFS LC-40
3             11    August 5, 2014       8:00         F9 v1.1  CCAFS LC-40
4             12 September 7, 2014       5:00         F9 v1.1  CCAFS LC-40
5             16     March 2, 2015       3:50         F9 v1.1  CCAFS LC-40

Find all missions that had failed landings

1
2
fail <- mgo$find('{"Landing_Outcome": { "$regex": "^Fail", "$options": "i"}}')
print(fail)
   Flight_Number               Date Time__UTC_                               Booster_Version
1              1       June 4, 2010      18:45                                       F9 v1.0
2              2   December 8, 2010      15:43                                       F9 v1.0
3              6 September 29, 2013      16:00                                       F9 v1.1
4             13 September 21, 2014       5:52                                       F9 v1.1
5             14   January 10, 2015       9:47                                       F9 v1.1

Result projection

1
2
3
4
5
fail_projection <- mgo$find(
  query = '{"Landing_Outcome": { "$regex": "^Fail", "$options": "i"}}',
  fields = '{"Flight_Number": true, "Landing_Outcome": true, "_id": false}')

print(fail_projection)
   Flight_Number                                                  Landing_Outcome
1              1                                             Failure (parachutes)
2              2                                             Failure (parachutes)
3              6                                                  Failure (ocean)
4             13                                                  Failure (ocean)
5             14                                             Failure (drone ship)
6             17                                             Failure (drone ship)
7             21                                             Failure (drone ship)
8             22                                             Failure (drone ship)
9             26                                             Failure (drone ship)
10          FH 1 Failure (drone ship), Success (ground pad), Success (ground pad)

See the Mongolite User Manual for more information on the R Mongolite syntax.

Conclusion

I’ve been very impressed with Azure Cosmos DB. Onboarding was quite easy. Most of my difficulties were finding the right syntax to use with the Mongolite driver.

Very impressed with the speed of Azure Cosmos DB. I did run into a few errors with inserting millions of records as fast as possible. That had to due with my Throughput being too low. Would be nice if there was an option to automatically scale throughput to a max value and warn me if that max value has been exceeded over a period of time.

Would love to hear your experiences with using Azure Cosmos DB.