Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 59 min ago

RConsortium — Building an R Certification

Fri, 10/26/2018 - 12:26

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

For the last months, ThinkR has been involved (with Mango, Procogia and the Linux Foundation) in a working group for an RConsortium R Certification. 

Why this Working Group (WG)?

No doubt the demand for R (and for Data Science) has been increasing over the past few years — Data Scientist became the sexiest job of the century according to Harvard Business Review, and Glassdoor listed it as the “Best Job in America” in 2016, 2017 and 2018. No need to argue on that point: data science is now one of the top skills to have on the market.

With the growth in interest came a growth of demands both from employers and people looking to improve their skills. To support this, several channels of training have been made available.

But even with that, it’s still complex today for the three parties involved in the field:

  • students (in a broad sense), who do not know where to start, and if self-taught don’t know how to “officially” validate their skills
  • teachers and trainers, who have to face a constantly innovating field, and might not be able to identify the skills needed on the market
  • recruiters, who might find it hard to identify key skills, diplomas, and certifications

The RConsortium came to the conclusion that there was no “official R certification” to assess one’s skills when it comes to R. Hence this WG, working to establish some ground rules for this certification.

Towards a certification

This certification will be useful for two key things:

  • it will allow professionals and students to acquire practical, industry-driven skills in R
  • it will help recruiter to identify the skills of potential recruits

As it is crucial for this certification to be industry & community-driven, we are currently conducting a little survey to get an overview of the current state of the sector, worldwide.

It would definitely help us a lot if you could take 3 minutes to answer this one

Find the survey here

The post RConsortium — Building an R Certification appeared first on (en) The R Task Force.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Marketing Analytics and Data Science

Fri, 10/26/2018 - 11:00

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Summary

The data science community, Kaggle, recently announced the Google Analytics Customer Revenue Prediction competition. The competition uses data from the Google Merchandise store, and the challenge is to create a model that will predict the total revenue per customer.

I’m using the Kaggle competition as an opportunity to show how data science can be used in digital marketing to answer a specific question, and take what is learned from the data and apply it to marketing strategies.

In my approach to the competition, I am using the Business Science Problem Framework (BSPF), a methodology for applying data science to business problems.

The BSPF was developed by Matt Dancho of Business Science. Business Science has an online university for learning data science on a pro level. I learned the BSPF methodology from their course, Data Science for Business with R.

Contents 1. The Business Problem 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Impact – Business Benefit

1. The Business Problem

The competition outlines the problem:

The 80/20 rule has proven true for many businesses – only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

The Google merchandise store wants to know which customers are generating the most revenue (customer value). Understanding customer value will allow the business to strategically dedicate resources to maximize their return on investment. More about this in the Impact section.

The Goal: Predict the revenue of a customer

Note: The article excludes some of my thought process and analysis for brevity and understanding for a non-technical.

2. Data Understanding

Exploratory data analysis (EDA) is an observational approach to understand the characteristics of the data. This step will provide insight into a data set, uncover the underlying structure and extract important variables.

EDA is a necessary step to understand the data before attempting to create a predictive model.

A big thanks to Erik Brin for sharing his exploratory data analysis on Kaggle. I share Erik’s EDA findings below. The full exploratory data analysis is quite long, therefore, I will focus on highlighting some of the important findings.

2.1 Data Overview

The data consists of approximately 1.7 million records (observations) and is split between training and test datasets. The training data is used to train a machine learning model, while the test data is used to validate the accuracy of the model.

Each observation contains information, such as the number of times a user visited the store, the date/time of visit, the visitor’s country, etc. The data does not contain personal identifying information, with all persons identified by a visitor ID.

2.2 Missing Data

There are over 900,000 observations in the training dataset, with each observation representing a store visit.

The Takeaway: There are 14 variables in the training dataset containing missing values. Most of these variables are related to Google AdWords campaigns, we surmise the missing transaction revenue data means there was no transaction.

2.3 Understanding Transaction Revenue

Only 11,515 web sessions resulted in a transaction, which is 1.27% of the total observations.

Altogether, the store had 1.5 million dollars revenues in that year, with the range of revenues-per-transaction from $1 to approximately $23,000. Only 1.6% of the 11,515 transactions had revenues over $1,000.

The Takeaway: Only a small percent of all the data resulted in a transaction, with almost all of the transactions resulting in less than $1000.

2.4 Sessions and Revenues by Date

The number of daily website sessions peaked in October and November 2016 but did not lead to higher daily revenues. In the revenues plot below, Daily Revenues vary between $0 and approximately $27,000 and that there is no revenues trend visible.

In addition, there are some outliers with high daily revenues. Notice the revenue spikes marked in yellow.

The Takeaway: I should consider revenue anomalies to accurately predict customer revenue.

2.5 Sessions and Revenues by Workday

Notice from the plot below there are fewer sessions and revenues on the weekend. A logical explanation seems that Google-branded merchandise is mostly bought by businesses who operate during the weekdays.

The Takeaway: The weekend website sessions and revenue are lower than the weekdays.

2.6 Sessions and Revenues by Month

The Takeaway: Looking at the month patterns, April has a high ratio of revenues/sessions, and November has a low ratio of revenues/sessions. This is something to keep track of.

2.7 Channel Grouping

Channels define how people come to your website. The default channel groupings can be found here. These groups are mostly defined by the Medium, and some also by Source, Social Source Referral or Ad Distribution Network. From Google Analytics:

  • Source: the origin of your traffic, such as a search engine (for example, Google) or a domain.
  • Medium: the general category of the source, for example, organic search (organic), cost-per-click paid search (CPC), web referral (referral).

The Takeaway: Organic Search and Social have a high number of sessions, with social producing very low revenue. Referral produces the most revenue, with a relatively small number of sessions.

2.8 Pageviews; All Sessions

A pageview indicates each time a visitor views a page on your website, regardless of how many hits are generated.

Most revenue relates to a low number of pageviews.

The Takeaway: The distribution of pageviews is skewed. However, the second graph shows that most revenues came from a small number of sessions with a lot of pageviews! It may be that pageviews is a good factor for predicting revenue.

Solve Problems with Data Science Today!



To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

3. Data Preparation

With an understanding of the data, the next step is to prepare the data for modeling.

Data preparation is the process of formatting the data for people (communication) and machines (analyzing/modeling), and perform correlation analysis to review and select the best features for modeling.

Much of this process is code-related, however, below is a correlation analysis that orders variables that are most likely to be good predictors.

Correlation analysis.

The correlation analysis is important because it gives insight into which variables have a good chance of helping the model. Above, pageviews and hits are two highly correlated (good for predicting) variables.

3.1 Hypotheses – Important Variables

Exploring and preparing the data provided a good understanding of the relationships between customer revenue and other variables in the data.

The following are my initial hypotheses that go into the model:

  • Hypothesis 1: The time of day is an indicator of how much a person spends.
    To test this hypothesis, I created a new variable, hour of day. I extracted the hour of day from the store visitor’s session time.

  • Hypothesis 2: Visitor behavior is an indicator of how much a person spends.
    The existing data contains each visitor’s number of hits, pageviews, and visits. Along with the individual variables, I created a sum of these and combined them as visitor behavior.

Before modeling, I performed an additional correlation analysis after adding the new variables from Hypothesis 1 and Hypothesis 2.

The new visitBehavior variable has a high correlation.

So far the variable visitBehavior from Hypothesis 2 looks correct. From the correlation ranking above, visitBehavior is a good variable for the model to predict customer revenue. The variable sessionHourOfDay from Hypothesis 1, however, is not a good predictor.

Additional hypotheses to be tested (not in the current model):

  • New site visits is an indicator of how much a person spends.
  • A time-series approach will produce a more accurate model.
  • Address holidays and big sales days (such as Black Friday) to increase revenue prediction. The data does contain spikes in revenue. It would be interesting to see if there is an annual trend in revenue spikes and if there is a geographic trend with revenue spikes.
4. Modeling

Deciding which variables that go into modeling will take into account the EDA takeaways, the correlation analysis, and my own hypotheses. I will be using the automated machine learning library H2O. H2O takes the prepared data and automatically creates several machine learning models. Once the models are created, they are ranked by performance and can be analyzed in detail.

Modeling with large datasets can use a lot of compute resources and take a long time to process. The H20 machine learning model was generated with a low runtime setting. Increasing this setting will help produce a more accurate model.

The top-performing generated machine learning model was a distributed random forest.

5. Evaluation

With the baseline model now complete, I will submit the model to the Kaggle competition for results.

Now that the submission has been completed, I will continue to improve the model’s accuracy by testing additional hypotheses and evaluating model metrics.

6. Impact – Business Benefit

With the machine learning model complete, the Google Merchandise Store can identify customers that generate the most revenue.

Below are the business benefits to identifying high-value customers:

  • Reduce customer churn with retention marketing for high-value customers
  • Identify likely high-value customers by analyzing existing high-value customer purchasing behavior
  • Predict revenue based on high-value customer purchasing history
  • Focus product development on high-value customer products

You’ve made it all this way!

Thank you for taking the time to read through this analysis. My purpose for making it is to show how digital marketing can benefit from data science analysis.

It’s not explicitly mentioned in the article, but what really makes this work is having a clearly defined question, sufficient data to answer the question, and a business process to work through the data science problem, such as Business Science Problem Framework.

Although the exercise uses a large dataset, data science can be used to answer questions with fewer data as well.

David Curry
Founder, Sure Optimize – Marketing Intelligence

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CRAN’s New Missing Data Task View

Fri, 10/26/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

It is a relatively rare event, and cause for celebration, when CRAN gets a new Task View. This week the r-miss-tastic team: Julie Josse, Nicholas Tierney and Nathalie Vialaneix launched the Missing Data Task View. Even though I did some research on R packages for a post on missing values a couple of years ago, I was dumbfounded by the number of packages included in the new Task View.

This single page not only describes what R has to offer with respect to coping with missing data, it is probably the world’s most complete index of statistical knowledge on the subject. But, the task view is not just devoted to esoterica. The Exploration of missing data section contains a number of tools that should be useful in any data wrangling effort like this plot diagnostic plot from the dlookr package.

The downside that may curb one’s enthusiasm, is that mastering missing data techniques requires some serious study. Missing data problems are among the most vexing in statistics, and the newer techniques for tackling these problems are relatively sophisticated. So, unless you are a na.omit() kind of guy/gal (a data scientist?) coming to grips with NAs may involve a subtle inferential task embedded in the usual data wrangling effort.

It is also the case that it is not easy to find good introductory material on the subject. The notable exception is Stef van Buuren’s authoritative R-based textbook, Flexible Imputation of Missing Data which he has graciously made available online.

I also found the review papers by Dong and Peng and by Horton and Kleinman to be helpful. And, if you read French a little better than I do, the review by Imbert and Vialaneix looks like it covers all of the basic material.

Finally, I should mention that the Missing Data Task View is part of the R Consortium funded project A unified platform for missing values methods and workflows. Many thanks to the ISC for making it possible for the expert r-miss-tastic team to do the work.

_____='https://rviews.rstudio.com/2018/10/26/cran-s-new-missing-values-task-view/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Designing Transforms for Data Reshaping with cdata

Fri, 10/26/2018 - 01:47

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the cdata package. The cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

cdata adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the iris data faceted".

The goal is to produce the following graph with ggplot2

In order to do this, one wants data that looks like the following:

Notice Species is in a column so we can use it to choose colors. Also, flower_part is in a column so we can use it to facet.

However, iris data starts in the following format.

We call this form a row record because all the information about a single entity (a "record") lies in a single row. When the information about an entity is distributed across several rows (in whatever shape), we call that a block record. So the goal is to transform the row records in iris into the desired block records before plotting.

This new block record is partially keyed by the flower_part column, which tells us which piece of a record a row corresponds to (the petal information, or the sepal information). We could also add an iris_id as a per-record key; this we are not adding, as we do not need it for our graphing task. However, adding a per-record id makes the transform invertible, as is shown here.

There are a great number of ways to achieve the above transform. We are going to concentrate on the cdata methodology. We want to move data from an "all of the record is in one row" format to "the meaningful record unit is a block across several rows" format. In cdata this means we want to perform a rowrecs_to_blocks() transform. To do this we start by labeling the roles of different portion of the block oriented data example. In particular we identify:

  • Columns we want copied as additional row information (in this case Species, but often a per-record index or key).
  • The additional key that identifies parts of each multi-row record (in this case flower_part).
  • Levels we expect in the new record portion key column (Petal and Sepal).
  • New column names for the new data.frame. These will go where values are currently in the block record data.

We show this labeling below.

Notice we have marked the measurements 1.4, 0.2, 5.1, 3.5 as "column names", not values. That is because we must show which columns in the original data frame these values are coming from.

This annotated example record is the guide for building what we call the transform control table. We build up the transform control table following these rules:

  • The key column is the first column of the control table.
  • The key levels we wish to track are values in the key column.
  • The column names of the control table are the new column names we will produce in our result.
  • The block of values seen in our example record are replaced by the names of the columns that these values were taken from in the original data.
  • Any other columns we want copied are specified in the columnsToCopy argument.


The R version of the above is specified as follows:

(controlTable <- wrapr::qchar_frame( flower_part, Length , Width | Petal , Petal.Length, Petal.Width | Sepal , Sepal.Length, Sepal.Width )) #> flower_part Length Width #> 1 Petal Petal.Length Petal.Width #> 2 Sepal Sepal.Length Sepal.Width (columnsToCopy <- "Species") #> [1] "Species"

And we can now perform the transform.

library("cdata") iris_aug <- rowrecs_to_blocks( iris, controlTable = controlTable, columnsToCopy = columnsToCopy) head(iris_aug) #> Species flower_part Length Width #> 1 setosa Petal 1.4 0.2 #> 2 setosa Sepal 5.1 3.5 #> 3 setosa Petal 1.4 0.2 #> 4 setosa Sepal 4.9 3.0 #> 5 setosa Petal 1.3 0.2 #> 6 setosa Sepal 4.7 3.2

The data is now ready to plot using ggplot2 as was shown here.

Designing a blocks_to_rowrecs transform is even easier, as the controlTable has the same shape as the incoming record block (assuming the record partial key controlling column is the first column). All one has to do is replace the values in such an example record with the desired column names.

For example:

# grab a record to use to build our control table ct <- iris_aug[1:2, c("flower_part", "Length", "Width")] print(ct) #> flower_part Length Width #> 1 Petal 1.4 0.2 #> 2 Sepal 5.1 3.5 # replace values with desired column names # as in this case the values are unique we can do # this mechanically, normally we do this by hand for(i in 1:nrow(ct)) { for(j in 2:ncol(ct)) { ct[i,j] <- colnames(iris)[which(iris[1,]==ct[i,j])] } } print(ct) #> flower_part Length Width #> 1 Petal Petal.Length Petal.Width #> 2 Sepal Sepal.Length Sepal.Width

Though to actually move back from block records to row records we would need additional record id keys showing which rows together form a record (which we demonstrate here).

Notice in both cases that having examples of the before and after form of the transform is the guide to building the transform specification, that is, the transform control table. In practice: we highly recommend looking at your data, writing down what a single record on each side of the transform would look like, and then using that to fill out the control table on paper.

The exercise of designing a control table really opens your eyes to how data is moving in such transforms and exposes a lot of structure of data transforms. For example:

  • If the control table has two columns (one key column, one value column) then the operation could be implemented as a single tidyr gather() or spread().
  • If the control table has k rows then the rowrecs_to_blocks() direction could be implemented as k-1 rbind()s.

Some discussion of the nature of block records and row records in cdata can be found here.

Some additional tutorials on cdata data transforms can are given below:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting started Stamen maps with ggmap

Thu, 10/25/2018 - 20:06

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

Spatial visualizations really come to life when you have a real map as a background. In R, ggmap is the package that you’ll want to use to get these maps. In what follows, we’ll demonstrate how to use ggmap with the Sacramento dataset in the caret package. For each city, we are going to keep track of the median price of houses in that city, the number of transactions in that city, as well as the mean latitude and longitude of the transactions.

library(caret) library(dplyr) library(ggmap) data(Sacramento) df <- Sacramento %>% group_by(city) %>% summarize(median_price = median(price), transactions = n(), latitude = mean(latitude), longitude = mean(longitude))

Now, I want to plot these cities on a map, with the color of the points representing the median price and the size representing the number of transactions. I could do this with ggplot2:

ggplot() + geom_point(data = df, mapping = aes(x = longitude, y = latitude, col = median_price, size = transactions)) + scale_color_distiller(palette = "YlOrRd", direction = 1)

That plots the data that we want, but the visualization isn’t very informative. For it to be useful, I need to know which point represents which city. I could add text labels to each of these points…

ggplot(data = df, mapping = aes(x = longitude, y = latitude)) + geom_point(aes(col = median_price, size = transactions)) + geom_text(aes(label = city), size = 2, nudge_y = 0.01) + scale_color_distiller(palette = "YlOrRd", direction = 1)

… but it would clutter the visualization, and I still think that it does a poorer job of being informative than putting a map as background. Let’s use ggmap to do that!

Google maps

A quick note about Google maps: it seems that using Google maps as a background is quite a challenge now. According to the author of the ggmap package, to do so you need to provide an API key and enable billing. (See https://github.com/dkahle/ggmap for more details on how to work with these new restrictions.)

Stamen maps

Fortunately there is an easier way to get map backgrounds: Stamen maps. In the code below, I’m getting a base map of the Sacramento region based on the latitudes and longitudes I see in my dataset:

height <- max(Sacramento$latitude) - min(Sacramento$latitude) width <- max(Sacramento$longitude) - min(Sacramento$longitude) sac_borders <- c(bottom = min(Sacramento$latitude) - 0.1 * height, top = max(Sacramento$latitude) + 0.1 * height, left = min(Sacramento$longitude) - 0.1 * width, right = max(Sacramento$longitude) + 0.1 * width) map <- get_stamenmap(sac_borders, zoom = 10, maptype = "toner-lite")

I can then print it with the ggmap function:

ggmap(map)

The zoom parameter is set to 10 by default. Making it bigger gives a more detailed map, but takes longer to load.

Just like in ggplot2 graphics, we can add layers to the map drawn by ggmap(map) with the + syntax. For example, if we wanted to overlay the map with the data from our dataset, we could do the following:

ggmap(map) + geom_point(data = df, mapping = aes(x = longitude, y = latitude, col = median_price, size = transactions)) + scale_color_distiller(palette = "YlOrRd", direction = 1)

Another way to get Stamen maps as a background is to use the qmplot function. Notice though that the syntax is pretty different from what we had with ggplot. Here is a map plotting the same points as above but with a different color scale and a different type of background map:

qmplot(x = longitude, y = latitude, data = df, maptype = "watercolor", geom = "point", color = median_price, size = transactions) + scale_color_gradient(low = "blue", high = "red")

The full list of Stamen map types is “terrain”, “terrain-background”, “terrain-labels”, “terrain-lines”, “toner”, “toner-2010”, “toner-2011”, “toner-background”, “toner-hybrid”, “toner-labels”, “toner-lines”, “toner-lite”, “watercolor”. Note that some of them don’t work with the CRAN version of ggmap (e.g. “terrain”); instead, you will have to download the github version of the package (see https://github.com/dkahle/ggmap/issues/188 for more details).

Below are examples of what “terrain” and “toner” look like:

qmplot(x = longitude, y = latitude, data = df, maptype = "terrain", geom = "point", color = median_price, size = transactions) + scale_color_gradient(low = "blue", high = "red")

qmplot(x = longitude, y = latitude, data = df, maptype = "toner", geom = "point", color = median_price, size = transactions) + scale_color_gradient(low = "blue", high = "red")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How DataCamp Handles Course Quality

Thu, 10/25/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

At DataCamp, we pride ourselves on having the best platform and the best curriculum for learning data science. To this end, we put a lot of effort into ensuring that every exercise is both effective at educating and enjoyable. After a course is launched, we don’t consider it to be complete: the launch is just the start of data collection. Whenever a student attempts an exercise in a course, we capture data points such as how many attempts it took them to solve the exercise, and whether or not they needed assistance by asking for the hint or solution. By aggregating this data over all our students, we can get a picture of how difficult an exercise is. Additionally, students can rate exercises from one star to five stars, and provide feedback to us, letting us know how well liked an exercise is.

Our Content Quality team works with the instructors to act upon this data to improve the courses. This takes a variety of forms since many things can go wrong with an exercise.

Sometimes, small things can upset a lot of students. In Data Science Toolbox (Part 1), students learn how to write Python lambda functions using a dataset based upon J.R.R. Tolkien’s Fellowship of the Ring, from the Lord of the Rings trilogy of books. Unfortunately, the dataset was missing Gandalf and Pippin. Our students quite rightly complained, so we listened and fixed the problem.

Analyzing the feedback over many exercises can reveal patterns in misconceptions that students have. Many data analyses keep their data in a rectangular form where each row is some record, and each column is a value belonging to that record. For example, each row could correspond to a person, and the columns could be their name, their height, and their favorite color.

A really common data manipulation practice is to filter the rows of that rectangular data. An instructor might write an instruction like:

Filter the dataset to remove rows where people are shorter than 170cm.

This is OK, but most statistical software, including R’s dplyr package and Python’s Pandas package, make you filter datasets by specifying the things that you want to keep. If the instruction is rewritten in the following way:

Filter the dataset to keep rows where people are taller than 170cm.

Then the instruction matches the way that the code works, and avoids a point of confusion for students. For advanced courses where students are expected to be familiar with data manipulation, this phrasing would be less of an issue. In our Pandas Foundations course, we spotted that many students were struggling with this problem, and changed the language.

All DataCamp exercise are automatically graded using software built by our Content Engineering team. This allows instantaneous intelligent feedback for the students if they get an answer wrong. It’s perhaps the greatest feature of our platform. One of the hardest parts about this is anticipating what students will do wrong in order to give them good advice about what to differently next time. That means that occasionally a correct solution gets marked as incorrect. When this happens, students really hate it. In one case, a student complained:

In our Cluster Analysis in R course, many students had spotted that although the suggested solution to an exercise used the well-known min() function to calculate a minimum value, there was a simpler and more elegant solution using the lesser-known parallel minimum function, pmin().

Originally, the exercise only allowed one solution but based upon the students’ ingenious idea, both solutions are now accepted. One of the Content Engineering team’s goals in developing DataCamp’s feedback system is to improve the flexibility of the grading, allowing students to solve problems in their own way.

Another frustration that students can encounter with our automated feedback is seeing the same piece of feedback repeatedly when taking an exercise. The principle is shown in the diagram below, where a student is asked to create a Python array with three elements.

By providing more granular feedback, to address specific problems with a student’s answer, we can provide a more positive learning experience. The results of improved feedback are encouraging. In an exercise from our Intermediate Python for Data Science course, switching to granular feedback meant that the percentage of students seeing the same feedback message more than once dropped from around 65% to under 10%.

The percentage of users seeing the same feedback message more than once is a metric we monitor for all exercises and courses. In combination with the popularity of courses, we created a data-driven way to keep improving the learning experience in the most impactful way.

Our goal for the end of the year is that for the most popular half of all courses, the average of the repeated feedback percentage over all exercises is below 30% for each course.

We will continue to add these improvements across all the technologies students can learn on DataCamp.
Internally we also improve the tools we use to create the feedback and analyze student submissions to make sure we keep improving the learning experience for all current and future content.

There are many more things that can go wrong with an exercise, but I hope that this gives you some encouragement that DataCamp listens to student feedback, and is continually working to improve student satisfaction with our courses.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Popular Halloween Candy on US State Grid Map

Thu, 10/25/2018 - 02:00

(This article was first published on R on Chi's Impe[r]fect Blog, and kindly contributed to R-bloggers)

Halloween is coming..!

Halloween is just around the corner, I am still trying to decide which candies to purchase this year for trick-or-treaters.

Initially I was looking for data sets maybe comparing American chocolate bars vs Canadian chocolate bars possibly with sugar contents or lists of ingredients. I am really curious why there are big differences in candy between Canada and US, but for now I couldn’t find them instead I came across below data sets.

Data itself is actually bit outdated, since it’s data from candy sales in 2007-2015. I’m not even sure if these popular candies changes every year or not. Now I’m curious to find data sets that are more recent…
CandyStore.com’s sales from 2007–2015—focusing on the three months leading up to Halloween

Since data set contains data at state level for US, I decided I’ll practice plotting the data on grid map, I’ve been wanting to try. To colour the map, I’ve decided instead of colouring by volume of candy purchased (in pounds), I’ve gathered population by state in order to figure out ratio of candy per person.

Looks like Hawaiian eats lots of Hershey Kisses (or in general, maybe there are more sweet tooth in Hawaii!). Utah, Nevada and Arizona also seems to maybe either consumer more sweets or possibly top ranked chocolates are purchased in volume!

It would’ve been fun to gather all candy images and place them in the grid too…!

Below are the steps I took to visualize the data.

Importing Data & Tidying Up

I’ve imported data to dataframe using jsonlite package using fromJSON function from OpenDataSoft site. There seems to be lots of other interesting data sets!

I’ve also harvested population by state from Wikipedia using rvest package.

library(tidyverse) ## basically tidyverse is needed for everything :) library(jsonlite) ## to read data in json format library(rvest) ## to scrape website library(hrbrthemes) ## I love this themes! ## read this: https://cran.r-project.org/web/packages/hrbrthemes/vignettes/why_hrbrthemes.html library(geofacet) ## I'll use the grid in this package to create library(ggrepel) ## so that texts on ggplot don't overlap! library(janitor) ## My recent favourite ## Download Candy Data from Opendata Soft site. candy_data <- jsonlite::fromJSON("https://public.opendatasoft.com/api/records/1.0/search/?rows=51&start=0&fields=name,top_candy,top_candy_pounds,2nd_place,2nd_place_pounds,3rd_place,3rd_place_pounds&dataset=state-by-state-favorite-halloween-candy&timezone=America%2FLos_Angeles") ## I'll call this candy_wide, because data is in wide format. candy_wide <-candy_data$records$fields names(candy_wide) <- c("state","rank_3","pound_2","rank_2","pound_3","rank_1","pound_1") ## Might be interesting to see Volume per Capita, so get population data from Wiki. state_pop_html <- read_html("https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population") %>% html_table(2) state_pop_html <- state_pop_html[[2]] ## I'm only really interested in population, so trim down the table bit. state_pop <- state_pop_html %>% select(state=3, pop_2016=4) ## I want population to be numeric value instead of character state_pop <- state_pop %>% mutate(pop_2016 = as.numeric(str_remove_all(pop_2016,","))) Transforming Data from Wide to Long format ### Data Transformaiton (Wide to Long) ## Data is in wide format, nicer to look at but I want to convert it to long format! candy_rank_long <- candy_wide %>% select(state, rank_1, rank_2, rank_3) %>% gather(ranking, chocolate, -state) %>% mutate(ranking = as.integer(str_remove(ranking,"rank_"))) candy_pound_long <- candy_wide %>% select(state, pound_1, pound_2, pound_3) %>% gather(ranking, volume, -state) %>% mutate(ranking = as.integer(str_remove(ranking,"pound_"))) candy_long <- candy_rank_long %>% inner_join(candy_pound_long) %>% left_join(state_pop) candy_long <-candy_long %>% mutate(volume_per_capita = volume/pop_2016) ## Just creating state table state_df <- tibble( state = c(state.name,"District of Columbia"), state_div = c(as.character(state.division),"South Atlantic"), state_abb = c(state.abb, "DC") ) candy_long <- candy_long %>% left_join(state_df) Overall Popular Candies in United States

If candy was ranked #1 in state, I gave gold colour, if #2, then silver, #3 then bronze. Sort of like Olympic, I’ve counted how many metals each candies have gotten to decide most popular candy in US.

There were total of 27 different candies in this data sets, most popular candies are M&M followed by Skittles! I thought it was also interesting that Life Savers were top ranked candy in Delaware, but it did not appear in any other state, and similarly for Swedish Fish in Georgia.

There’s Assorted Salt Water Taffy and Salt Water Taffy… I wasn’t sure if they are same candies…!

## using janitor package candy_list <- candy_long %>% tabyl(chocolate, ranking) %>% adorn_totals("col") %>% arrange(-Total,-`1`,-`2`,-`3`) ## if candy was listed same # of times, then I want to make sure that better ranked candies are ranked higher. #show_col(c("#A77044", "#A7A7AD","#D6AF36")) # bronze, silber, gold candy_long %>% mutate(chocolate = fct_relevel(chocolate, candy_list$chocolate)) %>% count(chocolate, ranking, state, state_abb) %>% ggplot(aes(x=fct_rev(chocolate),y=n)) + geom_col(aes(fill=fct_rev((factor(ranking)))), colour="#ffffff30") + geom_text(aes(label=state_abb), position="stack", colour="#ffffff", size=5, hjust=1, family=font_rc, fontface="bold") + theme_ipsum_rc() + coord_flip() + scale_fill_manual(values=c("#A77044", "#A7A7AD","#D6AF36"), name="ranking within state", labels=c("bronze (3rd)", "silver (2nd)","gold (1st)")) + scale_y_comma() + labs(title="Halloween Candy Olympic! - which candy gets which ranking from which state?") + theme(legend.position="top", legend.direction = "horizontal") + guides(fill = guide_legend(reverse = TRUE))

Grid State Map

I recently discovered geofacet package. Examples of graphs you can create with this package looks super fun! I thought it was really neat that I can create my own grid using “Geo Grid Designer”!!

#get_grid_names() --> you can see what grids are available. ## I can preview the grid first #grid_design(us_state_grid1) ## Going to use US state map grid in geofacet package head(us_state_grid1) %>% knitr::kable() row col code name 6 7 AL Alabama 7 2 AK Alaska 5 2 AZ Arizona 5 5 AR Arkansas 4 1 CA California 4 3 CO Colorado ## Join Candy data & grid so I can create grid map candy_long_with_coord <- candy_long %>% left_join(us_state_grid1, by=c("state_abb" = "code")) ## Replace every other space with new line ## some chocolate names are little too long to lists in one line... space_to_newline = function(x) gsub("([^ ]+ [^ ]+) ","\\1\n", x) candy_long_with_coord <- candy_long_with_coord %>% mutate(chocolate_label = map_chr(chocolate, space_to_newline), facet_label = case_when(ranking==1 ~ "Top Ranked Candies by State", ranking==2 ~ "2nd Popular Candies by States", ranking==3 ~ "3rd Popular Candies by States")) candy_long_with_coord %>% #filter(ranking==1) %>% ggplot(aes(x=col, y=row)) + geom_tile(color="#ffffff90", aes(fill=volume_per_capita)) + #geom_text(aes(label=state), size=3.5, color="#ffffff20", vjust=4, family=font_rc_light) + geom_text(aes(label=state_abb),size=5, family=font_rc, vjust=-1, color="#ffffff90") + geom_text(aes(label=chocolate_label), family=font_rc, size=3.5, lineheight=0.8, color="#ffffff", vjust=1, fontface="bold") + scale_y_reverse(breaks=NULL) + scale_x_continuous(breaks=NULL) + theme_ipsum_rc(base_family="Roboto Condensed") + scale_fill_viridis_c(option="magma", alpha=0.8, end=0.7, name="", breaks=fivenum(candy_long$volume_per_capita), labels=c("","","<< Less Candy per Person","","More Candy per Person >>")) + facet_wrap(~fct_reorder(facet_label, ranking), ncol=1) + theme(legend.position = "top", legend.key.width = unit(2, "cm"), legend.justification = "left") + labs(title="Top Ranked Candies by States", caption="Data Source: CandyStore.com's sales from 2007–2015—focusing on the three months leading up to Halloween", x="", y="") + coord_fixed(ratio=0.9)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Chi's Impe[r]fect Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Drawing beautiful maps programmatically with R, sf and ggplot2 — Part 3: Layouts

Thu, 10/25/2018 - 02:00

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

view raw Rmd

This tutorial is the third part in a series of three:

After the presentation of the basic map concepts, and the flexible
approach in layer implemented in ggplot2, this part illustrates how to
achieve complex layouts, for instance with map insets, or several maps
combined. Depending on the visual information that needs to be
displayed, maps and their corresponding data might need to be arranged
to create easy to read graphical representations. This tutorial will
provide different approaches to arranges maps in the plot, in order to
make the information portrayed more aesthetically appealing, and most
importantly, convey the information better.

Getting started

Many R packages are available from CRAN,
the Comprehensive R Archive Network, which is the primary repository of
R packages. The full list of packages necessary for this series of
tutorials can be installed with:

install.packages(c("cowplot", "googleway", "ggplot2", "ggrepel", "ggspatial", "libwgeom", "sf", "rworldmap", "rworldxtra"))

We start by loading the basic packages necessary for all maps, i.e.
ggplot2 and sf. We also suggest to use the classic dark-on-light
theme for ggplot2 (theme_bw), which is more appropriate for maps:

library("ggplot2") theme_set(theme_bw()) library("sf")

The package rworldmap provides a map of countries of the entire world;
a map with higher resolution is available in the package rworldxtra.
We use the function getMap to extract the world map (the resolution
can be set to "low", if preferred):

library("rworldmap") library("rworldxtra") world <- getMap(resolution = "high") class(world) ## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"

The world map is available as a SpatialPolygonsDataFrame from the
package sp; we thus convert it to a simple feature using st_as_sf
from package sf:

world <- st_as_sf(world) class(world) ## [1] "sf" "data.frame" General concepts

There are 2 solutions to combine sub-maps:

  • using Grobs (graphic objects, allow plots only in plot region, based
    on coordinates), which directly use ggplot2
  • using ggdraw (allows plots anywhere, including outer margins,
    based on relative position) from package cowplot

Example illustrating the difference between the two, and their use:

(g1 <- qplot(0:10, 0:10))

(g1_void <- g1 + theme_void() + theme(panel.border = element_rect(colour = "black", fill = NA)))

Graphs from ggplot2 can be saved, like any other R object. They can
then be reused later in other functions.

Using grobs, and annotation_custom:

g1 + annotation_custom( grob = ggplotGrob(g1_void), xmin = 0, xmax = 3, ymin = 5, ymax = 10 ) + annotation_custom( grob = ggplotGrob(g1_void), xmin = 5, xmax = 10, ymin = 0, ymax = 3 )

Using ggdraw (note: used to build on top of initial plot; could be
left empty to arrange subplots on a grid; plots are “filled” with their
plots, unless the plot itself has a constrained ratio, like a map):

ggdraw(g1) + draw_plot(g1_void, width = 0.25, height = 0.5, x = 0.02, y = 0.48) + draw_plot(g1_void, width = 0.5, height = 0.25, x = 0.75, y = 0.09)

Several maps side by side or on a grid

Having a way show in a visualization, a specific area can be very
useful. Many scientists usually create maps for each specific area
individually. This is fine, but there are simpler ways to display what
is needed for a report, or publication.

This exmaple is using two maps side by side, including the legend of the
first one. It illustrates how to use a custom grid, which can be made a
lot more complex with different elements.

First, simplify REGION for the legend:

levels(world$REGION)[7] <- "South America"

Prepare the subplots, #1 world map:

(gworld <- ggplot(data = world) + geom_sf(aes(fill = REGION)) + geom_rect(xmin = -102.15, xmax = -74.12, ymin = 7.65, ymax = 33.97, fill = NA, colour = "black", size = 1.5) + scale_fill_viridis_d(option = "plasma") + theme(panel.background = element_rect(fill = "azure"), panel.border = element_rect(fill = NA)))

And #2 Gulf map :

(ggulf <- ggplot(data = world) + geom_sf(aes(fill = REGION)) + annotate(geom = "text", x = -90, y = 26, label = "Gulf of Mexico", fontface = "italic", color = "grey22", size = 6) + coord_sf(xlim = c(-102.15, -74.12), ylim = c(7.65, 33.97), expand = FALSE) + scale_fill_viridis_d(option = "plasma") + theme(legend.position = "none", axis.title.x = element_blank(), axis.title.y = element_blank(), panel.background = element_rect(fill = "azure"), panel.border = element_rect(fill = NA)))

The command ggplotGrob signals to ggplot to take each created map,
and how to arrange each map. The argument coord_equal can specify the
length, ylim, and width, xlim, for the entire plotting area. Where
as in annotation_custom, each maps’ xmin, xmax, ymin, and ymax
can be specified to allow for complete customization.

#Creating a faux empty data frame df <- data.frame() plot1<-ggplot(df) + geom_point() + xlim(0, 10) + ylim(0, 10) plot2<-ggplot(df) + geom_point() + xlim(0, 10) + ylim(0, 10) ggplot() + coord_equal(xlim = c(0, 3.3), ylim = c(0, 1), expand = FALSE) + annotation_custom(ggplotGrob(plot1), xmin = 0, xmax = 1.5, ymin = 0, ymax = 1) + annotation_custom(ggplotGrob(plot2), xmin = 1.5, xmax = 3, ymin = 0, ymax = 1) + theme_void()

Below is the final map, using the same methodology as the exmaple plot
above. Using ggplot to arrange maps, allows for easy and quick
plotting in one function of R code.

ggplot() + coord_equal(xlim = c(0, 3.3), ylim = c(0, 1), expand = FALSE) + annotation_custom(ggplotGrob(gworld), xmin = 0, xmax = 2.3, ymin = 0, ymax = 1) + annotation_custom(ggplotGrob(ggulf), xmin = 2.3, xmax = 3.3, ymin = 0, ymax = 1) + theme_void()

In the second approach, using cowplot::plot_grid to arrange ggplot
figures, is quite versatile. Any ggplot figure can be arranged just
like the figure above. There are many commands that allow for the map to
have different placements, such as nrow=1 means that the figure will
only occupy one row and multiple columns, and ncol=1 means the figure
will be plotted on one column and multiple rows. The command
rel_widths establishes the width of each map, meaning that the first
map gworld will have a relative width of 2.3, and the map ggulf
has the relative width of 1.

library("cowplot") theme_set(theme_bw()) plot_grid(gworld, ggulf, nrow = 1, rel_widths = c(2.3, 1))

Some other commands can adjust the position of the figures such as
adding align=v to align vertically, and align=h to align
horiztonally.

Note also the existence of get_legend (cowplot), and that the legend
can be used as any object.

This map can be save using,ggsave:

ggsave("grid.pdf", width = 15, height = 5) Map insets

For map insets directly on the background map, both solutions are viable
(and one might prefer one or the other depending on relative or absolute
coordinates).

Map example using map of the 50 states of the US, including Alaska and
Hawaii (note: not to scale for the latter), using reference projections
for US maps. First map (continental states) use a 10/6 figure:

usa <- subset(world, ADMIN == "United States of America") ## US National Atlas Equal Area (2163) ## http://spatialreference.org/ref/epsg/us-national-atlas-equal-area/ (mainland <- ggplot(data = usa) + geom_sf(fill = "cornsilk") + coord_sf(crs = st_crs(2163), xlim = c(-2500000, 2500000), ylim = c(-2300000, 730000)))

Alaska map (note: datum = NA removes graticules and coordinates):

## Alaska: NAD83(NSRS2007) / Alaska Albers (3467) ## http://www.spatialreference.org/ref/epsg/3467/ (alaska <- ggplot(data = usa) + geom_sf(fill = "cornsilk") + coord_sf(crs = st_crs(3467), xlim = c(-2400000, 1600000), ylim = c(200000, 2500000), expand = FALSE, datum = NA))

Hawaii map:

## Hawaii: Old Hawaiian (4135) ## http://www.spatialreference.org/ref/epsg/4135/ (hawaii <- ggplot(data = usa) + geom_sf(fill = "cornsilk") + coord_sf(crs = st_crs(4135), xlim = c(-161, -154), ylim = c(18, 23), expand = FALSE, datum = NA))

Using ggdraw from cowplot (tricky to define exact positions; note
the use of the ratios of the inset, combined with the ratio of the
plot):

(ratioAlaska <- (2500000 - 200000) / (1600000 - (-2400000))) ## [1] 0.575 (ratioHawaii <- (23 - 18) / (-154 - (-161))) ## [1] 0.7142857 ggdraw(mainland) + draw_plot(alaska, width = 0.26, height = 0.26 * 10/6 * ratioAlaska, x = 0.05, y = 0.05) + draw_plot(hawaii, width = 0.15, height = 0.15 * 10/6 * ratioHawaii, x = 0.3, y = 0.05)

This plot can be saved using ggsave:

ggsave("map-us-ggdraw.pdf", width = 10, height = 6)

The same kind of plot can be created using grobs, with ggplotGrob,
(note the use of xdiff/ydiff and arbitrary ratios):

mainland + annotation_custom( grob = ggplotGrob(alaska), xmin = -2750000, xmax = -2750000 + (1600000 - (-2400000))/2.5, ymin = -2450000, ymax = -2450000 + (2500000 - 200000)/2.5 ) + annotation_custom( grob = ggplotGrob(hawaii), xmin = -1250000, xmax = -1250000 + (-154 - (-161))*120000, ymin = -2450000, ymax = -2450000 + (23 - 18)*120000 )

This plot can be saved using ggsave:

ggsave("map-inset-grobs.pdf", width = 10, height = 6)

The print command can also be used place multiple maps in one plotting
area.

To specify where each plot is displayed with the print function, the
argument viewport needs to include the maximum width and height of
each map, and the minimum x and y coordinates of where the maps are
located in the plotting area. The argument just will make a position
on how the secondary maps will be displayed. All maps are defaulted the
same size, until the sizes are adjusted with width and height.

vp <- viewport(width = 0.37, height = 0.10, x = 0.20, y =0.25, just = c("bottom")) vp1<- viewport(width = 0.37, height = 0.10, x = 0.35, y =0.25, just = c("bottom"))

Theprint function uses the previous specifications that were listed in
each plots’ respective viewport, with vp=.

print(mainland) print(alaska, vp=vp) print(hawaii, vp=vp1)

Several maps connected with arrows

To bring about a more lively map arrangement, arrows can be used to
bring the viewers eyes to specific areas in the plot. The next example
will create a map with zoomed in areas, pointed to by arrows.

Firstly, we will create our main map, and then our zoomed in areas.

Site coordinates, same as Tutorial #1:

sites <- st_as_sf(data.frame(longitude = c(-80.15, -80.1), latitude = c(26.5, 26.8)), coords = c("longitude", "latitude"), crs = 4326, agr = "constant")

Mainlaind map of Florida, #1:

(florida <- ggplot(data = world) + geom_sf(fill = "antiquewhite1") + geom_sf(data = sites, size = 4, shape = 23, fill = "darkred") + annotate(geom = "text", x = -85.5, y = 27.5, label = "Gulf of Mexico", color = "grey22", size = 4.5) + coord_sf(xlim = c(-87.35, -79.5), ylim = c(24.1, 30.8)) + xlab("Longitude")+ ylab("Latitude")+ theme(panel.grid.major = element_line(colour = gray(0.5), linetype = "dashed", size = 0.5), panel.background = element_rect(fill = "aliceblue"), panel.border = element_rect(fill = NA)))

A map for site A is created by layering the map and points we created
earlier. ggplot layers geom_sf objects and plot them spatially.

(siteA <- ggplot(data = world) + geom_sf(fill = "antiquewhite1") + geom_sf(data = sites, size = 4, shape = 23, fill = "darkred") + coord_sf(xlim = c(-80.25, -79.95), ylim = c(26.65, 26.95), expand = FALSE) + annotate("text", x = -80.18, y = 26.92, label= "Site A", size = 6) + theme_void() + theme(panel.grid.major = element_line(colour = gray(0.5), linetype = "dashed", size = 0.5), panel.background = element_rect(fill = "aliceblue"), panel.border = element_rect(fill = NA)))

A map for site B:

(siteB <- ggplot(data = world) + geom_sf(fill = "antiquewhite1") + geom_sf(data = sites, size = 4, shape = 23, fill = "darkred") + coord_sf(xlim = c(-80.3, -80), ylim = c(26.35, 26.65), expand = FALSE) + annotate("text", x = -80.23, y = 26.62, label= "Site B", size = 6) + theme_void() + theme(panel.grid.major = element_line(colour = gray(0.5), linetype = "dashed", size = 0.5), panel.background = element_rect(fill = "aliceblue"), panel.border = element_rect(fill = NA)))

Coordinates of the two arrows will need to be specified before plotting.
The argumemnts x1, and x2 will plot the arrow line from a specific
starting x-axis location,x1, and ending in a specific x-axis,x2. The
same applies for y1 and y2, with the y-axis respectively:

arrowA <- data.frame(x1 = 18.5, x2 = 23, y1 = 9.5, y2 = 14.5) arrowB <- data.frame(x1 = 18.5, x2 = 23, y1 = 8.5, y2 = 6.5)

Final map using (ggplot only). The argument geom_segment, will be
the coordinates created in the previous script, to plot line segments
ending with an arrow using arrow=arrow():

ggplot() + coord_equal(xlim = c(0, 28), ylim = c(0, 20), expand = FALSE) + annotation_custom(ggplotGrob(florida), xmin = 0, xmax = 20, ymin = 0, ymax = 20) + annotation_custom(ggplotGrob(siteA), xmin = 20, xmax = 28, ymin = 11.25, ymax = 19) + annotation_custom(ggplotGrob(siteB), xmin = 20, xmax = 28, ymin = 2.5, ymax = 10.25) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), data = arrowA, arrow = arrow(), lineend = "round") + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), data = arrowB, arrow = arrow(), lineend = "round") + theme_void()

This plot can be saved using ggsave:

ggsave("florida-sites.pdf", width = 10, height = 7)

ggdraw could also be used for a similar result, with the argument
draw_plot:

ggdraw(xlim = c(0, 28), ylim = c(0, 20)) + draw_plot(florida, x = 0, y = 0, width = 20, height = 20) + draw_plot(siteA, x = 20, y = 11.25, width = 8, height = 8) + draw_plot(siteB, x = 20, y = 2.5, width = 8, height = 8) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), data = arrowA, arrow = arrow(), lineend = "round") + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), data = arrowB, arrow = arrow(), lineend = "round")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-spatial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Drawing beautiful maps programmatically with R, sf and ggplot2 — Part 1: Basics

Thu, 10/25/2018 - 02:00

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

view raw Rmd

Beautiful maps in a beautiful world

Maps are used in a variety of fields to express data in an appealing and
interpretive way. Data can be expressed into simplified patterns, and
this data interpretation is generally lost if the data is only seen
through a spread sheet. Maps can add vital context by incorporating many
variables into an easy to read and applicable context. Maps are also
very important in the information world because they can quickly allow
the public to gain better insight so that they can stay informed. It’s
critical to have maps be effective, which means creating maps that can
be easily understood by a given audience. For instance, maps that need
to be understood by children would be very different from maps intended
to be shown to geographers.

Knowing what elements are required to enhance your data is key into
making effective maps. Basic elements of a map that should be considered
are polygon, points, lines, and text. Polygons, on a map, are closed
shapes such as country borders. Lines are considered to be linear shapes
that are not filled with any aspect, such as highways, streams, or
roads. Finally, points are used to specify specific positions, such as
city or landmark locations. With that in mind, one need to think about
what elements are required in the map to really make an impact, and
convey the information for the intended audience. Layout and formatting
are the second critical aspect to enhance data visually. The arrangement
of these map elements and how they will be drawn can be adjusted to make
a maximum impact.

A solution using R and its ecosystem of packages

Current solutions for creating maps usually involves GIS software, such
as ArcGIS, QGIS, eSpatial, etc., which allow to visually prepare a map,
in the same approach as one would prepare a poster or a document layout.
On the other hand, R, a free and open-source software development
environment (IDE) that is used for computing statistical data and
graphic in a programmable language, has developed advanced spatial
capabilities over the years, and can be used to draw maps
programmatically.

R is a powerful and flexible tool. R can be used from calculating data
sets to creating graphs and maps with the same data set. R is also free,
which makes it easily accessible to anyone. Some other advantages of
using R is that it has an interactive language, data structures,
graphics availability, a developed community, and the advantage of
adding more functionalities through an entire ecosystem of packages. R
is a scriptable language that allows the user to write out a code in
which it will execute the commands specified.

Using R to create maps brings these benefits to mapping. Elements of a
map can be added or removed with ease — R code can be tweaked to make
major enhancements with a stroke of a key. It is also easy to reproduce
the same maps for different data sets. It is important to be able to
script the elements of a map, so that it can be re-used and interpreted
by any user. In essence, comparing typical GIS software and R for
drawing maps is similar to comparing word processing software (e.g.
Microsoft Office or LibreOffice) and a programmatic typesetting system
such as LaTeX, in that typical GIS software implement a WYSIWIG approach
(“What You See Is What You Get”), while R implements a WYSIWYM approach
(“What You See Is What You Mean”).

The package ggplot2 implements the grammar of graphics in R, as a way
to create code that make sense to the user: The grammar of graphics is a
term used to breaks up graphs into semantic components, such as
geometries and layers. Practically speaking, it allows (and forces!) the
user to focus on graph elements at a higher level of abstraction, and
how the data must be structured to achieve the expected outcome. While
ggplot2 is becoming the de facto standard for R graphs, it does not
handle spatial data specifically. The current state-of-the-art of
spatial objects in R relies on Spatial classes defined in the package
sp, but the new package
sf has recently implemented
the “simple feature” standard, and is steadily taking over sp.
Recently, the package ggplot2 has allowed the use of simple features
from the package sf as layers in a graph1. The combination of
ggplot2 and sf therefore enables to programmatically create maps,
using the grammar of graphics, just as informative or visually appealing
as traditional GIS software.

Drawing beautiful maps programmatically with R, sf and ggplot2

This tutorial is the first part in a series of three:

In this part, we will cover the fundamentals of mapping using ggplot2
associated to sf, and presents the basics elements and parameters we
can play with to prepare a map.

Getting started

Many R packages are available from CRAN,
the Comprehensive R Archive Network, which is the primary repository of
R packages. The full list of packages necessary for this series of
tutorials can be installed with:

install.packages(c("cowplot", "googleway", "ggplot2", "ggrepel", "ggspatial", "libwgeom", "sf", "rworldmap", "rworldxtra"))

We start by loading the basic packages necessary for all maps, i.e.
ggplot2 and sf. We also suggest to use the classic dark-on-light
theme for ggplot2 (theme_bw), which is appropriate for maps:

library("ggplot2") theme_set(theme_bw()) library("sf")

The package rworldmap provides a map of countries of the entire world;
a map with higher resolution is available in the package rworldxtra.
We use the function getMap to extract the world map (the resolution
can be set to "low", if preferred):

library("rworldmap") library("rworldxtra") world <- getMap(resolution = "high") class(world) ## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"

The world map is available as a SpatialPolygonsDataFrame from the
package sp; we thus convert it to a simple feature using st_as_sf
from package sf:

world <- st_as_sf(world) class(world) ## [1] "sf" "data.frame" General concepts illustrated with the world map Data and basic plot (ggplot and geom_sf)

First, let us start with creating a base map of the world using
ggplot2. This base map will then be extended with different map
elements, as well as zoomed in to an area of interest. We can check that
the world map was properly retrieved and converted into an sf object,
and plot it with ggplot2:

ggplot(data = world) + geom_sf()

This call nicely introduces the structure of a ggplot call: The first
part ggplot(data = world) initiates the ggplot graph, and indicates
that the main data is stored in the world object. The line ends up
with a + sign, which indicates that the call is not complete yet, and
each subsequent line correspond to another layer or scale. In this case,
we use the geom_sf function, which simply adds a geometry stored in a
sf object. By default, all geometry functions use the main data
defined in ggplot(), but we will see later how to provide additional
data.

Note that layers are added one at a time in a ggplot call, so the
order of each layer is very important. All data will have to be in an
sf format to be used by ggplot2; data in other formats (e.g. classes
from sp) will be manually converted to sf classes if necessary.

Title, subtitle, and axis labels (ggtitle, xlab, ylab)

A title and a subtitle can be added to the map using the function
ggtitle, passing any valid character string (e.g. with quotation
marks) as arguments. Axis names are absent by default on a map, but can
be changed to something more suitable (e.g. “Longitude” and “Latitude”),
depending on the map:

ggplot(data = world) + geom_sf() + xlab("Longitude") + ylab("Latitude") + ggtitle("World map", subtitle = paste0("(", length(unique(world$NAME)), " countries)"))

Map color (geom_sf)

In many ways, sf geometries are no different than regular geometries,
and can be displayed with the same level of control on their attributes.
Here is an example with the polygons of the countries filled with a
green color (argument fill), using black for the outline of the
countries (argument color):

ggplot(data = world) + geom_sf(color = "black", fill = "lightgreen")

The package ggplot2 allows the use of more complex color schemes, such
as a gradient on one variable of the data. Here is another example that
shows the population of each country. In this example, we use the
“viridis” colorblind-friendly palette for the color gradient (with
option = "plasma" for the plasma variant), using the square root of
the population (which is stored in the variable POP_EST of the world
object):

ggplot(data = world) + geom_sf(aes(fill = POP_EST)) + scale_fill_viridis_c(option = "plasma", trans = "sqrt")

Projection and extent (coord_sf)

The function coord_sf allows to deal with the coordinate system, which
includes both projection and extent of the map. By default, the map will
use the coordinate system of the first layer that defines one (i.e.
scanned in the order provided), or if none, fall back on WGS84
(latitude/longitude, the reference system used in GPS). Using the
argument crs, it is possible to override this setting, and project on
the fly to any projection. This can be achieved using any valid PROJ4
string (here, the European-centric ETRS89 Lambert Azimuthal Equal-Area
projection):

ggplot(data = world) + geom_sf() + coord_sf(crs = "+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs ")

Spatial Reference System Identifier (SRID) or an European Petroleum
Survey Group (EPSG) code are available for the projection of interest,
they can be used directly instead of the full PROJ4 string. The two
following calls are equivalent for the ETRS89 Lambert Azimuthal
Equal-Area projection, which is EPSG code 3035:

ggplot(data = world) + geom_sf() + coord_sf(crs = "+init=epsg:3035") ggplot(data = world) + geom_sf() + coord_sf(crs = st_crs(3035))

The extent of the map can also be set in coord_sf, in practice
allowing to “zoom” in the area of interest, provided by limits on the
x-axis (xlim), and on the y-axis (ylim). Note that the limits are
automatically expanded by a fraction to ensure that data and axes don’t
overlap; it can also be turned off to exactly match the limits provided
with expand = FALSE:

ggplot(data = world) + geom_sf() + coord_sf(xlim = c(-102.15, -74.12), ylim = c(7.65, 33.97), expand = FALSE)

Scale bar and North arrow (package ggspatial)

Several packages are available to create a scale bar on a map (e.g.
prettymapr, vcd, ggsn, or legendMap). We introduce here the
package ggspatial, which provides easy-to-use functions…

scale_bar that allows to add simultaneously the north symbol and a
scale bar into the ggplot map. Five arguments need to be set manually:
lon, lat, distance_lon, distance_lat, and distance_legend. The
location of the scale bar has to be specified in longitude/latitude in
the lon and lat arguments. The shaded distance inside the scale bar
is controlled by the distance_lon argument. while its width is
determined by distance_lat. Additionally, it is possible to change the
font size for the legend of the scale bar (argument legend_size, which
defaults to 3). The North arrow behind the “N” north symbol can also be
adjusted for its length (arrow_length), its distance to the scale
(arrow_distance), or the size the N north symbol itself
(arrow_north_size, which defaults to 6). Note that all distances
(distance_lon, distance_lat, distance_legend, arrow_length,
arrow_distance) are set to "km" by default in distance_unit; they
can also be set to nautical miles with “nm”, or miles with “mi”.

library("ggspatial") ggplot(data = world) + geom_sf() + annotation_scale(location = "bl", width_hint = 0.5) + annotation_north_arrow(location = "bl", which_north = "true", pad_x = unit(0.75, "in"), pad_y = unit(0.5, "in"), style = north_arrow_fancy_orienteering) + coord_sf(xlim = c(-102.15, -74.12), ylim = c(7.65, 33.97)) ## Scale on map varies by more than 10%, scale bar may be inaccurate

Note the warning of the inaccurate scale bar: since the map use
unprojected data in longitude/latitude (WGS84) on an equidistant
cylindrical projection (all meridians being parallel), length in
(kilo)meters on the map directly depends mathematically on the degree of
latitude. Plots of small regions or projected data will often allow for
more accurate scale bars.

Country names and other names (geom_text and annotate)

The world data set already contains country names and the coordinates
of the centroid of each country (among more information). We can use
this information to plot country names, using world as a regular
data.frame in ggplot2. We first check the country name information:

head(world[, c("NAME", "LON", "LAT")]) ## Simple feature collection with 6 features and 3 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -70.06164 ymin: -18.0314 xmax: 74.89231 ymax: 60.48075 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## NAME LON LAT geometry ## 1 Aruba -69.97345 12.51678 MULTIPOLYGON (((-69.87609 1... ## 2 Afghanistan 66.00845 33.83627 MULTIPOLYGON (((71.02458 38... ## 3 Angola 17.56405 -12.32934 MULTIPOLYGON (((13.98233 -5... ## 4 Anguilla -63.05667 18.22432 MULTIPOLYGON (((-63.0369 18... ## 5 Albania 20.05399 41.14258 MULTIPOLYGON (((20.06496 42... ## 6 Aland 19.94429 60.22851 MULTIPOLYGON (((19.91892 60...

The function geom_text can be used to add a layer of text to a map
using geographic coordinates. The function requires the data needed to
enter the country names, which is the same data as the world map. Again,
we have a very flexible control to adjust the text at will on many
aspects:

  • The size (argument size);
  • The alignment, which is centered by default on the coordinates
    provided. The text can be adjusted horizontally or vertically using
    the arguments hjust and vjust, which can either be a number
    between 0 (right/bottom) and 1 (top/left) or a character (“left”,
    “middle”, “right”, “bottom”, “center”, “top”). The text can also be
    offset horizontally or vertically with the argument nudge_x and
    nudge_y;
  • The font of the text, for instance its color (argument color) or
    the type of font (fontface);
  • The overlap of labels, using the argument check_overlap, which
    removes overlapping text. Alternatively, when there is a lot of
    overlapping labels, the package ggrepel provides a
    geom_text_repel function that moves label around so that they do
    not overlap.

Additionally, the annotate function can be used to add a single
character string at a specific location, as demonstrated here to add the
Gulf of Mexico:

ggplot(data = world) + geom_sf() + geom_text(aes(LON, LAT, label = NAME), size = 4, hjust = "left", color = "darkblue", fontface = "bold", check_overlap = TRUE) + annotate(geom = "text", x = -90, y = 26, label = "Gulf of Mexico", fontface = "italic", color = "grey22", size = 6) + coord_sf(xlim = c(-102.15, -74.12), ylim = c(7.65, 33.97), expand = FALSE)

Final map

Now to make the final touches, the theme of the map can be edited to
make it more appealing. We suggested the use of theme_bw for a
standard theme, but there are many other themes that can be selected
from (see for instance ?ggtheme in ggplot2, or the package
ggthemes which provide
several useful themes). Moreover, specific theme elements can be tweaked
to get to the final outcome:

  • Position of the legend: Although not used in this example, the
    argument legend.position allows to automatically place the legend
    at a specific location (e.g. "topright", "bottomleft", etc.);
  • Grid lines (graticules) on the map: by using panel.grid.major and
    panel.grid.minor, grid lines can be adjusted. Here we set them to
    a gray color and dashed line type to clearly distinguish them from
    country borders lines;
  • Map background: the argument panel.background can be used to color
    the background, which is the ocean essentially, with a light blue;
  • Many more elements of a theme can be adjusted, which would be too
    long to cover here. We refer the reader to the documentation for the
    function theme.

ggplot(data = world) + geom_sf(fill = "antiquewhite1") + geom_text(aes(LON, LAT, label = NAME), size = 4, hjust = "left", color = "darkblue", fontface = "bold", check_overlap = TRUE) + annotate(geom = "text", x = -90, y = 26, label = "Gulf of Mexico", fontface = "italic", color = "grey22", size = 6) + annotation_scale(location = "bl", width_hint = 0.5) + annotation_north_arrow(location = "bl", which_north = "true", pad_x = unit(0.75, "in"), pad_y = unit(0.5, "in"), style = north_arrow_fancy_orienteering) + coord_sf(xlim = c(-102.15, -74.12), ylim = c(7.65, 33.97), expand = FALSE) + xlab("Longitude") + ylab("Latitude") + ggtitle("Map of the Gulf of Mexico and the Caribbean Sea") + theme(panel.grid.major = element_line(color = gray(.5), linetype = "dashed", size = 0.5), panel.background = element_rect(fill = "aliceblue"))

Saving the map with ggsave

The final map now ready, it is very easy to save it using ggsave. This
function allows a graphic (typically the last plot displayed) to be
saved in a variety of formats, including the most common PNG (raster
bitmap) and PDF (vector graphics), with control over the size and
resolution of the outcome. For instance here, we save a PDF version of
the map, which keeps the best quality, and a PNG version of it for web
purposes:

ggsave("map.pdf") ggsave("map_web.png", width = 6, height = 6, dpi = "screen")
  1. Note: Support of sf objects is available since version 3.0.0 of
    ggplot2, recently released on CRAN. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-spatial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

When the numbers don’t tell the whole story

Wed, 10/24/2018 - 23:19

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Anscombe's Quartet is a famous collection of four small data sets — just 11 (x,y) pairs each — that was developed in the 1970s to emphasize the fact that sometimes, numerical summaries of data aren't enough. (For a modern take on this idea, see also the Datasaurus Dozen.) In this case, it takes visualizing the data to realize that the for data sets are qualitatively very different, even though the means, variances, and regression coefficients are all the same. In the video below for Guy in a Cube, Buck Woody uses R to summarize the data (which is conveniently built into R) and visualize it using an R script in Power BI.

The video also makes for a nice introduction to R for those new to statistics, and also demonstrates using R code to generate graphics in Power BI.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Join us at the EARL US Roadshow – a conference dedicated to the real-world usage of R

Wed, 10/24/2018 - 15:35

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Join us at the EARL US Roadshow – a conference dedicated to the real-world usage of R

EARL, the Enterprise Applications of the R Language Conference is set to embark on a US roadshow following a successful London conference in September, with dates in Seattle, Houston and Boston between 7th and 13th November.

This cross-sector conference hosted and organised by Mango Solutions focuses on the commercial use of the R programming language and showcases some of the world’s leading practitioners alongside fostering networking and relationships within the R community. Much like the set-up in the UK, this stateside version of EARL promises actionable insights from day 1 – anything from coding, wrangling data, leading a team of R users to make data-driven decisions.

Data Scientists Julia Silge, Stack Overflow, Hadley Wickham, RStudio, Dr Robert Gentleman, 23andMe and Bob Rudis, Rapid 7 are headlining the US conferences with inspirational keynote talks this year. Amongst the topics are use cases for R packages as a collaborative team tool, machine learning and Shiny applications in addition to stimulating talks from a huge range of industries such as Starbucks, Google, IBM, Johnson & Johnson and Bupa.

Now in its 5th year, EARL sets to inspire users in R and discuss how they are solving similar business problems. “It offers a safe place to come and impart knowledge in a real positive manner, says CEO Matt Aldridge, Mango Solutions”. Liz Mathews, Head of Community & Education at Mango Solutions has been organising EARL from the beginning and loves the cross-pollination of ideas amongst the R community alongside the networking and collaboration which she feels is a key feature of the conference.

David Smith, Cloud Developer Advocate at Microsoft, both a delegate and speaker at the conference over the last 5 years, recalls his experience of EARL from last year. Inspired by the stories, he remembers how analysis and R were being used to decide what shipping containers appear on ships to optimise the process across the oceans. It’s recapping on such work that makes him proud to be a data scientist. “There isn’t really another conference like it in terms of real applications in R”, he says. “I’m super interested in technology and learning new things with R, but what I really like is to see is the way they are actually applied in practice – to see how the work the data scientists are doing, using the R language is actually forming the basis of big decisions in industry”.  David’s talk in Seattle, ‘Not Hotdog: Image recognition with R and the Custom Vision API’, illustrates the use of R in conjunction with the Microsoft Custom Vision API to train and use a custom vision recognizer. Motivated by the TV series ‘Silicon Valley’, and with just a couple of hundred images of food, he’ll create a function in R that can detect whether or not a given image contains a hot dog.

If you use R in your organisation, the EARL Conference is for you and your team. Susanna Liberti, Head of Business Insights, eBay Classified Group Scandinavia came with her team to London EARL to get ideas on how to use R in their environment.  As a team transitioning from business analysts to more of a data science team, R allows them to be more effective in their data analysis with deeper insights. If you use R in your organisation, the EARL Conference provides the right learning environment for you and your team alongside significant networking opportunities with speakers and fellow delegates.

We are delighted that the EARL US Roadshow will be sponsored by RStudio. Come and join us for some interesting discussions in support of the commercial use of the R programming language in Seattle on 7th November, Houston 9th November and Boston 13th November.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

automl package: part 2/2 first steps how to

Wed, 10/24/2018 - 14:53

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

first steps: how to

For those who will laugh at seeing deep learning with one hidden layer and the Iris data set of 150 records, I will say: you’re perfectly right
The goal at this stage is simply to take the first steps

fit a regression model manually (hard way)

Subject: predict Sepal.Length given other Iris parameters
1st with gradient descent and default hyper-parameters value for learning rate (0.001) and mini batch size (32)

input

data(iris) xmat <- cbind(iris[,2:4], as.numeric(iris$Species)) ymat <- iris[,1] amlmodel <- automl_train_manual(Xref = xmat, Yref = ymat)

output

(cost: mse) cost epoch10: 20.9340400047156 (cv cost: 25.205632342013) (LR:  0.001 ) cost epoch20: 20.6280923387762 (cv cost: 23.8214521197268) (LR:  0.001 ) cost epoch30: 20.3222407903838 (cv cost: 22.1899741289456) (LR:  0.001 ) cost epoch40: 20.0217966054298 (cv cost: 21.3908446693146) (LR:  0.001 ) cost epoch50: 19.7584058034009 (cv cost: 20.7170232035934) (LR:  0.001 )    dim X: ...

input

res <- cbind(ymat, automl_predict(model = amlmodel, X = xmat)) colnames(res) <- c('actual', 'predict') head(res)

output

     actual   predict [1,]    5.1 -2.063614 [2,]    4.9 -2.487673 [3,]    4.7 -2.471912 [4,]    4.6 -2.281035 [5,]    5.0 -1.956937 [6,]    5.4 -1.729314

:-[] no pain, no gain …

After some manual fine tuning on learning rate, mini batch size and iterations number (epochs):

input

data(iris) xmat <- cbind(iris[,2:4], as.numeric(iris$Species)) ymat <- iris[,1] amlmodel = automl_train_manual(                Xref = xmat, Yref = ymat,                hpar = list(                    learningrate = 0.01,                    minibatchsize = 2^2,                    numiterations = 30                    )                )

output

(cost: mse) cost epoch10: 5.55679482839698 (cv cost: 4.87492997304325) (LR:  0.01 ) cost epoch20: 1.64996951479802 (cv cost: 1.50339773126712) (LR:  0.01 ) cost epoch30: 0.647727077375946 (cv cost: 0.60142564484723) (LR:  0.01 )    dim X: ...

input

res <- cbind(ymat, automl_predict(model = amlmodel, X = xmat)) colnames(res) <- c('actual', 'predict') head(res)

output

     actual  predict [1,]    5.1 4.478478 [2,]    4.9 4.215683 [3,]    4.7 4.275902 [4,]    4.6 4.313141 [5,]    5.0 4.531038 [6,]    5.4 4.742847

Better result, but with human efforts!

fit a regression model automatically (easy way, Mix 1)

Same subject: predict Sepal.Length given other Iris parameters

input

data(iris) xmat <- as.matrix(cbind(iris[,2:4], as.numeric(iris$Species))) ymat <- iris[,1] start.time <- Sys.time() amlmodel <- automl_train(                 Xref = xmat, Yref = ymat,                 autopar = list(                     psopartpopsize = 15,                     numiterations = 5,                     nbcores = 4                     )                 ) end.time <- Sys.time() cat(paste('time ellapsed:', end.time - start.time, '\n'))

output

(cost: mse) iteration 1 particle 1 weighted err: 22.05305 (train: 19.95908 cvalid: 14.72417 ) BEST MODEL KEPT iteration 1 particle 2 weighted err: 31.69094 (train: 20.55559 cvalid: 27.51518 ) iteration 1 particle 3 weighted err: 22.08092 (train: 20.52354 cvalid: 16.63009 ) iteration 1 particle 4 weighted err: 20.02091 (train: 19.18378 cvalid: 17.09095 ) BEST MODEL KEPT iteration 1 particle 5 weighted err: 28.36339 (train: 20.6763 cvalid: 25.48073 ) iteration 1 particle 6 weighted err: 28.92088 (train: 20.92546 cvalid: 25.9226 ) iteration 1 particle 7 weighted err: 21.67837 (train: 20.73866 cvalid: 18.38941 ) iteration 1 particle 8 weighted err: 29.80416 (train: 16.09191 cvalid: 24.66206 ) iteration 1 particle 9 weighted err: 22.93199 (train: 20.5561 cvalid: 14.61638 ) iteration 1 particle 10 weighted err: 21.18474 (train: 19.64622 cvalid: 15.79992 ) iteration 1 particle 11 weighted err: 23.32084 (train: 20.78257 cvalid: 14.43688 ) iteration 1 particle 12 weighted err: 22.27164 (train: 20.81055 cvalid: 17.15783 ) iteration 1 particle 13 weighted err: 2.23479 (train: 1.95683 cvalid: 1.26193 ) BEST MODEL KEPT iteration 1 particle 14 weighted err: 23.1183 (train: 20.79754 cvalid: 14.99564 ) iteration 1 particle 15 weighted err: 20.71678 (train: 19.40506 cvalid: 16.12575 ) ... iteration 4 particle 3 weighted err: 0.3469 (train: 0.32236 cvalid: 0.26104 ) iteration 4 particle 4 weighted err: 0.2448 (train: 0.07047 cvalid: 0.17943 ) iteration 4 particle 5 weighted err: 0.09674 (train: 5e-05 cvalid: 0.06048 ) BEST MODEL KEPT iteration 4 particle 6 weighted err: 0.71267 (train: 6e-05 cvalid: 0.44544 ) iteration 4 particle 7 weighted err: 0.65614 (train: 0.63381 cvalid: 0.57796 ) iteration 4 particle 8 weighted err: 0.46477 (train: 0.356 cvalid: 0.08408 ) ... time ellapsed: 2.65109273195267

input

res <- cbind(ymat, automl_predict(model = amlmodel, X = xmat)) colnames(res) <- c('actual', 'predict') head(res)

output

     actual  predict [1,]    5.1 5.193862 [2,]    4.9 4.836507 [3,]    4.7 4.899531 [4,]    4.6 4.987896 [5,]    5.0 5.265334 [6,]    5.4 5.683173

It’s even better, with no human efforts but machine time
Users on Windows won’t benefit from parallelization, the function uses parallel package included with R base…

fit a regression model experimentally (experimental way, Mix 2)

Same subject: predict Sepal.Length given other Iris parameters

input

data(iris) xmat <- as.matrix(cbind(iris[,2:4], as.numeric(iris$Species))) ymat <- iris[,1] amlmodel <- automl_train_manual(                 Xref = xmat, Yref = ymat,                 hpar = list(                     modexec = 'trainwpso',                     numiterations = 30,                     psopartpopsize = 50                     )                 )

output

(cost: mse) cost epoch10: 0.113576786377019 (cv cost: 0.0967069106128153) (LR:  0 ) cost epoch20: 0.0595472259640828 (cv cost: 0.0831404427407914) (LR:  0 ) cost epoch30: 0.0494578776185938 (cv cost: 0.0538888075333611) (LR:  0 )    dim X: ...

input

res <- cbind(ymat, automl_predict(model = amlmodel, X = xmat)) colnames(res) <- c('actual', 'predict') head(res)

output

     actual  predict [1,]    5.1 5.028114 [2,]    4.9 4.673366 [3,]    4.7 4.738188 [4,]    4.6 4.821392 [5,]    5.0 5.099064 [6,]    5.4 5.277315

Pretty good too, even better!
But time consuming on larger datasets: where gradient descent should be preferred in this case

fit a regression model with custom cost (experimental way, Mix 2)

Same subject: predict Sepal.Length given other Iris parameters
Let’s try with Mean Absolute Percentage Error instead of Mean Square Error

input

data(iris) xmat <- as.matrix(cbind(iris[,2:4], as.numeric(iris$Species))) ymat <- iris[,1] f <- 'J=abs((y-yhat)/y)' f <- c(f, 'J=sum(J[!is.infinite(J)],na.rm=TRUE)') f <- c(f, 'J=(J/length(y))') f <- paste(f, collapse = ';') amlmodel <- automl_train_manual(                 Xref = xmat, Yref = ymat,                 hpar = list(                     modexec = 'trainwpso',                     numiterations = 30,                     psopartpopsize = 50,                     costcustformul = f                     )                 )

output

(cost: custom) cost epoch10: 0.901580275333795 (cv cost: 1.15936129555304) (LR:  0 ) cost epoch20: 0.890142834441629 (cv cost: 1.24167078564786) (LR:  0 ) cost epoch30: 0.886088388448652 (cv cost: 1.22756121243449) (LR:  0 )    dim X: ...

input

res <- cbind(ymat, automl_predict(model = amlmodel, X = xmat)) colnames(res) <- c('actual', 'predict') head(res)

output

     actual  predict [1,]    5.1 4.693915 [2,]    4.9 4.470968 [3,]    4.7 4.482036 [4,]    4.6 4.593667 [5,]    5.0 4.738504 [6,]    5.4 4.914144 fit a classification model with softmax (Mix 2)

Subject: predict Species given other Iris parameters
Softmax is available with PSO, no derivative needed

input

data(iris) xmat = iris[,1:4] lab2pred <- levels(iris$Species) lghlab <- length(lab2pred) iris$Species <- as.numeric(iris$Species) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(iris$Species)) + 0 amlmodel <- automl_train_manual(                 Xref = xmat, Yref = ymat,                 hpar = list(                     modexec = 'trainwpso',                     layersshape = c(10, 0),                     layersacttype = c('relu', 'softmax'),                     layersdropoprob = c(0, 0),                     numiterations = 50,                     psopartpopsize = 50                     )                 )

output

(cost: crossentropy) cost epoch10: 0.373706545886467 (cv cost: 0.36117608867856) (LR:  0 ) cost epoch20: 0.267034060152876 (cv cost: 0.163635821437066) (LR:  0 ) cost epoch30: 0.212054571476337 (cv cost: 0.112664100290429) (LR:  0 ) cost epoch40: 0.154158717402463 (cv cost: 0.102895917099299) (LR:  0 ) cost epoch50: 0.141037927317585 (cv cost: 0.0864623836595045) (LR:  0 )    dim X: ...

input

res <- cbind(ymat, automl_predict(model = amlmodel, X = xmat)) colnames(res) <- c(paste('act',lab2pred, sep = '_'), paste('pred',lab2pred, sep = '_')) head(res) tail(res)

output

  act_setosa act_versicolor act_virginica pred_setosa pred_versicolor pred_virginica 1          1              0             0   0.9863481     0.003268881    0.010383018 2          1              0             0   0.9897295     0.003387193    0.006883349 3          1              0             0   0.9856347     0.002025946    0.012339349 4          1              0             0   0.9819881     0.004638452    0.013373451 5          1              0             0   0.9827623     0.003115452    0.014122277 6          1              0             0   0.9329747     0.031624836    0.035400439     act_setosa act_versicolor act_virginica pred_setosa pred_versicolor pred_virginica 145          0              0             1  0.02549091    2.877957e-05      0.9744803 146          0              0             1  0.08146753    2.005664e-03      0.9165268 147          0              0             1  0.05465750    1.979652e-02      0.9255460 148          0              0             1  0.06040415    1.974869e-02      0.9198472 149          0              0             1  0.02318048    4.133826e-04      0.9764061 150          0              0             1  0.03696852    5.230936e-02      0.9107221 change the model parameters (shape …)

Same subject: predict Species given other Iris parameters
1st example: with gradient descent and 2 hidden layers containing 10 nodes, with various activation functions for hidden layers

input

data(iris) xmat = iris[,1:4] lab2pred <- levels(iris$Species) lghlab <- length(lab2pred) iris$Species <- as.numeric(iris$Species) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(iris$Species)) + 0 amlmodel <- automl_train_manual(                 Xref = xmat, Yref = ymat,                 hpar = list(                     layersshape = c(10, 10, 0),                     layersacttype = c('tanh', 'relu', ''),                     layersdropoprob = c(0, 0, 0)                     )                 )

nb: last activation type may be left to blank (it will be set automatically)

2nd example: with gradient descent and no hidden layer (logistic regression)

input

data(iris) xmat = iris[,1:4] lab2pred <- levels(iris$Species) lghlab <- length(lab2pred) iris$Species <- as.numeric(iris$Species) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(iris$Species)) + 0 amlmodel <- automl_train_manual(                 Xref = xmat, Yref = ymat,                 hpar = list(                     layersshape = c(0),                     layersacttype = c('sigmoid'),                     layersdropoprob = c(0)                     )                 ) ToDo List
  • transfert learning from existing frameworks
  • add autotune to other parameters (layers, dropout, …)
  • CNN
  • RNN

join the team !
https://github.com/aboulaboul/automl

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

M4 Forecasting Conference

Wed, 10/24/2018 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Following the highly successful M4 Forecasting Competition, there will be a conference held on 10-11 December at Tribeca Rooftop, New York, to discuss the results. The conference will elaborate on the findings of the M4 Competition, with prominent speakers from leading business firms and top universities.

Nassim Nicholas Taleb will deliver a keynote address about uncertainty in forecasting and Spyros Makridakis will discuss how organizations can benefit by improving the accuracy of their predictions and assessing uncertainty realistically.

There will be a presentation of the three most accurate methods of the M4 Competition by their developers explaining the reasons for their success. Unfortunately I cannot attend myself due to a clash with another speaking engagement. However, Pablo Montero-Manso and George Athanasopoulos will attend to discuss our 2nd place entry in the competition.

For information about the conference and registration visit:
www.mcompetitions.unic.ac.cy/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Computer Vision for Model Assessment

Tue, 10/23/2018 - 20:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

One of the differences between statistical data scientists and machine learning engineers is that while the latter group are concerned primarily with the predictive performance of a model, the former group are also concerned with the fit of the model. A model that misses important structures in the data — for example, seasonal trends, or a poor fit to specific subgroups — is likely to be lacking important variables or features in the source data. You can try different machine learning techniques or adjust hyperparameters to your heart's content, but you're unlikely to discover problems like this without evaluating the model fit.

One of the most powerful tools for assessing model fit is the residual plot: a scatterplot of the predicted values of the model versus the residuals (the difference between the predictions and the original data). If the model fits well, there should be no obvious relation between the two. For example, visual inspection shows us that the model below first fairly well for 19 of the 20 subgroups in the data, but there may be a missing variable that would explain the apparent residual trend for group 19.

Assessing residual plots like these is something of an art, which is probably why it isn't routine in machine learning circles. But what if we could use deep learning and computer vision to assess residual plots like these automatically? That's what Professor Di Cook from Monash University proposes, in the 2018 Belz Lecture for the Statistical Society of Australia, Human vs computer: when visualising data, who wins?  

(The slides for the presentation, created using R, are also available online.) The first part of the talk also includes a statistical interpretation of neural networks, and an intuitive explanation of deep learning for computer vision. The talk also references a related  paper, Visualizing Statistical Models : Removing the Blindfold (by Hadley Wickham, Dianne Cook and Jeike Hofman and published in the ASA Data Science Journal in 2015) which is well worth a read as well.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create a Glossary in R Markdown

Tue, 10/23/2018 - 18:00

(This article was first published on Yongfu's Blog, and kindly contributed to R-bloggers)

I was thinking about creating a glossary in bookdown and found out that there was already an issue about it. I like Yihui’s recommendation: use Pandoc’s definition lists. This was exactly what I had been doing, but I quickly found out that there was a major drawback – the definition lists won’t order alphabetically unless written in that way.

So I wrote an R function to reorder the definition lists written in R Markdown. Note that this functions only works for R Markdown files containing defintion lists exclusively. If the R Markdown files aren’t whole-definition-lists, the function will fail.

Usage

To order the definition lists alphabetically, simply put the Rmd file path in the function. To have a different output file, provide the output file path as the second argument.

sort_def_list("glossary.Rmd") # sort_def_list("glossary.Rmd", "reordered.Rmd")

The output in PDF looks like this (I used the multicol package)1:

Source Code

Notes

  1. To see the source R Markdown file, visit glossary.rmd. To see the output PDF, visit glossary.pdf

Visit R-bloggers
Last updated: 2018-10-24

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Yongfu's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppTOML 0.1.4: Now with TOML v0.5.0

Tue, 10/23/2018 - 14:39

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of our RcppTOML package just arrived on CRAN. It wraps an updated version of the cpptoml parser which, after a correction or two, now brings support for TOML v0.5.0 – which is still rather rare.

RcppTOML brings TOML to R. TOML is a file format that is most suitable for configurations, as it is meant to be edited by humans but read by computers. It emphasizes strong readability for humans while at the same time supporting strong typing as well as immediate and clear error reports. On small typos you get parse errors, rather than silently corrupted garbage. Much preferable to any and all of XML, JSON or YAML – though sadly these may be too ubiquitous now. TOML has been making inroads with projects such as the Hugo static blog compiler, or the Cargo system of Crates (aka “packages”) for the Rust language.

Besides the (exciting !!) support for TOML v0.5.0 and e.g. its dates support, this release also includes a (still somewhat experimental) feature cooked up by Dan Dillon a while back: TOML files can now include other TOML (and in fact, Dan implemented a whole recursing stream processor…). The full list of changes is below.

Changes in version 0.1.4 (2018-10-23)
  • Spelling / grammar fixes to README (Jon Calder in #18)

  • Cast from StretchyList to List ensures lists appear as List objects in R

  • Support optional includize pre-processor for recursive includes by Dan Dillon as a header-only library (#21 and #22)

  • Support includize argument in R and C++ parser interface

  • Added a few more #nocov tags for coverage (#23)

  • Synchronized with new upstream cpptoml version supporting the TOMP v0.5.0 specification (#25)

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppTOML page page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing gratia

Tue, 10/23/2018 - 14:00

(This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

I use generalized additive models (GAMs) in my research work. I use them a lot! Simon Wood’s mgcv package is an excellent set of software for specifying, fitting, and visualizing GAMs for very large data sets. Despite recently dabbling with brms, mgcv is still my go-to GAM package. The only down-side to mgcv is that it is not very tidy-aware and the ggplot-verse may as well not exist as far as it is concerned. This in itself is no bad thing, though as someone who uses mgcv a lot but also prefers to do my plotting with ggplot2, this lack of awareness was starting to hurt. So, I started working on something to help bridge the gap between these two separate worlds that I inhabit. The fruit of that labour is gratia, and development has progressed to the stage where I am ready to talk a bit more about it.

gratia is an R package for working with GAMs fitted with gam(), bam() or gamm() from mgcv or gamm4() from the gamm4 package, although functionality for handling the latter is not yet implement. gratia provides functions to replace the base-graphics-based plot.gam() and gam.check() that mgcv provides with ggplot2-based versions. Recent changes have also resulted in gratia being much more tidyverse aware and it now (mostly) returns outputs as tibbles.

In this post I wanted to give a flavour of what is currently possible with gratia and outline what still needs to be implemented.

gratia currently lives on GitHub, so we need to install it from there using devtools::install_github:

devtools::install_github('gavinsimpson/gratia')

To do anything useful with gratia we need a GAM and for that we need mgcv

library('mgcv') library('gratia')

and an old favourite example data set

set.seed(20) dat <- gamSim(1, n = 400, dist = "normal", scale = 2, verbose = FALSE)

The simulated data in dat are well-studied in GAM-related research and contain a number of covariates — labelled x0 through x3 — which have, to varying degrees, non-linear relationships with the response. We want to try to recover these relationships by approximating the true relationships between covariate and response using splines. To fit a purely additive model, we use

mod <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat, method = "REML")

mgcv provides a summary() method that is used to extract information about the fitted GAM

summary(mod) Family: gaussian Link function: identity Formula: y ~ s(x0) + s(x1) + s(x2) + s(x3) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.7625 0.0959 80.95 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(x0) 3.528 4.370 11.30 4.2e-09 *** s(x1) 2.662 3.310 129.02 < 2e-16 *** s(x2) 8.146 8.799 84.72 < 2e-16 *** s(x3) 1.001 1.002 0.00 0.987 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.763 Deviance explained = 77.2% -REML = 850.87 Scale est. = 3.6785 n = 400

and the k.check() function for checking whether sufficient numbers of basis functions were used in each smooth in the model. (You may not have used k.check() directly — it is called by gam.check() which prints out other diagnostics and also produces four model diagnostic plots, which is one thing that gratia provides a replacement for.)

Plotting smooths

To visualize estimated GAMs, mgcv provides the plot.gam() method and the vis.gam() function. gratia currently provides a ggplot2-based replacement for plot.gam(). Work is on-going to provide vis.gam()-like functionality within gratia — see ?gratia::data_slice for early work in that direction. In gratia, we use the draw() generic to produce ggplot2-like plots from objects. To visualize the four estimated smooth functions in the GAM mod, we would use

draw(mod) The result of draw(mod) is a plot of each of the four smooth functions in the mod GAM.

Internally draw() uses the plot_grid() function from cowplot to draw multiple panels on the plot device, and to line up the individual plots.

There’s not an awful lot more you can do with this now, but the at least the plot is reasonably pretty. gratia includes tools for working with the underlying smooths represented in mod, and if you wanted to extract most of the data used to build the plot you’d use the evaluate_smooth() function.

evaluate_smooth(mod, "x1") # A tibble: 100 x 5 smooth fs_variable x1 est se 1 s(x1) 0.000565 -2.75 0.294 2 s(x1) 0.0106 -2.72 0.277 3 s(x1) 0.0207 -2.68 0.261 4 s(x1) 0.0308 -2.64 0.245 5 s(x1) 0.0409 -2.60 0.230 6 s(x1) 0.0510 -2.56 0.217 7 s(x1) 0.0610 -2.52 0.204 8 s(x1) 0.0711 -2.48 0.193 9 s(x1) 0.0812 -2.44 0.183 10 s(x1) 0.0913 -2.40 0.173 # ... with 90 more rows Producing diagnostic plots

The diagnostic plots currently produced by gam.check() can also be produced using gratia, with the appraise() function

appraise(mod) The result of appraise(mod) is an array of four diagnostics plots, including a Q-Q plot (top left) and histogram (bottom left) of model residuals, a plot of residuals vs the linear predictor (top right), and a plot of observed vs fitted values.

Each of the four plots is produced via user-accessible function that implements a specific plot. For example, qq_plot(mod) produces the Q-Q plot in the upper left for the figure above, and the qq_plot.gam() method reproduces most of the functionality of mgcv::qq.gam(), including the direct randomization procedure (method = ‘direct’, as shown above) and the data simulation procedure (method = ‘simulate’) to generate reference quantiles, which typically have better performance for GLM-like models (Augustin et al., 2012).

qq_plot(mod, method = 'simulate') The result of qq_plot(mod, method = ‘simulate’, fig.width = 6, fig.height = 4) is a Q-Q plot of residuals, where the reference quantiles are derived by simulating data from the fitted model.

draw() can also handle many of the more specialized smoothers currently available in mgcv. For example, 2D smoothers are represented as geom_raster() surfaces with contours

set.seed(1) dat <- gamSim(2, n = 4000, dist = "normal", scale = 1, verbose = FALSE) mod <- gam(y ~ s(x, z, k = 30), data = dat$data, method = "REML") draw(mod) The default way a 2D smoother is plotted using draw().

and factor-smooth-interaction terms, which are the equivalent of random slopes and intercepts for splines, are drawn on a single panel and colour is used to distinguish the different random smooths

## simulate example... from ?mgcv::factor.smooth.interaction set.seed(0) ## simulate data... f0 <- function(x) 2 * sin(pi * x) f1 <- function(x, a=2, b=-1) exp(a * x)+b f2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^10 n <- 500 nf <- 10 fac <- sample(1:nf, n, replace=TRUE) x0 <- runif(n) x1 <- runif(n) x2 <- runif(n) a <- rnorm(nf) * .2 + 2; b <- rnorm(nf) * .5 f <- f0(x0) + f1(x1, a[fac], b[fac]) + f2(x2) fac <- factor(fac) y <- f + rnorm(n) * 2 df <- data.frame(y = y, x0 = x0, x1 = x1, x2 = x2, fac = fac) mod <- gam(y~s(x0) + s(x1, fac, bs="fs", k=5) + s(x2, k=20), method = "ML") draw(mod) The result of draw(mod) for a more complex GAM containing a factor-smooth-interaction term with bs = ‘fs’.
What else can gratia do?

Although still quite early in the planned development cycle, gratia can handle most of the smooths that mgcv can estimate, including by variable smooths with factor and continuous by variables, random effect smooths (bs = ‘re’), 2D tensor product smooths, and models with parametric terms.

Smoothers that gratia can’t do anything with as yet are Markov random fields (MRFs; bs = ‘mrf’), splines on the sphere (SoSs; bs = ‘sos’), soap film smoothers (bs = ‘so’), and linear functional models with matrix terms.

The package also includes functions for

  • calculating across-the-function and simultaneous confidence intervals for smooths via confint() methods, and
  • calculating first and second derivatives (of currently only univariate) smooths using finite differences. fderiv() is the old home for first derivatives of GAM smooths, whilst the new derivatives() function can calculate first and second derivatives using forward (like fderiv()) as well as backward and central finite differences.

There is also are lot of exported functions that make it easier for working with GAMs fitted by mgcv and for extracting aspects of the fitted model and the smooths. The exact functionality is still being worked on so be prepared for some of the functions to come and go or change name as I work through ideas and implementations and settle on the interface for the tools that gratia will provide for this.

What can’t gratia do?

I’ve already covered where gratia is currently lacking in respect to the types of smoother that mgcv can fit. It is also currently lacking in tools for exploring models in more detail, such as the plots of model predictions over slices of covariate space that vis.gam() can produce (though see gratia::data_slice() for functions to create the data needed for such plots.) Nor can gratia currently handle smooths of higher than two dimensions. I’d like to add this capability soon as it will make visualizing GAMs fitted to spatio-temporal data much easier then it currently is.

The future?

Longer term I plan to fill out the types of smoothers that gratia can handle to cover all the types that mgcv can fit, and add vis.gam()-like functionality and the ability to handle higher dimensional smooths (plot.gam() can now handle 3- or 4-dimensional smooths.)

The ultimate goal of course is to just have draw() work for whatever GAM model you throw at it, and at least have feature parity with plot.gam() and vis.gam().

As is to be expected for such an early release, there is a lot of stabilization to function names and arguments that needs to happen in gratia, and a lot of documentation to be written, including some vignettes. For now, the best way to understand what gratia is doing or how it works is to look at the examples on the gratia website (built using pkgdown) and take a look at the package tests which contain lots of examples of GAM fits and the code to work with them.

I’m very much interested in user feedback, so please do let me know if you have any suggestions for additions or improvements to gratia, and if you do use gratia and find bugs in the package or GAMs that gratia can’t handle I would love to hear from you. You can get in touch via the comments below, or via GitHub Issues.

I would also be remiss if I did not mention Matteo Fasiolo’s excellent mcgViz package, which already has extensive capabilities for exploring GAM fits, including some very interesting approaches to handling models of millions of data points or more, which cause data visualization problems.

References

Augustin, N. H., Sauleau, E.-A., and Wood, S. N. (2012). On quantile quantile plots for generalized linear models. Computational statistics & data analysis 56, 2404–2409. doi:10.1016/j.csda.2012.01.026.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine learning logistic regression for credit modelling in R

Tue, 10/23/2018 - 03:29

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Regression Models

    Tags

    1. Logistic Regression
    2. R Programming
    3. ROC

    Machine learning logistic regressions is a widely popular method to model credit modeling. There are excellent and efficient packages in R, that can perform these types of analysis. Typically you will first create different machine learning visualizations before you perform the machine learning logistic regression analysis.

    This article is the second step of a credit modeling analysis, where I recently published the first step in this article.

    The first thing we need to do is to load the R packages into the library:

    # Load R packages into the library # Data management packages library(DescTools) library(skimr) library(plyr) library(dplyr) library(aod) library(readxl) # Visualization packages library(Deducer) library(ggplot2) # Machine learnning method packages library(ROCR) library(pROC) library(caret) library(MASS)

    Now it is time to load the dataset and do some data management. We will work with the loan lending club dataset. The below coding is the data management:

    # Import dataset loan_data <- read.csv("/loan.csv") # Selecting the relevant variables in the dataset: loan_data <- loan_data[,c("grade","sub_grade","term","loan_amnt","issue_d","loan_status","emp_length", "home_ownership", "annual_inc","verification_status","purpose","dti", "delinq_2yrs","addr_state","int_rate", "inq_last_6mths","mths_since_last_delinq", "mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc")] # Data management for missing observations loan_data$mths_since_last_delinq[is.na(loan_data$mths_since_last_delinq)] <- 0 loan_data$mths_since_last_record[is.na(loan_data$mths_since_last_record)] <- 0 var.has.na <- lapply(loan_data, function(x){any(is.na(x))}) num_na <- which( var.has.na == TRUE ) per_na <- num_na/dim(loan_data)[1] loan_data <- loan_data[complete.cases(loan_data),]

    Although this is the second step of a credit modeling analysis, the visualization step can be found in my previous article, let us do minimum of visualization in case the reader only reads this article:

    # Visualization of the data # Bar chart of the loan amount loanamount_barchart <- ggplot(data=loan_data, aes(loan_data$loan_amnt)) + geom_histogram(breaks=seq(0, 35000, by=1000), col="black", aes(fill=..count..)) + scale_fill_gradient("Count", low="green1", high="yellowgreen")+ labs(title="Loan Amount", x="Amount", y="Number of Loans") loanamount_barchart ggplotly(p = ggplot2::last_plot()) # Box plot of loan amount box_plot_stat <- ggplot(loan_data, aes(loan_status, loan_amnt)) box_plot_stat + geom_boxplot(aes(fill = loan_status)) + theme(axis.text.x = element_blank()) + labs(list(title = "Loan amount by status", x = "Loan Status", y = "Amount")) ggplotly(p = ggplot2::last_plot())

    The above coding gives us the following two visualizations:

    Lets see some descriptive statistics of the data as well:

    skim(loan_data) Skim summary statistics n obs: 886877 n variables: 23 -- Variable type:factor -------------------------------------------------------- variable missing complete n n_unique top_counts ordered addr_state 0 886877 886877 51 CA: 129456, NY: 74033, TX: 71100, FL: 60901 FALSE emp_length 0 886877 886877 12 10+: 291417, 2 y: 78831, < 1: 70538, 3 y: 69991 FALSE grade 0 886877 886877 7 B: 254445, C: 245721, A: 148162, D: 139414 FALSE home_ownership 0 886877 886877 6 MOR: 443319, REN: 355921, OWN: 87408, OTH: 180 FALSE issue_d 0 886877 886877 103 Oct: 48619, Jul: 45938, Dec: 44323, Oct: 38760 FALSE loan_status 0 886877 886877 8 Cur: 601533, Ful: 209525, Cha: 45956, Lat: 11582 FALSE purpose 0 886877 886877 14 deb: 524009, cre: 206136, hom: 51760, oth: 42798 FALSE sub_grade 0 886877 886877 35 B3: 56301, B4: 55599, C1: 53365, C2: 52206 FALSE term 0 886877 886877 2 36: 620739, 60: 266138, NA: 0 FALSE verification_status 0 886877 886877 3 Sou: 329393, Ver: 290896, Not: 266588, NA: 0 FALSE -- Variable type:numeric ------------------------------------------------------- variable missing complete n mean sd p0 p25 p50 p75 p100 hist annual_inc 0 886877 886877 75019.4 64687.38 0 45000 65000 90000 9500000 ???????? delinq_2yrs 0 886877 886877 0.31 0.86 0 0 0 0 39 ???????? dti 0 886877 886877 18.16 17.19 0 11.91 17.66 23.95 9999 ???????? inq_last_6mths 0 886877 886877 0.69 1 0 0 0 1 33 ???????? int_rate 0 886877 886877 13.25 4.38 5.32 9.99 12.99 16.2 28.99 ???????? loan_amnt 0 886877 886877 14756.97 8434.43 500 8000 13000 20000 35000 ???????? mths_since_last_delinq 0 886877 886877 16.62 22.89 0 0 0 30 188 ???????? mths_since_last_record 0 886877 886877 10.83 27.65 0 0 0 0 129 ???????? open_acc 0 886877 886877 11.55 5.32 1 8 11 14 90 ???????? pub_rec 0 886877 886877 0.2 0.58 0 0 0 0 86 ???????? revol_bal 0 886877 886877 16924.56 22414.33 0 6450 11879 20833 2904836 ???????? revol_util 0 886877 886877 55.07 23.83 0 37.7 56 73.6 892.3 ???????? total_acc 0 886877 886877 25.27 11.84 1 17 24 32 169 ????????

    Next we need to do some more data management to prepare the dataset for machine learning analysis

    # Focus on the historical loans loan_data=as.data.frame(loan_data[loan_data$loan_status!="Current", ]) limits_inc = quantile(loan_data$annual_inc, seq(0,1,0.1)) labels <- c(0, limits_inc[2:10], "+inf") labels <- prettyNum(labels, big.mark = ",") labels <- paste(labels[1:10], labels[2:11], sep = "-") loan_data$annual_inc <- cut(loan_data$annual_inc, limits_inc, labels = labels, include.lowest = T) loan_data[,"annual_inc"] <- as.character(loan_data[,"annual_inc"]) # Create binary variables for the logistic regression analysis # Annual_inc loan_data$annual_inc[loan_data$annual_inc == "70,000- 80,000"| loan_data$annual_inc == "80,000- 94,000" | loan_data$annual_inc == "94,000-120,000" | loan_data$annual_inc == "120,000- +inf" ] <- 1 loan_data$annual_inc[loan_data$annual_inc != 1] <- 0 loan_data$annual_inc <- as.numeric(loan_data$annual_inc) # Home_ownership loan_data$home_ownership <- as.character(loan_data$home_ownership) loan_data$home_ownership[loan_data$home_ownership=="OWN" | loan_data$home_ownership=="MORTGAGE" ] <- 1 loan_data$home_ownership[loan_data$home_ownership!=1] <- 0 # Dealinq_2yrs loan_data$delinq_2yrs <- as.character(loan_data$delinq_2yrs) loan_data$delinq_2yrs[loan_data$delinq_2yrs=="0"] <- 0 loan_data$delinq_2yrs[loan_data$delinq_2yrs!= 0] <- 1 # Verification status: if Verified = 1 ; otherwise = 0 loan_data$verification_status = as.character(loan_data$verification_status) loan_data$verification_status[loan_data$verification_status == "Verified" | loan_data$verification_status == "Source Verified"] = 1 loan_data$verification_status[loan_data$verification_status != 1] = 0 loan_data$verification_status=as.numeric(loan_data$verification_status) # Dti dti_quant <- quantile(loan_data$dti, seq(0, 1, 0.1)) labels = c(0,prettyNum(dti_quant[2:10], big.mark = ","), "+Inf") labels = paste(labels[1:10],labels[2:11], sep = "-") loan_data <- mutate(loan_data, dti= cut(loan_data$dti, breaks = dti_quant, labels = factor(labels), include.lowest = T)) loan_data$dti <- as.character(loan_data$dti) loan_data$dti[loan_data$dti == "0-6.57" | loan_data$dti == "12.13-14.32" | loan_data$dti == "14.32-16.49" ] <- 1 loan_data$dti[loan_data$dti!=1] <- 0 # Status loan_data$loan_status <- as.character(loan_data$loan_status) loan_data$loan_status[loan_data$loan_status == "Charged Off" | loan_data$loan_status == "Default" ] <- 1 loan_data$loan_status[loan_data$loan_status != 1] <- 0 table(loan_data$loan_status) PercTable(loan_data$loan_status) # Change to nummeric variables: loan_data[,"revol_util"] <- as.numeric(sub("%", "",loan_data$"revol_util", fixed =TRUE))/100 loan_data[,"int_rate"] <- as.numeric(sub("%", "",loan_data$"int_rate", fixed =TRUE))/100 loan_data$loan_status <- as.numeric(loan_data$loan_status) # Grouping variables loan_data$purpose <- as.character(loan_data$purpose) loan_data$purpose[loan_data$purpose == "car" | loan_data$purpose == "major_purchase" | loan_data$purpose == "home_improvement"| loan_data$purpose == "credit_card" ] <- 2 loan_data$purpose[loan_data$purpose == "moving" | loan_data$purpose == "small_business" | loan_data$purpose == "renewable_energy" ] <- 0 loan_data$purpose[loan_data$purpose!= 0 & loan_data$purpose!= 2 ] <- 1 loan_data$purpose <- as.factor(loan_data$purpose)

    Now it is time to make the machine learning regression analysis. We will work with multiple logistic regression. Logistic regression is applied when you have a binary variable (y) to explain. The logistic regression model uses the cumulative distribution function to estimate the logistic function of the model with a group of explanatory variables (the x’s). We will work with a stepwise model in order to find a final model for the logistic regression. The below coding generates the multiple logistic regression analysis:

    ##Machine Learning: Multiple Logistic Regression Models # Logistic: Logit stepwise Regression logregmodI <- glm(loan_status ~ loan_amnt + home_ownership + annual_inc + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc, data = loan_data, family = binomial(link= "logit")) step <- stepAIC(logregmodI, direction="both") step$anova Stepwise Model Path Analysis of Deviance Table Initial Model: loan_status ~ loan_amnt + home_ownership + annual_inc + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc Final Model: loan_status ~ loan_amnt + home_ownership + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc Step Df Deviance Resid. Df Resid. Dev AIC 1 285330 241553.4 241581.4 2 - annual_inc 0 0 285330 241553.4 241581.4

    Now we need to make a training dataset and testing dataset in order to train the model and perform a ROC curve:

    # Create a training- and testing dataset percing <- floor((nrow(loan_data)/4)*3) loan <- loan_data[sample(nrow(loan_data)), ] loan.training <- loan[1:percing, ] loan.testing <- loan[(percing+1):nrow(loan), ] # Begin training of the model fitting.logistic <- glm(loan_status ~ loan_amnt + home_ownership + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc, data=loan.training,family = binomial(link= "logit")) summary(fitting.logistic) Call: glm(formula = loan_status ~ loan_amnt + home_ownership + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc, family = binomial(link = "logit"), data = loan.training) Deviance Residuals: Min 1Q Median 3Q Max -1.5509 -0.6336 -0.5098 -0.3799 2.5925 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.952e+00 4.221e-02 -69.931 < 2e-16 *** loan_amnt 2.213e-06 8.330e-07 2.657 0.007894 ** home_ownership1 -1.626e-01 1.261e-02 -12.892 < 2e-16 *** verification_status 7.695e-02 1.429e-02 5.385 7.23e-08 *** purpose1 -3.342e-01 3.237e-02 -10.326 < 2e-16 *** purpose2 -3.882e-01 3.383e-02 -11.475 < 2e-16 *** dti1 -2.075e-01 1.385e-02 -14.982 < 2e-16 *** delinq_2yrs1 -5.411e-02 1.618e-02 -3.345 0.000823 *** int_rate 1.157e+01 1.545e-01 74.868 < 2e-16 *** inq_last_6mths 7.640e-02 5.113e-03 14.943 < 2e-16 *** mths_since_last_delinq -2.289e-03 2.726e-04 -8.396 < 2e-16 *** revol_bal -1.818e-06 4.002e-07 -4.543 5.54e-06 *** revol_util 3.391e-01 2.743e-02 12.366 < 2e-16 *** total_acc -6.473e-03 5.718e-04 -11.320 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 191593 on 214007 degrees of freedom Residual deviance: 180719 on 213994 degrees of freedom AIC: 180747 Number of Fisher Scoring iterations: 4

    Now we can create the ROC and AUC curves for the model.

    # AUC and ROC curve fitted.results <- predict(fit.log, newdata = loan.testing, type = "response") loan.testing$prob <- fitted.results pred <- prediction(loan.testing$prob,loan.testing$loan_status) auc1 <- performance(pred, measure = "auc") auc1@y.values

    And lastly we can display the ROC curve as a visualization. These codings are pretty heavy in computing power – so relax and let R do the calculations:

    # Performance function ROCRperf = performance(pred, "tpr", "fpr") # Plot the ROC graph Add threshold labels plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7)) abline(0, 1, col= "black")

    The above coding gives us the following ROC graph visualizaiton:

    References
    1. Using DescTools in R – CRAN.R-project.org
    2. Using skimr in R – CRAN.R-project.org
    3. Using magrittr in R – CRAN.R-project.org
    4. Using plyr in R – CRAN.R-project.org
    5. Using dplyr in R – CRAN.R-project.org
    6. Using aod in R – CRAN.R-project.org
    7. Using readxl in R – CRAN.R-project.org
    8. Using Deducer in R – CRAN.R-project.org
    9. Using ggplot2 in R – CRAN.R-project.org
    10. Using plotly in R – CRAN.R-project.org
    11. Using ROCR in R – CRAN.R-project.org
    12. Using pROC in R – CRAN.R-project.org
    13. Using caret in R – CRAN.R-project.org
    14. Using MASS in R – CRAN.R-project.org

    Related Post

    1. Commercial data analytics: An economic view on the data science methods
    2. Weight loss in the U.S. – An analysis of NHANES data with tidyverse
    3. Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
    4. Story of pairs, ggpairs, and the linear regression
    5. Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RStudio 1.2 Preview: Plumber Integration

    Tue, 10/23/2018 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    The RStudio 1.2 Preview Release
    makes it even easier to create RESTful Web APIs in R using the plumber package.

    plumber is a package that converts your existing R code to a web API using a handful of special one-line comments.

    RStudio 1.2 provides the following plumber-related productivity enhancements:

    • push-button local server execution for development and testing
    • easy API publishing to RStudio Connect
    • automatic API documentation and interactive execution via Swagger
    • create a new Plumber API project or add an API to an existing directory

    A full discussion of Web APIs and the plumber package is beyond the scope of this article; for a primer, check out:
    R Views: REST APIs and Plumber

    Let’s take a look at the new features.

    Creating an API

    On the RStudio main menu, select File / New Files / Plumber API.

    RStudio will offer to install plumber and any dependencies, if necessary. Then, give your API a folder name and a location for that folder:

    An R source file named plumber.R containing sample APIs is opened in RStudio. This file shows the essentials of the plumber-specific annotations that identify and document APIs. For the example input above, the location would be ~/code/MyAPI/HelloWorld/plumber.R.

    Creating a Plumber API project

    If a recent version of the plumber package is already installed, you can also create a new RStudio project via File / New Project / New Directory / New Plumber API Project. You may have to scroll down the list of project types to see the plumber option:

    This will prompt for the same information as the other approach, but the result is a standalone RStudio project containing the plumber.R file with the same sample APIs.

    Running a Local API server

    Comments beginning with #* followed by plumber-specific annotations such as @get identify this as a plumber API file. RStudio adds the Run API button in place of the Run and Source buttons seen on a generic R source file.

    The plumber package also supports a #' prefix, but these are not recognized by RStudio; be sure to use #* if you are using RStudio to author APIs, otherwise the Run API button will not be shown.

    Clicking this button will start a local http server and display the auto-generated API documentation and test page in the location currently selected in the Run API button’s dropdown.

    When an API is running, the Run API button changes to Reload API; use this after making changes to your API source files to test them out. To stop running the APIs, click the stop icon in the RStudio Console or the pane or window showing the API preview.

    Interacting with the APIs

    The plumber package auto-generates a page showing all the APIs defined in your project and displays it in RStudio.

    In addition to providing documentation, you can use this page to make test calls to the APIs. Click on the first API, /echo, to expand it. Then click Try it out. Enter some text in the edit box, and click Execute to make a REST call to the API. Scroll down to see both an example of how to construct the REST call, e.g. via curl, and the JSON response.

    Try the other APIs, as well, and you will see that the output is not limited to JSON text; the /plot API returns an image, which is shown inline.

    Breakpoints

    Authoring APIs directly in RStudio provides a nice workflow. As your APIs become more complex, you may need to debug the code to figure out what’s happening.

    In an R source file with plumber annotations, inserting browser() in the source code at the point you wish to start debugging will trigger a breakpoint. Then RStudio’s features can be used to examine and modify the program state, single-step, and so on. For more on RStudio debugging facilities, see this article.

    Publishing to Connect

    Now that you have some APIs, you need to make them available to others. The easiest way is to publish to an RStudio Connect server. With Connect, you get push-button publishing, and can leverage features such as load balancing, authentication, and access-control.

    A file named plumber.R containing plumber annotations is required to publish an API to RStudio Connect.

    Next to the Run API button is the publish button. Click that to begin the publishing process.

    Upon completion, the browser launches to show your API on Connect. From there, use Connect tools and options to manage the APIs. The Publish button in RStudio may then be used to republish the APIs after you’ve made changes.

    Publishing elsewhere

    The plumber docs provide details on how to publish to other services such as DigitalOcean, and how to leverage Docker containers for running plumber APIs.

    Resources var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Cross-over study design with a major constraint

    Tue, 10/23/2018 - 02:00

    (This article was first published on ouR data generation, and kindly contributed to R-bloggers)

    Every new study presents its own challenges. (I would have to say that one of the great things about being a biostatistician is the immense variety of research questions that I get to wrestle with.) Recently, I was approached by a group of researchers who wanted to evaluate an intervention. Actually, they had two, but the second one was a minor tweak added to the first. They were trying to figure out how to design the study to answer two questions: (1) is intervention \(A\) better than doing nothing and (2) is \(A^+\), the slightly augmented version of \(A\), better than just \(A\)?

    It was clear in this context (and it is certainly not usually the case) that exposure to \(A\) on one day would have no effect on the outcome under \(A^+\) the next day (or vice versa). That is, spillover risks were minimal. Given this, the study was an ideal candidate for a cross-over design, where each study participant would receive both versions of the intervention and the control. This design can be much more efficient than a traditional RCT, because we can control for variability across patients.

    While a cross-over study is interesting and challenging in its own right, the researchers had a pretty serious constraint: they did not feel they could assign intervention \(A^+\) until \(A\) had been applied, which would be necessary in a proper cross-over design. So, we had to come up with something a little different.

    This post takes a look at how to generate data for and analyze data from a more standard cross-over trial, and then presents the solution we came up with for the problem at hand.

    Cross-over design with three exposures

    If we are free to assign any intervention on any day, one possible randomization scheme involving three interventions could look like this:

    Key features of this scheme are: (1) all individuals are exposed to each intervention over three days, (2) on any given day, each intervention is applied to one group of participants (just in case the specific day has an impact on the outcome), and (3) not every permutation is included (for example, \(A\) does not immediately proceed \(Control\) in any sequence), because the relative ordering of interventions in this case is assumed not to matter. (We might need to expand to six groups to rectify this.)

    Data simulation

    In this simulation, we will assume (1) that the outcome is slightly elevated on days two and three, (2) \(A\) is an improvement over \(Control\), (3) \(A^+\) is an improvement over \(A\), (4) there is strong correlation of outcomes within each individual, and (5) group membership has no bearing on the outcome.

    First, I define the data, starting with the different sources of variation. I have specified a fairly high intra-class coefficient (ICC), because it is reasonable to assume that there will be quite a bit of variation across individuals:

    vTotal = 1 vAcross <- iccRE(ICC = 0.5, varTotal = vTotal, "normal") vWithin <- vTotal - vAcross ### Definitions b <- defData(varname = "b", formula = 0, variance = vAcross, dist = "normal") d <- defCondition(condition = "rxlab == 'C'", formula = "0 + b + (day == 2) * 0.5 + (day == 3) * 0.25", variance = vWithin, dist = "normal") d <- defCondition(d, "rxlab == 'A'", formula = "0.4 + b + (day == 2) * 0.5 + (day == 3) * 0.25", variance = vWithin, dist = "normal") d <- defCondition(d, "rxlab == 'A+'", formula = "1.0 + b + (day == 2) * 0.5 + (day == 3) * 0.25", variance = vWithin, dist = "normal")

    Next, I generate the data, assigning three groups, each of which is tied to one of the three treatment sequences.

    set.seed(39217) db <- genData(240, b) dd <- trtAssign(db, 3, grpName = "grp") dd <- addPeriods(dd, 3) dd[grp == 1, rxlab := c("C", "A", "A+")] dd[grp == 2, rxlab := c("A+", "C", "A")] dd[grp == 3, rxlab := c("A", "A+", "C")] dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))] dd[, day := factor(period + 1)] dd <- addCondition(d, dd, newvar = "Y") dd ## timeID Y id period grp b rxlab day ## 1: 1 0.9015848 1 0 2 0.2664571 A+ 1 ## 2: 2 1.2125919 1 1 2 0.2664571 C 2 ## 3: 3 0.7578572 1 2 2 0.2664571 A 3 ## 4: 4 2.0157066 2 0 3 1.1638244 A 1 ## 5: 5 2.4948799 2 1 3 1.1638244 A+ 2 ## --- ## 716: 716 1.9617832 239 1 1 0.3340201 A 2 ## 717: 717 1.9231570 239 2 1 0.3340201 A+ 3 ## 718: 718 1.0280355 240 0 3 1.4084395 A 1 ## 719: 719 2.5021319 240 1 3 1.4084395 A+ 2 ## 720: 720 0.4610550 240 2 3 1.4084395 C 3

    Here is a plot of the treatment averages each day for each of the three groups:

    dm <- dd[, .(Y = mean(Y)), keyby = .(grp, period, rxlab)] ngrps <- nrow(dm[, .N, keyby = grp]) nperiods <- nrow(dm[, .N, keyby = period]) ggplot(data = dm, aes(y=Y, x = period + 1)) + geom_jitter(data = dd, aes(y=Y, x = period + 1), width = .05, height = 0, color="grey70", size = 1 ) + geom_line(color = "grey50") + geom_point(aes(color = rxlab), size = 2.5) + scale_color_manual(values = c("#4477AA", "#DDCC77", "#CC6677")) + scale_x_continuous(name = "day", limits = c(0.9, nperiods + .1), breaks=c(1:nperiods)) + facet_grid(~ factor(grp, labels = paste("Group", 1:ngrps))) + theme(panel.grid = element_blank(), legend.title = element_blank())

    Estimating the effects

    To estimate the treatment effects, I will use this mixed effects linear regression model:

    \[Y_{it} = \alpha_0 + \gamma_{t} D_{it} + \beta_1 A_{it} + \beta_2 P_{it} + b_i + e_i\]

    where \(Y_{it}\) is the outcome for individual \(i\) on day \(t\), \(t \in (1,2,3)\). \(A_{it}\) is an indicator for treatment \(A\) in time \(t\); likewise \(P_{it}\) is an indicator for \(A^+\). \(D_{it}\) is an indicator that the outcome was recorded on day \(t\). \(b_i\) is an individual (latent) random effect, \(b_i \sim N(0, \sigma_b^2)\). \(e_i\) is the (also latent) noise term, \(e_i \sim N(0, \sigma_e^2)\).

    The parameter \(\alpha_0\) is the mean outcome on day 1 under \(Control\). The \(\gamma\)’s are the day-specific effects for days 2 and 3, with \(\gamma_1\) fixed at 0. \(\beta_1\) is the effect of \(A\) (relative to \(Control\)) and \(\beta_2\) is the effect of \(A^+\). In this case, the researchers were primarily interested in \(\beta_1\) and \(\beta_2 – \beta_1\), which is the incremental change from \(A\) to \(A^+\).

    library(lme4) lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) rndTidy(lmerfit) ## term estimate std.error statistic group ## 1: (Intercept) -0.14 0.08 -1.81 fixed ## 2: day2 0.63 0.06 9.82 fixed ## 3: day3 0.38 0.06 5.97 fixed ## 4: rxlabA 0.57 0.06 8.92 fixed ## 5: rxlabA+ 0.98 0.06 15.35 fixed ## 6: sd_(Intercept).id 0.74 NA NA id ## 7: sd_Observation.Residual 0.70 NA NA Residual

    As to why we would want to bother with this complex design if we could just randomize individuals to one of three treatment groups, this little example using a more standard parallel design might provide a hint:

    def2 <- defDataAdd(varname = "Y", formula = "0 + (frx == 'A') * 0.4 + (frx == 'A+') * 1", variance = 1, dist = "normal") dd <- genData(240) dd <- trtAssign(dd, nTrt = 3, grpName = "rx") dd <- genFactor(dd, "rx", labels = c("C","A","A+"), replace = TRUE) dd <- addColumns(def2, dd) lmfit <- lm(Y~frx, data = dd) rndTidy(lmfit) ## term estimate std.error statistic p.value ## 1: (Intercept) -0.12 0.10 -1.15 0.25 ## 2: frxA 0.64 0.15 4.38 0.00 ## 3: frxA+ 1.01 0.15 6.86 0.00

    If we compare the standard error for the effect of \(A^+\) in the two studies, the cross-over design is much more efficient (i.e. standard error is considerably smaller: 0.06 vs. 0.15). This really isn’t so surprising since we have collected a lot more data and modeled variation across individuals in the cross-over study.

    Constrained cross-over design

    Unfortunately, the project was not at liberty to implement the three-way/three-day design just simulated. We came up with this approach that would provide some cross-over, but with an added day of treatment and measurement:

    The data generation is slightly modified, though the original definitions can still be used:

    db <- genData(240, b) dd <- trtAssign(db, 2, grpName = "grp") dd <- addPeriods(dd, 4) dd[grp == 0, rxlab := c("C", "C", "A", "A+")] dd[grp == 1, rxlab := c("C", "A", "A+", "A")] dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))] dd[, day := factor(period + 1)] dd <- addCondition(d, dd, "Y")

    The model estimates indicate slightly higher standard errors than in the pure cross-over design:

    lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) rndTidy(lmerfit) ## term estimate std.error statistic group ## 1: (Intercept) 0.15 0.06 2.36 fixed ## 2: day2 0.48 0.08 6.02 fixed ## 3: day3 0.16 0.12 1.32 fixed ## 4: day4 -0.12 0.12 -1.02 fixed ## 5: rxlabA 0.46 0.10 4.70 fixed ## 6: rxlabA+ 1.14 0.12 9.76 fixed ## 7: sd_(Intercept).id 0.69 NA NA id ## 8: sd_Observation.Residual 0.68 NA NA Residual

    Here are the key parameters of interest (refit using package lmerTest to get the contrasts). The confidence intervals include the true values (\(\beta_1 = 0.4\) and \(\beta_2 – \beta_1 = 0.6\)):

    library(lmerTest) lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) L <- matrix(c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1), nrow = 2, ncol = 6, byrow = TRUE) con <- data.table(contest(lmerfit, L, confint = TRUE, joint = FALSE)) round(con[, .(Estimate, `Std. Error`, lower, upper)], 3) ## Estimate Std. Error lower upper ## 1: 0.462 0.098 0.269 0.655 ## 2: 0.673 0.062 0.551 0.795 Exploring bias

    A single data set does not tell us if the proposed approach is indeed unbiased. Here, I generate 1000 data sets and fit the mixed effects model. In addition, I fit a model that ignores the day factor to see if it will induce bias (of course it will).

    iter <- 1000 ests <- vector("list", iter) xests <- vector("list", iter) for (i in 1:iter) { db <- genData(240, b) dd <- trtAssign(db, 2, grpName = "grp") dd <- addPeriods(dd, 4) dd[grp == 0, rxlab := c("C", "C", "A", "A+")] dd[grp == 1, rxlab := c("C", "A", "A+", "A")] dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))] dd[, day := factor(period + 1)] dd <- addCondition(d, dd, "Y") lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) xlmerfit <- lmer(Y ~ rxlab + (1|id), data = dd) ests[[i]] <- data.table(estA = fixef(lmerfit)[5], estAP = fixef(lmerfit)[6] - fixef(lmerfit)[5]) xests[[i]] <- data.table(estA = fixef(xlmerfit)[2], estAP = fixef(xlmerfit)[3] - fixef(xlmerfit)[2]) } ests <- rbindlist(ests) xests <- rbindlist(xests)

    The results for the correct model estimation indicate that there is no bias (and that the standard error estimates from the model fit above were correct):

    ests[, .(A.est = round(mean(estA), 3), A.se = round(sd(estA), 3), AP.est = round(mean(estAP), 3), AP.se = round(sd(estAP), 3))] ## A.est A.se AP.est AP.se ## 1: 0.407 0.106 0.602 0.06

    In contrast, the estimates that ignore the day or period effect are in fact biased (as predicted):

    xests[, .(A.est = round(mean(estA), 3), A.se = round(sd(estA), 3), AP.est = round(mean(estAP), 3), AP.se = round(sd(estAP), 3))] ## A.est A.se AP.est AP.se ## 1: 0.489 0.053 0.474 0.057 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages