Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 3 min ago

R Weekly Bulletin Vol – XI

Sat, 06/10/2017 - 20:48

(This article was first published on R programming, and kindly contributed to R-bloggers)

This week’s R bulletin will cover topics on how to round to the nearest desired number, converting and comparing dates and how to remove last x characters from an element.

We will also cover functions like rank, mutate, transmute, and set.seed. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Comment/uncomment current line/selection – Ctrl+Shift+C
2. Move Lines Up/Down – Alt+Up/Down
3. Delete Line – Ctrl+D

Problem Solving Ideas Rounding to the nearest desired number

Consider a case where you want to round a given number to the nearest 25. This can be done in the following manner:

round(145/25) * 25

[1] 150

floor(145/25) * 25

[1] 125

ceiling(145/25) * 25

[1] 150

Usage:
Assume if you are calculating a stop loss or take profit for an NSE stock in which the minimum tick is 5 paisa. In such case, we will divide and multiply by 0.05 to achieve the desired outcome.

Example:

Price = 566 Stop_loss = 1/100 # without rounding SL = Price * Stop_loss print(SL)

[1] 5.66

# with rounding to the nearest 0.05 SL1 = floor((Price * Stop_loss)/0.05) * 0.05 print(SL1)

[1] 5.65

How to remove last n characters from every element

To remove the last n characters we will use the substr function along with the nchr function. The example below illustrates the way to do it.

Example:

# In this case, we just want to retain the ticker name which is "TECHM" symbol = "TECHM.EQ-NSE" s = substr(symbol,1,nchar(symbol)-7) print(s)

[1] “TECHM”

Converting and Comparing dates in different formats

When we pull stock data from Google finance the date appears as “YYYYMMDD”, which is not recognized as a date-time object. To convert it into a date-time object we can use the “ymd” function from the lubridate package.

Example:

library(lubridate) x = ymd(20160724) print(x)

[1] “2016-07-24”

Another data provider gives stock data which has the date-time object in the American format (mm/dd/yyyy). When we read the file, the date-time column is read as a character. We need to convert this into a date-time object. We can convert it using the as.Date function and by specifying the format.

dt = "07/24/2016" y = as.Date(dt, format = "%m/%d/%Y") print(y)

[1] “2016-07-24”

# Comparing the two date-time objects (from Google Finance and the data provider) after conversion identical(x, y)

[1] TRUE

Functions Demystified

rank function

The rank function returns the sample ranks of the values in a vector. Ties (i.e., equal values) and
missing values can be handled in several ways.

rank(x, na.last = TRUE, ties.method = c(“average”, “first”, “random”, “max”, “min”))

where,
x: numeric, complex, character or logical vector
na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if “keep” they are kept with rank NA
ties.method: a character string specifying how ties are treated

Examples:

x <- c(3, 5, 1, -4, NA, Inf, 90, 43) rank(x)

[1] 3 4 2 1 8 7 6 5

rank(x, na.last = FALSE)

[1] 4 5 3 2 1 8 7 6

mutate and transmute functions

The mutate and transmute functions are part of the dplyr package. The mutate function computes new variables using the existing variables of a given data frame. The new variables are added to the existing data frame. On the other hand, the transmute function creates these new variables as a separate data frame.

Consider the data frame “df” given in the example below. Suppose we have 5 observations of 1-minute price data for a stock, and we want to create a new variable by subtracting the mean from the 1-minute closing prices. It can be done in the following manner using the mutate function.

Example:

library(dplyr) OpenPrice = c(520, 521.35, 521.45, 522.1, 522) ClosePrice = c(521, 521.1, 522, 522.25, 522.4) Volume = c(2000, 3500, 1750, 2050, 1300) df = data.frame(OpenPrice, ClosePrice, Volume) print(df)

df_new = mutate(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE)) print(df_new)

# If we want the new variable as a separate data frame, we can use the transmute function instead. df_new = transmute(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE)) print(df_new)

set.seed function

The set.seed function helps generate the same sequence of random numbers every time the program runs. It sets the random number generator to a known state. The function takes a single argument which is an integer. One needs to use the same positive integer in order to get the same initial state.

Example:

# Initialize the random number generator to a known state and generate five random numbers set.seed(100) runif(5)

[1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928

# Reinitialize to the same known state and generate the same five 'random' numbers set.seed(100) runif(5)

[1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!

The post R Weekly Bulletin Vol – XI appeared first on .

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Density-Based Clustering Exercises

Sat, 06/10/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius.
Advantages of density-based clustering:

  • as mentioned above, it does not require a predefined number of clusters,
  • clusters can be of any shape, including non-spherical ones,
  • the technique is able to identify noise data (outliers).

Disadvantages:

  • density-based clustering fails if there are no density drops between clusters,
  • it is also sensitive to parameters that define density (radius and the minimum number of points); proper parameter setting may require domain knowledge.

There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and “mean-shift”.
This set of exercises covers basic techniques for using the DBSCAN method, and allows to compare its result to the results of the k-means clustering algorithm by means of the silhouette analysis.
The set requires the packages dbscan, cluster, and factoextra to be installed. The exercises make use of the iris data set, which is supplied with R, and the wholesale customers data set from the University of California, Irvine (UCI) machine learning repository (download here).
Answers to the exercises are available here.

Exercise 1
Create a new data frame using all but the last variable from the iris data set, which is supplied with R.

Exercise 2
Use the scale function to normalize values of all variables in the new data set (with default settings). Ensure that the resulting object is of class data.frame.

Exercise 3
Plot the distribution of distances between data points and their fifth nearest neighbors using the kNNdistplot function from the dbscan package.
Examine the plot and find a tentative threshold at which distances start increasing quickly. On the same plot, draw a horizontal line at the level of the threshold.

Exercise 4
Use the dbscan function from the package of the same name to find density-based clusters in the data. Set the size of the epsilon neighborhood at the level of the found threshold, and set the number of minimum points in the epsilon region equal to 5.
Assign the value returned by the function to an object, and print that object.

Exercise 5
Plot the clusters with the fviz_cluster function from the factoextra package. Choose the geometry type to draw only points on the graph, and assign the ellipse parameter value such that an outline around points of each cluster is not drawn.
(Note that the fviz_cluster function produces a 2-dimensional plot. If the data set contains two variables those variables are used for plotting, if the number of variables is bigger the first two principal components are drawn.)

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • Delve into various algorithms for classification such as KNN and see how they are applied in R
  • Evaluate k-Means, Connectivity, Distribution, and Density based clustering
  • And much more

Exercise 6
Examine the structure of the cluster object obtained in Exercise 4, and find the vector with cluster assignments. Make a copy of the data set, add the vector of cluster assignments to the data set, and print its first few lines.

Exercise 7
Now look at what happens if you change the epsilon value.

  1. Plot again the distribution of distances between data points and their fifth nearest neighbors (with the kNNdistplot function, as in Exercise 3). On that plot, draw horizontal lines at levels 1.8, 0.5, and 0.4.
  2. Use the dbscan function to find clusters in the data with the epsilon set at these values (as in Exercise 4).
  3. Plot the results (as in the Exercise 5, but now set the ellipse parameter value such that an outline around points is drawn).

Exercise 8
This exercise shows how the DBSCAN algorithm can be used as a way to detect outliers:

  1. Load the Wholesale customers data set, and delete all variables with the exception of Fresh and Milk. Assign the data set to the customers variable.
  2. Discover clusters using the steps from Exercises 2-5: scale the data, choose an epsilon value, find clusters, and plot them. Set the number of minimum points to 5. Use the db_clusters_customers variable to store the output of the dbscan function.

Exercise 9
Compare the results obtained in the previous exercise with the results of the k-means algorithm. First, find clusters using this algorithm:

  1. Use the same data set, but get rid of outliers for both variables (here the outliers may be defined as values beyond 2.5 standard deviations from the mean; note that the values are already expressed in unit of standard deviation about the mean). Assign the new data set to the customers_core variable.
  2. Use kmeans function to obtain an object with cluster assignments. Set the number of centers equal to 4, and the number of initial random sets (the nstart parameter) equal to 10. Assign the obtained object to the variable km_clusters_customers variable.
  3. Plot clusters using the fviz_cluster function (as in the previous exercise).

Exercise 10
Now compare the results of DBSCAN and k-means using silhouette analysis:

  1. Retrieve a vector of cluster assignments from the db_clusters_customers object.
  2. Calculate distances between data points in the customers data set using the dist function (with the default parameters).
  3. Use the vector and the distances object as inputs into the silhouette function from the cluster package to get a silhouette information object.
  4. Plot that object with the fviz_silhouette function from the factoextra package.
  5. Repeat the steps described above for the km_clusters_customers object and the customers_core data sets.
  6. Compare two plots and the average silhouette width values.
Related exercise sets:
  1. Data science for Doctors: Cluster Analysis Exercises
  2. Hierarchical Clustering exercises (beginner)
  3. Building Shiny App exercises part 7
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

UK 2017 General Election Results Data

Sat, 06/10/2017 - 12:53

(This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

As the reality of a hung parliament starts to sink in, economists, political scientists and commentators will begin their usual routine of “post mortem” analysis of the surprise result of the UK 2017 general election. My co-authors Sascha Becker and Dennis Novy have done a similar exercise studying the EU Referendum last year [see also here]  and have worked on the question whether migration contributed to an erosion of pro EU sentiment [see also here].

For people wanting to get to work straight away, there are a few things slowing us down.  The last constituency, Kensington, was not called until last night and so I dont expect the UK’s Election Commission to post the final tally of votes across all constituencies anytime before next week. Nevertheless, the crude election results data can be “scraped” from some infographics. This post describes how…

The Economist’s Infographics

The Economist, among other newspapers, provides a very nice infographic – behind that info graphic lies a web service that can be queried using JSON formed requests.

Each Parliamentary constituency has an identifier code that can be used to query the web service and pull the results. The URL for a request is quite simple:

http://infographics.economist.com/2017/ukelmap-2017/data/live-results2017/r2017E14000937.json

This provides the results for the constituency Cambridgeshire, South East. The JSON object looks as follows

resultCB({"swing": -3.84, "mpn": "Lucy Frazer", "electorate": "86121", "lib": 11958, "id": "E14000937", "name": "Cambridgeshire, South East", "lab": 17443, "con": 33601, "status": "hold", "pa_key": "123", "oth": 0, "region": "East Of England", "win": "con", "turnout": "63002"})

This piece of Javascript calls a function resultCB that updates one of the views of the infographic.

In order to convert this to an R data frame, we can use the RJSONIO or jsonlite package functions fromJSON, after having removed the part that calls the function, i.e.

library(jsonlite) as.data.frame(fromJSON(gsub("\\)$","",gsub("resultCB\\(","",readLines(con="http://infographics.economist.com/2017/ukelmap-2017/data/live-results2017/r2017E14000937.json"))))) ## id pa_key oth name win status swing lib ## 1 E14000937 123 0 Cambridgeshire, South East con hold -3.84 11958 ## region mpn electorate turnout lab con ## 1 East Of England Lucy Frazer 86121 63002 17443 33601

In order to build a data.frame of all election results, all that is necessary is to loop over the set of constituency codes available. I share the results from this step in the following spreadsheet Data for UK 2017 General Election Results (Economist Infographic).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Schedule for useR!2017 now available

Fri, 06/09/2017 - 21:17

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The full schedule of talks for useR!2017, the global R user conference, has now been posted. The conference will feature 16 tutorials, 6 keynotes, 141 full talks, and 86 lightning talks starting on July 5 in Brussels. That's a lot to fir into 4 days, but I'm especially looking forward to the keynote presentations:

I'm also pleased to be attending with several of my Microsoft colleagues. You can find our talks below.

I hope you can attend too! Registration is still open if you'd like to join in. You can find the complete schedule linked below.

Sched: useR!2017

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Big Data Manipulation in R Exercises

Fri, 06/09/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Some times it is necessary to download really big csv files to deliver some analysis. When you hit file sizes in Gigabytes it is useful to use R instead of spreadsheets. This exercise teaches us to manipulate this kind of files.

Answers to the exercises are available here.

Exercise 1
Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
Download data set from here.

Exercise 2
Create a string vector with file names: 00540002-eng, 00540005-eng, 00540007-eng, 00540009-eng, 00540011-eng, 00540013-eng, 00540015-eng, and 00540017-eng.

Exercise 3
Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

Exercise 4
Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

Exercise 5
Clean up all other data sets in the list and exclude registers the same way discribed at exercise 4. Then, pile up all data in a sigle data set.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 6
Write a csv file with the recent create data set.

Exercise 7
Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
Download data set from here.
Create a string vector with file names: 00540018-eng, 00540019-eng, 00540020-eng, 00540021-eng, 00540022-eng, 00540023-eng, 00540024-eng, and 00540025-eng.
Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

Exercise 8
Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

Exercise 9
Clean up all other data sets in the list and exclude registers the same way discribed at exercise 8. Then, pile up all data in a sigle data set.

Exercise 10
Write a csv file with the recent create data set.

Related exercise sets:
  1. Forecasting: ARIMAX Model Exercises (Part-5)
  2. Data wrangling : I/O (Part-1)
  3. Bind exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Managing intermediate results when using R/sparklyr

Fri, 06/09/2017 - 17:37

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.


Handle management

Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat up resources.

To help control handle lifetime the replyr supplies record-retaining temporary name generators (and uses the same internally).

The actual function is pretty simple:

print(replyr::makeTempNameGenerator) ## function(prefix, ## suffix= NULL) { ## force(prefix) ## if((length(prefix)!=1)||(!is.character(prefix))) { ## stop("repyr::makeTempNameGenerator prefix must be a string") ## } ## if(is.null(suffix)) { ## alphabet <- c(letters, toupper(letters), as.character(0:9)) ## suffix <- paste(base::sample(alphabet, size=20, replace= TRUE), ## collapse = '') ## } ## count <- 0 ## nameList <- c() ## function(dumpList=FALSE) { ## if(dumpList) { ## v <- nameList ## nameList <<- c() ## return(v) ## } ## nm <- paste(prefix, suffix, sprintf('%010d',count), sep='_') ## nameList <<- c(nameList, nm) ## count <<- count + 1 ## nm ## } ## } ## ##

For instance to join a few tables it is a can be a good idea to call compute after each join (else the generated SQL can become large and unmanageable). This sort of code looks like the following:

# create example data names <- paste('table', 1:5, sep='_') tables <- lapply(names, function(ni) { di <- data.frame(key= 1:3) di[[paste('val',ni,sep='_')]] <- runif(nrow(di)) copy_to(sc, di, ni) }) # build our temp name generator tmpNamGen <- replyr::makeTempNameGenerator('JOINTMP') # left join the tables in sequence joined <- tables[[1]] for(i in seq(2,length(tables))) { ti <- tables[[i]] if(i<length(tables)) { joined <- compute(left_join(joined, ti, by='key'), name= tmpNamGen()) } else { # use non-temp name. joined <- compute(left_join(joined, ti, by='key'), name= 'joinres') } } # clean up temps temps <- tmpNamGen(dumpList = TRUE) print(temps) ## [1] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000000" ## [2] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000001" ## [3] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000002" for(ti in temps) { db_drop_table(sc, ti) } # show result print(joined) ## Source: query [3 x 6] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 3 x 6 ## key val_table_1 val_table_2 val_table_3 val_table_4 val_table_5 ## ## 1 1 0.7594355 0.8082776 0.696254059 0.3777300 0.30015615 ## 2 2 0.4082232 0.8101691 0.005687125 0.9382002 0.04502867 ## 3 3 0.5941884 0.7990701 0.874374779 0.7936563 0.19940400

Careful introduction and management of materialized intermediates can conserve resources (both time and space) and greatly improve outcomes. We feel it is a good practice to set up an explicit temp name manager, pass it through all your Sparklyr transforms, and then clear temps in batches after the results no longer depend no the intermediates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Unconf projects 5: mwparser, Gargle, arresteddev

Fri, 06/09/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

And finally, we end our series of unconf project summaries (day 1, day 2, day 3, day 4).

mwparser

Summary: Wikimarkup is the language used on Wikipedia and similar projects, and as such contains a lot of valuable data both for scientists studying collaborative systems and people studying things documented on or in Wikipedia. mwparser parses wikimarkup, allowing a user to filter down to specific types of tags such as links or templates, and then extract components of those tags.

Team: Oliver Keyes

Github: https://github.com/Ironholds/mwparser

Gargle

Summary: Gargle is a library that provides authentication for Google APIs but without all the agonizing pain. The package provides helper functions (for httr) to support automatic retries, paging, and progress bars for API calls.

Team: Craig Citro

Github: https://github.com/r-lib/gargle

arresteddev

Summary: This package is designed to help troubleshoot errors that come up during package and analysis development. As of now, the package helps track tracebacks and errors but more functionality is planned for the future.

Team: Lucy D'Agostino McGowan, Karthik Ram, Miles McBain

Github: https://github.com/ropenscilabs/arresteddev

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

R Interface to Spark

Fri, 06/09/2017 - 06:30

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

SparkR

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685 ### SUMMARY TABLE WITH AGG() grp <- groupBy(filter(df1, "month in (1, 3, 5)"), "month") summ <- agg(grp, avg_dep = avg(df1$dep_time), avg_arr = avg(df1$arr_time)) head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685

sparklyr

library(sparklyr) sc <- spark_connect(master = "local") df1 <- spark_read_csv(sc, name = "tbl1", path = "nycflights13.csv", header = TRUE, infer_schema = TRUE) ### SUMMARY TABLE WITH SQL library(DBI) summ <- dbGetQuery(sc, "select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743 ### SUMMARY TABLE WITH DPLYR library(dplyr) summ <- df1 %>% filter(month %in% c(1, 3, 5)) %>% group_by(month) %>% summarize(avg_dep = mean(dep_time), avg_arr = mean(arr_time)) head(summ) # month avg_dep avg_arr # # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Data Science for Business – Time Series Forecasting Part 2: Forecasting with timekit

Fri, 06/09/2017 - 02:00

In my last post, I prepared and visually explored time series data.

Now, I will use this data to test the timekit package for time series forecasting with machine learning.

Forecasting

In time series forecasting, we use models to predict future time points based on past observations.

As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).”

And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts – we can already guess the future curve just by looking at it. So, if we think of use-cases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience.

The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.

library(tidyverse) library(caret) library(tidyquant) library(broom) library(timekit) library(modelr) options(na.action = na.warn)

Training and test data

My input data is the tibble retail_p_day, that was created in my last post.

I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011).

retail_p_day <- retail_p_day %>% mutate(model = ifelse(day <= "2011-11-01", "train", "test")) colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))] <- paste0("P_", colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))])

Here, I am testing out timekit’s functions with the net income per day as response variable. Because the time series in our data set is relatively short and doesn’t cover multiple years, this forecast will only be able to capture recurring variation in days and weeks. Variations like increased sales before holidays, etc. would need additional data from several years to be accurately forecast.

As we can see in the plot below, the net income shows variation between days.

retail_p_day %>% ggplot(aes(x = day, y = sum_income, color = model)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

Augmenting the time series signature

With timekit, we can do forecasting with only a time series signature (a series of dates and times) and a corresponding response variable. If we had additional features that could be forecast independently, we could also introduce these into the model, but here, I will only work with the minimal data set.

A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature:

  • index: The index value that was decomposed
  • index.num: The numeric value of the index in seconds. The base is “1970-01-01 00:00:00”.
  • diff: The difference in seconds from the previous numeric index value.
  • year: The year component of the index.
  • half: The half component of the index.
  • quarter: The quarter component of the index.
  • month: The month component of the index with base 1.
  • month.xts: The month component of the index with base 0, which is what xts implements.
  • month.lbl: The month label as an ordered factor beginning with January and ending with December.
  • day: The day component of the index.
  • hour: The hour component of the index.
  • minute: The minute component of the index.
  • second: The second component of the index.
  • hour12: The hour component on a 12 hour scale.
  • am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
  • wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.
  • wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6.
  • wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
  • mday: The day of the month.
  • qday: The day of the quarter.
  • yday: The day of the year.
  • mweek: The week of the month.
  • week: The week number of the year (Sunday start).
  • week.iso: The ISO week number of the year (Monday start).
  • week2: The modulus for bi-weekly frequency.
  • week3: The modulus for tri-weekly frequency.
  • week4: The modulus for quad-weekly frequency.
  • mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.

Because we have missing data for the first column of diff, I am removing this row. We need to keep in mind too, that we have an irregular time series, because we never have data on Saturdays. This will affect the modeling and results and we need to account for this later on! Alternatively, it might make sense to compare the results when setting all NAs/Saturdays to 0, assuming that no information means that there was no income on a given day. Or we could impute missing values. Which strategy most accurately represents your data needs to be decided based on a good understanding of the business and how the data was collected.

retail_p_day_aug <- retail_p_day %>% rename(date = day) %>% select(model, date, sum_income) %>% tk_augment_timeseries_signature() %>% select(-contains("month")) retail_p_day_aug <- retail_p_day_aug[complete.cases(retail_p_day_aug), ]

Preprocessing

Not all of these augmented features will be informative for our model. For example, we don’t have information about time of day, so features like hour, minute, second, etc. will be irrelevant here.

Let’s look at column variation for all numeric feature and remove those with a variance of 0.

library(matrixStats) (var <- data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]), colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>% filter(colvars == 0)) ## colnames colvars ## 1 hour 0 ## 2 minute 0 ## 3 second 0 ## 4 hour12 0 ## 5 am.pm 0 retail_p_day_aug <- select(retail_p_day_aug, -one_of(as.character(var$colnames)))

Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set.

library(ggcorrplot) cor <- cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) p.cor <- cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) ggcorrplot(cor, type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor, colors = c(palette_light()[1], "white", palette_light()[2]))

I am going to choose a cutoff of 0.9 for removing features:

cor_cut <- findCorrelation(cor, cutoff=0.9) retail_p_day_aug <- select(retail_p_day_aug, -one_of(colnames(cor)[cor_cut]))

Now, I can split the data into training and test sets:

train <- filter(retail_p_day_aug, model == "train") %>% select(-model) test <- filter(retail_p_day_aug, model == "test")

Modeling

To model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple.

fit_lm <- glm(sum_income ~ ., data = train)

We can examine our model e.g. by visualizing:

tidy(fit_lm) %>% gather(x, y, estimate:p.value) %>% ggplot(aes(x = term, y = y, color = x, fill = x)) + facet_wrap(~ x, scales = "free", ncol = 4) + geom_bar(stat = "identity", alpha = 0.8) + scale_color_manual(values = palette_light()) + scale_fill_manual(values = palette_light()) + theme_tq() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

augment(fit_lm) %>% ggplot(aes(x = date, y = .resid)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()

With this model, we can now add predictions and residuals for the test data…

pred_test <- test %>% add_predictions(fit_lm, "pred_lm") %>% add_residuals(fit_lm, "resid_lm")

… and visualize the residuals.

pred_test %>% ggplot(aes(x = date, y = resid_lm)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()

We can also compare the predicted with the actual sum income in the test set.

pred_test %>% gather(x, y, sum_income, pred_lm) %>% ggplot(aes(x = date, y = y, color = x)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

Forecasting

Once we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame).

# Extract index idx <- retail_p_day %>% tk_index()

From this index we can generate the future time series.

Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds).

retail_p_day_aug %>% ggplot(aes(x = date, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always off-days).

retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl != "Sunday" & diff > 86400) %>% mutate(days_missing = diff / 86400 -1) ## # A tibble: 5 x 4 ## date wday.lbl diff days_missing ## ## 1 2011-01-04 Tuesday 1036800 11 ## 2 2011-04-26 Tuesday 432000 4 ## 3 2011-05-03 Tuesday 172800 1 ## 4 2011-05-31 Tuesday 172800 1 ## 5 2011-08-30 Tuesday 172800 1 retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl == "Sunday" & diff > 172800) %>% mutate(days_missing = diff / 86400 -1) ## # A tibble: 1 x 4 ## date wday.lbl diff days_missing ## ## 1 2011-05-01 Sunday 259200 2

Let’s create a list of all missing days:

off_days <- c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03", "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30") %>% ymd()

Official UK holidays during that time were:

  • 2011:
  • Boxing Day December 26
  • Christmas Day Holiday December 27

  • 2012:
  • New Year’s Day Holiday January 2
  • Good Friday April 6
  • Easter Monday April 9
  • Early May Bank Holiday May 7
  • Spring Bank Holiday June 4
  • Diamond Jubilee Holiday June 5
  • Summer Bank Holiday August 27

We can account for the missing Saturdays with inspect_weekdays = TRUE.

Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. Generally, it is very difficult to account for holidays, because they don’t occur with an easy to model rule (e.g. Easter is on the first Sunday after the first full moon in Spring). Unfortunately, in our dataset we have seen that holidays and randomly missing days did not have a big overlap in the past.

Because not all holidays are missing days and we have more missing days than official holidays, I am using the list of missing days for skipping values – even though this is only a best-guess approach and likely not going to match all days that will be missing in reality during the future time series.

idx_future <- idx %>% tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days) idx_future %>% tk_get_timeseries_signature() %>% ggplot(aes(x = index, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame.

data_future <- idx_future %>% tk_get_timeseries_signature() %>% rename(date = index) pred_future <- predict(fit_lm, newdata = data_future) pred_future <- data_future %>% select(date) %>% add_column(sum_income = pred_future) retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% rbind(pred_future) %>% ggplot(aes(x = date, y = sum_income)) + scale_x_date() + geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + theme_tq()

When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals.

test_residuals <- pred_test$resid_lm test_resid_sd <- sd(test_residuals, na.rm = TRUE) pred_future <- pred_future %>% mutate( lo.95 = sum_income - 1.96 * test_resid_sd, lo.80 = sum_income - 1.28 * test_resid_sd, hi.80 = sum_income + 1.28 * test_resid_sd, hi.95 = sum_income + 1.96 * test_resid_sd )

This, we can then plot to show the forecast with confidence intervals:

retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% ggplot(aes(x = date, y = sum_income)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future, fill = "#D5DBFF", color = NA, size = 0) + geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future, fill = "#596DD5", color = NA, size = 0, alpha = 0.8) + geom_point(aes(x = date, y = sum_income), data = pred_future, alpha = 0.5, color = palette_light()[[2]]) + geom_smooth(aes(x = date, y = sum_income), data = pred_future, method = 'loess', color = "white") + theme_tq()

Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future.

Next time, I’ll compare how Facebook’s prophet will predict the future income.

sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggcorrplot_0.1.1 matrixStats_0.52.2 ## [3] modelr_0.1.0 timekit_0.3.0 ## [5] broom_0.4.2 tidyquant_0.5.1 ## [7] quantmod_0.4-8 TTR_0.23-1 ## [9] PerformanceAnalytics_1.4.3541 xts_0.9-7 ## [11] zoo_1.8-0 lubridate_1.6.0 ## [13] caret_6.0-76 lattice_0.20-35 ## [15] dplyr_0.5.0 purrr_0.2.2.2 ## [17] readr_1.1.1 tidyr_0.6.3 ## [19] tibble_1.3.1 ggplot2_2.2.1 ## [21] tidyverse_1.1.1 ## ## loaded via a namespace (and not attached): ## [1] httr_1.2.1 jsonlite_1.5 splines_3.4.0 ## [4] foreach_1.4.3 assertthat_0.2.0 stats4_3.4.0 ## [7] cellranger_1.1.0 yaml_2.1.14 backports_1.0.5 ## [10] quantreg_5.33 digest_0.6.12 rvest_0.3.2 ## [13] minqa_1.2.4 colorspace_1.3-2 htmltools_0.3.6 ## [16] Matrix_1.2-10 plyr_1.8.4 psych_1.7.5 ## [19] SparseM_1.77 haven_1.0.0 padr_0.3.0 ## [22] scales_0.4.1 lme4_1.1-13 MatrixModels_0.4-1 ## [25] mgcv_1.8-17 car_2.1-4 nnet_7.3-12 ## [28] lazyeval_0.2.0 pbkrtest_0.4-7 mnormt_1.5-5 ## [31] magrittr_1.5 readxl_1.0.0 evaluate_0.10 ## [34] nlme_3.1-131 MASS_7.3-47 forcats_0.2.0 ## [37] xml2_1.1.1 foreign_0.8-68 tools_3.4.0 ## [40] hms_0.3 stringr_1.2.0 munsell_0.4.3 ## [43] compiler_3.4.0 rlang_0.1.1 grid_3.4.0 ## [46] nloptr_1.0.4 iterators_1.0.8 labeling_0.3 ## [49] rmarkdown_1.5 gtable_0.2.0 ModelMetrics_1.1.0 ## [52] codetools_0.2-15 DBI_0.6-1 reshape2_1.4.2 ## [55] R6_2.2.1 knitr_1.16 rprojroot_1.2 ## [58] Quandl_2.8.0 stringi_1.1.5 parallel_3.4.0 ## [61] Rcpp_0.12.11 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Run massive parallel R jobs cheaply with updated doAzureParallel package

Fri, 06/09/2017 - 00:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.

JS Tan using doAzureParallel #rstats package to run simulation on a cluster of 20 low-priority Azure VMs. Total cost: $0.02 #EARLConf2017 pic.twitter.com/Mpl3IUa9zY

— David Smith (@revodavid) June 7, 2017

Using the doAzureParallel package is simple. First, you need to define the cluster you're going to use as a JSON file. (You can see an example on the right.) Here, you'll specify your Azure credentials, the size of the cluster, and the type of nodes (CPUs and memory) to use in the cluster. You can also specify here R packages (from CRAN and/or Github) to be pre-loaded onto each node, and the maximum number of simultaneous tasks to run on each node (for within-node parallelism).

New to this update, the poolSize option allows you to specify the number of dedicated (standard) VM nodes to use, in addition to a number of low-priority nodes to use. Low-priority nodes can be pre-empted by the Azure system at any time, but are much cheaper to use. (Even if a node is pre-empted your parallel computation will be continue; it will just take a little longer with the reduced capacity.) You can even specify a minimum and maximum number of nodes of each class to use, in which case the cluster will automatically scale up and down according to either (your choice) the workload or the time of day (e.g. only expand the low-priority part of the cluster on weekends, when pre-emption is less likely). 


Once you've defined the parameters of your cluster, all you need to do is declare the cluster as a backend for the foreach package. The body of the foreach loop runs just like a for loop, except that multiple iterations run in parallel on the remote cluster. Here are the key parts of the option price simulation example JS presented at the conference.

This same approach can be used for any "embarrassingly parallel" iteration in R, and you can use any R function or package within the body of the loop. For example, you could use a cluster to reduce the time required for parameter tuning and cross-validation with the caret package, or speed up data preparation tasks when using the dplyr package.

In addition to support for auto-scaling clusters, this update to doAzureParallel also includes a few other new features. You'll also find new utility functions for managing multiple long-running R jobs, functions to read data from and write data to Azure Blob storage, and the ability to pre-load data into the cluster by specifying resource files.

The doAzureParallel package is available for download now from Github, under the open-source MIT license. For details on how to use the package, check out the README and the doAzureParallel guide.

Github (Azure): doAzureParallel

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Introduction to Set Theory and Sets with R

Thu, 06/08/2017 - 22:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Sets define a ‘collection’ of objects, or things typically referred to as ‘elements’ or ‘members.’ The concept of sets arises naturally when dealing with any collection of objects, whether it be a group of numbers or anything else. Conceptually, the following examples can be defined as a ‘set’:

  • {1, 2, 3, 4}
  • {Red, Green, Blue}
  • {Cat, Dog}

The first example is the set of the first four natural numbers. The second defines a set of the primary colors while the third example denotes a set of common household pets.

Since its development beginning in the 1870s with Georg Cantor and Richard Dedekind, set theory has become a foundational system of mathematics, and therefore its concepts constantly arise in the study of mathematics and is also an area of active research today.

This post will introduce some of the basic concepts of set theory, specifically the Zermelo-Fraenkel axiomatic system (more on that later), with R code to demonstrate these concepts.

Set Notation

Sets can be defined with lowercase, uppercase, script or Greek letters (in addition to subscripts and the like). Using several types of letters helps when dealing with hierarchies. Before diving into set theory, it’s best to define the common notation employed. One benefit of set theory being ubiquitousness in mathematics is learning its notation also helps in the understanding of other mathematical concepts.

  • \forall x – for every set x
  • \exists x – there exists such a set x that
  • \neg – not
  • \wedge – and
  • \vee – or (one or the other or both)
  • \Rightarrow – implies
  • \Leftrightarrow – iff, if and only if.
  • A \cup B – union of sets A and B
  • A \cap B – intersection of sets A and B
  • x \in A – x is an element of a set A
  • x \notin A – x is not an element of a set A

The \Rightarrow notation for implies can be thought of like an if statement in that it denotes the relation ‘if a then b.’

Set Membership

Set membership is written similar to:

x \in A

Which can be stated as ‘x is an element of the set A.’ If x is not a member of A, we write:

x \notin A

The symbol \in to denote set membership originated with Giuseppe Peano (Enderton, pp. 26).

Which is read ‘x is not an element of the set A.’ We can write an R function to implement the concept of set membership.

iselement <- function(x, A) { if (x %in% A) { return(TRUE) } return(FALSE) }

Let’s use our simple function to test if there exists some members in the set, A = \{3, 5, 7, 11\}.

A <- c(3, 5, 7, 11) eles <- c(3, 5, 9, 10, (5 + 6)) for (i in 1:length(eles)) { print(iselement(i, A)) } ## [1] FALSE ## [1] FALSE ## [1] TRUE ## [1] FALSE ## [1] TRUE

Set membership leads into one of the first axioms of set theory under the Zermelo-Fraenkel system, the Principle of Extensionality.

Principle of Extensionality

The principle of extensionality states if two sets have the same members, they are equal. The formal definition of the principle of extensionality can be stated more concisely using the notation given above:

\forall A \forall B (\forall x (x \in A \Leftrightarrow x \in B) \Rightarrow A = B)

Stated less concisely but still using set notation:

If two sets A and B are such that for every element (member) x:

x \in A \qquad iff \qquad x \in B Then A = B.

We can express this axiom through an R function to test for set equality.

isequalset <- function(a, b) { a <- unique(a) b <- unique(b) an <- length(a) if (an > length(b)) { return(FALSE) } for (i in 1:an) { if (!(a[i]) %in% b) { return(FALSE) } } return(TRUE) }

We can now put the principle of extensionality in action with our R function!

# original set A to compare A <- c(3, 5, 7, 11) # define some sets to test for equality B <- c(5, 7, 11, 3) C <- c(3, 4, 6, 5) D <- c(3, 5, 7, 11, 13) E <- c(11, 7, 5, 3) G <- c(3, 5, 5, 7, 7, 11) # collect sets into a list to iterate sets <- list(B, C, D, E, G) # using the isequalset() function, print the results of the equality tests. for (i in sets) { print(isequalset(i, A)) } ## [1] TRUE ## [1] FALSE ## [1] FALSE ## [1] TRUE ## [1] TRUE Empty Sets and Singletons

So far we have only investigated sets with two or more members. The empty set, denoted \varnothing, is defined as a set containing no elements and occurs surprisingly frequently in set-theoretic operations despite is seemingly straightforward and simple nature.

The empty set axiom, states the existence of an empty set concisely:

\exists B \forall x \qquad x \notin B Which can also be stated as ‘there is a set having no members.’

A set {\varnothing} can be formed whose only member is \varnothing. It is important to note {\varnothing} \neq \varnothing because \varnothing \in {\varnothing} but \varnothing \notin \varnothing. One can conceptually think of {\varnothing} as a container with nothing in it.

A singleton is a set with exactly one element, denoted typically by {a}. A nonempty set is, therefore, a set with one or more element. Thus a singleton is also nonempty. We can define another quick function to test if a given set is empty, a singleton or a nonempty set.

typeofset <- function(a) { if (length(a) == 0) { return('empty set') } else if (length(a) == 1) { return('singleton') } else if (length(a) > 1) { return('nonempty set') } else { stop('could not determine type of set') } } A <- c() B <- c(0) C <- c(1, 2) D <- list(c()) set_types <- list(A, B, C, D) for (i in set_types) { print(typeofset(i)) } ## [1] "empty set" ## [1] "singleton" ## [1] "nonempty set" ## [1] "singleton"

Note D is defined as a singleton because the set contains one element, the empty set \varnothing.

Summary

This post introduced some of the basic concepts of axiomatic set theory using the Zermelo-Fraenkel axioms by exploring the idea of set, set membership and some particular cases of sets such as the empty set and singletons. Set notation that will be used throughout not just set-theoretic applications but throughout mathematics was also introduced.

References

Barile, Margherita. “Singleton Set.” From MathWorld–A Wolfram Web Resource, created by Eric W. Weisstein. http://mathworld.wolfram.com/SingletonSet.html

Enderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.

Weisstein, Eric W. “Empty Set.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/EmptySet.html

Weisstein, Eric W. “Nonempty Set.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NonemptySet.html

The post Introduction to Set Theory and Sets with R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Campaign Response Testing no longer published on Udemy

Thu, 06/08/2017 - 20:33

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Our free video course Campaign Response Testing is no longer published on Udemy. It remains available for free on YouTube with all source code available from GitHub. I’ll try to correct bad links as I find them.

Please read on for the reasons.

Udemy recently unilaterally instituted a new policy on free courses: “When a free course has a Recent Review Rating less than 4.1 and is flagged with a ‘high degree of confidence’ the course will be hidden from Udemy’s search.”

Campaign Response Testing is a free course with an all-time average rating of 4.14 and a recent rating of 3.85. We have kept the code up to date and answered student questions (there was no backlog of student questions).

Obviously others should have opinion of our public work (which may or may not be good). And we are keeping the course up for free with no-sign up necessary on YouTube (as we in no way want to hurt or inconvenience students).

But I do have an issue with Udemy’s new system: it is completely vulnerable to griefers and spammers. Accounts with “no skin in the game” (having paid nothing, possibly not even have watched the course, and without having to leave any review text in addition to the “high degree of confidence” star rating) can belittle free courses at will and in bulk. Currently there is no effective dispute mechanism (other than superficial palliatives such as: “answer student questions”), and like all lop-sided “fights against semi-anonymous trolls” it would be a pointless effort anyway. I understand the desire to increase the quality of free content (it is Udemy’s “introduction”) but I feel the introduced system would obviously be unworkable even before it was introduced.

So with hopefully no harm to our subscribers I am un-publishing the course. As is the case with Udemy policy anybody already subscribed should continue to have access to the course through Udemy. And as I have said: we have long sense ported the entire free course to YouTube and Github for free and login-free sharing.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Neural networks Exercises (Part-1)

Thu, 06/08/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Source: Wikipedia

Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.

In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. This set of exercises is an introduction to neural networks where we’ll use them to create two simple regression and clustering model. Doing so, we’ll use a lot of basic concepts we’ll explore further in future sets. If you want more informations about neural network, your can see this page.

Answers to the exercises are available here.

Exercise 1
We’ll start by creating the data set on which we want to do a simple regression. Set the seed to 42, generate 200 random points between -10 and 10 and store them in a vector named X. Then, create a vector named Y containing the value of sin(x). Neural network are a lot more flexible than most regression algorithms and can fit complex function with ease. The biggest challenge is to find the appropriate network function appropriate to the situation.

Exercise 2
A network function is made of three components: the network of neurons, the weight of each connection between neuron and the activation function of each neuron. For this example, we’ll use a feed-forward neural network and the logistic activation which are the defaults for the package nnet. We take one number as input of our neural network and we want one number as the output so the size of the input and output layer are both of one. For the hidden layer, we’ll start with three neurons. It’s good practice to randomize the initial weights, so create a vector of 10 random values, picked in the interval [-1,1].

Exercise 3
Neural networks have a strong tendency of overfitting your data, meaning they become really good at describing the relationship between the values in your data set, but are not effective with data that wasn’t used to train your model. As a consequence, we need to cross-validate our model. Set the seed to 42, then create a training set containing 75% of the values in your initial data set and a test set containing the rest of your data.

Exercise 4
Load the nnet package and use the function of the same name to create your model. Pass your weights via the Wts argument and set the maxit argument to 50. We want to fit a function which can have for output multiple possible values. To do so, set the linout argument to true. Finally, take the time to look at the structure of your model.

Learn more about neural networks in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. In this course you will learn how to:

  • Work with Deep Learning networks and related packagse in R
  • Create Natural Language Processing models
  • And much more

Exercise 5
Predict the output for the test set and compute the RMSE of your predictions. Plot the function sin(x) and then plot your predictions.

Exercise 6
The number of neurons in the hidden layer, as well as the number of hidden layer used, has a great influence on the effectiveness of your model. Repeat the exercises three to five, but this time use a hidden layer with seven neurons and initiate randomly 22 weights.

Exercise 7
Now let us use neural networks to solve a clustering problems, so let’s load the iris data set! It is good practice to normalize your input data to uniformize the behavior of your model over different range of value and have a faster training. Normalize each factor so that they have a mean of zero and a standard deviation of 1, then create your train and test set.

Exercise 8
Use the nnet() and use a hidden layer of ten neurons to create your model. We want to fit a function which have a finite amount of value as output. To do so, set the code>linout argument to true. Look at the structure of your model. With clustering problem, the output is usually a factor that is coded as multiple dummy variables, instead of a single numeric value. As a consequence, the output layer have as one less neuron than the number of levels of the output factor.

Exercise 9
Make prediction with the values of the test set.

Exercise 10
Create the confusion table of your prediction and compute the accuracy of the model.

Related exercise sets:
  1. Evaluate your model with R Exercises
  2. Getting started with Plotly: basic Plots
  3. Network Analysis Part 3 Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Introducing the MonteCarlo Package

Thu, 06/08/2017 - 17:26

(This article was first published on R – first differences, and kindly contributed to R-bloggers)

My first R package has been released on CRAN recently. It is named MonteCarlo and aims to make simulation studies as easy as possible – including parallelization and the generation of tables.

What Are Simulation Studies Good For?

Monte Carlo simulations are an essential tool in statistics and related disciplines. They are routinely used to determine distributional properties, where no analytical results are available. A typical example is to study the finite sample properties of a new statistical procedure. Finite sample properties are often unknown, since theoretical results usually rely on asymptotic theory that only applies to infinite sample sizes. This may be a good approximation in practice if the sample size is large, but it may also be bad if the sample is small or if the rate of convergence is very slow.

Simulation studies are also extremely useful to compare the performance of alternative procedures for the same task. Imagine you want to test whether your sample comes from a Gaussian distribution and you have to decide whether to use the Jarque-Bera test or the Kolmogorov Smirnov test. Both appear to be legitimate choices. A simulation study that is tailored so that it reflects the situation at hand might uncover that one of the procedures is much more powerful than the other.

Finally, small scale simulation studies are an essential tool for statistical programming. Testing is an essential part of programming and software development. Usually, if one writes a function, it is good practice to let it run on some test cases. This is easy, if the outcome is deterministic. But if your function is a statistical test or an estimator, the output depends on the sample and is stochastic. Therefore, the only way to test whether the implementation is correct, is to generate a large number of samples and to see whether it has the expected statistical properties. For example one might test whether the mean squared error of an estimator converges to zero as the sample size increases, or whether the test has the correct alpha error.

Therefore, writing Monte Carlo simulations is an everyday task in many areas of statistics. This comes with considereable effort. It is not unusual that the required lines of code to produce a simulation study are a multiple of that needed to implement the procedure of interest. As a consequence of that they are also one of the main sources for errors. On top of this, the large computational effort often requires parallelization which brings additional complications and programming efforts. This efforts can often be prohibitive – especially for less advanced users.

The MonteCarlo Package

The MonteCarlo package streamlines the process described above. It allows to create simulation studies and to summarize their results in LaTeX tables quickly and easily.

There are only two main functions in the package:

  1. MonteCarlo() runs a simulation study for a user defined parameter grid. It handles the generation of loops over these parameter grid and parallelizes the computation on a user specified number of CPU units.
  2. MakeTable() creates LaTeX tables from the output of MonteCarlo(). It stacks high dimensional output arrays into tables with a user specified ordering of rows and columns.

To run a simulation study, the user has to nest both – the generation of a sample and the calculation of the desired statistics from this sample – in a single function. This function is passed to MonteCarlo(). No additional programming is required. This approach is not only very versatile – it is also very intuitive. The user formulates his experiment as if he/she was only interested in a single draw.

The aim of this approach is to allow the user full control and flexibility with regard to the design of the Monte Carlo experiment, but to take away all the effort of setting up the technical part of the simulation.

A Simple Example: The t-test

Suppose we want to evaluate the performance of a standard t-test for the hypothesis that the mean is equal to zero. We are interested to see how the size and power of the test change with the sample size (n), the distance from the null hypothesis (loc for location) and the standard deviation of the distribution (scale). The sample is generated from a normal distribution.

To conduct this analysis, we proceed as follows. First, we load the MonteCarlo package.

library(MonteCarlo)

Then define the following function.

######################################### ## Example: t-test # Define function that generates data and applies the method of interest ttest<-function(n,loc,scale){ # generate sample: sample<-rnorm(n, loc, scale) # calculate test statistic: stat<-sqrt(n)*mean(sample)/sd(sample) # get test decision: decision1.96 # return result: return(list("decision"=decision)) }

As discussed above, ttest() is formulated in a way as if we only want to generate a single test decision. The arguments of the function are the parameters we are interested in. ttest() carries out 4 steps:

  1. Draw a sample of n observations from a normal distribution with mean loc and standard deviation scale.
  2. Calculate the t-statistic.
  3. Determine the test decision.
  4. Return the desired result in form of a list.

We then define the combinations of parameters that we are interested in and collect them in a list. The elements of the lists must have the same names as the parameters for which we want to supply grids.

# define parameter grid: n_grid<-c(50,100,250,500) loc_grid<-seq(0,1,0.2) scale_grid<-c(1,2) # collect parameter grids in list: param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)

To run the simulation, the function ttest() and the parameter grid (param_list) are passed to MonteCarlo(), together with the desired number of Monte Carlo repetitions (nrep=1000).

# run simulation: MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list)

There is no further coding required. All the mechanics of the Monte Carlo experiment are handled by the MonteCarlo() function.

Calling summary produces a short information on the simulation.

summary(MC_result) ## Simulation of function: ## ## function(n,loc,scale){ ## ## # generate sample: ## sample<-rnorm(n, loc, scale) ## ## # calculate test statistic: ## stat<-sqrt(n)*mean(sample)/sd(sample) ## ## # get test decision: ## decision1.96 ## ## # return result: ## return(list("decision"=decision)) ## } ## ## Required time: 13.38 secs for nrep = 1000 repetitions on 1 CPUs ## ## Parameter grid: ## ## n : 50 100 250 500 ## loc : 0 0.2 0.4 0.6 0.8 1 ## scale : 1 2 ## ## ## 1 output arrays of dimensions: 4 6 2 1000

As one can see from the summary, the simulation results are stored in an array of dimension c(4,6,2,1000), where the Monte Carlo repetitions are collected in the last dimension of the array.

To summarize the results in a reasonable way and to include them as a table in a paper or report, we have to represent them in a matrix. This is handled by the MakeTable() function that stacks the submatrices collected in the array in the rows and columns of a matrix and prints the result in the form of code to generate a LaTeX table.

To determine in which order the results are stacked in rows and columns, we supply the function arguments rows and cols. These are vectors of the names of the parameters in the order in which we want them to appear in the table (sorted from the inside to the outside).

# generate table: MakeTable(output=MC_result, rows="n", cols=c("loc","scale"), digits=2, include_meta=FALSE) ## \begin{table}[h] ## \centering ## \resizebox{ 1 \textwidth}{!}{% ## \begin{tabular}{ rrrrrrrrrrrrrrr } ## \hline\hline\\\\ ## scale && \multicolumn{ 6 }{c}{ 1 } & & \multicolumn{ 6 }{c}{ 2 } \\ ## n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ ## & & & & & & & & & & & & & & \\ ## 50 & & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 & & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ ## 100 & & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 & & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ ## 250 & & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ ## 500 & & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## \\ ## \\ ## \hline\hline ## \end{tabular}% ## } ## \caption{ decision } ## \end{table}

To change the ordering, just change the vectors rows and cols.

# generate table: MakeTable(output=MC_result, rows=c("n","scale"), cols="loc", digits=2, include_meta=FALSE) ## \begin{table}[h] ## \centering ## \resizebox{ 1 \textwidth}{!}{% ## \begin{tabular}{ rrrrrrrrr } ## \hline\hline\\\\ ## scale & n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ ## & & & & & & & & \\ ## \multirow{ 4 }{*}{ 1 } & 50 & & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 \\ ## & 100 & & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 \\ ## & 250 & & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## & 500 & & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## & & & & & & & & \\ ## \multirow{ 4 }{*}{ 2 } & 50 & & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ ## & 100 & & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ ## & 250 & & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ ## & 500 & & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## \\ ## \\ ## \hline\hline ## \end{tabular}% ## } ## \caption{ decision } ## \end{table}

Now we can simply copy the code and add it to our paper, report or presentation. That is all. Only make sure that the package multirow is included in the header of the .tex file.

Parallelised Simulation

If the procedure you are interested in is not so fast or you need a large number of replications to produce very accurate results, you might want to use parallelized computation on multiple cores of your computer (or cluster). To achive this, simply specify the number of CPUs by supplying a value for the argument ncpus of MonteCarlo as shown below. Of course you should actually have at least the specified number of units.

# run simulation: MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list, ncpus=4)

This automatically sets up a snow cluster, including the export of functions and the loading of packages. The user does not have to take care of anything.

Further Information

This is an introduction to get you up and running with the MonteCarlo package as quickly as possible. Therefore, I only included a short example. However, the functions MonteCarlo() and particularly MakeTable() provide much more functionality. This is described in more detail in the package vignette, that also provides additional examples.

Filed under: R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – first differences. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Words growing or shrinking in Hacker News titles: a tidy analysis

Thu, 06/08/2017 - 16:00

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

In May, some friends and I built Tagger News, a real-time automatic classifier of Hacker News articles based on their text (see here for more about how we built it). This process started me down some interesting paths, particularly analyzing trends in titles. By finding words that became more or less common in Hacker News titles over time, we can see what topics were gaining or losing interest in the tech world.

I thought I’d share some of these analyses here. As with most of my text analysis posts, I’ll be analyzing it with the tidytext package written by me and Julia Silge. (Check out our soon-to-be-released book on the subject, Text Mining with R!)

Processing one million Hacker News articles

The titles (and dates, and links) of Hacker News articles are helpfully stored on Google Bigquery. We start with a dataset of a million Hacker News article titles, which covers about 3 and a half years of posts. I downloaded it on 2017-06-04 with the following query:

SELECT id, score, time, title, url FROM [bigquery-public-data:hacker_news.full] WHERE deleted IS null AND dead IS null AND type = 'story' ORDER BY time DESC LIMIT 1000000

I’m hosting the dataset so that you can read it yourself (note that it’s a moderately large and compressed file so it may take a minute to read) as follows:

library(tidyverse) library(lubridate) stories <- read_csv("http://varianceexplained.org/files/stories_1000000.csv.gz") %>% mutate(time = as.POSIXct(time, origin = "1970-01-01"), month = round_date(time, "month"))

Whenever we read in a text dataset, the first step is usually to tokenize it (split it up into words) and remove stopwords (like “the” or “and”). As described here, we can do this with tidytext’s unnest_tokens function.

library(tidyverse) library(tidytext) library(stringr) title_words <- stories %>% arrange(desc(score)) %>% distinct(title, .keep_all = TRUE) %>% unnest_tokens(word, title, drop = FALSE) %>% distinct(id, word, .keep_all = TRUE) %>% anti_join(stop_words, by = "word") %>% filter(str_detect(word, "[^\\d]")) %>% group_by(word) %>% mutate(word_total = n()) %>% ungroup()

This creates a pretty large data frame (4443083 rows) with one row per word per post. For example, we could use this to find and visualize the most common words in Hacker News posts during this 3.5 year period.

word_counts <- title_words %>% count(word, sort = TRUE) word_counts %>% head(25) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "lightblue") + scale_y_continuous(labels = comma_format()) + coord_flip() + labs(title = "Most common words in Hacker News titles", subtitle = "Among the last million stories; stop words removed", y = "# of uses")

The top result, “hn”, comes from the common constructions of “Show HN” and “Ask HN”, which are often prepended to a title. Among the other words, there’s nothing we wouldn’t have expected from following Hacker News. There some notable technology companies like Google, Apple, and Facebook, along with common topics in tech discussions like “app”, “data” and “startup”.

Change over time

What words and topics have become more frequent, or less frequent, over time? These could give us a sense of the changing technology ecosystem, and let us predict what topics will continue to grow in relevance.

To achieve this, we’ll first count the occurrences of words in titles by month.

stories_per_month <- stories %>% group_by(month) %>% summarize(month_total = n()) word_month_counts <- title_words %>% filter(word_total >= 1000) %>% count(word, month) %>% complete(word, month, fill = list(n = 0)) %>% inner_join(stories_per_month, by = "month") %>% mutate(percent = n / month_total) %>% mutate(year = year(month) + yday(month) / 365) word_month_counts ## # A tibble: 33,480 x 6 ## word month n month_total percent year ## ## 1 2.0 2013-10-01 42 20542 0.002044592 2013.751 ## 2 2.0 2013-11-01 55 23402 0.002350226 2013.836 ## 3 2.0 2013-12-01 26 21945 0.001184780 2013.918 ## 4 2.0 2014-01-01 25 19186 0.001303033 2014.003 ## 5 2.0 2014-02-01 31 22295 0.001390446 2014.088 ## 6 2.0 2014-03-01 37 21118 0.001752060 2014.164 ## 7 2.0 2014-04-01 33 23673 0.001393993 2014.249 ## 8 2.0 2014-05-01 28 21394 0.001308778 2014.332 ## 9 2.0 2014-06-01 47 19265 0.002439657 2014.416 ## 10 2.0 2014-07-01 25 19829 0.001260780 2014.499 ## # ... with 33,470 more rows

We can then use my broom package to fit a model (logistic regression) to examine whether the frequency of each word is increasing or decreasing over time. Every term will then have a growth rate (as an exponential term) associated with it.

library(broom) mod <- ~ glm(cbind(n, month_total - n) ~ year, ., family = "binomial") slopes <- word_month_counts %>% nest(-word) %>% mutate(model = map(data, mod)) %>% unnest(map(model, tidy)) %>% filter(term == "year") %>% arrange(desc(estimate)) slopes ## # A tibble: 744 x 6 ## word term estimate std.error statistic p.value ## ## 1 trump year 1.7570144 0.04052662 43.35458 0.000000e+00 ## 2 ai year 0.7830541 0.01756776 44.57335 0.000000e+00 ## 3 blockchain year 0.6557110 0.02682345 24.44544 5.626574e-132 ## 4 neural year 0.6270933 0.02617919 23.95388 8.418336e-127 ## 5 react year 0.6027489 0.01628292 37.01725 6.045233e-300 ## 6 vr year 0.5260498 0.02247906 23.40177 4.099556e-121 ## 7 bot year 0.5178669 0.02600166 19.91669 2.916460e-88 ## 8 iot year 0.5076088 0.02514613 20.18636 1.290270e-90 ## 9 microservices year 0.4933223 0.03180060 15.51299 2.833910e-54 ## 10 slack year 0.4718030 0.02287605 20.62432 1.660491e-94 ## # ... with 734 more rows

We can now ask: what terms have been increasing in frequency in Hacker News titles?

It makes sense that “Trump” is the word that’s growing most quickly. While it shows a moderate growth after Trump announced his candidacy in mid-2015, it shows a sharp increase around the time of the 2016 election (though it hasn’t been quite as dominant in the months since his inauguration). The next fastest growing terms include “AI” and “blockchain”, both topics that have shown a recent surge of interest. (We can also see “machine learning”, “AWS”, and “bot” among the growing terms).

What words have been decreasing in frequency in Hacker News titles?

This shows a few topics in which interest has died out since 2013, including Google Glass and Gmail. Discussion of Edward Snowden and the NSA was fresh news around 2013-2014, so it makes sense that it’s declined since. There are also some notable technologies that people write about less, such as AngularJS, HTML5, and Ruby on Rails. It’s interesting to compare these to Stack Overflow Trends during that time period, in which AngularJS has been growing in terms of Stack Overflow questions asked (HTML5 and Rails have leveled off). It’s possible that discussion on HN is a “leading indicator”, showing a surge articles when a technology first gets popular.

(I don’t currently have a guess for why “million” and “billion” had sudden dropoffs in 2014. Is it some artifact of the Hacker News policy, with the word becoming edited or deleted in newer posts? Or is it a real change in what the site discusses?)

Interestingly, the conversation around “bitcoin” peaked around 2013-2014 (with a recent uptick due to a surge in Bitcoin’s price), while discussion of the blockchain has been growing in the years since. Comparing them on the same axes makes in clear that the blockchain has roughly “caught up to” bitcoin in terms of general interest:

Words that passed their peak

We can learn a lot by examining words , but we’re misses an important piece of the puzzle: there are some words that grew in interest for part of this time, then “passed their peak.” This is more complicated to explore quantitatively, but one approach is to find words where the ratio of the peak in the trend to the average is especially high. To reduce the effect of random noise, we use a cubic spline to smooth each trend before determining the peak.

library(splines) mod2 <- ~ glm(cbind(n, month_total - n) ~ ns(year, 4), ., family = "binomial") # Fit a cubic spline to each shape spline_predictions <- word_month_counts %>% mutate(year = as.integer(as.Date(month)) / 365) %>% nest(-word) %>% mutate(model = map(data, mod2)) %>% unnest(map2(model, data, augment, type.predict = "response")) # Find the terms with the highest peak / average ratio peak_per_month <- spline_predictions %>% group_by(word) %>% mutate(average = mean(.fitted)) %>% top_n(1, .fitted) %>% ungroup() %>% mutate(ratio = .fitted / average) %>% filter(month != min(month), month != max(month)) %>% top_n(16, ratio)

Here are 16 terms that had strong peaks at various point in the last 3.5 years.

peak_per_month %>% select(word, peak = month) %>% inner_join(spline_predictions, by = "word") %>% mutate(word = reorder(word, peak)) %>% ggplot(aes(month, percent)) + geom_line(aes(color = word), show.legend = FALSE) + geom_line(aes(y = .fitted), lty = 2) + facet_wrap(~ word, scales = "free_y") + scale_y_continuous(labels = percent_format()) + expand_limits(y = 0) + labs(x = "Year", y = "Percent of titles containing this term", title = "16 words that peaked then declined in Hacker News titles", subtitle = "Spline fit (df = 4) shown.\nSelected based on the peak of the spline divided by the overall average; ordered by peak month.")

We can see a peak of discussion around net neutrality in 2015 (though it’s shown a recent resurgence of interest). You can spot the introduction of Swift and the Apple Watch, then a moderate decline in discussion around them, and there are sudden jolts of discussion around Oculus in 2014 (with Facebook’s purchase of the company) and the FBI in 2016 (news around Clinton’s email server). The graph suggests the discussion around Slack, bots, Tesla, and virtual reality may have leveled off (though in some cases it may be too early to tell).

What’s next

I have some more analyses of Hacker News posts I’ll be sharing in the future, including analyses of word correlations, examination of what words and topics tend to get upvoted, and some improvements on the Tagger News classification algorithm. The series will also give me the chance to highlight more techniques from Text Mining with R, including word networks and topic modeling.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Test-driving Microsoft Cognitive Toolkit in R using reticulate

Thu, 06/08/2017 - 15:27

(This article was first published on R snippets, and kindly contributed to R-bloggers)

Recently new tools for data science pop up constantly, so it is hard to keep up with the changes and choose those that promise to be useful in the long run.
Recently two new solutions were announced that look very promising: reticulate package for R and Microsoft Cognitive Toolkit 2.0 (CNTK). Together with Wit Jakuczun we have decided to test drive them on Azure Data Science Virtual Machine.
CNTK is a very promising deep learning toolkit, backed by Microsoft. Unfortunately it provides bindings to Python and not to R. Therefore it is an ideal situation to test reticulate package, which provides R interface to Python. We wanted to test CNTK on GPUs, so we have decided to use Azure Data Science Virtual Machine using NV6 Instance with Tesla M60 GPU.
For our test-drive we have used classic MNIST dataset. You can find all the source codes and detailed results of the test on GitHub. The short conclusion is that all components worked and played together excellently:

  • on CNTK we have build MLP network model, which has reported 2.3% classification error, without any special tuning, which is a pretty decent result;
  • calling Python library from R using reticulate was simple and stable;
  • Calculations using GPU give a significant speedup and configuration of Azure DSVM was really simple.
In summary, we have found the combo MSFT CNTK + R/reticulate + Azure DSVM to be simple to set up and powerful. It if definitely worth checking in your next data science project. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R snippets. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Deep Learning with R

Thu, 06/08/2017 - 14:40

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

For R users, there hasn’t been a production grade solution for deep learning (sorry MXNET). This post introduces the Keras interface for R and how it can be used to perform image classification. The post ends by providing some code snippets that show Keras is intuitive and powerful.

Tensorflow

Last January, Tensorflow for R was released, which provided access to the Tensorflow API from R. However, for most R users, the interface was not very R like.

Take a look at this code chunk for training a model:

cross_entropy <- tf$reduce_mean(-tf$reduce_sum(y_ * tf$log(y_conv), reduction_indices=1L)) train_step <- tf$train$AdamOptimizer(1e-4)$minimize(cross_entropy) correct_prediction <- tf$equal(tf$argmax(y_conv, 1L), tf$argmax(y_, 1L)) accuracy <- tf$reduce_mean(tf$cast(correct_prediction, tf$float32)) sess$run(tf$global_variables_initializer()) for (i in 1:20000) { batch <- mnist$train$next_batch(50L) if (i %% 100 == 0) { train_accuracy <- accuracy$eval(feed_dict = dict( x = batch[[1]], y_ = batch[[2]], keep_prob = 1.0)) cat(sprintf("step %d, training accuracy %g\n", i, train_accuracy)) } train_step$run(feed_dict = dict( x = batch[[1]], y_ = batch[[2]], keep_prob = 0.5)) } test_accuracy % fit( x = train_x, y = train_y, epochs=epochs, batch_size=batch_size, validation_data=valid) Image Classification with Keras

So if you are still with me, let me show you how to build deep learning models using R, Keras, and Tensorflow together. You will find a Github repo that contains the code and data you will need. Included is an R notebook that walks through building an image classifier (telling cat from dog), but can easily be generalized to other images. The walk through includes advanced methods that are commonly used for production deep learning work including:

  • augmenting data
  • using the bottleneck features of a pre-trained network
  • fine-tuning the top layers of a pre-trained network
  • saving weights for your models
Code Snippets of Keras

The R interface to Keras truly makes it easy to build deep learning models in R. Here are some code snippets to illustrate how intuitive and useful Keras for R is:

To load picture from a folder:

train_generator <- flow_images_from_directory(train_directory, generator = image_data_generator(), target_size = c(img_width, img_height), color_mode = "rgb", class_mode = "binary", batch_size = batch_size, shuffle = TRUE, seed = 123)

To define a simple convolutional neural network:

model % layer_conv_2d(filter = 32, kernel_size = c(3,3), input_shape = c(img_width, img_height, 3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filter = 64, kernel_size = c(3,3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_flatten() %>% layer_dense(64) %>% layer_activation("relu") %>% layer_dropout(0.5) %>% layer_dense(1) %>% layer_activation("sigmoid")

To augment data:

augment <- image_data_generator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=TRUE)

To load a pretrained network:

model_vgg <- application_vgg16(include_top = FALSE, weights = "imagenet")

To save model weights:

save_model_weights_hdf5(model_ft, 'finetuning_30epochs_vggR.h5', overwrite = TRUE)

I believe the Keras for R interface will make it much easier for R users and the R community to build and refine deep learning models with R. This means you don't have to force everyone to use python to build, refine, and test your models. I really think this will open up deep learning to a wider audience that was a bit apprehensive on using Python. So for now, give it a spin!

Grab my github repo, fire up RStudio (or your IDE of choice), and go build a simple classifier using Keras.

    Related Post

    1. Unsupervised Learning and Text Mining of Emotion Terms Using R
    2. Using MCA and variable clustering in R for insights in customer attrition
    3. Web Scraping and Applied Clustering Global Happiness and Social Progress Index
    4. Key Phrase Extraction from Tweets
    5. Financial time series forecasting – an easy approach
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Add P-values and Significance Levels to ggplots

    Thu, 06/08/2017 - 11:48

    (This article was first published on Easy Guides, and kindly contributed to R-bloggers)

    In this article, we’ll describe how to easily i) compare means of two or multiple groups; ii) and to automatically add p-values and significance levels to a ggplot (such as box plots, dot plots, bar plots and line plots …).

    Contents:

    Prerequisites Install and load required R packages

    Required R package: ggpubr (version >= 0.1.3), for ggplot2-based publication ready plots.

    • Install from CRAN as follow:
    install.packages("ggpubr")
    • Or, install the latest developmental version from GitHub as follow:
    if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/ggpubr")
    • Load ggpubr:
    library(ggpubr)

    Official documentation of ggpubr is available at: http://www.sthda.com/english/rpkgs/ggpubr

    Demo data sets

    Data: ToothGrowth data sets.

    data("ToothGrowth") head(ToothGrowth) len supp dose 1 4.2 VC 0.5 2 11.5 VC 0.5 3 7.3 VC 0.5 4 5.8 VC 0.5 5 6.4 VC 0.5 6 10.0 VC 0.5 Methods for comparing means

    The standard methods to compare the means of two or more groups in R, have been largely described at: comparing means in R.

    The most common methods for comparing means include:

    Methods R function Description T-test t.test() Compare two groups (parametric) Wilcoxon test wilcox.test() Compare two groups (non-parametric) ANOVA aov() or anova() Compare multiple groups (parametric) Kruskal-Wallis kruskal.test() Compare multiple groups (non-parametric)

    A practical guide to compute and interpret the results of each of these methods are provided at the following links:

    R functions to add p-values

    Here we present two new R functions in the ggpubr package:

    • compare_means(): easy to use solution to performs one and multiple mean comparisons.
    • stat_compare_means(): easy to use solution to automatically add p-values and significance levels to a ggplot.
    compare_means()

    As we’ll show in the next sections, it has multiple useful options compared to the standard R functions.

    The simplified format is as follow:

    compare_means(formula, data, method = "wilcox.test", paired = FALSE, group.by = NULL, ref.group = NULL, ...)
    • formula: a formula of the form x ~ group, where x is a numeric variable and group is a factor with one or multiple levels. For example, formula = TP53 ~ cancer_group. It’s also possible to perform the test for multiple response variables at the same time. For example, formula = c(TP53, PTEN) ~ cancer_group.
    • data: a data.frame containing the variables in the formula.

    • method: the type of test. Default is “wilcox.test”. Allowed values include:

      • “t.test” (parametric) and “wilcox.test”” (non-parametric). Perform comparison between two groups of samples. If the grouping variable contains more than two levels, then a pairwise comparison is performed.
      • “anova” (parametric) and “kruskal.test” (non-parametric). Perform one-way ANOVA test comparing multiple groups.
    • paired: a logical indicating whether you want a paired test. Used only in t.test and in wilcox.test.

    • group.by: variables used to group the data set before applying the test. When specified the mean comparisons will be performed in each subset of the data formed by the different levels of the group.by variables.

    • ref.group: a character string specifying the reference group. If specified, for a given grouping variable, each of the group levels will be compared to the reference group (i.e. control group). ref.group can be also “.all.”. In this case, each of the grouping variable levels is compared to all (i.e. base-mean).

    stat_compare_means()

    This function extends ggplot2 for adding mean comparison p-values to a ggplot, such as box blots, dot plots, bar plots and line plots.

    The simplified format is as follow:

    stat_compare_means(mapping = NULL, comparisons = NULL hide.ns = FALSE, label = NULL, label.x = NULL, label.y = NULL, ...)
    • mapping: Set of aesthetic mappings created by aes().

    • comparisons: A list of length-2 vectors. The entries in the vector are either the names of 2 values on the x-axis or the 2 integers that correspond to the index of the groups of interest, to be compared.

    • hide.ns: logical value. If TRUE, hide ns symbol when displaying significance levels.

    • label: character string specifying label type. Allowed values include “p.signif” (shows the significance levels), “p.format” (shows the formatted p value).

    • label.x,label.y: numeric values. coordinates (in data units) to be used for absolute positioning of the label. If too short they will be recycled.

    • : other arguments passed to the function compare_means() such as method, paired, ref.group.

    Compare two independent groups

    Perform the test:

    compare_means(len ~ supp, data = ToothGrowth) # A tibble: 1 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len OJ VC 0.06449067 0.06449067 0.064 ns Wilcoxon

    By default method = “wilcox.test” (non-parametric test). You can also specify method = “t.test” for a parametric t-test.

    Returned value is a data frame with the following columns:

    • .y.: the y variable used in the test.
    • p: the p-value
    • p.adj: the adjusted p-value. Default value for p.adjust.method = “holm”
    • p.format: the formatted p-value
    • p.signif: the significance level.
    • method: the statistical test used to compare groups.

    Create a box plot with p-values:

    p <- ggboxplot(ToothGrowth, x = "supp", y = "len", color = "supp", palette = "jco", add = "jitter") # Add p-value p + stat_compare_means() # Change method p + stat_compare_means(method = "t.test")

    Add p-values and significance levels to ggplots

    Note that, the p-value label position can be adjusted using the arguments: label.x, label.y, hjust and vjust.

    The default p-value label displayed is obtained by concatenating the method and the p columns of the returned data frame by the function compare_means(). You can specify other combinations using the aes() function.

    For example,

    • aes(label = ..p.format..) or aes(label = paste0(“p =”, ..p.format..)): display only the formatted p-value (without the method name)
    • aes(label = ..p.signif..): display only the significance level.
    • aes(label = paste0(..method.., “\n”, “p =”, ..p.format..)): Use line break (“\n”) between the method name and the p-value.

    As an illustration, type this:

    p + stat_compare_means( aes(label = ..p.signif..), label.x = 1.5, label.y = 40)

    Add p-values and significance levels to ggplots

    If you prefer, it’s also possible to specify the argument label as a character vector:

    p + stat_compare_means( label = "p.signif", label.x = 1.5, label.y = 40) Compare two paired samples

    Perform the test:

    compare_means(len ~ supp, data = ToothGrowth, paired = TRUE) # A tibble: 1 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len OJ VC 0.004312554 0.004312554 0.0043 ** Wilcoxon

    Visualize paired data using the ggpaired() function:

    ggpaired(ToothGrowth, x = "supp", y = "len", color = "supp", line.color = "gray", line.size = 0.4, palette = "jco")+ stat_compare_means(paired = TRUE)

    Add p-values and significance levels to ggplots

    Compare more than two groups
    • Global test:
    # Global test compare_means(len ~ dose, data = ToothGrowth, method = "anova") # A tibble: 1 x 6 .y. p p.adj p.format p.signif method 1 len 9.532727e-16 9.532727e-16 9.5e-16 **** Anova

    Plot with global p-value:

    # Default method = "kruskal.test" for multiple groups ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means() # Change method to anova ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(method = "anova")

    Add p-values and significance levels to ggplots

    • Pairwise comparisons. If the grouping variable contains more than two levels, then pairwise tests will be performed automatically. The default method is “wilcox.test”. You can change this to “t.test”.
    # Perorm pairwise comparisons compare_means(len ~ dose, data = ToothGrowth) # A tibble: 3 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len 0.5 1 7.020855e-06 1.404171e-05 7.0e-06 **** Wilcoxon 2 len 0.5 2 8.406447e-08 2.521934e-07 8.4e-08 **** Wilcoxon 3 len 1 2 1.772382e-04 1.772382e-04 0.00018 *** Wilcoxon # Visualize: Specify the comparisons you want my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") ) ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value stat_compare_means(label.y = 50) # Add global p-value

    Add p-values and significance levels to ggplots

    If you want to specify the precise y location of bars, use the argument label.y:

    ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(comparisons = my_comparisons, label.y = c(29, 35, 40))+ stat_compare_means(label.y = 45)

    Add p-values and significance levels to ggplots

    (Adding bars, connecting compared groups, has been facilitated by the ggsignif R package )

    • Multiple pairwise tests against a reference group:
    # Pairwise comparison against reference compare_means(len ~ dose, data = ToothGrowth, ref.group = "0.5", method = "t.test") # A tibble: 2 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len 0.5 1 6.697250e-09 6.697250e-09 6.7e-09 **** T-test 2 len 0.5 2 1.469534e-16 2.939068e-16 < 2e-16 **** T-test # Visualize ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(method = "anova", label.y = 40)+ # Add global p-value stat_compare_means(label = "p.signif", method = "t.test", ref.group = "0.5") # Pairwise comparison against reference

    Add p-values and significance levels to ggplots

    • Multiple pairwise tests against all (base-mean):
    # Comparison of each group against base-mean compare_means(len ~ dose, data = ToothGrowth, ref.group = ".all.", method = "t.test") # A tibble: 3 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len .all. 0.5 1.244626e-06 3.733877e-06 1.2e-06 **** T-test 2 len .all. 1 5.667266e-01 5.667266e-01 0.57 ns T-test 3 len .all. 2 1.371704e-05 2.743408e-05 1.4e-05 **** T-test # Visualize ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(method = "anova", label.y = 40)+ # Add global p-value stat_compare_means(label = "p.signif", method = "t.test", ref.group = ".all.") # Pairwise comparison against all

    Add p-values and significance levels to ggplots

    A typical situation, where pairwise comparisons against “all” can be useful, is illustrated here using the myeloma data set from the survminer package.

    We’ll plot the expression profile of the DEPDC1 gene according to the patients’ molecular groups. We want to know if there is any difference between groups. If yes, where the difference is?

    To answer to this question, you can perform a pairwise comparison between all the 7 groups. This will lead to a lot of comparisons between all possible combinations. If you have many groups, as here, it might be difficult to interpret.

    Another easy solution is to compare each of the seven groups against “all” (i.e. base-mean). When the test is significant, then you can conclude that DEPDC1 is significantly overexpressed or downexpressed in a group xxx compared to all.

    # Load myeloma data from survminer package if(!require(survminer)) install.packages("survminer") data("myeloma", package = "survminer") # Perform the test compare_means(DEPDC1 ~ molecular_group, data = myeloma, ref.group = ".all.", method = "t.test") # A tibble: 7 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 DEPDC1 .all. Cyclin D-1 1.496896e-01 4.490687e-01 0.14969 ns T-test 2 DEPDC1 .all. Cyclin D-2 5.231428e-01 1.000000e+00 0.52314 ns T-test 3 DEPDC1 .all. Hyperdiploid 2.815333e-04 1.689200e-03 0.00028 *** T-test 4 DEPDC1 .all. Low bone disease 5.083985e-03 2.541992e-02 0.00508 ** T-test 5 DEPDC1 .all. MAF 8.610664e-02 3.444265e-01 0.08611 ns T-test 6 DEPDC1 .all. MMSET 5.762908e-01 1.000000e+00 0.57629 ns T-test 7 DEPDC1 .all. Proliferation 1.241416e-09 8.689910e-09 1.2e-09 **** T-test # Visualize the expression profile ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group", add = "jitter", legend = "none") + rotate_x_text(angle = 45)+ geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean stat_compare_means(method = "anova", label.y = 1600)+ # Add global annova p-value stat_compare_means(label = "p.signif", method = "t.test", ref.group = ".all.") # Pairwise comparison against all

    Add p-values and significance levels to ggplots

    From the plot above, we can conclude that DEPDC1 is significantly overexpressed in proliferation group and, it’s significantly downexpressed in Hyperdiploid and Low bone disease compared to all.

    Note that, if you want to hide the ns symbol, specify the argument hide.ns = TRUE.

    # Visualize the expression profile ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group", add = "jitter", legend = "none") + rotate_x_text(angle = 45)+ geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean stat_compare_means(method = "anova", label.y = 1600)+ # Add global annova p-value stat_compare_means(label = "p.signif", method = "t.test", ref.group = ".all.", hide.ns = TRUE) # Pairwise comparison against all

    Add p-values and significance levels to ggplots

    Multiple grouping variables
    • Two independent sample comparisons after grouping the data by another variable:

    Perform the test:

    compare_means(len ~ supp, data = ToothGrowth, group.by = "dose") # A tibble: 3 x 9 dose .y. group1 group2 p p.adj p.format p.signif method 1 0.5 len OJ VC 0.023186427 0.04637285 0.023 * Wilcoxon 2 1.0 len OJ VC 0.004030367 0.01209110 0.004 ** Wilcoxon 3 2.0 len OJ VC 1.000000000 1.00000000 1.000 ns Wilcoxon

    In the example above, for each level of the variable “dose”, we compare the means of the variable “len” in the different groups formed by the grouping variable “supp”.

    Visualize (1/2). Create a multi-panel box plots facetted by group (here, “dose”):

    # Box plot facetted by "dose" p <- ggboxplot(ToothGrowth, x = "supp", y = "len", color = "supp", palette = "jco", add = "jitter", facet.by = "dose", short.panel.labs = FALSE) # Use only p.format as label. Remove method name. p + stat_compare_means(label = "p.format")

    Add p-values and significance levels to ggplots

    # Or use significance symbol as label p + stat_compare_means(label = "p.signif", label.x = 1.5)

    Add p-values and significance levels to ggplots

    To hide the ‘ns’ symbol, use the argument hide.ns = TRUE.

    Visualize (2/2). Create one single panel with all box plots. Plot y = “len” by x = “dose” and color by “supp”:

    p <- ggboxplot(ToothGrowth, x = "dose", y = "len", color = "supp", palette = "jco", add = "jitter") p + stat_compare_means(aes(group = supp))

    Add p-values and significance levels to ggplots

    # Show only p-value p + stat_compare_means(aes(group = supp), label = "p.format")

    Add p-values and significance levels to ggplots

    # Use significance symbol as label p + stat_compare_means(aes(group = supp), label = "p.signif")

    Add p-values and significance levels to ggplots

    • Paired sample comparisons after grouping the data by another variable:

    Perform the test:

    compare_means(len ~ supp, data = ToothGrowth, group.by = "dose", paired = TRUE) # A tibble: 3 x 9 dose .y. group1 group2 p p.adj p.format p.signif method 1 0.5 len OJ VC 0.03296938 0.06593876 0.033 * Wilcoxon 2 1.0 len OJ VC 0.01905889 0.05717667 0.019 * Wilcoxon 3 2.0 len OJ VC 1.00000000 1.00000000 1.000 ns Wilcoxon

    Visualize. Create a multi-panel box plots facetted by group (here, “dose”):

    # Box plot facetted by "dose" p <- ggpaired(ToothGrowth, x = "supp", y = "len", color = "supp", palette = "jco", line.color = "gray", line.size = 0.4, facet.by = "dose", short.panel.labs = FALSE) # Use only p.format as label. Remove method name. p + stat_compare_means(label = "p.format", paired = TRUE)

    Add p-values and significance levels to ggplots

    Other plot types
    • Bar and line plots (one grouping variable):
    # Bar plot of mean +/-se ggbarplot(ToothGrowth, x = "dose", y = "len", add = "mean_se")+ stat_compare_means() + # Global p-value stat_compare_means(ref.group = "0.5", label = "p.signif", label.y = c(22, 29)) # compare to ref.group # Line plot of mean +/-se ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se")+ stat_compare_means() + # Global p-value stat_compare_means(ref.group = "0.5", label = "p.signif", label.y = c(22, 29))

    Add p-values and significance levels to ggplots

    • Bar and line plots (two grouping variables):
    ggbarplot(ToothGrowth, x = "dose", y = "len", add = "mean_se", color = "supp", palette = "jco", position = position_dodge(0.8))+ stat_compare_means(aes(group = supp), label = "p.signif", label.y = 29) ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se", color = "supp", palette = "jco")+ stat_compare_means(aes(group = supp), label = "p.signif", label.y = c(16, 25, 29))

    Add p-values and significance levels to ggplots

    Infos

    This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.3).

    jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

    .content{padding:0px;}


    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    UK R Courses

    Thu, 06/08/2017 - 10:30

    (This article was first published on R – Why?, and kindly contributed to R-bloggers)

    Over the next few months we’re running a number of R, Stan and Scala courses around the UK.

    June (London)
    • Mon Jun 26 – Introduction to R
    • Tue Jun 27 – Statistical Modelling with R
    • Wed Jun 28 – Programming with R
    • Thu Jun 29 (2-day course) – Advanced R Programming
    August (Belfast)
    • Thu Aug 10 – Programming with R
    • Fri Aug 11 – Building an R Package
    September 2017 (Newcastle)
    • Mon Sep 11 – Introduction to R
    • Tue Sep 12- Statistical Modelling with R
    • Wed Sep 13- Programming with R
    • Thu Sep 14- Advanced Graphics with R
    • Fri Sep 15- Automated Reporting (first steps towards Shiny)
    • Mon Sep 18 (5-day course) – Bioconductor
    October (Glasgow)
    • Wed Oct 25 – Advanced Graphics with R
    • Thu Oct 26 (2-day course) – Advanced R Programming
    November (London)
    • Mon Nov 06 (1-day course) – Efficient R Programming
    • Tue Nov 07 (1-day course) – R for Big Data
    December (London/Newcastle)
    • Thu Dec 07 (2-day course) – Introduction to Bayesian Inference using RStan (Newcastle)
    • Mon Dec 11 (3-day course) – Scala for Statistical Computing and Data Science (London)

    See the website for course descriptions. Any questions, feel free to contact me: colin@jumpingrivers.com

    On site courses available on request.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Why?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Unconf projects 4: cityquant, notary, packagemetrics, pegax

    Thu, 06/08/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    Continuing our series of blog posts (day 1, day 2, day 3) this week about unconf 17.

    cityquant

    Summary: The goal with the cityquant project was to build a digital dashboard for sustainable cities.

    They also had a "spin-off" project called selfquant to get data from a quantified self google sheets template to keep track of weekly performance in various categories.

    Team: Reka Solymosi, Ben Best, Chelsea Ursaner, Tim Phan, Jasmine Dumas

    Github: https://github.com/ropenscilabs/cityquant

    notary

    Summary: notary is actually two things:

    notary: An R package for signing and verification of R packages. It has functions for installing and verifying packages, validating GitHub releases, sourcing files with verification, and more.

    r-security-practices: A bookdown book targeting users, developers, and admins on R security best practices.

    Team: Stephanie Locke, Oliver Keyes, Rich FitzJohn, Bob Rudis, Joroen Ooms

    Github: https://github.com/ropenscilabs/notary / https://github.com/ropenscilabs/r-security-practices

    packagemetrics

    Summary: packagemetrics is a package for helping you choose which package to use. Their tool collects metrics including CRAN downloads, GitHub stars, whether it's tidyverse compatible, whether it has tests and vignettes, number of contributors, and more!

    This project combined two ideas from our brainstorming stage: Avoiding redundant / overlapping packages and A framework for reproducible tables.

    Team: Erin Grand, Sam Firke, Hannah Frick, Becca Krouse, Lori Shepherd

    Github: https://github.com/ropenscilabs/packagemetrics

    pegax

    Summary: pegax is a very alpha client for parsing taxonomic names. Taxonomic names are things such as Homo sapiens (human beings) wikispecies, or Ursus americanus (american black bear) wikispecies, or Balaenoptera musculus (blue whale) wikispecies. Taxonomic names can be hard to parse – and thus something called Parsing Expression Grammar (PEG) can be employed to help. We were lucky that Oliver Keyes just started an R package for PEGs in R called piton – which is now used in pegax to parse taxonomic names.

    Team:

    Scott Chamberlain (with help from Oliver Keyes)

    Github: https://github.com/ropenscilabs/pegax

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages