Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 17 min ago

Superpixels in imager

Fri, 03/24/2017 - 17:23

(This article was first published on R – dahtah, and kindly contributed to R-bloggers)

Superpixels are used in image segmentation as a pre-processing step. Instead of segmenting pixels directly, we first group similar pixels into “super-pixels”, which can then be processed further (and more cheaply).


(image from Wikimedia)

The current version of imager doesn’t implement them, but it turns out that SLIC superpixels are particularly easy to implement. SLIC is essentially k-means applied to pixels, with some bells and whistles.

We could use k-means to segment images based on colour alone. To get good results on colour segmentation the CIELAB colour space is appropriate, because it tries to be perceptually uniform.

library(tidyverse) library(imager) im <- load.image("https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Aster_Tataricus.JPG/1024px-Aster_Tataricus.JPG") #Convert to CIELAB colour space, then create a data.frame with three colour channels as columns d <- sRGBtoLab(im) %>% as.data.frame(wide="c")%>% dplyr::select(-x,-y) #Run k-means with 2 centers km <- kmeans(d,2) #Turn cluster index into an image seg <- as.cimg(km$cluster,dim=c(dim(im)[1:2],1,1)) plot(im,axes=FALSE) highlight(seg==1)

We mostly manage to separate the petals from the rest, with a few errors here and there.
SLIC does pretty much the same thing, except we (a) use many more centers and (b) we add pixel coordinates as features in the clustering. The latter ensures that only adjacent pixels get grouped together.

The code below implements SLIC. It’s mostly straightforward:

#Compute SLIC superpixels #im: input image #nS: number of superpixels #ratio: determines compactness of superpixels. #low values will result in pixels with weird shapes #... further arguments passed to kmeans slic <- function(im,nS,compactness=1,...) { #If image is in colour, convert to CIELAB if (spectrum(im) ==3) im <- sRGBtoLab(im) #The pixel coordinates vary over 1...width(im) and 1...height(im) #Pixel values can be over a widely different range #We need our features to have similar scales, so #we compute relative scales of spatial dimensions to colour dimensions sc.spat <- (dim(im)[1:2]*.28) %>% max #Scale of spatial dimensions sc.col <- imsplit(im,"c") %>% map_dbl(sd) %>% max #Scaling ratio for pixel values rat <- (sc.spat/sc.col)/(compactness*10) X <- as.data.frame(im*rat,wide="c") %>% as.matrix #Generate initial centers from a grid ind <- round(seq(1,nPix(im)/spectrum(im),l=nS)) #Run k-means km <- kmeans(X,X[ind,],...) #Return segmentation as image (pixel values index cluster) seg <- as.cimg(km$cluster,dim=c(dim(im)[1:2],1,1)) #Superpixel image: each pixel is given the colour of the superpixel it belongs to sp <- map(1:spectrum(im),~ km$centers[km$cluster,2+.]) %>% do.call(c,.) %>% as.cimg(dim=dim(im)) #Correct for ratio sp <- sp/rat if (spectrum(im)==3) { #Convert back to RGB sp <- LabtosRGB(sp) } list(km=km,seg=seg,sp=sp) }

Use it as follows:

#400 superpixels out <- slic(im,400) #Superpixels plot(out$sp,axes=FALSE) #Segmentation plot(out$seg,axes=FALSE) #Show segmentation on original image (im*add.colour(abs(imlap(out$seg)) == 0)) %>% plot(axes=FALSE)


The next step is to segment the superpixels but I’ll keep that for another time.

To leave a comment for the author, please follow the link and comment on their blog: R – dahtah. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – I

Fri, 03/24/2017 - 14:44

(This article was first published on R programming, and kindly contributed to R-bloggers)

We are starting with R weekly bulletins which will contain some interesting ways and methods to write codes in R and solve bugging problems. We will also cover R functions and shortcut keys for beginners. We understand that there can be more than one way of writing a code in R, and the solutions listed in the bulletins may not be the sole reference point for you. Nevertheless, we believe that the solutions listed will be helpful to many of our readers. Hope you like our R weekly bulletins. Enjoy reading them!

Shortcut Keys
  1. To move cursor to R Source Editor – Ctrl+1
  2. To move cursor to R Console – Ctrl+2
  3. To clear the R console – Ctrl+L
Problem Solving Ideas Creating user input functionality

To create the user input functionality in R, we can make use of the readline function. This gives us the flexibility to set the input values for variables for our choice in the code is run.

Example: Suppose that we have coded a backtesting strategy. We want to have the flexibility to choose the backtest period. To do so, we can create a user input “n”, signifying the backtest period in years, and add the line shown below at the start of the code.

When the code is run, it will prompt the user to enter the value for “n”. Upon entering the value, the R code will get executed for the set period and produce the desired output.

n = readline(prompt = "Enter the backtest period in years: ")

Refresh a code in every x seconds

To refresh a code in every x seconds we can use the while loop and the Sys.sleep function. The “while loop” keeps executing the enclosed block of commands until the condition remains satisfied. We enclose the code in the while statement and keep the condition as TRUE. By keeping the condition as TRUE, it will keep looping. At the end of the code, we add the Sys.sleep function and specify the delay time in seconds. This way the code will get refreshed after every “x” seconds.

Example: In this example, we initialize the x value to zero. The code is refreshed every 1 second, and it will keep printing the value of x. One can hit the escape button on the keyboard to terminate the code.

x = 0 while (TRUE) { x = x + 1 print(x) Sys.sleep(1) }

Running multiple R scripts sequentially

To run multiple R scripts, one can have the main script which will contain the names of the scripts to be run. Running the main script will lead to the execution of the other R scripts. Assume the name of the main script is “NSE Stocks.R”. In this script, we will mention the names of the scripts we wish to run within the source function. In this example, we wish to run the “Top gainers.R” and the “Top losers.R” script. These will be the part of the “NSE Stocks.R” as shown below and we run the main script to run these 2 scripts.

source("Top gainers.R") source("Top losers.R")

Enclosing the R script name within the “source” function causes R to accept its input from the named file. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment. Alternatively, one can also place the R script names in a vector, and use the sapply function.

Example:

filenames = c("Top gainers.R", "Top losers.R") sapply(filenames, source)

Converting a date in the American format to Standard date format

The American date format is of the type mm/dd/yyyy, whereas the ISO 8601 standard format is yyyy-mm-dd. To convert a date from American format to the Standard date format we will use the as.Date function along with the format function. The example below illustrates the method.

Example:

# date in American format dt = "07/24/2016" # If we call the as.Date function on the date, it will # throw up an error, as the default format assumed by the as.Date function is yyyy-mmm-dd. as.Date(dt)

Error in charToDate(x): character string is not in a standard unambiguous format

# Correct way of formatting the date as.Date(dt, format = "%m/%d/%Y")

[1] “2016-07-24”

How to remove all the existing files from a folder

To remove all the files from a particular folder, one can use the unlink function. Specify the path of the folder as the argument to the function. A forward slash with an asterisk is added at the end of the path. The syntax is given below.

unlink(“path/*”)

Example:

unlink("C:/Users/Documents/NSE Stocks/*")

This will remove all the files present in the “NSE Stocks” folder.

Functions Demystified write.csv function

If you want to save a data frame or matrix in a csv file, R provides for the write.csv function. The syntax for the write.csv function is given as:

write.csv(x, file=”filename”, row.names=FALSE)

If we specify row.names=TRUE, the function prepends each row with a label taken from the row.names attribute of your data. If your data doesn’t have row names then the function just uses the row numbers. Column header line is written by default. If you do not want the column headers, set col.names=FALSE.

Example:

# Create a data frame Ticker = c("PNB","CANBK","KTKBANK","IOB") Percent_Change = c(2.30,-0.25,0.50,1.24) df = data.frame(Ticker,Percent_Change) write.csv(df, file="Banking Stocks.csv", row.names=FALSE)

This will write the data contained in the “df” dataframe to the “Banking Stocks.csv” file. The file gets saved in the R working directory.

fix function

The fix function shows the underlying code of a function provided as an argument to it.

Example:

fix(sd)

The underlying code for the standard deviation function is as shown below. This is displayed when we executed the fix function with “sd” as the argument.

function (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm))

download.file function

The download.file function helps download a file from a website. This could be a webpage, a csv file, an R file, etc. The syntax for the function is given as:

download.file(url, destfile)

where,
url – the Uniform Resource Locator (URL) of the file to be downloaded
destfile – the location to save the downloaded file, i.e. path with a file name

Example: In this example, the function will download the file from the path given in the “url” argument, and saved it in the D drive within the “Skills” folder with the name “betawacc.xls”.

url = "http://www.exinfm.com/excel%20files/betawacc.xls" destfile = "D:/Skills/wacc.xls" download.file(url, destfile)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!

The post R Weekly Bulletin Vol – I appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Web data acquisition: parsing json objects with tidyjson (Part 3)

Fri, 03/24/2017 - 12:56

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

The collection of example flight data in json format available in part 2, described the libraries and the structure of the POST request necessary to collect data in a json object. Despite the process generated and transferred locally a proper response, the data collected were neither in a suitable structure for data analysis nor immediately readable. They appears as just a long string of information nested and separated according to the JavaScript object notation syntax.

Thus, to visualize the deeply nested json object and make it human readable and understandable for further processing, the json content could be copied and pasted in a common online parser. The tool allows to select each node of the tree and observe the data structure up to the variables and data of interest for the statistical analysis. The bulk of the relevant information for the purpose of the analysis on flight prices are hidden in the tripOption node as shown in the following figure (only 50 flight solutions were requested).

However, looking deeply into the object, several other elements are provided as the distance in mile, the segment, the duration, the carrier, etc. The R parser to transform the json structure in a usable dataframe requires the dplyr library for using the pipe operator (%>%) to streamline the code and make the parser more readable. Nevertheless, the library actually wrangling through the lines is tidyjson and its powerful functions:

  • enter_object: enters and dives into a data object;
  • gather_array: stacks a JSON array;
  • spread_values: creates new columns from values assigning specific type (e.g. jstring, jnumber).
library(dplyr) # for pipe operator %>% and other dplyr functions library(tidyjson) # https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html data_items <- datajson %>% spread_values(kind = jstring("kind")) %>% spread_values(trips.kind = jstring("trips","kind")) %>% spread_values(trips.rid = jstring("trips","requestId")) %>% enter_object("trips","tripOption") %>% gather_array %>% spread_values( id = jstring("id"), saleTotal = jstring("saleTotal")) %>% enter_object("slice") %>% gather_array %>% spread_values(slice.kind = jstring("kind")) %>% spread_values(slice.duration = jstring("duration")) %>% enter_object("segment") %>% gather_array %>% spread_values( segment.kind = jstring("kind"), segment.duration = jnumber("duration"), segment.id = jstring("id"), segment.cabin = jstring("cabin")) %>% enter_object("leg") %>% gather_array %>% spread_values( segment.leg.aircraft = jstring("aircraft"), segment.leg.origin = jstring("origin"), segment.leg.destination = jstring("destination"), segment.leg.mileage = jnumber("mileage")) %>% select(kind, trips.kind, trips.rid, saleTotal,id, slice.kind, slice.duration, segment.kind, segment.duration, segment.id, segment.cabin, segment.leg.aircraft, segment.leg.origin, segment.leg.destination, segment.leg.mileage) head(data_items) kind trips.kind trips.rid saleTotal 1 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 2 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 3 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 4 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 5 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 6 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 id slice.kind slice.duration 1 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 2 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 3 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 4 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 5 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 6 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 segment.kind segment.duration segment.id segment.cabin 1 qpxexpress#segmentInfo 160 GixYrGFgbbe34NsI COACH 2 qpxexpress#segmentInfo 235 Gj1XVe-oYbTCLT5V COACH 3 qpxexpress#segmentInfo 190 Grt369Z0shJhZOUX COACH 4 qpxexpress#segmentInfo 155 GRvrptyoeTfrSqg8 COACH 5 qpxexpress#segmentInfo 100 GXzd3e5z7g-5CCjJ COACH 6 qpxexpress#segmentInfo 105 G8axcks1R8zJWKrN COACH segment.leg.aircraft segment.leg.origin segment.leg.destination segment.leg.mileage 1 320 FCO IST 859 2 77W IST LHR 1561 3 73H FCO ARN 1256 4 73G ARN LHR 908 5 319 FCO STR 497 6 319 STR LHR 469

Data are now in an R-friendly structure despite not yet ready for analysis. As can be observed from the first rows, each record has information on a single segment of the flight selected. A further step of aggregation using some SQL is needed in order to end up with a dataframe of flights data suitable for statistical analysis.

Next up, the aggregation, some data analysis and data visualization to complete the journey through the web data acquisition using R.

#R #rstats #maRche #json #curl #tidyjson #Rbloggers

This post is also shared in www.r-bloggers.com and LinkedIn

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R – Change columns names in a spatial dataframe

Fri, 03/24/2017 - 12:32

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

Ordnance Survey have a great OpenRoads dataset, but unfortunately it contains a column called ‘primary’, which is a keyword in SQL. This makes it challenging/impossible to import the OpenRoads dataset into a SQL database (e.g. GRASS), without changing the offending column name.

Enter R! Or any other capable programming language. The following script reads a folder of shp files, changes a given column name and overwrites the original files. Note the use of the ‘@’ symbol to call a slot from the S4 class object (thanks Stack Exchange).

library(rgdal) f = list.files("~/Downloads/OS/data", pattern="shp") f = substr(f, 1, nchar(f) - 4) lapply(f, function(i){ x = readOGR("/home/mspencer/Downloads/OS/data", i) colnames(x@data)[11] = "prim" writeOGR(x, "/home/mspencer/Downloads/OS/data", i, "ESRI Shapefile", overwrite_layer=T) })

 

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Neural Networks for Learning Lyrics

Fri, 03/24/2017 - 12:21

(This article was first published on More or Less Numbers, and kindly contributed to R-bloggers)

I created a Twitter account which was inspired by a couple Twitter accounts that applied a particular type of machine learning technique to learn how two (at the time) presidential hopefuls spoke. I thought, why not see what a model like this could do with lyrics from my favorite rock n roll artist? Long short term memory (LSTM) is a recurrent neural network (RNN) that can be used to produce sentences or phrases by learning from text. The two twitter accounts that inspired this were ‘@deeplearnthebern’ and ‘@deepdrumpf’ which use this technique to produce phrases and sentences. I scraped a little more than 300 of his songs and have fed them to a LSTM model using R and the mxnet library. Primarily I used the mxnet.io/ to build and train the model…great site and tools.  The tutorials on their site are very helpful and particularly this one.

The repository is here that contains the code for the scraper and other information. Follow deeplearnbruce for tweets that are hopefully entertaining for Springsteen fans or anyone else. 

To leave a comment for the author, please follow the link and comment on their blog: More or Less Numbers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Lesser known purrr tricks

Fri, 03/24/2017 - 12:00

purrr is package that extends R’s functional programming capabilities. It brings a lot of new stuff to the table and in this post I show you some of the most useful (at least to me) functions included in purrr.

Getting rid of loops with map() library(purrr) numbers <- list(11, 12, 13, 14) map_dbl(numbers, sqrt) ## [1] 3.316625 3.464102 3.605551 3.741657

You might wonder why this might be preferred to a for loop? It’s a lot less verbose, and you do not need to initialise any kind of structure to hold the result. If you google “create empty list in R” you will see that this is very common. However, with the map() family of functions, there is no need for an initial structure. map_dbl() returns an atomic list of real numbers, but if you use map() you will get a list back. Try them all out!

Map conditionally map_if() # Create a helper function that returns TRUE if a number is even is_even <- function(x){ !as.logical(x %% 2) } map_if(numbers, is_even, sqrt) ## [[1]] ## [1] 11 ## ## [[2]] ## [1] 3.464102 ## ## [[3]] ## [1] 13 ## ## [[4]] ## [1] 3.741657 map_at() map_at(numbers, c(1,3), sqrt) ## [[1]] ## [1] 3.316625 ## ## [[2]] ## [1] 12 ## ## [[3]] ## [1] 3.605551 ## ## [[4]] ## [1] 14

map_if() and map_at() have a further argument than map(); in the case of map_if(), a predicate function ( a function that returns TRUE or FALSE) and a vector of positions for map_at(). This allows you to map your function only when certain conditions are met, which is also something that a lot of people google for.

Map a function with multiple arguments numbers2 <- list(1, 2, 3, 4) map2(numbers, numbers2, `+`) ## [[1]] ## [1] 12 ## ## [[2]] ## [1] 14 ## ## [[3]] ## [1] 16 ## ## [[4]] ## [1] 18

You can map two lists to a function which takes two arguments using map_2(). You can even map an arbitrary number of lists to any function using pmap().

By the way, try this in: `+`(1,3) and see what happens.

Don’t stop execution of your function if something goes wrong possible_sqrt <- possibly(sqrt, otherwise = NA_real_) numbers_with_error <- list(1, 2, 3, "spam", 4) map(numbers_with_error, possible_sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] NA ## ## [[5]] ## [1] 2

Another very common issue is to keep running your loop even when something goes wrong. In most cases the loop simply stops at the error, but you would like it to continue and see where it failed. Try to google “skip error in a loop” or some variation of it and you’ll see that a lot of people really just want that. This is possible by combining map() and possibly(). Most solutions involve the use of tryCatch() which I personally do not find very easy to use.

Don’t stop execution of your function if something goes wrong and capture the error safe_sqrt <- safely(sqrt, otherwise = NA_real_) map(numbers_with_error, safe_sqrt) ## [[1]] ## [[1]]$result ## [1] 1 ## ## [[1]]$error ## NULL ## ## ## [[2]] ## [[2]]$result ## [1] 1.414214 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 1.732051 ## ## [[3]]$error ## NULL ## ## ## [[4]] ## [[4]]$result ## [1] NA ## ## [[4]]$error ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## ## [[5]] ## [[5]]$result ## [1] 2 ## ## [[5]]$error ## NULL

safely() is very similar to possibly() but it returns a list of lists. An element is thus a list of the result and the accompagnying error message. If there is no error, the error component is NULL if there is an error, it returns the error message.

Transpose a list safe_result_list <- map(numbers_with_error, safe_sqrt) transpose(safe_result_list) ## $result ## $result[[1]] ## [1] 1 ## ## $result[[2]] ## [1] 1.414214 ## ## $result[[3]] ## [1] 1.732051 ## ## $result[[4]] ## [1] NA ## ## $result[[5]] ## [1] 2 ## ## ## $error ## $error[[1]] ## NULL ## ## $error[[2]] ## NULL ## ## $error[[3]] ## NULL ## ## $error[[4]] ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## $error[[5]] ## NULL

Here we transposed the above list. This means that we still have a list of lists, but where the first list holds all the results (which you can then access with safe_result_list$result) and the second list holds all the errors (which you can access with safe_result_list$error). This can be quite useful!

Apply a function to a lower depth of a list transposed_list <- transpose(safe_result_list) transposed_list %>% at_depth(2, is_null) ## $result ## $result[[1]] ## [1] FALSE ## ## $result[[2]] ## [1] FALSE ## ## $result[[3]] ## [1] FALSE ## ## $result[[4]] ## [1] FALSE ## ## $result[[5]] ## [1] FALSE ## ## ## $error ## $error[[1]] ## [1] TRUE ## ## $error[[2]] ## [1] TRUE ## ## $error[[3]] ## [1] TRUE ## ## $error[[4]] ## [1] FALSE ## ## $error[[5]] ## [1] TRUE

Sometimes working with lists of lists can be tricky, especially when we want to apply a function to the sub-lists. This is easily done with at_depth()!

Set names of list elements name_element <- c("sqrt()", "ok?") set_names(transposed_list, name_element) ## $`sqrt()` ## $`sqrt()`[[1]] ## [1] 1 ## ## $`sqrt()`[[2]] ## [1] 1.414214 ## ## $`sqrt()`[[3]] ## [1] 1.732051 ## ## $`sqrt()`[[4]] ## [1] NA ## ## $`sqrt()`[[5]] ## [1] 2 ## ## ## $`ok?` ## $`ok?`[[1]] ## NULL ## ## $`ok?`[[2]] ## NULL ## ## $`ok?`[[3]] ## NULL ## ## $`ok?`[[4]] ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## $`ok?`[[5]] ## NULL Reduce a list to a single value reduce(numbers, `*`) ## [1] 24024

reduce() applies the function * iteratively to the list of numbers. There’s also accumulate():

accumulate(numbers, `*`) ## [1] 11 132 1716 24024

which keeps the intermediary results.

This function is very general, and you can reduce anything:

Matrices:

mat1 <- matrix(rnorm(10), nrow = 2) mat2 <- matrix(rnorm(10), nrow = 2) mat3 <- matrix(rnorm(10), nrow = 2) list_mat <- list(mat1, mat2, mat3) reduce(list_mat, `+`) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.5228188 0.4813357 0.3808749 -1.1678164 0.3080001 ## [2,] -3.8330509 -0.1061853 -3.8315768 0.3052248 0.3486929

even data frames:

df1 <- as.data.frame(mat1) df2 <- as.data.frame(mat2) df3 <- as.data.frame(mat3) list_df <- list(df1, df2, df3) reduce(list_df, dplyr::full_join) ## Joining, by = c("V1", "V2", "V3", "V4", "V5") ## Joining, by = c("V1", "V2", "V3", "V4", "V5") ## V1 V2 V3 V4 V5 ## 1 0.01587062 0.8570925 1.04330594 -0.5354500 0.7557203 ## 2 -0.46872345 0.3742191 -1.88322431 1.4983888 -1.2691007 ## 3 -0.60675851 -0.7402364 -0.49269182 -0.4884616 -1.0127531 ## 4 -1.49619518 1.0714251 0.06748534 0.6650679 1.1709317 ## 5 0.06806907 0.3644795 -0.16973919 -0.1439047 0.5650329 ## 6 -1.86813223 -1.5518295 -2.01583786 -1.8582319 0.4468619

Hope you enjoyed this list of useful functions! If you enjoy the content of my blog, you can follow me on twitter.

RApiDatetime 0.0.1

Fri, 03/24/2017 - 02:30

Very happy to announce a new package of mine is now up on the CRAN repository network: RApiDatetime.

It provides six entry points for C-level functions of the R API for Date and Datetime calculations: asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R. Hence the need to put them together in the this package to complete the accessible API somewhat.

These should be helpful for fellow package authors as many of us have either our own partial copies of some of this code, or rather farm back out into R to get this done.

As a simple (yet real!) illustration, here is an actual Rcpp function which we could now cover at the C level rather than having to go back up to R (via Rcpp::Function()):

inline Datetime::Datetime(const std::string &s, const std::string &fmt) { Rcpp::Function strptime("strptime"); // we cheat and call strptime() from R Rcpp::Function asPOSIXct("as.POSIXct"); // and we need to convert to POSIXct m_dt = Rcpp::as<double>(asPOSIXct(strptime(s, fmt))); update_tm(); }

I had taken a first brief stab at this about two years ago, but never finished. With the recent emphasis on C-level function registration, coupled with a possible use case from anytime I more or less put this together last weekend.

It currently builds and tests fine on POSIX-alike operating systems. If someone with some skill and patience in working on Windows would like to help complete the Windows side of things then I would certainly welcome help and pull requests.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

QR Decomposition with the Gram-Schmidt Algorithm

Thu, 03/23/2017 - 21:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

QR decomposition is another technique for decomposing a matrix into a form that is easier to work with in further applications. The QR decomposition technique decomposes a square or rectangular matrix, which we will denote as A, into two components, Q, and R.
A = QR

Where Q is an orthogonal matrix, and R is an upper triangular matrix. Recall an orthogonal matrix is a square matrix with orthonormal row and column vectors such that Q^T Q = I, where I is the identity matrix. The term orthonormal implies the vectors are of unit length and are perpendicular (orthogonal) to each other.

QR decomposition is often used in linear least squares estimation and is, in fact, the method used by R in its lm() function. Signal processing and MIMO systems also employ QR decomposition. There are several methods for performing QR decomposition, including the Gram-Schmidt process, Householder reflections, and Givens rotations. This post is concerned with the Gram-Schmidt process.

The Gram-Schmidt Process

The Gram-Schmidt process is used to find an orthogonal basis from a non-orthogonal basis. An orthogonal basis has many properties that are desirable for further computations and expansions. As noted previously, an orthogonal matrix has row and column vectors of unit length:

||a_n|| = \sqrt{a_n \cdot a_n} = \sqrt{a_n^T a_n} = 1

Where a_n is a linearly independent column vector of a matrix. The vectors are also perpendicular in an orthogonal basis. The Gram-Schmidt process works by finding an orthogonal projection q_n for each column vector a_n and then subtracting its projections onto the previous projections (q_j). The resulting vector is then divided by the length of that vector to produce a unit vector.

Consider a matrix A with n column vectors such that:

A = \left[ a_1 | a_2 | \cdots | a_n \right]

The Gram-Schmidt process proceeds by finding the orthogonal projection of the first column vector a_1.

v_1 = a_1, \qquad e_1 = \frac{v_1}{||v_1||}

Because a_1 is the first column vector, there is no preceeding projections to subtract. The second column a_2 is subtracted by the previous projection on the column vector:

v_2 = a_2 – proj_{v_1} (a_2) = a_2 – (a_2 \cdot e_1) e_1, \qquad e_2 = \frac{v_2}{||v_2||}

This process continues up to the n column vectors, where each incremental step k + 1 is computed as:

v_{k+1} = a_{k+1} – (a_{k+1} \cdot e_{1}) e_1 – \cdots – (a_{k+1} \cdot e_k) e_k, \qquad e_{k+1} = \frac{u_{k+1}}{||u_{k+1}||}

The || \cdot || is the L_2 norm which is defined as:

\sqrt{\sum^m_{j=1} v_k^2} The projection can also be defined by:

Thus the matrix A can be factorized into the QR matrix as the following:

A = \left[a_1 | a_2 | \cdots | a_n \right] = \left[e_1 | e_2 | \cdots | e_n \right] \begin{bmatrix}a_1 \cdot e_1 & a_2 \cdot e_1 & \cdots & a_n \cdot e_1 \\\ 0 & a_2 \cdot e_2 & \cdots & a_n \cdot e_2 \\\ \vdots & \vdots & & \vdots \\\ 0 & 0 & \cdots & a_n \cdot e_n\end{bmatrix} = QR

Gram-Schmidt Process Example

Consider the matrix A:

\begin{bmatrix} 2 & – 2 & 18 \\\ 2 & 1 & 0 \\\ 1 & 2 & 0 \end{bmatrix}

We would like to orthogonalize this matrix using the Gram-Schmidt process. The resulting orthogonalized vector is also equivalent to Q in the QR decomposition.

The Gram-Schmidt process on the matrix A proceeds as follows:

v_1 = a_1 = \begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix} \qquad e_1 = \frac{v_1}{||v_1||} = \frac{\begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix}^2}}} e_1 = \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}
v_2 = a_2 – (a_2 \cdot e_1) e_1 = \begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix} – \left(\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}, \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}\right)\begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}
v_2 = \begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix} \qquad e_2 = \frac{v_2}{||v_2||} = \frac{\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}^2}}}
e_2 = \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix}
v_3 = a_3 – (a_3 \cdot e_1) e_1 – (a_3 \cdot e_2) e_2
v_3 = \begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix} – \left(\begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix}, \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}\right)\begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} – \left(\begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix}, \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} \right)\begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix}
v_3 = \begin{bmatrix}2 \\\ – 4 \\\ 4 \end{bmatrix} \qquad e_3 = \frac{v_3}{||v_3||} = \frac{\begin{bmatrix}2 \\\ -4 \\\ 4\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}2 \\\ -4 \\\ 4\end{bmatrix}^2}}}
e_3 = \begin{bmatrix} \frac{1}{3} \\\ -\frac{2}{3} \\\ \frac{2}{3} \end{bmatrix}

Thus, the orthogonalized matrix resulting from the Gram-Schmidt process is:

\begin{bmatrix} \frac{2}{3} & -\frac{2}{3} & \frac{1}{3} \\\ \frac{2}{3} & \frac{1}{3} & -\frac{2}{3} \\\ \frac{1}{3} & \frac{1}{3} & \frac{2}{3} \end{bmatrix}

The component R of the QR decomposition can also be found from the calculations made in the Gram-Schmidt process as defined above.

R = \begin{bmatrix}a_1 \cdot e_1 & a_2 \cdot e_1 & \cdots & a_n \cdot e_1 \\\ 0 & a_2 \cdot e_2 & \cdots & a_n \cdot e_2 \\\ \vdots & \vdots & & \vdots \\\ 0 & 0 & \cdots & a_n \cdot e_n \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} 2 \\\ 2 \\\ 1 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} & \begin{bmatrix} -2 \\\ 1 \\\ 2 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} \\\ 0 & \begin{bmatrix} -2 \\\ 1 \\\ 2 \end{bmatrix} \cdot \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} \\\ 0 & 0 & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} \frac{1}{3} \\\ -\frac{2}{3} \\\ \frac{2}{3} \end{bmatrix}\end{bmatrix}
R = \begin{bmatrix} 3 & 0 & 12 \\\ 0 & 3 & -12 \\\ 0 & 0 & 6 \end{bmatrix}

The Gram-Schmidt Algorithm in R

We use the same matrix A to verify our results above.

A <- rbind(c(2,-2,18),c(2,1,0),c(1,2,0)) A ## [,1] [,2] [,3] ## [1,] 2 -2 18 ## [2,] 2 1 0 ## [3,] 1 2 0

The following function is an implementation of the Gram-Schmidt algorithm using the modified version of the algorithm. A good comparison of the classical and modified versions of the algorithm can be found here. The Modified Gram-Schmidt algorithm was used above due to its improved numerical stability, which results in more orthogonal columns over the Classical algorithm.

gramschmidt <- function(x) { x <- as.matrix(x) # Get the number of rows and columns of the matrix n <- ncol(x) m <- nrow(x) # Initialize the Q and R matrices q <- matrix(0, m, n) r <- matrix(0, n, n) for (j in 1:n) { v = x[,j] # Step 1 of the Gram-Schmidt process v1 = a1 # Skip the first column if (j > 1) { for (i in 1:(j-1)) { r[i,j] <- t(q[,i]) %*% x[,j] # Find the inner product (noted to be q^T a earlier) # Subtract the projection from v which causes v to become perpendicular to all columns of Q v <- v - r[i,j] * q[,i] } } # Find the L2 norm of the jth diagonal of R r[j,j] <- sqrt(sum(v^2)) # The orthogonalized result is found and stored in the ith column of Q. q[,j] <- v / r[j,j] } # Collect the Q and R matrices into a list and return qrcomp <- list('Q'=q, 'R'=r) return(qrcomp) }

Perform the Gram-Schmidt orthogonalization process on the matrix A using our function.

gramschmidt(A) ## $Q ## [,1] [,2] [,3] ## [1,] 0.6666667 -0.6666667 0.3333333 ## [2,] 0.6666667 0.3333333 -0.6666667 ## [3,] 0.3333333 0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] 3 0 12 ## [2,] 0 3 -12 ## [3,] 0 0 6

The results of our function match those of our manual calculations!

The qr() function in R also performs the Gram-Schmidt process. The qr() function does not output the Q and R matrices, which must be found by calling qr.Q() and qr.R(), respectively, on the qr object.

A.qr <- qr(A) A.qr.out <- list('Q'=qr.Q(A.qr), 'R'=qr.R(A.qr)) A.qr.out ## $Q ## [,1] [,2] [,3] ## [1,] -0.6666667 0.6666667 0.3333333 ## [2,] -0.6666667 -0.3333333 -0.6666667 ## [3,] -0.3333333 -0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] -3 0 -12 ## [2,] 0 -3 12 ## [3,] 0 0 6

Thus the qr() function in R matches our function and manual calculations as well.

References

http://www.calpoly.edu/~jborzell/Courses/Year%2005-06/Spring%202006/304Gram_Schmidt_Exercises.pdf

http://cavern.uark.edu/~arnold/4353/CGSMGS.pdf

https://www.math.ucdavis.edu/~linear/old/notes21.pdf

http://www.math.ucla.edu/~yanovsky/Teaching/Math151B/handouts/GramSchmidt.pdf

The post QR Decomposition with the Gram-Schmidt Algorithm appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing R Tools 1.0 for Visual Studio 2015

Thu, 03/23/2017 - 19:16

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Shahrokh Mortazavi, Partner PM, Visual Studio Cloud Platform Tools at Microsoft

I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May.

RTVS is a free and open source plug-in that turns Visual Studio into a powerful and productive R development environment. Check out this video for a quick tour of its core features:

Core IDE Features

RTVS builds on Visual Studio, which means you get numerous features for free: from using multiple languages to word-class Editing and Debugging to over 7,000 extensions for every need:

  • A polyglot IDE – VS supports R, Python, C++, C#, Node.js, SQL, etc. projects simultaneously.
  • Editor – complete editing experience for R scripts and functions, including detachable/tabbed windows, syntax highlighting, and much more.
  • IntelliSense – (aka auto-completion) available in both the editor and the Interactive R window.
  • R Interactive Window – work with the R console directly from within Visual Studio.
  • History window – view, search, select previous commands and send to the Interactive window.
  • Variable Explorer – drill into your R data structures and examine their values.
  • Plotting – see all of your R plots in a Visual Studio tool window.
  • Debugging – breakpoints, stepping, watch windows, call stacks and more.
  • R Markdown – R Markdown/knitr support with export to Word and HTML.
  • Git – source code control via Git and GitHub.
  • Extensions – over 7,000 Extensions covering a wide spectrum from Data to Languages to Productivity.
  • Help – use ? and ?? to view R documentation within Visual Studio.

It’s Enterprise-Grade

RTVS includes various features that address the needs of individual as well as Data Science teams, for example:

SQL Server 2016

RTVS integrates with SQL Server 2016 R Services and SQL Server Tools for Visual Studio 2015. These separate downloads enhance RTVS with support for syntax coloring and Intellisense, interactive queries, and deployment of stored procedures directly from Visual Studio.

Microsoft R Client

Use the stock CRAN R interpreter, or the enhanced Microsoft R Client and its ScaleR functions that support multi-core and cluster computing for practicing data science at scale.

Visual Studio Team Services

Integrated support for git, continuous integration, agile tools, release management, testing, reporting, bug and work-item tracking through Visual Studio Team Services. Use our hosted service or host it yourself privately.

Remoting

Whether it’s data governance, security, or running large jobs on a powerful server, RTVS workspaces enable setting up your own R server or connecting to one in the cloud.

The road ahead

We’re very excited to officially bring another language to the Visual Studio family!  Along with Python Tools for Visual Studio, you have the two main languages for tackling most any analytics and ML related challenge.  In the near future (~May), we’ll release RTVS for Visual Studio 2017 as well. We’ll also resurrect the “Data Science workload” in VS2017 which gives you R, Python, F# and all their respective package distros in one convenient install. Beyond that, we’re looking forward to hearing from you on what features we should focus on next! R package development? Mixed R+C debugging? Model deployment? VS Code/R for cross-platform development? Let us know at the RTVS Github repository!

Thank you!

Bits: http://microsoft.github.io/RTVS-docs/installation
Code: https://github.com/Microsoft/RTVS
Docs: http://microsoft.github.io/RTVS-docs

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Survminer Cheatsheet to Create Easily Survival Plots

Thu, 03/23/2017 - 16:46

We recently released the survminer verion 0.3, which includes many new features to help in visualizing and sumarizing survival analysis results.

In this article, we present a cheatsheet for survminer, created by Przemysław Biecek, and provide an overview of main functions.

survminer cheatsheet

The cheatsheet can be downloaded from STHDA and from Rstudio. It contains selected important functions, such as:

  • ggsurvplot() for plotting survival curves
  • ggcoxzph() and ggcoxdiagnostics() for assessing the assumtions of the Cox model
  • ggforest() and ggcoxadjustedcurves() for summarizing a Cox model

Additional functions, that you might find helpful, are briefly described in the next section.

survminer overview

The main functions, in the package, are organized in different categories as follow.

Survival Curves

  • ggsurvplot(): Draws survival curves with the ‘number at risk’ table, the cumulative number of events table and the cumulative number of censored subjects table.

  • arrange_ggsurvplots(): Arranges multiple ggsurvplots on the same page.

  • ggsurvevents(): Plots the distribution of event’s times.

  • surv_summary(): Summary of a survival curve. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.

  • surv_cutpoint(): Determines the optimal cutpoint for one or multiple continuous variables at once. Provides a value of a cutpoint that correspond to the most significant relation with survival.

  • pairwise_survdiff(): Multiple comparisons of survival curves. Calculate pairwise comparisons between group levels with corrections for multiple testing.

Diagnostics of Cox Model

  • ggcoxzph(): Graphical test of proportional hazards. Displays a graph of the scaled Schoenfeld residuals, along with a smooth curve using ggplot2. Wrapper around plot.cox.zph().

  • ggcoxdiagnostics(): Displays diagnostics graphs presenting goodness of Cox Proportional Hazards Model fit.

  • ggcoxfunctional(): Displays graphs of continuous explanatory variable against martingale residuals of null cox proportional hazards model. It helps to properly choose the functional form of continuous variable in cox model.

Summary of Cox Model

  • ggforest(): Draws forest plot for CoxPH model.

  • ggcoxadjustedcurves(): Plots adjusted survival curves for coxph model.

Competing Risks

  • ggcompetingrisks(): Plots cumulative incidence curves for competing risks.

Find out more at http://www.sthda.com/english/rpkgs/survminer/, and check out the documentation and usage examples of each of the functions in survminer package.

Infos

This analysis has been performed using R software (ver. 3.3.2).

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}


Make the [R] Kenntnis-Tage 2017 your stage

Thu, 03/23/2017 - 14:17

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

At the [R] Kenntnis-Tage 2017 on November 8 and 9, 2017 you will get the chance to benefit not only from the exchange about the programming language R in a business context and practical tutorials but also from the audience: use the [R] Kenntnis-Tage 2017 as your platform and hand in your topic for the call of papers.


[R] Kenntnis-Tage 2017: Call for Papers

The topics can be as diverse as the event itself: Whether you share your data science use case with the participants, tell them your personal lessons learned with R in the business environment or show how your company uses data science and R. As a speaker at the [R] Kenntnis-Tage 2017, you will get free admission to the event.

Companies present themselves as big data pioneers

At last year’s event, many speakers already took their chance: Julia Flad from Trumpf Laser talked about the application possibilities of data science in the industry and shared her experiences with the participants. In his talk “Working efficiently with R – faster to the data product” Julian Gimbel from Lufthansa Industry Solutions gave a vivid example for working with the open source programming language.

Take your chance to become part of the [R] Kenntnis-Tage 2017 and hand in your topic related to data science or R. For more information, e.g. on the length of the presentation or the choice of topic, visit our website.

You don’t want to give a talk but still want to participate? When you register until May 5 you can profit from our early bird tickets. Data science goes professional – join in.

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Tidyverse Curse

Thu, 03/23/2017 - 14:08

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

The Tidyverse Curse

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method.  It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).

> library("dplyr") > mtcars %>% + as_data_frame() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:

> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + as.data.frame() %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

10 Million Dots: Mapping European Population

Thu, 03/23/2017 - 11:02

(This article was first published on R – Spatial.ly, and kindly contributed to R-bloggers)

Dot density maps are an increasingly popular way of showing population datasets. The technique has its limitations (see here for more info), but it can create some really nice visual outputs, particularly when trying to show density. I have tried to push this to the limit by using a high resolution population grid for Europe (GEOSTAT 2011) to plot a dot for every 50 people in Europe (interactive). The detailed data aren’t published for all European countries – there are modelled values for those missing but I have decided to stick to only those countries with actual counts here.

Giant dot density maps with R

I produced this map as part of an experiment to see how R could handle generating millions of dots from spatial data and then plotting them. Even though the final image is 3.8 metres by 3.8 metres – this is as big as I could make it without causing R to crash – many of the grid cells in urban areas are saturated with dots so a lot of detail is lost in those areas. The technique is effective in areas that are more sparsely populated. It would probably work better with larger spatial units that aren’t necessarily grid cells.

Generating the dots (using the spsample function) and then plotting them in one go would require way more RAM than I have access to. Instead I wrote a loop to select each grid cell of population data, generate the requisite number of points and then plot them. This created a much lighter load as my computer happily chugged away for a few hours to produce the plot. I could have plotted a dot for each person (500 million+) but this would have completely saturated the map, so instead I opted for 1 dot for every 50 people.

I have provided full code below. T&Cs on the data mean I can’t share that directly but you can download here.

Load rgdal and the spatial object required. In this case it’s the GEOSTAT 2011 population grid. In addition I am using the rainbow function to generate a rainbow palette for each of the countries.

library(rgdal) Input<-readOGR(dsn=".", layer="SHAPEFILE") Cols<- data.frame(Country=unique(Input$Country), Cols=rainbow(nrow(data.frame(unique(Input$Country)))))

Create the initial plot. This is empty but it enables the dots to be added interatively to the plot. I have specified 380cm by 380cm at 150 dpi PNG – this is as big as I could get it without crashing R.

png("europe_density.png",width=380, height=380, units="cm", res=150) par(bg='black') plot(t(bbox(Input)), asp=T, axes=F, ann=F)

Loop through each of the individual spatial units – grid cells in this case – and plot the dots for each. The number of dots are specified in spsample as the grid cell’s population value/50.

for (j in 1:nrow(Cols)) { Subset<- Input[Input$Country==Cols[j,1],] for (i in 1:nrow(Subset@data)) { Poly<- Subset[i,] pts1<-spsample(Poly, n = 1+round(Poly@data$Pop/50), "random", iter=15) #The colours are taken from the Cols object. plot(pts1, pch=16, cex=0.03, add=T, col=as.vector(Cols[j,2])) } } dev.off()

 

To leave a comment for the author, please follow the link and comment on their blog: R – Spatial.ly. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New mlr Logo

Thu, 03/23/2017 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

We at mlr are currently deciding on a new logo, and in the spirit of open-source, we would like to involve the community in the voting process!

You can vote for your favorite logo on GitHub by reacting to the logo with a +1.

Thanks to Hannah Atkin for designing the logos!

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Euler Problem 17: Number Letter Counts

Wed, 03/22/2017 - 22:09

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler Problem 17 asks to count the letters in numbers written as words. This is a skill we all learnt in primary school mainly useful when writing cheques—to those that still use them.

Each language has its own rules for writing numbers. My native language Dutch has very different logic to English. Both Dutch and English use a single syllable until the number twelve. Linguists have theorised this is evidence that early Germanic numbers were duodecimal. This factoid is supported by the importance of a “dozen” as a counting word and the twelve hours in the clock. There is even a Dozenal Society that promotes the use of a number system based on 12.

The English language changes the rules when reaching the number 21. While we say eight-teen in English, we do no say “one-twenty”. Dutch stays consistent and the last number is always spoken first. For example, 37 in English is “thirty-seven”, while in Dutch it is written as “zevenendertig” (seven and thirty).

Euler Problem 17 Definition

If the numbers 1 to 5 are written out in words: one, two, three, four, five, then there are 3 + 3 + 5 + 4 + 4 = 19 letters used in total. If all the numbers from 1 to 1000 (one thousand) inclusive were written out in words, how many letters would be used?

NOTE: Do not count spaces or hyphens. For example, 342 (three hundred and forty-two) contains 23 letters and 115 (one hundred and fifteen) contains 20 letters. The use of “and” when writing out numbers is in compliance with British usage.

Solution

The first piece of code provides a function that generates the words for numbers 1 to 999,999. This is more than the problem asks for, but it might be a useful function for another application. The last line concatenates all words together and removes the spaces.

numword.en <- function(x) { if (x > 999999) return("Error: Oustide my vocabulary") # Vocabulary single <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine") teens <- c( "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen") tens <- c("ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety") # Translation numword.10 <- function (y) { a <- y %% 100 if (a != 0) { and <- ifelse(y > 100, "and", "") if (a < 20) return (c(and, c(single, teens)[a])) else return (c(and, tens[floor(a / 10)], single[a %% 10])) } } numword.100 <- function (y) { a <- (floor(y / 100) %% 100) %% 10 if (a != 0) return (c(single[a], "hundred")) } numword.1000 <- function(y) { a <- (1000 * floor(y / 1000)) / 1000 if (a != 0) return (c(numword.100(a), numword.10(a), "thousand")) } numword <- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse=" ") return (trimws(numword)) } answer <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.en), collapse=""))) print(answer) Writing Numbers in Dutch

I went beyond Euler Problem 17 by translating the code to spell numbers in Dutch. Interesting bit of trivia is that it takes 307 fewer characters to spell the numbers 1 to 1000 in Dutch than it does in English.

It would be good if other people can submit functions for other languages in the comment section. Perhaps we can create an R package with a multi-lingual function for spelling numbers.

numword.nl <- function(x) { if (x > 999999) return("Error: Getal te hoog.") single <- c("een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "nenen") teens <- c( "tien", "elf", "twaalf", "dertien", "veertien", "fifteen", "zestien", "zeventien", "achtien", "nenentien") tens <- c("tien", "twintig", "dertig", "veertig", "vijftig", "zestig", "zeventig", "tachtig", "negengtig") numword.10 <- function(y) { a <- y %% 100 if (a != 0) { if (a < 20) return (c(single, teens)[a]) else return (c(single[a %% 10], "en", tens[floor(a / 10)])) } } numword.100 <- function(y) { a <- (floor(y / 100) %% 100) %% 10 if (a == 1) return ("honderd") if (a > 1) return (c(single[a], "honderd")) } numword.1000 <- function(y) { a <- (1000 * floor(y / 1000)) / 1000 if (a == 1) return ("duizend ") if (a > 0) return (c(numword.100(a), numword.10(a), "duizend ")) } numword<- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse="") return (trimws(numword)) } antwoord <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.nl), collapse=""))) print(antwoord) print(answer - antwoord)

The post Euler Problem 17: Number Letter Counts appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Visualization – Part 2

Wed, 03/22/2017 - 20:50

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

A Quick Overview of the ggplot2 Package in R

While it will be important to focus on theory, I want to explain the ggplot2 package because I will be using it throughout the rest of this series. Knowing how it works will keep the focus on the results rather than the code. It’s an incredibly powerful package and once you wrap your head around what it’s doing, your life will change for the better! There are a lot of tools out there which provide better charts, graphs and ease of use (i.e. plot.ly, d3.js, Qlik, Tableau), but ggplot2 is still a fantastic resource and I use it all of the time.

In case you missed it, here’s a link to Data Visualization – Part 1

Why would you use ggplot2?
  1. More robust plotting than the base plot package
  2. Better control over aesthetics – colors, axes, background, etc.
  3. Layering
  4. Variable Mapping (aes)
  5. Automatic aggregation of data
  6. Built in formulas & plotting (geom_smooth)
  7. The list goes on and on…

Basically, ggplot2 allows for a lot more customization of plots with a lot less code (the rest of it is behind the scenes). Once you are used to the syntax, there’s no going back. It’s faster and easier.

Why wouldn’t you use ggplot2?
  1. A bit of a learning curve
  2. Lack of user interactivity with the plots

Fundamentally, ggplot2 gives the user the ability to start a plot and layer everything in. There are many ways to accomplish the same thing, so figure out what makes sense for you and stick to it.

A Basic Example: Unemployment Over Time

library(dplyr) library(ggplot2) # Load the economics data from ggplot2 data(economics,package='ggplot2')

# Take a look at the format of the data head(economics)

## # A tibble: 6 × 6 ## date pce pop psavert uempmed unemploy ## &lt;date&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; ## 1 1967-07-01 507.4 198712 12.5 4.5 2944 ## 2 1967-08-01 510.5 198911 12.5 4.7 2945 ## 3 1967-09-01 516.3 199113 11.7 4.6 2958 ## 4 1967-10-01 512.9 199311 12.5 4.9 3143 ## 5 1967-11-01 518.1 199498 12.5 4.7 3066 ## 6 1967-12-01 525.8 199657 12.1 4.8 3018

# Create the plot ggplot(data = economics) + geom_line(aes(x = date, y = unemploy))

What happened to get that?
  • ggplot(economics) loaded the data frame
  • + tells ggplot() that there is more to be added to the plot
  • geom_line() defined the type of plot
  • aes(x = date, y = unemploy) mapped the variables

The aes() portion is what typically throws new users off but is my favorite feature of ggplot2. In simple terms, this is what “auto-magically” brings your plot to life. You are telling ggplot2, “I want ‘date’ to be on the x-axis and ‘unemploy’ to be on the y-axis.” It’s pretty straightforward in this case but there are more complex use cases as well.

Side Note: you could have achieved the same result by mapping the variables in the ggplot() function rather than in geom_line():
ggplot(data = economics, aes(x = date, y = unemploy)) + geom_line()

Here’s the basic formula for success:
  • Everything in ggplot2 starts with ggplot(data) and utilizes + to add on every element thereafter
  • Include your data frame (economics) in a ggplot function: ggplot(data = economics)
  • Input the type of plot you would like (i.e. line chart of unemployment over time): + geom_line(aes(x = date, y = unemploy))
    • “geom” stands for “geometric object” and determines the type of object (there can be more than one type per plot)
    • There are a lot of types of geometric objects – check them out here
  • Add in layers and utilize fill and col parameters within aes()

I’ll go through some of the examples from the Top 50 ggplot2 Visualizations Master List. I will be using their examples but I will also explain what’s going on.

Note: I believe the intention of the author of the Top 50 ggplot2 Visualizations Master List was to illustrate how to use ggplot2 rather than doing a full demonstration of what important data visualization techniques are – so keep that in mind as I go through these examples. Some of the visuals do not line up with my best practices addressed in my first post on data visualization.

As usual, some packages must be loaded.

library(reshape2) library(lubridate) library(dplyr) library(tidyr) library(ggplot2) library(scales) library(gridExtra)

The Scatterplot

This is one of the most visually powerful tool for data analysis. However, you have to be careful when using it because it’s primarily used by people doing analysis and not reporting (depending on what industry you’re in).

The author of this chart was looking for a correlation between area and population.

# Use the "midwest"" data from ggplot2 data("midwest", package = "ggplot2") head(midwest)

## # A tibble: 6 × 28 ## PID county state area poptotal popdensity popwhite popblack ## &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; ## 1 561 ADAMS IL 0.052 66090 1270.9615 63917 1702 ## 2 562 ALEXANDER IL 0.014 10626 759.0000 7054 3496 ## 3 563 BOND IL 0.022 14991 681.4091 14477 429 ## 4 564 BOONE IL 0.017 30806 1812.1176 29344 127 ## 5 565 BROWN IL 0.018 5836 324.2222 5264 547 ## 6 566 BUREAU IL 0.050 35688 713.7600 35157 50 ## # ... with 20 more variables: popamerindian &lt;int&gt;, popasian &lt;int&gt;, ## # popother &lt;int&gt;, percwhite &lt;dbl&gt;, percblack &lt;dbl&gt;, percamerindan &lt;dbl&gt;, ## # percasian &lt;dbl&gt;, percother &lt;dbl&gt;, popadults &lt;int&gt;, perchsd &lt;dbl&gt;, ## # percollege &lt;dbl&gt;, percprof &lt;dbl&gt;, poppovertyknown &lt;int&gt;, ## # percpovertyknown &lt;dbl&gt;, percbelowpoverty &lt;dbl&gt;, ## # percchildbelowpovert &lt;dbl&gt;, percadultpoverty &lt;dbl&gt;, ## # percelderlypoverty &lt;dbl&gt;, inmetro &lt;int&gt;, category &lt;chr&gt;

Here’s the most basic version of the scatter plot

This can be called by geom_point() in ggplot2

# Scatterplot ggplot(data = midwest, aes(x = area, y = poptotal)) + geom_point() #ggplot

Here’s version with some additional features

While the addition of the size of the points and color don’t add value, it does show the level of customization that’s possible with ggplot2.

ggplot(data = midwest, aes(x = area, y = poptotal)) + geom_point(aes(col=state, size=popdensity)) + geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + labs(subtitle="Area Vs Population", y="Population", x="Area", title="Scatterplot", caption = "Source: midwest")

Explanation:

ggplot(data = midwest, aes(x = area, y = poptotal)) +
Inputs the data and maps x and y variables as area and poptotal.

geom_point(aes(col=state, size=popdensity)) +
Creates a scatterplot and maps the color and size of points to state and popdensity.

geom_smooth(method="loess", se=F) +
Creates a smoothing curve to fit the data. method is the type of fit and se determines whether or not to show error bars.

xlim(c(0, 0.1)) +
Sets the x-axis limits.

ylim(c(0, 500000)) +
Sets the y-axis limits.

labs(subtitle="Area Vs Population",

y="Population",

x="Area",

title="Scatterplot",

caption = "Source: midwest")
Changes the labels of the subtitle, y-axis, x-axis, title and caption.

Notice that the legend was automatically created and placed on the lefthand side. This is also highly customizable and can be changed easily.

The Density Plot

Density plots are a great way to see how data is distributed. They are similar to histograms in a sense, but show values in terms of percentage of the total. In this example, the author used the mpg data set and is looking to see the different distributions of City Mileage based off of the number of cylinders the car has.

# Examine the mpg data set head(mpg)

## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl ## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; ## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p ## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p ## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p ## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p ## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p ## # ... with 1 more variables: class &lt;chr&gt;

Sample Density Plot

g = ggplot(mpg, aes(cty)) g + geom_density(aes(fill=factor(cyl)), alpha=0.8) + labs(title="Density plot", subtitle="City Mileage Grouped by Number of cylinders", caption="Source: mpg", x="City Mileage", fill="# Cylinders")

You’ll notice one immediate difference here. The author decided to create a the object g to equal ggplot(mpg, aes(cty)) – this is a nice trick and will save you some time if you plan on keeping ggplot(mpg, aes(cty)) as the fundamental plot and simply exploring other visualizations on top of it. It is also handy if you need to save the output of a chart to an image file.

ggplot(mpg, aes(cty)) loads the mpg data and aes(cty) assumes aes(x = cty)

g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
geom_density kicks off a density plot and the mapping of cyl is used for colors. alpha is the transparency/opacity of the area under the curve.

labs(title="Density plot",

subtitle="City Mileage Grouped by Number of cylinders",

caption="Source: mpg",

x="City Mileage",

fill="# Cylinders")
Labeling is cleaned up at the end.

How would you use your new knowledge to see the density by class instead of by number of cylinders?

**Hint: ** g = ggplot(mpg, aes(cty)) has already been established.

g + geom_density(aes(fill=factor(class)), alpha=0.8) + labs(title="Density plot", subtitle="City Mileage Grouped by Class", caption="Source: mpg", x="City Mileage", fill="Class")


Notice how I didn’t have to write out ggplot() again because it was already stored in the object g.

The Histogram

How could we show the city mileage in a histogram?

g = ggplot(mpg,aes(cty)) g + geom_histogram(bins=20) + labs(title="Histogram", caption="Source: mpg", x="City Mileage")

geom_histogram(bins=20) plots the histogram. If bins isn’t set, ggplot2 will automatically set one.

The Bar/Column Chart

For all intensive purposes, bar and column charts are essentially the same. Technically, the term “column chart” can be used when the bars run vertically. The author of this chart was simply looking at the frequency of the vehicles listed in the data set.

#Data Preparation freqtable &lt;- table(mpg$manufacturer) df &lt;- as.data.frame.table(freqtable) head(df)

## Var1 Freq ## 1 audi 18 ## 2 chevrolet 19 ## 3 dodge 37 ## 4 ford 25 ## 5 honda 9 ## 6 hyundai 14

#Set a theme theme_set(theme_classic()) g &lt;- ggplot(df, aes(Var1, Freq)) g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(title="Bar Chart", subtitle="Manufacturer of vehicles", caption="Source: Frequency of Manufacturers from 'mpg' dataset") + theme(axis.text.x = element_text(angle=65, vjust=0.6))

The addition of theme_set(theme_classic()) adds a preset theme to the chart. You can create your own or select from a large list of themes. This can help set your work apart from others and save a lot of time.

However, theme_set() is different than the theme(axis.text.x = element_text(angle=65, vjust=0.6)) the one used inside the plot itself in this case. The author decided to tilt the text along the x-axis. vjust=0.6 changes how far it is spaced away from the axis line.

Within geom_bar() there is another new piece of information: stat="identity" which tells ggplot to use the actual value of Freq.

You may also notice that ggplot arranged all of the data in alphabetical order based off of the manufacturer. If you want to change the order, it’s best to use the reorder() function. This next chart will use the Freq and coord_flip() to orient the chart differently.

g &lt;- ggplot(df, aes(reorder(Var1,Freq), Freq)) g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(title="Bar Chart", x = 'Manufacturer', subtitle="Manufacturer of vehicles", caption="Source: Frequency of Manufacturers from 'mpg' dataset") + theme(axis.text.x = element_text(angle=65, vjust=0.6)) + coord_flip()

Let’s continue with bar charts – what if we wanted to see what hwy looked like by manufacturer and in terms of cyl?

g = ggplot(mpg,aes(x=manufacturer,y=hwy,col=factor(cyl),fill=factor(cyl))) g + geom_bar(stat='identity', position='dodge') + theme(axis.text.x = element_text(angle=65, vjust=0.6))

position='dodge' had to be used because the default setting is to stack the bars, 'dodge' places them side by side for comparison.

Despite the fact that the chart did what I wanted, it is very difficult to read due to how many manufacturers there are. This is where the facet_wrap() feature comes in handy.

theme_set(theme_bw()) g = ggplot(mpg,aes(x=factor(cyl),y=hwy,col=factor(cyl),fill=factor(cyl))) g + geom_bar(stat='identity', position='dodge') + facet_wrap(~manufacturer)


This created a much nicer view of the information. It “auto-magically” split everything out by manufacturer!

Spatial Plots

Another nice feature of ggplot2 is the integration with maps and spatial plotting. In this simple example, I wanted to plot a few cities in Colorado and draw a border around them. Other than the addition of the map, ggplot simply places the dots directly on the locations via their longitude and latitude “auto-magically.”

This map is created with ggmap which utilizes Google Maps API.

library(ggmap) library(ggalt) foco &lt;- geocode("Fort Collins, CO") # get longitude and latitude # Get the Map ---------------------------------------------- colo_map &lt;- qmap("Colorado, United States",zoom = 7, source = "google") # Get Coordinates for Places --------------------- colo_places &lt;- c("Fort Collins, CO", "Denver, CO", "Grand Junction, CO", "Durango, CO", "Pueblo, CO") places_loc &lt;- geocode(colo_places) # get longitudes and latitudes # Plot Open Street Map ------------------------------------- colo_map + geom_point(aes(x=lon, y=lat), data = places_loc, alpha = 0.7, size = 7, color = "tomato") + geom_encircle(aes(x=lon, y=lat), data = places_loc, size = 2, color = "blue")

Final Thoughts

I hope you learned a lot about the basics of ggplot2 in this. It’s extremely powerful but yet easy to use once you get the hang of it. The best way to really learn it is to try it out. Find some data on your own and try to manipulate it and get it plotted. Without a doubt, you will have all kinds of errors pop up, data you expect to be plotted won’t show up, colors and fills will be different, etc. However, your visualizations will be leveled-up!

Coming soon:
  • Determining whether or not you need a visualization
  • Choosing the type of plot to use depending on the use case
  • Visualization beyond the standard charts and graphs

I made some modifications to the code, but almost all of the examples here were from Top 50 ggplot2 Visualizations – The Master List .

As always, the code used in this post is on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Datashader is a big deal

Wed, 03/22/2017 - 19:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions.

I am so excited about datashader capabilities I literally will not wait for the functionality to be exposed in R through rbokeh. I am going to leave my usual knitr/rmarkdown world and dust off Jupyter Notebook just to use datashader plotting. This is worth trying, even for diehard R users.

datashader

Every plotting system has two important ends: the grammar where you specify the plot, and the rendering pipeline that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter plotly::ggplotly).

Currently, to use datashader you must talk directly to Python and Bokeh (i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.

datashader makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.

Please take a look at these stills from the datashader US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:

If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:

What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:

and even more powerful than zooming out on a static high-resolution rendering:

datashader can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.

You need to run datashader to really see the effect. The html exports, while interactive, sometimes do not correctly perform in all web browsers.

An R example

I am going to share a simple datashader example here. Again, to see the full effect you would have to copy it into an Jupyter notebook and run it. But I will use it to show my point.

After going through the steps to install Anaconda and Juputer notebook (plus some more conda install steps to include necessary packages) we can make a plot of the ggplot2 data example diamonds

ggplot2 renderings of diamonds typically look like the following (and show of the power and convenience of the grammar):

A datashader rendering looks like the following:

If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:

Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of datashader! It is something unexpected in the data which is now jumping out at us.

datashader is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to R we can look directly at the data:

> library("dplyr") > diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400) # A tibble: 5 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 3.01 Premium I SI2 60.2 59 18242 9.36 9.31 5.62 2 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 3 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 4 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77 5 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77

There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.

For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:

The Jupyter notebooks always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a datashader issue).

Conclusion

datashader brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Running your R code on Azure with mrsdeploy

Wed, 03/22/2017 - 18:09

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

// <![CDATA[ // &lt;![CDATA[ // &amp;lt;![CDATA[ // &amp;amp;lt;![CDATA[ // &amp;amp;amp;lt;![CDATA[ $(document).ready(function () { window.buildTabsets(&amp;amp;amp;quot;TOC&amp;amp;amp;quot;); }); // ]]&amp;amp;amp;gt; // ]]&amp;amp;gt; // ]]&amp;gt; // ]]&gt; // ]]>

by John-Mark Agosta, data scientist manager at Microsoft

Let’s say you’ve built a model in R that is larger than you can conveniently run locally, and you want to take advantage of Azure’s resources simply to run it on a larger machine. This blog explains how to provision and run an Azure virtual machine (VM) for this, using the mrsdeploy library that comes installed with Microsoft’s R Server. We will work specifically with the Unbuntu Linux version of the VM, so I you’ll need to be familiar with working with superuser privileges at the command line in Linux, and of course, familiar with R.

The fundamental architecture consists of your local machine as the client for which you create a server machine in the Cloud. You’ll set up a service on the remote machine — the one in the cloud. Once you do this, you needn’t interact directly with the remote machine; instead you issue commands to it and see the results returned at the client. This is one approach; there are many many ways this can be done in Azure, depending on your choice of language, reliance on coding, capabilities of the service, and complexity and scale of the task. A data scientist typically works first interactively to explore data on an individual machine, then puts the model thus built into production at scale, in this example, in the Cloud. The purpose of this posting is to clarify the deployment process, or as it is called, in a mouthful, operationalization. In short, using a VM running the mrsdeploy library in R Server lets you operationalize your code with little effort, at modest expense.

Alternatively, instead of setting up a service with R server, one unadvisedly could just provision a an bare virtual machine, and login into it as one would any remote machine with the manual encumbrance of having to work with multiple machines, load application software, and move data and code back and forth. But that’s what we avoid. The point of the Cloud is making large data and compute as much as possible like working on your local computer.

Deploying Microsoft R Server (MRS) on an Azure VM

Azure Marketplace offers a Linux VM (Ubuntu version 16.04) preconfigured with R Server 2016. Additionally the Linux VM with R Server comes with mrsdeploy, a new R package for establishing a remote session in a console application and for publishing and managing a web service written in R. In order to use the R Server’s deployment and operationalization features, one needs to configure R Server for operationalization after installation, to act as a deployment server and host analytic web services.

Alternately there are other Azure platforms for operationalization using R Server in the Marketplace, with other operating systems and platforms including HDInsight, Microsoft’s Hadoop offering. Or, equivalently one could use the Data Science VM available in the Marketplace, since it has a copy of R Server installed. Configuration of these platforms is similar to the example covered in this posting.

Provisioning an R Server VM, as reference in the documentation, takes a few steps that are detailed here, which consist of configuring the VM and setting up the server account to authorize remote access. To set up the server you’ll use the system account you set up as a user of the Linux machine. The server account is used for client interaction with the R Server, and should not be confused with the Linux system account. This is a major difference with the Windows version of the R Server VM that uses Active Directory services for authentication.

Provisioning a machine from the Marketplace

You will want to do the install of a Unbuntu Marketplace VM with R server preinstalled. The best way to find it on portal.azure.com is to search for “r server”:

R Server in the Marketplace

Select the Ubuntu version. Do a conventional deployment—lets say you name yours mymrs. Take note of the mymrs-ip public address, and the mymrs-nsg network security group resources created for it since you will want to customize them.

Login to the VM using the system account you set up in the Portal, and add these aliases, one for the path to the version of the R executable, MRS (aka Revo64), and one for the mrsdeploy menu-driven administration tool.

alias rserver='/usr/local/bin/Revo64-9.0' alias radmin='sudo /usr/local/bin/dotnet \ /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Utils.AdminUtil/Microsoft.DeployR.Utils.AdminUtil.dll'

The following are a set of steps to bring up on the VM a combined web-compute server (a “one-box” server) that can be accessed remotely.

1. Check if you can run Microsoft R Server (MRS).

Just use the alias for MRS

$ rserver [Note a line in the banner saying "Loading Microsoft R Server packages, ..."]

Here’s a simple test that MRS library is pre-loaded and runs. Note the MRS libraries (“rx” functions) are preloaded.

> rxSummary(formula = ~., data = iris) 2. Set up the MRS server for mrsdeploy

mrsdeploy operationalization runs two services, the web node and one or more compute nodes. In the simplest configuration, the one described here, both “nodes” are services running on same VM. Alternately, by making these separate, larger loads can be handled with one web node and one or more compute nodes.

Use the alias you created for the admin tool.

$ radmin

This utility brings up a menu

************************************* Administration Utility (v9.0.1) ************************************* 1. Configure R Server for Operationalization 2. Set a local admin password 3. Stop and start services 4. Change service ports 5. Encrypt credentials 6. Run diagnostic tests 7. Evaluate capacity 8. Exit Web node endpoint: **http://localhost:12800/** Please enter an option: 1 Set the admin password: ************* Confirm this password: ************* Configuration for Operationalization: A. One-box (web + compute nodes) B. Web node C. Compute node D. Reset machine to default install state E. Return to main menu Please enter an option: A Success! Web node running (PID: 4172) Success! Compute node running (PID: 4172)

At this point the setup should be complete. Running diagnostics with the admin tool can check that it is.

Run Diagnostic Tests: A. Test Configuration Please enter an option: 6 Preparing to run diagnostics... *********************** DIAGNOSTIC RESULTS: *********************** Overall Health: pass Web Node Details: Logs: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.WebAPI/logs Available compute nodes: 1 Compute Node Details: Health of 'http://localhost:12805/': pass Logs: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.BackEnd/logs Authentication Details: A local admin account was found. No other form of authentication is configured. Database Details: Health: pass Type: sqlite

Code Execution Test: PASS Code: ‘y <- cumprod(c(1500, 1+(rnorm(n=25,mean=.05, sd = 1.4)/100)))’

Yes, it even tests that the MRS interpreter runs! If the web or the service had stopped the following test will complain loudly. Note the useful links to the log directories for failure details. Services can be stopped and started from selection 3 in the top level menu.

Run Diagnostic Tests: B. Raw Server Status ********************** SERVICE STATE (raw): ********************** Please authenticate... Username: admin Password: ************* Server: Health: pass Details: logPath: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.WebAPI/logs backends: Health: pass http://localhost:12805/: Health: pass Details: maxPoolSize: 80 activeShellCount: 1 currentPoolSize: 5 logPath: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.BackEnd/logs database: Health: pass Details: type: sqlite name: main state: Open 3. Verify that the MRS server is running from the server linux prompt

The R server) webservices can also be checked by looking at the machine’s open ports, without going into the admin tool. This command reveals ports the linux machine is listening on:

$ netstat - tupln Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:29130 0.0.0.0:* LISTEN 42527/mdsd tcp 0 0 127.0.0.1:29131 0.0.0.0:* LISTEN 2001/mdsd tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1265/sshd tcp 0 0 0.0.0.0:9054 0.0.0.0:* LISTEN 55348/Rserve tcp 0 0 0.0.0.0:9055 0.0.0.0:* LISTEN 55348/Rserve tcp6 0 0 :::12805 :::* LISTEN 55327/dotnet tcp6 0 0 :::22 :::* LISTEN 1265/sshd tcp6 0 0 :::12800 :::* LISTEN 55285/dotnet udp 0 0 0.0.0.0:68 0.0.0.0:* 1064/dhclient

We can see that port 12800 is active for the web service. 12805 is the compute server, running here on the same machine as the web service.

Next thing you should do is see if you can connect to the service with R server running locally, and load mrsdeploy.

4. Check the MRS server is running by logging-in in from the server itself.

Do this by running a remote mrsdeploy session from the server as localhost. This is the way one would “run MRS as R Client,” even though the full set of MRS features are available. Running MRS as both a client and a server on the same machine is possible, but I see no purpose other than to test that the web service is accessible. The sequence of steps is:

$ rserver [ MRS banner...] > endpoint <- "localhost:12800" # The forum shows this format for logins. > library(mrsdeploy) > remoteLogin(endpoint) Username: admin Password: ************* # The password you set in the admin tool. [...] REMOTE>

If authentication is failing, you can look at the tail of the system log file for the error, like this

$ cd /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.WebAPI/logs $ sudo tail $(ls -t1 | head -1) # Look at the end of the most recent logfile ... "Message":"The username doesn't belong to the admin user",...

Then, to end the remote session, the command is exit’

REMOTE> exit 5. Finish VM Configuration for remote access

Another two steps are needed before you can use the server over the network. You should set the public DNS (e.g. domain) address since the VM’s public IP address is dynamic and may change when the machine is restarted. And as a matter of security, the Azure firewall (the “network security gateway” resource) needs to be configured.

Go back to the portal.azure.com and find these resources associated with the VM: – Public DNS address – Open incoming service ports

Public IP

To set the public DNS name, go to the portal’s VM overview pane and click on the public-IP item, for instance, “mymrs-ip”:

until you get to the configuration blade:

This will send you to the mymrs-ip component where you can change the DNS label.

Network Security Group

If you don’t do this, a remote mrsdeploy login attempt will fail with a message

Error: Couldn't connect to server

since only the port 22 for ssh is allowed by default for the VM’s network security gateway. One option is to use ssh to set up port forwarding. I won’t explain that here. The alternative is to configure remote access on the server. For this you’ll need to open the port the admin tool reported as the web endpoint, typically 12800. The inbound security rules’ blade is buried in the VM choices -> Network Interfaces -> Network Security Group -> Inbound Security Rules. Choose “Add” to create a custom inbound rule for TCP port 12800. The result looks like this:

Now the server is ready for use!

6. Check that the MRS server is running from another machine

You’ll need a local copy of MRS to do this. Copies are available from a few sources, including a “client side only” copy called, naturally–R Client that is licensed for free. R Client gives you all the remoting capabilities of R Server, also the same custom learning algorithms available with R Server, but unlike R Server, it is limited to datasets that fit in-memory.

The sources of R Server are several:

  • MSDN subscription downloads include R Server for diferent platforms
  • Also R Client is a free download on MSDN.
  • Microsoft SQL Server comes with R Server as an option. You can install R Server “standalone” with the SQL Server installer in addition to installing it as part of SQL Server.
  • If you have installed R Tools for Visual Studio (RTVS), the R Tools menu has an item to install R Client.
  • Of course any VM that comes with R Server will work too. Notably, the Data Science VM, which hosts an exhaustive collection of data science tools includes a copy of R Server .

To remotely login from your local machine, the MRS commands are the same as before, except use the domain name of the server from your local client:

> endpoint <- "mymrs.southcentralus.azure.com:12800' > library(mrsdeploy) > remoteLogin(endpoint)

If as shown, you do not include the admin account and passwords as arguments to remoteLogin the command will bring up a modal dialog asking you for them. Be advised that this dialog may be hidden and not come to the front, and you’ll have to look for it.

The server will kindly return a banner with the differences between your client and the server MRS environments. Here’s what a proper remote session returns on initiation:

Diff report between local and remote R sessions... Warning! R version mismatch local: R version 3.3.2 (2016-10-31) remote: R version 3.2.3 (2015-12-10) These R packages installed on the local machine are not on the remote R instance: Missing Packages 1 checkpoint 2 CompatibilityAPI 3 curl ... 23 RUnit The versions of these installed R packages differ: Package Local Remote 1 base 3.3.2 3.2.3 ... 23 utils 3.3.2 3.2.3 Your REMOTE R session is now active. Commands: - pause() to switch to local session & leave remote session on hold. - resume() to return to remote session. - exit to leave (and terminate) remote session.

Once at the REMOTE> prompt you can explore the remote interpreter environment. These handy R functions let you explore the remote environment further:

Sys.getenv() # will show the machine's OS environment variables on the server. Sys.info() # returns a character string with machine and user descriptions Environment differences: Adding custom packages to the server

The comparative listing of package when you log into the remote should alert you to the need to accommodate the differences between local and remote environments. Different R versions generate this warning:

Warning! R version mismatch

Different versions will limit which packages are available for both versions.

Compatible but missing packages can be installed on the server. To be able to install packages when available packages differ, the remote session will need permission to write to one of the directories identified by .libPaths() on the remote server. This is not granted by default. If you feel comfortable with letting the remote user make modifications to the server, you could grant this permission by making this directory writable by everyone

$ sudo chmod a+w /usr/local/lib/R/site-library/

Then to specify a library, for example, glmnet to be installed in this directory use

REMOTE> install.packages("glmnet", lib="/usr/local/lib/R/site-library")

These installations will persist from one remote session to another, and the “missing packages” warning at login will be updated correctly, although strangely, intellisense for package names always refers to the local list of packages, so will make suggestions that are unavailable at the remote.

Running batch R job on the server

Congratulations! Now you can run large R jobs on a VM in the cloud!

There are various uses for the server to take advantage of the VM, in addition to running interactively at the REMOTE> prompt. A simple case is to take advantage of the remote server to run large time-consuming jobs. For instance, this interation, to compute a regression’s leave-one-out r-squared values—

rsqr <- c() system.time( for (k in 1:nrow(mtcars)) { rsqr[k] <- summary(lm(mpg ~ . , data=mtcars[-k,]))$r.squared }) print(summary(rsqr))

—can be done the same remotely:

remoteExecute("rsqr <- c()\ system.time(\ for (k in 1:nrow(mtcars)) {\ rsqr[k] <- summary(lm(mpg ~ . , data=mtcars[-k,]))$r.squared\ })")

We’ll need to recall the results separately, since only the last value in the remote expression output is printed:

remoteExecute("summary(rsqr)")

For larger chunks of code, you can include them in script files, and execute the file remotely by use mrsdeploy::remoteScript("myscript.R") which is simply a wrapper around mrsdeploy::remoteExecute("myscript.R", script=TRUE), where myscript.R is found in your local working directory.

Note that the the mrsdeploy library is not needed in the script running remotely. Indeed, the VM with preinstalled Microsoft R Server 2016 (version 9.0.1) for Linux (Ubuntu version 16.04) runs R version 3.2.3, which does not include the mrsdeploy library. So both library(mrsdeploy) and install.packages(“mrsdeploy") will generate an error on the remote session. If you’ve included these statements to enable your local script, be sure to remove them if you execute the script remotely, or the script will fail! If you want to use the same script in both places, a simple workaround is to avoid making the library call in the script when it runs in the remore session:

if ( Sys.info()["user"] != "rserve2" ) { library(mrsdeploy) }

The ability of mrsdeploy to execute a script remotely is just the tip of the iceberg. It also enables moving files and variables back and forth between local and remote, and most importantly, configuring R functions as production web services. This set of deployment features merits another entire blog posting.

For more information

For details about different configuration options see Configuring R Server Operationalization. Libraries as required in the Operationalization instructions are already configured on the VM.

To see what you can do with a remote session, have a look here.. And, for a general overview see this..

Go to Rserver documentation for the full API reference.

// <![CDATA[ // &lt;![CDATA[ // &amp;lt;![CDATA[ // &amp;amp;lt;![CDATA[ // &amp;amp;amp;lt;![CDATA[ // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $(&amp;amp;amp;#39;tr.header&amp;amp;amp;#39;).parent(&amp;amp;amp;#39;thead&amp;amp;amp;#39;).parent(&amp;amp;amp;#39;table&amp;amp;amp;#39;).addClass(&amp;amp;amp;#39;table table-condensed&amp;amp;amp;#39;); } $(document).ready(function () { bootstrapStylePandocTables(); }); // ]]&amp;amp;amp;gt; // ]]&amp;amp;gt; // ]]&amp;gt; // ]]&gt; // ]]>
// <![CDATA[ // &lt;![CDATA[ // &amp;lt;![CDATA[ // &amp;amp;lt;![CDATA[ // &amp;amp;amp;lt;![CDATA[ (function () { var script = document.createElement(&amp;amp;amp;quot;script&amp;amp;amp;quot;); script.type = &amp;amp;amp;quot;text/javascript&amp;amp;amp;quot;; script.src = &amp;amp;amp;quot;https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&amp;amp;amp;quot;; document.getElementsByTagName(&amp;amp;amp;quot;head&amp;amp;amp;quot;)[0].appendChild(script); })(); // ]]&amp;amp;amp;gt; // ]]&amp;amp;gt; // ]]&amp;gt; // ]]&gt; // ]]>

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

February 2017 New Package Picks

Wed, 03/22/2017 - 17:00

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Joseph Rickert

One hundred and forty-five new packages were added to CRAN in February. Here are 47 interesting packages organized into five categories; Biostatistics, Data, Data Science, Statistics and Utilities.

Biostatistics
  • BaTFLED3D v0.1.7: Implements a machine learning algorithm to make predictions and determine interactions in data that varies along three independent modes. It was developed to predict the growth of cell lines when treated with drugs at different doses. The vignette shows an example with simulated data.

  • DClusterm v0.1: Implements methods for the model-based detection of disease clusters. Look at the JSS paper for details.

Data Data Science
  • autothresholdr v0.2.0: Is an R port of the ImageJ image processing program. The vignette shows how to use it.

  • dlib v1.0: Provides an Rcpp interface to dlib, the C++ toolkit containing machine learning algorithms and computer vision tools.

  • liquidSVM v1.0.1: Provides several functions to support a fast support vector machine implementation. There is a demo vignette and supplemental installation documentation.

  • OOBCurve v0.1: Provides a function to calculate the out-of-bag learning curve for random forest models built with the randomForest or ranger packages.

  • opusminer v0.1-0: Provides an interface to the OPUS Miner algorithm for finding the top-k, non-redundant itemsets from transaction data.

Statistics
  • BayesCombo v1.0: Implements Bayesian meta-analysis methods to combine diverse evidence from multiple studies. The vignette provides a detailed example.

  • BayesianTools v0.1.0: Implements various Metropolis MCMC variants (including adaptive and/or delayed rejection MH), the T-walk, differential evolution MCMCs, DREAM MCMCs, and a sequential Monte Carlo particle filter, along with diagnostic and plot functions. The vignette will get you started.

  • FRK v0.1.1: Implements the Fixed Rank Kriging methods presented by Cressie and Johannesson in their 2008 paper. An extended vignette explains the math and provides several examples.

  • glmmTMB v0.1.1: Provides functions to fit Generalized Linear Mixed Models using the Template Model Builder (TMB) package. There are vignettes for getting started, Covariance Structures, post-hoc MCMC, simulation, and troubleshooting.

  • IMIFA v1.1.0: Provides flexible Gibbs sampler functions for fitting Infinite Mixtures of Infinite Factor Analysers and related models, introduced by Murphy et al. in a 2017 paper. The vignette shows examples.

  • ImputeRobust v1.1-1: Provides new imputation methods for the mice package, based on generalized additive models for location, scale, and shape (GAMLSS) as described in de Jong, van Buuren and Spiess.

  • lmvar v1.0.0: Provides functions to run linear regressions in which both the expected value and variance can vary by observation. Vignettes provide an introduction and explain the details of the math.

  • metaviz v0.1.0: Implements the rainforest plots proposed by Schild & Voracek as a variant of the forest plots used for meta-analysis. The vignette describes their use.

  • prophet v0.1: Implements a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. There is a Quick Start guide.

  • robustarima v0.2.5: Provides functions for fitting a linear regression model with ARIMA errors, using a filtered tau-estimate.

  • rpgm v0.1.3: Provides functions that use the Ziggurat Method to generate Normal random variates quickly.

  • sarima v0.4-3: Provides functions for simulating and predicting with seasonal ARIMA models. The vignette presents a use case.

  • sppmix v1.0.0.0: Implements classes and methods for modeling spatial point patterns using inhomogeneous Poisson point processes, where the intensity surface is assumed to be analogous to a finite additive mixture of normal components, and the number of components is a finite, fixed or random integer.


Look here for documentation.

  • walkr v0.3.4: Provides functions to sample from the interaction of a hyperplane and an N Simplex. The vignette describes the math and the algorithms involved.
Utilities

  • odbc v1.0.1: Uses the DBI interface to implement a connection to ODBC compatible databases.

  • sonify v0.1-0: Contains a function to transform univariate data, sampled at regular or irregular intervals, into a continuous sound with time-varying frequency. The function is intended as a substitute for R’s plot function to simplify data analysis for the visually impaired.

  • widgetframe v0.1.0: Provides two functions for working with htmlwidgets and iframes, which may be useful when working with WordPress or R Markdown. There is a vignette.

  • wrapr v0.1.1: Contains the debugging functions DebugFnW to capture function context on error for debugging, and let to convert non-standard evaluation interfaces to standard evaluation interfaces.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Hitchhiker’s Guide to Ggplot2 in R

Wed, 03/22/2017 - 17:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Published: 2016-11-30
Updated: 2017-03-23

“Any bleeder knows that books are never finished, only abandoned.”
Why Information Grows About the book

You can find the book here.

This is a book that may look complete but changes in R package are always demanding changes in the examples contained within the book. This is why the electronic format is perfect for the purpose of this work. Trapping it inside a dead tree book is ultimately a waste of time and resources in my on view.

Aside from being my first book, this is also my first collaborative work. I wrote it in a 50-50 collaboration with Jodie Burchell. Jodie is an amazing data scientist. I highly recommend reading her blog Standard Error where you can find really good material on Reproducible Research and more.

This is a technical book. The scope of the book is to go straight to the point and the writing style is similar to a recipe with detailed instructions. It is assumed that you know the basics of R and that you want to learn how to create beautiful plots.

Each chapter will explain how to create a different type of plot, and will take you step-by-step from a basic plot to a highly customised graph. The chapters’ order is by degree of difficulty.

Every chapter is independent from the others. You can read the whole book or go to a section of interest and we are sure that it will be easy to understand the instructions and reproduce our examples without reading the first chapters.

In total this book contains 237 pages (letter paper size) of different recipes to obtain an acceptable aesthetic result. You can download the book for free (yes, really!) from Leanpub.

How the book started?

Almost a year ago I finished writing the eleventh tutorial in a series on using ggplot2 I created with Jodie Burchell.

I asked Jodie to co-authors some blog entries when I found her blog and I realised that my interest in Data Science was reflected on her blog. The book comes after those entries on our blogs.

A few weeks later those tutorials evolved into the shape of an ebook. The reason behind it was that what we started to write had an unexpected success. We even had RTs from important people in the R community such as Hadley Wickham. Finally the book was released by Leanpub.

We also included a pack that contains the Rmd files that we used to generate every chart that is displayed in the book.

Why Leanpub?

Leanpub is a platform where you can easily write your book by using MS Word among other writing software and it even has GitHub and Dropbox integration. We went for R Markdown with LaTeX output, and that means that Leanpub is both easy to use and flexible at the same time.

Even more, Leanpub enables the audience to download your books for free, if you allow it, or you can define a price range with a suggested price indication. The website gives the authors a royalty of 90% minus 50 cents per sale (compared to other platforms this is convenient for the authors). You can also sell your books with additional exercises, lessons in video, etc.

For example, last year I updated all the examples contained in the book just a few days after ggplot2 2.2 was released and my readers had a notification email just after I uploaded the new version. People who pay or does not pay for your books can download the newer versions of if for free.

If that’s not enough Leanpub allows you to create bundles and sell your books as a set or you can charge another price for your book plus additional material such as Rmarkdown notebooks, instructional videos and more.

What I learned from my first book?

At the moment I am teaching Data Visualization and from my students I learned that good visualizations come after they learn the visualization concepts. Coding cleary helps but coding goes after the fundamentals.

It would be better to teach visualization fundamentals first and not in parallel while coding, and this applies specially when a part of your audience has never wrote code before.

I got a lot of feedback from my students last term. That was really helpful to improve the book and dive some steps in smaller pieces to facilitate the understading of the Grammar of Graphics.

The interested reader may find some remarkable books that can be read before mine. I highly recommend:

Those are really good books that show the fundamentals of Data Visualisation and provide the key concepts and rules needed to communicate effectively with data.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages