R news and tutorials contributed by hundreds of R bloggers
Updated: 1 day 15 hours ago

### Massively-parallel computations on Azure clusters with R, made easy with doAzureParallel

Wed, 03/29/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by JS Tan (Program Manager, Microsoft)

For users of the R language, scaling up their work to take advantage of cloud-based computing has generally been a complex undertaking. We are therefore excited to announce doAzureParallel, a lightweight R package built on Azure Batch that allows you to easily use Azure’s flexible compute resources right from your R session. The doAzureParallel package complements Microsoft R Server and provides the infrastructure you need to run massively parallel simulations on Azure directly from R.

The doAzureParallel package is a parallel backend for the popular foreach package, making it possible to execute multiple processes across a cluster of Azure virtual machines with just a few lines of R code. The package helps you create and manage the cluster in Azure, and register it as a parallel backend to be used with foreach.

With doAzureParallel, there is no need to manually create, configure and manage a cluster of individual VMs. Running your scale jobs is as easy as running algorithms on your local machine. With Azure Batch’s autoscaling capabilities, you can also increase or decrease your cluster size to fit your workloads, saving you time and money. doAzureParallel also uses the Azure Data Science Virtual Machine (DSVM), allowing Azure Batch to easily and quickly configure the appropriate environment in as little time as possible.

doAzureParallel is ideal for running embarrassingly parallel work such as parametric sweeps or Monte Carlo simulations, making it a great fit for many financial modelling algorithms (back-testing, portfolio scenario modelling, etc).

The doAzureParallel is available for download on Github under the open-source MIT license, and there is no additional cost for these capabilities – you only pay for the Azure VMs you use.

Performance Gains with doAzureParallel

To illustrate the kind of speed up you can get with doAzureParallel, we did a test that compares the performance of computing the Mandelbrot set on a single machine of 2 cores, a 5-node clusters of 10 cores, a 10-node cluster of 20 cores, and finally a 20-node cluster of 40 cores.

Time in seconds for computing the Mandelbrot set

From the graph, we can see that using doAzureParallel on Azure can get you a speed up of about 2X with a cluster of 5 nodes, a speed up of about 3X with a cluster of 10 nodes, and a speed up of about 6X with a cluster of 20 nodes. From a cost perspective, running the 20-node cluster for about 50 seconds with 6X performance ended up costing only $0.03 when using a cluster of Standard_F2 Linux VMs. (This was for the WestUS region. Pricing varies by region and can change over time.) The doAzureParallel package is available now on Github, where you can also find detailed documentation. For more detailed information, including installation steps and demo code, check out the announcement at the Microsoft Azure blog linked below. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Coordinatized Data: A Fluid Data Specification Wed, 03/29/2017 - 17:33 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) Authors: John Mount and Nina Zumel. Introduction It’s been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting). Boris Artzybasheff illustration Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding. In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R. Example Consider four data scientists who perform the same set of modeling tasks, but happen to record the data differently. In each case the data scientist was asked to test two decision tree regression models (a and b) on two test-sets (x and y) and record both the model quality on the test sets under two different metrics (AUC and pseudo R-squared). The two models differ in tree depth (in this case model a has depth 5, and model b has depth 3), which is also to be recorded. Data Scientist 1 Data scientist 1 is an experienced modeler, and records their data as follows: library("tibble") d1 <- frame_data( ~model, ~depth, ~testset, ~AUC, ~pR2, 'a', 5, 'x', 0.4, 0.2, 'a', 5, 'y', 0.6, 0.3, 'b', 3, 'x', 0.5, 0.25, 'b', 3, 'y', 0.5, 0.25 ) print(d1) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 Data Scientist 1 uses what is called a denormalized form. In this form each row contains all of the facts we want ready to go. If we were thinking about "column roles" (a concept we touched on briefly in Section A.3.5 "How to Think in SQL" of Practical Data Science with R, Zumel, Mount; Manning 2014), then we would say the columns model and testset are key columns (together they form a composite key that uniquely identifies rows), the depth column is derived (it is a function of model), and AUC and pR2 are payload columns (they contain data). Denormalized forms are the most ready for tasks that reason across columns, such as training or evaluating machine learning models. Data Scientist 2 Data Scientist 2 has data warehousing experience and records their data in a normal form: models2 <- frame_data( ~model, ~depth, 'a', 5, 'b', 3 ) d2 <- frame_data( ~model, ~testset, ~AUC, ~pR2, 'a', 'x', 0.4, 0.2, 'a', 'y', 0.6, 0.3, 'b', 'x', 0.5, 0.25, 'b', 'y', 0.5, 0.25 ) print(models2) ## # A tibble: 2 × 2 ## model depth ## <chr> <dbl> ## 1 a 5 ## 2 b 3 print(d2) ## # A tibble: 4 × 4 ## model testset AUC pR2 ## <chr> <chr> <dbl> <dbl> ## 1 a x 0.4 0.20 ## 2 a y 0.6 0.30 ## 3 b x 0.5 0.25 ## 4 b y 0.5 0.25 The idea is: since depth is a function of the model name, it should not be recorded as a column unless needed. In a normal form such as above, every item of data is written only one place. This means that we cannot have inconsistencies such as accidentally entering two different depths for a given model. In this example all our columns are either key or payload. Data Scientist 2 is not concerned about any difficulty that might arise by this format as they know they can convert to Data Scientist 1’s format by using a join command: library("dplyr", warn.conflicts= FALSE) d1_2 <- left_join(d2, models2, by='model') %>% select(model, depth, testset, AUC, pR2) %>% arrange(model, testset) print(d1_2) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 all.equal(d1, d1_2) ## [1] TRUE Relational data theory (the science of joins) is the basis of Structured Query Language (SQL) and a topic any data scientist must master. Data Scientist 3 Data Scientist 3 has a lot of field experience, and prefers an entity/attribute/value notation. They log each measurement as a separate row: d3 <- frame_data( ~model, ~depth, ~testset, ~measurement, ~value, 'a', 5, 'x', 'AUC', 0.4, 'a', 5, 'x', 'pR2', 0.2, 'a', 5, 'y', 'AUC', 0.6, 'a', 5, 'y', 'pR2', 0.3, 'b', 3, 'x', 'AUC', 0.5, 'b', 3, 'x', 'pR2', 0.25, 'b', 3, 'y', 'AUC', 0.5, 'b', 3, 'y', 'pR2', 0.25 ) print(d3) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 x AUC 0.40 ## 2 a 5 x pR2 0.20 ## 3 a 5 y AUC 0.60 ## 4 a 5 y pR2 0.30 ## 5 b 3 x AUC 0.50 ## 6 b 3 x pR2 0.25 ## 7 b 3 y AUC 0.50 ## 8 b 3 y pR2 0.25 In this form model, testset, and measurement are key columns. depth is still running around as a derived column and the new value column holds the measurements (which could in principle have different types in different rows!). Data Scientist 3 is not worried about their form causing problems as they know how to convert into Data Scientist 1’s format with an R command: library("tidyr") d1_3 <- d3 %>% spread('measurement', 'value') %>% select(model, depth, testset, AUC, pR2) %>% # to guarantee column order arrange(model, testset) # to guarantee row order print(d1_3) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 all.equal(d1, d1_3) ## [1] TRUE You can read a bit on spread() here. We will use the term moveValuesToColumns() for this operation later. The spread() will be replaced with the following. moveValuesToColumns(data = d3, columnToTakeKeysFrom = 'measurement', columnToTakeValuesFrom = 'value', rowKeyColumns = c('model', 'testset')) The above operation is a bit exotic and it (and its inverse) already go under number of different names: • pivot / un-pivot (Microsoft Excel) • pivot / anti-pivot (databases) • crosstab / un-crosstab (databases) • unstack / stack (R) • cast / melt (reshape, reshape2) • spread / gather (tidyr) • "widen" / "narrow" (colloquial) • moveValuesToColumns() and moveValuesToRows() (this writeup) And we are certainly neglecting other namings of the concept. We find none of these particularly evocative (though cheatsheets help), so one purpose of this note will be to teach these concepts in terms of the deliberately verbose ad-hoc terms: moveValuesToColumns() and moveValuesToRows(). Note: often the data re-arrangement operation is only exposed as part of a larger aggregating or tabulating operation. Also moveValuesToColumns() is considered the harder transform direction (as it has to group rows to work), so it is often supplied in packages, whereas analysts often use ad-hoc methods for the simpler moveValuesToRows() operation (to be defined next). Data Scientist 4 Data Scientist 4 picks a form that makes models unique keys, and records the results as: d4 <- frame_data( ~model, ~depth, ~x_AUC, ~x_pR2, ~y_AUC, ~y_pR2, 'a', 5, 0.4, 0.2, 0.6, 0.3, 'b', 3, 0.5, 0.25, 0.5, 0.25 ) print(d4) ## # A tibble: 2 × 6 ## model depth x_AUC x_pR2 y_AUC y_pR2 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 a 5 0.4 0.20 0.6 0.30 ## 2 b 3 0.5 0.25 0.5 0.25 This is not a problem as it is possible to convert to Data Scientist 3’s format. d3_4 <- d4 %>% gather('meas', 'value', x_AUC, y_AUC, x_pR2, y_pR2) %>% separate(meas, c('testset', 'measurement')) %>% select(model, depth, testset, measurement, value) %>% arrange(model, testset, measurement) print(d3_4) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 x AUC 0.40 ## 2 a 5 x pR2 0.20 ## 3 a 5 y AUC 0.60 ## 4 a 5 y pR2 0.30 ## 5 b 3 x AUC 0.50 ## 6 b 3 x pR2 0.25 ## 7 b 3 y AUC 0.50 ## 8 b 3 y pR2 0.25 all.equal(d3, d3_4) ## [1] TRUE We will replace the gather operation with moveValuesToRows() and the call will look like the following. moveValuesToRows(data = d4, nameForNewKeyColumn = 'meas', nameForNewValueColumn = 'value', columnsToTakeFrom = c('x_AUC', 'y_AUC', 'x_pR2', 'y_pR2')) moveValuesToRows() is (under some restrictions) an inverse of moveValuesToColumns(). Although we implement moveValuesToRows() and moveValuesToColums() as thin wrappers of tidyr‘s gather and spread, we find the more verbose naming (and calling interface) more intuitive. So we encourage you to think directly in terms of moveValuesToRows() as moving values to different rows (in the same column), and moveValuesToColums() as moving values to different columns (in the same row). It will usually be apparent from your problem which of these operations you want to use. The Theory of Coordinatized Data When you are working with transformations you look for invariants to keep your bearings. All of the above data share an invariant property we call being coordinatized data. In this case the invariant is so strong that one can think of all of the above examples as being equivalent, and the row/column transformations as merely changes of frame of reference. Let’s define coordinatized data by working with our examples. In all the above examples a value carrying (or payload) cell or entry can be uniquely named as follows: c(Table=tableName, (KeyColumn=KeyValue)*, ValueColumn=ValueColumnName) The above notations are the coordinates of the data item (hence "coordinatized data"). For instance: the AUC of 0.6 is in a cell that is named as follows for each of our scientists as: • Data Scientist 1: c(Table='d1', model='a', testset='y', ValueColumn='AUC') • Data Scientist 2: c(Table='d2', model='a', testset='y', ValueColumn='AUC') • Data Scientist 3: c(Table='d3', model='a', testset='y', measurement='AUC', ValueColumn='value') • Data Scientist 4: c(Table='d4', model='a', ValueColumn= paste('y', 'AUC', sep= '_')) From our point of view these keys all name the same data item. The fact that we are interpreting one position as a table name and another as a column name is just convention. We can even write R code that uses these keys on all our scientists’ data without performing any reformatting: # take a map from names to scalar conditions and return a value. # inefficient method; notional only lookup <- function(key) { table <- get(key[['Table']]) col <- key[['ValueColumn']] conditions <- setdiff(names(key), c('Table', 'ValueColumn')) for(ci in conditions) { table <- table[table[[ci]]==key[[ci]], , drop= FALSE] } table[[col]][[1]] } k1 <- c(Table='d1', model='a', testset='y', ValueColumn='AUC') k2 <- c(Table='d2', model='a', testset='y', ValueColumn='AUC') k3 <- c(Table='d3', model='a', testset='y', measurement='AUC', ValueColumn='value') k4 = c(Table='d4', model='a', ValueColumn= paste('y', 'AUC', sep= '_')) print(lookup(k1)) ## [1] 0.6 print(lookup(k2)) ## [1] 0.6 print(lookup(k3)) ## [1] 0.6 print(lookup(k4)) ## [1] 0.6 The lookup() procedure was able to treat all these keys and key positions uniformly. This illustrates that what is in tables versus what is in rows versus what is in columns is just an implementation detail. Once we understand that all of these data scientists recorded the same data we should not be surprised we can convert between representations. The thing to remember: coordinatized data is in cells, and every cell has unique coordinates. We are going to use this invariant as our enforced precondition before any data transform, which will guarantee our data meets this invariant as a postcondition. I.e., if we restrict ourselves to coordinatized data and exclude wild data, the operations moveValuesToColumns() and moveValuesToRows() become well-behaved and much easier to comprehend. In particular, they are invertible. (In math terms, the operators moveValuesToColumns() and moveValuesToRows() form a groupoid acting on coordinatized data.) The 15 puzzle: another groupoid By "wild" data we mean data where cells don’t have unique lookup() addresses. This often happens in data that has repeated measurements. Wild data is simply tamed by adding additional keying columns (such as an arbitrary experiment repetition number). Hygienic data collection practice nearly always produces coordinatized data, or at least data that is easy to coordinatize. Our position is that your data should always be coordinatized; if it’s not, you shouldn’t be working with it yet. Rows and Columns Many students are initially surprised that row/column conversions are considered "easy." Thus, it is worth taking a little time to review moving data between rows and columns. Moving From Columns to Rows ("Thinifying data") Moving data from columns to rows (i.e., from Scientist 1 to Scientist 3) is easy to demonstrate and explain. The only thing hard about this operation is remembering the name of the operation ("gather()") and the arguments. We can remove this inessential difficulty by writing a helper function (to check our preconditions) and a verbose wrapper function (also available as a package from CRAN or Github): library("wrapr") checkColsFormUniqueKeys <- function(data, keyColNames, allowNAKeys = FALSE) { # check for NA keys if((!allowNAKeys) && (length(keyColNames)>0)) { allGood <- data %>% dplyr::select(dplyr::one_of(keyColNames)) %>% complete.cases() %>% all() if(!allGood) { stop("saw NA in keys") } } # count the number of rows ndata <- nrow(data) if(ndata<=1) { return(TRUE) } # count the number of rows identifiable by keys nunique <- min(1, ndata) if(length(keyColNames)>0) { nunique <- data %>% dplyr::select(dplyr::one_of(keyColNames)) %>% dplyr::distinct() %>% nrow() } # compare return(nunique==ndata) } moveValuesToRows <- function(data, nameForNewKeyColumn, nameForNewValueColumn, columnsToTakeFrom) { cn <- colnames(data) dcols <- setdiff(cn, columnsToTakeFrom) if(!checkColsFormUniqueKeys(dplyr::select(data, dplyr::one_of(dcols)), dcols)) { stop("moveValuesToRows: rows were not uniquely keyed") } # assume gather_ is going to be deprecated, as is happening # with dplyr methods : # https://github.com/hadley/dplyr/blob/da7fc6ecef1c6d329f014feb96c9c99d6cebc880/R/select-vars.R wrapr::let(c(NAMEFORNEWKEYCOLUMM= nameForNewKeyColumn, NAMEFORNEWVALUECOLUMN= nameForNewValueColumn), tidyr::gather(data, key= NAMEFORNEWKEYCOLUMM, value= NAMEFORNEWVALUECOLUMN, dplyr::one_of(columnsToTakeFrom)) ) } In this notation moving from Data Scientist 1’s records to Data Scientist 3’s looks like the following. d3from1 <- moveValuesToRows(data=d1, nameForNewKeyColumn= 'measurement', nameForNewValueColumn= 'value', columnsToTakeFrom = c('AUC', 'pR2')) %>% select(model, depth, testset, measurement, value) %>% arrange(model, testset, measurement) print(d3from1) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 x AUC 0.40 ## 2 a 5 x pR2 0.20 ## 3 a 5 y AUC 0.60 ## 4 a 5 y pR2 0.30 ## 5 b 3 x AUC 0.50 ## 6 b 3 x pR2 0.25 ## 7 b 3 y AUC 0.50 ## 8 b 3 y pR2 0.25 all.equal(d3, d3from1) ## [1] TRUE In a moveValuesToRows() operation each row of the data frame is torn up and used to make many rows. Each of the columns we specify that we want measurements from gives us a new row from each of the original data rows. The pattern is more obvious if we process any rows of d1 independently: row <- d1[3,] print(row) ## # A tibble: 1 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 b 3 x 0.5 0.25 moveValuesToRows(data=row, nameForNewKeyColumn= 'measurement', nameForNewValueColumn= 'value', columnsToTakeFrom = c('AUC', 'pR2')) %>% select(model, depth, testset, measurement, value) %>% arrange(model, testset, measurement) ## # A tibble: 2 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 b 3 x AUC 0.50 ## 2 b 3 x pR2 0.25 Moving From Rows to Columns ("Widening data") Moving data from rows to columns (i.e., from Scientist 3 to Scientist 1) is a bit harder to explain, and usually not explained well. In moving from rows to columns we group a set of rows that go together (match on keys) and then combine them into one row by adding additional columns. Note: to move data from rows to columns we must know which set of rows go together. That means some set of columns is working as keys, even though this is not emphasized in the spread() calling interface or explanations. For invertible data transforms, we want a set of columns (rowKeyColumns) that define a composite key that uniquely identifies each row of the result. For this to be true, the rowKeyColumns plus the column we are taking value keys from must uniquely identify each row of the input. To make things easier to understand and remember, we introduce another wrapping function. moveValuesToColumns <- function(data, columnToTakeKeysFrom, columnToTakeValuesFrom, rowKeyColumns) { cn <- colnames(data) # we insist that the rowKeyColumns plus # columnToTakeKeysFrom are unique keys over the input rows if(!checkColsFormUniqueKeys(data, c(rowKeyColumns, columnToTakeKeysFrom))) { stop(paste0("\n moveValuesToColumns: specified", "\n rowKeyColumns plus columnToTakeKeysFrom", "\n isn't unique across rows")) } # we are also checking that other columns don't prevent us # from matching values dcols <- setdiff(colnames(data), c(columnToTakeKeysFrom, columnToTakeValuesFrom)) dsub <- data %>% dplyr::select(dplyr::one_of(dcols)) %>% dplyr::distinct() if(!checkColsFormUniqueKeys(dsub, rowKeyColumns)) { stop(paste0("\n some columns not in", "\n c(rowKeyColumns, columnToTakeKeysFrom, columnToTakeValuesFrom)", "\n are splitting up row groups")) } wrapr::let(c(COLUMNTOTAKEKEYSFROM= columnToTakeKeysFrom, COLUMNTOTAKEVALUESFROM= columnToTakeValuesFrom), tidyr::spread(data, key= COLUMNTOTAKEKEYSFROM, value= COLUMNTOTAKEVALUESFROM) ) } This lets us rework the example of moving from Data Scientist 3’s format to Data Scientist 1’s: d1from3 <- moveValuesToColumns(data= d3, columnToTakeKeysFrom= 'measurement', columnToTakeValuesFrom= 'value', rowKeyColumns= c('model', 'testset')) %>% select(model, depth, testset, AUC, pR2) %>% arrange(model, testset) print(d1from3) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 all.equal(d1, d1from3) ## [1] TRUE If the structure of our data doesn’t match our expected keying we can have problems. We emphasize that these problems arise from trying to work with non-coordinatized data, and not from the transforms themselves. Too little keying If our keys don’t contain enough information to match rows together we can have a problem. Suppose our testset record was damaged or not present and look how a direct call to spread works: d3damaged <- d3 d3damaged$testset <- 'z' print(d3damaged) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 z AUC 0.40 ## 2 a 5 z pR2 0.20 ## 3 a 5 z AUC 0.60 ## 4 a 5 z pR2 0.30 ## 5 b 3 z AUC 0.50 ## 6 b 3 z pR2 0.25 ## 7 b 3 z AUC 0.50 ## 8 b 3 z pR2 0.25 spread(d3damaged, 'measurement', 'value') ## Error: Duplicate identifiers for rows (1, 3), (5, 7), (2, 4), (6, 8)

This happens because the precondition is not met: the columns (model, testset, measurement) don’t uniquely represent each row of the input. Catching the error is good, and we emphasize that in our wrapper.

moveValuesToColumns(data= d3damaged, columnToTakeKeysFrom= 'measurement', columnToTakeValuesFrom= 'value', rowKeyColumns= c('model', 'testset')) ## Error in moveValuesToColumns(data = d3damaged, columnToTakeKeysFrom = "measurement", : ## moveValuesToColumns: specified ## rowKeyColumns plus columnToTakeKeysFrom ## isn't unique across rows

The above issue is often fixed by adding additional columns (such as measurement number or time of measurement).

Too much keying

Columns can also contain too fine a key structure. For example, suppose our data was damaged and depth is no longer a function of the model id, but contains extra detail. In this case a direct call to spread produces a way too large result because the extra detail prevents it from matching rows.

d3damaged <- d3 d3damaged$depth <- seq_len(nrow(d3damaged)) print(d3damaged) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <int> <chr> <chr> <dbl> ## 1 a 1 x AUC 0.40 ## 2 a 2 x pR2 0.20 ## 3 a 3 y AUC 0.60 ## 4 a 4 y pR2 0.30 ## 5 b 5 x AUC 0.50 ## 6 b 6 x pR2 0.25 ## 7 b 7 y AUC 0.50 ## 8 b 8 y pR2 0.25 spread(d3damaged, 'measurement', 'value') ## # A tibble: 8 × 5 ## model depth testset AUC pR2 ## * <chr> <int> <chr> <dbl> <dbl> ## 1 a 1 x 0.4 NA ## 2 a 2 x NA 0.20 ## 3 a 3 y 0.6 NA ## 4 a 4 y NA 0.30 ## 5 b 5 x 0.5 NA ## 6 b 6 x NA 0.25 ## 7 b 7 y 0.5 NA ## 8 b 8 y NA 0.25 The frame d3damaged does not match the user’s probable intent: that the columns (model, testset) should uniquely specify row groups, or in other words, they should uniquely identify each row of the result. In the above case we feel it is good to allow the user to declare intent (hence the extra rowKeyColumns argument) and throw an exception if the data is not structured how the user expects (instead of allowing this data to possibly ruin a longer analysis in some unnoticed manner). moveValuesToColumns(data= d3damaged, columnToTakeKeysFrom= 'measurement', columnToTakeValuesFrom= 'value', rowKeyColumns= c('model', 'testset')) ## Error in moveValuesToColumns(data = d3damaged, columnToTakeKeysFrom = "measurement", : ## some columns not in ## c(rowKeyColumns, columnToTakeKeysFrom, columnToTakeValuesFrom) ## are splitting up row groups The above issue is usually fixed by one of two solutions (which one is appropriate depends on the situation): 1. Stricter control (via dplyr::select()) of which columns are in the analysis. In our example, we would select all the columns of d3damaged except depth. 2. Aggregating or summing out the problematic columns. For example if the problematic column in our example were runtime, which could legitimately vary for the same model and dataset, we could use dplyr::group_by/summarize to create a data frame with columns (model, testset, mean_runtime, measurement, value), so that (model, testset) does uniquely specify row groups. Conclusion The concept to remember is: organize your records so data cells have unique consistent abstract coordinates. For coordinatized data the actual arrangement of data into tables, rows, and columns is an implementation detail or optimization that does not significantly change what the data means. For coordinatized data different layouts of rows and columns are demonstrably equivalent. We document and maintain this equivalence by asking the analyst to describe their presumed keying structure to our methods, which then use this documentation to infer intent and check preconditions on the transforms. It pays to think fluidly in terms of coordinatized data and delay any format conversions until you actually need them. You will eventually need transforms as most data processing steps have a preferred format. For example, machine learning training usually requires a denormalized form. We feel the methods moveValuesToRows() and moveValuesToColumns() are easier to learn and remember than abstract terms such as "stack/unstack", "melt/cast", or "gather/spread" and thus are a good way to teach. Perhaps they are even a good way to document (and confirm) your intent in your own projects. To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### #1: Easy Package Registration Wed, 03/29/2017 - 13:52 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) Welcome to the first actual post in the R4 series, following the short announcement earlier this week. Context Last month, Brian Ripley announced on r-devel that registration of routines would now be tested for by R CMD check in r-devel (which by next month will become R 3.4.0). A NOTE will be issued now, this will presumably turn into a WARNING at some point. Writing R Extensions has an updated introduction) of the topic. Package registration has long been available, and applies to all native (i.e. "compiled") function via the .C(), .Call(), .Fortran() or .External() interfaces. If you use any of those — and .Call() may be the only truly relevant one here — then is of interest to you. Brian Ripley and Kurt Hornik also added a new helper function: tools::package_native_routine_registration_skeleton(). It parses the R code of your package and collects all native function entrypoints in order to autogenerate the registration. It is available in R-devel now, will be in R 3.4.0 and makes adding such registration truly trivial. But as of today, it requires that you have R-devel. Once R 3.4.0 is out, you can call the helper directly. As for R-devel, there have always been at least two ways to use it: build it locally (which we may cover in another R4 installment), or using Docker. Here will focus on the latter by relying on the Rocker project by Carl and myself. Use the Rocker drd Image We assume you can run Docker on your system. How to add it on Windows, macOS or Linux is beyond our scope here today, but also covered extensively elsewhere. So we assume you can execute docker and e.g. bring in the ‘daily r-devel’ image drd from our Rocker project via ~$ docker pull rocker/drd

With that, we can use R-devel to create the registration file very easily in a single call (which is a long command-line we have broken up with one line-break for the display below).

The following is real-life example when I needed to add registration to the RcppTOML package for this week’s update:

~/git/rcpptoml(master)$docker run --rm -ti -v$(pwd):/mnt rocker/drd \ ## line break RD --slave -e 'tools::package_native_routine_registration_skeleton("/mnt")' #include <R.h> #include <Rinternals.h> #include <stdlib.h> // for NULL #include <R_ext/Rdynload.h> /* FIXME: Check these declarations against the C/Fortran source code. */ /* .Call calls */ extern SEXP RcppTOML_tomlparseImpl(SEXP, SEXP, SEXP); static const R_CallMethodDef CallEntries[] = { {"RcppTOML_tomlparseImpl", (DL_FUNC) &RcppTOML_tomlparseImpl, 3}, {NULL, NULL, 0} }; void R_init_RcppTOML(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } edd@max:~/git/rcpptoml(master)$Decompose the Command We can understand the docker command invocation above through its components: • docker run is the basic call to a container • --rm -ti does subsequent cleanup (--rm) and gives a terminal (-t) that is interactive (-i) • -v$(pwd):/mnt uses the -v a:b options to make local directory a available as b in the container; here $(pwd) calls print working directory to get the local directory which is now mapped to /mnt in the container • rocker/drd invokes the ‘drd’ container of the Rocker project • RD is a shorthand to the R-devel binary inside the container, and the main reason we use this container • -e 'tools::package_native_routine_registration_skeleton("/mnt") calls the helper function of R (currently in R-devel only, so we use RD) to compute a possible init.c file based on the current directory — which is /mnt inside the container That it. We get a call to the R function executed inside the Docker container, examining the package in the working directory and creating a registration file for it which is display to the console. Copy the Output to src/init.c We simply copy the output to a file src/init.c; I often fold one opening brace one line up. Update NAMESPACE We also change one line in NAMESPACE from (for this package) useDynLib("RcppTOML") to useDynLib("RcppTOML", .registration=TRUE). Adjust accordingly for other package names. That’s it! And with that we a have a package which no longer provokes the NOTE as seen by the checks page. Calls to native routines are now safer (less of a chance for name clashing), get called quicker as we skip the symbol search (see the WRE discussion) and best of all this applies to all native routines whether written by hand or written via a generator such as Rcpp Attributes. So give this a try to get your package up-to-date. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Burtin’s Antibiotics visualization in Plotly and R Wed, 03/29/2017 - 12:35 (This article was first published on R – Modern Data, and kindly contributed to R-bloggers) In this post, we’ll try to re-create Burtin’s antibiotics visualization using Plotly. The post follows Mike Bostock’s original re-creation in Protovis. See here. library(plotly) library(reshape2) library(dplyr) # Data df <- read.csv("https://cdn.rawgit.com/plotly/datasets/5360f5cd/Antibiotics.csv", stringsAsFactors = F) N <- nrow(df) # Melting for easier use later on df$Seq <- 1:N df <- melt(df, id.vars = c("Bacteria", "Gram", "Seq")) %>% arrange(Seq) # Angle generation theta.start <- (5/2)*pi - pi/10 theta.end <- pi/2 + pi/10 theta.range <- seq(theta.start, theta.end, length.out = N * 4 - 1) dtheta <- diff(theta.range)[1] # Fine adjustment for larger ticks (black) inc <- 0.04 # Angles for larger ticks big_ticks <- theta.range[1:length(theta.range) %% 4 == 0] big_ticks <- c(theta.range[1] - dtheta, big_ticks, theta.range[length(theta.range)] + dtheta) # Angles for smaller ticks small_ticks <- theta.range[1:length(theta.range) %% 4 != 0] # Set inner and outer radii inner.radius <- 0.3 outer.radius <- 1 # Set colors cols <- c("#0a3264","#c84632","#000000") pos_col <- "rgba(174, 174, 184, .8)" neg_col <- "rgba(230, 130, 110, .8)" # Function to calculate radius given minimum inhibitory concentration # Scaling function is sqrt(log(x)) radiusFUNC <- function(x){ min <- sqrt(log(0.001 * 1e4)) max <- sqrt(log(1000 * 1e4)) a <- (outer.radius - inner.radius)/(min - max) b <- inner.radius - a * max rad <- a * sqrt(log(x * 1e4)) + b return(rad) } SEQ <- function(start, by, length){ vec <- c() vec[1] <- start for(i in 2:length){ vec[i] <- vec[i-1] + by } return(vec) } # Generate x and y coordinates for large ticks radial_gridlines <- data.frame(theta = big_ticks, x = (inner.radius - inc) * cos(big_ticks), y = (inner.radius - inc) * sin(big_ticks), xend = (outer.radius + inc) * cos(big_ticks), yend = (outer.radius + inc) * sin(big_ticks)) # Generate x and y coordinates for antibiotics antibiotics <- df antibiotics$x <- inner.radius * cos(small_ticks) antibiotics$y <- inner.radius * sin(small_ticks) antibiotics$xend <- radiusFUNC(df$value) * cos(small_ticks) antibiotics$yend <- radiusFUNC(df$value) * sin(small_ticks) antibiotics$text <- with(antibiotics, paste("<b>Bacteria:</b>", Bacteria, "<br>", "<b>Antibiotic:</b>", variable, "<br>", "<b>Min. conc:</b>", value, "<br>")) # Generate x and y coordinates for white circles (grid) rad <- c(100, 10, 1, 0.1, 0.01, 0.001) rad <- c(inner.radius, radiusFUNC(rad)) theta <- seq(0, 2 * pi, length.out = 100) circles <- lapply(rad, function(x){ x.coord <- x * cos(theta) y.coord <- x * sin(theta) return(data.frame(label = which(rad == x), x = x.coord, y = y.coord)) }) circles <- do.call(rbind, lapply(circles, data.frame)) # Generate gram-negative polygon theta <- seq(big_ticks[1], big_ticks[10], length.out = 100) gram.neg <- data.frame(theta = theta, x = inner.radius * cos(theta), y = inner.radius * sin(theta)) theta <- rev(theta) gram.neg <- rbind(gram.neg, data.frame(theta = theta, x = outer.radius * cos(theta), y = outer.radius * sin(theta))) # Generate gram-positive polygon theta <- seq(big_ticks[10], big_ticks[length(big_ticks)], length.out = 100) gram.pos <- data.frame(theta = theta, x = inner.radius * cos(theta), y = inner.radius * sin(theta)) theta <- rev(theta) gram.pos <- rbind(gram.pos, data.frame(theta = theta, x = outer.radius * cos(theta), y = outer.radius * sin(theta))) # Text annotations - bacteria name bacteria.df <- data.frame(bacteria = unique(antibiotics$Bacteria), theta = big_ticks[-length(big_ticks)] - 0.17, stringsAsFactors = F) bacteria.df$x <- (outer.radius + inc) * cos(bacteria.df$theta) bacteria.df$y <- (outer.radius + inc) * sin(bacteria.df$theta) bacteria.df$textangle <- SEQ(-70, by = 21, length = nrow(bacteria.df)) bacteria.df$textangle[9:nrow(bacteria.df)] <- (bacteria.df$textangle[9:nrow(bacteria.df)] - 90) - 90 bacteria.df <- lapply(1:nrow(bacteria.df), function(x){ list(x = outer.radius*cos(bacteria.df$theta[x]) , y = outer.radius*sin(bacteria.df$theta[x]), xref = "plot", yref = "plot", xanchor = "center", yanchor = "middle", text = bacteria.df$bacteria[x], showarrow = F, font = list(size = 12, family = "arial"), textangle = bacteria.df$textangle[x]) }) # Title bacteria.df[[17]] <- list(x = 0, y = 1, xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", text = "<b>Burtin’s Antibiotics</b>", showarrow = F, font = list(size = 30, family = "serif")) # Text annotations - scale scale.annotate <- data.frame(x = 0, y = c(inner.radius, radiusFUNC(c(100, 10, 1, 0.1, 0.01, 0.001))), scale = c("", "100", "10", "1", "0.1", "0.01","0.001")) # Plot p <- plot_ly(x = ~x, y = ~y, xend = ~xend, yend = ~yend, hoverinfo = "text", height = 900, width = 800) %>% # Gram negative sector add_polygons(data = gram.neg, x = ~x, y = ~y, line = list(color = "transparent"), fillcolor = neg_col, inherit = F, hoverinfo = "none") %>% # Gram positive sector add_polygons(data = gram.pos, x = ~x, y = ~y, line = list(color = "transparent"), fillcolor = pos_col, inherit = F, hoverinfo = "none") %>% # Antibiotics add_segments(data = antibiotics %>% filter(variable == "Penicillin"), text = ~text, line = list(color = cols[1], width = 7)) %>% add_segments(data = antibiotics %>% filter(variable == "Streptomycin"), text = ~text, line = list(color = cols[2], width = 7)) %>% add_segments(data = antibiotics %>% filter(variable == "Neomycin"), text = ~text, line = list(color = cols[3], width = 7)) %>% # Black large ticks add_segments(data = radial_gridlines, line = list(color = "black", width = 2), hoverinfo = "none") %>% # White circles add_polygons(data = circles, x = ~x, y = ~y, group_by = ~label, line = list(color = "#eeeeee", width = 1), fillcolor = "transparent", inherit = F, hoverinfo = "none") %>% # Scale labels add_text(data = scale.annotate, x = ~x, y = ~y, text = ~scale, inherit = F, textfont = list(size = 10, color = "black"), hoverinfo = "none") %>% # Gram labels add_markers(x = c(-0.05, -0.05), y = c(-1.4, -1.5), marker = list(color = c(neg_col, pos_col), size = 15), inherit = F, hoverinfo = "none") %>% add_text(x = c(0.15, 0.15), y = c(-1.4, -1.5), text = c("Gram Negative", "Gram Positive"), textfont = list(size = 10, color = "black"), inherit = F, hoverinfo = "none") %>% # Antibiotic legend add_markers(x = c(-0.15, -0.15, -0.15), y = c(0.1, 0, -0.1), marker = list(color = c(cols), size = 15, symbol = "square"), inherit = F, hoverinfo = "none") %>% add_text(x = c(0.05, 0.05, 0.05), y = c(0.1, 0, -0.1), text = c("Penicillin", "Streptomycin", "Neomycin"), textfont = list(size = 10, color = "black"), inherit = F, hoverinfo = "none") %>% layout(showlegend = F, xaxis = list(showgrid = F, zeroline = F, showticklabels = F, title = "", domain = c(0.05, 0.95)), yaxis = list(showgrid = F, zeroline = F, showticklabels = F, title = "", domain = c(0.05, 0.95)), plot_bgcolor = "rgb(240, 225, 210)", paper_bgcolor = "rgb(240, 225, 210)", # Annotations annotations = bacteria.df) p To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Make ggplot2 purrr Wed, 03/29/2017 - 06:45 (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers) I’ll be honest: the title is a bit misleading. I will not use purrr that much in this blog post. Actually, I will use one single purrr function, at the very end. I use dplyr much more. However Make ggplot2 purrr sounds better than Make ggplot dplyr or whatever the verb for dplyr would be. Also, this blog post was inspired by a stackoverflow question and in particular one of the answers. So I don’t bring anything new to the table, but I found this stackoverflow answer so useful and so underrated (only 16 upvotes as I’m writing this!) that I wanted to write something about it. Basically the idea of this blog post is to show how to create graphs using ggplot2, but by grouping by a factor variable beforehand. To illustrate this idea, let’s use the data from the Penn World Tables 9.0. The easiest way to get this data is to install the package called pwt9 with: install.packages("pwt9") and then load the data with: data("pwt9.0") Now, let’s load the needed packages. I am also using ggthemes which makes themeing your ggplots very easy. I’ll be making Tufte-style plots. library(ggplot2) library(ggthemes) library(dplyr) library(purrr) library(pwt9) First let’s select a list of countries: country_list <- c("France", "Germany", "United States of America", "Luxembourg", "Switzerland", "Greece") small_pwt <- pwt9.0 %>% filter(country %in% country_list) Let’s us also order the countries in the data frame as I have written them in country_list: small_pwt <- small_pwt %>% mutate(country = factor(country, levels = country_list, ordered = TRUE)) You might be wondering why this is important. At the end of the article, we are going to save the plots to disk. If we do not re-order the countries inside the data frame as in country_list, the name of the files will not correspond to the correct plots! Now when you want to plot the same variable by countries, say avh (Average annual hours worked by persons engaged), the usual way to do this is with one of facet_wrap() or facet_grid(): ggplot(data = small_pwt) + theme_tufte() + geom_line(aes(y = avh, x = year)) + facet_wrap(~country) ggplot(data = small_pwt) + theme_tufte() + geom_line(aes(y = avh, x = year)) + facet_grid(country~.) As you can see, for this particular example, facet_grid() is not very useful, but do notice its argument, country~., which is different from facet_wrap()’s argument. This way, I get the graphs stacked horizontally. If I had used facet_grid(~country) the graphs would be side by side and completely unreadable. Now, let’s go to the meat of this post: what if you would like to have one single graph for each country? You’d probably think of using dplyr::group_by() to form the groups and then the graphs. This is the way to go, but you also have to use dplyr::do(). This is because as far as I understand, ggplot2 is not dplyr-aware, and using an arbitrary function with groups is only possible with dplyr::do(). plots <- small_pwt %>% group_by(country) %>% do(plot = ggplot(data = .) + theme_tufte() + geom_line(aes(y = avh, x = year)) + ggtitle(unique(.$country)) + ylab("Year") + xlab("Average annual hours worked by persons engaged"))

If you know dplyr at least a little bit, the above lines should be easy for you to understand. But notice how we get the title of the graphs, with ggtitle(unique(.$country)), which was actually the point of the stackoverflow question. What might be surprising though, is the object that is created by this code. Let’s take a look at plots: print(plots) ## Source: local data frame [6 x 2] ## Groups: <by row> ## ## # A tibble: 6 × 2 ## country plot ## * <ord> <list> ## 1 France <S3: gg> ## 2 Germany <S3: gg> ## 3 United States of America <S3: gg> ## 4 Luxembourg <S3: gg> ## 5 Switzerland <S3: gg> ## 6 Greece <S3: gg> As dplyr::do()’s documentation tells us, the return values get stored inside a list. And this is exactly what we get back; a list of plots! Lists are a very flexible and useful class, and you cannot spell list without purrr (at least not when you’re a neRd). Here are the final lines that use purrr::map2() to save all these plots at once inside your working directory: file_names <- paste0(country_list, ".pdf") map2(file_names, plots$plot, ggsave)

As I said before, if you do not re-order the countries inside the data frame, the names of the files and the plots will not match. Try running all the code without re-ordering, you’ll see!

I hope you found this post useful. You can follow me on twitter for blog updates.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### UK government using R to modernize reporting of official statistics

Wed, 03/29/2017 - 02:34

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Like all governments, the UK government is responsible for producing reports of official statistics on an ongoing basis. That process has traditionally been a highly manual one: extract data from government systems, load it into a mainframe statistical analysis tool and run models and forecasts, extract the results to a spreadsheet to prepare data for presentation, and ultimately combine it all in a manual document editing tool to produce the final report. The process in the UK looks much like this today:

Matt Upson, a Data Scientist at the UK Government Digital Service, is looking to modernize this process with a reproducible analytical pipeline. This new process, based on the UK Government's Technology Service Manual for new IT deployments, aims to simplify the process by using R — the open-source programming language for statistical analysis — to automate the data extraction, analysis, and document generation tasks.

The development of the new process is underway now, with the first target being a report on culture and sporting impacts on the UK economy:

Towards the end of 2016 we embarked on a project with a team in the Department for Culture, Media, and Sport (DCMS) who are responsible for the production of the Economic Estimates for DCMS Sectors Statistical First Release (SFR). Currently this publication is produced with a mix of manual and semi-manual processes. Our aim was to see if we could speed up production of the SFR, whilst maintaining the high standard of the publication and QA.

Central to the process is automating the report with Rmarkdown, a system for programmatically laying out a document while encapsulating all of the R code for data preparation, analysis, tabulation and charting into a collaborative document. As a comprehensive language, R has the functionality to handle all of these tasks natively, without the need to bring other systems into the chain. And it's an ideal tool for the statistical data analysis required for these reports, which must follow the guidance of the Aqua Book, the UK government's manual for producing quality analysis. (The Aqua Book is well worth reading for any data scientist, with excellent guidance on designing studies in the face of uncertain data.)

Naturally, the government has specific standards for the presentation of data, and the R-based process needs to be able to reproduce that look in the final report. The GDS team has produced a package for R, called govstyle, which standardizes the look of data charts while upgrading them to more modern design principles. For example this chart from a report on truancy in UK schools:

has been upgraded and reproduced in R to look like this:

An automated workflow allows for the inclusion of modern devops processes, too. Dependency management for R packages is handled with packrat. Test-driven development comes courtesy of the testthat package, and automated testing (including data verification and code coverage analysis) is provided by Travis CI. And source code control and collaboration is provided by Github, which is also where the entire process is documented as R package: the eesectors package.

For more on the UK government's use of R for official statistics reporting, see the post below from the UK Government Digital Service.

Data at GDS: Reproducible Analytical Pipeline

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### shinyHeatmaply – a shiny app for creating interactive cluster heatmaps

Wed, 03/29/2017 - 02:23

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

My friend Jonathan Sidi and I (Tal Galili) are pleased to announce the release of shinyHeatmaply (0.1.0): a new Shiny application (and Shiny gadget) for creating interactive cluster heatmaps. shinyHeatmaply is based on the heatmaply R package which strives to make it easy as possible to create interactive cluster heatmaps.

The app introduces a functionality that saves to disk a self contained copy of the htmlwidget as an html file with your data and specifications you set from the UI, so it can be embedded in webpages, blogposts and online web appendices for academic publications.

You can see some of shinyHeatmaply‘s capabilities in the following 40 seconds video:

Installing shinyHeatmaply

From CRAN:

install_packages('shinyHeatmaply')

From github:

devtools::install_github('yonicd/shinyHeatmaply') Running the app/gadget

The application has an import interface as part of the application which currently supports csv, txt, tab, xls, xlsx, rd, rda. You can start the app using:

library(shiny) library(heatmaply) # If you didn't get shinyHeatmaply yet, you can run it through github: # runGitHub("yonicd/shinyHeatmaply",subdir = 'inst/shinyapp') # or just use your locally installed package: library(shinyHeatmaply) runApp(system.file("shinyapp", package = "shinyHeatmaply"))

The gadget is called from the R console and accepts input arguments. The object defined as the input to the shinyHeatmaply gadget is a data.frame or a list of data.frames. You can start it using the following code:

library(shinyHeatmaply)   #single data.frame data(mtcars) launch_heatmaply(mtcars)   #list data(iris) launch_heatmaply(list('Example1'=mtcars,'Example2'=iris))

You can see an example of a saved shinyHeatmaply output here. Or view the following iframe:

We would love to get your feedback!

For issue reports or feature requests, please visit the GitHub repo.

Post post credit: shinyHeatmaply was made thanks to the dedication of Jonathan Sidi, and based on recent features added to heatmaply by Alan O’Callaghan. I am very grateful to them both. This could also not be made possible by the amazing work of the RStudio’s team on Shiny applications, and of Carson Sievert on plotly. And lastly, to my adviser Yoav Benjamini for his support and advices.

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Gradient Descent

Wed, 03/29/2017 - 02:00

(This article was first published on --Jean Arreola--, and kindly contributed to R-bloggers)

Trying gradient descent for linear regression

The best way to learn an algorith is to code it. So here it is, my take on Gradient Descent Algorithm for simple linear regression.

First, we fit a simple linear model with lm for comparison with gradient descent values.

#Load libraries library(dplyr) library(highcharter) #Scaling length variables from iris dataset. iris_demo <- iris[,c("Sepal.Length","Petal.Length")] %>% mutate(sepal_length = as.numeric(scale(Sepal.Length)), petal_length = as.numeric(scale(Petal.Length))) %>% select(sepal_length,petal_length) #Fit a simple linear model to compare coefficients. regression <- lm(iris_demo$petal_length~iris_demo$sepal_length) coef(regression) ## (Intercept) iris_demo$sepal_length ## 4.643867e-16 8.717538e-01 iris_demo_reg <- iris_demo iris_demo_reg$reg <- predict(regression,iris_demo) #Plot the model with highcharter highchart() %>% hc_add_series(data = iris_demo_reg, type = "scatter", hcaes(x = sepal_length, y = petal_length), name = "Sepal Length VS Petal Length") %>% hc_add_series(data = iris_demo_reg, type = "line", hcaes(x = sepal_length, y = reg), name = "Linear Regression") %>% hc_title(text = "Linear Regression")

open

We will try to acomplish the same coefficients, this time using Gradient Descent.

library(tidyr) set.seed(135) #To reproduce results #Auxiliary function # y = mx + b reg <- function(m,b,x) return(m * x + b) #Starting point b <- runif(1) m <- runif(1) #Gradient descent function gradient_desc <- function(b, m, data, learning_rate = 0.01){ # Small steps # Column names = Code easier to understand colnames(data) <- c("x","y") #Values for first iteration b_iter <- 0 m_iter <- 0 n <- nrow(data) # Compute the gradient for Mean Squared Error function for(i in 1:n){ # Partial derivative for b b_iter <- b_iter + (-2/n) * (data$y[i] - ((m * data$x[i]) + b)) # Partial derivative for m m_iter <- m_iter + (-2/n) * data$x[i] * (data$y[i] - ((m * data$x[i]) + b)) } # Move to the OPPOSITE direction of the derivative new_b <- b - (learning_rate * b_iter) new_m <- m - (learning_rate * m_iter) # Replace values and return new <- list(new_b,new_m) return(new) } # I need to store some values to make the motion plot vect_m <- m vect_b <- b # Iterate to obtain better parameters for(i in 1:1000){ if(i %in% c(1,100,250,500)){ # I keep some values in the iteration for the plot vect_m <- c(vect_m,m) vect_b <- c(vect_b,b) } x <- gradient_desc(b,m,iris_demo) b <- x[[1]] m <- x[[2]] } print(paste0("m = ", m)) ## [1] "m = 0.871753774273602" print(paste0("b = ", b)) ## [1] "b = 5.52239677041512e-10" The difference in the coefficients is minimal. We can see how the iterations work in the next plot: #Compute new values iris_demo$preit <- reg(vect_m[1],vect_b[1],iris_demo$sepal_length) iris_demo$it1 <- reg(vect_m[2],vect_b[2],iris_demo$sepal_length) iris_demo$it100 <- reg(vect_m[3],vect_b[3],iris_demo$sepal_length) iris_demo$it250 <- reg(vect_m[4],vect_b[4],iris_demo$sepal_length) iris_demo$it500 <- reg(vect_m[5],vect_b[5],iris_demo$sepal_length) iris_demo$finalit <- reg(m,b,iris_demo$sepal_length) iris_gathered <- iris_demo %>% gather(key = gr, value = val, preit:finalit) %>% select(-petal_length) %>% distinct() iris_start <- iris_gathered %>% filter(gr == "preit") iris_seq <- iris_gathered %>% group_by(sepal_length) %>% do(sequence = list_parse(select(., y = val))) iris_data <- left_join(iris_start, iris_seq) #Motion Plot irhc2 <- highchart() %>% hc_add_series(data = iris_data, type = "line", hcaes(x = sepal_length, y = val), name = "Gradient Descent") %>% hc_motion(enabled = TRUE, series = 0, startIndex = 0, labels = c("Iteration 1","Iteration 100","Iteration 250","Iteration 500","Final Iteration")) %>% hc_add_series(data = iris_demo_reg, type = "scatter", hcaes(x = sepal_length, y = petal_length), name = "Sepal Length VS Petal Length") %>% hc_title(text = "Gradient Descent Iterations") irhc2 open Maybe in a future post we can try a multivariate regression model! To leave a comment for the author, please follow the link and comment on their blog: --Jean Arreola--. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### A fast method to add annotations to a plot Wed, 03/29/2017 - 02:00 (This article was first published on Bluecology blog, and kindly contributed to R-bloggers) A fast method to add annotations to a plot Making professional looking plots in R can be fiddly. One task that I often spend ages doing is manually finding coordinates to add labels. Wouldn’t it be nice if you could just send the coordinates directly to your text editor? I did some searching and found on stackoverflow that you can send R objects to the clipboard. So here is my solution using that trick. Adding text to the right position on a plot can be a real hassle. Here I show how to use a simple function to click on a figurea and put coordinates onto your clipboard. You can get R to send data directly to the clipboard using the pipe command. Below is a little function I wrote that takes coordinates from locator() and sends them to your clipboard. Then you can just hit cmd-v to paste them into your text editor (nb I understand this may need some slight modifications to work on linux or windows, I use OSX): loccopy <- function(n, digits = 2){ data <- locator(n) data <- round(cbind(data$x, data$y), digits) clip <- pipe("pbcopy", "w") write.table(data, file = clip, col.names = F, row.names = F) close(clip) } Let’s test it out: set.seed(42) plot(runif(100)) loccopy(1) Now hit cmd-v (or equivalent on your OS). 69.23 0.84 Let’s add a label using our fast method for coordinates: text(69.23, 0.84, "Unusual data point", pos =4, offset = 0) The pos=4 and offset=0 ensures that the text goes directly to the right of our coordinates. That’s it. Hope it helps speed up your workflow. To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Understanding Linear SVM with R Wed, 03/29/2017 - 01:01 (This article was first published on DataScience+, and kindly contributed to R-bloggers) Linear Support Vector Machine or linear-SVM(as it is often abbreviated), is a supervised classifier, generally used in bi-classification problem, that is the problem setting, where there are two classes. Of course it can be extended to multi-class problem. In this work, we will take a mathematical understanding of linear SVM along with R code to understand the critical components of SVM. Taking the liberty to assume that some readers are new to machine learning, let us first find out what the supervised classifier is supposed to do. The term supervised comes from the concept of supervised learning. In case of supervised learning, you provide some examples along with their labels to your classifier and the classifier then tries to learn important features from the provided examples. For example, consider the following data set. You, will provide a part of this data to your linear SVM and tune the parameters such that your SVM can can act as a discriminatory function separating the ham messages from the spam messages. So, let us add the following R-code to our task. sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) head(sms_data) type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv What Parameters should be tuned Linear SVM tries to find a separating hyper-plane between two classes with maximum gap in-between. A hyper-plane in $$d$$- dimension is a set of points $$x \in \mathbb R^{d}$$ satisfying the equation $$w^{T}x+b=0$$ Let us denote $$h(x)=w^{T}(x)+b$$ Here $$w$$ is a $$d$$-dimensional weight vector while $$b$$ is a scalar denoting the bias. Both $$w$$ and $$b$$ are unknown to you. You will try to find and tune both $$w$$ and $$b$$ such that $$h(x)$$ can separate the spams from hams as accurately as possible. I am drawing an arbitrary line to demonstrate a separating hyper-plane using the following code. plot(c(-1,10), c(-1,10), type = "n", xlab = "x", ylab = "y", asp = 1) abline(a =12 , b = -5, col=3) text(1,5, "h(x)=0", col = 2, adj = c(-.1, -.1)) text(2,3, "h(x)>0,Ham", col = 2, adj = c(-.1, -.1)) text(0,3, "h(x)<0,Spam", col = 2, adj = c(-.1, -.1)) This is plot we get: Oh, btw, did you notice that Linear SVM has decision rules. Let $$y$$ be an indicator variable which takes the value -1 if the message is spam and 1 if it is a ham. Then, the decision rules are $$y=-1~iff~h(x)~is~less~than~0$$ $$y=1~iff~h(x)~is~greater~than~0$$ Notice that for any point $$x$$ in your data set, $$h(x)\neq 0$$ as we want the function $$w^{T}x+b$$ to clearly distinguish between spam and ham. This brings us to a important condition. Important Condition Linear SVM assumes that the two classes are linearly separable that is a hyper-plane can separate out the two classes and the data points from the two classes do not get mixed up. Of course this is not an ideal assumption and how we will discuss it later how linear SVM works out the case of non-linear separability. But for a reader with some experience here I pose a question which is like this Linear SVM creates a discriminant function but so does LDA. Yet, both are different classifiers. Why ? (Hint: LDA is based on Bayes Theorem while Linear SVM is based on the concept of margin. In case of LDA, one has to make an assumption on the distribution of the data per class. For a newbie, please ignore the question. We will discuss this point in details in some other post.) Properties of $$w$$ Consider two points $$a_{1}$$ and $$a_{2}$$ lying on the hyper-plane defined by $$w^{T}x+b$$. Thus, $$h(a_{1})=w^{T}a_{1}+b=0$$ and $$h(a_{2})=w^{T}a_{2}+b=0$$. Therefore, $$w^{T}(a_{1}-a_{2})=0$$. Notice that the vector $$a_{1}-a_{2}$$ is lying on the separating hyper-plane defined by $$w^{T}x+b$$. Thus, $$w$$ is orthogonal to the hyper-plane. Here it is the plot: The Role of $$w$$ Consider a point $$x_{i}$$ and let $$x_{p}$$ be the orthogonal projection of $$x_{i}$$ onto the hyper-plane. Assume$h(x_{i})>0$for $$x$$. $$\frac{w}{\vert\vert w\vert\vert}$$ is a unit vector in the direction of $$w$$. Here it is the plot: Let $$r$$ be the distance between $$x_{i}$$ and $$x_{p}$$. Since $$x_{i}$$ is projected on the hyper-plane, there is some loss of information which is calculated as $$(r=x_{i}-x_{p})$$. Notice that $$r$$ is parallel to the unit vector $$\frac{w}{\vert\vert w\vert\vert}$$ and therefore $$r$$ can be written as some multiple of $$\frac{w}{\vert\vert w\vert\vert}$$. Thus, $$r=c\times \frac{w}{\vert\vert w\vert\vert}$$, for some scalar value $$c$$. It can be shown that $$c=\frac{h(x_{i})}{\vert\vert w\vert\vert}$$ Actually, $$c\times \frac{w}{\vert\vert w\vert\vert}$$ represents the orthogonal distance of the point $$x_{i}$$ from the hyper-plane. Distance has to be positive. $$\frac{w}{\vert\vert w\vert\vert}$$ cannot be negative. However, $$c=\frac{h(x)}{\vert\vert w\vert\vert}$$. Though, before starting the equations we assumed that $$h(x_{i})$$ is positive, but that is not a general case. In general $$c$$ can be negative also. Therefore we multiply $$y$$ with $$c$$. Remember your $$y$$ from the decision rules. That is $$c=y\times \frac{h(x_{i})}{\vert\vert w\vert\vert}$$ The Concept of Margin Linear SVM is based on the concept of margin which is defined as the smallest value of $$c$$ for all the $$n$$ points in your training data set. That is $$\delta^{\star}=min\{c~\forall~points~in~the~training~data~set\}$$. All points $$x^{\star}$$ which are lying at an orthogonal distance of $$\delta^{\star}$$ from the hyper-plane $$w^{T}x+b=0$$ are called support vectors. Scaling Down the Margin Let $$\delta^{\star}=\frac{y^{\star}\times h(x^{\star})}{\vert\vert w\vert\vert}$$, where $$x^{\star}$$ is a support vector. We want $$\delta^{\star} = \frac{1}{\vert\vert w\vert\vert}$$. Clearly, we want to multiply some value $$s$$ such that $$s\times y^{\star}\times h(x^{\star})=1$$. Therefore, $$s=\frac{1}{y^{\star}\times h(x^{\star})}$$. Objective Criteria for Linear SVM After scaling down, we have the following constraint for SVM: $$y\times (w^{T}x+b)\geq \frac{1}{\vert\vert w\vert\vert}$$. Now consider two support vectors, one denoted by $$a_{1}$$ on the positive side, while the other $$a_{2}$$ on the negative side of the hyper-plane. For the time being let us not use $$y$$. Then, we have $$w^{T}a_{1}+b= \frac{1}{\vert\vert w\vert\vert}$$ and $$w^{T}a_{2}+b= \frac{-1}{\vert\vert w\vert\vert}$$.Thus, $$w^{T}(a_{1}-a_{2})=\frac{2}{\vert\vert w\vert\vert}$$.$$w^{T}(a_{1}-a_{2})$$ depicts the separation/gap between the support vectors on the either side. We want to maximize this gap $$\frac{2}{\vert\vert w\vert\vert}$$. For the purpose, we need to minimize $$\vert\vert w\vert\vert$$. Instead,we can minimize $$\frac{\vert\vert w\vert\vert^{2}}{2}$$. Formally, we want to solve $$Minimize_{w,b}\frac{\vert\vert w\vert\vert^{2}}{2}$$ subject to the constraint $$y_{i}(w^{T}x_{i}+b)\geq \frac{1}{\vert \vert w\vert\vert}~\forall ~x_{i}~in ~Training~Data~set$$. For linearly inseparable case, we introduce some penalty $$0\leq \xi_{i} \leq 1$$ in the objective function and our constraint. $$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C\sum_{i=1}^{n}\xi_{i})$$ subject to the constraints $$(1)~y_{i}(w^{T}x_{i}+b)\geq 1-\xi_{i}~\forall ~x_{i}~in ~Training~Data~set~of~size~n$$ and $$(2)~\xi_{i}\geq 0~\forall ~n~points~in~the~Training~Data~set$$ $$C$$ is a user defined parameter. In case you wish to see how $$\xi_{i}$$ is used as penalty, please consider the the three figures below. You can also choose to skip them if the above details are okay for you. Try the Code with R Let us now start our coding. The needed code is here. library(tm) setwd(path to the directory) print(getwd()) sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) print(head(sms_data)) sms_data$type<-factor(sms_data$type) simply_text<-Corpus(VectorSource(sms_data$text)) cleaned_corpus<-tm_map(simply_text, tolower) cleaned_corpus<-tm_map(cleaned_corpus,removeNumbers) cleaned_corpus<-tm_map(cleaned_corpus,removeWords,stopwords()) #cleaned_corpus<-tm_map(cleaned_corpus,removePunctuation) #cleaned_corpus<-tm_map(cleaned_corpus,stripWhitespace) sms_dtm<-DocumentTermMatrix(cleaned_corpus) sms_train<-sms_dtm[1:4000] sms_test<-sms_dtm[4001:5574] #sms_train<-cleaned_corpus[1:4000] #sms_test<-cleaned_corpus[4001:5574] freq_term=(findFreqTerms(sms_dtm,lowfreq=2)) #sms_freq_train<-DocumentTermMatrix(sms_train, list(dictionary=freq_term)) #sms_freq_test<-DocumentTermMatrix(sms_test,list(dictionary=freq_term)) y<-sms_data$type y_train<-y[1:4000] y_test<-y[4001:5574] library(e1071) tuned_svm<-tune(svm, train.x=sms_freq_train, train.y = y_train,kernel="linear", range=list(cost=10^(-2:2), gamma=c(0.1, 0.25,0.5,0.75,1,2)) ) print(tuned_svm) adtm.m<-as.matrix(sms_freq_train) adtm.df<-as.data.frame(adtm.m) svm_good_model<-svm(y_train~., data=adtm.df, kernel="linear",cost=tuned_svm$best.parameters$cost, gamma=tuned_svm$best.parameters$gamma) adtm.m_test<-as.matrix(sms_freq_test) adtm.df_test=as.data.frame(adtm.m_test) y_pred<-predict(svm_good_model,adtm.df_test) table(y_test,y_pred) library(gmodels) CrossTable(y_pred,y_test,prop.chisq = FALSE) type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv Parameter tuning of ‘svm’: - sampling method: 10-fold cross validation - best parameters: cost gamma 1 0.1 - best performance: 0.01825 Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 1574 | y_test y_pred | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1295 | 191 | 1486 | | 0.871 | 0.129 | 0.944 | | 0.952 | 0.897 | | | 0.823 | 0.121 | | -------------|-----------|-----------|-----------| spam | 66 | 22 | 88 | | 0.750 | 0.250 | 0.056 | | 0.048 | 0.103 | | | 0.042 | 0.014 | | -------------|-----------|-----------|-----------| Column Total | 1361 | 213 | 1574 | | 0.865 | 0.135 | | -------------|-----------|-----------|-----------| In the above example we are not using the cleaned corpus. We see that the accuracy/precision of our classifier for ham is 0.871 and recall is 0.95. However for Spam messages, the precision is 0.250, while recall is just 0.103. We found that the usage of cleaned corpus degrades the performance for the SVM more. Related Post To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RStudio Connect 1.4.4.1 Tue, 03/28/2017 - 20:42 (This article was first published on RStudio Blog, and kindly contributed to R-bloggers) We’re excited to announce the release of RStudio Connect: version 1.4.4.1. This release includes the ability to manage different versions of your work on RStudio Connect. Rollback / Roll Forward The most notable feature of this release is the ability to “rollback” to a previously deployed version of your work or “roll forward” to a more recent version of your work. You can also download a particular version, perhaps as a starting place for a new report or application, and delete old versions that you want to remove from the server. Other important features allow you to: • Specify the number of versions to retain. You can alter the setting Applications.BundleRetentionLimit to specify how many versions of your applications you want to keep on disk. By default, we retain all bundles eternally. • Limit the number of scheduled reports that will be run concurrently using the Applications.ScheduleConcurrency setting. This setting will help ensure that your server isn’t overwhelmed by too many reports all scheduled to run at the same time of day. The default is set to 2. • Create a printable view of your content with a new “Print” menu option. • Notify users of unsaved changes before they take an action in parameterized reports. The release also includes numerous security and stability improvements. If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, etc.) with collaborators, colleagues, or customers. You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below. To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### nanotime 0.1.2 Tue, 03/28/2017 - 13:32 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A new minor version of the nanotime package for working with nanosecond timestamps arrived yesterday on CRAN. nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic. This release just arranges things neatly before Leonardo Silvestri and I may shake things up with a possible shift to doing it all in S4 as we may need the added rigour for nanotime object operations for use in his ztsdb project. Changes in version 0.1.2 (2017-03-27) • The as.integer64 function is now exported as well. We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Introducing Boardgame Recommendation website built with Shiny Tue, 03/28/2017 - 05:28 My family and I love boardgames. We are also on the lookout for new ones that would fit with the ones we already like to play. So I decided to create a boardgame recommendation website https://larrydag.shinyapps.io/boardgame_reco/ I built it using R. The recommendation engine uses a very simple collaborative filtering algortihm based on correlation scores from other boardgame players collection lists. The collections are gathered using the API from BoardgameGeek.com. It is very much in a beta project phase as I just wanted to get something built to get working. I also wanted another project to build in Shiny. I really like how easy it is to publish R projects with Shiny. Some of the features include: • Ability to enter your own collection • Get recommendation on your collection • Amazon link to buy boardgame that is recommended Its a work in progress. There is much to clean up and to make more presentable. Please take a look and offer comments to help improve the website. ### Le Monde puzzle [#1000…1025] Tue, 03/28/2017 - 00:17 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) Le Monde mathematical puzzle launched a competition to celebrate its 1000th puzzle! A fairly long-term competition as it runs over the 25 coming puzzles (and hence weeks). Starting with puzzle #1001. Here is the 1000th puzzle, not part of the competition: Alice & Bob spend five (identical) vouchers in five different shops, each time buying the maximum number of items to get close to the voucher value. In these five shops, they buy sofas at 421 euros each, beds at 347 euros each, kitchen appliances at 289 euros each, tables at 251 euros each and bikes at 211 euros each, respectively. Once the buying frenzy is over, they realise that within a single shop, they would have spent exactly four vouchers for the same products. What is the value of a voucher? Filed under: Kids, R Tagged: Alice and Bob, arithmetics, competition, Le Monde, mathematical puzzle, R, Tangente To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### How to Reach More People with your Next Data Science Talk Mon, 03/27/2017 - 18:00 (This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers) Speaking publicly about your data science projects is one of the best things you can do for your portfolio. Writing your presentation will help you refine what your project is, who it is for, and how it relates to other work in the area. Giving your presentation at a live, in-person event will make a large impact on the audience. That being said, public speaking has its drawbacks. For example: 1. Time to prepare. While your talk might take just 30 minutes to deliver, you could easily spend weeks writing and rehearsing it. 2. Cost. Some talks, such as those at local meetups, might be free to attend. But some will require significant travel. That costs money, even if your conference ticket is free. 3. Limited upside. When you deliver a conference talk, your upside is limited in two ways: 1. Reach. Only the people in the room at the time of your talk can hear your message. 2. Interaction. Your interaction with the audience is normally limited to a few minutes of Q&A, and whatever conversations you have with people during the rest of the conference. In my experience, this third pain point is the most significant. After all, if you had reached 10x the number of people with your last talk, then you probably would have been willing to spend more time writing it. And you probably would have been willing to travel further to give it. Imagine if your talk wasn’t limited by the conference itself For a moment, try to ignore the limits that the conferences place on your talk. Imagine if members of the audience could ask you questions days, or even weeks, after you gave your talk. Wouldn’t that be a game changer? And what about all the people who care about your topic but couldn’t make it to the conference? Wouldn’t it be great if they could hear your talk as well? All of this is possible I’ve recently worked out a system to solve these problems in my own public speaking events. My talks now reach more people than ever before, and lead to more interactions than ever before. Bonus: A checklist for making the most out of your data science talks! Before giving you the exact system I use, let me tell you the technology that you’ll need in order to implement it. Note that some of the links below are affiliate links. This means that, at no additional cost to you, I may get a commission if you wind up purchasing after clicking these links: 1. A self-hosted WordPress website. I use BlueHost. 2. An Email Service so you can follow up with people. I use Drip. 3. A Webinar Provider. I use Zoom. 4. A Landing Page Builder. I use Thrive. Step 1: Let the Audience Download your Slides Before the conference, take out a url such as <ConferenceName>Slides.com. The URL should be easy for the audience to remember. Set the URL to redirect to a separate page on your website. This page should a “landing page”. This means that the only Call-to-Action (CTA) on the page should be for the reader to enter their email address so that you can send them the slides. I recommend that the last slide in your deck contain just this URL. When the slide appears, say “If you would like to download the slide deck I used for this talk, visit this URL and enter your email address. I will then email you the slides.” Set up an automated email campaign just for this opt-in. The first email should fire immediately and send them the slides. The second email should fire a few days later, and ask people if they enjoyed the slides or have any questions. This tip alone should multiply the chances you have to interact with your audience after the talk ends. Step 2: Host a Post-Conference Webinar The sad fact is this: most people who would enjoy your talk won’t be in the audience. No conference can get even a fraction of the people who are interested in a topic. The best way around this is to host a live webinar where you go through the same exact same slides as your presentation. The week after your presentation, publish a blog post where you announce that you’ll be giving the same talk, but this time as a live webinar. Drive traffic to the registration page by announcing it on LinkedIn, Facebook and Twitter. The biggest impact your talk will have is the in-person presentation at the conference. The second biggest impact will be the live webinar. Step 3: Use the Webinar Recording After giving the webinar, you will now have a recording of the presentation. This is great because (again) most people who are interested in your talk will not have been at the conference, and will not have attended your live webinar. You can host the recording on your own website or on youtube. When you get a new subscriber, point them to the recording. You can also periodically drive traffic to the recording on social media. Step 4: Post Your Slides Online A week after your webinar I recommend posting your slides online. I also recommend creating a content upgrade so that readers can download the slides. A “content upgrade” simply means that readers enter their email address, and that you then email them the slides. This step will probably have the highest volume (i.e. the most people will interact with your content in this form), but the least impact (i.e. people will not get as much out of interacting with your talk in this form). Closing Notes As a data scientist with an online portfolio, conference talks are probably the most valuable type of content you create. Without additional effort, though, that value will be limited to people who are in the audience at the time of your presentation. With just a little bit of work, though, you can overcome that limitation and have more followup with your audience and reach more people. Bonus: A checklist for making the most out of your data science talks! The post How to Reach More People with your Next Data Science Talk appeared first on AriLamstein.com. To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Data science for Doctors: Inferential Statistics Exercises (part-3) Mon, 03/27/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills. We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here. This is the sixth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play. In more detail, in this part we will go through the hypothesis testing for Student’s t-distribution (Student’s t-test), which may be the most used test you will need to apply, since in most cases the standard deviation σ of the population is not known. We will cover the one-sample t-test and two-sample t-test(both with equal and unequal variance). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background. Before proceeding, it might be helpful to look over the help pages for the t.test. Please run the code below in order to load the data set and transform it into a proper data frame format: url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" data <- read.table(url, fileEncoding="UTF-8", sep=",") names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class') colnames(data) <- names data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Suppose that we take a sample of 25 candidates that tried a diet and they had a average weight of 29 (generate 25 normal distributed samples with mean 29 and standard deviation 4) after the experiment.
Find the t-value.

Exercise 2

Find the p-value.

Exercise 3

Find the 95% confidence interval.

Exercise 4

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the sample with 5% confidence level.

Exercise 5

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the population and the alternative that the true mean is less than the mean of the population with 5% confidence level.

Exercise 6

Suppose that we want to compare the current diet with another one. We assume that we test a different diet to a sample of 27 with mass average of 31(generate normal distributed samples with mean 31 and standard deviation of 5). Test whether the two diets are significantly different.
Note that the two distributions have different variances.
hint: This is a two sample hypothesis testing with different variances.

Exercise 7

Test whether the the first diet is more efficient than the second.

Exercise 8

Assume that the second diet has the same variance as the first one. Is it significant different?

Exercise 9

Assume that the second diet has the same variance as the first one. Is it significantly better?

Exercise 10

Suppose that you take a sample of 27 with average mass of 29, and after the diet the average mass is 28(generate the sampled with rnorm(27,average,4)). Are they significant different?
hint: Paired Sample T-Test.

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### An Introduction to Stock Market Data Analysis with R (Part 1)

Mon, 03/27/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on backtest.ru at a link that now appears to no longer work) and Chinese (here and here). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis.

This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.

NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk.

Introduction

Advanced mathematics and statistics have been present in finance for some time. Prior to the 1980s, banking and finance were well-known for being “boring”; investment banking was distinct from commercial banking and the primary role of the industry was handling “simple” (at least in comparison to today) financial instruments, such as loans. Deregulation under the Regan administration, coupled with an influx of mathematical talent, transformed the industry from the “boring” business of banking to what it is today, and since then, finance has joined the other sciences as a motivation for mathematical research and advancement. For example one of the biggest recent achievements of mathematics was the derivation of the Black-Scholes formula, which facilitated the pricing of stock options (a contract giving the holder the right to purchase or sell a stock at a particular price to the issuer of the option). That said, bad statistical models, including the Black-Scholes formula, hold part of the blame for the 2008 financial crisis.

In recent years, computer science has joined advanced mathematics in revolutionizing finance and trading, the practice of buying and selling of financial assets for the purpose of making a profit. In recent years, trading has become dominated by computers; algorithms are responsible for making rapid split-second trading decisions faster than humans could make (so rapidly, the speed at which light travels is a limitation when designing systems). Additionally, machine learning and data mining techniques are growing in popularity in the financial sector, and likely will continue to do so. In fact, a large part of algorithmic trading is high-frequency trading (HFT). While algorithms may outperform humans, the technology is still new and playing an increasing role in a famously turbulent, high-stakes arena. HFT was responsible for phenomena such as the 2010 flash crash and a 2013 flash crash prompted by a hacked Associated Press tweet about an attack on the White House.

My articles, however, will not be about how to crash the stock market with bad mathematical models or trading algorithms. Instead, I intend to provide you with basic tools for handling and analyzing stock market data with R. We will be using stock data as a first exposure to time series data, which is data considered dependent on the time it was observed (other examples of time series include temperature data, demand for energy on a power grid, Internet server load, and many, many others). I will also discuss moving averages, how to construct trading strategies using moving averages, how to formulate exit strategies upon entering a position, and how to evaluate a strategy with backtesting.

DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience as a trader (a lot of this knowledge comes from a one-semester course on stock trading I took at Salt Lake Community College)! This is purely introductory knowledge, not enough to make a living trading stocks. People can and do lose money trading stocks, and you do so at your own risk!

Getting and Visualizing Stock Data Getting Data from Yahoo! Finance with quantmod

Before we analyze stock data, we need to get it into some workable format. Stock data can be obtained from Yahoo! Finance, Google Finance, or a number of other sources, and the quantmod package provides easy access to Yahoo! Finance and Google Finance data, along with other sources. In fact, quantmod provides a number of useful features for financial modelling, and we will be seeing those features throughout these articles. In this lecture, we will get our data from Yahoo! Finance.

# Get quantmod if (!require("quantmod")) { install.packages("quantmod") library(quantmod) } start <- as.Date("2016-01-01") end <- as.Date("2016-10-01") # Let's get Apple stock data; Apple's ticker symbol is AAPL. We use the # quantmod function getSymbols, and pass a string as a first argument to # identify the desired ticker symbol, pass 'yahoo' to src for Yahoo! # Finance, and from and to specify date ranges # The default behavior for getSymbols is to load data directly into the # global environment, with the object being named after the loaded ticker # symbol. This feature may become deprecated in the future, but we exploit # it now. getSymbols("AAPL", src = "yahoo", from = start, to = end) ## As of 0.4-0, 'getSymbols' uses env=parent.frame() and ## auto.assign=TRUE by default. ## ## This behavior will be phased out in 0.5-0 when the call will ## default to use auto.assign=FALSE. getOption("getSymbols.env") and ## getOptions("getSymbols.auto.assign") are now checked for alternate defaults ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details. ## [1] "AAPL" # What is AAPL? class(AAPL) ## [1] "xts" "zoo" # Let's see the first few rows head(AAPL) ## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume ## 2016-01-04 102.61 105.37 102.00 105.35 67649400 ## 2016-01-05 105.75 105.85 102.41 102.71 55791000 ## 2016-01-06 100.56 102.37 99.87 100.70 68457400 ## 2016-01-07 98.68 100.13 96.43 96.45 81094400 ## 2016-01-08 98.55 99.11 96.76 96.96 70798000 ## 2016-01-11 98.97 99.06 97.34 98.53 49739400 ## AAPL.Adjusted ## 2016-01-04 102.61218 ## 2016-01-05 100.04079 ## 2016-01-06 98.08303 ## 2016-01-07 93.94347 ## 2016-01-08 94.44022 ## 2016-01-11 95.96942

Let’s briefly discuss this. getSymbols() created in the global environment an object called AAPL (named automatically after the ticker symbol of the security retrieved) that is of the xts class (which is also a zoo-class object). xts objects (provided in the xts package) are seen as improved versions of the ts object for storing time series data. They allow for time-based indexing and provide custom attributes, along with allowing multiple (presumably related) time series with the same time index to be stored in the same object. (Here is a vignette describing xts objects.) The different series are the columns of the object, with the name of the associated security (here, AAPL) being prefixed to the corresponding series.

Yahoo! Finance provides six series with each security. Open is the price of the stock at the beginning of the trading day (it need not be the closing price of the previous trading day), high is the highest price of the stock on that trading day, low the lowest price of the stock on that trading day, and close the price of the stock at closing time. Volume indicates how many stocks were traded. Adjusted close (abreviated as “adjusted” by getSymbols()) is the closing price of the stock that adjusts the price of the stock for corporate actions. While stock prices are considered to be set mostly by traders, stock splits (when the company makes each extant stock worth two and halves the price) and dividends (payout of company profits per share) also affect the price of a stock and should be accounted for.

Visualizing Stock Data

Now that we have stock data we would like to visualize it. I first use base R plotting to visualize the series.

plot(AAPL[, "AAPL.Close"], main = "AAPL")

A linechart is fine, but there are at least four variables involved for each date (open, high, low, and close), and we would like to have some visual way to see all four variables that does not require plotting four separate lines. Financial data is often plotted with a Japanese candlestick plot, so named because it was first created by 18th century Japanese rice traders. Use the function candleChart() from quantmod to create such a chart.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white")

With a candlestick chart, a black candlestick indicates a day where the closing price was higher than the open (a gain), while a red candlestick indicates a day where the open was higher than the close (a loss). The wicks indicate the high and the low, and the body the open and close (hue is used to determine which end of the body is the open and which the close). Candlestick charts are popular in finance and some strategies in technical analysis use them to make trading decisions, depending on the shape, color, and position of the candles. I will not cover such strategies today.

(Notice that the volume is tracked as a bar chart on the lower pane as well, with the same colors as the corresponding candlesticks. Some traders like to see how many shares are being traded; this can be important in trading.)

We may wish to plot multiple financial instruments together; we may want to compare stocks, compare them to the market, or look at other securities such as exchange-traded funds (ETFs). Later, we will also want to see how to plot a financial instrument against some indicator, like a moving average. For this you would rather use a line chart than a candlestick chart. (How would you plot multiple candlestick charts on top of one another without cluttering the chart?)

Below, I get stock data for some other tech companies and plot their adjusted close together.

# Let's get data for Microsoft (MSFT) and Google (GOOG) (actually, Google is # held by a holding company called Alphabet, Inc., which is the company # traded on the exchange and uses the ticker symbol GOOG). getSymbols(c("MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "MSFT" "GOOG" # Create an xts object (xts is loaded with quantmod) that contains closing # prices for AAPL, MSFT, and GOOG stocks <- as.xts(data.frame(AAPL = AAPL[, "AAPL.Close"], MSFT = MSFT[, "MSFT.Close"], GOOG = GOOG[, "GOOG.Close"])) head(stocks) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 105.35 54.80 741.84 ## 2016-01-05 102.71 55.05 742.58 ## 2016-01-06 100.70 54.05 743.62 ## 2016-01-07 96.45 52.17 726.39 ## 2016-01-08 96.96 52.33 714.47 ## 2016-01-11 98.53 52.30 716.03 # Create a plot showing all series as lines; must use as.zoo to use the zoo # method for plot, which allows for multiple series to be plotted on same # plot plot(as.zoo(stocks), screens = 1, lty = 1:3, xlab = "Date", ylab = "Price") legend("right", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

What’s wrong with this chart? While absolute price is important (pricey stocks are difficult to purchase, which affects not only their volatility but your ability to trade that stock), when trading, we are more concerned about the relative change of an asset rather than its absolute price. Google’s stocks are much more expensive than Apple’s or Microsoft’s, and this difference makes Apple’s and Microsoft’s stocks appear much less volatile than they truly are (that is, their price appears to not deviate much).

One solution would be to use two different scales when plotting the data; one scale will be used by Apple and Microsoft stocks, and the other by Google.

plot(as.zoo(stocks[, c("AAPL.Close", "MSFT.Close")]), screens = 1, lty = 1:2, xlab = "Date", ylab = "Price") par(new = TRUE) plot(as.zoo(stocks[, "GOOG.Close"]), screens = 1, lty = 3, xaxt = "n", yaxt = "n", xlab = "", ylab = "") axis(4) mtext("Price", side = 4, line = 3) legend("topleft", c("AAPL (left)", "MSFT (left)", "GOOG"), lty = 1:3, cex = 0.5)

Not only is this solution difficult to implement well, it is seen as a bad visualization method; it can lead to confusion and misinterpretation, and cannot be read easily.

A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply.

One transformation would be to consider the stock’s return since the beginning of the period of interest. In other words, we plot:

This will require transforming the data in the stocks object, which I do next.

# Get me my beloved pipe operator! if (!require("magrittr")) { install.packages("magrittr") library(magrittr) } ## Loading required package: magrittr stock_return % t %>% as.xts head(stock_return) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 1.0000000 1.0000000 1.0000000 ## 2016-01-05 0.9749407 1.0045620 1.0009975 ## 2016-01-06 0.9558614 0.9863139 1.0023994 ## 2016-01-07 0.9155197 0.9520073 0.9791734 ## 2016-01-08 0.9203607 0.9549271 0.9631052 ## 2016-01-11 0.9352634 0.9543796 0.9652081 plot(as.zoo(stock_return), screens = 1, lty = 1:3, xlab = "Date", ylab = "Return") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

This is a much more useful plot. We can now see how profitable each stock was since the beginning of the period. Furthermore, we see that these stocks are highly correlated; they generally move in the same direction, a fact that was difficult to see in the other charts.

Alternatively, we could plot the change of each stock per day. One way to do so would be to plot the percentage increase of a stock when comparing day to day , with the formula:

But change could be thought of differently as:

These formulas are not the same and can lead to differing conclusions, but there is another way to model the growth of a stock: with log differences.

(Here, is the natural log, and our definition does not depend as strongly on whether we use or .) The advantage of using log differences is that this difference can be interpreted as the percentage change in a stock but does not depend on the denominator of a fraction.

We can obtain and plot the log differences of the data in stocks as follows:

stock_change % log %>% diff head(stock_change) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 NA NA NA ## 2016-01-05 -0.025378648 0.0045516693 0.000997009 ## 2016-01-06 -0.019763704 -0.0183323194 0.001399513 ## 2016-01-07 -0.043121062 -0.0354019469 -0.023443064 ## 2016-01-08 0.005273804 0.0030622799 -0.016546113 ## 2016-01-11 0.016062548 -0.0005735067 0.002181138 plot(as.zoo(stock_change), screens = 1, lty = 1:3, xlab = "Date", ylab = "Log Difference") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

Which transformation do you prefer? Looking at returns since the beginning of the period make the overall trend of the securities in question much more apparent. Changes between days, though, are what more advanced methods actually consider when modelling the behavior of a stock. so they should not be ignored.

Moving Averages

Charts are very useful. In fact, some traders base their strategies almost entirely off charts (these are the “technicians”, since trading strategies based off finding patterns in charts is a part of the trading doctrine known as technical analysis). Let’s now consider how we can find trends in stocks.

A -day moving average is, for a series and a point in time , the average of the past days: that is, if denotes a moving average process, then:

Moving averages smooth a series and helps identify trends. The larger is, the less responsive a moving average process is to short-term fluctuations in the series . The idea is that moving average processes help identify trends from “noise”. Fast moving averages have smaller and more closely follow the stock, while slow moving averages have larger , resulting in them responding less to the fluctuations of the stock and being more stable.

quantmod allows for easily adding moving averages to charts, via the addSMA() function.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white") addSMA(n = 20)

Notice how late the rolling average begins. It cannot be computed until 20 days have passed. This limitation becomes more severe for longer moving averages. Because I would like to be able to compute 200-day moving averages, I’m going to extend out how much AAPL data we have. That said, we will still largely focus on 2016.

start = as.Date("2010-01-01") getSymbols(c("AAPL", "MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "AAPL" "MSFT" "GOOG" # The subset argument allows specifying the date range to view in the chart. # This uses xts style subsetting. Here, I'm using the idiom # 'YYYY-MM-DD/YYYY-MM-DD', where the date on the left-hand side of the / is # the start date, and the date on the right-hand side is the end date. If # either is left blank, either the earliest date or latest date in the # series is used (as appropriate). This method can be used for any xts # object, say, AAPL candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = 20)

You will notice that a moving average is much smoother than the actual stock data. Additionally, it’s a stubborn indicator; a stock needs to be above or below the moving average line in order for the line to change direction. Thus, crossing a moving average signals a possible change in trend, and should draw attention.

Traders are usually interested in multiple moving averages, such as the 20-day, 50-day, and 200-day moving averages. It’s easy to examine multiple moving averages at once.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = c(20, 50, 200))

The 20-day moving average is the most sensitive to local changes, and the 200-day moving average the least. Here, the 200-day moving average indicates an overall bearish trend: the stock is trending downward over time. The 20-day moving average is at times bearish and at other times bullish, where a positive swing is expected. You can also see that the crossing of moving average lines indicate changes in trend. These crossings are what we can use as trading signals, or indications that a financial security is changing direction and a profitable trade might be made.

Visit next week to read about how to design and test a trading strategy using moving averages.

# Package/system information sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 15.10 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] magrittr_1.5 quantmod_0.4-7 TTR_0.23-1 xts_0.9-7 ## [5] zoo_1.7-14 RWordPress_0.2-3 optparse_1.3.2 knitr_1.15.1 ## ## loaded via a namespace (and not attached): ## [1] lattice_0.20-34 XML_3.98-1.5 bitops_1.0-6 grid_3.3.3 ## [5] formatR_1.4 evaluate_0.10 highr_0.6 stringi_1.1.3 ## [9] getopt_1.20.0 tools_3.3.3 stringr_1.2.0 RCurl_1.95-4.8 ## [13] XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Analyzing Accupedo step count data in R: Part 2 – Adding weather data

Mon, 03/27/2017 - 16:30

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In my last set of posts, I wrote about analyzing data from the Accupedo step counter app I have on my phone. In this post, I’ll talk about some additional analysis I’ve done by merging the step counter data with weather data from another source.

The website www.wunderground.com has freely available weather data available for most parts of the world. According to their website, their data are recorded by a crowd-sourced team of weather enthusiasts who report the weather conditions from personal stations at their homes. Pretty cool project, and it was really easy to search for my city and download the data in a .csv format. In this post, we will focus on the Mean Temperature (for the day) value, recorded in degrees Celsius.

Data Preparation

The step count data here are aggregated to the daily level, as I didn’t have more granular weather information. In all honesty, it probably makes more sense to analyze the data at the day level (as I’ve done); it would likely be hard to see hourly evolution in the weather matching hourly evolution in my step counts. My movements are pretty constrained during certain hours of the day (e.g. when I’m at work), and so examining that granular of a relationship is unlikely to be very illuminating.

I was able to match all of the dates that I had step counts for with the daily weather information from www.wunderground.com, and merged the weather data into the main step count dataset described in the previous post. I then created the analysis dataset, a day-level version of the walking data aggregated across the two variables I had created previously – day and day of week (essentially this aggregates to the day-level, while keeping a column describing the day of the week – e.g. Monday, Tuesday, etc.).

# load the required packages
library(plyr); library(dplyr)
library(lubridate)
library(ggplot2)

# make the variables to aggregate across
# the master data set with steps and temperature is called walking_weather
walking_weather$oneday <- cut(walking_weather$date_time, breaks = "1 day")
walking_weather$dow <- wday(walking_weather$date_time, label = TRUE)

# aggregate by day, day of week
# max of step count; mean of temperature
ww_per_day <- walking_weather %>% group_by(oneday, dow) %>%
summarise(steps = max(steps), temp = mean(Mean_TemperatureC))

As before, when aggregating I kept the maximum step count per day (as these counts are cumulative), and computed the average temperature in degrees Celsius per day (because there are multiple observations per day in the non-aggregated data, taking the average across days simply returns the daily average temperature from the daily weather data).

The resulting dataset looks like this:

Data Visualization and Analysis

As a first step, let’s make histograms of the temperature and step count variables:

# histograms of temperature and step count at a daily level
# using the base plotting system
hist(ww_per_day$temp, col ='darkblue', xlab = 'Daily Average Temperature (Celsius)', main = 'Histogram of Temperature') hist(ww_per_day$steps, col ='darkgreen',
xlab = 'Daily Total Step Count',
main = 'Histogram of Step Count')

The temperatures have ranged from -4 degrees to 27 degrees Celsius, with a mean of 11.66 (SD = 5.83).

The daily step counts have ranged from 6 to 65,337, with a mean of 18,294.04 (SD = 7,808.36).

In order to visualize the relationship between my daily step count and the daily average temperature, I made a simple scatterplot, with a linear model line superimposed on the graph (95% confidence intervals shown in light grey):

# plot of steps by temperature
p = ggplot(data = ww_per_day, aes(x = temp, y = steps)) +
geom_point() +
coord_cartesian(ylim = c(0, 25000)) +
geom_smooth(method="lm") +
theme(legend.title=element_blank()) +
labs(x = "Temperature (Celsius)", y = "Steps" )
p

There is a slight positive relationship between temperature and step count, such that on warmer days I tend to walk more. A simple linear regression will estimate the size of this relationship:

summary(lm(steps~temp , data = ww_per_day ))

The estimates given:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; } Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 17277.29 681.76 25.342 < 2e-16 *** temp 87.21 52.31 1.667 0.096 .

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7798 on 653 degrees of freedom
Multiple R-squared: 0.004238, Adjusted R-squared: 0.002713
F-statistic: 2.779 on 1 and 653 DF, p-value: 0.09597

The model estimates that I walk an additional 87.21 steps for each degree increase in the daily average temperature. However, the p-value associated with this estimate shows only a nonsignificant trend toward significance.

In a previous post we saw that my walking patterns tend to be different during the weekdays and the weekend. I was therefore curious to examine the relationship between temperature and step count across these two different types of days.

I generated the week/weekend variable (as in the previous post) and the plot with the following code:

# generate the weekday vs. weekend variable
ww_per_day$week_weekend[ww_per_day$dow == 'Sun'] <- 'Weekend'
ww_per_day$week_weekend[ww_per_day$dow == 'Sat'] <- 'Weekend'
ww_per_day$week_weekend[ww_per_day$dow == 'Mon'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Tues'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Wed'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Thurs'] <- 'Weekday'
ww_per_day$week_weekend[ww_per_day$dow == 'Fri'] <- 'Weekday'

# plot the relationship between temperature and step count
# with separate colors and lines for weekday vs. weekend
p = ggplot(data = ww_per_day, aes(x = temp, y = steps, color=week_weekend)) +
geom_point() +
coord_cartesian(ylim = c(0, 25000)) +
geom_smooth(method="lm") +
theme(legend.title=element_blank()) +
labs(x = "Temperature (Celsius)", y = "Steps" )
p

The relationship between temperature and step count does indeed appear to be different on weekdays vs. the weekend:

The general pattern is that there’s not much of a relationship between temperature and step count during the week (after all, I have to walk to work whether it’s warm or cold), but on the weekends I walk quite a bit more when it’s warmer out.

Notice how the confidence intervals get wider at the extremes of the x-axis. These are regions where there are simply less data (by definition, extreme temperatures are rare). This pattern is especially noticeable for the weekend days (there are far fewer weekend days than there are week days, and so we have even less data here at the margins).

Although I’m in general not particularly interested in null-hypothesis significance testing of regression coefficients in this type of analysis, I went through the formality of testing the time period (weekday vs. weekend) by temperature interaction.

summary(lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),
data = ww_per_day ))

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 19178.76 774.45 24.76 < 2e-16 *** temp 23.46 59.52 0.39 0.69 as.factor(week_weekend)Weekend -6982.66 1486.25 -4.70 0.00 *** temp:as.factor(week_weekend)Weekend 246.76 113.76 2.17 0.03 * ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7560 on 651 degrees of freedom
Multiple R-squared: 0.06682, Adjusted R-squared: 0.06252
F-statistic: 15.54 on 3 and 651 DF, p-value: 9.005e-10

The results show that the interaction term is statistically significant; we already understood the shape and meaning of the interaction with the above plot.

More interesting for our purposes is the model summary. The adjusted multiple R squared is about 6%, much greater than the .3% from the model above (where we only include temperature as a predictor), but not very high overall. The take-home message is that time period (weekday vs. weekend) and temperature have clear relationships with step count, but together and in combination they don’t explain a huge amount of variance in the amount of steps I walk each day.

I wondered about the importance of the interaction term, actually. Interactions are widely sought after in certain academic contexts, but I enjoy the simplicity and ease-of-interpretation of main effects models for applied problems which require interpretable solutions (e.g. “white box” models). Bluntly put: how important is the interaction term in this context? One might argue that, although the p-value for the interaction term passes the threshold for statistical significance, knowing that the interaction term is (statistically) significantly different from zero doesn’t bring us much further in our understanding of the problem at hand, after accounting for the main effects of time period and temperature.

Model Comparison

In order to more concretely investigate the importance of the time period by temperature interaction term, I computed the model predicting step count with just the main effects of time period and temperature (lm_1 below), and also the model with the main effects and their interaction term (lm_2 below). I then performed a model comparison to see if adding the interaction term results in a statistically significant reduction in the residual sum of squares. In other words, is the difference between the predictive performance of the first and second model greater than zero?

# main effects model
lm_1 <- lm(steps~temp + as.factor(week_weekend), data = ww_per_day )
summary(lm_1)

# main effects + interaction model
lm_2 <- lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),
data = ww_per_day )
summary(lm_2)

# model comparison
anova(lm_1,lm_2)

Which gives the following result:
table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

Model Res.Df RSS Df Sum of Sq F Pr(>F) 1 652 3.75E+10 2 651 3.72E+10 1 268916758 4.70 0.03 * ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This comparison suggests that adding the interaction term results in a statistically significant decrease in the residual sum of squares. In other words, we can better predict my step count if we include the interaction term.

Conclusion

To sum up, in this post we took the daily total step counts from my Accupedo step counter app and added data about the average temperature on each day that steps were recorded. We used scatterplots to explore and understand the relationship between these two variables, both overall and according to the time period (e.g. weekday vs. weekend).

We also used linear regression to investigate the relationship between daily temperature and daily step count. When predicting step count from temperature, we saw that there was a non-statistically significant trend such that I walk more on warmer days. When adding time period as a categorical predictor in an interactive regression model, we saw a statistically significant interaction between temperature and time period, such that the relationship between temperature and step count was stronger on weekends vs. weekdays. In other words, I walk more when it’s warmer out, but only during the weekend.

Caveat: Null Hypothesis Significance Testing (NHST) in Exploratory Data Analysis

We used NHST and p-values to interpret regression coefficients and to compare the predictive performance of the main effects vs. interaction regression models. I’m somewhat ambivalent about the use of p-values in this context, honestly. I’ve been very influenced by Andrew Gelman’s work, particularly his wonderful blog, and am very conscious of the limitations of p-values in exploratory contexts like this one. If I had to choose the most interesting test of statistical significance in this post, I would choose the model comparison: it puts an explicit focus on the relative performance of competing (nested) models, rather than simply using significance testing to determine whether individual regression coefficients are significantly different from zero.

The statistical approach we’ve seen in this post conforms very much to the way I was trained in the social sciences. In this tradition, a strong focus is put on NHST (via p-values), and less frequently on interpreting model R square and performing formal model comparisons. In the work I do now as an applied data analyst (or “data scientist”), I often take an approach that borrows heavily from machine learning (though these ideas are also present in traditional statistical thinking): evaluating models via predictive accuracy on a test dataset that was not used in data modeling.

Coming Up Next

In the next set of posts, we will analyze a different dataset using an approach that leans more heavily towards this machine-learning perspective. We’ll also change analytical software, using Python/Pandas and the scikit-learn library to analyze the data. Stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### #0: Introducing R^4

Mon, 03/27/2017 - 15:10

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

So I had been toying with the idea of getting back to the blog and more regularly writing / posting little tips and tricks. I even starting taking some notes but because perfect is always the enemy of the good it never quite materialized.

But the relatively broad discussion spawned by last week’s short rant on Suggests != Depends made a few things clear. There appears to be an audience. It doesn’t have to be long. And it doesn’t have to be too polished.

So with that, let’s get the blogging back from micro-blogging.

This note forms post zero of what will a new segment I call R4 which is shorthand for relatively random R rambling.

Stay tuned.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introducing brotools

Mon, 03/27/2017 - 09:23

I’m happy to announce my first R package, called brotools. This is a package that contains functions that are specific to my needs but that you might find also useful. I blogged about some of these functions, so if you follow my blog you might already be familiar with some of them. It is not on CRAN and might very well never be. The code is hosted on bitbucket and you can install the package with

devtools::install_bitbucket("b-rodrigues/brotools")

Hope you’ll find the brotools useful!