Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 20 min ago

An overview of R with a curated learning path

Tue, 05/08/2018 - 07:05

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

I recently wrote a 80-page guide to how to get a programming job without a degree, curated from my experience helping students do just that at Springboard, a leading data science bootcamp. This excerpt is a part where I focus on an overview of the R programming language.

Description: R is an open-source programming language most often used by statisticians, data scientists, and academics who will use it to explore large data sets and distill insights from it. It offers multiple libraries that are useful for data processing tasks. Developed by John Chambers and other colleagues at Bell Laboratories, it is a refined version of its precursor language S.

It has strong libraries for data visualization, time series analysis and a variety of statistical analysis tasks.

It’s free software that runs on a variety of operating systems, from UNIX to Windows to OSX. It runs on the open source license of the GNU general public license and is thus free to use and adapt.

Salary: Median salaries for R users tend to vary, with the main split being the difference between data analysts who use R to query existing data pipelines and data scientists who build those data pipelines and train different models programmatically on top of larger data sets. The difference can be stark, with around a $90,000 median salary for data scientists who use R, vs about a $60,000 median salary for data analysts who use R.

Uses: R is often used to analyze datasets, especially in an academic context. Most frameworks that have evolved around R focus on different methods of data processing. The ggplot family of libraries has been widely recognized as some of the top programming modules for data visualization.

Differentiation: R is often compared to Python when it comes to data analysis and data science tasks. Its strong data visualization and manipulation libraries along with its data analysis-focused community help make it a strong contender for any data transformation, analysis, and visualization tasks.

Most Popular Github Projects:

1- Mal

Mal is a Clojure inspired lisp interpreter which can be implemented in the R programming language. With 4,500 stars, Mal requires one of the lowest amount of stars to qualify for the top repository of a programming language. It speaks to the fact that most of the open-source work done on the R programming language resides outside of Github.

2- Prophet

Prophet is a library that is able to do rich time series analysis by adjusting forecasts to account for seasonal and non-linear trends. It was created by Facebook and forms a part of the strong corpus of data analysis frameworks and libraries that exist for the R programming language.

3- ggplot2

Ggplot2 is a data visualization library for R that is based on the Grammar of Graphics. It is a library often used by data analysts and data scientists to display their results in charts, heatmaps, and more.

4- H2o-3

H2o-3 is the open source machine learning library for the R programming language, similar to scikit-learn for Python. It allows people using the R programming language to run deep learning and other machine learning techniques on their data, an essential utility in an era where data is not nearly as useful without machine learning techniques.

5- Shiny

Shiny is an easy web application framework for R that allows you to build interactive websites in a few lines of code without any JavaScript. It uses an intuitive UI (user interface) component based on Bootstrap. Shiny can take all of the guesswork out of building something for the web with the R programming language.

Example Sites: There are not many websites built with R, which is used more for data analysis tasks and projects that are internal to one computer. However, you can build things with R Markdown and build different webpages. You might also use a web development framework such as Shiny if you wanted to create simple interactive web applications around your data.

Frameworks: The awesome repository comes up again with a great list of different R packages and frameworks you can use. A few that are worth mentioning are packages such as dplyr that help you assemble data in an intuitive tabular fashion, ggplot2 to help with data visualization and plotly to help with interactive web displays of R analysis. R libraries and frameworks are some of the most robust for doing ad hoc data analysis and displaying the results in a variety of formats.

Learning Path: This article helps frame the resources you need to learn R, and how you should learn it, starting from syntax and going to specific packages. It makes for a great introduction to the field, even if you’re an absolute beginner. If you want to apply R to data science projects and different data analysis tasks, Datacamp will help you learn the skills and mentality you need to do just that — you’ll learn everything from machine learning practices with R to how to do proper data visualization of the results.

Resources: R-bloggers is a large community of R practitioners and writers who aim to share knowledge about R with each other. This list of 60+ resources on R can be used in case you ever get lost trying to learn R.

I hope this is helpful for you! Want more material like this? Check out my guide to how to get a programming job without a degree.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

simmer 3.8.0

Mon, 05/07/2018 - 18:14

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The 3.8.0 release of simmer, the Discrete-Event Simulator for R, hit CRAN almost a week ago, and Windows binaries are already available. This version includes two highly requested new features that justify this second consecutive minor release.

Attachment of precomputed data

Until v3.7.0, the generator was the only means to attach data to trajectories, and it was primarily intended for dynamic generation of arrivals:

library(simmer) set.seed(42) hello_sayer <- trajectory() %>% log_("hello!") simmer() %>% add_generator("dummy", hello_sayer, function() rexp(1, 1)) %>% run(until=2) ## 0.198337: dummy0: hello! ## 0.859232: dummy1: hello! ## 1.14272: dummy2: hello! ## 1.18091: dummy3: hello! ## 1.65409: dummy4: hello! ## simmer environment: anonymous | now: 2 | next: 3.11771876826972 ## { Monitor: in memory } ## { Source: dummy | monitored: 1 | n_generated: 6 }

Although it may be used to attach precomputed data too, especially using the at() adaptor:

simmer() %>% add_generator("dummy", hello_sayer, at(seq(0, 10, 0.5))) %>% run(until=2) ## 0: dummy0: hello! ## 0.5: dummy1: hello! ## 1: dummy2: hello! ## 1.5: dummy3: hello! ## simmer environment: anonymous | now: 2 | next: 2 ## { Monitor: in memory } ## { Source: dummy | monitored: 1 | n_generated: 21 }

Now, let’s say that we want to attach some empirical data, and our observations not only include arrival times, but also priorities and some attributes (e.g., measured service times), as in this question on StackOverflow:

myData <- data.frame( time = c(1:10,1:5), priority = 1:3, duration = rnorm(15, 50, 5)) %>% dplyr::arrange(time)

This is indeed possible using generators, but it requires some trickery; more specifically, the clever usage of a consumer function as follows:

consume <- function(x, prio=FALSE) { i <- 0 function() { i <<- i + 1 if (prio) c(x[[i]], x[[i]], FALSE) else x[[i]] } } activityTraj <- trajectory() %>% seize("worker") %>% timeout_from_attribute("duration") %>% release("worker") initialization <- trajectory() %>% set_prioritization(consume(myData$priority, TRUE)) %>% set_attribute("duration", consume(myData$duration)) %>% join(activityTraj) arrivals_gen <- simmer() %>% add_resource("worker", 2, preemptive=TRUE) %>% add_generator("dummy_", initialization, at(myData$time)) %>% run() %>% get_mon_arrivals() # check the resulting duration times activity_time <- arrivals_gen %>% tidyr::separate(name, c("prefix", "n"), convert=TRUE) %>% dplyr::arrange(n) %>% dplyr::pull(activity_time) all(activity_time == myData$duration) ## [1] TRUE

Since this v3.8.0, the new data source add_dataframe greatly simplifies this process:

arrivals_df <- simmer() %>% add_resource("worker", 2, preemptive=TRUE) %>% add_dataframe("dummy_", activityTraj, myData, time="absolute") %>% run() %>% get_mon_arrivals() identical(arrivals_gen, arrivals_df) ## [1] TRUE On-disk monitoring

As some users noted (see 12), the default in-memory monitoring capabilities can turn problematic for very long simulations. To address this issue, the simmer() constructor gains a new argument, mon, to provide different types of monitors. Monitoring is still performed in-memory by default, but as of v3.8.0, it can be offloaded to disk through monitor_delim() and monitor_csv(), which produce flat delimited files.

mon <- monitor_csv() mon ## simmer monitor: to disk (delimited files) ## { arrivals: /tmp/RtmpAlQH2g/file6933ce99281_arrivals.csv } ## { releases: /tmp/RtmpAlQH2g/file6933ce99281_releases.csv } ## { attributes: /tmp/RtmpAlQH2g/file6933ce99281_attributes.csv } ## { resources: /tmp/RtmpAlQH2g/file6933ce99281_resources.csv } env <- simmer(mon=mon) %>% add_generator("dummy", hello_sayer, function() rexp(1, 1)) %>% run(until=2) ## 0.26309: dummy0: hello! ## 0.982183: dummy1: hello! env ## simmer environment: anonymous | now: 2 | next: 2.29067480322535 ## { Monitor: to disk (delimited files) } ## { arrivals: /tmp/RtmpAlQH2g/file6933ce99281_arrivals.csv } ## { releases: /tmp/RtmpAlQH2g/file6933ce99281_releases.csv } ## { attributes: /tmp/RtmpAlQH2g/file6933ce99281_attributes.csv } ## { resources: /tmp/RtmpAlQH2g/file6933ce99281_resources.csv } ## { Source: dummy | monitored: 1 | n_generated: 3 } read.csv(mon$handlers["arrivals"]) # direct access ## name start_time end_time activity_time finished ## 1 dummy0 0.2630904 0.2630904 0 1 ## 2 dummy1 0.9821828 0.9821828 0 1 get_mon_arrivals(env) # adds the "replication" column ## name start_time end_time activity_time finished replication ## 1 dummy0 0.2630904 0.2630904 0 1 1 ## 2 dummy1 0.9821828 0.9821828 0 1 1

See below for a comprehensive list of changes.

New features:
  • New data source add_dataframe enables the attachment of precomputed data, in the form of a data frame, to a trajectory. It can be used instead of (or along with) add_generator. The most notable advantage over the latter is that add_dataframe is able to automatically set attributes and prioritisation values per arrival based on columns of the provided data frame (#140 closing #123).
  • New set_source activity deprecates set_distribution(). It works both for generators and data sources (275a09c, as part of #140).
  • New monitoring interface allows for disk offloading. The simmer() constructor gains a new argument mon to provide different types of monitors. By default, monitoring is performed in-memory, as usual. Additionally, monitoring can be offloaded to disk through monitor_delim and monitor_csv, which produce flat delimited files. But more importantly, the C++ interface has been refactorised to enable the development of new monitoring backends (#146 closing #119).
Minor changes and fixes:
  • Some documentation improvements (1e14ed7, 194ed05).
  • New default until=Inf for the run method (3e6aae9, as part of #140).
  • branch and clone now accept lists of trajectories, in the same way as join, so that there is no need to use do.call (#142).
  • The argument continue (present in seize and branch) is recycled if only one value is provided but several sub-trajectories are defined (#143).
  • Fix process reset: sources are reset in strict order of creation (e7d909b).
  • Fix infinite timeouts (#144).

Article originally published in Enchufa2.es: simmer 3.8.0.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tidying messy Excel data (Introduction)

Mon, 05/07/2018 - 15:42

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

[Re-posted from Abhijit’s blog] Personal expressiveness, or how data is stored in a spreadsheet

When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!

But for smaller data, or data collected in the lab, the possibilities are truly endless! You can get every possiblle color-coding of rows, columns and cells, merged cells, hidden columns and rows, and inopportune blank spaces that convert numbers to characters, and notes where there should be numbers. That’s just from the more organized spreadsheets. Then you get multiple tables in the same spreadsheet, ad hoc computations in some cells, cells copied by hand (with error), and sundry other variations on this theme. In other words, it can be a nightmare scenario for the analyst. To wit,

I feel like some sort of spreadsheet naturalist, where new specimens spotted in the wild must be logged and compared to known species

— Jenny Bryan (@JennyBryan) April 25, 2018

In thinking about this post, I went back and looked at the documentation of the readxl package, which has made reading Excel files into R a lot easier than before. This package is quite powerful, so as long as data are in a relatively clean tabular form, this tool can pull it into R; see these vignettes to get a real sense of how to process decently behaved Excel files with R.

On a side note, how many ways of importing Excel files into R can you name, or have you used?

The readxl package has been my friend for a while, but then I received some well-intentioned spreadsheets that even readxl wouldn’t touch, despite much coaxing. Now, I had two options: ask the collaborator to re-format the spreadsheet (manually, of course ), which in my experience is a disaster waiting to happen; or just take the spreadsheet as is and figure out how to import it into R. I almost always take the latter route, and tidyxl is my bosom friend in this effort. In the next part of this series, I’ll describe my experiences with tidyxl and why, every time I use it, I say a small blessing for the team that created it.

Resources
  1. tidyeval meets Spreadsheet Hell
  2. Carpentries lesson on spreadsheet practices
  3. readxl workflows
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGSL 0.3.4

Mon, 05/07/2018 - 14:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A minor update version 0.3.4 of RcppGSL is now on CRAN. It contains an improved Windows build system (thanks, Jeroen!) and updates the C++ headers by removing dynamic exception specifications which C++11 frowns upon, and the compilers lets us know that in no uncertain terms. Builds using RcppGSL will now be more quiet. And as always, an extra treat for Solaris.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.4 (2018-05-06)
  • Windows builds were updated (Jeroen Ooms in #16).

  • Remove dynamic exception specifications which are deprecated with C++11 or later (Dirk in #17).

  • Accomodate Solaris by being more explicit about sqrt.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning Explained: Vectorization and matrix operations

Mon, 05/07/2018 - 14:29

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: vectorization. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.
With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective.

A basic exemple of vectorization Preliminary exemple – Python

 
Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.

import numpy as np import time W=np.random.normal(0,1,100000) n_rep=1000

The naive way to compute the sum iterates over all the elements of the array and stores the sum:

start_time = time.time() for i in range(n_rep): loop_res=0 for elt in W: loop_res+=elt time_loop = time.time() - start_time

If is our vector of interest, the sum of its elements can be expressed as the dot product :

start_time = time.time() for i in range(n_rep): one_dot=np.ones(W.shape) vect_res=one_dot.T.dot(W) time_vect = time.time() - start_time

Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.

print(np.abs(vect_res-loop_res)<10e-10) print(time_loop/time_vect)

Note: The same results can be obtained with np.sum.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.

start_time = time.time() for i in range(n_rep): vect_res=np.sum(W) time_vect_np = time.time() - start_time Preliminary exemple – R

The previous experiments can be replicated in R:

##Creation of the vector W=matrix(rnorm(100000)) n_rep=10000 #Naive way: library(tictoc) tic('Naive computation') for (rep in 1:n_rep) { res_loop=0 for (w_i in W) res_loop=w_i+res_loop } toc() tic('Vectorized computation') # vectorized way for (rep in 1:n_rep) { ones=rep(1,length(W)) res_vect= crossprod(ones,W) } toc() tic('built-in computation') # built-in way for (rep in 1:n_rep) { res_built_in= sum(W) } toc()

In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.

Preliminary exemple – Results

   

   

Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.

Why vectorization is (often) faster
  • R and Python are interpreted language, this means that your instructions are analyzed and interpreted at each execution. Since they are not statically typed, the loops have to assess the type of the operands at each iteration which leads to a computational overhead.
  • R and Python linear algebra relies on optimized back-end for matrix operations and linear algebra. These back-end are written in C/C++ and can process loops efficiently. Furthermore, while loops are sequential, these back-end can run operations in parallel which improves the computation speed on modern CPU.

Note 1: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.
Note 2: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).

A small zoo of matrix operations

The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.

Column-wise matrix sum

The column wise sum (and mean) can be expressed as a matrix product. Let be our matrix of interest. Using the matrix multiplication formula, we have: . Hence, the column-wise sum of is .

Python code:

def colWiseSum(W): ones=np.ones((W.shape[0],1)) return ones.T.dot(W)

R code:

colWiseSum=function(W) { ones=rep(1,nrow(W)) t(W)%*%ones } Row-wise matrix sum

Similarly, the row-wise sum is .

Python code:

def rowWiseSum(W): ones=np.ones((W.shape[1],1)) return W.dot(ones)

R code:

rowWiseSum=function(W) { ones=rep(1,ncol(W)) W%*%ones } Matrix sum

The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of is .

Python code:

def matSum(W): rhs_ones=np.ones((W.shape[1],1)) lhs_ones=np.ones((W.shape[0],1)) return lhs_ones.T.dot(W).dot(rhs_ones)

R code:

matSum=function(W) { rhs_ones=rep(1,ncol(W)) lhs_ones=rep(1,nrow(W)) t(lhs_ones) %*% W%*% rhs_ones } Similarity matrix (Gram matrix)

Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension (using word2vect). Let (our set of words) and (our dictionary) be two matrices resp. in and . To compute the similarity of all the observations of and we simply need to compute .

Python code:

def gramMatrix(X,Y): return X.dot(Y.T)

R code:

gramMatrix=function(X,Y) { X %*% t(Y) } L2 distance

We want to compute the pair-wise distance between two sets of vector. Let and be two matrix in and . For each vector of we need to compute the distance with all the vectors of .Hence, the output matrix should be of size .

If and are two vectors, their distance is:

   

. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size , then the output vector of distance will be of size , which can be reshaped into a vector of size .

The first two terms and need to be computed for each and . is the element-wise multiplication of X with itself (its elements are ). Hence, the i-th element of is the squared sum of the coordinate of the i-th observation, .

However, this is a vector and its shape is . By replicating each of its elements times, we will get a matrix of size . The replication can be done easily, if we consider the right matrix .

Let be a vector of one of size . Let be the matrix of size with repetitions of on the “diagonal”:

   

Then, our final vector is (The same expression holds for ). We denote a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:

   

The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.

Python code:

def L2dist(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0) delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0) return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)

R Code:

L2dist=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) } L2 distance (improved)

Actually the previous L2dist is not completely optimized requires a lot of memory since has cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor ):

Python code:

def L2dist_improved(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] ##Replace multiplication by a simple repeat X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2)) Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T return X_rpt+Y_rpt-2*gramMatrix(X,Y)

R code:

L2dist_improved=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) } L2 distance – benchmark

To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.

Computation time of the L2 distance

The vectorized implementation is 2 to 3 orders of magnitude faster than the naive implementation and on par with the scikit implementation.

The post Machine Learning Explained: Vectorization and matrix operations appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Coordinate systems in ggplot2: easily overlooked and rather underrated

Mon, 05/07/2018 - 08:05

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

All plots have coordinate systems. Perhaps because they are such an integral element of plots, they are easily overlooked. However, in ggplot2, there are several very useful options to customize the coordinate systems of plots, which we will not overlook but explore in this blog post.

Since it is spring, we will use a random subset of the famous iris data set. When we plot the petal length against the petal width, and map species onto color and play around a little with the shape, color and sizes of aesthetics, one obtains this vernal plot:

# Base plot plot_base <- ggplot(data = df_iris) + geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species), size = 3, alpha = 0.9, shape = 8) + geom_point(aes(x = Petal.Length, y = Petal.Width), color = "yellow", size = 0.4) + scale_color_manual(values = c("#693FE9", "#A089F8", "#0000FF")) + theme_minimal() Cartesian coordinate system Zooming in and out

The coordinate system can be manipulated by adding one of ggplot’s different coordinate systems. When you are imagining a coordinate system, you are most likely thinking of a Cartesian one. The Cartesian coordinate system combines x and y dimension orthogonally and is ggplots default (coord_cartesian).

There also are several varaitions of the familiar Cartesian coordinate system in ggplot, namely coord_fixed, coord_flip and coord_trans. For all of them, the displayed section of the data can be specified by defining the maximal value depicted on the x (xlim =) and y (ylim =) axis. This allows to “zoom in” or “zoom out” of a plot. It is a great advantage, that all manipulations of the coordinate system only alter the depiction of the data but not the data itself.

# Zooming in with xlim/ylim plot_base + coord_cartesian(xlim = 5, ylim = 2) + ggtitle("coord_cartesian with xlim = 5 and ylim = 2")

Specifying the “aspect ratio” of the axes

Via coord_fixed one can specify the exact ratio of the length of a y unit relative to the length of a x unit within the final visualization.

# Setting the "aspect ratio" of y vs. x units plot_base + coord_fixed(ratio = 1/2) + ggtitle("coord_fixed with ratio = 1/2")

Transforming the scales of the axes

This helps to emphasize the exact insight one wants to communicate. Another way to do so is coord_trans, which allows several transformations of the x and y variable (see table below, taken from Wickham 2016 page 97). Let me stress this again, very conveniently such transformations only pertain to the depicted – not the actual – scale of the data. This also is the reason why, regardless of the conducted transformation, the original values are used as axis labels.

Name Funktion Inverse asn exp identity log log10 log2 logit pow10 probit recip reverse sqrt # Transforming the axes plot_base + coord_trans(x = "log", y = "log2") + ggtitle("coord_trans with x = \"log\" and y = \"log2\"")

Swapping the axes

The last of the Cartesian options, cood_flip, swaps x and y axis. For example, this option can be useful, when we intend to change the orientation of univariate plots as histograms or plot types – like box plots – that visualize the distribution of a continuous variable over the categories of another variable. Nonetheless, coord_flip also works with all other plots. This multiplies the overall possibilities for designing plots, especially since all Cartesian coordinate systems can be combined.

# Swapping axes # base plot #2 p1 <- ggplot(data = df_iris) + geom_bar(aes(x = Species, fill = Species), alpha = 0.6) + scale_fill_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() # base plot & coord_flip() p2 <- ggplot(data = df_iris) + geom_bar(aes(x = Species, fill = Species), alpha = 0.6) + scale_fill_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() + coord_flip() gridExtra::grid.arrange(p1, p2, top = "Bar plot without and with coord_flip")

Polar coordinate system

The customization of Cartesian coordinate systems allows for the fine tuning of plots. However, coord_polar, the final coordinate system discussed here, changes the whole character of a plot. By using coord_polar, bar geoms are transformed to pie charts or “bullseye” plots, while line geoms are transformed to radar charts. This is done by mapping x and y to the angle and radius of the resulting plot. By default, the x variable is mapped to the angle but by setting the theta augment in coord_polar to “y” this can be changed.


While such plots might shine with respect to novelty and looks, their perceptual properties are intricate, and their correct interpretation may be quite hard and rather unintuitive.

# Base plot 2 (long format, x = 1 is summed up to generate count) plot_base_2 <- df_iris %>% dplyr::mutate(x = 1) %>% ggplot(.) + geom_bar(aes(x = x, fill = Species), alpha = 0.6) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + scale_fill_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() + ggtitle("base plot") # Bullseye plot # geom_bar & coord_polar(theta = "x") p2 <- plot_base_2 + coord_polar(theta = "x") + ggtitle("theta = \"x\"") # Pie chart # geom_bar & coord_polar(theta = "y") p3 <- plot_base_2 + coord_polar(theta = "y") + ggtitle("theta = \"y\"") gridExtra::grid.arrange(p2, p3, plot_base_2, top = "geom_bar & coord_polar", ncol = 2) # Base plot 3 (long format, mean width/length of sepals/petals calculated) plot_base_3 <- iris %>% dplyr::group_by(Species) %>% dplyr::summarise(Petal.Length = mean(Petal.Length), Sepal.Length = mean(Sepal.Length), Sepal.Width = mean(Sepal.Width), Petal.Width = mean(Petal.Width)) %>% reshape2::melt() %>% ggplot() + geom_polygon(aes(group = Species, color = Species, y = value, x = variable), fill = NA) + scale_color_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() + ggtitle("base plot") # Radar plot # geom_polygon & coord_polar p2 <- plot_base_3 + theme_minimal() + coord_polar() + ggtitle("coord_polar") gridExtra::grid.arrange(plot_base_3, p2, top = "geom_polygon & coord_polar", ncol = 2)

 

References
  • Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
Über den Autor

Lea Waniek

Lea ist Mitglied im Data Science Team und unterstützt ebenfalls im Bereich Statistik.

Der Beitrag Coordinate systems in ggplot2: easily overlooked and rather underrated erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppMsgPack 0.2.2

Sun, 05/06/2018 - 23:18

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Am maintenance release of RcppMsgPack got onto CRAN this afternoon. It contains a single static_cast fix to address a warning which g++-8.1 shows whereas older compilers remained silent—and CRAN asked us to address this.

MessagePack itself is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it is faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. RcppMsgPack brings both the C++ headers of MessagePack as well as clever code (in both R and C++) Travers wrote to access MsgPack-encoded objects directly from R.

Changes in version 0.2.2 (2018-05-06)
  • Apply a static_cast from upstream to suppress a warning from g++-8.1 as requested by CRAN.

Courtesy of CRANberries, there is also a diffstat report for this release.

More information is on the RcppRedis page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing Spotify Song Metrics to Visualize Popular Songs

Sun, 05/06/2018 - 21:04

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

How would you go about describing your favorite new song? What makes it catchy to you? According to the people at Spotify, characteristics like like energy or mood of a song can actually be quantified, and have many algorithms to describe music in an amazing amount of ways.

Now, why would you care about what the people at Spotify have to say about a song? With a user base of 159 million active monthly users, determining key factors that affect popularity can actually be a powerful tool for record label producers to find new artists to sign, or for aspiring data scientists to show off some nice visualizations and determine what to put on the ultimate summer playlist. Popularity is well defined in their API notes as a function of the number of times a particular song was streamed, as well as how recent those streams are.

About the App

This app visualizes several key factors and investigates their correlation to popularity visually across a wide spectrum of music.

The first two plots offer a degree of interactivity, allowing the user to visualize the difference amongst the genres. The box plot helps to see a more quantitative take on the separation across a wide musical spectrum. The density plot helps more with visualizing across the 0 – 100 scale of popularity, to see if there are any abnormalities with popularity distribution (for instance, classical music seems to have a pretty well defined bi-modal distribution of popularity!)

Besides looking at each genre as a whole, I wanted visualize a subset of each genre on a scatter plot to identify clusters to look at other variables like energy or danceability and how they change along with popularity. However, I ran into many issues in trying to separate the genres and effectively display the information. I decided on a 3D scatter plot, adding another user-input variable to look at two separate correlations with a very interactive plot for the user to zoom in and rotate the axes to better display information to their preference.  I have also included a small table to look at the Pearson correlation coefficient of several of the metrics from Spotify with popularity.

Finally, I took the 50th percentile (in terms of popularity) from each genre in my dataset and displayed them in a datatable in terms of ‘threshold values’ for each genre. For instance, for a successfully popular metal song, the relative would need to be quite high, as the 50th percentile has a value of 0.902. Also interestingly enough, danceability seems to be a much more crucial factor for pop as opposed to indie pop.

About the Dataset

The dataset was obtained by using the Spotify Web API in combination with the Python 3 library Spotipy. For each genre I chose, I queried 3,000 songs for the Spotify audio analysis and features. The API has a 50 song limit at each time, so I had to create a loop to query the API in 3,000 song chunks, and store them in a relevant pandas dataframe. Afterwards, I wrote the data into a CSV to do the majority of the analysis within R.

The jupyter notebook used to query the server as well as the Shiny application can be found at this GitHub repo.

Future Work

I would love to continue this analysis of popularity metrics with clustering/regression analysis at a further date, or to be able to develop a predictive model and feed information into it via Spotipy to determine up-and-coming popular artists.

For any comments or questions, please reach me via e-mail at joshvichare@gmail.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Tokenizing Text

Sun, 05/06/2018 - 18:12

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Statistics Sunday: Tokenizing Text I recently started working my way through Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. There are many interesting ways text analysis can be used, not only for research interests but for marketing and business insights. Today, I’d like to introduce one of the basic concepts necessary for conducting text analysis: tokens.

In text analysis, a token is any kind of meaningful text unit that can be analyzed. Frequently, a token is a word and the process of tokenizing splits up the text into individual words and counts up how many times each word appears in the text. But a token could also be a phrase (such as each two-word combination present in a text, which is called a bi-gram), a sentence, a paragraph, even a whole chapter. Obviously, the size of the token you choose impacts what kind of analysis you can do. Generally, people choose smaller tokens, like words.

Let’s use R to download the text of a classic book (which I did previously in this post, but today, I’ll do in an easier way) and tokenize it by word.

Any text available in the Project Gutenberg repository can be downloaded, with header and footer information stripped out, with the guternbergr package.

install.packages("gutenbergr")
library(gutenbergr)

The package comes with a dataset, called gutenberg_metadata, that contains a list, by ID, of all text available. Let’s use The War of the Worlds by H.G. Wells as our target book. We can find our target book like this:

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ggplot2 2.2.1 purrr 0.2.4
## tibble 1.4.2 dplyr 0.7.4
## tidyr 0.8.0 stringr 1.3.0
## readr 1.1.1 forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## dplyr::filter() masks stats::filter()
## dplyr::lag() masks stats::lag()
gutenberg_metadata %>%
filter(title == "The War of the Worlds")
## # A tibble: 3 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books…
##
## 1 36 The Wa… Wells, … 30 en Movie Books/Sci…
## 2 8976 The Wa… Wells, … 30 en Movie Books/Sci…
## 3 26291 The Wa… Wells, … 30 en
## # ... with 2 more variables: rights , has_text

The ID for The War of the Worlds is 36. Now I can use that information to download the text of the book into a data frame, using the gutenbergr function, gutenberg_download.

warofworlds<-gutenberg_download(36)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Now I have a dataset with two variables: one containing the Project Gutenberg ID for the text (which is helpful if you create a dataset with multiple texts, perhaps all by the same author or within the same genre) and one containing a line of text. To tokenize our dataset, we need the R package, tidytext.

install.packages("tidytext")
library(tidytext)

We can tokenize with the function, unnest_tokens: first we tell it to do so by word then we tell it which column to look in to find the tokens.

tidywow<-warofworlds %>%
unnest_tokens(word, text)

Now we have a dataset with each word from the book, one after the other. There are duplicates in here, because I haven’t told R to count up the words. Before I do that, I probably want to tell R to ignore extremely common words, like “the,” “and,” “to,” and so on. In text analysis, these are called stop words, and tidytext comes with a dataset called stop_words that can be used to drop stop words from your text data.

tidywow<-tidywow %>%
anti_join(stop_words)
## Joining, by = "word"

Last, we have R count up the words.

wowtokens<-tidywow %>%
count(word, sort=TRUE)
head(wowtokens)
## # A tibble: 6 x 2
## word n
##
## 1 martians 163
## 2 people 159
## 3 black 122
## 4 time 121
## 5 road 104
## 6 night 102

After removing stop words, of which there may be hundreds or thousands in any text, the most common words are: Martians, people, black, time, and road.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Send tweets from R: A very short walkthrough

Sun, 05/06/2018 - 12:28

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

There are a few reasons why you might want to send tweets from R. You might want to write a Twitter bot or – as in my case – you want to send yourself a tweet when a very long computation finishes.

So, here I will run you through all the steps you have to take using
Twitter’s API and
– the twitteR package written by Jeff Gentry

The setup to send myself tweets is the following: I have my main twitter account and an additional account I am only using to tweet from R. I am following the additional account with my main account. On my phone’s Twitter app, my main account is logged in and notifications are activated only for the additional account. So, whenever I tweet something with the additional account, the new tweet pops up on my phone.

Let’s get started.

Step 1: Create an additional twitter account and log in with this one.
Step 2: Go to apps.twitter.com (Application Management) and create a new app. After completing the process your Application Management main screen should look something like this.

  Step 3: Click on the name of your app and then select “manage keys and access tokens”. This should look like this (without the red boxes, of course).   Step 4: Copy your Consumer Key, Consumer Secret, Access Token and Access Token Secret. You will need them for authenticating from your R script. Step 5: Fire up R and install the twitteR package.   install.packages(“twitteR”) library(twitteR)
setup_twitter_oauth(consumer_key = “”,
                    access_token = “”,
                    consumer_secret = “”,
                    access_secret = “”) Step 6: Tweet something!
  tw <- updateStatus(“My first tweet from R!”) Basically, that’s it. But you can do more cool stuff. If, for example, you only want to read the notification about the tweet on your phone, you can delete the latest tweet just as fast.
deleteStatus(tw)
Aaaand, it’s gone!
You can also include a plot you just did in R! In this case, do not delete the tweet, because the media file will be deleted too and you cannot view it. Let’s do this.
t1 <- Sys.time()
gauss <- rnorm(n = 10000)
jpeg(“temp.jpg”)
plot(density(gauss))
dev.off()
t2 <- difftime(Sys.time(), t1, units = “secs”)

tw <- updateStatus(paste(“Finished in”, round(t2, 2), “seconds”), mediaPath = “temp.jpg”)
And just a few seconds later, this pops up on my phone.

 
I’m sure you’ll come up with a lot of more cool stuff you can do with that. Let’s hear them in the comments. Also, feel free to report any problems and I will try to help.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Deep Learning from first principles in Python, R and Octave – Part 8

Sun, 05/06/2018 - 05:28

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

1. Introduction

You don’t understand anything until you learn it more than one way. Marvin Minsky
No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. Marvin Minsky
A wealth of information creates a poverty of attention. Herbert Simon

This post, Deep Learning from first Principles in Python, R and Octave-Part8, is my final post in my Deep Learning from first principles series. In this post, I discuss and implement a key functionality needed while building Deep Learning networks viz. ‘Gradient Checking’. Gradient Checking is an important method to check the correctness of your implementation, specifically the forward propagation and the backward propagation cycles of an implementation. In addition I also discuss some tips for tuning hyper-parameters of a Deep Learning network based on my experience.

My post in this  ‘Deep Learning Series’ so far were
1. Deep Learning from first principles in Python, R and Octave – Part 1 In part 1, I implement logistic regression as a neural network in vectorized Python, R and Octave
2. Deep Learning from first principles in Python, R and Octave – Part 2 In the second part I implement a simple Neural network with just 1 hidden layer and a sigmoid activation output function
3. Deep Learning from first principles in Python, R and Octave – Part 3 The 3rd part implemented a multi-layer Deep Learning Network with sigmoid activation output in vectorized Python, R and Octave
4. Deep Learning from first principles in Python, R and Octave – Part 4 The 4th part deals with multi-class classification. Specifically, I derive the Jacobian of the Softmax function and enhance my L-Layer DL network to include Softmax output function in addition to Sigmoid activation
5. Deep Learning from first principles in Python, R and Octave – Part 5 This post uses the Softmax classifier implemented to classify MNIST digits using a L-layer Deep Learning network
6. Deep Learning from first principles in Python, R and Octave – Part 6 The 6th part adds more bells and whistles to my L-Layer DL network, by including different initialization types namely He and Xavier. Besides L2 Regularization and random dropout is added.
7. Deep Learning from first principles in Python, R and Octave – Part 7 The 7th part deals with Stochastic Gradient Descent Optimization methods including momentum, RMSProp and Adam
8. Deep Learning from first principles in Python, R and Octave – Part 8 – This post implements a critical function for ensuring the correctness of a L-Layer Deep Learning network implementation using Gradient Checking

By the way , also take a look at my compact, minimal book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

Gradient Checking is based on the following approach. One iteration of Gradient Descent computes and updates the parameters by doing
.
To minimize the cost we will need to minimize
Let be a function that computes the derivative . Gradient Checking allows us to numerically evaluate the implementation of the function and verify its correctness.
We know the derivative of a function is given by
0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" title="\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" class="latex" />
Note: The above derivative is based on the 2 sided derivative. The 1-sided derivative  is given by 0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}" title="\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}" class="latex" />
Gradient Checking is based on the 2-sided derivative because the error is of the order as opposed for the 1-sided derivative.
Hence Gradient Check uses the 2 sided derivative as follows.
0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" title="g(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" class="latex" />

In Gradient Check the following is done
A) Run one normal cycle of your implementation by doing the following
a) Compute the output activation by running 1 cycle of forward propagation
b) Compute the cost using the output activation
c) Compute the gradients using backpropation (grad)

B) Perform gradient check steps as below
a) Set . Flatten all ‘weights’ and ‘bias’ matrices and vectors to a column vector.
b) Initialize by bumping up by adding ()
c) Perform forward propagation with
d) Compute cost with i.e.
e) Initialize  by bumping down by subtracting
f) Perform forward propagation with
g) Compute cost with i.e. 
h) Compute or ‘gradapprox’ asusing the 2 sided derivative.
i) Compute L2norm or the Euclidean distance between ‘grad’ and ‘gradapprox’. If the
diference is of the order of or the implementation is correct. In the Deep Learning Specialization Prof Andrew Ng mentions that if the difference is of the order of then the implementation is correct. A difference of is also ok. Anything more than that is a cause of worry and you should look at your code more closely. To see more details click Gradient checking and advanced optimization

You can clone/download the code from Github at DeepLearning-Part8

After spending a better part of 3 days, I now realize how critical Gradient Check is for ensuring the correctness of you implementation. Initially I was getting very high difference and did not know how to understand the results or debug my implementation. After many hours of staring at the results, I  was able to finally arrive at a way, to localize issues in the implementation. In fact, I did catch a small bug in my Python code, which did not exist in the R and Octave implementations. I will demonstrate this below

1.1a Gradient Check – Sigmoid Activation – Python import numpy as np import matplotlib exec(open("DLfunctions8.py").read()) exec(open("testcases.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() #Set layer dimensions layersDimensions = [2,4,1] parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="sigmoid") print("cost=",cost) #Perform backprop and get gradients gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") epsilon = 1e-7 outputActivationFunc="sigmoid" # Set-up variables # Flatten parameters to a vector parameters_values, _ = dictionary_to_vector(parameters) #Flatten gradients to a vector grad = gradients_to_vector(parameters,gradients) num_parameters = parameters_values.shape[0] #Initialize J_plus = np.zeros((num_parameters, 1)) J_minus = np.zeros((num_parameters, 1)) gradapprox = np.zeros((num_parameters, 1)) # Compute gradapprox using 2 sided derivative for i in range(num_parameters): # Compute J_plus[i]. thetaplus = np.copy(parameters_values) thetaplus[i][0] = thetaplus[i][0] + epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_plus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = np.copy(parameters_values) thetaminus[i][0] = thetaminus[i][0] - epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_minus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) # Compare gradapprox to backward propagation gradients by computing difference. numerator = np.linalg.norm(grad-gradapprox) denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) difference = numerator/denominator #Check the difference if difference > 1e-5: print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m") else: print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m") print(difference) print("\n") # The technique below can be used to identify # which of the parameters are in error # Covert grad to dictionary m=vector_to_dictionary2(parameters,grad) print("Gradients from backprop") print(m) print("\n") # Convert gradapprox to dictionary n=vector_to_dictionary2(parameters,gradapprox) print("Gradapprox from gradient check") print(n) ## (300, 2) ## (300,) ## cost= 0.6931455556341791 ## [92mYour backward propagation works perfectly fine! difference = 1.1604150683743381e-06[0m ## 1.1604150683743381e-06 ## ## ## Gradients from backprop ## {'dW1': array([[-6.19439955e-06, -2.06438046e-06], ## [-1.50165447e-05, 7.50401672e-05], ## [ 1.33435433e-04, 1.74112143e-04], ## [-3.40909024e-05, -1.38363681e-04]]), 'db1': array([[ 7.31333221e-07], ## [ 7.98425950e-06], ## [ 8.15002817e-08], ## [-5.69821155e-08]]), 'dW2': array([[2.73416304e-04, 2.96061451e-04, 7.51837363e-05, 1.01257729e-04]]), 'db2': array([[-7.22232235e-06]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[-6.19448937e-06, -2.06501483e-06], ## [-1.50168766e-05, 7.50399742e-05], ## [ 1.33435485e-04, 1.74112391e-04], ## [-3.40910633e-05, -1.38363765e-04]]), 'db1': array([[ 7.31081862e-07], ## [ 7.98472399e-06], ## [ 8.16013923e-08], ## [-5.71764858e-08]]), 'dW2': array([[2.73416290e-04, 2.96061509e-04, 7.51831930e-05, 1.01257891e-04]]), 'db2': array([[-7.22255589e-06]])} 1.1b Gradient Check – Softmax Activation – Python (Error!!)

In the code below I show, how I managed to spot a bug in your implementation

import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 parameters = initializeDeepModel(layersDimensions) #Compute forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T # Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax") cost= 1.0986187818144022 2 There is a mistake in the backward propagation! difference = 0.7100295155692544 0.7100295155692544 Gradients from backprop {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40953794e-05, -9.52657769e-04, -1.10269379e-04], [-7.45469382e-04, 9.49795606e-04, 2.29045434e-04], [ 8.29564761e-04, 2.86216305e-06, -1.18776055e-04]]), 'db2': array([[-0.00253808], [-0.00505508], [ 0.00759315]])} Gradapprox from gradient check {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40960634e-05, -9.52657953e-04, -1.10268461e-04], [-7.45469242e-04, 9.49796908e-04, 2.29045671e-04], [ 8.29565305e-04, 2.86104473e-06, -1.18776100e-04]]), 'db2': array([[-8.46211989e-06], [-1.68487446e-05], [ 2.53108645e-05]])}

Gradient Check gives a high value of the difference of 0.7100295. Inspecting the Gradients and Gradapprox we can see there is a very big discrepancy in db2. After I went over my code I discovered that I my computation in the function layerActivationBackward for Softmax was

# Erroneous code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W) instead of # Fixed code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = 1/numtraining * np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W)

After fixing this error when I ran Gradient Check I get

1.1c Gradient Check – Softmax Activation – Python (Corrected!!) import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 #Set layer dimensions parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T #Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax") ## cost= 1.0986193170234435 ## 2 ## [92mYour backward propagation works perfectly fine! difference = 5.268804859613151e-07[0m ## 5.268804859613151e-07 ## ## ## Gradients from backprop ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83441270e-05, -9.70179498e-04, -1.08715815e-04], ## [-7.70175008e-04, 9.54478237e-04, 2.27690198e-04], ## [ 8.48519135e-04, 1.57012608e-05, -1.18974383e-04]]), 'db2': array([[-8.52190476e-06], ## [-1.69954294e-05], ## [ 2.55173342e-05]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83439980e-05, -9.70180603e-04, -1.08716369e-04], ## [-7.70173925e-04, 9.54478718e-04, 2.27690089e-04], ## [ 8.48520143e-04, 1.57018842e-05, -1.18973720e-04]]), 'db2': array([[-8.52096171e-06], ## [-1.69964043e-05], ## [ 2.55162558e-05]])} 1.2a Gradient Check – Sigmoid Activation – R source("DLfunctions8.R") z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) #Set layer dimensions layersDimensions = c(2,5,1) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="sigmoid", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 0.6931447 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid",numClasses=layersDimensions[length(layersDimensions)]) epsilon = 1e-07 outputActivationFunc="sigmoid" #Convert parameter list to vector parameters_values = list_to_vector(parameters) #Convert gradient list to vector grad = gradients_to_vector(parameters,gradients) num_parameters = dim(parameters_values)[1] #Initialize J_plus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) J_minus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) gradapprox = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) # Compute gradapprox for(i in 1:num_parameters){ # Compute J_plus[i]. thetaplus = parameters_values thetaplus[i][1] = thetaplus[i][1] + epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_plus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = parameters_values thetaminus[i][1] = thetaminus[i][1] - epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_minus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) } # Compare gradapprox to backward propagation gradients by computing difference. #Compute L2Norm numerator = L2NormVec(grad-gradapprox) denominator = L2NormVec(grad) + L2NormVec(gradapprox) difference = numerator/denominator if(difference > 1e-5){ cat("There is a mistake, the difference is too high",difference) } else{ cat("The implementations works perfectly", difference) } ## The implementations works perfectly 1.279911e-06 # This can be used to check print("Gradients from backprop") ## [1] "Gradients from backprop" vector_to_list2(parameters,grad) ## $dW1 ## [,1] [,2] ## [1,] -7.641588e-05 -3.427989e-07 ## [2,] -9.049683e-06 6.906304e-05 ## [3,] 3.401039e-06 -1.503914e-04 ## [4,] 1.535226e-04 -1.686402e-04 ## [5,] -6.029292e-05 -2.715648e-04 ## ## $db1 ## [,1] ## [1,] 6.930318e-06 ## [2,] -3.283117e-05 ## [3,] 1.310647e-05 ## [4,] -3.454308e-05 ## [5,] -2.331729e-08 ## ## $dW2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0001612356 0.0001113475 0.0002435824 0.000362149 2.874116e-05 ## ## $db2 ## [,1] ## [1,] -1.16364e-05 print("Grad approx from gradient check") ## [1] "Grad approx from gradient check" vector_to_list2(parameters,gradapprox) ## $dW1 ## [,1] [,2] ## [1,] -7.641554e-05 -3.430589e-07 ## [2,] -9.049428e-06 6.906253e-05 ## [3,] 3.401168e-06 -1.503919e-04 ## [4,] 1.535228e-04 -1.686401e-04 ## [5,] -6.029288e-05 -2.715650e-04 ## ## $db1 ## [,1] ## [1,] 6.930012e-06 ## [2,] -3.283096e-05 ## [3,] 1.310618e-05 ## [4,] -3.454237e-05 ## [5,] -2.275957e-08 ## ## $dW2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0001612355 0.0001113476 0.0002435829 0.0003621486 2.87409e-05 ## ## $db2 ## [,1] ## [1,] -1.16368e-05 1.2b Gradient Check – Softmax Activation – R source("DLfunctions8.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) layersDimensions = c(2, 3, 3) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="softmax", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 1.098618 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax",numClasses=layersDimensions[length(layersDimensions)]) # Need to take transpose of the last layer for Softmax L=length(parameters)/2 gradients[[paste('dW',L,sep="")]]=t(gradients[[paste('dW',L,sep="")]]) gradients[[paste('db',L,sep="")]]=t(gradients[[paste('db',L,sep="")]]) #Perform gradient check gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7,outputActivationFunc="softmax") ## The implementations works perfectly 3.903011e-07[1] "Gradients from backprop" ## $dW1 ## [,1] [,2] ## [1,] 0.0007962367 -0.0001907606 ## [2,] 0.0004444254 0.0010354412 ## [3,] 0.0003078611 0.0007591255 ## ## $db1 ## [,1] ## [1,] -0.0017305136 ## [2,] 0.0005393734 ## [3,] 0.0012484550 ## ## $dW2 ## [,1] [,2] [,3] ## [1,] -3.515627e-04 7.487283e-04 -3.971656e-04 ## [2,] -6.381521e-05 -1.257328e-06 6.507254e-05 ## [3,] -1.719479e-04 -4.857264e-04 6.576743e-04 ## ## $db2 ## [,1] ## [1,] -5.536383e-06 ## [2,] -1.824656e-05 ## [3,] 2.378295e-05 ## ## [1] "Grad approx from gradient check" ## $dW1 ## [,1] [,2] ## [1,] 0.0007962364 -0.0001907607 ## [2,] 0.0004444256 0.0010354406 ## [3,] 0.0003078615 0.0007591250 ## ## $db1 ## [,1] ## [1,] -0.0017305135 ## [2,] 0.0005393741 ## [3,] 0.0012484547 ## ## $dW2 ## [,1] [,2] [,3] ## [1,] -3.515632e-04 7.487277e-04 -3.971656e-04 ## [2,] -6.381451e-05 -1.257883e-06 6.507239e-05 ## [3,] -1.719469e-04 -4.857270e-04 6.576739e-04 ## ## $db2 ## [,1] ## [1,] -5.536682e-06 ## [2,] -1.824652e-05 ## [3,] 2.378209e-05 1.3a Gradient Check – Sigmoid Activation – Octave source("DL8functions.m") ################## Circles data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); #Set layer dimensions layersDimensions = [2 5 1]; #tanh=-0.5(ok), #relu=0.1 best! [weights biases] = initializeDeepModel(layersDimensions); #Perform forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid"); #Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); #Compute gradients from cost [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid", numClasses=layersDimensions(size(layersDimensions)(2))); epsilon = 1e-07; outputActivationFunc="sigmoid"; # Convert paramter cell array to vector parameters_values = cellArray_to_vector(weights, biases); #Convert gradient cell array to vector grad = gradients_to_vector(gradsDW,gradsDB); num_parameters = size(parameters_values)(1); #Initialize J_plus = zeros(num_parameters, 1); J_minus = zeros(num_parameters, 1); gradapprox = zeros(num_parameters, 1); # Compute gradapprox for i = 1:num_parameters # Compute J_plus[i]. thetaplus = parameters_values; thetaplus(i,1) = thetaplus(i,1) + epsilon; [weights1 biases1] =vector_to_cellArray(weights, biases,thetaplus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_plus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute J_minus[i]. thetaminus = parameters_values; thetaminus(i,1) = thetaminus(i,1) - epsilon ; [weights1 biases1] = vector_to_cellArray(weights, biases,thetaminus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X',weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_minus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute gradapprox[i] gradapprox(i) = (J_plus(i) - J_minus(i))/(2*epsilon); endfor #Compute L2Norm numerator = L2NormVec(grad-gradapprox); denominator = L2NormVec(grad) + L2NormVec(gradapprox); difference = numerator/denominator; disp(difference); #Check difference if difference > 1e-04 printf("There is a mistake in the implementation "); disp(difference); else printf("The implementation works perfectly"); disp(difference); endif [weights1 biases1] = vector_to_cellArray(weights, biases,grad); printf("Gradients from back propagation"); disp(weights1); disp(biases1); [weights2 biases2] = vector_to_cellArray(weights, biases,gradapprox); printf("Gradients from gradient check"); disp(weights2); disp(biases2); 0.69315 1.4893e-005 The implementation works perfectly 1.4893e-005 Gradients from back propagation { [1,1] = 5.0349e-005 2.1323e-005 8.8632e-007 1.8231e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9529e-007 5.4502e-005 3.2721e-005 [1,2] = 1.0567e-005 6.0615e-005 4.6004e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8716e-005 1.1309e-009 4.7686e-005 1.2051e-005 -1.4612e-005 [1,2] = 9.5808e-006 } Gradients from gradient check { [1,1] = 5.0348e-005 2.1320e-005 8.8485e-007 1.8219e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9762e-007 5.4502e-005 3.2723e-005 [1,2] = [1,2] = 1.0565e-005 6.0614e-005 4.6007e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8713e-005 1.1102e-009 4.7687e-005 1.2048e-005 -1.4609e-005 [1,2] = 9.5790e-006 } 1.3b Gradient Check – Softmax Activation – Octave source("DL8functions.m") data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); # Set the layer dimensions layersDimensions = [2 3 3]; [weights biases] = initializeDeepModel(layersDimensions); # Run forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax"); # Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); # Perform backward prop [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax", numClasses=layersDimensions(size(layersDimensions)(2))); #Take transpose of last layer for Softmax L=size(weights)(2); gradsDW{L}= gradsDW{L}'; gradsDB{L}= gradsDB{L}'; #Perform gradient check difference= gradient_check_n(weights, biases, gradsDW,gradsDB, X, Y, epsilon = 1e-7, outputActivationFunc="softmax",numClasses=layersDimensions(size(layersDimensions)(2))); 1.0986 The implementation works perfectly 2.0021e-005 Gradients from back propagation { [1,1] = -7.1590e-005 4.1375e-005 -1.9494e-004 -5.2014e-005 -1.4554e-004 5.1699e-005 [1,2] = 3.3129e-004 1.9806e-004 -1.5662e-005 -4.9692e-004 -3.7756e-004 -8.2318e-005 1.6562e-004 1.7950e-004 9.7980e-005 } { [1,1] = -3.0856e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.2046e-006 2.9259e-007 -1.4972e-006 } Gradients from gradient check { [1,1] = -7.1586e-005 4.1377e-005 -1.9494e-004 -5.2013e-005 -1.4554e-004 5.1695e-005 3.3129e-004 1.9806e-004 -1.5664e-005 -4.9692e-004 -3.7756e-004 -8.2316e-005 1.6562e-004 1.7950e-004 9.7979e-005 } { [1,1] = -3.0852e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.1902e-006 2.8200e-007 -1.4644e-006 } 2.1 Tip for tuning hyperparameters

Deep Learning Networks come with a large number of hyper parameters which require tuning. The hyper parameters are

1. -learning rate
2. Number of layers
3. Number of hidden units
4. Number of iterations
5. Momentum – – 0.9
6. RMSProp – – 0.9
7. Adam – , and
8. learning rate decay
9. mini batch size
10. Initialization method – He, Xavier
11. Regularization

– Among the above the most critical is learning rate decay. Rather than just trying out random values, it may help to try out values on a logarithmic scale. So we could try out
values -0.01,0.1,1.0,10 etc. If we find that the cost is between 0.01 and 0.1 we could use a technique similar to binary search, so we can try 0.01, 0.05. If we need to be bigger than 0.01 and 0.05 we could try 0.25  and then keep halving the distance etc.
– The performance of Momentum and RMSProp are very good and work well with values 0.9. Even with this, it is better to try out values of 1- in the logarithmic range. So 1- could 0.001,0.01,0.1 and hence would be 0.999,0.99 or 0.9
– Increasing the number of hidden units or number of hidden layers need to be done gradually. I have noticed that increasing number of hidden layers heavily does not improve performance and sometimes degrades it.
– Sometimes, I tend to increase the number of iterations if I think I see a steady decrease in the cost for a certain learning rate
– It may also help to add learning rate decay if you see there is an oscillation while it decreases.
– Xavier and He initializations also help in a fast convergence and are worth trying out.

3.1 Final thoughts

As I come to a close in this Deep Learning Series from first principles in Python, R and Octave, I must admit that I learnt a lot in the process.

* Building a L-layer, vectorized Deep Learning Network in Python, R and Octave was extremely challenging but very rewarding
* One benefit of building vectorized versions in Python, R and Octave was that I was looking at each function that I was implementing thrice, and hence I was able to fix any bugs in any of the languages
* In addition since I built the generic L-Layer DL network with all the bells and whistles, layer by layer I further had an opportunity to look at all the functions in each successive post.
* Each language has its advantages and disadvantages. From the performance perspective I think Python is the best, followed by Octave and then R
* Interesting, I noticed that even if small bugs creep into your implementation, the DL network does learn and does generate a valid set of weights and biases, however this may not be an optimum solution. In one case of an inadvertent bug, I was not updating the weights in the final layer of the DL network. Yet, using all the other layers, the DL network was able to come with a reasonable solution (maybe like random dropout, remaining units can still learn the data!)
* Having said that, the Gradient Check method discussed and implemented in this post can be very useful in ironing out bugs.
Feel free to clone/download the code from Github at DeepLearning-Part8

Conclusion

These last couple of months when I was writing the posts and the also churning up the code in Python, R and Octave were  very hectic. There have been times when I found that implementations of some function to be extremely demanding and I almost felt like giving up. Other times, I have spent quite some time on an intractable DL network which would not respond to changes in hyper-parameters. All in all, it was a great learning experience. I would suggest that you start from my first post Deep Learning from first principles in Python, R and Octave-Part 1 and work your way up. Feel free to take the code apart and try out things. That is the only way you will learn.

Hope you had as much fun as I had. Stay tuned. I will be back!!!

Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Revisiting crimes against women in India
3. Literacy in India – A deepR dive
4. Sixer – R package cricketr’s new Shiny avatar
5. Bend it like Bluemix, MongoDB using Auto-scale – Part 1!
6. Computer Vision: Ramblings on derivatives, histograms and contours
7. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
8. A closer look at “Robot Horse on a Trot” in Android

To see all post click Index of Posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bad Stock Photos of My Job? Data Science on Pexels

Sun, 05/06/2018 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I couldn’t miss the fun Twitter hashtag #BadStockPhotosOfMyJob thanks to a tweet by Julia Silge and another one by Colin Fay. The latter inspired me to actually go and look for what makes a data science photo… What characterizes “data science” stock photos?

My (not bad) stock photo source Pexels metadata for the win

Where to find information related to stock photos? In two previous blog posts of mine I used Pexels, a website providing CC0 pictures which is quite nice. My goal was to obtain the titles and the tags of stock photos of “data science”: for instance if you look at this picture, its tags are “business”, “contemporary”, “computer”, etc. Pexels tags are very useful metadata, saving me the effort to use machine learning methods to analyse images.

Responsible webscraping

When researching this post I discovered that Pexels has an API, documented here but this API does not get you the title nor the tags associated to a picture so only webscraping could get me what I needed.

Webscraping is a powerful tool allowing one to rectangle webpages but with great power comes great responsability. Being able to scrape a webpage does not mean you are allowed to. You could get sued or your IP could get blocked. I am far from being an expert but I often read Bob Rudis’ blog where I learnt about rOpenSci’s robotstxt package that does “robots.txt file parsing and checking for R” which in plain language means it checks for you what a webpage legally allows you to do. See below,

# how I'll find pictures robotstxt::paths_allowed("https://www.pexels.com/search") ## [1] TRUE # where tags live robotstxt::paths_allowed("https://www.pexels.com/photo") ## [1] TRUE

robots.txt files often also tell you how often you can hit a page by defining a “crawling delay”. Sadly Pexels robots.txt doesn’t:

robotstxt::get_robotstxt("https://www.pexels.com") ## Sitemap: https://s3.amazonaws.com/pexels/sitemaps/sitemap.xml.gz

But Bob Rudis, who was patient and nice enough to answer my questions, told me that I should probably respect the rate limit defined in Pexels API docs. “Do not abuse the API. The API is rate-limited to 200 requests per hour and 20,000 requests per month.” As I recently explained in a post on Locke Data’s blog, these days to limit rate of a function I use the very handy ratelimitr package by Tarak Shah.

limited_get <- ratelimitr::limit_rate(httr::GET, ratelimitr::rate(200, 60*60),# not more than 200 times an hour ratelimitr::rate(1, 5))#not more than 1 time every 5 seconds Elegant webscraping

At the time of the two aforelinked blog posts I had used RSelenium to scroll down and get the download link of many pictures, but Bob Rudis wrote an elegant and cool alternative using query parameters, on which I’ll build in this post.

I first re-wrote the function to get all 15 pictures of each page of results.

get_page <- function(num = 1, seed = 1) { message(num) limited_get( url = "https://www.pexels.com/search/data science/", query = list( page=num, seed=seed ) ) -> res httr::stop_for_status(res) pg <- httr::content(res) tibble::tibble( url = rvest::html_attr(rvest::html_nodes(pg, xpath = "//a[@class='js-photo-link']"), "href"), title = rvest::html_attr(rvest::html_nodes(pg, xpath = "//a[@class='js-photo-link']"), "title"), tags = purrr::map(url, get_tags) ) }

I re-wrote it because I needed the “href” and because it seems that the structure of each page changed a bit since the day on which the gist was written. To find out I had to write “a[@class=’js-photo-link’]” I inspected the source of a page.

Then I wrote a function getting tags for each picture.

get_tags <- function(url){ message(url) url <- paste0("https://www.pexels.com", url) res <- limited_get(url) httr::stop_for_status(res) pg <- httr::content(res) nodes <- rvest::html_nodes(pg, xpath = '//a[@data-track-label="tag" ]') rvest::html_text(nodes) }

And finally I got results for 20 pages. I chose 20 without thinking too much. It seemed enough for my needs, and each of these pages had pictures.

ds_stock <- purrr::map_df(1:20, get_page) ds_stock <- unique(ds_stock) ds_stock <- tidyr::unnest(ds_stock, tags)

I got 300 unique pictures.

What’s in a data science stock photo?

Now that I have all this information at hand, I can describe data science stock photos!

Data science tags library("ggplot2") library("ggalt") library("hrbrthemes") tag_counts <- dplyr::count(ds_stock, tags, sort = TRUE)[1:10,] dplyr::mutate(tag_counts, tags = reorder(tags, n)) %>% ggplot() + geom_lollipop(aes(tags, n), size = 2, col = "salmon") + hrbrthemes::theme_ipsum(base_size = 16, axis_title_size = 16) + xlab("Tag") + ggtitle("300 data science stock photos", subtitle = "Most frequent tags. Source: https://www.pexels.com") + coord_flip()

So the most common tags are data, technology, business and computer. Not too surprising!

Data science scenes

Now, let’s have a look at titles that are in general more descriptive of what’s happening/present on the photo (i.e. is the computer near a cup of coffee or is someone working on it). I tried using a technique described in Julia Silge’s and David Robinson’s Tidy text mining book: “Counting and correlating pairs of words with the widyr package” described in this section of the book but it wasn’t too interesting because most correlation values were too low. One issue was probably my having too few titles: only half of pictures have titles! So I’ll resort to plotting most common bigrams, which I learnt in the Tidy text mining book as well.

stopwords <- rcorpora::corpora("words/stopwords/en")$stopWords ds_stock %>% dplyr::filter(!is.na(title)) %>% dplyr::select(title) %>% unique() %>% tidytext::unnest_tokens(bigram, title, token = "ngrams", n = 2) %>% tidyr::separate(bigram, c("word1", "word2"), sep = " ", remove = FALSE) %>% dplyr::filter(!word1 %in% stopwords) %>% dplyr::filter(!word2 %in% stopwords)%>% dplyr::count(bigram, sort = TRUE) %>% dplyr::mutate(bigram = reorder(bigram, n)) %>% head(n = 10)%>% ggplot() + geom_lollipop(aes(bigram, n), size = 2, col = "salmon") + hrbrthemes::theme_ipsum(base_size = 16, axis_title_size = 16) + ggtitle("300 data science stock photos", subtitle = "Most frequent bigrams in titles. Source: https://www.pexels.com")+ coord_flip()

So there’s a lot of holding computer happening, and these laptops are either black or white… And well Macbook Pro probably looks more professional?

Hold my laptop and watch…

my trying to find a good post conclusion! In this post, I tried to responsibly and elegantly scrape rich photo metadata from Pexels to characterize stock photos of data science. Using tags, and most common bigrams in titles, I found that data science stock photos are associated with data, business and computers; and that they often show people holding computers. Now, you’ll excuse me while I try and comfort my poor pink laptop that feels a bit too un-data-sciency.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Remove password protection from Excel sheets using R

Sun, 05/06/2018 - 01:07

(This article was first published on "R" you ready?, and kindly contributed to R-bloggers)

Most data scientists wished that all data lived neatly managed in some DB. However, in reality, Excel files are ubiquitous and often a common way to disseminate results or data within many companies. Every now and then I found myself in the situation where I wanted to protect Excel sheets against users accidentally changing them. A few months later, however, I found that I sometimes had forgotten the password I used. The “good” thing is that protecting Excel sheets by password is far from safe and access can be recovered quite easily. The following works for .xlsx files as of Excel 2016 and I suppose for 2013 and as well.

Before implementing the steps in R, I will outline how to remove the password protection “by hand”. The R way is simply the automation of these steps. The first thing one needs to understand is that a .xlsx file is just a collection of folders and files in a zip container. If you unzip a .xlsx file (e.g. using 7-Zip) you get the following folder structure (sorry, German UI):

In the folder ./xl/worksheets we find one XML file for each Excel sheet. The sheet’s password protection is encoded directly in the sheet. While there used to be the plain password text in the XML in former versions, now, we find the hashed password (see part marked in yellow below). In order to get rid of the password protection, we simply can remove the whole sheetProtection node from the XML. We can do that in any text editor and save the file.

As the last step, we need to recreate the .xlsx file by creating a zip folder that contains our modified XML file (German UI again).

Finally, we change the file extension from .zip back to .xlsx and voilá, we get an Excel file without password protected sheets. Programming the steps outlined above in R is quite straightforward. The steps are commented in the GitHub gist below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: "R" you ready?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Install github package on safe haven server

Sun, 05/06/2018 - 00:03

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

I’ve had few enquires about how to install the summarizer package on a server without internet access, such as the NHS Safe Havens.

  1. Upload summarizer-master.zip from here to server.
  2. Unzip.
  3. Run this:


library(devtools)
source = devtools:::source_pkg("summarizer-master")
install(source)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R as learning tool: solving integrals

Sat, 05/05/2018 - 20:11

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Integrals are so easy only math teachers could make them difficult.When I was in high school I really disliked math and, with hindsight, I would say it was just because of the the prehistoric teaching tools (when I saw this video I thought I’m not alone). I strongly believe that interaction CAUSES learning (I’m using “causes” here on purpose being quite aware of the difference between correlation and causation), practice should come before theory and imagination is not a skill you, as a teacher, could assume in your students. Here follows a short and simple practical explanation of integrals. The only math-thing I will write here is the following: f(x) = x + 7. From now on everything will be coded in R. So, first of all, what is a function? Instead of using the complex math philosophy let’s just look at it with a programming eye: it is a tool that takes something in input and returns something else as output. For example, if we use the previous tool with 2 as an input we get a 9. Easy peasy. Let’s look at the code:

# here we create the tool (called "f") # it just takes some inputs and add it to 7 f <- function(x){x+7} # if we apply it to 2 it returns a 9 f(2) 9

Then the second question comes by itself. What is an integral? Even simpler, it is just the sum of this tool applied to many inputs in a range. Quite complicated, let’s make it simpler with code: 

# first we create the range of inputs # basically x values go from 4 to 6 # with a very very small step (0.01) # seq stands for sequence(start, end, step) x <- seq(4, 6, 0.01) x 4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07... x[1] 4 x[2] 4.01

As you see, x has many values and each of them is indexed so it’s easy to find, e.g. the first element is 4 (x[1]). Now that we have many x values (201) within the interval from 4 to 6, we compute the integral.

# since we said that the integral is # just a sum, let's call it IntSum and # set it to the start value of 0 # in this way it will work as an accumulator IntSum = 0

Differently from the theory in which the calculation of the integral produces a new non-sense formula (just kidding, but this seems to be what math teachers are supposed to explain), the integral does produce an output, i.e. a number. We find this number by summing the output of each input value we get from the tool (e.g. 4+7, 4.01+7, 4.02+7, etc) multiplied by the step between one value and the following (e.g. 4.01-4, 4.02-4.01, 4.03-4.02, etc). Let’s clarify this, look down here:

# for each value of x for(i in 2:201){          # we do a very simple thing:     # we cumulate with a sum     # the output value of the function f     # multiplied by each steps difference          IntSum = IntSum + f(x[i])*(x[i]-x[i-1])               # So for example,       # with the first and second x values the numbers will be:     #0.1101 = 0 + (4.01 + 7)*(4.01 - 4)          # with the second and third:     #0.2203 = 0.1101 + (4.02 + 7)*(4.02 - 4.01)          # with the third and fourth:     #0.3306 = 0.2203 + (4.03 + 7)*(4.03 - 4.02)          # and so on... with the sum (integral) growing and growing     # up until the last value } IntSum 24.01

Done! We have the integral but let’s have a look to the visualization of this because it can be represented and made crystal clear. Let’s add a short line of code to serve the purpose of saving the single number added to the sum each time. The reason why we decide to call it “bin” instead of, for example, “many_sum” will be clear in a moment.

# we need to store 201 calculation and we # simply do what we did for IntSum but 201 times bin = rep(0, 201) bin 0 0 0 0 0 0 0 0 0 0 0 0 ...

Basically, we created a sort of memory to host each of the calculation as you see down here:

for (i in 2:201){          # the sum as earlier     IntSum = IntSum + f(x[i])*(x[i]-x[i-1])          # overwrite each zero with each number     bin[i] = f(x[i])*(x[i]-x[i-1]) } IntSum 24.01 bin 0.0000 0.1101 0.1102 0.1103 0.1104 0.1105 .. sum(bin) 24.01

Now if you look at the plot below you get the whole story: each bin is a tiny bar with a very small area and is the smallest part of the integral (i.e. the sum of all the bins).

# plotting them all barplot(bin, names.arg=x)

This tells you a lot about the purpose of integral and the possibility of calculating areas of curvy surfaces. To have an idea of this just change the function with, let’s say, sin(x) or log(x). What is happening? And what if you increase/decrease the number of bins? Have fun replicating the code changing some numbers and functions. Integrals should be clearer in the end. That’s all folks! #R #rstats #maRche #Rbloggers This post is also shared in www.r-bloggers.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Confessions of a Data Scientist: Why I quit social media and still cut my own grass

Sat, 05/05/2018 - 17:41

(This article was first published on Scott Mutchler, and kindly contributed to R-bloggers)

I hate to have to read a long blog post before I get to the payoff.  So here is mine:

  • Social media fosters a short attention span by triggering a dopamine (reward) response in our brains.  Innovation in Data Science (and many other technical fields) requires deep thought.  Mindless tasks like mowing the grass allows you enter into deep thought and tap the purely creative center of your brain.  Revolutionary Data Science is as much about creativity as it is about having an analytical mind.

Over a year ago, I realized I had greatly reduced the number of technical books I was reading.  I usually read 1-2 books a month and truly enjoyed learning new analytical techniques with the driving purpose of creating something unique… something no one else had ever thought of (even if it was derivative of other work).  It’s how I’m wired I guess.

Instead of learning, I was spending 2-3 hours a day mindlessly scrolling through Facebook looking at cat videos and political hate speech.  I was also posting family pictures of our adventures and eagerly awaiting the dopamine reward of getting likes from friends and family.  My attention span was getting shorter and shorter by the day.  I was also fundamentally unhappy because I always felt that others were having more adventures than me.  I guess I wasn’t alone in this feeling.

Now I’ve been fortunate enough to have tremendous success in Data Science.  I’ve built analytical solutions for Fortune 500 companies that have driven $100M’s of incremental revenue.  I’m not bragging, I’m incredibly grateful to be one of the seemingly few data scientist that had leadership that would listen to my ideas and give me a long rope to implement them.  I’m thinking specifically of you Kevin Freeland (LinkedIn profile).

As I thought of those times, when I was happiest I realized that I spent a lot of time thinking… especially during my leisure time.  I took books on Data Mining, Oracle, etc. to the beach.  I spent hours on long drives thinking of new ways to solve business problems with analytics.  Specifically, I thought about the exact moment I came up with a new way to assort products in retail stores.  I was mowing my grass.  I had a big yard.  The mindless repetition of mowing allowed my mind to access the creative center of my brain.  Apparently it did for Joe Walsh also.  Ironically, I live not far from where he crashed a riding mower when he thought up “Rocky Mountain Way”!

Everyone is different and you should take this advice with a huge bag of rock salt but here goes:

  • Try leaving social media for 1 month
  • Take long drives
  • Mow your own grass
  • Block 2-3 hours a day to do nothing… no work, no TV, no social media, no phone.

 

Make time to

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Scott Mutchler. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio:addins part 1. – code reproducibility testing

Sat, 05/05/2018 - 16:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Contents
  1. Introduction
  2. Prerequisites and recommendations
  3. Step 1 – Creating a package
  4. Step 2 – Writing the first functions
  5. Step 3 – Setting up an addin
  6. Step 4 – Updating our DESCRIPTION and NAMESPACE
  7. What is next – Always paying our (technical) debts
  8. Wrapping up
  9. TL;DR – Just give me the package
  10. References
Introduction

This is the first post in the RStudio:addins series. The aim of the series is to walk the readers through creating an R package that will contain functionality for integrating useful addins into the RStudio IDE. At the end of this first article, your RStudio will be 1 useful addin richer.

The addin we will create in this article will let us run a script open in RStudio in R vanilla mode via a keyboard shortcut and open a file with the script’s output in RStudio.

This is useful for testing whether your script is reproducible by users that do not have the same start-up options as you (e.g. preloaded environment, site file, etc.), making it a good tool to test your scripts before sharing them.

If you want to get straight to the code, you can find it at https://gitlab.com/jozefhajnala/jhaddins.git

Prerequisites and recommendations

To make the most use of the series, you will need the following:

  1. R, ideally version 3.4.3 or more recent, 64bit
  2. RStudio IDE, ideally version 1.1.383 or more recent
  3. Also recommended
  • git, for version control
  • TortoiseGit, convenient shell interface to git for those using Windows, with pretty icons and all
  1. Recommended R packages (install with install.packages("packagename"), or via RStudio’s Packages tab):
  • devtools – makes your development life easier
  • testthat – provides a framework for unit testing integrated into RStudio
  • roxygen2 – makes code documentation easy
Step 1 – Creating a package
  1. Use devtools::create to create a package (note that we will update more DESCRIPTION fields later and you can also choose any path you like and it will be reflected in the name of the package)
devtools::create( path = "jhaddins" , description = list("License" = "GPL-3") )
  1. In RStudio or elsewhere navigate to the jhaddins folder and open the project jhaddins.Rproj (or the name of your project if you chose a different path)

  2. Run the first check and install the package

devtools::check() # Ctrl+Shift+E or Check button on RStudio's build tab devtools::install() # Ctrl+Shift+B or Install button on RStudio's build tab
  1. Optionally, initialize git for version control
devtools::use_git() Step 2 – Writing the first functions

We will now write some functions into a file called makeCmd.R that will let us run the desired functionality:

  1. makeCmd to create a command executable via system or shell, with defaults set up for executing an R file specified by path
makeCmd <- function(path , command = "Rscript" , opts = "--vanilla" , outputFile = NULL , suffix = NULL , addRhome = TRUE) { if (Sys.info()["sysname"] == "Windows") { qType <- "cmd2" } else { qType <- "sh" } if (isTRUE(addRhome)) { command <- file.path(R.home("bin"), command) } cmd <- paste( shQuote(command, type = qType) , shQuote(opts, type = qType) , shQuote(path, type = qType) ) if (!is.null(outputFile)) { cmd <- paste(cmd, ">", shQuote(outputFile)) } if (!is.null(suffix)) { cmd <- paste(cmd, suffix) } cmd }
  1. executeCmd to execute a command
executeCmd <- function(cmd, intern = FALSE) { sysName <- Sys.info()["sysname"] stopifnot( is.character(cmd) , length(cmd) == 1 , sysName %in% c("Windows", "Linux") ) if (sysName == "Windows") { shell(cmd, intern = intern) } else { system(cmd, intern = intern) } }
  1. replaceTilde for Linux purposes
replaceTilde <- function(path) { if (substr(path, 1, 1) == "~") { path <- sub("~", Sys.getenv("HOME"), path, fixed = TRUE) } file.path(path) }
  1. And finally the function which will be used for the addin execution – runCurrentRscript to retrieve the path to the currently active file in RStudio, run it, write the output to a file output.txt and open the file with output.
runCurrentRscript <- function( path = replaceTilde(rstudioapi::getActiveDocumentContext()[["path"]]) , outputFile = "output.txt") { cmd <- makeCmd(path, outputFile = outputFile) executeCmd(cmd) if (!is.null(outputFile) && file.exists(outputFile)) { rstudioapi::navigateToFile(outputFile) } } Step 3 – Setting up an addin

Now that we have all our functions ready, all we have to do is create a file addins.dcf under the \inst\rstudio folder of our package. We specify the Name of the addin, write a nice Description of what it does and most importantly specify the Binding to the function we want to call:

creating addins.dcf under inst/rstudio

Name: runCurrentRscript Description: Executes the currently open R script file via Rscript with --vanilla option Binding: runCurrentRscript Interactive: false

Now we can rebuild and install our package and in RStudio’s menu navigate to Tools -> Addins -> Browse Addins..., and there it is – our first addin. For the best experience, we can click the Keyboard Shortcuts... button and assign a keyboard shortcut to our addin for easy use.

setting an RStudio addin keyboard shortcut

Now just open an R script, hit our shortcut and voilà, our script gets execute via RScript in vanilla mode.

Step 4 – Updating our DESCRIPTION and NAMESPACE

As our last steps, we should

  1. Update our DESCRIPTION file with rstudioapi as Imports, as we will be needing it before using our package:
Package: jhaddins Title: JH's RStudio Addins Version: 0.0.0.9000 Authors@R: person("Jozef", "Hajnala", email = "jozef.hajnala@gmail.com", role = c("aut", "cre")) Description: Useful addins to make RStudio even better. Depends: R (>= 3.0.1) Imports: rstudioapi (>= 0.7) License: GPL-3 Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1
  1. Update our NAMESPACE by importing the functions from other packages that we are using, namely:
importFrom(rstudioapi, navigateToFile) importFrom(rstudioapi, getActiveDocumentContext)

Now we can finally rebuild and install our package again and run a CHECK to see that we have no errors, warnings and notes telling us something is wrong. Make sure to use the document = FALSE for now.

devtools::install() # Ctrl+Shift+B or Install button on RStudio's build tab devtools::check(document = FALSE) # Ctrl+Shift+E or Check button on RStudio's build tab What is next – Always paying our (technical) debts

In the next post of the series, we will pay our debt of

  • missing documentation for our functions, that will help us to generate updates to our NAMESPACE automatically and help us get a nice documentation so that we can read about our functions using ?
  • and unit tests to help us sleep better knowing that our functions get tested!
Wrapping up

We can quickly create an RStudio addin by:

  1. Creating an R package
  2. Writing a function in that package
  3. Creating a addins.dcf in \inst\rstudio folder of our package
TL;DR – Just give me the package

Use git clone from: https://gitlab.com/jozefhajnala/jhaddins.git

References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Different Way To Think About Drawdown — Geometric Calmar Ratio

Fri, 05/04/2018 - 20:40

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will discuss the idea of the geometric Calmar ratio — a way to modify the Calmar ratio to account for compounding returns.

So, one thing that recently had me sort of annoyed in terms of my interpretation of the Calmar ratio is this: essentially, the way I interpret it is that it’s a back of the envelope measure of how many years it takes you to recover from the worst loss. That is, if a strategy makes 10% a year (on average), and has a loss of 10%, well, intuition serves that from that point on, on average, it’ll take about a year to make up that loss–that is, a Calmar ratio of 1. Put another way, it means that on average, a strategy will make money at the end of 252 trading days.

But, that isn’t really the case in all circumstances. If an investment manager is looking to create a small, meager return for their clients, and is looking to make somewhere between 5-10%, then sure, the Calmar ratio approximation and interpretation makes sense in that context. Or, it makes sense in the context of “every year, we withdraw all profits and deposit to make up for any losses”. But in the context of a hedge fund trying to create large, market-beating returns for its investors, those hedge funds can have fairly substantial drawdowns.

Citadel–one of the gold standards of the hedge fund industry, had a drawdown of more than 50% during the financial crisis, and of course, there was https://www.reuters.com/article/us-usa-fund-volatility/exclusive-ljm-partners-shutting-its-doors-after-vol-mageddon-losses-in-u-s-stocks-idUSKCN1GC29Hat least one fund that blew up in the storm-in-a-teacup volatility spike on Feb. 5 (in other words, if those guys were professionals, what does that make me? Or if I’m an amateur, what does that make them?).

In any case, in order to recover from such losses, it’s clear that a strategy would need to make back a lot more than what it lost. Lose 25%? 33% is the high water mark. Lose 33%? 50% to get back to even. Lose 50%? 100%. Beyond that? You get the idea.

In order to capture this dynamic, we should write a new Calmar ratio to express this idea.

So here’s a function to compute the geometric calmar ratio:

require(PerformanceAnalytics) geomCalmar <- function(r) { rAnn <- Return.annualized(r) maxDD <- maxDrawdown(r) toHighwater <- 1/(1-maxDD) - 1 out <- rAnn/toHighwater return(out) }

So, let's compare how some symbols stack up. We'll take a high-volatility name (AMZN), the good old S&P 500 (SPY), and a very low volatility instrument (SHY).

getSymbols(c('AMZN', 'SPY', 'SHY'), from = '1990-01-01') rets <- na.omit(cbind(Return.calculate(Ad(AMZN)), Return.calculate(Ad(SPY)), Return.calculate(Ad(SHY)))) compare <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets), CalmarRatio(rets), geomCalmar(rets)) rownames(compare)[6] <- "Geometric Calmar" compare

The returns start from July 31, 2002. Here are the statistics.

AMZN.Adjusted SPY.Adjusted SHY.Adjusted Annualized Return 0.3450000 0.09110000 0.01940000 Annualized Std Dev 0.4046000 0.18630000 0.01420000 Annualized Sharpe (Rf=0%) 0.8528000 0.48860000 1.36040000 Worst Drawdown 0.6525491 0.55189461 0.02231459 Calmar Ratio 0.5287649 0.16498652 0.86861760 Geometric Calmar 0.1837198 0.07393135 0.84923475

For my own proprietary volatility trading strategy, a strategy which has a Calmar above 2 (interpretation: finger in the air means that you make a new equity high every six months in the worst case scenario), here are the statistics:

> CalmarRatio(stratRetsAggressive[[2]]['2011::']) Close Calmar Ratio 3.448497 > geomCalmar(stratRetsAggressive[[2]]['2011::']) Close Annualized Return 2.588094

Essentially, because of the nature of losses compounding, the geometric Calmar ratio will always be lower than the standard Calmar ratio, which is to be expected when dealing with the geometric nature of compounding returns.

Essentially, I hope that this gives individuals some thought about re-evaluating the Calmar Ratio.

Thanks for reading.

NOTES: registration for R/Finance 2018 is open. As usual, I’ll be giving a lightning talk, this time on volatility trading.

I am currently contracting and seek network opportunities, along with information about prospective full time roles starting in July. Those interested in my skill set can feel free to reach out to me on LinkedIn here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tidyposterior slides

Fri, 05/04/2018 - 18:10

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

tidyposterior is an R package for comparing models based on their resampling statistics. There are a few case studies on the webpage to illustrate the process.

I gave a talk at the Open Data Science Conference (ODSC) yesterday. A pdf of the slides are here and the animated gif above is here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Do you really need a multilevel model? A preview of powerlmm 0.4.0

Fri, 05/04/2018 - 14:55

(This article was first published on R Psychologist - R, and kindly contributed to R-bloggers)

In this post I will show some of the new simulation features that will be available in powerlmm 0.4.0. You can already install the dev version from GitHub.

# GitHub devtools::install_github("rpsychologist/powerlmm")

The revamped simulation functions offer 3 major new features:

  • Compare multiple model formulas, including OLS models (no random effects).
  • Evaluate a “forward” or “backward” model selection strategy using LRT.
  • Apply a data transformation during the simulation, this makes it possible to compare e.g. an ANCOVA versus a 2-level LMM, or write your own custom MNAR or MAR missing data function.

I will cover these new function in two examples.

library(powerlmm) nsim <- 5000 cores <- 16 Example 1 Do I really need a LMM? 2-lvl LMM versus ANCOVA

This example will show both the new data_transform argument and the new support for multiples formula, and formulas without random effects. Each formula is fit to the same data set (or a transformed version) during the simulation. Let’s assume we’ll do a trial where we measure patients for 11 weeks, from baseline to week 10. We can analyze such data in many ways, as an example we will compare 3 popular models:

  • t-test: group differences at posttest
  • ANCOVA: group differences at posttest adjusting for pretest values
  • LMM: group difference in change over time using a 2-level linear-mixed effects model with a random intercept and slope

To make the LMM more similar to the ANCOVA we also fit a constrained model where we assume there is no group differences at baseline (which there isn’t). Let’s setup the models and run the simulation, we will try different amounts of random slope variance compared to the within-subjects variance (var_ratio), and with or without dropout (30 % at posttest).

p <- study_parameters(n1 = 11, n2 = 40, # per treatment icc_pre_subject = 0.5, cor_subject = -0.5, var_ratio = c(0, 0.01, 0.02, 0.03), dropout = c(0, dropout_weibull(0.3, 1)), effect_size = cohend(c(0, 0.8))) # Formulas -------------------------------------------------------------------- # OLS (t-test) f_PT <- sim_formula("y ~ treatment", test = "treatment", data_transform = transform_to_posttest) # ANCOVA f_PT_pre <- sim_formula("y ~ treatment + pretest", test = "treatment", data_transform = transform_to_posttest) # analyze as 2-level longitudinal f_LMM <- sim_formula("y ~ time * treatment + (1 + time | subject)") # constrain treatment differences at pretest f_LMM_c <- sim_formula("y ~ time + time:treatment + (1 + time | subject)") # combine formulas f <- sim_formula_compare("posttest" = f_PT, "ANCOVA" = f_PT_pre, "LMM" = f_LMM, "LMM_c" = f_LMM_c) # Run sim -------------------------------------------------------------------- res <- simulate(p, formula = f, nsim = nsim, cores = cores, satterthwaite = TRUE, batch_progress = FALSE)

Let’s summarize the results for the treatment effect.

# need to specify what parameter estimates the treatment effect. tests <- list("posttest" = "treatment", "ANCOVA" = "treatment", "LMM" = "time:treatment", "LMM_c" = "time:treatment") x <- summary(res, para = tests) # Only print the first 5 print(head(x, 5), add_cols = c("var_ratio")) ## Model: 'All' | Type: 'fixed' | Parameter(s): 'treatment', 'time:treatment' ## --- ## model var_ratio M_est theta M_se SD_est Power Power_bw Power_satt ## ANCOVA 0.00 -0.0441 0 2.7 2.7 0.053 0.049 0.049 ## ANCOVA 0.01 0.0083 0 3.1 3.1 0.052 0.048 0.048 ## ANCOVA 0.02 0.0153 0 3.6 3.5 0.051 0.046 0.046 ## ANCOVA 0.03 -0.0822 0 4.0 4.0 0.047 0.043 0.043 ## ANCOVA 0.00 11.3523 0 2.7 2.8 0.982 0.981 0.981 ## --- ## nsim: 5000 | 24 columns not shown

Since we have 16 × 4 model results, it is probably better to plot the results.

library(ggplot2) library(viridis) # re-order x$model <- factor(x$model, levels = c("posttest", "ANCOVA", "LMM", "LMM_c")) # Set custom limits per facet d_limits <- data.frame(effect_size = c(0), Power_satt = c(0.025, 0.075), var_ratio = 0, model = 0, dropout = 0) # Set hlines per facet d_hline <- data.frame(effect_size = c(0, 0.8), x = c(0.05, 0.8)) # Plot ggplot(x, aes(model, Power_satt, group = interaction(var_ratio, dropout), color = factor(var_ratio), linetype = factor(dropout))) + geom_hline(data = d_hline, aes(yintercept = x), linetype = "dotted") + geom_point() + geom_blank(data = d_limits) + geom_line() + labs(y = "Power (Satterthwaite)", linetype = "Dropout", color = "Variance ratio", title = "Power and Type I error") + facet_grid(dropout ~ effect_size, scales = "free", labeller = "label_both") + coord_flip() + theme_minimal() + scale_color_viridis_d()

We can see that for complete data that the ANCOVA is generally more powerful than the standard LMM as heterogeneity increases (“variance ratio”), whereas the constrained LMM is more powerful. The difference between ANCOVA and t-test (“posttest”) also decrease with increasing heterogeneity in change over time—since this leads to a weaker correlation between pretest and posttest. For the scenarios with missing data, LMM is more powerful, as would be expected—the cross-sectional methods have 30 % of the observations missing, whereas the LMMs can include the available responses up until dropout occurs.

It is worth noting that the 2-lvl LMM with 11 repeated measures is not always more powerful than a “cross-sectional” t-test at posttest. Obviously, this is a limited example, but it demonstrates that it is a mistake to base sample size calculations on a t-test, when a LMM is planned, with the motivation that “a LMM will have more power” (I see such motivations quite often).

Example 2 Do I really need a multilevel model? Using LRT to perform model selection

A common strategy when analyzing (longitudinal) data is to build the model in a data driven fashion—by starting with a random intercept model, then add a random slope and perform a likelihood ratio test (LRT) and keep the random slope if it is significant, and so on. We can investigate how well such a strategy works using sim_formula_compare. We’ll define our model formulas, starting with a 2-level random intercept model and working up to a 3-level random intercept and slope model. The true model is a 3-level model with only a random slope. Let’s first setup the models

p <- study_parameters(n1 = 11, n2 = 40, n3 = 3, icc_pre_subject = 0.5, cor_subject = -0.5, icc_slope = 0.05, var_ratio = 0.03) f0 <- sim_formula("y ~ time * treatment + (1 | subject)") f1 <- sim_formula("y ~ time * treatment + (1 + time | subject)") f2 <- sim_formula("y ~ time * treatment + (1 + time | subject) + (0 + time | cluster)") f3 <- sim_formula("y ~ time * treatment + (1 + time | subject) + (1 + time | cluster)") f <- sim_formula_compare("subject-intercept" = f0, "subject-slope" = f1, "cluster-slope" = f2, "cluster-intercept" = f3)

Then we run the simulation, the four model formulas in f will be fit the each data set.

res <- simulate(p, formula = f, nsim = nsim, satterthwaite = TRUE, cores = cores, CI = FALSE)

During each simulation the REML log likelihood is saved for each model, we can therefore perform the model selection in the summary method, as a post-processing step. Since REML is used it is assumed the fixed effects are the same, and that we compare a “full model” to a “reduced model”. Let’s try a forward selection strategy, where start with the first model and compare it to the next. If the comparison is significant we continue testing models, else we stop. The summary function performs this model comparison for each of the nsim simulations and returns the results from the “winning” model in each simulation replication.

x <- summary(res, model_selection = "FW", LRT_alpha = 0.05) x ## Model: model_selection ## ## Random effects ## ## parameter M_est theta est_rel_bias prop_zero is_NA ## subject_intercept 100.000 100.00 0.00029 0.00 0 ## subject_slope 2.900 2.80 0.00710 0.00 0 ## error 100.000 100.00 -0.00017 0.00 0 ## cor_subject -0.490 -0.50 -0.01000 0.00 0 ## cluster_slope 0.270 0.15 0.82000 0.00 0 ## cluster_intercept 7.900 0.00 7.90000 0.00 0 ## cor_cluster -0.081 0.00 -0.08100 0.67 0 ## ## Fixed effects ## ## parameter M_est theta M_se SD_est Power Power_bw Power_satt ## (Intercept) 0.0160 0 1.10 1.00 0.050 . . ## time -0.0045 0 0.25 0.28 0.130 . . ## treatment 0.0160 0 1.50 1.50 0.049 . . ## time:treatment -0.0024 0 0.36 0.40 0.130 0.048 0.12 ## --- ## Number of simulations: 5000 | alpha: 0.05 ## Time points (n1): 11 ## Subjects per cluster (n2 x n3): 40 x 3 (treatment) ## 40 x 3 (control) ## Total number of subjects: 240 ## --- ## Results based on LRT model comparisons, using direction: FW (alpha = 0.05) ## Model selected (proportion) ## cluster-intercept cluster-slope subject-slope ## 0.0054 0.4360 0.5586

The point of the model selection algorithm is to mimic a type of data driven model selection that is quite common. We see that this strategy do not lead to nominal Type I errors in this scenario. The cluster-level is left out of the model too often, leading to Type I errors around 12 %. However, it is fairly common to increase the LRT’s alpha level to try to improve this strategy. Let’s try a range of alpha level to see the impact.

alphas <- seq(0.01, 0.5, length.out = 50) x <- vapply(alphas, function(a) { type1 <- summary(res, model_selection = "FW", LRT_alpha = a) type1$summary$model_selection$FE$Power_satt[4] }, numeric(1)) d <- data.frame(LRT_alpha = alphas, type1 = x) d <- data.frame(LRT_alpha = alphas, type1 = x) ggplot(d, aes(LRT_alpha, type1)) + geom_line() + geom_hline(yintercept = 0.05, linetype = "dotted") + labs(y = "Type I error (time:treatment)", title = "Impact of LRT alpha level for model selection") + theme_minimal()

The figure shows that the LRT alpha level need to be very liberal to keep Type I errors, for the treatment effect, close to the 5 % level.

We can also see the results from each of the four models. Here we will just look at the time:treatment effect.

x1 <- summary(res, para = "time:treatment") x1 ## Model: summary ## ## Fixed effects: 'time:treatment' ## ## model M_est theta M_se SD_est Power Power_bw Power_satt ## subject-intercept -0.0024 0 0.14 0.4 0.500 0.330 0.500 ## subject-slope -0.0024 0 0.25 0.4 0.220 0.080 0.220 ## cluster-slope -0.0024 0 0.39 0.4 0.088 0.028 0.054 ## cluster-intercept -0.0024 0 0.40 0.4 0.082 0.026 0.043 ## --- ## Number of simulations: 5000 | alpha: 0.05 ## Time points (n1): 11 ## Subjects per cluster (n2 x n3): 40 x 3 (treatment) ## 40 x 3 (control) ## Total number of subjects: 240

We see that the 2-lvl random intercept model lead to substantially inflated Type I errors = 0.495. The 2-level model that also adds a random slope is somewhat better but still not good, Type I errors = 0.215. The correct 3-level model that account for the third level using a random slope have close to nominal Type I errors = 0.054. The full 3-level that adds an unnecessary random intercept is somewhat conservative, Type I errors = 0.043.

When choosing a strategy Type I errors is not only factor we want to minimize, power is also important. So let’s see how power is affected.

# See if power is impacted p1 <- update(p, effect_size = cohend(0.8)) res_power <- simulate(p1, formula = f, nsim = nsim, satterthwaite = TRUE, cores = cores, CI = FALSE) # we can also summary a fixed effect for convenience x <- summary(res_power, model_selection = "FW", LRT_alpha = 0.05, para = "time:treatment") print(x, verbose = FALSE) ## Model: summary ## ## Fixed effects: 'time:treatment' ## ## model M_est theta M_se SD_est Power Power_bw Power_satt ## model_selection 1.1 1.1 0.36 0.39 0.82 0.61 0.69 ## --- x1 <- summary(res_power, para = "time:treatment") print(x1, verbose = FALSE) ## Model: summary ## ## Fixed effects: 'time:treatment' ## ## model M_est theta M_se SD_est Power Power_bw Power_satt ## subject-intercept 1.1 1.1 0.14 0.39 0.98 0.97 0.98 ## subject-slope 1.1 1.1 0.25 0.39 0.95 0.86 0.94 ## cluster-slope 1.1 1.1 0.39 0.39 0.80 0.55 0.63 ## cluster-intercept 1.1 1.1 0.40 0.39 0.78 0.52 0.55 ## ---

We can note that power for the treatment effect based on LRT model selection is only slightly higher than for the correct 3-level model. If we balance this slight increase in power compared to the noticeable increase in Type I errors, it might be reasonable to conclude that for these data we should always fit the 3-level model. Lastly, the overspecified 3-level model that include an unnecessary random intercept loses some power.

Summary

The improved simulation method in powerlmm makes it really convenient to plan and evaluate the analysis of longitudinal treatment trials with a possible third level of clustering (therapists, schools, groups, etc). The support for data transforms and single level (lm()) models also enables a lot of custom models to be fit.

Feedback, bugs, etc

I appreciate all types of feedback, e.g. typos, bugs, inconsistencies, feature requests, etc. Open an issue on github.com/rpsychologist/powerlmm/issues, common on this post, or contact me here rpsychologist.com/about.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages