Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 27 min ago

Building a website with pkgdown: a short guide

Tue, 08/01/2017 - 14:16

As promised in my last post, here is a short guide with some tips and tricks for building a documentation website for an R package using pkgdown.

In the end, this guide ended up way longer than I was expecting, but I hope you’ll find it useful, although it often replicates information already available in pkgdown documentation !

Prerequisites To build a website using pkgdown, all you need to have is an R package hosted on Git Hub, with a file structure “tweaked” with some functionality provided by devtools.  Assuming you are using RStudio, and that you didn’t already do this, open the project corresponding to your package and (as a minimum) run: require(devtools)
use_readme_rmd()
use_news_md()
use_vignette("test") #substitute with the name of your package

Since to use pkgdown your package must be on Git Hub, you may also want:  use_github_links() , which will populate automatically some fields in the DESCRIPTION file successively used to build the home page of your website. Other possibly useful commands (depending on the status of your package) may be:  use_travis()
use_cran_badge() (see devtools documentation for further info)

At this point, within your package file structure you should have a README.Rmd file (which is used also to create the “README.md” file for Git Hub), and a NEWS.md file. You should also have a man subfolder containing the .Rd files documenting your functions (usually auto-generated from roxygen comments using devtools::document()). Finally, you should have a my_package.Rmd file in the vignettes subfolder (which is used for example by devtools::build_vignette() to automatically create a vignette for the package).

Getting Started: creating a bare-bones website

To create a standard pkgdown site, just run:

devtools::install_githb("hadley/pkgdown")
library(pkgdown)
build_site()

build_site() will do several things:

  • create a “docs” subfolder in your file structure, where it will place all the material needed for rendering the website;
  • knit README.Rmd to “docs/index.html”. This will be the home page of your website;
  • knit NEWS.md to “docs/news/index.html” (in this way, any time you update NEWS.md, the news section of the website can be automatically updates);
  • knit all your “Rd” files to “docs/reference/” as html files inheriting the name of the function (e.g., “docs/reference/myfun_1.html” “docs/reference/myfun_2.html”, etc.). A “docs/reference/index.html” page is also created: this is a simple html page linking to the html documentation pages for the different functions (see for example here). By default, all functions will be included and listed in alphabetical order.
  • knit any Rmd files in your “vignettes” subfolder to “docs/articles” as single html files;
  • Scrape your package for various other info (e.g., Authors, Git Hub address, License, Citation, badges, etc.) to create additional material on the right-hand side bar of the home page;
  • Put everything together by some magic to build a working website, and open a preview in RStudio Viewer or your browser.

The resulting website will look more or less like this:

“Standard” website built by pkgdown::build_site()

, with your main vignette linked in Getting Started, and all the other Rmds found in vignettes (if any) linked-to in the Articles drop down menu.

Considering that all is needed is to run build_site() (in particular if the package is already using devtools tweaks), I’d say that this is already a nice result ! However, spending some time in better configuring the structure of the site and tweaking some small things allows to achieve a much nicer result, as explained below.

Customizing appearence and structure of the website: the _pkgdown.yaml file

Your pkgdown website can be further customized by creating and customizing a text file named _pkgdown.yaml in the root folder of your project. The file needs to have three main sections, which I will describe here using the current .yaml file used in the MODIStsp Website as an example (the complete file can be found here).

Preamble Section

This is quite straightforward: first of all, give a title to your website and provide its URL. Then, select a template to customize its appearance from the ones available at bootswatch. Finally, add the GoogleAnalytics tracking code if you wish.

title: MODIStsp
url: http://lbusett.github.io/MODIStsp
template:
params:
bootswatch: flatly
ganalytics: UA-12345678-0

Reference Section

Here, you can configure the page of the website containing the index to the documentation of your functions (docs/reference/index.html). Instead than having a simple list in alphabetical order, you can define different groups and decide which functions to put in each group.

Each group is defined by a title, a description (use ~ for no description), and a contents section containing an indented list of functions to be included in that group. Syntax and indentation rules must be strictly followed but are very simple. Looking at the example below should suffice for understanding it. In this case I decided to use only two groups: one for exported functions, and one for everything else, but you can set-up as many groups as you need.

reference:
- title: Exported Functions
desc: Functions exported by MODIStsp
contents:
- '`MODIStsp`'
- '`MODIStsp_extract`'
- '`MODIStsp_addindex`'
- '`install_MODIStsp_launcher`'
- title: Internals
desc: Internal functions and helpers
contents:
- '`MODIStsp_GUI`'
- '`MODIStsp_process`'
- '`MODIStsp_process_indexes`'
- '`MODIStsp_process_QA_bits`'
- '`MODIStsp_read_xml`'
- '`lpdaac_getmod_dirs`'
- '`lpdaac_getmod_dates`'
- '`lpdaac_getmod_names`'
- '`MODIStsp_check_files`'
- '`MODIStsp_vrt_create`'
- '`bbox_from_file`'
- '`reproj_bbox`' Navbar Section

The last section of _pkgdown.yaml is named navbar, and is where most the customization occurs.

In the first line, you can choose if using a standard or inverse color scheme for your boootstrap template. The only way to understand what this does is to try both and see which one you do prefer.

navbar:
type: inverse

Next, you can define what content will be accessible through buttons or menus on the left side of the top-menu of the website.

  • Buttons linking to a single page are described by:
    1. a title text or an icon name (chosen from http://fontawesome.io/icons/) that will be shown on the button;
    2. a hyperlink to the page that will be accessed through the button (Note that the hyperlinks are built relative to the root of the “docs” folder created by pkgdown – no need to specify full paths !).
  • Dropdown menus giving access to multiple pages are described by:
    1. a title text or an icon name;
    2. a “menu:” line;  
    3. an indented list of the text entries that will appear in the menu, with the associated hyperlinks.

In the example below, you can see how the content should be indented and organized:

left:
- icon: fa-home
href: index.html
- text: "How To"
menu:
- text: Installation
href: articles/installation.html
- text: Interactive Execution - the MODIStsp GUI
href: articles/interactive_execution.html
- text: Non-Interactive Execution from within R
href: articles/noninteractive_execution.html
- text: Standalone Execution and Scheduled Processing
href: articles/standalone_execution.html
- text: Outputs Format and Naming Conventions
href: articles/output.html
- text: Accessing and Analyzing Processed Data from R
href: articles/analyze.html
- text: "Examples of Use"
href: articles/examples.html

In the case of MODIStsp website, I decided to not provide a link to the “full” vignette (which was linked from “Getting Started” in the “bare-bones” website). Instead, I divided the contents of that vignette in six shorter articles accessible from the “How To” dropdown menu.

To do that, I just had to create six separate Rmd files within the vignettes folder. All Rmds within that folder are automatically knitted by pkgdown when launching either build_site() or build_articles(), and are therefore available for linking from the menu.

Finally, in the last section of _pkgdown.yaml you can specify what content should be accessible from the right-hand side of the top menu. Syntax and indentation are identical to what described above.

right:
- text: "faq"
icon: fa-question-circle-o
href: articles/faq.html
- icon: fa-newspaper-o
text: "news"
href: news/index.html
- icon: fa-file-code-o
text: "functions"
href: reference/index.html
- icon: fa-github fa-lg
text: "github"
href: https://github.com/lbusett/MODIStsp/

In MODIStsp website, we are using the right side of the top menu bar to provide access to:

  • a static page containing a FAQ; 
  • the auto-generated news page (by linking to news/index.html); 
  • the function documentation index page (by linking to reference/index.html); 
  • an icon linking to the MODIStsp repository on Git Hub. 

(From the example, you can see that it is possible to associate the “buttons” including wit both an icon and a short title text. In that case, the icon and the text will are shown one after the other.)

Once your .yaml file is complete, just run build_site() again and check the results. Then iterate ad-libitum until you are satisfied by the resulting structure.

Fine tuning

When you are satisfied with the structure of the website, you can start tweaking its contents to achieve a better-looking final result. Here I’m just sharing some tips and tricks I learnt while building our website:

  • If (like me) you wish to have a different “home page” in the website and in the main Git Hub page, you can create a new index.Rmd file in the root of the package. If index.Rmd is found, it is used instead than README.Rmd to populate the home page;
  • Text in the “standard” output is a bit too “compressed” for my taste. You can increase the spacing between the main sections of an article by simply adding a
    at the end of each main section;
  • Similarly, you can add line separators between sections by simply adding a line of underscores under each section;
  • To reduce the “wall of text” effect, you can put any of the http://fontawesome.io/icons/ icons in an article by inserting its “full html specification” in the text of the corresponding Rmd. (For example,”I want a github icon here: class=”fa fa-github aria-hidden=”true”>” would render in the website with a “Git Hub octopus” icon at the end);
  • Of course, you can add any image/R plot by linking/creating it in the Rmds of the different articles or of the home page;
  • If you build the site using build_site(run_dont_run = TRUE), the examples with the “dont_run” specification in the roxygen comment will be run, and their results appear in the documentation page of each function. I find this really useful to illustrate the typical behaviour of functions;
  • To provide modifiers to the standard pkgdown.css and pkgdown.js files, create a new folder named “pkgdown” in your project root. As described in pkgdown documentation, “the content of files “extra.css and “extra.js” placed in this folder will be copied to docs/ and inserted into the  after the default pkgdown CSS and JSS“. I for example added the following lines in extra.css to have some additional syntax highlighting in the code snippets:
  • .fl {color: blue;}
    .fu {color: blue;} /* function */
    .ch,.st {color: orange;} /* string */
    .kw {color: black;} /* keyword */
    .co {color: green;} /* comment */

    .message { color: gray; font-weight: bolder;}
    .error { color: red; font-weight: bolder;}
    .warning { color: purple; font-weight: bolder;
    }

, but if you are more fluent than me in css and js you can probably tweak appearance a lot more.

  • you don’t need to run the whole build_site() every time you wish to adjust something and check the results. You can use build_reference(), build_news(), build_articles() and build_home() to update just some sections. 

Below, you can see how the .yaml file described before and the other small “tweaks” improved the appearance of MODIStsp homepage :

Final Home Page of the MODIStsp website

Deploying the website to Git Hub and the web

Finally, it’s time to deploy the website. To do that:

  1. Commit and push your changes to the remote;
  2. Login on Git Hub, navigate to your repo and select “Settings”.
  3. Scroll down to find the “Git Hub pages” subsection and, under “Source”, select “master branch/docs folder” (from this, you can see that it is fundamental that your website material is pushed to the master branch)
  4. click on the link of your website and… congratulations: you just published your new pkgdown website !
(PS: if you find any errors in this guide, or you think additional clarification is needed, please leave a comment to this post !) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to use H2O with R on HDInsight

Mon, 07/31/2017 - 23:24

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

H2O.ai is an open-source AI platform that provides a number of machine-learning algorithms that run on the Spark distributed computing framework. Azure HDInsight is Microsoft's fully-managed Apache Hadoop platform in the cloud, which makes it easy to spin up and manage Azure clusters of any size. It's also easy to to run H2O on HDInsight: H2O AI Platform is available as an application on HDInsight, which pre-installs everything you need as the cluster is created.

You can also drive H2O from R, but the R packages don't come auto-installed on HDInsight. To make this easy, the Azure HDInsight team has provided a couple of scripts that will install the necessary components on the cluster for you. These include RStudio (to provide an R IDE on the cluster) and the rsparkling package. With these components installed, from R you can:

  • Query data in Spark using the dplyr interface, and add new columns to existing data sets.
  • Convert data for training, validation, and testing to "H2O Frames" in preparation for modeling.
  • Apply any of the machine learning models provided by Sparkling Water to your data, using the distributed computing capabilities provided by the HDInsight platform.

For details on how to install the R components on HDInsight, follow the link below.

Azure Data Lake & Azure HDInsight Blog: Run H2O.ai in R on Azure HDInsight

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Counterfactual estimation on nonstationary data, be careful!!!

Mon, 07/31/2017 - 22:00

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

In a recent paper that can be downloaded here, Carvalho, Masini and Medeiros show that estimating counterfactuals in a non-stationary framework (when I say non-stationary it means integrated) is a tricky task. It is intuitive that the models will not work properly in the absence of cointegration (spurious case), but what the authors show is that even with cointegration, the average treatment effect (ATE) converges to a non-standard distribution. As a result, standard tests on the ATE will identify treatment effects in several cases that there is no effect at all.

For those who are not familiar with counterfactual models: normally, these models have a treated unit (we want to know the effects of the treatment) and several untreated units that we call peers. There units may be cities, countries, companies, etc. Assuming that only one unit was treated and that there is no contamination, we can use information from the peers to project the behaviour of the treated unit as if it was not treated, which is the counterfactual. The models should also be able to identify if the treatment had no effects on the treated unit.

Simulation tests

Here I am going to generate cointegrated data with no treatment to show the behaviour of Synthetic Control and the ArCo (Other techniques will have the same features). The data is very simple and it is based on the mentioned article’s example. We are going to generate 100 observations of 10 time-series, 9 random walks and one that is the sum of these random walks plus an error, which is the treated unit. However, I will not include any treatment and I want the models to show that there was no treatment. The hypothesis to be tested is that at we had some treatment in the treated unit.

library(ArCo) library(glmnet) library(Synth) T=100 # Number of Observations n=10 # Number of Time Series set.seed(1) # Seed for replication # Generate random walk peers peers=matrix(0,T,n-1) for(i in 2:T){ peers[i,]=peers[i-1,]+rnorm(n-1) } # Generate the treated unit y1=rowSums(peers)+rnorm(T) # Put all the TS together Y=cbind(y1,peers) # Plot all TS. The black line is the "treated unit" matplot(Y,type="l")

First, I am going to estimate the ArCo. The $delta show the average treatment effect with 95\% confidence interval. As you can see, the test indicated that the treatment was statistically significant (the zero is very far from the confidence interval). The plot also show a dislocated counterfactual (blue line).

# Estimate the ArCo in the non-stationary data arco=fitArCo(list(Y),cv.glmnet,predict,treated.unity = 1,t0=51) arco$delta ## LB delta UB ## Variable1 -6.157713 -5.236705 -4.315697 plot(arco)

Now I am going to estimate the Synthetic Control. The Synth package does not have an implementation of a formal test. However, we can have a good idea from the counterfactual plot, which also indicates that the treatment was effective (even though there was no treatment).

# Estimate the Synthetic Control in the nonstationary data # Put the data in the Synth package notation sY=as.vector(Y) timeid=rep(1:T,ncol(Y)) unitid=sort(rep(1:ncol(Y),T)) synthdata=data.frame(time=timeid,unit=unitid,Y=sY) synthdataprep<- dataprep( foo = synthdata, predictors = c("Y"), predictors.op = "mean", dependent = "Y", unit.variable = "unit", time.variable = "time", treatment.identifier = 1, controls.identifier = c(2:10), time.predictors.prior = c(1:50), time.optimize.ssr = c(1:50), time.plot = 1:100 ) SC=synth(synthdataprep) path.plot(SC,synthdataprep) abline(v=50,col="blue",lty=2)

As you can see, both the ArCo and the Synthetic Control wrongly rejected the null hypothesis of no treatment. The best way to solve the problem is to take the first difference of all non-stationary time-series. However, there is an important thing to point out here. Imagine the random walk:

Suppose we have a treatment in that simply makes only for . For any we will have and for we will have . Therefore, if we take the first difference of the treatment will have an impact only at , which makes it impossible for any counterfactual model to identify an intervention. In this example, we would need an intervention that changed the drift of the Random Walk in a way that for all we have:

This will be better illustrated in an example. The plot below shows the difference between the two treatments. The red line is the case we can’t identify if we take the first difference. The Blue line changes the drift of the Random Walk, and therefore we can identify. It is impossible, with the tools I am using here, to identify if we are in the black line or in the red line.

y=rep(0,T) yt1=yt2=y d=1 c=10 for(i in 2:T){ e=rnorm(1) y[i]=y[i-1]+e if(i==51){ yt1[i]=c+yt1[i-1]+e }else{ yt1[i]=yt1[i-1]+e } if(i>=51){ yt2[i]=d+yt2[i-1]+e }else{ yt2[i]=yt2[i-1]+e } } plot(yt2,type="l",col=4,ylab="") lines(yt1,col=2) lines(y,col=1) legend("topleft",legend=c("untreated","treated by a constant in t0","treated in the drift"),col=c(1,2,4),bty="n",seg.len=0.5,cex=0.8,lty=1)

Now let’s get back to the first example. The results clearly indicated an intervention where there was none. Now I will show what happens if we take the first difference.

# Take the first difference dY=diff(Y,1) # Estimate the ArCo darco=fitArCo(list(dY),cv.glmnet,predict,treated.unity = 1,t0=50) darco$delta ## LB delta UB ## Variable1 -0.6892162 0.02737802 0.7439722 plot(darco)

# Adjust the data to the Synth and estimate the model dsY=as.vector(dY) timeid=rep(2:T,ncol(Y)) unitid=sort(rep(1:ncol(Y),T-1)) dsynthdata=data.frame(time=timeid,unit=unitid,Y=dsY) dsynthdataprep<- dataprep( foo = dsynthdata, predictors = c("Y"), predictors.op = "mean", dependent = "Y", unit.variable = "unit", time.variable = "time", treatment.identifier = 1, controls.identifier = c(2:10), time.predictors.prior = c(2:50), time.optimize.ssr = c(2:50), time.plot = 2:100 ) dSC=synth(dsynthdataprep) ## ## X1, X0, Z1, Z0 all come directly from dataprep object. ## ## ## **************** ## optimization over w weights: computing synthtic control unit ## ## ## ## **************** ## **************** ## **************** ## ## MSPE (LOSS V): 8.176593 ## ## solution.v: ## 1 ## ## solution.w: ## 2.9e-09 1.4e-08 3.7e-09 0.9999999 7.4e-09 2.9e-09 1.16e-08 6.9e-09 4e-09 path.plot(dSC,dsynthdataprep) abline(v=50,col="blue",lty=2)

As you can see, the $delta in the ArCo showed that the treatment effect was statistically zero, which is true. The Synthetic Control plot also showed that there was no difference cause by the intervention even though the model adjustment was not very good.

Finally, note that we have two scenarios. If there is no intervention and we do not take the first difference we will probably wrongly reject the null hypothesis of no intervention. However, if we are in the “red line” case of the plot and we take the first difference, we will not reject the null when it may be false depending on the type of variable we are dealing with.

References

Carvalho, Carlos, Ricardo Masini, and Marcelo C. Medeiros. “The perils of counterfactual analysis with integrated processes.” (2016).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

15 Jobs for R users (2017-07-31) – from all over the world

Mon, 07/31/2017 - 20:41
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

 

More New Jobs
  1. Full-Time
    Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
    MesaArizona, United States
    31 Jul2017
  2. Full-Time
    Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
    PhiladelphiaPennsylvania, United States
    31 Jul2017
  3. Freelance
    Fit GLM distribution models to data using R nandini arora
    Anywhere
    31 Jul2017
  4. Freelance
    Financial services startup looking for freelance Shiny/R/UI developer incuhed
    Anywhere
    29 Jul2017
  5. Freelance
    Data Scientists – PhD Paradise – Germany Data Science Talent – Posted by datasciencetalent
    Frankfurt am MainHessen, Germany
    22 Jul2017
  6. Full-Time
    Technical Project Manager Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    15 Jul2017
  7. Full-Time
    Software Developer Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    15 Jul2017
  8. Full-Time
    Software Developer (with R experience) @ Arlington, Virginia, U.S. Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    14 Jul2017
  9. Full-Time
    Marketing Manager RStudio – Posted by bill@rstudio.com
    Anywhere
    14 Jul2017
  10. Full-Time
    Solutions Engineer RStudio – Posted by nwstephens
    Anywhere
    14 Jul2017
  11. Full-Time
    Lead Data Scientist Golden Rat Studios – Posted by goldenrat
    Los AngelesCalifornia, United States
    13 Jul2017
  12. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    12 Jul2017
  13. Freelance
    R/Shiny and JavaScript Developer Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB) – Posted by vbremerich
    Anywhere
    10 Jul2017
  14. Part-Time
    Data Science Bootcamp Mentor Thinkful – Posted by baronowicz
    Anywhere
    9 Jul2017
  15. Part-Time
    Call for Help: R/Shiny Developer Fantasy Football Analytics – Posted by val.pinskiy
    Anywhere
    9 Jul2017

 

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Machine Learning Explained: Dimensionality Reduction

Mon, 07/31/2017 - 20:13

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Dealing with a lot of dimensions can be painful for machine learning algorithms. High dimensionality will increase the computational complexity, increase the risk of overfitting (as your algorithm has more degrees of freedom) and the sparsity of the data will grow. Hence, dimensionality reduction will project the data in a space with less dimension to limit these phenomena.
In this post, we will first try to get an intuition of what is dimensionality reduction, then we will focus on the most widely used techniques.

Spaces and variables with many dimensions

Let’s say we own a shop and want to collect some data on our clients. We can collect their ages, how frequently they come to our shop, how much they spend on average and when was the last time they came to our shop. Hence each of our clients can be represented by four variables (ages, frequency, spendings and last purchase date) and can be seen as a point in a four dimension space. Later on, variables and dimensions will have the same meaning.

Now, let’s add some complexity and some variables by using images. How many dimensions are used to represent the image below?

There are 8 by 8 pixels, hence 64 pixels are needed and each of these pixels is represented by a quantitative variable. 64 variables are needed!
Even bigger, how many variables do you need to represent the picture below?

Its resolution is 640 by 426 and you need three color channels (Red, Green, and Blue). So this picture requires no more than … 640*426*3= 817,920 variables. That’s how you go from four to almost one million of dimensions (and you can go well above!)

Dimensionality reduction in everyday life

Before seeing any algorithm, everyday life provides us a great example of dimensionality reduction.

Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.

Why should you reduce the number of dimensions?

Dimensionality reduction has several advantages from a machine learning point of view.

  • Since your model has fewer degrees of freedom, the likelihood of overfitting is lower. The model will generalize more easily on new data.
  • If you use feature selection or linear methods (such as PCA), the reduction will promote the most important variables which will improve the interpretability of your model.
  • Most of features extraction techniques are unsupervised. You can train your autoencoder or fit your PCA on unlabeled data. This can be really helpful is you have a lot of unlabeled data (for instance text or images) and labeling is time-consuming and expensive.
Features selection as a basic reduction

The most obvious way to reduce dimensionality is to remove some dimensions and to select the more suitable variables for the problem.

Here are some ways to select variables:

  • Greedy algorithms which add and remove variables until some criterion is met. For instance, the stepwise regression with forward selection will add at each step the variable which improve the fit in the most significant way
  • Shrinking and penalization methods, which will add cost for having too many variables. For instance, L1 regularization will cut some variables’ coefficient to zero. Regularization limits the space where the coefficients can live in.
  • Selection of models based on criteria taking the number of dimensions into accounts such as the adjusted R², AIC or BIC. Contrary to regularization, the model is not trained to optimize these criteria. The criteria are computed after fitting to compare several models.
  • Filtering of variables using correlation, VIF or some “distance measure” between the features.
Features Engineering and extraction to reduce dimensionality

While features selection is efficient, it is brutal since variables are either kept or removed. However in some interval or for some modalities, removed variable may be useful while kept variable may be redundant. Features extraction (or engineering) seeks to keep only the intervals or modalities of the features which contain information.

PCA and Kernel PCA

Principal component analysis (or PCA), is a linear transformation of the data which looks for the axis where the data has the most variance. PCA will create new variables which are linear combinations of the original ones, these new variables will be orthogonal (i.e. correlation equals to zero). PCA can be seen as a rotation of the initial space to find more suitable axis to express the variability in the data.
On the new variables have been created, you can select the most important ones. The threshold is up-to-you and depends on how much variance you want to keep. You can check the tutorial below to see a working R example:

R Basics: PCA with R

Since PCA is linear, it mostly works on linearly separable data. Hence, if you want to perform classification on other data (Like the donut below), linear PCA will probably make you lose a lot of information.

Example of Kernel PCA from Scikit-Learn

On the other hand, kernel PCA can work on nonlinearly separable data. The trick is simple:

  • before applying PCA, a function is applied project the initial space in a bigger space where the data are separable. The function can be a polynomial, radial or some activation function.
  • Then a PCA is applied in this new space. (NB: It is not exactly PCA but close to it).

Hence you can easily separate the two rings of the donut, which will improve the performance of classifiers. Kernel PCA raises two problems, you need to find the right kernel and the new variables are hard to understand for humans.

LDA and Kernel LDA

Linear discriminant analysis is similar to PCA but is supervised (while PCA does not require labels). The goal of the LDA is not to maximize variance but to create linear surfaces to separate the different groups. The new features will be axes that separate the data when the data are projected on them.

As with PCA, you can also apply the kernel trick to apply LDA on non linearly separable data.

ICA and Kernel ICA

Independant component analysis aims at creating independent variables from the original variables.
The typical example Cocktail party problem: you are in a room with lots of different people having different conversations and your goal is to separate the different conversations. If you have a lot of microphones in the room, each and every of them will get a linear combination of all the conversation (some kind of noise). The goal of ICA is to disentangle these conversations from the noise.
This can be seen as dimensionality reduction if you have 200 microphones but only ten independent conversations, you should be able to represent them with ten independant variables from the ICA.

Autoencoder

Autoencoder is a powerful method to reduce the dimensionality of data. It is composed of a neural network (it can be feed-forward, convolutional or recurrent, most of the architecture can be adapted into an autoencoder) which will try to learn its input. For instance, an autoencoder trained on images will try to reconstruct these images.

In addition to this, the autoencoder has a bottleneck, the number of neurons in the autoencoder’s hidden layers will be smaller than the number of input variables. Hence, the autoencoder has to learn a compressed representation of the data where the number of dimensions is the number of neuron in the smallest hidden layer.

The main advantages of autoencoder are their non-linearity and how flexible they can be.

T-SNE

T-SNE or t-distributed stochastic neighbor embedding is mainly used to fit data from large dimensions in 2D or 3D space. That is the technique we used to fit and visualize the twitter data in our analysis of French election. The main idea behind T-SNE is that nearby point in the original space should be close in the low dimensional space while distant point should also be distant in the smaller space.
T-sne is highly non-linear, is originally non-parametric, dependant on the random seed and does not keep distance alike. Hence, I mainly use it to plot high dimension and to visualise cluster and similarity.

NB: There is also a parametric variant which seems less widely used than the original T-SNE.
https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf

Here ends our presentation of the most widely used dimensionality reduction techniques.

 

 

 

The post Machine Learning Explained: Dimensionality Reduction appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Google Vision API in R – RoogleVision

Mon, 07/31/2017 - 19:54

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Using the Google Vision API in R Utilizing RoogleVision

After doing my post last month on OpenCV and face detection, I started looking into other algorithms used for pattern detection in images. As it turns out, Google has done a phenomenal job with their Vision API. It’s absolutely incredible the amount of information it can spit back to you by simply sending it a picture.

Also, it’s 100% free! I believe that includes 1000 images per month. Amazing!

In this post I’m going to walk you through the absolute basics of accessing the power of the Google Vision API using the RoogleVision package in R.

As always, we’ll start off loading some libraries. I wrote some extra notation around where you can install them within the code.

# Normal Libraries library(tidyverse) # devtools::install_github("flovv/RoogleVision") library(RoogleVision) library(jsonlite) # to import credentials # For image processing # source("http://bioconductor.org/biocLite.R") # biocLite("EBImage") library(EBImage) # For Latitude Longitude Map library(leaflet) Google Authentication

In order to use the API, you have to authenticate. There is plenty of documentation out there about how to setup an account, create a project, download credentials, etc. Head over to Google Cloud Console if you don’t have an account already.

# Credentials file I downloaded from the cloud console creds = fromJSON('credentials.json') # Google Authentication - Use Your Credentials # options("googleAuthR.client_id" = "xxx.apps.googleusercontent.com") # options("googleAuthR.client_secret" = "") options("googleAuthR.client_id" = creds$installed$client_id) options("googleAuthR.client_secret" = creds$installed$client_secret) options("googleAuthR.scopes.selected" = c("https://www.googleapis.com/auth/cloud-platform")) googleAuthR::gar_auth() Now You’re Ready to Go

The function getGoogleVisionResponse takes three arguments:

  1. imagePath
  2. feature
  3. numResults

Numbers 1 and 3 are self-explanatory, “feature” has 5 options:

  • LABEL_DETECTION
  • LANDMARK_DETECTION
  • FACE_DETECTION
  • LOGO_DETECTION
  • TEXT_DETECTION

These are self-explanatory but it’s nice to see each one in action.

As a side note: there are also other features that the API has which aren’t included (yet) in the RoogleVision package such as “Safe Search” which identifies inappropriate content, “Properties” which identifies dominant colors and aspect ratios and a few others can be found at the Cloud Vision website

Label Detection

This is used to help determine content within the photo. It can basically add a level of metadata around the image.

Here is a photo of our dog when we hiked up to Audubon Peak in Colorado:

dog_mountain_label = getGoogleVisionResponse('dog_mountain.jpg', feature = 'LABEL_DETECTION') head(dog_mountain_label) ## mid description score ## 1 /m/09d_r mountain 0.9188690 ## 2 /g/11jxkqbpp mountainous landforms 0.9009549 ## 3 /m/023bbt wilderness 0.8733696 ## 4 /m/0kpmf dog breed 0.8398435 ## 5 /m/0d4djn dog hiking 0.8352048

All 5 responses were incredibly accurate! The “score” that is returned is how confident the Google Vision algorithms are, so there’s a 91.9% chance a mountain is prominent in this photo. I like “dog hiking” the best – considering that’s what we were doing at the time. Kind of a little bit too accurate…

Landmark Detection

This is a feature designed to specifically pick out a recognizable landmark! It provides the position in the image along with the geolocation of the landmark (in longitude and latitude).

My wife and I took this selfie in at the Linderhof Castle in Bavaria, Germany.

us_castle <- readImage('us_castle_2.jpg') plot(us_castle)

The response from the Google Vision API was spot on. It returned “Linderhof Palace” as the description. It also provided a score (I reduced the resolution of the image which hurt the score), a boundingPoly field and locations.

  • Bounding Poly – gives x,y coordinates for a polygon around the landmark in the image
  • Locations – provides longitude,latitude coordinates
us_landmark = getGoogleVisionResponse('us_castle_2.jpg', feature = 'LANDMARK_DETECTION') head(us_landmark) ## mid description score ## 1 /m/066h19 Linderhof Palace 0.4665011 ## vertices locations ## 1 25, 382, 382, 25, 178, 178, 659, 659 47.57127, 10.96072

I plotted the polygon over the image using the coordinates returned. It does a great job (certainly not perfect) of getting the castle identified. It’s a bit tough to say what the actual “landmark” would be in this case due to the fact the fountains, stairs and grounds are certainly important and are a key part of the castle.

us_castle <- readImage('us_castle_2.jpg') plot(us_castle) xs = us_landmark$boundingPoly$vertices[[1]][1][[1]] ys = us_landmark$boundingPoly$vertices[[1]][2][[1]] polygon(x=xs,y=ys,border='red',lwd=4)

Turning to the locations – I plotted this using the leaflet library. If you haven’t used leaflet, start doing so immediately. I’m a huge fan of it due to speed and simplicity. There are a lot of customization options available as well that you can check out.

The location = spot on! While it isn’t a shock to me that Google could provide the location of “Linderhof Castle” – it is amazing to me that I don’t have to write a web crawler search function to find it myself! That’s just one of many little luxuries they have built into this API.

latt = us_landmark$locations[[1]][[1]][[1]] lon = us_landmark$locations[[1]][[1]][[2]] m = leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(lng = lon, lat = latt, zoom = 5) %>% addMarkers(lng = lon, lat = latt) m

Face Detection

My last blog post showed the OpenCV package utilizing the haar cascade algorithm in action. I didn’t dig into Google’s algorithms to figure out what is under the hood, but it provides similar results. However, rather than layering in each subsequent “find the eyes” and “find the mouth” and …etc… it returns more than you ever needed to know.

  • Bounding Poly = highest level polygon
  • FD Bounding Poly = polygon surrounding each face
  • Landmarks = (funny name) includes each feature of the face (left eye, right eye, etc.)
  • Roll Angle, Pan Angle, Tilt Angle = all of the different angles you’d need per face
  • Confidence (detection and landmarking) = how certain the algorithm is that it’s accurate
  • Joy, sorrow, anger, surprise, under exposed, blurred, headwear likelihoods = how likely it is that each face contains that emotion or characteristic

The likelihoods is another amazing piece of information returned! I have run about 20 images through this API and every single one has been accurate – very impressive!

I wanted to showcase the face detection and headwear first. Here’s a picture of my wife and I at “The Bean” in Chicago (side note: it’s awesome! I thought it was going to be really silly, but you can really have a lot of fun with all of the angles and reflections):

us_hats_pic <- readImage('us_hats.jpg') plot(us_hats_pic)

us_hats = getGoogleVisionResponse('us_hats.jpg', feature = 'FACE_DETECTION') head(us_hats) ## vertices ## 1 295, 410, 410, 295, 164, 164, 297, 297 ## 2 353, 455, 455, 353, 261, 261, 381, 381 ## vertices ## 1 327, 402, 402, 327, 206, 206, 280, 280 ## 2 368, 439, 439, 368, 298, 298, 370, 370 ## ## landmarks... landmarks ## rollAngle panAngle tiltAngle detectionConfidence landmarkingConfidence ## 1 7.103324 23.46835 -2.816312 0.9877176 0.7072066 ## 2 2.510939 -1.17956 -7.393063 0.9997375 0.7268016 ## joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood ## 1 VERY_LIKELY VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY ## 2 VERY_LIKELY VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY ## underExposedLikelihood blurredLikelihood headwearLikelihood ## 1 VERY_UNLIKELY VERY_UNLIKELY VERY_LIKELY ## 2 VERY_UNLIKELY VERY_UNLIKELY VERY_LIKELY us_hats_pic <- readImage('us_hats.jpg') plot(us_hats_pic) xs1 = us_hats$fdBoundingPoly$vertices[[1]][1][[1]] ys1 = us_hats$fdBoundingPoly$vertices[[1]][2][[1]] xs2 = us_hats$fdBoundingPoly$vertices[[2]][1][[1]] ys2 = us_hats$fdBoundingPoly$vertices[[2]][2][[1]] polygon(x=xs1,y=ys1,border='red',lwd=4) polygon(x=xs2,y=ys2,border='green',lwd=4)

Here’s a shot that should be familiar (copied directly from my last blog) – and I wanted to highlight the different features that can be detected. Look at how many points are perfectly placed:

my_face_pic <- readImage('my_face.jpg') plot(my_face_pic)

my_face = getGoogleVisionResponse('my_face.jpg', feature = 'FACE_DETECTION') head(my_face) ## vertices ## 1 456, 877, 877, 456, NA, NA, 473, 473 ## vertices ## 1 515, 813, 813, 515, 98, 98, 395, 395 ## landmarks ## landmarks ... ## rollAngle panAngle tiltAngle detectionConfidence landmarkingConfidence ## 1 -0.6375801 -2.120439 5.706552 0.996818 0.8222974 ## joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood ## 1 VERY_LIKELY VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY ## underExposedLikelihood blurredLikelihood headwearLikelihood ## 1 VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY head(my_face$landmarks) ## [[1]] ## type position.x position.y position.z ## 1 LEFT_EYE 598.7636 192.1949 -0.001859295 ## 2 RIGHT_EYE 723.1612 192.4955 -4.805475700 ## 3 LEFT_OF_LEFT_EYEBROW 556.1954 165.2836 15.825399000 ## 4 RIGHT_OF_LEFT_EYEBROW 628.8224 159.9029 -23.345352000 ## 5 LEFT_OF_RIGHT_EYEBROW 693.0257 160.6680 -25.614508000 ## 6 RIGHT_OF_RIGHT_EYEBROW 767.7514 164.2806 7.637372000 ## 7 MIDPOINT_BETWEEN_EYES 661.2344 185.0575 -29.068363000 ## 8 NOSE_TIP 661.9072 260.9006 -74.153710000 ... my_face_pic <- readImage('my_face.jpg') plot(my_face_pic) xs1 = my_face$fdBoundingPoly$vertices[[1]][1][[1]] ys1 = my_face$fdBoundingPoly$vertices[[1]][2][[1]] xs2 = my_face$landmarks[[1]][[2]][[1]] ys2 = my_face$landmarks[[1]][[2]][[2]] polygon(x=xs1,y=ys1,border='red',lwd=4) points(x=xs2,y=ys2,lwd=2, col='lightblue')

Logo Detection

To continue along the Chicago trip, we drove by Wrigley field and I took a really bad photo of the sign from a moving car as it was under construction. It’s nice because it has a lot of different lines and writing the Toyota logo isn’t incredibly prominent or necessarily fit to brand colors.

This call returns:

  • Description = Brand name of the logo detected
  • Score = Confidence of prediction accuracy
  • Bounding Poly = (Again) coordinates of the logo
wrigley_image <- readImage('wrigley_text.jpg') plot(wrigley_image)

wrigley_logo = getGoogleVisionResponse('wrigley_text.jpg', feature = 'LOGO_DETECTION') head(wrigley_logo) ## mid description score vertices ## 1 /g/1tk6469q Toyota 0.3126611 435, 551, 551, 435, 449, 449, 476, 476 wrigley_image <- readImage('wrigley_text.jpg') plot(wrigley_image) xs = wrigley_logo$boundingPoly$vertices[[1]][[1]] ys = wrigley_logo$boundingPoly$vertices[[1]][[2]] polygon(x=xs,y=ys,border='green',lwd=4)

Text Detection

I’ll continue using the Wrigley Field picture. There is text all over the place and it’s fun to see what is captured and what isn’t. It appears as if the curved text at the top “field” isn’t easily interpreted as text. However, the rest is caught and the words are captured.

The response sent back is a bit more difficult to interpret than the rest of the API calls – it breaks things apart by word but also returns everything as one line. Here’s what comes back:

  • Locale = language, returned as source
  • Description = the text (the first line is everything, and then the rest are indiviudal words)
  • Bounding Poly = I’m sure you can guess by now
wrigley_text = getGoogleVisionResponse('wrigley_text.jpg', feature = 'TEXT_DETECTION') head(wrigley_text) ## locale ## 1 en ## description ## 1 RIGLEY F\nICHICAGO CUBS\nORDER ONLINE AT GIORDANOS.COM\nTOYOTA\nMIDWEST\nFENCE\n773-722-6616\nCAUTION\nCAUTION\n ORDER ## vertices ## 1 55, 657, 657, 55, 210, 210, 852, 852 ## 2 343, 482, 484, 345, 217, 211, 260, 266 wrigley_image <- readImage('wrigley_text.jpg') plot(wrigley_image) for(i in 1:length(wrigley_text$boundingPoly$vertices)){ xs = wrigley_text$boundingPoly$vertices[[i]]$x ys = wrigley_text$boundingPoly$vertices[[i]]$y polygon(x=xs,y=ys,border='green',lwd=2) }

That’s about it for the basics of using the Google Vision API with the RoogleVision library. I highly recommend tinkering around with it a bit, especially because it won’t cost you a dime.

While I do enjoy the math under the hood and the thinking required to understand alrgorithms, I do think these sorts of API’s will become the way of the future for data science. Outside of specific use cases or special industries, it seems hard to imagine wanting to try and create algorithms that would be better than ones created for mass consumption. As long as they’re fast, free and accurate, I’m all about making my life easier! From the hiring perspective, I much prefer someone who can get the job done over someone who can slightly improve performance (as always, there are many cases where this doesn’t apply).

Please comment if you are utilizing any of the Google API’s for business purposes, I would love to hear it!

As always you can find this on my GitHub

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Upcoming Talk at the Bay Area R Users Group (BARUG)

Mon, 07/31/2017 - 17:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Next Tuesday (August 8) I will be giving a talk at the Bay Area R Users Group (BARUG). The talk is titled Beyond Popularity: Monetizing R Packages. Here is an abstract of the talk:

This talk will cover my initial foray into monetizing an open source R package. In 2015, a year after publishing the initial version of choroplethr on CRAN, I made a concerted effort to try and monetize the package. In this workshop you will learn:

1. Why I decided to monetize choroplethr.
2. The monetization strategy I used.
3. Three tactics that can help you monetize your own R package.

I first gave this talk at the San Francisco EARL conference in June. The talk was well-received there, and I am looking forward to giving it again!

The talk is free to attend, but requires registration.

Learn More Here

The post Upcoming Talk at the Bay Area R Users Group (BARUG) appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sparklyr 0.6

Mon, 07/31/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce a new release of the sparklyr package, available in CRAN today! sparklyr 0.6 introduces new features to:

  • Distribute R computations using spark_apply() to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.
  • Connect to External Data Sources using spark_read_source(), spark_write_source(), spark_read_jdbc() and spark_write_jdbc().
  • Use the Latest Frameworks including dplyr 0.7, DBI 0.7, RStudio 1.1 and Spark 2.2.

and several improvements across:

  • Spark Connections add a new Databricks connection that enables using sparklyr in Databricks through mode="databricks", add support for Yarn Cluster through master="yarn-cluster" and connection speed was also improved.
  • Dataframes add support for sdf_pivot(), sdf_broadcast(), cbind(), rbind(), sdf_separate_column(), sdf_bind_cols(), sdf_bind_rows(), sdf_repartition() and sdf_num_partitions().
  • Machine Learning adds support for multinomial regression in ml_logistic_regression(), weights.column for GLM, ml_model_data() and a new ft_count_vectorizer() function for ml_lda().
  • Many other improvements, from initial support for broom over ml_linear_regression() and ml_generalized_linear_regression(), dplyr support for %like%, %rlike% and %regexp%, sparklyr extensions now support download_scalac() to help you install the required Scala compilers while developing extensions, Hive database management got simplified with tbl_change_db() and src_databases() to query and switch between Hive databases. RStudio started a joint effort with Microsoft to support a cross-platform Spark installer under github.com/rstudio/spark-install.

Additional changes and improvements can be found in the sparklyr NEWS file.

Updated documentation and examples are available at spark.rstudio.com. For questions or feedback, please feel free to open a sparklyr github issue or a sparklyr stackoverflow question.

Distributed R

sparklyr 0.6 provides support for executing distributed R code through spark_apply(). For instance, after connecting and copying some data:

library(sparklyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris)

We can apply an arbitrary R function, say jitter(), to each column over each row as follows:

iris_tbl %>% spark_apply(function(e) sapply(e[,1:4], jitter)) ## # Source: table [?? x 4] ## # Database: spark_connection ## Sepal_Length Sepal_Width Petal_Length Petal_Width ## ## 1 5.102223 3.507372 1.406654 0.1990680 ## 2 4.900148 3.002006 1.396052 0.2002922 ## 3 4.699807 3.204126 1.282652 0.2023850 ## 4 4.618854 3.084675 1.508538 0.2119644 ## 5 4.985322 3.596079 1.388837 0.1846146 ## 6 5.381947 3.881051 1.686600 0.3896673 ## 7 4.613713 3.400265 1.404120 0.2954829 ## 8 4.995116 3.408897 1.493193 0.1945901 ## 9 4.418538 2.916306 1.392230 0.1976161 ## 10 4.891340 3.096591 1.498078 0.1174069 ## # ... with 140 more rows

One can also group by columns to apply an operation over each group of rows, say, to perform linear regression over each group as follows:

spark_apply( iris_tbl, function(e) broom::tidy(lm(Petal_Width ~ Petal_Length, e)), names = c("term", "estimate", "std.error", "statistic", "p.value"), group_by = "Species" ) ## # Source: table [?? x 6] ## # Database: spark_connection ## Species term estimate std.error statistic p.value ## ## 1 versicolor (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01 ## 2 versicolor Petal_Length 0.33105360 0.03750041 8.8279995 1.271916e-11 ## 3 virginica (Intercept) 1.13603130 0.37936622 2.9945505 4.336312e-03 ## 4 virginica Petal_Length 0.16029696 0.06800119 2.3572668 2.253577e-02 ## 5 setosa (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01 ## 6 setosa Petal_Length 0.20124509 0.08263253 2.4354220 1.863892e-02

Packages can be used since they are automatically distributed to the worker nodes; however, using spark_apply() requires R to be installed over each worker node. Please refer to Distributing R Computations for additional information and examples.

External Data Sources

sparklyr 0.6 adds support for connecting Spark to databases. This feature is useful if your Spark environment is separate from your data environment, or if you use Spark to access multiple data sources. You can use spark_read_source(), spark_write_source with any data connector available in Spark Packages. Alternatively, you can use spark_read_jdbc() and spark_write_jdbc() and a JDBC driver with almost any data source.

For example, you can connect to Cassandra using spark_read_source(). Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section.

config <- spark_config() config[["sparklyr.defaultPackages"]] <- c( "datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11") sc <- spark_connect(master = "local", config = config) spark_read_source(sc, "emp", "org.apache.spark.sql.cassandra", list(keyspace = "dev", table = "emp"))

To connect to MySQL, one can download the MySQL connector and use spark_read_jdbc() as follows:

config <- spark_config() config$`sparklyr.shell.driver-class-path` <- "~/Downloads/mysql-connector-java-5.1.41/mysql-connector-java-5.1.41-bin.jar" sc <- spark_connect(master = "local", config = config) spark_read_jdbc(sc, "person_jdbc", options = list( url = "jdbc:mysql://localhost:3306/sparklyr", user = "root", password = "", dbtable = "person"))

Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section. See also crassy, an sparklyr extension being developed to read data from Cassandra with ease.

Dataframe Functions

sparklyr 0.6 includes many improvements for working with DataFrames. Here are some additional highlights.

x_tbl <- sdf_copy_to(sc, data.frame(a = c(1,2,3), b = c(2,3,4))) y_tbl <- sdf_copy_to(sc, data.frame(b = c(3,4,5), c = c("A","B","C"))) Pivoting DataFrames

It is now possible to pivot (i.e. cross tabulate) one or more columns using sdf_pivot().

sdf_pivot(y_tbl, b ~ c, list(b = "count")) ## # Source: table [?? x 4] ## # Database: spark_connection ## b A B C ## ## 1 4 NaN 1 NaN ## 2 3 1 NaN NaN ## 3 5 NaN NaN 1 Binding Rows and Columns

Binding DataFrames by rows and columns is supported through sdf_bind_rows() and sdf_bind_cols():

sdf_bind_rows(x_tbl, y_tbl) ## # Source: table [?? x 3] ## # Database: spark_connection ## a b c ## ## 1 1 2 ## 2 2 3 ## 3 3 4 ## 4 NaN 3 A ## 5 NaN 4 B ## 6 NaN 5 C sdf_bind_cols(x_tbl, y_tbl) ## # Source: lazy query [?? x 4] ## # Database: spark_connection ## a b.x b.y c ## ## 1 1 2 3 A ## 2 3 4 5 C ## 3 2 3 4 B Separating Columns

Separate lists into columns with ease. This is especially useful when working with model predictions that are returned as lists instead of scalars. In this example, each element in the probability column contains two items. We can use sdf_separate_column() to isolate the item that corresponds to the probability that vs equals one.

library(magrittr) mtcars[, c("vs", "mpg")] %>% sdf_copy_to(sc, .) %>% ml_logistic_regression("vs", "mpg") %>% sdf_predict() %>% sdf_separate_column("probability", list("P[vs=1]" = 2)) ## # Source: table [?? x 7] ## # Database: spark_connection ## vs mpg id58fb64e07a38 rawPrediction probability prediction ## ## 1 0 21.0 0 1 ## 2 0 21.0 1 1 ## 3 1 22.8 2 1 ## 4 1 21.4 3 1 ## 5 0 18.7 4 0 ## 6 1 18.1 5 0 ## 7 0 14.3 6 0 ## 8 1 24.4 7 1 ## 9 1 22.8 8 1 ## 10 1 19.2 9 0 ## # ... with 22 more rows, and 1 more variables: `P[vs=1]` Machine Learning Multinomial Regression

sparklyr 0.6 adds support for multinomial regression for Spark 2.1.0 or higher:

iris_tbl %>% ml_logistic_regression("Species", features = c("Sepal_Length", "Sepal_Width")) ## Call: Species ~ Sepal_Length + Sepal_Width ## ## Coefficients: ## (Intercept) Sepal_Length Sepal_Width ## [1,] -201.5540 73.19269 -59.83942 ## [2,] -214.6001 75.09506 -59.43476 ## [3,] 416.1541 -148.28775 119.27418 Improved Text Mining with LDA

ft_tokenizer() was introduced in sparklyr 0.5 but sparklyr 0.6 introduces ft_count_vectorizer() and vocabulary.only to simplify LDA:

library(janeaustenr) lines_tbl <- sdf_copy_to(sc,austen_books()[c(1,3),]) lines_tbl %>% ft_tokenizer("text", "tokens") %>% ft_count_vectorizer("tokens", "features") %>% ml_lda("features", k = 4) ## An LDA model fit on 1 features ## ## Topics Matrix: ## [,1] [,2] [,3] [,4] ## [1,] 0.8996952 0.8569472 1.1249431 0.9366354 ## [2,] 0.9815787 1.1721218 1.0795771 0.9990090 ## [3,] 1.1738678 0.8600233 0.9864862 0.9573433 ## [4,] 0.8951603 1.0730703 0.9562389 0.8899160 ## [5,] 1.0528748 1.0291708 1.0699833 0.8731401 ## [6,] 1.1857015 1.0441299 1.0987713 1.1247574 ## ## Estimated Document Concentration: ## [1] 13.52071 13.52172 13.52060 13.51963

The vocabulary can be printed with:

lines_tbl %>% ft_tokenizer("text", "tokens") %>% ft_count_vectorizer("tokens", "features", vocabulary.only = TRUE) ## [1] "jane" "sense" "austen" "by" "sensibility" ## [6] "and"

That’s all for now, disconnecting:

spark_disconnect(sc) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data visualization with googleVis exercises part 9

Sun, 07/30/2017 - 18:26

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Histogram & Calendar chart

This is part 9 of our series and we are going to explore the features of two interesting types of charts that googleVis provides like histogram and calendar charts.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

To run this example we will first create an experimental data frame with:
Hist=data.frame(A=rpois(100, 10),
B=rpois(100, 20),
C=rpois(100, 30))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.

Histogram

It is quite simple to create a Histogram with googleVis. We will use the “Hist” data frame we just created. You can see the variables of your data frame with head().
Look at the example below to create a simple histogram:
HistC <- gvisHistogram(Hist)
plot(HistC)

Exercise 1

Create a list named “HistC” and pass to it the “Hist” data frame as a histogram. HINT: Use gvisHistogram().

Exercise 2

Plot the the histogram. HINT: Use plot().

Options

To add a legend to your chart you can use:
options=list(
legend="{ position: 'top' }")

Exercise 3

Add a legend to the bottom of your histogram and plot it. HINT: Use list().

To decide the colours of your bars you can use:
options=list(
colors="['black', 'green', 'yellow']")

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 4

Change the colours of the histogram’s bars to red, green and blue and plot it. HINT: Use colors.

To set the dimensions of your histogram you can use:
options=list(
width=400, height=400)

Exercise 5

Set width of your histogram to 500, its height to 400 and plot it.

Calendar chart

It is quite simple to create a Calendar Chart with googleVis. We will use the “Cairo” data set. You can see the variables of “Cairo” with head().
Look at the example below to create a simple calendar chart:
CalC <- gvisCalendar(Cairo)
plot(CalC)

Exercise 6

Create a list named “CalC” and pass to it the “Cairo” data set as a calendar chart. HINT: Use gvisCalendar().

Exercise 7

Plot the the calendar chart. HINT: Use plot().

Options

You can add title to your chart and set the dimensions with:
options=list(
title="Title",
height=400)

Exercise 8

Add a title to your calendar chart, set height to 500 and plot it. HINT: Use list().

You can change the features of your labels with:
options=list(calendar="{yearLabel: { fontName: 'Times-Roman',
fontSize: 26, color: 'black', bold: false})

Exercise 9

Add labels to your chart ,set the font of your labels to “Times-Roman”, their size to 30, their color to black, make them bold and plot the chart.

To find more options about the cells you can use:
cellSize: 15,
cellColor: { stroke: 'red', strokeOpacity: 0.5 },
focusedCellColor: {stroke:'red'}

Exercise 10

Set the size of the cells to 10, the focused color to green and plot the chart.

Related exercise sets:
  1. Data Visualization with googleVis exercises part 2
  2. Data Visualization with googleVis exercises part 4
  3. Data Visualization with googleVis exercises part 3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Matching, Optimal Transport and Statistical Tests

Sun, 07/30/2017 - 14:54

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

To explain the “optimal transport” problem, we usually start with Gaspard Monge’s “Mémoire sur la théorie des déblais et des remblais“, where the the problem of transporting a given distribution of matter (a pile of sand for instance) into another (an excavation for instance). This problem is usually formulated using distributions, and we seek the “optimal” transport from one distribution to the other one. The formulation, in the context of distributions has been formulated in the 40’s by Leonid Kantorovich, e.g. from the distribution on the left to the distribution on the right.

Consider now the context of finite sets of points. We want to transport mass from points to points . It is a complicated combinatorial problem. For 4 points, there are only 24 possible transfer to consider, but it exceeds 20 billions with 15 points (on each side). For instance, the following one is usually seen as inefficient

while the following is usually seen as much better

Of course, it depends on the cost of the transport, which depends on the distance between the origin and the destination. That cost is usually either linear or quadratic.

There are many application of optimal transport in economics, see eg Alfred’s book Optimal Transport Methods in Economics. And there are also applications in statistics, that what I’ve seen while I was discussing with Pierre while I was in Boston, in June. For instance if we want to test whether some sample were drawn from the same distribution,

set.seed(13)
npoints <- 25
mu1 <- c(1,1)
mu2 <- c(0,2)
Sigma1 <- diag(1, 2, 2)
Sigma2 <- diag(1, 2, 2)
Sigma2[2,1] <- Sigma2[1,2] <- -0.5
Sigma1 <- 0.4 * Sigma1
Sigma2 <- 0.4 *Sigma2
library(mnormt)
X1 <- rmnorm(npoints, mean = mu1, Sigma1)
X2 <- rmnorm(npoints, mean = mu2, Sigma2)
plot(X1[,1], X1[,2], ,col="blue")
points(X2[,1], X2[,2], col = "red")

Here we use a parametric model to generate our sample (as always), and we might think of a parametric test (testing whether mean and variance parameters of the two distributions are equal).

or we might prefer a nonparametric test. The idea Pierre mentioned was based on optimal transport. Consider some quadratic loss

ground_p <- 2
p <- 1
w1 <- rep(1/npoints, npoints)
w2 <- rep(1/npoints, npoints)
C <- cost_matrix_Lp(t(X1), t(X2), ground_p)
library(transport)
a <- transport(w1, w2, costm = C^p, method = "shortsimplex")

then it is possible to match points in the two samples

nonzero <- which(a$mass != 0)
from_indices <- a$from[nonzero]
to_indices <- a$to[nonzero]
for (i in from_indices){
segments(X1[from_indices[i],1], X1[from_indices[i],2], X2[to_indices[i], 1], X2[to_indices[i],2])
}

Here we can observe two things. The total cost can be seen as rather large

> cost=function(a,X1,X2){
nonzero <- which(a$mass != 0)
naa=a[nonzero,]
d=function(i) (X1[naa$from[i],1]-X2[naa$to[i],1])^2+(X1[naa$from[i],2]-X2[naa$to[i],2])^2
sum(Vectorize(d)(1:npoints))
}
> cost(a,X1,X2)
[1] 9.372472

and the angle of the transport direction is alway in the same direction (more or less)

> angle=function(a,X1,X2){
nonzero <- which(a$mass != 0)
naa=a[nonzero,]
d=function(i) (X1[naa$from[i],2]-X2[naa$to[i],2])/(X1[naa$from[i],1]-X2[naa$to[i],1])
atan(Vectorize(d)(1:npoints))
}
> mean(angle(a,X1,X2))
[1] -0.3266797

> library(plotrix)
> ag=(angle(a,X1,X2)/pi)*180
> ag[ag<0]=ag[ag<0]+360
> dag=hist(ag,breaks=seq(0,361,by=1)-.5)
> polar.plot(dag$counts,seq(0,360,by=1),main=”Test Polar Plot”,lwd=3,line.col=4)

(actually, the following plot has been obtain by generating a thousand of sample of size 25)

In order to have a decent test, we need to see what happens under the null assumption (when drawing samples from the same distribution), see

Here is the optimal matching

Here is the distribution of the total cost, when drawing a thousand samples,

VC=rep(NA,1000)
VA=rep(NA,1000*npoints)
for(s in 1:1000){
X1a <- rmnorm(npoints, mean = mu1, Sigma1)
X1b <- rmnorm(npoints, mean = mu1, Sigma2)
ground_p <- 2
p <- 1
w1 <- rep(1/npoints, npoints)
w2 <- rep(1/npoints, npoints)
C <- cost_matrix_Lp(t(X1a), t(X1b), ground_p)
ab <- transport(w1, w2, costm = C^p, method = "shortsimplex")
VC[s]=cout(ab,X1a,X1b)
VA[s*npoints-(0:(npoints-1))]=angle(ab,X1a,X1b)
}
plot(density(VC)

So our cost of 9 obtained initially was not that high. Observe that when drawing from the same distribution, there is now no pattern in the optimal transport

ag=(VA/pi)*180
ag[ag<0]=ag[ag<0]+360
dag=hist(ag,breaks=seq(0,361,by=1)-.5)
polar.plot(dag$counts,seq(0,360,by=1),main="Test Polar Plot",lwd=3,line.col=4)

 

Nice isn’t it? I guess I will spend some time next year working on those transport algorithm, since we have great R packages, and hundreds of applications in economics…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scripting for data analysis (with R)

Sun, 07/30/2017 - 09:55

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Course materials (GitHub)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Vectors
Data frames
Functions
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Lists
Objects
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.

Postat i:computer stuff, data analysis, english Tagged: ggplot2, plyr, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding Overhead Issues in Parallel Computation

Sun, 07/30/2017 - 03:14

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In my talk at useR! earlier this month, I emphasized the fact that a major impediment to obtaining good speed from parallelizing an algorithm is systems overhead of various kinds, including:

  • Contention for memory/network.
  • Bandwidth limits — CPU/memory, CPU/network, CPU/GPU.
  • Cache coherency problems.
  • Contention for I/O ports.
  • OS and/or R limits on number of sockets (network connections).
  • Serialization.

During the Q&A at the end, one person in the audience asked how R programmers without a computer science background might acquire this information. A similar question was posed here today by a reader on this blog, to which I replied,

That question was asked in my talk. I answered by saying that I have an introduction to such things in my book, but that this is not enough. One builds this knowledge in haphazard ways, e.g. by search terms like “cache miss” and “network latency” on the Web, and above all, by giving it careful thought and reasoning things out. (When Nobel laureate Richard Feynman was a kid, someone said in awe, “He fixes radios by thinking!”)

Join an R Users Group, if there is one in your area. (And if not, then start one!) Talk about these things with them (though if you follow my above advice, you may find you soon know more than they do).

The book I was referring to was Parallel Computing for Data Science: With Examples in R, C++ and CUDA (Chapman & Hall/CRC, The R Series, Jun 4, 2015.

I have decided that the topic of system overhead issues in parallel computation is important enough for me to place Chapter 2 on the Web, which I have now done. Enjoy. I’d be happy to answer your questions (of a general nature, not on your specific code).

We are continuing to add more features to our R parallel computation package, partools. Watch this space for news!

By the way, the useR! 2017 videos are now on the Web, including my talk on parallel computing.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Memorable dataviz with the R program, talk awarded people’s choice prize

Sun, 07/30/2017 - 02:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

“Memorable dataviz with the R program” awarded people’s choice prize

For the past two years Dr Nick Hamilton has invited me to give a talk on creating data visuals with the R program at the wonderful UQ Winterschool in
Bioinformatics
.

This year I was lucky enough to be awarded a prize for my talk – best speaker from a mid-career presenter, as voted by the audience.

Nick and the UQ Winterschool team have been kind enough to post my talk on Vimeo, so I am sharing it here in the hope that others find it useful. You can also get all the talk notes (and code) on my blog here

I think it is something of a feat to have a talk win a people’s choice award, when that talk is fundamentally about computer programming. The talk’s success speaks not to my own presentation skill, but the rather the huge interest in the R program.

I also borrowed a lot of neat ideas from people much more smarter than me (all duly acknowledged), like the Datasaurus (thanks Alberto Cairo). So check out the video, I hope you find it insightful, and a bit entertaining.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting workshop in Perth

Sun, 07/30/2017 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

On 26-28 September 2017, I will be running my 3-day workshop in Perth on “Forecasting: principles and practice” based on my book of the same name.
Topics to be covered include seasonality and trends, exponential smoothing, ARIMA modelling, dynamic regression and state space models, as well as forecast accuracy methods and forecast evaluation techniques such as cross-validation.
Workshop participants are expected to be familiar with basic statistical tools such as multiple regression, but no knowledge of time series or forecasting will be assumed.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

More documentation for Win-Vector R packages

Sun, 07/30/2017 - 00:43

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.)

Please check them out (hint: vtreat is our favorite).

The package sites:

For more on all of these packages, please see the Win-Vector blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Updated overbought/oversold plot function

Sat, 07/29/2017 - 19:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A good six years ago I blogged about plotOBOS() which charts a moving average (from one of several available variants) along with shaded standard deviation bands. That post has a bit more background on the why/how and motivation, but as a teaser here is the resulting chart of the SP500 index (with ticker ^GSCP):

 

The code uses a few standard finance packages for R (with most of them maintained by Joshua Ulrich given that Jeff Ryan, who co-wrote chunks of these, is effectively retired from public life). Among these, xts had a recent release reflecting changes which occurred during the four (!!) years since the previous release, and covering at least two GSoC projects. With that came subtle API changes: something we all generally try to avoid but which is at times the only way forward. In this case, the shading code I used (via polygon() from base R) no longer cooperated with the beefed-up functionality of plot.xts(). Luckily, Ross Bennett incorporated that same functionality into a new function addPolygon — which even credits this same post of mine.

With that, the updated code becomes

## plotOBOS -- displaying overbough/oversold as eg in Bespoke's plots ## ## Copyright (C) 2010 - 2017 Dirk Eddelbuettel ## ## This is free software: you can redistribute it and/or modify it ## under the terms of the GNU General Public License as published by ## the Free Software Foundation, either version 2 of the License, or ## (at your option) any later version. suppressMessages(library(quantmod)) # for getSymbols(), brings in xts too suppressMessages(library(TTR)) # for various moving averages plotOBOS <- function(symbol, n=50, type=c("sma", "ema", "zlema"), years=1, blue=TRUE, current=TRUE, title=symbol, ticks=TRUE, axes=TRUE) { today <- Sys.Date() if (class(symbol) == "character") { X <- getSymbols(symbol, from=format(today-365*years-2*n), auto.assign=FALSE) x <- X[,6] # use Adjusted } else if (inherits(symbol, "zoo")) { x <- X <- as.xts(symbol) current <- FALSE # don't expand the supplied data } n <- min(nrow(x)/3, 50) # as we may not have 50 days sub <- "" if (current) { xx <- getQuote(symbol) xt <- xts(xx$Last, order.by=as.Date(xx$`Trade Time`)) colnames(xt) <- paste(symbol, "Adjusted", sep=".") x <- rbind(x, xt) sub <- paste("Last price: ", xx$Last, " at ", format(as.POSIXct(xx$`Trade Time`), "%H:%M"), sep="") } type <- match.arg(type) xd <- switch(type, # compute xd as the central location via selected MA smoother sma = SMA(x,n), ema = EMA(x,n), zlema = ZLEMA(x,n)) xv <- runSD(x, n) # compute xv as the rolling volatility strt <- paste(format(today-365*years), "::", sep="") x <- x[strt] # subset plotting range using xts' nice functionality xd <- xd[strt] xv <- xv[strt] xyd <- xy.coords(.index(xd),xd[,1]) # xy coordinates for direct plot commands xyv <- xy.coords(.index(xv),xv[,1]) n <- length(xyd$x) xx <- xyd$x[c(1,1:n,n:1)] # for polygon(): from first point to last and back if (blue) { blues5 <- c("#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C") # cf brewer.pal(5, "Blues") fairlylight <<- rgb(189/255, 215/255, 231/255, alpha=0.625) # aka blues5[2] verylight <<- rgb(239/255, 243/255, 255/255, alpha=0.625) # aka blues5[1] dark <<- rgb(8/255, 81/255, 156/255, alpha=0.625) # aka blues5[5] ## buglet in xts 0.10-0 requires the <<- here } else { fairlylight <<- rgb(204/255, 204/255, 204/255, alpha=0.5) # two suitable grays, alpha-blending at 50% verylight <<- rgb(242/255, 242/255, 242/255, alpha=0.5) dark <<- 'black' } plot(x, ylim=range(range(x, xd+2*xv, xd-2*xv, na.rm=TRUE)), main=title, sub=sub, major.ticks=ticks, minor.ticks=ticks, axes=axes) # basic xts plot setup addPolygon(xts(cbind(xyd$y+xyv$y, xyd$y+2*xyv$y), order.by=index(x)), on=1, col=fairlylight) # upper addPolygon(xts(cbind(xyd$y-xyv$y, xyd$y+1*xyv$y), order.by=index(x)), on=1, col=verylight) # center addPolygon(xts(cbind(xyd$y-xyv$y, xyd$y-2*xyv$y), order.by=index(x)), on=1, col=fairlylight) # lower lines(xd, lwd=2, col=fairlylight) # central smooted location lines(x, lwd=3, col=dark) # actual price, thicker }

and the main change are the three calls to addPolygon. To illustrate, we call plotOBOS("SPY", years=2) with an updated plot of the ETF representing the SP500 over the last two years:

 

Comments and further enhancements welcome!

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Markdown exercises part 1

Sat, 07/29/2017 - 13:35

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

R Markdown is one of the most popular data science tools and is used to save and execute code, create exceptional reports whice are easily shareable.

The documents that R Markdown provides are fully reproducible and support a wide variety of static and dynamic output formats.

Using markdown syntax, which provides an easy way of creating documents that can be converted to many other file types, while embeding R code in the report, so it is not necessary to keep the report and R script separately. Furthermore The report is written as normal text, so knowledge of HTML is not required. Of course no additional files are needed because everything is incorporated in the HTML file.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a new R Markdown file (.Rmd) in RStudio.

Exercise 2

Insert a YAML Header with title, author and date of your choice at the top of your .Rmd script.

Exercise 3

Display the summary of “cars” dataset in your report. HINT: Use summary().

Exercise 4

Make a plot of the “cars” dataset under the summary you just created. HINT: Use plot().

Exercise 5

Create a small experimental dataframe and dipslay it in your report. HINT: Use data.frame().

Learn more about reporting your results in the online course: R for Data Science Solutions. In this course you will learn how to:

  • Build a complete workflow in R for your data science problem
  • Get indepth on how to report your results in a interactive way
  • And much more

Exercise 6

Hide the code from your report. HINT: Use echo.

Exercise 7

Load the package “knitr” in your .Rmd file. and hide the code chunk. HINT: Use echo.

Exercise 8

Hide the warning message that appeared in your report. HINT: Use warning.

Exercise 9

Set fig.width and fig.height of your plot to 5.

Exercise 10

Change the file format of your plot from .png to .svg. HINT: Use dev.

Related exercise sets:
  1. How to create reports with R Markdown in RStudio
  2. Building Shiny App exercises part 1
  3. Shiny Applications Layouts Exercises (Part-6)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stan Weekly Roundup, 28 July 2017

Fri, 07/28/2017 - 21:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Here’s the roundup for this past week.

  • Michael Betancourt added case studies for methodology in both Python and R, based on the work he did getting the ML meetup together:

  • Michael Betancourt, along with Mitzi Morris, Sean Talts, and Jonah Gabry taught the women in ML workshop at Viacom in NYC and there were 60 attendees working their way up from simple linear regression, through Poisson regression to GPs.

  • Ben Goodrich has been working on new R^2 analyses and priors, as well as the usual maintenance on RStan and RStanArm.

  • Aki Vehtari was at the summer school in Valencia teaching Stan.

  • Aki has also been kicking off planning for StanCon in Helsinki 2019. Can’t believe we’re planning that far ahead!

  • Sebastian Weber was in Helsinki giving a talk on Stan, but there weren’t many Bayesians there to get excited about Stan; he’s otherwise been working with Aki on variable selection.

  • Imad Ali is finishing up the spatial models in RStanArm and moving on to new classes of models (we all know his goal is to model basketball, which is a very spatially continuous game!).

  • Ben Bales has been working on generic append array funcitons and vectorizing random number geneators. We learned his day job was teaching robotics with lego to mechanical engineering students!

  • Charles Margossian is finishing up the algebraic solvers (very involved autodiff issues there, as with the ODE solvers) and wrapping up a final release of Torsten before he moves to Columbia to start the Ph.D. program in stats. He’s also writing the mixed solver paper with feedback from Michael Betancourt and Bill Gillespie.

  • Mitzi Morris added runtime warning messages for problems arising in declarations, which inadvertently fixed another bug arising for declarations with sizes for which constraints couldn’t be satisfied (as in size zero simplexes).

  • Miguel Benito, along with Mitzi Morris and Dan Simpson, with input from Michael Betancourt and Andrew Gelman, now have spatial models with matching results across GeoBUGS, INLA, and Stan. They further worked on better priors for Stan so that it’s now competitive in fitting; turns out the negative effect of the sum-to-zero constraint on the spatial random effects had a greater negative effect on the geometry than a positive effect on identifiability.

  • Michael Andreae resubmitted papers with Ben Goodrich and Jonah Gabry and is working on some funding prospects.

  • Sean Talts (with help from Daniel Lee) has most of the C++11/C++14 dev ops in place so we’ll be able to start using all those cool toys.

  • Sean Talts and Michael Betancourt with some help from Mitzi Morris, have been doing large-scale Cook-Gelman-Rubin evaluations for simple and hierarchical models and finding some surprising results (being discussed on Discourse). My money’s on them getting to the bottom of what’s going on soon; Dan Simpson’s jumping in to help out on diagnostics, in the same thread on Discourse.
  • Aki Vehtari reports that Amazon UK (with Neil Lawrence and crew) are using Stan, so we expect to see some more GP activity at some point.

  • We spent a long time discussing how to solve the multiple translation unit problems. It looks at first glance like Eigen just inlines every function and that may also work for us (if a function is declared inline, it may be defined in multiple translation units).

  • Solène Desmée, along with France Mentre and others have been fitting time-to-event models in Stan and have a new open-access publication, Nonlinear joint models for individual dynamic prediction of risk of death using Hamiltonian Monte Carlo: application to metastatic prostate cancer. You may remember France as the host of last year’s PK/PD Stan conference in Paris.

The post Stan Weekly Roundup, 28 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Learn parallel programming in R with these exercises for "foreach"

Fri, 07/28/2017 - 18:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The foreach package provides a simple looping construct for R: the foreach function, which you may be familiar with from other languages like Javascript or C#. It's basically a function-based version of a "for" loop. But what makes foreach useful isn't iteration: it's the way it makes it easy to run those iterations in parallel, and save time on multi-CPU and distributed systems.

If you want to get familiar with the foreach function, Parallelizing Loops at Microsoft Docs will introduce you to foreach oops (and the companion concept, iterators), and the various "backends" you can use to make the loops run in parallel without changing your loop code. Then, to make sure you've captured the concepts, you can try these 10 parallel computing exercises using foreach, from R-exercises.com. If you get stuck, the solutions are available here.

To get started with foreach, can install the foreach package from CRAN, or simply use any edition of Microsoft R including Microsoft R Open (all come with foreach preinstalled). The foreach package is maintained by Microsoft under the open-source Apache license. 

R-exercises: Parallel Computing Exercises: Foreach and DoParallel (Part-2)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacking Strings with stringi

Fri, 07/28/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the last set of exercises, we worked on the basic concepts of string manipulation with stringr. In this one we will go further into hacking strings universe and learn how to use stringi package.Note that stringi acts as a backend of stringr but have many more useful string manipulation functions compared to stringr and one should really know stringi for text manipulation .

Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create two strings
c1 <- "a quick brown fox jumps over a lazy dog"
c2 <- "a quick brown fox jump over a lazy dog"
Now stringi comes with many functions and wrappers around functions to check if two string are equivalent. Check if they are equivalent with
stri_compare, %s<=% and try to reason about the answers.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 2

How would you find no of words in c1 and c2 . Its pretty easy with stringi.Find it out .

Exercise 3

Similarly How would you find all words in c1 and c2 . Again its pretty straight forward with stringi.Find it out .

Exercise 4
Lets say you have a vector which contains famous mathematicians
genius <- c(Godel,Hilbert,Cantor,Gauss, Godel, Fermet,Gauss)
Find the duplications .

Exercise 5

Find the number of characters in genius vector by stri function.

Exercise 6
Its important to keep the character’s of a set of strings in same encoding .Suppose you have a vector
Genius1 <- c("Godel","Hilbert","Cantor","Gauss", "Gödel", "Fermet","Gauss")
Now basically Godel and Gödel are same person but the encoding of the characters are different . but if you try to compare them in a naive way they will act as different .So for the sake of consistency,we should really translate it to similar encoding .Find it how .

Hint – use “Latin-ASCII” transliterator in stri_trans* like function.

Exercise 7
How do we collapse the LETTER vector in R such that it looks like this
“A-B_C-D_E-F_G-H_I-J_K-L_M-N_O-P_Q-R_S-T_U-V_W-X_Y-Z_”

Exercise 8
Suppose you have a string of words like c1 that we have created earlier . You might want to know the starting and end index of the first word, last word.which is obvious for start index of first word and last word but not so obvious for the end index of first word and start index of last word. How would you find this .

Exercise 9
Suppose I have a string
pun <- "A statistician can have his head in an oven and his feet in ice, and he will say that on the average he feels fine"
Suppose I want to replace statistician and average with mathematician and median in the string pun .How can I achieve that .
Hint -use a stri_replace* method.

Exercise 10
My string x is like
x <- "I AM SAM. I AM SAM. SAM I AM"
replace last SAM with ADAM.

Related exercise sets:
  1. Hacking strings with stringr
  2. Higher Order Functions Exercises
  3. Character Functions (Advanced)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages