Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 4 hours 39 min ago

Dynamic Networks: Visualizing Arms Trades Over Time

Wed, 06/14/2017 - 16:09

(This article was first published on sweissblaug, and kindly contributed to R-bloggers)

I previously made some network graphs of Middle East country relationships here using Slate’s Middle East Friendship Chart. I was thinking of a way to visualize these relationships (and possibly other ones) with more rule based and objective measure over time. What kind of public dataset could show countries relationships accurately?

I used weapons / arms trades between nations to explore these relationships. I think arms trades are a good indicator of how friendly countries are together because 1) countries won’t sell to enemies and 2) if a country wants to befriends country, buying weapons are a good way to do buy influence. SIPRI has a very detailed dataset of international arms transfers between >257 nations / organizations in the post war era. The include a type / description and quantity for all trades. For this analysis I will use only the fact that two countries bought or sold some weapons for a particular year.  One can create adjacency matrices from this dataset and make some network plots. However, doing this one year at a time will not create have a smooth plots over time. I decided to use siamese network graphs to embed a country indicator and year variable in two dimensions with the response variable being whether or not countries traded arms in that particular year. This will transform the country and year of a country into a 2 dimensional space, and means if two points are near each other they probably traded in that particular year.

You can check out the graph here. It’s cool that we can see Iran moving away from US to Russia / Soviet Union around 1979 while Egypt and Iraq move towards US.  One can also explore other relationships. Below is a graph of relationships of the world in 1960. There is a Russia/ Soviet Union cluster and a US / Western Europe Cluster. Neat! Analysis was done in python (using the Keras example for siamese network) and R (for plotting). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

New course: Data Visualization in R with lattice

Wed, 06/14/2017 - 16:05

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hello, R users! Today we’re launching Data Visualization in R with lattice by Deepayan Sarkar, the creator of the lattice package.

Visualization is an essential component of interactive data analysis in R. Traditional (base) graphics is powerful, but limited in its ability to deal with multivariate data. Trellis graphics is the natural successor to traditional graphics, extending its simple philosophy to gracefully handle common multivariable data visualization tasks. This course introduces the lattice package, which implements Trellis graphics for R, and illustrates its basic use.

Start course for free

Data Visualization in R with lattice features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master in data visualization with lattice!

What you’ll learn

You’ll start off with an introduction to some basic plotting functions in lattice. Draw histograms, scatter plots, density plots, and box and whisker plots.

Chapter 2 will teach you to create “conditioned” plots consisting of multiple panels using the formula interface.

You’ll then learn how to control and customize axis limits and visual appearance in chapter 3.

Chapter 4 will cover how to use panel and prepanel functions to enhance existing displays or create new ones.

Finally, you’ll see that the lattice package is not just meant to be used as a standalone collection of plotting functions. Rather, it is a framework that is used as a base by many other packages. Some of these are very specialized and beyond the scope of this course. In chapter 5, we give a brief survey of extensions that are generally useful to enhance displays or create new ones.

Start Data Visualization with lattice today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to make and share an R package in 3 steps

Wed, 06/14/2017 - 14:07

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

If you find yourself often repeating the same scripts in R, you might come to the point where you want to turn them into reusable functions and create your own R package. I recently reached that point and wanted to learn how to build my own R package – as simple as possible.

It took me some time to figure out the full process from A to Z. I was often left with remaining questions:

  • What to do with the import of external libraries?
  • What about documentation?
  • How to actually install my package?
  • How can I share it with others?

Therefore, I will explain how I made my first R package and which methods I found helpful. Of course, there are different approaches out there, some of them very well documented, like for example this one – but the following 3 easy steps worked for me, so maybe they will help you too getting your first R package ready, installed and online quickly.

My first package is a wrapper for the FlightStats API. You can sign up for FlightStats to get a free trial API key (which unfortunately works for one month only). However, you can of course make a wrapper for any API you like or make a non-API related package. Two links I found very useful for getting started making my first package are this one and this one (regarding the storing of API keys). Also, I learned a lot using this package as an example.

Prepare the functions you want to include

First of all, we need to have all the functions (and possibly dataframes & other objects) ready in our R environment. For example, imagine we want to include these two functions to store the API key and an app ID:

setAPIKey <- function(){ input = readline(prompt="Enter your FlightStats API Key: ") Sys.setenv(flightstats_api_key = input) # this is a more simple way of storing API keys, it saves it in the .Rprofile file, however this is only temporary - meaning next session the login details will have to be provided again. See below how to store login details in a more durable way. } setAppId <- function(){ input = readline(prompt="Enter your FlightStats appID: ") Sys.setenv(flightstats_app_id = input) }

Now you can retrieve the login in details like this:

ID = Sys.getenv("flightstats_app_id") # if the key does not exist, this returns an empty string (""), in this case the user should be prompted to use the setAPIKey() and setAppID() functions KEY = Sys.getenv("flightstats_api_key")

However, if you would like the login details to still be accessible the next time you open R, you can define your login details in the .Renviron file (instead of the .Rprofile file). The .Renviron file can be found in your home directory. You can find the path to your home directory by running Sys.getenv("HOME"). For more info see here. The final function to store the API key in the FlightsR package looked like this.

Side note: I actually did not know this before, but find it quite handy: If you want to know the script of any function in R, you can find it by just typing the function name and hit enter. For example:

> read.csv function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.table(file = file, header = header, sep = sep, quote = quote, dec = dec, fill = fill, comment.char = comment.char, ...)

Let’s make another function that we can add to the package. For example, here is a simple function to list all airlines:

listAirlines <- function(activeOnly=TRUE){ ID = Sys.getenv("flightstats_app_id") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } KEY = Sys.getenv("flightstats_api_key") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } if(missing(activeOnly)){ choice = "active" } if(activeOnly == FALSE) { choice = "all" } else { choice = "active" } link = paste0("https://api.flightstats.com/flex/airlines/rest/v1/json/",choice,"?appId=",ID,"&appKey=",KEY) dat = getURL(link) dat_list <- fromJSON(dat) airlines <- dat_list$airlines return(airlines) } Prepare the documents

Use package.skeleton(), devtools & RoxyGen2 to let R prepare the necessary documents for you

package.skeleton(name = "FlightR", list = c("listAirlines","listAirports","scheduledFlights","scheduledFlightsFullDay","searchAirline","searchAirport","setAPIKey","setAppId"))

That’s it! Now in your working directory folder there should be a new folder with the name you just gave to your package.

Now, what is handy from the function above is that it creates the folders and files you need in a new package folder (“FlightsR” in this case). In the /R folder you see now that every function you added has its own .R script and in the /man folder there is an .Rd file for each of the functions.

You can now go and manually change everything in these files that needs to be changed (documentation needs to be added, the import of external packages to be defined, etc.) – or use roxygen2 and devtools to do it for you. Roxygen2 will complete the documentation in each .Rd file correctly and will create a NAMESPACE file for you. To do this, make sure you delete the current incomplete files (this is, all the files in the /man folder and the NAMESPACE file), otherwise you will get an error when you use the document() function later.

Now extra information needs to be added in the functions (for example, what are the parameters of the function, an example usage, necessary imports, etc.), in the following way:

#' Function searches a specific airline by IATA code #' #' @param value character, an airline IATA code #' @return data.frame() with the airline #' #' @author Emelie Hofland, \email{emelie_hofland@hotmail.com} #' #' @examples #' searchAirline("FR") #' #' @import RCurl #' @import jsonlite #' @export #' searchAirline <- function(value){ ID = Sys.getenv("flightstats_app_id") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } KEY = Sys.getenv("flightstats_api_key") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } link = paste0("https://api.flightstats.com/flex/airlines/rest/v1/json/iata/",toupper(value),"?appId=",ID,"&appKey=",KEY) dat <- getURL(link) dat_list <- fromJSON(dat) result <- dat_list$airlines if (length(result)==0){ warning("Please make sure that you provide a valid airline IATA code.") } return(result) }

Do not forget to add the @export, otherwise your functions will not be there when you open your package!

Now, when you have added this information for all your functions, make sure that devtools and roxygen2 are installed.

install.packages("roxygen2") install.packages("devtools")

Make sure your working directory is set to the folder of your package and run the following commands in R:

# to automatically generate the documentation: document() # to build the package build() # to install the package install()

Voila! You are done. I do not know if it is necessary, but just to be sure I restarted R at this point. In a new session you can now run library(YourPackageName) and this should work.

To adjust functions, you can just change things in the functions in the package and re-run the document(), build() and install() commands.

Pushing a custom made R package to GitHub (or GitLab)

Note: These steps assume that you have Git installed and configured on your PC.
1) Create a new repository in your github account.
2) Create and copy the https link to clone the repo on your PC.
3) Go to the folder on your PC where you want to save your repo, open the command line interface & type:

$ git clone https://github.com/YourGithub/YourPackage.git

4) Copy all the files from your package in the folder and run:

$ git add . $ git commit -m "whatever message you want to add" $ git push origin master

5) Voila – now your package should be on GitHub!

Now people can download & install your package straight from GitHub or GitLab – the devtools package has a function for this:

if (!require(devtools)) { install.packages('devtools') } # If your repo is on GitHub: devtools::install_github('YourGithub/YourPackage') # If your repo is a public repo on GitLab: devtools::install_git("https://gitlab/link/to/your/repo") # If your repo is a private repo on GitLab: devtools::install_git("https://emelie.hofland:password@gitlab/link/to/your/repo.git")

Post a comment below if you have a question.

    Related Post

    1. Introductory Data Analysis with Python
    2. Understanding Linear SVM with R
    3. How to add a background image to ggplot2 graphs
    4. Streamline your analyses linking R to SAS: the workfloweR experiment
    5. R Programming – Pitfalls to avoid (Part 1)
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Installing R packages with rxInstallPackages in Microsoft R Server

    Wed, 06/14/2017 - 13:07

    (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

    In MicrosoftML package comes – in my opinion – long anticipated function for installing R packages for SQL Server and Microsoft R Server. And, I am super happy.

    Last year, in one of my previous blog posts, I have been showing how to install R package from SSMS using sp_execute_external_script. Now, with new package MicrosoftML (that is part of Microsoft R Server 9.x and above)  new function is available that enables you to easy install the package and also little bit more.

    Code is relatively simple and straightforward:

    USE SQLR; GO EXECUTE sp_execute_external_script @language = N'R' ,@script = N' packagesToInstall <- c("caret","tree","party") library(MicrosoftML) SqlServerCC <- RxInSqlServer(connectionString = "Driver=SQL Server; +Server=SICN-KASTRUN\\SQLSERVER2017C2;Database=SQLR; +Trusted_Connection=True;") rxInstallPackages(pkgs = packagesToInstall, owner = '', +scope = "shared", computeContext = "SqlServerCC");'; GO

    This is way too easy to be true, but it is. Make sure to do couple of things prior to running this code:

    1. set the compute environment to where your packages are installed
    2. set up the correct permissions and access
    3. Check up also the tcp/ip protocols

    In rxInstallPackages function use computeContext parameter to set either to “Local” or to your  “SqlServer” environment, you can also use scope as shared or private (difference is, if you install package as shared it can be used by different users across different databases, respectively for private). You can also specify owner if you are running this command out of db_owner role.

    Happy SQLR-ing!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    #7: C++14, R and Travis — A useful hack

    Wed, 06/14/2017 - 03:54

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Welcome to the seventh post in the rarely relevant R ramblings series, or R4 for short.

    We took a short break as several conferences and other events interfered during the month of May, keeping us busy and away from this series. But we are back now with a short and useful hack I came up with this weekend.

    The topic is C++14, i.e. the newest formally approved language standard for C++, and its support in R and on Travis CI. With release R 3.4.0 of a few weeks ago, R now formally supports C++14. Which is great.

    But there be devils. A little known fact is that R hangs on to its configuration settings from its own compile time. That matters in cases such as the one we are looking at here: Travis CI. Travis is a tremendously useful and widely-deployed service, most commonly connected to GitHub driving "continuous integration" (the ‘CI’) testing after each commit. But Travis CI, for as useful as it is, is also maddingly conservative still forcing everybody to live and die by [Ubuntu 14.04]http://releases.ubuntu.com/14.04/). So while we all benefit from the fine work by Michael who faithfully provides Ubuntu binaries for distribution via CRAN (based on the Debian builds provided by yours truly), we are stuck with Ubuntu 14.04. Which means that while Michael can provide us with current R 3.4.0 it will be built on ancient Ubuntu 14.04.

    Why does this matter, you ask? Well, if you just try to turn the very C++14 support added to R 3.4.0 on in the binary running on Travis, you get this error:

    ** libs Error in .shlib_internal(args) : C++14 standard requested but CXX14 is not defined

    And you get it whether or not you define CXX14 in the session.

    So R (in version 3.4.0) may want to use C++14 (because a package we submitted requests it), but having been built on the dreaded Ubuntu 14.04, it just can’t oblige. Even when we supply a newer compiler. Because R hangs on to its compile-time settings rather than current environment variables. And that means no C++14 as its compile-time compiler was too ancient. Trust me, I tried: adding not only g++-6 (from a suitable repo) but also adding C++14 as the value for CXX_STD. Alas, no mas.

    The trick to overcome this is twofold, and fairly straightforward. First off, we just rely on the fact that g++ version 6 defaults to C++14. So by supplying g++-6, we are in the green. We have C++14 by default without requiring extra options. Sweet.

    The remainder is to tell R to not try to enable C++14 even though we are using it. How? By removing CXX_STD=C++14 on the fly and just for Travis. And this can be done easily with a small script configure which conditions on being on Travis by checking two environment variables:

    #!/bin/bash ## Travis can let us run R 3.4.0 (from CRAN and the PPAs) but this R version ## does not know about C++14. Even though we can select CXX_STD = C++14, R ## will fail as the version we use there was built in too old an environment, ## namely Ubuntu "trusty" 14.04. ## ## So we install g++-6 from another repo and rely on the fact that is ## defaults to C++14. Sadly, we need R to not fail and hence, just on ## Travis, remove the C++14 instruction if [[ "${CI}" == "true" ]]; then if [[ "${TRAVIS}" == "true" ]]; then echo "** Overriding src/Makevars and removing C++14 on Travis only" sed -i 's|CXX_STD = CXX14||' src/Makevars fi fi

    I have deployed this now for two sets of builds in two distinct repositories for two "under-development" packages not yet on CRAN, and it just works. In case you turn on C++14 via SystemRequirements: in the file DESCRIPTION, you need to modify it here.

    So to sum up, there it is: C++14 with R 3.4.0 on Travis. Only takes a quick Travis-only modification.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Use a Join Controller to Document Your Work

    Wed, 06/14/2017 - 00:17

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

    When working on real world predictive modeling tasks in production, the ability to join data and document how you join data is paramount. There are very strong reasons to organize production data in something resembling one of the Codd normal forms. However, for machine learning we need a fully denormalized form (all columns populated into a single to ready to go row, no matter what their provenance, keying, or stride).

    This is not an essential difficulty as in relational data systems moving between these forms can be done by joining, and data stores such as PostgreSQL or Apache Spark are designed to provide powerful join capabilities.

    However there are some inessential (in that they can be avoided) but substantial difficulties in managing and documenting long join plans. It is not uncommon to have to join 7 or more tables to get an analysis ready. This at first seems trivial, but once you add in the important tasks of narrowing tables (eliminating columns not used later) and disambiguating column names (ensuring unique names after the join) the documentation and work can become substantial. Specifying the join process directly in R code leads to hard to manage, hard to inspect, and hard to share spaghetti code (even when using a high-level data abstraction such as dplyr).

    If you have done non-trivial work with production data you have seen this pain point.

    The fix is to apply the following principles:

    • Anything long, repetitive, and tedious should not be done directly.
    • Moving specification out of code and into data is of huge benefit.
    • A common special case can be treated separately, as that documents intent.

    To supply such a solution the development version of replyr now supplies a item called a "join controller" under the method replyr::executeLeftJoinPlan().

    This is easiest to explain through a concrete example, which is what we will do here.

    First let’s load the needed packages.

    # load packages suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.0' library("replyr") packageVersion("replyr") ## [1] '0.4.1'

    Now let’s load some notional example data. For our example we have:

    • One primary table of measurements (called "meas1") keyed by id and date.
    • A fact table that maps ids to patient names (called "names", and keyed by id).
    • A second table of additional measurements (called "meas2") That we consider "nice to have." That is: rows missing from this table should not censor-out meas1 rows, and additional rows found here should not be included in the analysis.

    The data is given below:

    # load notional example data my_db <- dplyr::src_sqlite(":memory:", create = TRUE) # example data replyr_copy_to(my_db, data.frame(id= c(1,1,2,2), date= c(1,2,1,2), weight= c(200, 180, 98, 120), height= c(60, 54, 12, 14)), 'meas1_train') replyr_copy_to(my_db, data.frame(id= seq_len(length(letters)), name= letters, stringsAsFactors=FALSE), 'names_facts') replyr_copy_to(my_db, data.frame(pid= c(2,3), date= c(2,2), weight= c(105, 110), width= 1), 'meas2_train')

    An important (and very neglected) step in data science tasks is documenting roles of tables, especially their key-structure (which we also call "stride" in the sense it describes how you move from row to row). replyr::tableDescription() is a function that builds an initial description of the tables. (Note: replyr::tableDescription() is misspelled in the current release version of replyr, we have fixed this in dev).

    # map from abstract names to realized names tables <- list('meas1' = 'meas1_train', 'names' = 'names_facts', 'meas2' = 'meas2_train') # get the initial description of table defs got <- lapply(names(tables), function(ni) { # get table reference from source by concrete name ti <- tbl(my_db, tables[[ni]]) # map abstract name to reference tableDescription(ni, ti) }) tDesc <- bind_rows(got)

    tDesc is essentially a slightly enriched version of the data handle concordance described in "Managing Spark data handles in R." We can take a quick look at the stored simplified summaries:

    print(tDesc %>% select(tableName, sourceClass, isEmpty)) ## # A tibble: 3 x 3 ## tableName sourceClass isEmpty ## ## 1 meas1 src_dbi, src_sql, src FALSE ## 2 names src_dbi, src_sql, src FALSE ## 3 meas2 src_dbi, src_sql, src FALSE print(tDesc$columns) ## [[1]] ## [1] "id" "date" "weight" "height" ## ## [[2]] ## [1] "id" "name" ## ## [[3]] ## [1] "pid" "date" "weight" "width" print(tDesc$colClass) ## [[1]] ## id date weight height ## "numeric" "numeric" "numeric" "numeric" ## ## [[2]] ## id name ## "integer" "character" ## ## [[3]] ## pid date weight width ## "numeric" "numeric" "numeric" "numeric" # add names for printing names(tDesc$keys) <- tDesc$tableName print(tDesc$keys) ## $meas1 ## id date weight height ## "id" "date" "weight" "height" ## ## $names ## id name ## "id" "name" ## ## $meas2 ## pid date weight width ## "pid" "date" "weight" "width"

    tableDescription() is a table that holds the following:

    • tableName: the abstract name we wish to use for this table.
    • handle: the actual data handle (either a data.frame or a handle to a remote data source such as PostgreSQL or Spark). Notice in the example it is of class "tbl_sqlite" or "tbl_dbi" (depending on the version of dplyr).
    • columns: the list of columns in the table.
    • keys: a named list mapping abstract key names to table column names. The set of keys together is supposed to uniquely identify rows.
    • colClasses: a vector of column classes of the underlying table.
    • sourceClass: the declared class of the data source.
    • isEmpty: an advisory column indicating if any rows were present when we looked.

    The tableName is "abstract" in that it is only used to discuss tables (i.e., it is only ever used as row identifier in this table). The data is actually found through the handle. This is critical in processes where we may need to run the same set of joins twice on different sets of tables (such as building a machine learning model, and then later applying the model to new data).

    The intent is to build a detailed join plan (describing order, column selection, and column re-naming) from the tDesc table. We can try this with the supplied function buildJoinPlan(), which in this case tells us our table descriptions are not quite ready to specify a join plan:

    tryCatch( buildJoinPlan(tDesc), error = function(e) {e} ) ##

    In the above the keys column is wrong in that it claims every column of each table is a table key. The join plan builder noticed this is unsupportable in that when it comes time to join the "names" table not all of the columns that are claimed to be "names" keys are already known from previous tables. That is: the "names$name" column is present in the earlier tables, and so can not be joined on. We can’t check everything, but the join controller tries to "front load" or encounter as many configuration inconsistencies early- before any expensive steps have been started.

    The intent is: the user should edit the "tDesc" keys column and share it with partners for criticism. In our case we declare the primary of the measurement tables to be PatientID and MeasurementDate, and the primary key of the names table to be PatientID. Notice we do this by specifying named lists or vectors mapping desired key names to names actually used in the tables.

    # declare keys (and give them consistent names) tDesc$keys[[1]] <- c(PatientID= 'id', MeasurementDate= 'date') tDesc$keys[[2]] <- c(PatientID= 'id') tDesc$keys[[3]] <- c(PatientID= 'pid', MeasurementDate= 'date') print(tDesc$keys) ## $meas1 ## PatientID MeasurementDate ## "id" "date" ## ## $names ## PatientID ## "id" ## ## $meas2 ## PatientID MeasurementDate ## "pid" "date"

    The above key mapping could then be circulated to partners for comments and help. Notice since this is not R code we can easily share it with non-R users for comment and corrections.

    It is worth confirming the keying as as expected (else some rows can reproduce in bad ways during joining). This is a potentially expensive operation, but it can be done as follows:

    keysAreUnique(tDesc) ## meas1 names meas2 ## TRUE TRUE TRUE

    Once we are satisfied with our description of tables we can build a join plan. The join plan is an ordered sequence of left-joins.

    In practice, when preparing data for predictive analytics or machine learning there is often a primary table that has exactly the set of rows you want to work over (especially when encountering production star-schemas. By starting joins from this table we can perform most of our transformations using only left-joins. To keep things simple we have only supplied a join controller for this case. This is obviously not the only join pattern needed; but it is the common one.

    A join plan can now be built from our table descriptions:

    # build the column join plan columnJoinPlan <- buildJoinPlan(tDesc) print(columnJoinPlan %>% select(tableName, sourceColumn, resultColumn, isKey, want)) ## # A tibble: 10 x 5 ## tableName sourceColumn resultColumn isKey want ## ## 1 meas1 id PatientID TRUE TRUE ## 2 meas1 date MeasurementDate TRUE TRUE ## 3 meas1 weight meas1_weight FALSE TRUE ## 4 meas1 height height FALSE TRUE ## 5 names id PatientID TRUE TRUE ## 6 names name name FALSE TRUE ## 7 meas2 pid PatientID TRUE TRUE ## 8 meas2 date MeasurementDate TRUE TRUE ## 9 meas2 weight meas2_weight FALSE TRUE ## 10 meas2 width width FALSE TRUE

    Essentially the join plan is an unnest of the columns from the table descriptions. This was anticipated in our article "Managing Spark Data Handles".

    We then alter the join plan to meet our needs (either through R commands or by exporting the plan to a spreadsheet and editing it there).

    Only columns named in the join plan with a value of TRUE in the want column are kept in the join (columns marked isKey must also have want set to TRUE). This is very useful as systems of record often have very wide tables (with hundreds of columns) of which we only want a few columns for analysis.

    For example we could decide to exclude the width column by either dropping the row or setting the row’s want column to FALSE.

    Since we have edited the join plan it is a good idea to both look at it and also run it through the inspectDescrAndJoinPlan() to look for potential inconsistencies.

    # decide we don't want the width column columnJoinPlan$want[columnJoinPlan$resultColumn=='width'] <- FALSE # double check our plan if(!is.null(inspectDescrAndJoinPlan(tDesc, columnJoinPlan))) { stop("bad join plan") } print(columnJoinPlan %>% select(tableName, sourceColumn, resultColumn, isKey, want)) ## # A tibble: 10 x 5 ## tableName sourceColumn resultColumn isKey want ## ## 1 meas1 id PatientID TRUE TRUE ## 2 meas1 date MeasurementDate TRUE TRUE ## 3 meas1 weight meas1_weight FALSE TRUE ## 4 meas1 height height FALSE TRUE ## 5 names id PatientID TRUE TRUE ## 6 names name name FALSE TRUE ## 7 meas2 pid PatientID TRUE TRUE ## 8 meas2 date MeasurementDate TRUE TRUE ## 9 meas2 weight meas2_weight FALSE TRUE ## 10 meas2 width width FALSE FALSE

    The join plan is the neglected (and often missing) piece of documentation key to non-trivial data science projects. We strongly suggest putting it under source control, and circulating it to project partners for comment.

    As a diagram the key structure of the join plan looks like the following (produced by DiagrammeR::mermaid(makeJoinDiagramSpec(columnJoinPlan))):

    Note the diagramming ability is currently only in the dev version of replyr. These diagrams are kind of fun. For instance, here is a more complicated one from the help(makeJoinDiagramSpec) examples:

    Once you have a good join plan executing it is a one-line command with executeLeftJoinPlan() (once you have set up a temp name manager as described in "Managing intermediate results when using R/sparklyr"):

    # manage the temp names as in: # http://www.win-vector.com/blog/2017/06/managing-intermediate-results-when-using-rsparklyr/ tempNameGenerator <- makeTempNameGenerator("extmps") # execute the left joins results <- executeLeftJoinPlan(tDesc, columnJoinPlan, verbose= TRUE, tempNameGenerator= tempNameGenerator) ## [1] "start meas1 Tue Jun 13 15:30:25 2017" ## [1] " rename/restrict meas1" ## [1] " 'table_meas1_present' = 'table_meas1_present'" ## [1] " 'PatientID' = 'id'" ## [1] " 'MeasurementDate' = 'date'" ## [1] " 'meas1_weight' = 'weight'" ## [1] " 'height' = 'height'" ## [1] " res <- meas1" ## [1] "done meas1 Tue Jun 13 15:30:25 2017" ## [1] "start names Tue Jun 13 15:30:25 2017" ## [1] " rename/restrict names" ## [1] " 'table_names_present' = 'table_names_present'" ## [1] " 'PatientID' = 'id'" ## [1] " 'name' = 'name'" ## [1] " res <- left_join(res, names, by = c( 'PatientID' ))" ## [1] "done names Tue Jun 13 15:30:25 2017" ## [1] "start meas2 Tue Jun 13 15:30:25 2017" ## [1] " rename/restrict meas2" ## [1] " 'table_meas2_present' = 'table_meas2_present'" ## [1] " 'PatientID' = 'pid'" ## [1] " 'MeasurementDate' = 'date'" ## [1] " 'meas2_weight' = 'weight'" ## [1] " res <- left_join(res, meas2, by = c( 'PatientID', 'MeasurementDate' ))" ## [1] "done meas2 Tue Jun 13 15:30:26 2017"

    executeLeftJoinPlan() takes both a table description table (tDesc, keyed by tableName) and the join plan (columnJoinPlan, keyed by tableName and sourceColumn).

    The separation of concerns is strong: all details about the intended left-join sequence are taken from the columnJoinPlan, and only the mapping from abstract table names to tables (or table references/handles) is taken from tDesc. This is deliberate design and makes running the same join plan on two different sets of tables (say once for model construction, and later for model application) very easy. tDesc is a runtime entity (as it binds names to live handles, so can’t be serialized: you must save the code steps to produce it; note only the columns tableName and handle are used so there is no point re-editing the keys column after running tableDescription() on new tables) and columnJoinPlan is a durable entity (has only information, not handles).

    Basically you:

    • Build simple procedures to build up tDesc.
    • Work hard to get a good columnJoinPlan.
    • Save columnJoinPlan in source control and re-load it (not re-build it) when you need it.
    • Re-build new tDesc compatible with the saved columnJoinPlan later when you need to work with tables (note only the columns tableName and handle are used during join execution, so you only need to create those).

    As always: the proof is in the pudding. We should look at results:

    print(results %>% select(PatientID, MeasurementDate, meas1_weight, height, name, table_meas2_present, meas2_weight), width= Inf) ## # Source: lazy query [?? x 7] ## # Database: sqlite 3.11.1 [:memory:] ## PatientID MeasurementDate meas1_weight height name table_meas2_present meas2_weight ## ## 1 1 1 200 60 a 0 NA ## 2 1 2 180 54 a 0 NA ## 3 2 1 98 12 b 0 NA ## 4 2 2 120 14 b 1 105

    Notice the joiner added extra columns of the form table_*_present to show which tables had needed rows. This lets us tell different sorts of missingness apart (value NA as there was no row to join, versus value NA as a NA came from a row) and appropriately coalesce results easily. These columns are also very good for collecting statistics on data coverage, and in business settings often are very useful data quality and data provenance features which can often be directly included in machine learning models.

    Also notice the join plan is very specific: every decision (such as what order to operate and how to disambiguate column names) is already explicitly set in the plan. The executor is then allowed to simply move through the tables left-joining in the order the table names first appear in the plan.

    Having to "join a bunch of tables to get the data into simple rows" is a common step in data science. Therefore you do not want this to be a difficult and undocumented task. By using a join controller you essentially make the documentation the executable specification for the task.

    # cleanup temps <- tempNameGenerator(dumpList= TRUE) for(ti in temps) { replyr_drop_table_name(my_db, ti) } rm(list=ls()) gc(verbose= FALSE) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Syberia: A development framework for R code in production

    Tue, 06/13/2017 - 22:57

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Putting R code into production generally involves orchestrating the execution of a series of R scripts. Even if much of the application logic is encoded into R packages, a run-time environment typically involves scripts to ingest and prepare data, run the application logic, validate the results, and operationalize the output. Managing those scripts, especially in the face of working with multiple R versions, can be a pain — and worse, very complex scripts are difficult to understand and reuse for future applications.

    That's where Syberia comes in: an open-source framework created by Robert Krzyzanowski and other engineers at the consumer lending company Avant. There, Syberia has been used by more than 30 developers to build a production data modeling system. In fact, building production R systems was the motivating tenet of Syberia

    Developing classifiers using the Syberia modeling engine follows one primary tenet: development equals production. When a web developer experiments with a new website layout or feature, they are targeting a production system. After they are satisfied with their work, they push a button and deploy the code base so it is live and others can interact with it.

    Feature engineering and statistical modeling … should belong to the same class of work. When an architect designs a skyscraper their work has to be translated to reality through a construction team by replaying the design using a physical medium. This is the current industry standard for machine learning: prototype the model in one language, typically a Python or R "notebook," and then write it in a "solid production-ready" language so it can survive the harsh winds of the real world.

    This is wrong.

    In much the same way that ggplot2 is a completely different way of thinking about R graphics, Syberia is a completely different way of thinking about R scripts. It's also similarly difficult to get your head around at first, but once you do, it reveals itself as an elegant and customizable way of managing complex and interconnected R code — and a much better solution than simply source-ing an 800-line R script.

    At its core, Syberia defines a set of conventions for defining the steps in a data-analysis workflow, and specifying them with a collection (in real-world projects, a large collection) of small R scripts in a standardized folder structure. Collectively, these define the complete data analysis process, which you can execute with a simple R command: run. To make modifying and maintaining this codebase (which you'd typically manage in a source-code control system) easier, Syberia is designed to isolate dependencies between filew. For example, rather than specifying a file name and format (say, "data.csv") in a script that reads data, you'd instead define "adapters" to read and write data in the adapters/adapters.R script:

    # ./adapters/csv.R read <- function(key) { read.csv(file.path("/some/path", paste0(key, ".csv"))) } write <- function(value, key) { write.csv(value, file.path("/some/path", paste0(key, ".csv"))) }

    Syberia will then use those "read" and "write" adapters to connect with your data. That way, when you later decide to source the data from a database, you can just write new adapters rather than trying to find the lines dealing with data I/O in a massive script. (This also helps avoid conflicts when working in a large code-base with multiple developers.) Syberia defines similar "patterns" for data preparation (including feature generation), statistical modeling, and testing; the "run" function conceptually synthesizes the entire codebase into a single R script to run.

    Syberia also encourages you to break up your process into a series of distinct steps, each of which can be run (and tested) independently. It also has a make-like feature, in that results from intermediate steps are cached, and do not need to be re-run each time unless their dependencies have been modified.

    Syberia can also be used to associate specific R versions with scripts, or even other R engines like Microsoft R. I was extremely impressed when during a 30-minute-break at the R/Finance conference last month, Robert was able to sketch out a Syberia implementation of a modeling process using the RevoScaleR library. In fact Robert's talk from the conference, embedded below, provides a nice introduction to Syberia.

    Syberia is available now under the open-source MIT license. (Note: git is required to install Syberia.) To get started with Syberia check out the documentation, which is available at the Syberia website linked below.

    Syberia: The development framework for R

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Keeping Users Safe While Collecting Data

    Tue, 06/13/2017 - 22:48

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap.

    The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words.

    As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/.

    Why would the goal of the project combined with that URL give pause? Read on!

    You’ve Got Scams!

    Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your voice, attackers can replay it on phone prompts and because it’s your voice it makes it harder to refute the evidence and can foil recognition systems that look for your actual voice.

    As the chart above shows, the Better Business Bureau has logged over 5,000 of these scams this year (searching for ‘phishing’ and ‘yes’). You can play with the data (a bit — the package needs work) in R with scamtracker.

    Now, these are “analog” attacks (i.e. a human spends time socially engineering a human). Bookmark this as you peruse section 2.

    Integrity Challenges in 2017

    I “trust” Pete’s intentions, but I sure don’t trust open-speech-commands.appspot.com (and, you shouldn’t either). Why? Go visit https://totally-harmless-app.appspot.com. It’s a Google App Engine app I made for this post. Anyone can make an appspot app and the https is meaningless as far as integrity & authenticity goes since I’m running on google’s infrastructure but I’m not google.

    You can’t really trust most SSL/TLS sessions as far as site integrity goes anyway. Let’s Encrypt put the final nail in the coffin with their Certs Gone Wild! initiative. With super-recent browser updates you can almost trust your eyes again when it comes to URLs, but you should be very wary of entering your info — especially uploading voice, prints or eye/face images — into any input box on any site if you aren’t 100% sure it’s a legit site that you trust.

    Tracking the Trackers

    If you don’t know that you’re being tracked 100% of the time on the internet then you really need to read up on the modern internet.

    In many cases your IP address can directly identify you. In most cases your device & browser profile — which most commercial sites log — can directly identify you. So, just visiting a web site means that it’s highly likely that web site can know that you are both not a dog and are in fact you.

    Still Waiting for the “So, What?”

    Many states and municipalities have engaged in awareness campaigns to warn citizens about the “Say ‘Yes’” scam. Asking someone to record themselves saying ‘Yes’ into a random web site pretty much negates that advice.

    Folks like me regularly warn about trust on the internet. I could have cloned the functionality of the original site to open-speech-commmands.appspot.com. Did you even catch the 3rd ‘m’ there? Even without that, it’s an appspot.com domain. Anyone can set one up.

    Even if the site doesn’t ask for your name or other info and just asks for your ‘Yes’, it can know who you are. In fact, when you’re enabling the microphone to do the recording, it could even take a picture of you if it wanted to (and you’d likely not know or not object since it’s for SCIENCE!).

    So, in the worst case scenario a malicious entity could be asking you for your ‘Yes’, tying it right to you and then executing the post-scam attacks that were being performed in the analog version.

    But, go so far as to assume this is a legit site with good intentions. Do you really know what’s being logged when you commit your voice info? If the data was mishandled, it would be just as easy to tie the voice files back to you (assuming a certain level of data logging).

    The “so what” is not really a warning to users but a message to researchers: You need to threat model your experiments and research initiatives, especially when innocent end users are potentially being put at risk. Data is the new gold, diamonds and other precious bits that attackers are after. You may think you’re not putting folks at risk and aren’t even a hacker target, but how you design data gathering can reinforce good or bad behaviour on the part of users. It can solidify solid security messages or tear them down. And, you and your data may be more of a target than you really know.

    Reach out to interdisciplinary colleagues to help threat model your data collection, storage and dissemination methods to ensure you aren’t putting yourself or others at risk.

    FIN

    Pete did the right thing:

    and, I’m sure the site will be on a “proper” domain soon. When it is, I’ll be one of the first in line to help make a much-needed open data set for research purposes.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    dplyr 0.7.0

    Tue, 06/13/2017 - 18:18

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    I’m pleased to announce that dplyr 0.7.0 is now on CRAN! (This was dplyr 0.6.0 previously; more on that below.) dplyr provides a “grammar” of data transformation, making it easy and elegant to solve the most common data manipulation challenges. dplyr supports multiple backends: as well as in-memory data frames, you can also use it with remote SQL databases. If you haven’t heard of dplyr before, the best place to start is the Data transformation chapter in R for Data Science.

    You can install the latest version of dplyr with:

    install.packages("dplyr") Features

    dplyr 0.7.0 is a major release including over 100 improvements and bug fixes, as described in the release notes. In this blog post, I want to discuss one big change and a handful of smaller updates. This version of dplyr also saw a major revamp of database connections. That’s a big topic, so it’ll get its own blog post next week.

    Tidy evaluation

    The biggest change is a new system for programming with dplyr, called tidy evaluation, or tidy eval for short. Tidy eval is a system for capturing expressions and later evaluating them in the correct context. It is is important because it allows you to interpolate values in contexts where dplyr usually works with expressions:

    my_var <- quo(homeworld) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE) #> # A tibble: 49 x 3 #> homeworld height mass #> #> 1 Alderaan 176.3333 64.0 #> 2 Aleen Minor 79.0000 15.0 #> 3 Bespin 175.0000 79.0 #> 4 Bestine IV 180.0000 110.0 #> 5 Cato Neimoidia 191.0000 90.0 #> 6 Cerea 198.0000 82.0 #> 7 Champala 196.0000 NaN #> 8 Chandrila 150.0000 NaN #> 9 Concord Dawn 183.0000 79.0 #> 10 Corellia 175.0000 78.5 #> # ... with 39 more rows

    This makes it possible to write your functions that work like dplyr functions, reducing the amount of copy-and-paste in your code:

    starwars_mean <- function(my_var) { my_var <- enquo(my_var) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE) } starwars_mean(homeworld)

    You can also use the new .data pronoun to refer to variables with strings:

    my_var <- "homeworld" starwars %>% group_by(.data[[my_var]]) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE)

    This is useful when you’re writing packages that use dplyr code because it avoids an annoying note from R CMD check.

    To learn more about how tidy eval helps solve data analysis challenge, please read the new programming with dplyr vignette. Tidy evaluation is implemented in the rlang package, which also provides a vignette on the theoretical underpinnings. Tidy eval is a rich system and takes a while to get your head around it, but we are confident that learning tidy eval will pay off, especially as it roles out to other packages in the tidyverse (tidyr and ggplot2 are next on the todo list).

    The introduction of tidy evaluation means that the standard evaluation (underscored) version of each main verb (filter_(), select_() etc) is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

    Character encoding

    We have done a lot of work to ensure that dplyr works with encodings other than Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data. Please let us know if you have problems!

    New datasets

    dplyr has some new datasets that will help write more interesting examples:

    • starwars, shown above, contains information about characters from the Star Wars movies, sourced from the Star Wars API. It contains a number of list-columns. starwars #> # A tibble: 87 x 13 #> name height mass hair_color skin_color eye_color #> #> 1 Luke Skywalker 172 77 blond fair blue #> 2 C-3PO 167 75 gold yellow #> 3 R2-D2 96 32 white, blue red #> 4 Darth Vader 202 136 none white yellow #> 5 Leia Organa 150 49 brown light brown #> 6 Owen Lars 178 120 brown, grey light blue #> 7 Beru Whitesun lars 165 75 brown light blue #> 8 R5-D4 97 32 white, red red #> 9 Biggs Darklighter 183 84 black light brown #> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray #> # ... with 77 more rows, and 7 more variables: birth_year , #> # gender , homeworld , species , films , #> # vehicles , starships
    • storms has the trajectories of ~200 tropical storms. It contains a strong grouping structure. storms #> # A tibble: 10,010 x 13 #> name year month day hour lat long status category #> #> 1 Amy 1975 6 27 0 27.5 -79.0 tropical depression -1 #> 2 Amy 1975 6 27 6 28.5 -79.0 tropical depression -1 #> 3 Amy 1975 6 27 12 29.5 -79.0 tropical depression -1 #> 4 Amy 1975 6 27 18 30.5 -79.0 tropical depression -1 #> 5 Amy 1975 6 28 0 31.5 -78.8 tropical depression -1 #> 6 Amy 1975 6 28 6 32.4 -78.7 tropical depression -1 #> 7 Amy 1975 6 28 12 33.3 -78.0 tropical depression -1 #> 8 Amy 1975 6 28 18 34.0 -77.0 tropical depression -1 #> 9 Amy 1975 6 29 0 34.4 -75.8 tropical storm 0 #> 10 Amy 1975 6 29 6 34.0 -74.8 tropical storm 0 #> # ... with 10,000 more rows, and 4 more variables: wind , #> # pressure , ts_diameter , hu_diameter
    • band_members, band_instruments and band_instruments2 has a tiny amount of data about bands. It’s designed to be very simple so you can illustrate how joins work without getting distracted by the details of the data. band_members #> # A tibble: 3 x 2 #> name band #> #> 1 Mick Stones #> 2 John Beatles #> 3 Paul Beatles band_instruments #> # A tibble: 3 x 2 #> name plays #> #> 1 John guitar #> 2 Paul bass #> 3 Keith guitar
    New and improved verbs
    • The pull() generic allows you to extract a single column either by name or position. It’s similar to select() but returns a vector, rather than a smaller tibble. mtcars %>% pull(-1) %>% str() #> num [1:32] 4 4 1 1 2 1 4 2 2 4 ... mtcars %>% pull(cyl) %>% str() #> num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

      Thanks to Paul Poncet for the idea!

    • arrange() for grouped data frames gains a .by_group argument so you can choose to sort by groups if you want to (defaults to FALSE).
    • All single table verbs now have scoped variantssuffixed with _if(), _at() and _all(). Use these if you want to do something to every variable (_all), variables selected by their names (_at), or variables that satisfy some predicate (_if). iris %>% summarise_if(is.numeric, mean) starwars %>% select_if(Negate(is.list)) storms %>% group_by_at(vars(month:hour))
    Other important changes
    • Local join functions can now control how missing values are matched. The default value is na_matches = "na", which treats two missing values as equal. To prevent missing values from matching, use na_matches = "never".

    You can change the default behaviour by calling pkgconfig::set_config("dplyr::na_matches", "never").

    • bind_rows() and combine() are more strict when coercing. Logical values are no longer coerced to integer and numeric. Date, POSIXct and other integer or double-based classes are no longer coerced to integer or double to avoid dropping important metadata. We plan to continue improving this interface in the future.
    Breaking changes

    From time-to-time I discover that I made a mistake in an older version of dplyr and developed what is now a clearly suboptimal API. If the problem isn’t too big, I try to just leave it – the cost of making small improvements is not worth it when compared to to the cost of breaking existing code. However, there are bigger improvements where I believe the short-term pain of breaking code is worth the long-term payoff of a better API.

    Regardless, it’s still frustrating when an update to dplyr breaks your code. To minimise this pain, I plan to do two things going forward:

    • Adopt an odd-even release cycle so that API breaking changes only occur in odd numbered releases. Even numbered releases will only contain bug fixes and new features. This is why I’ve skipped dplyr 0.6.0 and gone directly to dplyr 0.7.0.
    • Invest time in developing better tools isolating packages across projects so that you can choose when to upgrade a package on a project-by-project basis, and if something goes wrong, easily roll back to a version that worked. Look for news about this later in the year.
    Contributors

    dplyr is truly a community effort. Apart from the dplyr team (myself, Kirill Müller, and Lionel Henry), this release wouldn’t have been possible without patches from Christophe Dervieux, Dean Attali, Ian Cook, Ian Lyttle, Jake Russ, Jay Hesselberth, Jennifer (Jenny) Bryan, @lindbrook, Mauro Lepore, Nicolas Coutin, Daniel, Tony Fischetti, Hiroaki Yutani and Sergio Oller. Thank you all for your contributions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Using purrr with APIs – revamping my code

    Tue, 06/13/2017 - 12:20

    I wrote a little while back about using Microsoft Cognitive Services APIs with R to first of all detect the language of pieces of text and then do sentiment analysis on them. I wasn’t too happy with the some of the code as it was very inelegant. I knew I could code better than I had, especially as I’ve been doing a lot more work with purrr recently. However, it had sat in drafts for a while. Then David Smith kindly posted about the process I used which meant I had to get this nicer version of my code out ASAP!

    Get the complete code in this gist.

    Prerequisites Setup library(httr) library(jsonlite) library(dplyr) library(purrr) cogapikey<-"XXX" text=c("is this english?" ,"tak er der mere kage" ,"merci beaucoup" ,"guten morgen" ,"bonjour" ,"merde" ,"That's terrible" ,"R is awesome") # Put data in an object that converts to the expected schema for the API data_frame(text) %>% mutate(id=row_number()) -> textdf textdf %>% list(documents=.) -> mydata Language detection

    We need to identify the most likely language for each bit of text in order to send this additional bit of info to the sentiment analysis API to be able to get decent results from the sentiment analysis.

    cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/languages?numberOfLanguagesToDetect=1" cogapi %>% POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey), body=toJSON(mydata)) -> response # Process response response %>% content() %>% flatten_df() %>% select(detectedLanguages) %>% flatten_df()-> respframe textdf %>% mutate(language= respframe$iso6391Name) -> textdf Sentiment analysis

    With an ID, text, and a language code, we can now request the sentiment of our text be analysed.

    # New info mydata<-list(documents = textdf) # New endpoint cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment" # Construct a request cogapi %>% POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey), body=toJSON(mydata)) -> response # Process response response %>% content() %>% flatten_df() %>% mutate(id=as.numeric(id))-> respframe # Combine textdf %>% left_join(respframe) -> textdf

    And… et voila! A multi-language dataset with the language identified and the sentiment scored using purrr for easier to read code.

    Using purrr with APIs makes code nicer and more elegant as it really helps interact with hierarchies from JSON objects. I feel much better about this code now!

    Original id language text score 1 en is this english? 0.2852910 2 da tak er der mere kage NA 3 fr merci beaucoup 0.8121097 4 de guten morgen NA 5 fr bonjour 0.8118965 6 fr merde 0.0515683 7 en That’s terrible 0.1738841 8 en R is awesome 0.9546152 Revised text id language score is this english? 1 en 0.2265771 tak er der mere kage 2 da 0.7455934 merci beaucoup 3 fr 0.8121097 guten morgen 4 de 0.8581840 bonjour 5 fr 0.8118965 merde 6 fr 0.0515683 That’s terrible 7 en 0.0068665 R is awesome 8 en 0.9973871

    Interestingly the scores for English have not stayed the same – for instance, Microsoft now sees “R is awesome” in a much more positive light. It’s also great to see German and Danish are now supported!

    Get the complete code in this gist.

    The post Using purrr with APIs – revamping my code appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    New rOpenSci Packages for Text Processing in R

    Tue, 06/13/2017 - 09:00

    Textual data and natural language processing are still a niche domain within the R ecosytstem. The NLP task view gives an overview of existing work however a lot of basic infrastructure is still missing.
    At the rOpenSci text workshop in April we discussed many ideas for improving text processing in R which revealed several core areas that need improvement:

    • Reading: better tools for extracing text and metadata from documents in various formats (doc, rtf, pdf, etc).
    • Encoding: many text packages work well for ascii text but rapidly break down when text contains Hungarian, Korean or emojis.
    • Interchange: packages don't work well together due to lack of data classes or conventions for textual data (see also ropensci/tif)

    Participants also had many good suggestions for C/C++ libraries that text researchers in R might benefit from. Over the past weeks I was able to look into these suggestions and work on a few packages for reading and analyzing text. Below is an update on new and improved rOpenSci tools for text processsing in R!

    Google language detector 2 and 3

    New packages cld2 and cld3 are wrappers C++ libraries by Google for language identification. CLD2 is a Naïve Bayesian classifier, whereas CLD3 uses a neural network model. I found cld2 to give better results for short text.

    # Get the latest versions install.packages(c("cld2", "cld3")) # Vectorized function text <- c("À chaque fou plaît sa marotte.", "猿も木から落ちる", "Алты́нного во́ра ве́шают, а полти́нного че́ствуют.", "Nou breekt mijn klomp!") cld2::detect_language(text) # [1] "fr" "ja" "ru" "nl"

    Maëlle has written a cool post comparing language classification methods using 18000 "#RolandGarros2017" tweets and Thomas reminds us that algorithms can easily be fooled. Still I found the accuracy on real text quite astonishing given the relatively small size of these libraries.

    Note that the algorithm for CLD3 is still under development and the engineers at Google have recently opened their Github issues page for feedback.

    (anti) word and (un)rtf

    Many archived documents are only available in legacy formats such as .doc and .rtf. The only tools available for extracting text from these documents were difficult to install and could not be imported from packages and scripts.

    To make this a little easier we have packaged up utilities antiword and UnRTF to read MS doc and rtf files respectively.

    # Get the latest versions install.packages(c("antiword", "unrtf")) # Extract text from 'rtf' file text <- unrtf::unrtf("https://jeroen.github.io/files/sample.rtf", format = "text") cat(text) ### Lots of text... # Extract text from 'doc' file text <- antiword::antiword("https://jeroen.github.io/files/UDHR-english.doc") cat(text) ### Lots of text...

    Also have a look at meta packages readtext or textreadr which wrap these and other packages for automatically reading text in many different formats.

    pdf utilities

    Our pdftools package now supports reading pdf (extracting text or metadata) and rendering pdf to png, jpeg, tiff, or raw vectors on all platforms (incl. Windows).

    # Read some text text <- pdftools::pdf_text('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf') cat(text[1]) # An Introduction to R # Notes on R: A Programming Environment for Data Analysis and Graphics # Version 3.4.0 (2017-04-21) # W. N. Venables, D. M. Smith # and the R Core Team # Read meta data pdftools::pdf_info('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf') # $version # [1] "1.5" # # $pages # [1] 105 # # .... much more :)

    You can use either render a page into a raw bitmap array, or directly to an image format such as png, jpeg or tiff.

    files <- pdftools::pdf_convert('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf', format = "png" pages = 1:5) # Converting page 1 to R-intro_1.png... done! # Converting page 2 to R-intro_2.png... done! # Converting page 3 to R-intro_3.png... done! # Converting page 4 to R-intro_4.png... done! # Converting page 5 to R-intro_5.png... done!

    To extract text from scanned images, also check out our tesseract package which wraps Google's powerful OCR engine.

    Stemming, tokenizing and spell checking

    Our hunspell package has had a few updates recently as well. The package is a wrapper for libhunspell which is a popular library for spell checking:

    # Extract incorrect from a piece of text bad <- hunspell("spell checkers are not neccessairy for langauge ninja's") print(bad[[1]]) # [1] "neccessairy" "langauge" hunspell_suggest(bad[[1]]) # [[1]] # [1] "necessary" "necessarily" "necessaries" "recessionary" "accessory" "incarcerate" # # [[2]] # [1] "language" "Langeland" "Lagrange" "Lange" "gaugeable" "linkage" "Langland"

    The package is also used by devtools to spell-check manual pages in R packages:

    devtools::spell_check() # WORD FOUND IN # alltogether pdftools.Rd:36 # cairo pdf_render_page.Rd:42 # jpeg pdf_render_page.Rd:40 # libpoppler pdf_render_page.Rd:42, pdftools.Rd:30, description:1 # png pdf_render_page.Rd:40 # Poppler pdftools.Rd:34

    Finally hunspell also exposes the underlying methods needed for spell checking such as stemming words:

    # Find possible stems for each word words <- c("loving", "loved", "lover", "lovely", "love") hunspell_analyze(words) # [[1]] # [1] " st:loving" " st:love fl:G" # # [[2]] # [1] " st:loved" " st:love fl:D" # # [[3]] # [1] " st:lover" " st:love fl:R" # # [[4]] # [1] " st:lovely" " st:love fl:Y" # # [[5]] # [1] " st:love"

    Hunspell also suppors tokenizing words from html, latex, man, or plain text. For more advanced word extraction, check out the rOpenSci tokenizers package.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Joining Tables in SparkR

    Tue, 06/13/2017 - 07:27

    (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

    library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") grp1 <- groupBy(filter(df1, "month in (1, 2, 3)"), "month") sum1 <- withColumnRenamed(agg(grp1, min_dep = min(df1$dep_delay)), "month", "month1") grp2 <- groupBy(filter(df1, "month in (2, 3, 4)"), "month") sum2 <- withColumnRenamed(agg(grp2, max_dep = max(df1$dep_delay)), "month", "month2") # INNER JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all = FALSE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "inner")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 3| -25| 3| 911| #| 2| -33| 2| 853| #+------+-------+------+-------+ # LEFT JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all.x = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "left")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 1| -30| null| null| #| 3| -25| 3| 911| #| 2| -33| 2| 853| #+------+-------+------+-------+ # RIGHT JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all.y = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "right")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 3| -25| 3| 911| #| null| null| 4| 960| #| 2| -33| 2| 853| #+------+-------+------+-------+ # FULL JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "full")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 1| -30| null| null| #| 3| -25| 3| 911| #| null| null| 4| 960| #| 2| -33| 2| 853| #+------+-------+------+-------+

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    RcppMsgPack 0.1.1

    Tue, 06/13/2017 - 04:24

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new package! Or at least new on CRAN as the very initial version 0.1.0 had been available via the ghrr drat for over a year. But now we have version 0.1.1 to announce as a CRAN package.

    RcppMspPack provides R with MessagePack header files for use via C++ (or C, if you must) packages such as RcppRedis.

    MessagePack itself is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it is faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.

    MessagePack is used by Redis and many other projects, and has bindings to just about any language.

    To use this package, simply add it to the LinkingTo: field in the DESCRIPTION field of your R package—and the R package infrastructure tools will then know how to set include flags correctly on all architectures supported by R.

    More information may be on the RcppMsgPack page. Issues and bugreports should go to the GitHub issue tracker.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Data Science for Business – Time Series Forecasting Part 3: Forecasting with Facebook’s Prophet

    Tue, 06/13/2017 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    In my last two posts (Part 1 and Part 2), I explored time series forecasting with the timekit package.

    In this post, I want to compare how Facebook’s prophet performs on the same dataset.

    Predicting future events/sales/etc. isn’t trivial for a number of reasons and different algorithms use different approaches to handle these problems. Time series data does not behave like a regular numeric vector, because months don’t have the same number of days, weekends and holidays differ between years, etc. Because of this, we often have to deal with multiple layers of seasonality (i.e. weekly, monthly, yearly, irregular holidays, etc.). Regularly missing days, like weekends, are easier to incorporate into time series models than irregularly missing days.

    Timekit uses a time series signature for modeling, which we used as features to build our model of choice (e.g. a linear model). This model was then used for predicting future dates.

    Prophet is Facebook’s time series forecasting algorithm that was just recently released as open source software with an implementation in R.

    Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.”

    (I am not going to discuss forecast and ARIMA or other models because they are quite well established with lots and lots of excellent tutorials out there.)

    Training and Test data

    I am using the same training and test intervals as in my last post using timekit.

    Just as with timekit, prophet starts with a data frame that consists of a date column and the respective response variable for each date.

    library(prophet) library(tidyverse) library(tidyquant) retail_p_day <- retail_p_day %>% mutate(model = ifelse(day <= "2011-11-01", "train", "test")) train <- filter(retail_p_day, model == "train") %>% select(day, sum_income) %>% rename(ds = day, y = sum_income) test <- filter(retail_p_day, model == "test") %>% select(day, sum_income) %>% rename(ds = day)

    Model building

    In contrast to timekit, we do not “manually” augment the time series signature in prophet, we can directly feed our input data to the prophet() function (check the function help for details on optional parameters).

    To make it comparable, I am feeding the same list of irregularly missing days to the prophet() function. As discussed in the last post, I chose not to use a list of holidays because the holidays in the observation period poorly matched the days that were actually missing.

    off_days <- data.frame(ds = as.Date(c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03", "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30"))) %>% mutate(holiday = paste0("off_day_", seq_along(1:length(ds)))) prophet_model_test <- prophet(train, growth = "linear", # growth curve trend n.changepoints = 100, # Prophet automatically detects changes in trends by selecting changepoints from the data yearly.seasonality = FALSE, # yearly seasonal component using Fourier series weekly.seasonality = TRUE, # weekly seasonal component using dummy variables holidays = off_days) ## Initial log joint probability = -8.3297 ## Optimization terminated normally: ## Convergence detected: relative gradient magnitude is below tolerance

    Predicting test data

    With our model, we can now predict on the test data and compare the predictions with the actual values.

    forecast_test <- predict(prophet_model_test, test)

    Just as with timekit, I want to have a look at the residuals. Compared to timekit, the residuals actually look almost identical…

    forecast_test %>% mutate(resid = sum_income - yhat) %>% ggplot(aes(x = ds, y = resid)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()

    … As does the comparison plot. So, here it seems that prophet built a model that is basically identical to the linear model I used with timekit.

    forecast_test %>% gather(x, y, sum_income, yhat) %>% ggplot(aes(x = ds, y = y, color = x)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

    Predicting future sales

    Now, let’s see whether the future predictions will be identical as well.

    Just like with timekit, I am using a future time series of 300 days. Here, we see a slight difference in how we generate the future time series: with timekit I could use the entire index of observed dates, together with the list of missing days, while prophet uses the forecasting model that was generated for comparing the test data, i.e. it only considers the dates from the training set. We could built a new model with the entire dataset but this would then be different to how I approached the modeling with timekit.

    future <- make_future_dataframe(prophet_model_test, periods = 300) forecast <- predict(prophet_model_test, future) plot(prophet_model_test, forecast) + theme_tq()

    Interestingly, prophet’s forecast is distinctly different from timekit’s, despite identical performance on test samples! While timekit predicted a drop at the beginning of the year (similar to the training period), prophet predicts a steady increase in the future. It looks like timekit put more weight on the overall pattern during the training period, while prophet seems to put more weight on the last months, which showed a rise in net income.

    sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tidyquant_0.5.1 quantmod_0.4-8 ## [3] TTR_0.23-1 PerformanceAnalytics_1.4.3541 ## [5] xts_0.9-7 zoo_1.8-0 ## [7] lubridate_1.6.0 dplyr_0.5.0 ## [9] purrr_0.2.2.2 readr_1.1.1 ## [11] tidyr_0.6.3 tibble_1.3.1 ## [13] ggplot2_2.2.1 tidyverse_1.1.1 ## [15] prophet_0.1.1 Rcpp_0.12.11 ## ## loaded via a namespace (and not attached): ## [1] rstan_2.15.1 reshape2_1.4.2 haven_1.0.0 ## [4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 ## [7] stats4_3.4.0 yaml_2.1.14 rlang_0.1.1 ## [10] foreign_0.8-68 DBI_0.6-1 modelr_0.1.0 ## [13] readxl_1.0.0 plyr_1.8.4 stringr_1.2.0 ## [16] Quandl_2.8.0 munsell_0.4.3 gtable_0.2.0 ## [19] cellranger_1.1.0 rvest_0.3.2 codetools_0.2-15 ## [22] psych_1.7.5 evaluate_0.10 labeling_0.3 ## [25] inline_0.3.14 knitr_1.16 forcats_0.2.0 ## [28] parallel_3.4.0 broom_0.4.2 scales_0.4.1 ## [31] backports_1.0.5 StanHeaders_2.15.0-1 jsonlite_1.5 ## [34] gridExtra_2.2.1 mnormt_1.5-5 hms_0.3 ## [37] digest_0.6.12 stringi_1.1.5 grid_3.4.0 ## [40] rprojroot_1.2 tools_3.4.0 magrittr_1.5 ## [43] lazyeval_0.2.0 xml2_1.1.1 extraDistr_1.8.5 ## [46] assertthat_0.2.0 rmarkdown_1.5 httr_1.2.1 ## [49] R6_2.2.1 nlme_3.1-131 compiler_3.4.0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    thinning a Markov chain, statistically

    Tue, 06/13/2017 - 00:17

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    Art Owen has arXived a new version of his thinning MCMC paper, where he studies how thinning or subsampling can improve computing time in MCMC chains. I remember quite well the message set by Mark Berliner and Steve MacEachern in an early 1990’s paper that subsampling was always increasing the variance of the resulting estimators. We actually have this result in our Monte Carlo Statistical Methods book. Now, there are other perspectives on this, as for instance cases when thinning can be hard-wired by simulating directly a k-step move, delaying rejection or acceptance, prefetching, or simulating directly the accepted values as in our vanilla Rao-Blackwellisation approach. Here, Art considers the case when there is a cost θ of computing a transform of the simulation [when the transition cost a unit] and when those transforms are positively correlated with correlation ρ. Somewhat unsurprisingly, when θ is large enough, thinning becomes worth implementing. But requires extra computations in evaluating the correlation ρ and the cost θ, which is rarely comparable with the cost of computing the likelihood itself, a requirement for the Metropolis-Hastings or Hamiltonian Monte Carlo step(s).  Subsampling while keeping the right target (which is a hard constraint!) should thus have a much more effective impact on computing budgets.

    Filed under: Books, pictures, R, Statistics Tagged: autocorrelation, computing time, MCMC, MCMC convergence, Monte Carlo Statistical Methods, thinning, vanilla Rao-Blackwellisation

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    The magic trick that highlights significant results on any table

    Mon, 06/12/2017 - 22:29

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    This post describes the single biggest time saving technique that I know about for highlighting significant results on a table. The table below, which shows the results of a MANOVA, illustrates the trick. The coloring and bold fonts on the table quickly draw the eye to the two key patterns in the results: people aged 50 or more have lower averages, and the result is strongest for Intel. This is revealed by the coral-colored shading and the bolding.

     

    The same principles can be applied to any table containing numbers. The table below shows people’s attitude towards Microsoft by age. The attitude is split into three categories: Detractors, Neutrals, and Promoters. What story does it tell? If you are diligent you can work it out. But, it is hard work to do so because it requires you to compare and contrast 15 different numbers on the table.

    Now look at the table below. The colors and arrow tell us people aged 50 or more are less likely to be Promoters than people in the other age groups. Sure we could get to this insight by reading the first table and doing a whole lot of significance tests. But, the simple use of coloring and an arrow makes the key result much easier to find.

    Formatting the table to emphasize differences in residuals is the basic principle used in both examples illustrated above.

     

    A simple approach to computing residuals

    How do we work out which cells are interesting automatically? To illustrate this, I will walk through the calculations used in the table showing attitude towards Microsoft by age, returning to more complex analyses later in this post.

    We start with a table showing the total percentages (i.e., the number of people in each cell of the table, divided by the total sample size). With this table, our cell showing Promoters aged 50 or more shows a value of 3%. Is this result interesting?

    To answer this question we should look at the TOTALs. We can see that 27% is the TOTAL percentage for 50 or more. So, all else being equal we should expect that 27% of Promoters will be aged 50 or more. As 21% of the respondents are Promoters, we would then expect, all else being equal, that the percentage that are Promoters aged 50 or more will be 27% * 21% = 6%. The actual percentage shown on the table is only 3%, so we can conclude that the proportion of people that are Promoters and aged 50 or more is roughly half what we would expect if there was no relationship between age and attitude.

    The 6% value that I computed above for Promoters aged 50 or more is called the expected value. It is an estimate of the percentage of people in the sample who would be Promoters aged 50 or more, in the event that age and attitude are unrelated. The table below shows the expected values for all the cells in the table. That is, the values computed by multiplying together all the row and column totals for each cell. It is showing 5.6% rather than 6% for the Promoters aged 50 or more as I have used all the decimal places in the data when performing the calculations.

    This next table shows the beginning of the magic. Now I have subtracted the expected values from the actual (observed) percentages . The resulting table shows the residuals (residuals = observed – expected values). Cells close to 0% tend to be uninteresting.

     

    Using statistical significance to save more time: standardized residuals

    The residuals are difficult to correctly interpret. They have two problems. First, how big is “big”? Second, and less obviously, it is often not appropriate to compare them. We can readily see both of these problems by looking at the residual of -2.6 for Neutral and 30 to 49. This is the biggest residual on the table. It is bigger than the residual for Promoters aged 50 or more. However, the expected value for this the Neutrals aged 30 to 49 is much closer to 50% than it is for the Promoters aged 50 or more. If you have studied your introductory stats, you will recall that this means that there is more uncertainty about the estimate (i.e., a higher sampling error).

    There is a simple solution to this problem: scale the residuals so that they show statistical significance. This is straightforward to do (to see the underlying R code, click here to go into Displayr and click on the tables). These standardized residuals are z-scores.  This means that values of more than 1.96, or less than -1.96, are significant at the 0.05 level. The only cell in this table that is significant at the 0.05 level is Promoters in the 50 or more age group. This tells us that this is the first cell we should look at when exploring the table.

     

    Formatting

    The standardized residuals provide two types of information that allow us to quickly see patterns on a table. First, we have the standardized residuals themselves. On the table below, negative residuals are shaded in coral and positive values in blue, with the degree of shading proportional to the values. The second bit of information is the statistical significance of the cells (e.g., whether the z-scores are significant at the 0.05 level). This has been shown via bolding.

    The table below uses the standardized residuals in a slightly different way:

    • Arrow lengths are proportional to to the standardized residuals.
    • Arrows only appear when the residual is significant.
    • The color of the fonts and the arrows indicates whether the residuals are positive or negative.
    • The order of the rows and columns follows the values of the standardized residuals (for Google and Fun).

    The net effect of these formatting choices is that it becomes relatively easy to extract key conclusions from the table. For example, Apple does very well on Stylish, but less so on the price-related dimensions.

     

    Techniques for other types of data

    The “magic” consists of two separate things: (1) emphasizing differences in residuals and (2) statistical significance. There are lots of different ways to achieve these outcomes.

    Computing residuals

    The basic method for calculating the residuals illustrated above is technically only “correct” for tables with mutually exclusive categorical variables in the rows and columns. However, it is widely used with pretty much all types of data and does a good job. However, there are no shortage of alternative approaches. For example, in the MANOVA example the residuals are t-statistics. The table immediately above uses a log-linear model to compute the expected values.

    Computing statistical significance

    The method to use to compute statistical significance depends both on the data and the assumptions you want to make. Provided that you can compute the standard error for every cell on the table, you can compute a z- or t-statistic by dividing the residual by the standard error. You can of course get more rigorous. In the MANOVA example above, for example, I have computed robust standard errors and applied the false discovery rate correction.

     

    Software

    For this post, I calculated and formatted the standardized residuals and the MANOVA using R (the formatting uses the formattable package). Click here to go into Displayr, then click on the MANOVA output table in Displayr and you will see the R CODE.

    To create the two tables that show standardized residuals with arrows and font colors, I used Displayr, which automatically computes standardized residuals on all tables, and formats them in this way.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Interfacing with APIs using R: the basics

    Mon, 06/12/2017 - 22:18

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    While R (and its package ecosystem) provides a wealth of functions for querying and analyzing data, in our cloud-enabled world there's now a plethora of online services with APIs you can use to augment R's capabilities. Many of these APIs use a RESTful interface, which means you will typically send/receive data encoded in the JSON format using HTTP commands.

    Fortunately, as Steph Locke explains in her most recent R Quick tip, the process is pretty simple using R:

    1. Obtain an authentication key for using the service 
    2. Find the URL of the API service you wish to use
    3. Convert your input data to JSON format using toJSON in the jsonlite package
    4. Send your data to the API service using the POST function in the httr package. Include your API key using the add_headers function
    5. Extract your results from the API response using the fromJSON function

    Steph provides an example of detecting text language using the Microsoft Cognitive Services Text Analytics API, but that basic recipe should work for most APIs. (API keys for other Microsoft Cognitive Services APIs for vision, speech, language, knowledge and search are also available for free use with your Github account.) There many APIs available for use around the web, and this list of public APIs maintained by Abhishek Banthia is also a useful resource.

    Locke Data: R Quick tip: Microsoft Cognitive Services’ Text Analytics API

     

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Workshop on Monetizing R Packages

    Mon, 06/12/2017 - 21:35

    (This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

    Last week I gave a talk at the San Francisco EARL Conference about monetizing R packages. The talk was well received, so this Thursday at 9am PT I will be giving the same presentation as a live workshop.

    This workshop covers my initial foray into monetizing choroplethr, which occurred in 2015. You will learn:

    1. Why I decided to monetize choroplethr.
    2. The strategy I used.
    3. Three tactics that can help you monetize your own R package.

    The workshop is free to attend. There will be a live Q&A at the end.

    Register for the Workshop

    The post Workshop on Monetizing R Packages appeared first on AriLamstein.com.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Clustering

    Mon, 06/12/2017 - 20:53

    (This article was first published on R – Real Data, and kindly contributed to R-bloggers)

    Hello, everyone! I’ve been meaning to get a new blog post out for the
    past couple of weeks. During that time I’ve been messing around with
    clustering. Clustering, or cluster analysis, is a method of data mining
    that groups similar observations together. Classification and clustering
    are quite alike, but clustering is more concerned with exploration than
    an end result.

    Note: This post is far from an exhaustive look at all clustering has to
    offer. Check out this guide
    for more. I am reading Data Mining by Aggarwal presently, which is very informative.

    data("iris") head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa

    For simplicity, we’ll use the iris dataset. We’re going to try to use
    the numeric data to correctly group the observations by species. There
    are 50 of each species in the dataset, so ideally we would end up with
    three clusters of 50.

    library(ggplot2) ggplot() + geom_point(aes(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species))

    As you can see, there is already some groupings present. Let’s use
    hierarcical clustering first.

    iris2 <- iris[,c(1:4)] #not going to use the `Species` column. medians <- apply(iris2, 2, median) mads <- apply(iris2,2,mad) iris3 <- scale(iris2, center = medians, scale = mads) dist <- dist(iris3) hclust <- hclust(dist, method = 'median') #there are several methods for hclust cut <- cutree(hclust, 3) table(cut) ## cut ## 1 2 3 ## 49 34 67 iris <- cbind(iris, cut) iris$cut <- factor(iris$cut) levels(iris$cut) <- c('setosa', 'versicolor', 'virginica') err <- iris$Species == iris$cut table(err) ## err ## FALSE TRUE ## 38 112 ggplot() + geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris$cut))

    Nice groupings here, but it looks like the algorithm has some trouble
    determining between versicolor and virginica. If we used this
    information to classify the observations, we’d get an error rate of
    about .25. Let’s try another clustering technique: DBSCAN.

    library(dbscan) db <- dbscan(iris3, eps = 1, minPts = 20) table(db$cluster) ## ## 0 1 2 ## 5 48 97 iris2 <- cbind(iris2, db$cluster) iris2$cluster <- factor(iris2$`db$cluster`) ggplot() + geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris2$cluster))

    DBSCAN classifies points into three different categories: core, border,
    and noise points on the basis of density. Thus, the versicolor/
    virginica cluster is treated as one group. Since our data is not
    structured in such a way that density is meaningful, DBSCAN is probably
    not a wise choice here.

    Let’s look at one last algo: the ROCK. No, not that ROCK.

    library(cba) distm <- as.matrix(dist) rock <- rockCluster(distm, 3, theta = .02) ## Clustering: ## computing distances ... ## computing links ... ## computing clusters ... iris$rock <- rock$cl levels(iris$rock) <- c('setosa', 'versicolor', 'virginica') ggplot() + geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = rock$cl))

    err <- (iris$Species == iris$rock) table(err) ## err ## FALSE TRUE ## 24 126

    While it may not look like it, the ROCK does the best job at determining
    clusters in this data – the error rate dropped to 16%. Typically this
    method is reserved for categorical data, but since we used dist it
    shouldn't cause any problems.

    I have been working on a project using some of these (and similar) data
    mining procedures to explore spatial data and search for distinct
    groups. While clustering the iris data may not be all that meaningful,
    I think it is illustrative of the power of clustering. I have yet to try
    higher-dimension clustering techniques, which might be even better at
    determining Species.

    Have any comments? Questions? Suggestions for future posts? I am always
    happy to hear from readers; please contact me!

    Happy clustering,

    Kiefer

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    LASSO regression in R exercises

    Mon, 06/12/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Lease Absolute Shrinkage and Selection Operator (LASSO) performs regularization and variable selection on a given model. Depending on the size of the penalty term, LASSO shrinks less relevant predictors to (possibly) zero. Thus, it enables us to consider a more parsimonious model. In this exercise set we will use the glmnet package (package description: here) to implement LASSO regression in R.

    Answers to the exercises are available here.

    Exercise 1
    Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) “Least Angle Regression” Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO.

    Exercise 2
    The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes.
    It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process.

    Exercise 3
    Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison.

    Exercise 4
    Use the glmnet function to plot the path of each of x’s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero.

    Learn more about the glmnet package in the online course Regression Machine Learning with R. In this course you will learn how to:

    • Avoid model over-fitting using cross-validation for optimal parameter selection
    • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
    • And much more

    Exercise 5
    Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

    Exercise 6
    Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y.

    Exercise 7
    To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero.

    Exercise 8
    As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results.

    Exercise 9
    Repeat exercise-4 for the new model.

    Exercise 10
    Repeat exercises 5 and 6 for the new model and see which coefficients are shrunk to zero. This is an effective way to narrow down on important predictors when there are many candidates.

    Related exercise sets:
    1. Evaluate your model with R Exercises
    2. Multiple Regression (Part 3) Diagnostics
    3. Data science for Doctors: Inferential Statistics Exercises(Part-4)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages