Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 7 hours 31 min ago

RcppArmadillo 0.7.960.1.1

Sun, 08/20/2017 - 21:20

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

On the heels of the very recent bi-monthly RcppArmadillo release comes a quick bug-fix release 0.7.960.1.1 which just got onto CRAN (and I will ship a build to Debian in a moment).

There were three distinct issues I addressed in three quick pull requests:

  • The excellent Google Summer of Code work by Binxiang Ni had only encountered direct use of sparse matrices as produced by the Matrix. However, while we waited for 0.7.960.1.0 to make it onto CRAN, the quanteda package switched to derived classes—which we now account for via the is() method of our S4 class. Thanks to Kevin Ushey for reminding me we had is().
  • We somehow missed to account for the R 3.4.* and Rcpp 0.12.{11,12} changes for package registration (with .registration=TRUE), so ensured we only have one fastLm symbol.
  • The build did not take not too well to systems without OpenMP, so we now explicitly unset supported via an Armadillo configuration variable. In general, client packages probably want to enable C++11 support when using OpenMP (explicitly) but we prefer to not upset too many (old) users. However, our configure check now also wants g++ 4.7.2 or later just like Armadillo.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 382 other packages on CRAN—an increase of 52 since the CRAN release in June!

Changes in this release relative to the previous CRAN release are as follows:

Changes in RcppArmadillo version 0.7.960.1.1 (2017-08-20)
  • Added improved check for inherited S4 matrix classes (#162 fixing #161)

  • Changed fastLm C++ function to fastLm_impl to not clash with R method (#164 fixing #163)

  • Added OpenMP check for configure (#166 fixing #165)

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Answer probability questions with simulation

Sun, 08/20/2017 - 18:33

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Probability is at the heart of data science. Simulation is also commonly used in algorithms such as the bootstrap. After completing this exercise, you will have a slightly stronger intuition for probability and for writing your own simulation algorithms.

Most of the problems in this set have an exact analytical solution, which is not the case for all probability problems, but they are great for practice since we can check against the exact correct answer.

To get the most out of the exercises, it pays off to read the instructions carefully and think about what the solution should be before starting to write R code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated.

Answers are available here.

Exercise 1
In 100 coin tosses, what is the probability of having the same side come up 10 times in a row?

You might want to use some of the following functions to answer this question:sample(), rbinom(), rle().

Exercise 2
Six kids are standing in line. What is the probability that they are in alphabetical order by name? Assume no two children have the same exact name.

Exercise 3
Remember the kids from the last question? There are three boys and three girls. How likely is it that all the girls come first?

Exercise 4
In six coin tosses, what is the probability of having a different side come up with each throw, that is, that you never get two tails or two heads in a row?

Exercise 5
A random five-card poker hand is dealt from a standard deck. What is the chance of a flush (all cards are the same suit)?

Exercise 6
In a random thirteen-card hand from a standard deck, what is the probability that none of the cards is an ace and none is a heart (♥)?

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

  • work with different binomial and logistic regression techniques,
  • know how to compare regression models and choose the right fit,
  • and much more.

Exercise 7
At four parties each attended by 13, 23, 33, and 53 people respectively, how likely is it that at least two individuals share a birthday at each party? Assume there are no leap days, that all years are 365 days, and that births are uniformly distributed over the year.

Exercise 8
A famous coin tossing game has the following rules: The player tosses a coin repeatedly until a tail appears or tosses it a maximum of 1000 times if no tail appears. The initial stake starts at 2 dollars and is doubled every time heads appears. The first time tails appears, the game ends and the player wins whatever is in the pot. Thus the player wins 2 dollars if tails appears on the first toss, 4 dollars if heads appears on the first toss and tails on the second, 8 dollars if heads appears on the first two tosses and tails on the third, and so on. Mathematically, the player wins 2k dollars, where k equals the number of tosses until the first tail. What is the probability of profit if it costs 15 dollars to participate?

Exercise 9
Back to coin tossing. What is the probability the pattern heads-heads-tails appears before tails-heads-heads?

Exercise 10
Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car; behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He then says to you, “Do you want to pick door #2?” What is the probability of winning the car if you use the strategy of first picking a random door and then switching doors every time? Note that the host will always open a door you did not pick, and it always reveals a goat.

Related exercise sets:
  1. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
  2. Probability functions beginner
  3. Combinations Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#10: Compacting your Shared Libraries, After The Build

Sun, 08/20/2017 - 15:40

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the tenth post in the rarely ranting R recommendations series, or R4 for short. A few days ago we showed how to tell the linker to strip shared libraries. As discussed in the post, there are two options. One can either set up ~/.R/Makevars by passing the strip-debug option to the linker. Alternatively, one can adjust src/Makevars in the package itself with a bit a Makefile magic.

Of course, there is a third way: just run strip --strip-debug over all the shared libraries after the build. As the path is standardized, and the shell does proper globbing, we can just do

$ strip --strip-debug /usr/local/lib/R/site-library/*/libs/*.so

using a double-wildcard to get all packages (in that R package directory) and all their shared libraries. Users on macOS probably want .dylib on the end, users on Windows want another computer as usual (just kidding: use .dll). Either may have to adjust the path which is left as an exercise to the reader.

The impact can be Yuge as illustrated in the following dotplot:

This illustration is in response to a mailing list post. Last week, someone claimed on r-help that tidyverse would not install on Ubuntu 17.04. And this is of course patently false as many of us build and test on Ubuntu and related Linux systems, Travis runs on it, CRAN tests them etc pp. That poor user had somehow messed up their default gcc version. Anyway: I fired up a Docker container, installed r-base-core plus three required -dev packages (for xml2, openssl, and curl) and ran a single install.packages("tidyverse"). In a nutshell, following the launch of Docker for an Ubuntu 17.04 container, it was just

$ apt-get update $ apt-get install r-base libcurl4-openssl-dev libssl-dev libxml2-dev $ apt-get install mg # a tiny editor $ mg /etc/R/Rprofile.site # to add a default CRAN repo $ R -e 'install.packages("tidyverse")'

which not only worked (as expected) but also installed a whopping fifty-one packages (!!) of which twenty-six contain a shared library. A useful little trick is to run du with proper options to total, summarize, and use human units which reveals that these libraries occupy seventy-eight megabytes:

root@de443801b3fc:/# du -csh /usr/local/lib/R/site-library/*/libs/*so 4.3M /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so 2.3M /usr/local/lib/R/site-library/bindrcpp/libs/bindrcpp.so 144K /usr/local/lib/R/site-library/colorspace/libs/colorspace.so 204K /usr/local/lib/R/site-library/curl/libs/curl.so 328K /usr/local/lib/R/site-library/digest/libs/digest.so 33M /usr/local/lib/R/site-library/dplyr/libs/dplyr.so 36K /usr/local/lib/R/site-library/glue/libs/glue.so 3.2M /usr/local/lib/R/site-library/haven/libs/haven.so 272K /usr/local/lib/R/site-library/jsonlite/libs/jsonlite.so 52K /usr/local/lib/R/site-library/lazyeval/libs/lazyeval.so 64K /usr/local/lib/R/site-library/lubridate/libs/lubridate.so 16K /usr/local/lib/R/site-library/mime/libs/mime.so 124K /usr/local/lib/R/site-library/mnormt/libs/mnormt.so 372K /usr/local/lib/R/site-library/openssl/libs/openssl.so 772K /usr/local/lib/R/site-library/plyr/libs/plyr.so 92K /usr/local/lib/R/site-library/purrr/libs/purrr.so 13M /usr/local/lib/R/site-library/readr/libs/readr.so 4.7M /usr/local/lib/R/site-library/readxl/libs/readxl.so 1.2M /usr/local/lib/R/site-library/reshape2/libs/reshape2.so 160K /usr/local/lib/R/site-library/rlang/libs/rlang.so 928K /usr/local/lib/R/site-library/scales/libs/scales.so 4.9M /usr/local/lib/R/site-library/stringi/libs/stringi.so 1.3M /usr/local/lib/R/site-library/tibble/libs/tibble.so 2.0M /usr/local/lib/R/site-library/tidyr/libs/tidyr.so 1.2M /usr/local/lib/R/site-library/tidyselect/libs/tidyselect.so 4.7M /usr/local/lib/R/site-library/xml2/libs/xml2.so 78M total root@de443801b3fc:/#

Looks like dplyr wins this one at thirty-three megabytes just for its shared library.

But with a single stroke of strip we can reduce all this down a lot:

root@de443801b3fc:/# strip --strip-debug /usr/local/lib/R/site-library/*/libs/*so root@de443801b3fc:/# du -csh /usr/local/lib/R/site-library/*/libs/*so 440K /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so 220K /usr/local/lib/R/site-library/bindrcpp/libs/bindrcpp.so 52K /usr/local/lib/R/site-library/colorspace/libs/colorspace.so 56K /usr/local/lib/R/site-library/curl/libs/curl.so 120K /usr/local/lib/R/site-library/digest/libs/digest.so 2.5M /usr/local/lib/R/site-library/dplyr/libs/dplyr.so 16K /usr/local/lib/R/site-library/glue/libs/glue.so 404K /usr/local/lib/R/site-library/haven/libs/haven.so 76K /usr/local/lib/R/site-library/jsonlite/libs/jsonlite.so 20K /usr/local/lib/R/site-library/lazyeval/libs/lazyeval.so 24K /usr/local/lib/R/site-library/lubridate/libs/lubridate.so 8.0K /usr/local/lib/R/site-library/mime/libs/mime.so 52K /usr/local/lib/R/site-library/mnormt/libs/mnormt.so 84K /usr/local/lib/R/site-library/openssl/libs/openssl.so 76K /usr/local/lib/R/site-library/plyr/libs/plyr.so 32K /usr/local/lib/R/site-library/purrr/libs/purrr.so 648K /usr/local/lib/R/site-library/readr/libs/readr.so 400K /usr/local/lib/R/site-library/readxl/libs/readxl.so 128K /usr/local/lib/R/site-library/reshape2/libs/reshape2.so 56K /usr/local/lib/R/site-library/rlang/libs/rlang.so 100K /usr/local/lib/R/site-library/scales/libs/scales.so 496K /usr/local/lib/R/site-library/stringi/libs/stringi.so 124K /usr/local/lib/R/site-library/tibble/libs/tibble.so 164K /usr/local/lib/R/site-library/tidyr/libs/tidyr.so 104K /usr/local/lib/R/site-library/tidyselect/libs/tidyselect.so 344K /usr/local/lib/R/site-library/xml2/libs/xml2.so 6.6M total root@de443801b3fc:/#

Down to six point six megabytes. Not bad for one command. The chart visualizes the respective reductions. Clearly, C++ packages (and their template use) lead to more debugging symbols than plain old C code. But once stripped, the size differences are not that large.

And just to be plain, what we showed previously in post #9 does the same, only already at installation stage. The effects are not cumulative.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

joyplot for GSEA result

Sun, 08/20/2017 - 14:51

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

I am very glad to find that someone figure out how to use ggjoy with ggtree.

I really love ggjoy and believe it can be a good tool to visualize gene set enrichment (GSEA) result. DOSE/clusterProfiler support several visualization methods.

running score:

enrichment map:

category-gene-network:

dotplot:

These visualization methods are designed for visualizing/summarizing enrichment results and none of them can incoporate expression values.

In DOSE v>=3.3.2, I defined a joyplot function, which can visualize GSEA result with expression distribution of the enriched categories.

Here is an example:

require(clusterProfiler) data(geneList) x <- gseKEGG(geneList) joyplot(x)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DART: Dropout Regulation in Boosting Ensembles

Sun, 08/20/2017 - 07:50

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

The dropout approach developed by Hinton has been widely employed in the context of deep learnings to prevent the deep neural network from over-fitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout is also proposed to address the over-fitting in tree boosting ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest trees added later can only contribute to learning from a small subset, which increases the risk of over-fittings. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts would determine the degree of regularization for tree ensembles.

Below is a demonstration showing the implementation of DART in the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a tree boosting ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with weak learners. While the ROC for the training set can be as high as 0.95, the ROC for the testing set is only 0.60 in our case, implying the over-fitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, the ROC for the testing set can be improved from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regulation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to install wgrib2 in OSX

Sun, 08/20/2017 - 07:18

(This article was first published on R – Bovine Aerospace, and kindly contributed to R-bloggers)

Prompted by both my own struggles with wgrib2 compilation and a plea on the rNOMADS email listserv, I’m going to describe how to compile and install wgrib2 on Mac OS.

First of all, some background: wgrib2 is an excellent utility written by Wesley Ebisuzaki at NOAA.  It allows for a number of swift and stable operations on GRIB2 files (a common file format for weather and climate data).  It is also a requirement for using grib files in rNOMADS (the function ReadGrib() in particular).

So here’s how to install it on Mac OS.

  1.  Get Command Line Tools for Xcode (search for it on Duckduckgo or your engine of choice) and also make sure gcc is installed. If you don’t know what gcc is, stop now and find someone who does (it will save you a lot of time).
  2. Download wgrib2 here (note download links are pretty far down the page).
  3. Untar the tarball somewhere, and roll up your sleeves.  cd into the resulting wgrib directory.
  4.  In the makefile, uncomment the lines

    #export cc=gcc
    #export FC=gfortran
    Also search for makefile.darwin in the makefile and uncomment the line containing it.  You’ll see instructions to this effect in the makefile anyway.
  5. Now we have to edit the included libpng package, since it is untarred by the makefile and doesn’t inherit our compiler specifications in step 4. Ensuring that we’re in the wgrib directory:

    tar -xvf libpng-1.2.57.tar.gz
    cd libpng-1.2.57/scripts/

    now edit the makefile.darwin file, changing

    CC=cc

    to

    CC=gcc

    Now, return to the wgrib directory, and re-tar libpng!

    tar -cf libpng-1.2.57.tar libpng-1.2.57
    gzip libpng-1.2.57.tar

    If it asks you if you want to replace the original tar.gz file, say “yes”.  What we’ve done here is edited libpng to make sure it uses the right compiler.
  6. Finally, type make to build wgrib2.

If you are still having problems (for example, libaec complaining that there is no c compiler), make sure that all the compiler commands (gcc, cc, etc.) all point to something other than clang (the default compiler that comes with OSX).  You may have to edit your bash_profile file to ensure this.

As always, contact me on the form below if you’re having unresolvable issues.

[contact-form]

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bovine Aerospace. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Is dplyr Easily Comprehensible?

Sun, 08/20/2017 - 04:24

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns.

But, is it an easy task to study and characterize dplyr itself?

Please take our advanced dplyr quiz to test your dplyr mettle.


“Pop dplyr quiz, hot-shot! There is data in a pipe. What does each verb do?”

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

simplyR

Sat, 08/19/2017 - 23:56

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

simplyR is a web space where we’ll be posting practical and easy guides for solving real important problems using R programming language.

As we aren’t fans of unnecessary complications, we’ll keep the content of our tutorials / R codes as simple as possible.

Many tutorials are coming soon.

Topics we love include:

  • R programming
  • Biostatistics
  • Genomic data analysis
  • Survival analysis
  • Machine/statistical learning
  • Data visualization

Samples of our recent publications, on R & Data Science, are:

If you want to contribute, read this: http://www.sthda.com/english/pages/contribute-to-sthda

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zillow Rent Analysis

Sat, 08/19/2017 - 17:08

(This article was first published on R – Journey of Analytics, and kindly contributed to R-bloggers)

Hello Readers,

This is a notification post – Did you realize our website has moved? The blog is live at New JA Blog under the domain http://www.journeyofanalytics.com . You can read about the rent analysis post here.

If you received this post AND an email from anu_analytics, then please disregard this post.

If you received this post update from WordPress, but did NOT receive an email from anu_analytics (via MailChimp email) then please send us an email at anuprv@journeyofanalytics.com . The email from the main site was sent out 4 hours ago. Alternatively, you can sign up using this contact form.

(Email screenshot below)

JourneyofAnalytics email Newsletter

" data-medium-file="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=300" data-large-file="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375" class="size-large wp-image-1342" src="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375" alt="JourneyofAnalytics email Newsletter" width="450" srcset_temp="https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375 676w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=150&h=83 150w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=300&h=166 300w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=768&h=426 768w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg 999w" sizes="(max-width: 676px) 100vw, 676px" />

JourneyofAnalytics Newsletter

Again, the latest blogposts are available at blog.journeyofanalytics.com and all code/project files are available under the Projects page.

See you at our new site. Happy Coding!

 

Filed under: learning resources, Project Updates, R Tagged: inferential statistics, journeyofanalytics, new site, rent analysis

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Journey of Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

More things with the New Zealand Election Study by @ellis2013nz

Sat, 08/19/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

A new cross tab tool

I recently put up a simple web app, built with R Shiny, to let users explore the relationship between party vote in the 2014 New Zealand general election and a range of demographic and attitudinal questions in the New Zealand Election Study. The image below is a link to the web app:

The original motivation was to answer a question on Twitter for a breakdown of National party vote by gender. I was surprised how interesting I found the resulting tool though. Without a fancy graphic, just a table of numbers, there’s a lot to play around with here. I deliberately kept the functionality narrow, because I wanted to avoid a bewildering array of choices and confusing user interface, so it tries to do only one thing and does it well. The thing it does is show cross tabs of party vote with other variables from the study.

The source code of the Shiny app is available as is the preparation script but they’re quite unremarkable so I won’t reproduce them here; follow the links and read them on GitHub in their natural habitat.

A few interesting statistical points to note:

  • I produced a new set of weights so population totals would match the actual party vote. Even after the NZES team did their weighting, the sample wasn’t representative of the population of people on the electoral roll in terms of actual party vote. In fact, people who did not vote were particularly over-represented. This isn’t that surprising – people who don’t vote for whatever reason (whether it is apathy or being out of the country and busy with other things) are probably also disproportionately likely not to respond to surveys. The web app gives the user the choice of the original NZES weights or my re-calibrated ones, defaulting to the latter. I think that’s useful because people might use the app to say “X thousand voters for party Y have Z attitude”, so adding up to the right number of voters by party is important.
  • I included an option to see the Pearson residuals, which compare the observed cell count with what would have been expected in the (nearly always implausible) null hypothesis of no relationship at all between the two variables making up the cross tab. I think this is by far the best way to look at which cells of the table are unusual. For example, in the screenshot above, it is highlighted clearly that National voters had unusually strong levels of agreement with the statement “with lower welfare benefits people would learn to stand on their own two feet”, whereas voters for Labour and the Greens were unusually unlikely to agree with that statement (and likely to disagree). This won’t be a surprise for any watchers of New Zealand politics.
  • One version I tested had a little Chi square test of the null hypothesis of no relationship between the two. But it was nearly always returning a p value of zero, because of course there is a relationship between these variables. I decided it was uninteresting, and didn’t want to focus people on null hypothesis testing anyway, so left it out.
  • I resisted the urge to put multiple numbers in each cell of the table, as is done in some stats package output (eg SPSS). I think tables like these work as visualisations if the eye can sweep across, knowing that every number in the table is somehow comparable. This isn’t possible when you combine values in each cell (eg include both row-wise and column-wise percentages).
  • It was interesting to think through what should be the default way of calculating percentages in a table like this. I decided in the end to default to row-wise, which means the user is reading (for example) “Of the people who voted X, what percentage thought Y?” I don’t think there’s a right or wrong, just a contingent guess that this is most likely to be the first want of people.
  • An early version of the tool had an option for “margin of error” for each cell and this drew my attention to the difficult of conceptualising the margin of error in a single cell of a contingency table. I’m going to think more about this one.
  • Adding the heatmap colour was a late addition, made easy by the wonder of the easy combination of datatable JavaScript with R via the DT package.
More stuff using the New Zealand Election Study

So I now have two web apps with this data:

… and six blog posts. To recap, here’s all the blog posts I’ve done so far with this data:

1. Attitudes to the “Dirty Politics” book

In my first post on the data, I did quick demo analysis of what the attitudes were of voters for various parties to Nicky Hagar’s book “Dirty Politics”

2. Modelling individual level party vote

I did some reasonably comprehensive modelling of who votes for whom. The main work here was deciding how many degrees of freedom could be spared for the various demographic variables, and clumping/tidying them up into analysis-ready form. This was also a good opportunity for some thinking about modelling strategy, the role of the bootstrap, and multiple imputation which is essential with this sort of problem.

3. Web app for individual vote

This led to my first web app with the New Zealand Election Study data, which lets you explore the predicted probability of different types of people voting for different parties.

4. Sankey chart of ‘transitions’ from 2011 vote to 2014

This was an interesting experiment in looking at what one survey can tell us about people swapping from party to party:

5. Modelling voter turnout

I adapted my approach of modelling party vote to the perhaps even more important question of who turns out to vote at all.

6. Cross tab tool

The sixth blog post is today’s.

For New Zealand readers, have a good final five weeks up to the 2017 election!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Obstacles to performance in parallel programming

Fri, 08/18/2017 - 21:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDAhas shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:

  • Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
  • Load balance, where the computing resources aren't contributing equally to the problem;
  • Impacts from use of RAM and virtual memory, such as cache misses and page faults;
  • Network effects, such as latency and bandwidth, that impact performance and communication overhead;
  • Interprocess conflicts and thread scheduling; 
  • Data access and other I/O considerations.

The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It's also worth checking out Norm Matloff's keynote from the useR!2017 conference, embedded below.

Norm Matloff: Understanding overhead issues in parallel computation

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Starting a Rmarkdown Blog with Bookdown + Hugo + Github

Fri, 08/18/2017 - 19:54

(This article was first published on R – Tales of R, and kindly contributed to R-bloggers)

Finally, -after 24h of failed attempts-, I could get my favourite Hugo theme up and running with R Studio and Blogdown.

All the steps I followed are detailed in my new Blogdown entry, which is also a GitHub repo.

After exploring some alternatives, like Shirin’s (with Jekyll), and Amber Thomas advice (which involved Git skills beyond my basic abilities), I was able to install Yihui’s hugo-lithium-theme in a new repository.

However, I wanted to explore other blog templates, hosted in GiHub, like:

The three first themes are currently linked in the blogdown documentation as being most simple and easy to set up for unexperienced blog programmers, but I hope the list will grow in the following months. For those who are willing to experiment, the complete list is here.

Finally I chose the hugo-tranquilpeak theme, by Thibaud Leprêtre, for which I mostly followed Tyler Clavelle’s entry on the topic. This approach turned out to be easy and good, given some conditions:

  • Contrary to Yihui Xie’s advice, I chose github.io to host my blog, instead of Netlify (I love my desktop integration with GitHub, so it was interesting for me not to move to another service for my static content).
  • In my machine, I installed Blogdown & Hugo using R studio (v 1.1.336).
  • In GiHub, it was easier for me to host the blog directly in my main github pages repository (always named [USERNAME].github.io), in the master branch, following Tyler’s tutorial.
  • I have basic knowledge of html, css and javascript, so I didn’t mind to tinker around with the theme.
  • My custom styles didn’t involve theme rebuilding. At this moment they’re simple cosmetic tricks.

The steps I followed were:

Git & GitHub repos
  • Setting a GitHub repo with the name [USERNAME].github.io (in my case aurora-mareviv.github.io). See this and this.
  • Create a git repo in your machine:
    • Create manually a new directory called [USERNAME].github.io.
    • Run in the terminal (Windows users have to install git first):
    cd /Git/[USERNAME].github.io # your path may be different git init # initiates repo in the directory git remote add origin https://github.com/[USERNAME]/[USERNAME].github.io # connects git local repo to remote Github repo git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded
  • For now, your repo is ready. We will now focus in creating & customising our Blogdown.
RStudio and blogdown
  • We will open RStudio (v 1.1.336, development version as of today).
    • First, you may need to install Blogdown in R:
    install.packages("blogdown")
    • In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project may not be yet available in your RStudio version (not for much longer probably).

Creating new Project

Selecting Hugo Blogdown format

Selecting Hugo Blogdown theme

 

config.toml file appears

Customising paths and styles

Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.

Modify config.toml file

To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:

  • We need to set up the base URL to the “root” of the web page (https://[USERNAME].github.io/ in this case)
  • By default, the web page is published in the “public” folder. We need it to be published in the root of the repository, to match the structure of the GitHub masterbranch:
baseurl = "/" publishDir = "."
  • Other useful global settings:
ignoreFiles = ["\\.Rmd$", "\\.Rmarkdown$", "_files$", "_cache$"] enableEmoji = true Images & styling paths

We can revisit the config.toml file to make changes to the default settings.

The logo that appears in the corner must be in the root folder. To modify it in the config.toml:

picture = "logo.png" # the path to the logo

The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images . To modify it in the config.toml:

coverImage = "myimage.jpg"

We want some custom css and js. We need to locate it in /static/css and in /static/jsrespectively.

# Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS; # they have to be referred from static root. Example customCSS = ["css/my-style.css"] # Custom JS. Put here your custom JS files. They are loaded after the theme JS; # they have to be referred from static root. Example customJS = ["js/myjs.js"] Custom css

We can add arbitrary classes to our css file (see above).

Since I started writing in Bootstrap, I miss it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)

Here is my custom css file to date:

/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme*/ @import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/ @import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css'); /*font awesome icons*/ .input-lg { font-size: 30px; } .input { font-size: 20px; } .font-sm { font-size: 0.7em; } .texttt { font-family: monospace; } .alert { padding: 15px; margin-bottom: 20px; border: 1px solid transparent; border-radius: 4px; } .alert-success { color: #3c763d; background-color: #dff0d8; border-color: #d6e9c6; } .alert-danger, .alert-error { color: #b94a48; background-color: #f2dede; border-color: #eed3d7; } .alert-info { color: #3a87ad; background-color: #d9edf7; border-color: #bce8f1; } .alert-gray { background-color: #f2f3f2; border-color: #f2f3f2; } /*style for printing*/ @media print { .noPrint { display:none; } } /*link formatting*/ a:link { color: #478ca7; text-decoration: none; } a:visited { color: #478ca7; text-decoration: none; } a:hover { color: #82b5c9; text-decoration: none; }

Also, we have font-awesome icons!

Site build with blogdown

Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post .

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site() blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site Updating the local git repository

This can be done with simple git commands:

cd /Git/[USERNAME].github.io # your path to the repo may be different git add . # indexes all files that wil be added to the local repo git commit -m "Starting my Hugo blog" # adds all files to the local repo, with a commit message Pushing to GitHub git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

Just go to the page https://[USERNAME].github.io and enjoy your blog!

R code

Works just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!

The figure 1 uses the ggplot2 library:

library(ggplot2) ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

Figure 1: diamonds plot with ggplot2.

Rmd source code

You can download it from here

I, for one, welcome the new era of reproducible blogging!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ggvis Exercises (Part-2)

Fri, 08/18/2017 - 18:26

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.

This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.
Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set and make a scatterplot. HINT: Use ggvis() and layer_points().

Exercise 2

Add a slider to the scatterplot of Exercise 1 that sets the point size from 10 to 100. HINT: Use input_slider().

Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Work extensively with the ggvis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 3

Add a slider to the scatterplot of Exercise 1 that sets the point opacity from 0 to 1. HINT: Use input_slider().

Exercise 4

Create a histogram of the variable “Horsepower” of the “Cars93” data set. HINT: Use layer_histograms().

Exercise 5

Set the width and the center of the histogram bins you just created to 10.

Exercise 6

Add 2 sliders to the histogram you just created, one for width and the other for center with values from 0 to 10 and set the step to 1. HINT: Use input_slider().

Exercise 7

Add the labels “Width” and “Center” to the two sliders respectively. HINT: Use label.

Exercise 8

Create a scatterplot of the variables “Horsepower” and “MPG.city” of the “Cars93” dataset with size = 10 and opacity = 0.5.

Exercise 9

Add to the scatterplot you just created a function which will set the size with the left and right keyboard controls. HINT: Use left_right().

Exercise 10

Add interactivity to the scatterplot you just created using a function that shows the value of the “Horsepower” when you “mouseover” a certain point. HINT: Use add_tooltip().

Related exercise sets:
  1. How to create interactive data visualizations with ggvis
  2. ggvis Exercises (Part-1)
  3. How to create visualizations with iPlots package in R
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

GoTr – R wrapper for An API of Ice And Fire

Fri, 08/18/2017 - 15:53

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Ava Yang

It’s Game of Thrones time again as the battle for Westeros is heating up. There are tons of ideas, ingredients and interesting analyses out there and I was craving for my own flavour. So step zero, where is the data?

Jenny Bryan’s purrr tutorial introduced the list got_chars, representing characters information from the first five books, which seems not much fun beyond exercising list manipulation muscle. However, it led me to an API of Ice and Fire, the world’s greatest source for quantified and structured data from the universe of Ice and Fire including the HBO series Game of Thrones. I decided to create my own API functions, or better, an R package (inspired by the famous rwar package).

The API resources cover 3 types of endpoint – Books, Characters and Houses. GoTr pulls data in JSON format and parses them to R list objects. httr’s Best practices for writing an API package by Hadley Wickham is another life saver.

The package contains: – One function got_api() – Two ways to specify parameters generally, i.e. endpoint type + id or url – Three endpoint types

## Install GoTr from github #devtools::install_github("MangoTheCat/GoTr") library(GoTr) library(tidyverse) library(listviewer) # Retrieve books id 5 books_5 <- got_api(type = "books", id = 5) # Retrieve characters id 583 characters_583 <- got_api(type = "characters", id = 583) # Retrieve houses id 378 house_378 <- got_api(type = "houses", id = 378) # Retrieve pov characters data in book 5 povData <- books_5$povCharacters %>% flatten_chr() %>% map(function(x) got_api(url = x)) # Helpful functions to check structure of list object length(books_5) ## [1] 11 names(books_5) ## [1] "url" "name" "isbn" "authors" ## [5] "numberOfPages" "publisher" "country" "mediaType" ## [9] "released" "characters" "povCharacters" names(house_378) ## [1] "url" "name" "region" ## [4] "coatOfArms" "words" "titles" ## [7] "seats" "currentLord" "heir" ## [10] "overlord" "founded" "founder" ## [13] "diedOut" "ancestralWeapons" "cadetBranches" ## [16] "swornMembers" str(characters_583, max.level = 1) ## List of 16 ## $ url : chr "https://anapioficeandfire.com/api/characters/583" ## $ name : chr "Jon Snow" ## $ gender : chr "Male" ## $ culture : chr "Northmen" ## $ born : chr "In 283 AC" ## $ died : chr "" ## $ titles :List of 1 ## $ aliases :List of 8 ## $ father : chr "" ## $ mother : chr "" ## $ spouse : chr "" ## $ allegiances:List of 1 ## $ books :List of 1 ## $ povBooks :List of 4 ## $ tvSeries :List of 6 ## $ playedBy :List of 1 map_chr(povData, "name") ## [1] "Aeron Greyjoy" "Arianne Martell" "Arya Stark" ## [4] "Arys Oakheart" "Asha Greyjoy" "Brienne of Tarth" ## [7] "Cersei Lannister" "Jaime Lannister" "Samwell Tarly" ## [10] "Sansa Stark" "Victarion Greyjoy" "Areo Hotah" #listviewer::jsonedit(povData)

Another powerful parameter is query which allows filtering by specific attribute such as the name of a character, pagination and so on.

It’s worth knowing about pagination. The first simple request will render a list of 10 elements, since the default number of items per page is 10. The maximum valid pageSize is 50, i.e. if 567 is passed on to it, you still get 50 characters.

# Retrieve character by name Arya_Stark <- got_api(type = "characters", query = list(name = "Arya Stark")) # Retrieve characters on page 3, change page size to 20. characters_page_3 <- got_api(type = "characters", query = list(page = "3", pageSize="20"))

So how do we get ALL books, characters or houses information? The package does not provide the function directly but here’s an implementation.

# Retrieve all books booksAll <- got_api(type = "books", query = list(pageSize="20")) # Extract names of all books map_chr(booksAll, "name") ## [1] "A Game of Thrones" "A Clash of Kings" ## [3] "A Storm of Swords" "The Hedge Knight" ## [5] "A Feast for Crows" "The Sworn Sword" ## [7] "The Mystery Knight" "A Dance with Dragons" ## [9] "The Princess and the Queen" "The Rogue Prince" ## [11] "The World of Ice and Fire" "A Knight of the Seven Kingdoms" # Retrieve all houses houses <- 1:9 %>% map(function(x) got_api(type = "houses", query = list(page=x, pageSize="50"))) %>% unlist(recursive=FALSE) map_chr(houses, "name") %>% length() ## [1] 444 map_df(houses, `[`, c("name", "region")) %>% head() ## # A tibble: 6 x 2 ## name region ## ## 1 House Algood The Westerlands ## 2 House Allyrion of Godsgrace Dorne ## 3 House Amber The North ## 4 House Ambrose The Reach ## 5 House Appleton of Appleton The Reach ## 6 House Arryn of Gulltown The Vale

The houses list is a starting point for a social network analysis: Mirror mirror tell me, who are the most influential houses in the Seven Kingdom? Stay tuned for that is the topic of the next blogpost.

Thanks to all open resources. Please comment, fork, issue, star the work-in-progress on our GitHub repository.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Estimating Gini coefficient when we only have mean income by decile by @ellis2013nz

Fri, 08/18/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Income inequality data

Ideally the Gini coefficient to estimate inequality is based on original household survey data with hundreds or thousands of data points. Often this data isn’t available due to access restrictions from privacy or other concerns, and all that is published is some kind of aggregate measure. Some aggregations include the income at the 80th percentile divided by that at the 20th (or 90 and 10); the number of people at the top of the distribution whose combined income is equal to that of everyone else; or the income of the top 1% as a percentage of all income. I wrote a little more about this in one of my earliest blog posts.

One way aggregated data are sometimes presented is as the mean income in each decile or quintile. This is not the same as the actual quantile values themselves, which are the boundary between categories. The 20th percentile is the value of the 20/100th person when they are lined up in increasing order, whereas the mean income of the first quintile is the mean of all the incomes of a “bin” of everyone from 0/100 to 20/100, when lined up in order.

To explore estimating Gini coefficients from this type of binned data I used data from the wonderful Lakner-Milanovic World Panel Income Distribution database, which is available for free download. This useful collection contains the mean income by decile bin of many countries from 1985 onwards – the result of some careful and doubtless very tedious work with household surveys from around the world. This is an amazing dataset, and amongst other purposes it can be used (as Milanovic and co-authors have pioneered dating back to his World Bank days) in combination with population numbers to estimate “global inequality”, treating everyone on the planet as part of a single economic community regardless of national boundaries. But that’s for another day.

Here’s R code to download the data (in Stata format) and grab the first ten values, which happen to represent Angloa in 1995. These particular data are based on consumption, which in poorer economies is often more sensible to measure than income:

library(tidyverse) library(scales) library(foreign) # for importing Stata files library(actuar) # for access to the Burr distribution library(acid) library(forcats) # Data described at https://www.gc.cuny.edu/CUNY_GC/media/LISCenter/brankoData/LaknerMilanovic2013WorldPanelIncomeDistributionLMWPIDDescription.pdf # The database has been created specifically for the # paper “Global Income Distribution: From the Fall of the Berlin Wall to the Great Recession”, # World Bank Policy Research Working Paper No. 6719, December 2013, published also in World # Bank Economic Review (electronically available from 12 August 2015). download.file("https://wfs.gc.cuny.edu/njohnson/www/BrankoData/LM_WPID_web.dta", mode = "wb", destfile = "LM_WPID_web.dta") wpid <- read.dta("LM_WPID_web.dta") # inc_con whether survey is income or consumption # group income decline group 1 to 10 # RRinc is average per capita income of the decile in 2005 PPP # the first 10 rows are Angola in 1995, so let's experiment with them angola <- wpid[1:10, c("RRinc", "group")]

Here’s the resulting 10 numbers. N

And this is the Lorenz curve:

Those graphics were drawn with this code:

ggplot(angola, aes(x = group, y = RRinc)) + geom_line() + geom_point() + ggtitle("Mean consumption by decile in Angola in 1995") + scale_y_continuous("Annual consumption for each decile group", label = dollar) + scale_x_continuous("Decile group", breaks = 1:10) + labs(caption = "Source: Lakner/Milanovic World Panel Income Distribution data") + theme(panel.grid.minor = element_blank()) angola %>% arrange(group) %>% mutate(cum_inc_prop = cumsum(RRinc) / sum(RRinc), pop_prop = group / max(group)) %>% ggplot(aes(x = pop_prop, y = cum_inc_prop)) + geom_line() + geom_ribbon(aes(ymax = pop_prop, ymin = cum_inc_prop), fill = "steelblue", alpha = 0.2) + geom_abline(intercept = 0, slope = 1, colour = "steelblue") + labs(x = "Cumulative proportion of population", y = "Cumulative proportion of consumption", caption = "Source: Lakner/Milanovic World Panel Income Distribution data") + ggtitle("Inequality in Angola in 1995", "Lorenz curve based on binned decile mean consumption") Calculating Gini directly from deciles?

Now, I could just treat these 10 deciles as a sample of 10 representative people (each observation after all represents exactly 10% of the population) and calculate the Gini coefficient directly. But my hunch was that this would underestimate inequality, because of the straight lines in the Lorenz curve above which are a simplification of the real, more curved, reality.

To investigate this issue, I started by creating a known population of 10,000 income observations from a Burr distribution, which is a flexible, continuous non-negative distribution often used to model income. That looks like this:

population <- rburr(10000, 1, 3, 3) par(bty = "l", font.main = 1) plot(density(population), main = "Burr(1,3,3) distribution")

Then I divided the data up into between 2 and 100 bins, took the means of the bins, and calculated the Gini coefficient of the bins. Doing this for 10 bins is the equivalent of calculating a Gini coefficient directly from decile data such as in the Lakner-Milanovic dataset. I got this result, which shows, that when you have the means of 10 bins, you are underestimating inequality slightly:

Here’s the code for that little simulation. I make myself a little function to bin data and return the mean values of the bins in a tidy data frame, which I’ll need for later use too:

#' Quantile averages #' #' Mean value in binned groups #' #' @param y numeric vector to provide summary statistics on #' @param len number of bins to calculate means for #' @details this is different from producing the actual quantiles; it is the mean value of y within each bin. bin_avs <- function(y, len = 10){ # argument checks: if(class(y) != "numeric"){stop("y should be numeric") } if(length(y) < len){stop("y should be longer than len")} # calculation: y <- sort(y) bins <- cut(y, breaks = quantile(y, probs = seq(0, 1, length.out = len + 1))) tmp <- data.frame(bin_number = 1:len, bin_breaks = levels(bins), mean = as.numeric(tapply(y, bins, mean))) return(tmp) } ginis <- numeric(99) for(i in 1:99){ ginis[i] <- weighted.gini(bin_avs(population, len = i + 1)$mean)$Gini } ginis_df <- data.frame( number_bins = 2:100, gini = ginis ) ginis_df %>% mutate(label = ifelse(number_bins < 11 | round(number_bins / 10) == number_bins / 10, number_bins, "")) %>% ggplot(aes(x = number_bins, y = gini)) + geom_line(colour = "steelblue") + geom_text(aes(label = label)) + labs(x = "Number of bins", y = "Gini coefficient estimated from means within bins") + ggtitle("Estimating Gini coefficient from binned mean values of a Burr distribution population", paste0("Correct Gini is ", round(weighted.gini(population)$Gini, 3), ". Around 25 bins needed for a really good estimate.")) A better method for Gini from deciles?

Maybe I should have stopped there; after all, there is hardly any difference between 0.32 and 0.34; probably much less than the sampling error from the survey. But I wanted to explore if there were a better way. The method I chose was to:

  • choose a log-normal distribution that would generate (close to) the 10 decile averages we have;
  • simulate individual-level data from that distribution; and
  • estimate the Gini coefficient from that simulated data.

I also tried this with a Burr distribution but the results were very unstable. The log-normal approach was quite good at generating data with means of 10 bins very similar to the original data, and gave plausible values of Gini coefficient just slightly higher than when calculated directly of the bins’ means.

Here’s how I did that:

# algorithm will be iterative # 1. assume the 10 binned means represent the following quantiles: 0.05, 0.15, 0.25 ... 0.65, 0.75, 0.85, 0.95 # 2. pick the best lognormal distribution that fits those 10 quantile values. # Treat as a non-linear optimisation problem and solve with `optim()`. # 3. generate data from that distribution and work out what the actual quantiles are # 4. repeat 2, with these actual quantiles n <- 10000 x <- angola$RRinc fn2 <- function(params) { sum((x - qlnorm(p, params[1], params[2])) ^ 2 / x) } p <- seq(0.05, 0.95, length.out = 10) fit2 <- optim(c(1,1), fn2) simulated2 <- rlnorm(n, fit2$par[1], fit2$par[2]) p <- plnorm(bin_avs(simulated2)$mean, fit2$par[1], fit2$par[2]) fit2 <- optim(c(1,1), fn2) simulated2 <- rlnorm(n, fit2$par[1], fit2$par[2])

And here are the results. The first table shows the means of the bins in my simulated log-normal population (mean) compared to the original data for Angola’s actual deciles in 1995 (x). The next two values, 0.415 and 0.402, are the Gini coefficents from the simulated and original data respectively:

> cbind(bin_avs(simulated2), x) bin_number bin_breaks mean x 1 1 (40.6,222] 165.0098 174 2 2 (222,308] 266.9120 287 3 3 (308,393] 350.3674 373 4 4 (393,487] 438.9447 450 5 5 (487,589] 536.5547 538 6 6 (589,719] 650.7210 653 7 7 (719,887] 795.9326 785 8 8 (887,1.13e+03] 1000.8614 967 9 9 (1.13e+03,1.6e+03] 1328.3872 1303 10 10 (1.6e+03,1.3e+04] 2438.4041 2528 > weighted.gini(simulated2)$Gini [,1] [1,] 0.4145321 > > # compare to the value estimated directly from the data: > weighted.gini(x)$Gini [,1] [1,] 0.401936

As would be expected from my earlier simulation, the Gini coefficient from the estimated underlying log-normal distribtuion is verr slightly higher than that calculated directly from the means of the decile bins.

Applying this method to the Lakner-Milanovic inequality data

I rolled up this approach into a function to convert means of deciles into Gini coefficients and applied it to all the countries and years in the World Panel Income Distribution data. Here are the results, first over time:

.. and then as a snapshot

Neither of these is great as a polished data visualisation, but it’s difficult data to present in a static snapshot, and will do for these illustrative purposes.

Here’s the code for that function (which depends on the previously defined ) and drawing those charts. Drawing on the convenience of Hadley Wickham’s dplyr and ggplot2 it’s easy to do this on the fly and in the below I calculate the Gini coefficients twice, once for each chart. Technically this is wasteful, but with modern computers this isn’t a big deal even though there is quite a bit of computationally intensive stuff going on under the hood; the code below only takes a minute or so to run.

#' Convert data that is means of deciles into a Gini coefficient #' #' @param x vector of 10 numbers, representing mean income (or whatever) for 10 deciles #' @param n number of simulated values of the underlying log-normal distribution to generate #' @details returns an estimate of Gini coefficient that is less biased than calculating it #' directly from the deciles, which would be slightly biased downwards. deciles_to_gini <- function(x, n = 1000){ fn <- function(params) { sum((x - qlnorm(p, params[1], params[2])) ^ 2 / x) } # starting estimate of p based on binned means and parameters p <- seq(0.05, 0.95, length.out = 10) fit <- optim(c(1,1), fn) # calculate a better value of p simulated <- rlnorm(n, fit$par[1], fit$par[2]) p <- plnorm(bin_avs(simulated)$mean, fit$par[1], fit$par[2]) # new fit with the better p fit <- optim(c(1,1), fn) simulated <- rlnorm(n, fit$par[1], fit$par[2]) output <- list(par = fit$par) output$Gini <- as.numeric(weighted.gini(simulated)$Gini) return(output) } # example usage: deciles_to_gini(x = wpid[61:70, ]$RRinc) deciles_to_gini(x = wpid[171:180, ]$RRinc) # draw some graphs: wpid %>% filter(country != "Switzerland") %>% mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>% group_by(region, country, contcod, year, inc_con) %>% summarise(Gini = deciles_to_gini(RRinc)$Gini) %>% ungroup() %>% ggplot(aes(x = year, y = Gini, colour = contcod, linetype = inc_con)) + geom_point() + geom_line() + facet_wrap(~region) + guides(colour = FALSE) + ggtitle("Inequality over time", "Gini coefficients estimated from decile data") + labs(x = "", linetype = "", caption = "Source: Lakner/Milanovic World Panel Income Distribution data") wpid %>% filter(country != "Switzerland") %>% mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>% group_by(region, country, contcod, year, inc_con) %>% summarise(Gini = deciles_to_gini(RRinc)$Gini) %>% ungroup() %>% group_by(country) %>% filter(year == max(year)) %>% ungroup() %>% mutate(country = fct_reorder(country, Gini), region = fct_lump(region, 5)) %>% ggplot(aes(x = Gini, y = country, colour = inc_con, label = contcod)) + geom_text(size = 2) + facet_wrap(~region, scales = "free_y", nrow = 2) + labs(colour = "", y = "", x = "Gini coefficient", caption = "Source: Lakner-Milanovic World Panel Income Distribution") + ggtitle("Inequality by country", "Most recent year available up to 2008; Gini coefficients are estimated from decile mean income.")

There we go – deciles to Gini fun with world inequality data!

# cleanup unlink("LM_WPID_web.dta") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Oil leakage… those old BMW’s are bad :-)

Fri, 08/18/2017 - 11:42

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

RDW Data

There are many data sets that you can download. I have used the following:

  • Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
  • Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
  • Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.
Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

 

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage.  But look at BMW: waaauwww those old BMW’s are leaking like oil crazy…  The few lines of R code can be found here.

Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.

Cheers, Longhow

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.7.960.1.0

Fri, 08/18/2017 - 02:41

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The bi-monthly RcppArmadillo release is out with a new version 0.7.960.1.0 which is now on CRAN, and will get to Debian in due course.

And it is a big one. Lots of nice upstream changes from Armadillo, and lots of work on our end as the Google Summer of Code project by Binxiang Ni, plus a few smaller enhancements — see below for details.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 379 other packages on CRAN—an increase of 49 since the last CRAN release in June!

Changes in this release relative to the previous CRAN release are as follows:

Changes in RcppArmadillo version 0.7.960.1.0 (2017-08-11)
  • Upgraded to Armadillo release 7.960.1 (Northern Banana Republic Deluxe)

    • faster randn() when using OpenMP (NB: usually omitted when used fromR)

    • faster gmm_diag class, for Gaussian mixture models with diagonal covariance matrices

    • added .sum_log_p() to the gmm_diag class

    • added gmm_full class, for Gaussian mixture models with full covariance matrices

    • expanded .each_slice() to optionally use OpenMP for multi-threaded execution

  • Upgraded to Armadillo release 7.950.0 (Northern Banana Republic)

    • expanded accu() and sum() to use OpenMP for processing expressions with computationally expensive element-wise functions

    • expanded trimatu() and trimatl() to allow specification of the diagonal which delineates the boundary of the triangular part

  • Enhanced support for sparse matrices (Binxiang Ni as part of Google Summer of Code 2017)

    • Add support for dtCMatrix and dsCMatrix (#135)

    • Add conversion and unit tests for dgT, dtT and dsTMatrix (#136)

    • Add conversion and unit tests for dgR, dtR and dsRMatrix (#139)

    • Add conversion and unit tests for pMatrix and ddiMatrix (#140)

    • Rewrite conversion for dgT, dtT and dsTMatrix, and add file-based tests (#142)

    • Add conversion and unit tests for indMatrix (#144)

    • Rewrite conversion for ddiMatrix (#145)

    • Add a warning message for matrices that cannot be converted (#147)

    • Add new vignette for sparse matrix support (#152; Dirk in #153)

    • Add support for sparse matrix conversion from Python SciPy (#158 addressing #141)

  • Optional return of row or column vectors in collapsed form if appropriate #define is set (Serguei Sokol in #151 and #154)

  • Correct speye() for non-symmetric cases (Qiang Kou in #150 closing #149).

  • Ensure tests using Scientific Python and reticulate are properly conditioned on the packages being present.

  • Added .aspell/ directory with small local directory now supported by R-devel.

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2017 App Update

Fri, 08/18/2017 - 02:34

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

As you may have noticed, we have made a few changes to our apps for the 2017 season to bring you a smoother and quicker experience while also adding more advanced and customizable views.

Most visibly, we moved the apps to Shiny so we can continue to build on our use of R and add new features and improvements throughout the season.  We expect the apps to better handle high traffic load this season during draft season and peak traffic.

In addition to the ability to create and save custom settings, you can also choose the columns you view in our Projections tool.  We have also added more advanced metrics such as weekly VOR and Projected Points Per Dollar (ROI) for those of you in auction leagues.  With a free account, you’ll be able to create and save one custom setting.  If you get an FFA Insider subscription, you’ll be able to create and save unlimited custom settings.

Up next is the ability to upload custom auction values to make it easier to use during auction drafts.

We are also always looking to add new features, so feel free to drop us a suggestion in the Comments section below!

The post 2017 App Update appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Chapman University DataFest Highlights

Fri, 08/18/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Editor’s Note: The 2017 Chapman University DataFest was held during the weekend of April 21-23. The 2018 DataFest will be held during the weekend of April 27-29.

DataFest was founded by Rob Gould in 2011 at UCLA with 40 students. In just seven years, it has grown to 31 sites in three countries. Have a look at Mine Çetinkaya-Rundel’s post Growth of DataFest over the years for the details. In recent years, it has been difficult for UCLA to keep up with the growing interest and demand from southern California universities. This year, the Chapman DataFest became the second DataFest site in southern California, and the largest inaugural DataFest in the history of the event. We had 65 students who stayed the whole weekend from seven universities organized into 15 teams.

The event began on a Friday evening with Professor Rob Gould, the “founder” of DataFest, giving advice on goals for the weekend. He then introduced the Expedia dataset: nearly 11 million records representing users’ online searches for hotels, plus an associated file with detailed information about the hotel destinations.

Throughout the weekend, the organizers kept students motivated with data challenges (with cell phone chargers awarded as prizes), a mini-talk on tools for joining and merging data files, and a tutorial from bitScoop on using their API integration platform.

At noon on Sunday, the students submitted their two-slide presentations via email. At 1 pm, each team had five minutes to show their findings to the six-judge panel: Johnny Lin (UCLA), Joe Kurian (Mitsubishi UFG Union Bank, Irvine), Tao Song (Spectrum Pharmaceuticals), Pamela Hsu (Spectrum Pharmaceuticals), Lynn Langit (AWS, GCP IoT), and Brett Danaher (Chapman University).

The judges announced winners in three official categories:

Best Insight: CSU Northridge team “Mean Squares” (Jamie Decker, Matthew Jones, Collin Miller, Ian Postel, and Seyed Sajjadi). [See Seyed’s description of his team’s experience!]

Best Visualization: Chapman University team “Winners ‘); Drop Table” (Dylan Bowman, William Cortes, Shevis Johnson, and Tristan Tran).

Best Use of External Data: Chapman University team “BEST” (Brandon Makin, Sarah Lasman, and Timothy Kristedja).

Additionally, “Judges’ Choice” awards for “Best Use of Statistical Models” went to the USC “Big Data” team (Hsuanpei Lee, Omar Lopez, Yi Yang Tan, Grace Xu, and Xuejia Xu) and the USC “Quants” team (Cheng (Serena) Cheng, Chelsea Lee, and Hossein Shafii).

All winners were given certificates and medallions designed by Chapman’s Ideation Lab and printed on Chapman’s MLAT Lab 3D printer (see photo).

Winners also received free student memberships in the American Statistical Association.

Many thanks go to the Silver Sponsors: Children’s Hospital Orange County Medical Intelligence and Innovation Institute, Southern California Chapter of the American Statistical Association, and Chapman University MLAT Lab; and Bronze Sponsors: Experian, RStudio, Chapman University Computational and Data Sciences and Schmid College of Science and Technology, Orange County Long Beach ASA Chapter, the Missing Variables, USC Stats Club, Luke Thelen, and Google.

Thanks also to the 45 VIP consultants from BitScoop Labs, Chapman University, Compatiko, CSU Fullerton, CSU Long Beach, CSU San Bernardino, Education Management Services, Freelance Data Analysis, Hiner Partners, Mater Dei High School, Nova Technologies, Otoy, Southern California Edison, Sonikpass, Startup, SurEmail, UC Irvine, UCLA, USC, and Woodbridge High School, many of whom spent most of the weekend working with the students.

Overall, participants were enthusiastic about meeting students from other schools and the opportunity to work with the local professionals. (See the two student perspectives below.) DataFest will continue to grow as these students return to their schools and share their enthusiasm with their classmates!

The Mean Squares Perspective

by Seyed Sajjadi

For most of our team, this DataFest was only the first or second hackathon they ever attended, but the group gelled instantly.

Culture is important for a hackathon group, but talent and preparation play key roles in the success or failure. Our group spent more than a month in advance preparing for this competition. We practiced, practiced, and practiced some more for this event. We had weekly workshops where we presented the assignments that we had worked on for the past week.

The next essential for the competition may come as a surprise to most: having an artist design and prepare the presentation took an enormous amount of work off our shoulders. During the entire competition, we had a very talented artist design a fabulous slideshow for the presentation. This may sound boastful, but allowing specialized talent to work on the slideshow the entire competition is a lot better than designing it at the last minute.

The questions that were asked were not specific at all, and it was on the participants to form and ask the proper questions. We focused on optimizing two questions of customer acquisition and retention/conversion. We proved that online targeting and marketing can be optimized by regional historical data feedback, meaning that most states residents tend to have similar preferences when it comes to same destinations. For instance, most Californians go to Las Vegas to gamble, but most people from Texas go to Las Vegas for music events; these analyses can be used to better target potential customers from neighboring regions.

Regarding customer retention and conversion of lookers to bookers, we calculated the optimum point in time where Expedia can offer more special packages; this time frame happened to be around 14 sessions of interaction between the customer and the website. The biggest part of our analysis was achieved via hierarchical clustering.

A big aspect of the event had to do with the atmosphere and the organization. They invited people from industry to come and roam around the halls, which led to a great opportunity to meet professionals in the field of data science. We were situated in a huge room with all of the teams. We ended up crowding around a small table with everyone on their laptops and chairs. The room was big enough to have impromptu meetings, which allowed a lot of room to breathe. This hackathon was a huge growing experience for all of us on “The Mean Squares”.

Team Pineapples’ Perspective

by Annelise Hitchman

On day one, I could tell my enthusiasm to start working on the dataset was matched by the other dozens of students participating. The room was filled with interaction, and not just among the individual teams. I enjoyed talking with all the consultants in the room about the data, our approach, and even just learning about what they did for work. DataFest introduced me to real-world data that I had never seen in my classes. I learned quite a bit about data analysis from both my own team members and nearly everyone else at the event. Watching the final presentations was an inspiring and insightful end to DataFest. I really hope that DataFest is able to continue and be available to universities such as my own, so that all students interested in data analysis can participate.

Michael Fahy is Professor of Mathematics and Computer Science and Associate Dean, Schmid College of Science & Technology at Chapman University

_____='https://rviews.rstudio.com/2017/08/18/chapman-university-datafest-highlights/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Server Pro is ready for BigQuery on the Google Cloud Platform

Fri, 08/18/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio is excited to announce the availability of RStudio Server Pro on the Google Cloud Platform.

RStudio Server Pro GCP is identical to RStudio Server Pro, but with additional convenience for data scientists, including pre-installation of multiple versions of R, common systems libraries, and the BigQuery package for R.

RStudio Server Pro GCP adapts to your unique circumstances. It allows you to choose different GCP computing instances for RStudio Server Pro no matter how large, whenever a project requires it (hourly pricing).

If the enhanced security, support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you, please give RStudio Server Pro for GCP a try. Below are some useful links to get you started:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages