Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 31 min 30 sec ago

Follow-up forecasting forum in Eindhoven

Tue, 03/07/2017 - 23:35

Last October I gave a 3-day masterclass on “Forecasting with R” in Eindhoven, Netherlands. There is a follow-up event planned for Tuesday 18 April 2017. It is particularly designed for people who attended the 3-day class, but if anyone else wants to attend they would be welcome.

Please register here if you want to attend.

The preliminary schedule is as follows.

10.00 — 11.00 New developments in forecasting using R

  • forecast v8.0
  • prophet
  • forecastHybrid
  • opera
Coffee break 11.15 — 12.15 Open discussion of forecasting problems and questions Lunch break 13.00 — 14.00 Brainstorm about possible features for forecast v9.0

Advanced Econometrics: Model Selection

Tue, 03/07/2017 - 21:00

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

On Thursday, March 23rd, I will give the third lecture of the PhD course on advanced tools for econometrics, on model selection and variable selection, where we will focus on ridge and lasso regressions . Slides are available online.

The first part was on on Nonlinearities in Econometric models, and the second one on Simulations.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building views with R

Tue, 03/07/2017 - 19:01

 

[Here you can see the Building views with R cheat sheet at a full resolution]

Queries

In database theory a query is a request for data or information from a database table or combination of tables.

Since dplyr we have something that quite closely conceptually resembles a query in R:

require(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.5

require(pryr)

mtcars %>% tbl_df() %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 26.66364 4.509828 ## 2 6 19.74286 1.453567 ## 3 8 15.10000 2.560048

I particularly appreciate of dplyr the possibility of building my query as a step by step set of R statement that I can progressively test at each step.

 

Views

Again in database theory, a view is the result set of a stored query on the data, which the database users can query just as they would in a table.

I would like to have something similar to a view in R

As far as I know, I can achieve this goal in three ways:

  • Function makeActiveBinding
  • Operator %>a% from package pryr
  • My proposed `%>>% operator

 

Function makeActiveBinding()

Function makeActiveBinding(sym, fun, env) installs a function in an environment env so that getting the value of sym calls fun with no arguments.

As a basic example I can actively bind a function that simulates a dice to an object named dice :

makeActiveBinding("dice", function() sample(1:6, 1), env = globalenv())

so that:

replicate(5 , dice)

## [1] 5 1 6 2 3

Similarly, I can wrap adplyr expression into a function:

f <- function() {mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))}

and then actively bind it to a symbol:

makeActiveBinding('view', f , env = globalenv())

so that, any time we call view the result of function f()is computed again:

view

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 26.66364 4.509828 ## 2 6 19.74286 1.453567 ## 3 8 15.10000 2.560048

As a result, if I change any value of mpg within mtcars, view is automatically updated:

mtcars$mpg[c(1,3,5)] <- 0 view

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 24.59091 9.231192 ## 2 6 16.74286 7.504189 ## 3 8 13.76429 4.601606

Clearly, I have to admit that all of this looks quite unfriendly, at least to me.

 

Operator %<a-%

A valid alternative, that wraps away the complexity of function makeActiveBinding() is provided by operator %<a-% from package pryr:

view %<a-% {mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))}

Again, if I change any value of mpg within mtcars, the value of view get automatically updated:

mtcars$mpg[c(1,3,5)] <- 50 view

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 29.13636 8.159568 ## 2 6 23.88571 11.593451 ## 3 8 17.33571 9.688503

Note that in this case I have to enclose the whole expression within curly brackets.

Moreover, the final assignment: %<a-% goes on the left hand side of my chain of dplyr statements.

 

Operator %>>%

Finally I would like to propose a third alternative, still based on makeActiveBinding(), that I named %>>%

`%>>%` <- function( expr, x) { x <- substitute(x) call <- match.call()[-1] fun <- function() {NULL} body(fun) <- call$expr makeActiveBinding(sym = deparse(x), fun = fun, env = parent.frame()) invisible(NULL) }

that can be used as:

mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) %>>% view

And again, if I change the values of mpg:

mtcars$mpg[c(1,3,5)] <- 100

The content of view changes accordingly

view

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 33.68182 22.41624 ## 2 6 31.02857 30.44321 ## 3 8 20.90714 22.88454

I believe this operator offers two advantages:

  • Avoids the usage of curly brackets around my dplyr expression
  • Allows me to actively assign the result of my chain of dplyr statements, in a more natural way at the end of the chain

The post Building views with R appeared first on Quantide – R training & consulting.

In case you missed it: February 2017 roundup

Tue, 03/07/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from February of particular interest to R users. 

Public policy researchers use R to predict neighbourhoods in US cities subject to gentrification.

The ggraph package provides a grammar-of-graphics framework for visualizing directed and undirected graphs.

Facebook has open-sourced the "prophet" package they use for forecasting time series at scale.

A preview of features coming soon to R Tools for Visual Studio 1.0.

On the differences between using Excel and R for data analysis.

A data scientist suggests a "Gloom Index" for identifying the most depressing songs by the band Radiohead.

Catterplots is a package that replaces scatterplot points with cats. (Tufte would not approve.)

A collection of tips on using Microsoft R Server from the support team.

A summary of the many improvements slated for R 3.4.0.

R code using the RevoScaleR package to classify a large database of galaxy images in SQL Server.

A review of four deep learning packages for R: MXNet, darch, deepnet and h2o.

An update on more than a dozen projects and community initiatives funded by R Consortium grants.

R has overtaken SAS for Statistics job listings on indeed.com.

ModernDive is a free online textbook on Statistics and data science using R. 

A solution (with R code) for modeling customer churn in the retail industry using SQL Server R Services.

The superheat package provides enhanced heatmap graphics for R.

The fst package provides a new serialization format for R data focused on performance.

Thomas Dinsmore reflects on major events in the R Project and for Microsoft in 2016.

And some general interest stories (not necessarily related to R): A big drain; 'Vous' vs 'tu'; a remembrance of the late Hans Rosling; and Ten Meter Tower, a short film.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sigr: Simple Significance Reporting

Tue, 03/07/2017 - 16:44

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.

Model Example

Let’s take as our example the following linear relation between x and y:

library('sigr') set.seed(353525) d <- data.frame(x= rnorm(5)) d$y <- 2*d$x + 0.5*rnorm(nrow(d))

stats::lm() has among the most complete summaries of all models in R, so we easily get can see the quality of fit and its significance:

model <- lm(y~x, d=d) summary(model) ## ## Call: ## lm(formula = y ~ x, data = d) ## ## Residuals: ## 1 2 3 4 5 ## 0.3292 0.3421 0.0344 -0.1671 -0.5387 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4201 0.2933 1.432 0.24747 ## x 2.1117 0.2996 7.048 0.00587 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 3 degrees of freedom ## Multiple R-squared: 0.943, Adjusted R-squared: 0.9241 ## F-statistic: 49.67 on 1 and 3 DF, p-value: 0.005871

sigr::wrapFTest() can render the relevant model quality summary.

cat(render(wrapFTest(model), pLargeCutoff=1))

F Test summary: (R2=0.94, F(1,3)=50, p=0.0059).

Note: sigr reports the un-adjusted R-squared, as this is the one that significance is reported for in summary::lm().

sigr also carries around the important summary components for use in code.

unclass(wrapFTest(model)) ## $test ## [1] "F Test" ## ## $numdf ## [1] 1 ## ## $dendf ## [1] 3 ## ## $FValue ## [1] 49.66886 ## ## $R2 ## [1] 0.9430403 ## ## $pValue ## [1] 0.005871236

In this function sigr is much like broom::glance() or modelr::rsquare().

broom::glance(model) ## r.squared adj.r.squared sigma statistic p.value df logLik ## 1 0.9430403 0.9240538 0.4261232 49.66886 0.005871236 2 -1.552495 ## AIC BIC deviance df.residual ## 1 9.10499 7.933304 0.544743 3 modelr::rsquare(model, d) ## [1] 0.9430403

This is something like what is reported by caret::train() (where caret controls the model training process).

cr <- caret::train(y~x, d, method = 'lm') ## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = ## trainInfo, : There were missing values in resampled performance measures. cr$results ## intercept RMSE Rsquared RMSESD RsquaredSD ## 1 TRUE 0.9631662 0.9998162 0.6239989 0.0007577834 summary(cr$finalModel) ## ## Call: ## lm(formula = .outcome ~ ., data = dat) ## ## Residuals: ## X1 X2 X3 X4 X5 ## 0.3292 0.3421 0.0344 -0.1671 -0.5387 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4201 0.2933 1.432 0.24747 ## x 2.1117 0.2996 7.048 0.00587 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 3 degrees of freedom ## Multiple R-squared: 0.943, Adjusted R-squared: 0.9241 ## F-statistic: 49.67 on 1 and 3 DF, p-value: 0.005871

(I presume cr$results$Rsquared is a model quality report and not a consistency of cross-validation procedure report. If it is a model quality report it is somehow inflated, perhaps by the resampling procedure. So I apologize for using either caret::train() or its results incorrectly.)

Data example

The issues in taking summary statistics (and significances) from models include:

  • Working only from models limits you to models that include the statistic you want.
  • Working only from models is mostly "in-sample." You need additional procedures for test or hold-out data.

With sigr it is also easy to reconstruct quality and significance from the predictions, no matter where they came from (without needing the model data structures).

In-sample example d$pred <- predict(model, newdata = d) cat(render(wrapFTest(d, 'pred', 'y'), pLargeCutoff=1))

F Test summary: (R2=0.94, F(1,3)=50, p=0.0059).

Notice we reconstruct the summary statistic and significance, independent of the model data structures. This means the test is generic and can be used on any regression (modulo informing the significance model of the appropriate number of parameters). It also can be used on held-out or test data.

In this mode it is a lot like ModelMetrics::rmse() or caret::postResample().

ModelMetrics::rmse(d$y, d$pred) ## [1] 0.3300736 caret::postResample(d$y, d$pred) ## RMSE Rsquared ## 0.3300736 0.9430403 Out of sample example

If we had more data we can look at the quality of the prediction on that data (though you have to take care in understanding the number of degrees of freedom is different for held-out data).

d2 <- data.frame(x= rnorm(5)) d2$y <- 2*d2$x + 0.5*rnorm(nrow(d2)) d2$pred <- predict(model, newdata = d2) cat(render(wrapFTest(d2, 'pred', 'y'), pLargeCutoff=1))

F Test summary: (R2=0.76, F(1,3)=9.6, p=0.054).

Exotic model example

We could have used glmnet instead of lm:

library("glmnet") d$one <- 1 # make sure we have at least 2 columns xmat <- as.matrix(d[, c('one','x')]) glmnetModel <- cv.glmnet(xmat, d$y) ## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations ## per fold summary(glmnetModel) ## Length Class Mode ## lambda 53 -none- numeric ## cvm 53 -none- numeric ## cvsd 53 -none- numeric ## cvup 53 -none- numeric ## cvlo 53 -none- numeric ## nzero 53 -none- numeric ## name 1 -none- character ## glmnet.fit 12 elnet list ## lambda.min 1 -none- numeric ## lambda.1se 1 -none- numeric d$predg <- as.numeric(predict(glmnetModel, newx= xmat)) cat(render(wrapFTest(d, 'predg', 'y'), pLargeCutoff=1))

F Test summary: (R2=0.91, F(1,3)=30, p=0.012).

Plotting

Because sigr can render to "LaTex" it can (when used in conjunction with latex2exp) also produce formatted titles for plots.

library("ggplot2") library("latex2exp") f <- paste0(format(model$coefficients['x'], digits= 3), '*x + ', format(model$coefficients['(Intercept)'], digits= 3)) title <- paste0("linear y ~ ", f, " relation") subtitle <- latex2exp::TeX(render(wrapFTest(d, 'pred', 'y'), format= 'latex')) ggplot(data=d, mapping=aes(x=pred, y=y)) + geom_point() + geom_abline(color='blue') + xlab(f) + ggtitle(title, subtitle= subtitle) Conclusion

sigr computes a few statistics or summaries (with standard significance estimates) and returns the calculation in both machine readable and formatted forms. For non-standard summaries sigr includes some resampling methods for significance estimation. For formatting sigr tries to get near the formats specified by "The American Psychological Association." sigr works with models (such as lm and glm(family='binomial')) or data, and is a small package with few dependencies.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 3.3.3 is released!

Tue, 03/07/2017 - 16:17

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

R 3.3.3 (codename “Another Canoe”) was released yesterday You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes and new features is provided below.

A quick summary by David Smith:

R 3.3.3 fixes an issue related to attempting to use download.file on sites that automatically redirect from http to https: now, R will re-attempt to download the secure link rather than failing. Other fixes include support for long vectors in the vapply function, the ability to use pmax (and pmin) on ordered factors, improved accuracy for qbeta for some extreme cases, corrected spelling for “Cincinnati” in the precip data set, and a few other minor issues.

Upgrading to R 3.3.3 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) # only for R versions older than 3.3.0 installr::updateR() # updating R.

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.3.3 NEW FEATURES
  • Changes when redirection of a http:// URL to a https:// URL is encountered:
    • The internal methods of download.file() and url() now report that they cannot follow this (rather than failing silently).
    • (Unix-alike) download.file(method = "auto") (the default) re-tries with method = "libcurl".
    • (Unix-alike) url(method = "default") with an explicit open argument re-tries with method = "libcurl". This covers many of the usages, e.g. readLines() with a URL argument.
INSTALLATION on a UNIX-ALIKE
  • The configure check for the zlib version is now robust to versions longer than 5 characters, including 1.2.11.
UTILITIES
  • Environmental variable _R_CHECK_TESTS_NLINES_ controls how R CMD check reports failing tests (see §8 of the ‘R Internals’ manual).
DEPRECATED AND DEFUNCT
  • (C-level Native routine registration.) The undocumented styles field of the components of R_CMethodDef and R_FortranMethodDef is deprecated.
BUG FIXES
  • vapply(x, *) now works with long vectors x. (PR#17174)
  • isS3method("is.na.data.frame") and similar are correct now. (PR#17171)
  • grepRaw(<long>, <short>, fixed = TRUE) now works, thanks to a patch by Mikko Korpela. (PR#17132)
  • Package installation into a library where the package exists via symbolic link now should work wherever Sys.readlink() works, resolving PR#16725.
  • "Cincinnati" was missing an "n" in the precip dataset.
  • Fix buffer overflow vulnerability in pdf() when loading an encoding file. Reported by Talos (TALOS-2016-0227).
  • getDLLRegisteredRoutines() now produces its warning correctly when multiple DLLs match, thanks to Matt Dowle’s PR#17184.
  • Sys.timezone() now returns non-NA also on platforms such as Ubuntu 14.04.5 LTS, thanks to Mikko Korpela’s PR#17186.
  • format(x) for an illegal "POSIXlt" object x no longer segfaults.
  • methods(f) now also works for f "(" or "{".
  • (Windows only) dir.create() did not check the length of the path to create, and so could overflow a buffer and crash R. (PR#17206)
  • On some systems, very small hexadecimal numbers in hex notation would underflow to zero. (PR#17199)
  • pmin() and pmax() now work again for ordered factors and 0-length S3 classed objects, thanks to Suharto Anggono’s PR#17195 and PR#17200.
  • bug.report() did not do any validity checking on a package’s BugReports field. It now ignores an empty field, removes leading whitespace and only attempts to open http:// and https:// URLs, falling back to emailing the maintainer.
  • Bandwidth selectors bw.ucv() and bw.SJ() gave incorrect answers or incorrectly reported an error (because of integer overflow) for inputs longer than 46341. Similarly for bw.bcv() at length 5793.

    Another possible integer overflow is checked and may result in an error report (rather than an incorrect result) for much longer inputs (millions for a smooth distribution).

  • findMethod() failed if the active signature had expanded beyond what a particular package used. (Example with packages XR and XRJulia on CRAN.)
  • qbeta() underflowed too early in some very asymmetric cases. (PR#17178)
  • R CMD Rd2pdf had problems with packages with non-ASCII titles in ‘.Rd’ files (usually the titles were omitted).

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Ramones. Punk is Data, Too

Tue, 03/07/2017 - 15:16

(This article was first published on English – R-blog, and kindly contributed to R-bloggers)

 

The starting point of this post is a simple question: can we use R to analyze punk bands ? And, as a result: what can we learn from applying data analytics methods to punk music ?

 

Whether we like it or not “punk rock is arguably the most important subgenre of music to come out of the ‘70s” and consequently still an integral part of our mainstream contemporary culture. After years of being declared too outrageous to be accepted, its legacy is so astonishingly extensive that it deserves careful consideration and serious attention. Since decades, many music critiques, fine arts experts, social and political scientists or historians of pop culture have devoted time and energy to study the punk scene, its cultural production and legacy, the attitude of the punk generation, its tangle of ideologies, the ways it was perceived and received. Facts and figures, however, are still missing, perhaps because there apparently is nothing more distant from data analytics than punk music. So, is data analytics of punk rock possible ? Would it make any sense ? My answer is a loud and bold yes –yes, statistics on punk rock matters.

 

Although the punk scene cannot be condensed into a single band, the Ramones are still considered by many as the first “pure punk band” and, perhaps –and more importantly–, one of the most influential. This does not imply that other punk rock bands (Clash, Dead Kennedys, The Stooges, Misfits, Sex Pistols, Social Distorsion, Patti Smith Group, etc) are less noteworthy or not as good. Yet, since I need to start somewhere, I decided that my first attempt would focus on the Ramones –which I paradoxically like a lot despite being more of a baroque and classical music person.

 

What did the Ramones do ?

From 1976 to 1994, the Ramones released 14 studio albums. In their original USA release, the albums comprised 176 different songs in total that were quite short (median: 2M 32S) and mostly written in a Major key (only 2 songs are in a minor key: Em).

Year Album Nbre of Songs Length 1976 Ramones 14 28M 52S 1977 Leave Home 14 28M 57S 1977 Rocket To Russia 14 28M 5S 1978 Road To Ruin 12 28M 9S 1980 End Of The Century 12 28M 50S 1981 Pleasant Dreams 12 28M 53S 1983 Subterranean Jungle 12 28M 21S 1985 Too Tough To Die 12 28M 18S 1986 Animal Boy 12 28M 44S 1987 Halfway To Sanity 12 28M 53S 1989 Brain Drain 12 28M 2S 1992 Mondo Bizarro 13 28M 25S 1993 Acid Eaters 12 28M 3S 1994 Adios Amigos 13 28M 1S

Musical purists always reproached the Ramones for knowing a couple of chords only and making an excessive use of them. Data show that the band knew at least… 11 different chords (out of too-many-to-bother-counting possibilities) although 80% of their songs were built on no more than 6. And there is no evidence of a sophistication of the Ramones’ compositions over time.

Just as the number of different chords in a Ramones’ song is independent from the song writer/s –t.test of number of different chords ~ writers don’t allow to exclude alternative hypothesis–, even with each band member having a very distinct personality, according to the biographers.

 

In terms of official charts ranking in the USA, the success of the Ramones fluctuated over their career. The first years of the band were definitely the most successful, from the creation of the band till the early 80’s. Then, from 1985 onwards, it looks like that the sales didn’t follow the strengthening of their reputation not only within but also outside the punk rock scene.

 

What did the Ramones say ?

Im my dataset, the Ramones’ lyrics come from azlyrics.com. I preferred this source over many other available sources since that website provides the lyrics without the verses repeats, which, in my opinion, would over-emphasise and, ultimately, biais the relevance of n-grams or topics. The dataset (a data frame) contains a lyrics variable, i.e. a character string of the track (without the verses repeats) including the < br> tags to mark the end of each line.

An example of the lyrics variable is like the following:

Hey ho, let s go < br>Hey ho, let s go < br>They re forming in a straight line < br>They re going through a tight wind < br>The kids are losing their minds < br>The Blitzkrieg Bop < br>They re piling in the back seat < br>They re generating steam heat < br>Pulsating to the back beat < br>The Blitzkrieg Bop. < br>Hey ho, let s go < br>Shoot em in the back now < br>What they want, I dont know < br>They re all reved up and ready to go

Tidying the text up (adopting the data principles recommended by Hadley Wickham) is the necessary first step of the lyrics mining exercise. For that, I follow the tidy text approach developed by Julia Silge & David Robinson.

 

First and foremost, it is worth noting that whatever the Ramones say, they say it in very few words ! Ramones songs are brief in time, but also short in lyrics (but not so much in vocabulary with 2,139 different unique words in total).

Whereas uniGrams are usually considered suitable for analysis after expurgation of stop words, in the Ramones lyrics the raw uniGrams show an interesting pattern. The 2 most frequent words in the 14 studio albums are i and you. One could provocatively argue that Tea for Two, a well-known 1925 song from Vincent Youmans and Irving Caesar, is a good representation of the Ramones musical universe that seems to be mainly centered on you and i, and i and you !

In the uniGrams table below, the columns of the cleaned uniGrams highlight that the top word in the Ramones lyrics is dont, expressing an atmosphere of clear negation. But there is also a fascinating tension pointing to the future that shows through words such as wanna, gonna and ll (will or shall). Rock and punk amongst the top 20 words definitely remind you what type of music you are listening to but also what subculture the band belongs to. In an all-men band, words such as baby, love, girl witness the significance of man-woman relationships in the Ramones songs. Perhaps it took statistical analysis of lyrics to take the risk of forming the hypothesis of the Ramones as a romantic band…

All uniGrams Freq | Cleaned uniGrams Freq i 1510 | dont 317 you 800 | baby 241 the 773 | yeah 161 a 615 | love 154 to 584 | wanna 122 s 498 | gonna 117 and 438 | time 90 it 402 | ll 78 my 372 | life 61 me 322 | rock 58 dont 317 | day 57 oh 259 | girl 55 in 258 | hey 55 of 251 | remember 54 baby 241 | punk 52 t 237 | ve 52 m 232 | world 48 no 215 | fun 43 can 202 | feel 42 on 200 | bad 41

 

The identification of most frequent uniGrams per album is a further step into a more granular analysis:

 

In addition to identifying the most frequent single words, we could also highlight when they are used in the discography using a simple Token Distribution Analysis. Let’s limit this exercise to 5 words only from the list of the top 20: love, gonna, rock (or rocker), life and dont.

A quick visualisation of ‘raw’ nGrams (stop words not removed) confirms the feeling of a narrative universe mainly focused on i, you and negation (don’t).

 

What did the Ramones feel ?

As a (brief) final chapter of this post, I would like to run a very quick –and limited– sentiment analysis of the Ramones’ studio albums lyrics. Actually, rather than a sentiment analysis, this is nothing but scratching the surface of sentiment analysis. The bing sentiment lexicon was used here, but a similar analysis could be carried out using afinn or nrc lexicons (all available in the tidytext r package) or using all of them for a comparative approach.

Although the sentiment lexicon gives the word punk a negative value, there is little risk in asserting that this is not the way the Ramones intended it.

 

In order to both fine tune and expand the approach, a more accurate sentiment analysis could be undertaken paying attention to 5 additional tasks at least:

  • in the lyrics, identify the sentiment words preceded or followed by not;
  • review and, perhaps, amend the sentiment lexicon(s) to better reflect the punk rock subculture;
  • focus on relative more than absolute frequencies of words;
  • add terms’ inverse document frequency analysis to measure the impact of the words that are rarely used;
  • use ML to spot/predict combinations of n-Grams, topics, writers that would “guarantee” a better ranking in the charts.

 

The dataset and complete R code of this post can be downloaded from this link.

To leave a comment for the author, please follow the link and comment on their blog: English – R-blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Missing values imputation with missMDA

Tue, 03/07/2017 - 15:06

(This article was first published on François Husson, and kindly contributed to R-bloggers)

The best thing to do with missing values is not to have any” (Gertrude Mary Cox)

Unfortunately, missing values are ubiquitous and occur for plenty of reasons. One solution is single imputation which consists in replacing missing entries with plausible values. It leads to a complete dataset that can be analyzed by any statistical methods.

Based on dimensionality reduction methods, the missMDA package successfully imputes large and complex datasets with quantitative variables, categorical variables and mixed variables. Indeed, it imputes data with principal component methods that take into account the similarities between the observations and the relationship between variables. It has proven to be very competitive in terms of quality of the prediction compared to the state of the art methods.

With 3 lines of code, we impute the dataset orange available in missMDA:

library(missMDA)

data(orange)

nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute

res.comp <- imputePCA(orange, ncp = nbdim)

In the same way, imputeMCA imputes datasets with categorical variables and imputeFAMD imputes mixed datasets.

With a completed data, we can pursue our analyses… however we need to be careful and not to forget that the data was incomplete! In a future post, we will see how to visualize the uncertainty on these predicted values.

You can find more information in this JSS paper, on this website, on this tutorial given at useR!2016 at stanford.

You can also watch this playlist on Youtube to practice with R.

You can also enroll in this MOOC.

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing Icarus v0.3

Tue, 03/07/2017 - 14:45

(This article was first published on NC233 » R, and kindly contributed to R-bloggers)

This weekend I released version 0.3.0 of the Icarus package to CRAN.

Icarus provides tools to help perform calibration on margins, which is a very important method in sampling. One of these days I’ll write a blog post explaining calibration on margins! In the meantime if you want to learn more, you can read our course on calibration (in French) or the original paper of Deville and Sarndal (1992). Shortly said, calibration computes new sampling weights so that the sampling estimates match totals we already know thanks to another source (census, typically).

In the industry, one of the most widely used software for performing calibration on margins is the SAS macro Calmar developed at INSEE. Icarus is designed with the typical Calmar user in mind if s/he whishes to find a direct equivalent in R. The format expected by Icarus for the margins and the variables is directly inspired by Calmar’s (wiki and example here). Icarus also provides the same kind of graphs and stats aimed at helping statisticians understand the quality of their data and estimates (especially on domains), and in general be able to understand and explain the reweighting process.


Example of Icarus in RStudio

I hope I find soon the time to finish a full well documented article to submit to a journal and set it as a vignette on CRAN. For now, here are the slides (in French, again) I presented at the “colloque francophone sondages” in Gatineau last october: http://nc233.com/icarus.

Kudos to the CRAN team for their amazing work!

To leave a comment for the author, please follow the link and comment on their blog: NC233 » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping “France at night” with the new sf package

Tue, 03/07/2017 - 11:45

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)



A few days ago, I was inspired by a set of photographs of Earth from space, at night.

The images are amazing, so I decided to try to replicate them using R.

To be a little more specific, I found a single dataset – a data set with French population data – and I’m using it to replicate the images for a single country: France.

A few notes and warnings before you get started

As usual, I’ve included the code for how to create the final data visualization.

Having said that, in this case I decided to use some new tools. Therefore, before trying to run the code yourself, there are a few notes and caveats to keep in mind.

This was built with the new sf package

I decided to build this with the new sf package.

sf is a new package for geospatial analysis that’s currently under development. You can read more about it here.

Warning: setting up sf and dependencies may be problematic

Before getting started, you should know that installing sf may require you to install some dependencies. You can find out more here.

If you decide to install sf and the dependencies, be warned: getting the dependencies set up (specifically, gdal) may be tricky and problematic, so attempt it at your own risk.

Code: mapping French population density

Ok, here’s the code.

First, we’ll load the packages that we’re going to need. (Of course, you need to have them installed as well.)

Notice also that we’re using the development version of ggplot2. We need this because we’ll be using geom_sf().

#======================================= # INSTALL DEVELOPMENT VERSION OF ggplot2 #======================================= library(devtools) dev_mode(on = T) install_github("hadley/ggplot2") #============== # LOAD PACKAGES #============== library(sf) library(tidyverse) library(stringr) library(scales) Get the data

Next, we’ll get the data from a URL at Sharp Sight and load it into R:

#========= # GET DATA #========= url.france_pop <- url("http://sharpsightlabs.com/wp-content/datasets/france_population_data_2016.RData") load(url.france_pop) # INSPECT glimpse(df.france) Change column names to lower case

We won’t have to make too many adjustments to the data, but we will transform the variable names to lower case.

#====================================== # CHANGE COLUMN NAMES: lowercase # - in the original dataset the column # names were capitalized # - we'll transform the names to # lower case #====================================== colnames(df.france) <- colnames(df.france) %>% str_to_lower() # INSPECT colnames(df.france) Create density variable

In the visualization, we’re going to plot population density.

The dataset does not have a density variable, so we need to calculate it based on population and superficie (i.e. area). Keep in mind that superficie is in hectares, so we’ll be converting from “people/hectare” to “people/square kilometer.”

#========================= # CREATE DENSITY VARIABLE #========================= df.france$population %>% summary() #------------------------- # CREATE VARIABLE: density #------------------------- df.france <- df.france %>% mutate(density = population/superficie*100) # INSPECT colnames(df.france) head(df.france) Plot a test chart

Next, we’ll plot a very simple test chart to see if things are roughly OK.

As we do this, note how simple it is to use geom_sf(). This is one of the advantages of using the new sf package.

#========================================== # PLOT TEST CHART # - here, we'll just plot a test chart # to see if the data are roughly correct # and to make sure that the data are # roughly in working order #========================================== #ggplot(df.france) + geom_sf() ggplot(df.france) + geom_sf(aes(fill = density))



This appears to be OK, but it’s tough to tell because the borders are too thick.

Let’s do one more test chart to try to remove the lines and get a better look:

#========================================== # PLOT SECOND TEST CHART # - here, we'll plot another test with # smaller borders # - we'll set size = .1 #========================================== ggplot(df.france) + geom_sf(aes(fill = density), color = NA)



Ok, we can see the fill color now (although the borders are still there).

Looking at this, we’ll have to solve at least 2 problems:

  • Adjust the color scale
  • Remove the borders

Most of the remaining work will be modifying the fill color to get the colors right, and adjust the scale so that differences between densities are more visible.

Calculate quantiles to identify color break points

Ultimately, we need to adjust the fill-colors associated with the fill aesthetic. We need to find “break points” to which we’ll associate specific fill values.

To do this, we’ll calculate quantiles.

#======================================================= # IDENTIFY QUANTILES # - the hardest part of creating the finalized # data visualization is getting the color scale # just right # - on final analysis, the best way to adjust the color # scale was by capping the population density # - we'll use these quantiles to find a good cap # via trial and error #======================================================= #--------- # VENTILES #--------- quantile(df.france$density, seq(from = 0, to = 1, by = .05)) # 0% 5% 10% 15% 20% 25% 30% # 0.000000 6.815208 10.032525 12.824891 15.614076 18.485419 21.719516 # 35% 40% 45% 50% 55% 60% 65% # 25.510974 29.538791 34.568568 40.088124 46.762867 54.489134 64.356473 # 70% 75% 80% 85% 90% 95% 100% # 77.376987 94.117647 118.935653 160.578630 242.086849 510.070301 41929.234973 #---------------------- # PERCENTILES: 90 - 100 #---------------------- quantile(df.france$density, seq(from = .9, to = 1, by = .01)) # 90% 91% 92% 93% 94% 95% 96% 97% 98% # 242.0868 268.1167 300.2817 347.6989 412.2042 510.0703 637.9211 876.0090 1300.6435 # 99% 100% # 2649.3653 41929.2350 # MEDIAN median(df.france$density) #40.08812 Create finalized version of chart

Let’s plot the finalized version. I’ll give a little explanation afterwards.

#================== # DETAILED VERSION # - add theme # - add colors #================== df.france %>% mutate(fill_var = density) %>% ggplot() + geom_sf(aes(fill = fill_var, color = fill_var)) + scale_fill_gradientn(colours = c("#000233","#f9cf86","#fceccf","white") ,values = rescale(c(0,500,3000,40000)) ,breaks = c(0,500,1500) ,guide = FALSE) + scale_color_gradientn(colours = c("#000233","#f9cf86","#fceccf","white") ,values = rescale(c(0,500,3000,40000)) ,guide = FALSE) + labs(title = "Population density in France\nvisualized to imitate satellite images of France at night" ,subtitle = "(people/sq km)" ) + theme(text = element_text(family = "Gill Sans", color = "#E1E1E1") ,plot.title = element_text(size = 18, color = "#E1E1E1") ,plot.subtitle = element_text(size = 10) ,plot.background = element_rect(fill = "#000223") ,panel.background = element_rect(fill = "#000223") ,panel.grid.major = element_line(color = "#000223") ,axis.text = element_blank() ,axis.ticks = element_blank() ,legend.background = element_blank() ) #================== # TURN OFF DEV MODE #================== dev_mode(on = F)



OK, the truth is that to create this finalized version, you actually need to do a lot of trial and error.

Essentially, you need to find the right cutoffs in the data and associate those values to specific fill values.

The point where we do that is in the following code:

scale_fill_gradientn(colours = c("#000233","#f9cf86","#fceccf","white") ,values = rescale(c(0,500,3000,40000))

Here, we’re taking values of our fill variable, density, and we’re associating those values to specific colors. How do we know which values to use and which colors to use?

Trial and error.

You need to take the quantile values that we calculated earlier, and test them out. Try different cutoff values for values = . Try different fill-values for colours = .

To create visualizations like this, master ggplot2

I’ll admit that this data visualization is a little challenging to create, simple because of the extensive trial and error that’s required.

It’s also a little out of the ordinary, because we’re using the some packages that are under development.

Having said that, this visualization is still essentially just a fancy application of ggplot2. If you really know what you’re doing with ggplot2 – if you’ve already mastered the basics – then this plot is fairly straightforward.

So, if you want to learn to create compelling data visualizations, master ggplot2.

Sign up now to master ggplot2

To get a job as a highly-paid data scientist, you need to know how to create compelling data visualizations.

To create compelling visualizations in R, you need to master ggplot2.

Sign up now we’ll show you how.

By signing up, you’ll get our free Data Science Crash Course, where you’ll learn:

  • the foundations of ggplot2
  • 3 essential data visualizations
  • a step-by-step data science learning plan

  • the 1 programming language you need to learn

  • how to do data manipulation in R
  • how to get started with machine learning
  • and more …

In addition, you’ll also get weekly data science tutorials, delivered right to your inbox.

SIGN UP NOW

The post Mapping “France at night” with the new sf package appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ensemble Learning in R

Tue, 03/07/2017 - 10:26

Previous research in data mining has devised numerous different algorithms for learning tasks. While an individual algorithm might already work decently, one can usually obtain a better predictive by combining several. This approach is referred to as ensemble learning.
Common examples include random forests, boosting and AdaBost in particular.

Our slide deck is positioned at the intersection of teaching the basic idea of ensemble learning and providing practical insights in R.
Therefore, each algorithm comes with an easy-to-understand explanation on how to use it in R.

We hope that the slide deck enables practitioners to quickly adopt ensemble learning for their applications in R. Moreover, the materials might lay the groundwork for courses on data mining and machine learning.

Download the slides here
Download the exercise sheet here
The content was republished on r-bloggers.com with permission.

Frankenstein

Tue, 03/07/2017 - 08:30

Remember me, remember me, but ah! forget my fate (Dido’s Lament, Henry Purcell)

A Voronoi diagram divides a plane based on a set of original points. Each polygon, or Voronoi cell, contains an original point and all that are closer to that point than any other.

This is a nice example of a Voronoi tesselation. You can find good explanations of Voronoi diagrams and Delaunay triangulations here (in English) or here (in Spanish).

A grayscale image is simply a matrix where darkness of pixel located in coordinates (i, j) is represented by the value of its corresponding element of the matrix: a grayscale image is a dataset. This is a Voronoi diagraman of Frankenstein:

To do it I followed the next steps:

  1. Read this image
  2. Convert it to gray scale
  3. Turn it into a pure black and white image
  4. Obtain a random sample of black pixels (previous image corresponds to a sample of 6.000 points)
  5. Computes the Voronoi tesselation

Steps 1 to 3 were done with imager, a very appealing package to proccess and analice images. Step 5 was done with deldir, also a convenient package which computes Delaunay triangulation and the Dirichlet or Voronoi tessellations.

The next grid shows tesselations for sample size from 500 to 12.000 points and step equal to 500:
























I gathered all previous images in this gif created with magick, another amazing package of R I discovered recently:

This is the code:

library(imager) library(dplyr) library(deldir) library(ggplot2) library(scales) # Download the image file="http://ereaderbackgrounds.com/movies/bw/Frankenstein.jpg" download.file(file, destfile = "frankenstein.jpg", mode = 'wb') # Read and convert to grayscale load.image("frankenstein.jpg") %>% grayscale() -> x # This is just to define frame limits x %>% as.data.frame() %>% group_by() %>% summarize(xmin=min(x), xmax=max(x), ymin=min(y), ymax=max(y)) %>% as.vector()->rw # Filter image to convert it to bw x %>% threshold("45%") %>% as.data.frame() -> df # Function to compute and plot Voronoi tesselation depending on sample size doPlot = function(n) { #Voronoi tesselation df %>% sample_n(n, weight=(1-value)) %>% select(x,y) %>% deldir(rw=rw, sort=TRUE) %>% .$dirsgs -> data # This is just to add some alpha to lines depending on its longitude data %>% mutate(long=sqrt((x1-x2)^2+(y1-y2)^2), alpha=findInterval(long, quantile(long, probs = seq(0, 1, length.out = 20)))/21)-> data # A little bit of ggplot to plot results data %>% ggplot(aes(alpha=(1-alpha))) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), color="black", lwd=1) + scale_x_continuous(expand=c(0,0))+ scale_y_continuous(expand=c(0,0), trans=reverse_trans())+ theme(legend.position = "none", panel.background = element_rect(fill="white"), axis.ticks = element_blank(), panel.grid = element_blank(), axis.title = element_blank(), axis.text = element_blank())->plot return(plot) } # I call the previous function and store resulting plot in jpeg format i=5000 name=paste0("frankie",i,".jpeg") jpeg(name, width = 600, height = 800, units = "px", quality = 100) doPlot(i) dev.off() # Once all images are stored I can create gif library(magick) frames=c() images=list.files(pattern="jpeg") for (i in length(images):1) { x=image_read(images[i]) x=image_scale(x, "300") c(x, frames) -> frames } animation=image_animate(frames, fps = 2) image_write(animation, "Frankenstein.gif")

RProtoBuf 0.4.9

Tue, 03/07/2017 - 02:17

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

RProtoBuf provides R bindings for the Google Protocol Buffers ("Protobuf") data encoding and serialization library used and released by Google, and deployed as a language and operating-system agnostic protocol by numerous projects.

The RProtoBuf 0.4.9 release is the fourth and final update this weekend following the request by CRAN to not use package= in .Call() when PACKAGE= is really called for.

Some of the code in RProtoBuf 0.4.9 had this bug; some other entry points had neither (!!). With the ongoing drive to establish proper registration of entry points, a few more issues were coming up, all of which are now addressed. And we had some other unreleased minor cleanup, so this made for a somewhat longer (compared to the other updates this weekend) NEWS list:

Changes in RProtoBuf version 0.4.9 (2017-03-06)
  • A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

  • Symbol registration is enabled in useDynLib

  • Several missing PACKAGE= arguments were added to the corresponding .Call invocations

  • Two (internal) C++ functions were renamed with suffix _cpp to disambiguate them from R functions with the same name

  • All of above were part of #26

  • Some editing corrections were made to the introductory vignette (David Kretch in #25)

  • The ‘configure.ac’ file was updated, and renamed from the older converntion ‘configure.in’, along with ‘src/Makevars’. (PR #24 fixing #23)

CRANberries also provides a diff to the previous release. The RProtoBuf page has an older package vignette, a ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RVowpalWabbit 0.0.9

Tue, 03/07/2017 - 02:06

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The RVowpalWabbit package update is the third of four upgrades requested by CRAN, following RcppSMC 0.1.5 and RcppGSL 0.3.2.

This package being somewhat raw, the change was simple and just meant converting the single entry point to using Rcpp Attributes — which addressed the original issue in passing.

No new code or features were added.

We should mention that is parallel work ongoing in a higher-level package interfacing the vw binary — rvw — as well as plan to redo this package via the external libraries. In that sounds interesting to you, please get in touch.

More information is on the RVowpalWabbit page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hyper-parameter Tuning with Grid Search for Deep Learning

Tue, 03/07/2017 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Last week I showed how to build a deep neural network with h2o and rsparkling. As we could see there, it is not trivial to optimize the hyper-parameters for modeling. Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy.

Keep in mind though, that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the predictions you are trying to model. It might be necessary to go back to the original features and try e.g. feature engineering methods.

Preparing Spark instance and plotting theme

Check out last week’s post for details on how I prepared the data.

library(rsparkling) options(rsparkling.sparklingwater.version = "2.0.3") library(h2o) library(dplyr) library(sparklyr) sc <- spark_connect(master = "local", version = "2.0.0") library(ggplot2) library(ggrepel) my_theme <- function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "white"), legend.position = "right", legend.justification = "top", panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) } arrhythmia_sc <- copy_to(sc, arrhythmia_subset) arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE) arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2]) arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3]) splits <- h2o.splitFrame(arrhythmia_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "diagnosis" weights <- "weights" features <- setdiff(colnames(train), c(response, weights, "class"))

Grid Search

We can use the h2o.grid() function to perform a Random Grid Search (RGS). We could also test all possible combinations of parameters with Cartesian Grid or exhaustive search, but RGS is much faster when we have a large number of possible combinations and usually finds sufficiently accurate models.

For RGS, we first define a set of hyper-parameters and search criteria to fine-tune our models. Because there are many hyper-parameters, each with a range of possible values, we want to find an (ideally) optimal combination to maximize our model’s accuracy.
We can also specify how long we want to run the grid search for. Based on the results of each model tested in the grid, we can choose the one with the highest accuracy or best performance for the question on hand.

Activation Functions
  • Rectifier: is the default activation function. It is the fastest and most versatile option. It can lead to instability though and tends to be lower in accuracy.
  • Tanh: The hyperbolic tangent is a scaled and shifted variant of the sigmoid activation function. It can take on values from -1 to 1 and centers around 0. Tanh needs more computational power than e.g. the Rectifier function.
  • Maxout: is an activation function that is the max of the inputs. It is computationally quite demanding but can produce high accuracy models.

  • ...WithDropout: When we specify with dropout, a random subset of the network is trained and the weights of all sub-networks are averaged. It works together with the parameter hidden_dropout_ratios, which controls the amount of layer neurons that are randomly dropped for each hidden layer. Hidden dropout ratios are useful for preventing overfitting on learned features.
Hidden layers
  • are the most important hyper-parameter to set for deep neural networks, as they specify how many hidden layers and how many nodes per hidden layer the model should learn
L1 and L2 penalties
  • L1: lets only strong weights survive
  • L2: prevents any single weight from getting too big.
Rho and Epsilon
  • rho: similar to prior weight updates
  • epsilon: prevents getting stuck in local optima
hyper_params <- list( activation = c("Rectifier", "Maxout", "Tanh", "RectifierWithDropout", "MaxoutWithDropout", "TanhWithDropout"), hidden = list(c(5, 5, 5, 5, 5), c(10, 10, 10, 10), c(50, 50, 50), c(100, 100, 100)), epochs = c(50, 100, 200), l1 = c(0, 0.00001, 0.0001), l2 = c(0, 0.00001, 0.0001), rate = c(0, 01, 0.005, 0.001), rate_annealing = c(1e-8, 1e-7, 1e-6), rho = c(0.9, 0.95, 0.99, 0.999), epsilon = c(1e-10, 1e-8, 1e-6, 1e-4), momentum_start = c(0, 0.5), momentum_stable = c(0.99, 0.5, 0), input_dropout_ratio = c(0, 0.1, 0.2), max_w2 = c(10, 100, 1000, 3.4028235e+38) )

Early stopping criteria
  • stopping_metric: metric that we want to use as stopping criterion
  • stopping_tolerance and stopping_rounds: training stops when the the stopping metric does not improve by the stopping tolerance proportion any more (e.g. by 0.05 or 5%) for the number of consecutive rounds defined by stopping rounds.
search_criteria <- list(strategy = "RandomDiscrete", max_models = 100, max_runtime_secs = 900, stopping_tolerance = 0.001, stopping_rounds = 15, seed = 42)

Now, we can train the model with combinations of hyper-parameters from our specified stopping criteria and hyper-parameter grid.

dl_grid <- h2o.grid(algorithm = "deeplearning", x = features, y = response, weights_column = weights, grid_id = "dl_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 )

We now want to extract the best model from the grid model list. What makes a model the best depends on the question you want to address with it: in some cases, the model with highest AUC is the most suitable, or the one with the lowest mean squared error, etc. See last week’s post again for a more detailed discussion of performance metrics.

For demonstration purposes, I am choosing the best models from a range of possible quality criteria. We first use the h2o.getGrid() function to sort all models by the quality metric we choose (depending on the metric, you want it ordered by descending or ascending values). We can then get the model that’s the first in the list to work with further. This model’s hyper-parameters can be found with best_model@allparameters. You can now work with your best model as with any regular model in h2o (for an example see last week’s post).

# performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("dl_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) assign(paste0("best_model_", sort_by_1), best_model) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("dl_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) assign(paste0("best_model_", sort_by_2), best_model) }

Let’s plot the mean per class error for each best model:

library(tibble) sort_options <- c("mean_per_class_error", "mse", "err", "auc", "precision", "accuracy", "recall", "specificity") for (sort_by in sort_options) { best_model <- get(paste0("best_model_", sort_by)) errors <- h2o.mean_per_class_error(best_model, train = TRUE, valid = TRUE, xval = TRUE) errors_df <- data.frame(model_id = best_model@model_id, sort = sort_by, errors) %>% rownames_to_column(var = "rowname") if (sort_by == "mean_per_class_error") { errors_df_comb <- errors_df } else { errors_df_comb <- rbind(errors_df_comb, errors_df) } } order <- subset(errors_df_comb, rowname == "xval") %>% arrange(errors) errors_df_comb %>% mutate(sort = factor(sort, levels = order$sort)) %>% ggplot(aes(x = sort, y = errors, fill = model_id)) + facet_grid(rowname ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8) + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), plot.margin = unit(c(0.5, 0, 0, 1), "cm")) + labs(x = "")

Model performance

The ultimate performance test for our model will be it’s prediction accuracy on the test set it hasn’t seen before. Here, I will compare the AUC and mean squared error for each best model from before. You could of course look at any other quality metric that is most appropriate for your model.

for (sort_by in sort_options) { best_model <- get(paste0("best_model_", sort_by)) mse_auc_test <- data.frame(model_id = best_model@model_id, sort = sort_by, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (sort_by == "mean_per_class_error") { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } } library(tidyr) mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = sort, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")

I will then create a dataframe with predictions for each test sample with all best models. As in last week’s post, I want to compare the default predictions with stringent predictions.

for (sort_by in sort_options) { best_model <- get(paste0("best_model_", sort_by)) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), sort = rep(sort_by, nrow(test)), class = as.vector(test$class), actual = as.vector(test$diagnosis), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (sort_by == "mean_per_class_error") { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } }

To get a better overview, I am going to plot the predictions (default and stringent):

finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + my_theme() + facet_wrap(~ sort, ncol = 4) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + my_theme() + facet_wrap(~ sort, ncol = 4) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

With predictions made by different models, we can see where each model performs best. This obviously corresponds with the quality metric we chose to define the best model. Stringent prediction thresholds reduced the number of false predictions but of course also the number of predictions made at all.

We can now decide which model is most relevant. Let’s say, we want this model to give recommendations for further validating a diagnosis of arrhythmia, we might want to detect as many arrhythmia cases as possible, while being okay with also getting some erroneously diagnosed healthy subjects. In this case, the people could flag patients with likely arrhythmia for further testing. Here, this would mean that we wanted to minimize the number of wrong predictions of arrhythmia cases, so we would prefer the mse, auc and specificity model (which is the same model, chosen by all three qualitry metrics). The worst model for this purpose would be the recall model, which also did not make any predictions that passed the stringency threshold. The recall model would have been best if we wanted to be confident that healthy people get flagged correctly. However, in a real-life setting, you can imagine that it would be much worse if a sick patient got sent home because he was wrongly diagnosed as healthy than if a healthy person got submitted to further testing for possible arrhythmia.

Other machine learning topics I have covered include

## R version 3.3.2 (2016-10-31) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tidyr_0.6.1 tibble_1.2 ggrepel_0.6.5 ## [4] ggplot2_2.2.1.9000 sparklyr_0.5.3-9002 dplyr_0.5.0 ## [7] h2o_3.10.3.6 rsparkling_0.1.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 RColorBrewer_1.1-2 plyr_1.8.4 ## [4] bitops_1.0-6 base64enc_0.1-3 tools_3.3.2 ## [7] digest_0.6.12 jsonlite_1.2 evaluate_0.10 ## [10] gtable_0.2.0 shiny_1.0.0 DBI_0.5-1 ## [13] rstudioapi_0.6 yaml_2.1.14 parallel_3.3.2 ## [16] withr_1.0.2 httr_1.2.1 stringr_1.2.0 ## [19] knitr_1.15.1 rappdirs_0.3.1 rprojroot_1.2 ## [22] grid_3.3.2 R6_2.2.0 rmarkdown_1.3 ## [25] reshape2_1.4.2 magrittr_1.5 backports_1.0.5 ## [28] scales_0.4.1 htmltools_0.3.5 assertthat_0.1 ## [31] mime_0.5 xtable_1.8-2 colorspace_1.3-2 ## [34] httpuv_1.3.3 labeling_0.3 config_0.2 ## [37] stringi_1.1.2 RCurl_1.95-4.8 lazyeval_0.2.0 ## [40] munsell_0.4.3

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create Air Travel Route Maps in ggplot: A Visual Travel Diary

Mon, 03/06/2017 - 21:24

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

I have been lucky to fly to a few countries around the world. Like any other bored traveller, I thumb through the airline magazines and look at the air travel route maps. These maps are beautifully stylised depictions of the world with gently curved lines between the many destinations. I always wanted such a map for my own travel adventures.

Create Air Travel Route Maps using ggplot2

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

# Read flight list flights <- read.csv("flights.csv", stringsAsFactors = FALSE) # Lookup coordinates library(ggmap) airports <- unique(c(flights$From, flights$To)) coords <- geocode(airports) airports <- data.frame(airport=airports, coords)

We now we have a data frame of airports with their coordinates and can create air travel route maps. The data frames are merged so that we can create air travel route maps using the curve geom. The borders function of ggplot2 creates the map data. The ggrepel package helps to prevent overplotting of text.

# Add coordinates to flight list flights <- merge(flights, airports, by.x="To", by.y="airport") flights <- merge(flights, airports, by.x="From", by.y="airport") # Plot flight routes library(ggplot2) library(ggrepel) worldmap <- borders("world", colour="#efede1", fill="#efede1") # create a layer of borders ggplot() + worldmap + geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y), col = "#b29e7d", size = 1, curvature = .2) + geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) + theme(panel.background = element_rect(fill="white"), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank() )

I also tried to use ggmap package to display the maps to get a satellite image background. This did not work because the curve geom struggles with the map projection methods used in ggmap. Another problem is that the flight from Auckland to Los Angeles is drawn the wrong way. I hope no flat-earthers will see this map because they might use it as prove that the world is flat.

Alternative Visualisation

Another way of visualising this type of data is using a network diagram provided by the igraph package. This visualisation shows the logic between the locations and not their spatial locations.

# Network visualisation library(igraph) edgelist <- as.matrix(flights[c("From", "To")]) g <- graph_from_edgelist(edgelist, directed = TRUE) g <- simplify(g) par(mar=rep(0,4)) plot.igraph(g, edge.arrow.size=0, edge.color="black", edge.curved=TRUE, edge.width=2, vertex.size=3, vertex.color=NA, vertex.frame.color=NA, vertex.label=V(g)$name, vertex.label.cex=3, layout=layout.fruchterman.reingold )

The post Create Air Travel Route Maps in ggplot: A Visual Travel Diary appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 3.3.3 now available

Mon, 03/06/2017 - 19:10

The R core group announced today the release of R 3.3.3 (code-name: "Another Canoe"). As the wrap-up release of the R 3.3 series, this update mainly contains minor bug-fixes. (Bigger changes are planned for R 3.4.0, expected in mid-April.) Binaries for the Windows version are already up on the CRAN master site, and binaries for all platforms will appear on your local CRAN mirror within the next couple of days. 

R 3.3.3 fixes an issue related to attempting to use download.file on sites that automatically redirect from http to https: now, R will re-attempt to download the secure link rather than failing. Other fixes include support for long vectors in the vapply function, the ability to use pmax (and pmin) on ordered factors, improved accuracy for qbeta for some extreme cases, corrected spelling for "Cincinnati" in the precip data set, and a few other minor issues.

R 3.3.3 should be completely compatible with your existing R 3.3.x scripts and data sets. Unless some major unforeseen bug rears its head between now and the release of R 3.4.0, this is the final release of the R 3.3 series, so it's definitely a good idea to upgrade when you can.

As always, thanks go to the R Core Group for the continued and welcome improvements to R!

Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

Mon, 03/06/2017 - 18:59

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In this screencast we demonstrate how to easily and effectively step-debug magrittr/dplyr pipelines in R using wrapr and replyr.



Some of the big issues in trying to debug magrittr/dplyr pipelines include:

  • Pipelines being large expressions that are hard to line-step into.
  • Visibility of intermediate results.
  • Localizing operations (in time and code position) in the presence of lazy evaluation semantics.

In this screencast we demonstrate how to mitigate the impact of all of these issues and quickly and effectively debug pipelined code.

The example code is here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping 5,000 Years of City Growth

Mon, 03/06/2017 - 17:41

(This article was first published on R – Spatial.ly, and kindly contributed to R-bloggers)

I recently stumbled upon a great dataset. It’s the first to provide comprehensive data for world city sizes as far back as 3700BC. The authors (Meredith Reba, Femke Reitsma & Karen Seto) write:

How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski.

These kinds of data are crying out to be mapped so that’s what I did…

Creating the Maps

I was keen to see how easy it was to produce an animated map with R and the ggplot2 package from these data. It turned out to be a slightly bigger job than I thought – partly because the cleaned data doesn’t have all the population estimates updating for every city each year so there’s a bit of a messy loop at the start of the code to get that bit working, and partly because there a LOADS of different parameters to tweak on ggplot to get the maps looking the way I like. The script below will generate a series of PNGs that can be strung together into an animation using a graphics package. I’m a big fan of what ggplot2 can do now and I hope, at the very least, the chunks of ggplot2 code below will provide a useful template for similar mapping projects.

R Code

Load the libraries we need

library("ggplot2") library("ggthemes") library("rworldmap") library("classInt") library("gridExtra") library("grid") library("cowplot")

load data – this is the processed file from http://www.nature.com/articles/sdata201634 I offer no guarantees about whether I ran the script correctly etc! I’d encourage you to take the data direct from the authors.

#City data cities<- read.csv("alldata.csv") #for some reason the coords were loaded as factors so I've created two new numeric fields here cities$X<-as.numeric(as.vector(cities$Longitude)) cities$Y<-as.numeric(as.vector(cities$Latitude)) #World map base from rworldmap world <- fortify(getMap(resolution="low"))

Take a look at the data

head(cities)

Create the base map

base<- ggplot() + geom_map(data=world, map=world, aes(x=long,y=lat, map_id=id, group=group),fill="#CCCCCC") +theme_map()

Now plot the cities on top – with circles scaled by city size. Here I am using the mollweide projection. This script loops through the data and checks for updates from one year to the next. It then plots the cities as proportional circles for each year and saves out an image for each 100 years of data. If you are just interested in the maps themselves and have your own point data then most of the below can be ignored.

Set the breaks for the size of the circles

size_breaks<-c(10000,50000,100000,500000,1000000,5000000,10000000) #Here I'm looping through the data each year to select the years required and comparing/ updating the values for each city. count<-1 for (i in unique(cities$year)) { if (count==1) { Data<-cities[which(cities$year==i),] }else{ New_Data<-cities[which(cities$year==i),] #replace old rows Data2<-merge(Data, New_Data, by.x="X", by.y="X", all=T) New_Vals<-Data2[!is.na(Data2$pop.y),c("City.y","X","Y.y","pop.y")] names(New_Vals)<-c("City","X","Y","pop") Old_Vals<-Data2[is.na(Data2$pop.y),c("City.x","X","Y.x","pop.x")] names(Old_Vals)<-c("City","X","Y","pop") Data<-rbind(New_Vals,Old_Vals) Data<- aggregate(Data$pop, by=list(Data$City,Data$X, Data$Y), FUN="mean") names(Data)<-c("City","X","Y","pop") } #This statement sorts out the BC/ AC in the title that we'll use for the plot. if(i<0) { title <- paste0("Cities in the Year ",sqrt(i^2)," BC") }else{ title <- paste0("Cities in the Year ",i," AD") } #There's lots going on here with the myriad of different ggplot2 parameters. I've broken each chunk up line by line to help make it a little clearer. Map<-base+ geom_point(data=Data, aes(x=X, y=Y, size=pop), colour="#9933CC",alpha=0.3, pch=20)+ scale_size(breaks=size_breaks,range=c(2,14), limits=c(min(cities$pop), max(cities$pop)),labels=c("10K","50k","100K","500k","1M","5M","10M+"))+ coord_map("mollweide", ylim=c(-55,80),xlim=c(-180,180))+ theme(legend.position="bottom",legend.direction="horizontal",legend.justification="center",legend.title=element_blank(),plot.title = element_text(hjust = 0.5,face='bold',family = "Helvetica"))+ggtitle(title)+guides(size=guide_legend(nrow=1)) #I only want to plot map every 100 years if(i%%100==0) { png(paste0("outputs/",i,"_moll.png"), width=30,height=20, units="cm", res=150) print(Map) dev.off() } count<-count+1 }

Just to make things a little more interesting I wanted to try a plot of two hemispheres (covering most, but not all) of the globe. I’ve also added the key in between them. This depends on the plot_grid() functionality as well as a few extra steps you’ll spot that we didn’t need above.

count<-1 for (i in unique(cities$year)) { if (count==1) { Data<-cities[which(cities$year==i),] }else{ New_Data<-cities[which(cities$year==i),] #replace old rows Data2<-merge(Data, New_Data, by.x="X", by.y="X", all=T) New_Vals<-Data2[!is.na(Data2$pop.y),c("City.y","X","Y.y","pop.y")] names(New_Vals)<-c("City","X","Y","pop") Old_Vals<-Data2[is.na(Data2$pop.y),c("City.x","X","Y.x","pop.x")] names(Old_Vals)<-c("City","X","Y","pop") Data<-rbind(New_Vals,Old_Vals) Data<- aggregate(Data$pop, by=list(Data$City,Data$X, Data$Y), FUN="mean") names(Data)<-c("City","X","Y","pop") } #Create a map for each hemisphere - to help with the positioning I needed to use the `plot.margin` options in the `theme()` settings. map1<-base+geom_point(data=Data, aes(x=X, y=Y, size=pop), colour="#9933CC",alpha=0.3, pch=20)+scale_size(breaks=size_breaks,range=c(2,16), limits=c(min(cities$pop), max(cities$pop)))+coord_map("ortho", orientation = c(10, 60, 0),ylim=c(-55,70))+ theme(legend.position="none",plot.margin = unit(c(0.5,0.5,-0.1,0.5), "cm")) map2<- base+geom_point(data=Data, aes(x=X, y=Y, size=pop), colour="#9933CC",alpha=0.3, pch=20)+scale_size(breaks=size_breaks,range=c(2,16),limits=c(min(cities$pop), max(cities$pop)))+coord_map("ortho", orientation = c(10, -90, 0),ylim=c(-55,70))+ theme(legend.position="none", plot.margin = unit(c(0.5,0.5,0,0.5), "cm")) #create dummy plot that we will use the legend from - basically I just want to create something that has a legend (they've been switched off for the maps above) dummy<-ggplot()+geom_point(aes(1:7,1,size=size_breaks),colour="#9933CC", alpha=0.3, pch=20)+scale_size(breaks=size_breaks,range=c(2,16),labels=c("10K","50k","100K","500k","1M","5M","10M+"),limits=c(min(cities$pop), max(cities$pop)))+theme(legend.position="bottom",legend.direction="vertical",legend.title=element_blank())+guides(size=guide_legend(nrow=1,byrow=TRUE)) Legend <- get_legend(dummy) #This statement sorts out the BC/ AC in the title. if(i<0) { title <- ggdraw() + draw_label(paste0("Cities in the Year ",sqrt(i^2)," BC"), fontface='bold',fontfamily = "Helvetica") }else{ title <- ggdraw() + draw_label(paste0("Cities in the Year ",i," AD"), fontface='bold',fontfamily = "Helvetica") } #I only want to plot map every 100 years if(i%%100==0) { png(paste0("outputs/",i,".png"), width=20,height=30, units="cm", res=150) #This is where the elements of the plot are combined print(plot_grid(title,map1, Legend,map2, ncol=1,rel_heights = c(.1,1, .1,1))) dev.off() } count<-count+1 }

 

To leave a comment for the author, please follow the link and comment on their blog: R – Spatial.ly. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Descriptive summary: Proportions of values in a vector #rstats

Mon, 03/06/2017 - 16:33

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

When describing a sample, researchers in my field often show proportions of specific characteristics as description. For instance, proportion of female persons, proportion of persons with higher or lower income etc. Since it happens often that I like to know these characteristics when exploring data, I decided to write a function, prop(), which is part of my sjstats-package – a package dedicated to summary-functions, mostly for fit- or association-measures of regression models or descriptive statistics.

prop() is designed following a similar fashion like most functions of my sjmisc-package: first, the data; then an user-defined number of logical comparisons that define the proportions. A single comparison argument as input returns a vector, multiple comparisons return a tibble (where the first column contains the comparison, and the second the related proportion).

An examle from the mtcars dataset:

library(sjstats) data(mtcars) # proportions of observations in mpg that are greater than 25 prop(mtcars, mpg > 25) #> [1] 0.1875 prop(mtcars, mpg > 25, disp > 200, gear == 4) #> # A tibble: 3 × 2 #> condition prop #> <chr> <dbl> #> 1 mpg>25 0.1875 #> 2 disp>200 0.5000 #> 3 gear==4 0.3750

The function also works on grouped data frames, and with labelled data. In the following example, we group a dataset on family carers by their gender and education, and then get the proportions of observations where care-receivers are at least moderately dependent and male persons. To get an impression of how the raw variables look like, we first compute simple frequency tables with frq().

library(sjmisc) # for frq()-function data(efc) frq(efc, e42dep) #> # elder's dependency #> #> val label frq raw.prc valid.prc cum.prc #> 1 independent 66 7.27 7.33 7.33 #> 2 slightly dependent 225 24.78 24.97 32.30 #> 3 moderately dependent 306 33.70 33.96 66.26 #> 4 severely dependent 304 33.48 33.74 100.00 #> 5 NA 7 0.77 NA NA frq(efc, e16sex) #> # elder's gender #> #> val label frq raw.prc valid.prc cum.prc #> 1 male 296 32.60 32.85 32.85 #> 2 female 605 66.63 67.15 100.00 #> 3 NA 7 0.77 NA NA efc %>% select(e42dep, c161sex, c172code, e16sex) %>% group_by(c161sex, c172code) %>% prop(e42dep > 2, e16sex == 1) #> # A tibble: 6 × 4 #> `carer's gender` `carer's level of education` `e42dep>2` `e16sex==1` #> <chr> <chr> <dbl> <dbl> #> 1 Male low level of education 0.6829 0.3659 #> 2 Male intermediate level of education 0.6590 0.3155 #> 3 Male high level of education 0.7872 0.2766 #> 4 Female low level of education 0.7101 0.4638 #> 5 Female intermediate level of education 0.5929 0.2832 #> 6 Female high level of education 0.6881 0.2752

So, within the group of male family carers with low level of education, 68.29% of care-receivers are moderately or severely dependent, and 36.59% of care-receivers are male. Within female family carers with high level of education, 68.81% of care-receivers are at least moderately dependent and 27.52% are male.

Tagged: R, rstats

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages