Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 20 min ago

Inverse Statistics – and how to create Gain-Loss Asymmetry plots in R

9 hours 58 min ago

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Asset returns have certain statistical properties, also called stylized facts. Important ones are:

  • Absence of autocorrelation: basically the direction of the return of one day doesn’t tell you anything useful about the direction of the next day.
  • Fat tails: returns are not normal, i.e. there are many more extreme events than there would be if returns were normal.
  • Volatility clustering: basically financial markets exhibit high-volatility and low-volatility regimes.
  • Leverage effect: high-volatility regimes tend to coincide with falling prices and vice versa.

A good introduction and overview can be found in R. Cont: Empirical properties of asset returns: stylized facts and statistical issues.

One especially fascinating statistical property is the so called gain-loss asymmetry: it basically states that upward movements tend to take a lot longer than downward movements which often come in the form of sudden hefty crashes. So, an abstract illustration of this property would be a sawtooth pattern:

Source: Wikimedia

The same effect in real life:

suppressWarnings(suppressMessages(library(quantmod))) suppressWarnings(suppressMessages(getSymbols("^GSPC", from = "1950-01-01"))) ## [1] "GSPC" plot.zoo(GSPC$GSPC.Close, xlim = c(as.Date("2000-01-01"), as.Date("2013-01-01")), ylim = c(600, 1700), ylab ="", main ="S&P from 2000 to 2013")

The practical implication for your investment horizon is that your losses often come much faster than your gains (life is just not fair…). To illustrate this authors often plot the investment horizon distribution. It illustrates how long you have to wait for a certain target return, negative as well as positive (for some examples see e.g. here, also the source of the following plot):

This is closely related to what statisticians call first passage time: when is a given threshold passed for the first time? To perform such an analysis you need something called inverse statistics. Normally you would plot the distribution of returns given a fixed time window (= forward statistics). Here we do it the other way around: you fix the return and want to find the shortest waiting time needed to obtain at least the respective return. To achieve that you have to test all possible time windows which can be quite time consuming.

Because I wanted to reproduce those plots I tried to find some code somewhere… to no avail. I then contacted some of the authors of the respective papers… no answer. I finally asked a question on Quantitative Finance StackExchange… and got no satisfying answer either. I therefore wrote the code myself and thereby answered my own question:

inv_stat <- function(symbol, name, target = 0.05) { p <- coredata(Cl(symbol)) end <- length(p) days_n <- days_p <- integer(end) # go through all days and look when target is reached the first time from there for (d in 1:end) { ret <- cumsum(as.numeric(na.omit(ROC(p[d:end])))) cond_n <- ret < -target cond_p <- ret > target suppressWarnings(days_n[d] <- min(which(cond_n))) suppressWarnings(days_p[d] <- min(which(cond_p))) } days_n_norm <- prop.table(as.integer(table(days_n, exclude = "Inf"))) days_p_norm <- prop.table(as.integer(table(days_p, exclude = "Inf"))) plot(days_n_norm, log = "x", xlim = c(1, 1000), main = paste0(name, " gain-/loss-asymmetry with target ", target), xlab = "days", ylab = "density", col = "red") points(days_p_norm, col = "blue") c(which.max(days_n_norm), which.max(days_p_norm)) # mode of days to obtain (at least) neg. and pos. target return } inv_stat(GSPC, name = "S&P 500")

## [1] 10 24

So, here you see that for the S&P 500 since 1950 the mode (peak) of the days to obtain a loss of at least 5% has been 10 days and a gain of the same size 24 days! That is the gain-loss asymmetry in action!

Still two things are missing in the code:

  • Detrending of the time series.
  • Fitting a probability distribution (the generalized gamma distribution seems to work well).

If you want to add them or if you have ideas how to improve the code, please let me know in the comments! Thank you and stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Use RStudio Server in a Virtual Environment with Docker in Minutes!

15 hours 39 min ago

(This article was first published on r-bloggers – Telethon Kids Institute, and kindly contributed to R-bloggers)

A fundamental aspect of the reproducible research framework is that (statistical) analysis can be reproduced; that is, given a set of instructions (or a script file) the exact results can be achieved by another analyst with the same raw data. This idea may seem intuitive, but in practice it can be difficult to achieve in an analytical environment that is always evolving and changing.

For example, Tidyverse users will have recently seen the following warning in R:

Warning message:
'data_frame()' is depreciated, use 'tibble()'.

data_frame(a = 1:3)

This is because dplyr has changed, hopefully for the better, and tibble() is now the preferred route. It is conceivable that in a future release the data_frame() function will no longer be a part of dplyr at all.

So, what does this have to do with reproducible research? Say you want to come back to your analysis in 5 years and re-run your old code. Your future installation of R with the latest Tidyverse installation may not be backwards compatible with your now ancient old code and the analysis will crash.

There are a couple of strategies that we could use to deal with updating dependencies. The first, and possibly the easiest, is to use devtools to install older package versions (see here for further details). But what if the version of the package you used is not archived on CRAN? For example, the analysis could have used a package from GitHub that has since changed and the maintainer hasn’t used systematic releases for you to pull from. Thus, this strategy is quite likely to fail.

Another solution is to put your entire project inside a static virtual environment that is isolated from the rest of your machine. The benefit of project isolation is that any dependencies that are installed within the environment can be made persistent, but are also separated from any dependency upgrades that might be applied to your host machine or to other projects.

Docker isn’t a new topic for regular R-Bloggers readers, but for those of you that are unfamiliar: Docker is a program that uses virtual containers, which isolate and bundle applications. The containers are defined by a set of instructions detailed in a docker-compose.yml or Dockerfile and can effectively be stacked to call upon the capabilities of base images. Furthermore, one of the fundamental tenets of Docker is portability – if a container will work on your local machine then it will work anywhere. And because base images are versioned, they will work for all time. This is also great for scalability onto servers and distribution to colleagues and collaborators who will see exactly what you have prepared regardless of their host operating system.

Our dockerised RStudio environment.

Our group at the Telethon Kids Institute has prepared source code to launch a dockerised RStudio Server instance via a NGINX reverse proxy enabled for HTTPS connections. Our GitHub repository provides everything you need to quickly establish a new project (just clone the repository) with the features of RStudio Web Server with the added benefit of SSL encryption (this will need some local configuration); even though RStudio Server is password protected, web encryption is still important for us as we routinely deal with individual level health data.

(Important note, even with data security measures in place, our policy is for patient data to be de-identified prior to analysis as per good clinical practice norms and, except for exceptional circumstances, we would use our intranet to further restrict access).

The defining docker-compose.yml that we use can be found at out our GitHub site via this link: https://github.com/TelethonKids/rstudio. We used an RStudio base image with Tidyverse pre-installed with R 3.5.3, which is maintained by rocker at https://hub.docker.com/r/rocker/tidyverse. As updates are made to the base-image, the repository will be updated with new releases that provide an opportunity to re-install a specific version of the virtual environment. It has also been set up with persistent volumes for the projects, rstudio, and lib/R directories to keep any changes made to the virtual environment for back-up and further version control.

By combining tools like Docker and Git we believe we can refine and make common place, within our institute and those we collaborate with, a culture of reproducible research as we conduct world class research to improve the lives of sick children.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Telethon Kids Institute. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Markov chain Monte Carlo doesn’t “explore the posterior”

Mon, 03/25/2019 - 19:52

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it.

Sampling to characterize the posterior

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of Bayesian Data Analysis 3).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

Bad news from higher dimensions

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of , or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

Good news in expectation

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

where are draws from the posterior

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

\theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right] \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].' title='\mbox{Pr}[\theta_1 > \theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right] \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].' class='latex' />

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What it the interpretation of the diagonal for a ROC curve

Mon, 03/25/2019 - 19:00

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Last Friday, we discussed the use of ROC curves to describe the goodness of a classifier. I did say that I will post a brief paragraph on the interpretation of the diagonal. If you look around some say that it describes the “strategy of randomly guessing a class“, that it is obtained with “a diagnostic test that is no better than chance level“, even obtained by “making a prediction by tossing of an unbiased coin“.

Let us get back to ROC curves to illustrate those points. Consider a very simple dataset with 10 observations (that is not linearly separable)

1 2 3 4 x1 = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) x2 = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) y = c(1,1,1,1,1,0,0,1,0,0) df = data.frame(x1=x1,x2=x2,y=as.factor(y))

here we can check that, indeed, it is not separable

1 plot(x1,x2,col=c("red","blue")[1+y],pch=19)

Consider a logistic regression (the course is on linear models)

1 reg = glm(y~x1+x2,data=df,family=binomial(link = "logit"))

but any model here can be used… We can use our own function

1 2 3 4 5 6 7 8 9 10 Y=df$y
S=predict(reg)
roc.curve=function(s,print=FALSE){ Ps=(S&gt;=s)*1
 FP=sum((Ps==1)*(Y==0))/sum(Y==0)
 TP=sum((Ps==1)*(Y==1))/sum(Y==1)
 if(print==TRUE){
 print(table(Observed=Y,Predicted=Ps))
 }
 vect=c(FP,TP)
 names(vect)=c("FPR","TPR")
 return(vect)
}

or any R package actually

1 2 library(ROCR)
 perf=performance(prediction(S,Y),"tpr","fpr")

We can plot the two simultaneously here

1 2 3 4 plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(-5,5,length=251))
 points(V[1,],V[2,]) 
segments(0,0,1,1,col="light blue")

So our code works just fine, here. Let us consider various strategies that should lead us to the diagonal.

The first one is : everyone has the same probability (say 50%)

1 2 3 4 S=rep(.5,10) 
plot(performance(prediction(S,Y),"tpr","fpr"))
 V=Vectorize(roc.curve)(seq(0,1,length=251))
 points(V[1,],V[2,])

Indeed, we have the diagonal. But to be honest, we have only two points here : and . Claiming that we have a straight line is not very satisfying… Actually, note that we have this situation whatever the probability we choose

1 2 3 4 S=rep(.2,10)
 plot(performance(prediction(S,Y),"tpr","fpr")) 
V=Vectorize(roc.curve)(seq(0,1,length=251))
 points(V[1,],V[2,])

We can try another strategy, like “making a prediction by tossing of an unbiased coin“. This is what we obtain

1 2 3 4 5 6 set.seed(1)
 S=sample(0:1,size=10,replace=TRUE) 
plot(performance(prediction(S,Y),"tpr","fpr"))
 V=Vectorize(roc.curve)(seq(0,1,length=251))
 points(V[1,],V[2,]) 
segments(0,0,1,1,col="light blue")

We can also try some sort of “random classifier”, where we choose the score randomly, say uniform on the unit interval

1 2 3 4 5 6 set.seed(1) 
S=runif(10) 
plot(performance(prediction(S,Y),"tpr","fpr"))
 V=Vectorize(roc.curve)(seq(0,1,length=251))
 points(V[1,],V[2,])
 segments(0,0,1,1,col="light blue")

Let us try to go further on that one. For convenience, let us consider another function to plot the ROC curve

1 2 V=Vectorize(roc.curve)(seq(0,1,length=251)) 
roc_curve=Vectorize(function(x) max(V[2,which(V[1,]&lt;=x)]))

We have the same line as previously

1 2 3 x=seq(0,1,by=.025)
 y=roc_curve(x) 
 lines(x,y,type="s",col="red")

But now, consider many scoring strategies, all randomly chosen

1 2 3 4 5 6 7 8 9 MY=matrix(NA,500,length(y)) 
for(i in 1:500){
 S=runif(10)
 V=Vectorize(roc.curve)(seq(0,1,length=251))
 MY[i,]=roc_curve(x)
 } 
plot(performance(prediction(S,df$y),"tpr","fpr"),col="white")
 for(i in 1:500){
 lines(x,MY[i,],col=rgb(0,0,1,.3),type="s")
} 
lines(c(0,x),c(0,apply(MY,2,mean)),col="red",type="s",lwd=3)
segments(0,0,1,1,col="light blue")

The red line is the average of all random classifiers. It is not a straight line, be we observe oscillations around the diagonal.

Consider a dataset with more observations

1 2 3 4 5 6 7 8 9 
myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")
 myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1
 reg = glm(PRONO~.,data=myocarde,family=binomial(link = "logit"))
 Y=myocarde$PRONO
 S=predict(reg) 
plot(performance(prediction(S,Y),"tpr","fpr")) 
V=Vectorize(roc.curve)(seq(-5,5,length=251)) 
points(V[1,],V[2,]) 
segments(0,0,1,1,col="light blue")

Here is a “random classifier” where we draw scores randomly on the unit interval

1 2 3 4 5 S=runif(nrow(myocarde) 
plot(performance(prediction(S,Y),"tpr","fpr"))
 V=Vectorize(roc.curve)(seq(-5,5,length=251))
 points(V[1,],V[2,])
 segments(0,0,1,1,col="light blue")

And if we do that 500 times, we obtain, on average

1 2 3 4 5 6 7 8 9 MY=matrix(NA,500,length(y)) 
for(i in 1:500){
 S=runif(length(Y))
 V=Vectorize(roc.curve)(seq(0,1,length=251))
 MY[i,]=roc_curve(x)
}
 plot(performance(prediction(S,Y),"tpr","fpr"),col="white") 
for(i in 1:500){
 lines(x,MY[i,],col=rgb(0,0,1,.3),type="s")
 }
 lines(c(0,x),c(0,apply(MY,2,mean)),col="red",type="s",lwd=3)
segments(0,0,1,1,col="light blue")

So, it looks like me might say that the diagonal is what we have, on average, when drawing randomly scores on the unit interval…

I did mention that an interesting visual tool could be related to the use of the Kolmogorov Smirnov statistic on classifiers. We can plot the two empirical cumulative distribution functions of the scores, given the response

1 2 3 4 5 score=data.frame(yobs=Y,
 ypred=predict(reg,type="response")) 
f0=c(0,sort(score$ypred[score$yobs==0]),1)
 f1=c(0,sort(score$ypred[score$yobs==1]),1) 
plot(f0,(0:(length(f0)-1))/(length(f0)-1),col="red",type="s",lwd=2,xlim=0:1)
 lines(f1,(0:(length(f1)-1))/(length(f1)-1),col="blue",type="s",lwd=2)

we can also look at the distribution of the score, with the histogram (or density estimates)

1 2 3 4 5 S=score$ypred
 hist(S[Y==0],col=rgb(1,0,0,.2),
 probability=TRUE,breaks=(0:10)/10,border="white")
 hist(S[Y==1],col=rgb(0,0,1,.2),
 probability=TRUE,breaks=(0:10)/10,border="white",add=TRUE)
 lines(density(S[Y==0]),col="red",lwd=2,xlim=c(0,1)) 
lines(density(S[Y==1]),col="blue",lwd=2)

The underlying idea is the following : we do have a “perfect classifier” (top left corner)

is the supports of the scores do not overlap

otherwise, we should have errors. That the case below

we in 10% of the cases, we might have misclassification

or even more missclassification, with overlapping supports

Now, we have the diagonal

when the two conditional distributions of the scores are identical

Of course, that only valid when is very large, otherwise, it is only what we observe on average….

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Operator Notation for Data Transforms

Mon, 03/25/2019 - 17:17

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

As of cdata version 1.0.8 cdata implements an operator notation for data transform.

The idea is simple, yet powerful.

First let’s start with some data.

d <- wrapr::build_frame( "id", "measure", "value" | 1 , "AUC" , 0.7 | 1 , "R2" , 0.4 | 2 , "AUC" , 0.8 | 2 , "R2" , 0.5 ) knitr::kable(d) id measure value 1 AUC 0.7 1 R2 0.4 2 AUC 0.8 2 R2 0.5

In the above data we have two measurements each for two individuals (individuals identified by the "id" column). Using cdata‘s new_record_spec() method we can capture a description of this record structure.

library("cdata") record_spec <- new_record_spec( wrapr::build_frame( "measure", "value" | "AUC" , "AUC" | "R2" , "R2" ), recordKeys = "id") print(record_spec) ## $controlTable ## measure value ## 1 AUC AUC ## 2 R2 R2 ## ## $recordKeys ## [1] "id" ## ## $controlTableKeys ## [1] "measure" ## ## attr(,"class") ## [1] "cdata_record_spec"

Once we have this specification we can transform the data using operator notation.

We can collect the record blocks into rows by a "division" (or aggregation/projection) step.

knitr::kable(d) id measure value 1 AUC 0.7 1 R2 0.4 2 AUC 0.8 2 R2 0.5 d2 <- d %//% record_spec knitr::kable(d2) id AUC R2 1 0.7 0.4 2 0.8 0.5

We can expand record rows into blocks by a "multiplication" (or join) step.

knitr::kable(d2) id AUC R2 1 0.7 0.4 2 0.8 0.5 d3 <- d2 %**% record_spec knitr::kable(d3) id measure value 1 AUC 0.7 1 R2 0.4 2 AUC 0.8 2 R2 0.5

And that is truly fluid data manipulation.

This article can be found in a vignette here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Critical Thinking in Data Science

Mon, 03/25/2019 - 17:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Debbie Berebichez, a physicist, TV host and data scientist and is currently the Chief Data Scientist at Metis in NY.

Introducing Debbie Berebichez

Hugo: Hi there, Debbie, and welcome to DataFramed.

Debbie: Hi, Hugo. It’s a pleasure of mine to be here.

Hugo: It is such a pleasure to have you on this show, and I’m really excited to be here today to talk in particular about critical thinking in data science, and what that actually means, and as we know, to get critical about critical thinking and to see what aspects of data science in the space, what ways we are being critical, where we can actually improve aspects of critical thinking, particularly with respect to data thinking in general. But before we get into that, I’d love to know a bit about you. So, could you start off by telling us what you’re known for in the data community?

Debbie: Sure. Thank you. Well, I’m not sure I’m that well known in the data science community, but if I am, I would say it’s because I’m a big promoter of both critical thinking and of getting minorities, such as women, and especially Hispanic women, to enter the fields of STEM, including data science, and I’ve promoted and have started a bunch of initiatives geared towards getting more women to get into science, technology, and engineering. The second reason could be because I cohost a TV show for the Discovery Channel called Outrageous Acts of Science, so I know that a lot of people know me from there.

Hugo: Right, and you also are at Metis, aren’t you, the data science bootcamp?

Debbie: Absolutely. I was gonna say that next. So, I’m the chief data scientist at Metis. Metis is a data science training company that is part of Kaplan, the large education company, and we basically have two modes of teaching data science. One is through bootcamps, which we host in person at four locations, in New York, Chicago, San Francisco, and Seattle, and the second mode of teaching is through corporate training and other products. So, we teach live online intro to data science as a pre-bootcamp course, but we also customize various courses for corporations that need either visualization courses or Python programming or big data techniques and whatnot, and we’ve had quite a bit of success with that.

Hugo: That’s great, and I look forward later on to talking about kind of the relationship to your work at Metis, bootcamps in general, how they can prepare people for a job market where … In the job market, in some respects, coding skills are at the forefront and not critical thinking skills, and how to deal with that trade-off in the education space, which is something we think a lot about.

Debbie: Absolutely.

Hugo: On top of that, though, you mentioned you’re a big promoter of women, in particular, Hispanic women in the space, and correct me if I’m wrong, I may have mess this up completely, but you were the first Mexican woman to get a PhD from Stanford in physics?

Debbie: Wow. You didn’t get it wrong.

Hugo: I got that right?

Debbie: Yes.

Hugo: Fantastic.

Debbie: That’s right, and I think it’s an important statistic not so much to brag about it, but to show that examples like mine, of persevering and working really hard and making your dream come true exist out there, and they’re so important to talk about because they really serve as inspiration for people who sometimes think that their particular minority group or so is not suited for a career in data science or STEM.

Hugo: So, is this how you got interested in data science and computation initially, through a physics PhD?

Debbie: Yeah, yeah. I have kind of a, I guess, not so atypical background for data science. I did my PhD in physics at Stanford, like you said, and I did theoretical physics. I did a lot of computational work the last two years, and so I learned about models and programming and working with data. Then I moved to New York to do two postdocs at Columbia University and at the Courant Institute, part of NYU, after which I decided, like a lot of physicists, to work in Wall Street for a few years as what is sometimes derogatorily called a quant. I was involved in creating risk models, and I did a lot of data analysis, and that’s when I realized that my skills in math and programming had other alternative ways of being applied, not just in physics.

Debbie: So then, after Wall Street, I thought that that was not the field for me because I didn’t really care about just making money, even though making money is nice, but I had bigger aspirations, and I wanted to do data and ethics and help the world and change the world in many ways, and so I’d heard about this new field, sort of new field for me at least, called data science about 10 years ago, and I took a course. It was kind of like a bootcamp. I had the skills, but I didn’t know how to translate them into the different techniques and algorithms that are typical of data science. So, after taking that course, I jumped ship and I started my career in data science.

Hugo: Awesome. That’s a really interesting trajectory, and I just want to step back a bit, and if you don’t want to talk about this, we don’t have to, but I’m just wondering, coming from where you were in Mexico, did you have kind of a social, cultural, and even familial or parental support to go down this path?

Debbie: No, I didn’t, and that is precisely why I care so much about inspiring and helping other young women who, like myself, feel attracted to a career in science or engineering, but who for some reason, whether it be financial or social, feel that they cannot achieve their dreams. From a very young age growing up in Mexico City, I was discouraged from pursuing a career in physics and math because I was a girl, and I was told by friends and parents and teachers in school that I better pick something more feminine, and that to do physics I had to practically be a genius, which I knew I wasn’t, and so they really discouraged me so much that I became insecure about my math skills and about my ability to conquer and study the field.

Debbie: So, years later when it came to go to university, I picked philosophy as an undergrad because I thought that that was something similar to physics. It had a lot of questions, and you could use your imagination to ask yourself why are we here, and all kinds of things that had to do with objects that surround us and their meaning and whatnot, but I realized, Hugo, that the more I tried to hide my love for physics and math, the more that this inner voice telling me to go for it and to study it was screaming at me, until two years into the bachelor’s program in Mexico, I decided behind everyone’s back to apply to schools in the US as a transfer student, and it was difficult because in Mexico we were paying an eighth of what universities cost in the US, and especially as a foreign student, it’s very hard to find scholarships and financial help, but I was extremely lucky that I got full scholarship offered to me by Brandeis University in Massachusetts, and so in the middle of my BA in philosophy, I transferred to Brandeis in the winter. I hadn’t seen the snow before, and I picked up philosophy courses, but right in my first semester I had the courage to take my first intro to physics class. It was a very large classroom with a hundred students, and the class was astronomy 101.

Debbie: In that class, I realized that my passion and my love for physics was not gonna go away, and I befriended the teaching assistant in the classroom who was a graduate student by the name of Rupesh, who came from India. He came from Darjeeling, town in the Himalayas, and Rupesh and I became friends, and we would meet all the time, and he was the first person who truly believed in me, and he told me that I wasn’t the typical student that just wanted to get an A in the homework, that my questions were just so curious, and I was so inquisitive, and that I really, really cared about knowing about the planets and quantum mechanics and statistical mechanics, and all kids of things, and so he really encouraged me to try to do physics, until one day, we were walking in Harvard Square in Cambridge, and we sat under a tree, and I looked at Rupesh with tears in my eyes, and I said, “Rupesh, I just don’t want to die without trying. I don’t want to die without trying to do physics.”

Debbie: He got up, and we didn’t have cellphones at the time, but he called his advisor who was the head of the physics department at Brandeis, Dr. Wardle, who was the professor in my astronomy class, and he said, “I have a student here who has a scholarship for only two years because she’s a transfer student, and I know that BA in physics takes normally four years to complete, but she’s really, really passionate. What can we do about it?” So, Dr. Wardle called me into his office, and we had a conversation, and he basically told me, me and Rupesh, who was there with me, he said, “Believe it or not, there’s somebody else who’s done this in the past at Brandeis. His name is Ed Witten. He is-”

Hugo: Wow …

Debbie: I know. For those people who know physics and know who he is, he’s basically the father of string theory, so he definitely qualifies as a genius, and so I thought he was pulling my leg, like okay, Ed Witten, there’s no way I could achieve this. But he said, "Ed switched at Brandeis from history to physics, and he did it in only two years," because I couldn’t ask my family to pay for another extra two years to stay there, and so what Dr. Wardle offered is he gave me a book called Div, Grad and Curl, which is vector calculus in three dimensions, and basically, he said to me, "If by the end of the summer you’re able to master this material," and Hugo, I didn’t even remember algebra at this point-

Hugo: And of course, there’s a whole bunch of linear algebra, which goes into this vector calculus. Right?

Debbie: Of course. There’s so much background you have to know to even get into studying this book. So, he said, "If in two months," because this was in the month of May, "you’re able to master this material, we’ll give you a test, and we’ll let you skip through the first two years of the physics major, so you can basically finish the whole BA in only two years." So, Rupesh looked at me, and he said, "We’re gonna do this," and he decided, incredibly, to devote his entire summer from mid June to end of August to teaching me and mentoring me, and basically covering all the subjects that I needed to master in order to enter the third year of physics in September.

Debbie: It was amazing because I was so incredibly hardworking and passionate that I didn’t move from my desk. Every day, Rupesh taught me from 9:00 in the morning till 9:00 p.m. We didn’t have much time, so it was just practical, knowing how to solve derivatives on Saturday. Sunday, we’ll do integrals. Monday, first three chapters of classical mechanics, and you get the idea. So, at the end of the summer I presented the test, and I passed. I tried to not burn too many capacitors in my first electronics lab at the time, and I remember how incredibly grateful I was to Rupesh, this person that absolutely changed the course of my life.

Debbie: I tell this story every time I have an opportunity because it’s incredible to me what Rupesh told me. I basically always wanted to pay him for all that he dedicated to me and all the effort he put into tutoring me, and he said to me that when he was growing up in India, in Darjeeling, there was an old man who used to climb up to his little town in this mountainous terrain, and used to teach him and his sisters the tabla, the musical instrument, math and English, and every time the family wanted to compensate this old man, he said, "No. The only way you could ever pay me back is if you do this with someone else in the world."

Debbie: That beautiful story is how my mission in life began, and Rupesh passed the torch of knowledge to me to inspire, help, and encourage other minorities who, like myself, dream of becoming scientists or engineers, but who for some reason lack the confidence or the skills at the time, and that has really informed my career. It has been the passion that connects everything that I’ve done, and I’m incredibly grateful to that pay-it-forward story. So, after graduating with highest honors from Brandeis is when I went to Stanford, and I reconnected with Rupesh only about seven years after that, because he had gone to the South Pole to be a submillimeter astronomer, and we connected, and he was incredibly proud that I managed to graduate and do my research with a Nobel Prize winner at Stanford, and it was a great story.

Critical Thinking and Data Science

Hugo: Firstly, Debbie, thank you so much for sharing that beautiful story. Secondly, I wish I had a box of tissues with me right now, and thirdly, I feel like I was sitting there under that tree with you and Rupesh solving all the vector calculus challenges, and I want to give Rupesh a big hug and a bunch of cash right now as well, but of course, I’ll do exactly what I’m trying to do and what we need to be doing, which is paying it forward, and I think that actually provides a great segue into talking about critical thinking and data science, how we think about critical thinking as educators, being critical of critical thinking, and maybe I want to frame this conversation by saying there’s just a lot of talk around the skills aspiring data scientists, data analysts, data-fluent, data-literate people need to know, and sometimes to me, anyway, the conversation around this seems to be a little bit superficial, and I was wondering, firstly, if that’s the case for you, and secondly, if it is, what seems superficial about it?

Debbie: Yes. I’m so glad you’re asking this question, Hugo. I can’t tell you how many times I have visited programs where I’ve been a mentor for high school students, and I’ll give you one example. One of these afternoon programs was receiving quite a bit of funding, and there were three groups of young girls from high school working in data science, and they had been taught SQL, so they were masters at it, much more than I was ever proficient at their age, so I was like, “Wow. These girls are really impressive.” There were three groups. They were working at a museum, and so one of them was working with a data set that was about birds in the museum, and they were trying to find patterns by looking at their demographics of the birds and their flying patterns and all this kind of information.

Debbie: Another group was looking at astronomical objects, and a third group was working with turtles because the museum had a whole bunch of turtles in an exhibit. So, I went to the third group that was working with turtles, and I looked at the data that they were working with, and one of the columns said weight, so the weight of the turtles, and so I said, “Oh, wow. So, just out of curiosity, how big are the turtles that you’re working with? Have you ever seen them?” They said, “Oh, yeah, we have. They’re about the size of the palm of my hand.” I said, “Oh, cute. I’d love to see those turtles.” I said, “Okay. So, is the weight here that you have in the column … You don’t have any units for it because you just have the number, and the numbers are around 150 and 200 and 300. So, is this weight in pounds? Is it in kilograms? What is this weight in? What are the units?”

Debbie: All of a sudden, these six girls in the group got all quiet, and none of them ventured to answer until one of them raised her hand and said, “Oh, I think it’s in pounds,” and I said, “Oh, wow. Let’s see. I’m about five-foot-three, and I weight probably about 120 pounds, so this is interesting because a turtle that’s the size of my hand, basically, you’re telling me it weighs double the amount of pounds that I do. Does that make sense?” Then they all laughed and said, “Oh, yeah. You’re right. It doesn’t make sense,” and we had this very nice conversation, and we went back and forth. It turns out, after an hour, we finally found a teacher who knew, and for certain, gave us the information that the weight was actually in grams.

Hugo: Wow.

Debbie: So, the girls were surprised, and that story really caught my attention because I had been visiting a lot of schools and programs that are trying to teach coding in a very kind of fast and superficial way, just to be able to say, "Our students know how to code," and I realized that in an effort to get more and more people to know the skills for data science and for data analysis in a world that’s going way too fast where we need to prepare our students for jobs in AI and machine learning and whatnot, we are forgetting what all of this is for. Coding and analyzing data has a purpose. It’s not an end in itself. The purpose is to be able to solve problems and to have insights about what the data is telling us.

Debbie: If we’re not taught to ask the right questions and to think critically about where the data comes from, why is it being used or collected in a certain way, what other data could help or hurt my dataset, what biases are being introduced by this dataset, if we’re not teaching our kids to think what’s behind these techniques, then we’re basically failing, because we’re just making them like robots who can only perform a simple task if, and only if, the next dataset they see is similar in scope and structure to the one that they’re learning to work with.

Debbie: It was a very moving, and in a way, also painful experience to see, because I realized how needed are those critical skills, and not only in the education at the high school level, but how many projects haven’t we seen at companies, at very large companies and advanced data science groups where there’s a significant bias being introduced because no one bothered to include a certain minority but important group in the statistical sample, or bias was introduced because people didn’t bother to check what some outliers in the dataset were describing et cetera. So, I’m very, very passionate about teaching the critical thinking skills that are behind our why for why we do data science.

Collecting Data

Hugo: You’ve spoken to so many essential points there. The overarching one is critical thinking, and what I like to think of, data thinking or data understanding before even … There’s a movement to put data into models and throw models at data before even looking at, as you say, units or important features, or really getting to know your data, getting to understand it, and performing that type of exploratory data analysis, and a related point that underlay a lot of what you were discussing there is thinking about the data collection process as well, and if you’re collecting data in a certain way, what are you leaving out? What are your instruments not picking up? Is your data censored for any of these reasons? Are you leaving out certain demographics because they don’t use a particular part of your service?

Debbie: Mm-hmm (affirmative). Exactly. Exactly, and I think I see a lot of companies that don’t really know what data science is about, because it has become this buzzword, and everyone wants to be in it, but nobody really knows exactly what you can get out of it, and what’s happening is a lot of companies are investing significant dollar amounts in big data and solving big problems because they have collected so much data, they just build a huge infrastructure and try to find insights, but without really know if, first of all, those insights are important for the company, second of all, if they find them, would they be able to use them for something and enact policies or something that’s actually gonna be helpful for the goals of the company? I always remind them with this kind of simple example. One of my heroes in physics is Tycho Brahe, who was a very famous Danish astronomer. Basically, he was locked up in a tower in an island in Denmark, which I actually had the opportunity to visit last summer.

Hugo: Oh, really?

Debbie: Yes.

Hugo: Wow.

Debbie: He lived in the 1500s, an amazing man, but he also had a … Apparently, he was a nobleman. He had an awful personality, and he lost his nose in a duel.

Hugo: They say he replaced it with a golden bridge, I think.

Debbie: With a bronze-

Hugo: Bronze. Yeah. Okay, great.

Debbie: I think that has been discredited a bit. That’s what they told me in the museum. But anyways, yeah, this very interesting character, but the amazing thing about him is that he looked at the sky without any telescope. He basically had created these sophisticated instruments, but in the 1500s, it took him years, and he created a catalog of only about a thousand stars. That’s it. So, that’s a very, very small dataset by today’s standards, but from only those thousand data points, I think it was like 1,800 or so, to be more accurate, but he helped the theories that were later created by Kepler and Copernicus, and where the laws of planetary motion were derived.

Debbie: Basically, Kepler used that, and then Isaac Newton used it as the basis for the law of gravity. So, from those thousand data points came universal theories that we’re still using today, that are incredibly powerful and deep, and that is a good example to say that sometimes we can put a lot of investment into huge datasets, but when we’re talking about data literacy, large datasets also have a lot of noise, and you have to start by teaching that the most important thing is the insight that you’re going to derive from that dataset and not its size.

Big Data

Hugo: I’d like to speak to this idea of the focus on big data and the fact that a lot of us are collecting as much data as possible, thinking that all the information we need will be contained in there, even before asking critical questions, which is very dangerous, but before that, I just want to say tangentially, Tycho Brahe and Kepler’s story is so wild. I haven’t looked into it in a while, but if I recall correctly, Kepler wanted to unlock the secrets of planetary motion and figure out what was happening, and he realized that Tycho had the data. So, this is a story of someone realizing someone else has this data, and he went to work with him in Tycho Brahe’s, I think, final years, and Tycho didn’t even give him all the data at that point. He was actually very secretive about the data he had, and even when Brahe died, Kepler had to struggle with Brahe’s family in order to get the data. So, there were all types of data secrecy and data privacy issues at that point as well.

Debbie: Also, data ownership, because what-

Hugo: Exactly. That’s what I meant. Yeah.

Debbie: Most people know who Kepler was, but if you ask people about Tycho Brahe, very few non-science people know, and that’s because a lot of the credit went to Kepler, and some people argue that the one that did all the meticulous observations and had theories about it was Tycho, and so he deserved more credit. So, it was kind of a crazy time, and lots of fights about data were happening.

Hugo: Of course, we’re talking about a decoupling or a separation of, let’s say, humans into the people who are fantastic at collecting data and the people who are fantastic at analyzing it as well. This is a division in a lot of ways.

Debbie: Yeah, absolutely.

Hugo: But this focus on big data, the fact that even a lot of companies’ valuations are based around the fact that they have so much data, and it must be useful in the future, right? This is incredibly dangerous for practitioners, but also for society.

Debbie: Absolutely. I mean, we did have a tipping point in that we had the hope in the ’70s of AI and changing the landscape of our society, and it didn’t quite deliver in its promise because we didn’t have the capacity to analyze very, very large datasets like we do now, and there was a tipping point where now we are able to analyze these much, much larger datasets. I mean, I think every day in the world, we produce 10 to the 18 bytes of data, like 3 exabytes of data, something like that, that we generate. So, obviously these are enormous scales, but what’s important is not that we now have this capacity to analyze it, but are we really getting a significant marginal insight, or are the insights that we’re getting commensurate with the ones that we were getting when we didn’t have such large datasets?

Debbie: I think that question’s still out there. We haven’t been able to answer it because, as you know, the real important applications of AI are still being created and worked on. A lot of the AI things that we see out there are still simplistic in that they don’t use all of the incredible and deep capacities that AI has to solve problems. So, dimensionality of the data matters. It matters a lot, and probably for certain problems, it’s going to be hugely important. But my point is more about when you’re educating people or when you’re a company investing in certain technology, you have to be able to walk before you run, so start analyzing the smaller datasets, come up with strategies that are based more on critical thinking, and the questions that you’re trying to solve rather than the size of your dataset, and the size of the infrastructure that you’ve built.

Top 3 Critical Thinking Skills to Learn

Hugo: Great. So, I’ve got a thought experiment for you, which may happen all the time. I have no idea. But a student, an aspiring data scientist data analyst comes to you and says, "I need to learn some data thinking skills, some critical thinking skills to work with data. What are the top three critical thinking skills that you think I should learn, Debbie?"

Debbie: Thanks for that question, Hugo. I think the first one is you have to be a skeptic about data. You have to always … Just like when you read a scientific paper, you have to know who paid for this research. Was it the drug company that is sponsoring a paper that says their drug is the only and best drug in the world? Clearly, I’m not gonna trust that paper. So, a healthy skepticism about the team that collected the data, what biases could have been introduced, where was this data taken, how was it collected, what things were left out, what variables would be important in the future, et cetera. All those questions I think are super important. So, if you don’t ask them before even doing exploratory data analysis, it means you’re thinking about the data, and your relationship with the data is gonna be limited.

Debbie: The second one, and this one, I came up with it from another famous physicist, Richard Feynman, who said, "The ability to not fool oneself is one of the hardest and most important skills one can acquire in life," because it’s very easy … Sometimes we think, oh, I wouldn’t be fooled by anyone, not any marketing campaign, not any government is gonna fool me, but we fool ourselves much more often than the people interpreting the data out there. So, the ability to not fall in love with what we think our data should be telling us, that is what I call fooling yourself, that is super important.

Debbie: The third skill is connecting the code and the algorithms to the real world, like my example with high school girls that were working with the data. To be working with a database for three months and forgetting that behind the data are actual turtles, in this example, that’s a big mistake, the same way when Facebook is incredible at doing face recognition and analyzing relationships between groups and people, but if they’re forgetting that behind those connections are real people with real lives and real consequences, then we’re failing. We need to really connect our analysis to the world out there.

Hugo: I agree, and I just want to go through those again, because I’m sure our listeners are scribbling away trying to remember all of this. So, the first one was a healthy skepticism about data, the second, the ability to not fool yourself, and the third, connecting the code and the real world and all the stakeholders that actually exist on the ground.

Debbie: Correct. Thank you, Hugo.

Bias

Hugo: So, I just want to build slightly on the ability not to fool yourself. I mean, all of these are incredibly important, but there’s a paper called, I hope I get this right, Many Analysts, One Dataset, that we’ve discussed once or twice on the podcast before, and it gives a whole bunch of statisticians and domain experts a dataset, separates them into teams, and gives them the same dataset and asks … It’s a dataset of, I think, either yellow or red cards given to football players in football or soccer matches, and the question is, are these decisions to give cards, is there some sort of ethnic bias or a racial bias in these decisions?

Hugo: The fact is, what happened was 70% of the teams said one thing, 30% said the other thing, either yes or no, and then when they got to see everyone else’s results, nearly all the teams were even more sure of their own techniques and their own results. There are a lot of reasons for this, but one of the points is that people go in with a certain bias already, and if you have a bias going into a dataset, you make all these micro-decisions as an analyst, which helps you get to the place that you already thought you were going, right?

Debbie: Yeah. You reminded me, funnily, of a paper that I discussed. I don’t even think you could consider it a scientific, sophisticated paper, but it was a paper done for the astrology, not astronomy, but Astrology Association in India years ago, and I talked about it at a conference because they first decided the hypothesis is that through some astrological charts that tell you certain characteristics about some kids, if these people that were the gurus and the chart readers and predictors were able to guess, I think that they gave themself a pretty low score. They said, "If we are able to guess 60% of the outcomes," and I think the question was whether these students were intellectually gifted or just going to be average students in school, based just on their astrological chart, "and if we’re able to get 60% of them right, then that means we are gurus, and astrology is true, and we are able to predict this with very high confidence." That was their confidence level.

Debbie: The funny thing is even though they did slightly worse than a coin toss, that is they got 49% of them right, and anybody in their right mind would be able to say, "Well, clearly they did even worse than chance, a toss of a coin would’ve done better," but they themselves patted themselves on the back saying, "You see? We got 49% right. We can do this." So, it’s a very funny paper, and I encourage people to read it because it’s so easy to fool ourselves.

Hugo: Absolutely, and the best thing about doing worse than a coin toss is you could actually just switch all your decisions and do better. So, we’ve been talking about critical thinking at an individual and societal level. I’m wondering how you think about the needs for all these skills, critical thinking skills, how they should be spread through organizations, and what I mean is, what type of critical thinking and data thinking skills will be needed and are needed for people who don’t even work directly with data themselves, but in jobs impacted by data?

Debbie: Yes. That’s an excellent question because I think the more that our field of data science grows, the more that we get different dependencies in companies, different groups needing insights or even having contact with the data, and not everybody’s going to be a data scientist. We’re gonna have people just interpret visualizations that come from the data, others using APIs and having to interpret what the algorithms come up with and whatnot. So, I think it’s essential that we spread the critical thinking message across organizations, and it has to start early in school because the ability to ask the right questions in an industry setting in incredibly important, and I don’t think we’re putting enough emphasis in it. So, I think everybody in an organization has to be trained about things such as data ethics. How is the data being collected? Are we using it for the right purpose? Data ownership, data privacy, data security, all kinds of issues that impact the manipulation of data, and so that’s part of the critical thinking process.

Hugo: Hopefully, this aspect of understanding on the part of people in society and other working professionals who aren’t data scientists will result in less burden on the data scientists. What I really mean by that is … Well, there are a few ways to frame it. The first way is I think it was probably Nate Silver who said this. Any quotation I don’t know who it was, I’ll just say it’s Nate Silver, generally. But it was probably Nate Silver who said something like, "When a data scientist gets something right, they’re thought of as a god, and when they get something wrong, they’re thought of as they’ve made the worst mistakes ever," as opposed to a job in which sometimes you get it right, and sometimes you get it wrong.

Hugo: Another way to frame it is it kind of viewed by people without data skills are like, "I have no idea how to deal with this, so this is what you’re going to do, and you have kind of … You’re a prophet, or you’re the holder of divine knowledge, or the high priest of data science", I like to call them, and whether this will actually help, as people develop more data skills who aren’t data scientists, will actually help bridge this gap in a lot of ways. So, how do you think about these types of issues and challenges when building data science curricula at Metis and elsewhere?

Debbie: Yeah. It’s very important for me to learn … I’m not an expert in the field of learning science, but it’s very important to me to learn how to best build curriculum that optimizes these critical thinking principles and questions that I’m talking about, and so it really depends on the curriculum. So, for example, we built with a team with Cathy O’Neil, who I know you’ve interviewed before, who I love, and a group of others, seven executive women with the funding from Moody’s Analytics and the help of Girls, Inc., we developed the first data science curriculum for high school girls of under-served backgrounds, and we deployed it in New York in several high schools.

Debbie: So, I think it was just this amazing experience because we try to emphasize focusing on the topic and what the consequences were of every single step in the process, from data collecting, to choosing the algorithm, to knowing how to measure the accuracy, the recall, the precision, everything that we were doing, where it comes from, how to choose the metric that was right for the problem at hand, et cetera, and so the intention was very conscious to be about how to get the most insight about the limitations and the successes of the challenge or the problem at hand.

Debbie: When I build curriculum for the Metis bootcamp currently in my position, I want the students to have a pretty broad set of tools with which they can crack really hard problems. So, I may not focus on getting every single clustering algorithm there is in the curriculum, but I will focus on how to analyze the results of the clustering algorithms that we will see, and how to know if we’re using the right algorithms for the problem at hand, and how to be able to ask that question of our colleagues, of our communities, et cetera, because we all have limitations to our knowledge.

Metis

Hugo: Yeah. There are two things there I want to focus on. The first is, as you said, at Metis, thinking about the actual problems, and thinking about the question at hand before even getting coding I think is incredibly important, and also, educating people through questions that really pertain to them and are interesting to them. So, students will ask me, "If I want to embark upon my first data science project, what would you suggest I do?" I say, "Well, what are you interested in," and if they have a fitness tracker, for example, I say, "Maybe you could analyze your own fitness data. If you’re a foodie, scrape Yelp reviews of restaurants and work with that type of stuff. If you love movies, if you’re a cinephile, the OMDB has a fantastic API."

Debbie: That’s exactly what we do at Metis. We have our students in the bootcamp use their own dataset, and they create their own project. So, it’s really cool. I encourage people to go to madeatmetis.com, and it’s a site where we have some of our greatest projects, and it’s incredible because you see people that had very basic math and programming skills coming in, and in three months they’re able to analyze contamination sources in the ocean, or some healthcare-related thing, or an app that helps you choose the best restaurant for crepes that evening, and stuff like … It’s really, really cool what you can do.

Hugo: Yeah, and I’ll build on that by saying I’ve been to several of Metis’s graduation presentations. What do you call them?

Debbie: Career Day.

Hugo: Yeah. They’re incredible, and seeing all the learner students there present the work they’ve done is amazing, and I know that … For example, you know I’ve had Emily Robinson on the podcast. I work with her now at DataCamp, and she completed Metis, and I think she went to Etsy straight from Metis. I could be wrong there.

Debbie: Yes. We love Emily.

Future of Data Science

Hugo: Yeah, incredible. So, we’re gonna wrap up in a few minutes, but I’d like to … We’ve talked about the state of play of critical thinking today, but I’d like to … It’s a prediction problem. So, what does the future of data science look like to you, Debbie?

Debbie: To me, it’s going to merge with the industry of IOT or the internet of things. That is, as we see the ubiquitous sensors, that these sensors are simply everywhere, from medical devices, to buildings that are smart buildings testing our comfort level, to apps that match our behavior, it’s-

Hugo: I mean, you’re right. We wear them, and we carry them in our pockets, right?

Debbie: Exactly, and just like the personal computer came to revolutionize the information technology field, the same way, IOT is going to revolutionize, and we’re gonna see a new paradigm where we’re going to collect substantial more amounts of data about ourselves, our behaviors, our connections, and so issues that have to do with data privacy, data ownership, security, analysis, insights are going become evermore important. So, what I predict is that with more automation, we’re gonna have more needs to have people that are not necessarily the data scientists working with the data, but are working in the field to analyze the ethical consequences of it to act as peer reviewing committees to see if there should be policies or regulations that should be enforced around certain applications, et cetera. So, that’s what I see for a future, more and more need for sort of adjacent professions that help with the data analysis process.

Hugo: Yeah, I think you’re right in terms of defining it anyway or describing it as a merging between data science and IOT and automation. I can’t quite remember, did you give a talk on the internet of things at the NYR… Jared’s conference, a few years ago?

Debbie: Yes, I did at the R… Yep. Yeah.

What is your favorite data science technique?

Hugo: Okay, great. Well, I loved that talk, and Jared puts all those talks up online, so I’ll find a link for that and put that in the show notes as well, if anyone’s interested. So, I want to get a bit technical. I’m wondering what one of your favorite data sciencey techniques or methodologies is, just something you love to do.

Debbie: I actually really, really love singular value decomposition, SVD. I’ve always loved linear algebra, and just the thought of being able to reduce the dimensionality of a problem is so sexy to me. In physics, we deal with all the time, and my first encounter with it was when I worked briefly with David Botstein, who’s … This is many, many years ago at Stanford. He’s one of the creators of Genentech, the biotech company, and we were analyzing the data coming from DNA microarrays, which basically compare a sample of healthy DNA with a sample that came from a patient in order to conclude whether the patient had cancer, and in the case of a positive answer, what type of breast cancer it was.

Debbie: So, it was really, really interesting because, obviously, there are so many genes in our genome that the dimension of the problem was humongous, and so to apply SVD and be able to reduce it to the dimensions that were most important enabled them to come up with pretty customized drugs that I have heard, because I have since stopped working in that topic, but I’ve heard are working quite well for different types of breast cancer. So, the applications of SVD are incredible, and so I don’t know, I just really like that conceptually, and anything that has to do with that, even NLP and, I don’t know, just seeing what you can get by sacrificing a bit of information is just really interesting to me.

Hugo: Well, I’m sold. I mean, you’ve motivated it through linear algebra, which I also love, and then you gave some incredibly important examples of its use, and for those of you out there who know of PCA, I’d definitely suggest you to check out SVD as well.

Debbie: Yeah.

Call to Action

Hugo: I’ve got one final question for you. Do you have a final call to action for our listeners out there?

Debbie: Yes, I do. I’ll repeat, Hugo, what I said in my Grace Hopper Celebration keynote speech a little over a year ago. Think deeply, be bold, and help others.

Hugo: I think that’s fantastic, Debbie, and what we’ll do is we’ll link to your Grace Hopper talk as well, because I think the way you explained in that talk all of these things, why it’s important to think deeply, be bold, and help others, which you’ve kind of gone through this talk as well, I think that talk can provide more context there also.

Debbie: Wonderful. This has been such an awesome conversation, Hugo. Thank you.

Hugo: Thank you so much, Debbie. It’s been an absolute pleasure having you on the show.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

quantmod_0.4-14 on CRAN

Mon, 03/25/2019 - 12:53

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

I just pushed a new release of quantmod to CRAN! I’m most excited about the update to getSymbols() so it doesn’t throw an error and stop processing if there’s a problem with one ticker symbol. Now getSymbols() will import all the data it can, and provide an informative error message for any ticker symbols it could not import.

Another cool feature is that getQuote() can now import quotes from Tiingo. But don’t thank me; thank Ethan Smith for the feature request [#247] and pull request [#250].

There are also several bug fixes in this release.  The most noticeable are fixes to getDividends()  and getSplits(). Yahoo! Finance continues to have stability issues. Now it returns raw dividends instead of split-adjusted dividends (thanks to Douglas Barnard for the report [#253]), and the actual split adjustment ratio instead of the inverse (e.g. now 1/2 instead of 2/1).  I suggest using a different data provider. See my post: Yahoo! Finance Alternatives for some suggestions.

See the news file for the other bug fixes. Please let me know what you think about these changes.  I need your feedback and input to make quantmod even better!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Play with the cyphr package

Mon, 03/25/2019 - 02:01

(This article was first published on Shige's Research Blog, and kindly contributed to R-bloggers)

The cyphr package seems to provide a good choice for small research group that shares sensitive data over internet (e.g., DropBox). I did some simple experiment myself and made sure it can actually serve my purpose.

I did my experiment on two computers (using openssl): I created the test data on my Linux workstation running Manjaro then I tried to access the data on a Windows 7 laptop.

For creating the data (Linux workstation):

library(cyphr)

# Create the test data
data_dir <- file.path(“~/Dropbox/temp_files”, “data”)
dir.create(data_dir)
dir(data_dir)

# Encrypt the test data
cyphr::data_admin_init(data_dir)

key <- cyphr::data_key(data_dir)

filename <- file.path(data_dir, “iris.rds”)
cyphr::encrypt(saveRDS(iris, filename), key)
dir(data_dir)

# Cannot read the data with decrypting it
readRDS(filename)

# Read the decrypted version of the data
head(cyphr::decrypt(readRDS(filename), key))

For accessing the data (Windows laptop):

library(cyphr)

key <- data_key(“C:/Users/Ssong/Dropbox/temp_files/data”, path_user = “C:/Users/Ssong/.ssh”)

# Make data access request
data_request_access(“C:/Users/Ssong/Dropbox/temp_files/data”, 
path_user = “C:/Users/Ssong/.ssh”)

On Windows 7,  the system cannot locate the public located in “~/.ssh”, which is pretty dumb.

Going back to the Linux workstation to approve the data access request:

# Review the request and approve (to share with other users)
req <- data_admin_list_requests(data_dir)
data_admin_authorise(data_dir, yes = TRUE)
data_admin_list_keys(data_dir)

Now I can access the data on my Windows laptop:

key <- data_key(“C:/Users/Ssong/Dropbox/temp_files/data”, path_user = “C:/Users/Ssong/.ssh”)

d <- decrypt( readRDS( “C:/Users/Ssong/Dropbox/temp_files/data/iris.rds”), key)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shige's Research Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Summer Interns 2019

Mon, 03/25/2019 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We received almost 400 applications for our 2019 internship program from students with very diverse backgrounds. After interviewing several dozen people and making some very difficult decisions, we are pleased to announce that these twelve interns have accepted positions with us for this summer:

  • Therese Anders: Calibrated Peer Review. Prototype tools to conduct experiments to see whether calibrated peer review is a useful and feasible feedback strategy in introductory data science classes and industry workshops. (mentor: Mine Çetinkaya-Rundel)

  • Malcolm Barrett: R Markdown Enhancements. Tidy up and refactoring the R Markdown code base. (mentor: Rich Iannone)

  • Julia Blum: RStudio Community Sustainability. Study community.rstudio.com, enhance documentation and processes, and onboard new users. (mentor: Curtis Kephart)

  • Joyce Cahoon: Object Scrubbers. Help write a set of methods to scrub different types of objects to reduce their size on disk. (mentors: Max Kuhn and Davis Vaughan)

  • Daniel Chen: Grader Enhancements. Enhance [grader](https://github.com/rstudio-education/grader) to identify students’ mistakes when doing automated tutorials. (mentor: Garrett Grolemund)

  • Marly Cormar: Production Testing Tools for Data Science Pipelines. Build on applicability domain methods from computational chemistry to create functions that can be included in a dplyr pipeline to perform statistical checks on data in production. (mentor: Max Kuhn)

  • Desiree De Leon: Teaching and Learning with RStudio. Create a one-stop guide to teaching with RStudio similar to Teaching and Learning with Jupyter. (mentor: Alison Hill)

  • Dewey Dunnington: ggplot2 Enhancements. Contribute to ggplot2 or an associated package (like scales) by writing R code for graphics and helping to manage a large, popular open source project. (mentor: Hadley Wickham)

  • Maya Gans: Tidy Blocks. Prototype and evaluate a block-based version of the tidyverse so that young students can do simple analysis using an interface like Scratch. (mentor: Greg Wilson)

  • Leslie Huang: Shiny Enhancements. Enhance Shiny’s UI, improve performance bottlenecks, fix bugs, and create a set of higher-order reactives for more sophisticated programming. (mentor: Barret Schloerke)

  • Grace Lawley: Tidy Practice. Develop practice projects so learners can practice tidyverse skills using interesting real-world data. (mentor: Alison Hill)

  • Yim Register: Data Science Training for Software Engineers. Develop course materials to teach basic data analysis to programmers using software engineering problems and data sets. (mentor: Greg Wilson)

We are very excited to welcome them all to the RStudio family, and we hope you’ll enjoy following their progress over the summer.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

nice student project

Sun, 03/24/2019 - 20:24

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In all of my undergraduate classes, I require a term project, done in groups of 3-4 students. Though the topic is specified, it is largely open-ended, a level of “freedom” that many students are unaccustomed to. However, some adapt quite well. The topic this quarter was to choose a CRAN package that does not use any C/C++, and try to increase speed by converting some of the code to C/C++.

Some of the project submissions were really excellent. I decided to place one on the course Web page, and chose this one. Nice usage of Rcpp and devtools (neither of which was covered in class), very nicely presented.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ShinyProxy 2.2.0

Sun, 03/24/2019 - 12:05

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.

Secured Embedding of Shiny Apps

Since version 2.0.1 ShinyProxy provides a REST API to manage (launch, shut down) Shiny apps and consume the content programmatically inside broader web applications or portals. This allows to cleanly separate the responsiblity for the Shiny apps (data science teams) and those broader applications (IT teams) while still achieving seamless integration between the two from the user’s perspective. With this release we go one step further and support the industry standard to protect REST APIs, namely OAuth 2.0.

In practice this means the following: when users of the portal log on, they typically authenticate with an OAuth2 provider (e.g. Auth0). This then allows the web application to access the ShinyProxy API on their behalf and launch the Shiny apps over the ShinyProxy API. We leave out the details on authorization codes and access tokens, but the core message is that you can now embed Shiny apps in virtually any other web application in a secure way. If you want an actual example,
please head to our Github page with ShinyProxy configuration examples, where a sample Node.js application is made available to demonstrate the full scenario.

Miscellaneous improvements

More generally, users are deploying ShinyProxy on a wide array of cloud platforms and using a great variety of authentication technologies. A lot of the experience gained can now be found in updated documentation with additional examples on e.g. AWS Cognito (here) or Microsoft Azure AD B2C (here), next to Google and Auth0 (here). These production deployments also called for more extensive documentation on logging with ShinyProxy and at that level we also introduced a new setting logging.requestdump to enable full request dump for advanced debugging. Then, for user convenience we introduced user-friendly URLs to access an app either via the standard ShinyProxy interface (/app/) or directly (/app_direct/) if needed.

Full release notes can be found on the downloads page and updated documentation can be found on https://shinyproxy.io. As always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using R and H2O to identify product anomalies during the manufacturing process.

Sun, 03/24/2019 - 06:31

(This article was first published on R-Analytics, and kindly contributed to R-bloggers)

Introduction:

We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance.

 Before starting we need the next software installed and working:

R language installed.
– All the R packages mentioned in the R sources.
– Testing stations data, I suggest to go station by station.
H2O open source framework.
– Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases
R studio.

Get your data.

About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository   

How many records?:  Training data set – In my real project, I use 100 thousand test passed records, it is around a month of production data. Testing data set – I use the last 24 hours of testing station data. Let’s the fun begin Deep Learning Model Creation And Testing.

# Load libraries
library( h2o )

h2o.init( nthreads = -1, max_mem_size = “5G”, port = 6666 )

h2o.removeAll() ## Removes the data from the h2o cluster in preparation for our final model.

# Reading SECOM data file
allData = read.csv( “secom.data”, sep = ” “, header = FALSE, encoding = “UTF-8” )

# fixing the data set, there are a lot of NaN records
if( dim(na.omit(allData))[1] == 0 ){
  for( colNum in 1:dim( allData )[2] ){

    # Get valid values from the actual column values
    ValidColumnValues = allData[,colNum][!is.nan( allData[, colNum] )]

    # Check each value in the actual active column.
    for( rowNum in 1:dim( allData )[1] ){

    cat( “Processing row:”, rowNum, “, Column:”, colNum, “Data:”, allData[rowNum, colNum], “\n” )

    if( is.nan( allData[rowNum, colNum] ) ) {

      # Assign random valid value to our row,column with NA value

      getValue = ValidColumnValues[ floor( runif(1, min = 1, max = length( ValidColumnValues ) ) ) ]

      allData[rowNum, colNum] = getValue
    }
  }
 }
}

# spliting all data, the fiirst 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]

# Convert the training dataset to H2O format.

trainingData_hex = as.h2o( trainingData, destination_frame = “train_hex” )

# Set the input variables
featureNames = colnames(trainingData_hex)

# Creating the first model version.
trainingModel = h2o.deeplearning( x = featureNames

                                                         , training_frame = trainingData_hex

                                                         , model_id = “Station1DeepLearningModel”
                                                         , activation = “Tanh”
                                                         , autoencoder = TRUE
                                                         , reproducible = TRUE
                                                         , l1 = 1e-5
                                                         , ignore_const_cols = FALSE
                                                         , seed = 1234
                                                         , hidden = c( 400, 200, 400 ), epochs = 50 )

# Getting the anomalies from training data to set the min MSE( Mean Squared Error )
# value before setting a record as anomally
trainMSE = as.data.frame( h2o.anomaly( trainingModel
                                                                 , trainingData_hex
                                                                 , per_feature = FALSE ) )

# Check the first 30 descendent sorted trainMSE records to see our outliers
head( sort( trainMSE$Reconstruction.MSE , decreasing = TRUE ), 30)
# 0.020288603 0.017976305 0.012772556 0.011556780 0.010143009 0.009524983 0.007363854
# 0.005889714 0.005604329 0.005189614[11] 0.005185285 0.005118595 0.004639442 0.004497609
# 0.004438342 0.004419993 0.004298936 0.003961503 0.003651326 0.003426971 0.003367108
# 0.003169319 0.002901914 0.002852006 0.002772110 0.002765924 0.002754586 0.002748887
# 0.002619872 0.002474702

# Ploting errors of reconstructing our training data, to have a graphical view
# of our data reconstruction errors

plot( sort( trainMSE$Reconstruction.MSE ), main = ‘Reconstruction Error’, ylab = “MSE Value.” )

# Seeing the chart and the first 30 decresing sorted MSE records, we can decide .01
# as our min MSE before setting a record as anomally, because we see Just a few
# records with two decimals greater than zero and we can set those as outliers.
# This value is something you must decide for your data.

# Updating trainingData data set with reconstruction error < .01
trainingDataNew = trainingData[ trainMSE$Reconstruction.MSE < .01, ]

h2o.removeAll() ## Remove the data from the h2o cluster in preparation for our final model.

# Convert our new training data frame to H2O format.
trainingDataNew_hex = as.h2o( trainingDataNew, destination_frame = “train_hex” )

# Creating the final model.
trainingModelNew = h2o.deeplearning( x = featureNames
                                                                , training_frame = trainingDataNew_hex
                                                                , model_id = “Station1DeepLearningModel”
                                                                , activation = “Tanh”
                                                                , autoencoder = TRUE
                                                                , reproducible = TRUE
                                                                , l1 = 1e-5
                                                                , ignore_const_cols = FALSE
                                                                , seed = 1234
                                                                , hidden = c( 400, 200, 400 ), epochs = 50 )

################################
# Check our testing data for anomalies.
################################

# Convert our testing data frame to H2O format.
testingDataH2O = as.h2o( testingData, destination_frame = “test_hex” )

# Getting anomalies found in testing data.
testMSE = as.data.frame( h2o.anomaly( trainingModelNew
                                                               , testingDataH2O
                                                               , per_feature = FALSE ) )

# Binding our data.
testingData = cbind( MSE = testMSE$Reconstruction.MSE , testingData )

anomalies = testingData[ testingData$MSE >= .01,  ]

if( dim(anomalies)[1] ){
  cat( “Anomalies detected in the sample data, station needs maintenance.” )
}

Here is the code on github: https://github.com/LaranIkal/ProductAnomaliesDetection

Enjoy it!!!.
Carlos Kassab https://www.linkedin.com/in/carlos-kassab-48b40743/
More information about R: https://www.r-bloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How cdata Control Table Data Transforms Work

Sat, 03/23/2019 - 20:39

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.

cdata defines two primary data manipulation operators: rowrecs_to_blocks() and blocks_to_rowrecs(). These are the fundamental transforms that convert between data representations. The two representations it converts between are:

  • A world where all facts about an instance or record are in a single row (“rowrecs”).
  • A world where all facts about an instance or record are in groups of rows (“blocks”).

It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.

For example:

rowrecs_to_blocks() does the following. For each row record, make a replicant of the of the control table with values filled in. In relational terms rowrecs_to_blocks() is therefore a join of the data to the control table. Conversely blocks_to_rowrecs() combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.

We share some nifty tutorials on the ideas here:

One can build fairly clever illustrations and animations to teach the above.

The most common special cases of the above have been popularized in R as unpivot/pivot (pivot invented by Pito Salas), stack/unstack, melt/cast, or gather/spread. These special cases are handled in cdata by convenience functions unpivot_to_blocks() and pivot_to_rowrecs(). A great example of a “higher order” transform that isn’t one of the common ones is given here.

Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).

citation("cdata") To cite package ‘cdata’ in publications use: John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations. https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/. A BibTeX entry for LaTeX users is @Manual{, title = {cdata: Fluid Data Transformations}, author = {John Mount and Nina Zumel}, year = {2019}, note = {https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/}, } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Strength of a Lennon song exposed with R function glue::glue

Sat, 03/23/2019 - 13:05

(This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers)


love_verse <- function(w1, w2, w3){ glue::glue( "Love is {b}, {b} is love Love is {y}, {y} love Love is {u} to be loved", b = w1, y = w2, u = w3) }

As a return, parameters sometimes gives echoes of poetry.

love_verse('real', 'feeling', 'wanting') Love is real, real is love Love is feeling, feeling love Love is wanting to be loved love_verse('touch', 'reaching', 'asking') Love is touch, touch is love Love is reaching, reaching love Love is asking to be loved ## refrain Love is you You and me Love is knowing We can be love_verse('free', 'living', 'needing') Love is free, free is love Love is living, living love Love is needing to be loved

https://www.youtube.com/watch?v=7er_xx7Wmg8

list(list(w1 = 'real', w2 = 'feeling', w3 = 'wanting'), list(w1 = 'touch', w2 = 'reaching', w3 = 'asking' ), list(w1 = 'free', w2 = 'living', w3 = 'needing')) %>% purrr::map(function(x)do.call(love_verse, x)) [[1]] Love is real, real is love Love is feeling, feeling love Love is wanting to be loved [[2]] Love is touch, touch is love Love is reaching, reaching love Love is asking to be loved [[3]] Love is free, free is love Love is living, living love Love is needing to be loved

We could also read title of this article as “strength of an R function exposed with a Lennon song”…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Can you turn 1,500 R$ into 1,000,430 R$ by investing in the stock market?

Sat, 03/23/2019 - 01:00

(This article was first published on R on msperlin, and kindly contributed to R-bloggers)

In the last few weeks we’ve seen a great deal of controversy in Brazil regarding financial investments. Too keep it short, Empiricus, an ad-based company that massively sells online courses and subscriptions, posted a YouTube ad where a young girl, Bettina, says the following:

Hi, I'm Bettina, I am 22 years old and, starting with R$ 1,500, I now own R$ 1,042,000 of accumulated wealth.

She later explains that she earned the money by investing in the stock market over three years. For my international audience, the proposed investment is equivalent of turning $394 into $263,169 over a three year period.

Anyone with a economics or business background will easily spot that the financial returns stated in the ad is simply not possible. Even if Bettina is a very good investor, reaching this level of returns over a three year period in the stock market is unheard of. The yearly rate of return of the investment is equal to 774% per year. The monthly rate proposed in the ad is equivalent to 19,8% per month.

Giving perspective, Buffet, one of the greatest long term investor of all times, has reached the approximate rate of 19% per year, around 1.46% per month. So, Bettina is either a financial genius that, with only 22 years old, was able to beat Buffet in its own game, or the ad is not fully committed to the truth. To be fair, even if we took the difference of inflation rates between Brazil and US into account, the difference is still very impressive and misleading.

Others have pointed out that if you compound these return over time, the result will be economically unrealistic. See next what happens to R$ 1.500 if we assume that you can replicate the alledged investment return of Bettina (774% per month) over a 10 year period.

Table 1: Compounding returns for Bettina
Number of years Investiment value 1 R$ 13.103,89 2 R$ 114.474,71 3 R$ 1.000.043,00 4 R$ 8.736.305,50 5 R$ 76.319.752,11 6 R$ 666.724.001,23 7 R$ 5.824.454.109,93 8 R$ 50.882.022.569,93 9 R$ 444.501.780.243,15 10 R$ 3.883.136.374.301,74

If Bettina is a genius and can replicate her result over the years, she will be a billionaire in 7 years and a trillionaire in 10. If she waited one more year, she could even buy the whole country if she wanted to. The current GDP of Brazil is around 2 trillion USD (7.5 trillion in R$). She can easily reach this amount of cash in 12 years or more.

But lets go further in this endeavor. Let’s stop being skeptic about her returns and see whether its possible to achieve such returns in the stock market. As you can expect, I’m taking a data based approach. I’ll compare the returns of Bettina to GodBot, a computer algorithm that can perfectly predict stock prices.

First, let’s download some stock data from B3, the Brazilian stock exchange.

library(BatchGetSymbols) df.sp500 <- GetSP500Stocks() df.sp500 <- GetIbovStocks() my.tickers <- paste0(df.sp500$tickers, '.SA') df.stocks <- BatchGetSymbols(tickers = my.tickers, first.date = '2010-01-01', last.date = Sys.Date())[[2]] ## ## Running BatchGetSymbols for: ## tickers = ABEV3.SA, B3SA3.SA, BBAS3.SA, BBDC3.SA, BBDC4.SA, BBSE3.SA, BRAP4.SA, BRDT3.SA, BRFS3.SA, BRKM5.SA, BRML3.SA, BTOW3.SA, CCRO3.SA, CIEL3.SA, CMIG4.SA, CSAN3.SA, CSNA3.SA, CVCB3.SA, CYRE3.SA, ECOR3.SA, EGIE3.SA, ELET3.SA, ELET6.SA, EMBR3.SA, ENBR3.SA, EQTL3.SA, ESTC3.SA, FLRY3.SA, GGBR4.SA, GOAU4.SA, GOLL4.SA, HYPE3.SA, IGTA3.SA, ITSA4.SA, ITUB4.SA, JBSS3.SA, KLBN11.SA, KROT3.SA, LAME4.SA, LOGG3.SA, LREN3.SA, MGLU3.SA, MRFG3.SA, MRVE3.SA, MULT3.SA, NATU3.SA, PCAR4.SA, PETR3.SA, PETR4.SA, QUAL3.SA, RADL3.SA, RAIL3.SA, RENT3.SA, SANB11.SA, SBSP3.SA, SMLS3.SA, SUZB3.SA, TAEE11.SA, TIMP3.SA, UGPA3.SA, USIM5.SA, VALE3.SA, VIVT4.SA, VVAR3.SA, WEGE3.SA ## Downloading data for benchmark ticker | Found cache file ## ABEV3.SA | yahoo (1|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## B3SA3.SA | yahoo (2|65) | Found cache file - Got 82.0% of valid prices | Nice! ## BBAS3.SA | yahoo (3|65) | Found cache file - Got 96.0% of valid prices | Good stuff! ## BBDC3.SA | yahoo (4|65) | Found cache file - Got 96.0% of valid prices | Got it! ## BBDC4.SA | yahoo (5|65) | Found cache file - Got 96.0% of valid prices | Mais contente que cusco de cozinheira! ## BBSE3.SA | yahoo (6|65) | Found cache file - Got 61.6% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## BRAP4.SA | yahoo (7|65) | Found cache file - Got 96.0% of valid prices | Good stuff! ## BRDT3.SA | yahoo (8|65) | Found cache file - Got 13.0% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## BRFS3.SA | yahoo (9|65) | Found cache file - Got 96.0% of valid prices | Got it! ## BRKM5.SA | yahoo (10|65) | Found cache file - Got 96.0% of valid prices | You got it! ## BRML3.SA | yahoo (11|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## BTOW3.SA | yahoo (12|65) | Found cache file - Got 96.0% of valid prices | Well done! ## CCRO3.SA | yahoo (13|65) | Found cache file - Got 96.0% of valid prices | OK! ## CIEL3.SA | yahoo (14|65) | Found cache file - Got 96.0% of valid prices | You got it! ## CMIG4.SA | yahoo (15|65) | Found cache file - Got 96.0% of valid prices | You got it! ## CSAN3.SA | yahoo (16|65) | Found cache file - Got 96.0% of valid prices | Youre doing good! ## CSNA3.SA | yahoo (17|65) | Found cache file - Got 96.0% of valid prices | Youre doing good! ## CVCB3.SA | yahoo (18|65) | Found cache file - Got 55.1% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## CYRE3.SA | yahoo (19|65) | Found cache file - Got 96.0% of valid prices | Feels good! ## ECOR3.SA | yahoo (20|65) | Found cache file - Got 93.4% of valid prices | Looking good! ## EGIE3.SA | yahoo (21|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## ELET3.SA | yahoo (22|65) | Found cache file - Got 96.0% of valid prices | OK! ## ELET6.SA | yahoo (23|65) | Found cache file - Got 94.6% of valid prices | Youre doing good! ## EMBR3.SA | yahoo (24|65) | Found cache file - Got 96.0% of valid prices | You got it! ## ENBR3.SA | yahoo (25|65) | Found cache file - Got 96.0% of valid prices | Good job! ## EQTL3.SA | yahoo (26|65) | Found cache file - Got 96.0% of valid prices | Feels good! ## ESTC3.SA | yahoo (27|65) | Found cache file - Got 96.0% of valid prices | Good job! ## FLRY3.SA | yahoo (28|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## GGBR4.SA | yahoo (29|65) | Found cache file - Got 96.0% of valid prices | Youre doing good! ## GOAU4.SA | yahoo (30|65) | Found cache file - Got 96.0% of valid prices | OK! ## GOLL4.SA | yahoo (31|65) | Found cache file - Got 96.0% of valid prices | Good job! ## HYPE3.SA | yahoo (32|65) | Found cache file - Got 96.0% of valid prices | Got it! ## IGTA3.SA | yahoo (33|65) | Found cache file - Got 96.0% of valid prices | Well done! ## ITSA4.SA | yahoo (34|65) | Found cache file - Got 96.0% of valid prices | You got it! ## ITUB4.SA | yahoo (35|65) | Found cache file - Got 96.0% of valid prices | Well done! ## JBSS3.SA | yahoo (36|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## KLBN11.SA | yahoo (37|65) | Not Cached - Got 0.0431% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## KROT3.SA | yahoo (38|65) | Found cache file - Got 73.1% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## LAME4.SA | yahoo (39|65) | Found cache file - Got 96.0% of valid prices | You got it! ## LOGG3.SA | yahoo (40|65) | Not Cached - Got 0.0431% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## LREN3.SA | yahoo (41|65) | Found cache file - Got 96.0% of valid prices | Well done! ## MGLU3.SA | yahoo (42|65) | Found cache file - Got 82.2% of valid prices | OK! ## MRFG3.SA | yahoo (43|65) | Found cache file - Got 96.0% of valid prices | You got it! ## MRVE3.SA | yahoo (44|65) | Found cache file - Got 96.0% of valid prices | OK! ## MULT3.SA | yahoo (45|65) | Found cache file - Got 96.0% of valid prices | You got it! ## NATU3.SA | yahoo (46|65) | Found cache file - Got 96.0% of valid prices | OK! ## PCAR4.SA | yahoo (47|65) | Found cache file - Got 96.0% of valid prices | Good job! ## PETR3.SA | yahoo (48|65) | Found cache file - Got 96.0% of valid prices | Youre doing good! ## PETR4.SA | yahoo (49|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## QUAL3.SA | yahoo (50|65) | Found cache file - Got 80.5% of valid prices | Feels good! ## RADL3.SA | yahoo (51|65) | Found cache file - Got 96.0% of valid prices | Feels good! ## RAIL3.SA | yahoo (52|65) | Found cache file - Got 41.4% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## RENT3.SA | yahoo (53|65) | Found cache file - Got 96.0% of valid prices | OK! ## SANB11.SA | yahoo (54|65) | Found cache file - Got 89.2% of valid prices | Youre doing good! ## SBSP3.SA | yahoo (55|65) | Found cache file - Got 96.0% of valid prices | Good stuff! ## SMLS3.SA | yahoo (56|65) | Found cache file - Got 61.6% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## SUZB3.SA | yahoo (57|65) | Found cache file - Got 96.0% of valid prices | Looking good! ## TAEE11.SA | yahoo (58|65) | Not Cached - Got 0.0431% of valid prices | OUT: not enough data (thresh.bad.data = 75.0%) ## TIMP3.SA | yahoo (59|65) | Found cache file - Got 96.0% of valid prices | Youre doing good! ## UGPA3.SA | yahoo (60|65) | Found cache file - Got 96.0% of valid prices | Well done! ## USIM5.SA | yahoo (61|65) | Found cache file - Got 96.0% of valid prices | Feels good! ## VALE3.SA | yahoo (62|65) | Found cache file - Got 96.0% of valid prices | Good stuff! ## VIVT4.SA | yahoo (63|65) | Found cache file - Got 96.0% of valid prices | Good job! ## VVAR3.SA | yahoo (64|65) | Found cache file - Got 96.0% of valid prices | Good job! ## WEGE3.SA | yahoo (65|65) | Found cache file - Got 96.0% of valid prices | Well done!

My god bot has a single and simple rule:

  • For each month, it will always invest 100% in the stock with the highest return for the month

Next we code this bot using tidyverse:

library(tidyverse) res.inv <- df.stocks %>% mutate(ref.month = as.Date(format(ref.date, '%Y-%m-01'))) %>% group_by(ref.month, ticker) %>% summarise(ret.month = last(price.adjusted)/first(price.adjusted) - 1) %>% group_by(ref.month) %>% summarise(best.ticker = ticker[which.max(ret.month)], best.return = ret.month[which.max(ret.month)])

Now, let’s have a look in those returns:

library(ggplot2) # bettinas returns initial.cash <- 1500 last.cash <- 1000043 my.T <- 3 # years r.aa <- (last.cash/initial.cash)^(1/3) -1 r.am <- (last.cash/initial.cash)^(1/(3*12)) -1 p <- ggplot(res.inv, aes(x = ref.month, y = best.return)) + geom_col() + geom_hline(yintercept =r.am, color = 'red', size =1.5) + labs(x = 'Time', y = 'Monthly Returns', title = 'Monthly Returns of Bettina and GodBot', subtitle = paste0('- This plot shows the "alleged" returns from Bettina against a perfect predictor \n for the BR stock market\n', '- The horizontal red line represents the return of Bettina (19.79% monthly)'), caption = 'www.msperlin.com/blog' ) + scale_y_continuous(labels = scales::percent) p

As you can see, Bettina did good with a constant monthly return of 19,8%. But, GodBot is better. Bettina is clearly missing something out!

When looking at the nominal value of the investment, the effect of compound returns explodes the value of the portfolio.

library(tidyr) format.cash <- function(x) { require(scales) x.formatted <- dollar(x, prefix = 'R$ ', decimal.mark = ',', big.mark = '.', largest_with_cents = Inf) return(x.formatted) } df.cumret <- res.inv %>% mutate(cumret.godbot = initial.cash*cumprod(1+res.inv$best.return), cumret.bettina =initial.cash*cumprod(1+rep(r.am, n())) ) %>% select(-best.ticker, -best.return) df.to.plot <- gather(df.cumret, 'Investor', "Value", -ref.month ) p <- ggplot(df.to.plot, aes(x = ref.month, y = Value, color = Investor)) + geom_line(size =2) + labs(x = 'Time', y = 'Value of Return', title = 'Accumulated Value of Portfolio', subtitle = 'This figure shows the value of accumulated return for Bettina in comparison to GodBot, a perfect predictor of stock markets') + scale_y_continuous(labels = format.cash) p

The results show that Bettina’s returns are possible. All you need to do is to perfectly predict, for each month, which stock will do best in the market. If you haven’t sensed my irony, let me be crystal clear: Market prices are impossible to predict. The result I just showed you is only possible in my computer. No one will ever be able to replicate it in practice.

The ad from Empiricus is very misleading. In my opinion as a finance professor, the real problem in this episode is that the great majority of the Brazilian population is not financially educated. Many people will believe that is legally possible to reach a 20% return over a month.

I’ve seen countless cases of financial pyramids, usually tied to some exotic cryptocurrency, to rise and shortly burn here in Brazil. This is specially – and sadly – most frequent in the poorer areas of the country. Those that follow Empiricus advice will soon learn its lesson. Making money in short run in stocks is very difficult.

Unfortunately, every disapointed person that followed Empiricus advice is never going back to investing in financial markets. They will miss what is probably the greatest system ever designed for investing and passively creating wealth in the long run.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.9.300.2.0

Fri, 03/22/2019 - 23:57

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new RcppArmadillo release based on a new Armadillo upstream release arrived on CRAN and Debian today.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 583 other packages on CRAN.

The (upstream-only this time) changes are listed below:

Changes in RcppArmadillo version 0.9.300.2.0 (2019-03-21)
  • Upgraded to Armadillo release 9.300.2 (Fomo Spiral)

    • Faster handling of compound complex matrix expressions by trace()

    • More efficient handling of element access for inplace modifications in sparse matrices

    • Added .is_sympd() to check whether a matrix is symmetric/hermitian positive definite

    • Added interp2() for 2D data interpolation

    • Added expm1() and log1p()

    • Expanded .is_sorted() with options "strictascend" and "strictdescend"

    • Expanded eig_gen() to optionally perform balancing prior to decomposition

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Fun – Inspired by Darasaurus

Fri, 03/22/2019 - 22:10

(This article was first published on Econometrics By Simulation, and kindly contributed to R-bloggers)

After my recent post on Anscombe’s Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice’s Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to fit a specified mean and variance for X and Y. Their method randomly applies small disturbances to individual observations to gradually move the data to match a target preference set.

Inspired by their use of point images which match a specific parameter set, I have done generated some of my own. For all of them X has a mean and variance of 1 and 11. While y has a mean and variance of 1 and 1. The correlation between X and Y is set to 0 which causes some distortion in the images. More on that in the post.

Figure 1: Shows a graph of 12 data sets each with 15 transitional data sets. The mean, variance, and correlations of X and Y are held constant throughout the sets and transitions.

Data Source I generated the data myself using Mobilefish’s upload photo and record clicks webapp. The source images are from images I found online. The only slight trick to using the data generated by Mobilefish was that the y cooridates are typically tracked from the top of the page with software, yet most statistical graphing software plots with y starting from the bottom of the graph. Raw Images The raw data when plotted look like their source material..

New Images: Force Cor(X,Y)=0 When we force the correlation of X and Y to be zero certain point distributions become distorted.

For Bart and the Cat forcing cor(X,Y) has noticable distortions while for the flower minimal distortions seem to have been introduced. New Images: Force Cor(X,Y)<>0 It gets even worse when we impose a constant correlation between X and Y. The following shows the distortions to the flower when we change b1, keeping Cor(X,Y) constant and fixing the Y plot limits. Figure 8: Shows the effect on Var(Y) that changing b1, when all other factors are held constant. Slight changes to the Anscombe-Generator Code In order to generate graphs that had cor(X,Y)=0 I had to modify my previous code to allow variation in Y that was completely independent of X. The problem with my code was that if b1=1, my calculation used SSE = (b1^2 * var(X))*n in order to infer how large the variation in u needed to be (varianceu = (SSE/corXY^2 – SSE)/n). This backwards inference does not work if b1=0. So, just for the special case of corXY=0 I have included an additional error term E which is helpful in the even that b1=0. Summary The thought of use points to make recognizable images had not occurred to me until I viewed Justin Matejka and George Fitzmaurice’s Datasaurus work. I hope that in making a slightly more efficient distribution manipulator I will allow new and better datasets to be generated which will help students understand the importance of graphical exploration of their data. Code My code as well as the slight change to Anscombe’s code can be found here. The standarized data sets can be found here. They are the ones that end with std.csv var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics By Simulation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why we Did Not Name the cdata Transforms wide/tall/long/short

Fri, 03/22/2019 - 19:53

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques.

The terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure.

The key point is: are we in a very de-normalized form where all facts about an instance are in a single row (which we called “row records”), or are we in a record oriented form where all the facts about an instances are in several rows (which we called “block records”)? The point is: row records don’t necessarily have more columns than block records. This makes shape based naming of the transforms problematic, no matter what names you pick for the shapes. There is an advantage to using intent or semantic based naming.

Below is a simple example.

Notice the width of the result relative to input width varies as function of the input data, even though we were always calling the same transform. This makes it incorrect to characterize these transforms as merely widening or narrowing.

There are still some subtle points (for instance row records are in fact instances of block records), but overall the scheme we (Nina Zumel, and myself: John Mount) worked out, tested, and promoted is pretty good. A lot of our work researching this topic can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Decode Lyrics in Pop Music: Visualise Prose with the Songsim algorithm

Fri, 03/22/2019 - 18:14

(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers)

The post Decode Lyrics in Pop Music: Visualise Prose with the Songsim algorithm appeared first on The Lucid Manager.

Music is an inherently mathematical form of art. Ancient Greek mathematician Pythagoras was the first to describe the logic of the scales that form melody and harmony. Numbers can also represent the rhythm of the music. Even the lyrics have a mathematical structure. Poets structure syllables and repeat words to create pleasing sounding prose. This article shows how to decode lyrics from pop songs and visualise them using the Songsim method to analyse their metre.

Decode Lyrics using the Songsim algorithm

Data visualiser, pop music appreciator and machine learner Colin Morris has extensively analysed the repetitiveness of song lyrics. Colin demonstrated that lyrics are becoming more repetitive since the early days of pop music. The most repetitive song is Around the World by Daft Punk, which should not be a surprise since the artist repeats the same phrase 144 times. Bohemian Rhapsody by Queen has some of the least repetitive lyrics in popular music.

The TedX presentation (see below) by Colin Morris shows how he visualises the repetitiveness of song lyrics with what he calls the Songsim algorithm. The more points in the graph, the more often a word is repeated.

Visualise the lyrics of Around the World and Bohemian Rhapsody.

The visual language of song lyrics

Morris decided to use a self-similarity matrix, used to visualise DNA sequences, to decode lyrics. In this method, the individual words of the song are the names of the columns and the names of the rows in a matrix. For every point in the song where the row name equals the column name, shows a dot. By definition, the diagonal of every similarity matrix is filled. The timeline of the song runs along the diagonal from top left to bottom right.

Patterns away from the diagonal represent two different points in time that have the same words. The more of these patterns we see, the more repetitive a song is. Let’s demonstrate this with the first words ever recorded by Thomas Edison in 1877.

Mary had a little lamb, whose fleece was white as snow. And everywhere that Mary went, the lamb was sure to go.

The similarity matrix below visualises the two first sentences of the famous nursery rhyme. It shows where the words “Mary”, “lamb” and “was” are repeated once.

Self-similarity matrix for Mary had a Little Lamb by Thomas Edison.

The snowflake diagrams are a visual language to decode lyrics. The verses are the gutters with only diagonal lines. A verse is not very repetitive besides some stop words. The verse repeats through the song. Many songs have a bridge that contrasts with the rest of the song. The bridge is in most songs a unique pattern with self-similarity.

The diagram below visualises the lyrics of one of the most famous pop songs ever, Waterloo by Abba. The first 30 words are the opening verse, which shows little repetition, other than stop words such as and the pronoun I. After that we see diagonal lines appearing that represent the repetitive use of the song title. Towards the end of the song, we see the bridge, which is like a little snowflake within the diagram.

Decoding lyrics: Waterloo by Abba.

The next section shows how to implement this approach with ggplot, scraping pop song lyrics from the azlyrics.com website.

Implementing Songsim with ggplot

The code below visualises song lyrics or poetry as suggested by Colin Morris. The code uses four libraries. I use the tidyverse series of libraries because it makes life very easy. The tidytext library uses the tidyverse principles to analyse text. The old reshape2 library helps to transform a matrix, and lastly, rvest helps to scrape song lyrics from the azlyrics website.

The first function scrapes song lyrics from the azlyrics website using the artist and song as input. The first three lines clean the artist and song variables. This code removes any character that is not a number or a letter, converts to lowercase and lastly removes the definite article in the artist name. These two fields are then concatenated to create the URL, which the function prints. The remainder of the code scrapes the lyrics from the website or trips on an error 404 when it cannot find the song/artist combination.

The second function implements the Morris method to visualise the lyrics. The code extracts single words from the text and places them in a data frame (tibble). This data frame is converted to a boolean matrix that contains the visualisation.

The code looks at each word and places the value TRUE where reappears in the song. Each of the vectors is then concatenated to a matrix. Lastly, ggplot visualises the matrix is visualised as a raster.

What does your favourite song look like a snowflake diagram?

The Songsim code

You can view the code below or download the latest version from my GitHub repository.

## Decoding lyrics library(tidyverse) library(tidytext) library(reshape2) library(rvest) get_lyrics <- function(artist, song) { artist <- gsub("[^A-Za-z0-9]+", "", tolower(artist)) song <- gsub("[^A-Za-z0-9]+", "", tolower(song)) artist <- gsub("^the", "", artist) url = paste("http://azlyrics.com/lyrics/", artist, "/", song, ".html", sep = "") print(url) azlyrics <- read_html(url) lyrics <- html_nodes(azlyrics, "div") lyrics <- html_text(lyrics[22]) gsub("\r|\n", " ", lyrics) } plot_snowflake <- function(artist, song){ lyrics <- get_lyrics(artist, song) lyrics <- data_frame(line = lyrics) %>% filter(line != "") words <- lyrics %>% unnest_tokens(word, line) words_matrix <- lapply(1:nrow(words), function(w){ as.character(words[w, 1]) == words } ) %>% do.call(cbind, .) rownames(words_matrix) <- 1:nrow(words) colnames(words_matrix) <- 1:nrow(words) melt(words_matrix, varnames = c("x", "y")) %>% ggplot(aes(x, -y, fill = value)) + geom_raster() + scale_fill_manual(values = c("white", "dodgerblue4"), guide = FALSE) + theme_void() + ggtitle(artist, subtitle = song) } plot_snowflake("Abba", "Waterloo")

The post Decode Lyrics in Pop Music: Visualise Prose with the Songsim algorithm appeared first on The Lucid Manager.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Quick and Tidy Look at the 2018 GSS

Fri, 03/22/2019 - 16:04

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

The data from the 2018 wave of the General Social Survey was released during the week, leading to a flurry of graphs showing various trends. The GSS is one of the most important sources of information on various aspects of U.S. society. One of the best things about it is that the data is freely available for more than forty years worth of surveys. Here I’ll walk through my own quick look at the data, in order to show how R can tidily manage data from a complex survey. I decided to revisit a topic that came up a few years ago, in the New York Times and elsewhere, regarding the beliefs of young men about gender roles. The idea was that several surveys seemed to point to some increasing conservatism on this front. The GSS has a longstanding question named fefam:

It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.

Respondents may answer that they Strongly Agree, Agree, Disagree, or Strongly Disagree with the statement (as well as refusing to answer, or saying they don’t know).

We’ll grab the GSS Cumulative Data File in Stata’s .dta format, and work from there in R, using the Tidyverse tools, Thomas Lumley’s Survey package, and Greg Freedman Ellis’s srvyr package, which provides a set of wrappers to survey functions that allow them to be piped and worked with in a way familiar to tidyverse residents.

library(tidyverse) library(forcats) library(ggrepel) library(haven) library(survey) library(srvyr) library(tools)

We also load some libraries that aren’t strictly needed, but that will make our plots conform to the house style.

library(showtext) showtext_auto() library(myriad) import_myriad_semi()

This is a quick-and-dirty function we’ll use to clean some age group labels we’ll create in a minute.

convert_agegrp <- function(x){ x <- gsub("\\(", "", x) x <- gsub("\\[", "", x) x <- gsub("\\]", "", x) x <- gsub(",", "-", x) x <- gsub("-89", "+", x) regex <- "^(.*$)" x <- gsub(regex, "Age \\1", x) x } my_colors <- _function (palette = "cb"){ cb.palette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") rcb.palette <- rev(cb.palette) bly.palette <- c("#E69F00", "#0072B2", "#000000", "#56B4E9", "#009E73", "#F0E442", "#D55E00", "#CC79A7") if (palette == "cb") return(cb.palette) else if (palette == "rcb") return(rcb.palette) else if (palette == "bly") return(bly.palette) else stop("Choose cb, rcb, or bly only.") }

The names of some of the weighting and stratifying variables.

cont_vars <- c("year", "id", "ballot", "age") cat_vars <- c("race", "sex", "fefam") wt_vars <- c("vpsu", "vstrat", "oversamp", "formwt", # weight to deal with experimental randomization "wtssall", # weight variable "sampcode", # sampling error code "sample") # sampling frame and method vars <- c(cont_vars, cat_vars, wt_vars)

Load in the data to gss_all and create a small subset of it, gss_sm containing just the variables of interest.

gss_all <- read_stata("data/GSS7218_R1.DTA") gss_sm <- gss_all %>% select(one_of(vars))

Let’s take a look at it:

gss_sm ## # A tibble: 64,814 x 14 ## year id ballot age race sex fefam vpsu ## ## 1 1972 1 NA(i) [IAP] 23 1 [whi… 2 [fem… NA(i) [IAP] NA(i) [IAP] ## 2 1972 2 NA(i) [IAP] 70 1 [whi… 1 [mal… NA(i) [IAP] NA(i) [IAP] ## 3 1972 3 NA(i) [IAP] 48 1 [whi… 2 [fem… NA(i) [IAP] NA(i) [IAP] ## 4 1972 4 NA(i) [IAP] 27 1 [whi… 2 [fem… NA(i) [IAP] NA(i) [IAP] ## 5 1972 5 NA(i) [IAP] 61 1 [whi… 2 [fem… NA(i) [IAP] NA(i) [IAP] ## 6 1972 6 NA(i) [IAP] 26 1 [whi… 1 [mal… NA(i) [IAP] NA(i) [IAP] ## 7 1972 7 NA(i) [IAP] 28 1 [whi… 1 [mal… NA(i) [IAP] NA(i) [IAP] ## 8 1972 8 NA(i) [IAP] 27 1 [whi… 1 [mal… NA(i) [IAP] NA(i) [IAP] ## 9 1972 9 NA(i) [IAP] 21 2 [bla… 2 [fem… NA(i) [IAP] NA(i) [IAP] ## 10 1972 10 NA(i) [IAP] 30 2 [bla… 2 [fem… NA(i) [IAP] NA(i) [IAP] ## # … with 64,804 more rows, and 6 more variables: vstrat , ## # oversamp , formwt , wtssall , sampcode , ## # sample

The read_stata() function has carried over the labeling information from Stata, which might be useful to us under other circumstances. Columns with, e.g., designations behave like regular (double-precision floating point numbers), but have the label information as metadata.

Now we clean up gss_sm a bit, discarding some of the label and missing value information we don’t need. We also create some new variables: age quintiles, a variable flagging whether a respondent is 25 or younger, recoded fefam to binary “Agree” or “Disagree” (with non-responses dropped).

qrts <- quantile(as.numeric(gss_sm$age), na.rm = TRUE) quintiles <- quantile(as.numeric(gss_sm$age), probs = seq(0, 1, 0.2), na.rm = TRUE) gss_sm <- gss_sm %>% modify_at(vars(), zap_missing) %>% modify_at(wt_vars, as.numeric) %>% modify_at(cat_vars, as_factor) %>% modify_at(cat_vars, fct_relabel, toTitleCase) %>% mutate(ageq = cut(x = age, breaks = unique(qrts), include.lowest=TRUE), ageq = fct_relabel(ageq, convert_agegrp), agequint = cut(x = age, breaks = unique(quintiles), include.lowest = TRUE), agequint = fct_relabel(agequint, convert_agegrp), year_f = droplevels(factor(year)), young = ifelse(age < 26, "Yes", "No"), fefam = fct_recode(fefam, NULL = "IAP", NULL = "DK", NULL = "NA"), fefam_d = fct_recode(fefam, Agree = "Strongly Agree", Disagree = "Strongly Disagree"), fefam_n = car::recode(fefam_d, "'Agree'=0; 'Disagree'=1;", as.factor=FALSE)) gss_sm$compwt <- with(gss_sm, oversamp * formwt * wtssall) gss_sm$samplerc <- with(gss_sm, ifelse(sample %in% 3:4, 3, ifelse(sample %in% 6:7, 6, sample)))

Now we need to take this data and use the survey variables in it, so we can properly calculate population means and errors and so on. We use svyr’s wrappers to survey for this:

options(survey.lonely.psu = "adjust") options(na.action="na.pass") gss_svy <- gss_sm %>% filter(year > 1974) %>% drop_na(fefam_d) %>% mutate(stratvar = interaction(year, vstrat)) %>% as_survey_design(id = vpsu, strata = stratvar, weights = wtssall, nest = TRUE)

Now gss_svy is a survey object:

## Stratified 1 - level Cluster Sampling design (with replacement) ## With (3585) clusters. ## Called via srvyr ## Sampling variables: ## - ids: vpsu ## - strata: stratvar ## - weights: wtssall ## Data variables: year (dbl), id (dbl), ballot (dbl+lbl), age (dbl+lbl), race (fct), sex (fct), fefam ## (fct), vpsu (dbl), vstrat (dbl), oversamp (dbl), formwt (dbl), wtssall (dbl), sampcode (dbl), ## sample (dbl), ageq (fct), agequint (fct), year_f (fct), young (chr), fefam_d (fct), fefam_n ## (dbl), compwt (dbl), samplerc (dbl), stratvar (fct)

We’re in a position to draw some pictures of fefam trends now.

out_ff <- gss_svy %>% group_by(year, sex, young, fefam_d) %>% summarize(prop = survey_mean(na.rm = TRUE, vartype = "ci")) out_ff ## # A tibble: 168 x 7 ## year sex young fefam_d prop prop_low prop_upp ## ## 1 1977 Male No Agree 0.726 0.685 0.766 ## 2 1977 Male No Disagree 0.274 0.234 0.315 ## 3 1977 Male Yes Agree 0.551 0.469 0.633 ## 4 1977 Male Yes Disagree 0.449 0.367 0.531 ## 5 1977 Female No Agree 0.674 0.639 0.709 ## 6 1977 Female No Disagree 0.326 0.291 0.361 ## 7 1977 Female Yes Agree 0.415 0.316 0.514 ## 8 1977 Female Yes Disagree 0.585 0.486 0.684 ## 9 1985 Male No Agree 0.542 0.496 0.587 ## 10 1985 Male No Disagree 0.458 0.413 0.504 ## # … with 158 more rows facet_names <- c("No" = "Age Over 25 when surveyed", "Yes" = "Age 18-25 when surveyed") p <- ggplot(subset(out_ff, fefam_d == "Disagree"), aes(x = year, y = prop, ymin = prop_low, ymax = prop_upp, color = sex, group = sex, fill = sex)) + geom_line(size = 1.2) + geom_ribbon(alpha = 0.3, color = NA) + scale_x_continuous(breaks = seq(1978, 2018, 4)) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + scale_color_manual(values = my_colors("bly")[2:1], labels = c("Men", "Women"), guide = guide_legend(title=NULL)) + scale_fill_manual(values = my_colors("bly")[2:1], labels = c("Men", "Women"), guide = guide_legend(title=NULL)) + facet_wrap(~ young, labeller = as_labeller(facet_names), ncol = 1) + coord_cartesian(xlim = c(1977, 2017)) + labs(x = "Year", y = "Percent Disagreeing", subtitle = "Disagreement with the statement, ‘It is much better for\neveryone involved if the man is the achiever outside the\nhome and the woman takes care of the home and family’", caption = "Kieran Healy http://socviz.co.\nData source: General Social Survey") + theme(legend.position = "bottom") ggsave("figures/fefam_svy.png", p, width = 8, height = 6, dpi = 300)

Let’s take a closer look at the age breakdown.

out_ff_agequint <- gss_svy %>% group_by(year, agequint, fefam_d) %>% summarize(prop = survey_mean(na.rm = TRUE, vartype = "se")) %>% mutate(end_label = if_else(year == max(year), socviz::prefix_strip(as.character(agequint), "Age "), NA_character_), start_label = if_else(year == min(year), socviz::prefix_strip(as.character(agequint), "Age "), NA_character_)) ## Warning: Factor `agequint` contains implicit NA, consider using ## `forcats::fct_explicit_na` p <- ggplot(subset(out_ff_agequint, fefam_d == "Disagree"), aes(x = year, y = prop, ymin = prop - prop_se, ymax = prop + prop_se, color = agequint, group = agequint, fill = agequint)) + geom_line(size = 1.2) + geom_ribbon(alpha = 0.3, color = NA) + scale_x_continuous(breaks = seq(1978, 2018, 4)) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + scale_fill_manual(values = man_cols) + scale_color_manual(values = man_cols) + guides(fill = FALSE, color = FALSE) + annotate("text", x = 1974.5, y = 0.58, label = "Age at time\nof survey", size = 3, hjust = 0, fontface = "bold", lineheight = 0.9) + annotate("text", x = 2020.8, y = 0.95, label = "Age at time\nof survey", size = 3, hjust = 1, fontface = "bold", lineheight = 0.8) + geom_label_repel(aes(label = end_label), color = "white", nudge_x = 1) + geom_label_repel(aes(label = start_label), color = "white", nudge_x = -1) + coord_cartesian(xlim = c(1976, 2019)) + labs(x = "Year", y = "Percent", title = "Changing Attitudes to Gender Roles, by Age Quintiles", subtitle = "Percent Disagreeing with the statement, ‘It is much better for everyone involved if the man is the\nachiever outside the home and the woman takes care of the home and family’", caption = "Kieran Healy http://socviz.co.\nData source: General Social Survey. Shaded ranges are population-adjusted standard errors for each age group.") + theme(legend.position = "right") ggsave("figures/fefam_age_quin_svy.png", p, height = 7, width = 12, dpi = 300) ## Warning: Removed 100 rows containing missing values (geom_label_repel). ## Warning: Removed 100 rows containing missing values (geom_label_repel).

Finally, we can make a plot to get a sense of generational replacement and cohort effects. We’ll make two panels. First, a comparison of more or less the same cohort (though not of course the same individuals): these are people who answered the fefam question in 1977 when they were aged 18-25 and those who answered in 2018 and were aged 63 or older. We’ll also look at two very different cohorts: people who were over 63 in 1977, and people who were aged 18-25 in 2018.

cohort_comp <- gss_svy %>% filter(year %in% c(1977, 2018) & agequint %in% c("Age 18-29", "Age 63+")) %>% mutate(cohort = interaction(agequint, year), cohort = droplevels(cohort)) %>% group_by(cohort, fefam) %>% summarize(prop = survey_mean(na.rm = TRUE, vartype = "se")) %>% mutate(cohort = fct_relabel(cohort, ~ gsub("\\.", " in ", .x)), cohort = factor(cohort, levels = c("Age 18-29 in 2018", "Age 63+ in 1977", "Age 18-29 in 1977", "Age 63+ in 2018"), ordered = TRUE), compare = case_when(cohort %in% c("Age 18-29 in 1977", "Age 63+ in 2018") ~ "Comparing Approximately the Same Cohort in 1977 and 2018", cohort %in% c("Age 18-29 in 2018", "Age 63+ in 1977") ~ "Comparing the Old in 1977 vs the Young in 2018"), end_label = if_else(fefam == "Strongly Disagree", socviz::prefix_strip(as.character(cohort), "Age "), NA_character_)) p <- ggplot(cohort_comp, aes(x = fefam, y = prop, group = cohort, color = cohort, fill = cohort, ymin = prop - prop_se, ymax = prop + prop_se)) + geom_point(size = 3) + geom_line(size = 1.2) + geom_ribbon(alpha = 0.2, color = NA) + scale_color_manual(values = my_colors()) + scale_fill_manual(values = my_colors()) + guides(fill = FALSE, color = FALSE) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + geom_label_repel(aes(label = end_label), fill = "white", size = 2.2, segment.colour = NA, nudge_x = 0.6) + facet_wrap(~ compare) + labs(y = "Percent", x = NULL, title = "Generational Replacement, or, People Don't Change Much, They Just Get Old", subtitle = "Responses to the statement ‘It is much better for everyone involved if the man is the\nachiever outside the home and the woman takes care of the home and family’", caption = "Kieran Healy http://socviz.co.\nData source: General Social Survey. Shaded ranges are population-adjusted standard errors for each age group.") ggsave("figures/fefam_age_quin_svy_synth.png", p, height = 7, width = 13, dpi = 300) ## Warning: Removed 12 rows containing missing values (geom_label_repel).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages