Upcoming data preparation and modeling article series
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects.
vtreat handles, in a statistically sound fashion:
 Missing values.
 Encoding of categorical values for regularized inference and machine learning techniques.
 Categorical variables with very many values.
 Novel categorical values (that is values not seen during training).
 Variable pruning.
 yaware scaling.
 Structured crossvalidation.
 Mitigating nested model bias.
In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public opensource implementation is for inmemory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL).
vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat version 0.6.0 is called “custom coders.” WinVector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and setvalued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part9)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.
Answers to the exercises are available here.
For other parts of this exercise set follow the tag Hacking stats
Exercise 1
Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.
Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.
Exercise 2
Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a sixshooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.
The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.
Exercise 3
What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?
 Work with about different binomial and logistic regression techniques
 Know how to compare regression models and choose the right fit
 And much more
Exercise 4
Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.
 Create a function that simulate a complete game and return how many coin flip has been done and who win.
 In average, how many coin flip is needed before someone has all the pennies.
 Plot the histogram of the number of coin flipped during a simulation.
 What is the probability that someone wins a coin flip?
 What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?
When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.
In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.
Exercise 5
A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.
 Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
 Repeat this process, but this time choose to switch door.
 Why the probabilities are different?
When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.
To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?
Exercise 6
This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.
 Compute the success rate (number of success/total number of cases) of both treatments.
 Which treatment seems the more successful?
 Create a contingency table of the success.
 Compute the success rate of both treatments when treating small kidney stones.
 Compute the success rate of both treatments when treating large kidney stones.
 Which treatment is more successful for small kidney stone? For large kidney stone?
This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.
Exercise 7
 Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
 Do a scatter plot of the variable X and Y and add the trend line to the graph.
 Repeat this process of each of the three categories.
We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!
Exercise 8
For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.
A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.
 Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
 Repeat this process for the remaining observations. Is there a difference between those rate?
It is a known fact that false positive rate are a lot higher in lowincidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!
Exercise 9
 Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
 Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
 Compute the variation of the estimation.
 Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Plot the variance of the estimation of the means done with different sample size.
As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.
Exercise 10
A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.
Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?
A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.
Related exercise sets: Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part8)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part6)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part4)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How Random Forests improve simple Regression Trees?
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel Vasconcelos
Regression TreesIn this post I am going to discuss some features of Regression Trees an Random Forests. Regression Trees are know to be very unstable, in other words, a small change in your data may drastically change your model. The Random Forest uses this instability as an advantage through bagging (you can see details about bagging here) resulting on a very stable model.
The first question is how a Regression Tree works. Suppose, fore example, that we have the number of points scored by a set of basketball players and we want to relate it to the player’s weight an height. The Regression Tree will simply split the heightweight space and assign a number of points to each partition. The figure below shows two different representations for a small tree. In the left we have the tree itself and in the right how the space is partitioned (the blue line shows the first partition and the red lines the following partitions). The numbers in the end of the tree (and in the partitions) represent the value of the response variable. Therefore, if a basketball player is higher than 1.85 meters and weights more than 100kg it is expected to score 27 points (I invented this data =] ).
You might be asking how I chose the partitions. In general, in each node the partition is chosen through a simple optimization problem to find the best pair variableobservation based on how much the new partition reduces the model error.
What I want to illustrate here is how unstable a Regression Tree can be. The package tree has some examples that I will follow here with some small modifications. The example uses computer CPUs data and the objective is to build a model for the CPU performance based on some characteristics. The data has 209 CPU observations that will be used to estimate two Regression Trees. Each tree will be estimate from a random resample with replacement. Since the data comes from the same place, it would be desirable to have similar results on both models.
library(ggplot2) library(reshape2) library(tree) library(gridExtra) data(cpus, package = "MASS") # = Load Data # = First Tree set.seed(1) # = Seed for Replication tree1 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209, 209, replace = TRUE), ]) plot(tree1); text(tree1) # = Second Tree set.seed(10) tree2 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209,209, replace=TRUE), ]) plot(tree2); text(tree2)
As you can see, the two trees are different from the start. We can use some figures to verify. First let us calculate the predictions of each model in the real data (not the resample). The first figure is a scatterplot of both predictions and the second figure shows their boxplots. Although the scatterplot shows some relation between the two predictions, it is far from good.
# = Calculate predictions pred = data.frame(p1 = predict(tree1, cpus) ,p2 = predict(tree2, cpus)) # = Plots g1 = ggplot(data = pred) + geom_point(aes(p1, p2)) g2 = ggplot(data = melt(pred)) + geom_boxplot(aes(variable, value)) grid.arrange(g1, g2, ncol = 2) Random ForestAs mentioned before, the Random Forest solves the instability problem using bagging. We simply estimate the desired Regression Tree on many bootstrap samples (resample the data many times with replacement and reestimate the model) and make the final prediction as the average of the predictions across the trees. There is one small (but important) detail to add. The Random Forest adds a new source of instability to the individual trees. Every time we calculate a new optimal variableobservation point to split the tree, we do not use all variables. Instead, we randomly select 2/3 of the variables. This will make the individual trees even more unstable, but as I mentioned here, bagging benefits from instability.
The question now is: how much improvement do we get from the Random Forest. The following example is a good illustration. I broke the CPUs data into a training sample (first 150 observations) and a test sample (remaining observations) and estimated a Regression Tree and a Random Forest. The performance is compared using the mean squared error.
library(randomForest) # = Regression Tree tree_fs = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[1:150, ]) # = Random Forest set.seed(1) # = Seed for replication rf = randomForest(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data=cpus[1:150, ], nodesize = 10, importance = TRUE) # = Calculate MSE mse_tree = mean((predict(tree_fs, cpus[c(1:150), ])  log(cpus$perf)[c(1:150)])^2) mse_rf = mean((predict(rf, cpus[c(1:150), ])  log(cpus$perf[c(1:150)]))^2) c(rf = mse_rf, tree = mse_tree) ## rf tree ## 0.2884766 0.5660053As you can see, the Regression Tree has an error twice as big as the Random Forest. The only problem is that by using a combination of trees any kind of interpretation becomes really hard. Fortunately, there are importance measures that allow us to at least know which variables are more relevant in the Random Forest. In our case, both importance measures pointed to the cache size as the most important variable.
importance(rf) ## %IncMSE IncNodePurity ## syct 22.60512 22.373601 ## mmin 19.46153 21.965340 ## mmax 24.84038 27.239772 ## cach 27.92483 33.536185 ## chmin 13.77196 13.352793 ## chmax 17.61297 8.379306Finally, we can see how the model error decreases as we increase the number of trees in the Random Forest with the following code:
plot(rf)If you liked this post, you can find more details on Regression Trees and Random forest in the book Elements of Statistical learning, which can be downloaded direct from the authors page here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Welcome to R/exams
(This article was first published on R/exams, and kindly contributed to Rbloggers)
Welcome everybody, we are proud to introduce the brand new web page and
blog http://www.Rexams.org/. This provides a central access point for
the opensource software “exams” implemented in the
R system for statistical computing.
R/exams is a oneforall exams generator using (potentially) dynamic
exercises written in R/Markdown or R/LaTeX and it can export a variety of
formats from learning management systems to classical written exams.
This post gives a brief overview of what has happened so far and what
you can expect to find here in the next months.
The package was originally started more than a decade ago to facilitate
classical written exams with randomized exercises for largelecture courses.
Like many other teachers of introductory statistics and mathematics courses,
we were in need of infrastructure for conducting written exams with about 1,0001,500
students. A team effort of colleagues at WU Wirtschaftsuniversität Wien
lead to a large collection of dynamic exercises and the software was eventually
released at https://CRAN.Rproject.org/package=exams.
Over the years learning management systems (like
Moodle,
Blackboard,
OLAT, etc.)
became easily available at virtually every university, creating a desire to
employ the same dynamic exercises also for online tests and quizzes. Hence,
the R/exams infrastructure was completely reimplemented allowing to export
the same exercises not only to written exams (with automatic scanning
and evaluation) but also to learning management systems, the live voting
system ARSnova, as well as customizable standalone
files in PDF, HTML, Docx, and beyond.
Despite (or rather because of) the flexibility of the software, novice R/exams
users often struggled with adopting it because the documentation provided in
the package is either somewhat technical and/or targeted towards more experienced
R users.
Hence, this web page and blog make it easy for new users to explore the possibilities
of R/exams before reading about a lot of technical details. It also provides accessible
guides to common tasks and examples for dynamic exercises with different complexity.
For a first tour you can check out the oneforall approach of
the package based on (potentially) dynamic exercise templates
for generating large numbers of personalized exams/quizzes/tests, either for
elearning or classical written exams (with
automatic evaluation).
Some tutorials already explain the installation of R/exams (with
dependencies like LaTeX or pandoc) and the first steps in writing dynamic exercises
using either Markdown or LaTeX markup along with randomization in R. There is
also a gallery of exercise templates, ranging from basic multiplechoice
questions with simple shuffling to more advanced exercises with random data, graphics,
etc.
For the next months we plan to write more tutorial blog posts that help to accomplish
common tasks, e.g., handson guidance for exporting exercises from R to Moodle or
tips how to write good randomized exercises. If you want to give us feedback or ask
us to cover certain topics please feel free to contact us – be it via email,
discussion forum, or twitter. Also if you want to link R/examsbased materials or share
share experiences of using R/exams in a guest post, please let us know.
Big shout to all contributors that helped R/exams to grow so much
over the last years. A special thank you goes to Patrik Keller, Hanna Krimm, and Reto
Stauffer who established the infrastructure for this web page (using R/Markdown and Jekyll)
and designed graphics/icons/photos/etc. Thanks also to everyone sharing this post,
especially on http://www.Rbloggers.com/ and https://RWeekly.org/.
To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big Data Analytics with H20 in R Exercises Part 1
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
We have dabbled with RevoScaleR before , In this exercise we will work with H2O , another high performance R library which can handle big data very effectively .It will be a series of exercises with increasing degree of difficulty . So Please do this in sequence .
H2O requires you to have Java installed in your system .So please install Java before trying with H20 .As always check the documentation before trying these exercise set .
Answers to the exercises are available here.
If you want to install the latest release from H20 , install it via this instructions .
Exercise 1
Download the latest stable release from h20 and initialize the cluster
Exercise 2
Check the cluster information via clusterinfo
Exercise 3
You can see how h2o works via the demo function , Check H2O’s glm via demo method .
Exercise 4
down load the loan.csv from H2O’s github repo and import it using H2O .
Exercise 5
Check the type of imported loan data and notice that its not a dataframe , check the summary of the loan data .
Hint use h2o.summary()
Exercise 6
One might want to transfer a dataframe from R environment to H2O , use as.h2o to conver the mtcars dataframe as a H2OFrame
 work with different data import techniques,
 know how to import data and transform it for a specific moddeling or analysis goal,
 and much more.
Exercise 7
Check the dimension of the loan H2Oframe via h2o.dim
Exercise 8
Find the colnames from the H2OFrame of loan data.
Exercise 9
Check the histogram of the loan amount of loan H2Oframe .
Exercise 10
Find the mean of loan amount by each home ownership group from the loan H2OFrame
 Big Data analytics with RevoScaleR Exercises2
 Data wrangling : I/O (Part1)
 DensityBased Clustering Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tutorial: Launch a Spark and R cluster with HDInsight
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
If you'd like to get started using R with Spark, you'll need to set up a Spark cluster and install R and all the other necessary software on the nodes. A really easy way to achieve that is to launch an HDInsight cluster on Azure, which is just a managed Spark cluster with some useful extra components. You'll just need to configure the components you'll need, in our case R and Microsoft R Server, and RStudio Server.
This tutorial explains how to launch an HDInsight cluster for use with R. It explains how to size the cluster and launch the cluster, connect to it via SSH, install Microsoft R Server (with R) on each of the nodes, and install RStudio Server community edition to use as an IDE on the edge node. (If you find you need a larger or smaller cluster after you've set it up, it's easy to resize the cluster dynamically.) Once you have the cluster up an running, here are some things you can try:
 Import and explore data within the Spark cluster
 Use the sparklyr package to interact with Spark Data Frames from RStudio
 Operationalize R functions on the Spark cluster for remote calling
 Train statistical models on very large data sets with the RevoScaleR package
To get started with R and Spark, follow the instructions for setting up a HDInsight cluster at the link below.
Microsoft Azure Documentation: Get started using R Server on HDInsight
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
MultiDimensional Reduction and Visualisation with tSNE
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
tSNE is a very powerful technique that can be used for visualising (looking for patterns) in multidimensional data. Great things have been said about this technique. In this blog post I did a few experiments with tSNE in R to learn about this technique and its uses.
Its power to visualise complex multidimensional data is apparent, as well as its ability to cluster data in an unsupervised way.
What’s more, it is also quite clear that tSNE can aid machine learning algorithms when it comes to prediction and classification. But the inclusion of tSNE in machine learning algorithms and ensembles has to be ‘crafted’ carefully, since tSNE was not originally intended for this purpose.
All in all, tSNE is a powerful technique that merits due attention.
tSNELet’s start with a brief description. tSNE stands for tDistributed Stochastic Neighbor Embedding and its main aim is that of dimensionality reduction, i.e., given some complex dataset with many many dimensions, tSNE projects this data into a 2D (or 3D) representation while preserving the ‘structure’ (patterns) in the original dataset. Visualising highdimensional data in this way allows us to look for patterns in the dataset.
tSNE has become so popular because:
 it was demonstrated to be useful in many situations,
 it’s incredibly flexible, and
 can often find structure where other dimensionalityreduction algorithms cannot.
A good introduction on tSNE can be found here. The original paper on tSNE and some visualisations can be found at this site. In particular, I like this site which shows how tSNE is used for unsupervised image clustering.
While tSNE itself is computationally heavy, a faster version exists that uses what is known as the BarnesHut approximation. This faster version allows tSNE to be applied on large realworld datasets.
tSNE for Exploratory Data AnalysisBecause tSNE is able to provide a 2D or 3D visual representation of highdimensional data that preserves the original structure, we can use it during initial data exploration. We can use it to check for the presence of clusters in the data and as a visual check to see if there is some ‘order’ or some ‘pattern’ in the dataset. It can aid our intuition about what we think we know about the domain we are working in.
Apart from the initial exploratory stage, visualising information via tSNE (or any other algorithm) is vital throughout the entire analysis process – from those first investigations of the raw data, during data preparation, as well as when interpreting the outputs of machine learning algorithms and presenting insights to others. We will see further on that we can use tSNE even during the predcition/classification stage itself.
Experiments on the Optdigits datasetIn this post, I will apply tSNE to a wellknown dataset, called optdigits, for visualisation purposes.
The optdigits dataset has 64 dimensions. Can tSNE reduce these 64 dimensions to just 2 dimension while preserving structure in the process? And will this structure (if present) allow handwritten digits to be correctly clustered together? Let’s find out.
tsne packageWe will use the tsne package that provides an exact implementation of tSNE (not the BarnesHut approximation). And we will use this method to reduce dimensionality of the optdigits data to 2 dimensions. Thus, the final output of tSNE will essentially be an array of 2D coordinates, one per row (image). And we can then plot these coordinates to get the final 2D map (visualisation). The algorithm runs in iterations (called epochs), until the system converges. Every
\(K\) number of iterations and upon convergence, tSNE can call a usersupplied callback function, and passes the list of 2D coordinates to it. In our callback function, we plot the 2D points (one per image) and the corresponding class labels, and colourcode everything by the class labels.
The images below show how the clustering improves as more epochs pass.
As one can see from the above diagrams (especially the last one, for epoch 1000), tSNE does a very good job in clustering the handwritten digits correctly.
But the algorithm takes some time to run. Let’s try out the more efficient BarnesHut version of tSNE, and see if we get equally good results.
Rtsne packageThe Rtsne package can be used as shown below. The perplexity parameter is crucial for tSNE to work correctly – this parameter determines how the local and global aspects of the data are balanced. A more detailed explanation on this parameter and other aspects of tSNE can be found in this article, but a perplexity value between 30 and 50 is recommended.
traindata < read.table("optdigits.tra", sep=",") trn < data.matrix(traindata) require(Rtsne) # perform dimensionality redcution from 64D to 2D tsne < Rtsne(as.matrix(trn[,1:64]), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2) # display the results of tSNE cols < rainbow(10) plot(tsne$Y, t='n') text(tsne$Y, labels=trn[,65], col=cols[trn[,65] +1])Note how clearlydefined and distinctly separable the clusters of handwritten digits are. We have only a minimal amount of incorrect entries in the 2D map.
Of Manifolds and the Manifold AssumptionHow can tSNE achieve such a good result? How can it ‘drop’ 64dimensional data into just 2 dimensions and still preserve enough (or all) structure to allow the classes to be separated?
The reason has to do with the mathematical concept of manifolds. A Manifold is a \(d\)dimensional surface that lives in an \(D\)dimensional space, where \(d < D\). For the 3D case, imagine a 2D piece of paper that is embedded within 3D space. Even if the piece of paper is crumpled up extensively, it can still be ‘unwrapped’ (uncrumpled) into the 2D plane that it is. This 2D piece of paper is a manifold in 3D space. Or think of an entangled string in 3D space – this is a 1D manifold in 3D space.
Now there’s what is known as the Manifold Hypothesis, or the >Manifold Assumption, that states that natural data (like images, etc.) forms lower dimensional manifolds in their embedding space. If this assumption holds (there are theoretical and experimental evidence for this hypothesis), then tSNE should be able to find this lowerdimensional manifold, ‘unwrap it’, and present it to us as a lowerdimensional map of the original data.
tSNE vs. SOMtSNE is actually a tool that does something similar to SelfOrganising Maps (SOMs), though the underlying process is quite different. We have used a SOM on the optdigits dataset in a previous blog and obtained the following unsupervised clustering shown below.
The tSNE map is more clear than that obtained via a SOM and the clusters are separated much better.
tSNE for Shelter Animals datasetIn a previous blog, I applied machine learning algorithms for predicting the outcome of shelter animals. Let’s now apply tSNE to the dataset – I am using the cleaned and modified data as described in this blog entry.
trn < read.csv('train.modified.csv.zip') trn < data.matrix(trn) require(Rtsne) # scale the data first prior to running tSNE trn[,1] < scale(trn[,1]) tsne < Rtsne(trn[,1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) # display the results of tSNE cols < rainbow(5) plot(tsne$Y, t='n') text(tsne$Y, labels=as.numeric(trn[,1]), col=cols[trn[,1]])Here, while it is evident that there is some structure and patterns to the data, clustering by OutcomeType has not happened.
Let’s now use tSNE to perform dimensionality reduction to 3D on the same dataset. We just need to set the dims parameter to 3 instead of 2. And we will use package rgl for plotting the 3D map produced by tSNE.
tsne < Rtsne(trn[,1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) #display results of tSNE require(rgl) plot3d(tsne$Y, col=cols[trn[,1]]) legend3d("topright", legend = '0':'5', pch = 16, col = rainbow(5))This 3D map has a richer structure than the 2D version, but again the resulting clustering is not done by OutcomeType.
One possible reason could be that the Manifold Assumption is failing for this dataset when trying to reduce to 2D and 3D. Please note that the Manifold Assumption cannot be always true for a simple reason. If it were, we could take an arbitrary set of \(N\)dimensional points and conclude that they lie on some \(M\)dimensional manifold (where \(M < N\)). We could then take those \(M\)dimensional points and conclude that they lie on some other lowerdimensional manifold (with, say, \(K\) dimensions, where \(K < M\)). We could repeat this process indefinitely and eventually conclude that our arbitrary \(N\)dimensional dataset lies on a 1D manifold. Because this cannot be true for all datasets, the manifold assumption cannot always be true. In general, if you have \(N\)dimensional points being generated by a process with \(N\) degrees of freedom, your data will not lie on a lowerdimensional manifold. So probably the Animal Shelter dataset has degrees of freedom higher than 3.
So we can conclude that tSNE did not aid much the initial exploratory analysis for this dataset.
tSNE and machine learning?One disadvantage of tSNE is that there is currently no incremental version of this algorithm. In other words, it is not possible to run tSNE on a dataset, then gather a few more samples (rows), and “update” the tSNE output with the new samples. You would need to rerun tSNE from scratch on the full dataset (previous dataset + new samples). Thus tSNE works only in batch mode.
This disadvantage appears to makes it difficult for tSNE to be used in a machine learning system. But as we will see in a future post, it is still possible to use tSNE (with care) in a machine learning solution. And the use of tSNE can improve classification results, sometimes markedly. The limitation is the extra nonrealtime processing brought about by tSNE’s batch mode nature.
Stay tuned.
Related Post
 Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
 Analyzing Obesity across USA
 Can we predict flu deaths with Machine Learning and R?
 Graphical Presentation of Missing Data; VIM Package
 Creating Graphs with Python and GooPyCharts
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
My advice on dplyr::mutate()
There are substantial differences between adhoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: adhoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop perteam and perproject field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of WinVector LLC‘s current guidance on using the R package dplyr.
 Disclaimer: WinVector LLC has no official standing with RStudio, or dplyr development.

However:
“One need not have been Caesar in order to understand Caesar.”
Alternately: Georg Simmmel or Max Webber.
WinVector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. WinVector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.
From experience we have distilled a lot of best practices. And below we will share one.
From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:
Note that you can refer to columns that you’ve just created:
mutate(flights_sml, gain = arr_delay  dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )Let’s try that with database backed data:
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") # [1] ‘0.7.3’ db < DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights < copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay  dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # ... # 1 2013 1 1 517 515 ... # ...That worked. One of the selling points of dplyr is a lot of dplyr is sourcegeneric or sourceagnostic: meaning it can be run against different data providers (inmemory, databases, Spark).
However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:
mutate(flights, gain = arr_delay  dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hourIt is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.
This quickly leads to strong feelings of learned helplessness and anxiety.
Our rule for dplyr::mutate() has been for some time:
Each column name used in a single mutate must appear only on the lefthandside of a single assignment, or otherwise on the righthandside of any number of assignments (but never both sides, even if it is different assignments).
Under this rule neither of the above mutates() are allowed. The second should be written as (switching to pipenotation):
flights %>% mutate(gain = arr_delay  dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)And the above works.
If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.
dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.
Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.
However, confusion around reuse of column names has been driving dplyr issues for quite some time:
 dplyr issue 3095
 dplyr issue 2884
 dplyr issue 2883
 dplyr pull 2869
 dplyr issue 2842
 dplyr pull 2483
 dplyr issue 2481
 dplyr issue 2360
It makes sense to work in a reliable and teachable subdialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1
(This article was first published on Alexej's blog, and kindly contributed to Rbloggers)
The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).
I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).
In this blog post – “Part 1” – I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? etc.
“Part 2” will be an in depth analysis of the language used in patent titles, descriptions, and claims, and “Part 3” will be on experimentation with with deep neural nets applied to the patents’ title, description, and claims text.
Identifying patents related to machine learning and AIFirst, I curated a patent full text dataset consisting of “machine learning and AI related” patents.
I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.
Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords
“machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model”
in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).1
With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.
Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.
Some example patentsHere are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.
## $`2867` ## [1] "Fuselage manufacturing system" ## [2] "THE BOEING COMPANY" ## [3] "... using various techniques. For example, at least ## one of an artificial intelligence program, a ## knowledgebase, an expert ..." ## ## $`1618` ## [1] "Systems and methods for validating wind farm ## performance measurements" ## [2] "General Electric Company" ## [3] "... present disclosure leverages and fuses ## accurate available sensor data using machine ## learning algorithms. That is, the more ..." ## ## $`5441` ## [1] "Trigger repeat order notifications" ## [2] "Accenture Global Services Limited" ## [3] "... including user data obtained from a user ## device; obtaining a predictive model that ## estimates a likelihood of ..."And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.
## $`5742` ## [1] "Adaptive demodulation method and apparatus using an ## artificial neural network to improve data recovery ## in high speed channels" ## [2] "Kelquan Holdings Ltd." ## [3] "... THE INVENTION\nh0002\n1 The present invention ## relates to a neural network based integrated ## demodulator that mitigates ..." ## ## $`3488` ## [1] "Clustertrained machine learning for image processing" ## [2] "Amazon Technologies, Inc." ## [3] "... BACKGROUND\nh0001\n1 Artificial neural networks, ## especially deep neural network ..." ## ## $`3103` ## [1] "Methods and apparatus for machine learning based ## malware detection" ## [2] "Invincea, Inc." ## [3] "... a need exists for methods and apparatus that can ## use machine learning techniques to reduce the amount ..." Who holds these patents (inventors and assignees)?The first question I would like to address is who files most of the machine learning and AI related patents.
Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.
It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).
There are a number of interesting followup questions which for now I leave unanswered (hard to answer without additional data):
 What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?
 Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?
Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US.
In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active.
The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).
The results aren’t that surprising.
However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.
Alternatively, we can also map the number of patents per inventors’ origin countries.
Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions.
Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.
Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as:
When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.
For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…
 …in the top panel: number of patents (yaxis) per their original month of filing (xaxis),
 …in the bottom panel: the number of patents (yaxis) that were assigned (approved) per week (xaxis) in 2017 so far.
The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017.
The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.2412.57 years.
The distribution of the duration from patent filing date to approval is depicted by following figure.
So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:
 “Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)
 “Apparatus for biological sensing and alerting of pharmacogenomic mutation” (~12.24 years to approval; assignee: NA)
 “System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)
 “Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)
 “Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)
Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…
We can also look at the five patents with the shortest approval time:
 “Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)
 “Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)
 “Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)
 “Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)
 “Generating accurate reason codes with complex nonlinear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)
Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.
Do big tech companies get their patents approved faster than other companies (e.g., startups)?The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.
Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.
Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.
ConclusionThere is definitely a lot of promise in the USPTO full text patent data.
Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more.
The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:
This work (blog post and included figures) is licensed under a Creative Commons Attribution 4.0 International License.

There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. ↩
To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Will Stanton hit 61 home runs this season?
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
[edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]
So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple backoftheenvelope Bayesian model and see what the posterior event probability estimate is.
Sampling notation
A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:
The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.
Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the atbats in the rest of the season is Poisson.
We then take the number of home runs to be binomial given the number of at bats and the home run rate.
Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.
Event probability
The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,
Computation in R
The posterior for is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:
> sum(rbinom(1e5, rpois(1e5, 10 * 555 / 149), rbeta(1e5, 1 + 56, 1 + 555  56)) >= (61  56)) / 1e5 [1] 0.34That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.
Disclaimer
The above is intended for recreational use only and is not intended for serious bookmaking.
Exercise
You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.
The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.
The post Will Stanton hit 61 home runs this season? appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Pirating Pirate Data for Pirate Day
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017.
The post provides a useful and practical example of extracting data from a website without an API, otherwise known as "scraping" data. In this case the website was the International Chamber of Commerce, and the provided R code demonstrates several useful R packages for scraping:
 rvest, for extracting data from a web pages' HTML source
 purrr, and specifically the safely function, for streamlining the process of iterating over pages that may return errors
 purrrly, for iterating over the rows of scraped data
 splashr, for automating the process of taking a screenshot of a webpage
 and robotstxt, for checking whether automated downloads of the website content are allowed
That last package is an important one, because while it's almost always technically possible to automate the process of extracting data from a website, it's not always allowed or even legal. Inspecting the robots.txt file is one check you should definitely make, and you should also check the terms of service of the website which may also prohibit the practice. Even then, you should be respectful and take care not to overload the server with frequent and/or repeated calls, as Bob demonstrates by spacing requests by 5 seconds. Finally — and most importantly! — even if scraping isn't expressly forbidden, using scraped data may not be ethical, especially when the data is about people who are unable to give their individual consent to your use of the data. This Forbes article about analyzing data scraped from a dating website offers an instructive tale in that regard.
This piracy data example however provides a case study of using websites and the data it provides in the right way. Follow the link below all for the details.
rud.is: Pirating Web Content Responsibly With R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Exploratory Data Analysis of Tropical Storms in R
(This article was first published on RProjects – Stoltzmaniac, and kindly contributed to Rbloggers)
Exploratory Data Analysis of Tropical Storms in RThe disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on data.world and started some basic Exploratory data analysis (EDA).
EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easytouse method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kickoff the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.
The original source of the data was can be found at DHS.gov.
Step 1: Take a look at your data set and see how it is laid out FID YEAR MONTH DAY AD_TIME BTID NAME LAT LONG WIND_KTS PRESSURE CAT 2001 1957 8 8 1800Z 63 NOTNAMED 22.5 140.0 50 0 TS 2002 1961 10 3 1200Z 116 PAULINE 22.1 140.2 45 0 TS 2003 1962 8 29 0600Z 124 C 18.0 140.0 45 0 TS 2004 1967 7 14 0600Z 168 DENISE 16.6 139.5 45 0 TS 2005 1972 8 16 1200Z 251 DIANA 18.5 139.8 70 0 H1 2006 1976 7 22 0000Z 312 DIANA 18.6 139.8 30 0 TDFortunately, this is a tidy data set which will make life easier and appears to be cleaned up substantially. The column names are relatively straightforward with the exception of “ID” columns.
The description as given by DHS.gov:
This dataset represents Historical North Atlantic and Eastern North Pacific Tropical Cyclone Tracks with 6hourly (0000, 0600, 1200, 1800 UTC) center locations and intensities for all subtropical depressions and storms, extratropical storms, tropical lows, waves, disturbances, depressions and storms, and all hurricanes, from 1851 through 2008. These data are intended for geographic display and analysis at the national level, and for large regional areas. The data should be displayed and analyzed at scales appropriate for 1:2,000,000scale data.
Step 2: View some descriptive statistics YEAR MONTH DAY WIND_KTS PRESSURE Min. :1851 Min. : 1.000 Min. : 1.00 Min. : 10.00 Min. : 0.0 1st Qu.:1928 1st Qu.: 8.000 1st Qu.: 8.00 1st Qu.: 35.00 1st Qu.: 0.0 Median :1970 Median : 9.000 Median :16.00 Median : 50.00 Median : 0.0 Mean :1957 Mean : 8.541 Mean :15.87 Mean : 54.73 Mean : 372.3 3rd Qu.:1991 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.: 70.00 3rd Qu.: 990.0 Max. :2008 Max. :12.000 Max. :31.00 Max. :165.00 Max. :1024.0We can confirm that this particular data had storms from 1851 – 2010, which means the data goes back roughly 100 years before naming storms started! We can also see that the minimum pressure values are 0, which likely means it could not be measured (due to the fact zero pressure is not possible in this case). We can see that there are recorded months from January to December along with days extending from 1 to 31. Whenever you see all of the dates laid out that way, you can smile and think to yourself, “if I need to, I can put dates in an easy to use format such as YYYYmmdd (20170912)!”
Step 3: Make a basic plotThis is a great illustration of our data set and we can easily notice an upward trend in the number of storms over time. Before we go running to tell the world that the number of storms per year is growing, we need to drill down a bit deeper. This could simply be caused because more types of storms were added to the data set (we know there are hurricanes, tropical storms, waves, etc.) being recorded. However, this could be a potential starting point for developing a hypothesis for timeseries data.
You will notice the data starts at 1950 rather than 1851. I made this choice because storms were not named until this point so it would be difficult to try and count the unique storms per year. It could likely be done by finding a way to utilize the “ID” columns. However, this is a preliminary analysis so I didn’t want to dig too deep.
Step 4: Make some calculations YEAR Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change 1951 10 3 0.23 1952 6 4 0.40 1953 8 2 0.33 1954 8 0 0.00 1955 10 2 0.25 1956 7 3 0.30 1957 9 2 0.29 1958 10 1 0.11 1959 13 3 0.30 1960 13 0 0.00In this case, we can see the number of storms, nominal change and percentage change per year. These calculations help to shed light on what the growth rate looks like each year. So we can use another summary table:
Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change Min. : 6.00 Min. :15.0000 Min. :0.42000 1st Qu.:15.75 1st Qu.: 3.0000 1st Qu.:0.12000 Median :25.00 Median : 1.0000 Median : 0.04000 Mean :22.81 Mean : 0.3448 Mean : 0.05966 3rd Qu.:28.00 3rd Qu.: 3.7500 3rd Qu.: 0.25000 Max. :43.00 Max. : 16.0000 Max. : 1.14000From the table we can state the following for the given time period:
 The mean number of storms is 23 per year (with a minimum of 6 and maximum of 43)
 The mean change in the number of storms per year is 0.34 (with a minimum of 15 and maximum of 16)
 The mean percent change in the number of storms per year is 6% (with a minimum of 42% and maximum of 114%)
Again, we have to be careful because these numbers are in aggregate and may not tell the whole story. Dividing these into groups of storms is likely much more meaningful.
Step 5: Make a more interesting plotBecause I was most interested in hurricanes, I filtered out only the data which was classified as “H (15).” By utilizing a data visualization technique called “small multiples” I was able to pull out the different types and view them within the same graph. While this is possible to do in tables and spreadsheets, it’s much easier to visualize this way. By holding the axes constant, we can see the majority of the storms are classified as H1 and then it appears to consistently drop down toward H5 (with very few actually being classified as H5). We can also see that most have an upward trend from 1950 – 2010. The steepest appears to be H1 (but it also flattens out over the last decade).
Step 6: Make a filtered calculation Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change Min. : 4.00 Min. :11.00000 Min. :0.56000 1st Qu.: 9.25 1st Qu.: 4.00000 1st Qu.:0.25750 Median :13.50 Median : 0.00000 Median : 0.00000 Mean :12.76 Mean : 0.05172 Mean : 0.08379 3rd Qu.:15.00 3rd Qu.: 3.00000 3rd Qu.: 0.35250 Max. :24.00 Max. : 10.00000 Max. : 1.80000Now we are looking strictly at hurricane data (classified as H1H5):
 The mean number of hurricanes is 13 per year (with a minimum of 4 and maximum of 24)
 The mean change in the number of hurricanes per year is 0.05 (with a minimum of 11 and maximum of 10)
 The mean percent change in the number of hurricanes per year is 8% (with a minimum of 56% and maximum of 180%)
While it doesn’t really make sense to say “we had an average growth of 0.05 hurricanes per year between 1950 and 2010” … it may make sense to say “we saw an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.”
That’s a great thing to put in quotes!
During EDA we discovered an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.
Side Note: Be ready, as soon as you make a statement like that, you will likely have to explain how you arrived at that conclusion. That’s where having an RMarkdown notebook and data online in a repository will help you out! Reproducible research is all of the hype right now.
Step 7: Try visualizing your statementsA histogram and/or density plot is a great way to visualize the distribution of the data you are making statements about. This plot helps to show that we are looking at a rightskewed distribution with substantial variance. Knowing that we have n = 58 (meaning 58 years after being aggregated), it’s not surprising that our histogram looks sparse and our density plot has an unusual shape. At this point, you can make a decision to jot this down, research it in depth and then attack it with full force.
However, that’s not what we’re covering in this post.
Step 8: Plot another aspect of your data60K pieces of data can get out of hand quickly, we need to back this down into manageable chunks. Building on the knowledge from our last exploration, we should be able to think of a way to cut this down to get some better information. The concept of small multiples could come in handy again! Splitting the data up by type of storm could prove to be valuable. We can also tell that we are missing
After filtering the data down to hurricanes and utilizing a heat map rather than plotting individual points we can get a better handle on what is happening where. The H4 and H5 sections are probably the most interesting. It appears as if H4 storms are more frequent on the West coast of Mexico whereas the H5 are most frequent in the Gulf of Mexico.
Because we’re still in EDA mode, we’ll continue with another plot.
Here are some of the other storms from the data set. We can see that TD, TS and L have large geographical spreads. The E, SS, and SD storms are concentrated further North toward New England.
Digging into this type of data and building probabilistic models is a fascinating field. The actuarial sciences are extremely difficult and insurance companies really need good models. Having mapped this data, it’s pretty clear you could dig in and find out what parts of the country should expect what types of storms (and you’ve also known this just from being alive for 10+ years). More hypotheses could be formed about location at this stage and could be tested!
Step 9: Look for a relationshipWhat is the relationship between WIND_KTS and PRESSURE? This chart helps us to see that low PRESSURE and WIND_KTS are likely negatively correlated. We can also see that the WIND_KTS is essentially the predictor in the data set which can perfectly predict how a storm is classified. Well, it turns out, that’s basically the distinguishing feature when scientists are determining how to categorize these storms!
Step N……We have done some basic EDA and identified some patterns in the data. While doing some EDA can simply be just for fun, in most data analysis, it’s important to find a place to apply these discoveries by making and testing hypotheses! There are plenty of industries where you could find a use for this:
 Insurance – improve statistical modeling and risk analysis
 Real Estate Development – identify strategic investment locations
 Agriculture – crop selection
 …
The rest is up to you! This is a great data set and there are a lot more pieces of information lurking within it. I want people to do their own EDA and send me anything interesting!
Some fun things to check out:
 What was the most common name for a hurricane?
 Do the names actually follow an alphabetical pattern through time? (This is one is tricky)
 If the names are in alphabetical order, how often does a letter get repeated in a year?
 Can we merge this data with FEMA, charitable donations, or other aid data?
To get you started on the first one, here’s the Top 10 most common names for tropical storms. Why do you think it’s Florence?
Thank you for reading, I hope this helps you with your own data. The code is all written in R and is located on my GitHub. You can also find other data visualization posts and usages of ggplot2 on my blog Stoltzmaniac
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProjects – Stoltzmaniac. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
GoldMining – Week 3 (2017)
(This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers)
Week 3 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!
The post GoldMining – Week 3 (2017) appeared first on Fantasy Football Analytics.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Don’t teach students the hard way first
(This article was first published on Variance Explained, and kindly contributed to Rbloggers)
Imagine you were going to a party in an unfamiliar area, and asked the host for directions to their house. It takes you thirty minutes to get there, on a path that takes you on a long winding road with slow traffic. As the party ends, the host tells you “You can take the highway on your way back, it’ll take you only ten minutes. I just wanted to show you how much easier the highway is.”
Wouldn’t you be annoyed? And yet this kind of attitude is strangely common in programming education.
I was recently talking to a friend who works with R and whose opinions I greatly respect. He was teaching some sessions to people in his company who hadn’t used R before, where he largely followed my philosophy on teaching the tidyverse to beginners. I agreed with his approach, until he said something far too familiar to me:
“I teach them dplyr’s group_by/summarize in the second lesson, but I teach them loops first just to show them how much easier dplyr is.”
I talk to people about teaching a lot, and that phrase keeps popping up: “I teach them X just to show them how much easier Y is”. It’s a trap a trap I’ve fallen into before when teaching, and one that I’d like to warn others against.
Students don’t share your nostalgiaFirst, why do some people make this choice? I think because when we teach a class, we accidentally bring in all of our own history and context.
For instance, I started programming with Basic and Perl in high school, then Java in college, then got really into Python, then got even more into R. Along the way I built up habits in each language that had to be painfully undone afterwards. I worked in an objectoriented style in Python, then switched to thinking in data frames and functional operations. I wrote too many loops when I started R, then I grew accustomed to the apply family of functions. Along the way there were thousands of little frustrations and little epiphanies, tragedies and triumphs in microcosm.
But when I’m teaching someone how to use R… they don’t care about any of that. They weren’t there! They didn’t make my mistakes, they don’t need to be talked out of them. Going back to the party host, perhaps the highway was built only last year, and maybe the host had to talk his friends, stuck in their habits, into taking the highway by explaining how much faster it is. But that doesn’t make any difference to the guest, who has never taken either route.
@ucfagls I think ppl teach base stuff preferentially because it’s what they still use or they’re reliving their own #rstats “journey” @drob
— Jenny Bryan (@JennyBryan) January 16, 2015
It’s true that I learned a lot about programming in the path I described above. I learned how to debug, how to compare programming paradigms, and when to switch from one approach to another. I think some teachers hope that by walking students through this “journey”, we can impart some of that experience. But it doesn’t work: feeding students two possible solutions in a row is nothing like the experience of them comparing solutions for themselves, and doesn’t grant them any of the same skills. Besides which, students will face plenty of choices and challenges as they continue their programming career, without us having to invent artificial ones.
Students should absolutely learn multiple approaches (there’s usually advantages to each one). But not in this order, from hardest to easiest, and not because it happened to be the order we learned it ourselves.
Bandwidth and trustThere are two reasons I recommend against teaching a harder approach first. One is educational bandwidth, and one is trust.
One of the most common mistakes teachers make (especially inexperienced ones) is to think they can teach more material than they can.
.@minebocek talks philosophy of Data Carpentry.
Highlight: don't take on too much material. Rushing discourages student questions #UseR2017 pic.twitter.com/txAcSd3ND3
— David Robinson (@drob) July 6, 2017
This comes down to what I sometimes call educational bandwidth: the total amount of information you can communicate to students is limited, especially since you need to spend time reinforcing and revisiting each concept. It’s not just about the amount of time you have in the lesson, either. Learning new ideas is hard work: think of the headache you can get at the end of a oneday workshop. This means you should make sure every idea you get across is valuable. If you teach them a method they’ll never have to use, you’re wasting time and energy.
The other reason is trust. Think of the imaginary host who gave poor directions before giving better ones. Wouldn’t it be unpleasant to be tricked like that? When a student goes through the hard work of learning a new method, telling them “just kidding, you didn’t need to learn that” is obnoxious, and hurts their trust in you as a teacher.
In some cases there’s a tradeoff between bandwidth and trust. For instance, as I’ve described before, I teach dplyr and the %>% operator before explaining what a function is, or even how to do a variable assignment. This conserves bandwidth (it gets students doing powerful things quickly) but it’s a minor violation of their trust (I’m hiding details of how R actually works). But there’s no tradeoff in teaching a hard method before an easier one.
ExceptionsIs there any situation where you might want to show students the hard way first? Yes: one exception is if the hard solution is something the student would have been tempted to do themselves.
One good example is in R for Data Science, by Hadley Wickham and Garrett Grolemund. Chapter 19.2 describes “When should you write a function”, and gives an example of rescaling several columns by copying and pasting code:
df$a < (df$a  min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE)  min(df$a, na.rm = TRUE)) df$b < (df$b  min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE)  min(df$a, na.rm = TRUE)) df$c < (df$c  min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE)  min(df$c, na.rm = TRUE)) df$d < (df$d  min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE)  min(df$d, na.rm = TRUE))By showing them what this approach looks like (and even including an intentional typo in the second line, to show how copying and pasting code is prone to error), the book guides them towards the several steps involved in writing a function.
rescale01 < function(x) { rng < range(x, na.rm = TRUE) (x  rng[1]) / (rng[2]  rng[1]) }This educational approach makes sense because copying and pasting code is a habit beginners would fall into naturally, especially if they’ve never programmed before. It doesn’t take up any educational bandwidth because students already know how to do it, and it’s upfront about it’s approach (when I teach this way, I usually use the words “you might be tempted to…”).
However, teaching a loop (or split()/lapply(), or aggregate, or tapply) isn’t something beginners would do accidentally, and it’s therefore not something you need to be talking.
In conclusion: teaching programming is hard, don’t make it harder. Next time you’re teaching a course, or workshop, or writing a tutorial, or just helping a colleague getting set up in R, try teaching them your preferred method first, instead of meandering through subpar solutions. I think you’ll find that it’s worth it.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ggformula: another option for teaching graphics in R to beginners
(This article was first published on SAS and R, and kindly contributed to Rbloggers)
A previous entry (http://sasandr.blogspot.com/2017/07/optionsforteachingrtobeginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested.
In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here’s Randall:
For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of
 the lattice package for graphics,
 several functions from the stats package for modeling (e.g., lm(), t.test()), and
 the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.
Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.
goal ( y ~ x , data = mydata, … )
Trouble in paradise
As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantilequantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into subplots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing. On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify. ggformula: The thirdanda half way Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics). For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the sidebyside boxplots made with lattice in the previous post. We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the “then” operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.
Comparing groups Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color) or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses  very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers. The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired. Fitting into the tidyverse work flow ggformala also fits into a tidyversestyle workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations. Summary The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the ggformula package vignettes and tutorials describe these in more detail. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: SAS and R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
R has a lot of packages for users to analyse posts on social media. As an experiment in this field, I decided to start with the biggest one: Facebook.
I decided to look at the Facebook activity of Donald Trump and Hillary Clinton during the 2016 presidential election in the United States.
The winner may be more famous for his Twitter account than his Facebook one, but he still used it to great effect to help pick off his Republican rivals in the primaries and to attack Hillary Clinton in the general election.
For this work we’re going to be using the Rfacebook package developed by Pablo Barbera, plus his excellent howto guide.
The first thing to do is to generate an access token from Facebook’s developer portal. Keep it anonymous (otherwise you’re gifting the world access to your account) and save it in your environment.
library(Rfacebook) options(scipen = 999) token < "Your token goes here"The next thing to do is to use the getPage() function to retrieve all the posts from each candidate.
I’m going to start the clock on January 1, 2016 and end it the day after the election on November 9, 2016 (which means it will stop on election day, the day before)
trump < getPage("donaldtrump", token, n = 5000, since='2016/01/01', until='2016/11/09') clinton < getPage("hillaryclinton", token, n = 5000, since='2016/01/01', until='2016/11/09') Caveat: The data doesn’t seem to contain all Trump and Clinton’s Facebook postsI ran the commands several times and got 545 posts for Trump and 692 posts for Clinton. However, I think I may have got more results the first time I ran the commands. I also searched their pages via Facebook and came up with some posts that don’t appear in the R datasets. If you have a solution for this, please let me know!
In the meantime, we will work with what we have
We want to calculate the average number of likes, comments and shares for each month for both candidates. Again, we will be using Pablo’s code for a while here.
First up, we will format the date:
format.facebook.date < function(datestring) { date < as.POSIXct(datestring, format = "%Y%m%dT%H:%M:%S+0000", tz = "GMT") }Then we will use his formula to calculate the average likes, comments and shares (metrics) per month:
aggregate.metric < function(metric) { m < aggregate(page[[paste0(metric, "_count")]], list(month = page$month), mean) m$month < as.Date(paste0(m$month, "15")) m$metric < metric return(m) }Now we run this data for both candidates:
#trump page < trump page$datetime < format.facebook.date(page$created_time) page$month < format(page$datetime, "%Y%m") df.list < lapply(c("likes", "comments", "shares"), aggregate.metric) trump_months head(trump_months) month x metric 1 20160115 17199.93 likes 2 20160215 15239.63 likes 3 20160315 22616.28 likes 4 20160415 19364.17 likes 5 20160515 14598.30 likes 6 20160615 32760.68 likes #clinton page < clinton page$datetime < format.facebook.date(page$created_time) page$month < format(page$datetime, "%Y%m") df.list < lapply(c("likes", "comments", "shares"), aggregate.metric) clinton_months < do.call(rbind, df.list)Before we combine them together, let’s label them so we know who’s who:
trump_months$candidate < "Donald Trump" clinton_months$candidate < "Hillary Clinton" both < rbind(trump_months, clinton_months)Now we have the data, we can visualise it. This is a neat opportunity to have a go at faceting using ggplot2.
Faceting is when you display two or more plots sidebyside for easy ataglance comparison.
library(ggplot2) library(scales) p < ggplot(both, aes(x = month, y = x, group = metric)) + geom_line(aes(color = metric)) + scale_x_date(date_breaks = "months", labels = date_format("%m")) + ggtitle("Facebook engagement during the 2016 election") + labs(y = "Count", x = "Month (2016)", aesthetic='Metric') + theme(text=element_text(family="Browallia New", color = "#2f2f2d")) + scale_colour_discrete(name = "Metric") #add in a facet p < p + facet_grid(. ~ candidate) p AnalysisClearly Trump’s Facebook engagement got far better results than Clinton’s. Even during his ‘off months’ he received more likes per page on average than Clinton managed at the height of the general election.
Trump’s comments per page also skyrocketed during October and November as the election neared.
Hillary Clinton enjoyed a spike in engagement around June. It was a good month for her: she was confirmed as the Democratic nominee and received the endorsement of President Obama.
ThemesTrump is famous for using nicknames for his political opponents. We had lowenergy Jeb, Little Marco, Lyin’ Ted and then Crooked Hillary.
The first usage of Crooked Hillary in the data came on April 26. A look through his Twitter feed shows he seems to have decided on Crooked Hillary around this time as well.
#DrainTheSwamp was one of his later hashtags, making the first appearance in the data on October 20, just a few weeks shy of the election on November 8.
Clinton meanwhile mentioned her rival’s surname in about a quarter of her posts. Her most popular ones were almost all on the eve of the election exhorting her followers to vote.
Of her earlier ones, only her New Year message and one from March appealing to Trump’s temperament resonated as strongly.
ConclusionIt’s frustrating that the API doesn’t seem to retrieve all the data.
Nonetheless, it’s a decent sample size and shows that the Trump campaign was far more effective than the Clinton one on Facebook during the 2016 election.
Related Post
 Analyzing Obesity across USA
 Can we predict flu deaths with Machine Learning and R?
 Graphical Presentation of Missing Data; VIM Package
 Creating Graphs with Python and GooPyCharts
 Visualizing Tennis Grand Slam Winners Performances
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualizing the Spanish Contribution to The Metropolitan Museum of Art
(This article was first published on R – Fronkonstin, and kindly contributed to Rbloggers)
TweetWell I walk upon the river like it’s easier than land
(Love is All, The Tallest Man on Earth)
The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and geography of a particular piece.
I can imagine a bunch of things to do with these data but since I am a big fan of highcharter, I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.
To read data I use fread function from data.table package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word SPANISH in the columns Artist Nationality and Culture and looking for the word SPAIN in the column Country. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:
 There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
 This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
 The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)
My treemap represents the distribution of Spanish artworks by department (column Department) and type of work (column Classification). There are two important things to know before doing a treemap with highcharter:
 You have to use treemap function from treemap package to create a list with your data frame that will serve as input for hctreemap function
 hctreemap fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.
This is the treemap:
Here you can see a full size version of it.
There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.
Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!
Feel free to use my code to do your own experiments:
library(data.table) library(dplyr) library(stringr) library(highcharter) library(treemap) file="MetObjects.csv" # Download data if (!file.exists(file)) download.file(paste0("https://media.githubusercontent.com/media/metmuseum/openaccess/master/", file), destfile=file, mode='wb') # Read data data=fread(file, sep=",", encoding="UTF8") # Modify column names to remove blanks colnames(data)=gsub(" ", ".", colnames(data)) # Clean columns to prepare for searching data[,`:=`(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF8', to='ASCII//TRANSLIT'), Culture_aux=toupper(Culture) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF8', to='ASCII//TRANSLIT'), Country_aux=toupper(Country) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF8', to='ASCII//TRANSLIT'))] # Look for Spanish artworks data[Artist.Nationality_aux %like% "SPANISH"  Culture_aux %like% "SPANISH"  Country_aux %like% "SPAIN"] > data_spain # Count artworks by Department and Classification data_spain %>% mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>% mutate(Department=tolower(Department), Classification1=str_match(Classification, "(\\w+)(,\\)")[,2], Classification=ifelse(!is.na(Classification1), tolower(Classification1), tolower(Classification))) %>% group_by(Department, Classification) %>% summarize(Objects=n()) %>% ungroup %>% mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"), Classification)) %>% as.data.frame() > dfspain # Do treemap without drawing tm_dfspain < treemap(dfspain, index = c("Department", "Classification"), draw=F, vSize = "Objects", vColor = "Objects", type = "index") # Do highcharter treemap hctreemap( tm_dfspain, allowDrillToNode = TRUE, allowPointSelect = T, levelIsConstant = F, levels = list( list( level = 1, dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")), borderWidth = 1 ), list( level = 2, dataLabels = list (enabled = F, align = 'right', verticalAlign = 'top', style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")), borderWidth = 0.7 ) )) %>% hc_title(text = "Spanish Artworks in The Met") %>% hc_subtitle(text = "Distribution by Department") > plot plot var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Pandigital Products: Euler Problem 32
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler Problem 32 returns to pandigital numbers, which are numbers that contain one of each digit. Like so many of the Euler Problems, these numbers serve no practical purpose whatsoever, other than some entertainment value. You can find all pandigital numbers in base10 in the Online Encyclopedia of Interegers (A050278). The Numberhile video explains everything you ever wanted to
The Numberhile video explains everything you ever wanted to know about pandigital numbers but were afraid to ask.
Euler Problem 32 DefinitionWe shall say that an ndigit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5digit number, 15234, is 1 through 5 pandigital.
The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.
Find the sum of all products whose multiplicand/multiplier/product identity can be written as a 1 through 9 pandigital.
HINT: Some products can be obtained in more than one way so be sure to only include it once in your sum.
Proposed SolutionThe pandigital.9 function tests whether a string classifies as a pandigital number. The pandigital.prod vector is used to store the multiplication.
The only way to solve this problem is brute force and try all multiplications but we can limit the solution space to a manageable number. The multiplication needs to have ten digits. For example, when the starting number has two digits, the second number should have three digits so that the total has four digits, e.g.: 39 × 186 = 7254. When the first number only has one digit, the second number needs to have four digits.
pandigital.9 < function(x) # Test if string is 9pandigital (length(x)==9 & sum(duplicated(x))==0 & sum(x==0)==0) t < proc.time() pandigital.prod < vector() i < 1 for (m in 2:100) { if (m < 10) n_start < 1234 else n_start < 123 for (n in n_start:round(10000 / m)) { # List of digits digs < as.numeric(unlist(strsplit(paste0(m, n, m * n), ""))) # is Pandigital? if (pandigital.9(digs)) { pandigital.prod[i] < m * n i < i + 1 print(paste(m, "*", n, "=", m * n)) } } } answer < sum(unique(pandigital.prod)) print(answer)Numbers can also be checked for pandigitality using mathematics instead of strings.
You can view the most recent version of this code on GitHub.
The post Pandigital Products: Euler Problem 32 appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Report from Mexico City
(This article was first published on R Views, and kindly contributed to Rbloggers)
Editors Note: It has been heartbreaking watching the images from México City. Teresa Ortiz, coorganizer of RLadies CDMX reports on efforts of data scientists to help. Our thoughts are with them, and with the people of México.
It has been a hard couple of days around here.
In less than 2 weeks, México has gone through two devastating earthquakes and the damages keep adding. Nevertheless, the response from the citizens has been outstanding and Mexican datadriven initiatives have not stayed behind.
An example is codeandoMexico, an open innovation platform which brings together top talent through public challenges, codeandoMexico developed a Citizen’s quick response center with a directory of emergency phone numbers, providing the location of places that need help, maps, and more. Most of the information in the site is updated from firsthand experience of volunteers that are helping throughout the affected areas. Technical collaboration with codeandoMexico is open to everyone by attending to issues on their website these include extending the services offered by the site or attending to bugs.
Social networks are being extensively used to spread information about where and how to help, however often times tweets and facebook posts are fake or obsolete. There are several efforts taking place to overcome this difficulty; RLadies CDMX’s cofounder Silvia Gutierrez is working in one such project to map information on shelters, affected buildings and more (http://bit.ly/2xfIwZm).
The full scope of the damage is yet to be seen, but the support, both from México and other countries, is encouraging.
_____='https://rviews.rstudio.com/2017/09/21/reportfrommexicocity/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Monte Carlo Simulations & the "SimDesign" Package in R
(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to Rbloggers)
Past posts on this blog have included several relating to Monte Carlo simulation – e.g., see here, here, and here. Recently I came across a great article by Matthew Sigal and Philip Chalmers in the Journal of Statistics Education. It’s titled, “Play it Again: Teaching Statistics With Monte Carlo Simulation”, and the full reference appears below. The authors provide a really nice introduction to basic Monte Carlo simulation, using R. In particular, they contrast using a “for loop” approach, with using the “SimDesign” R package (Chalmers, 2017). Here’s the abstract of their paper:“Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how wellsuited MCS designs are to classroom demonstrations, and how they provide a handson method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate–analyze–summarize coding structure over the typical “forloop” strategy are discussed.”
The SimDesign package provides an efficient, and safe template for setting pretty much any Monte Carlo experiment that you’re likely to want to conduct. It’s really impressive, and I’m looking forward to experimenting with it. The SigalChalmers paper includes helpful examples, with the associated R code and output. It would be superfluous for me to add that here. Needless to say, the SimDesign package is just as useful for simulations in econometrics as it is for those dealing with straight statistics problems. Try it out for yourself! ReferencesChalmers, R. P., 2017. SimDesign: Structure for Organizing Monte Carlo Simulation Designs, R package version 1.7. M. J. Sigal and R. P. Chalmers, 2016. Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24, 136156.
© 2017, David E. Giles var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...