Custom Level Coding in vtreat
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "onehot encoding"). Level coding can be computationally and statistically preferable to onehot encoding for variables that have an extremely large number of possible levels.
Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS
By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional loglikelihood and global loglikelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for treebased methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.
We’ll implement our partial pooling encoders using the lmer() (multilevel linear regression) and glmer() (multilevel generalized linear regression) functions from the lme4 package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).
The Data: Radon levels in Minnesota library("vtreat") library("lme4") library("dplyr") library("tidyr") library("ggplot2") # example data srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE) # target: log of radon activity (activity) # grouping variable: county radonMN = filter(srrs, state=="MN") %>% select("county", "activity") %>% filter(activity > 0) %>% mutate(activity = log(activity), county = base::trimws(county)) %>% mutate(critical = activity>1.5) str(radonMN) ## 'data.frame': 916 obs. of 3 variables: ## $ county : chr "AITKIN" "AITKIN" "AITKIN" "AITKIN" ... ## $ activity: num 0.788 0.788 1.065 0 1.131 ... ## $ critical: logi FALSE FALSE FALSE FALSE FALSE FALSE ...For this example we have three columns of interest:
 county: 85 possible values
 activity: the log of the radon reading (numerical outcome)
 critical: TRUE when activity > 1.5 (categorical outcome)
The goal is to level code county for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is "critical").
As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.
Implementing Level Coders for Partial PoolingLet’s implement level coders that use partial pooling to compute the level score.
Regression
For regression problems, the custom coder should be a function that takes as input:
 v: a string with the name of the categorical variable
 vcol: the actual categorical column (assumed character)
 y: the numerical outcome column
 weights: a column of row weights
The function should return a column of scores (the level codings). For our example, the function builds a lmer model to predict y as a function of vcol, then returns the predictions on the training data.
# @param v character variable name # @param vcol character, independent or input variable # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderN < function(v, vcol, y, weights) { # regression case y ~ vcol d < data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m < lmer(y ~ (1  x), data=d, weights=weights) predict(m, newdata=d) }Categorization
For categorization problems, the function should assume that y is a logical column, where TRUE is assumed to be the target outcome. This is because vtreat converts the outcome column to a logical while creating the treatment plan.
# @param v character variable name # @param vcol character, independent or input variable # @param y logical, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderC < function(v, vcol, y, weights) { # classification case y ~ vcol d < data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m = glmer(y ~ (1  x), data=d, weights=weights, family=binomial) predict(m, newdata=d, type='link') }You can then pass the functions in as a named list into either designTreatmentsX or mkCrossFrameXExperiment to build the treatment plan. The format of the key is [nc].levelName[.option]*.
The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat to center the codes with respect to the estimated grand mean. ThecatN and catB level codings are centered in this way.
Our example coders can be passed in as shown below.
customCoders = list('n.poolN.center' = ppCoderN, 'c.poolC.center' = ppCoderC) Using the Custom CodersLet’s build a treatment plan for the regression problem.
# I only want to create the cleaned numeric variables, the isBAD variables, # and the level codings (not the indicator variables or catP, etc.) vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN') treatplanN = designTreatmentsN(radonMN, varlist = c('county'), outcomename = 'activity', codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) scoreFrame = treatplanN$scoreFrame scoreFrame %>% select(varName, sig, origName, code) ## varName sig origName code ## 1 county_poolN 1.343072e16 county poolN ## 2 county_catN 2.050811e16 county catNNote that the treatment plan returned both the catN variable (default level encoding) and the pooled level encoding (poolN). You can restrict to just using one coding or the other using the codeRestriction argument either during treatment plan creation, or in prepare().
Let’s compare the two level encodings.
# create a frame with one row for every county, measframe = data.frame(county = unique(radonMN$county), stringsAsFactors=FALSE) outframe = prepare(treatplanN, measframe) # If we wanted only the new pooled level coding, # (plus any numeric/isBAD variables), we would # use a codeRestriction: # # outframe = prepare(treatplanN, # measframe, # codeRestriction = c('clean', 'isBAD', 'poolN')) gather(outframe, key=scoreType, value=score, county_poolN, county_catN) %>% ggplot(aes(x=score)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of scores")Notice that the poolN scores are "tucked in" compared to the catN encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.
We can also code for the categorical problem.
# For categorical problems, coding is catB vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC') treatplanC = designTreatmentsC(radonMN, varlist = c('county'), outcomename = 'critical', outcometarget= TRUE, codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) outframe = prepare(treatplanC, measframe) gather(outframe, key=scoreType, value=linkscore, county_poolC, county_catB) %>% ggplot(aes(x=linkscore)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of link scores")Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and unmodeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lowervariance estimators when they do not cost much in terms of bias.
Other ConsiderationsFor this example, we used the lme4 package to create custom level codings. Once calculated, vtreat stores the coding as a lookup table in the treatment plan. This means lme4 is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.
When using mkCrossFrameXExperiment, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB and catN variables. This is because the treatment plan is built using all the data, while the cross frame is built using nfold cross validation on the data. See the cross frame vignette for more details.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.
In a followup article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat‘s default coding.
ReferencesGelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Survival Analysis with R
(This article was first published on R Views, and kindly contributed to Rbloggers)
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic lifetable methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.
Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – leftright, updown, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As wellorganized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger
The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.
ggfortify enables producing handsome, oneline survival plots with ggplot2::autoplot.
ranger might be the surprise in my very short list of survival packages. The ranger() function is wellknown for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building timetoevent models with the large, highdimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machinelearning age.
Load the dataThis first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a twotreatment, randomized trial for lung cancer.
library(survival) library(ranger) library(ggplot2) library(dplyr) library(ggfortify) # data(veteran) head(veteran) ## trt celltype time status karno diagtime age prior ## 1 1 squamous 72 1 60 7 69 0 ## 2 1 squamous 411 1 70 5 64 10 ## 3 1 squamous 228 1 60 3 38 0 ## 4 1 squamous 126 1 60 9 63 10 ## 5 1 squamous 118 1 70 11 65 10 ## 6 1 squamous 10 1 20 5 49 0The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes
Kaplan Meier AnalysisThe first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.
# Kaplan Meier Survival Curve km < with(veteran, Surv(time, status)) head(km,80) ## [1] 72 411 228 126 118 10 82 110 314 100+ 42 8 144 25+ ## [15] 11 30 384 4 54 13 123+ 97+ 153 59 117 16 151 22 ## [29] 56 21 18 139 20 31 52 287 18 51 122 27 54 7 ## [43] 63 392 10 8 92 35 117 132 12 162 3 95 177 162 ## [57] 216 553 278 12 260 200 156 182+ 143 105 103 250 100 999 ## [71] 112 87+ 231+ 242 991 111 1 587 389 33To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the KaplanMeier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.
km_fit < survfit(Surv(time, status) ~ 1, data=veteran) summary(km_fit, times = c(1,30,60,90*(1:10))) ## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1 137 2 0.985 0.0102 0.96552 1.0000 ## 30 97 39 0.700 0.0392 0.62774 0.7816 ## 60 73 22 0.538 0.0427 0.46070 0.6288 ## 90 62 10 0.464 0.0428 0.38731 0.5560 ## 180 27 30 0.222 0.0369 0.16066 0.3079 ## 270 16 9 0.144 0.0319 0.09338 0.2223 ## 360 10 6 0.090 0.0265 0.05061 0.1602 ## 450 5 5 0.045 0.0194 0.01931 0.1049 ## 540 4 1 0.036 0.0175 0.01389 0.0934 ## 630 2 2 0.018 0.0126 0.00459 0.0707 ## 720 2 0 0.018 0.0126 0.00459 0.0707 ## 810 2 0 0.018 0.0126 0.00459 0.0707 ## 900 2 0 0.018 0.0126 0.00459 0.0707 #plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready autoplot(km_fit)Next, we look at survival curves by treatment.
km_trt_fit < survfit(Surv(time, status) ~ trt, data=veteran) autoplot(km_trt_fit)And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.
vet < mutate(veteran, AG = ifelse((age < 60), "LT60", "OV60"), AG = factor(AG), trt = factor(trt,labels=c("standard","test")), prior = factor(prior,labels=c("N0","Yes"))) km_AG_fit < survfit(Surv(time, status) ~ AG, data=vet) autoplot(km_AG_fit)Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.
Cox Proportional Hazards ModelNext, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.
# Fit Cox Model cox < coxph(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) summary(cox) ## Call: ## coxph(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137, number of events= 128 ## ## coef exp(coef) se(coef) z Pr(>z) ## trttest 2.946e01 1.343e+00 2.075e01 1.419 0.15577 ## celltypesmallcell 8.616e01 2.367e+00 2.753e01 3.130 0.00175 ** ## celltypeadeno 1.196e+00 3.307e+00 3.009e01 3.975 7.05e05 *** ## celltypelarge 4.013e01 1.494e+00 2.827e01 1.420 0.15574 ## karno 3.282e02 9.677e01 5.508e03 5.958 2.55e09 *** ## diagtime 8.132e05 1.000e+00 9.136e03 0.009 0.99290 ## age 8.706e03 9.913e01 9.300e03 0.936 0.34920 ## priorYes 7.159e02 1.074e+00 2.323e01 0.308 0.75794 ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(coef) lower .95 upper .95 ## trttest 1.3426 0.7448 0.8939 2.0166 ## celltypesmallcell 2.3669 0.4225 1.3799 4.0597 ## celltypeadeno 3.3071 0.3024 1.8336 5.9647 ## celltypelarge 1.4938 0.6695 0.8583 2.5996 ## karno 0.9677 1.0334 0.9573 0.9782 ## diagtime 1.0001 0.9999 0.9823 1.0182 ## age 0.9913 1.0087 0.9734 1.0096 ## priorYes 1.0742 0.9309 0.6813 1.6937 ## ## Concordance= 0.736 (se = 0.03 ) ## Rsquare= 0.364 (max possible= 0.999 ) ## Likelihood ratio test= 62.1 on 8 df, p=1.799e10 ## Wald test = 62.37 on 8 df, p=1.596e10 ## Score (logrank) test = 66.74 on 8 df, p=2.186e11 cox_fit < survfit(cox) #plot(cox_fit, main = "cph model", xlab="Days") autoplot(cox_fit)Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, timedependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.
Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”
To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a timedependent intercept term, X is the vector of covariates for the subject (possibly timedependent), and B(t) is a timedependent matrix of coefficients.”
The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.
aa_fit <aareg(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) aa_fit ## Call: ## aareg(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137 ## 75 out of 97 unique event times used ## ## slope coef se(coef) z p ## Intercept 0.083400 3.81e02 1.09e02 3.490 4.79e04 ## trttest 0.006730 2.49e03 2.58e03 0.967 3.34e01 ## celltypesmallcell 0.015000 7.30e03 3.38e03 2.160 3.09e02 ## celltypeadeno 0.018400 1.03e02 4.20e03 2.450 1.42e02 ## celltypelarge 0.001090 6.21e04 2.71e03 0.229 8.19e01 ## karno 0.001180 4.37e04 8.77e05 4.980 6.28e07 ## diagtime 0.000243 4.92e05 1.64e04 0.300 7.65e01 ## age 0.000246 6.27e05 1.28e04 0.491 6.23e01 ## priorYes 0.003300 1.54e03 2.86e03 0.539 5.90e01 ## ## Chisq=41.62 on 8 df, p=1.6e06; test weights=aalen #summary(aa_fit) # provides a more complete summary of results autoplot(aa_fit) Random Forests ModelAs a final example of what some might perceive as a datasciencelike way to do timetoevent modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic treemodeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.
ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.
# ranger model r_fit < ranger(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior, data = vet, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE) # Average the survival models death_times < r_fit$unique.death.times surv_prob < data.frame(r_fit$survival) avg_prob < sapply(surv_prob,mean) # Plot the survival models for each patient plot(r_fit$unique.death.times,r_fit$survival[1,], type = "l", ylim = c(0,1), col = "red", xlab = "Days", ylab = "survival", main = "Patient Survival Curves") # cols < colors() for (n in sample(c(2:dim(vet)[1]), 20)){ lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n]) } lines(death_times, avg_prob, lwd = 2) legend(500, 0.7, legend = c('Average = black'))The next block of code illustrates how ranger() ranks variable importance.
vi < data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE)) names(vi) < "importance" head(vi) ## importance ## karno 0.0734 ## celltype 0.0338 ## diagtime 0.0003 ## trt 0.0007 ## prior 0.0011 ## age 0.0027Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest pvalues in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.
But ranger() does compute Harrell’s cindex (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the WilcoxonMannWhitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.
cat("Prediction Error = 1  Harrell's cindex = ", r_fit$prediction.error) ## Prediction Error = 1  Harrell's cindex = 0.308269An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:
However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.
I believe that the major use for treebased models for survival data will be to deal with very large data sets.
Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().
# Set up for ggplot kmi < rep("KM",length(km_fit$time)) km_df < data.frame(km_fit$time,km_fit$surv,kmi) names(km_df) < c("Time","Surv","Model") coxi < rep("Cox",length(cox_fit$time)) cox_df < data.frame(cox_fit$time,cox_fit$surv,coxi) names(cox_df) < c("Time","Surv","Model") rfi < rep("RF",length(r_fit$unique.death.times)) rf_df < data.frame(r_fit$unique.death.times,avg_prob,rfi) names(rf_df) < c("Time","Surv","Model") plot_df < rbind(km_df,cox_df,rf_df) p < ggplot(plot_df, aes(x = Time, y = Surv, color = Model)) p + geom_line()For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.
This fourpackage excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how opensource code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s cindex packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.
Some Tutorials and PapersFor a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.
Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.
For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].
For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.
See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.
ReferencesFor convenience, I have collected the references used throughout the post here.
[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 44524461 [3] Kaplan, E.L. & Meier, P. (1958). Nonparametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563. [4] Cox, D.R. (1972). Regression models and lifetables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
[5] Diez, David. Survival Analysis in R, OpenIntro
[6] Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
[7] Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361387 [9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved TimeBased Predictions in R
[10] NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
[11] Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model [12] Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
[13] Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005) [14] Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
[15] Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
[16] BouHamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)
Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.
_____='https://rviews.rstudio.com/2017/09/25/survivalanalysiswithr/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Super excited for R promises
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
We at Appsilon are excited about RStudio introducing promises in R quite soon which is going to be a huge step forward in programming in R (we have already used futures and similar libraries to run code asynchronously, however this is going to be a standard and it looks like it’s going to be very easy to use). They support chaining which is a great way of building clean code by piping computations that take a long time.
We’ve used futures/promises/tasks in programming languages like C#, Scala, Javascript and promises have always had a big impact on the way code is structured and on the overall execution speed.
We recently spoke with Joe Chang, and he mentioned promises at the EARL conference. Here are some links if you’re interested in reading more about promises on github and medium.
What do you think? Let us know in the comments.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
eXtremely Boost your machine learning Exercises (Part1)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in xgboost package and it is really worth including into anyone’s machine learning portfolio.
This is the first part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Load xgboost library and download German Credit dataset. Your goal in this tutorial will be to predict Creditability (the first column in the dataset).
Exercise 2
Convert columns c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20) to factors and then encode them as dummy variables. HINT: use model.matrix()
Exercise 3
Split data into training and test set 700:300. Create xgb.DMatrix for both sets with Creditability as label.
Exercise 4
Train xgboost with logistic objective and 30 rounds of training and maximal depth 2.
Exercise 5
To check model performance calculate test set classification error.
Exercise 6
Plot predictors importance.
 Create a machine learning algorithm from a beginner point of view
 Quickly dive into more advanced methods in an accessible pace and with more explanations
 And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
Exercise 7
Use xgb.train() instead of xgboost() to add both train and test sets as a watchlist. Train model with same parameters, but 100 rounds to see how it performs during training.
Exercise 8
Train model again adding AUC and Log Loss as evaluation metrices.
Exercise 9
Plot how AUC and Log Loss for train and test sets was changing during training process. Use plotting function/library of your choice.
Exercise 10
Check how setting parameter eta to 0.01 influences the AUC and Log Loss curves.
 Model Evaluation 2
 How to prepare and apply machine learning to your dataset
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part5)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppGSL 0.3.3
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its twocolumn pdf default.
The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.
No userfacing new code or features were added. The NEWS file entries follow below:
Changes in version 0.3.3 (20170924)
We also check for gslconfig at package load.

The vignette now uses the pinp package in twocolumn mode.

Minor other fixes to package and testing infrastructure.
Courtesy of CRANberries, a summary of changes to the most recent release is available.
More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppCNPy 0.2.7
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new version of the RcppCNPy package arrived on CRAN yesterday.
RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.
This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp twopage template and package.
Changes in version 0.2.7 (20170922)
Vignette updated to Rmd and use of pinp package

File src/init.c added for dynamic registration
CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppClassic 0.9.8
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A bugfix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.
Courtesy of CRANberries, there are changes relative to the previous release.
Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Upcoming data preparation and modeling article series
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects.
vtreat handles, in a statistically sound fashion:
 Missing values.
 Encoding of categorical values for regularized inference and machine learning techniques.
 Categorical variables with very many values.
 Novel categorical values (that is values not seen during training).
 Variable pruning.
 yaware scaling.
 Structured crossvalidation.
 Mitigating nested model bias.
In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public opensource implementation is for inmemory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL).
vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat version 0.6.0 is called “custom coders.” WinVector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and setvalued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part9)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.
Answers to the exercises are available here.
For other parts of this exercise set follow the tag Hacking stats
Exercise 1
Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.
Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.
Exercise 2
Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a sixshooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.
The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.
Exercise 3
What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?
 Work with about different binomial and logistic regression techniques
 Know how to compare regression models and choose the right fit
 And much more
Exercise 4
Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.
 Create a function that simulate a complete game and return how many coin flip has been done and who win.
 In average, how many coin flip is needed before someone has all the pennies.
 Plot the histogram of the number of coin flipped during a simulation.
 What is the probability that someone wins a coin flip?
 What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?
When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.
In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.
Exercise 5
A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.
 Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
 Repeat this process, but this time choose to switch door.
 Why the probabilities are different?
When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.
To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?
Exercise 6
This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.
 Compute the success rate (number of success/total number of cases) of both treatments.
 Which treatment seems the more successful?
 Create a contingency table of the success.
 Compute the success rate of both treatments when treating small kidney stones.
 Compute the success rate of both treatments when treating large kidney stones.
 Which treatment is more successful for small kidney stone? For large kidney stone?
This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.
Exercise 7
 Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
 Do a scatter plot of the variable X and Y and add the trend line to the graph.
 Repeat this process of each of the three categories.
We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!
Exercise 8
For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.
A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.
 Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
 Repeat this process for the remaining observations. Is there a difference between those rate?
It is a known fact that false positive rate are a lot higher in lowincidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!
Exercise 9
 Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
 Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
 Compute the variation of the estimation.
 Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
 Plot the variance of the estimation of the means done with different sample size.
As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.
Exercise 10
A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.
Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?
A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.
Related exercise sets: Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part8)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part6)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part4)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How Random Forests improve simple Regression Trees?
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel Vasconcelos
Regression TreesIn this post I am going to discuss some features of Regression Trees an Random Forests. Regression Trees are know to be very unstable, in other words, a small change in your data may drastically change your model. The Random Forest uses this instability as an advantage through bagging (you can see details about bagging here) resulting on a very stable model.
The first question is how a Regression Tree works. Suppose, fore example, that we have the number of points scored by a set of basketball players and we want to relate it to the player’s weight an height. The Regression Tree will simply split the heightweight space and assign a number of points to each partition. The figure below shows two different representations for a small tree. In the left we have the tree itself and in the right how the space is partitioned (the blue line shows the first partition and the red lines the following partitions). The numbers in the end of the tree (and in the partitions) represent the value of the response variable. Therefore, if a basketball player is higher than 1.85 meters and weights more than 100kg it is expected to score 27 points (I invented this data =] ).
You might be asking how I chose the partitions. In general, in each node the partition is chosen through a simple optimization problem to find the best pair variableobservation based on how much the new partition reduces the model error.
What I want to illustrate here is how unstable a Regression Tree can be. The package tree has some examples that I will follow here with some small modifications. The example uses computer CPUs data and the objective is to build a model for the CPU performance based on some characteristics. The data has 209 CPU observations that will be used to estimate two Regression Trees. Each tree will be estimate from a random resample with replacement. Since the data comes from the same place, it would be desirable to have similar results on both models.
library(ggplot2) library(reshape2) library(tree) library(gridExtra) data(cpus, package = "MASS") # = Load Data # = First Tree set.seed(1) # = Seed for Replication tree1 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209, 209, replace = TRUE), ]) plot(tree1); text(tree1) # = Second Tree set.seed(10) tree2 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209,209, replace=TRUE), ]) plot(tree2); text(tree2)
As you can see, the two trees are different from the start. We can use some figures to verify. First let us calculate the predictions of each model in the real data (not the resample). The first figure is a scatterplot of both predictions and the second figure shows their boxplots. Although the scatterplot shows some relation between the two predictions, it is far from good.
# = Calculate predictions pred = data.frame(p1 = predict(tree1, cpus) ,p2 = predict(tree2, cpus)) # = Plots g1 = ggplot(data = pred) + geom_point(aes(p1, p2)) g2 = ggplot(data = melt(pred)) + geom_boxplot(aes(variable, value)) grid.arrange(g1, g2, ncol = 2) Random ForestAs mentioned before, the Random Forest solves the instability problem using bagging. We simply estimate the desired Regression Tree on many bootstrap samples (resample the data many times with replacement and reestimate the model) and make the final prediction as the average of the predictions across the trees. There is one small (but important) detail to add. The Random Forest adds a new source of instability to the individual trees. Every time we calculate a new optimal variableobservation point to split the tree, we do not use all variables. Instead, we randomly select 2/3 of the variables. This will make the individual trees even more unstable, but as I mentioned here, bagging benefits from instability.
The question now is: how much improvement do we get from the Random Forest. The following example is a good illustration. I broke the CPUs data into a training sample (first 150 observations) and a test sample (remaining observations) and estimated a Regression Tree and a Random Forest. The performance is compared using the mean squared error.
library(randomForest) # = Regression Tree tree_fs = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[1:150, ]) # = Random Forest set.seed(1) # = Seed for replication rf = randomForest(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data=cpus[1:150, ], nodesize = 10, importance = TRUE) # = Calculate MSE mse_tree = mean((predict(tree_fs, cpus[c(1:150), ])  log(cpus$perf)[c(1:150)])^2) mse_rf = mean((predict(rf, cpus[c(1:150), ])  log(cpus$perf[c(1:150)]))^2) c(rf = mse_rf, tree = mse_tree) ## rf tree ## 0.2884766 0.5660053As you can see, the Regression Tree has an error twice as big as the Random Forest. The only problem is that by using a combination of trees any kind of interpretation becomes really hard. Fortunately, there are importance measures that allow us to at least know which variables are more relevant in the Random Forest. In our case, both importance measures pointed to the cache size as the most important variable.
importance(rf) ## %IncMSE IncNodePurity ## syct 22.60512 22.373601 ## mmin 19.46153 21.965340 ## mmax 24.84038 27.239772 ## cach 27.92483 33.536185 ## chmin 13.77196 13.352793 ## chmax 17.61297 8.379306Finally, we can see how the model error decreases as we increase the number of trees in the Random Forest with the following code:
plot(rf)If you liked this post, you can find more details on Regression Trees and Random forest in the book Elements of Statistical learning, which can be downloaded direct from the authors page here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Welcome to R/exams
(This article was first published on R/exams, and kindly contributed to Rbloggers)
Welcome everybody, we are proud to introduce the brand new web page and
blog http://www.Rexams.org/. This provides a central access point for
the opensource software “exams” implemented in the
R system for statistical computing.
R/exams is a oneforall exams generator using (potentially) dynamic
exercises written in R/Markdown or R/LaTeX and it can export a variety of
formats from learning management systems to classical written exams.
This post gives a brief overview of what has happened so far and what
you can expect to find here in the next months.
The package was originally started more than a decade ago to facilitate
classical written exams with randomized exercises for largelecture courses.
Like many other teachers of introductory statistics and mathematics courses,
we were in need of infrastructure for conducting written exams with about 1,0001,500
students. A team effort of colleagues at WU Wirtschaftsuniversität Wien
lead to a large collection of dynamic exercises and the software was eventually
released at https://CRAN.Rproject.org/package=exams.
Over the years learning management systems (like
Moodle,
Blackboard,
OLAT, etc.)
became easily available at virtually every university, creating a desire to
employ the same dynamic exercises also for online tests and quizzes. Hence,
the R/exams infrastructure was completely reimplemented allowing to export
the same exercises not only to written exams (with automatic scanning
and evaluation) but also to learning management systems, the live voting
system ARSnova, as well as customizable standalone
files in PDF, HTML, Docx, and beyond.
Despite (or rather because of) the flexibility of the software, novice R/exams
users often struggled with adopting it because the documentation provided in
the package is either somewhat technical and/or targeted towards more experienced
R users.
Hence, this web page and blog make it easy for new users to explore the possibilities
of R/exams before reading about a lot of technical details. It also provides accessible
guides to common tasks and examples for dynamic exercises with different complexity.
For a first tour you can check out the oneforall approach of
the package based on (potentially) dynamic exercise templates
for generating large numbers of personalized exams/quizzes/tests, either for
elearning or classical written exams (with
automatic evaluation).
Some tutorials already explain the installation of R/exams (with
dependencies like LaTeX or pandoc) and the first steps in writing dynamic exercises
using either Markdown or LaTeX markup along with randomization in R. There is
also a gallery of exercise templates, ranging from basic multiplechoice
questions with simple shuffling to more advanced exercises with random data, graphics,
etc.
For the next months we plan to write more tutorial blog posts that help to accomplish
common tasks, e.g., handson guidance for exporting exercises from R to Moodle or
tips how to write good randomized exercises. If you want to give us feedback or ask
us to cover certain topics please feel free to contact us – be it via email,
discussion forum, or twitter. Also if you want to link R/examsbased materials or share
share experiences of using R/exams in a guest post, please let us know.
Big shout to all contributors that helped R/exams to grow so much
over the last years. A special thank you goes to Patrik Keller, Hanna Krimm, and Reto
Stauffer who established the infrastructure for this web page (using R/Markdown and Jekyll)
and designed graphics/icons/photos/etc. Thanks also to everyone sharing this post,
especially on http://www.Rbloggers.com/ and https://RWeekly.org/.
To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big Data Analytics with H20 in R Exercises Part 1
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
We have dabbled with RevoScaleR before , In this exercise we will work with H2O , another high performance R library which can handle big data very effectively .It will be a series of exercises with increasing degree of difficulty . So Please do this in sequence .
H2O requires you to have Java installed in your system .So please install Java before trying with H20 .As always check the documentation before trying these exercise set .
Answers to the exercises are available here.
If you want to install the latest release from H20 , install it via this instructions .
Exercise 1
Download the latest stable release from h20 and initialize the cluster
Exercise 2
Check the cluster information via clusterinfo
Exercise 3
You can see how h2o works via the demo function , Check H2O’s glm via demo method .
Exercise 4
down load the loan.csv from H2O’s github repo and import it using H2O .
Exercise 5
Check the type of imported loan data and notice that its not a dataframe , check the summary of the loan data .
Hint use h2o.summary()
Exercise 6
One might want to transfer a dataframe from R environment to H2O , use as.h2o to conver the mtcars dataframe as a H2OFrame
 work with different data import techniques,
 know how to import data and transform it for a specific moddeling or analysis goal,
 and much more.
Exercise 7
Check the dimension of the loan H2Oframe via h2o.dim
Exercise 8
Find the colnames from the H2OFrame of loan data.
Exercise 9
Check the histogram of the loan amount of loan H2Oframe .
Exercise 10
Find the mean of loan amount by each home ownership group from the loan H2OFrame
 Big Data analytics with RevoScaleR Exercises2
 Data wrangling : I/O (Part1)
 DensityBased Clustering Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tutorial: Launch a Spark and R cluster with HDInsight
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
If you'd like to get started using R with Spark, you'll need to set up a Spark cluster and install R and all the other necessary software on the nodes. A really easy way to achieve that is to launch an HDInsight cluster on Azure, which is just a managed Spark cluster with some useful extra components. You'll just need to configure the components you'll need, in our case R and Microsoft R Server, and RStudio Server.
This tutorial explains how to launch an HDInsight cluster for use with R. It explains how to size the cluster and launch the cluster, connect to it via SSH, install Microsoft R Server (with R) on each of the nodes, and install RStudio Server community edition to use as an IDE on the edge node. (If you find you need a larger or smaller cluster after you've set it up, it's easy to resize the cluster dynamically.) Once you have the cluster up an running, here are some things you can try:
 Import and explore data within the Spark cluster
 Use the sparklyr package to interact with Spark Data Frames from RStudio
 Operationalize R functions on the Spark cluster for remote calling
 Train statistical models on very large data sets with the RevoScaleR package
To get started with R and Spark, follow the instructions for setting up a HDInsight cluster at the link below.
Microsoft Azure Documentation: Get started using R Server on HDInsight
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
MultiDimensional Reduction and Visualisation with tSNE
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
tSNE is a very powerful technique that can be used for visualising (looking for patterns) in multidimensional data. Great things have been said about this technique. In this blog post I did a few experiments with tSNE in R to learn about this technique and its uses.
Its power to visualise complex multidimensional data is apparent, as well as its ability to cluster data in an unsupervised way.
What’s more, it is also quite clear that tSNE can aid machine learning algorithms when it comes to prediction and classification. But the inclusion of tSNE in machine learning algorithms and ensembles has to be ‘crafted’ carefully, since tSNE was not originally intended for this purpose.
All in all, tSNE is a powerful technique that merits due attention.
tSNELet’s start with a brief description. tSNE stands for tDistributed Stochastic Neighbor Embedding and its main aim is that of dimensionality reduction, i.e., given some complex dataset with many many dimensions, tSNE projects this data into a 2D (or 3D) representation while preserving the ‘structure’ (patterns) in the original dataset. Visualising highdimensional data in this way allows us to look for patterns in the dataset.
tSNE has become so popular because:
 it was demonstrated to be useful in many situations,
 it’s incredibly flexible, and
 can often find structure where other dimensionalityreduction algorithms cannot.
A good introduction on tSNE can be found here. The original paper on tSNE and some visualisations can be found at this site. In particular, I like this site which shows how tSNE is used for unsupervised image clustering.
While tSNE itself is computationally heavy, a faster version exists that uses what is known as the BarnesHut approximation. This faster version allows tSNE to be applied on large realworld datasets.
tSNE for Exploratory Data AnalysisBecause tSNE is able to provide a 2D or 3D visual representation of highdimensional data that preserves the original structure, we can use it during initial data exploration. We can use it to check for the presence of clusters in the data and as a visual check to see if there is some ‘order’ or some ‘pattern’ in the dataset. It can aid our intuition about what we think we know about the domain we are working in.
Apart from the initial exploratory stage, visualising information via tSNE (or any other algorithm) is vital throughout the entire analysis process – from those first investigations of the raw data, during data preparation, as well as when interpreting the outputs of machine learning algorithms and presenting insights to others. We will see further on that we can use tSNE even during the predcition/classification stage itself.
Experiments on the Optdigits datasetIn this post, I will apply tSNE to a wellknown dataset, called optdigits, for visualisation purposes.
The optdigits dataset has 64 dimensions. Can tSNE reduce these 64 dimensions to just 2 dimension while preserving structure in the process? And will this structure (if present) allow handwritten digits to be correctly clustered together? Let’s find out.
tsne packageWe will use the tsne package that provides an exact implementation of tSNE (not the BarnesHut approximation). And we will use this method to reduce dimensionality of the optdigits data to 2 dimensions. Thus, the final output of tSNE will essentially be an array of 2D coordinates, one per row (image). And we can then plot these coordinates to get the final 2D map (visualisation). The algorithm runs in iterations (called epochs), until the system converges. Every
\(K\) number of iterations and upon convergence, tSNE can call a usersupplied callback function, and passes the list of 2D coordinates to it. In our callback function, we plot the 2D points (one per image) and the corresponding class labels, and colourcode everything by the class labels.
The images below show how the clustering improves as more epochs pass.
As one can see from the above diagrams (especially the last one, for epoch 1000), tSNE does a very good job in clustering the handwritten digits correctly.
But the algorithm takes some time to run. Let’s try out the more efficient BarnesHut version of tSNE, and see if we get equally good results.
Rtsne packageThe Rtsne package can be used as shown below. The perplexity parameter is crucial for tSNE to work correctly – this parameter determines how the local and global aspects of the data are balanced. A more detailed explanation on this parameter and other aspects of tSNE can be found in this article, but a perplexity value between 30 and 50 is recommended.
traindata < read.table("optdigits.tra", sep=",") trn < data.matrix(traindata) require(Rtsne) # perform dimensionality redcution from 64D to 2D tsne < Rtsne(as.matrix(trn[,1:64]), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2) # display the results of tSNE cols < rainbow(10) plot(tsne$Y, t='n') text(tsne$Y, labels=trn[,65], col=cols[trn[,65] +1])Note how clearlydefined and distinctly separable the clusters of handwritten digits are. We have only a minimal amount of incorrect entries in the 2D map.
Of Manifolds and the Manifold AssumptionHow can tSNE achieve such a good result? How can it ‘drop’ 64dimensional data into just 2 dimensions and still preserve enough (or all) structure to allow the classes to be separated?
The reason has to do with the mathematical concept of manifolds. A Manifold is a \(d\)dimensional surface that lives in an \(D\)dimensional space, where \(d < D\). For the 3D case, imagine a 2D piece of paper that is embedded within 3D space. Even if the piece of paper is crumpled up extensively, it can still be ‘unwrapped’ (uncrumpled) into the 2D plane that it is. This 2D piece of paper is a manifold in 3D space. Or think of an entangled string in 3D space – this is a 1D manifold in 3D space.
Now there’s what is known as the Manifold Hypothesis, or the >Manifold Assumption, that states that natural data (like images, etc.) forms lower dimensional manifolds in their embedding space. If this assumption holds (there are theoretical and experimental evidence for this hypothesis), then tSNE should be able to find this lowerdimensional manifold, ‘unwrap it’, and present it to us as a lowerdimensional map of the original data.
tSNE vs. SOMtSNE is actually a tool that does something similar to SelfOrganising Maps (SOMs), though the underlying process is quite different. We have used a SOM on the optdigits dataset in a previous blog and obtained the following unsupervised clustering shown below.
The tSNE map is more clear than that obtained via a SOM and the clusters are separated much better.
tSNE for Shelter Animals datasetIn a previous blog, I applied machine learning algorithms for predicting the outcome of shelter animals. Let’s now apply tSNE to the dataset – I am using the cleaned and modified data as described in this blog entry.
trn < read.csv('train.modified.csv.zip') trn < data.matrix(trn) require(Rtsne) # scale the data first prior to running tSNE trn[,1] < scale(trn[,1]) tsne < Rtsne(trn[,1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) # display the results of tSNE cols < rainbow(5) plot(tsne$Y, t='n') text(tsne$Y, labels=as.numeric(trn[,1]), col=cols[trn[,1]])Here, while it is evident that there is some structure and patterns to the data, clustering by OutcomeType has not happened.
Let’s now use tSNE to perform dimensionality reduction to 3D on the same dataset. We just need to set the dims parameter to 3 instead of 2. And we will use package rgl for plotting the 3D map produced by tSNE.
tsne < Rtsne(trn[,1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) #display results of tSNE require(rgl) plot3d(tsne$Y, col=cols[trn[,1]]) legend3d("topright", legend = '0':'5', pch = 16, col = rainbow(5))This 3D map has a richer structure than the 2D version, but again the resulting clustering is not done by OutcomeType.
One possible reason could be that the Manifold Assumption is failing for this dataset when trying to reduce to 2D and 3D. Please note that the Manifold Assumption cannot be always true for a simple reason. If it were, we could take an arbitrary set of \(N\)dimensional points and conclude that they lie on some \(M\)dimensional manifold (where \(M < N\)). We could then take those \(M\)dimensional points and conclude that they lie on some other lowerdimensional manifold (with, say, \(K\) dimensions, where \(K < M\)). We could repeat this process indefinitely and eventually conclude that our arbitrary \(N\)dimensional dataset lies on a 1D manifold. Because this cannot be true for all datasets, the manifold assumption cannot always be true. In general, if you have \(N\)dimensional points being generated by a process with \(N\) degrees of freedom, your data will not lie on a lowerdimensional manifold. So probably the Animal Shelter dataset has degrees of freedom higher than 3.
So we can conclude that tSNE did not aid much the initial exploratory analysis for this dataset.
tSNE and machine learning?One disadvantage of tSNE is that there is currently no incremental version of this algorithm. In other words, it is not possible to run tSNE on a dataset, then gather a few more samples (rows), and “update” the tSNE output with the new samples. You would need to rerun tSNE from scratch on the full dataset (previous dataset + new samples). Thus tSNE works only in batch mode.
This disadvantage appears to makes it difficult for tSNE to be used in a machine learning system. But as we will see in a future post, it is still possible to use tSNE (with care) in a machine learning solution. And the use of tSNE can improve classification results, sometimes markedly. The limitation is the extra nonrealtime processing brought about by tSNE’s batch mode nature.
Stay tuned.
Related Post
 Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
 Analyzing Obesity across USA
 Can we predict flu deaths with Machine Learning and R?
 Graphical Presentation of Missing Data; VIM Package
 Creating Graphs with Python and GooPyCharts
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
My advice on dplyr::mutate()
There are substantial differences between adhoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: adhoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop perteam and perproject field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of WinVector LLC‘s current guidance on using the R package dplyr.
 Disclaimer: WinVector LLC has no official standing with RStudio, or dplyr development.

However:
“One need not have been Caesar in order to understand Caesar.”
Alternately: Georg Simmmel or Max Webber.
WinVector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. WinVector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.
From experience we have distilled a lot of best practices. And below we will share one.
From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:
Note that you can refer to columns that you’ve just created:
mutate(flights_sml, gain = arr_delay  dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )Let’s try that with database backed data:
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") # [1] ‘0.7.3’ db < DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights < copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay  dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # ... # 1 2013 1 1 517 515 ... # ...That worked. One of the selling points of dplyr is a lot of dplyr is sourcegeneric or sourceagnostic: meaning it can be run against different data providers (inmemory, databases, Spark).
However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:
mutate(flights, gain = arr_delay  dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hourIt is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.
This quickly leads to strong feelings of learned helplessness and anxiety.
Our rule for dplyr::mutate() has been for some time:
Each column name used in a single mutate must appear only on the lefthandside of a single assignment, or otherwise on the righthandside of any number of assignments (but never both sides, even if it is different assignments).
Under this rule neither of the above mutates() are allowed. The second should be written as (switching to pipenotation):
flights %>% mutate(gain = arr_delay  dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)And the above works.
If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.
dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.
Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.
However, confusion around reuse of column names has been driving dplyr issues for quite some time:
 dplyr issue 3095
 dplyr issue 2884
 dplyr issue 2883
 dplyr pull 2869
 dplyr issue 2842
 dplyr pull 2483
 dplyr issue 2481
 dplyr issue 2360
It makes sense to work in a reliable and teachable subdialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1
(This article was first published on Alexej's blog, and kindly contributed to Rbloggers)
The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).
I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).
In this blog post – “Part 1” – I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? etc.
“Part 2” will be an in depth analysis of the language used in patent titles, descriptions, and claims, and “Part 3” will be on experimentation with with deep neural nets applied to the patents’ title, description, and claims text.
Identifying patents related to machine learning and AIFirst, I curated a patent full text dataset consisting of “machine learning and AI related” patents.
I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.
Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords
“machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model”
in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).1
With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.
Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.
Some example patentsHere are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.
## $`2867` ## [1] "Fuselage manufacturing system" ## [2] "THE BOEING COMPANY" ## [3] "... using various techniques. For example, at least ## one of an artificial intelligence program, a ## knowledgebase, an expert ..." ## ## $`1618` ## [1] "Systems and methods for validating wind farm ## performance measurements" ## [2] "General Electric Company" ## [3] "... present disclosure leverages and fuses ## accurate available sensor data using machine ## learning algorithms. That is, the more ..." ## ## $`5441` ## [1] "Trigger repeat order notifications" ## [2] "Accenture Global Services Limited" ## [3] "... including user data obtained from a user ## device; obtaining a predictive model that ## estimates a likelihood of ..."And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.
## $`5742` ## [1] "Adaptive demodulation method and apparatus using an ## artificial neural network to improve data recovery ## in high speed channels" ## [2] "Kelquan Holdings Ltd." ## [3] "... THE INVENTION\nh0002\n1 The present invention ## relates to a neural network based integrated ## demodulator that mitigates ..." ## ## $`3488` ## [1] "Clustertrained machine learning for image processing" ## [2] "Amazon Technologies, Inc." ## [3] "... BACKGROUND\nh0001\n1 Artificial neural networks, ## especially deep neural network ..." ## ## $`3103` ## [1] "Methods and apparatus for machine learning based ## malware detection" ## [2] "Invincea, Inc." ## [3] "... a need exists for methods and apparatus that can ## use machine learning techniques to reduce the amount ..." Who holds these patents (inventors and assignees)?The first question I would like to address is who files most of the machine learning and AI related patents.
Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.
It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).
There are a number of interesting followup questions which for now I leave unanswered (hard to answer without additional data):
 What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?
 Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?
Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US.
In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active.
The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).
The results aren’t that surprising.
However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.
Alternatively, we can also map the number of patents per inventors’ origin countries.
Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions.
Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.
Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as:
When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.
For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…
 …in the top panel: number of patents (yaxis) per their original month of filing (xaxis),
 …in the bottom panel: the number of patents (yaxis) that were assigned (approved) per week (xaxis) in 2017 so far.
The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017.
The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.2412.57 years.
The distribution of the duration from patent filing date to approval is depicted by following figure.
So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:
 “Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)
 “Apparatus for biological sensing and alerting of pharmacogenomic mutation” (~12.24 years to approval; assignee: NA)
 “System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)
 “Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)
 “Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)
Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…
We can also look at the five patents with the shortest approval time:
 “Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)
 “Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)
 “Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)
 “Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)
 “Generating accurate reason codes with complex nonlinear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)
Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.
Do big tech companies get their patents approved faster than other companies (e.g., startups)?The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.
Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.
Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.
ConclusionThere is definitely a lot of promise in the USPTO full text patent data.
Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more.
The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:
This work (blog post and included figures) is licensed under a Creative Commons Attribution 4.0 International License.

There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. ↩
To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Will Stanton hit 61 home runs this season?
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
[edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]
So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple backoftheenvelope Bayesian model and see what the posterior event probability estimate is.
Sampling notation
A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:
The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.
Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the atbats in the rest of the season is Poisson.
We then take the number of home runs to be binomial given the number of at bats and the home run rate.
Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.
Event probability
The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,
Computation in R
The posterior for is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:
> sum(rbinom(1e5, rpois(1e5, 10 * 555 / 149), rbeta(1e5, 1 + 56, 1 + 555  56)) >= (61  56)) / 1e5 [1] 0.34That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.
Disclaimer
The above is intended for recreational use only and is not intended for serious bookmaking.
Exercise
You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.
The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.
The post Will Stanton hit 61 home runs this season? appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Pirating Pirate Data for Pirate Day
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017.
The post provides a useful and practical example of extracting data from a website without an API, otherwise known as "scraping" data. In this case the website was the International Chamber of Commerce, and the provided R code demonstrates several useful R packages for scraping:
 rvest, for extracting data from a web pages' HTML source
 purrr, and specifically the safely function, for streamlining the process of iterating over pages that may return errors
 purrrly, for iterating over the rows of scraped data
 splashr, for automating the process of taking a screenshot of a webpage
 and robotstxt, for checking whether automated downloads of the website content are allowed
That last package is an important one, because while it's almost always technically possible to automate the process of extracting data from a website, it's not always allowed or even legal. Inspecting the robots.txt file is one check you should definitely make, and you should also check the terms of service of the website which may also prohibit the practice. Even then, you should be respectful and take care not to overload the server with frequent and/or repeated calls, as Bob demonstrates by spacing requests by 5 seconds. Finally — and most importantly! — even if scraping isn't expressly forbidden, using scraped data may not be ethical, especially when the data is about people who are unable to give their individual consent to your use of the data. This Forbes article about analyzing data scraped from a dating website offers an instructive tale in that regard.
This piracy data example however provides a case study of using websites and the data it provides in the right way. Follow the link below all for the details.
rud.is: Pirating Web Content Responsibly With R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Exploratory Data Analysis of Tropical Storms in R
(This article was first published on RProjects – Stoltzmaniac, and kindly contributed to Rbloggers)
Exploratory Data Analysis of Tropical Storms in RThe disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on data.world and started some basic Exploratory data analysis (EDA).
EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easytouse method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kickoff the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.
The original source of the data was can be found at DHS.gov.
Step 1: Take a look at your data set and see how it is laid out FID YEAR MONTH DAY AD_TIME BTID NAME LAT LONG WIND_KTS PRESSURE CAT 2001 1957 8 8 1800Z 63 NOTNAMED 22.5 140.0 50 0 TS 2002 1961 10 3 1200Z 116 PAULINE 22.1 140.2 45 0 TS 2003 1962 8 29 0600Z 124 C 18.0 140.0 45 0 TS 2004 1967 7 14 0600Z 168 DENISE 16.6 139.5 45 0 TS 2005 1972 8 16 1200Z 251 DIANA 18.5 139.8 70 0 H1 2006 1976 7 22 0000Z 312 DIANA 18.6 139.8 30 0 TDFortunately, this is a tidy data set which will make life easier and appears to be cleaned up substantially. The column names are relatively straightforward with the exception of “ID” columns.
The description as given by DHS.gov:
This dataset represents Historical North Atlantic and Eastern North Pacific Tropical Cyclone Tracks with 6hourly (0000, 0600, 1200, 1800 UTC) center locations and intensities for all subtropical depressions and storms, extratropical storms, tropical lows, waves, disturbances, depressions and storms, and all hurricanes, from 1851 through 2008. These data are intended for geographic display and analysis at the national level, and for large regional areas. The data should be displayed and analyzed at scales appropriate for 1:2,000,000scale data.
Step 2: View some descriptive statistics YEAR MONTH DAY WIND_KTS PRESSURE Min. :1851 Min. : 1.000 Min. : 1.00 Min. : 10.00 Min. : 0.0 1st Qu.:1928 1st Qu.: 8.000 1st Qu.: 8.00 1st Qu.: 35.00 1st Qu.: 0.0 Median :1970 Median : 9.000 Median :16.00 Median : 50.00 Median : 0.0 Mean :1957 Mean : 8.541 Mean :15.87 Mean : 54.73 Mean : 372.3 3rd Qu.:1991 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.: 70.00 3rd Qu.: 990.0 Max. :2008 Max. :12.000 Max. :31.00 Max. :165.00 Max. :1024.0We can confirm that this particular data had storms from 1851 – 2010, which means the data goes back roughly 100 years before naming storms started! We can also see that the minimum pressure values are 0, which likely means it could not be measured (due to the fact zero pressure is not possible in this case). We can see that there are recorded months from January to December along with days extending from 1 to 31. Whenever you see all of the dates laid out that way, you can smile and think to yourself, “if I need to, I can put dates in an easy to use format such as YYYYmmdd (20170912)!”
Step 3: Make a basic plotThis is a great illustration of our data set and we can easily notice an upward trend in the number of storms over time. Before we go running to tell the world that the number of storms per year is growing, we need to drill down a bit deeper. This could simply be caused because more types of storms were added to the data set (we know there are hurricanes, tropical storms, waves, etc.) being recorded. However, this could be a potential starting point for developing a hypothesis for timeseries data.
You will notice the data starts at 1950 rather than 1851. I made this choice because storms were not named until this point so it would be difficult to try and count the unique storms per year. It could likely be done by finding a way to utilize the “ID” columns. However, this is a preliminary analysis so I didn’t want to dig too deep.
Step 4: Make some calculations YEAR Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change 1951 10 3 0.23 1952 6 4 0.40 1953 8 2 0.33 1954 8 0 0.00 1955 10 2 0.25 1956 7 3 0.30 1957 9 2 0.29 1958 10 1 0.11 1959 13 3 0.30 1960 13 0 0.00In this case, we can see the number of storms, nominal change and percentage change per year. These calculations help to shed light on what the growth rate looks like each year. So we can use another summary table:
Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change Min. : 6.00 Min. :15.0000 Min. :0.42000 1st Qu.:15.75 1st Qu.: 3.0000 1st Qu.:0.12000 Median :25.00 Median : 1.0000 Median : 0.04000 Mean :22.81 Mean : 0.3448 Mean : 0.05966 3rd Qu.:28.00 3rd Qu.: 3.7500 3rd Qu.: 0.25000 Max. :43.00 Max. : 16.0000 Max. : 1.14000From the table we can state the following for the given time period:
 The mean number of storms is 23 per year (with a minimum of 6 and maximum of 43)
 The mean change in the number of storms per year is 0.34 (with a minimum of 15 and maximum of 16)
 The mean percent change in the number of storms per year is 6% (with a minimum of 42% and maximum of 114%)
Again, we have to be careful because these numbers are in aggregate and may not tell the whole story. Dividing these into groups of storms is likely much more meaningful.
Step 5: Make a more interesting plotBecause I was most interested in hurricanes, I filtered out only the data which was classified as “H (15).” By utilizing a data visualization technique called “small multiples” I was able to pull out the different types and view them within the same graph. While this is possible to do in tables and spreadsheets, it’s much easier to visualize this way. By holding the axes constant, we can see the majority of the storms are classified as H1 and then it appears to consistently drop down toward H5 (with very few actually being classified as H5). We can also see that most have an upward trend from 1950 – 2010. The steepest appears to be H1 (but it also flattens out over the last decade).
Step 6: Make a filtered calculation Distinct_Storms Distinct_Storms_Change Distinct_Storms_Pct_Change Min. : 4.00 Min. :11.00000 Min. :0.56000 1st Qu.: 9.25 1st Qu.: 4.00000 1st Qu.:0.25750 Median :13.50 Median : 0.00000 Median : 0.00000 Mean :12.76 Mean : 0.05172 Mean : 0.08379 3rd Qu.:15.00 3rd Qu.: 3.00000 3rd Qu.: 0.35250 Max. :24.00 Max. : 10.00000 Max. : 1.80000Now we are looking strictly at hurricane data (classified as H1H5):
 The mean number of hurricanes is 13 per year (with a minimum of 4 and maximum of 24)
 The mean change in the number of hurricanes per year is 0.05 (with a minimum of 11 and maximum of 10)
 The mean percent change in the number of hurricanes per year is 8% (with a minimum of 56% and maximum of 180%)
While it doesn’t really make sense to say “we had an average growth of 0.05 hurricanes per year between 1950 and 2010” … it may make sense to say “we saw an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.”
That’s a great thing to put in quotes!
During EDA we discovered an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.
Side Note: Be ready, as soon as you make a statement like that, you will likely have to explain how you arrived at that conclusion. That’s where having an RMarkdown notebook and data online in a repository will help you out! Reproducible research is all of the hype right now.
Step 7: Try visualizing your statementsA histogram and/or density plot is a great way to visualize the distribution of the data you are making statements about. This plot helps to show that we are looking at a rightskewed distribution with substantial variance. Knowing that we have n = 58 (meaning 58 years after being aggregated), it’s not surprising that our histogram looks sparse and our density plot has an unusual shape. At this point, you can make a decision to jot this down, research it in depth and then attack it with full force.
However, that’s not what we’re covering in this post.
Step 8: Plot another aspect of your data60K pieces of data can get out of hand quickly, we need to back this down into manageable chunks. Building on the knowledge from our last exploration, we should be able to think of a way to cut this down to get some better information. The concept of small multiples could come in handy again! Splitting the data up by type of storm could prove to be valuable. We can also tell that we are missing
After filtering the data down to hurricanes and utilizing a heat map rather than plotting individual points we can get a better handle on what is happening where. The H4 and H5 sections are probably the most interesting. It appears as if H4 storms are more frequent on the West coast of Mexico whereas the H5 are most frequent in the Gulf of Mexico.
Because we’re still in EDA mode, we’ll continue with another plot.
Here are some of the other storms from the data set. We can see that TD, TS and L have large geographical spreads. The E, SS, and SD storms are concentrated further North toward New England.
Digging into this type of data and building probabilistic models is a fascinating field. The actuarial sciences are extremely difficult and insurance companies really need good models. Having mapped this data, it’s pretty clear you could dig in and find out what parts of the country should expect what types of storms (and you’ve also known this just from being alive for 10+ years). More hypotheses could be formed about location at this stage and could be tested!
Step 9: Look for a relationshipWhat is the relationship between WIND_KTS and PRESSURE? This chart helps us to see that low PRESSURE and WIND_KTS are likely negatively correlated. We can also see that the WIND_KTS is essentially the predictor in the data set which can perfectly predict how a storm is classified. Well, it turns out, that’s basically the distinguishing feature when scientists are determining how to categorize these storms!
Step N……We have done some basic EDA and identified some patterns in the data. While doing some EDA can simply be just for fun, in most data analysis, it’s important to find a place to apply these discoveries by making and testing hypotheses! There are plenty of industries where you could find a use for this:
 Insurance – improve statistical modeling and risk analysis
 Real Estate Development – identify strategic investment locations
 Agriculture – crop selection
 …
The rest is up to you! This is a great data set and there are a lot more pieces of information lurking within it. I want people to do their own EDA and send me anything interesting!
Some fun things to check out:
 What was the most common name for a hurricane?
 Do the names actually follow an alphabetical pattern through time? (This is one is tricky)
 If the names are in alphabetical order, how often does a letter get repeated in a year?
 Can we merge this data with FEMA, charitable donations, or other aid data?
To get you started on the first one, here’s the Top 10 most common names for tropical storms. Why do you think it’s Florence?
Thank you for reading, I hope this helps you with your own data. The code is all written in R and is located on my GitHub. You can also find other data visualization posts and usages of ggplot2 on my blog Stoltzmaniac
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProjects – Stoltzmaniac. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
GoldMining – Week 3 (2017)
(This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers)
Week 3 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!
The post GoldMining – Week 3 (2017) appeared first on Fantasy Football Analytics.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...