Using sparklyr with Microsoft R Server
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The sparklyr package (by RStudio) provides a highlevel interface between R and Apache Spark. Among many other things, it allows you to filter and aggregate data in Spark using the dplyr syntax. In Microsoft R Server 9.1, you can now connect to a a Spark session using the sparklyr package as the interface, allowing you to combine the datapreparation capabilities of sparklyr and the dataanalysis capabilities of Microsoft R Server in the same environment.
In a presentation by at the Spark Summit (embedded below, and you can find the slides here), Ali Zaidi shows how to connect to a Spark session from Microsoft R Server, and use the sparklyr package to extract a data set. He then shows how to build predictive models on this data (specifically, a deep Neural Network and a Boosted Trees classifier). He also shows how to build general ensemble models, crossvalidate hyperparameters in parallel, and even gives a preview of forthcoming streaming analysis capabilities.
Any easy way to try out these capabilities is with Azure HDInsight 3.6, which provides a managed Spark 2.1 instance with Microsoft R Server 9.1.
Spark Summit: Extending the R API for Spark with sparklyr and Microsoft R Server
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing Partial Least Squares to Johnson’s Relative Weights
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
In this post I explore two different methods for computing the relative importance of predictors in regression: Johnson’s Relative Weights and Partial Least Squares (PLS) regression. Both techniques solve a problem with Multiple Linear Regression, which can perform poorly when there are correlations between predictor variables.
When there is a very high correlation between two predictor variables, Multiple Linear Regression can lead to one of the variables being found to be a strong predictor, while the other is found to have a relatively small effect. Relative Weights computes importance scores that factor in the correlation between the predictors. The goal of PLS is slightly different. It is designed to work in situations where it is impossible to get stable results from multiple regression. Instability may be because of extremely high correlations between variables (aka multicollinearity), or where there are more predictor variables than observations.
Relative Weights is my “go to” method in situations where there are nontrivial correlations between predictor variables (click here to read more about some of its strengths). However, in some fields, such as chemometrics and sensory research, PLS is the “go to” method because they have more predictor variables than observations. In this post I seek to understand how similar the techniques are when applied to a data set with moderate correlations.
The case studyThe data set I am using for the case study contains 34 predictor variables and 1,893 observations. I use it to look at the relationship between 327 consumer’s perceptions of six different cola brands. Click here to read more about this case study. The correlations between the variables are shown below.
Comparing the methodsThe chart below shows the relative importance computed using Johnson’s Relative Weights, PLS, and Multiple Linear Regression. (Click here to see how to estimate PLS in R and here to see how to do this in Displayr.) In contrast to this earlier case study using this data, I am showing negative effects for Multiple Linear Regression. I use the same signs for Relative Weights because Relative Weights does not distinguish between whether the effect is positive or negative.
The chart shows us that there is a strong correlation between each of the methods. However, that should not be a great surprise, as they all use the same data and have the same functional form (i.e., linear with additive effects).
The R2 statistic computed predicting the PLS coefficients by the Multiple Regression coefficients is 0.98. I fitted this model without an intercept. This demonstrates that the Multiple Linear Regression and PLS results are very similar.
The R2 between the PLS and the Relative Weights is 0.85, which tells us that PLS is much more similar to Multiple Linear Regression than to Relative Weights. The R2 between Multiple Linear Regression and the Relative Weights is 0.81, which means that PLS is somewhere in between the two other techniques, but is much more similar to Multiple Linear Regression.
If we look at the chart we can see some striking differences between Relative Weights and the other two methods:
 Healthconscious is much less important for Relative Weights than in the other analyses.
 All the negative importance scores are much closer to 0 for Relative Weights than for the other analyses.
 The small positive importance scores are greater for Relative Weights than for the other analyses.
The above analysis shows that we get large and meaningful differences between Relative Weights and PLS. Why? One explanation relates to standardization. Relative weights automatically standardizes the predictor variables, presenting the importance scores in terms of their ability to explain R2. PLS does not standardize the predictors. A second explanation relates to negative coefficients. Relative Weights ignores the sign of coefficients, with the signs for the Relative Weights in the above analyses derived from a separate Multiple Regression. To appreciate the effect of these two things, I:
 Standardized all the variables (i.e., divide each variable’s raw data by the variable’s standard deviation).
 Reestimated all of the models, but only used variables with positive coefficients.
The coefficients are plotted below. They are a lot more similar than in the previous analysis. The PLS and Multiple Linear Regression, are particularly close, with an R2 of 0.998. The R2 for the PLS versus the Relative Weights is 0.95. For the Relative Weights versus Multiple Linear Regression it is 0.94. Given each analysis is for the same data set and the same core assumptions, 0.94 is still quite a difference.
Looking at the chart, we can see that the conclusions from the models are substantively different. In particular:
 Relative Weights tends to conclude that the relatively unimportant variables are more important than is the case with either PLS and Multiple Linear Regression. This can be seen from the bars at the very bottom of the chart.
 The conclusions of the models regarding Healthconscious remain striking. The Multiple Regression and PLS both conclude that is the 10th most important variables. By contrast, Relative Weights has it as the least important variable. This earlier case study discusses the reasons for the difference.
Although PLS and Johnson’s Relative Weights are both techniques for dealing with correlations between predictors, they give fundamentally different results. In this data set, where the correlations are moderate, PLS is little different from Multiple Linear Regression. By contrast, Relative Weights gives substantially different conclusions.
TRY IT OUT
All the analysis in this post was conducted using R in Displayr. Click here to review the underlying data and code, or to run your own analyses.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Understanding Data
(This article was first published on RStats – Tom Hopper, and kindly contributed to Rbloggers)
When analyzing data, I have often found it useful to think of the data as being one of four main types, according to the typology proposed by Stevens.[1] Different types of data have certain characteristics; understanding what type of data you have helps with selecting the analysis to perform and prevent basic analysis mistakes.
The types, or “scales of measurement,” are:
 Nominal
 Data identifying unique classifications or objects where the order of values is not meaningful. Examples include zip codes, gender, nationality, sports teams and multiple choice answers on a test.
 Ordinal
 Data where the order is important but the difference or distance between items is not important or not measured. Examples include team rankings in sport (team A is better than team B, but how much better is open to debate), scales such as health (e.g. “healthy” to “sick”), ranges of opinion (e.g. “strongly agree” to “strongly disagree” or “on a scale of 1 to 10”) and Intelligence Quotient.
 Interval
 Numeric data identified by values where the degree of difference between items is significant and meaningful, but their ratio is not. Common examples are dates—we can say 2000 CE is 1000CE + 1000 years, but 1000 CE is not half of 2000 CE in any meaningful way—and temperatures on the Celsius and Fahrenheit scales, where a difference of 10° is meaningful, but 10° is not twice as hot as 5°.
 Ratio
 Numeric data where the ratio between numbers is meaningful. Usually, such scales have a meaningful “0.” Examples include length, mass, velocity, acceleration, voltage, power, duration, energy and Kelvinscale temperature.
The generallyappropriate statistics and mathematical operations for each type are summarized in table 1.
Table 1: Scales of measurement and allowed statistical and mathematical operations. Scale Type Statistics Operations Nominal mode, frequency, chisquared, cluster analysis =, ≠ Ordinal above, plus: median, nonparametric tests, KruskalWallis, rankcorrelation =, ≠, >, < Interval plus: arithmetic mean, some parametric tests, correlation, regression, ANOVA (sometimes), factor analysis =, ≠, >, <, +, – Ratio plus: geometric and harmonic mean, ANOVA, regression, correlation coefficient =, ≠, >, <, +, , ×, ÷While this is a useful typology for most use, and certainly for initial consideration, there are valid criticisms of Stevens’ typology. For example, percentages and count data have some characteristics of ratioscale data, but with additional constraints. e.g. the average of the counts may not be meaningful. This typology is a useful thinking tool, but it is essential to understand the statistical methods being applied and their sensitivity to departures from underlying assumptions.
Types of data in RR[2] recognizes at least fifteen different types of data. Several of these are related to identifying functions and other objects—most users don’t need to worry about most of them. The main types that industrial engineers and scientists will need to use are:
 numeric
 Real numbers. Also known as double, real and single (note that R stores all real numbers in doubleprecision). May be used for all scales of measurement, but is particularly suited to ratio scale measurements.
 complex
 Imaginary real numbers can be manipulated directly as a data type using
x < 1 + i2
or
x < complex(real=1, imaginary=2)Like type numeric, may be used for all scales of measurement.
 integer
 Stores integers only, without any decimal point. Can be used mainly for ordinal or interval data, but may be used as ratio data—such as counts—with some caution.
 logical
 Stores Boolean values of TRUE or FALSE, typically used as nominal data.
 character
 Stores text strings and can be used as nominal or ordinal data.
The above types of data can be stored in several types, or structures, of variables. The equivalent to a variable in Excel would be rows, columns or tables of data. The main ones that we will use are:
 vector
 Contains one or many elements, and behaves like a column or row of data. Vectors can contain any of the above types of data but each vector is stored, or encoded, as a single type. The vector
c(1, 2, 1, 3, 4)
## [1] 1 2 1 3 4
is, by default, a numeric vector of type double, but
c(1, 2, 1, 3, 4, "name") ## [1] "1" "2" "1" "3" "4" "name"will be a character vector, or a vector where all data is stored as type character, and the numbers will be stored as characters rather than numbers. It will not be possible to perform mathematical operations on these numbersstoredascharacters without first converting them to type numeric.
 factor
 A special type of character vector, where the text strings signify factor levels and are encoded internally as integer counts of the occurrence of each factor. Factors can be treated as nominal data when the order does not matter, or as ordinal data when the order does matter. factor(c("a", "b", "c", "a"), levels=c("a","b","c","d")) ## [1] a b c a ## Levels: a b c d
 array
 A generalization of vectors from one dimension to two or more dimensions. Array dimensions must be predefined and can have any number of dimensions. Like vectors, all elements of an array must be of the same data type. (Note that the letters object used in the example below is a variable supplied by R that contains the letters a through z.) # letters a  c in 2x4 array array(data=letters[1:3], dim=c(2,4)) ## [,1] [,2] [,3] [,4] ## [1,] "a" "c" "b" "a" ## [2,] "b" "a" "c" "b" # numbers 1  3 in 2x4 array array(data=1:3, dim=c(2,4)) ## [,1] [,2] [,3] [,4] ## [1,] 1 3 2 1 ## [2,] 2 1 3 2
 matrix
 A special type of array with the properties of a mathematical matrix. It may only be twodimensional, having rows and columns, where all columns must have the same type of data and every column must have the same number of rows. R provides several functions specific to manipulating matrices, such as taking the transpose, performing matrix multiplication and calculation eigenvectors and eigenvalues. matrix(data = rep(1:3, times=2), nrow=2, ncol=3) ## [,1] [,2] [,3] ## [1,] 1 3 2 ## [2,] 2 1 3
 list
 Vectors whose elements are other R objects, where each object of the list can be of a different data type, and each object can be of different length and dimension than the other objects. Lists can therefore store all other data types, including other lists. list("text", "more", 2, c(1,2,3,2)) ## [[1]] ## [1] "text" ## ## [[2]] ## [1] "more" ## ## [[3]] ## [1] 2 ## ## [[4]] ## [1] 1 2 3 2
 data.frame
 For most industrial and data scientists, data frames are the most widely useful type of variable. A data.frame is the list analog to the matrix: it is an list where all columns must be vectors of the same number of rows (determined with NROW()). However, unlike matrices, different columns can contain different types of data and each row and column must have a name. If not named explicitly, R names rows by their row number and columns according to the data assigned assigned to the column. Data frames are typically used to store the sort of data that industrial engineers and scientists most often work with, and is the closest analog in R to an Excel spreadsheet. Usually data frames are made up of one or more columns of factors and one or more columns of numeric data. data.frame(rnorm(5), rnorm(5), rnorm(5)) ## rnorm.5. rnorm.5..1 rnorm.5..2 ## 1 0.2939566 1.28985202 0.01669957 ## 2 0.3672161 0.01663912 1.02064116 ## 3 1.0871615 1.13855476 0.78573775 ## 4 0.8501263 0.17928722 1.03848796 ## 5 1.6409403 0.34025455 0.62113545
More generally, in R all variables are objects, and R distinguishes between objects by their internal storage type and by their class declaration, which are accessible via the typeof() and class() functions. Functions in R are also objects, and the users can define new objects to control the output from functions like summary() and print(). For more on objects, types and classes, see section 2 of the R Language Definition.
Table 2 summarizes the internal storage and R classes of the main data and variable types.
Table 2: Table of R data and variable types. Variable type Storage type Class Measurement Scale vector of decimals double numeric ratio vector of integers integer integer ratio or interval vector of complex complex complex ratio vector of characters character character nominal factor vector integer factor nominal or ordinal matrix of decimals double matrix ratio data frame list data.frame mixed list list list mixed References Stevens, S. S. “On the Theory of Scales of Measurement.” Science. 103.2684 (1946): 677680. Print.
 R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/.
To leave a comment for the author, please follow the link and comment on their blog: RStats – Tom Hopper. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using Partial Least Squares to Conduct Relative Importance analysis in R
Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable.
There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe how to use an alternative method, Partial Least Squares, in R. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.
Partial Least SquaresPLS is a dimension reduction technique with some similarity to principal component analysis. The predictor variables are mapped to a smaller set of variables, and within that smaller space we perform a regression against the outcome variable. In Principal Component Analysis the dimension reduction procedure ignores the outcome variable. However PLS aims to choose new mapped variables that maximally explain the outcome variable.
To get started I’ll import some data into R and examine it with the following few lines of code:
cola.url = "http://wiki.qresearchsoftware.com/images/d/db/Stacked_colas.csv" colas = read.csv(cola.url) str(colas)The output below show 37 variables. I am going to predict pref, the strength of a respondent’s preference for a brand on a scale from 1 to 5. To do this I’ll use the 34 binary predictor variables that indicate whether the person perceives the brand to have a particular characteristic.
Using Partial Least Squares in RThe next step is to remove unwanted variables and then build a model. Cross validation is used to find the optimal number of retained dimensions. Then the model is rebuilt with this optimal number of dimensions. This is all contained in the R code below.
colas = subset(colas, select = c(URLID, brand)) library(pls) pls.model = plsr(pref ~ ., data = colas, validation = "CV") # Find the number of dimensions with lowest cross validation error cv = RMSEP(pls.model) best.dims = which.min(cv$val[estimate = "adjCV", , ])  1 # Rerun the model pls.model = plsr(pref ~ ., data = colas, ncomp = best.dims) Producing the OutputFinally, we extract the useful information and format the output.
coefficients = coef(pls.model) sum.coef = sum(sapply(coefficients, abs)) coefficients = coefficients * 100 / sum.coef coefficients = sort(coefficients[, 1 , 1]) barplot(tail(coefficients, 5))The regression coefficients are normalized so their absolute sum is 100 and the result is sorted.
The results below show that Reliable and Fun are positive predictors of preference. You could run the code barplot(head(coefficients, 5)) to see that at the other end of the scale Unconventional and Sleepy are negative predictors.
TRY IT OUT
Displayr is a data science platform providing analytics, visualization and the full power of R. You can perform this analysis for yourself in Displayr.
Second step with nonlinear regression: adding predictors
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
In this post we will see how to include the effect of predictors in nonlinear regressions. In other words, letting the parameters of nonlinear regressions vary according to some explanatory variables (or predictors). Be sure to check the first post on this if you are new to nonlinear regressions.
The example that I will use throughout this post is the logistic growth function, it is often used in ecology to model population growth. For instance, say you count the number of bacteria cells in a petri dish, in the beginning the cell counts will increase exponentially but after some time due to limits in resources (be it space or food), the bacteria population will reach an equilibrium. This will produce the classical Sshaped, nonlinear, logistic growth function. The logistic growth function has three parameters: the growth rate called “r”, the population size at equilibrium called “K” and the population size at the beginning called “n0”.
First Example: Categorical PredictorSay that we want to test the effect of food quality on the population dynamic of our bacteria. To do so we use three different type of food and we count the number of bacteria over time. Here is how to code this in R using the function nlsList from the package nlme:
#load libraries library(nlme) #first try effect of treatment on logistic growth Ks < c(100,200,150) n0 < c(5,5,6) r < c(0.15,0.2,0.15) time < 1:50 #this function returns population dynamics following #a logistic curves logF < function(time,K,n0,r){ d < K * n0 * exp(r*time) / (K + n0 * (exp(r*time)  1)) return(d) } #simulate some data dat < data.frame(Treatment=character(),Time=numeric(), Abundance=numeric()) for(i in 1:3){ Ab < logF(time = time,K=Ks[i],n0=n0[i],r=r[i]) tmp < data.frame(Treatment=paste0("T",i),Time=time, Abundance=Ab+rnorm(time,0,5)) #note that random deviates were added to the simulated #population density values dat <rbind(dat,tmp) } #the formula for the models lF<formula(Abundance~K*n0*exp(r*Time)/(K+n0*(exp(r*Time)1))  Treatment) #fit the model (m < nlsList(lF,data=dat,start=list(K=150,n0=10,r=0.5))) Call: Model: Abundance ~ K * n0 * exp(r * Time)/(K + n0 * (exp(r * Time)  1))  Treatment Data: dat Coefficients: K n0 r T1 105.2532 4.996502 0.1470740 T2 199.5149 4.841359 0.2006150 T3 150.2882 5.065196 0.1595293 Degrees of freedom: 150 total; 141 residual Residual standard error: 5.085229The code simulated population values using three sets of parameters (the r, K and n0’s). Then we specified the nonlinear regression formula, using the pipe “” symbol to explicitly ask for fitting different parameters to each Treatment. The model output gives us the estimated parameters for each Treatment. We can now plot the model predictions against the real data:
#derive the predicted lines dat$Pred <predict(m) #plot ggplot(dat,aes(x=Time,y=Abundance,color=Treatment))+ geom_point()+geom_line(aes(y=Pred)) Second Example: Continuous PredictorWe can also model the effect of continuous predictors on the parameters of the nonlinear regression. For instance, we might assume that bacterial colonies grow faster in warmer temperature. In this case we could model the effect of temperature on growth rate (assuming that it is linear) and use the fitted growth rate to model the number of bacteria, all this within one model, pretty neat.
To do so I will use another package: bbmle and its function mle2:
The first argument for mle2 is either a function returning the negative loglikelihood or a formula of the form “response ~ some distribution(expected value)”. In our case we fit a poisson distribution with expected values coming form the logistic equation and its three parameters. The dependency between the growth rate and the temperature is given using the “parameter” argument. The basic model output gives us all the estimated parameters, of interest here are the “r.(Intercept)” which is the growth rate when the temperature is at 0 and “r.Temp” which is the increase in the growth rate for every increase of temperature by 1.
Again we can get the fitted values for plotting:
datT$pred < predict(mll) ggplot(datT,aes(x=Time,y=Abund,color=factor(Temp)))+ geom_point()+geom_line(aes(y=pred)) Third Example: More on Continuous PredictorsThe cool thing with mle2 is that you can fit any models that you can imagine, as long as you are able to write down the loglikelihood functions. We will see this with an extension of the previous model. It is a common assumption in biology that species should have some optimum temperature, hence we can expect a bellshape relation between population equilibrium and temperature. So we will add to the previous model a quadratic relation between the population equilibrium (“K”) and temperature.
#simulate some data mm < sapply(Temp,function(x){ rpois(50,logF(time = 1:50,K = 100+20*xx**2,n0=n0,r = 0.05 + 0.01 * x))}) #note that this time both K and r are given as vectors to represent their #relation with temperature #some reshaping datT < melt(mm) names(datT) < c("Time","Temp","Abund") datT$Temp < Temp[datT$Temp] #the negative loglikelihood function LL < function(k0,k1,k2,n0,r0,r1){ K < k0 + k1*Temp + k2*Temp**2 r < r0 + r1*Temp lbd < K*n0*exp(r*Time)/(K+n0*(exp(r*Time)1)) sum(dpois(Abund,lbd,log=TRUE)) } (mll2 < mle2(LL,data=datT,start=list(k0=100,k1=1,k2=0.1,n0=10,r0=0.1,r1=0.1))) Call: mle2(minuslogl = LL, start = list(k0 = 100, k1 = 1, k2 = 0.1, n0 = 10, r0 = 0.1, r1 = 0.1), data = datT) Coefficients: k0 k1 k2 n0 r0 r1 100.47610915 20.09677780 1.00836974 4.84940342 0.04960558 0.01018838 Loglikelihood: 1700.91This time I defined the negative loglikelihood function to fit the model. This function takes as argument the model parameter. In this case we have 3 parameters to describe the quadratic relation between K and temperature, 1 parameter for the constant n0 and 2 parameters to describe the linear relation between r and temperature. The important thing to understand is what the last line is doing. Basically the dpois call return the probabilities to get the observed abundance values (“Abund”) based on the expected mean value (“lbd”). The LL function will return a single negative loglikelihood value (the sum) and the job of mle2 is to minimize this value. To do so mle2 will try many different parameter combination each time calling the LL function and when it reaches the minimum it will return the parameter set used. This is Maximum Likelihood Estimation (MLE). As before the model output gives us the estimated parameters, and we can use these to plot the fitted regressions:
cc < coef(mll2) #sorry for the awful coding, to get the model predicted values we need #to code the relation by hand which is rather clumsy ... datT$pred < melt(sapply(Temp,function(x) {logF(time=1:50,K=cc[1]+cc[2]*x+cc[3]*x**2,n0=cc[4],r=cc[5]+cc[6]*x))})[,3] ggplot(datT,aes(x=Time,y=Abund,color=factor(Temp)))+ geom_point()+geom_line(aes(y=pred)) ConclusionI showed three way by which you can include the effect of predictors on the parameters of some nonlinear regression: nlsList, mle2 with formula and mle2 with customized negative loglikelihood function. The first option is super easy and might be enough in many cases, while the last option though more complex gives you (almost) infinite freedom in the models that you can fit.
Happy nonlinear modelling!
Related Post
 Weather forecast with regression models – part 4
 Weather forecast with regression models – part 3
 Weather forecast with regression models – part 2
 Weather forecast with regression models – part 1
 Weighted Linear Support Vector Machine
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
12 Jobs for R users (20170619) – from all over the world
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Current R jobsJob seekers: please follow the links below to learn more and apply for your R job of interest:
Featured Jobs
 FullTime
 Data analyst – infectious disease mapping John Drake (University of Georgia) – Posted by John Drake
 Athens Georgia, United States
 16 Jun2017

 FullTime
 Kenya Product Innovations Analyst One Acre Fund – Posted by OAF
 Kakamega Kakamega County, Kenya
 14 Jun2017

 FullTime
 Software Developer One Acre Fund – Posted by OAF
 Jinja Eastern Region, Uganda
 8 Jun2017

 FullTime
 Senior Data Scientist Warner Bros. Entertainment Inc. – Posted by WBCareers
 Anywhere
 5 Jun2017

 FullTime
 Manager, Statistical Consulting & Data Science Warner Bros. Entertainment Inc. – Posted by WBCareers
 Burbank California, United States
 5 Jun2017

 FullTime
 Financial Controller RStudio, Inc. – Posted by tkawaf
 Anywhere
 5 Jun2017

 FullTime
 Software Engineer in Test for RStudio RStudio – Posted by agadrow
 Anywhere
 30 May 2017

 FullTime
 Data Scientist / Analytics Consultant Hedera Insights – Posted by HederaInsights
 Antwerpen Vlaanderen, Belgium
 18 May2017

 FullTime
 Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc
 Chicago Illinois, United States
 5 May 2017

 FullTime
 Data analyst – infectious disease mapping John Drake (University of Georgia) – Posted by John Drake
 Athens Georgia, United States
 16 Jun2017

 FullTime
 Kenya Product Innovations Analyst One Acre Fund – Posted by OAF
 Kakamega Kakamega County, Kenya
 14 Jun2017

 Freelance
 Development of HFT system Deuxus
 Anywhere
 13 Jun2017

 FullTime
 Research and Statistical Analyst – Housing @ London, England, U.K. Greater London Authority – Posted by jgleeson
 London England, United Kingdom
 8 Jun2017

 FullTime
 Data Scientist @ Garching bei München, Bayern, Germany Leibniz Supercomputing Centre – Posted by jamitzky
 Garching bei München Bayern, Germany
 8 Jun2017

 FullTime
 Software Developer One Acre Fund – Posted by OAF
 Jinja Eastern Region, Uganda
 8 Jun2017

 FullTime
 Senior Quantitative Analyst, Data Scientist Roshtech – Posted by bergerbesta
 Center District Israel
 7 Jun2017

 Freelance
 R data wrangler University of Gothenburg Sweden / University of Bristol UK – Posted by hanse
 Anywhere
 6 Jun2017

 FullTime
 Senior Data Scientist Warner Bros. Entertainment Inc. – Posted by WBCareers
 Anywhere
 5 Jun2017

 FullTime
 Manager, Statistical Consulting & Data Science Warner Bros. Entertainment Inc. – Posted by WBCareers
 BurbankCalifornia, United States
 5 Jun2017

 FullTime
 Financial Controller RStudio, Inc. – Posted by tkawaf
 Anywhere
 5 Jun2017

 Freelance
 Author Books on R for Packt Publishing Packt Publishing – Posted by Tushar G
 Anywhere
 1 Jun2017
In Rusers.com you can see all the R jobs that are currently available.
Rusers ResumesRusers also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Ridge regression in R exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Bias vs Variance tradeoff is always encountered in applying supervised learning algorithms. Least squares regression provides a good fit for the training set but can suffer from high variance which lowers predictive ability. To counter this problem, we can regularize the beta coefficients by employing a penalization term. Ridge regression applies l2 penalty to the residual sum of squares. In contrast, LASSO regression, which was covered here previously, applies l1 penalty.
Using ridge regression, we can shrink the beta coefficients towards zero which would reduce variance at the cost of higher bias which can result in better predictive ability than least squares regression. In this exercise set we will use the glmnet package (package description: here) to implement ridge regression in R.
Answers to the exercises are available here.
Exercise 1
Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) “Least Angle Regression” (with discussion) Annals of Statistics). This is the same dataset from the LASSO exercise set and has patient level data on the progression of diabetes. Next, load the glmnet package that will that we will now use to implement ridge regression.
The dataset has three matrices x, x2 and y. x has a smaller set of independent variables while x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes.
Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis.
Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison.
Exercise 2
Fit the ridge regression model using the glmnet function and plot the trace of the estimated coefficients against lambdas. Note that coefficients are shrunk closer to zero for higher values of lambda.
Exercise 3
Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.
Exercise 4
Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that coefficients are lower than least squares estimates.
Exercise 5
To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note the shrinkage effect on the estimates.
 Avoid model overfitting using crossvalidation for optimal parameter selection
 Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels.
 And much more
Exercise 6
Split the data randomly between a training set (80%) and test set (20%). We will use these to get the prediction standard error for least squares and ridge regression models.
Exercise 7
Fit the ridge regression model on the training and get the estimated beta coefficients for both the minimum lambda and the higher lambda within 1standard error of the minimum.
Exercise 8
Get predictions from the ridge regression model for the test set and calculate the prediction standard error. Do this for both the minimum lambda and the higher lambda within 1standard error of the minimum.
Exercise 9
Fit the least squares model on the training set.
Exercise 10
Get predictions from the least squares model for the test set and calculate the prediction standard error.
 LASSO regression in R exercises
 Evaluate your model with R Exercises
 Quantile Regression in R exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
My set of packages for (daily) data analysis #rstats
(This article was first published on R – Strenge Jacke!, and kindly contributed to Rbloggers)
I started writing my first package as collection of various functions that I needed for (almost) daily work. Meanwhile, packages were growing and bit by bit I sourced out functions to put them into new packages. Although this means more work for CRAN members when they have more packages to manage on their network, from a userperspective it is much better if packages have a clear focus and a well defined set of functions. That’s why I now released a new package on CRAN, sjlabelled, which contains all functions that deal with labelled data. These functions use to live in the sjmiscpackage, where they now are deprecated and will be removed in a future update.
My aim is not only to provide packages with a clear focus, but also with a consistent design and philosophy, making it easier and more intuitive to use (see also here) – I prefer to follow the so called „tidyverse“approach here. It is still work in progress, but so far I think I’m on a good way…
So, what are the packages I use for (almost) daily work?
 sjlabelled – reading, writing and working with labelled data (especially since I collaborate a lot with people who use Stata or SPSS)
 sjmisc – data and variable transformation utilities (the complement to dplyr and tidyr, when it comes down from data frames to variables within the data wrangling process)
 sjstats – outofthebox statistics that is not already provided by base R or packages
 sjPlot – to quickly generate tables and plot
 ggeffects – to visualize regression models
The next step is creating cheat sheets for my packages. I think if you can map the scope and idea of your package (functions) on a cheat sheet, its focus is probably well defined.
Btw, if you also use some of the above packages more or less regularly, you can install the „strengejacke“package to load them all in one step. This package is not on CRAN, because its only purpose is to load other packages.
Disclaimer: Of course I use other packages everyday as well – this posting is just focussing on my packages that I created because I frequently needed these kind of functions.
Tagged: data visualization, ggplot, R, rstats, sjPlot, Statistik
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
wrapr Implementation Update
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
IntroductionThe development version of our R helper function wrapr::let() has switched from stringbased substitution to abstract syntax tree based substitution (AST based subsitution, or language based substitution).
I am looking for some feedback from wrapr::let() users already doing substantial work with wrapr::let(). If you are already using wrapr::let() please test if the current development version of wrapr works with your code. If you run into problems: I apologize, and please file a GitHub issue.
The substitution modesThe development version of wrapr::let() now has three substitution implementations:
 Language substitution (subsMethod='langsubs' the new default). In this mode user code is captured as an abstract syntax tree (or parse tree) and substitution is performed only on nodes known to be symbols.
 String substitution (subsMethod='stringsubs', the CRAN current default). In this mode user code is captured as text and then string replacement on wordboundaries is used to substitute in variable remappings.
 Substitute substitution (subsMethod='subsubs', new to the development version). In this mode substitution is performed by R‘s own base::substitute().
The semantics of the three methods can be illustrated by showing the effects of substituting the variable name "y" for "X" in the somewhat complicated block of statements:
{ d < data.frame("X" = "X", X2 = "XX", d = X*X, .X = X_) X < list(X = d$X, X2 = d$"X", v1 = `X`, v2 = ` X`) }This block a lot of different examples and cornercases.
Language substitution (subsMethod='langsubs') library("wrapr") let( c(X = 'y'), { d < data.frame("X" = "X", X2 = "XX", d = X*X, .X = X_) X < list(X = d$X, X2 = d$"X", v1 = `X`, v2 = ` X`) }, eval = FALSE, subsMethod = 'langsubs') ## { ## d < data.frame(y = "X", X2 = "XX", d = y * y, .X = X_) ## y < list(y = d$y, X2 = d$y, v1 = y, v2 = ` X`) ## }Notice the substitution replaced all symbollike uses of "X", and only these (including some that were quoted!).
The reason I need more testing on this method is R language structures are fairly complicated, so there may be some corner cases we want additional coding for. For example treating d$"X" the same as d$X is in fact handled by some specialcase code.
String substitution (subsMethod='stringsubs') let( c(X = 'y'), { d < data.frame("X" = "X", X2 = "XX", d = X*X, .X = X_) X < list(X = d$X, X2 = d$"X", v1 = `X`, v2 = ` X`) }, eval = FALSE, subsMethod = 'stringsubs') ## expression({ ## d < data.frame(y = "y", X2 = "XX", d = y * y, .y = X_) ## y < list(y = d$y, X2 = d$y, v1 = y, v2 = ` y`) ## })Notice string substitution has a few flaws: it went after variable names that appeared to start with a wordboundary (the cases where the variable name started with a dot or a space). Substitution also occurred in some string constants (which as we have seen could be considered a good thing).
These situations are all avoidable as both the code inside the letblock and the substitution targets are chosen by the programmer, so they can be chosen to be simple and mutually consisitent. We suggest "ALL_CAPS" style substitution targets as they jump out as being macro targets. But, of course, it is better to have stricter control on substitution.
Think of the language substitution implementation as a lowerbound on a perfect implementation (cautious, with a few corner cases to get coverage) and string substitution as an upper bound on a perfect implementation (aggressive, with a few overreaches).
Substitute substitution (subsMethod='subsubs') let(c(X = 'y'), { d < data.frame("X" = "X", X2 = "XX", d = X*X, .X = X_) X < list(X = d$X, X2 = d$"X", v1 = `X`, v2 = ` X`) }, eval = FALSE, subsMethod = 'subsubs') ## { ## d < data.frame(X = "X", X2 = "XX", d = y * y, .X = X_) ## y < list(X = d$y, X2 = d$X, v1 = y, v2 = ` X`) ## }Notice base::substitute() doesn’t rewrite lefthandsides of argument bindings. This is why I originally didn’t consider using this implementation. Rewriting lefthandsides of assignments is critical in expressions such as dplyr::mutate( RESULTCOL = INPUTCOL + 1). Also base::substitute() doesn’t special case the d$"X" situation (which really isn’t that important).
Conclusionwrapr::let() when used prudently is already a safe and powerful tool. That being said it will be improved by switching to abstract syntax tree based substitution. I’d love to do that in the next release, so it would helpful to collect more experience using the current development version prior to that.
Appendix wrapr::let() for beginnersObviously wrapr::let() is not interesting if you have no idea why you would need it or how to use it.
wrapr::let() is designed to solve a single problem when programming in R: when in writing code you don’t yet know the name of a variable or a data.frame column you are going to use. That is: the name of the column will only be available later when your code is run.
For example suppose we have a data.frame "d" defined as follows:
d < data.frame(x = c(15, 30, 40)) print( d ) ## x ## 1 15 ## 2 30 ## 3 40If we know the name of the column we can access it as follows:
print( d$x ) ## [1] 15 30 40If we don’t know the name of the column (such as would be the case when writing a function) we write code like the following:
getColumn < function(df, columnName) { df[[columnName]] } print( getColumn(d, 'x') ) ## [1] 15 30 40This works because R takes a lot of trouble to supply parametric interfaces for most use cases.
We only run into trouble if code we are trying to work with strongly prefers nonparametric (also called nonstandardevaluation or NSE) interfaces.
A popular package that heavily emphasizes NSE interfaces is dplyr. It is the case that dplyr supplies its own methods to work around NSE issues ("underbar methods" and lazyeval for dplyr 0.5.0, and rlang/tidyeval for dplyr 0.7.0). However, we will only discuss wrapr::let() methodology here.
The issue is: it is common to write statements such as the following in dplyr:
suppressPackageStartupMessages(library("dplyr")) d %>% mutate( v2 = x + 1 ) ## x v2 ## 1 15 16 ## 2 30 31 ## 3 40 41If we were writing the above in a function it would plausible that we would not know the name of the desired result column or the name of the column to add one to. wrapr::let() lets us write such code easily:
library("wrapr") addOneToColumn < function(df, result_column_name, input_column_name) { let( c(RESULT_COL = result_column_name, INPUT_COL = input_column_name), df %>% mutate( RESULT_COL = INPUT_COL + 1 ) ) } d %>% addOneToColumn('v2', 'x') ## x v2 ## 1 15 16 ## 2 30 31 ## 3 40 41Again, writing the function addOneToColumn() was the goal. The issue was that in such a function we don’t know what column names the user is going to end up supplying. We work around this difficulty with wrapr::let()
All the user needs to keep in mind is: wrapr::let() takes two primary arguments:
 The aliases mapping concrete standin names (here shown in ALL_CAPS style) to the variables holding the actual variable names.
 The block of code to rewrite and execute (which can be a single statement, or a larger block with braces).
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sequential Fitting Strategies For Models of short RNA Sequencing Data
(This article was first published on R – Statistical Reflections of a Medical Doctor, and kindly contributed to Rbloggers)
After a (really long!) hiatus I am reactivating my statistical blog. The first article concerns the clarification of a point made in the manual of our recently published statistical model for short RNA sequencing data.
The background for this post, in case one wants to skip reading the manuscript (please do read it !), centers around the limitations of existing methods for the analysis of data for this very promising class of biomarkers. To overcome these limitations our group comprised from investigators from Division of Nephrology, University of New Mexico and the Galas Lab at Pacific Northwest Research Institute introduced a novel method for the analysis of short RNA sequencing (sRNAseq) data. This method (RNAseqGAMLSS), which was derived from first principles modeling of the short RNAseq process, was shown to have a number of desirable properties in an analysis of nearly 200 public and internal datasets:
 It can quantify the intrinsic, sequencing specific bias of sRNAseq from calibration, synthetic equimolar mixes of the target short RNAs (e.g. microRNAs)
 It can use such estimates to correct for the bias present in experimental samples of different composition and input than the calibration runs. This in turns opens the way for the use of short RNAseq measurements in personalized medicine applications (as explained here)
 Adapted to the problem of differential expression analysis, our method exhibited greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of short RNAseq data.
 In contrast to these popular methods which force the expression profiles to have a certain form of symmetry (equal number and magnitude of overexpressed and underexpressed sequences), our method can be used to discover global, directional changes in expression profiles which are missed by the aforementioned methods. Accounting for such a possibility may be appropriate in certain instances, in which the disease process leads to loss or gain in the number of cells of origin of the affected organ/tissue.
The proposed methodology which is based on Generalized Additive Models for Location, Scale and Shape (GAMLSS) involves the fitting of simultaneous regression models for the location (mean) and the scale (dispersion) of sequence counts using either the Negative Binomial or a particular parameterization of the Normal distribution. However there is price to pay for the advantages of RNASeqGAMLSS over alternatives: this comes in the form of a small (but not infinitesimal) probability[1] that the fitting algorithm will execute successfully. In the manual of our method (Section 6.4) we explain that a numerically more stable way of fitting these complex models exists and should be adapted if one encounters numerical optimization errors with the default approach used in the Nucleic Acids Research (NAR) manuscript. The three steps of this sequential fitting strategy are as follows:
 One fits a Poisson count mixed model to the RNAseq data, to obtain estimates of the relative mean expression for the different short RNA species in the expression profile
 These estimates are used to fix the values of the mean parameter model of the RNASeqGAMLSS model while estimating the values of the dispersion parameters.
 Finally, one uses the values of the mean (Step 1) and dispersion (Step 2) parameters to fit the general RNASeqGAMLSS model
In essence one ignores the overdispersion (additional variability) of the short RNAseq data (Step 1) to guide the algorithm into estimates of the dispersion parameters (Step 2). Finally one uses the separate estimates of the mean (Step 1) and dispersion (Step 2) parameters as an initial point for the simultaneous estimation of both (Step 3). The reason that this approach works is because the dispersion parameters do not affect the mean parameters, so that the Poisson distribution of Step 1 has the same mean as the RNASeqGAMLSS model. Hence the estimates produced by this Step are identical (to the limit of numerical precision) to those that would have been produced by a successful execution of the RNASeqGAMLSS optimization algorithms. Fixing these values when fitting the RNASeqGAMLSS model in Step 2 facilitates estimation of the dispersion parameters. Having very good initial guesses for these parameters virtually guarantees convergence of the 3rd Step (which is the only step in the NAR paper).
A fully worked out example is shown below (note that the data used in the NAR paper, source code, manual that includes instructions to compile the source code of the RNASeqGAMLSS and Win64 DLL libraries are all available in the BitBucket repository for this project)
First, we load the data, the C++ libraries and extract the data to the two groups we would like to compare :
library(TMB) ## the TMB framework for optimizati library(lme4) ## load the DE C++ libraries dyn.load(dynlib("LQNO_DE")) ## Note about the form of data storage for use with this software ##================================================================================ ## the long format should be employed when storing microRNA data for ## GAMLSS type of analyses: in this format, the data are arranged in columns: ##  column miRNA : yields the name of the miRNA ##  column reads : reads of the miRNA from a *single sample* ##  column SampleID: the sample the reads were recorded at ## this is called long format, because data from experiments are stored in ## different rows (rather than appearing as consecutive columns) ##================================================================================ ## lads the data from the 286 series load("286.RData") ## Obtain data for GAMLSS  we will compare the two ratiometric series datRat<subset(dat286.long,(Series=="RatioB"  Series =="RatioA") & Amount=="100 fmoles") datRat$SampleID<factor(datRat$SampleID) datRat$Series<factor(datRat$Series) ## STEP 0: PREPARE THE DATA FOR THE RNASeqGAMLSS FIT u_X<as.numeric(factor(datRat$miRNA)) ## maps readings to the identity of the miRNA u_G<as.numeric(factor(datRat$Series)) ## maps counts to group y=datRat$reads ## extract the actual counts X<model.matrix(~Series,data=datRat) ## design matrix (ANOVA) for group comparisonsSecondly, we fit the Poisson model (Step 1), using the facilities of the lme4 R package:
## STEP 1: USE A POISSON MODEL TO OBTAIN ESTIMATES FOR THE MU SUBMODEL ##========================================================================================== ## fit the parameters for the mu submodel using the poisson GLM gl<glmer(reads~Series+(0+SeriesmiRNA),data=datRat,family="poisson")Then we extract the values of these parameters and used them to fix the values of the mean submodel (Step 2):
## STEP 2: USE THE MU MODEL ESTIMATES TO FIT THE PHI SUBMODEL ##========================================================================================== ## initializes standard deviation of RE for the mu submodel sigmu=sqrt(diag((summary(gl)[["varcor"]])[[1]])) sigsig=rep(1,max(u_G)) ## initializes standard deviation of RE for the phi submodel b=fixef(gl) ## initial values for the overall group means (mean submodel) ## initial values for the variation of miRNAs around their group mean (mean submodel) u_m=as.matrix(ranef(gl)$miRNA) ## Very rough initial values for the phi submodel parameters s_b=rep(0,ncol(X)) ## initial values for the overall group means (phi submodel) ## initial values for the variation of miRNAs around their group mean (phi submodel) u_s= matrix(0,max(u_X),max(u_G)) ## MAP list that allow us to fix some parameters to their values MAP<NULL MAP[["b"]]<factor(rep(NA,length(b))) MAP[["u_m"]]<factor(rep(NA,length(c(u_m)))) MAP[["sigmu"]]<factor(rep(NA,length(sigmu))) ## construct the AD object  note that we fix the mu at their values while estimating the ## phi submodel obj.TMB<MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G), parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s, sigmu=sigmu,sigsig=sigsig), DLL="LQNO_DE",random=c("u_s"),hessian=FALSE,silent=TRUE, method="BFGS",random.start=expression(last.par.best[random]), ADReport=TRUE,map=MAP) ## parameter estimation  note errors may be generated during some iterations f.TMB<nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr, control=list(eval.max=10000,iter.max=10000),lower=30,upper=30) ## obtain the report on the parameters to extract the fitted values of the gamlss model rep<sdreport(obj.TMB) u_s = matrix(summary(rep,"random",p.value=FALSE)[,1],ncol=max(u_G)) dummy<summary(rep,"fixed",p.value=FALSE)[,1] ## parameter estimates s_b=dummy[1:max(u_G)] sigsig=dummy[(1:max(u_G))]Finally, we refit the model letting all parameters vary:
## STEP 3: REFIT THE MODEL WITHOUT FIXING ANY PARAMETERS ##========================================================================================== obj.TMB<MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G), parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s, sigmu=sigmu,sigsig=sigsig), DLL="LQNO_DE",random=c("u_m","u_s"),hessian=TRUE,silent=TRUE, method="BFGS",random.start=expression(last.par.best[random]), ADReport=TRUE) ## scale objective by the magnitude of the deviance of the fitted Poisson model f.TMB<nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr, control=list(eval.max=10000,iter.max=10000,scale=deviance(gl)),lower=30,upper=30) ## obtain the report on the parameters rep<sdreport(obj.TMB) ## differential expression ratios, standard errors z and pvalues gamlssAD<summary(rep,"report",p.value=TRUE)[1:nlevels(datRat$miRNA),] rownames(gamlssAD)<levels(datRat$miRNA) ## rownames are the miRNAs; columns are estimates, standard error, z value and pvalue ## the final estimates with their standard errors head(gamlssAD)These steps (and the RNASeqGAMLSS code) is going to be incorporated in an upcoming Bioconductor package for the analysis of short RNA sequencing data by Dr Lorena Pantano PhD. Until this package becomes available, the aforementioned code snippets may adapted very easily to one’s application by suitable adaptations of the code (i.e. the names of the columns corresponding to the RNA species, sample identifiers and experimental groups).
1. This is particularly likely when the underlying software implementations are not compiled against the Intel®Math Kernel Libraries.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Reflections of a Medical Doctor. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quantifying the magnitude of a population decline with Bayesian timeseries modelling
(This article was first published on Bluecology blog, and kindly contributed to Rbloggers)
Quantifying the magnitude of a population decline with Bayesian timeseries modellingPopulation abundances tend to vary year to year. This variation can make
it make it hard detect a change and hard to quantify exactly what that
change is.
Bayesian timeseries analysis can help us quantify a decline and put
uncertainty bounds on it too. Here I will use the RINLA
package to fit a timeseries model to a
population decline.
For instance, take the pictured timeseries. Quantifying change as the
difference between the first and last timepoints is obviously
misleading. Doing so would imply that abundance has declined by 77% from
the historical value.
Another approach would be to compare the average of the first and last
decades. Doing so would yield a 72% decline.
A better way might be to model the population trend over time and then
estimate our change from the model. An advantage of a model is we can be
more specific (and transparent) about assumptions. Using Bayesian
timeseries analysis we can also pick a model that is appropriate for
the distribution of the data. e.g. here I will use negative
binomial
errors, because the abundances are counts.
Let’s simulate a simple timeseries of counts that declines at a
constant (exponential) rate:
The results are pictured above. In short we calculate eta which is the
linear predictor. Its exponent determines the mean of the negative
binomial process, hence the exponential (not linear) decline. The
size = 5 means that the count data will be moderately overdispersed –
that is quite noisy (the variance will be > than the mean).
Notice that I put a bit of a ‘kink’ in eta by having the rate of decline
(acoef), increase halfway through the timeseries like this:
We can calculate the magnitude of change like this: 1  (y[n]/y[1]) =
0.77 which can be interpreted as what fraction of the intial years
biomass has been lost?
What we want to do is put a smoother over time, so our estimate accounts
for shortterm variation. We also want to get credibility intervals (95%
probability intervals) on the estimate of decline.
RINLA is a pretty handy package. It is a very
fast
way to fit generalized linear models and can also handle a huge range of
different types of random
effects. Here we will use
an autoregressive lag one process. That is saying that abundance at
time t depends on abundance at time t1 with some constant correlation
coefficient.
We could also include other covariates in the model, for instance,
abundance of a predator that might eat our organism of interest.
However, we won’t go into that here.
First up, we specify the formula in INLA’s notation:
library(INLA) f1 < y ~ 1 + f(z, model = "ar1")Which just says model y as a function of an intercept term (constant
mean) + an autoregressive process that depends on z (just an index for
the year). The f() is INLA’s standard notation for modelling its
different random effects models (AKA latent models), here we chose the
ar1 option.
Now the trick is to get INLA to compute the % difference between the
expectation for the first and last time points as it goes. That way
we not only get an estimate of the change but also we will get the full
posterior distribution, so we can get our CIs. Note I put emphasis on
expectation, because we won’t simply be calculating the difference
between the first and last points (we did that above), but will in fact
be calculating the difference between the model’s estimated mean values
for the first and last points.
To do that we use INLA’s linear
combination
feature.
This feature let’s us compute a linear combination of the estimated
covariates. Here we will ask for the difference between the first and
last values of z (the timeseries). inla will multiply the z
values by the numbers we give below (here 1 and 1), then sum them
together.
The row of NA in the middle just says to ignore the middle points.
Note that we have also ignored the intercept. More on this in a moment.
Now we can fit the model, using negative binomial errors and specifying
our linear combination:
And the results summary can be obtained (not shown here):
summary(m1)Note the large estimate for rho, which is the autocorrelation
parameter. That happens because we have such a strong decline.
We could have also fit just a linear trend to our data, however, because
the rate varies over time (and we wouldn’t know that beforehand if we
had real data, not data we made up), it is nice to use ar1 which has a
kind of smoothing effect.
Also worth noting that modelling an trending (nonstationary)
timeseries with an ar1 is not technically correct if we want to
estimate rho. However, here we use ar1 because it acts like a smoother
in a GAM [e.g. see
here.
It is pretty straightforward to plot INLA’s predictions for the
timeseries. They are all stored in the m1 object:
Now we can extract the change in population size from our linear
combination:
However, this is not quite the number we want. It is the change in eta
or the linear predictor. Negative binomial models have a loglink, so
that is the change in logspace. We want the real change. If we
calculate:
we get the change as a % from the historical mean The reason being is
that the linear combination is log(eta[n])  log(eta[1]). Taking the
exponent gets the change and one minus that is the % loss.
It is easy now to get the CIs on the linear combination:
m1$summary.lincomb.derived ## ID mean sd 0.025quant 0.5quant 0.975quant mode kld ## lc1 1 1.081196 0.309715 1.695217 1.08053 0.4716486 1.079157 0We could also look at the marginal and plot the posterior distribution
of the % loss:
To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Beware the Argument: The Flint Water Crisis and Quantiles
(This article was first published on Theory meets practice..., and kindly contributed to Rbloggers)
AbstractIf your tap water suddenly becomes brown while authorities claim everything is okay, you start to worry. LangkjærBain (2017) tells the Flint Water Crisis story from a statistical viewpoint: essentially the interest is in whether the 90th percentile in a sample of lead concentration measurements is above a certain threshold or not. We illustrate how to perform the necessary calculations with R’s quantile function and show that the typeargument of the function matters.
This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .
IntroductionIn a recent Significance article, LangkjærBain (2017) tells the story about the Flint water crisis. In 2014 the city of Flint, Michigan, USA, decided to change its water supply to Flint River. Due to insufficient treatment of the water with corrosion inhibitors the lead concentration in the drinking water increased, because lead in the aging pipes leached into the water supply. This created serious health problems – as explained in this short video summary. In this blog post we investigate further the computation of the 90th percentile of the tap water lead concentration samples described in LangkjærBain (2017). Quantile estimation in this context has already been discussed in a recent blog entry entitled Quantiles and the Flint water crisis by Rick Wicklin.
The monitoring of drinking water in the US is regulated by the Lead and Copper Rule of the United States Environmental Protection Agency. The entire text of the rule is available as electronic code of federal regulation (eCFR) Title 40: Protection of Environment, Part 141 – National Primary Drinking Water Regulations. In particular the regulation defines a sampling plan for collecting tap water samples. The size of the sample depends on the number of people the water system serves. In case this number is bigger than 100,000 a sample of 100 sites is needed. If there are 10,001100,000 people served, then a sample from 60 sites is needed. For systems serving fewer than 10,000 sizes of 40, 20, 10 and 5 are defined – see §141.86(c) of the rule for details. Of interest for this blog post is that action needs to be taken, if too many of the samples are above a given threshold of 15 part per billion (ppb):
Figure: Excerpt of CFR 40 §141.80 with the quantile calculation highlighted in yellow.
We note the explicit duality in the CFR between the in more than 10% and the 90% quantile in the text. However, we also note that it is not clear how to proceed, if the number calculated in (c)(3)(ii) is not an integer. This is not a problem per se, because the CFR itself only operates with samples sizes 100, 60, 40, 20, 10 and 5, where 0.9 times the sample size always gives an integer number. But if one for some reason does not obtain exactly the required number this quickly can become an issue as we shall see below.
The Flint DataThe data of the spring 2015 Flint water supply monitoring conducted by the Michigan Department of Environmental Quality are presented in the figure on p. 20 of LangkjærBain (2017). Alternatively, they can also be taken directly from the Blog entry Quantiles and the Flint water crisis by Rick Wicklin.
##Read flint water monitoring data (the column lead is the measurement) flint < read.csv(file=file.path(filePath,"flint.csv")) ##Sort the measured lead values in ascending order lead < sort(flint$lead) ##Number of observations n < length(lead)The proportion of observations above the 15 ppb threshold is
mean(lead > 15) ## [1] 0.1126761In other words, the proportion 11.3% is above the 10% action threshold and, hence, something needs to be done. However, as LangkjærBain (2017) describes, the story is a little more complicated, because two of the values above 15 were removed with the argument that they originated from sites which are not covered by the sampling plan. Only private households at high risk, i.e. with lead pipelines, are supposed to be sampled. As one can read in the article the removal is highly controversial, in particular, because the proportion of critical observations falls below the 10% action threshold when these two values are removed. For this blog entry, we will, however, work with the full \(n=71\) sample and focus on the quantile aspect of the calculation.
Percentages and QuantilesLet \(n\) denote the size of the selected sample, which in our case is \(n=71\). If more than 10% of the sample is to be above 15 ppb, this means that \(\lfloor 0.1\cdot n + 1\rfloor\) of the samples need to be above 15 ppb, where \(\lfloor y \rfloor\) denotes the largest integer less than or equal to \(y\). We shall denote this the number of critical samples. If we divide this number by \(n\) we get the actual proportion of critical samples needed before action. It is worthwhile noting the difference between this critical proportion and the 10% threshold illustrated by the sawtooth curve in the figure below. The explanation for these sawtooth stepspikes is the discreteness of the decision problem (i.e. \(x\) out of \(n\)).
Turning to the equivalent 90% quantile is above 15 ppm decision criterion of the CFR, one will need to determine the 90% quantile from the finite sample of lead concentration measurements. How to estimate quantiles with statistical software is discussed in the excellent survey of Hyndman and Fan (1996). In their work they describe the nine different quantile estimators implemented in R’s quantile function. The estimators are all based on the order statistic of the sample, i.e. let \(x_{(1)} \leq x_{(2)} < \cdots \leq x_{(n)}\) denote the ordered lead concentration measurements. Then the simplest estimator for the \(p\cdot 100\%\) quantile is
\[
\hat{x}_p = \min_{k} \left\{\hat{F}(x_{(k)}) \geq p\right\} =
x_{(\lceil n \cdot p\rceil)},
\]
where \(\hat{F}\) is the empirical cumulative distribution function of the sample and \(\lceil y \rceil\) denotes the smallest integer greater than or equal to \(y\). This method corresponds to R’s type=1. For our application \(p\) would be 0.9 and the 90th percentil for the Flint data is thus
c("manual.90%"=lead[ceiling(n*p)], R=quantile(lead, probs=p, type=1)) ## manual.90% R.90% ## 18 18which is above the action threshold of 15 ppb. It is also important to note that when \(x_{(\lceil n \cdot 0.9\rceil)} > 15\> \text{ppb}\) then a total \(n – \lceil n \cdot 0.9\rceil + 1\) samples are above 15 ppm. In other words the proportion of samples above 15 ppm in this situation is \[ \frac{n – \lceil n \cdot 0.9\rceil + 1}{n}.\] To show the duality between the more than 10% critical samples and the 90% quantile being above 15 ppm we thus need to show that \((n – \lceil n \cdot 0.9\rceil + 1)/n = (\lfloor 0.1\cdot n + 1\rfloor)/n\). This is possible since the following relations hold for the floor and ceiling functions: \[ \begin{align*}
– \lceil y \rceil &= \lfloor y \rfloor \quad \text{and} \\
n + \lfloor y \rfloor &= \lfloor n + y \rfloor,
\end{align*}
\] with \(n\) being an integer and \(y\) any real number. Thus, \[
(n+1) – \lceil n \cdot 0.9\rceil = (n+1) + \lfloor n \cdot 0.9\rfloor =
\lfloor (n+1) – n \cdot 0.9\rfloor = \lfloor 0.1n+1 \rfloor.
\]
Important conclusion: We have thus shown that with the type=1 quantile method we have the duality between having more than 10% critical samples and the 90th percentile of the measurements being above 15 ppm.
Other Quantile TypesSince \(\hat{F}\) has jumps of size \(1/n\) the actual value of \(\hat{F}(\hat{x}_{p})\) can end up being somewhat larger than the desired \(p\). Therefore, Hyndman and Fan (1996) prefer estimators interpolating between two adjacent order statistics. Also because such estimators have a lower mean squared error in most cases (Dielman, Lowry, and Pfaffenberger 1994). As an example of such an extended estimator, the type=5 quantile estimator is defined by letting \(h=p n + 1/2\) and then computing: \[
\hat{x}_p = x_{\lfloor h \rfloor} + (h – \lfloor h \rfloor) (x_{\lfloor h
\rfloor + 1} – x_{\lfloor h \rfloor}).
\]
Doing this either manually or using the quantile function one obtains:
## Small function to manually compute the type=5 p*100th quantile ## for the sample x manual_quantile_5 < function(x, p) { h < length(x)*p + 0.5 x < sort(x) return(x[floor(h)] + (h  floor(h))* (x[floor(h)+1]  x[floor(h)])) } c("manual.90%"=manual_quantile_5(lead, p=0.9), R=quantile(lead, probs=0.9,type=5)) ## manual.90% R.90% ## 18.8 18.8Instead of reading the above or using the R code one can also instead watch a more didactic whiteboard explanation by Professor Christopher Gardiner for Michigan Radio on how to calculate the 90% quantile using a type=5 argument for the Flint sample. However, the important point of the above calculations is that this quantile type is of limited interest, because the Lead and Copper Rule implicitly defines that one has to use the type=1 quantile. To make this point even more explicit, we use sampling with replacement from the Flint data to construct a dataset, where the 90% type=5quantile is above 15 ppm, but the percentage of samples above the 15 ppm threshold is less than 10%.
##Function to compute the proportion critical as well as the 90% quantile ##using type (type)quantiles. Returns the quantile as well as the proportion ##above the 15 ppm threshold prop_critical_and_q90 < function(x, type=5) { q90 < as.numeric(quantile(x, type=type,probs=p)) prop < mean(x>15) c(q90=q90,prop= prop) } ##Make 100 datasets by sampling with replacement r < data.frame(seed=seq_len(100)) %>% rowwise %>% do({ set.seed(.$seed) newdata < sample(lead, replace=TRUE) as.data.frame(seed=.$seed, t(prop_critical_and_q90(newdata))) }) ##Check which datasets violate the duality between quantile and ##percentage above threshold assumption r %<>% mutate(violates_duality = q90 > 15 & prop < 0.1) ##Do the stats for this dataset (five < r %>% filter(violates_duality) %>% slice(1:5)) ## # A tibble: 5 × 3 ## q90 prop violates_duality ## ## 1 15.0 0.09859155 TRUE ## 2 15.0 0.09859155 TRUE ## 3 16.2 0.09859155 TRUE ## 4 15.0 0.09859155 TRUE ## 5 15.8 0.09859155 TRUEWe note that some of the lines in the above output are artifacts of lacking numerical precision: the quantile is only above 15 due to numerical imprecision in the calculation of the type=5 quantile:
print(five$q90, digits=20) ## [1] 15.000000000000028422 15.000000000000028422 16.200000000000045475 ## [4] 15.000000000000028422 15.800000000000039790This shows that regulatory business with floating point arithmetic is tricky. As a step towards fixing the problem, one could redefine the greater and less than operators, respectively, to only compare up to numerical precision:
##Function to do numerical safe greater than comparision "%greater%" < function(x,y) { equal_up_to_numerical_precision < isTRUE(all.equal(x,y)) return( (x > y) & !(equal_up_to_numerical_precision) ) } ##Function to do numerical safe less than comparision "%less%" < function(x,y) { equal_up_to_numerical_precision < isTRUE(all.equal(x,y)) return( (x < y) & !(equal_up_to_numerical_precision) ) } ##Add the new column, which does < and > comparisons only up to ##numerical precision r %<>% mutate(violates_duality_numsafe = (q90 %greater% 15) & (prop %less% 0.1)) ##Show five violation candidates for this corrected dataset (five < r %>% filter(violates_duality_numsafe) %>% slice(1:5)) ## # A tibble: 5 × 4 ## q90 prop violates_duality violates_duality_numsafe ## ## 1 16.2 0.09859155 TRUE TRUE ## 2 15.8 0.09859155 TRUE TRUE ## 3 15.8 0.09859155 TRUE TRUE ## 4 16.2 0.09859155 TRUE TRUE ## 5 16.2 0.09859155 TRUE TRUE DiscussionTo summarize the findings: The type of quantile estimation used in practice matters. It is not clear what to do, if \(0.9\cdot n\) is not integer in the estimation of the 90% quantile under the Lead and Copper Rule. For the Flint example the 63’rd sorted value is 13 which is below threshold, whereas the 64’th value is 18 which is above the threshold. If we use type=1 then \(\lceil 71\cdot 0.9\rceil=64\) would be the correct value to take and the 90% quantile of the sample would be estimated to be 18 ppb. This means that the 19 ppb vertical line in the figure of LangkjærBain (2017) is a little misleading, because this appears to be the rounded type=5 quantile. For the setting with \(n=71\) samples, both estimators are although above the action threshold of 15 ppb, so in the particular Flint application it does not matter so much which method to take. However, in other settings this might very well make a difference! So be careful with the type argument of the quantile function. As an example, the nine different types of R’s quantile function provide estimates for the 90% quantile in the range from 17.50 to 19.60 for the Flint data.
On another note, one can discuss if it is a good idea to rely on the type=1 quantile estimator in the rule, because it is well known that this type does not have as good estimation properties as, e.g., type=5. However, type=1 is simpler to compute, ensures duality with the intuitive critical proportion, and has the property that the obtained value is always one also occurring in the sample. The later thus avoids the issue of numerical instability.
Finally, the blog post by Rick Wicklin addresses quantile estimation from an even more statistical viewpoint by computing confidence intervals for the quantile – a topic, which has been previously treated theoretically in this blog. Compliance to a given quantile threshold based on samples has also been treated in the entirely different context of digital elevation models (Höhle and Höhle 2009). Typically, tests and the dual confidence intervals are in this regulation setting formulated in a reversed way, such that one needs to provide enough evidence to show that underlying 90th percentile is indeed below 15 ppm beyond reasonable doubt. An interesting question in this context is how large the sample needs to be in order to do this with a given certainty – see Höhle and Höhle (2009) for details. It is, however, worthwhile pointing out that the Lead and Copper Rule does not know about confidence intervals. Currently, estimation uncertainty is only treated implicitly by specifying sample size as a function of number of people served by the water system and then hardthresholding the result at 15 ppm.
On a Personal Note: If you want more details on the use of confidence intervals for quantiles, join my 5 minute lightning talk on 6th of July at the useR!2017 conference in Brussels.
Photo is copyright George Thomas, available under a CC BYNCND 2.0 license.
Acknowledgments
Thanks goes to Brian Tarran, editor of the Significance Magazine, for providing additional information about the quantile computation of the LangkjærBain (2017) article and for pointing out the Gardiner video.
ReferencesDielman, T., C. Lowry, and R. Pfaffenberger. 1994. “A Comparison of Quantile Estimators.” Communications in Statistics – Simulation and Computation 23 (2): 355–71. doi:10.1080/03610919408813175.
Höhle, J., and M. Höhle. 2009. “Accuracy Assessment of Digital Elevation Models by Means of Robust Statistical Methods.” ISPRS Journal of Photogrammetry and Remote Sensing 64 (4): 398–406. doi:10.1016/j.isprsjprs.2009.02.003.
Hyndman, R. J., and Y. Fan. 1996. “Sample Quantiles in Statistical Packages.” American Statistician 50 (4): 361–65. doi:10.2307/2684934.
LangkjærBain, Robert. 2017. “The Murky Tale of Flint’s Deceptive Water Data.” Significance 14 (2): 16–21. doi:10.1111/j.17409713.2017.01016.x.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Partial least squares in R
(This article was first published on poissonisfish, and kindly contributed to Rbloggers)
Blood serum samples. From: https://health.onehowto.com
My last entry introduces principal component analysis (PCA), one of many unsupervised learning tools. I concluded the post with a demonstration of principal component regression (PCR), which essentially is a ordinary least squares (OLS) fit using the first principal components (PCs) from the predictors. This brings about many advantages:
 There is virtually no limit for the number of predictors. PCA will perform the decomposition no matter how many variables you handle. Simpler models such as OLS do not cope with more predictors than observations (i.e. n " title="p > n " class="latex" />).
 Correlated or noisy predictors do not undermine the regression fit. Collinearity is a problem for OLS, by widening the solution space, i.e. destabilizing coefficient estimation. PCA will always produce few uncorrelated PCs from a set of variables, correlated or not. On the other hand, and to some extent, uninformative predictors will not affect performance.
 The PCs carry the maximum amount of variance possible. In any predictive model, predictors with zero or nearzero variance often constitute a problem and behave as second intercepts. In the process of compression, PCA will find the projections where your data points spread out the most, thus facilitating learning from the subsequent OLS.
However, in many cases it is much wiser performing a decomposition similar to the PCA, yet constructing PCs that best explain the response, be it quantitative (regression) or qualitative (classification). This is the concept of partial least squares (PLS), whose PCs are more often designated latent variables (LVs), although in my understanding the two terms can be used interchangeably.
PLS safeguards advantages 1. and 2. and does not necessarily follow 3. In addition, it follows that if you use the same number of LVs and predictors, you are going to get exactly a OLS fit – each predictor gets its own LV, so it does not help much. PLS (regression) and PLS followed by discriminant analysis (PLSDA, classification) are tremendously useful in predictive modelling. They are adequate in a wide variety of experimental designs and linear in their parameters, therefore more easily interpretable.
Today we will perform PLSDA on the Arcene data set hosted at the UCI Machine Learning Repository that comprises 100 observations and 10,000 explanatory variables () in order to diagnose cancer from serum samples. From the 10,000 features, 7,000 comprise distinct mass spectrometry (MS) peaks, each determining the levels of a protein. For some reason the contributors added 3,000 random variables, probably to test robustness against noise (check more information in the link above).
Let’s get started with RFor predictive modelling I always use the caret package, which builds up on existing model packages and employs a single syntax. In addition, caret features countless functions that deal with preprocessing, imputation, summaries, plotting and much more we will see firsthand shortly.
For some reason there is an empty column sticking with the data set upon loading, so I have added the colClasses argument to get rid of it. Once you load the data set take a look and note that the levels of most proteins are rightskewed, which can be a problem. However, the authors of this study conducted an exhaustive preprocessing and looking into this would take far too long. The cancer / nocancer labels (coded as 1 / 1) are stored in a different file, so we can directly append it to the full data set and later use the formula syntax for training the models.
# Load caret, install if necessary library(caret) arcene < read.table("http://archive.ics.uci.edu/ml/machinelearningdatabases/arcene/ARCENE/arcene_train.data", sep = " ", colClasses = c(rep("numeric", 10000), "NULL")) # Add the labels as an additional column arcene$class < factor(scan("https://archive.ics.uci.edu/ml/machinelearningdatabases/arcene/ARCENE/arcene_train.labels", sep = "\t"))We can finally search missing values (NA’s). If these were present it would require some form of imputation – that is not the case. Calling any(is.na(arcene)) should prompt a FALSE.
So now the big questions are:
 How accurately can we predict whether a patient is sick based of the MS profile of his/her blood serum?
 Which proteins / MS peaks best discriminate sick and healthy patients?
Time to get started with the function train. With respect to preprocessing we will remove zerovariance predictors and center and scale all those remaining, in this exact order using the preProc argument. Considering the size of the sample (n = 100), I will pick a 10x repeated 5fold crossvalidation (CV) – a large number of repetitions compensates for the high variance stemming from a reduced number of folds – rendering a total of 50 estimates of accuracy. Many experts advise in favor of simpler models when performance is undistinguishable, so we will apply the “one SE” rule that selects the least complex model with the average crossvalidated accuracy within one standard error (SE) from that in the optimal model. Finally, we must define a range of values for the tuning parameter, the number of LVs. In this particular case we will consider all models with 120 LVs.
# Compile crossvalidation settings set.seed(100) myfolds < createMultiFolds(arcene$class, k = 5, times = 10) control < trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE") # Train PLS model mod1 < train(class ~ ., data = arcene, method = "pls", metric = "Accuracy", tuneLength = 20, trControl = control, preProc = c("zv","center","scale")) # Check CV profile plot(mod1)This figure depicts the CV profile, where we can learn the average accuracy (yaxis, %) obtained from models trained with different numbers of LVs (xaxis). Note the swift change in accuracy among models with 15 LVs. Although the model with six LVs had the highest average accuracy, calling mod1 in the console will show you that, because of the “one SE” rule, the selected model has five LVs (accuracy of 80.6%).
Now we will compare our PLSDA to the classifier homolog of PCR – linear discriminant analysis (LDA) following a PCA reductivestep (PCADA, if such thing exists). Note that the PCA preprocessing is also set in the preProc argument. We can also try some more complicated models, say, random forests (RF). I will not go into details about the workings and parameterisation of RF, as for today it will be purely illustrative. Note that RF will comparatively take very long (~15min in my 8Gb MacBook Pro from 2012). As usual, a simple object call will show you the summary of the corresponding CV.
# PCADA mod2 < train(class ~ ., data = arcene, method = "lda", metric = "Accuracy", trControl = control, preProc = c("zv","center","scale","pca")) # RF mod3 < train(class ~ ., data = arcene, method = "ranger", metric = "Accuracy", trControl = control, tuneGrid = data.frame(mtry = seq(10,.5*ncol(arcene),length.out = 6)), preProc = c("zv","center","scale"))Finally, we can compare PLSDA, PCADA and RF with respect to accuracy. We will compile the three models using caret::resamples, borrowing the plotting capabilities of ggplot2 to compare the 50 accuracy estimates from the optimal crossvalidated model in the three cases.
# Compile models and compare performance models < resamples(list("PLSDA" = mod1, "PCADA" = mod2, "RF" = mod3)) bwplot(models, metric = "Accuracy")It is clear that the long RF run did not translate into a excelling performance, quite the opposite. Although in average all three models have similar performances, the RF displays a much larger variance in accuracy, which is of course a concern if we seek a robust model. In this case PLSDA and PCADA exhibit the best performance (6395% accuracy) and either model would do well in diagnosing cancer in new serum samples.
To conclude, we will determine the ten proteins that best diagnose cancer using the variable importance in the projection (ViP), from both the PLSDA and PCADA. With varImp, this is given in relative levels (scaled to the range 0100).
plot(varImp(mod1), 10, main = "PLSDA") plot(varImp(mod2), 10, main = "PCADA")Very different figures, right? Idiosyncrasies aside, how many PCs did the PCADA use? By calling mod2$preProcess, we learn that “PCA needed 82 components to capture 95 percent of the variance”. This is because the PCA preprocessing functionality in caret uses, by default, as many PCs as it takes to cover 95% of the data variance. So, in one hand we have a PLSDA with five LVs, on the other hand a PCADA with 82 PCs. This not only makes PCADA cheaply complicated, but arcane at the interpretation level.
The PLSDA ViP plot above clearly distinguishes V1184 from all other proteins. This could be an interesting cancer biomarker. Of course, many other tests and models must be conducted in order to provide a reliable diagnostic tool. My aim here is only to introduce PLS and PLSDA.
And that’s it for today, I hope you enjoyed. Please drop me a comment or message if you have any suggestion for the next post. Cheers!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: poissonisfish. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Superstorm Sandy at the Delaware Estuary Revisited
(This article was first published on AdventuresInData, and kindly contributed to Rbloggers)
Continuing on the theme of updating older visualizations into newer formats, below is a clip of the impact of Superstorm Sandy on Delaware Estuary water surface elevations. The original MS Excel and screen capture version is posted here. The new version developed in R with the animation package is cleaner and more portable.
The graph shows measured and predicted water surface elevations provided by the NOAA PORTS system for the Delaware River and Bay. The shape of the Delaware Estuary amplifies the tidal signal at the mouth of the estuary resulting in a wider tidal range at the upper end of the estuary, which is also the more urbanized densely populated end. Storm surges are a particular concern since the potential exists for the surge to be amplified within the estuary.
The graph shows that the measured water surface elevations initially closely match those predicted by harmonic constituents (based on lunar and other cycles), but as the storm approaches the impact of the surge results in greater differences between the predicated and observed water levels.
Email me at JYagecic@gmail.com for the full length GIF or more details about construction of the animated graph.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: AdventuresInData. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
The Apache Drill folks have a nice walkthrough tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.
If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.
The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.
The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using the xlsx package to create an Excel file
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Microsoft Excel is perhaps the most popular data anlysis tool out there. While arguably convenient, spreadsheet software is error prone and Excel code can be very hard to review and test.
After successfully completing this exercise set, you will be able to prepare a basic Excel document using just R (no need to touch Excel yourself), leaving behind a reproducible Rscript.
Solutions are available here.
Exercise 1
Install and load the xlsx package, using the dependencies = TRUE option.
Exercise 2
Create an xlsx workbook object in your R workspace and call it wb.
Exercise 3
Create a sheet object in wb named iris assign it the name sheet1 in your workspace.
Exercise 4
Write the builtin Iris data.frame to the iris sheet without row names. Hint: use the addDataFrame() function.
Now you can write your workbook anytime to your working directory using saveWorkbook(wb, "filename.xlsx").
Learn more about working with excel and R in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to: Learn some of the differences between working in Excel with regression modelling and R
 Learn about different statistical concepts
 And much more
Exercise 5
Apply ‘freeze pane’ on the top row.
Exercise 6
Set width of columns 1 through 5 to 12, that is 84 pixels.
Exercise 7
Use Font, CellBlock and CB.setFont to make the header in bold.
Exercise 8
Using tapply generate a table with the mean of ‘petal width’ by species and write to a new sheet called pw, from row 2 down.
Exercise 9
Add a title in cell A1 above the table, merge the cells of the first three columns.
Exercise 10
Save your workbook to your working directory and open using Excel. Go back to R and continue formatting and adding information to your workbook at will.
 Building Shiny App exercises part 1
 Data wrangling : I/O (Part1)
 Building Shiny App exercises part 4
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
All my talks in one place (plus a Hugo walkthrough!)
(This article was first published on R – Locke Data, and kindly contributed to Rbloggers)
I mentioned in an earlier post about how I’m revamping my presentation slides process but that I hadn’t tackled the user experience of browsing my slides, which wasted lots of the effort I put in. To tackle this part of it, I’ve made lockedata.uk using Hugo to be a way of finding and browsing presentations on R, SQL, and more. As Hugo is so easy, I thought I’d throw in a quick Hugo walkthrough too so that you could build your own blog/slides/company site if you wanted to.
The great thing about using Hugo with a sort of blog taxonomy system is now all my presentations have categories and tags so you can find something relevant. Additionally, I’m also ordering slides by the date of the last presentation. This means if you see me present, you can straight away get the slides – they’ll be the first thing on the page.
A number of the talks will also include the videos (embedded with Hugo shortcodes for the most part) from when I’ve presented the talk as that’s the most robust way of getting value out of the slides.
lockedata.uk – presentation website built with Hugo. Read the post to get a hugo walkthrough.
If you’re interested in the tech behind the site read on…
Hugo walkthroughThis site uses Hugo. Hugo is a “static site generator” which means you write a bunch of markdown and it generates html. This is great for building simple sites like company leafletware or blogs.
You can get Hugo across platforms and on Windows it’s just an executable you can put in your program files. You can then work with it like git in the command line.
You will want to read some of the docs to make customisations but the “happy path” Hugo walkthrough to get started is:
 Make a new site you simply run hugo new site NAME
 Let’s do some git, run git init inside the new site’s directory to get your git repo going
 Make a repo on GitHub that we’ll want to push to later on
 Find a theme to suit your needs and go to its github page
 Change to the themes directory then clone the theme into the themes directory
 Back on the toplevel of your site, extract your themes example content with something like cp av themes/academic/exampleSite/* .
 Get your site running locally with realtime changes by running hugo server w
 Start editing the sample content to meet your needs and using the site (available at localhost:1313) to see how it’s going
 When you’re happy, stop the local server (in Windows, this is ctrl+c)
 Make sure in the config.toml file, the site URL reflects where it’ll go on Github i.e. http://yourusername.gihub.io/yourreponame/
 Generate your site to the docs directory with hugo d docs
 Commit everything and push to your GitHub repo
 In the Settings page for the Github repo, set the Github Pages > Source option to “master branch /docs folder”
Now your site will be available yourusername.gihub.io/yourreponame
PS If you want to do a lot of R in your blog, you can utilise blogdown which is an rmarkdown Hugo hybrid.
The post All my talks in one place (plus a Hugo walkthrough!) appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Weather forecast with regression models – part 4
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Results so far obtained allow us to predict the RainTomorrow Yes/No variable. In the first part, we highlighted that such factor variable can be put in relationship with the Rainfall quantitative one by:
all.equal(weather_data$Rainfall > 1, weather_data$RainToday == "Yes")As a consequence, we are able so far to predict if tomorrow rainfall shall be above 1mm or not. Some more questions to be answered:
 In case of “at least moderate” Rainfall (arbitrarily defined as Rainfall > 15mm), how do our models perform?
In case of “at least moderate” rainfall, we would like to be as much reliable as possible in predicting {RainTomorrow = “Yes”}. Since RainTomorrow = “Yes” is perceived as the prediction of a potential threat of damages due to the rainfall, we have to alert Canberra’s citizens properly. That translates in having a very good specificity, as explained in the presecution of the analysis.
 Can we build additional models in order to predict further variables such as tomorrow’s Rainfall, Humidity3pm, WindGustSpeed, Sunshine, MinTemp and MaxTemp variables?
That is motivated by the fact that weather forecast comprises more than one prediction. For an example of the most typical ones, see:
Moderate Rainfall ScenarioSince Rainfall was one of the variable we drop off from our original dataset, we have to consider it back and identify the records which are associated to moderate rainfall. The following note applies.
Note on predictors availabilityIn the prosecution of the analysis, we will not make the assumption to have to implement separate 9AM, 3PM and late evening predictions as done in part #2. For simplicity and brevity, we assume to have to arrange this new goals in the late evening where all predictors are available. Similarly as done before, we will not take into account the following variables:
Date, Location, RISK_MM, RainToday, Rainfall, WindDir9am, WindDir3pmas Date and Location cannot be used as predictors; RISK_MM, RainToday, Rainfall are not used to make our task a little bit more difficult; we demonstrated in the first tutorial of this series that WinDir9am and WinDir3pm are not interesting predictors and they have associated a considerable number of NA values.
Having said that, we then start the moderate rainfall scenario analysis. At the same time, we prepare the dataset for the rest of the analysis.
weather_data6 < subset(weather_data, select = c(Date, Location, RISK_MM, RainToday, WindDir9am, WindDir3pm)) weather_data6$RainfallTomorrow < c(weather_data6$Rainfall[2:nrow(weather_data6)], NA) weather_data6$Humidity3pmTomorrow < c(weather_data6$Humidity3pm[2:nrow(weather_data6)], NA) weather_data6$WindGustSpeedTomorrow < c(weather_data6$WindGustSpeed[2:nrow(weather_data6)], NA) weather_data6$SunshineTomorrow < c(weather_data6$Sunshine[2:nrow(weather_data6)], NA) weather_data6$MinTempTomorrow < c(weather_data6$MinTemp[2:nrow(weather_data6)], NA) weather_data6$MaxTempTomorrow < c(weather_data6$MaxTemp[2:nrow(weather_data6)], NA) weather_data7 = weather_data6[complete.cases(weather_data6),] head(weather_data7) MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindSpeed9am WindSpeed3pm 1 8.0 24.3 0.0 3.4 6.3 NW 30 6 20 2 14.0 26.9 3.6 4.4 9.7 ENE 39 4 17 3 13.7 23.4 3.6 5.8 3.3 NW 85 6 6 4 13.3 15.5 39.8 7.2 9.1 NW 54 30 24 5 7.6 16.1 2.8 5.6 10.6 SSE 50 20 28 6 6.2 16.9 0.0 5.8 8.2 SE 44 20 24 Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 1 68 29 1019.7 1015.0 7 7 14.4 23.6 Yes 2 80 36 1012.4 1008.4 5 3 17.5 25.7 Yes 3 82 69 1009.5 1007.2 8 7 15.4 20.2 Yes 4 62 56 1005.5 1007.0 2 7 13.5 14.1 Yes 5 68 49 1018.3 1018.5 7 7 11.1 15.4 No 6 70 57 1023.8 1021.7 7 5 10.9 14.8 No RainfallTomorrow Humidity3pmTomorrow WindGustSpeedTomorrow SunshineTomorrow MinTempTomorrow 1 3.6 36 39 9.7 14.0 2 3.6 69 85 3.3 13.7 3 39.8 56 54 9.1 13.3 4 2.8 49 50 10.6 7.6 5 0.0 57 44 8.2 6.2 6 0.2 47 43 8.4 6.1 MaxTempTomorrow 1 26.9 2 23.4 3 15.5 4 16.1 5 16.9 6 18.2 hr_idx = which(weather_data7$RainfallTomorrow > 15)It is important to understand what is the segmentation of moderate rainfall records in terms of training and testing set, as herein shown.
(train_hr < hr_idx[hr_idx %in% train_rec]) [1] 9 22 30 52 79 143 311 (test_hr < hr_idx[!(hr_idx %in% train_rec)]) [1] 3 99 239 251We see that some of the “at least moderate” Rainfall records belong to the testing dataset, hence we can use it in order to have a measure based on unseen data by our model. We test the evening models with a testset comprising such moderate rainfall records.
rain_test < weather_data7[test_hr,] rain_test MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindSpeed9am WindSpeed3pm 3 13.7 23.4 3.6 5.8 3.3 NW 85 6 6 99 14.5 24.2 0.0 6.8 5.9 SSW 61 11 20 250 3.0 11.1 0.8 1.4 0.2 W 35 0 13 263 1.1 11.0 0.2 1.8 0.0 WNW 41 0 6 Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 3 82 69 1009.5 1007.2 8 7 15.4 20.2 Yes 99 76 76 999.4 998.9 7 7 17.9 20.3 Yes 250 99 96 1024.4 1021.1 7 8 6.3 8.6 Yes 263 92 87 1014.4 1009.0 7 8 2.4 8.7 Yes RainfallTomorrow Humidity3pmTomorrow WindGustSpeedTomorrow SunshineTomorrow MinTempTomorrow 3 39.8 56 54 9.1 13.3 99 16.2 58 41 5.6 12.4 250 16.8 72 35 6.5 2.9 263 19.2 56 54 7.5 2.3 MaxTempTomorrow 3 15.5 99 19.9 250 9.5 263 11.6Let us see how the first weather forecast evening model performs.
opt_cutoff < 0.42 pred_test < predict(mod_ev_c2_fit, rain_test, type="prob") prediction < ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") data.frame(prediction = prediction, RainfallTomorrow = rain_test$RainfallTomorrow) prediction RainfallTomorrow 1 Yes 39.8 2 No 16.2 3 Yes 16.8 4 Yes 19.2 confusionMatrix(prediction, rain_test$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 0 1 Yes 0 3 Accuracy : 0.75 95% CI : (0.1941, 0.9937) No Information Rate : 1 PValue [Acc > NIR] : 1 Kappa : 0 Mcnemar's Test PValue : 1 Sensitivity : NA Specificity : 0.75 Pos Pred Value : NA Neg Pred Value : NA Prevalence : 0.00 Detection Rate : 0.00 Detection Prevalence : 0.25 Balanced Accuracy : NA 'Positive' Class : NoThen, the second evening weather forecast model.
opt_cutoff < 0.56 pred_test < predict(mod_ev_c3_fit, rain_test, type="prob") prediction < ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") data.frame(prediction = prediction, RainfallTomorrow = rain_test$RainfallTomorrow) prediction RainfallTomorrow 1 Yes 39.8 2 Yes 16.2 3 No 16.8 4 Yes 19.2 confusionMatrix(prediction, rain_test$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 0 1 Yes 0 3 Accuracy : 0.75 95% CI : (0.1941, 0.9937) No Information Rate : 1 PValue [Acc > NIR] : 1 Kappa : 0 Mcnemar's Test PValue : 1 Sensitivity : NA Specificity : 0.75 Pos Pred Value : NA Neg Pred Value : NA Prevalence : 0.00 Detection Rate : 0.00 Detection Prevalence : 0.25 Balanced Accuracy : NA 'Positive' Class : NoFor both evening forecast models, one of the testing set predictions is wrong. If we like to improve it, we have to step back to the tuning procedure and determine a decision threshold value more suitable to capture such scenarios. From the tables that we show in the previous part, we can try to lower the cutoff value to increase specificity, however likely implying to reduce accuracy.
Chance of RainIn the previous part of this series, when discussing with the tuning of the decision threshold, we dealt with probabilities associated to the predicted RainTomorrow variable. The chances of having RainTomorrow == “Yes” are equivalent to the chance of rain. Hence the following utiligy function can be depicted at the purpose
chance_of_rain < function(model, data_record){ chance_frac < predict(mod_ev_c3_fit, data_record, type="prob")[, "Yes"] paste(round(chance_frac*100), "%", sep="") }We try it out passing ten records of the testing dataset.
chance_of_rain(mod_ev_c3_fit, testing[1:10,]) [1] "79%" "3%" "4%" "3%" "1%" "7%" "69%" "61%" "75%" "78%"To build all the following models, we are going to use all the data available in order to capture the variability of an entire year. For brevity, we do not make comparison among models for same predicted variable and we do not show regression models diagnostic plots.
Tomorrow’s Rainfall PredictionIf the logistic regression model predicts RainTomorrow = “Yes”, we would like to take advantage of a linear regression model capable to predict the Rainfall value for tomorrow. In other words, we are just interested in records whose Rainfall outcome is greater than 1 mm.
weather_data8 = weather_data7[weather_data7$RainfallTomorrow > 1,] rf_fit < lm(RainfallTomorrow ~ MaxTemp + Sunshine + WindGustSpeed  1, data = weather_data8) summary(rf_fit) Call: lm(formula = RainfallTomorrow ~ MaxTemp + Sunshine + WindGustSpeed  1, data = weather_data8) Residuals: Min 1Q Median 3Q Max 10.363 4.029 1.391 3.696 23.627 Coefficients: Estimate Std. Error t value Pr(>t) MaxTemp 0.3249 0.1017 3.196 0.002225 ** Sunshine 1.2515 0.2764 4.528 2.88e05 *** WindGustSpeed 0.1494 0.0398 3.755 0.000394 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.395 on 60 degrees of freedom Multiple Rsquared: 0.6511, Adjusted Rsquared: 0.6336 Fstatistic: 37.32 on 3 and 60 DF, pvalue: 9.764e14All predictors are reported as significant.
lm_pred < predict(rf_fit, weather_data8) plot(x = seq_along(weather_data8$RainfallTomorrow), y = weather_data8$RainfallTomorrow, type='p', xlab = "observations", ylab = "RainfallTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data8$RainfallTomorrow), y = lm_pred, col='red')The way we intend to use this model together with the logistic regression model predicting RainTomorrow factor variable is the following.
++ ++  logistic regression   linear regression   model +> {RainTomorrow = Yes} >+ model for > {RainfallTomorrow prediction}    RainFallTomorrow  +++ ++   + {RainTomorrow = No} > {Rainfall < 1 mm} Tomorrow’s Humidity 3pm Prediction h3pm_fit < lm(Humidity3pmTomorrow ~ Humidity3pm + Sunshine, data = weather_data7) summary(h3pm_fit) Call: lm(formula = Humidity3pmTomorrow ~ Humidity3pm + Sunshine, data = weather_data7) Residuals: Min 1Q Median 3Q Max 30.189 9.117 2.417 7.367 45.725 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 34.18668 5.45346 6.269 1.09e09 *** Humidity3pm 0.37697 0.06973 5.406 1.21e07 *** Sunshine 0.85027 0.33476 2.540 0.0115 *  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.26 on 344 degrees of freedom Multiple Rsquared: 0.2803, Adjusted Rsquared: 0.2761 Fstatistic: 66.99 on 2 and 344 DF, pvalue: < 2.2e16All predictors are reported as significant.
lm_pred < predict(h3pm_fit, weather_data7) plot(x = seq_along(weather_data7$Humidity3pmTomorrow), y = weather_data7$Humidity3pmTomorrow, type='p', xlab = "observations", ylab = "Humidity3pmTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data7$Humidity3pmTomorrow), y = lm_pred, col='red') Tomorrow’s WindGustSpeed Prediction wgs_fit < lm(WindGustSpeedTomorrow ~ WindGustSpeed + Pressure9am + Pressure3pm, data = weather_data7) summary(wgs_fit) Call: lm(formula = WindGustSpeedTomorrow ~ Pressure9am + Pressure3pm + WindGustSpeed, data = weather_data7) Residuals: Min 1Q Median 3Q Max 31.252 7.603 1.069 6.155 49.984 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 620.53939 114.83812 5.404 1.22e07 *** Pressure9am 2.17983 0.36279 6.009 4.78e09 *** Pressure3pm 2.76596 0.37222 7.431 8.66e13 *** WindGustSpeed 0.23365 0.05574 4.192 3.52e05 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.29 on 343 degrees of freedom Multiple Rsquared: 0.2628, Adjusted Rsquared: 0.2563 Fstatistic: 40.75 on 3 and 343 DF, pvalue: < 2.2e16All predictors are reported as significant.
lm_pred < predict(wgs_fit, weather_data7) plot(x = seq_along(weather_data7$WindGustSpeedTomorrow), y = weather_data7$WindGustSpeedTomorrow, type='p', xlab = "observations", ylab = "WindGustSpeedTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data7$WindGustSpeedTomorrow), y = lm_pred, col='red') Tomorrow’s Sunshine Prediction sun_fit < lm(SunshineTomorrow ~ Sunshine*Humidity3pm + Cloud3pm + Evaporation + I(Evaporation^2) + WindGustSpeed  1, data = weather_data7) summary(sun_fit) Call: lm(formula = SunshineTomorrow ~ Sunshine * Humidity3pm + Cloud3pm + Evaporation + I(Evaporation^2) + WindGustSpeed  1, data = weather_data7) Residuals: Min 1Q Median 3Q Max 9.9701 1.8230 0.6534 2.1907 7.0478 Coefficients: Estimate Std. Error t value Pr(>t) Sunshine 0.607984 0.098351 6.182 1.82e09 *** Humidity3pm 0.062289 0.012307 5.061 6.84e07 *** Cloud3pm 0.178190 0.082520 2.159 0.031522 * Evaporation 0.738127 0.245356 3.008 0.002822 ** I(Evaporation^2) 0.050735 0.020510 2.474 0.013862 * WindGustSpeed 0.036675 0.013624 2.692 0.007453 ** Sunshine:Humidity3pm 0.007704 0.002260 3.409 0.000729 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.134 on 340 degrees of freedom Multiple Rsquared: 0.8718, Adjusted Rsquared: 0.8692 Fstatistic: 330.3 on 7 and 340 DF, pvalue: < 2.2e16All predictors are reported as significant. Very good adjusted Rsquared.
lm_pred < predict(sun_fit, weather_data7) plot(x = seq_along(weather_data7$SunshineTomorrow), y = weather_data7$SunshineTomorrow, type='p', xlab = "observations", ylab = "SunshineTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data7$SunshineTomorrow), y = lm_pred, col='red')Above plot makes more evident that this model captures only a subset of the original Sunshine variance.
To give a more intuitive interpretation of the Sunshine information, it is suggestable to report a factor variable having levels:
{"Cloudy", "Mostly Cloudy", "Partly Cloudy", "Mostly Sunny", "Sunny"}In doing that, we have furthermore to take into account tomorrow’s Cloud9am and Cloud3pm. For those quantitative variables, corresponding predictions are needed, and, at the purpose, the following linear regression models based on the Sunshine predictor can be depicted.
cloud9am_fit < lm(Cloud9am ~ Sunshine, data = weather_data7) summary(cloud9am_fit) Call: lm(formula = Cloud9am ~ Sunshine, data = weather_data7) Residuals: Min 1Q Median 3Q Max 5.2521 1.7635 0.4572 1.6920 5.9127 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 8.51527 0.28585 29.79 <2e16 *** Sunshine 0.58031 0.03298 17.59 <2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.163 on 345 degrees of freedom Multiple Rsquared: 0.4729, Adjusted Rsquared: 0.4714 Fstatistic: 309.6 on 1 and 345 DF, pvalue: < 2.2e16All predictors are reported as significant.
lm_pred < round(predict(cloud9am_fit, weather_data7)) lm_pred[lm_pred == 9] = 8 plot(x = weather_data7$Sunshine, y = weather_data7$Cloud9am, type='p', xlab = "Sunshine", ylab = "Cloud9am") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red') cloud3pm_fit < lm(Cloud3pm ~ Sunshine, data = weather_data7) summary(cloud3pm_fit) Call: lm(formula = Cloud3pm ~ Sunshine, data = weather_data7) Residuals: Min 1Q Median 3Q Max 4.1895 1.6230 0.0543 1.4829 5.2883 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 8.01199 0.26382 30.37 <2e16 *** Sunshine 0.50402 0.03044 16.56 <2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.996 on 345 degrees of freedom Multiple Rsquared: 0.4428, Adjusted Rsquared: 0.4412 Fstatistic: 274.2 on 1 and 345 DF, pvalue: < 2.2e16All predictors are reported as significant.
lm_pred < round(predict(cloud3pm_fit, weather_data7)) lm_pred[lm_pred == 9] = 8 plot(x = weather_data7$Sunshine, y = weather_data7$Cloud3pm, type='p', xlab = "Sunshine", ylab = "Cloud3pm") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red')Once predicted the SunshineTomorrow value, we can use it as predictor input in Cloud9am/Cloud3pm linear regression models to determine both tomorrow’s Cloud9am and Cloud3pm predictions. We have preferred such approach in place of building a model based on one of the following formulas:
Cloud9amTomorrow ~ Cloud9am Cloud9amTomorrow ~ Cloud9am + Sunshine Cloud3pmTomorrow ~ Cloud3pmbecause they provide with a very low adjusted Rsquared (approximately in the interval [0.07, 0.08]).
Based on the Cloud9am and Cloud3pm predictions, the computation of a new factor variable named as CloudConditions and having levels {“Cloudy”, “Mostly Cloudy”, “Partly Cloudy”, “Mostly Sunny”, “Sunny”} will be shown. As we observed before our Sunshine prediction captures only a a subset of the variance of the modeled variable, and, as a consequence, our cloud conditions prediction is expected to rarely (possibly never) report “Cloudy” and “Sunny”.
Tomorrow’s MinTemp Prediction minTemp_fit < lm(MinTempTomorrow ~ MinTemp + Humidity3pm , data = weather_data7) summary(minTemp_fit) Call: lm(formula = MinTempTomorrow ~ MinTemp + Humidity3pm, data = weather_data7) Residuals: Min 1Q Median 3Q Max 10.4820 1.7546 0.3573 1.9119 10.0953 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 2.696230 0.520244 5.183 3.74e07 *** MinTemp 0.844813 0.027823 30.364 < 2e16 *** Humidity3pm 0.034404 0.009893 3.478 0.000571 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.112 on 344 degrees of freedom Multiple Rsquared: 0.7327, Adjusted Rsquared: 0.7311 Fstatistic: 471.4 on 2 and 344 DF, pvalue: < 2.2e16All predictors are reported as significant.
lm_pred < predict(minTemp_fit, weather_data7) plot(x = weather_data7$Sunshine, y = weather_data7$MinTemp, type='p', xlab = "Sunshine", ylab = "MinTemp") legend("topright", c("actual", "fitted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red') Tomorrow’s MaxTemp Prediction maxTemp_fit < lm(MaxTempTomorrow ~ MaxTemp + Evaporation, data = weather_data7) summary(maxTemp_fit) Call: lm(formula = MaxTempTomorrow ~ MaxTemp + Evaporation, data = weather_data7) Residuals: Min 1Q Median 3Q Max 13.217 1.609 0.137 2.025 9.708 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 2.69811 0.56818 4.749 3.01e06 *** MaxTemp 0.82820 0.03545 23.363 < 2e16 *** Evaporation 0.20253 0.08917 2.271 0.0237 *  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.231 on 344 degrees of freedom Multiple Rsquared: 0.7731, Adjusted Rsquared: 0.7718 Fstatistic: 586 on 2 and 344 DF, pvalue: < 2.2e16All predictors are reported as significant.
lm_pred < predict(maxTemp_fit, weather_data7) plot(x = weather_data7$Sunshine, y = weather_data7$MaxTemp, type='p', xlab = "Sunshine", ylab = "MaxTemp") legend("topright", c("actual", "fitted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red') CloudConditions computationBased on second reference given at the end, we have the following mapping between a descriptive cloud conditions string and the cloud coverage in ”oktas,” which are a unit of eights.
Cloudy: 8/8 opaque clouds Mostly Cloudy: 6/8  7/8 opaque clouds Partly Cloudy: 3/8  5/8 opaque clouds Mostly Sunny: 1/8  2/8 opaque clouds Sunny: 0/8 opaque cloudsWe can figure out the following utility function to help.
computeCloudConditions = function(cloud_9am, cloud_3pm) { cloud_avg = min(round((cloud_9am + cloud_3pm)/2), 8) cc_str = NULL if (cloud_avg == 8) { cc_str = "Cloudy" } else if (cloud_avg >= 6) { cc_str = "Mostly Cloudy" } else if (cloud_avg >= 3) { cc_str = "Partly Cloudy" } else if (cloud_avg >= 1) { cc_str = "Mostly Sunny" } else if (cloud_avg < 1) { cc_str = "Sunny" } cc_str } Weather Forecast ReportStarting from a basic example of weather dataset, we were able to build several regression models. The first one, based on logistic regression, is capable of predict the RainTomorrow factor variable. The linear regression models are to predict the Rainfall, Humidity3pm, WindGustSpeed, MinTemp, MaxTemp, CloudConditions weather metrics. The overall picture we got is the following.
++    +> RainTomorrow    Weather +> Rainfall   {Today's weather data} >+ Forecast +> Humidity3pm    Models +> WindGustSpeed ++    linear regression   +> Sunshine >+ models +>{Cloud9am, Cloud3pm}+    Cloud9am and Cloud3pm    +> MinTemp ++      +> MaxTemp     ++ ++   computing new   { Cloudy, Mostly Cloudy, Partly Cloudy, Mostly Sunny, Sunny} <+ factor variable +<+  CloudConditions  ++As said before, if RainTomorrow == “Yes” Rainfall is explicitely given, otherwise un upper bound Rainfall < 1 mm is. Chance of rain is computed only if RainTomorrow prediction is “Yes”. The Humidity3pm prediction is taken as humidity prediction for the whole day, in general.
weather_report < function(today_record, rain_tomorrow_model, cutoff) { # RainTomorrow prediction rainTomorrow_prob < predict(rain_tomorrow_model, today_record, type="prob") rainTomorrow_pred = ifelse(rainTomorrow_prob$Yes >= cutoff, "Yes", "No") # Rainfall prediction iff RainTomorrow prediction is Yes; chance of rain probability rainfall_pred < NA chance_of_rain < NA if (rainTomorrow_pred == "Yes") { rainfall_pred < round(predict(rf_fit, today_record), 1) chance_of_rain < round(rainTomorrow_prob$Yes*100) } # WindGustSpeed prediction wgs_pred < round(predict(wgs_fit, today_record), 1) # Humidity3pm prediction h3pm_pred < round(predict(h3pm_fit, today_record), 1) # sunshine prediction is used to fit Cloud9am and Cloud3pm sun_pred < predict(sun_fit, today_record) cloud9am_pred < min(round(predict(cloud9am_fit, data.frame(Sunshine=sun_pred))), 8) cloud3pm_pred < min(round(predict(cloud3pm_fit, data.frame(Sunshine=sun_pred))), 8) # a descriptive cloud conditions string is computed CloudConditions_pred < computeCloudConditions(cloud9am_pred, cloud3pm_pred) # MinTemp prediction minTemp_pred < round(predict(minTemp_fit, today_record), 1) # MaxTemp prediction maxTemp_pred < round(predict(maxTemp_fit, today_record), 1) # converting all numeric predictions to strings if (is.na(rainfall_pred)) { rainfall_pred_str < "< 1 mm" } else { rainfall_pred_str < paste(rainfall_pred, "mm", sep = " ") } if (is.na(chance_of_rain)) { chance_of_rain_str < "" } else { chance_of_rain_str < paste(chance_of_rain, "%", sep="") } wgs_pred_str < paste(wgs_pred, "Km/h", sep= " ") h3pm_pred_str < paste(h3pm_pred, "%", sep = "") minTemp_pred_str < paste(minTemp_pred, "°C", sep= "") maxTemp_pred_str < paste(maxTemp_pred, "°C", sep= "") report < data.frame(Rainfall = rainfall_pred_str, ChanceOfRain = chance_of_rain_str, WindGustSpeed = wgs_pred_str, Humidity = h3pm_pred_str, CloudConditions = CloudConditions_pred, MinTemp = minTemp_pred_str, MaxTemp = maxTemp_pred_str) report }Sure there are confidence and prediction intervals associated to our predictions. However, since we intend to report our forecasts to Canberra’s citizens, our message should be put in simple words to reach everybody and not just statisticians.
Finally we can try our tomorrow weather forecast report out.
(tomorrow_report < weather_report(weather_data[10,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 36.9 Km/h 39.7% Partly Cloudy 8.7°C 22.7°C (tomorrow_report < weather_report(weather_data[32,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 48 Km/h 41.3% Partly Cloudy 10.9°C 24.8°C (tomorrow_report < weather_report(weather_data[50,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 8.3 mm 59% 36.9 Km/h 60.1% Partly Cloudy 10.8°C 22.2°C (tomorrow_report < weather_report(weather_data[100,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 5.4 mm 71% 46.7 Km/h 51.3% Partly Cloudy 11.2°C 20.3°C (tomorrow_report < weather_report(weather_data[115,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 46.4 Km/h 35.7% Mostly Sunny 12.3°C 25.1°C (tomorrow_report < weather_report(weather_data[253,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 9.4 mm 81% 52.4 Km/h 65.2% Partly Cloudy 1.3°C 10.3°C (tomorrow_report < weather_report(weather_data[311,], mod_ev_c3_fit, 0.56)) Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 40.7 Km/h 44.2% Mostly Cloudy 6.9°C 16.4°CPlease note that some records of the original weather dataset may show NA’s values and our regression models did not cover a predictor can be as such. To definitely evaluate the accuracy of our weather forecast report, we would need to check its unseen data predictions with the occurred tomorrow’s weather. Further, to improve the adjusted Rsquared metric of our linear regression models is a potential area of investigation.
ConclusionsStarting from a basic weather dataset, we went through an interesting datastory involving exploratory analysis and models building. We spot strengths and weaknesses of our prediction models. We also learned some insights of the meteorology terminology.
We hope you enjoyed this tutorial series on weather forecast. If you have any questions, please feel free to comment below.
References
Related Post
 Weather forecast with regression models – part 3
 Weather forecast with regression models – part 2
 Weather forecast with regression models – part 1
 Weighted Linear Support Vector Machine
 Logistic Regression Regularized with Optimization
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Automatic tools for improving R packages
On Tuesday I gave a talk at a meetup of the R users group of Barcelona. I got to choose the topic of my talk, and decided I’d like to expand a bit on a recent tweet of mine. There are tools that help you improve your R packages, some of them are not famous enough yet in my opinion, so I was happy to help spread the word! I published my slides online but thought that a blog post would be nice as well.
During my talk at RUG BCN, for each tool I gave a short introduction and then applied it to a small package I had created for the occasion. In that post I’ll just shortly present each tool. Most of them are only automatic because they automatically provide you with a list of things to fix, but they won’t do the work for you, sorry. If you have an R package you develop at hand, I’d really advise you to apply them on it and see what you get! I concentrated on tools improving the coding style, the package structure, testing, the documentation, but not features and performance.
Note: before my talk Luca Cerone gave a short introduction to package development with devtools and roxygen2. This is how I develop packages now although the first package I contributed to, surveillance, has separate man pages, a NAMESPACE maintained by hand, etc. I gave its maintainer a hard time with all my stupid beginner mistakes! Thankfully Sebastian was always patient and nice. In any case, if you want to start developping packages you should definitely read Hadley Wickham’s book, it’s a good one! I read it all once I started upping my package game – this upping being a continuous process of course.
The basics: run R CMD check!In his talk Luca said that when he develops a package he often doesn’t have time to check it passes R CMD check, which made me die inside a little bit. I totally understand the lack of time but I think that R CMD check complains about important stuff, and most often unknown NOTE, WARNING or ERROR messages can be googled and solved, and one learns something. So please do run R CMD check!
The RLady that uploaded this picture to the Meetup event told me that this is my “Run R CMD check, you fools!” face (she actually said Usad el R CMD check, malditos!!!).
lintr, Static Code Analysis for RI then presented Jim Hester’s lintr package which allows you to check some stuff in your code without running it. It will for instance tell you if you used camel case instead of snake case for a variable name, or if you defined a variable without using it afterwards. The README of the Github repository of the package provides a list of available linters, each linter being one of these annoying, hum useful, checks.
How does one use lintr? Just by typing lintr::lint_package() at the root of the package folder. This week I applied it to a package of mine and spent some time adding white space that made my code prettier. In my opinion using lintr helps you maintain a code that is nice to read, which is cool because then you, future you or another human can concentrate on important stuff when reading it.
I mentioned two small tips as well.
How to customize the lintr linters?For that one needs to create a .lintr file at the root of the package (and add it to .Rbuildignore). Here is an example of the content of such a file:
with_defaults(camel_case_linter = NULL, # you prefer camel case snake_case_linter, # so flag snake case line_length_linter(120)) # you love long lines How to never ever ignore lintr outputYou can add an unit test for lintr! Why would one ever want to do that? Well, you might want to force yourself, or contributors to your package, to respect some rules. The test would be:
if (requireNamespace("lintr", quietly = TRUE)) { context("lints") test_that("Package Style", { lintr::expect_lint_free() }) }After I published my slides someone added such a test to one of his packages which made me very proud.
One tip I forgot to mention during my talk is how to add exceptions, I was far too strict!
How to exclude lines from lintr checksFor that either put # nolint at the end of the line or # nolint start and # nolint end before and after a block of lines you’d like to exclude. A good example is for instance a very long line that you can’t break because it contains a very long URL.
goodpractice, Advice on R Package BuildingThis package by Gabor Csardi can only be installed from Github. It integrates a variety of tools, including lintr, for providing “Advice (that) includes functions and syntax to avoid, package structure, code complexity, code formatting, etc.”. For instance, goodpractice uses cyclocomp to check whether any of your functions is too complex. A participant of the meetup told me she got a lot of messages when running goodpractice::gp() on her first package, which is great because it means goodpractice will really have provided her with useful guidance. Some things that goodpractice::gp() will flag might be necessary features of your package, but in general it’s a nice list to read. Note: you can only run goodpractice::gp() once R CMD check passes.
Romain François who came to the Meetup from Montpellier (thanks, Romain!) asked me if I had ever run goodpractice::gp() on famous packages like dplyr. I haven’t but since the default advice of goodpractice seems quite consistent with the advice in Hadley Wickham’s book, I doubt I’d find anything interesting.
At rOpenSci onboarding, we run goodpractice::gp() on every submitted package as part of the editorial checks, which makes me like this tool a bit more every week. It’s actually tunable but the defaults are so great that I never needed to tune it!
I was asked when to run goodpractice::gp() and I’d say once in a while, e.g. before submitting a package to CRAN for the first time, or after creating its first working version.
devtools::spell_check()This function was written by Jeroen Ooms and is supported by his hunspell package. It will look for possible typos in the documentation of a package. In the output, there might be quite a few false positives, e.g. the name of an API, but I strongly believe that its helping you find typos in the documentation compensates the effort of reading the whole list of potential typos.
I’ve noticed that devtools::release(), which one can use when submitting a package to CRAN, now includes the output of devtools::spell_check(). It helped me find a typo at the last minute in one of my packages! Typos might not be that bad, but a typofree documentation looks much better.
pkgdown, generate static html documentation for an R packageThis last tool was the only one that’s truly automatic. This package was created by Hadley Wickham and is currently only on Github. It allows you to create a pretty website for your package without any big effort.
How does one use it? After installing it, run pkgdown::init() and pkgdown::build_site() at the root of your package. This will create a docs/ folder containing the static html documentation, that you can browse locally.
Now, how do you present it to the world? In theory I guess you could publish it anywhere but the really handy thing is to have it as a Github page, if you develop your package on Github. For this you just need to change the settings of the repository to make master/docs the source for Github pages.
Unless you typically redirect all links from your Github account to your domain, your package will live at account_name.github.io/package_name. See for instance this nice package of Dirk Schumacher’s whose website is here.
How to customize the website?This part is not automatic, and demands creating a _pkgdown.yml file at the root of your package. I’ve only mentioned two aspects:

One can change the style of the website, e.g. in this config file I chose another Boostrap theme than the default one. To choose one Boostrap theme one can browse them online or look at a pkgdown website that one likes and look at how it was created. A participant of the meetup told me that there is an RStudio addin for choosing a theme but I couldn’t locate it.

One can change the order of vignettes, or of the functions, and even group them, which might be extremely useful when a package has many functions. Usually the alphabetical order makes little sense. See here the config file of this website, the grouping is really well done.
Since this was a talk, not a long workshop, and also because I don’t know everything, I have not mentioned every useful tool in the world. I’d be grateful for suggestions in the comments, and in the meantime here are some tools I briefly mentioned at the end of my talk:

Continuous integration, with this well written post by Julia Silge

The Rhub project

The available package to check whether the name you want to use for your package is a) available and b) appropriate.
A very important tip I gave was to have your code read by humans. The automatic tools will flag some things, but the feedback of an human about your code and your documentation will be crucial. And when you review code, you’ll learn a lot! If you want to do that, you can volunteer to become a reviewer for rOpenSci, see more info here and fill the short form here.
After my talk I got a nice RMarkdown tshirt that has the same colour as my face! Thanks a lot to the organizers, in particular Luca Cerone!
NonStandard Evaluation and Function Composition in R
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
In this article we will discuss composing standardevaluation interfaces (SE) and composing nonstandardevaluation interfaces (NSE) in R.
In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.
To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).
 ":=" is needed to make lefthandside remapping possible (adding yet another "more than one assignment type operator running around" notation issue).
 "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
 Lefthandsides of expressions are names or strings, while righthandsides are quosures/expressions.
Let’s apply tidyeval/rlang notation to the task of building reusable generic in R.
# setup suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.0'vignette('programming', package = 'dplyr') includes the following example:
my_mutate < function(df, expr) { expr < enquo(expr) mean_name < paste0("mean_", quo_name(expr)) sum_name < paste0("sum_", quo_name(expr)) mutate(df, !!mean_name := mean(!!expr), !!sum_name := sum(!!expr) ) }We can try this:
d < data.frame(a=1) my_mutate(d, a) ## a mean_a sum_a ## 1 1 1 1 SE ExampleFrom this example we can figure out how to use tidyeval/rlang notation to build a standard interface version of a function that adds one to a column and lands the value in an arbitrary column:
tidy_add_one_se < function(df, res_var_name, input_var_name) { input_var < as.name(input_var_name) res_var < res_var_name mutate(df, !!res_var := (!!input_var) + 1) } tidy_add_one_se(d, 'res', 'a') ## a res ## 1 1 2And we can rewrap tidy_add_one_se as into a "add one to self" function as we show here:
tidy_increment_se < function(df, var_name) { tidy_add_one_se(df, var_name, var_name) } tidy_increment_se(d, 'a') ## a ## 1 2 NSE ExampleWe can also use the tidyeval/rlang notation more as it is intended: to wrap or compose a nonstandard interface in another nonstandard interface.
tidy_add_one_nse < function(df, res_var, input_var) { input_var < enquo(input_var) res_var < quo_name(enquo(res_var)) mutate(df, !!res_var := (!!input_var) + 1) } tidy_add_one_nse(d, res, a) ## a res ## 1 1 2And we even wrap this again as a new "add one to self" function:
tidy_increment_nse < function(df, var) { var < enquo(var) tidy_add_one_nse(df, !!var, !!var) } tidy_increment_nse(d, a) ## a ## 1 2(The above enquo() then "!!" pattern is pretty much necissary, as the simpler idea of just passing var through doesn’t work.)
An IssueWe could try use base::substitute() instead of quo_name(enquo()) in the nonstandardevaluation wrapper. At first this appears to work, but it runs into trouble when we try to compose nonstandardevaluation functions with each other.
tidy_add_one_nse_subs < function(df, res_var, input_var) { input_var < enquo(input_var) res_var < substitute(res_var) mutate(df, !!res_var := (!!input_var) + 1) } tidy_add_one_nse_subs(d, res, a) ## a res ## 1 1 2However this seemingly similar variation is not recomposable in the same manner.
tidy_increment_nse_subs < function(df, var) { var < enquo(var) tidy_add_one_nse_subs(df, !!var, !!var) } tidy_increment_nse_subs(d, a) ## Error: LHS must be a name or stringLikely there is some way to get this to work, but my point is:
 The obvious way didn’t work.
 Some NSE functions can’t be reused in standard NSE composition. You may not know which ones those are ahead of time. Presumably functions from major packages are sovetted, but you may not be able to trust "one off compositions" to be safe to recompose.
It is easy to specify the function we want with wrapr as follows (both using standard evaluation, and using nonstandard evaluation):
SE version library("wrapr") wrapr_add_one_se < function(df, res_var_name, input_var_name) { wrapr::let( c(RESVAR= res_var_name, INPUTVAR= input_var_name), df %>% mutate(RESVAR = INPUTVAR + 1) ) } wrapr_add_one_se(d, 'res', 'a') ## a res ## 1 1 2Standard composition:
wrapr_increment_se < function(df, var_name) { wrapr_add_one_se(df, var_name, var_name) } wrapr_increment_se(d, 'a') ## a ## 1 2 NSE versionNonstandard evaluation interface:
wrapr_add_one_nse < function(df, res_var, input_var) { wrapr::let( c(RESVAR= substitute(res_var), INPUTVAR= substitute(input_var)), df %>% mutate(RESVAR = INPUTVAR + 1) ) } wrapr_add_one_nse(d, res, a) ## a res ## 1 1 2wrapr::let()‘s NSE composition pattern seems to work even when applied to itself:
wrapr_increment_nse < function(df, var) { wrapr::let( c(VAR= substitute(var)), wrapr_add_one_nse(df, VAR, VAR) ) } wrapr_increment_nse(d, a) ## a ## 1 2 Abstract Syntax Tree VersionOr, if you are uncomfortable with macros being implemented through stringsubstitution one can use wrapr::let() in "language mode" (where it works directly on abstract syntax trees).
SE redo wrapr_add_one_se < function(df, res_var_name, input_var_name) { wrapr::let( c(RESVAR= res_var_name, INPUTVAR= input_var_name), df %>% mutate(RESVAR = INPUTVAR + 1), subsMethod= 'langsubs' ) } wrapr_add_one_se(d, 'res', 'a') ## a res ## 1 1 2 wrapr_increment_se < function(df, var_name) { wrapr_add_one_se(df, var_name, var_name) } wrapr_increment_se(d, 'a') ## a ## 1 2 NSE redo wrapr_add_one_nse < function(df, res_var, input_var) { wrapr::let( c(RESVAR= substitute(res_var), INPUTVAR= substitute(input_var)), df %>% mutate(RESVAR = INPUTVAR + 1), subsMethod= 'langsubs' ) } wrapr_add_one_nse(d, res, a) ## a res ## 1 1 2 wrapr_increment_nse < function(df, var) { wrapr::let( c(VAR= substitute(var)), wrapr_add_one_nse(df, VAR, VAR), subsMethod= 'langsubs' ) } wrapr_increment_nse(d, a) ## a ## 1 2 Conclusiontidyeval/rlang provides general tools to compose or daisychain nonstandardevaluation functions (i.e., write new nonstandardevaluation functions in terms of others. This tries to abrogate the issue that it can be hard to compose nonstandard function interfaces (i.e., one can not parameterize them or program over them without a tool such as tidyeval/rlang). In contrast wrapr::let() concentrates on standard evaluation, providing a tool that allows one to rewrap nonstandardevaluation interfaces as standard evaluation interfaces.
A lot of the tidyeval/rlang design is centered on treating variable names as lexical closures that capture an environment they should be evaluated in. This does make them more like general R functions (which also have this behavior).
However, creating so many longterm bindings is actually counter to some common data analyst practice.
The my_mutate(df, expr) example itself from vignette('programming', package = 'dplyr') even shows the pattern I am referring to: the analyst transiently pairs a chosen concrete data set to abstract variable names. One argument is the data and the other is the expression to be applied to that data (and only that data, with clean code not capturing values from environments).
Many methods are written expecting to be rerun on different data (for example predict()). This has the huge advantage that it documents your intent to change out what data is being applied (such as running a procedure twice, once on training data and once on future application data).
This is a principle we also strongly apply in our join controller which has no issue sharing variables out as an external spreadsheet, because it thinks of variable names (here meaning columns names) as fundamentally being strings (not as quosures temporally working "under cover" in string representations).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...