Minimum CRPS vs. maximum likelihood
(This article was first published on Achim Zeileis, and kindly contributed to Rbloggers)
In a new paper in Monthly Weather Review, minimum CRPS and maximum likelihood estimation are compared for fitting heteroscedastic (or nonhomogenous) regression models under different response distributions. Minimum CRPS is more robust to distributional misspecification while maximum likelihood is slightly more efficient under correct specification. An R implementation is available in the crch package.
CitationManuel Gebetsberger, Jakob W. Messner, Georg J. Mayr, Achim Zeileis (2018). “Estimation Methods for Nonhomogeneous Regression Models: Minimum Continuous Ranked Probability Score versus Maximum Likelihood.” Monthly Weather Review. 146(12), 43234338. doi:10.1175/MWRD170364.1
AbstractNonhomogeneous regression models are widely used to statistically postprocess numerical ensemble weather prediction models. Such regression models are capable of forecasting full probability distributions and correcting for ensemble errors in the mean and variance. To estimate the corresponding regression coefficients, minimization of the continuous ranked probability score (CRPS) has widely been used in meteorological postprocessing studies and has often been found to yield more calibrated forecasts compared to maximum likelihood estimation. From a theoretical perspective, both estimators are consistent and should lead to similar results, provided the correct distribution assumption about empirical data. Differences between the estimated values indicate a wrong specification of the regression model. This study compares the two estimators for probabilistic temperature forecasting with nonhomogeneous regression, where results show discrepancies for the classical Gaussian assumption. The heavytailed logistic and Student?s t distributions can improve forecast performance in terms of sharpness and calibration, and lead to only minor differences between the estimators employed. Finally, a simulation study confirms the importance of appropriate distribution assumptions and shows that for a correctly specified model the maximum likelihood estimator is slightly more efficient than the CRPS estimator.
Softwarehttps://CRAN.Rproject.org/package=crch
The function crch() provides heteroscedastic (or nonhomogenous) regression models of "gaussian" (i.e., normally distributed), "logistic", or "student" (i.e., tdistributed) response variables. Additionally, responses may be censored or truncated. Estimation methods include maximum likelihood (type = "ml", default) and minimum CRPS (type = "crps"). Boosting can also be employed for model fitting (instead of full optimization). CRPS computations leverage the excellent scoringRules package.
IllustrationThe plots below show histograms of the PIT (probability integral transform) for various nonhomogenous regression models yielding probabilistic 1dayahead temperature forecasts at an Alpine site (Innsbruck). When the probabilistic forecasts are perfectly calibrated to the actual observations the PIT histograms should form a straight line at density 1. The gray area illustrates the 95% consistency interval around perfect calibration – and binning is based on 5% intervals.
When a normally distributed or Gaussian response is assumed (left panel), it is shown that the maximumlikelihood model (solid line) is not well calibrated as the tails are not heavy enough. (The legend denotes this “LS” because maximizing the likelihood is equivalent to minimizing the socalled logscore.) In contrast, the minimumCRPS model is reasonably well calibrated.
When assuming a Studentt response (right panel) there is little deviation between both estimation techniques and both are wellcalibrated.
Thus, the source of the differences between CRPS and MLbased estimation with a Gaussian response comes from assuming a distribution whose tails are not heavy enough. In this situation, minimumCRPS yields the somewhat more robust model fit while both estimation techniques lead to very similar results if a more suitable response distribution is adopted. In the latter case ML is slightly more efficient than minimumCRPS.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
linl 0.0.3: Micro release
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Our linl package for writing LaTeX letter with (R)markdown had a fairly minor release today, following up on the previous release well over a year ago. This version just contains one change which Mark van der Loo provided a few months ago with a clean PR. As another user was just bitten the same issue when using an included letterhead – which was fixed but unreleased – we decided it was time for a release. So there it is.
linl makes it easy to write letters in markdown, with some extra bells and whistles thanks to some cleverness chiefly by Aaron.
Here is screenshot of the vignette showing the simple input for some moderately fancy output:
The NEWS entry follows:
Changes in linl version 0.0.3 (20181215) Correct LaTeX double loading of package color with different options (Mark van der Loo in #18 fixing #17).
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the linl page. For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Request for comments on planned features for futile.logger 1.5
(This article was first published on R – Cartesian Faith, and kindly contributed to Rbloggers)
I will be pushing a new version of futile.logger (version 1.5) to CRAN in January. This version introduces a number of enhancements and fixes some bugs. It will also contain at least one breaking change. I am making the release process public, since the package is now used in a number of other packages. If you use futile.logger, this is your opportunity to influence the direction of the package and prepare for changes. Please use the github issue tracker for discussion.
There are two main themes for enhancements: 1) integration with the signal conditions, and 2) supporting multiple appenders with different thresholds.
Hijacking the signal systemCurrently, futile.logger is unaware of the signal system in R. The only tiein is with ftry, which catches a warning or error and prints a message. Unfortunately, the behavior is different from try. This release will make ftry consistent with try. This is a breaking change, so if you use ftry in your code, it will no longer halt processing.
In addition to this fix, futile.logger will now have better integration with the signal system. Currently, if a function emits a warning or an error, these are printed to the screen. It would be convenient if futile.logger can capture these signals and associate them to the correct log levels, i.e. warning to WARN and error to ERROR.
My proposal is to create a hijack_signal_handlers function to override the existing signal handlers. This function can be called at the top of a script or within the .onLoad function of a package. Once called, any warnings or errors would be captured and handled by futile.logger. The implementation would look like this, giving granular control of whether to hijack just warning, errors, or both:
hijack_signal_handlers < function(warning=TRUE, error=TRUE) {if (warning) {
# Override warning handler
}
if (error) {
# Override error handler
}
}
One issue I see with this function is when used in a package’s .onLoad. Suppose a user requires package A and B. These don’t use futile.logger. Now the user requires package C, which calls hijack_signal_handlers in its .onLoad. When this occurs, warnings and errors emitted from packages A and B would also be captured by futile.logger. From my perspective, this is probably a good thing, but I can appreciate why others may not want this behavior.
Emitting signalsThe other half of the signal handler puzzle is being able to emit signals from futile.logger. For this case, we want flog.warn to emit a warning signal and flog.error to emit an error signal. One signature looks like
flog.warn(message, emit=TRUE)meaning that by default no signal is emitted. This would work for flog.warn, flog.error, and flog.fatal (could map to error as well).
I have mixed feelings about this use case. Part of me says if futile.logger hijacks the signal system, then just use stop and futile.logger will catch it. On the other hand that seems roundabout and less capable than writing flog.error(message, emit=TRUE) or whatever.
Really my main concern is what happens when these two systems work together? Will they play nice or will an error be emitted twice (likely)? If not, then there is more logic that has to be built in, which ultimately adds complexity and wastes cycles. Any input here is encouraged!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Six Sigma DMAIC Series in R – Part4
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Categories
Tags
Hope you liked the Part 1 ,Part 2 and Part 3 of this Series. In this Part 4, we will go through the tools used during the Improve phase of Six Sigma DMAIC cycle. The most representative tool used during the Improve Phase is DOEDesign of experiments. Proper use of DOE can lead to process improvement, but a bad design of experiment can lead to undesired results – inefficiency, higher costs.
What is experimentAn experiment is a test or series of tests in which purposeful changes are made to input variables of a process or system so that we may observe and identify the reason for the change in output response.
Consider the example of the simple process model, shown below, where we have controlled input factors X’s, output Y and uncontrolled factors Z’s.
The objective of the experiment can be:
 Process Characterization: To know the relationship between X, Y & Z.
 Process Control: Capture changes in X, so that Y is always near the desired nominal value.
 Process Optimization: Capture changes in X, so that the variability of Y is minimized.
 Robust Design: Capture changes in X, so that effect of uncontrolled Z is minimized.
Experiments allow to control the values of the Xs of the process and then measure the value of the Ys to discover what values of the independent variables will allow us to improve the performance of our process. On the contrary, in the case of Observational Studies, we don’t have any influence on the variables we are measuring. We just collect the data and use the appropriate statistical technique.
There are some risks associated when the analysis is based on the data gathered directly during the normal performance of process: Inconsistent Data, Variable Value Range ( performance of X’s outside range not known ) & Correlated Variables.
Some of the Characteristics of wellplanned experiments are:
 The degree of Precision:
Probability should be high that experiment will be able to measure the differences with a degree of precision the experimenter desires. It implies appropriate design and sufficient replication.  Simplicity:
As simple as possible consistent with the objectives of experiment.  The absence of Systematic Error:
Units receiving one treatment should not differ in any systematic way from those receiving other treatment.  The range of Validity of Conclusions:
Experiments replicated in time and space would increase the range of validity of conclusions.  Calculation of the degree of Uncertainty:
A possibility of obtaining the observed results by chance alone.
Three basic principles of experiments are Randomization, Repetition & Blocking.
Lets understand this through an example – A food manufacturer is searching for the best recipe for its main product: a pizza dough. The managers decided to perform an experiment to determine the optimal levels of the three main ingredients in the pizza dough: flour, salt, and baking powder. The other ingredients are fixed as they do not affect the flavor of the final cooked pizza. The flavor of the product will be determined by a panel of experts who will give a score to each recipe. Therefore, we have three factors that we will call flour, salt, and baking powder(bakPow), with two levels each (− and +).
pizzaDesign < expand.grid(flour = gl(2, 1, labels = c("", "+")), salt = gl(2, 1, labels = c("", "+")), bakPow = gl(2, 1, labels = c("", "+")), score = NA)Now, we have eight different experiments (recipes) including all the possible combinations of the three factors at two levels.
When we have more than 2 factors, the combination of levels of different factors may affect the response. Therefore, to discover the main effects and the interactions, we should vary more than one level at a time, performing experiments in all the possible combinations.
The reason Why twolevel factor experiments are widely used is that as the number of factor level increases, cost of doing the experiment increases. To study the variation under the same experimental conditions, replication needs to be done, making more than one trial per factor combination. The number of replications depends on several aspects (e.g budget).
Once an experiment has been designed, we will proceed with its randomization.
pizzaDesign$ord < sample(1:8, 8) pizzaDesign[order(pizzaDesign$ord),]Each time you repeat the command you get a different order due to randomization.
2^k factorial Designs2k factorial designs are those whose number of factors to be studied are k, all of them with 2 levels. The number of experiments we need to carryout to obtain a complete replication is precisely the power(2 to the k). If we want n replications of the experiment, then the total number of experiments is n×2k.
ANOVA can be used to estimate the effect of each factor and interaction and assess which of these effects are signiﬁcant.
Example contd.: The experiment is carried out by preparing the pizzas at the factory following the package instructions, namely: “bake the pizza for 9 min in an oven at 180◦C.”
After a blind trial is conducted,the scores were given by the experts to each of the eight (2^3) recipes in each replication of the experiment.
ss.data.doe1 < data.frame(repl = rep(1:2, each = 8), rbind(pizzaDesign[, 6], pizzaDesign[, 6])) ss.data.doe1$score < c(5.33, 6.99, 4.23, 6.61, 2.26, 5.75, 3.26, 6.24, 5.7, 7.71, 5.13, 6.76, 2.79, 4.57, 2.48, 6.18)The average for each recipe can be calculated as below:
aggregate(score ~ flour + salt + bakPow, FUN = mean, data = ss.data.doe1)The best recipe seems to be the one with a high level of ﬂour and a low level of salt and baking powder. Fit a linear model and perform an ANOVA to ﬁnd the signiﬁcant effects.
doe.model1 < lm(score ~ flour + salt + bakPow + flour * salt + flour * bakPow + salt * bakPow + flour * salt * bakPow, data = ss.data.doe1) summary(doe.model1)pvalues show that the main effects of the ingredients ﬂour and baking powder are signiﬁcant, while the effect of the salt is not signiﬁcant. Interactions among the ingredients are neither 2way nor 3way, making them insigniﬁcant. Thus, we can simplify the model, excluding the non signiﬁcant effects. Thus, the new model with the signiﬁcant effects is :
doe.model2 < lm(score ~ flour + bakPow, data = ss.data.doe1) summary(doe.model2)Therefore,the statistical model for our experiment is
score =4.8306 + 2.4538*Flour1.8662*bakpow
Thus, the recipe with a high level of ﬂour and low level of baking powder will be the best one, regardless of the level of salt (high or low). The estimated score for this recipe is
`4.8306 + 2.4538 × 1 + (−1.8662) × 0 = 7.284.`
predict function can be used to get the estimation for all the experiment conditions.
predict(doe.model2) Visualize Effect Plot and Interaction plotthe ggplot2 package can be used to visualize the effect plot. The effect of flour is positive while the effect of baking powder is negative.
prinEf < data.frame(Factor = rep(c("A_Flour", "C_Baking Powder"), each = 2), Level = rep(c(1, 1), 2), Score = c(aggregate(score ~ flour, FUN = mean, data = ss.data.doe1)[,2], aggregate(score ~ bakPow, FUN = mean, data = ss.data.doe1)[,2])) p < ggplot(prinEf, aes(x = Level, y = Score)) + geom_point() + geom_line() +geom_hline(yintercept =mean(ss.data.doe1$score),linetype="dashed", color = "blue")+scale_x_continuous(breaks = c(1, 1)) + facet_grid(. ~ Factor)+ggtitle("Plot of Factor Effects") print(p)The interaction plot is as shown below.The lines don’t cross means that there is no interaction between the factors plotted.
intEf < aggregate(score ~ flour + bakPow, FUN = mean, data = ss.data.doe1) q < ggplot(intEf, aes(x = flour, y = score, color = bakPow )) + geom_point() + geom_line(aes(group=bakPow)) +geom_hline(yintercept =mean(ss.data.doe1$score),linetype="dashed", color = "blue")+ggtitle("Interaction Plot") print(q)The normality of residual can be checked with Shapiro test. As the pvalue is large, fail to reject the Null hypothesis of Normality of residuals.
shapiro.test(residuals(doe.model2))This was the brief introduction to DOE in R.
In next part, we will go through Control Phase of Six Sigma DMAIC process. Please let me know your feedback in the comments section. Make sure to like & share it. Happy Learning!!
Related Post
 NYC buses: company level predictors with R
 Visualizations for correlation matrices in R
 Interpretation of the AUC
 Simple Experiments with Smoothed Scatterplots
 Understanding the Covariance Matrix
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Day 15 – little helper sci_palette
(This article was first published on rbloggers – STATWORX, and kindly contributed to Rbloggers)
We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 15th day of Christmas my true love gave to me…
What can it do?This little helper returns a set of colors which we often use at STATWORX. So, if – like me – you cannot remeber each hex color code you need, this might help. Of course these are our colours, but you could rewrite it with your own palette. But the main benefactor is the plotting method – so you can see the color instead of only reading the hex code.
How to use it?To see which hex code corresponds to which colour and for what purpose to use it
sci_palette() main_color accent_color_1 accent_color_2 accent_color_3 highlight black "#013848" "#0085AF" "#00A378" "#09557F" "#FF8000" "#000000" text grey_2 light_gray special "#696969" "#D9D9D9" "#F8F8F8" "#C62F4B" attr(,"class") [1] "sci"As mentioned above, there is a plot() method which gives the following picture.
plot(sci_palette()) OverviewTo see all the other functions you can either check out our GitHub or you can read about them here.
Have a merry advent season!
Über den Autor Jakob Gepp ABOUT USSTATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI.
Sign Up Now!
.button {backgroundcolor: #0085af;}
Der Beitrag Day 15 – little helper sci_palette erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – STATWORX. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Manipulate dates easily with {lubridate}
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
free here. This is taken from Chapter 5, which presents
the {tidyverse} packages and how to use them to compute descriptive statistics and manipulate data.
In the text below, I scrape a table from Wikipedia, which shows when African countries gained
independence from other countries. Then, using {lubridate} functions I show you how you can
answers questions such as Which countries gained independence before 1960?.
{lubridate} is yet another tidyverse package, that makes dealing with dates or duration data
(and intervals) as painless as possible. I do not use every function contained in the package
daily, and as such will only focus on some of the functions. However, if you have to deal with
dates often, you might want to explore the package thoroughly.
Let’s get some data from a Wikipedia table:
library(tidyverse) library(rvest) page < read_html("https://en.wikipedia.org/wiki/Decolonisation_of_Africa") independence < page %>% html_node(".wikitable") %>% html_table(fill = TRUE) independence < independence %>% select(Rank) %>% map_df(~str_remove_all(., "\\[.*\\]")) %>% rename(country = `Country[a]`, colonial_name = `Colonial name`, colonial_power = `Colonial power[b]`, independence_date = `Independence date`, first_head_of_state = `First head of state[d]`, independence_won_through = `Independence won through`)This dataset was scraped from the following Wikipedia table.
It shows when African countries gained independence from which colonial powers. In Chapter 11, I
will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents
of the dataset:
as you can see, the date of independence is in a format that might make it difficult to answer questions
such as Which African countries gained independence before 1960 ? for two reasons. First of all,
the date uses the name of the month instead of the number of the month (well, this is not such a
big deal, but still), and second of all the type of
the independence day column is character and not “date”. So our first task is to correctly define the column
as being of type date, while making sure that R understands that January is supposed to be “01”, and so
on.
There are several helpful functions included in {lubridate} to convert columns to dates. For instance
if the column you want to convert is of the form “20121121”, then you would use the function ymd(),
for “yearmonthday”. If, however the column is “20122111”, then you would use ydm(). There’s
a few of these helper functions, and they can handle a lot of different formats for dates. In our case,
having the name of the month instead of the number might seem quite problematic, but it turns out
that this is a case that {lubridate} handles painfully:
Some dates failed to parse, for instance for Morocco. This is because these countries have several
independence dates; this means that the string to convert looks like:
which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for
simplicity’s sake.
Let’s take a look at the data now:
independence ## # A tibble: 54 x 6 ## country colonial_name colonial_power independence_da… first_head_of_s… ## ## 1 Liberia Liberia United States 18470726 Joseph Jenkins … ## 2 South … Cape Colony … United Kingdom 19100531 Louis Botha ## 3 Egypt Sultanate of… United Kingdom 19220228 Fuad I ## 4 Eritrea Italian Erit… Italy 19470210 Haile Selassie ## 5 Libya British Mili… United Kingdo… 19511224 Idris ## 6 Sudan AngloEgypti… United Kingdo… 19560101 Ismail alAzhari ## 7 Tunisia French Prote… France 19560320 Muhammad VIII a… ## 8 Morocco French Prote… France Spain NA Mohammed V ## 9 Ghana Gold Coast United Kingdom 19570306 Kwame Nkrumah ## 10 Guinea French West … France 19581002 Ahmed Sékou Tou… ## # ... with 44 more rows, and 1 more variable: ## # independence_won_throughAs you can see, we now have a date column in the right format. We can now answer questions such as
Which countries gained independence before 1960? quite easily, by using the functions year(),
month() and day(). Let’s see which countries gained independence before 1960:
You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work
on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on
Christmas Eve:
Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between
two dates, using the interval() column:
The independent_since column now contains an interval object that we can convert to years:
independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) %>% mutate(years_independent = as.numeric(independent_since, "years")) ## # A tibble: 54 x 3 ## country independent_since years_independent ## ## 1 Liberia 18470726 UTC20181215 UTC 171. ## 2 South Africa 19100531 UTC20181215 UTC 109. ## 3 Egypt 19220228 UTC20181215 UTC 96.8 ## 4 Eritrea 19470210 UTC20181215 UTC 71.8 ## 5 Libya 19511224 UTC20181215 UTC 67.0 ## 6 Sudan 19560101 UTC20181215 UTC 63.0 ## 7 Tunisia 19560320 UTC20181215 UTC 62.7 ## 8 Morocco NANA NA ## 9 Ghana 19570306 UTC20181215 UTC 61.8 ## 10 Guinea 19581002 UTC20181215 UTC 60.2 ## # ... with 44 more rowsWe can now see for how long the last country to gain independence has been independent.
Because the data is not tidy (in some cases, an African country was colonized by two powers,
see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:
{lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate}
is a package that you have to master.
Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Learning R: A gentle introduction to higherorder functions
(This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers)
Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…
Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are firstclass members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?
In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!
In R there are many applications as well, let us go through a simple example step by step.
Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:
mean(iris[ , 1]) ## [1] 5.843333 mean(iris[ , 2]) ## [1] 3.057333 mean(iris[ , 3]) ## [1] 3.758 mean(iris[ , 4]) ## [1] 1.199333Quite tedious and not very elegant. Of course, we can use a for loop for that:
for (x in iris[1:4]) { print(mean(x)) } ## [1] 5.843333 ## [1] 3.057333 ## [1] 3.758 ## [1] 1.199333This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:
apply(iris[1:4], 2, mean) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.843333 3.057333 3.758000 1.199333Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higherorder function” and we could use it with all kinds of other functions:
apply(iris[1:4], 2, sd) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 0.8280661 0.4358663 1.7652982 0.7622377 apply(iris[1:4], 2, min) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 4.3 2.0 1.0 0.1 apply(iris[1:4], 2, max) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 7.9 4.4 6.9 2.5You can also use userdefined functions:
midrange < function(x) (min(x) + max(x)) / 2 apply(iris[1:4], 2, midrange) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.10 3.20 3.95 1.30We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):
apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.10 3.20 3.95 1.30Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):
head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1To see the power of higherorder functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!
apply(mtcars, 2, summary) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Min. 10.40000 4.0000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000 ## 1st Qu. 15.42500 4.0000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000 ## Median 19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000 ## Mean 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125 ## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000 ## Max. 33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000Wow, that was easy and the result is quite impressive, is it not!
Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:
sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3)) ## mpg cyl disp hp drat wt qsec vs am gear carb ## (Intercept) 0 37.885 29.600 30.099 7.525 37.285 5.114 16.617 17.147 5.623 25.872 ## x 1 2.876 0.041 0.068 7.678 5.344 1.412 7.940 7.245 3.923 2.056Here we used another higherorder function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).
Often, you might not even have realised when you were using higherorder functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higherorder function curve and give it the function you want to plot as an argument:
curve(sin(x) + cos(1/2 * x), 10, 10)I want to give you one last example of another very helpful higherorder function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:
by(iris[1:4], iris$Species, summary) ## iris$Species: setosa ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ##  ## iris$Species: versicolor ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 ## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 ## Median :5.900 Median :2.800 Median :4.35 Median :1.300 ## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326 ## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500 ## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800 ##  ## iris$Species: virginica ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 ## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 ## Median :6.500 Median :3.000 Median :5.550 Median :2.000 ## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026 ## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300 ## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500This was just a very shy look at this huge topic. There are very powerful higherorder functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).
For some reason people tend to confuse higherorder functions with recursive functions but that is the topic of another post, so stay tuned…
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
In case you missed it: November 2018 roundup
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
In case you missed them, here are some articles from November of particular interest to R users.
David Gerard assesses the plausibility of a key plot point in 'Jurassic Park' with simulations in R.
Indatabase R is available in Azure SQL Database for private preview.
Introducing AzureR, a new suite of R packages for managing Azure resources in R.
The AzureRMR package provides an interface for Resource Manager.
Roundup of AI, Machine Learning and Data Science news from November 2018.
You can now use the AI capabilities of Microsoft Cognitive Services within a container you host.
A look back at some of the R applications presented at the EARL conference in Seattle.
Slides and notebooks from my ODSC workshop, AI for Good.
TMobile uses AI models implemented with R to streamline customer service.
A guide to R packages for importing and working with US Census data.
Azure Machine Learning Studio, the online draganddrop data analysis tool, upgrades its R support.
And some general interest stories (not necessarily related to R):
 How the planets would look in the sky, if they were as close as the Moon.
 Pavarotti and Freddie Mercury, impersonated by the same performer.
 The realworld physics of the scifi series The Expanse
 Real robots and simulated people, dancing
 The different ways human languages write animal sounds
As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Day 14 – little helper print_fs
(This article was first published on rbloggers – STATWORX, and kindly contributed to Rbloggers)
We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 14th day of Christmas my true love gave to me…
What can it do?This little helper returns the folder structure of a given path. With this, one can for example add a nice overview to the documentation of a project or within a git. For the sake of automation, this function could run and change parts wihtin a log or news file after a major change.
How to use it?If we take a look at the same example we used for the get_network function on day 5, we get the following:
print_fs("~/flowchart/", depth = 4) 1 flowchart 2 ¦create_network.R 3 ¦getnetwork.R 4 ¦plots 5 ¦ ¦examplenetworkhelfRlein.png 6 ¦ °improvednetwork.png 7 ¦R_network_functions 8 ¦ ¦dataprep 9 ¦ ¦ °foo_01.R 10 ¦ ¦method 11 ¦ ¦ °foo_02.R 12 ¦ ¦script_01.R 13 ¦ °script_02.R 14 °README.mdWith depth we can adjust how deep we want to traverse through our folders.
OverviewTo see all the other functions you can either check out our GitHub or you can read about them here.
Have a merry advent season!
Über den Autor Jakob Gepp ABOUT USSTATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI.
Sign Up Now!
.button {backgroundcolor: #0085af;}
Der Beitrag Day 14 – little helper print_fs erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – STATWORX. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
running plot [and simulated annealing]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
Last weekend, I found out a way to run updated plots within a loop in R, when calling plot() within the loop was never updated in real time. The above suggestion of including a Sys.sleep(0.25) worked perfectly on a simulated annealing example for determining the most dispersed points in a unit disc.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Easy CI/CD of GPU applications on Google Cloud including baremetal using Gitlab and Kubernetes
(This article was first published on Angel Sevilla Camins' Blog, and kindly contributed to Rbloggers)
SummaryAre you a data scientist who only wants to focus on modelling and coding and not on setting up a GPU cluster? Then, this blog might be interesting for you. We developed an automated pipeline using gitlab and Kubernetes that is able to run code in two GPU environments, GCP and baremetal; no need to worry about drivers, Kubernetes cluster creation or deletion. The only thing that you should do is to push your code and it runs in a GPU!
Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.
See here the complete blog post.
To leave a comment for the author, please follow the link and comment on their blog: Angel Sevilla Camins' Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reusable Pipelines in R
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Pipelines in R are popular, the most popular one being magrittr as used by dplyr.
This note will discuss the advanced reusable piping systems: rquery/rqdatatable operator trees and wrapr function object pipelines. In each case we have a set of objects designed to extract extra power from the wrapr dotarrow pipe %.>%.
Piping
Piping is not much more than having a system that lets one treat “x %.>% f(.)” as a near synonym for “f(x)”. For the wrapr dot arrow pipe the semantics are intentionally closer to (x %.>% f(.)) ~ {. < x; f(.)}.
The pipe notation may be longer, but it avoids nesting and reversed right to left reading for manystage operations (such as “x %.>% f1(.) %.>% f2(.) %.>% f3(.)” versus “f3(f2(f1(x)))”).
In addition to allowing users to write operations in this notation, most piping systems allow users to save pipelines for later reuse (though some others have issues serializing or saving such pipelines due to entanglement with the defining environment).
wrapr and rquery/rqdatatable supply a number of piping tools that are reusable, serializable, and very powerful (via R S3 and S4 dispatch features). One of the most compelling features are “function objects” which mans objects can be treated like functions (applied to other objects by pipelines). We will discuss some of these features in the context of rquery/rqdatatable and wrapr.
rquery/rqdatatableFor quite a while the rquery and rqdatatable packages have supplied a sequence of operators abstraction called an “operator tree” or “operator pipeline”.
These pipelines are (deliberately) fairly strict:
 They must start with a table description or definition.
 Each step must be a table to table transform meeting certain column preconditions.
 Each step must advertise what columns it makes available or produces, for later condition checking.
For a guiding example suppose we want to rowsubset some data, get pergroup means, and then sort the data by those means.
# our example data d < data.frame( group = c("a", "a", "b", "b"), value = c( 1, 2, 2, 10), stringsAsFactors = FALSE ) # load our package library("rqdatatable") ## Loading required package: rquery # build an operator tree threshold < 0.0 ops < # define the data format local_td(d) %.>% # restrict to rows with value >= threshold select_rows_nse(., value >= threshold) %.>% # compute pergroup aggegations project_nse(., groupby = "group", mean_value = mean(value)) %.>% # sort rows by mean_value decreasing orderby(., cols = "mean_value", reverse = "mean_value") # show the tree/pipeline cat(format(ops)) ## table(d; ## group, ## value) %.>% ## select_rows(., ## value >= 0) %.>% ## project(., mean_value := mean(value), ## g= group) %.>% ## orderby(., desc(mean_value))Of course the purpose of such a pipeline is to be able to apply it to data. This is done simply with the wrapr dot arrow pipe:
d %.>% ops ## group mean_value ## 1: b 2.0 ## 2: a 1.5rquery pipelines are designed to specify and execute data wrangling tasks. An important feature of rquery pipelines is: they are designed for serialization. This means we can save them and also send them to multiple nodes for parallel processing.
# save the optree saveRDS(ops, "rquery_optree.RDS") # simulate a fresh R session rm(list=setdiff(ls(), "d")) library("rqdatatable") # read the optree back in ops < readRDS('rquery_optree.RDS') # look at it cat(format(ops)) ## table(d; ## group, ## value) %.>% ## select_rows(., ## value >= 0) %.>% ## project(., mean_value := mean(value), ## g= group) %.>% ## orderby(., desc(mean_value)) # use it again d %.>% ops ## group mean_value ## 1: b 2.0 ## 2: a 1.5 # clean up rm(list=setdiff(ls(), "d"))We can also run rqdatatable operations in “immediate mode”, without predefining the pipeline or tables:
threshold < 0.0 d %.>% select_rows_nse(., value >= threshold) %.>% project_nse(., groupby = "group", mean_value = mean(value)) %.>% orderby(., cols = "mean_value", reverse = "mean_value") ## group mean_value ## 1: b 2.0 ## 2: a 1.5 wrapr function objectsA natural question is: given we already have rquery pipelines why do we need wrapr function object pipelines? The reason is: rquery/rdatatable pipelines are strict and deliberately restricted to operations that can be hosted both in R (via data.table) or on databases (examples: PostgreSQL and Spark). One might also want a more general pipeline with fewer constraints optimized for working in R directly.
The wrapr “function object” pipelines allow treatment of arbitrary objects as items we can pipe into. Their primary purpose is to partially apply functions to convert arbitrary objects and functions into singleargument (or unary) functions. This converted form is perfect for pipelining. This, in a sense, lets us treat these objects as functions. The wrapr function object pipeline also has less constraint checking than rquery pipelines, so is more suitable for “black box” steps that do not publish their column use and production details (in fact wrapr function object pipelines work on arbitrary objects, not just data.frames or tables).
Let’s adapt our above example into a simple wrapr dot arrow pipeline.
library("wrapr") threshold < 0 d %.>% .[.$value >= threshold, , drop = FALSE] %.>% tapply(.$value, .$group, 'mean') %.>% sort(., decreasing = TRUE) ## b a ## 2.0 1.5All we have done is replace the rquery steps with typical baseR commands. As we see the wrapr dot arrow can route data through a sequence of such commands to repeat our example.
Now let’s adapt our above example into a reusable wrapr function object pipeline.
library("wrapr") threshold < 0 pipeline < srcfn( ".[.$value >= threshold, , drop = FALSE]" ) %.>% srcfn( "tapply(.$value, .$group, 'mean')" ) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., ), ## SrcFunction{ tapply(.$value, .$group, 'mean') }(.=., ), ## base::sort(x=., decreasing))We used two wrapr abstractions to capture the steps for reuse (something built in to rquery, and now also supplied by wrapr). The abstractions are:
 srcfn() which wraps arbitrary quoted code as a function object.
 pkgfn() which wraps a package qualified function name as a function object (“base” being the default package).
This sort of pipeline can be applied to data using pipe notation:
d %.>% pipeline ## b a ## 2.0 1.5The above pipeline has one key inconvenience and one key weakness:
 For the srcfn() steps we had to place the source code in quotes, which defeats any sort of syntax highlighting and autocompleting in our R integrated development environment (IDE).
 The above pipeline has a reference to the value of threshold in our current environment, this means the pipeline is not sufficiently selfcontained to serialize and share.
We can quickly address both of these issues with the wrapr::qe() (“quote expression”) function. It uses base::substitute() to quote its arguments, and the IDE doesn’t know the contents are quoted and thus can help us with syntax highlighting and autocompletion. Also we are using base::bquote() .()style escaping to bind in the value of threshold.
pipeline < srcfn( qe( .[.$value >= .(threshold), , drop = FALSE] )) %.>% srcfn( qe( tapply(.$value, .$group, 'mean') )) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= 0, , drop = FALSE] }(.=., ), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5Notice this pipeline works as before, but no longer refers to the external value threshold. This pipeline can be saved and shared.
Another recommended way to bind in values is with the argsargument, which is a named list of values that are expected to be available with a srcfn() is evaluated, or additional named arguments that will be applied to a pkgfn().
In this notation the pipeline is written as follows.
pipeline < srcfn( qe( .[.$value >= threshold, , drop = FALSE] ), args = list('threshold' = threshold)) %.>% srcfn( qe( tapply(.$value, .$group, 'mean') )) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5We can save this pipeline.
saveRDS(pipeline, "wrapr_pipeline.RDS")And simulate using it in a fresh environment (i.e. simulate sharing it).
# simulate a fresh environment rm(list = setdiff(ls(), "d")) library("wrapr") pipeline < readRDS('wrapr_pipeline.RDS') cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5 ConclusionAnd that is some of the power of wrapr piping, rquery/rqdatatable, and wrapr function objects. Essentially wrapr function objects are a reference application of the S3/S4 piping abilities discussed in the wrapr pipe formal article.
The technique is very convenient when each of the steps is a substantial (such as nontrivial data preparation and model application steps).
The above techniques can make reproducing and sharing methods much easier.
We have some more examples of the technique here and here.
# clean up after example unlink("rquery_optree.RDS") unlink("wrapr_pipeline.RDS") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R community update: announcing sessions for useR Delhi December meetup
(This article was first published on R – Tech and Mortals, and kindly contributed to Rbloggers)
As referenced in my last blog post, useR Delhi NCR is all set to host our second meetup on 15th December, i.e. upcoming Saturday. We’ve finalized two exciting speaker sessions for the same. They’re as follows:
 Basics of Shiny and geospatial visualizations by Sean Angiolillo
 Up in the air: cloud storage based workflows in R by Himanshu Sikaria
Speaker session #1 Speaker session #2
Apart from the above, the meetup will feature lightning talks and networking session for attendees.
If you haven’t registered yet, do it now via this form. Event details can be found on useR Delhi’s meetup page.
Also, if you know of someone who may be interested in speaking at a future event, please let us know via mail, Twitter, Facebook or Meetup.
Please share and retweet our event and/or this blog. We would be grateful.
To leave a comment for the author, please follow the link and comment on their blog: R – Tech and Mortals. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
GoldMining Week 15 (2018)
(This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers)
Week 15 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!
The post GoldMining Week 15 (2018) appeared first on Fantasy Football Analytics.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RTutor: Better Incentive Contracts For Road Construction
(This article was first published on Economics and R  R posts, and kindly contributed to Rbloggers)
Since about two weeks, I face a large additional traffic jam every morning due to a construction site on the road. When passing the construction site, often only few people or sometimes nobody seems to be working there. Being an economist, I really wonder how much of such traffic jams could be avoided with better contracts that give the construction company proper incentives to account for the social cost of the road blocking and therefore more often fully staff the constructing site and finish earlier.
Patrick Bajari and Gregory Lewis have collected a detailed sample of 466 road construction projects in Minnesota to study this question in their very interesting article Moral Hazard, Incentive Contracts and Risk: Evidence from Procurement in the Review of Economic Studies, 2014.
They estimate a structural econometric model and find that changes in contract design could substantially reduce the duration of road blockages and largely increase total welfare at only minor increases in the risk that road construction firms face.
As part of his Master Thesis at Ulm University, Claudius Schmid has generated a nice and detailed RTutor problem set that allows you to replicate the findings in an interactive fashion. You learn a lot about the structure and outcomes of the currently used contracts, the theory behind better contract design and how the structural model to assess the quantitative effects can be estimated and simulated. At the same time, you can hone your general data science and R skills.
Here is a small screenshot:
Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by multiple choice quizzes.
You can test the problem set online on the rstudio.cloud: https://rstudio.cloud/project/137023 Source the run.R file to start the problem set.
If you don’t want to register at rstudio cloud, you can also check out the problem on shinyapps.io: https://clamasch.shinyapps.io/RTutorIncentiveContracts
The free shinyapps.io account is capped at 25 hours total usage time per month. So it may be greyed out when you click at it. Also, unlike on rstudio.cloud, you cannot save your progress on shinyapps.io.
To locally install the problem set, follow the installation guide at the problem set’s Github repository: https://github.com/ClaMaSch/RTutorIncentiveContracts
If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page
https://github.com/skranz/RTutor
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Economics and R  R posts. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Day 13 – little helper read_files
(This article was first published on rbloggers – STATWORX, and kindly contributed to Rbloggers)
We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 13th day of Christmas my true love gave to me…
What can it do?This little helper reads in multiple files of the same type and combines them into a data.table. What kind of „file reading function“ should be used can be choosen by the FUN argument.
How to use it?If you have a list of files, that all needs to be loaded in with the same function (e.g. read.csv), instead of using lapply and rbindlist now you can use this:
read_files(files, FUN = readRDS) read_files(files, FUN = readLines) read_files(files, FUN = read.csv, sep = ";")Internally, it just uses lapply and rbindlist but you dont have to type it all the time. The read_files combines the single files by their column names and returns one data.table. Why data.table? Because I like it. But, let's not go down the rabbit hole of data.table vs dplyr (to the rabbit hole …).
OverviewTo see all the other functions you can either check out our GitHub or you can read about them here.
Have a merry advent season!
Über den Autor Jakob Gepp ABOUT USSTATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI.
Sign Up Now!
.button {backgroundcolor: #0085af;}
Der Beitrag Day 13 – little helper read_files erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – STATWORX. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Recreating the NBA lead tracker graphic
(This article was first published on R – Statistical Odds & Ends, and kindly contributed to Rbloggers)
For each NBA game, nba.com has a really nice graphic which tracks the point differential between the two teams throughout the game. Here is the lead tracker graphic for the game between the LA Clippers and the Phoenix Suns on 10 Dec 2018:
I thought it would be cool to try recreating this graphic with R. You might ask: why try to replicate something that exists already? If we are able to pull out the data underlying this graphic, we could do much more than just replicate what is already out there; we have the power to make other visualizations which could be more informative or powerful. (For example, how does this chart look like for all games that the Golden State Warriors played in? Or, how does the chart look like for each quarter of the game?)
The full R code for this post can be found here. For a selfcontained script that accepts a game ID parameter and produces the lead tracker graphic, click here.
First, we load the packages that we will use:
library(lubridate) library(rvest) library(stringr) library(tidyverse)We can get playbyplay data from BasketballReference.com (here is the link for the LAC @ PHX game on 20181210). Here is a snippet of the playbyplay table on that webpage, we would like to extract the columns in red:
The code below extracts the webpage, then pulls out rows from the playbyplay table:
# get webpage url < paste0("https://www.basketballreference.com/boxscores/pbp/", current_id, ".html") webpage < read_html(url) # pull out the events from the playbyplay table events < webpage %>% html_nodes("#pbp") %>% html_nodes("tr") %>% html_text()events is a character vector that looks like this:
We would really like to pull out the data in the boxes above. Timings are easy enough to pull out with regular expressions (e.g. start of the string: at least 1 digit, then :, then at least one digit, then ., then at least one digit). Pulling out the score is a bit trickier: we can’t just use the regular expression denoting a dash – with a number on each side. An example of why that doesn’t work is in the purple box above. Whenever a team scores, basketballreference.com puts a “+2” or “+3” on the left or right of the score, depending on which team scored. In events, these 3 columns get smushed together into one string. If the team on the left scores, pulling out numberdashnumber will give the wrong value (e.g. the purple box above would give 222 instead of 22).
To avoid this issue, we extract the “+”s that may appear on either side of the score. In fact, this has an added advantage: we only need to extract a score if it is different from the previous timestamp. As such, we only have to keep the scores which have a “+” on either side of it. We then postprocess the scores.
# get event times & scores times < str_extract(events, "^\\d+:\\d+.\\d+") scores < str_extract(events, "[\\+]*\\d+\\d+[\\+]*") scores < ifelse(str_detect(scores, "\\+"), scores, NA) df < data.frame(time = times, score = scores, stringsAsFactors = FALSE) %>% na.omit() # remove the +'s parseScore < function(x) { if (startsWith(x, "+")) { return(str_sub(x, 3, str_length(x))) } else if (endsWith(x, "+")) { return(str_sub(x, 1, str_length(x)  1)) } else { return(x) } } df$score < sapply(df$score, parseScore)Next, we split the score into visitor and home score and compute the point differential (positive means the visitor team is winning):
# split score into visitor and home score, get home advantage df < df %>% separate(score, into = c("visitor", "home"), sep = "") %>% mutate(visitor = as.numeric(visitor), home = as.numeric(home), time = ms(time)) %>% mutate(visitor_adv = visitor  home)Next we need to process the timings. Each of the 4 quarters lasts for 12 minutes, while each overtime period (if any) lasts for 5 minutes. The time column shows the amount of time remaining in the current period. We will amend the times so that they show the time elapsed (in seconds) from the start of the game. This notion of time makes it easier for plotting, and works for any number of overtime periods as well.
# get period of play (e.g. Q1, Q2, ...) df$period < NA period < 0 prev_time < ms("0:00") for (i in 1:nrow(df)) { curr_time < df[i, "time"] if (prev_time < curr_time) { period < period + 1 } df[i, "period"] < period prev_time < curr_time } # convert time such that it runs upwards. regular quarters are 12M long, OT # periods are 5M long df < df %>% mutate(time = ifelse(period <= 4, as.duration(12 * 60)  as.duration(time), as.duration(5 * 60)  as.duration(time))) %>% mutate(time = ifelse(period <= 4, time + as.duration(12 * 60 * (period  1)), time + as.duration(12 * 60 * 4) + as.duration(5 * 60 * (period  5)) ))At this point, we have enough to make crude approximations of the lead tracker graphic:
ggplot() + geom_line(data = df, aes(x = time, y = visitor_adv)) + labs(title = "LAC @ PHX, 20181210") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5)) ggplot() + geom_step(data = df, aes(x = time, y = visitor_adv)) + labs(title = "LAC @ PHX, 20181210") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))Getting the fill colors that NBA.com’s lead tracker has requires a bit more work. We need to split the visitor_adv into two columns: the visitor’s lead (0 if they are behind) and the home’s lead (0 if they are behind). We can then draw the chart above and below the xaxis as two geom_ribbons. (It’s a little more complicated than that, see this StackOverflow question and this gist for details.) Colors were obtained using imagecolorpicker.com.
df$visitor_lead < pmax(df$visitor_adv, 0) df$home_lead < pmin(df$visitor_adv, 0) df_extraSteps < df %>% mutate(visitor_adv = lag(visitor_adv), visitor_lead = lag(visitor_lead), home_lead = lag(home_lead)) df2 < bind_rows(df_extraSteps, df) %>% arrange(time) ggplot() + geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") + geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") + labs(title = "LAC @ PHX, 20181210") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))Almost there! The code below does some touch up to the figure, giving it the correct limits for the yaxis as well as vertical lines for the end of the periods.
# get score differential range (round to nearest 5) ymax < round(max(df$visitor_adv) * 2, digits = 1) / 2 ymin < round(min(df$visitor_adv) * 2, digits = 1) / 2 # get period positions and labels periods < unique(df$period) x_value < ifelse(periods <= 4, 12 * 60 * periods, 12 * 60 * 4 + 5 * 60 * (periods  4)) x_label < ifelse(periods <= 4, paste0("Q", periods), paste0("OT", periods  4)) ggplot() + geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") + geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") + geom_vline(aes(xintercept = x_value), linetype = 2, col = "grey") + scale_y_continuous(limits = c(ymin, ymax)) + labs(title = "LAC @ PHX, 20181210") + scale_x_continuous(breaks = x_value, labels = x_label) + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5), axis.title.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank())The figure above is what we set out to plot. However, since we have the underlying data, we can now make plots of the same data that may reveal other trends (code at the end of this R file). Here are the line and ribbon plots where we look at the absolute score rather than the point differential:
Here, we add points to the line plot to indicate whether a free throw, 2 pointer or 3 pointer was scored:
The possibilities are endless!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Twins on the up
(This article was first published on HighlandR, and kindly contributed to Rbloggers)
Are multiple births on the increase?My twin boys turned 5 years old today. Wow, time flies. Life is never dull, because twins are still seen as something of a novelty, so wherever we go, we find ourselves in conversation with strangers, who are intrigued by the whole thing.
In order to save time if we ever meet, here’s some FAQ’s:
 No, they’re not identical
 Yes, I’m sure
 No, they do not have similar personalities
 They like different things – One likes Hulk and Gekko, the other likes Iron Man and Catboy.
Recently I’ve been hearing and seeing anecdotal evidence that twins and multiple births are on the increase. I tried to find some data for Scotland, and while there is a lot of information on births in Scotland available, I couldn’t find breakdowns of multiple births.
However, I did find some information for England and Wales, so let’s look at that.
In this next bit, they key thing that may be of interest is the use of tidyr::gather.
There has been some discussion on #rstats Twitter about things people struggle with and a surprising amount of people struggle to remember the syntax for tidyr’s gather and spread.
(I can neither confirm or deny I am one of them).
The data was found here
library(readxl) library(dplyr) library(tidyr) library(ggplot2) data < read_xls("birthcharacteristicsworkbook2016.xls", sheet = "Table 11", range = "A10:I87") data < data %>% rename(Year = X__1, All_ages = `All maternities with multiple births`, Under20 = X__2, `20_to_24` = X__3, `25_to_29` = X__4, `30_to_34` = X__5, `35_to_39` = X__6, `40_to_44` = X__7, `45_and_over` = X__8) # the 1981 data is borked, so ignore thatNote use of gather to combine all the age groups into an age_group variable.
We use the Year column as an index so we have an entry for every age group, for every year, with the value represented as ‘maternities’.
Back to the code:
long_data < data %>% filter(Year != "1981") %>% gather(key = age_group, value = "maternities", Year) long_data$Year < as.numeric(long_data$Year) long_data$age_group < forcats::as_factor(long_data$age_group) long_data$maternities < as.numeric(long_data$maternities) ggplot(long_data,aes(Year, maternities), group = age_group) + geom_line() + geom_point() + facet_wrap(vars(age_group), scales = "free_y") + ggtitle(label = "England and Wales maternities with multiple births  numbers", subtitle = "By age of mother, 1940 to 2016") + labs(x = NULL, y = "Multiple maternities") # Let's do rates rates < read_xls("birthcharacteristicsworkbook2016.xls", sheet = "Table 11", range = "A89:I166") rates < rates %>% rename(Year = X__1, All_ages = `All maternities with multiple births per 1,000 all maternities`, Under20 = X__2, `20_to_24` = X__3, `25_to_29` = X__4, `30_to_34` = X__5, `35_to_39` = X__6, `40_to_44` = X__7, `45_and_over` = X__8) long_rates < rates %>% filter(Year != 1981) %>% gather(key = age_group, value = "multiple_maternities_per_1000", Year) long_rates$Year < as.numeric(long_rates$Year) long_rates$age_group < forcats::as_factor(long_rates$age_group) long_rates$multiple_maternities_per_1000 < as.numeric(long_rates$multiple_maternities_per_1000) ggplot(long_rates,aes(Year, multiple_maternities_per_1000), group = age_group) + geom_line() + geom_point() + facet_wrap(vars(age_group)) + ggtitle(label = "England and Wales Rate of maternities with multiple births  per 1,000 all maternities ", subtitle = "By age of mother, 1940 to 2016") + labs(x = NULL, y = "Multiple maternities")When we look at maternities with multiple births as a rate per 1000 maternities, we see the increase in multiple births among older mothers, especially in the over 45 group.
Again, with free scales on the y axis – which helps us see almost all age groups are exhibiting an increase – compare the 2024 age group as a rate and as count for example.
Looks to me that overall, the rate of multiple births is increasing.
What’s driving this?
Can it continue?
Will people ever stop asking us if the twins are identical?
To leave a comment for the author, please follow the link and comment on their blog: HighlandR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rsampling Fama French
(This article was first published on R Views, and kindly contributed to Rbloggers)
Today we will continue our work on Fama French factor models, but more as a vehicle to explore some of the awesome stuff happening in the world of tidy models. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, my most recent previous post here where we covered managing many models, and if you’re into Shiny, this flexdashboard.
Our goal today is to explore kfold crossvalidation via the rsample package, and a bit of model evaluation via the yardstick package. We started the model evaluation theme last time when we used tidy(), glance() and augment() from the broom package. In this post, we will use the rmse() function from yardstick, but our main focus will be on the vfold_cv() function from rsample. We are going to explore these tools in the context of linear regression and Fama French, which might seem weird since these tools would typically be employed in the realms of machine learning, classification, and the like. We’ll stay in the world of explanatory models via linear regression world for a few reasons.
First, and this is a personal preference, when getting to know a new package or methodology, I prefer to do so in a context that’s already familiar. I don’t want to learn about rsample whilst also getting to know a new data set and learning the complexities of some crazy machine learning model. Since Fama French is familiar from our previous work, we can focus on the new tools in rsample and yardstick. Second, factor models are important in finance, despite relying on good old linear regression. We won’t regret time spent on factor models, and we might even find creative new ways to deploy or visualize them.
The plan for today is take the same models that we ran in the last post, only this use kfold cross validation and bootstrapping to try to assess the quality of those models.
For that reason, we’ll be working with the same data as we did previously. I won’t go through the logic again, but in short, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen:
library(tidyverse) library(broom) library(tidyquant) library(timetk) symbols < c("SPY", "EFA", "IJS", "EEM", "AGG") # The prices object will hold our daily price data. prices < getSymbols(symbols, src = 'yahoo', from = "20121231", to = "20171231", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<`(symbols) asset_returns_long < prices %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, date) %>% group_by(asset) %>% mutate(returns = (log(returns)  log(lag(returns)))) %>% na.omit() factors_data_address < "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip" factors_csv_name < "Global_5_Factors_Daily.csv" temp < tempfile() download.file( # location of file to be downloaded factors_data_address, # where we want R to store that file temp, quiet = TRUE) Global_5_Factors < read_csv(unz(temp, factors_csv_name), skip = 6 ) %>% rename(date = X1, MKT = `MktRF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(RF) data_joined_tidy < asset_returns_long %>% left_join(Global_5_Factors, by = "date") %>% na.omit()After running that code, we have an object called data_joined_tidy. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.
data_joined_tidy %>% slice(1) # A tibble: 5 x 8 # Groups: asset [5] date asset returns MKT SMB HML RMW CMA 1 20130102 AGG 0.00117 0.0199 0.0043 0.0028 0.0028 0.0023 2 20130102 EEM 0.0194 0.0199 0.0043 0.0028 0.0028 0.0023 3 20130102 EFA 0.0154 0.0199 0.0043 0.0028 0.0028 0.0023 4 20130102 IJS 0.0271 0.0199 0.0043 0.0028 0.0028 0.0023 5 20130102 SPY 0.0253 0.0199 0.0043 0.0028 0.0028 0.0023Let’s work with just one ETF for today and use filter(asset == "AGG") to shrink our data down to just that ETF.
agg_ff_data < data_joined_tidy %>% filter(asset == "AGG")Okay, we’re going to regress the daily returns of AGG on one factor, then three factors, then five factors, and we want to evaluate how well each model explains AGG’s returns. That means we need a way to test the model. Last time, we looked at the adjusted rsquared values when the model was run on the entirety of AGG returns. Today, we’ll evaluate the model using kfold cross validation. That’s a pretty jargonheavy phrase that isn’t part of the typical finance lexicon. Let’s start with the second part, crossvalidation. Instead of running our model on the entire data set – all the daily returns of AGG – we’ll run it on just part of the data set, then test the results on the part that we did not use. Those different subsets of our original data are often called the training and the testing sets, though rsample calls them the analysis and assessment sets. We validate the model results by applying them to the assessment data and seeing how the model performed.
The kfold bit refers to the fact that we’re not just dividing our data into training and testing subsets, we’re actually going to divide it into a bunch of groups, a k number of groups, or a k number of folds. One of those folds will be used as the validation set; the model will be fit on the other k  1 sets, and then tested on the validation set. We’re doing this with a linear model to see how well it explains the data; it’s typically used in machine learning to see how well a model predicts data (we’ll get there in 2019).1
If you’re like me, it will take a bit of tinkering to really grasp kfold cross validation, but rsample as a great function for dividing our data into kfolds. If we wish to use five folds (the state of the art seems to be either five or ten folds), we call the vfold_cv() function, pass it our data object agg_ff_data, and set v = 5.
library(rsample) library(yardstick) set.seed(752) cved_ff< vfold_cv(agg_ff_data, v = 5) cved_ff # 5fold crossvalidation # A tibble: 5 x 2 splits id 1 Fold1 2 Fold2 3 Fold3 4 Fold4 5 Fold5We have an object called cved_ff, with a column called splits and a column called id. Let’s peek at the first split.
cved_ff$splits[[1]] <1007/252/1259>Three numbers. The first, 1007, is telling us how many observations are in the analysis. Since we have five folds, we should have 80% (or 4/5) of our data in the analysis set. The second number, 252, is telling us how many observations are in the assessment, which is 20% of our original data. The third number, 1259, is the total number of observations in our original data.
Next, we want to apply a model to the analysis set of this kfolded data and test the results on the assessment set. Let’s start with one factor and run a simple linear model, lm(returns ~ MKT).
We want to run it on analysis(cved_ff$splits[[1]]) – the analysis set of out first split.
ff_model_test < lm(returns ~ MKT, data = analysis(cved_ff$splits[[1]])) ff_model_test Call: lm(formula = returns ~ MKT, data = analysis(cved_ff$splits[[1]])) Coefficients: (Intercept) MKT 0.0001025 0.0265516Nothing too crazy so far. Now we want to test on our assessment data. The first step is to add that data to the original set. We’ll use augment() for that task, and pass it assessment(cved_ff$splits[[1]])
ff_model_test %>% augment(newdata = assessment(cved_ff$splits[[1]])) %>% head() %>% select(returns, .fitted) returns .fitted 1 0.0009021065 1.183819e04 2 0.0011726989 4.934779e05 3 0.0010815505 1.157267e04 4 0.0024385815 7.544460e05 5 0.0021715702 8.341007e05 6 0.0028159467 3.865527e04We just added our fitted values to the assessment data, the subset of the data on which the model was not fit. How well did our model do when compare the fitted values to the data in the held out set?
We can use the rmse() function from yardstick to measure our model. RMSE stands for root meansquared error. It’s the sum of the squared differences between our fitted values and the actual values in the assessment data. A lower RMSE is better!
ff_model_test %>% augment(newdata = assessment(cved_ff$splits[[1]])) %>% rmse(returns, .fitted) # A tibble: 1 x 3 .metric .estimator .estimate 1 rmse standard 0.00208Now that we’ve done that piece by piece, let’s wrap the whole operation into one function. This function takes one argument, a split, and we’re going to use pull() so we can extract the raw number, instead of the entire tibble result.
model_one < function(split) { split_for_model < analysis(split) ff_model < lm(returns ~ MKT, data = split_for_model) holdout < assessment(split) rmse < ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }Now we pass it our first split.
model_one(cved_ff$splits[[1]]) %>% head() [1] 0.002080324Now we want to apply that function to each of our five folds that are stored in agg_cved_ff. We do that with a combination of mutate() and map_dbl(). We use map_dbl() instead of map because we are returning a number here and there’s not a good reason to store that number in a list column.
cved_ff %>% mutate(rmse = map_dbl(cved_ff$splits, model_one)) # 5fold crossvalidation # A tibble: 5 x 3 splits id rmse * 1 Fold1 0.00208 2 Fold2 0.00189 3 Fold3 0.00201 4 Fold4 0.00224 5 Fold5 0.00190OK, we have five RMSE’s since we ran the model on five separate analysis fold sets and tested on five separate assessment fold sets. Let’s find the average RMSE by taking the mean() of the rmse column. That can help reduce noisiness that resulted from our random creation of those five folds.
cved_ff %>% mutate(rmse = map_dbl(cved_ff$splits, model_one)) %>% summarise(mean_rse = mean(rmse)) # 5fold crossvalidation # A tibble: 1 x 1 mean_rse 1 0.00202We now have the mean RMSE after running on our model, lm(returns ~ MKT), on all five of our folds.
That process for finding the mean RMSE can be applied other models, as well. Let’s suppose we wish to find the mean RMSE for two other models: lm(returns ~ MKT + SMB + HML), the Fama French threefactor model, and lm(returns ~ MKT + SMB + HML + RMW + CMA, the Fama French fivefactor model. By comparing the mean RMSE’s, we can evaluate which model explained the returns of AGG better. Since we’re just adding more and more factors, the models can be expected to get more and more accurate but again, we are exploring the rsample machinery and creating a template where we can pop in whatever models we wish to compare.
First, let’s create two new functions, that follow the exact same code pattern as above but house the threefactor and fivefactor models.
model_two < function(split) { split_for_model < analysis(split) ff_model < lm(returns ~ MKT + SMB + HML, data = split_for_model) holdout < assessment(split) rmse < ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) } model_three < function(split) { split_for_model < analysis(split) ff_model < lm(returns ~ MKT + SMB + HML + RMW + CMA, data = split_for_model) holdout < assessment(split) rmse < ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }Now we pass those three models to the same mutate() with map_dbl() flow that we used with just one model. The result will be three new columns of RMSE’s, one for each of our three models applied to our five folds.
cved_ff %>% mutate( rmse_model_1 = map_dbl( splits, model_one), rmse_model_2 = map_dbl( splits, model_two), rmse_model_3 = map_dbl( splits, model_three)) # 5fold crossvalidation # A tibble: 5 x 5 splits id rmse_model_1 rmse_model_2 rmse_model_3 * 1 Fold1 0.00208 0.00211 0.00201 2 Fold2 0.00189 0.00184 0.00178 3 Fold3 0.00201 0.00195 0.00194 4 Fold4 0.00224 0.00221 0.00213 5 Fold5 0.00190 0.00183 0.00177We can also find the mean RMSE for each model.
cved_ff %>% mutate( rmse_model_1 = map_dbl( splits, model_one), rmse_model_2 = map_dbl( splits, model_two), rmse_model_3 = map_dbl( splits, model_three)) %>% summarise(mean_rmse_model_1 = mean(rmse_model_1), mean_rmse_model_2 = mean(rmse_model_2), mean_rmse_model_3 = mean(rmse_model_3)) # 5fold crossvalidation # A tibble: 1 x 3 mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3 1 0.00202 0.00199 0.00192That code flow worked just fine, but we had to repeat ourselves when creating the functions for each model. Let’s toggle to a flow where we define three models – the ones that we wish to test with via crossvalidation and RMSE – then pass those models to one function.
First, we use as.formula() to define our three models.
mod_form_1 < as.formula(returns ~ MKT) mod_form_2 < as.formula(returns ~ MKT + SMB + HML) mod_form_3 < as.formula(returns ~ MKT + SMB + HML + RMW + CMA)Now we write one function that takes split as an argument, same as above, but also takes formula as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to map, so I’ll append _flex to the name of this function.
ff_rmse_models_flex < function(split, formula) { split_for_data < analysis(split) ff_model < lm(formula, data = split_for_data) holdout < assessment(split) rmse < ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }Now we use the same code flow as before, except we call map_dbl(), pass it our cved_ff$splits object, our new flex function called ff_rmse_models_flex(), and the model we wish to pass as the formula argument. First we pass it mod_form_1.
cved_ff %>% mutate(rmse_model_1 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_1)) # 5fold crossvalidation # A tibble: 5 x 3 splits id rmse_model_1 * 1 Fold1 0.00208 2 Fold2 0.00189 3 Fold3 0.00201 4 Fold4 0.00224 5 Fold5 0.00190Now let’s pass it all three models and find the mean RMSE.
cved_ff %>% mutate(rmse_model_1 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_1), rmse_model_2 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_2), rmse_model_3 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_3)) %>% summarise(mean_rmse_model_1 = mean(rmse_model_1), mean_rmse_model_2 = mean(rmse_model_2), mean_rmse_model_3 = mean(rmse_model_3)) # 5fold crossvalidation # A tibble: 1 x 3 mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3 1 0.00202 0.00199 0.00192Alright, that code flow seems a bit more flexible than our original method of writing a function to assess each model. We didn’t do much hard thinking about functional form here, but hopefully this provides a template where you could assess more nuanced models. We’ll get into bootstrapping and time series work next week, then head to Shiny to ring in the New Year!
And, finally, a couple of public service announcements.
First, thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).
Second, applications are open for the Battlefin alternative data contest, and RStudio is one of the tools you can use to analyze the data. Check it out here. In January, they’ll announce 25 finalists who will get to compete for a cash prize and connect with some quant hedge funds. Go get ‘em!
Thanks for reading and see you next time.

For more on crossvalidation, see “An Introduction to Statistical Learning”, chapter 5. Available online here: http://wwwbcf.usc.edu/~gareth/ISL/.
_____='https://rviews.rstudio.com/2018/12/13/rsamplingfamafrench/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Teaching and Learning Materials for Data Visualization
(This article was first published on R on kieranhealy.org, and kindly contributed to Rbloggers)
Data Visualization: A Practical Introduction will begin shipping next week. I’ve written an R package that contains datasets, functions, and a course packet to go along with the book. The socviz package contains about twenty five datasets and a number of utility and convenience functions. The datasets range in size from things with just a few rows (used for purely illustrative purproses) to datasets with over 120,000 observations, for practicing with and exploring.
A course packet is also included the package. This is a zipped file containing an R Studio project consisting of nine R Markdown documents that parallel the chapters in the book. They contain the code for almost all the figures in the book (and a few more besides). There are also some additional support files, to help demonstrate things like reading in your own data locally in R.
Installing the packageTo install the package, you can follow the instructions in the Preface to the book. Alternatively, first download and install R for MacOS, Windows or Linux, as appropriate. Then download and install RStudio. Launch RStudio and then type the following code at the Console prompt (>), hitting return at the end of each line:
my_packages < c("tidyverse", "fs", "devtools") install.packages(my_packages) devtools::install_github("kjhealy/socviz")Once everything has downloaded and been installed (which may take a little while), load the socviz package:
library(socviz) The Course PacketThe supporting materials are contained in a compressed .zip file. To extract them to your Desktop, make sure the socviz package is loaded as described above. Then do this:
setup_course_notes()This will copy the dataviz_course_notes.zip file to your Desktop, and uncompress it into a folder called dataviz_course_notes. Doubleclick the file named dataviz.Rproj to launch the project as a new RStudio session. If you want to uncompress the file somewhere other than your Desktop, e.g. your Documents folder, you can do this:
setup_course_notes(folder = "~/Documents")The source code for socviz is available on GitHub. I plan on continuing to update and improve it as I use it myself in my own classes and workshops.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...