Data validation with the assertr package
(This article was first published on R – On the lambda, and kindly contributed to Rbloggers)
Version 2.0 of my data set validation package assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr, the text below will point out the improvements.
I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).
assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.
I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr is not (and can never be) a “onestop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive metapackage may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr attempts to impose a grammar of error checking for datasets.
In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:
 For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
 For every element in a column, you want to make sure certain criteria are met but the criteria can only be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
 For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
 For every row of a dataset, you want to make sure certain assumptions hold but the criteria can only be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
 You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.
Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…
Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars dataset, we wanted to confirm the following dataset assumptions…
 that it has the columns mpg, vs, and am
 that the dataset contains more than 10 observations
 that the column for ‘miles per gallon’ (mpg) is a positive number
 that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
 that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
 each row contains at most 2 NAs
 each row is unique jointly between the mpg, am, and wt columns
 each row’s mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)
Before assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start and chain_end functions; all assumptions within a chain (below a call to chain_start and above chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.
Another major improvement since the last version of assertr of CRAN is that assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")
Beyond these examplesSince the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.
The first is that there is another package, assertive (vignette here) that greatly enhances the assertr experience. The predicates (functions that start with “is_”) from this (meta)package can be used in assertr pipelines just as easily as assertr’s own predicates. And assertive has an enormous amount of them! Some specialized and exciting examples include is_hex_color, is_ip_address, and is_isbn_code!
The second is if assertive doesn’t have what you’re looking for, with just a little bit of studying the assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.
If this package interests you, I urge you to read the most recent package vignette here. If you’re a assertr oldtimer, I point you to this NEWS file that list the changes from the previous version.
To leave a comment for the author, please follow the link and comment on their blog: R – On the lambda. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
House effects in New Zealand voting intention polls
This post is one of a series leading up to a purely datadriven probabilistic prediction model for the New Zealand general election in 2017. No punditry will be indulged in (if only to avoid complications with my weekday role as an apolitical public servant)! This is straight statistics, if there is such a thing…
There are important sources of uncertainty in political polling other than sampling errorAn important question in using polling/survey data to predict election results is how to quantify the “house effects” of the different polling companies conducting surveys of potential voters. This is important for establishing which of the various pollsters is best at predicting the overall result which has obvious commercial implications for them; and also for predicting future election results for polling aggregators.
The naive approach is the one taken in this article on polling leading up to the 2014 New Zealand general election, which takes polling results as predictions of the eventual result. A more sophisticated approach is to treat polling numbers as data in an a predictive model in which election vote is a timebound result. Basically, voters get to change their mind between a poll and the election – the more time between the two, the more randomness there is, and the uncertainty from this can certainly swamp the oftenquoted margins of error.
If we look at the data from three major pollsters that have both a track record and continue to poll today (March 2017) for New Zealand intended party vote, we see that discrepancies between election results and voting intention estimated from opinion polls are not purely a matter of pollsters’ uncertainty or error:
The case of the Labour Party in the period between the 2011 and 2014 elections is a particular case in point. All three pollsters in the chart had, on average, higher intended party vote for Labour during the period than eventually transpired. But the consistency of their patterns strongly suggests that there was a genuine trend over time, best captured by Reid Research and Roy Morgan, with a decrease in support in the months leading up to the actual election.
In contrast, in the 2011 and 2014 (and perhaps 2008) elections, all three pollsters do indeed seem to systematically underestimate the vote for New Zealand First.
There are several things to take account ofA wellregarded approach to studying the relationship between opinion polls and electoral results has been based on a seminal 2005 article by Simon Jackman, “Pooling the Polls Over an Election Campaign”:
“Poll results vary over the course of a campaign election and across polling organisations, making it difficult to track genuine changes in voter support. I present a statistical model that tracks changes in voter support over time by pooling the polls, and corrects for variation across polling organisations due to biases known as ‘house effects’. The result is a less biased and more precise estimate of vote intentions than is possible from any one poll alone.”
Simon Jackman
The method is also described in his classic text, Bayesian Analysis for the Social Sciences. The statistical method theorises a latent, largely unobserved (ie except on election day) state space of the voting intention each day, and that polls are various dirty variables linked to the state of that latent variable, but not direct observations of it. The state space changes at random due to unknown forces over time, and the various opinion polls are grossly imperfect glimpses of its reality. The method is state of the art but very computationally intensive – it requires estimating the state of the voting intention for each party on each day since observations began, resulting in many thousands of parameters if the model is fit over multiple election cycles. Computational methods exist for fitting such models and I intend to do so at some point, but I also wanted a quicker look at the data that doesn’t take hours or days to fit fit.
It’s worth noting at this point that there’s good reason to believe that the latent, unobserved voting intention isn’t as volatile as polls indicate – ie an endorsement of Jackman’s method or a similar smoothing approach to aggregate polls, with a conservative approach to change in the underlying intention compared even to the common weighted rolling average method of aggregation provides. A brilliant study during the 2012 USA Presidential Election by Andrew Gelman and Andrew Rothschild showed compellingly that much of the fluctuation in voting intention comes from bias in responses:
“When there was good news for Mitt Romney, more Republicans opted to respond to the poll; when Obama was riding high, Democrats were more likely to respond. The result was that large and systematic changes in nonresponse had the effect of amplifying small changes in actual voter intention.”
These results are compelling evidence for some kind of smoothing of polling results, that not only provides a weighted average of recent polls, but in addition some healthy statistical skepticism (or regularisation to shrink towards zero) to the apparent rapid moves up and down in voting intention.
Generalized additive models give a cheap and quick solution to a broad range of smoothing challengesTo get a glance at house effects in New Zealand polling, I decided to use generalized additive models for the polls leading up to individual elections, for one party at a time, to produce a predicted election result for each pollsterpartyelection combination. Most studies I’ve seen in this space have applied Jackman’s method to a single election cycle; the computational problems magnify with longer periods as do the data management challenges. My nzelect R package makes available multiple years of polling data sourced from Wikipedia (albeit unfortunately without sample size and margin of error information); using a GAM rather than a state space model massively reduces the computational challenges.
I limit myself to the seven parties with a party vote likely to influence the 2017 election and who also were in previous elections; and the three major pollsters who both have a track record and an intention of polling leading up to the 2017 election (my understanding is that Digipoll do not have such intention; otherwise they would be a fourth pollster to include in the analysis).
For each combination of party and election year I used the polling data from all pollsters to predict election result, allowing an absolute level of bias for each pollster in the model. For two of the pollsters I have four complete election cycles of data and I obtained these results:
Roy Morgan stands out as particularly inclined to overestimate the vote for the Greens; and Colmar Brunton to underestimate the vote for New Zealand First.
For the third pollster I only have two election cycles of data and I get these results:
Here is a table summarising the amount each pollster appears to over/under estimate voting intention, comparing the prediction for election day based on patterns to date with the actual result:
Party Colmar Brunton Reid Research Roy Morgan ACT 0.2% 0.1% 0.2% Green 0.6% 2% 2.6% Labour 0.1% 0.9% 1% Maori 0.2% 0.2% 0.1% National 3.2% 3.2% 0.1% NZ First 1.8% 2.5% 0% United Future 0.4% 0.3% 0% Number of pollsFor those who are interested, here is an exploratory chart on the number of actual polls available by pollster and election cycle which led me to my conclusion to fit models just to polls from Colmar Brunton, Reid Research and Roy Morgan:
R codeHere is the code in R for all the above. Caveat – this is very much a spare time project for me, and none of this has been peer reviewed. I’d never allow this situation in my professional life, but here I am publishing results on a political subject this way… Use at your own risk, and if you spot something wrong or an idea for improvement, please let me know.
library(nzelect) library(mgcv) library(tidyverse) library(scales) library(magrittr) library(forcats) library(RColorBrewer) #=====================data prep======================= # vector of colours to use in graphics house_colours < c("black", brewer.pal(3, "Set1")) names(house_colours) < c("Election result", "Reid Research", "Colmar Brunton", "Roy Morgan") # vector of just the seven main parties to use parties < polls %>% filter(ElectionYear == 2017) %>% distinct(Party) %>% filter(!Party %in% c("Destiny", "Progressive", "Mana", "Conservative")) %$% Party #===============introductory graphic======================== election_dates < polls %>% filter(Pollster == "Election result") %>% select(MidDate) %>% distinct() d1 < polls %>% filter(Party %in% parties) %>% filter(Pollster %in% c("Reid Research", "Colmar Brunton", "Roy Morgan", "Election result")) %>% mutate(Party = fct_reorder(Party, VotingIntention, .desc = TRUE), Pollster = fct_relevel(Pollster, "Election result")) d1 %>% ggplot(aes(x = MidDate, y = VotingIntention, colour = Pollster)) + geom_vline(xintercept = as.numeric(election_dates$MidDate), colour = "orange") + geom_line(alpha = 0.4) + geom_smooth(data = filter(d1, Pollster != "Election result"), span = .3, se = FALSE) + geom_line(data = filter(d1, Pollster == "Election result"), size = 1, alpha = 0.5) + geom_point(data = filter(d1, Pollster == "Election result"), size = 2) + scale_y_continuous("Voting intention", label = percent) + scale_x_date("") + labs( colour = "") + scale_colour_manual(values = house_colours) + ggtitle("Survey versus actual performance in New Zealand voting behaviour", "New Zealand First seems systematically underestimated; Greens perhaps overestimated.") + labs(caption = "Source: polls data collected by Wikipedia, available in the {nzelect} R package") + facet_wrap( ~ Party, scales = "free") + theme(legend.position = c(0.7, 0.15)) #=============estimate and present house "bias"============= house_bias < function(elect_years, pollsters){ # depends on these objects being in environmenet: # polls, house_colours, parties houses < expand.grid(elect_years, pollsters) names(houses) < c("ElectionYear", "Pollster") for(j in 1:length(parties)){ the_party = parties[j] # election results: results < polls %>% filter(ElectionYear %in% elect_years & ElectionYear != 2002) %>% filter(Pollster == "Election result") %>% filter(Party == the_party) for(i in 1:length(elect_years)){ # Note we include *all* pollsters in the data for fitting the model thedata < polls %>% filter(ElectionYear == elect_years[i] & Pollster != "Election result") %>% filter(Party == the_party) mod < gam(VotingIntention ~ s(as.numeric(MidDate)) + Pollster, family = "quasibinomial", data = thedata) # for predicting values, we only take the pollsters we have an interest in: preddata < data.frame(MidDate = as.numeric(results[i, "MidDate"]), Pollster = pollsters) # house effect is shown by the amount the predicted value from polling # is *more* than the actual vote. So a positive score means the poll # overestimated the actual vote: houseeffects < predict(mod, newdata = preddata, type = "response")  results[i, "VotingIntention"] houses[houses$ElectionYear == elect_years[i], the_party] < houseeffects } } p < houses %>% gather(Party, `Polling overestimate`, ElectionYear, Pollster) %>% ggplot(aes(x = ElectionYear, y = `Polling overestimate`, colour = Pollster)) + geom_hline(yintercept = 0, colour = "black") + geom_point() + geom_line() + facet_wrap(~Party, ncol = 4) + scale_colour_manual(values = house_colours) + scale_x_continuous("Election year", breaks = c(2005, 2008, 2011, 2014), limits = c(2004, 2015)) + scale_y_continuous(label = percent) + theme(legend.position = c(0.9, 0.18)) + ggtitle("Statistical forecast of election compared to actual result", "Forecasts use time series methods based on pollsters' results, are not actual pollsters' forecasts") + labs(caption = "Source: polls data collected by Wikipedia, available in the {nzelect} R package") print(p) houses_av < houses %>% gather(Party, Bias, ElectionYear, Pollster) %>% group_by(Party, Pollster) %>% summarise(Bias = mean(Bias)) return(houses_av) } hb1 < house_bias(elect_years = c(2005, 2008, 2011, 2014), pollsters = c("Colmar Brunton", "Roy Morgan")) hb2 < house_bias(elect_years = c(2011, 2014), pollsters = c("Reid Research", "Colmar Brunton", "Roy Morgan")) # table for blog post: hb2 %>% filter(Pollster == "Reid Research") %>% rbind(hb1) %>% arrange(Party, Pollster) %>% mutate(`Average bias` = paste0(round(Bias * 100, 1), "%")) %>% select(Bias) %>% spread(Pollster, `Average bias`) %>% knitr::kable(align = "lrrr") #===================how many polls per year========================== polls %>% select(ElectionYear, Pollster, MidDate) %>% distinct() %>% group_by(ElectionYear, Pollster) %>% summarise(Polls = n()) %>% ungroup() %>% mutate(Pollster = fct_reorder(Pollster, Polls)) %>% ggplot(aes(x = Polls, y = Pollster, colour = as.factor(ElectionYear))) + geom_point() + facet_wrap(~ElectionYear) + theme(legend.position = "none")8th MilanoR Meeting: April 5th
(This article was first published on MilanoR, and kindly contributed to Rbloggers)
MilanoR Staff is happy to announce the 8th MilanoR Meeting!
A MilanoR meeting is an occasion to bring together R users in the Milano area to share knowledge and experiences. The meeting is a free event, open to everybody. The meeting consists of two R talks and a free buffet + networking time.
This time we have two exceptional speakers: Romain François, historical author of the famous package Rcpp, and Stefano Iacus, member of the R Foundation and cofounder of Voices from the blogs.
WhenWednesday, April 5, 2017
from 6,00 to 9 pm
Welcome Presentation
by Quantide
Talk 1: R and C++: past, present and future
by Romain François
Talk 2: yuimaGUI a Shiny app for the analysis and simulation of financial time series
by Stefano Iacus
WhereMicrosoft House
Viale Pasubio 21, Milano
Buffet
Our sponsors will provide the buffet after the meeting.
MilanoR Meeting is a free event, open to all R users and enthusiasts or those who wish to learn more about R.
Places are limited so, if you would like to attend to the MilanoR meeting, please register here.
—— FAQ I want to know more about the talks..R and C++: past, present and future: Combining the comfort of R with C++ is a very popular way to achieve better performance for critical code. We will review the current state of the art with Rcpp and its family of packages and share some insights on how we imagine the future of R and C++ integration and the challenges it represents.
yuimaGUI a Shiny app for the analysis and simulation of financial time series: The yuimaGUI is an interactive dashboard developed on top of the yuima package, which is an S4 framework for the simulation and inference of several classes of time series widely used in finance and derivate pricing. The yuimaGUI makes easy and accessible to a wider audience the data I/O from many sources, the explorative data analysis and model fitting and selection using information criteria. Scenarios simulation based on the fitted models as well as Monte Carlo analysis is also available at the cost of few additional clicks. yuimaGUI is joint Project with E. Guidotti and L. Mercuri
..and the speakers..Romain François is a consulting datactive at ThinkR (www.thinkr.fr). Romain has been professionally involved in the R community for more than 10 years offering training and consulting services. As an historical author of Rcpp, Romain is dedicated to offering practical and seemless solutions to leverage the high performance that C++ has to offer to the R community.
Stefano Iacus is full professor of statistics the Department of Economics, Management and Quantitative Methods at the University of Milan. He has been a member of the R Core Team (19992014) for the development of the R statistical environment and now member of the R Foundation for Statistical Computing. He is also a cofounder of VOICES from the Blogs, a startup company of the University of Milan, specialised in Social Media and Big Data Analysis. His research interests include inference for stochastic processes, simulation, computational statistics, causal inference, text mining, and sentiment analysis.
…and the sponsor..Quantide is a provider of consulting services and training courses about Data Science and Big Data. It’s specialized in R, the open source software for statistical computing. Headquartered in Legnano, near Milan (Italy), Quantide has been supporting for 10 years customers from several industries around the world. Quantide is the founder and the supporter of the Milano R meeting, since its first edition in 2012.
Microsft – Well, who dosn’t know Microsoft? But may be of interest that Microsoft develops Microsoft R Open, formerly known as Revolution R Open (RRO), the enhanced distribution of R from Microsoft Corporation. It is a complete open source platform for statistical analysis and data science. It complements and completes the Data Platform offering from Microsoft that is well described and accessible in Azure, the Microsoft Cloud, via Cortana Intelligence Suite.
For any further informations you can contact us at admin@milanor.net
The post 8th MilanoR Meeting: April 5th appeared first on MilanoR.
To leave a comment for the author, please follow the link and comment on their blog: MilanoR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
DIY – cheat sheets
I found recently, that in addition to a great list of cheatsheets designed by RStudio, one can also download a template for new cheatsheets from RStudio Cheat Sheets webpage.
With this template you can design your own cheatsheet, and submit it to the collection of Contributed Cheatsheets (Garrett Grolemund will help to improve the submission if needed).
Working on a new cheatsheet is pretty enjoying. You need to squeeze selected important stuff into a quite small surface like one or two pages.
Lots of fun.
I did it for eurostat and survminer packages (cheatsheets below).
And would love to see one for caret.
How Do You Discover R Packages?
(This article was first published on data science ish, and kindly contributed to Rbloggers)
Like I mentioned in my last blog post, I am contributing to a session at userR 2017 this coming July that will focus on discovering and learning about R packages. This is an increasingly important issue for R users as we all decide which of the 10,000+ packages to invest time in understanding and then use in our work.
library(dplyr) available.packages() %>% tbl_df() ## # A tibble: 10,276 × 17 ## Package Version Priority Depends ## <chr> <chr> <chr> <chr> ## 1 A3 1.0.0 <NA> R (>= 2.15.0), xtable, pbapply ## 2 abbyyR 0.5.0 <NA> R (>= 3.2.0) ## 3 abc 2.1 <NA> R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit ## 4 ABCanalysis 1.2.1 <NA> R (>= 2.10) ## 5 abc.data 1.0 <NA> R (>= 2.10) ## 6 abcdeFBA 0.4 <NA> Rglpk,rgl,corrplot,lattice,R (>= 2.10) ## 7 ABCoptim 0.14.0 <NA> <NA> ## 8 ABCp2 1.2 <NA> MASS ## 9 ABC.RAP 0.9.0 <NA> R (>= 3.1.0) ## 10 abcrf 1.5 <NA> R(>= 3.1) ## Imports LinkingTo ## <chr> <chr> ## 1 <NA> <NA> ## 2 httr, XML, curl, readr, progress <NA> ## 3 <NA> <NA> ## 4 plotrix <NA> ## 5 <NA> <NA> ## 6 <NA> <NA> ## 7 Rcpp Rcpp ## 8 <NA> <NA> ## 9 graphics, stats, utils <NA> ## 10 readr, MASS, ranger, parallel, stringr <NA> ## # ... with 10,266 more rows, and 11 more variables: Suggests <chr>, Enhances <chr>, License <chr>, ## # License_is_FOSS <chr>, License_restricts_use <chr>, OS_type <chr>, Archs <chr>, MD5sum <chr>, ## # NeedsCompilation <chr>, File <chr>, Repository <chr>To prepare for this session and gain some understanding, I am running an online survey about how R users currently discover and learn about R packages. I know that online polls like this can’t give us the same kind of understanding as surveys with carefully designed samples, but it still will give us some insight into how users are currently going about the process of deciding which packages to use. This is important information both for package developers, the maintainers of CRAN Task Views, and R users in general.
There is one question on the survey that allows the respondent to select all the answers that apply:
How do you currently discover and learn about R packages? Email lists such as rhelp, rpackages, or rpkgdevel
 General search websites such as Google and Yahoo
 Rspecific search websites such as METACRAN or Rdocumentation
 R packages built for search such as the sos package
 CRAN Task Views
 Your personal network, such as colleagues and professors
 Conferences, meetups, or seminars
 Books, textbooks, or journal articles (JSS, JOSS, RJournal)
 Social media such as blogs, Rbloggers, Twitter, Slack, or GitHub contacts
 Other
If you are an R user, please go to the poll and vote. If you have other ways that you don’t feel were fairly covered in these options, feel free to leave a comment here on my blog and we can consider them in our discussion at useR.
To leave a comment for the author, please follow the link and comment on their blog: data science ish. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Practical Data Science with R: ACM SIGACT News Book Review and Discount!
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Our book Practical Data Science with R has just been reviewed in Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory (ACM SIGACT) News by Dr. Allan M. Miller (U.C. Berkeley)!
The book is half off at Manning form March 21st 2017 using the following code (please share/Tweet):
Deal of the Day March 21: Half off my book Practical Data Science with R. Use code dotd032117au at https://www.manning.com/dotd
Please read on for links and excerpts from the review.
We are really excited that our book for practitioners is getting some academic / professionalassociation attention.
The primary link for the ACM SIGACT review is here (but paywalled): doi :10.1145/3061640.3061644. However, editor Professor Fred Green eventually shares the reviews here (not up yet, but hopefully soon).
Dr. Nina Zumel and I worked very hard work through lot of substantial material into the book. It was a lot of work and requires some effort to work through. Please treat the book as a series of lectures and worked examples. Many readers have reported back that working through the book really pays off. We are thrilled with the reception the book has been getting.
Here are some excerpts from the review:
Zumel and Mount’s book “Practical Data Science with R” provides exactly what the title says: practical coverage of commonly used data science methods using the R programming language. Topics include: how to conduct data science projects, manage and explore datasets, choose and evaluate modeling methods, and present results. The R programming language is used throughout the book for managing data, building models, and generating both graphical and tabular results. The book includes useful appendices on basic R language and tools, and relevant statistical concepts.
…
The book can serve as a useful advanced introduction to data science for readers with at least a basic understanding of statistics and computer programming. However, it is not designed to be a beginner’s introduction. Readers seeking a purely introductory text would likely find it difficult to grasp many of the concepts covered, but at least can get a valuable overview of the subject. Most would likely return to the book many times as their level of understanding and experience doing data science grows.
…
“Practical Data Science with R” is a remarkable book, packed with both valuable technical material about data science, and practical advice for how to conduct a successful data science project. In a field that is so new, and growing so quickly, it is an essential guide for practitioners, especially for the large numbers of new data scientists moving into the field. It is not only a worthwhile read, it can serve as a useful ongoing technical reference and practical manual for the data science practitioner.
A lot of books cover tools and notation, we took a lot of effort to discuss what is needed to get a great data science result, and how to do it with R.
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
News and Updates Surrounding plotly for R
(This article was first published on R – Modern Data, and kindly contributed to Rbloggers)
The plotly R package will soon release version 4.6.0 which includes new features that are over a year in the making. The NEWS file lists all the new features and changes. This webinar highlights the most important new features including animations and multiple linked views.
Concrete examples with code that you can run yourself will be covered in this webinar, however Carson will give you a more in depth learning experience at his workshop at plotcon taking place in Oakland on May 4th.
Here’s an example of the animation capabilities supported by the Plotly package
library(plotly) library(quantmod) library(zoo) library(dplyr) library(reshape2) library(PerformanceAnalytics) stocklist = c("AAPL","GOOGL","MSFT","BRKA","AMZN","FB","JNJ","XOM","JPM","WFC","BABA", "T","BAC","GE","PG","CHL","BUD","RDSA","WMT","V","VZ","PFE","CVX","ORCL", "KO","HD","NVS","CMCSA","DIS","MRK","PM","CSCO","TSM","C","INTC","IBM","UNH", "HSBC","PEP","MO","UL","CX","AMGN","MA","CCV","TOT","BTI","SAP","MMM","MDT") ddf < getSymbols(Symbols = stocklist[1], auto.assign = F) ddf < ddf[,6] pb < txtProgressBar(min = 0, max = length(stocklist)  1, style=3) for(i in stocklist[1]){ df < getSymbols(Symbols = i, auto.assign = F) df < df[,6] ddf < merge(ddf, df) setTxtProgressBar(pb, which(stocklist[1] == i)) } month < as.yearmon(index(ddf)) prices < data.frame(ddf, month) names(prices) < c(stocklist, "Month") prices < melt(prices, id.vars = "Month") # Calculate returns CalcRet < function(x, vec = F){ ret < (x[2:length(x)]  x[1:(length(x)  1)]) / x[1:(length(x)  1)] if(vec == T) { return(ret) }else{ return(mean(ret)) } } returns < prices %>% group_by(Month, variable) %>% summarize(Return = CalcRet(value)) returns < data.frame(returns, VAR = "Returns") names(returns) < c("Period", "Stock", "Value", "Variable") # Calculate volatility volatility < prices %>% group_by(Month, variable) %>% summarize(Volatility = sd(CalcRet(value, vec = T))) volatility < data.frame(volatility, VAR = "Volatility") names(volatility) < c("Period", "Stock", "Value", "Variable") # Create df for plotting plot.df < rbind(returns, volatility) plot.df < dcast(plot.df, Period + Stock ~ Variable, value.var = "Value") plot.df$Year < format(plot.df[,1], "%Y") p < plot_ly(plot.df, x = ~Volatility, y = ~Returns) %>% add_markers(color = ~Stock, size = ~(Returns / Volatility), frame = ~Year, marker = list(opacity = 0.6, line = list(width = 1, color = "black"))) %>% layout(title = "Monthly Return vs Volatility over last 10 years <br> for 50 US stocks over time", showlegend = F, plot_bgcolor = "#e6e6e6", paper_bgcolor = "#e6e6e6") %>% animation_opts(frame = 1000) About Carson
Carson Sievert is a freelance data scientist developing software and creating products that make data analysis more exciting and accessible. During his PhD, he became maintainer of the R package plotly and was recognized with the John Chambers Statistical Software Award. He is also author and maintainer of numerous other R packages including: LDAvis, animint, pitchRx, and rdom.
Follow Carson on Twitter
To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data science for Doctors: Inferential Statistics Exercises (part2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people
to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.
We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University.
Please find further information regarding the dataset there.
This is the fifth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients,
therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for binomial distribution (Binomial test)
and normal distribution (Ztest). If you are not aware
of what are the mentioned distributions please go here to acquire
the necessary background.
Before proceeding, it might be helpful to look over the help pages for the binom.test, mean,sd ,sqrt, z.test.
Moreover it is crucial to be familiar with the Central Limit Theorem.
install.packages(“TeachingDemos”)
library(TeachingDemos)
Please run the code below in order to load the data set and transform it into a proper data frame format:
url < "https://archive.ics.uci.edu/ml/machinelearningdatabases/pimaindiansdiabetes/pimaindiansdiabetes.data"
data < read.table(url, fileEncoding="UTF8", sep=",")
names < c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) < names
data < data[which(data$mass ==0),]
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Suppose that we take a sample of 30 candidates that tried a medicine and 5 of them are positive.
The null hypothesis is H_{0}: p = average of classes, is to be tested against H1: p != average of classes.
This practically means whether the drug had an effect on the patients
Exercise 2
Apply the same test as above but instead of writing the number of samples try to apply the test in respect to the number of
successes and failures (5,25).
Exercise 3
Having the same null hypothesis as the exercises 1,2 apply a onesided test where H1: p < average of classes.
Exercise 4
At the previous exercises we didn’t specified the confidence interval, so it applied it with the default 0.95. Run the test from exercise 3 but instead of having confidence interval of 0.95 run it with confidence interval 0.99.
Exercise 5
We have created another drug and we tested it on other 30 candidates. After having taken the medicine for a few weeks only 2 out of 30 were positive. We got really excited and decided to set the confidence interval to 0.999. Does that drug have an actual impact?
Exercise 6
Suppose that we establish a new diet and the average of the sample,of size 30, of candidates who tried this diet had average mass 29 after the testing period. Find the confidence interval for significance level of 0.05. Keep in mind that we run the test and compare it in respect to the data$mass variable
Exercise 7
Find the Zscore of the sample.
Exercise 8
Find the pvalue for the experiment.
Exercise 9
Run the ztest using the z.test function with confidence level of 0.95 and let the alternative hypothesis be that the diet had an effect. (twosided test)
Exercise 10
Let’s get a bit more intuitive now, let the alternative hypothesis be that the diet would lead to lower average body mass with confidence level of 0.99. (onesided test)
Related exercise sets: Data Science for Doctors – Part 4 : Inferential Statistics (1/5)
 Data Science for Doctors – Part 2 : Descriptive Statistics
 Data Science for Doctors – Part 3 : Distributions
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Preparing Datetime Data for Analysis with padr and dplyr
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
Two months ago padr was introduced, followed by an improved version that allowed for applying pad on group level. See the introduction blogs or the vignette("padr") for more package information. In this blog I give four more elaborate examples on how to go from raw data to insight with padr, dplyr and ggplot2.
They might serve as recipes for time series problems you want to solve. The dataset emergency is available in padr, and contains about a year of emergency data in Montgomery County, PA. For this blog I only use the twelve most prevalent emergencies.
Lets start with a fairly simple problem, we are calculating the relative distribution of events within each month. We normalize the counts within each month, and then plot through time. geom_bar does most of the heavy lifiting here, we just need the months counts for each event. (I checked, all events have calls in each month, there is no need for padding.)
top12 %>% thicken('month', col = 'month') %>% count(title, month) %>% ggplot(aes(month, n)) + geom_bar(aes(fill = title), col = "black", stat = 'identity', position = "fill") 2) Get a distribution, per event, per time unitFor each clock hour, for each event, we want to compute the average of the number of occurences. The first step is to make the hourly distribution out of the raw data. For every hour we need to count the number of calls per event. This would give no records if there were no calls within a givin hour. If we did not account for this, the calculated averages would be too high. (Recently called “the zerobug” in this blog post). Therefore we use pad and fill_by_value to insert records and give them the value 0.
hourly_distribution < top12 %>% thicken('hour') %>% count(title, time_stamp_hour) %>% pad(group = 'title') %>% fill_by_value(n)Next step is extracting the clock hour from each timestamp, and calculate the mean per event, per clock hour.
hourly_distribution %>% mutate(clock_hour = lubridate::hour(time_stamp_hour) ) %>% group_by(clock_hour, title) %>% summarise(mn = mean(n)) %>% ggplot(aes(clock_hour, mn)) + geom_line() + facet_wrap(~title, scales = 'free') 3) Get the 30th observation per time unitFor this dataset it might not be terribly interesting, but there are many situations were it is very useful to extract the first, last, or nth observation per time unit. Lets find the 30th observation per week, where weeks are starting on Mondays. We need an offset in thicken here, to make the weeks start on Mondays.
first_day < top12$time_stamp %>% as.Date %>% min first_day %>% weekdays ## [1] "Thursday" top12 %>% thicken('week', start_val = first_day  3) %>% group_by(title, time_stamp_week) %>% filter(min_rank(time_stamp) == 30) %>% mutate(weekday = weekdays(time_stamp) %>% factor(levels = (c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday') %>% rev) )) %>% ggplot(aes(weekday)) + geom_bar(aes(fill = title)) + coord_flip() + guides(fill = FALSE) + facet_wrap(~title)So here we have a distribution on which day the 30th call came in, for each of the twelve events.
4) Make a moving averageMoving averages are very helpful for smoothing time series. It is often a better indication of the underlying trend than the raw data. I recently learned about the RcppRoll package, when I was browsing through R for Data Science. This is a nice package by Kevin Ushey, that makes it terribly easy to calculate rolling stats on a vector. Here we want the moving average of the daily count of each of the events.
top12 %>% thicken('day', col = 'day') %>% count(day, title) %>% pad(group = 'title') %>% fill_by_value(n) %>% group_by(title) %>% mutate(moving_average = RcppRoll::roll_meanr(x = n, n = 30, fill = NA)) %>% ggplot(aes(day, n)) + geom_line() + geom_line(aes(y = moving_average), col = 'red') + facet_wrap(~title, scales = 'free')There we have some recipes for preparing data containing timestamp for analysis. Do you have a time seriesrelated challenge not adressed here? I would love to hear from you, and try to figure out an effective way for solving it! Just send an email or post a comment.
To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Preparing Datetime Data for Analysis with `padr` and `dplyr`
Two months ago padr was introduced, followed by an improved version that allowed for applying pad on group level. See the introduction blogs or the vignette("padr") for more package information. In this blog I give four more elaborate examples on how to go from raw data to insight with padr, dplyr and ggplot2.
They might serve as recipes for time series problems you want to solve. The dataset emergency is available in padr, and contains about a year of emergency data in Montgomery County, PA. For this blog I only use the twelve most prevalent emergencies.
Lets start with a fairly simple problem, we are calculating the relative distribution of events within each month. We normalize the counts within each month, and then plot through time. geom_bar does most of the heavy lifiting here, we just need the months counts for each event. (I checked, all events have calls in each month, there is no need for padding.)
top12 %>% thicken('month', col = 'month') %>% count(title, month) %>% ggplot(aes(month, n)) + geom_bar(aes(fill = title), col = "black", stat = 'identity', position = "fill") 2) Get a distribution, per event, per time unitFor each clock hour, for each event, we want to compute the average of the number of occurences. The first step is to make the hourly distribution out of the raw data. For every hour we need to count the number of calls per event. This would give no records if there were no calls within a givin hour. If we did not account for this, the calculated averages would be too high. (Recently called “the zerobug” in this blog post). Therefore we use pad and fill_by_value to insert records and give them the value 0.
hourly_distribution < top12 %>% thicken('hour') %>% count(title, time_stamp_hour) %>% pad(group = 'title') %>% fill_by_value(n)Next step is extracting the clock hour from each timestamp, and calculate the mean per event, per clock hour.
hourly_distribution %>% mutate(clock_hour = lubridate::hour(time_stamp_hour) ) %>% group_by(clock_hour, title) %>% summarise(mn = mean(n)) %>% ggplot(aes(clock_hour, mn)) + geom_line() + facet_wrap(~title, scales = 'free') 3) Get the 30th observation per time unitFor this dataset it might not be terribly interesting, but there are many situations were it is very useful to extract the first, last, or nth observation per time unit. Lets find the 30th observation per week, where weeks are starting on Mondays. We need an offset in thicken here, to make the weeks start on Mondays.
first_day < top12$time_stamp %>% as.Date %>% min first_day %>% weekdays ## [1] "Thursday" top12 %>% thicken('week', start_val = first_day  3) %>% group_by(title, time_stamp_week) %>% filter(min_rank(time_stamp) == 30) %>% mutate(weekday = weekdays(time_stamp) %>% factor(levels = (c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday') %>% rev) )) %>% ggplot(aes(weekday)) + geom_bar(aes(fill = title)) + coord_flip() + guides(fill = FALSE) + facet_wrap(~title)So here we have a distribution on which day the 30th call came in, for each of the twelve events.
4) Make a moving averageMoving averages are very helpful for smoothing time series. It is often a better indication of the underlying trend than the raw data. I recently learned about the RcppRoll package, when I was browsing through R for Data Science. This is a nice package by Kevin Ushey, that makes it terribly easy to calculate rolling stats on a vector. Here we want the moving average of the daily count of each of the events.
top12 %>% thicken('day', col = 'day') %>% count(day, title) %>% pad(group = 'title') %>% fill_by_value(n) %>% group_by(title) %>% mutate(moving_average = RcppRoll::roll_meanr(x = n, n = 30, fill = NA)) %>% ggplot(aes(day, n)) + geom_line() + geom_line(aes(y = moving_average), col = 'red') + facet_wrap(~title, scales = 'free')There we have some recipes for preparing data containing timestamp for analysis. Do you have a time seriesrelated challenge not adressed here? I would love to hear from you, and try to figure out an effective way for solving it! Just send an email or post a comment.
Rcpp 0.12.10: Some small fixes
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
The tenth update in the 0.12.* series of Rcpp just made it to the main CRAN repository providing GNU R with by now over 10,000 packages. Windows binaries for Rcpp, as well as updated Debian packages will follow in due course. This 0.12.10 release follows the 0.12.0 release from late July, the 0.12.1 release in September, the 0.12.2 release in November, the 0.12.3 release in January, the 0.12.4 release in March, the 0.12.5 release in May, the 0.12.6 release in July, the 0.12.7 release in September, the 0.12.8 release in November, and the 0.12.9 release in January — making it the fourteenth release at the steady and predictable bimontly release frequency.
Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 975 packages on CRAN depend on Rcpp for making analytical code go faster and further. That is up by sixtynine packages over the two months since the last release — or just over a package a day!
The changes in this release are almost exclusively minor bugfixes and enhancements to documentation and features: James "coatless" Balamuta rounded out the API, Iñaki Ucar fixed a bug concerning onecharacter output, Jeroen Ooms allowed for finalizers on XPtr objects, Nathan Russell corrected handling of lower (upper) triangular matrices, Dan Dillon and I dealt with Intel compiler quirks for his algorithm.h header, and I added a C++17 plugin along with some (overdue!) documentation regarding the various C++ standards that are supported by Rcpp (which is in essence whatever your compiler supports, i.e., C++98, C++11, C++14 all the way to C++17 but always keep in mind what CRAN and different users may deploy).
Changes in Rcpp version 0.12.10 (20170317)
Changes in Rcpp API:

Added new size attribute aliases for number of rows and columns in DataFrame (James Balamuta in #638 addressing #630).

Fixed singlecharacter handling in Rstreambuf (Iñaki Ucar in #649 addressing #647).

XPtr gains a parameter finalizeOnExit to enable running the finalizer when R quits (Jeroen Ooms in #656 addressing #655).


Changes in Rcpp Sugar:

Changes in Rcpp Attributes
 The C++17 standard is supported with a new plugin (used eg for g++6.2).

Changes in Rcpp Documentation:
 An overdue explanation of how C++11, C++14, and C++17 can be used was added to the Rcpp FAQ.
Thanks to CRANberries, you can also look at a diff to the previous release. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
tidyquant Integrates Quandl: Getting Data Just Got Easier
(This article was first published on businessscience.io  Articles, and kindly contributed to Rbloggers)
Today I’m very pleased to introduce the new Quandl API integration that is available in the development version of tidyquant. Normally I’d introduce this feature during the next CRAN release (v0.5.0 coming soon), but it’s really useful and honestly I just couldn’t wait. If you’re unfamiliar with Quandl, it’s amazing: it’s a web service that has partnered with toptier data publishers to enable users to retrieve a wide range of financial and economic data sets, many of which are FREE! Quandl has it’s own R package (aptly named Quandl) that is overall very good but has one minor inconvenience: it doesn’t return multiple data sets in a “tidy” format. This slight inconvenience has been addressed in the integration that comes packaged in the latest development version of tidyquant. Now users can use the Quandl API from within tidyquant with three functions: quandl_api_key(), quandl_search(), and the core function tq_get(get = "quandl"). In this post, we’ll go through a usercontributed example, How To Perform a Fama French 3 Factor Analysis, that showcases how the Quandl integration fits into the “Collect, Modify, Analyze” financial analysis workflow. Interested readers can download the development version using devtools::install_github("businessscience/tidyquant"). More information is available on the tidyquant GitHub page including the updated development vignettes.
Table of Contents Overview
 Prerequisites
 Example: How to Perform a Fama French 3 Factor Analysis
 Conclusion
 Recap
 Further Reading
tidyquant: Bringing financial analysis to the tidyverse
The topic for today is the Quandl integration today. Quandl enables access to a wide range of financial and economic data. It has it’s own R library appropriately named Quandl. Users can sign up for a FREE account, and in return users get an API key that enables access to numerous free and paid data sets. The Quandl package is very good: it enables searching the Quandl databases from the R console. Once a data set is found, the data set “code” can be used to retrieve the data in various formats. The one downside is that, although you can get multiple data sets (e.g. for multiple stocks, FRED codes, etc), the data returned is not “tidy”. This is where the tidyquant integration fits in. The integration makes it even more convenient to get data, and when multiple data sets are retrieved they are returned in one “tidy” data frame (aka “long” format which is perfect for grouping and scaling analysis)! In addition, you only need to load one package, tidyquant, to get the full capabilities of the Quandl API. The figure below shows how Quandl fits into the “Collect, Modify, Analyze” tidyquant financial analysis workflow.
If you are new to tidyquant, there’s a few core functions that you need to be aware of. I’ve broken them down by step in the CMA process.

Upstream (Collect): tq_get() is a onestop shop for getting webbased financial data in a “tidy” data frame format. Get data for daily stock prices (historical), key statistics (realtime), key ratios (historical), financial statements, economic data from the FRED, FOREX rates from Oanda, and now Quandl!
 Midstream (Modify):
 tq_transmute() and tq_mutate() manipulate financial data. tq_mutate() is used to add a column to the data frame. tq_transmute() is used to return a new data frame which is necessary for periodicity changes.
 tq_portfolio() aggregates a group (or multiple groups) of asset returns into one or more portfolios.
 Downstream (Analyze): tq_performance() integrates PerformanceAnalytics functions that turn investment returns into performance metrics.
To learn more about the functions, browse the Development Vignettes on GitHub.
PrerequisitesTo use the Quandl integration and other new tidyquant features, you’ll need to install the development version available on the Business Science GitHub Site. You can download with devtools.
devtools::install_github("businessscience/tidyquant")Next, load tidyquant, broom and corrr packages. The broom and corrr packages will help in our analysis at the end of the financial analysis workflow.
# Loads tidyquant, tidyverse, lubridate, quantmod, TTR, xts, zoo, PerformanceAnalytics library(tidyquant) library(broom) # Use `tidy` and `glance` functions library(corrr) # tidy correlationsI also recommend the opensource RStudio IDE, which makes R Programming easy and efficient especially for financial analysis. Now, onto a really neat example showing off why Quandl is such a great tool.
Example: How To Perform a Fama French 3 Factor AnalysisBefore we get started, I’d like to thank Bob Rietveld for the usage case. He’s been doing a lot of work with Fama French three and five factor models. You can find an example of his FF analyses here. In this example, we’ll perform a Fama French three factor regression on a portfolio of the following stocks: 20% AAPL, 20% F, 40% GE, and 20% MSFT. According to Investopedia:
The Fama and French Three Factor Model is an asset pricing model that expands on the capital asset pricing model (CAPM) by adding size and value factors to the market risk factor in CAPM. This model considers the fact that value and smallcap stocks outperform markets on a regular basis. By including these two additional factors, the model adjusts for the outperformance tendency, which is thought to make it a better tool for evaluating manager performance.
The CMA process steps we’ll implement are as follows:
 Collect Data: We’ll use the new Quandl integration to get both stock prices and Fama French data sets.
 Modify Data: This is a portfolio analysis so we’ll need to aggregate stock returns into a weighted portfolio
 Analyze Data: We’ll perform a regression analysis, and we need the broom package for the tidy() and glance() functions.
In this step, we will collect two data frames. The first is the historical stock returns for individual stocks. The second is the Fama French three factor data set. We are going to use the Quandl API integration so first set your API key using quandl_api_key(). If you don’t have an API key yet, you can sign up with Quandl.
quandl_api_key("yourapikeyhere") Collecting Historical Stock ReturnsNext, let’s create a table of stocks. We will use the “WIKI” database which returns open, high low, close, volume, dividends, splits, and adjusted prices. The Quandl data sets use the following code format: “Database” / “Data Set”. For “AAPL”, this would be “WIKI/AAPL” indicating the WIKI database and AAPL data set. The code in the first column will allow us to pipe (%>%) the stock list to the tq_get() function next.
stock_list_quandl < tribble( ~code, ~symbol, "WIKI/AAPL", "AAPL", "WIKI/F", "F", "WIKI/GE", "GE", "WIKI/MSFT", "MSFT" ) stock_list_quandl ## # A tibble: 4 × 2 ## code symbol ## <chr> <chr> ## 1 WIKI/AAPL AAPL ## 2 WIKI/F F ## 3 WIKI/GE GE ## 4 WIKI/MSFT MSFTOnce we have the stocks, we can very easily use tq_get(get = "quandl") to get stock prices and even stock returns depending on the options we use. The following time series options are available to be passed to the underlying Quandl::Quandl()function:
 order = “asc”, “desc”
 start_date (from) = “yyyymmdd”
 end_date (to) = “yyyymmdd”
 column_index = numeric column number (e.g. 1)
 rows = numeric row number indicating first n rows (e.g. 100)
 collapse = “none”, “daily”, “weekly”, “monthly”, “quarterly”, “annual”
 transform = “none”, “diff”, “rdiff”, “cumul”, “normalize”
We’ll use from and to to select a ten year time period from the beginning of 2007 through the end of 2016, transform = "rdiff" to get percentage returns, collapse = "monthly" to get monthly data, and column_index = 11 to get the eleventh column, “adj.close”. We’ll rename the column from “adj.close” to “monthly.returns” to accurately describe the values.
stock_returns_quandl < stock_list_quandl %>% tq_get(get = "quandl", from = "20070101", to = "20161231", transform = "rdiff", collapse = "monthly", column_index = 11) %>% rename(monthly.returns = adj.close) stock_returns_quandl ## # A tibble: 476 × 4 ## code symbol date monthly.returns ## <chr> <chr> <date> <dbl> ## 1 WIKI/AAPL AAPL 20161231 0.047468354 ## 2 WIKI/AAPL AAPL 20161130 0.025893958 ## 3 WIKI/AAPL AAPL 20161031 0.005045587 ## 4 WIKI/AAPL AAPL 20160930 0.064750236 ## 5 WIKI/AAPL AAPL 20160831 0.023618063 ## 6 WIKI/AAPL AAPL 20160731 0.090062762 ## 7 WIKI/AAPL AAPL 20160630 0.042659724 ## 8 WIKI/AAPL AAPL 20160531 0.071799336 ## 9 WIKI/AAPL AAPL 20160430 0.139921094 ## 10 WIKI/AAPL AAPL 20160331 0.127210673 ## # ... with 466 more rows Collecting Fama French 3Factor Monthly DataNext, we need to get the Fama French data. Suppose we don’t know exactly what we are looking for. We’ll use the function, quandl_search(), to query the Quandl API (a wrapper for Quandl.search()). We can search within the R console by setting query to a descriptive value. We’ll set per_page = 5 to get the top 5 results. We’ll set silent = TRUE to turn off the meta data output (in practice it may be beneficial to leave this easytoread option on). The results returned contain the “id”, dataset_code, “database_code”, “name”, “description”, etc, which gives us both insight into the data set contents and the information needed to retrieve. I’ve removed “description” to make it easier to view the information.
quandl_search(query = "FAMA FRENCH", per_page = 5, silent = TRUE) %>% select(description) id dataset_code database_code name refreshed_at newest_available_date oldest_available_date column_names frequency type premium database_id 30216128 MOMENTUM_A KFRENCH Fama/French Factors (Annual) 20170318T21:08:12.712Z 20161231 19271231 Date, Momentum annual Time Series FALSE 389 30579533 FACTORS_A KFRENCH Fama/French Factors (Annual) 20170318T21:06:21.885Z 20161231 19271231 Date, MktRF, SMB, HML, RF annual Time Series FALSE 389 2292156 FACTORS_M KFRENCH Fama/French Factors (Monthly) 20170318T21:06:21.953Z 20170131 19260731 Date, MktRF, SMB, HML, RF monthly Time Series FALSE 389 2292158 FACTORS_W KFRENCH Fama/French Factors (Weekly) 20170318T21:06:25.103Z 20170127 19260702 Date, MktRF, SMB, HML, RF weekly Time Series FALSE 389 2676225 MOMENTUM_M KFRENCH Fama/French Factors (Monthly) 20170318T21:08:12.746Z 20170131 19270131 Date, Momentum monthly Time Series FALSE 389The third result, “FACTORS_M”, is what we need. We can retrieve with tq_get(get = "quandl") by piping (%>%) "KFRENCH/FACTORS_M". (Remember that the format is always database code / dataset code). We’ll tack on collapse = "monthly" to ensure the dates match up with the returns, stock_returns_quandl.
# Get Fama French 3 Factor Data fama_french_3_M < "KFRENCH/FACTORS_M" %>% tq_get(get = "quandl", collapse = "monthly") fama_french_3_M ## # A tibble: 1,087 × 5 ## date mkt.rf smb hml rf ## <date> <dbl> <dbl> <dbl> <dbl> ## 1 20170131 1.94 0.95 2.76 0.04 ## 2 20161231 1.81 0.04 3.52 0.03 ## 3 20161130 4.86 5.68 8.44 0.01 ## 4 20161031 2.02 4.41 4.15 0.02 ## 5 20160930 0.25 2.00 1.34 0.02 ## 6 20160831 0.50 0.94 3.18 0.02 ## 7 20160731 3.95 2.90 0.98 0.02 ## 8 20160630 0.05 0.61 1.49 0.02 ## 9 20160531 1.78 0.27 1.79 0.01 ## 10 20160430 0.92 0.68 3.25 0.01 ## # ... with 1,077 more rowsNow we have all of the data needed. We are ready to move on to modifying the data.
Step 2: Modify DataThere’s two parts to this step. First, we will aggregate the portfolio in the weights specified in the beginning of the example:
 20% AAPL, 20% F, 40% GE, and 20% MSFT.
Second, we will join the aggregated portfolio returns with the Fama French data.
Aggregate PortfolioPortfolio aggregation is performed using tq_portfolio() as follows. We create a tibble (“tidy” data frame) of weights that can be mapped using the first column, “stocks”.
# Creating tibble of stock weights that can be mapped to the portfolio assets weights_tib < tibble(stocks = c("AAPL", "F", "GE", "MSFT"), weights = c(0.2, 0.2, 0.4, 0.2)) weights_tib ## # A tibble: 4 × 2 ## stocks weights ## <chr> <dbl> ## 1 AAPL 0.2 ## 2 F 0.2 ## 3 GE 0.4 ## 4 MSFT 0.2Then we pass the individual stock returns, stock_returns_quandl, to the tq_portfolio() function specifying the assets column “symbol” and the returns column “monthly.returns”. The weights_tib tibble is also passed to the weights argument. Note that there is also an argument, rebalance_on = c(NA, "years", "quarters", "months", "weeks", "days") if rebalancing is a consideration to factor into the model. Last, the output column is renamed to “monthly.returns” using the col_rename argument.
# Aggregating weighted portfolio returns portfolio_returns < stock_returns_quandl %>% tq_portfolio(assets_col = symbol, returns_col = monthly.returns, weights = weights_tib, rebalance_on = NA, col_rename = "monthly.returns") portfolio_returns ## # A tibble: 119 × 2 ## date monthly.returns ## <date> <dbl> ## 1 20070228 0.034414463 ## 2 20070331 0.022735121 ## 3 20070430 0.050631578 ## 4 20070531 0.068295584 ## 5 20070630 0.028425228 ## 6 20070731 0.002249988 ## 7 20070831 0.001170495 ## 8 20070930 0.077241302 ## 9 20071031 0.113769305 ## 10 20071130 0.076217903 ## # ... with 109 more rows Join Portfolio Returns and Fama French DataWe can join the two data sets by the “date” column in each using left_join from the dplyr package.
portfolio_fama_french < left_join(portfolio_returns, fama_french_3_M, by = "date") portfolio_fama_french ## # A tibble: 119 × 6 ## date monthly.returns mkt.rf smb hml rf ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 20070228 0.034414463 1.96 1.32 0.10 0.38 ## 2 20070331 0.022735121 0.68 0.06 0.23 0.43 ## 3 20070430 0.050631578 3.49 2.07 1.15 0.44 ## 4 20070531 0.068295584 3.24 0.01 0.07 0.41 ## 5 20070630 0.028425228 1.96 0.79 1.11 0.40 ## 6 20070731 0.002249988 3.73 2.50 3.34 0.40 ## 7 20070831 0.001170495 0.92 0.10 2.25 0.42 ## 8 20070930 0.077241302 3.22 2.28 1.91 0.32 ## 9 20071031 0.113769305 1.80 0.22 2.62 0.32 ## 10 20071130 0.076217903 4.83 2.61 1.18 0.34 ## # ... with 109 more rowsNow we are ready to analyze.
Step 3: Anaylze DataIn the final step we will analyze two ways. First, we will perform the three factor regression, which yields model parameters. Second, we will review visually by plotting a correlation matrix.
Three Factor Regression ModelThe article, “Rolling Your Own: ThreeFactor Analysis”, by William J. Bernstein with Efficient Frontier goes through an excellent stepbystep explaining the method. We are concerned with the following variables:
 Return of the Total Market minus the TBill return (mkt.rf): The return of the total market (CRSP 110) minus the Tbill return (Mkt)
 Small Minus Big (smb): The return of small company stocks minus that of big company stocks
 High Minus Low (hml): The return of the cheapest third of stocks sorted by price/book minus the most expensive third
Three factor regression is performed with the lm() function by analyzing the relationship between the portfolio returns and the three FF factors.
ff_3 < lm(monthly.returns ~ mkt.rf + smb + hml, data = portfolio_fama_french)Using glance from the broom package, we can review the regression metrics. Note kable() from the knitr package is used to create aesthetically pleasing tables. We can see from the “r.squared” value that 67% of the variance of the portfolio returns is explained by the model.
glance(ff_3) %>% knitr::kable() r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual 0.6709785 0.6623953 0.0424483 78.17375 0 4 209.1576 408.3151 394.4195 0.2072133 115Using tidy from the broom package, we can review the model coefficients: these are the most interesting. The intercept is the alpha, and at 0.005 the portfolio is outperforming the model by approximately 0.005% per month or roughly 0.055% per year (although the pvalue indicates this is not statistically significant). Next are the “loadings” for the three factors. The “mkt.rf” is the beta for the portfolio, which indicates very low volatility compared to the market (anything less than 1.0 means lower volatility than the market). The “smb” value of essentially zero signifies largecap (anything below 0 is large cap, above 0.5 is small cap). The “hml” value of essentially zero signifies a growth fund (a zero value defines a growth portfolio, a value of more than 0.3, a value fund).
tidy(ff_3) %>% knitr::kable() term estimate std.error statistic p.value (Intercept) 0.0046336 0.0039400 1.1760314 0.2420111 mkt.rf 0.0138197 0.0009619 14.3665066 0.0000000 smb 0.0027443 0.0018407 1.4909358 0.1387162 hml 0.0013105 0.0014951 0.8765756 0.3825446 Visualize CorrelationsWe can also visualize the results using the new corrr package to get a sense of the relationship the portfolio returns to each of the factors. The correlate() function creates a correlation matrix. The shave() function removes the upper diagonal.
portfolio_corr < portfolio_fama_french %>% select(c(date, rf)) %>% correlate() %>% shave() # `fashion` used for pretty printing fashion(portfolio_corr) %>% knitr::kable() rowname monthly.returns mkt.rf smb hml monthly.returns mkt.rf .81 smb .22 .36 hml .22 .33 .19Visualizing is just one more step with rplot() (similar to ggplot() + correlation geom). We can see that the market reference is highly correlated to the monthly portfolio returns, but this is the only value that has a significant correlation.
portfolio_corr %>% rplot() + labs(title = "Visualizing the Correlations Between FF Factors", subtitle = "Highest Correlation is 'mkt.rf', other factors are less correlated") + theme_tq() ConclusionsThe new Quandl integration opens up a lot of doors with regards to financial and economic data. The API is now integrated into the “tidyverse” workflow enabling scaling and easy data manipulations following the “Collect, Modify, Analyze” financial analysis workflow we use with tidyquant. The Fama French analysis is just one example of new and interesting analyses that are now easily performed. This is just the beginning. Feel free to email us at Business Science with new and interesting ways you are using tidyquant!
RecapWe covered a lot of ground today. We exposed you to the new Quandl integration and how it fits within the “Collect, Modify, Analyze” financial analysis workflow. We used quandl_api_key() to set an API key, enabling access to the Quandl API. We used quandl_search() to search the Quandl API for Fama French data. We used tq_get(get = "quandl") to retrieve data from Quandl, passing various options to conveniently get monthly returns. We aggregated a portfolio using tq_portfolio and joined the portfolio returns with the Fama French data. We then performed a basic Fama French Three Factor analysis. The entire analysis from beginning to end was easy, efficient, and “tidy”! =)
Further Reading
TQ01Core Functions in tidyquant, Development Version: The Core Functions in tidyquant vignette (development version) has a nice section on the features related to the Quandl integration!

R for Data Science: A free book that thoroughly covers the “tidyverse”. A prerequisite for maximizing your abilities with tidyquant.
To leave a comment for the author, please follow the link and comment on their blog: businessscience.io  Articles. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Faces of #rstats Twitter
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
This week I was impressed by this tweet where Daniel Pett, Digital Humanities Lead at the British Museum, presented a collage of Twitter profile pics of all his colleagues. He made this piece of art using R (for collecting the usernames) and Python. I’m a bit of an R fanatic (or a Python dummy…) so I decided to write a code in R only to make a collage of profile pics of Twitter #rstats users.
Note: if you’re interested in Daniel Pett’s code, the data gathering is here and the montage code is here.
Getting the usersOnce again I’ll use rtweet! In case you’ve been using twitteR, note that twitteR maintainer is currently passing the torch to rtweet so you’d better switch packages.
library("rtweet") library("dplyr") library("magick") library("httr") library("stringr") library("purrr") users < search_users(q= '#rstats', n = 1000, parse = TRUE) users < unique(users)I think that doing this I got only profiles whose description includes “rstats” or “#rstats”. Moreover, although I got 1,000 results, many of them were duplicates, so in the end I had about 450 #rstats Twitter users.
Gathering profile picsThis is the part where I felt slightly creepy downloading pictures of so many people.
I first had to create the link to each person’s profile page.
users < mutate(users, where = paste0("https://twitter.com/", screen_name))Then I wrote a function for getting the profile pic for each user. I worked for everyone except 1) eggs and 2) a person whose profile pic had no extension, so I was quite satisfied with the result.
Update: Daniel Pett told me I could have used the column “profile_image_url” of the data.frame obtained from rtweet::search_users… I had completely missed that column! Thanks a lot!
get_piclink < function(df){ content < httr::GET(df$where) content < httr::content(content, as = "text") image < str_extract(content, "class=\"ProfileAvatarimage \" src=\"https://pbs.twimg.com/profile_images/.*\\..*\" alt") image < str_replace(image, "class=\"ProfileAvatarimage \" src=\"", "") image < str_replace(image, "..alt", "") return(image) } users < by_row(users, get_piclink, .to = "piclink", .collate = "cols") readr::write_csv(users, path = "users.csv")After getting the links I could start saving the pictures, in a small 50px x 50px format. At that point I started using the magick package, a real game changer for image manipulation in R!
save_image < function(df){ image < try(image_read(df$piclink), silent = TRUE) if(class(image)[1] != "tryerror"){ image %>% image_scale("50x50") %>% image_write(paste0("data/20170319facesofr_users_images/", df$screen_name,".jpg")) } } users < filter(users, !is.na(piclink)) users < split(users, 1:nrow(users)) walk(users, save_image)I got 448 images.
Preparing the collage itselfI want to take a minute thinking about myself in elementary school. I am lefthanded and I had scissors for righthanded people that I used with my left hand because no one in my family or school knew any better. I was therefore very bad at collages because of the cutting part. Had I known that I’d do collages with a computer as a grownup, I would have been less sad to have to skip the break to keep on slowly cutting pieces of paper! (Note: I actually had a very happy childhood.)
Back to R talk now! For doing the collage I had to know how many columns and rows I wanted. For this I factorized the number of pics I had. Note that I also resampled the images, because I didn’t want all the RLadies chapters and their beautiful purple logo to end up near each other in the collage.
files < dir("data/20170319facesofr_users_images/", full.names = TRUE) set.seed(1) files < sample(files, length(files)) gmp::factorize(length(files)) ## Big Integer ('bigz') object of length 7: ## [1] 2 2 2 2 2 2 7Based on this I chose this format:
no_rows < 28 no_cols < 16In magick there’s a function for appending images, image_append, from which you get either a row or a column depending on the stack argument. I first created all columns.
make_column < function(i, files, no_rows){ image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>% image_append(stack = TRUE) %>% image_write(paste0("cols/", i, ".jpg")) } walk(0:(no_cols1), make_column, files = files, no_rows = no_rows)And finally, I appended the columns!
image_read(dir("cols/", full.names = TRUE)) %>% image_append(stack = FALSE) %>% image_write("20170319facesofr.jpg") The endHere is the result, where many, many R users are most certainly missing. Besides, can we talk about these accounts with the old R logo? They make me feel old, while the RLadies logo make me happy. Too bad Forwards wasn’t included, because their logo is really pretty.
Now you can try doing the same with your followers, the people you follow, or making a doormat with base plot advocates.
To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Python & R vs. SPSS & SAS
(This article was first published on Rbloggers – The Analytics Lab, and kindly contributed to Rbloggers)
When we’re working for clients we mostly come across the statistical programming languages SAS, SPSS, R and Python. Of these SAS and SPSS are probably the most used. However, the interest for the open source languages R and Python is increasing. In recent years, some of our clients migrated from using SAS or SPSS to using R and/or Python. And even if they haven’t (yet), most commercial software packages (including SAS and SPSS) make it possible to connect to R and Python nowadays.
SAS was developed at the North Carolina State University and was primarily developed to be able to analyse large quantities of agriculture data. The abbreviation SAS stands for Statistical Analysis System. In 1976 the company SAS was founded as the demand for such software increased. Statistical Package for the Social Sciences (SPSS) was developed for the social sciences and was the first statistical programming language for the PC. It was developed in 1968 at the University of Stanford and eight years later the company SPSS Inc. was founded, which was bought by IBM in 2009.
In 2000 the University of Auckland released the first version of R, a programming language primarily focused on statistical modeling and was open sourced under the GNU license. Python is the only one that was not developed at a university. Python was created by a Dutch guy who is a big fan of Monty Python (where the name comes from). He needed a project during Christmas and created this language which is based on ABC. ABC is a language, also created by him, with the goal to teach nonprogrammers how to program. Python is a multipurpose language, like C++ and Java, with the big difference and advantage that Python is way easier to learn. Programmers carried on and created lots of modules on top of Python and it therefore has a wide range of statistical modeling capabilities nowadays. That’s why Python definitely belongs in this list.
In this article, we compare the four languages on methods and techniques, ease of learning, visualisation, support and costs. We explicitly focus on the languages, the user interfaces SAS Enterprise Miner and SPSS Modeler are out of scope.
Statistical methods and TechniquesMy vision on Data Analysis is that there is continuum between explanatory models on one side and predictive models on the other side. The decisions you make during the modeling process depend on your goal. Let’s take Customer Churn as an example, you can ask yourself why are customers leaving? Or you can ask yourself which customers are leaving? The first question has as its primary goal to explain churn, while the second question has as its primary goal to predict churn. These are two fundamentally different questions and this has implications for the decisions you take along the way. The predictive side of Data Analysis is closely related to terms like Data Mining and Machine Learning.
When we’re looking at SPSS and SAS, both of these languages originate from the explanatory side of Data Analysis. They are developed in an academic environment, where hypotheses testing plays a major role. This makes that they have significant less methods and techniques in comparison to R and Python. Nowadays, SAS and SPSS both have data mining tools (SAS Enterprise Miner and SPSS Modeler), however these are different tools and you’ll need extra licenses.
One of the major advantages of open source tooling is that the community continuously improves and increases functionality. R was created by academics, who wanted their algorithms to spread as easily as possible. Ergo R has the widest range of algorithms, which makes R strong on the explanatory side and on the predictive side of Data Analysis.
Python is developed with a strong focus on (business) applications, not from an academic or statistical standpoint. This makes Python very powerful when algorithms are directly used in applications. Hence, we see that the statistical capabilities are primarily focused on the predictive side. Python is mostly used in Data Mining or Machine Learning applications where a data analyst doesn’t need to intervene. Python is therefore also strong in analysing images and videos, for example we’ve used Python this summer to build our own autonomous driving RC car. Python is also the easiest language to use when using Big Data Frameworks like Spark.
Ease of learningBoth SPSS and SAS have a comprehensive user interface, with the consequence that a user doesn’t necessarily need to code. Furthermore, SPSS has a pastefunction which creates syntaxes from steps executed in the user interface and SAS has Proc SQL, which makes SAScoding a lot easier for people who know the SQL query language. SAS and SPSS code are syntaxtically far from similar to each other and also very different from other relevant programming languages, so when you need to learn one of these from scratch, good luck with it!
Although there are GUI alternatives for R, like Rattle, it doesn’t come close to SAS or SPSS in terms of its functionality. R is easily to learn for programmers, however, a lot of analysts don’t have a background in programming. R has the steepest learning curve from all, it’s the most difficult one to start with. But once you get the basics, it gets easier soon. For this specific reason, we’ve created a R course, called Experience R, which kickstarts (aspiring) data analysts / scientists in learning R. Python is based on ABC, which is developed with the sole purpose of teaching nonprogrammers how to program. Readability is one of the key features of Python. This makes Python the easiest language to learn. As Python is so broad, there are no GUI’s for Python.
To conclude, as for ease of learning SPSS and SAS are the best option for starting analysts as they provide tools where the user doesn’t need to program.
SupportBoth SAS and SPSS are commercial products and therefore have official support. This motivates some companies to choose for these languages: if something goes wrong, they’ve got support.
There is a misconception around the support for opensource tooling. It’s true that there is no official support from the creators or owners, nonetheless, there’s a large community for both languages most willing to help you to solve your problem. And 99 out of 100 times (if not more often), your question has already been asked and answered on sites like Stack Overflow. On top of that, there are numerous companies that do provide professional support for R and Python. So, although there’s no official support for both R and Python, in practice we see that if you’ve got a question, you’ll likely have your answer sooner if it’s about R or Python than in case it’s SAS or SPSS related.
VisualisationThe graphical capabilities of SAS and SPSS are purely functional; although it is possible to make minor changes to graphs, to fully customize your plots and visualizations in SAS and SPSS can be very cumbersome or even impossible. R and Python offer much more opportunities to customize and optimize your graphs due to the wide range of modules that are available. The most widely used module for R is ggplot2, which has a wide set of graphs where you’re able to adjust practically everything. These graphs are also easily made interactive, which allows users to play with the data through applications like shiny.
Python and R learned (and still learn) a lot from each other. One of the best examples of this is that Python also has a ggplotmodule , which has practically the same functionality and syntax as it does in R. Another widely used module for visualisation in Python is Matplotlib.
CostsR and Python are open source, which makes them freely available for everybody. The downside is that, as we’ve discussed before, these are harder to learn languages compared to start using the SAS or SPSS GUI. As a result, analysts equipped with R and/or Python in their skillset have higher salaries than analyst that don’t. Educating employees that are currently not familiar with R and/or Python costs money as well. Therefore, in practice it isn’t the case that the open source programming language are completely free of costs, but when you compare it with the license fees for SAS or SPSS, the business case is very easily made: R and Python are way cheaper!
My choice“Software is like sex, it’s better when it’s free” – Linus Torvalds (creator Linux)
My goto tools are R and Python, I can use these languages everywhere without having to buy licenses. Also I don’t need to wait for the licenses. And time is a key feature in my job as a consultant. Aside from licenses, probably the main reason is the wide range of statistical methods; I can use any algorithm out there and choose the one that suits the challenge at hand best.
Which of the two languages I use depends on the goal, as mentioned above. Python is a multipurpose language and is developed with a strong focus on applications. Python is therefore strong in Machine Learning applications; hence I use Python for example for Face or Object Recognition or Deep Learning applications. I use R for goals which have to do with customer behaviour, where the explanatory side also plays a major role; if I know which customers are about to churn, I would also like to know why.
These two languages are for a large part complementary. There are libraries for R that allow you to run Python code (reticulate, rPython), and there are Python modules which allow you to run R code (rpy2). This makes the combination of the two languages even stronger.
Jeroen Kromme, Senior Consultant Data Scientist
To leave a comment for the author, please follow the link and comment on their blog: Rbloggers – The Analytics Lab. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Presentation “R for Data Science”
Some weeks ago I had a presentation at my work place about “R for data science” that I’d like to share with you. I’ve written the slides in R and rmarkdown and uploaded them to rpubs.com. I chose to use rmarkdown for my slides although we have great company PowerPoint templates, because I wanted to show how you can integrate creating slides in your data analysis workflow. In this post I share my presentation, some background information and how I prepared for the talk with you!
Writing the presentation in R + rmarkdownSince I have used markdown and rmarkdown before, preparing the slides was pretty straight forward. I only had an issue with image sizes which I could resolve with an answer from stackoverflow. Stackoverflow answer: Rpresentation in Rstudio – Make image fill out the whole screen.
Find out how I learned to love rmarkdown!
I used all the standard settings but for the next presentation I would definitely change the font size. We were sitting in a large meeting room and I think the people in the back could hardly read the text on my slides.
Practicing the presentationDue to the fact that I am always super nervous when it comes to presenting in front of people (that’s not an exaggeration) I wanted to practice my presentation first. My cats have listened to some of my presentations before but I was not really sure they could ask the right questions. So I thought we have this great Data Science Learning Club, which should help learn people data science things they are not yet good or comfortable at. For me, presenting is part of data science so I asked in our Learning Club Slack Channel if anyone was interested in listening to my presentation. Two days before the actual presentation I was able to practice my presentation in front of a (very small) machine learning audience. This helped me a lot
So if you have the same problem, just find someone online where you can practice your presentation using Skype, Hangout or whatever!
The presentationEvery year we can some work time on further education like university or online courses. The idea of this presentation was to share what we have learned in these courses so that everyone gets an overview. Since the regression courses were mostly theoretical, I thought an introduction to R might be a benefit for more people in our department. So please don’t be surprised why there is a slide with three regression courses at the beginning of the presentation
The focus of the presentation is:
 Data handling with dplyr
 Data visualisation with ggplot2
 Model training and evaluation with caret
 Presentation of results
My slides are located here.
Links I sharedAt the end of the presentation I shared some links which I think are a good starting point if you want to get to know R!
 R 4 Data Science (book)
 ggplot2 extensions (you might get a security warning)
 tidyverse (release post)
 tidyverse.org
 the caret package
 Markdown basics
 datacamp
 RStudio cheatsheets (awesome!)
The post Presentation “R for Data Science” appeared first on verenahaunschmid.
My 3 video presentations on “Essential R”
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
In this post I include my 3 video presentations on the topic “Essential R”. In these 3 presentations I cover the entire landscape of R. I cover the following
 R Language – The essentials
 Key R Packages (dplyr, lubridate, ggplot2, etc.)
 How to create R Markdown and share reports
 A look at Shiny apps
 How to create a simple R package
Essential R – Part 1
This video cover basic R data types – character, numeric, vectors, matrices, lists and data frames. It also touches on how to subset these data types
Essential R – Part 2
This video continues on how to subset dataframes (the most important data type) and some important packages. It also presents one of the most important job of a Data Scientist – that of cleaning and shaping the data. This is done with an example unclean data frame. It also touches on some key operations of dplyr like select, filter, arrange, summarise and mutate. Other packages like lubridate, quantmod are also included. This presentation also shows how to use base plot and ggplot2
Essential R – Part 3
This final session covers R Markdown , and touches on some of the key markdown elements. There is a brief overview of a simple Shiny app. Finally this presentation also shows the key steps to create an R package
These 3 R sessions cover most of the basic R topics that we tend to use in a our daytoday R way of life. With this you should be able to hit the ground running!
Hope you enjoy these video presentation and also hope you have an even greater time with R!
Also see
1. Introducing QCSimulator: A 5qubit quantum computing simulator in R
2. Computer Vision: Ramblings on derivatives, histograms and contours
3. Designing a Social Web Portal
4. Revisiting Whats up, Watson – Using Watson’s Question and Answer with Bluemix – Part 2
5. Reintroducing cricketr! : An R package to analyze performances of cricketers
To see all my posts click – Index of posts
To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Contours of statistical penalty functions as GIF images
(This article was first published on Alexej's blog, and kindly contributed to Rbloggers)
Many statistical modeling problems reduce to a minimization problem of the general form:
\begin{equation} \mathrm{minimize}\subscript{\boldsymbol{\beta}\in\mathbb{R}^m}\quad f(\mathbf{X}, \boldsymbol{\beta}) + \lambda g(\boldsymbol{\beta}), \end{equation}
or
%
where $f$ is some type of loss function, $\mathbf{X}$ denotes the data, and $g$ is a penalty, also referred to by other names, such as “regularization term” (problems (1) and (23) are often equivalent by the way). Of course both, $f$ and $g$, may depend on further parameters.
There are multiple reasons why it can be helpful to check out the contours of such penalty functions $g$:
 When $\boldsymbol{\beta}$ is twodimensional, the solution of problem (23) can be found by simply taking a look at the contours of $f$ and $g$.
 That builds intuition for what happens in more than two dimensions, and in other more general cases.
 From a Bayesian point of view, problem (1) can often be interpreted as an MAP estimator, in which case the contours of $g$ are also contours of the prior distribution of $\boldsymbol{\beta}$.
Therefore, it is meaningful to visualize the set of points that $g$ maps onto the unit ball in $\mathbb{R}^2$, i.e., the set
B\subscript{g} := \{ \mathbf{x}\in\mathbb{R}^2 : g(\mathbf{x}) \leq 1 \}.
Below you see GIF images of such sets $B\subscript{g}$ for various penalty functions $g$ in 2D, capturing the effect of varying certain parameters in $g$. The covered penalty functions include the family of $p$norms, the elastic net penalty, the fused penalty, and the sorted $\ell_1$ norm.
:white_check_mark: R code to reproduce the GIFs is provided.
pnorms in 2DFirst we consider the $p$norm,
g\subscript{p}(\boldsymbol{\beta}) = \lVert\boldsymbol{\beta}\rVert\subscript{p}^{p} = \lvert\beta\subscript{1}\rvert^p + \lvert\beta\subscript{2}\rvert^p,
with a varying parameter $p \in (0, \infty]$ (which actually isn’t a proper norm for $p < 1$). Many statistical methods, such as LASSO and Ridge Regression, employ $p$norm penalties. To find all $\boldsymbol{\beta}$ on the boundary of the 2D unit $p$norm ball, given $\beta_1$ (the first entry of $\boldsymbol{\beta}$), $\beta_2$ is easily obtained as
\beta_2 = \pm (1\beta_1^p)^{1/p}, \quad \forall\beta_1\in[1, 1].
Elastic net penalty in 2DThe elastic net penalty can be written in the form
g\subscript{\alpha}(\boldsymbol{\beta}) = \alpha \lVert \boldsymbol{\beta} \rVert\subscript{1} + (1  \alpha) \lVert \boldsymbol{\beta} \rVert\subscript{2}^{2},
for $\alpha\in(0,1)$. It is quite popular with a variety of regressionbased methods (such as the Elastic Net, of course). We obtain the corresponding 2D unit “ball”, by calculating $\beta\subscript{2}$ from a given $\beta\subscript{1}\in[1,1]$ as
\beta\subscript{2} = \pm \frac{\alpha + \sqrt{\alpha^2  4 (1  \alpha) ((1  \alpha) \beta\subscript{1}^2 + \alpha \beta\subscript{1}  1)}}{2  2 \alpha}.
Fused penalty in 2DThe fused penalty can be written in the form
g\subscript{\alpha}(\boldsymbol{\beta}) = \alpha \lVert \boldsymbol{\beta} \rVert\subscript{1} + (1  \alpha) \sum\subscript{i = 2}^m \lvert \beta\subscript{i}  \beta\subscript{i1} \rvert.
It encourages neighboring coefficients $\beta\subscript{i}$ to have similar values, and is utilized by the fused LASSO and similar methods.
(Here I have simply evaluated the fused penalty function on a grid of points in $[2,2]^2$, because figuring out equations in parametric form for the above polygons was too painful for my taste… :stuck_out_tongue:)
Sorted L1 penalty in 2DThe Sorted $\ell\subscript{1}$ penalty is used in a number of regressionbased methods, such as SLOPE and OSCAR. It has the form
g\subscript{\boldsymbol{\lambda}}(\boldsymbol{\beta}) = \sum\subscript{i = 1}^m \lambda\subscript{i} \lvert \beta \rvert\subscript{(i)},
where $\lvert \beta \rvert\subscript{(1)} \geq \lvert \beta \rvert\subscript{(2)} \geq \ldots \geq \lvert \beta \rvert\subscript{(m)}$ are the absolute values of the entries of $\boldsymbol{\beta}$ arranged in a decreasing order. In 2D this reduces to
g\subscript{\boldsymbol{\lambda}}(\boldsymbol{\beta}) = \lambda\subscript{1} \max\{\beta\subscript{1}, \beta\subscript{2}\} + \lambda\subscript{2} \min\{\beta\subscript{1}, \beta\subscript{2}\}.
Difference of pnormsIt holds that
\lVert \boldsymbol{\beta} \rVert\subscript{1} \geq \lVert \boldsymbol{\beta} \rVert\subscript{2},
or more generally, for all $p$norms it holds that
(\forall p \leq q) : \lVert \boldsymbol{\beta} \rVert\subscript{p} \geq \lVert \boldsymbol{\beta} \rVert\subscript{q}.
Thus, it is meaningful to define a penalty function of the form
g\subscript{\alpha}(\boldsymbol{\beta}) = \lVert \boldsymbol{\beta} \rVert\subscript{1}  \alpha \lVert \boldsymbol{\beta} \rVert\subscript{2},
for $\alpha\in[0,1]$, which results in the following.
We visualize the same for varying $p \geq 1$ fixing $\alpha = 0.6$, i.e., we define
g\subscript{\alpha}(\boldsymbol{\beta}) = \lVert \boldsymbol{\beta} \rVert\subscript{1}  0.6 \lVert \boldsymbol{\beta} \rVert\subscript{p},
and we obtain the following GIF.
CodeThe R code uses the libraries dplyr for data manipulation, ggplot2 for generation of figures, and magick to combine the individual images into a GIF.
Here are the R scripts that can be used to reproduce the above GIFs:
 pnorms in 2D
 Elastic net penalty in 2D
 Fused penalty in 2D
 Sorted L1 penalty in 2D
 Difference of $p$norms: $\ell\subscript{1} – \ell\subscript{2}$ in 2D
 Difference of $p$norms: $\ell\subscript{1} – \ell\subscript{p}$ in 2D
Should I come across other interesting penalty functions that make sense in 2D, then I will add corresponding further visualizations to the same Github repository.
This work is licensed under a Creative Commons Attribution 4.0 International License.
To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Science at StitchFix
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
If you want to see a great example of how data science can inform every stage of a business process, from product concept to operations, look no further than Stitch Fix's Algorithms Tour. Scroll down through this explainer to see how this personal styling service uses data and statistical inference to suggest clothes their customers will love, ship them from a nearby warehouse (already stocked thanks to demand modeling), and incorporate customer feedback and designer trends into the next round of suggestions.
The design of the Tour is interactive and elegant, and clearly someone put a lot of work into it. It's a great example of simplifying the detail of data science into a format where the impacts are immediately apparent. Communicating the benefits of data science to nonexperts is a key skill of any data scientist, and dialing down on the complexity, while not dumbing down the science, is key to that communication.
From the Tour, we can see that Stitch Fix's data science is far from simple. Their methods aren't just blackbox machine learning techniques: rather, they are using statistical inference to figure out why customers like particular products (or why operational strategies are efficient or not) rather than simply predicting what will happen. This is all driven by a sophisticated data science platform, that enables the team to employ techniques like Generalized Additive Models implemented in R, or Natural Language Processing implemented in Python. Check out their MultiThreaded blog for more details on data science applications at Stitch Fix, and be sure to check out the Algorithms Tour linked below.
Stitch Fix MultiThreaded: Algorithms Tour
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
One way MANOVA exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In ANOVA our interest lies in knowing if one continuous dependent variable is affected by one or more categorical independent variables. MANOVA is an extension of ANOVA where we are now able to understand how several dependent variables are affected by independent variables. For example consider an investigation where a medical investigator has developed 3 back pain therapies. Patients are enrolled for a 10 week trial and at the end the investigator interviews them on reduction of physiological, emotional and cognitive pain. Interest is in knowing which therapy is best at reducing pain.
Just like in ANOVA we can have one way or two way MANOVA depending on number of independent variables.
When conducting MANOVA it is important to understand the assumptions that need to be satisfied so that the results are valid. The assumptions are explained below.
 The observations are independent. Observations that are collected over time, over space and in any groupings violate the assumption of independence.
 The data follows a multivariate normal distribution. When observations are many we can rely on the central limit theorem (CLT) to achieve normality. It has been generally accepted any distribution with more than that observations will follow a normal distribution. MANOVA is robust to any nonnormality that arises from skewness but it is not robust to nonnormality resulting from outliers. Outliers should be checked and appropriate action taken. Analysis can be done with and without the outliers to check sensitivity.
 The variance in all the groups is homogeneous. A Bartlett test is useful for assessing the homogeneity of variance. MANOVA is not robust to deviations from the assumption of normality therefore a transformation is required to stabilize variance.
MANOVA can be used to understand the interactions and main effects of independent variables. The four test statistics that can be used are Wilk’s lambda, Pillai trace, HotellingLawley trace and Roy’s maximum root. Among the four test statistics Pillai is least affected by any violations in assumptions but Wilk’s is the most commonly used.
In this first part of MANOVA exercises we will use data from a study investigating a control and three therapies aimed at reducing symptoms of koro. Forty patients were selected for inclusion in the study and 10 patients were assigned to each of the four groups. Interest is in understanding which therapy is best in reducing symptoms. We will create three variables that hold change in indices before and after treatment. Here we have one independent variable and three dependent variables resulting in a one way MANOVA.
Solutions to these exercises can be found here
Exercise 1
Import data into R
Exercise 2
Check the number of observations in each group
Exercise 3
Create the variables that hold the change in indices
Exercise 4
Summarize the change variables
Exercise 5
Get descriptive statistics for each therapy
Exercise 6
Obtain the correlation matrix
Exercise 7
Check for outliers
Exercise 8
Check for homogeneity of variance
Exercise 9
Run MANOVA test with outliers
Exercise 10
Interpret results
Related exercise sets: One Way Analysis of Variance Exercises
 Two Way ANOVA in R Exercises
 Paired ttest in R Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Experimenting With Sankey Diagrams in R and Python
(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to Rbloggers)
A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.
One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.
In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).
Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.
The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.
Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.
One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:
If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.
If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.
As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.
In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)
The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…
The original (default) diagram looks like this:
and the final version, after a bit of data wrangling, looks like this:
Once again, all the code is provided in the notebook.
One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…
To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...