a perfectly normally distributed sample
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
When I saw this title on Rbloggers, I was wondering how “more perfect” a Normal sample could be when compared with the outcome of rnorm(n). Hence went checking the original blog on bayestestR in search of more information. Which was stating nothing more than how to generate a sample is perfectly normal by using the rnorm_perfect function. Still unsure of the meaning, I contacted one of the contributors who replied very quickly
…that’s actually a good question. I would say an empirical sample having characteristics as close as possible to a cannonic gaussian distribution. and again leaving me hungering for more details. I thus downloaded the package bayestestR and opened the rnorm_perfect function. Which is simply the sequence of nquantiles stats::qnorm(seq(1/n, 1 – 1/n, length.out = n), mean, sd) which I would definitely not call a sample as it has nothing random. And perfect?! Not really, unless one associates randomness and imperfection. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Likertplots and grouped Likertplots #rstats
(This article was first published on R – Strenge Jacke!, and kindly contributed to Rbloggers)
I’m pleased to anounce an update of my sjPlotpackage, a package for Data Visualization for Statistics in Social Science. Thanks to the help of Alexander, it is now possible to create grouped Likertplots. This is what I want to show in this post…
First, we load the required packages and sample data. We want to plot several items from an index, where all these items are 4point Likert scales. The items deal with coping of family caregivers, i.e. how well they cope with their role of caring for an older relative.
To find the required variables in the data set, we search all variables for a specific pattern. We know that the variables from the COPEindex all have a cop in their variable name. We can easily search for variables in a data set with the find_var()function.
library(sjPlot) library(sjmisc) data(efc) # find all variables from COPEIndex, which all # have a "cop" in their variable name, and then # plot the items as likertplot mydf < find_var(efc, pattern = "cop", out = "df") plot_likert(mydf)The plot is not perfect, because for those values with just a few answers, we have overlapping values. However, there are quite some options to tweak the plot. For instance, we can increase the axisrange (grid.range), show cumulative percentagevalues only at the ende of the bars (values = "sum.outside") and show the percentagesign (show.prc.sign = TRUE).
plot_likert( mydf, grid.range = c(1.2, 1.4), expand.grid = FALSE, values = "sum.outside", show.prc.sign = TRUE )The interesting question is, whether we can reduce the dimensions of this scale and try to extract principle components, in order to group single items into different subscales. To do that, we first run a PCA on the data. This can be done, e.g., with sjt.pca() or sjp.pca().
# creates a HTMLtable of the results of an PCA. sjt.pca(mydf) Principal Component Analysis Component 1 Component 2 do you feel you cope well as caregiver? 0.29 0.60 do you find caregiving too demanding? 0.60 0.42 does caregiving cause difficulties in your relationship with your friends? 0.69 0.16 does caregiving have negative effect on your physical health? 0.73 0.12 does caregiving cause difficulties in your relationship with your family? 0.64 0.01 does caregiving cause financial difficulties? 0.69 0.12 do you feel trapped in your role as caregiver? 0.68 0.38 do you feel supported by friends/neighbours? 0.07 0.64 do you feel caregiving worthwhile? 0.07 0.75 Cronbach’s α 0.78 0.45 varimaxrotationAs we can see, six items are associated with component one, while three items mainly load on the second component. The indices that indicate which items is associated with which component is returned by the function in the element $factor.index. So we save this in an object that can be used to create a grouped Likertplot.
groups < sjt.pca(mydf)$factor.index plot_likert(mydf, groups = groups, values = "sum.outside")There are even mote options to tweak the Likertplots. Find the full documentation at
https://strengejacke.github.io/sjPlot/index.html.
To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
EARL London 2019 speakers announced
(This article was first published on RBlog – Mango Solutions, and kindly contributed to Rbloggers)
We were thrilled with the overall quality and amount of abstracts we received for this year’s EARL London Conference. Which made our job of selecting the final speakers even more difficult!
We are pleased to share with you the speakers for EARL London 2019 – we will be interviewing some of our speakers over the next few months, so you can find out what to expect from their talks. As you can see we have a wide range of topics and industries covered – so there will be something for everyone.
The final agenda with times will be released in the next few weeks – in the meantime, take a look at who’s talking and make the most of our early bird ticket offers.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Converting Individual Binary vectors to a Value based on Column Names
(This article was first published on R – William Doane, and kindly contributed to Rbloggers)
When processing data downloaded from popular survey engines, it’s not uncommon for multiple choice questions to be represented as one column per possible response coded as 0/1. So, a question with just two responses might be downloaded as part of a CSV with one column for q1_1 and another for q1_2. If the responses are mutually exclusive, then (q1_1 == 0 iff q1_2 == 1) and (q1_1 == 1 iff q1_2 == 0). If the responses are part of a “choose all that apply” question, then it’s possible to have multiple 1s.
How can these individual binary indicator variables be reassembled into a single response variable?
First, let’s simulate some response data for nonmutually exclusive questions—each row represents one respondent’s choices:
df < data.frame( q1_1 = round(runif(5), 0), q1_2 = round(runif(5), 0), q1_3 = round(runif(5), 0), q2_1 = round(runif(5), 0), q2_2 = round(runif(5), 0), q3_1 = round(runif(5), 0), q3_2 = round(runif(5), 0), q3_3 = round(runif(5), 0), q3_4 = round(runif(5), 0) ) df q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 1 1 0 0 1 0 1 1 0 1 2 0 1 0 0 1 1 1 0 1 3 1 1 1 1 1 0 0 0 1 4 0 1 0 1 1 1 1 0 1 5 0 0 1 1 0 1 0 0 0R’s dplyr package offers the coalesce function, which doesn’t suit my needs when the data contains 0s for nonselected response choices. Notice below in row 2, for example, that q1 and q2 select the first nonNA values, which is 0:
library(dplyr) df %>% mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>% mutate(q2 = coalesce(q2_1, q2_2)) %>% mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4)) q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 1 0 0 1 1 1 0 1 0 0 1 3 1 1 1 1 1 0 0 0 1 1 1 1 4 0 1 0 1 1 1 1 0 1 0 1 1 5 0 0 1 1 0 1 0 0 0 0 1 1If you replace all 0s with NA, you can get closer to what you need:
df %>% mutate_all(~ifelse(. == 0, NA_real_, .)) %>% mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>% mutate(q2 = coalesce(q2_1, q2_2)) %>% mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4)) q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3 1 1 NA NA 1 NA 1 1 NA 1 1 1 1 2 NA 1 NA NA 1 1 1 NA 1 1 1 1 3 1 1 1 1 1 NA NA NA 1 1 1 1 4 NA 1 NA 1 1 1 1 NA 1 1 1 1 5 NA NA 1 1 NA 1 NA NA NA 1 1 1Unfortunately, the q1 vector here only tells us that there was some response by each respondent, not which response they gave for q1.
It would be nice to have a version of coalesce that gathered not the first nonNA value, but the column name of the first nonNA value. Here, I’ll use the structure of dplyr’s coalesce as a model:
coalesce_colname < function(...) { if (missing(..1)) { abort("At least one argument must be supplied") } colnames < as.character(as.list(match.call()))[1] values < list(...) x < values[[1]] x[!is.na(x)] < colnames[1] values < values[1] colnames < colnames[1] for (i in seq_along(values)) { x < ifelse(is.na(x) & !is.na(values[[i]]), colnames[i], x) } x }With this, you have a dropin replacement for coalesce that captures the column name:
df %>% mutate_all(~ifelse(. == 0, NA_real_, .)) %>% mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>% mutate(q2 = coalesce_colname(q2_1, q2_2)) %>% mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4)) q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3 1 1 NA NA 1 NA 1 1 NA 1 q1_1 q2_1 q3_1 2 NA 1 NA NA 1 1 1 NA 1 q1_2 q2_2 q3_1 3 1 1 1 1 1 NA NA NA 1 q1_1 q2_1 q3_4 4 NA 1 NA 1 1 1 1 NA 1 q1_2 q2_1 q3_1 5 NA NA 1 1 NA 1 NA NA NA q1_3 q2_1 q3_1and with a little effort, you can wrangle the column name to extract the response value:
df %>% mutate_all(~ifelse(. == 0, NA_real_, .)) %>% mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>% mutate(q2 = coalesce_colname(q2_1, q2_2)) %>% mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4)) %>% mutate_at(c("q1", "q2", "q3"), ~stringr::str_extract(., "\\d+$")) q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3 1 1 NA NA 1 NA 1 1 NA 1 1 1 1 2 NA 1 NA NA 1 1 1 NA 1 2 2 1 3 1 1 1 1 1 NA NA NA 1 1 1 4 4 NA 1 NA 1 1 1 1 NA 1 2 1 1 5 NA NA 1 1 NA 1 NA NA NA 3 1 1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – William Doane. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Granular Weighted Binning by Generalized Boosted Model
(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to Rbloggers)
In the post https://statcompute.wordpress.com/2019/04/27/moregeneralweightedbinning, I’ve shown how to do the weighted binning with the function wqtl_bin() by the iterative partitioning. However, the outcome from wqtl_bin() sometimes can be too coarse. The function wgbm_bin() (https://github.com/statcompute/MonotonicBinning/blob/master/code/wgbm_bin.R) leverages the idea of gbm() that implements the Generalized Boosted Model and generates more granular weighted binning outcomes.
Below is the demonstration showing the difference between wqtl_bin() and wgbm_bin() outcomes. Even with the same data, the wgbm_bin() function is able to generate a more granular binning result and 14% higher Information Value.
.gist table { marginbottom: 0; } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
x13binary 1.1.392
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
An updated x13binary package 1.1.392 of the X13ARIMASEATS program by the US Census Bureau (with upstream release 1.1.39) is now on CRAN, pretty much exactly two years after the previous release 1.1.391.
The x13binary package takes the pain out of installing X13ARIMASEATS by making it a fully resolved CRAN dependency. For example, if you install the excellent seasonal package by Christoph, then X13ARIMASEATS will get pulled in via the x13binary package and things just work: Depend on x13binary and on all relevant OSs supported by R, you should have an X13ARIMASEATS binary installed which will be called seamlessly by the higherlevel packages such as seasonal or gunsales. With this the full power of the what is likely the world’s most sophisticated deseasonalization and forecasting package is now at your fingertips and the R prompt, just like any other of the 14100+ CRAN packages. You can read more about this (and the seasonal package) in the recent Journal of Statistical Software paper by Christoph and myself.
There is almost no change in this release – apart from having to force StagedInstall: no following the R 3.6.0 release as the macOS build is otherwise broken now.
Courtesy of CRANberries, there is also a diffstat report for this release showing changes to the previous release.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Deep (learning) like Jacques Cousteau – Part 3 – Vectors
(This article was first published on Embracing the Random  R, and kindly contributed to Rbloggers)
(TL;DR: Vectors are ordered lists of numbers.)
LaTeX and MathJax warning for those viewing my feed: please view
directly on my website for MathJax goodness!
Sector, vector, the lyric inspector
XRay vision, power perfector
Kool Keith from ‘MC Champion’ by Ultramagnetic MC’s
Last
time,
we learnt about scalars. We’ll now start learning about
vectors!
A vector is essentially a list of numbers! They’re normally
depicted as columns of numbers (i.e. as column vectors):
\begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}
The order in which each number appears in our vector is important.
For example:
\begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}
However:
\begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix} \neq \begin{bmatrix} 3 \\ 2 \\ 1 \\ \end{bmatrix}
We can transpose vectors so that the rows become columns, or so that
the columns become rows. We’ll use an upper case “T” to denote the
transpose operator:
%
We now have ourselves a row vector! Applying the transpose operation
on our row vector brings us right back to where we started:
%
A tiny bit of notationWe will use the notation from Goodfellow, Ian, et
al. and refer to vectors using
lower case, bold letters:
\boldsymbol{x}
The individual elements of our vectors are often called
elements, components or entries. We’ll follow Goodfellow, Ian,
et al. and call them
elements. We’ll refer to the \boldsymbol{i{\textrm{th}}}
element of our vector \boldsymbol{x} using an italicised x with
a subscript indicating the element number:
x_i
For example, we’ll represent the second element of our vector \boldsymbol{x} like this:
x_{2}
How can we represent vectors in R? Common vector typesWe have many vector types in R. The vector types that I most commonly use are these:
 numeric,
 character, and
 logical
In R, all elements of our vectors must be of the same type. If a
single element in our vectors violates this condition, the entire vector
is coerced into a more generic class. For example, this results in
an integer vector:
However, this results in a numeric vector:
x < c(2L, 3L, 5L, 7) class(x) ## [1] "numeric"This makes sense as we learnt in part
one
that our set of integers, \mathbb{Z}, is a subset of our set of real
numbers, \mathbb{R}. That is:
\mathbb{Z} \subset \mathbb{R}
The \subset symbol means “is a proper/strict subset of”. It
indicates that all elements of the set on the left are contained within
the set on the right.
This means that we can represent both our integers and real numbers in a single vector of type numeric. However, we can’t represent all of our real numbers in a vector type of integer.
How can we create vectors?We can create vectors by using the c() function. For example, here is
a numeric vector:
Notice that when we print it out, it looks like a row vector.
However, when we use our vector in a data.frame, we get this:
Looks like a column to me! (Note: The numbers in the column on the left are row numbers)
If we then transpose it, we get this:
x_df_transposed < t(x_df) print(x_df_transposed) ## [,1] [,2] [,3] ## x 1 2 3Looks like a row vector now! Exciting! If we transpose it again, we’re
back to a column vector:
Vectors are ordered lists of numbers. We transposed the hell out of one.
Next time, we’ll combine our knowledge of scalars with our new knowledge
of vectors!
Justin
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
10 Steps to Better Graphs in R
(This article was first published on Learn R Programming & Build a Data Science Career  Michael Toth, and kindly contributed to Rbloggers)
Over the last 5 years, I have created a LOT of graphs. And let me tell you, they haven’t all been pretty. But with each new graph that I’ve created, I’ve improved my knowledge of what works and what doesn’t.
And I’ve used that knowledge to develop a set of best practices that I follow every time I’m working on a new project that involves communicating results or information with graphs.
You see, when I’m making a graph, I’m not doing it just to explore some data or show something “interesting”. No. I want my graphs to speak to my audience and help them to understand and take action based on what they learn.
A good data scientist needs to be able to not only analyze data, but also to convey the insights hidden in that data in a way that convinces people to take appropriate action.
If you can use your data analysis skills to consistently drive change in your organization, you will quickly find yourself on a path toward promotion, increased responsibility, and greater control over your own work.
In my own experience, learning to graph effectively was the single biggest thing that helped me increase my impact and ability to drive change at work. That’s why I think it’s so important for you to learn to graph effectively, and that’s why I’m sharing this checklist with you today.
This is the exact checklist I go through when I’m working on graphs for big consulting projects with my clients. I keep a printed copy on my desk and I refer to it every time I’m working on a new graph. It helps me, and I think it will help you too. Be sure to get a copy of the checklist for yourself so you’ll always have it handy when you need it!
Get Your Free 10Step ggplot Graphing Checklist
Why Do You Need a ChecklistGraphs are a versatile tool that can be used for a variety of different goals. That’s one of the reasons why I think learning to graph effectively is one of the highest priorities for data scientists.
But I also think that leads to a lot of problems with graphs. You see, you can use graphs for exploratory data analysis, and you can also use graphs for presentation and sharing results. The problem that I see ALL THE TIME is that people try to use the same graphs for both of these things. Ahhh!
Look, I’ve been there. It’s tempting and easy to throw a few quick graphs together and call it a day.
I used to work in finance, and I was once tasked with building a model to predict of how likely borrowers were to default on loans they had taken out.
It was a challenging problem, and I spent around 6 weeks creating a sophisticated model that predicted defaults based on all kinds of data about the borrower’s income level, where they lived, how big their loan was, and what their interest rate was.
I was pretty pleased that the model I created worked really well! I knew our clients would find a lot of value in this model once we got it implemented into our platform.
Before that, I needed to summarize my results and share them with the rest of the company. At the time, I didn’t see this as an opportunity to advocate for my work and its benefits to our clients. Instead, I saw it as an annoying obligation that I had to do on top of my already extensive analysis.
I knew my analysis was good and that this change was valuable. I was sure other people understood that as well.
So I threw some graphs together, gave a quick presentation on the topic to a room full of glazedover eyes, and went back to my desk.
It took weeks for us to build this into the platform, when it could have been done in a matter of days if there was sufficient motivation.
Whose fault was that? At the time, it was easy for me to blame the engineering team for the slow implementation. But the reality is, it was my fault!
Everybody else is busy with their own work, and for the most part, they don’t really know what it is you do all day. You’re often so deep in the weeds of your own analysis that things you think are simple and obvious aren’t even on the minds of everybody else. That’s why it’s important to treat every presentation and every graph you share as an opportunity to educate others and inspire them to move forward with an action.
I’d love to say that I immediately changed my presentation and graphing style after that experience, but I didn’t. It took me years to develop the knowledge and skills to do this effectively.
But it doesn’t have to take years! I’m here to help you learn today how to effectively use graphs to communicate ideas and drive change.
If you implement these ideas consistently, you will see improvements in the impact of your work and your influence in the organization.
Now, let’s get into the checklist. And remember, if you want to keep a copy of this for yourself so you’ll always have it to refer back to, you can get that here:
Get Your Free 10Step Checklist to Graphing for Impact
Before you graph 1. Decide who this graph is forIn order to create an effective graph, you need to know who will be using this information. Many of your design decisions stem from this key point. If you understand your audience’s background, goals, and challenges, you’ll be far more effective in creating a graph to help them make a decision, which is what this is all about! In particular, remember that this graph is not for you. You have all kinds of specialized knowledge that your audience likely does not have. You need your graph to appeal to them.
2. Structure your graph to answer a questionYour graphs should answer an important question that your audience has. “How have our revenue numbers changed over time?” or “Which of our services has the lowest level of customer satisfaction?” Design your graph to answer their question, not just to explore data. This serves two purposes.
 It gives people a reason to pay attention. If you’re answering a big question they have, they’re going to listen to you.
 It provides a clear path from data to action, which is ultimately what you want. Remember: the entire point of this field is to extract insights from data to help businesses make better decisions!
The graph you use will depend on the data and the question you are answering.
 Line Graph: Use line graphs to track trends over time or show the relationship between two variables.
 Bar Charts: Use bar charts to compare quantitative data across multiple categories.
 Scatter Plots: Use scatter plots to assess the relationship between two variables.
 Pie Charts: Use pie charts to show parts of a whole. I personally do not use pie charts, and I advise you to be very careful with them. If you must use them, limit the number of categories, as more than 3 or 4 makes them unreadable.
Outliers are an inevitability. You need to decide how to handle this. Sometimes, the outlier itself can be a point of focus in your graph that you want to highlight. Other times, it can be a distraction from your message that you would prefer to remove. Before removing an outlier, think critically about why the outlier exists and make a judgment call as to whether removing it helps to clarify your point without being misleading.
Building Your Graph 5. Remove unnecessary dataYour audience should be able to clearly understand the point of your graph. Excessive and unnecessary data can distract from this goal. Decide what is necessary to answer your question and cut the rest.
6. Don’t be misleadingThere are many ways that graphs can be misleading, either intentionally or unintentionally. These two seem to come up most frequently:
If you’re using a bar chart, the baseline for the yaxis must start at 0. Otherwise, your graph will be misleading by amplifying the actual differences across the categories.
Your titles and captions should accurately describe your data. Titles and captions are a great way to bring salience to your graph, but you need to ensure the text reinforces what the data says, rather than changing the message.
Styling Your Graph 7. Decide on an appropriate color paletteColor is an important and oftenneglected aspect of graphs. For singlecolor graphs, choose a color that’s related to your organization’s brand or thematically related to the graph (for example, green for forestry data). For multicolor graphs, use Color Picker for Data, an excellent tool for building visually pleasing color palettes.
8. Make all of your axis titles and labels horizontalAll of the axis titles and labels in your graph should be horizontal. Horizontal labels greatly improve the readability and visual appeal of a graph.
9. Adjust your titles, labels, and legend textGive your graph a compelling title, and add descriptive and well formatted names to the axis titles and legend. A good choice for your graph title is to simply state the question you’re trying to answer. I also like to use a subtitle that drives home the message you want people to take away. You can use the labs function in ggplot to modify these labels.
BONUS: Add your company logo and brandingIf you’re sharing this graph with clients or the public, adding your company logo and branding elements can help your graph to stand out and to build credibility for your organization. This is great for you, because it will help you to grow your own influence and visibility within the company. Read my guide on this subject for more details on how to implement this tip.
Exporting Your Graph 10. Save your graph in a readable highresolution formatThink about how your graph is going to be read. Will it be online, printed, or in a slide for a presentation? Each format may require different adjustments to text and graph sizing to be readable. Be sure to test for yourself to ensure you can read your graph in its final format. This will avoid frustrating reworks or–even worse–sharing an unreadable graph! Use the ggsave function to save your graph and modify the resolution. Then, adjust sizes until you’re satisfied with the final result.
ConclusionSome of these tips may have been obvious, and others may have seemed like revelations. The important thing is to think through these steps and apply them consistently with every graph you produce. I promise you that if you incorporate this checklist into your workflow, you’re going to see a big change in how people respond to your analysis at work.
Remember to get a copy of this graphing checklist so you can be sure to go through it every time!
Get Your Free 10Step Checklist to Graphing for Impact
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career  Michael Toth. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Backtest Trading Strategies like a real Quant
(This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers)
R is one of the best choices when it comes to quantitative finance. Here we will show you how to load financial data, plot charts and give you a stepbystep template to backtest trading strategies. So, read on…
We begin by just plotting a chart of the Standard & Poor’s 500 (S&P 500), an index of the 500 biggest companies in the US. To get the index data and plot the chart we use the powerful quantmod package (on CRAN). After that we add two popular indicators, the simple moving average (SMI) and the exponential moving average (EMA).
Have a look at the code:
library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Version 0.40 included new data defaults. See ?getSymbols. getSymbols("^GSPC", from = "20000101") ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.50. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. ## [1] "^GSPC" head(GSPC) ## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume ## 20000103 1469.25 1478.00 1438.36 1455.22 931800000 ## 20000104 1455.22 1455.22 1397.43 1399.42 1009000000 ## 20000105 1399.42 1413.27 1377.68 1402.11 1085500000 ## 20000106 1402.11 1411.90 1392.10 1403.45 1092300000 ## 20000107 1403.45 1441.47 1400.73 1441.47 1225200000 ## 20000110 1441.47 1464.36 1441.47 1457.60 1064800000 ## GSPC.Adjusted ## 20000103 1455.22 ## 20000104 1399.42 ## 20000105 1402.11 ## 20000106 1403.45 ## 20000107 1441.47 ## 20000110 1457.60 tail(GSPC) ## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume ## 20190424 2934.00 2936.83 2926.05 2927.25 3448960000 ## 20190425 2928.99 2933.10 2912.84 2926.17 3425280000 ## 20190426 2925.81 2939.88 2917.56 2939.88 3248500000 ## 20190429 2940.58 2949.52 2939.35 2943.03 3118780000 ## 20190430 2937.14 2948.22 2924.11 2945.83 3919330000 ## 20190501 2952.33 2954.13 2923.36 2923.73 3645850000 ## GSPC.Adjusted ## 20190424 2927.25 ## 20190425 2926.17 ## 20190426 2939.88 ## 20190429 2943.03 ## 20190430 2945.83 ## 20190501 2923.73 chartSeries(GSPC, theme = chartTheme("white"), subset = "last 10 months", show.grid = TRUE) addSMA(20) addEMA(20)As you can see the moving averages are basically smoothed out versions of the original data shifted by the given number of days. While with the SMA (red curve) all days are weighted equally with the EMA (blue curve) the more recent days are weighted stronger, so that the resulting indicator detects changes quicker. The idea is that by using those indicators investors might be able to detect longer term trends and act accordingly. For example a trading rule could be to buy the index whenever it crosses the MA from below and sell when it goes the other direction. Judge for yourself if this could have worked.
Well, having said that it might not be that easy to find out the profitability of certain trading rules just by staring at a chart. We are looking for something more systematic! We would need a decent backtest! This can of course also be done with R, a great choice is the PerformanceAnalytics package (on CRAN).
To backtest a trading strategy I provide you with a stepbystep template:
 Load libraries and data
 Create your indicator
 Use indicator to create equity curve
 Evaluate strategy performance
As an example we want to test the idea that it might be profitable to sell the index when the financial markets exhibit significant stress. Interestingly enough “stress” can be measured by certain indicators that are freely available. One of them is the National Financial Conditions Index (NFCI) of the Federal Reserve Bank of Chicago (https://www.chicagofed.org/publications/nfci/index):
The Chicago Fed’s National Financial Conditions Index (NFCI) provides a comprehensive weekly update on U.S. financial conditions in money markets, debt and equity markets and the traditional and “shadow” banking systems. […] The NFCI [is] constructed to have an average value of zero and a standard deviation of one over a sample period extending back to 1971. Positive values of the NFCI have been historically associated with tighterthanaverage financial conditions, while negative values have been historically associated with looserthanaverage financial conditions.
To make it more concrete we want to create a buy signal when the index is above one standard deviation in negative territory and a sell signal otherwise.
Have a look at the following code:
# Step 1: Load libraries and data library(quantmod) library(PerformanceAnalytics) ## ## Attaching package: 'PerformanceAnalytics' ## The following object is masked from 'package:graphics': ## ## legend getSymbols('NFCI', src = 'FRED', , from = '20000101') ## [1] "NFCI" NFCI < na.omit(lag(NFCI)) # we can only act on the signal after release, i.e. the next day getSymbols("^GSPC", from = '20000101') ## [1] "^GSPC" data < na.omit(merge(NFCI, GSPC)) # merge before (!) calculating returns) data$GSPC < na.omit(ROC(Cl(GSPC))) # calculate returns of closing prices # Step 2: Create your indicator data$sig < ifelse(data$NFCI < 1, 1, 0) data$sig < na.locf(data$sig) # Step 3: Use indicator to create equity curve perf < na.omit(merge(data$sig * data$GSPC, data$GSPC)) colnames(perf) < c("Stressbased strategy", "SP500") # Step 4: Evaluate strategy performance table.DownsideRisk(perf) ## Stressbased strategy SP500 ## Semi Deviation 0.0075 0.0087 ## Gain Deviation 0.0071 0.0085 ## Loss Deviation 0.0079 0.0095 ## Downside Deviation (MAR=210%) 0.0125 0.0135 ## Downside Deviation (Rf=0%) 0.0074 0.0087 ## Downside Deviation (0%) 0.0074 0.0087 ## Maximum Drawdown 0.5243 0.6433 ## Historical VaR (95%) 0.0173 0.0188 ## Historical ES (95%) 0.0250 0.0293 ## Modified VaR (95%) 0.0166 0.0182 ## Modified ES (95%) 0.0268 0.0311 table.Stats(perf) ## Stressbased strategy SP500 ## Observations 4858.0000 4858.0000 ## NAs 0.0000 0.0000 ## Minimum 0.0690 0.0947 ## Quartile 1 0.0042 0.0048 ## Median 0.0003 0.0005 ## Arithmetic Mean 0.0002 0.0002 ## Geometric Mean 0.0002 0.0001 ## Quartile 3 0.0053 0.0057 ## Maximum 0.0557 0.1096 ## SE Mean 0.0001 0.0002 ## LCL Mean (0.95) 0.0001 0.0002 ## UCL Mean (0.95) 0.0005 0.0005 ## Variance 0.0001 0.0001 ## Stdev 0.0103 0.0120 ## Skewness 0.1881 0.2144 ## Kurtosis 3.4430 8.5837 charts.PerformanceSummary(perf) chart.RelativePerformance(perf[ , 1], perf[ , 2]) chart.RiskReturnScatter(perf)The first chart shows that the stressbased strategy (black curve) clearly outperformed its benchmark, the S&P 500 (red curve). This can also be seen in the second chart, showing the relative performance. In the third chart we see that both return (more) and (!) risk (less) of our backtested strategy are more favourable compared to the benchmark.
So, all in all this seems to be a viable strategy! But of course before investing real money many more tests have to be performed! You can use this framework for backtesting your own ideas.
Here is not the place to explain all of the above tables and plots but as you can see both packages are very, very powerful and I have only shown you a small fraction of their capabilities. To use their full potential you should have a look at the extensive documentation that comes with it on CRAN.
Disclaimer:
This is no investment advice! No responsibility is taken whatsoever if you lose money!
If you gain money though I would be happy if you could buy me a coffee… that is not too much to ask, is it?
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using Modules in R
(This article was first published on INWTBlogRBloggers, and kindly contributed to Rbloggers)
When a code base grows we may think of using several files first and then source them. Functions, of course, are rightfully advocated to new R users, and are the essential building block. Packages are then, already, the next level of abstraction we have to offer. With the modules package I want to provide something in between: local namespace definitions without, or within R packages. We find this feature implemented in various ways and languages: classes, namespaces, functions, packages, and sometimes also modules. Python, Julia, F#, Scala, and Erlang (and more) are languages which use modules as a language construct. Some of them in parallel to classes and packages.
More Than a Function, Less Than a PackageJust like classes and objects, modules present a way to group functions into one entity. They behave as a first class citizen in the sense that they can be treated like any other data structure in R:
 they can be created anywhere, including inside another module,
 they can be passed to functions,
 and returned from functions.
In addition they provide:
 local namespace features by declaring imports and exports
 encapsulation by introducing a local scope,
 code reuse by various modes of composition,
 interchangeability with other modules implementing the same interface.
The most important part for me is the freedom to create a local scope, or context, in which I can safely reuse verbs as function names. This is so important to me because naming things is so terribly difficult; many verbs have a different meaning in a different context; and a lot of expressive verbs are already used by base R or the tidyverse and I want to avoid name clashes. When you are working inside a package or in the REPL, aka console, R only knows one context: all values are bound to names in the global environment (or package scope). Modules present a way to create a new context and reuse names, among some other useful features. To make one thing clear: modules do not aim at being a substitute for packages. Instead we may use them to create local namespaces inside or outside of a package. On a scale of abstraction they are located between functions and packages: modules can comprise functions, sometimes data. Packages can comprise functions, data, and modules (furthermore documentation, tests, and are a solid way to share code).
A Simple Module DefinitionLet’s look at an example:
# install.packages("modules") # vignette("modulesInR", "modules") graphics < modules::module({ modules::import("ggplot2") barplot < function(df) { ## my example barplot ## df (data.frame) a data frame with group and count columns ggplot(df, aes(group, count)) + geom_bar(stat = "identity") + labs(x = "Some group variable", y = "Absolute frequency") } }) df < data.frame( # Just some sample data group = sample(LETTERS, 5), count = round(runif(5, max = 100)) ) graphics$barplot(df)First we instantiate a new module using modules::module(). What you should recognize is that you can simply put your function definitions into it; here we define the function barplot. We then can call that function by qualifying its name with graphics$barplot. The graphics module is just a list and as such can be passed around:
class(graphics) ## [1] "module" "list" graphics ## barplot: ## function(df) ## ## my example barplot ## ## df (data.frame) a data frame with group and count columnsSomething which is important about modules is that they aim to create a local scope, or a mini namespace. As such they support to define imports and exports. So far we have made all names exported by ggplot2 available to the scope of the module. We may gain finer control of the names available to the module as illustrated below:
graphics < modules::module({ modules::import("ggplot2", "aes", "geom_bar", "ggplot", "labs") modules::export("barplot") barplot < function(df) { ## my example barplot ## df (data.frame) a data frame with a group and count column ggplot(df, aes(group, count)) + geom_bar(stat = "identity") + labs(x = "Some group variable", y = "Absolute frequency") } })We do not have to use the modules::module function to create a module. When we write scripts it might be more convenient, to instead spread our codebase accross files. Each file can then act as a module. If you compare this to source the advantage is, that each file can have its own set of imports while avoiding any name clashes between them. Furthermore modules only export what is necessary and we avoid to spam the global environment with intermediate objects and function definitions:
graphics.R
modules::import("ggplot2", "aes", "geom_bar", "ggplot", "labs") modules::export("barplot") barplot < function(df) { ## my example barplot ## df (data.frame) a data frame with a group and count column ggplot(df, aes(group, count)) + geom_bar(stat = "identity") + labs(x = "Some group variable", y = "Absolute frequency") }Then we load this module using:
graphics < modules::use("graphics.R") graphics$barplot When Can Modules be Useful?Why would you want this? All too often we end up spending much more time than we should maintaining badly written code bases. And while the solution often is: write a package!, use a style guide!, structure! – we just tend to respond We can pretify later, I just need to finish this up first. Modules aim to provide a lightweight solution to a couple of problems I often see in code bases:
Imagine you write a report involving some sort of data analysis. Of course you use knitr or Sweave. So you combine your writing and coding in one file. After a while the report grows beyond the length of a blog post and is too long for one file. So as a faithful useR you start to write functions. Maybe now you also start to have several R scripts or multiple Rmd files where you put your function definitions. Of course your library statements go into some sort of header block, right? Your functions depending on these library statements are separated into files or can be found possibly thousands of lines down the road. After some time, maybe month, and the effort of several authors, you discover that your report only compiles in a fresh R session. Never twice in a row, you get different results then. Someone decided to add library statements into the source files. When you remove them and place them on top in that header section you are supposed to have, something breaks. When you remove some variables you do not need anymore something down the road breaks. You do not really know why and you also promised that you would get an update of the report ready: this afternoon. So you better leave everything unchanged and just get it done, somehow.
In those kind of situations we missed the point to get the codebase organized. It is uncontrolled growth and happens when we do not pay attention. It happens more often than we like to admit, and it is also avoidable. Aside from all good practices you can and should apply, modules offer:
 A safe(r) way to source files in one R session
 Avoid name clashes between packages: (a) by importing single functions and (b) by only importing them locally
 Avoid name clashes between different parts (maybe files) of your codebase. In many cases you rely on objects in your global workspace. With modules you can make them local
 Provide interfaces and hide the implementation
Before we end up in those situation, what can we do? We need to create boundaries. We need to be explicit which dependencies are needed and where they are needed. We need also to be explicit which objects are allowed to be used in later parts of the analysis. When we manage to create boundaries between the parts of our programs, or between the parts of a report, we reduce the scope we need to understand when writing, maintaining – aka debugging – or extending any of it. We also take control of how much coupling between the parts we allow. The list goes on: boundaries are a good thing, in programming.
There are a number of tools and a variety of concepts you can use to create boundaries. Modules can be one tool in your bag. They allow the use of local imports: ggplot2 in the above code was never attached to the search path of the main R session. They allow us to expose only what is necessary and thus help us protect private objects. Modules can sit in their own files and folders and can be loaded, without changing anything in the global environment. Modules are containers where we can put things that are related. They can help us to keep things clean and impose structure. Sometimes I feel that writing functions for a package is like putting all the things I have into a storage room. Using modules, I additionally have boxes.
Final RemarksWriting all our code in some class definition does not mean we do object orientation. Writing a lot of functions does not mean we do functional programming. Just by putting all your code into a module will not solve anything. Only when we know and understand the problems we try to solve can we pick the right tool. It is just plain hard to know in advance what the right structure is for your problem. Also it is not always obvious when the small script should have had a package to back it up. But we have to improve on this! Impose structure, decouple components, follow best practices and get organized! For these things, I hope that modules can be of use. Feel free to open an issue on GitHub if you have any questions.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: INWTBlogRBloggers. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
{attachment} is on CRAN !
(This article was first published on (en) The R Task Force, and kindly contributed to Rbloggers)
We are pleased to announce that our package {attachment} is now available on CRAN. The goal of attachment is to help to deal with package dependencies during package development. It also gives useful tools to install or list missing packages used inside Rscripts or Rmds.
Deal with dependencies during package developmentIf you ever had the following error when checking your package, {attachment} is for you:
R CMD check results 1 error  0 warnings  0 notes checking package dependencies … ERROR Namespace dependency not required: ‘collateral’
With this error, Colin would answer you:
Have you listed the package in the DESCRIPTION and in the NAMESPACE?Indeed, the steps to deal with dependencies in your R functions when creating a package are:
 Use package::function directly in the code or list the function in the {roxygen2} header with @importFrom package function
 Run devtools::document() so that functionappears in the NAMESPACE file
 Add the {package} in the list of Depends in the DESCRIPTION file
Also as you create a vignette and tests in your package, you need to remember to list packages in the Suggestssection of your DESCRIPTION file
{attachment} is here to helpPackage {attachment} will do all the above steps for you.
Install the package from CRAN or from Github:
If you correctly called the package dependencies in the {roxygen2} skeleton, in your functions, in your Rmarkdown vignettes and in your tests, you only need to run attachment::att_to_description()just before devtools::check(). And that’s it, there is nothing else to remember !
Use {attachment} out of package development{attachment} parses code of R scripts and Rmd. It lists all packages required to run that code. If you want to install all packages before trying to run the code of somebody else, you can use:
attachment::att_from_rmds(path = ".") %>% attachment::install_if_missing() attachment::att_from_rscripts(path = ".") %>% attachment::install_if_missing()Also, if you build packages or Shiny Apps in packages for your delivery to your customers, or if you have to install your R product on customers’ servers, you will need to install all required packages before installation. A good start is to use a R script that lists all packages required. With function attachment::create_dependencies_file(), you can build this kind of script:
# No Remotes  # remotes::install_github("ThinkRopen/fcuk") # Attachments  to_install < c("covr", "desc", "devtools", "glue", "knitr", "magrittr", "rmarkdown", "stats", "stringr", "testthat", "utils") for (i in to_install) { message(paste("looking for ", i)) if (!requireNamespace(i)) { message(paste(" installing", i)) install.packages(i) } } Documentation and participationTo read the full documentation of package {attachment}, you can follow this link to the {pkgodwn} site.
If you want to participate to the development, report bugs or propose pull requests, you will find the github page here.
Find our other contributions to opensource and the R community here.
The post {attachment} is on CRAN ! appeared first on (en) The R Task Force.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Repeated crossvalidation in cvms and groupdata2
(This article was first published on R, and kindly contributed to Rbloggers)
I have spent the last couple of days adding functionality for performing repeated crossvalidation to cvms and groupdata2. In this quick post I will show an example.
In crossvalidation, we split our training set into a number (often denoted “k”) of groups called folds. We repeatedly train our machine learning model on k1 folds and test it on the last fold, such that each fold becomes test set once. Then we average the results and celebrate with food and music.
The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.
The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to crossvalidate multiple model formulas at once to quickly compare them and select the best model.
Repeated CrossvalidationIn repeated crossvalidation we simply repeat this process a couple of times, training the model on more combinations of our training set observations. The more combinations, the less one bad split of the data would impact our evaluation of the model.
For each repetition, we evaluate our model as we would have in regular crossvalidation. Then we average the results from the repetitions and go back to food and music.
groupdata2As stated, the role of groupdata2 is to create the folds. Normally it creates one column in the dataset called “.folds”, which contains a fold identifier for each observation (e.g. 1,1,2,2,3,3,1,1,3,3,2,2). In repeated crossvalidation it simply creates multiple of such fold columns (“.folds_1”, “.folds_2”, etc.). It also makes sure they are unique, so we actually train on different subsets.
# Install groupdata2 and cvms from github
devtools::install_github("ludvigolsen/groupdata2")
devtools::install_github("ludvigolsen/cvms")
# Attach packages
library(cvms) # cross_validate()
library(groupdata2) # fold()
library(knitr) # kable()
library(dplyr) # %>%
# Set seed for reproducibility
set.seed(7)
# Fold data
# Create 3 fold columns
# cat_col is the categorical column to balance between folds
# id_col is the column with IDs. Observations with the same ID will be put in the same fold.
# num_fold_cols determines the number of fold columns, and thereby the number of repetitions.
data < fold(data, k = 4, cat_col = 'diagnosis', id_col = 'participant', num_fold_cols = 3)
# Show first 15 rows of data
data %>% head(10) %>% kable()
In the cross_validate function, we specify our model formula for a logistic regression that classifies diagnosis. cvms currently supports linear regression and logistic regression, including mixed effects modelling. In the fold_cols (previously called folds_col), we specify the fold column names.
CV < cross_validate(data, "diagnosis~score",
fold_cols = c('.folds_1','.folds_2','.folds_3'),
family='binomial',
REML = FALSE)
# Show results
CV
Due to the number of metrics and useful information, it helps to break up the output into parts:
CV %>% select(1:7) %>% kable()
CV %>% select(8:14) %>% kable()
CV$Predictions[[1]] %>% head() %>% kable()
CV$`Confusion Matrix`[[1]] %>% head() %>% kable()
CV$Coefficients[[1]] %>% head() %>% kable()
CV$Results[[1]] %>% select(1:8) %>% kable()
We could have trained multiple models at once by simply adding more model formulas. That would add rows to the output, making it easy to compare the models.
The linear regression version has different evaluation metrics. These are listed in the help page at ?cross_validate.
Conclusioncvms and groupdata2 now have the functionality for performing repeated crossvalidation. We have briefly talked about this technique and gone through a short example. Check out cvms for more
Indlægget Repeated crossvalidation in cvms and groupdata2 blev først udgivet på .
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hierarchical Customer Lifetime Value
(This article was first published on Brian Callander, and kindly contributed to Rbloggers)
Hierarchical Customer Lifetime Value
Posted on 5 May, 2019 by Brian
Tags: customer lifetime value, recency frequency, hierarchical model, centred
parameterisation, noncentred parameterisation, priorpredictive
distribution, stan, ebfmi, energy
Category: customer_lifetime_value
In a previous post, we described how a model of customer lifetime value (CLV) works, implemented it in Stan, and fit the model to simulated data. In this post, we’ll extend the model to use hierarchical priors in two different ways: centred and noncentred parameterisations. I’m not aware of any other HMCbased implementations of this hierarchical CLV model, so we’ll run some basic tests to check it’s doing the right thing. More specifically, we’ll fit it to a dataset drawn from the prior predictive distribution. The resulting fits pass the main diagnostic tests and the 90% posterior intervals capture about 91% of the true parameter values.
A word of warning: I came across a number of examples where the model showed severe EBFMI problems. This seems to happen when the parameters \(\mu\) (inverse expected lifetime) and \(\lambda\) (expected purchase rate) are fairly similar, but I haven’t pinned down a solid reason for these energy problems yet. I suspect this has something to do with the difficulty in distinguishing a short lifetime from a low purchase rate. This is a topic we’ll leave for a future post.
We’ll work with raw stan, which can be a bit fiddly sometimes. To keep this post within the limits of readability, I’ll define some custom functions to simplify the process. Check out the full code to see the details of these functions.
\(\DeclareMathOperator{\dbinomial}{Binomial}\DeclareMathOperator{\dbernoulli}{Bernoulli}\DeclareMathOperator{\dpoisson}{Poisson}\DeclareMathOperator{\dnormal}{Normal}\DeclareMathOperator{\dt}{t}\DeclareMathOperator{\dcauchy}{Cauchy}\DeclareMathOperator{\dexponential}{Exp}\DeclareMathOperator{\duniform}{Uniform}\DeclareMathOperator{\dgamma}{Gamma}\DeclareMathOperator{\dinvgamma}{InvGamma}\DeclareMathOperator{\invlogit}{InvLogit}\DeclareMathOperator{\logit}{Logit}\DeclareMathOperator{\ddirichlet}{Dirichlet}\DeclareMathOperator{\dbeta}{Beta}\)
Data Generating ProcessLet’s recap on the story from last time. We have a 2year old company that has grown linearly over that time to gain a total of 1000 customers.
set.seed(65130) # https://www.random.org/integers/?num=2&min=1&max=100000&col=5&base=10&format=html&rnd=new customers < tibble(id = 1:1000) %>% mutate( end = 2 * 365, start = runif(n(), 0, end  1), T = end  start )Within a customer’s lifetime \(\tau\), they will purchase with Poissonrate \(\lambda\). We can simulate the time \(t\) till last observed purchase and number of purchases \(k\) with sample_conditional.
sample_conditional < function(T, tau, lambda) { # start with 0 purchases t < 0 k < 0 # simulate time till next purchase wait < rexp(1, lambda) # keep purchasing till end of life/observation time while(t + wait <= pmin(T, tau)) { t < t + wait k < k + 1 wait < rexp(1, lambda) } # return tabular data tibble( t = t, k = k ) } s < sample_conditional(300, 200, 1) Example output from sample_conditionalt k 198.2929 169
In the above example, even though the observation time is \(T = 300\), the time \(t\) till last purchase will always be below the lifetime \(\tau = 200\). With a purchase rate of 1 per unit time, we expect around \(k = 200\) purchases.
ModelWe’ll use the same likelihood as before, which says that the probability of customer \(i\)’s data given their parameters is
\[
\begin{align}
\mathbb P(k, t, T \mid \mu_i, \lambda_i)
&=
\frac{\lambda_i^k}{\lambda_i + \mu_i}
\left( \mu_i e^{t(\lambda_i + \mu_i)} + \lambda_i e^{T(\lambda_i + \mu_i)} \right)
\\
&\propto
p \dpoisson(k \mid t\lambda_i)S(t \mid \mu_i)
\\
&\hphantom{\propto}
+ (1 – p) \dpoisson(k \mid t\lambda_i)\dpoisson(0 \mid (Tt)\lambda_i)S(T \mid \mu_i)
,
\\
p
&:=
\frac{\mu_i}{\lambda_i + \mu_i}
,
\end{align}
\]
where \(S\) is the exponential survival function.
To turn this into a Bayesian model, we’ll need priors for the parameters. The last time, we put simple gamma priors on the parameters \(\mu_i\) and \(\lambda_i\). For example, we could choose \(\lambda_i \sim \dgamma(2, 28)\) if we were to use the simple model from last time (similarly for \(\mu_i\)). This time we’re going hierarchical. There are various ways to make this hierarchical. Let’s look at two of them.
One method arises directly from the difficulty of specifying the gammaprior parameters. It involves just turning those parameters into random variables to be simultaneously estimated along with \(\mu_i\) and \(\lambda_i\). For example, we say \(\lambda_i \sim \dgamma(\alpha, \beta)\), where \(\alpha_i \sim \dgamma(\alpha_\alpha, \beta_\alpha)\) and \(\beta_i \sim \dgamma(\alpha_\beta, \beta_\beta)\) (and similarly for \(\mu_i\)). We eventually want to incorporate covariates, which is difficult with this parameterisation, so let’s move onto a different idea.
Another solution is to use lognormal priors. This means setting
\[
\begin{align}
\lambda_i
&:=
\exp(\alpha_i)
\\
\alpha_i
&\sim
\dnormal(\beta, \sigma)
\\
\beta
&\sim
\dnormal(m, s_m)
\\
\sigma
&\sim
\dnormal_+(0, s)
,
\end{align}
\]
where \(m\), \(s_m\), and \(s\) are constants specified by the user (and similarly for \(\mu_i\)). This implies
 there is an overall mean value \(e^\beta\),
 the customerlevel effects \(\alpha_i\) are deviations from the overall mean, and
 the extent of these deviations is controlled by the magnitude of \(\sigma\).
With \(\sigma \approx 0\), there can be very little deviation from the mean, so most customers would be the same. On the other hand, large values of \(\sigma\) allow for customers to be (almost) completely unrelated to each other. This means that \(\sigma\) is helping us to regularise the model.
The above parameterisation is called “centred”, which basically means the prior for \(\alpha_i\) is expressed in terms of other parameters (\(\beta\), \(\sigma\)). This can be rewritten as a “noncentred” parameterisation as
\[
\begin{align}
\lambda_i
&:=
\exp(\beta + \sigma \alpha_i)
\\
\alpha_i
&\sim
\dnormal(0, 1)
\\
\beta
&\sim
\dnormal(m, s_m)
\\
\sigma
&\sim
\dnormal_+(0, s)
.
\end{align}
\]
Notice the priors now contain no references to any other parameters. This is equivalent to the centred parameterisation because \(\beta + \sigma \alpha_i \sim \dnormal(\beta, \sigma)\). The noncentred parameterisation is interesting because it is known to increase the sampling efficiency of HMCbased samplers (such as Stan’s) in some cases.
Centred Stan implementationHere is a centred stan implementation of our lognormal hierarchical model.
centred < here::here('models/rf_centred.stan') %>% stan_model()Note that we have introduced the prior_only flag. When we specify that we want prior_only, then stan will not consider the likelihood and will instead just draw from the priors. This allows us to make priorpredictive simulations. We’ll generate a dataset using the priorpredictive distribution, then fit our model to that dataset. The least we can expect from a model is that it fits well to data drawn from its prior distribution.
Simulate the datasetTo simulate datasets we’ll use hyperpriors that roughly correspond to the priors from the previous post. In particular, the expected lifetime is around 31 days, and the expected purchase rate around once per fortnight.
data_hyperpriors < list( log_life_mean_mu = log(31), log_life_mean_sigma = 0.7, log_life_scale_sigma = 0.8, log_lambda_mean_mu = log(1 / 14), log_lambda_mean_sigma = 0.3, log_lambda_scale_sigma = 0.5 ) data_prior < customers %>% mutate(t = 0, k = 0) %>% tidybayes::compose_data(data_hyperpriors, prior_only = 1) data_prior %>% str() List of 14 $ id : int [1:1000(1d)] 1 2 3 4 5 6 7 8 9 10 ... $ end : num [1:1000(1d)] 730 730 730 730 730 730 730 730 730 730 ... $ start : num [1:1000(1d)] 328 707 408 342 666 ... $ T : num [1:1000(1d)] 401.6 22.6 322.2 388.5 64.5 ... $ t : num [1:1000(1d)] 0 0 0 0 0 0 0 0 0 0 ... $ k : num [1:1000(1d)] 0 0 0 0 0 0 0 0 0 0 ... $ n : int 1000 $ log_life_mean_mu : num 3.43 $ log_life_mean_sigma : num 0.7 $ log_life_scale_sigma : num 0.8 $ log_lambda_mean_mu : num 2.64 $ log_lambda_mean_sigma : num 0.3 $ log_lambda_scale_sigma: num 0.5 $ prior_only : num 1Let’s simulate 8 possible datasets from our priors. Notice how the centres and spreads of the datasets can vary.
centred_prior < centred %>% fit( # a wrapper around rstan::sampling to allow caching file = here::here('models/rf_centred_prior.rds'), # cache data = data_prior, pars = c('customer'), # ignore this parameter include = FALSE, chains = 8, cores = 4, warmup = 1000, # not sure why this needs to be so high iter = 1001, # one more than warmup because we just want one dataset per chain seed = 3901 # for reproducibility ) centred_prior_draws < centred_prior %>% get_draws( # rstan::extract but also with energy pars = c( 'lp__', 'energy__', 'theta', 'log_centres', 'scales' ) ) %>% name_parameters() # add customer id, and idx = 1 (mu) or 2 (lambda) Some priorpredictive draws.Here are the exact hyperparameters used.
hyper < centred_prior_draws %>% filter(str_detect(parameter, "^log_scales")) %>% select(chain, parameter, value) %>% spread(parameter, value) Hyperparameters of the priorpredictive draws.chain log_centres[1] log_centres[2] scales[1] scales[2] 1 3.643546 2.057017 1.0866138 1.0874820 2 4.586349 2.231530 1.2064439 0.2725037 3 2.673924 2.475513 0.8847336 0.9429385 4 3.490525 2.557564 0.7708666 1.0124164 5 3.422691 2.877842 1.3232360 0.2920695 6 4.205884 3.196397 1.9217956 0.5307348 7 3.381881 2.995128 1.0299251 0.7266123 8 3.046531 2.328561 1.3993460 0.4751674
We’ll add the prior predictive parameter draws from chain 1 to our customers dataset.
set.seed(33194) df < centred_prior_draws %>% filter(chain == 1) %>% filter(name == 'theta') %>% transmute( id = id %>% as.integer(), parameter = if_else(idx == '1', 'mu', 'lambda'), value ) %>% spread(parameter, value) %>% mutate(tau = rexp(n(), mu)) %>% inner_join(customers, by = 'id') %>% group_by(id) %>% group_map(~sample_conditional(.$T, .$tau, .$lambda) %>% bind_cols(.x)) data_df < data_hyperpriors %>% tidybayes::compose_data(df, prior_only = 0) Sample of customers and their propertiesid t k lambda mu tau end start T 1 44.453109 44 0.8508277 0.0383954 44.669086 730 328.4312 401.56876 2 21.237790 5 0.2202757 0.0091300 263.541504 730 707.3906 22.60937 3 0.000000 0 0.0285432 0.0583523 8.405014 730 407.8144 322.18558 4 61.946272 7 0.0970620 0.0424386 71.617849 730 341.5465 388.45349 5 8.273831 3 0.1732173 0.0608967 8.747799 730 665.5210 64.47898 6 107.182661 19 0.1224131 0.0159025 113.792604 730 388.0271 341.97287 Fit the model to simulations
Now we can fit the model to the priorpredictive data df.
centred_fit < centred %>% fit( # like rstan::sampling but with filecaching as in brms file = here::here('models/rf_centred_fit.rds'), # cache data = data_df, chains = 4, cores = 4, warmup = 2000, iter = 3000, control = list(max_treedepth = 12), seed = 24207, pars = c('customer'), include = FALSE ) centred_fit %>% check_hmc_diagnostics() Divergences: 0 of 4000 iterations ended with a divergence. Tree depth: 0 of 4000 iterations saturated the maximum tree depth of 12. Energy: EBFMI indicated no pathological behavior.The HMC diagnostics pass. However, in some of the runs not shown here, there were pretty severe problems with the EBFMI diagnostic (~0.01) and I’ve yet to figure out exactly which kinds of situations cause these energy problems. Let’s check out the pairwise posterior densities of energy with the hyperparameters.
Pairwise posterior densities of the centred modelThe scale parameter for the expected lifetime (scales[1]) is correlated with energy, which is associated with the energy problems described above. I’m not sure how much of a problem this poses, so let’s check out some more diagnostics.
neff < centred_fit %>% neff_ratio() %>% tibble( ratio = ., parameter = names(.) ) %>% filter(ratio < 0.5) %>% arrange(ratio) %>% head(20) Parameters with the lowest effective sample size.ratio parameter 0.0937772 scales[1] 0.1195873 lp__ 0.2592819 log_centres[1] 0.2748369 scales[2] 0.3228056 log_centres[2] 0.4340080 theta[574,2] 0.4381685 theta[169,1] 0.4800795 theta[38,1]
Both the lp__ and scales[1] parameters have low effective sample sizes. The rhat values seem fine though.
centred_rhat < centred_fit %>% rhat() %>% tibble( rhat = ., parameter = names(.) ) %>% summarise( min_rhat = min(rhat, na.rm = TRUE), max_rhat = max(rhat, na.rm = TRUE) ) Most extreme rhat valuesmin_rhat max_rhat 0.9990247 1.005071
Let’s now compare the 90% posterior intervals with the true values. Ideally close to 90% of the 90% posterior intervals capture their true value.
centred_cis < centred_draws %>% group_by(parameter) %>% summarise( lo = quantile(value, 0.05), point = quantile(value, 0.50), hi = quantile(value, 0.95) ) %>% filter(!str_detect(parameter, '__')) # exclude diagostic parametersThe table below shows we managed to recover three of the hyperparameters. The scales[2] parameter was estimated slightly too high.
calibration_hyper < hyper %>% filter(chain == 1) %>% gather(parameter, value, chain) %>% inner_join(centred_cis, by = 'parameter') %>% mutate(hit = lo <= value & value <= hi) The true hyperparameters and their 90% posterior intervals.parameter value lo point hi hit log_centres[1] 3.643546 3.6280113 3.733964 3.831287 TRUE log_centres[2] 2.057017 2.1786705 2.090098 2.008663 TRUE scales[1] 1.086614 0.9988373 1.119258 1.242780 TRUE scales[2] 1.087482 1.0938685 1.160415 1.232225 FALSE
We get fairly close to 90% of the customerlevel parameters.
true_values < df %>% select(id, mu, lambda) %>% gather(parameter, value, id) %>% mutate( idx = if_else(parameter == 'mu', 1, 2), parameter = str_glue("theta[{id},{idx}]") ) centred_calibration < centred_cis %>% inner_join(true_values, by = 'parameter') %>% ungroup() %>% summarise(mean(lo <= value & value <= hi)) %>% pull() %>% percent() centred_calibration [1] "91.1%"This is slightly higher than the ideal value of 90%.
Noncentred Stan implementationHere is a noncentred stan implementation of our lognormal hierarchical model. The important difference is in the expression for \(\theta\) and in the prior for \(\text{customer}\).
noncentred < here::here('models/rf_noncentred.stan') %>% stan_model()Since the noncentred and centred models are equivalent, we can also consider df as a draw from the noncentred prior predictive distribution.
noncentred_fit < noncentred %>% fit( # like rstan::sampling but with filecaching as in brms file = here::here('models/rf_noncentred_fit.rds'), # cache data = data_df, chains = 4, cores = 4, warmup = 2000, iter = 3000, control = list(max_treedepth = 12), seed = 1259, pars = c('customer'), include = FALSE ) noncentred_fit %>% check_hmc_diagnostics() Divergences: 0 of 4000 iterations ended with a divergence. Tree depth: 0 of 4000 iterations saturated the maximum tree depth of 12. Energy: EBFMI indicated no pathological behavior.Again, the HMC diagnostics indicate no problems. Let’s check the pairwise densities anyway.
Pairwise posterior densities of the noncentred modelThe correlation between scales[1] and energy__ is smaller with the noncentred parameterisation. This is reflected in the higher effective sample size for scales[1] below. Unfortunately, the effective sample size for the purchase rate hyperpriors has gone down.
neff < noncentred_fit %>% neff_ratio() %>% tibble( ratio = ., parameter = names(.) ) %>% filter(ratio < 0.5) %>% arrange(ratio) %>% head(20) Parameters with the lowest effective sample size.ratio parameter 0.0631821 log_centres[2] 0.0681729 scales[2] 0.1579867 lp__ 0.1776186 scales[1] 0.2217369 log_centres[1] 0.3910369 theta[830,1] 0.4767405 theta[639,1] 0.4792693 theta[250,2] 0.4847856 theta[41,1] 0.4978518 theta[231,1]
Again, the rhat values seem fine.
noncentred_rhat < noncentred_fit %>% rhat() %>% tibble( rhat = ., parameter = names(.) ) %>% summarise( min_rhat = min(rhat, na.rm = TRUE), max_rhat = max(rhat, na.rm = TRUE) ) Most extreme rhat valuesmin_rhat max_rhat 0.9990501 1.013372
Let’s check how many of the 90% posterior intervals contain the true value.
noncentred_cis < noncentred_draws %>% group_by(parameter) %>% summarise( lo = quantile(value, 0.05), point = quantile(value, 0.50), hi = quantile(value, 0.95) ) %>% filter(!str_detect(parameter, '__'))The hyperparameter estimates are much the same as with the centred parameterisation.
noncentred_calibration_hyper < hyper %>% filter(chain == 1) %>% gather(parameter, value, chain) %>% inner_join(noncentred_cis, by = 'parameter') %>% mutate(hit = lo <= value & value <= hi) The true hyperparameters and their 90% posterior intervals.parameter value lo point hi hit log_centres[1] 3.643546 3.6325159 3.735363 3.838160 TRUE log_centres[2] 2.057017 2.1770703 2.092537 2.014462 TRUE scales[1] 1.086614 0.9970698 1.109297 1.235166 TRUE scales[2] 1.087482 1.0963503 1.159626 1.234190 FALSE
About 91% of customerlevel posterior intervals contain the true value.
noncentred_calibration < noncentred_cis %>% inner_join(true_values, by = 'parameter') %>% summarise(mean(lo <= value & value <= hi)) %>% pull() %>% percent() noncentred_calibration [1] "91.0%" DiscussionBoth centred and noncentred models performed reasonably well on the dataset considered. The noncentred model showed slightly less correlation between scales and energy__, suggesting it might be the better one to tackle the low EBFMI problems. Since we only checked the fit on one priorpredictive draw, it would be a good idea to check out the fit to more draws. Some casual attempts of mine (not shown here) suggest there are situations that cause severe EBFMI problems. Identifying these situations would be an interesting next step. It would also be great to see how it performs on some of the benchmarked datasets mentioned in the BTYDPlus package.
/** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configurationvariables */ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script');
s.src = '//stappitgithubio.disqus.com/embed.js';
s.setAttribute('datatimestamp', +new Date()); (d.head  d.body).appendChild(s); })();
Please enable JavaScript to view the comments powered by Disqus.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Brian Callander. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Benchmark of popular graph/network packages
(This article was first published on R on Quasilinear Musings, and kindly contributed to Rbloggers)
In this post I benchmark the performance of 5 popular graph/network packages. This was inspired by two questions I had:

Recently, I have been working with large networks (millions of vertices and edges) and often wonder what is the best currently available package/tool that would scale well and handle large scale network analysis tasks. Having tried out a few (networkx in Python and igraph in R) but on different problems, I thought it would be nice to have a head to head comparison.

Running large scale computations is also much easier nowadays with the availability of virtual machines that could be easily spinned up thanks to the growth of cloud computing. I think the trend of powerful single machines will eliminate a lot of the need for enterprise clusters so it will be interesting to see how far we can push a single machine using optimised algorithms.1
To replicate the benchmark study and for the full codes, please refer to my github repository. Instructions on how to setup and install the packages are also located in the repository.
SetupThe benchmark was carried out using a Google Compute n1standard16 instance (16vCPU Haswell 2.3GHz, 60 GB memory). I compare 5 different packages:
Networkx is written in Python while the other four packages are based on C / C++ but have Python APIs. Igraph has a R and Mathematica binding as well but to be consistent the following benchmark was based on the Python one. The other 3 libraries (snap, networkit and graphtool) have an additional emphasis on performance with multiprocessing capabilities built in.
Selecting what tasks to compare on is not really a trivial task with each package offering various tools and capabilities. In the end, I decided to focus on 5 specific problems:
 loading the data
 single source shortest path
 page rank
 kcore decomposition
 strongly connected components
Loading is more of an I/O task while the other 4 are common graph algorithms.2
3 datasets from the Stanford Large Network Dataset Collection were used in the exercise:
While it is the easiest to rank the packages based on the runtime of the algorithms, it is only one of the many considerations of what makes a good package. I try to offer a more subjective view based on my experience with these packages.
Related BenchmarksHere’s a list of other comparative benchmarks for the interested viewer to check out:
 Graphtool performance comparison, compares graphtool with igraph and networkx
 SNAP research paper, compares snap with igraph and networkx
 Networkit technical paper, compares networkit with igraph and graphtool
Most of them were written in 2015/2016 and it will be interesting to see if anything has changed since then.
ResultsAll timings reported are normalised to reflect the run time for a single run of the task.
Networkx is much slower than any of the other libraries. Across all computation tasks and for all datasets it is around 10 times slower than the slowest library.3 For example, it took 67s to run the single source shortest path problem on the Pokec dataset compared to 6.8s for networkit (the next slowest). Page rank took more than 10 minutes to run compared to 1 minute for igraph. Hence, I left it out of the comparison plots.
Here are the run times of the remaining four packages:
Full results can be seen from the table below:
dataset Algorithm graphtool igraph networkit networkx snap Amazon connected components 0.09 0.48 0.21 5.94 0.40 Amazon kcore 0.11 0.33 0.01 8.62 0.42 Amazon loading 5.00 0.79 3.27 9.96 1.90 Amazon page rank 0.05 1.59 0.01 25.71 0.90 Amazon shortest path 0.06 0.12 0.32 3.31 0.14 Google connected components 0.32 2.23 0.65 21.71 2.02 Google kcore 0.57 1.68 0.06 153.21 1.57 Google loading 67.27 5.51 17.94 39.69 9.03 Google page rank 0.76 5.24 0.12 106.49 4.16 Google shortest path 0.20 0.69 0.98 12.33 0.30 Pokec connected components 1.35 17.75 4.69 108.07 15.28 Pokec kcore 5.73 10.87 0.34 649.81 8.87 Pokec loading 119.57 34.53 157.61 237.72 59.75 Pokec page rank 1.74 59.55 0.20 611.24 19.52 Pokec shortest path 0.86 0.87 6.87 67.15 3.09 I/OLooking at the plots above, graphtool and networkit loads data much more slowly than the other two libraries. I was reading the datasets as a tab delimited file and graphtool basically uses a Python code to parse the input. The other 3 packages should be using C libraries to read the files which result in better performance.4
AlgorithmsNetworkit and graphtool takes the top spot in most of the tests with graphtool having the shortest run time for the single source shortest path and connected components problems and networkit winning the race for kcore and page rank.
When networkit is fast, it is extremely fast. On the pokec dataset it takes just 0.2s to run the page rank algorithm (graphtool: 1.7s, igraph: 59.6s, snap: 19.5s). For the kcore decomposition it is also 10 times faster than all other competitors or 2000 times networkx. That is consistent with the findings of their research paper where they claim that using some of the latest state of the art algorithms led to their processing speed being faster by an order of magnitude. However, for the shortest path problem (not analysed in their paper) it lags behind all other packages.5
graphtool is the most steady performer and achieves very impressive performance across all four tasks. With openMP support it betters igraph and snap across all tasks. It is 3 to 10+ times faster than those two packages.
igraph and snap achieves mostly similar performance across all tasks with a slight edge towards snap. This is also consistent with snap’s research findings.
Other ConsiderationsThere are also other important considerations when making a switch from networkx or igraph to one graphtool or networkit.
Packages
First, the algorithms available differ quite significantly across the packages. Users interested in switching to one of these packages should read the documentation on the list of features available. For example, while they all contain the basic tools needed to manipulate networks, graphtool does not have the more usual modular clustering tools but has additional functionalities on statistical inference on graphs using stochastic block models.
Visualising networks is also an important part of the analytical tool chain. Igraph implements quite a few layout algorithms and renders them using the cairo library. Snap supports graphviz while graphtool supports both graphviz and cairo. Networkit takes a different approach and relies on networkx to draw while also providing support and integration with Gephi via its streaming plugin.
API
Moving away from native Python or R means that the syntax for the packages can sometimes be quite convoluted. I compare the syntax for the shortest path problem below. Writing it in networkx would look something like this:
igraph:
g.shortest_paths([g.vs[node_index]])graphtool:
shortest_distance(g, g.vertex(node_index))networkit:
distance.BFS(g, node_index).run()snap:
NIdToDistH = snap.TIntH() snap.GetShortPath(g, node_index, NIdToDistH, True)Of all, I find snap’s the most cumbersome since one has to define an additional variable (with the correct variable type) to store the results before running it. Running more advanced functions on graphtool and networkit also requires a user to predefine variables with the correct type to store results.6
Support and Documentation
User support and documentation is really important when one wants to use the project in an actual project setting. Networkx is by far the winner in this category with more than 4k github stars and lots of issues documented in github and stackoverflow. Igraph fairs decently as well with more than a thousand stars across its different modules.
graphtool and networkit has much smaller followings though the creators seem relatively responsive to user issues and the packages are in active development.
snap was last updated on July 2018 but still supports only Python 2.7.x versions.
ConclusionOverall, I am pleasantly surprised at the performance of the libraries especially graphtool and networkit and plan to play around with them further. The fact that they breeze through the Pokec dataset is a good sign, but it will be interesting to find out what is the limit before computation becomes slow or memory issues start appearing.
As for recommendations on which package people should learn, I think picking up networkx is still important as it makes network science very accessible with a wide range of tools and capabilities. If analysis starts being too slow (and maybe that’s why you are here) then I will suggest taking a look at graphtool or networkit to see if they contain the necessary algorithms for your needs.

A 240 GB machine with 64 vCPUs currently cost about $1.5k USD / month on Google Cloud Compute. While relatively expensive, one can spin up the machine just for very heavy computation tasks and shut it down once the task is complete. Such intensive computation once reserved for enterprises and research institutions can now be replicated by almost anyone.

Disclaimer: I try as much as possible to specify the same parameters for each algorithm but differences in API across the packages could translate to actual differences in how the algorithm is run and the final output

After trying it out for 1 test run, I run the profiling tests only 10 times fewer when using the networkx library.

Graphtool should read from other file types such as graphml or dot much faster though I did not actually try it out.

I used the BFS search to proxy for the single source shortest path problem as there does not seem to be a direct API call for it.

I guess this is required of statically typed languages like C and C++.
To leave a comment for the author, please follow the link and comment on their blog: R on Quasilinear Musings. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
rTRNG on CRAN now!
(This article was first published on Mirai Solutions, and kindly contributed to Rbloggers)
After dwelling happily on our GitHub repositories for an extended period of time, rTRNG has now finally made it to CRAN.
We are very happy to get this out just ahead of the R in Finance in Chicago, where the functionality had been showcased with an applied example two years ago.
The main changes since then are related to dependency upgrades and other maintenance, so if you watched the package presentation at useR!2017 and have tried it out before, it will likely feel very familiar.
In case you’ve never heard about it, no worries, we will be back with a second post very soon, which will discuss functionality and technical details.
In the meantime, below is a short story about the integration of Tina’s Random Number Generator1 with R :).
What started as a series of extensions to a largescale Credit Risk simulation application on a client project several years ago, evolved into more investigative and exploratory work.
This involved the methodological modeling side (with topics such as importance sampling and stochastic recovery rates), as well as the software engineering, or more specifically highperformance computing side, with a focus on parallelization and efficient random number generation.
In our search for libraries that could assist in this quest, we came across TRNG, which made the impression of a highquality, yet overseeable project. We gave it a try and thanks to the smoothness of Rcpp and RcppParallel, things worked out in a breeze, and we had developed a nice custom solution bringing high value to our client.
Our experience was so positive, that we felt it was worth sharing this with the wider community, and that’s where we decided to make a new R package on GitHub.
Now most longtime R users will claim that a package that isn’t on CRAN is only half a package, living in the shadows, never fully accredited.
Being aware of this, we set the aim for CRAN, though sidetracked with lots of other exciting and demanding projects, things ended up taking a bit longer… but now we are happy to say: the wait is over!
We sincerely hope that some of you will fall in love with this package similarly as we did.

Who’s Tina? Tina is no acronym and is the name of a Linux cluster at the Institute of Theoretical Physics at the University Magdeburg in Germany, where TRNG was developed.
To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Deep (learning) like Jacques Cousteau – Part 2 – Scalars
(This article was first published on Embracing the Random  R, and kindly contributed to Rbloggers)
(TL;DR: Scalars are single numbers.)
I have so many scalars!
Opie, the open source snake
I’m sorry…that was lame…
Me
Last
time,
we covered some basic concepts regarding sets on our journey to
understanding vectors and matrices.
Let’s do this!
Today’s topic: Scalars What’s a scalar?A scalar is a single number! This seems very simple (and it is). But we
need to know this to understand operations like scalar multiplication
or statements like “The result of multiplying this by that is a
scalar”.
We will use the notation from Goodfellow, Ian, et
al. and depict them in lower case,
italicised letters like this:
n
How can we define the types of numbers our scalars should represent?Say, we define our arbitrary scalar, n, as a number from the set of
natural
numbers.
We would show this set membership like this:
n \in \mathbb{N}
The ‘\in’ symbol means
‘is a member/is an element of/belongs to (some set)’ Pick the one you like most!
However, the whole statement is often read as ’n is a natural
number’.
The symbol ‘\not\in’ means
‘is not a member/is not an element of/does not belongs to (some set)’. Easy!
What are the implications of defining our scalars as natural numbers?
Let’s start with an abstract example!
 Let’s say we start with the number 2, and we want to add
some arbitrary number, n, to it.  Let’s define n as a natural number. That is,
n belongs to the set of ‘whole’, positive numbers starting with 1 and increasing with no upper bound.
Here are some of the implications of our definition of
n \in \mathbb{N}:
 2 + n cannot equal 2 because 0 \not\in \mathbb{N}, and
therefore, n cannot take on the value of 0.  We can never get an answer where the first decimal place is
something other than zero. For example, there is no natural
number, n, where 2 + n = 2.5.
Now here is my (crappy) attempt at intuitively bringing this back to machine
learning!
 Let’s say that our scalar, n, is the value used to update the
parameters in our model after some iteration of training.  Then we are restricted to making crude updates of at least one
only!  Our algorithm may never converge and we might see the values of our
evaluation metric jumping about erratically as training progresses.
This might not be an academically rigorous explanation, but it’s
hopefully good enough to build some intuition.
We’ll make our universe of numbers into something larger where our scalars can take on more than just whole, positive values. We will
define our arbitrary scalars, x, as coming from the set of real
numbers. That is:
x \in \mathbb{R}
How can we represent scalars in R?R technically has no scalar data type! From Hadley Wickham’s ‘Advanced
R’, 1st edition, we can find this
in the ‘Data Structures’
chapter:
Note that R has no 0dimensional, or scalar types. Individual numbers
or strings, which you might think would be scalars, are actually
vectors of length one.
But in practice, we can emulate our real number scalar by doing something like this:
x < 123.532 print(x) ## [1] 123.532In the same section, we also find out that to test whether something is
a vector in R, one must use is.atomic():
Yes, we have ourselves a vector! How many elements do we have?
print(length(x)) ## [1] 1Hooray! We have ourselves a proxy for our scalar. Now what is the data
type of our scalar?
From the numeric help page in the R documentation, we find this:
numeric is identical to double (and real). It creates a
doubleprecision vector of the specified length with each element
equal to 0.
Then from the double help page, we find this:
All real numbers are stored in double precision format.
Let’s test it out!
class(x) ## [1] "numeric" typeof(x) ## [1] "double"Hooray!
ConclusionWe now know that scalars are members of sets. We have defined our scalars
as coming from the set of real numbers.
On to vectors!
Justin
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Free online r course
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers)
Recently a young relative mentioned that the campus R course she hoped to attend was full. What online alternatives did she have? So, I decided to start one of my own! https://github.com/matloff/fasteR Designed for complete beginners.
I now have six lessons up on the site. I hope to add one new lesson per week.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Why Use Weight of Evidence?
(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to Rbloggers)
I had been asked why I spent so much effort on developing SAS macros and R functions to do monotonic binning for the WoE transformation, given the availability of other cuttingedge data mining algorithms that will automatically generate the prediction with whatever predictors fed in the model. Nonetheless, what really distinguishes a good modeler from the rest is how to handle challenging data issues before feeding data in the model, including missing values, outliers, linearity, and predictability, in a scalable way that can be rolled out to hundreds or even thousands of potential model drivers in the production environment.
The WoE transformation through monotonic binning provides a convenient way to address each of aforementioned concerns.
1. Because WoE is a piecewise transformation based on the data discretization, all missing values would fall into a standalone category either by itself or to be combined with the neighbor that shares a similar event probability. As a result, the special treatment for missing values is not necessary.
2. After the monotonic binning of each variable, since the WoE value for each bin is a projection from the predictor into the response that is defined by the log ratio between event and nonevent distributions, any raw value of the predictor doesn’t matter anymore and therefore the issue related to outliers would disappear.
3. While many modelers would like to use log or power transformations to achieve a good linear relationship between the predictor and log odds of the response, which is heuristic at best with no guarantee for the good outcome, the WoE transformation is strictly linear with respect to log odds of the response with the unity correlation. It is also worth mentioning that a numeric variable and its strictly monotone functions should converge to the same monotonic WoE transformation.
4. At last, because the WoE is defined as the log ratio between event and nonevent distributions, it is indicative of the separation between cases with Y = 0 and cases with Y = 1. As the weighted sum of WoE values with the weight being the difference in event and nonevent distributions, the IV (Information Value) is an important statistic commonly used to measure the predictor importance.
Below is a simple example showing how to use WoE transformations in the estimation of a logistic regression.
.gist table { marginbottom: 0; } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Fast food, causality and R packages, part 2
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
I am currently working on a package for the R programming language; its initial goal was to simply
distribute the data used in the Card and Krueger 1994 paper that you can read
here (PDF warning). However, I decided that I
would add code to perform diffindiff.
In my previous blog post I showed
how to set up the structure of your new package. In this blog post, I will only focus on getting
Card and Krueger’s data and prepare it for distribution. The next blog posts will focus on writing
a function to perform differenceindifferences.
If you want to distribute data through a package, you first need to use the usethis::use_data_raw()
function (as shown in part 1).
This creates a dataraw folder, and inside you will find the DATASET.R script. You can edit this
script to prepare the data.
First, let’s download the data from Card’s website, unzip it and load the data into R. All these
operations will be performed from R:
To download and unzip a file from R, first, you need to define where you want to save the file. Because
I am not interested in keeping the downloaded file, I use the tempfile() function to get a temporary
file in my /tmp/ folder (which is the folder that contains temporary files and folders in a GNU+Linux
system). Then, using download.file() I download the file, and save it in my temporary file. I then
create a temporary directory using tempdir() (the idea is the same as with tempfile()), and use
this folder to save the files that I will unzip, using the unzip() function. This folder now contains
several files:
check.sas is the SAS script Card and Krueger used. It’s interesting, because it is quite simple,
quite short (170 lines long) and yet the impact of Card and Krueger’s research was and has been
very important for the field of econometrics. This script will help me define my own functions.
codebook, you guessed it, contains the variables’ descriptions. I will use this to name the columns
of the data and to write the dataset’s documentation.
public.csv is the data. It does not contain any column names:
46 1 0 0 0 0 0 1 0 0 0 30.00 15.00 3.00 . 19.0 . 1 . 2 6.50 16.50 1.03 1.03 0.52 3 3 1 1 111792 1 3.50 35.00 3.00 4.30 26.0 0.08 1 2 6.50 16.50 1.03 . 0.94 4 4 49 2 0 0 0 0 0 1 0 0 0 6.50 6.50 4.00 . 26.0 . 0 . 2 10.00 13.00 1.01 0.90 2.35 4 3 1 1 111292 . 0.00 15.00 4.00 4.45 13.0 0.05 0 2 10.00 13.00 1.01 0.89 2.35 4 4 506 2 1 0 0 0 0 1 0 0 0 3.00 7.00 2.00 . 13.0 0.37 0 30.0 2 11.00 10.00 0.95 0.74 2.33 3 3 1 1 111292 . 3.00 7.00 4.00 5.00 19.0 0.25 . 1 11.00 11.00 0.95 0.74 2.33 4 3 56 4 1 0 0 0 0 1 0 0 0 20.00 20.00 4.00 5.00 26.0 0.10 1 0.0 2 10.00 12.00 0.87 0.82 1.79 2 2 1 1 111492 . 0.00 36.00 2.00 5.25 26.0 0.15 0 2 10.00 12.00 0.92 0.79 0.87 2 2 61 4 1 0 0 0 0 1 0 0 0 6.00 26.00 5.00 5.50 52.0 0.15 1 0.0 3 10.00 12.00 0.87 0.77 1.65 2 2 1 1 111492 . 28.00 3.00 6.00 4.75 13.0 0.15 0 2 10.00 12.00 1.01 0.84 0.95 2 2 62 4 1 0 0 0 0 1 0 0 2 0.00 31.00 5.00 5.00 26.0 0.07 0 45.0 2 10.00 12.00 0.87 0.77 0.95 2 2 1 1 111492 . . . . . 26.0 . 0 2 10.00 12.00 . 0.84 1.79 3 3Missing data is defined by . and the delimiter is the space character. read.me is a README file.
Finally, survey1.nj and survey2.nj are the surveys that were administered to the fast food
restaurants’ managers; one in February (before the raise) and the second one in November
(after the minimum wage raise).
The next lines import the codebook:
codebook < read_lines(file = paste0(tempdir_path, "/codebook")) variable_names < codebook %>% `[`(8:59) %>% `[`(c(5, 6, 13, 14, 32, 33)) %>% str_sub(1, 13) %>% str_squish() %>% str_to_lower()Once I import the codebook, I select lines 8 to 59 using the `[`() function.
If you’re not familiar with this notation, try the following in a console:
and compare:
seq(1, 100) %>% `[`(., 1:10) ## [1] 1 2 3 4 5 6 7 8 9 10both are equivalent, as you can see. You can also try the following:
1 + 10 ## [1] 11 1 %>% `+`(., 10) ## [1] 11Using the same trick, I remove lines that I do not need, and then using stringr::str_sub(1, 13)
I only keep the first 13 characters (which are the variable names, plus some white space characters)
and then, to remove all the unneeded white space characters I use stringr::squish(), and then
change the column names to lowercase.
I then load the data, and add the column names that I extracted before:
dataset < read_table2(paste0(tempdir_path, "/public.dat"), col_names = FALSE) dataset < dataset %>% select(X47) %>% `colnames<`(., variable_names) %>% mutate_all(as.numeric) %>% mutate(sheet = as.character(sheet))I use the same trick as before. I rename the 47th column, which is empty,
I name the columns with `colnames<`().
After this, I perform some data cleaning. It’s mostly renaming categories of categorical variables,
and creating a “true” panel format. Several variables were measured at several points in time. Variables
that were measured a second time have a “2” at the end of their name. I remove these variables,
and add an observation data variable. So my data as twice as many rows as the original data, but
that format makes it way easier to work with. Below you can read the full code:
Click if you want to see the code
dataset < dataset %>% mutate(chain = case_when(chain == 1 ~ "bk", chain == 2 ~ "kfc", chain == 3 ~ "roys", chain == 4 ~ "wendys")) %>% mutate(state = case_when(state == 1 ~ "New Jersey", state == 0 ~ "Pennsylvania")) %>% mutate(region = case_when(southj == 1 ~ "southj", centralj == 1 ~ "centralj", northj == 1 ~ "northj", shore == 1 ~ "shorej", pa1 == 1 ~ "pa1", pa2 == 1 ~ "pa2")) %>% mutate(meals = case_when(meals == 0 ~ "None", meals == 1 ~ "Free meals", meals == 2 ~ "Reduced price meals", meals == 3 ~ "Both free and reduced price meals")) %>% mutate(meals2 = case_when(meals2 == 0 ~ "None", meals2 == 1 ~ "Free meals", meals2 == 2 ~ "Reduced price meals", meals2 == 3 ~ "Both free and reduced price meals")) %>% mutate(status2 = case_when(status2 == 0 ~ "Refused 2nd interview", status2 == 1 ~ "Answered 2nd interview", status2 == 2 ~ "Closed for renovations", status2 == 3 ~ "Closed permanently", status2 == 4 ~ "Closed for highway construction", status2 == 5 ~ "Closed due to Mall fire")) %>% mutate(co_owned = if_else(co_owned == 1, "Yes", "No")) %>% mutate(bonus = if_else(bonus == 1, "Yes", "No")) %>% mutate(special2 = if_else(special2 == 1, "Yes", "No")) %>% mutate(type2 = if_else(type2 == 1, "Phone", "Personal")) %>% select(sheet, chain, co_owned, state, region, everything()) %>% select(southj, centralj, northj, shore, pa1, pa2) %>% mutate(date2 = lubridate::mdy(date2)) %>% rename(open2 = open2r) %>% rename(firstinc2 = firstin2) dataset1 < dataset %>% select(ends_with("2"), sheet, chain, co_owned, state, region, bonus) %>% mutate(type = NA_character_, status = NA_character_, date = NA) dataset2 < dataset %>% select(ends_with("2")) %>% #mutate(bonus = NA_character_) %>% rename_all(~str_remove(., "2")) other_cols < dataset %>% select(sheet, chain, co_owned, state, region, bonus) other_cols_1 < other_cols %>% mutate(observation = "February 1992") other_cols_2 < other_cols %>% mutate(observation = "November 1992") dataset1 < bind_cols(other_cols_1, dataset1) dataset2 < bind_cols(other_cols_2, dataset2) njmin < bind_rows(dataset1, dataset2) %>% select(sheet, chain, state, region, observation, everything())The line I would like to comment is the following:
dataset %>% select(ends_with("2"), sheet, chain, co_owned, state, region, bonus)This select removes every column that ends with the character “2” (among others). I split the data
in two, to then bind the rows together and thus create my long dataset. I then save the data
into the data/ folder:
This saves the data as an .rda file. To enable users to read the data by typing data("njmin"),
you need to create a data.R script in the R/ folder. You can read my data.R script below:
Click if you want to see the code
#' Data from the Card and Krueger 1994 paper *Minimum Wages and Employment: A Case Study of the FastFood Industry in New Jersey and Pennsylvania* #' #' This dataset was downloaded and distributed with the permission of David Card. The original #' data contains 410 observations and 46 variables. The data distributed in this package is #' exactly the same, but was changed from a wide to a long dataset, which is better suited for #' manipulation with *tidyverse* functions. #' #' @format A data frame with 820 rows and 28 variables: #' \describe{ #' \item{\code{sheet}}{Sheet number (unique store id).} #' \item{\code{chain}}{The fastfood chain: bk is Burger King, kfc is Kentucky Fried Chicken, wendys is Wendy's, roys is Roy Rogers.} #' \item{\code{state}}{State where the restaurant is located.} #' \item{\code{region}}{pa1 is northeast suburbs of Phila, pa2 is Easton etc, centralj is central NJ, northj is northern NJ, southj is south NJ.} #' \item{\code{observation}}{Date of first (February 1992) and second (November 1992) observation.} #' \item{\code{co_owned}}{"Yes" if company owned.} #' \item{\code{ncalls}}{Number of callbacks. Is 0 if contacted on first call.} #' \item{\code{empft}}{Number fulltime employees.} #' \item{\code{emppt}}{Number parttime employees.} #' \item{\code{nmgrs}}{Number of managers/assistant managers.} #' \item{\code{wage_st}}{Starting wage ($/hr).} #' \item{\code{inctime}}{Months to usual first raise.} #' \item{\code{firstinc}}{Usual amount of first raise (\$/hr).} #' \item{\code{bonus}}{"Yes" if cash bounty for new workers.} #' \item{\code{pctaff}}{\% of employees affected by new minimum.} #' \item{\code{meals}}{Free/reduced priced code.} #' \item{\code{open}}{Hour of opening.} #' \item{\code{hrsopen}}{Number of hours open per day.} #' \item{\code{psode}}{Price of medium soda, including tax.} #' \item{\code{pfry}}{Price of small fries, including tax.} #' \item{\code{pentree}}{Price of entree, including tax.} #' \item{\code{nregs}}{Number of cash registers in store.} #' \item{\code{nregs11}}{Number of registers open at 11:00 pm.} #' \item{\code{type}}{Type of 2nd interview.} #' \item{\code{status}}{Status of 2nd interview.} #' \item{\code{date}}{Date of 2nd interview.} #' \item{\code{nregs11}}{"Yes" if special program for new workers.} #' } #' @source \url{http://davidcard.berkeley.edu/data_sets.html} "njmin"I have documented the data, and using roxygen2::royxgenise() to create the dataset’s documentation.
The data can now be used to create some nifty plots:
ggplot(njmin, aes(wage_st)) + geom_density(aes(fill = state), alpha = 0.3) + facet_wrap(vars(observation)) + theme_blog() + theme(legend.title = element_blank(), plot.caption = element_text(colour = "white")) + labs(title = "Distribution of starting wage rates in fast food restaurants", caption = "On April 1st, 1992, New Jersey's minimum wage rose from $4.25 to $5.05. Source: Card and Krueger (1994)") ## Warning: Removed 41 rows containing nonfinite values (stat_density).In the next blog post, I am going to write a first function to perform diff and diff, and we will
learn how to make it available to users, document and test it!
Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
One Step to Quickly Improve the Readability and Visual Appeal of ggplot Graphs
(This article was first published on Learn R Programming & Build a Data Science Career  Michael Toth, and kindly contributed to Rbloggers)
There’s something wonderful about a graph that communicates a point clearly. You know it when you see it. It’s the kind of graph that makes you pause and say ‘wow!’.
There are all kinds of different graphs that fit this description, but they usually have a few things in common:
 Clarity: The message of the graph is clear
 Simplicity: Extraneous details are removed
 Visual appeal: The graph should be pleasing to look at
Of course, your graph also needs to be communicating something worthwhile. But I see so many graphs that ultimately fall short of their potential because they don’t meet these three points above!
I’ve been there myself. Some of my earliest graphs in R fall short, in retrospect. But the key to improving is to keep learning new things and keep getting better over time.
It seems like many people learn how to create basic bar charts, scatter plots, and line graphs in R, and then stop developing their skills further. But you shouldn’t stop there!
I don’t think most people are doing this intentionally. In fact, I think most people simply don’t know what’s possible and what they should be aiming for.
If the only graphs you’ve ever seen are basic examples from statistics textbooks or code documentation, how would you possibly know that you can do better? How would you know that you can create graphs that capture attention, drive action, or inspire awe? You wouldn’t.
I want to teach you how to make graphs that get your point across with clarity, simplicity, and visual appeal. There are quick fixes you can make to your graphs right now that will get you much closer to making that a reality.
Today, I’m going to show you how you can use axis text rotation to greatly improve both the readability and visual appeal of your graphs.
Are you ready? Let’s go!
Creating a Base Graph to Work FromTo start, let’s load in the libraries we’ll be using throughout this post: tidyverse (for graphing and data manipulation), and hrbrthemes (a useful package that I use to improve the styling of my graphs).
library(hrbrthemes) library(tidyverse)For this post, we’ll be using the mtcars dataset to illustrate these graphing techniques. Here, I group the cars in that dataset by the number of cylinders in their engines (4, 6, or 8), and then calculate the average horsepower for each group.
# Calculate average horsepower for cars with 4, 6, and 8cylinder engines hp_by_cyl < mtcars %>% group_by(cyl) %>% summarise(avg_hp = mean(hp))Finally, I create a simple bar chart in ggplot to show this data. Let’s review this code briefly, so we’re all on the same page:
 The first two lines (ggplot and geom_bar) are what creates the base bar chart.
 The next 5 lines use the labs function to assign labels to the graph.
 The last line uses theme_ipsum from the hrbrthemes package to apply some nice styling to the graph.
This graph is fine, for the most part. But I don’t aim for fine, and you shouldn’t either!
I have a high attention to detail for graphing, and I want my graphs to be excellent. The easier I can make it for people to read and understand a graph, the better job I’ll do of convincing them to move forward with a particular course of action.
Rotate Your YAxis Title To Improve ReadabilityThere are several things we could do to improve this graph, but in this guide let’s focus on rotating the yaxis label. This simple change will make your graph so much better. That way, people won’t have to tilt their heads like me to understand what’s going on in your graph:
That’s not a look you want. Luckily, it’s super easy to rotate your axis title in ggplot! To do this, we’ll modify some parameters using ggplot‘s theme function, which can also be used to adjust all kinds of things in your graph like axis labels, gridlines, and text sizing.
Here, we specifically want to adjust the yaxis, which we can do using the axis.title.y parameter. To adjust a text element, we use element_text. You can use element_text this to adjust things like font, color, and size. But here, we’re interested in rotation, so we’re going to use angle. Setting the angle to 0 will make the yaxis text horizontal. Take a look:
# Modifying the graph from before (stored as g), to make the text horizontal g + theme(axis.title.y = element_text(angle = 0))It’s a minor change, but this graph is far more readable and more visually appealing than the graph we had before. That’s what we’re going for. Simplicity, clarity, and visual appeal.
More on Text Rotation in ggplotAs we just saw, when you need to rotate text in ggplot, you can accomplish this by adjusting the angle within element_text. Here we did this to adjust the axis title, but this works the same way for any text that you want to rotate in ggplot.
The angle parameter can take any value between 0 and 360, corresponding to the angle of rotation from a horizontal baseline. To briefly illustrate how different angle values work in ggplot, take a look at the following graph, where I explore four different angle rotations:
library(gridExtra) # 0Degree angle g1 < g + theme(axis.title.y = element_text(angle = 0)) + labs(title = 'YAxis Title at 0 degrees', subtitle = 'Using theme(axis.title.y = element_text(angle = 0))', caption = '') # 90Degree angle g2 < g + theme(axis.title.y = element_text(angle = 90)) + labs(title = 'YAxis Title at 90 degrees', subtitle = 'Using theme(axis.title.y = element_text(angle = 90))', caption = '') # 180Degree angle g3 < g + theme(axis.title.y = element_text(angle = 180)) + labs(title = 'YAxis Title at 180 degrees', subtitle = 'Using theme(axis.title.y = element_text(angle = 180))', caption = '') # 270Degree angle g4 < g + theme(axis.title.y = element_text(angle = 270)) + labs(title = 'YAxis Title at 270 degrees', subtitle = 'theme(axis.title.y = element_text(angle = 270))') # Add all graphs to a grid grid.arrange(g1, g2, g3, g4, nrow = 2, ncol = 2)You should now have a better understanding of how you can use axis title rotation to improve the readability and visual appeal of your graphs in ggplot!
You really need to think about minor details like this, especially when you’re going to be using a graph in a presentation or a report to a broader audience. Small details can really improve your graph, which in turn will make it easier for you to educate your audience, convince people of your conclusions, and drive change in your organization.
I will help you learn the specific skills you need to work more effectively, grow your income, and improve your career.
Sign up here to receive my best tips
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career  Michael Toth. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...