R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 14 hours ago

### Dataviz Workshop at RStudio::conf

Tue, 02/18/2020 - 18:29

Workshop materials are available here: https://rstd.io/conf20-dataviz
Consider buying the book; it’s good: Data Visualization: A Practical Introduction / Buy on Amazon

I was delighted to have the opportunity to teach a two-day workshop on Data Visualization using ggplot2 at this year’s rstudio::conf(2020) in January. It was my first time attending the conference and it was a terrific experience. I particularly appreciated the friendly and constructive atmosphere that RStudio so clearly goes out of its way to encourage and sustain.

The workshop focused on learning how to think about good data visualization in principle, and how to do it in practice. After many years of trying and often failing to learn how to make good visualizations myself, I became convinced of two things. First, there is a real need for an approach that effectively combines the why of visualization with the how. A lot of otherwise excellent introductions to data visualization will teach you why some visualizations work better than others, and will present a series of mouth-watering examples of fabulous graphics. Then you sit down in front of an empty .Rmarkdown file and … now what? How do I do that?

Meanwhile, many very good, detailed introductions to writing ggplot2 code may be a little out of reach for beginners or—perhaps more often—will tend to get picked up by users in a recipe-like or piecemeal way. People cast around to find out how to more or less solve a particular problem they are having. But they leave without really having a good grasp on why the code they are using looks the way it does. The result is that even people who are pretty used to working in R and who regularly make graphs from data end up with a hazy idea of what they’re doing when they use ggplot.

The second thing I became convinced of as I developed this material was that data visualization is a fantastic way to introduce people to the world of data analysis with R generally. When visualizing data with R and ggplot, it’s possible to produce satisfying results almost right away. That makes it easier to introduce other tidyverse principles and tools in an organic fashion.

For both of those reasons, I ended up writing a book that approached things in just the way I wanted: a practical introduction to data visualization using ggplot2 that kept both the ideas and the code in view, and tried to do so in an engaging and approachable way. It was this text that formed the core of the workshop.

While teaching over the two days, I was assisted by four TAs:

When I saw the roster, my first reaction was that mine was the only name I didn’t recognize. Having Thomas as a TA, in particular, did rather threaten to cross the line from the merely embarrassing to the faintly absurd. It was a real treat to meet and work with everyone for the first time.

The materials from the workshop are available at the GitHub repository for the course. The repo contains all the code we went through as well as PDFs of all of the slides. The code and the slides also include additional examples and other extensions that we did not have time to cover in over the two days, or that I just mentioned in passing.

One of the benefits of teaching a short course like this is that I get a (sometimes sharp!) reminder of what works best and what needs tweaking across the various topics covered. Revisiting the code, in particular, is always necessary just because the best way to do something will change over time. For example, a few of the small tricks and workarounds that I show for dealing with boxplots will shortly become unneccessary, thanks to the work of Thomas, Dewey, and others on the next version of ggplot. I’m looking forward to incorporating those elements and more into the next version of the workshop.

Data visualization is a powerful way to explore your own data and communicate your results to other people. One of the themes of the book, and the workshop, is that it is in most ways a tool like any other. It won’t magically render you immune to error or make it impossible for you to fool others, or fool yourself. But once you get a feel for how to work with it, it makes your work easier and better in many ways. The great strength of the approach taken by the grammar of graphics in general and ggplot in particular is that it gives people a powerful “flow of action” to follow. It provides a set of concepts—mappings, geoms, scales, facets, layers, and so on—that let you look at other people’s graphics and really see how their component pieces fit together. And it implements those concepts as a series of functions that let you coherently assemble graphics yourself. The goal of the workshop was to bring people to the point where they could comfortably write code that would clearly say what they wanted to see.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Get Better: R for absolute beginners

Tue, 02/18/2020 - 12:00

As part of the series on development of early career researchers in the lab, we spent three sessions over three weeks learning the basics of R.

In my book “The Digital Cell”, I advocate R as the main number-crunching software but the R literacy in my lab is actually quite mixed. In order to know how to pitch the training, I conducted a quick poll in our lab Slack workspace to find out what R skills people in the lab already had. I asked:

My R skills be like…

1. What’s R?
2. I’ve used it but didn’t know what I was doing
3. I’ve taken a class/course
4. I can write an R script
5. I am an R jedi, what can I teach you?

No-one answered 4 or 5, and one person answered 1, so this meant we needed to start from the basics.

I used the excellent STAT545 course materials as a framework and took examples from there. The three sessions were 1) the very basics, 2) dataframes and ggplot, 3) loading in real data, calculations and making a figure. Each session was approx. one hour in a room together with our laptops, followed by a homework exercise that was marked before the next session. Since some people voted 1 or 2 in the poll, session 1 was required. Even the people who voted 3, said that session 1 was a useful reminder, so it might be good to just start here regardless.

Details of the sessions are below and assume you are proficient in R enough to guide newbies through the exercises.

First session: the very basics

I explained what R and RStudio are. We ironed out people’s installation issues with the software before moving on to packages. As an example, we installed dplyr.

install.packages("dplyr", dependencies = TRUE)

We next covered assignment, objects and functions. Working from the console, we talked about each line:

x <- seq(1,10) date() # no arguments objects() ls() rm(list = ls()) # or click broom

The discussion about objects and the broom in RStudio brought us onto what the different windows in the IDE are used for. We set up an R project and discussed the working directory.

getwd()

Still working in the console, we went through the following commands to save a simple file and generate a basic plot

a <- 2 b <- -3 sig_sq <- 0.5 x <- runif(40) y <- a + b * x + rnorm(40, sd = sqrt(sig_sq)) avg_x <- mean(x) write(avg_x, "avg_x.txt") plot(x, y) abline(a, b, col = "purple")

We ran through lines 5, 8 and 9 a few times to understand the rnorm function.

We talked about the History tab and how to rerun commands, either in the console or by moving them to an R script.

We made an R script based on the plot that we’d made and discussed the value of commenting. We discussed reproducibility and what our goal is. To have a dataset and to use a script to reproducibly generate the analysis and output. This was a good place to stop.

Homework 1

This was the first homework I set (scroll down the page for the answers).

#first we will set the same seed, 1234 set.seed(1234) # make a vector called y of 100 random numbers from a normal distribution # find the 10th value in the vector y (do this using a command) # check the help to see what the mean of y should be (default value of the command you used) # find the actual mean of the vector y # make a vector called x of 100 integers using every other number (odd numbers). Hint: first = 1, last = 199 # make a scatter plot of y versus x # add an orange line at y = 0 # make a histogram of y # This should be line 19. The End

To help people check their work, I sent them the plots. Their task was to fill in the blanks in the above R script and submit it back to me to see if I could run it.

Second session: data frames and ggplot

In this session we concentrated on working with data frames and making some nice looking plots using ggplot. There’s no need to reinvent the wheel here, this session is straight from STAT545 and uses the gapminder library.

By executing one line after another and talking about what was happening, we could work through this exercise together to make some progress and finish with plots, which gave a sense of achievement.

install.packages("gapminder") library(gapminder) str(gapminder) # we also installed the tidyverse - some people had problems library(tidyverse) # looking at gapminder class(gapminder) gapminder head(gapminder) tail(gapminder) names(gapminder) ncol(gapminder) nrow(gapminder) length(gapminder) dim(gapminder) summary(gapminder) # doing some plots in base R plot(lifeExp ~ year, gapminder) plot(lifeExp ~ gdpPercap, gapminder) plot(lifeExp ~ log(gdpPercap), gapminder) # using data frame terminology - dollar signs for columns head(gapminder$lifeExp) summary(gapminder$lifeExp) hist(gapminder$lifeExp) class(gapminder$continent) summary(gapminder$continent) levels(gapminder$continent) nlevels(gapminder$continent) table(gapminder$continent) barplot(table(gapminder$continent)) # using ggplot p <-ggplot(filter(gapminder, continent != "Oceania"), aes(x = gdpPercap, y = lifeExp)) p <- p + scale_x_log10() # our background was set up as an object called p p + geom_point() p + geom_point(aes(color = continent)) p + geom_point(alpha = (1/3), size = 3) + geom_smooth(lwd = 3, se = FALSE) p + geom_point(alpha = (1/3), size = 3) + facet_wrap(~ continent) + geom_smooth(lwd = 1.5, se = FALSE, colour = "orange") In the end we have a nice looking set of plots (below). Earlier in the session, trainees see some base R graphics and then a ggplot version which highlights the aesthetics of ggplot. Session 2 output Homework 2 I supplied my R script (above) from the session, so that people could go over it in their own time. As before, the homework task this week was to fill in the blank lines and submit it back to me together with the plot. # We'll use some built-in data called ToothGrowth. # Type in ToothGrowth to have a look at it # What sort of object is ToothGrowth? Type a command to answer this # What is the median value of the len column? # Use table and barplot functions to make a quick graph of the column that contains Factors # boring graph! You should have seen 30 in both groups. OK let's do some ggplot library(ggplot2) # we'll convert the dose column to be a factor (it is currently numeric) ToothGrowth$dose <- as.factor(ToothGrowth$dose) # make a ggplot object called p that has the aesthetics x=dose and y=len # now add a violin plot to your plot object. Clue you need to use the plus sign to add geom_violin() # when you type p it should show the violin plot p # add a red dot to show the median. Use stat_summary to add a point that is size 2 and red (4 arguments) # compare what you have to the example graph # now add a boxplot to your violin plot. boxplot should be 0.1 wide # when you are happy with this make a new object called p1 with this combination # now save this by altering the line below to change "myName.png" to your name ggsave("myName.png",p1,width=5,height=5, dpi="print") # save your R script with "plotYourName.R" and send me the script and the graph png Session 2: this should be the output from line 21 Third session: real data The aim of the final session was to give people a taste of generating a real figure, using a script and some real data. Again the emphasis was on reproducibility. I reminded them that the raw data should be untouched and that the outputs are disposable, and can be regenerated at any time, including by other people. To generate some realistic data we used Fiji to analyse the number of puncta in a microscopy movie. We did this part by sharing the movie in a Dropbox folder and then executing a series of steps (in hindsight, this would’ve been better run as a script in Fiji). Everybody did this in the session by segmenting the image and then running analyse particles and saving the Results window as a CSV file in the data directory of their R project. Results This file was selected after line 2 in the following code which we again went through line-by-line. require(ggplot2) file_name <- file.choose() df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE) summary(df1$Mean) # build this up p1 <- ggplot(df1, aes(Slice,Mean)) + geom_point() + stat_summary(fun.y = mean, geom = "point", size=2, colour="orange") + geom_smooth() + labs(x = "Frame", y = "GFP") p1 p2 <- ggplot(df1, aes(Slice,Area)) + geom_point() + stat_summary(fun.y = mean, geom = "point", size=2, colour="orange") + geom_smooth() + labs(x = "Frame", y = "Area (pixels)") p2 p3 <- ggplot(df1, aes(Area,Mean)) + geom_point() p3 # correct scaling # 1 pixel is 0.069 * 0.069 um pxSize <- 0.069 * 0.069 df1$Area_um2 <- df1$Area * pxSize p4 <- ggplot(df1, aes(Area_um2,Mean)) + geom_point() p4 # 5 sec per frame df1$Time <- (df1$Slice - 1) * 5 p5 <- ggplot(df1, aes(Time,Mean)) + geom_point() + stat_summary(fun.y = mean, geom = "point", size=2, colour="orange") + geom_smooth() + labs(x = "Time (s)", y = "GFP") p5 # count spots per frame df2 <- as.data.frame(table(df1$Slice)) # note the column names names(df2) <- c("Slice","Count") df2$Time <- (df2$Slice - 1) * 5 # doesn't work - why? str(df2) df2$Slice <- as.integer(df2$Slice) str(df2) df2$Time <- (df2$Slice - 1) * 5 p6 <- ggplot(df2, aes(Time,Count)) + geom_point() + scale_y_continuous(limits = c(0,100)) + geom_smooth() + labs(x = "Time (s)", y = "Puncta") p6 After this we went through a further example using some data from one person in the lab who needed to plot out their results. We also incorporated a function that I had written for bootstrapping confidence intervals as part of this figure. I have not made this part of the session available in the post. Wrap up The lab members found it useful and didn’t even complain about the homework. All of them could see the value in learning a bit of R, so it was rewarding all round. The people with no prior experience found the final session pretty tough so I don’t think it would be possible to go through this material in fewer sessions. Of course, the best way to learn is by doing, so the challenge is to get people to persist with R for their number crunching. A simple way is to keep advocating for analysis to be done in R or more drastically, other software could be outlawed. One idea I had was to have a follow up session to work through a real dataset in a hackathon approach with everyone in the room. Hopefully this has given you some ideas for R training in your lab. The answers The answers to the two homework tasks are here. WordPress forbids *.R files, just change the extension after downloading. homework1 homework2 It was possible to use diff to show each person my answers and compare their work with mine. I used FileMerge on a Mac to do this. I could send a screenshot and some comments back to each person in a few seconds. The post title comes from Get Better by The New Fast Automatic Daffodils. The version I have is on a compilation of Martin Hannett produced tracks called “And Here Is The Young Man” var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### What and who is IT community? What does it take to be part? Tue, 02/18/2020 - 08:10 [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This blog post is long over due and has been rattling in my head for long time. Simply and boldly put it, community is everyone involved behind the result of your search for a particular problem. And by that I am referring to the IT community. Community (from now on, I am referring only to IT community) consists of people that are contributing and sharing their knowledge, experience, questions, solutions, FAQ, etc to broader group of people. Many (if not most) of us are doing this: • for 0 monetary exchange • in free (or extra) time • for building up a wider knowledge base In return many of us expect from others: • same behavior – sharing when they find an interesting solution to a problem • Sharing their time, energy with helping others • participating in read and gradually answering questions from other users • commitment • gratitude and respect to others. In the past couple of years I have seen decline in the above sharing-caring type of exchange and interaction. Yet, what I have seen more and more is: • decline in awareness of how much community is worth • decline in respect (disrespect) and gratitude to other fellow and community people • aging of the community • decline in general interest Let me address each of the topics separately. 1. How much is the community worth? Yes, what a bizarre question. And for sure, I can not answer it. But I will paint a different picture. We have all heard sayings like: “programming is just learning how to google things”, but not everyone have had asked themselves, what does it mean. And where does all the hits, results, solutions come? From community. And from people stepping forward and taking time to invest into understanding someone’s problem and solving it. Or people investing time and giving a lecture for free at the local meeting. Or people taking time and writing a blog, posting a youtube video on how to do something. These are all the people that are contributing to this knowledge base. And next time, when you have a question or a problem that you would need to solve, remember, that there was a person or group of people that have invested more time into solving this issue before you. Remember this. This is how much the community is worth and much more. 2. Decline in respect Pay respect to community people. You ask yourself, how to pay respect? It is very easy, how to show respect: • Say thanks, thank you (at the end of the forum post, at the end of the blog post, after listening to video online or after community event) and be grateful. • If you disagree or if you have a better solution, engage into conversation and propose the solution. But be polite, nice and respectful. Your improvement will be highly appreciated. • Give feedback, your words can count, especially when you describe your point of view • Don’t be afraid to give praise or to give critique. In general, when giving a positive or negative review, keep in mind to be precise and objective. Think about the others when writing either. Will it be helpful and worth to others or not? • Get involved and start helping others. Become an active member, start with observing, reading, listening, conversating and after time, start giving back; answers, ideas, blog posts, videos, speeches, etc. Don’t be afraid, if you are polite, kind, sharing and warm, community will embrace you. • Words like: “sorry”, “I don’t know”, or admitting that you didn’t know or that you were wrong are not the sign of weakness, but are the sign of learning. this is the path for getting more knowledge and getting on track to start helping the others. • Respect the general conduct of the website, of the event or of the product. Play by the rules. 3. Aging of the community The fact that the community is aging has to do with the social phenomenon – lack of available literature before popularity of internet. Those, who were spending long period of times in libraries, analyzing every single page of available book, know and understand the the importance of available materials. Majority of these same IT-people are the community contributors themselves. These people have been growing with community in past 20 years (massive emergence of internet and democratization of mass media) and these people are also the big majority of community that are still giving back to community. Drawing a line with any type of IT event that had been around for more than 10 years and you will find same people at the first iteration of these same events 10 years earlier. Teaching the community to give back the knowledge, encourage them to start participating more and more in any kind of community work should start at young age. And convincing younger generation to start participating and enjoying the community should also be introduced and discussed. Only in these manner, the knowledge will be returned and not taken for granted. 4. Decline in general How we live our lives and how technology had changed our habits directly (or indirectly) influence the community as well. With more and more different options to same subject-matter, many people have the capability to choose. Which is absolutely great, but can have a negative aspect: • people apply on way to many free videos, webinars, events, far more than they are able to attend, • people are subscribed on way to many free and open posts, forums, aggregates, making them confused when choosing • more options is not necessarily a good option • To study one topic, people still need to take time, study it thoroughly • Fast changing technology and market needs make users hindered • Too many technologies, too little time I have been involved in community over 20 years, covering variety of topics, programming languages (Delphi, .NET, Python, C#) , statistical languages (R, Python, SAS, IBM, SAP, Machine Learning, Data Science) and database software (Microsoft SQL Server, SAP Hana, Oracle) and I have seen a decline in general, especially due to lack in general, lack of time, age gap and content gap. But at the same time, I have also seen many new – more practical – events, blogs, vlogs, articles, that try not only to make the existing community stay and tickle their brains, but also engage new people, younger people and teach people that sharing is also caring, and caring is building new connections, new ties and new possibilities. Lastly, this is my personal view on the community, community work, evangelists and knowledge builders. Many of their them / us, do this out of sheer enthusiasm, energy, passion and drive and not because of – as many would have though – recognition or financial aspects. I love to share the passion, the energy, enthusiasm and drive with anyone, who wants to get started on particular topic. But only you must find this in yourself, otherwise it is useless. Feel free to share your personal view, your opinion, your positive or negative feedback, I would love to hear your view. Much appreciated. With love! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### How is information gain calculated? Tue, 02/18/2020 - 02:55 [This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. (adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true }); table, th, td { border: 1px solid black; } This post will explore the mathematics behind information gain. We’ll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each step in the tree. Before we formally define this measure we need to first understand the concept of entropy. Entropy measures the amount of information or uncertainty in a variable’s possible values. How to calculate entropy Entropy of a random variable X is given by the following formula: -Σi[p(Xi) * log2(Xi)] Here, each Xi represents each possible (ith) value of X. p(xi) is the probability of a particular (the ith) possible value of X. Why is it calculated this way? First, let’s build some intuition behind the entropy formula. The formula has several useful properties. For example, it’s always non-negative. Also, entropy as a function is monotonically decreasing in the probability, p. In other words, the amount of information about an event (or value) of X decreases as the probability of that event increases. That may sound a little abstract at first, so let’s consider a specific example. Suppose you’re predicting how much a movie will make in revenue. One of your predictors is a binary indicator – 1 if the record refers to a movie, and 0 otherwise. Well – that predictor is useless! Mathematically speaking, it’s useless because every record in the dataset is a movie – so there’s a 100% probability of that event (i.e. the record being a movie) occurring. This means that the variable provides no real information about the data. The closer you get to a variable having a single possible value, the less information that single value gives you. Why the log2? Technically, entropy can be calculated using a logarithm of a different base (e.g. natural log). However, it’s common to use base 2 because this returns a result in terms of bits. In this way, entropy can be thought of as the average number of bits needed to encode a value for a specific variable. Case Example Information gain in the context of decision trees is the reduction in entropy when splitting on variable X. Let’s do an example to make this clear. In the below mini-dataset, the label we’re trying to predict is the type of fruit. This is based off the size, color, and shape variables. Fruit Size Color Shape Watermelon Big Green Round Apple Medium Red Round Banana Medium Yellow Thin Grape Small Green Round Grapefruit Medium Yellow Round Lemon Small Yellow Round Suppose we want to calculate the information gained if we select the color variable. 3 out of the 6 records are yellow, 2 are green, and 1 is red. Proportionally, the probability of a yellow fruit is 3 / 6 = 0.5; 2 / 6 = 0.333.. for green, and 1 / 6 = 0.1666… for red. Using the formula from above, we can calculate it like this: -([3/6 * log2(3/6)] + [2/6 * log2(2/6)] + [1/6 * log2(1/6)]) = 1.459148 Likewise, we want to get the information gain for the size variable. -([3/6 * log2(3/6)] + [2/6 * log2(2/6)] + [1/6 * log2(1/6)]) = 1.459148 In this case, 3 / 6 of the fruits are medium-sized, 2 / 6 are small, 1 / 6 is big. Lastly, we have the shape variable. Here, 5 / 6 of the fruits are round and 1 / 6 is thin. -([5/6 * log2(5/6)] + [1/6 * log2(1/6)]) = 0.650022 How to calculate information gain in R So, how do we calculate information gain in R? Thankfully, this is fairly simple to do using the FSelector package: library(FSelector) info <- data.frame(fruits = c("watermelon", "apple", "banana", "grape", "grapefruit", "lemon")) info$sizes <- c("big", "medium", "medium","small" ,"medium", "small") info$colors <- c("green", "red", "yellow", "green", "yellow", "yellow") info$shapes <- c("round", "round", "thin", "round", "round", "round") # get information gain results information.gain(formula(info), info) Conclusion

That’s all for now! Information gain is just one of many possible feature importance methods, and I’ll have more articles in the future to explore other possibilities. If you liked this post, please follow my blog on Twitter!.

The post How is information gain calculated? appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Tue, 02/18/2020 - 02:32

To compute Lasso regression, \frac{1}{2}\|\mathbf{y}-\mathbf{X}\mathbf{\beta}\|_{\ell_2}^2+\lambda\|\mathbf{\beta}\|_{\ell_1}define the soft-thresholding functionS(z,\gamma)=\text{sign}(z)\cdot(|z|-\gamma)_+=\begin{cases}z-\gamma&\text{ if }\gamma>|z|\text{ and }z<0\\z+\gamma&\text{ if }\gamma<|z|\text{ and }z<0 \\0&\text{ if }\gamma\geq|z|\end{cases}[/latex]The R function would be</p> <p>80e25ee6f836d25d77f2bee9780b44be000</p> <p>To solve our optimization problem, set[latex display="true"]\mathbf{r}_j=\mathbf{y} - \left(\beta_0\mathbf{1}+\sum_{k\neq j}\beta_k\mathbf{x}_k\right)=\mathbf{y}-\widehat{\mathbf{y}}^{(j)}
so that the optimization problem can be written, equivalently
\min\left\lbrace\frac{1}{2n}\sum_{j=1}^p [\mathbf{r}_j-\beta_j\mathbf{x}_j]^2+\lambda |\beta_j|\right\rbrace
hence\min\left\lbrace\frac{1}{2n}\sum_{j=1}^p \beta_j^2\|\mathbf{x}_j\|-2\beta_j\mathbf{r}_j^T\mathbf{x}_j+\lambda |\beta_j|\right\rbrace
and one gets
\beta_{j,\lambda} = \frac{1}{\|\mathbf{x}_j\|^2}S(\mathbf{r}_j^T\mathbf{x}_j,n\lambda)
or, if we develop
\beta_{j,\lambda} = \frac{1}{\sum_i x_{ij}^2}S\left(\sum_ix_{i,j}[y_i-\widehat{y}_i^{(j)}],n\lambda\right)
Again, if there are weights \mathbf{\omega}=(\omega_i), the coordinate-wise update becomes
\beta_{j,\lambda,{\color{red}{\omega}}} = \frac{1}{\sum_i {\color{red}{\omega_i}}x_{ij}^2}S\left(\sum_i{\color{red}{\omega_i}}x_{i,j}[y_i-\widehat{y}_i^{(j)}],n\lambda\right)
The code to compute this componentwise descent is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 lasso_coord_desc = function(X,y,beta,lambda,tol=1e-6,maxiter=1000){ beta = as.matrix(beta) X = as.matrix(X) omega = rep(1/length(y),length(y)) obj = numeric(length=(maxiter+1)) betalist = list(length(maxiter+1)) betalist[[1]] = beta beta0list = numeric(length(maxiter+1)) beta0 = sum(y-X%*%beta)/(length(y)) beta0list[1] = beta0 for (j in 1:maxiter){ for (k in 1:length(beta)){ r = y - X[,-k]%*%beta[-k] - beta0*rep(1,length(y)) beta[k] = (1/sum(omega*X[,k]^2))* soft_thresholding(t(omega*r)%*%X[,k],length(y)*lambda) } beta0 = sum(y-X%*%beta)/(length(y)) beta0list[j+1] = beta0 betalist[[j+1]] = beta obj[j] = (1/2)*(1/length(y))*norm(omega*(y - X%*%beta - beta0*rep(1,length(y))),'F')^2 + lambda*sum(abs(beta)) if (norm(rbind(beta0list[j],betalist[[j]]) - rbind(beta0,beta),'F') &lt; tol) { break } } return(list(obj=obj[1:j],beta=beta,intercept=beta0)) }

For instance, consider the following (simple) dataset, with three covariates

that we can “normalize”

1 2 3 4 X = model.matrix(lm(Fire~.,data=chicago))[,2:4] for(j in 1:3) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j]) y = chicago$Fire y = (y-mean(y))/sd(y) To initialize the algorithm, use the OLS estimate 1 beta_init = lm(Fire~0+.,data=chicago)$coef

For instance

1 2 3 4 5 6 7 8 9 10 11 12 lasso_coord_desc(X,y,beta_init,lambda=.001) $obj [1] 0.001014426 0.001008009 0.001009558 0.001011094 0.001011119 0.001011119$beta [,1] X_1 0.0000000 X_2 0.3836087 X_3 -0.5026137   $intercept [1] 2.060999e-16 and we can get the standard Lasso plot by looping, var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Hyperparameter tuning and #TidyTuesday food consumption Tue, 02/18/2020 - 01:00 [This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Last week I published a screencast demonstrating how to use the tidymodels framework and specifically the recipes package. Today, I’m using this week’s #TidyTuesday dataset on food consumption around the world to show hyperparameter tuning! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal here is to predict which countries are Asian countries and which countries are not, based on their patterns of food consumption in the eleven categories from the #TidyTuesday dataset. The original data is in a long, tidy format, and includes information on the carbon emission associated with each category of food consumption. library(tidyverse) food_consumption <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv") food_consumption ## # A tibble: 1,430 x 4 ## country food_category consumption co2_emmission ## ## 1 Argentina Pork 10.5 37.2 ## 2 Argentina Poultry 38.7 41.5 ## 3 Argentina Beef 55.5 1712 ## 4 Argentina Lamb & Goat 1.56 54.6 ## 5 Argentina Fish 4.36 6.96 ## 6 Argentina Eggs 11.4 10.5 ## 7 Argentina Milk - inc. cheese 195. 278. ## 8 Argentina Wheat and Wheat Products 103. 19.7 ## 9 Argentina Rice 8.77 11.2 ## 10 Argentina Soybeans 0 0 ## # … with 1,420 more rows Let’s build a dataset for modeling that is wide instead of long using pivot_wider() from tidyr. We can use the countrycode package to find which continent each country is in, and create a new variable for prediction asia that tells us whether a country is in Asia or not. library(countrycode) library(janitor) food <- food_consumption %>% select(-co2_emmission) %>% pivot_wider( names_from = food_category, values_from = consumption ) %>% clean_names() %>% mutate(continent = countrycode( country, origin = "country.name", destination = "continent" )) %>% mutate(asia = case_when( continent == "Asia" ~ "Asia", TRUE ~ "Other" )) %>% select(-country, -continent) %>% mutate_if(is.character, factor) food ## # A tibble: 130 x 12 ## pork poultry beef lamb_goat fish eggs milk_inc_cheese wheat_and_wheat… ## ## 1 10.5 38.7 55.5 1.56 4.36 11.4 195. 103. ## 2 24.1 46.1 33.9 9.87 17.7 8.51 234. 70.5 ## 3 10.9 13.2 22.5 15.3 3.85 12.5 304. 139. ## 4 21.7 26.9 13.4 21.1 74.4 8.24 226. 72.9 ## 5 22.3 35.0 22.5 18.9 20.4 9.91 137. 76.9 ## 6 27.6 50.0 36.2 0.43 12.4 14.6 255. 80.4 ## 7 16.8 27.4 29.1 8.23 6.53 13.1 211. 109. ## 8 43.6 21.4 29.9 1.67 23.1 14.6 255. 103. ## 9 12.6 45 39.2 0.62 10.0 8.98 149. 53 ## 10 10.4 18.4 23.4 9.56 5.21 8.29 288. 92.3 ## # … with 120 more rows, and 4 more variables: rice , soybeans , ## # nuts_inc_peanut_butter , asia This is not a big dataset, but it will be good for demonstrating how to tune hyperparameters. Before we get started on that, how are the categories of food consumption related? Since these are all numeric variables, we can use ggscatmat() for a quick visualization. library(GGally) ggscatmat(food, columns = 1:11, color = "asia", alpha = 0.7) Notice how important rice is! Also see how the relationships between different food categories is different for Asian and non-Asian countries; a tree-based model like a random forest is good as learning interactions like this. Tune hyperparameters Now it’s time to tune the hyperparameters for a random forest model. First, let’s create a set of bootstrap resamples to use for tuning, and then let’s create a model specification for a random forest where we will tune mtry (the number of predictors to sample at each split) and min_n (the number of observations needed to keep splitting nodes). There are hyperparameters that can’t be learned from data when training the model. library(tidymodels) set.seed(1234) food_boot <- bootstraps(food, times = 30) food_boot ## # Bootstrap sampling ## # A tibble: 30 x 2 ## splits id ## ## 1 Bootstrap01 ## 2 Bootstrap02 ## 3 Bootstrap03 ## 4 Bootstrap04 ## 5 Bootstrap05 ## 6 Bootstrap06 ## 7 Bootstrap07 ## 8 Bootstrap08 ## 9 Bootstrap09 ## 10 Bootstrap10 ## # … with 20 more rows rf_spec <- rand_forest( mode = "classification", mtry = tune(), trees = 1000, min_n = tune() ) %>% set_engine("ranger") rf_spec ## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = tune() ## trees = 1000 ## min_n = tune() ## ## Computational engine: ranger We can’t learn the right values when training a single model, but we can train a whole bunch of models and see which ones turn out best. We can use parallel processing to make this go faster, since the different parts of the grid are independent. doParallel::registerDoParallel() rf_grid <- tune_grid( asia ~ ., model = rf_spec, resamples = food_boot ) rf_grid ## # Bootstrap sampling ## # A tibble: 30 x 4 ## splits id .metrics .notes ## * ## 1 Bootstrap01 ## 2 Bootstrap02 ## 3 Bootstrap03 ## 4 Bootstrap04 ## 5 Bootstrap05 ## 6 Bootstrap06 ## 7 Bootstrap07 ## 8 Bootstrap08 ## 9 Bootstrap09 ## 10 Bootstrap10 ## # … with 20 more rows Once we have our tuning results, we can check them out. rf_grid %>% collect_metrics() ## # A tibble: 20 x 7 ## mtry min_n .metric .estimator mean n std_err ## ## 1 2 4 accuracy binary 0.836 30 0.00798 ## 2 2 4 roc_auc binary 0.843 30 0.00861 ## 3 2 12 accuracy binary 0.830 30 0.00760 ## 4 2 12 roc_auc binary 0.833 30 0.00930 ## 5 4 33 accuracy binary 0.815 30 0.00873 ## 6 4 33 roc_auc binary 0.818 30 0.0101 ## 7 4 37 accuracy binary 0.814 30 0.00875 ## 8 4 37 roc_auc binary 0.820 30 0.0103 ## 9 5 31 accuracy binary 0.817 30 0.00864 ## 10 5 31 roc_auc binary 0.822 30 0.0103 ## 11 6 9 accuracy binary 0.824 30 0.00895 ## 12 6 9 roc_auc binary 0.831 30 0.00947 ## 13 7 21 accuracy binary 0.815 30 0.00947 ## 14 7 21 roc_auc binary 0.824 30 0.0101 ## 15 8 18 accuracy binary 0.817 30 0.00929 ## 16 8 18 roc_auc binary 0.824 30 0.0103 ## 17 9 26 accuracy binary 0.816 30 0.0100 ## 18 9 26 roc_auc binary 0.822 30 0.0104 ## 19 11 15 accuracy binary 0.813 30 0.0110 ## 20 11 15 roc_auc binary 0.825 30 0.0102 And we can see which models performed the best, in terms of some given metric. rf_grid %>% show_best("roc_auc") ## # A tibble: 5 x 7 ## mtry min_n .metric .estimator mean n std_err ## ## 1 2 4 roc_auc binary 0.843 30 0.00861 ## 2 2 12 roc_auc binary 0.833 30 0.00930 ## 3 6 9 roc_auc binary 0.831 30 0.00947 ## 4 11 15 roc_auc binary 0.825 30 0.0102 ## 5 8 18 roc_auc binary 0.824 30 0.0103 If you would like to specific the grid for tuning yourself, check out the dials package! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### rstudio::conf 2020 Videos Tue, 02/18/2020 - 01:00 [This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. rstudio::conf 2020 is already receding in the rear view mirror, but the wealth of resources generated by the conference will be valuable for quite some time. All of the materials from the workshops, and now all one hundred and four videos of conference talks are available. This unique video collection offers valuable insight into how developers, data scientists, statisticians, journalists, physicians, educators and other R savvy professionals are using their domain knowledge, analytical expertise and coding skills to make the world a better place. Talks range from very technical, R specific topics; to scientific contributions, best practices, applications of reproducible research, and the culture and sociology surrounding open source software. There is a lot to see, but my advice is to is to begin with J.J. Allaire’s keynote which presents open source software for data science through the lens of RStudio’s history and mission. J.J. Allaire’s Keynote at rstudio::conf 2020, announcing RStudio, PBC _____='https://rviews.rstudio.com/2020/02/18/rstudio-conf-2020-videos/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Getting started in R markdown Tue, 02/18/2020 - 01:00 [This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Photo by Jon Tyson If you have spent some time writing code in R, you probably have heard of generating dynamic reports incorporating R code, R outputs (results) and text or comments. In this article, I will explain how R Markdown works and give you the basic elements you need to get started easily in the production of these dynamic reports. R Markdown: what, why and how? R Markdown allows to generate a report (most of the time in PDF, HTML, Word or as a beamer presentation) that is automatically generated from a file written within RStudio. The generated documents can serve as a neat record of your analysis that can be shared and published in a detailed and complete report. Even if you never expect to present the results to someone else, it can also be used as a personal notebook to look back so you can see what you did at that time. A R Markdown file has the extension .Rmd, while a R script file has the extension .R. The first main advantage of using R Markdown over R is that, in a R Markdown document, you can combine three important parts of any statistical analysis: • R code to show how the analyses have been done. For instance, the data and the functions you used. This allows readers to follow your code and to check that the analyses were correctly performed. • Results of the code, that is, the output of your analyses. For example, the output of your linear model, plots, or results of the hypothesis test you just coded. This allows readers to see the results of your analyses. • Text, comments and interpretations of the results. For instance, after computing the main descriptive statistics and plotting some graphs, you can interpret them in the context of your problem and highlight important findings. This enables readers to understand your results thanks to your interpretations and your comments, delivered as if you wrote a document explaining your work. Another advantage of R Markdown is that the reports are dynamic and reproducible by anyone who has access to the .Rmd file (and the data if external data are used of course), making it perfectly suited to collaboration and dissemination of results. By dynamic, we mean that if your data changes, your results and your interpretations will change accordingly, without any work from your side. The production of the reports is done in two stages: 1. The .Rmd file which contains blocks of R code (called chunks) and text is provided to the {knitr} package which will execute the R code to get the output, and create a document in markdown (.md) format. This document then contains the R code, the results (or outputs), and the text. 2. This .md file is then converted to the desired format (HTML, PDF or Word), by the markdown package based on pandoc (i.e., a document conversion tool). Before you start To create a new R Markdown document (.Rmd), you first need to install and load the following packages: install.packages(c("knitr", "rmarkdown", "markdown")) library(knitr) library(rmarkdown) library(markdown) Then click on File -> New File -> R Markdown or click on the small white sheet with a green cross in the top left corner and select R Markdown: Create a new R Markdown document A window will open, choose the title and the author and click on OK. The default output format is HTML. It can be changed later to PDF or Word. After you have clicked on OK, a new .Rmd file which serves as example has been created. We are going to use this file as starting point to our more complex and more personalized file. To compile your R Markdown document into a HTML document, click on the Knit button located at the top: Knit a R Markdown document A preview of the HTML report appears and it is also saved in your working directory (see a reminder of what is a working directory). Components of a .Rmd file YAML header A .Rmd file starts with the YAML header, enclosed by two series of ---. By default, this includes the title, author, date and the format of the report. If you want to generate the report in a PDF document, replace output: html_document byoutput: pdf_document. These information from the YAML header will appear at the top of the generated report after you compile it (i.e., after knitting the document). To add a table of contents to your documents, replace output: html_document by output: html_document: toc: true Here are my usual settings regarding the format of a HTML document (remove toc_float: true if you render the document in PDF, as PDF documents do not accept floating tables of contents): output: html_document: toc: true toc_depth: 6 number_sections: true toc_float: true In addition to adding a table of contents, it sets the depth, adds a section numbering and the table of contents is floating when scrolling down the document (only available for HTML documents). You can visualize your table of contents even before knitting the document, or go directly to a specific section by clicking on the small icon in the top right corner. Your table of contents will appear, click on a section to go to this section in your .Rmd document: Visualize your table of contents In addition to this enhanced table of contents, I usually set the following date in the YAML header: Dynamic date in R Markdown This piece of code allows to write the current date, without having to change it myself. This is very convenient for projects that last several weeks or months to always have an updated date at the top of the document. Code chunks Below the YAML header, there is a first code chunk which is used for the setup options of your entire document. It is best to leave it like this at the moment, we can change it later if needed. Code chunks in R Markdown documents are used to write R code. Every time you want to include R code, you will need to enclose it with three backwards apostrophes. For instance, to compute the mean of the values 1, 7 and 11, we first need to insert a R code chunk by clicking on the Insert button located at the top and select R (see below a picture), then we need to write the corresponding code inside the code chunk we just inserted: Insert R code chunk in R Markdown Example of code chunk In the example file, you can see that the firs R code chunk (except the setup code chunk) includes the function summary() of the preloaded dataset cars: summary(cars). If you look at the HTML document that is generated from this example file, you will see that the summary measures are displayed just after the code chunk. The next code chunk in this example file is plot(pressure), which will produce a plot. Try writing other R codes and knit (i.e., compile the document by clicking on the knit button) the document to see if your code is generated correctly. If you already wrote code in a R script and want to reuse it in your R Markdown document, you can simply copy paste your code inside code chunks. Do not forget to always include your code inside code chunks or R will throw an error when compiling your document. As you can see, there are two additional arguments in the code chunk of the plot compared to my code chunk of the mean presented above. The first argument following the letter r (without comma between the two) is used to set the name of the chunk. In general, do not bother with this, it is mainly used to refer to a specific code chunk. You can remove the name of the chunk, but do not remove the letter r between the {} as it tells R that the code that follows corresponds to R code (yes you read it well, that also means you can include code from another programming language, e.g., Python, SQL, etc.). After the name of the chunk (after pressure in the example file), you can see that there is an additional argument: echo=FALSE. This argument, called an option, indicates that you want to hide the code, and display only the output of the code. Try removing it (or change it to echo=TRUE), and you will see that after knitting the document, both the code AND the output will appear, while only the results appeared previously. You can specify if you want to hide or display the code alongside the output of that code for each code chunk separately, for instance if you want to show the code for some code chunks, but not for others. Alternatively, if you want to always hide/display the code together with the output for the entire document, you can specify it in the setup code chunk located just after the YAML header. The options passed to this setup code chunk will determine the options for all code chunks, except for those that have been specifically modified. By default, the only setup option when you open a new R Markdown file is knitr::opts_chunk$set(echo = TRUE), meaning that by default, all outputs will be accompanied by its corresponding code. If you want to display only the results without the code for the whole document, replace it by knitr::opts_chunk$set(echo = FALSE). Two other options often passed to this setup code chunk are warning = FALSE and message = FALSE to prevent warnings and messages to be displayed on the report. If you want to pass several options, do not forget to separate them with a comma: Several options for a code chunk You can also choose to display the code, but not the result. For this, pass the option results = "hide". Alternatively, with the option include = FALSE, you can prevent code and results from appearing in the finished file while R still runs the code in order to use it at a later stage. If you want to prevent the code and the results to appear, and do not want R to run the code, use eval = FALSE. To edit the widht and height of figures, use the options fig.width and fig.height. Tip: When writing in R Markdown, you will very often need to insert new R code chunks. To insert a new R code chunk more rapidly, press CTRL + ALT + I on Windows or command + option + I on Mac. If you are interested in such shortcuts making you more efficient, see other tips and tricks in R Markdown. Note that a code chunk can be run without the need to compile the entire document, if you want to check the results of a specific code chunk for instance. In order to run a specific code chunk, select the code and run it as you would do in a R script (.R), by clicking on run or by pressing CTRL + Enter on Windows or command + Enter on Mac. Results of this code chunk will be displayed directly in the R Markdown document, just below the code chunk. Text Text can be added everywhere outside code chunks. R Markdown documents use the Markdown syntax for the formatting of the text. In our example file just below the setup code chunk, some text has been inserted. To insert text, you simply write text without any enclosing. Try adding some sentences and knit the document to see how it appears in the HTML document. Example of text below an example of R code chunk Markdown syntax can be used to change the formatting of your text appearing in the output file, for example to format some text in italics, in bold, etc. Below some common formatting commands: • Title: # Title • Subtitle: ## Subtitle • Subsubtitle: ### Subsubtitle. These headings will automatically be included in the table of contents if you included one. • italics: *italics* or _italics_ • bold: **bold** or __bold__ • Link: [link](https://www.statsandr.com/) (do not forget https:// if it is an external URL) • Equations: Enclose your equation (written in LaTeX) with one$ to have it in the text:

$A = \pi*r^{2}$

This is a well-know equation $$A = \pi*r^{2}$$.

Enclose your LaTeX equation with two $$to have it centered on a new line:$$A = \pi*r^{2}

This is another well-known equation:

$E = mc^2$

Lists:

• Unordered list, item 1: * Unordered list, item 1
• Unordered list, item 2: * Unordered list, item 2
1. Ordered list, item 1: 1. Ordered list, item 1
2. Ordered list, item 2: 2. Ordered list, item 2
Code inside text

Before going further, I would like to introduce an important feature of R Markdown. It is often the case that, when writing interpretations or detailing an analysis, we would like to refer to a result directly in our text. For instance, suppose we work on the iris dataset (preloaded in R). We may want to explain in words, that the mean of the length of the petal is a certain value, while the median is another value.

Without R Markdown, the user would need to compute the mean and median, and then report it manually. Thanks to R Markdown, it is possible to report these two descriptive statistics directly in the text, without manually encoding it. Even better, if the dataset happens to change because we removed some observations, the mean and median reported in the generated document will change automatically, without any change in the text from our side.

Here is an illustration with the mean and median of the length of the sepal for the iris dataset. We can insert results directly in the interpretations (i.e., in the text) by placing a backward apostrophe followed by the letter r before the code, and close it with another backward apostrophe:

Inline code in R Markdown

This combination of text and code will give the following output in the generated report:

The mean of the length of the sepal is 5.8433333 and the standard deviation is 0.8280661.

This technique, referred as inline code, allows you to insert results directly into the text of a R Markdown document. And as mentioned, if the dataset changes, the results incorporated inside the text (the mean and standard deviation in our case) will automatically be adjusted to the new dataset, exactly like the output of a code chunk is dynamically updated if the dataset changes.

This technique of inline code and the fact that it is possible to combine code, outputs of code, and text to comment the outputs makes R Markdown my favorite tool when it comes to statistical analyses. Since I discovered the power of R Markdown (and I am still learning as it has a huge amount of possibilities and features), I almost never write R code in scripts anymore. Every R code I write is supplemented by text and inline code in a R Markdown document, resulting in a professional and complete final document ready to be shared, published, or stored for future usage. If you are unfamiliar to this type of document, I invite you to learn more about it and to try it with your next analysis, you will most likely not go back to R scripts anymore.

Images

In addition to code, results and text, you can also insert images in your final document. To insert an image, place it in the working directory and outside a code chunk, write:

![](path_to_your_image.jpg)

![alt text here](path_to_your_image.jpg)

Tables

There are two options to insert tables in R Markdown documents:

1. the kable() function from the {knitr} package
2. the pander() function from the {pander} package

Here are an example of a table without any formatting, and the same code with the two functions applied on the iris dataset:

# without formatting summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## # with kable() library(knitr) kable(summary(iris))

# with pander() library(pander) pander(summary(iris))

The advantage of pander() over kable() is that it can be used for many more different outputs than table. Try on your own code, with results of a linear regression or a simple vector for example.

For more advanced users, R Markdown files can also be used to create Shiny apps, websites (this website is built thanks to R Markdown and the {blogdown} package), to write scientific papers based on templates from several international journals (with the {rticles} package), or even to write books (with the {bookdown} package).

To continue learning about R Markdown, see two complete cheat sheets from the R Studio team here and here, and a more complete guide here written by Yihui Xie, J. J. Allaire and Garrett Grolemund.

Thanks for reading. I hope this article convinced you to use R Markdown for your future projects. See more tips and tricks in R Markdown to increase even further your efficiency in R Markdown.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Related articles:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### RStudio 1.3 Preview: Configuration and Settings

Tue, 02/18/2020 - 01:00

This blog post is part of a series on new features in RStudio 1.3, currently available as a preview release.

Today, we’re going to talk about a number of improvements we’ve made to RStudio 1.3 around configuration and settings. To set the stage, here’s how you configure RStudio today:

This point-and-click dialog makes it easy for users to select the settings they want, but has a couple of limitations:

1. For users, there is no way to back up or save settings in e.g., a dotfiles repo, nor a way to view or manipulate preferences with external tools.
2. For administrators, there is no way to establish defaults for users.

In RStudio 1.3, we’ve overhauled the settings and configuration system to address both of these issues, and along the way we’ve made several portions of the IDE more amenable to configuration.

User Preferences

All the preferences in the Global Options dialog (and a number of other preferences that aren’t) are now saved in a simple, plain-text JSON file named rstudio-prefs.json. Here’s an example:

{ "posix_terminal_shell": "bash", "editor_theme": "Night Owl", "wrap_tab_navigation": false }

The example above instructs RStudio to use the bash shell for the Terminal tab, to apply the Night Owl theme to the IDE, and to avoid wrapping around when navigating through tabs. All other settings will have their default values.

By default, this file lives in AppData/Roaming/RStudio on Windows, and ~/.config/rstudio on other systems. While RStudio writes this file whenever you change a setting, you can also edit it yourself to change settings. You can back it up, or put it in a version control system. You’re in control!

If you’re editing this file by hand, you’ll probably want a reference. A full list of RStudio’s settings, along with their data types, allowable values, etc., can be found in the Session User Settings section of the RStudio Server Professional Administration Guide.

If you’re an administrator of an RStudio Server, you can establish defaults for any setting by using a global set of user preferences, placed here:

/etc/rstudio/rstudio-prefs.json

RStudio’s new configuration system complies with the XDG Base Directory Specification. This means that in addition to using XDG defaults for most directories, it is also possible to customize the location using environment variables. For example, you can set XDG_CONFIG_HOME for your users so that their configuration is loaded from somewhere other than ~/.config, or XDG_CONFIG_DIRS to establish a different folder for server-wide configuration.

The Configuration Folder

The user preferences aren’t the only thing that lives in the configuration folder. In RStudio 1.3, we’ve reorganized a number of user-level files and settings so that they’re all in the same place. This makes your RStudio configuration much more portable; simply unpacking a backup of this folder will make it possible to apply all of your RStudio customizations at once.

Here’s what’s inside:

File/Folder Content rstudio-prefs.json User preferences dictionaries/ Custom spelling dictionaries keybindings/ Editor and workbench keybindings, in JSON format snippets/ Console and source snippets (*.snippet) templates/ Default content for new files themes/ Custom color themes (*.rstheme)

Every one of these elements can now be configured both globally (in e.g., the /etc/rstudio configuration folder) and per-user (in e.g., the ~/.config/rstudio folder).

So, for example, an administrator could pre-install custom themes for their users by placing them in /etc/rstudio/themes/, and then instruct RStudio to default to one of the custom themes by changing the editor_theme setting in /etc/rstudio/rstudio-prefs.json. Or, they could establish a system-wide default template for .R files in /etc/rstudio/templates/default.R.

Customizing Session Settings

Try it Out!

If you’d like to give the new configuration system a spin, we’d very much welcome your feedback on our community forum. You can download the RStudio 1.3 preview here:

RStudio 1.3 Preview

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Efficient Data Management in R

Tue, 02/18/2020 - 01:00

[This article was first published on R on Methods Bites, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The software environment R is widely used for data analysis and data visualization in the social sciences and beyond. Additionally, it is becoming increasingly popular as a tool for data and file management. Focusing on these latter aspects, this Methods Bites Tutorial by Marcel Neunhoeffer, Oliver Rittmann and our team members Denis Cohen and Cosima Meyer illustrates the workflow and best practices for efficient data management in R.

• the workflow for organizing and conducting complex analyses in R
• creating, editing, and accessing directory hierarchies and their contents
• data merging, data management and data manipulation using tidy R and base R
• the basics of programming and debugging

To illustrate these steps, we will work through an example from comparative political behavior – the question under which conditions center-left parties can successfully mobilize votes among the unemployed. To tackle this question, we will combine micro-level voting data from Round 9 of the European Social Survey with contextual data on election dates from ParlGov, party positions from the Manifesto Project, and unemployment rates from the World Bank.

Note: This blog post is based on our workshop in the MZES Social Science Data Lab in Spring 2020. The corresponding workshop materials can be found on our GitHub.

Contents Setup

Over the course of this tutorial, we will use the following R packages:

Code: Packages used in this tutorial

## Save package names as a vector of strings pkgs <- c( "foreign", ### read data stored in various formats "readstata13", ### read data stored by Stata 13-16 "reshape2", ### flexibly reshape data "countrycode", ### convert country names and country codes "lubridate", ### dates and time "dplyr", ### tools for data manipulation "magrittr", ### piping operations "tidyr", ### tool to deal with messy data (and get "tidy data") "ggplot2", ### data visualization using a grammar of graphics "fabricatr" ### imagine your data before you collect it ) ## Install uninstalled packages lapply(pkgs[!(pkgs %in% installed.packages())], install.packages) ## Load all packages to library and adjust options lapply(pkgs, library, character.only = TRUE)

Workflow for a Reproducible Data Project

Even the setup of a reproducible data project should follow reproducible steps. And since we need to repeat these steps every time we start a new data project, it is a good idea to keep everything simple and to automate what can be automated. For the following steps to work you need three things:

1. A github.com account.
2. A current installation of R and RStudio on the computer you want to work on.
3. An installation of git on the computer you want to work on.

If you struggle with any of those three prerequisites, happygitwithr.com is an excellent resource that includes a step-by-step guide of how to setup git with R. It will also give you are more in-depth introduction to what git and github actually are and why it is worthwhile including them in any data project right from the start!

Git is a version control system that makes it easy to track changes and work on code collaboratively. GitHub is a hosting service for git. You can think of it like a public Dropbox for code. As a student, you even get unlimited private repositories which you can use if you don’t feel like sharing your code with the rest of the world (yet).

Let’s start a new project:

• Start a new project on GitHub:
• Click on New on the left side where your repositories are displayed.
• Give your project repository a meaningful name. For this course you could name it: efficient-data-management.
• Click Create repository.
• Copy the URL to the repository, it will look something like this: https://github.com//efficient-data-management.git
• Open the project in RStudio:
• Open RStudio.
• Click on File > New Project....
• Create a new project from Version Control.
• Choose Git.
• Paste the URL to the repository in Repository URL.
• Choose a folder on your computer where the project will be stored locally.
• And finally click on Create.
Initializing the project with a .Rprofile.

Now you have a new Rproject that is linked with GitHub. Before start coding, we want to have an efficient folder structure. Ideally, we want to use the same structure for all of our projects. This way, it is easier to focus on the things that really matter – like producing interesting research. To that end, we initialize our Rproject with a .Rprofile that automates the process of setting up our project structure.

• Initialize your Rproject with a .Rprofile:
• If you follow these steps for the very first time, install the renv package in R.
• In RStudio, with your new Rproject open, open a new Text File.
• Paste the following code and save it as .Rprofile.
• Now close RStudio and re-open your project from your project folder.
.First <- function() { dir.create(paste0(getwd(), "/figures"), showWarnings = F) dir.create(paste0(getwd(), "/processed-data"), showWarnings = F) dir.create(paste0(getwd(), "/raw-data"), showWarnings = F) dir.create(paste0(getwd(), "/scripts"), showWarnings = F) dir.create(paste0(getwd(), "/manuscript"), showWarnings = F) if (!("renv" %in% list.files())) { renv::init() } else { source("renv/activate.R") } cat("\nWelcome to your R-Project:", basename(getwd()), "\n") }

R will automatically source the .Rprofile at startup. Here, we first create some folders with dir.create(). Don’t worry, folders will not be created if they already exist, so this will not overwrite existing files. We then display a nice welcome message with cat() and activate the renv package.

After re-opening the Rproject, you should now see the project setup in your folder.

What is renv?

In the past, sharing your project with someone else and getting R to show the exact same behavior on different machines could be a pain. The renv package makes this super easy. It creates a local package directory for your project. This means that it keeps track of all the packages and package versions that you use in your project. If someone else wants to work with the exact same package environment to reproduce your data project, they can easily restore it from your renv package directory. To learn more about renv, check out its documentation here.

Leveraging git and renv at the end of a working session

Now that we have our nice project setup, we should not forget to leverage it. At the end of a working session you should follow the following steps:

• Create a renv::snapshot() to save all the packages you used to your local package directory.
• Commit all your changes to git. This can be easily done by using the Git pane in RStudio.
• Push everything to GitHub.

Whenever you re-open your Rproject, make sure to start your working session with a Pull from GitHub. That way, you will always work with the most recent version of your project.

Managing and Accessing Directories

Now that we have a directory structure for our project, we need to learn how to work with it. What goes into the different folders should be rather self-explanatory. Yet, it is good practice to also include a simple ReadMe file in each of the folders to explain what it should contain.

What is a working directory?

Your working directory is the folder on your computer that R will use to read and save files. With your .Rproject and .Rmd files, you do not have to worry about this too much. There is only one golden rule: Never change your working directory (especially, never to absolute paths specific to your computer).

When we created our .Rproject we had to decide where all the files should be stored. If you open your .Rproject, the working directory will automatically be set to this folder. If you have a .Rmd file somewhere in your project (e.g. in the /manuscript folder) RStudio will automatically set the working directory the folder containing the .Rmd.

What does this all mean?

This means you have to access all data or code (stored in the folders of your project) relative to your current working directory.

• If you work with a simple .R script (and you store the file in /scripts) in your .Rproject and you want to open data stored in your /raw-data folder: Simply open the file directly from the folder e.g. with read.csv("raw-data/20200127_parlgov.csv")
• If you work in a .Rmd file in your manuscript folder you have to navigate up one directory level by using ../: Open the file directly from the folder with read.csv("../raw-data/20200127_parlgov.csv")
• Similarly, if you want to store processed data or a figure you would directly store it to the right folder.
Data Management and Data Manipulation Getting to know the tidyverse

The R universe basically builds upon two (seemingly contradictive) approaches: base R and the tidyverse. While these two approaches are often seen as two different philosophies, they can form a symbiosis. We therefore recommend to pick whichever works best for you – or to combine the two.

Whereas base R is already implemented in R, using the tidyverse requires users to load new packages. People often find base R unintuitive and hard to read. This is why Hadley Wickham developed and introduced the tidyverse – a more intuitive approach to managing and wrangling data. Code written before 2014 was usually written in base R whereas the tidyverse style is becoming increasingly widespread. Again, which approach you prefer is rather a matter of personal taste than a decision between “right or wrong”.

The following figure shows two code chunks that are substantially identical and visualize the main differences between the two approaches.

The logic of tidyverse is fairly simple: As you can see in the graphic above, when using the tidyverse style, you start with your main object (me) and pipe (%>%) through this object by filtering, selecting, renaming, … parts of it. With these pipe operators, you read the code from left to right (or, across multiple lines, from top to bottom). In base R you would – in contrast – wrap the commands around your main object and thus read the code from the inside out.

We first familiarize ourselves with the basic logic of the tidyverse style using some examples. For this, we use some generated data.

Code: Generate data

# We use a code that is adjusted and based on examples provided here: https://declaredesign.org/r/fabricatr/articles/building_importing.html # We set a seed to make our dataset reproducible. set.seed(68159) data <- fabricate( countries = add_level( N = 10, month = recycle(month.abb), gdppercapita = runif(N, min = 10000, max = 50000), unemployment_rate = 5 + runif(N, 30,50) + ((gdppercapita > 50) * 10), life_expectancy = 50 + runif(N, 10, 20) + ((gdppercapita > 30000) * 10) ) ) # Drop countries data <- data %>% select(-countries) # We add artificial NAs data <- data %>% mutate( gdppercapita = ifelse(gdppercapita > 40000 & gdppercapita < 44000, NA, gdppercapita), unemployment_rate = ifelse( unemployment_rate > 50 & unemployment_rate < 54, NA, unemployment_rate ) ) # Have a look at the data data ## month gdppercapita unemployment_rate life_expectancy ## 1 Jan 31819.79 48.90126 70.19808 ## 2 Feb 45485.76 58.51337 76.16142 ## 3 Mar 18327.78 55.89483 67.68735 ## 4 Apr 22275.86 48.61502 60.98838 ## 5 May 23069.79 47.18671 66.39749 ## 6 Jun 34896.12 NA 72.52299 ## 7 Jul 14370.76 58.33964 66.42222 ## 8 Aug NA 59.60927 74.91735 ## 9 Sep NA NA 73.82523 ## 10 Oct 24195.20 NA 65.26416 # If your dataset becomes larger, the command "head(data)" is a useful alternative # to get a quick glance at the data. It shows by default the first 6 rows.

Filtering the data

Our dataset contains information on countries (numeric), months (characters), GDP per capita, unemployment rates, and life expectancies (all numeric). Suppose we are only interested in data from January. For this purpose, we can use the filter() function, provided by the package dplyr.1

data %>% filter(month == "Jan") ## month gdppercapita unemployment_rate life_expectancy ## 1 Jan 31819.79 48.90126 70.19808

How would you do it in base R?

data[data$month == "Jan", ] RStudio also provides shortcuts that allow you to write the pipe operator (%>%) quickly: • Shift + Cmd + M (Mac) • Shift + Ctrl + M (Windows) Selecting specific variables Sometimes, you want to select specific variables. We are interested in the unemployment rate and the months but not so much in the other variables. dplyr’s command select() allows you to do exactly this. data %>% select(month, unemployment_rate) %>% head() ## month unemployment_rate ## 1 Jan 48.90126 ## 2 Feb 58.51337 ## 3 Mar 55.89483 ## 4 Apr 48.61502 ## 5 May 47.18671 ## 6 Jun NA How would you do it in base R? head(data[, c("month", "unemployment_rate")]) Arranging variables Let’s say we want to know the largest unemployment rate and the lowest unemployment rate throughout the entire dataset. This can be done with the arrange() command that is added with another %>% operator to our previous code. By default, the arrange() function sorts the data in ascending order. To display the data in descending order, we add desc(). # Arrange in ascending order data %>% select(month, unemployment_rate) %>% arrange(unemployment_rate) ## month unemployment_rate ## 1 May 47.18671 ## 2 Apr 48.61502 ## 3 Jan 48.90126 ## 4 Mar 55.89483 ## 5 Jul 58.33964 ## 6 Feb 58.51337 ## 7 Aug 59.60927 ## 8 Jun NA ## 9 Sep NA ## 10 Oct NA # Arrange in descending order data %>% select(month, unemployment_rate) %>% arrange(desc(unemployment_rate)) ## month unemployment_rate ## 1 Aug 59.60927 ## 2 Feb 58.51337 ## 3 Jul 58.33964 ## 4 Mar 55.89483 ## 5 Jan 48.90126 ## 6 Apr 48.61502 ## 7 May 47.18671 ## 8 Jun NA ## 9 Sep NA ## 10 Oct NA How would you do it in base R? # Arrange in ascending order data[order(data$unemployment_rate), c("month","unemployment_rate")] # Arrange in descending order data[order(data$unemployment_rate, decreasing = TRUE), c("month","unemployment_rate")] As we can see, the highest unemployment rate was in August with 59.61% and the lowest unemployment rate in May with 47.19%. Group-wise operations To get the highest unemployment rates by month, we use a combination of group_by() (to group our results by month) and the summarise() function. data %>% group_by(month) %>% summarise(max_unemployment = max(unemployment_rate)) ## # A tibble: 10 x 2 ## month max_unemployment ## ## 1 Apr 48.6 ## 2 Aug 59.6 ## 3 Feb 58.5 ## 4 Jan 48.9 ## 5 Jul 58.3 ## 6 Jun NA ## 7 Mar 55.9 ## 8 May 47.2 ## 9 Oct NA ## 10 Sep NA How would you do it in base R? aggregate(data$unemployment_rate, by = list(data$month), max) In some cases, you might no longer need your variables grouped after performing transformations. In this case, you can pipe another command with %>% ungroup() at the end to ungroup your data. Extracting unique observations Which unique months are included in the dataset? To get this information, we use the distinct() command in dplyr. As we see, all months from January to December are present. R allows us to also sort this data alphabetically with arrange(). # Get the distinct months in our dataset data %>% distinct(month) ## month ## 1 Jan ## 2 Feb ## 3 Mar ## 4 Apr ## 5 May ## 6 Jun ## 7 Jul ## 8 Aug ## 9 Sep ## 10 Oct # Sort data alphabetically data %>% distinct(month) %>% arrange(month) ## month ## 1 Apr ## 2 Aug ## 3 Feb ## 4 Jan ## 5 Jul ## 6 Jun ## 7 Mar ## 8 May ## 9 Oct ## 10 Sep How would you do it in base R? # Get the distinct months in our dataset unique(data$month) # Sort data alphabetically sort(unique(data$month)) Renaming variables The variable name gdppercapita is hard to read. We therefore want rename it to gdp_per_capita using the rename() function from the dplyr package. The compound assignment pipe operator %<>% from the magrittr package simultaneously serves as the first pipe in a chain of commands and assigns the transformed data to the left-hand side object. # Rename the variable "gdppercapita" to "gdp_per_capita" data %<>% rename(gdp_per_capita = gdppercapita) How would you do it in base R? names(data)[names(data) == "gdppercapita"] <- "gdp_per_capita" Creating new variables The mutate() command allows you to generate a new variable. Let’s say you want to generate a dummy that indicates if it is summer or not. We call this variable summer. # Create a new variable called "summer" data %>% mutate(summer = ifelse(month %in% c("Jun", "Jul", "Aug"), 1, 0)) ## month gdp_per_capita unemployment_rate life_expectancy summer ## 1 Jan 31819.79 48.90126 70.19808 0 ## 2 Feb 45485.76 58.51337 76.16142 0 ## 3 Mar 18327.78 55.89483 67.68735 0 ## 4 Apr 22275.86 48.61502 60.98838 0 ## 5 May 23069.79 47.18671 66.39749 0 ## 6 Jun 34896.12 NA 72.52299 1 ## 7 Jul 14370.76 58.33964 66.42222 1 ## 8 Aug NA 59.60927 74.91735 1 ## 9 Sep NA NA 73.82523 0 ## 10 Oct 24195.20 NA 65.26416 0 How would you do it in base R? data$summer <- ifelse(data$month %in% c("Jun", "Jul", "Aug"), 1, 0) Additional features mutate() and summarise() also have several scoped variants that allow us to apply our commands simultaneously to several variables: • _all() affects all variables • _at() affects specific selected variables • _if() affects conditionally selected variables How does this work in practice? Let’s say we have good reason to believe that all NAs in our observations are 0 and we want to transform all variables with NAs at once. We then use mutate_if(): data %<>% mutate_if(is.numeric, replace_na, 0) How would you do it in base R? data <- lapply(data, function(x) replace(x, is.na(x), 0)) Data wrangling for complex data structures: A walkthrough Following this short intro to the tidyverse, we want to showcase how sequences of data-transforming operations can be used to manage and combine complex data structures. We use a classical multi-level setup: We want to enrich multinational survey data with country-specific contextual information. For this purpose, we use data from Round 9 of the European Social Survey (ESS), a cross-national survey of nearly 20 European countries fielded between August 2018 and May 2019 and augment it with party positions from the Manifesto Project and unemployment rates from the World Bank. Note: As shown in the examples above, a sequence of commands may be combined using a series of piped operations. For didactic purposes, we break these sequences into shorter segments in the following examples. Micro-level survey data from the European Social Survey We start with some initial data management for the ESS data. The code below selects a range of relevant variables. First, we select some administrative variables, including country identifiers (cntry), information on the start time of the interview (inwyys, inwmms, and inwdds), and the design weights (dweight). Secondly, we select variables on individuals’ voting behavior and party affinity. Next to two variables that establish whether respondents voted in the last general election (vote) or feel close to any party (clsprty), respectively, this includes a range of country-specific vote recall and party ID variables (starting with prtv and prtcl, respectively). For respondents who voted (feel close to any party), these variables specify which national party they voted for (feel closest to). Our main predictor is unemployment. We therefore select a variable that records whether respondents have ever been unemployed for longer than three months (uemp3m) and whether any such period was during the past five years (uemp5yr). Combining the information from both variables allows us to distinguish individuals with experiences of long-term unemployment during the last five years from those without such experiences. Additionally, we keep information on respondents’ gender, age, education, and ethnic minority status. Code: European Social Survey – Initial data management ess <- read.dta13("efficient-data-r/raw-data/202090127_ESS9e01_1.dta", fromEncoding = "UTF-8") %>% select( ### select relevant variables cntry, ### ISO2C country codes inwyys, ### start of interview, year inwmms, ### start of interview, month inwdds, ### start of interview, day dweight, ### design weights vote, ### voted in last general election? starts_with("prtv"), ### country-specific vote recall clsprty, ### close to any party? starts_with("prtcl"), ### country-specific party ID uemp3m, uemp5yr, ### unemployed in the past 5 years gndr, ### gender agea, ### age eisced, ### education (ISCED) blgetmg ### ethnic minority ) %>% mutate_if(is.factor, as.character) %>% ### factor -> character mutate(uemp5yr = ifelse(uemp3m == "No", ### recode uemp5yr "No", uemp5yr)) %>% mutate_if(is.character, ### recode responses to missing values function (x) ifelse( x %in% c("Refusal", "Don't know", "No answer", "Not applicable"), NA, x )) Election Dates Now that we have selected and partly recoded our micro-level variables, we are confronted with a new challenge: The ESS records respondents’ recall of their voting behavior in the most recent national election. We thus need to identify the correct electoral context for each country before we can add information on party strategies. This requires that we identify the date of the most recent national election prior to a given interview. As a first step, we need to collect information on the country-specific ESS field times. We use information on the date on which individual interviews were started, which are stored in separate variables for year, month, and day: inwyys, inwmms, and inwdds. We combine these three variables in YYYYMMDD format using the base R sprintf() command and save them as a date using the ymd() command from the tidyverse’s lubridate package. After this, we can drop the three variables from our data using dplyr’s select() command. ess %<>% mutate(inwdate = sprintf('%04d%02d%02d', inwyys, inwmms, inwdds)) %>% mutate(inwdate = ymd(inwdate)) %>% select(-inwyys,-inwmms,-inwdds) To retrieve information on country-specific field periods, we need to find the earliest and latest interview dates within each country. We therefore use the dplyr commands group_by() and summarize() to generate an auxiliary data frame named ess_dates. ess_dates <- ess %>% group_by(cntry) %>% summarize(field_start = min(inwdate), field_end = max(inwdate)) %>% print() ## # A tibble: 19 x 3 ## cntry field_start field_end ## ## 1 AT 2018-09-18 2019-01-12 ## 2 BE 2018-09-20 2019-01-27 ## 3 BG 2018-11-16 2018-12-15 ## 4 CH 2018-08-31 2019-02-11 ## 5 CY 2018-09-18 2019-05-26 ## 6 CZ 2018-11-17 2019-02-02 ## 7 DE 2018-09-03 2019-03-03 ## 8 EE 2018-10-01 2019-03-02 ## 9 FI 2018-09-06 2019-02-18 ## 10 FR 2018-10-20 2019-04-01 ## 11 GB 2018-09-01 2019-02-16 ## 12 HU 2019-01-31 2019-05-22 ## 13 IE 2018-11-07 2019-03-31 ## 14 IT 2018-12-17 2019-03-10 ## 15 NL 2018-09-03 2019-01-20 ## 16 NO 2018-10-04 2019-05-16 ## 17 PL 2018-10-26 2019-03-20 ## 18 RS 2018-09-28 2019-02-24 ## 19 SI 2018-09-24 2019-02-01 Now that we know the country-specific field periods, we need to find the election date preceding the respective field periods. For this purpose, we use the comprehensive collection on parliaments and elections from ParlGov. The data carries information on parties performance in both national and European parliamentary elections. As the data set comes in .csv format, we use the read.csv() command to load it into R. In order to use the data, we need to do a bit of housekeeping first: 1. We use the countrycode package to convert country names to ISO2C codes. This allows us to combine them with the data from ess_dates later on. 2. We retain information on national parliamentary elections, thus dropping observations for elections to the European parliament. 3. We select three variables we need: Country codes in ISO2C format, election dates, and unique election IDs. 4. As the data set still contains many observations per election (one for each party), we use group_by()–summarize_all()–ungroup() to collapse the data frame. As a result, it now contains one unique value for cntry and election_date for each value of election_id. 5. Lastly, we combine the election dates data from pgov with the field times from ess_dates, using the harmonized cntry identifiers. Using right_join() ensure that we only keep the election dates for the 19 countries included in the right-hand side data in ess_times. pgov <- read.csv("efficient-data-r/raw-data/20200127_parlgov.csv") %>% mutate(cntry = countrycode(country_name, "country.name", "iso2c")) %>% filter(election_type == "parliament") %>% select(cntry, election_date, election_id) %>% group_by(election_id) %>% summarize_all(unique) %>% ungroup() %>% right_join(ess_dates, by = "cntry") At this point, we have added all available election dates for each available country. What we want, however, is to find only the date of the most recent election preceding a country’s ESS field period. So what do we do? 1. We save election_date as an actual date variable. This allows us to compare it to other date variables. 2. For each country, we exclude all elections that took place after the respective start date of the ESS field period.2 3. We group_by() countries and arrange() the observations by election_date from earliest to latest. We then identify the most_recent election that has the smallest duration between the start of the field period and the preceding election date (while we’re at it, we also store the date of previous election for each country, which will come in handy later on). We thenungroup(). 4. We can now use filter() to keep only those observation where election_date == most_recent evaluates to true. pgov %<>% mutate(election_date = ymd(election_date)) %>% filter(election_date <= field_start) %>% group_by(cntry) %>% arrange(election_date) %>% mutate(most_recent = election_date[which.min(field_start - election_date)], prev_election_date = lag(election_date)) %>% ungroup() %>% filter(election_date == most_recent) The rest is just a little housekeeping: We extract the years of the most recent and previous election dates and select only those variables which we are going to need later on. Code: Election dates (housekeeping) pgov %<>% mutate( election_year = year(election_date), prev_election_year = year(prev_election_date) ) %>% select(cntry, election_date, prev_election_date, election_year, prev_election_year) Party Positions We now know the exact dates of the elections in which individuals cast their votes as reported in the ESS. This finally allows us to combine the survey data with information on the policy positions on which center-left parties campaigned in the respective elections. We therefore load the data set from the Manifesto Project, which comes in Stata 13+ format and therefore requires the read.dta13() command from the readStata13 package. We start with a little housekeeping, converting country names to ISO2C codes, subsetting the data to include only social democratic parties, and storing election dates as actual date variables. Code: Manifesto Project – Initial data management cmp <- read.dta13("efficient-data-r/raw-data/20200127_cmp2019b.dta") %>% mutate(cntry = countrycode(countryname, "country.name", "iso2c")) %>% filter(parfam == "soc social democratic") %>% mutate(election_date = as.Date(edate)) We then right_join() the data frame pgov to the Manifesto data in cmp, matching observations by both cntry and election_date, and select all relevant variables. Next to pervote, which contain parties’ vote shares in the most recent election, we keep rile and welfare. The former is parties’ general left-right position, the latter their position on welfare issues. cmp %<>% right_join(pgov, by = c("cntry", "election_date")) %>% select( cntry, election_date, election_year, prev_election_date, prev_election_year, party, partyname, partyabbrev, pervote, rile, welfare ) One problem we still need to address is that some countries have multiple parties of the social democratic party family. For the sake of simplicity, we choose to focus only on one party per country, namely the strongest. Therefore, we again group by country and keep only those observations with the highest vote percentage. We now have a data set that contains the positions of each countries’ main center-left party, based on the most recent election preceding the ESS, which we can easily merge with our ESS survey data using the cntry identifier. cmp %<>% group_by(cntry) %>% filter(pervote == max(pervote, na.rm = TRUE)) %>% ungroup() ## View data cmp %>% select(cntry, election_year, partyname, partyabbrev, welfare) ## Add to ESS ess %<>% left_join(cmp, by = "cntry") Unemployment Rates Now, suppose we also want to retrieve information on the average national unemployment rate during the legislative period preceding the most recent national election. For this purpose, we use data on annual national unemployment rates from the World Bank. Code: World Bank unemployment rates – Initial data management unemp <- read.csv("efficient-data-r/raw-data/20200114_worldbank_unemprate.csv", skip = 4L) %>% mutate(cntry = countrycode(Country.Name, "country.name", "iso2c")) %>% filter(cntry %in% cmp$cntry)

Following some initial data management, we can see that the data is in wide format: Every country has only one row while their annual unemployment rates are stored in multiple columns named X1960, …, X2019. So what do we do?

First, we select our country-identifier along with all yearly observations starting with X (except an unspecified additional variable named X). We then use the melt() command from the reshape2 package. This changes the data format from wide to long: We now have 60 observations per country (for years 1960-2019), uniquely identified by a new variable with we rename to year, with their corresponding values stored in the new variable we rename to unemprate. The year variable contains the previous variable names (X1960, …, X2019). After omitting the leading X character, for which we use the base R substr() command, we store the years as numeric values.

unemp %<>% select(cntry, starts_with("X"),-X) %>% melt(id.vars = "cntry") %>% rename(year = variable, unemprate = value) %>% mutate(year = as.numeric(substr(year, 2L, 5L)))

Using the long-format data, we then proceed as follows:

1. We select the country-specific information in the most recent and previous election years from the pgov data frame, and left_join() it with the unemp data frame using our country identifiers. Thereby, we supplement each of the 60 yearly observations per country with the information on the relevant time span.
2. After grouping the data set by cntry, we can use this information to filter() only the years of each country’s past legislative period.
3. The summarize()function, lastly, allows us to average across these observation to retrieve a single observation per country. This gives us the desired average unemployment rate during the past legislative period, which we may merge with the survey data using the cntry identifier.
unemp %<>% left_join(pgov %>% select(cntry, election_year, prev_election_year), by = c("cntry")) %>% group_by(cntry) %>% filter(year >= prev_election_year & year <= election_year) %>% summarize(unemprate = mean(unemprate)) ess %<>% left_join(unemp, by = "cntry") Programming and Debugging Identifying center-left parties in the ESS

Now we need a variable which indicates whether a respondent of the ESS voted for a center-left party. Unfortunately, the variables indicating the party a respondent voted for at the last election are rather messy: Each country has its own variable and party names are not consistent: sometimes, the ESS uses abbreviations, sometimes full names. Thus, we need to check the ESS codebook and the ESS Round 9 Appendix on Political Parties to manually code the names of center-left parties.

To save the information, we generate a new auxiliary data frame which we use later on.

Code: Preparation of auxiliary data

## Main center-left parties aux <- list( AT = "SPÖ", BE = "SP.A", BG = "Balgarska sotsialisticheska partiya (BSP)" , CH = "Social Democratic Party / Socialist Party" , CY = "Progressive Party of Working People (AKEL)" , CZ = "ČSSD", DE = "SPD" , EE = "Sotsiaaldemokraatlik Erakond", FI = "Social Democratic Party", FR = "PS (Parti Socialiste)", GB = "Labour", HU = "MSZP (Magyar Szocialista Párt)", IE = "Labour", IT = "Partido Democratico (PD)", NL = "Socialist Party", NO = "Arbeiderpartiet", PL = "Zjednoczona Lewica", RS = "Boris Tadic, Cedomir Jovanovic - Savez za bolju Srbiju - LDP, LSV, SDS", SI = "L - Levica" ) %>% melt() %>% rename( ess_party_name = value, cntry = L1 ) ## Match vote variables and country codes vote_var <- ess %>% select(starts_with("prtv"),-prtvede1) %>% names() vote_var_order <- sapply(aux$cntry, function (x) which(grepl(x, toupper(vote_var)))) ## Add to auxiliary data frame aux %<>% mutate(vote_var = vote_var[vote_var_order]) %>% mutate_if(is.factor, as.character) Programming a function Now, we want to add a variable that indicates whether a respondent voted for a center left party (as opposed to voting for a different party or not voting at all). To this end, we program a function named ess_center_left(). This function takes three input arguments. micro_data, aux_data, and group_var. micro_data is a data frame containing the survey data, i.e., our ess object. aux_datais a data frame with auxiliary information, i.e., our aux object. group_var, lastly, is a character string containing the name of the grouping variable (cntry) in both data frames. After several operations which we discuss in detail below, the function returns the modified data frame micro_data, now with an added variable v_center_left. ess_center_left <- function(micro_data, aux_data, group_var) { ## A priori consistency checks ... ## Main function ... ## A posteriori consistency checks ... ## Return output return(micro_data) } So what goes into our function body? We divide our function into three parts: 1. A priori consistency checks 2. The main function 3. A posteriori consistency check In the following, we will walk through each part separately and subsequently combine them into one function. A priori consistency checks A priori consistency checks test whether we supply the correct inputs to our function. In a first step, we check if both micro_data and aux_data actually contain a variable of the name specified in group_var. If either of them does not, the function stops and returns an error message. In a second step, also check for the unique values (i.e., country codes) of group_var in micro_data and aux_data and establish their intersection in group_vals_both. The third step establishes which, if any, values of group_var are missing in the micro and auxiliary data. In the fourth and final step, we print corresponding warning messages that inform us which countries are not included in both the micro and auxiliary data. For these, the new variable v_center_left will not be created. # 1) Check if group_var is a variable in micro_data and aux_data if (not(group_var %in% names(micro_data))) { stop("group_var is not a variable in micro_data.") } else if (not(group_var %in% names(aux_data))) { stop("group_var is not a variable in aux_data.") } else { # 2) Unique values of group_var group_vals_micro <- unique(micro_data[, group_var]) group_vals_aux <- unique(aux_data[, group_var]) group_vals_both <- group_vals_micro[group_vals_micro %in% group_vals_aux] # 3) Missing values of group_var in micro_data and aux_data group_vals_only_micro <- group_vals_micro[not(group_vals_micro %in% group_vals_both)] group_vals_only_aux <- group_vals_aux[not(group_vals_aux %in% group_vals_both)] # 4) Missing group_var values in micro_data or aux_data if (length(group_vals_only_micro) > 0) { warning(paste("Group values only in micro_data:", group_vals_only_micro, sep = " ")) } if (length(group_vals_only_aux) > 0) { warning(paste("Group values only in aux_data:", group_vals_only_aux, sep = " ")) } } Main function If our data inputs pass the initial consistency checks, the function generates the new variable v_center_left within each available country. We begin by defining an auxiliary identifier, aux_id that allows us to uniquely identify observations in the ESS data. This will allow us to merge the newly created variable to the original data later on. We then generate a list, vote_recoded that serves a container for the country-specific data that we generate in the subsequent loop through the different countries. In this loop, we first retrieve the name of the ESS vote choice variable and the name of the center-left party in a given country. We then subset the data to observations from this country and retain three variables only: aux_id, vote (i.e., whether someone voted or not), and the country-specific vote choice variable, which we subsequently rename to vote_choice and store in our system’s native encoding. Following this, we assign one out of four possible values to the new variable v_center_left: 1. If we have no information on respondents’ voting behavior: NA 2. If the respondent did not vote or was not eligible to vote: Did not vote 3. If the respondent voted for the country’s main center-left party: Yes 4. If the respondent voted for a different party: No After this, we collapse the list of country-specific data frames to a single data frame which we then add to the original data frame micro_data by our identifier aux_id. # Auxiliary ID micro_data$aux_id <- seq_len(nrow(micro_data)) # List container for data frames containing v_center_left vote_recoded <- list() # Loop through groups for (j in group_vals_both) { # Name of the group's vote choice variable vote_var_j <- aux_data$vote_var[aux_data$cntry == j] # Name of the group's center left party (in native encoding) ess_party_name_j <- aux_data$ess_party_name[aux_data$cntry == j] # Generate v_center_left for this group vote_recoded[[j]] <- micro_data %>% filter(cntry == j) %>% # subset to group select(aux_id, vote, vote_var_j) %>% # select variables rename(vote_choice = !!as.name(vote_var_j)) %>% # rename vote choice mutate(vote_choice = enc2native(vote_choice), # harmonize encoding v_center_left = # create v_center_left ifelse( is.na(vote) & is.na(vote_choice), NA, # missing information ifelse( vote %in% c("No", "Not eligible to vote"), "Did not vote", # non-voters ifelse(vote_choice == ess_party_name_j, "Yes", # center-left voters "No") # voted for other party ) )) } # Collapse list of data frames to a single data frame vote_recoded %<>% bind_rows() %>% select(aux_id, v_center_left) # Remove old versions of v_center_left if present if ("v_center_left" %in% names(micro_data)) { micro_data %<>% select(-v_center_left) } # Add new variable to micro_data by aux_id micro_data %<>% full_join(vote_recoded, by = "aux_id") %>% # add new variable select(-aux_id) # drop auxiliary ID A posteriori consistency checks

After generating our new variable v_center_left within each country, we want to know if our function actually worked. For this purpose, we check whether any country has zero percent of center-left voters. Given that we supplied a name for one center-left party per country, this would likely suggest a mismatch in our inputs.

# Proportions of center left voters within each group sample_prop <- micro_data %>% select(group_var, v_center_left) %>% group_by(!!as.name(group_var)) %>% summarize(prop_v_center_left = mean(v_center_left == "Yes", na.rm = TRUE)) # Check if any group has 0% center left voters if (any(sample_prop$prop_v_center_left == 0)) { warning(paste( "No center-left voters in", paste(sample_prop$cntry[sample_prop$prop_v_center_left == 0], collapse = " "), "- check your inputs!", sep = " " )) } Now, we can put the parts together and define our function. Code: Full function ess_center_left <- function(micro_data, aux_data, group_var) { ## A priori consistency checks # 1) Check if group_var is a variable in micro_data and aux_data if (not(group_var %in% names(micro_data))) { stop("group_var is not a variable in micro_data.") } else if (not(group_var %in% names(aux_data))) { stop("group_var is not a variable in aux_data.") } else { # 2) Unique values of group_var group_vals_micro <- unique(micro_data[, group_var]) group_vals_aux <- unique(aux_data[, group_var]) group_vals_both <- group_vals_micro[group_vals_micro %in% group_vals_aux] # 3) Missing values of group_var in micro_data and aux_data group_vals_only_micro <- group_vals_micro[not(group_vals_micro %in% group_vals_both)] group_vals_only_aux <- group_vals_aux[not(group_vals_aux %in% group_vals_both)] # 4) Missing group_var values in micro_data or aux_data if (length(group_vals_only_micro) > 0) { warning(paste("Group values only in micro_data:", group_vals_only_micro, sep = " ")) } if (length(group_vals_only_aux) > 0) { warning(paste("Group values only in aux_data:", group_vals_only_aux, sep = " ")) } } ## Main function # Auxiliary ID micro_data$aux_id <- seq_len(nrow(micro_data)) # List container for data frames containing v_center_left vote_recoded <- list() # Loop through groups for (j in group_vals_both) { # Name of the group's vote choice variable vote_var_j <- aux_data$vote_var[aux_data$cntry == j] # Name of the group's center left party (in native encoding) ess_party_name_j <- aux_data$ess_party_name[aux_data$cntry == j] # Generate v_center_left for this group vote_recoded[[j]] <- micro_data %>% filter(cntry == j) %>% # subset to group select(aux_id, vote, vote_var_j) %>% # select variables rename(vote_choice = !!as.name(vote_var_j)) %>% # rename vote choice mutate(vote_choice = enc2native(vote_choice), # harmonize encoding v_center_left = # create v_center_left ifelse( is.na(vote) & is.na(vote_choice), NA, # missing information ifelse( vote %in% c("No", "Not eligible to vote"), "Did not vote", # non-voters ifelse(vote_choice == ess_party_name_j, "Yes", # center-left voters "No") # voted for other party ) )) } # Collapse list of data frames to a single data frame vote_recoded %<>% bind_rows() %>% select(aux_id, v_center_left) # Remove old versions of v_center_left if present if ("v_center_left" %in% names(micro_data)) { micro_data %<>% select(-v_center_left) } # Add new variable to micro_data by aux_id micro_data %<>% full_join(vote_recoded, by = "aux_id") %>% # add new variable select(-aux_id) # drop auxiliary ID ## A posteriori consistency checks # Proportions of center left voters within each group sample_prop <- micro_data %>% select(group_var, v_center_left) %>% group_by(!!as.name(group_var)) %>% summarize(prop_v_center_left = mean(v_center_left == "Yes", na.rm = TRUE)) # Check if any group has 0% center left voters if (any(sample_prop$prop_v_center_left == 0)) { warning(paste( "No center-left voters in", paste(sample_prop$cntry[sample_prop$prop_v_center_left == 0], collapse = " "), "- check your inputs!", sep = " " )) } ## Return output return(micro_data) } Debugging Now we apply this function: ess %<>% ess_center_left(aux_data = aux, group_var = "cntry") ## Warning in ess_center_left(., aux_data = aux, group_var = "cntry"): No ## center-left voters in DE - check your inputs! There seems to be something wrong with Germany. Let’s have a look: ess %>% filter(cntry == "DE") %>% select(prtvede2) %>% table() ## . ## Alliance 90/The Greens (Bündnis 90/Die Grünen) ## 289 ## Alternative for Germany (AFD) ## 111 ## Christian Democratic Union/Christian Social Union (CDU/CSU) ## 558 ## Free Democratic Party (FDP) ## 147 ## National Democratic Party (NPD) ## 2 ## Other ## 36 ## Pirate Party (Piratenpartei) ## 4 ## Social Democratic Party (SPD) ## 355 ## The Left (Die Linke) ## 125 aux %>% filter(cntry == "DE") ## ess_party_name cntry vote_var ## 1 SPD DE prtvede2 As we can see, we do have a sizable number of center-left voters in Germany. However, the correct code for the SPD is not “SPD” but “Social Democratic Party (SPD)”. So let’s fix this bug and apply the function again: aux$ess_party_name[aux$cntry == "DE"] <- "Social Democratic Party (SPD)" # Apply the function ess %<>% ess_center_left(aux_data = aux, group_var = "cntry") Lastly, we save the data in the gen-data folder: save(ess, file = "efficient-data-r/gen-data/ess-proc.RData") Working with the data We can now use the data to approach our initial question through some simple descriptives. We first generate a binary version of our outcome variable v_center_left, where Yes equals TRUE whereas both No and Did not vote equal FALSE. We then split the data into a list of 19 country-specific data frames. Using sapply(), we then apply the a function for estimation and prediction to each of 19 data sets. Specifically, we run binary logistic regressions of v_center_left on uemp5yr. We then use the predict() function to retrieve the predicted probabilities of center-left votes among the unemployed along with their 95% confidence intervals for each country. ess_est <- ess %>% mutate(v_center_left = (v_center_left == "Yes")) %>% group_split(cntry) %>% lapply(function(dat) { mod <- glm(v_center_left ~ uemp5yr, data = dat, weights = dweight) pred <- predict(mod, newdata = data.frame(uemp5yr = "Yes"), se.fit = TRUE) pos_welfare <- unique(dat$welfare) output <- data.frame( est = pred$fit, lower95 = pred$fit + qnorm(.025) * pred$se, upper95 = pred$fit + qnorm(.975) * pred$se, welfare = unique(dat$welfare) ) return(output) }) %>% bind_rows()

We can then plot the relationship between center-left support among the unemployed and center-left parties’ welfare policy positions. As we can see below, based on 14 out of the 19 countries for which data on party positions is available, there does not seem to be a straightforward relationship between the two variables.

ess_est %>% ggplot(aes(x = welfare, y = est)) + geom_errorbar(aes(ymin = lower95, ymax = upper95)) + geom_point() + labs(x = "Center Left Welfare Policy Position", y = "Proportion of Unemployed Voting for Center Left")

Denis Cohen

is a postdoctoral fellow in the Data and Methods Unit at the Mannheim Centre for European Social Research (MZES), University of Mannheim, and one of the organizers of the MZES Social Science Data Lab. His research focus lies at the intersection of political preference formation, electoral behavior, and political competition. His methodological interests include quantitative approaches to the analysis of clustered data, measurement models, data visualization, strategies for causal identification, and Bayesian statistics.

Cosima Meyer

is a doctoral researcher and lecturer at the University of Mannheim and one of the organizers of the MZES Social Science Data Lab. Motivated by the continuing recurrence of conflicts in the world, her research interest on conflict studies became increasingly focused on post-civil war stability. In her dissertation, she analyzes leadership survival – in particular in post-conflict settings. Using a wide range of quantitative methods, she further explores questions on conflict elections, women’s representation as well as autocratic cooperation.

Marcel Neunhoeffer

is a PhD Candidate and Research Associate at the chair of Political Science, Quantitative Methods in the Social Sciences, at the University of Mannheim. His research focuses on political methodology, specifically on the application of deep learning algorithms to social science problems. His substantive interests include data privacy, political campaigns, and forecasting elections.

Oliver Rittmann is a PhD Candidate and Research Associate at the chair of Political Science, Quantitative Methods in the Social Sciences, at the University of Mannheim. His research focuses on legislative studies and political representation. His methodological expertise includes statistical modeling, authomated text and video analysis, and subnational public opinion estimation.

1. {By default, when several packages use identical names for different functions, R and R Studio use the package that was loaded last. This masks functions of the same name from packages loaded before. The sequence of loading your packages in R is therefore important. It is good practice to add a prefix with the required package before each function; i.e. packagename::function(). Alternatively, the package conflicted is helpful – it makes every package conflict an error and tells you which package with the same functions are available. You can then identify the package that you want or need to use. This way, you make sure that you do not accidentally use a function from a different package without recognizing it.}

2. {Additional problems could arise if a national election had taken place during the ESS field period. To identify such cases, we could simple create a true/false indicator variable using mutate(flag = election_date >= field_start & election_date <= field_end).}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Clustered randomized trials and the design effect

Tue, 02/18/2020 - 01:00

I am always saying that simulation can help illuminate interesting statistical concepts or ideas. The design effect that underlies much of clustered analysis is could benefit from a little exploration through simulation. I’ve written about clustered-related methods so much on this blog that I won’t provide links – just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect.

When individual outcomes in a group are correlated, we learn less about the group from adding a new individual than we might think. Take an extreme example where every individual in a group is perfectly correlated with all the others: we will learn nothing new about the group by adding someone new. In fact, we might as well just look at a single member, since she is identical to all the others. The design effect is a value that in a sense quantifies how much information we lose (or, surprisingly, possibly gain) by this interdependence.

Let’s just jump right into it.

The context

Imagine a scenario where an underlying population of interest is structurally defined by a group of clusters. The classic case is students in schools or classrooms. I don’t really do any school-based education (I learned from debating my teacher-wife that is a dangerous area to tread), but this example seems so clear. We might be interested in measuring the effect of some intervention (it may or may not take place in school) on an educational attainment outcome of high school-aged kids in a city (I am assuming a continuous outcome here just because it is so much easier to visualize). It does not seem crazy to think that the outcomes of kids from the same school might be correlated, either because the school itself does such a good (or poor) job of teaching or similar types of kids tend to go to the same school.

The unit of randomization

We have at least three ways to design our study. We could just recruit kids out and about in city and randomize them each individually to intervention or control. In the second approach, we decide that it is easier to randomize the schools to intervention or control – and recruit kids from each of the schools. This means that all kids from one school will be in the same intervention arm. And for the third option, we can go half way: we go to each school and recruit kids, randomizing half of the kids in each school to control, and the other half to the intervention. This last option assumes that we could ensure that the kids in the school exposed to the intervention would not influence their unexposed friends.

In all three cases the underlying assumptions are the same – there is a school effect on the outcome, an individual effect, and an intervention effect. But it turns out that the variability of the intervention effect depends entirely on how we randomize. And since variability of the outcome affects sample size, each approach has implications for sample size. (I’ll point you to a book by Donner & Klar, which gives a comprehensive and comprehensible overview of cluster randomized trials.)

Simulation of each design

Just to be clear about these different randomization designs, I’ll simulate 1500 students using each. I’ve set a seed in case you’d like to recreate the results shown here (and indicate the libraries I am using).

library(simstudy) library(data.table) library(ggplot2) library(clusterPower) library(parallel) library(lmerTest) RNGkind("L'Ecuyer-CMRG") # enables seed for parallel process set.seed(987) Randomization by student

I’ve written a function for each of the three designs to generate the data, because later I am going to need to generate multiple iterations of each design. In the first case, randomization is applied to the full group of students:

independentData <- function(N, d1) { di <- genData(N) di <- trtAssign(di, grpName = "rx") di <- addColumns(d1, di) di[] }

The outcome is a function of intervention status and a combined effect of the student’s school and the student herself. We cannot disentangle the variance components, because we do not know the identity of the school:

defI1 <- defDataAdd(varname = "y", formula = "0.8 * rx", variance = 10, dist = "normal") dx <- independentData(N = 30 * 50, defI1)

The observed effect size and variance should be close to the specified parameters of 0.8 and 10, respectively:

dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] ## [1] 1 dx[, var(y)] ## [1] 10.3

Here is a plot of the individual observations that highlights the group differences and individual variation:

Randomization by site

Next, the intervention status is assigned to each of the $$k$$ schools/clusters before generating $$m$$ students per cluster. In this case, the outcome (defined by defI2) is a function of the cluster effect, individual effect, and the intervention status. Note here, the variance components are disentangled, but together they sum to 10, suggesting that total variance should be the same as the first scenario:

clusteredData <- function(k, m, d1, d2) { dc <- genData(k, d1) dc <- trtAssign(dc, grpName = "rx") di <- genCluster(dc, "site", m, level1ID = "id") di <- addColumns(d2, di) di[] } defC <- defData(varname = "ceff", formula = 0, variance = 0.5, id = "site", dist = "normal") defI2 <- defDataAdd(varname = "y", formula = "ceff + 0.8 * rx", variance = 9.5, dist = "normal") dx <- clusteredData(k = 30, m = 50, defC, defI2)

The effect size and variation across all observations should be be quite similar to the previous design, though now the data has a structure that is determined by the clusters:

dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] ## [1] 1.4 dx[, var(y)] ## [1] 11

Randomization within site

In the last design, the treatment assignment is made after both the clusters and individuals have been generated. Cluster randomization within site is specified using the strata argument:

withinData <- function(k, m, d1, d2) { dc <- genData(k, d1) di <- genCluster(dc, "site", m, "id") di <- trtAssign(di, strata="site", grpName = "rx") di <- addColumns(d2, di) di[] } dx <- withinData(30, 50, defC, defI2) dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] ## [1] 0.787 dx[, var(y)] ## [1] 10.1

The design effect

There’s a really nice paper by Vierron & Giraudeau that describes many of the issues I am only touching on here. In particular, they define the design effect and then relate this definition to formulas that are frequently used simplify the estimation of the design effect.

Consider the statistics $$\sigma^2_{\Delta_{bc}}$$ and $$\sigma^2_{\Delta_{i}}$$, which are the variance of the effect sizes under the cluster randomization and the individual randomization designs, respectively:

$\sigma^2_{\Delta_{bc}} = Var(\bar{Y}_1^{bc} – \bar{Y}_0^{bc})$

and

$\sigma^2_{\Delta_{i}} =Var(\bar{Y}_1^{i} – \bar{Y}_0^{i})$

These variances are never observed, since they are based on a very large (really, an infinite) number of repeated experiments. However, the theoretical variances can be derived (as they are in the paper), and can be simulated (as they will be here). The design effect $$\delta_{bc}$$ is defined as

$\delta_{bc} = \frac{\sigma^2_{\Delta_{bc}}}{\sigma^2_{\Delta_{i}}}$

This ratio represents the required adjustment in sample size required to make the two designs equivalent in the sense that they provide the same amount of information. This will hopefully become clear with the simulations below.

I have decided to use $$k = 50$$ simulations to ensure enough clusters to estimate the proper variance. I need to know how many individuals per cluster are required for 80% power in the cluster randomized design, given the effect size and variance assumptions I’ve been using here. I’ll use the clusterPower package (which unfortunately defines the number of clusters in each as $$m$$, so don’t let that confuse you). Based on this, we should have 18 students per school, for a total sample of 900 students:

crtpwr.2mean(m = 50/2, d = 0.8, icc = 0.05, varw = 9.5) ## n ## 17.9

Now, I am ready to generate effect sizes for each of 2000 iterations of the experiment assuming randomization by cluster. With this collection of effect sizes in hand, I will be able to estimate their variance:

genDifFromClust <- function(k, m, d1, d2) { dx <- clusteredData(k, m, d1, d2) dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] } resC <- unlist(mclapply(1:niters, function(x) genDifFromClust(k= 50, m=18, defC, defI2)))

Here is an estimate of $$\sigma^2_{\Delta_{bc}}$$ based on the repeated experiments:

(s2.D_bc <- var(resC)) ## [1] 0.0818

And here is the estimate of $$\sigma^2_{\Delta_{i}}$$ (the variance of the effect sizes based on individual-level randomization experiments):

genDifFromInd <- function(N, d1) { dx <- independentData(N, d1) dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] } resI <- unlist(mclapply(1:niters, function(x) genDifFromInd(N = 50*18, defI1))) (s2.D_i <- var(resI)) ## [1] 0.0432

So, now we can use these variance estimates to derive the estimate of the design effect $$\delta_{bc}$$, which, based on the earlier definition, is:

(d_bc <- s2.D_bc / s2.D_i) ## [1] 1.89

The Vierron & Giraudeau paper derives a simple formula for the design effect assuming equal cluster sizes and an ICC $$\rho$$. This (or some close variation, when cluster sizes are not equal) is quite commonly used:

$\delta_{bc} = 1 + (m-1)*\rho$

As the ICC increases, the design effect increases. Based on the parameters for $$m$$ and $$\rho$$ we have been using in these simulations (note that $$\rho = 0.5/(0.5+9.5) = 0.05$$), the standard formula gives us this estimate of $$\delta_{bc.formula}$$ that is quite close to our experimental value:

( d_bc_form <- 1 + (18-1) * (0.05) ) ## [1] 1.85

But what is the design effect?

OK, finally, we can now see what the design effect actually represents. As before, we will generate repeated data sets; this time, we will estimate the treatment effect using an appropriate model. (In the case of the cluster randomization, this is a linear mixed effects model, and in the case of individual randomization, this is linear regression model.) For each iteration, I am saving the p-value for the treatment effect parameter in the model. We expect close to 80% of the p-values to be lower than 0.05 (this is 80% power given a true treatment effect of 0.8).

First, here is the cluster randomized experiment and the estimate of power:

genEstFromClust <- function(k, m, d1, d2) { dx <- clusteredData(k, m, d1, d2) summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5] } resCest <- unlist(mclapply(1:niters, function(x) genEstFromClust(k=50, m = 18, defC, defI2))) mean(resCest < 0.05) # power ## [1] 0.805 In just over 80% of the cases, we would have rejected the null. And here is the estimated power under the individual randomization experiment, but with a twist. Since the design effect is 1.85, the cluster randomized experiment needs a relative sample size 1.85 times higher than an equivalent (individual-level) RCT to provide the same information, or to have equivalent power. So, in our simulations, we will use a reduced sample size for the individual RCT. Since we used 900 individuals in the CRT, we need only $$900/1.85 = 487$$ individuals in the RCT: ( N.adj <- ceiling( 50 * 18 / d_bc_form ) ) ## [1] 487 genEstFromInd <- function(N, d1) { dx <- independentData(N, d1) summary(lm(y ~ rx, data = dx))$coef["rx", 4] } resIest <- unlist(mclapply(1:niters, function(x) genEstFromInd(N = N.adj, defI1)))

The power for this second experiment is also quite close to 80%:

mean(resIest < 0.05) # power ## [1] 0.796 Within cluster randomization

It is interesting to see what happens when we randomize within the cluster. I think there may be some confusion here, because I have seen folks incorrectly apply the standard formula for $$\delta_{bc}$$, rather than this formula for $$\delta_{wc}$$ that is derived (again, under the assumption of equal cluster sizes) in the Vierron & Giraudeau paper as

$\delta_{wc} = 1- \rho$

This implies that the sample size requirement actually declines as intra-cluster correlation increases! In this case, since $$\rho = 0.05$$, the total sample size for the within-cluster randomization needs to be only 95% of the sample size for the individual RCT.

As before, let’s see if the simulated data confirms this design effect based on the definition

$\delta_{wc} = \frac{\sigma^2_{\Delta_{wc}}}{\sigma^2_{\Delta_{i}}}$

genDifFromWithin <- function(k, m, d1, d2) { dx <- withinData(k, m, d1, d2) dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] } resW <- unlist(mclapply(1:niters, function(x) genDifFromWithin(k = 50, m = 18, defC, defI2))) (s2.D_wc <- var(resW)) ## [1] 0.0409

The estimated design effect is quite close to the expected design effect of 0.95:

(d_wc <- s2.D_wc / s2.D_i) ## [1] 0.947

And to finish things off, if we estimate an adjusted cluster size based on the design effects (first reducing the cluster size $$m=18$$ for the cluster randomized trial by $$\delta_{bc.formula}$$ to derive the appropriate sample size for the RCT, and then adjusting by $$\delta_{wc} = 0.95$$) to get the appropriate cluster size for the within cluster randomization, which is about 9 students. This study will only have 450 students, fewer than the RCT:

(m.adj <- round( (18 / d_bc_form) * 0.95, 0)) ## [1] 9 genEstFromWithin <- function(k, m, d1, d2) { dx <- withinData(k, m, d1, d2) summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5] } resWest <- unlist(mclapply(1:niters, function(x) genEstFromWithin(k = 50, m = ceiling(m.adj), defC, defI2))) mean(resWest < 0.05) ## [1] 0.786 References: Donner, Allan, and Neil Klar. “Design and analysis of cluster randomization trials in health research.” New York (2010). Vierron, Emilie, and Bruno Giraudeau. “Design effect in multicenter studies: gain or loss of power?.” BMC medical research methodology 9, no. 1 (2009): 39. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Quantile Regression (home made, part 2) Mon, 02/17/2020 - 21:38 [This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. A few months ago, I posted a note with some home made codes for quantile regression… there was something odd on the output, but it was because there was a (small) mathematical problem in my equation. So since I should teach those tomorrow, let me fix them. Median Consider a sample \{y_1,\cdots,y_n\}. To compute the median, solve\min_\mu \left\lbrace\sum_{i=1}^n|y_i-\mu|\right\rbracewhich can be solved using linear programming techniques. More precisely, this problem is equivalent to\min_{\mu,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^na_i+b_i\right\rbracewith a_i,b_i\geq 0 and y_i-\mu=a_i-b_i, \forall i=1,\cdots,n. Heuristically, the idea is to write y_i=\mu+\varepsilon_i, and then define a_i‘s and b_i‘s so that \varepsilon_i=a_i-b_i and |\varepsilon_i|=a_i+b_i, i.e. a_i=(\varepsilon_i)_+=\max\lbrace0,\varepsilon_i\rbrace=|\varepsilon|\cdot\boldsymbol{1}_{\varepsilon_i>0}andb_i=(-\varepsilon_i)_+=\max\lbrace0,-\varepsilon_i\rbrace=|\varepsilon|\cdot\boldsymbol{1}_{\varepsilon_i<0}[/latex]denote respectively the positive and the negative parts.</p> <p>Unfortunately (that was the error in my previous post), the expression of linear programs is[latex display="true"]\min_{\mathbf{z}}\left\lbrace\boldsymbol{c}^\top\mathbf{z}\right\rbrace\text{ s.t. }\boldsymbol{A}\mathbf{z}=\boldsymbol{b},\mathbf{z}\geq\boldsymbol{0}In the equation above, with the a_i‘s and b_i‘s, we’re not far away. Except that we have \mu\in\mathbb{R}, while it should be positive. So similarly, set \mu=\mu^+-\mu^- where \mu^+=(\mu)_+ and \mu^-=(-\mu)_+. Thus, let\mathbf{z}=\big(\mu^+;\mu^-;\boldsymbol{a},\boldsymbol{b}\big)^\top\in\mathbb{R}_+^{2n+2}and then write the constraint as \boldsymbol{A}\mathbf{z}=\boldsymbol{b} with \boldsymbol{b}=\boldsymbol{y} and \boldsymbol{A}=\big[\boldsymbol{1}_n;-\boldsymbol{1}_n;\mathbb{I}_n;-\mathbb{I}_n\big]And for the objective function\boldsymbol{c}=\big(\boldsymbol{0},\boldsymbol{1}_n,-\boldsymbol{1}_n\big)^\top\in\mathbb{R}_+^{2n+2} To illustrate, consider a sample from a lognormal distribution, 1 2 3 4 5 n = 101 set.seed(1) y = rlnorm(n) median(y) [1] 1.077415 For the optimization problem, use the matrix form, with 3n constraints, and 2n+1 parameters, 1 2 3 4 5 6 7 8 9 library(lpSolve) X = rep(1,n) A = cbind(X, -X, diag(n), -diag(n)) b = y c = c(rep(0,2), rep(1,n),rep(1,n)) equal_type = rep("=", n) r = lp("min", c,A,equal_type,b) head(r$solution,1) [1] 1.077415

It looks like it’s working well…

Quantile

Of course, we can adapt our previous code for quantiles

1 2 3 4 tau = .3 quantile(y,tau) 30% 0.6741586

The linear program is now\min_{q^+,q^-,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^n\tau a_i+(1-\tau)b_i\right\rbracewith a_i,b_i,q^+,q^-\geq 0 and y_i=q^+-q^-+a_i-b_i, \forall i=1,\cdots,n. The R code is now

1 2 3 4 c = c(rep(0,2), tau*rep(1,n),(1-tau)*rep(1,n)) r = lp("min", c,A,equal_type,b) head(r$solution,1) [1] 0.6741586 So far so good… Quantile Regression Consider the following dataset, with rents of flat, in a major German city, as function of the surface, the year of construction, etc. 1 base=read.table("http://freakonometrics.free.fr/rent98_00.txt",header=TRUE) The linear program for the quantile regression is now\min_{\boldsymbol{\beta}^+,\boldsymbol{\beta}^-,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^n\tau a_i+(1-\tau)b_i\right\rbracewith a_i,b_i\geq 0 and y_i=\boldsymbol{x}^\top[\boldsymbol{\beta}^+-\boldsymbol{\beta}^-]+a_i-b_i\forall i=1,\cdots,n and \beta_j^+,\beta_j^-\geq 0 \forall j=0,\cdots,k. So use here 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 require(lpSolve) tau = .3 n=nrow(base) X = cbind( 1, base$area) y = base$rent_euro K = ncol(X) N = nrow(X) A = cbind(X,-X,diag(N),-diag(N)) c = c(rep(0,2*ncol(X)),tau*rep(1,N),(1-tau)*rep(1,N)) b = base$rent_euro const_type = rep("=",N) r = lp("min",c,A,const_type,b) beta = r$sol[1:K] - r$sol[(1:K+K)] beta [1] 148.946864 3.289674

Of course, we can use R function to fit that model

1 2 3 4 5 library(quantreg) rq(rent_euro~area, tau=tau, data=base) Coefficients: (Intercept) area 148.946864 3.289674

Here again, it seems to work quite well. We can use a different probability level, of course, and get a plot

1 2 3 4 5 6 7 8 9 10 11 plot(base$area,base$rent_euro,xlab=expression(paste("surface (",m^2,")")), ylab="rent (euros/month)",col=rgb(0,0,1,.4),cex=.5) sf=0:250 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") tau = .9 r = lp("min",c,A,const_type,b) tail(r$solution,2) [1] 121.815505 7.865536 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") And we can adapt the later to multiple regressions, of course, 1 2 3 4 5 6 7 8 9 10 11 X = cbind(1,base$area,base$yearc) K = ncol(X) N = nrow(X) A = cbind(X,-X,diag(N),-diag(N)) c = c(rep(0,2*ncol(X)),tau*rep(1,N),(1-tau)*rep(1,N)) b = base$rent_euro const_type = rep("=",N) r = lp("min",c,A,const_type,b) beta = r$sol[1:K] - r$sol[(1:K+K)] beta [1] -5542.503252 3.978135 2.887234

to be compared with

1 2 3 4 5 6 7 8 library(quantreg) rq(rent_euro~ area + yearc, tau=tau, data=base)   Coefficients: (Intercept) area yearc -5542.503252 3.978135 2.887234   Degrees of freedom: 4571 total; 4568 residual var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Part 6: How not to validate your model with optimism corrected bootstrapping

Mon, 02/17/2020 - 17:32

When evaluating a machine learning model if the same data is used to train and test the model this results in overfitting. So the model performs much better in predictive ability  than it would if it was applied on completely new data, this is because the model uses random noise within the data to learn from and make predictions. However, new data will have different noise and so it is hard for the overfitted model to predict accurately just from noise on data it has not seen.

Keeping the training and test datasets completely seperate in machine learning from feature selection to fitting a model mitigates this problem, this is the basis for cross validation which iteratively splits the data into different chunks and iterates over them to avoid the problem. Normally, we use 100 times repeated 10 fold cross validation for medium to large datasets and leave-one-out cross validation (LOOCV) for small datasets.

There is another technique used to correct for the “optimism” resulting from fitting a model to the same data used to test it on, this is called optimism corrected bootstrapping. However, it is based on fundamentally unsound statistical principles that can introduce major bias into the results under certain conditions.

This is the optimism corrected bootstrapping method:

1. Fit a model M to entire data S and estimate predictive ability C.
2. Iterate from b=1…B:
1. Take a bootstrap sample from the original data, S*
2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
3. Use the bootstrap model M* to get predictive ability on S, C_orig
3. Optimism O is calculated as mean(C_boot – C_orig)
4. Calculate optimism corrected performance as C-O.

As I have stated previously, because the same data is used for training and testing in step 3 of the bootstrapping loop, there is an information leak here. Because the data used to make M* includes many of the same data that is used to test it, when applied on S. This can lead to C_orig being too high. Thus, the optimism (O) is expected to be too low, leading to the corrected performance from step 4 being too high. This can be a major problem.

The strength of this bias for this method strongly depends on: 1) the number of variables used, 2) the number of samples used, and, 3) the machine learning model used.

I can show the bias using glmnet and random forest in these new experiments. This code is different than in Part 5 because I am changing the ROC calculation to use my own evaluation package (https://cran.r-project.org/web/packages/MLeval/index.html), and also examining random forest instead of just logistic regression.

I am going to use a low N high p simulation, because that is what I am used to working with. N is the number of observations and p the number of features. Such dimensionalities are very common in the health care and bioinformatics industry and academia, using this method can lead to highly erroneous results, some of which end up in publications.

I am simulating null datasets here with random features sampled from a Gaussian distribution, so we should not expect any predictive ability thus the area under the ROC curve should be about 0.5 (chance expectation).

Experiment 1: my implementation.

N=100 and p=5 to 500 by 50. Random forest can be selected by using ‘ranger’ or lasso by using ‘glmnet’.

## my implementation optimism corrected bootstrapping library(ggplot2) library(MLeval) library(ranger) library(glmnet) N <- 100 alg <- 'ranger' ## loop over increasing number features cc <- c() for (zz in seq(5,500,50)){ print(zz) ## set up test data test <- matrix(rnorm(N*zz, mean = 0, sd = 1), nrow = N, ncol = zz, byrow = TRUE) labelsa <- as.factor(sample(c(rep('A',N/2),rep('B',N/2)))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,N),sep='') test <- data.frame(test) test[,zz+1] <- labelsa colnames(test)[zz+1] <- 'XXX' print(dim(test)) ## 1. fit model to entire data and predict labels on same data if (alg == 'glmnet'){ #orig <- glm(XXX~.,data=test,family='binomial') X <- test[,1:zz] Y <- labelsa orig <- glmnet(as.matrix(X),Y,family='binomial') preds <- predict(orig,newx=as.matrix(test[,-(zz+1)]),type='response',s=0.01) }else{ orig <- ranger(XXX~.,data=test,probability=TRUE) preds <- predict(orig,data=test[,-(zz+1)],type='response') } ## 2. use MLeval to get auc if (alg == 'ranger'){ preds <- data.frame(preds$predictions) }else{ preds <- data.frame(B=preds[,1],A=1-preds[,1]) } preds$obs <- labelsa x <- evalm(preds,silent=TRUE,showplots=FALSE) x <- x$stdres auc <- x$Group1[13,1] original_auc <- auc print(paste('original:',auc)) ## 3. bootstrapping to estimate optimism B <- 50 results <- matrix(ncol=2,nrow=B) for (b in seq(1,B)){ # get the bootstrapped data boot <- test[sample(row.names(test),N,replace=TRUE),] labelsb <- boot[,ncol(boot)] # use the bootstrapped model to predict bootstrapped data labels if (alg == 'ranger'){ bootf <- ranger(XXX~.,data=boot,probability=TRUE) preds <- predict(bootf,data=boot[,-(zz+1)],type='response') }else{ #bootf <- glm(XXX~.,data=boot,family='binomial') X <- boot[,1:zz] Y <- labelsb bootf <- glmnet(as.matrix(X),Y,family='binomial') preds <- predict(bootf,newx=as.matrix(boot[,-(zz+1)]),type='response',s=0.01) } # get auc of boot on boot if (alg == 'ranger'){ preds <- data.frame(preds$predictions) }else{ preds <- data.frame(B=preds[,1],A=1-preds[,1]) } preds$obs <- labelsb x <- evalm(preds,silent=TRUE,showplots=FALSE) x <- x$stdres boot_auc <- x$Group1[13,1] # boot on test/ entire data if (alg == 'ranger'){ ## use bootstrap model to predict original labels preds <- predict(bootf,data=test[,-(zz+1)],type='response') }else{ preds <- predict(bootf,newx=as.matrix(test[,-(zz+1)]),type='response',s=0.01) } # get auc if (alg == 'ranger'){ preds <- data.frame(preds$predictions) }else{ preds <- data.frame(B=preds[,1],A=1-preds[,1]) } preds$obs <- labelsa x <- evalm(preds,silent=TRUE,showplots=FALSE) x <- x$stdres boot_original_auc <- x$Group1[13,1] # ## add the data to results results[b,1] <- boot_auc results[b,2] <- boot_original_auc } ## calculate the optimism O <- mean(results[,1]-results[,2]) ## calculate optimism corrected measure of prediction (AUC) corrected <- original_auc-O ## append cc <- c(cc,corrected) print(cc) }

We can see that even with random features, optimism corrected bootstrapping can give very inflated performance metrics. Random data should give an AUC of about 0.5, this experiment shows the value of simulating null datasets and testing your machine learning pipeline with data with similar characteristics to that you will be using.

Below is code and results from the Caret implementation of this method (https://cran.r-project.org/web/packages/caret/index.html). Note, that the results are very similar to mine.

Caret will tune over a range of lambda for lasso by default (results below), whereas above I just selected a single value of this parameter. The bias for this method is much worse for models such as random forest and gbm, I suspect because they use decision trees which are non-linear so can more easily make accurate models out of noise if tested on the same data used to train with.

You can experiment easily with different types of cross validation and bootstrapping using null datasets with Caret. I invite you to do this, and if you change the cross validation method below, to, e.g. LOOCV or repeatedcv, you will see, these methods are not subject to this major overly optimistic results bias. They have their own issues, but I have found optimism corrected bootstrapping can give completely the wrong conclusion very easily, it seems not everyone knows of this problem before they are using it.

I recommend using Caret for machine learning. The model type and cross validation method can be changed simply without much effort. If ranger is changed to gbm in the below code we get a similar bias to using ranger.

Experiment 2: Caret implementation.

N=100 and p=5 to 500 by 50. Random forest can be selected by using ‘ranger’ or lasso by using ‘glmnet’.

## caret optimism corrected bootstrapping test library(caret) cc <- c() i = 0 N <- 100 ## add features iterate for (zz in seq(5,500,50)){ i = i + 1 # simulate data test <- matrix(rnorm(N*zz, mean = 0, sd = 1), nrow = N, ncol = zz, byrow = TRUE) labelsa <- as.factor(sample(c(rep('A',N/2),rep('B',N/2)))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,N),sep='') test <- data.frame(test) test[,zz+1] <- labelsa colnames(test)[zz+1] <- 'XXX' # optimism corrected bootstrapping with caret ctrl <- trainControl(method = 'optimism_boot', summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, verboseIter = F) fit4 <- train(as.formula( paste( 'XXX', '~', '.' ) ), data=test, method="ranger", # preProc=c("center", "scale") trControl=ctrl, metric = "ROC") # cc <- c(cc, max(fit4$results$ROC)) print(max(fit4$results$ROC)) }

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Creating MS Word reports using the officer package

Mon, 02/17/2020 - 13:30

[This article was first published on R – The Princess of Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Commonly, the final product that a data scientist or a statistician generates is a report, usually in MS Word format. The officer package enables generating such a report from within R. It also enables generating PowerPoint presentations, but this is beyond the scope of this post.

While the package has many great features,  using the package is not intuitive. The package manual covers no less than 80 pages. There are many functions that allow controlling many aspects of Word documents and PowerPoint presentation.

What I really need is to perform two tasks: inserting tables and figures into my Word document, and I also want to apply a standard format to all of my reports. I also need to add titles, empty lines and page breaks.

For these needs, I wrote a few functions that enable me to create standardized Word document reports using an intuitive syntax.

In addition to the officer package, the flextable package is also needed:

library(officer) library(flextable)

The first function creates a new object that represents the Word document to be created. It simply wraps officer’s read_docx() function, with the caveat that if an object with the same name already exists, it overrides it with a new “clean” object:

# create new word document new.word.doc=function(){ my.doc=read_docx() return(my.doc) }

The next two functions to add an empty line or a page break:

The next two functions are used to set the orientation of the next page to landscape, and then back to portrait:

# start landscape start.landscape=function(doc){ doc=body_end_section_continuous(doc) return("landscape orientation started") } # end landscape end.landscape=function(doc){ doc=body_end_section_landscape(doc) return("landscape orientation ended") }

This function adds a title. I use the fp_text() function to set the font size to 14, and the font type to bold. I also add an empty line after the title:

The next function is a little bit more complicated. It adds to the document a figure that already exists as an image file in the working directory. The parameters h and w indicate the height and width of image within the document, in inches. My usual values of choice are h=5 and w=5, so I set them as default. I align the figure to the center of the page.

# add an image, such as a jpg or png file add.image=function(doc, image, h=5, w=5){ body_add_img(doc, src=image,               height=h, width=w,               style="centered") return("image added") }

The last function, add.table(), is the most complicated one. The table to be added is a data frame, and my function assumes that the variables in this data frame are either character, factor or numeric.

I have two parameters to set the number of decimals to present for each of the numeric variables. The first parameter receives the names of the numeric variables, the second contains the number of decimals for each of these variables, respectively.  Note that I do not check the inputs of these two parameters. I trust myself to insert the right input.

In addition, the function sets the format of the table (borders, fonts, and header formatting) according to my standard report style, using officer functions such as border_outer(), bold() etc.

# add a data frame as a table add.table=function(doc, tbl, col.keys=NULL, col.digits=NULL){ # create basic flextable f.table=qflextable(tbl) # set numbers of decimals for numeric variables, if specified if(!is.null(col.keys)){    for(j in 1:length(col.keys)){      f.table=colformat_num(x=f.table,                            col_keys=col.keys[j],                            digits=col.digits[j])    } } # set table borders f.table=border_outer(f.table, part="all",                       border=fp_border(color="black", width = 1)) f.table=border_inner_h(f.table, part="all",                         border=fp_border(color="black", width = 1)) f.table=border_inner_v(f.table, part="all",                         border=fp_border(color="black", width = 1)) # set fonts f.table=font(f.table,  fontname = "Times", part = "all") # also set the table's header font as bold f.table=bold(f.table, part = "header") # add the table to the document flextable::body_add_flextable(doc, value = f.table, align = "left" ) return("table added") }

Now we are all set to create the report:

# create an histogram and save it as a png file: png(filename="histogram.png", width = 6, height = 6, units = 'in', res = 300) hist(mtcars$wt) dev.off() # create a data frame that will become a table in my report wide.table=mtcars[1:6, ] wide.table$car=rownames(wide.table) wide.table=wide.table[, c(12, 1:11)] narrow.table=wide.table[, 1:4] # create a new document object doc=new.word.doc() # add the report title and an empty line add.title(doc, "My report") add.empty.line(doc) add.title(doc, "narrow table") add.table(doc, narrow.table, col.keys=c("mpg", "disp"), col.digits=c(1,0)) add.page.break(doc) # add the histogram with an apropriate title add.title(doc, "Histogram - portrait") add.image(doc, "histogram.png", h=3, w=3) # set the orientation to lndscape start.landscape(doc) add.title(doc, "narrow table - landscape") add.table(doc, narrow.table, col.keys=c("mpg", "disp"), col.digits=c(1,0)) add.empty.line(doc) # add the wide table in landsacape page orientation add.title(doc, "wide table in landscape orientation") add.table(doc, wide.table, col.keys=c("mpg", "disp"), col.digits=c(1,0)) #set the orientation back to portrait end.landscape(doc) add.title(doc, "narrow table") add.table(doc, narrow.table, col.keys=c("mpg", "disp"), col.digits=c(1,0)) # generate the Word document using the print function print(doc, target="My report.docx")

Since I cannot upload a Word document to WordPress, here is a screen print of the report:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### What R you? (R dataframe vs python dataframe)

Mon, 02/17/2020 - 01:00

Recap

Previously in this series, we discovered the equivalent python data structures for the following R data structures:

In this post, we will look at translating R data frames into python. We will also compare and contrast data frames in R and python.

R data frame is a python…

Pretty straight forward, a R data frame is a python data frame.
We will use an in built data frame, OrchardSprays, for our illustration. This dataset is from a study, which examined the relationship on the uptake of sugar solutions by bees when different concentrations of pesticide were added to the syrup.

OrchardSprays has four variables:

1. rowpos: The row position of the solution

2. colpos: The column position of the solution

3. treatment: The type of orchard spray added to the solution

4. decrease: The decrease in volume of the solution at the end of the study which indicates the amount of syrup consumed by the bees.

library(tidyverse) library(reticulate) py_run_string("import pandas as pd") OrchardSprays %>% select(rowpos, colpos, treatment, decrease) %>% head() ## rowpos colpos treatment decrease ## 1 1 1 D 57 ## 2 2 1 E 95 ## 3 3 1 B 8 ## 4 4 1 H 69 ## 5 5 1 G 92 ## 6 6 1 F 90

Let’s check the data structure of OrchardSprays in R.

class(OrchardSprays) ## [1] "data.frame"

And now let is check the data structure when it is translated to python.

r_to_py(OrchardSprays) %>% class() ## [1] "pandas.core.frame.DataFrame" ## [2] "pandas.core.generic.NDFrame" ## [3] "pandas.core.base.PandasObject" ## [4] "pandas.core.accessor.DirNamesMixin" ## [5] "pandas.core.base.SelectionMixin" ## [6] "python.builtin.object"

Data frames in python are governed by the pandas package.

List columns

Occasionally, columns in a data frame contain lists instead of atomic vectors. These data frames are also known as nested data frames. Let us see how python handles a nested data frame from R. First, we will create a nested data frame. Variables “rowpos”, “colpos”, “decrease” will be nested inside the type of treatment. The nested variables appear as a list column called “data”.

Orchard_nest<- OrchardSprays %>% nest(-treatment) knitr::kable(Orchard_nest) treatment data D list(decrease = c(57, 36, 22, 51, 28, 27, 20, 39), rowpos = c(1, 4, 5, 8, 2, 6, 7, 3), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) E list(decrease = c(95, 51, 39, 114, 43, 47, 61, 55), rowpos = c(2, 5, 4, 3, 1, 7, 8, 6), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) B list(decrease = c(8, 6, 4, 10, 7, 4, 8, 14), rowpos = c(3, 2, 8, 7, 6, 5, 1, 4), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) H list(decrease = c(69, 127, 72, 130, 81, 76, 81, 86), rowpos = c(4, 3, 2, 1, 7, 8, 6, 5), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) G list(decrease = c(92, 71, 72, 24, 60, 77, 72, 80), rowpos = c(5, 8, 7, 6, 3, 4, 2, 1), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) F list(decrease = c(90, 69, 87, 20, 71, 44, 57, 114), rowpos = c(6, 7, 1, 5, 8, 3, 4, 2), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) C list(decrease = c(15, 84, 16, 9, 17, 29, 13, 19), rowpos = c(7, 1, 6, 4, 5, 2, 3, 8), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) A list(decrease = c(2, 2, 5, 4, 5, 12, 4, 3), rowpos = c(8, 6, 3, 2, 4, 1, 5, 7), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))

A nested data frame is still a R data frame.

class(Orchard_nest) ## [1] "data.frame"

Hence, it should be treated as a python data frame when translated.

r_to_py(Orchard_nest) %>% class() ## [1] "pandas.core.frame.DataFrame" ## [2] "pandas.core.generic.NDFrame" ## [3] "pandas.core.base.PandasObject" ## [4] "pandas.core.accessor.DirNamesMixin" ## [5] "pandas.core.base.SelectionMixin" ## [6] "python.builtin.object"

The dimension of a nested data frame in R and the translated data frame is the same.

dim(Orchard_nest) ## [1] 8 2 r_to_py(Orchard_nest) %>% dim() ## [1] 8 2

Inspecting the internal structure of the nested data frame, we can see for the column “data” there are nested variables within it .

str(Orchard_nest) ## 'data.frame': 8 obs. of 2 variables: ## $treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 ##$ data :List of 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 57 36 22 51 28 27 20 39 ## .. ..$rowpos : num 1 4 5 8 2 6 7 3 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 95 51 39 114 43 47 61 55 ## .. ..$rowpos : num 2 5 4 3 1 7 8 6 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 8 6 4 10 7 4 8 14 ## .. ..$rowpos : num 3 2 8 7 6 5 1 4 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 69 127 72 130 81 76 81 86 ## .. ..$rowpos : num 4 3 2 1 7 8 6 5 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 92 71 72 24 60 77 72 80 ## .. ..$rowpos : num 5 8 7 6 3 4 2 1 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 90 69 87 20 71 44 57 114 ## .. ..$rowpos : num 6 7 1 5 8 3 4 2 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 15 84 16 9 17 29 13 19 ## .. ..$rowpos : num 7 1 6 4 5 2 3 8 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8 ## ..$:'data.frame': 8 obs. of 3 variables: ## .. ..$ decrease: num 2 2 5 4 5 12 4 3 ## .. ..$rowpos : num 8 6 3 2 4 1 5 7 ## .. ..$ colpos : num 1 2 3 4 5 6 7 8

However, when inspecting the structure of the translated nested data frame, the list column appears as an ordinary python column.

r.Orchard_nest.info(verbose=True) #ran using {python} code chunk. unable to run in {r} code chunk with py_eval nor py_run_string ## ## RangeIndex: 8 entries, 0 to 7 ## Data columns (total 2 columns): ## treatment 8 non-null category ## data 8 non-null object ## dtypes: category(1), object(1) ## memory usage: 584.0+ bytes

Also, when you print a nested data frame from R in python, the list column appears partially printed with misalignment.

py_eval("r.Orchard_nest") ## treatment data ## 0 D decrease rowpos colpos ## 0 57.0 1.... ## 1 E decrease rowpos colpos ## 0 95.0 2.... ## 2 B decrease rowpos colpos ## 0 8.0 3.... ## 3 H decrease rowpos colpos ## 0 69.0 4.... ## 4 G decrease rowpos colpos ## 0 92.0 5.... ## 5 F decrease rowpos colpos ## 0 90.0 6.... ## 6 C decrease rowpos colpos ## 0 15.0 7.... ## 7 A decrease rowpos colpos ## 0 2.0 8.... Tibbles

Tibbles are the modern cousin of traditional R data frames. Let us see how they are treated in python. First, we will convert a R data frame into a tibble. We will use another built in data frame in R, LifeCycleSavings. This dataset is about the savings ratio of different countries from 1960-1970.

LifeCycleSavings %>% head() ## sr pop15 pop75 dpi ddpi ## Australia 11.43 29.35 2.87 2329.68 2.87 ## Austria 12.07 23.32 4.41 1507.99 3.93 ## Belgium 13.17 23.80 4.43 2108.47 3.82 ## Bolivia 5.75 41.89 1.67 189.13 0.22 ## Brazil 12.88 42.19 0.83 728.47 4.56 ## Canada 8.79 31.72 2.85 2982.88 2.43

The output tbl_df indicates the structure is a tibble.

LCS_tibble<-as_tibble(LifeCycleSavings) class(LCS_tibble) ## [1] "tbl_df" "tbl" "data.frame"

One thing to note about tibbles is that they do not have row names thus the row names were dropped when the data frame were converted into a tibble.

LCS_tibble %>% head() ## # A tibble: 6 x 5 ## sr pop15 pop75 dpi ddpi ## ## 1 11.4 29.4 2.87 2330. 2.87 ## 2 12.1 23.3 4.41 1508. 3.93 ## 3 13.2 23.8 4.43 2108. 3.82 ## 4 5.75 41.9 1.67 189. 0.22 ## 5 12.9 42.2 0.83 728. 4.56 ## 6 8.79 31.7 2.85 2983. 2.43

Now let’s see how python recognizes a tibble.

r_to_py(LCS_tibble) %>% class() ## [1] "pandas.core.frame.DataFrame" ## [2] "pandas.core.generic.NDFrame" ## [3] "pandas.core.base.PandasObject" ## [4] "pandas.core.accessor.DirNamesMixin" ## [5] "pandas.core.base.SelectionMixin" ## [6] "python.builtin.object"

The tibble is translated into a python data frame.
Let us check that there are no glitches when python translates a tibble into a python dataframe.

py_eval("r.LCS_tibble.head()") ## sr pop15 pop75 dpi ddpi ## 0 11.43 29.35 2.87 2329.68 2.87 ## 1 12.07 23.32 4.41 1507.99 3.93 ## 2 13.17 23.80 4.43 2108.47 3.82 ## 3 5.75 41.89 1.67 189.13 0.22 ## 4 12.88 42.19 0.83 728.47 4.56

The translated tibble prints the same as a R tibble.

Differences between R and python data frames (row names and indexes) Intro

If the data frame in R has row names it will be maintained in the python data frame.

LifeCycleSavings %>% head() %>% row.names() ## [1] "Australia" "Austria" "Belgium" "Bolivia" "Brazil" "Canada"

Just that in python row names are termed as indexes.

py_eval("r.LifeCycleSavings.head().index") ## Index(['Australia', 'Austria', 'Belgium', 'Bolivia', 'Brazil'], dtype='object') Creating row names and indexes

There are R data frames without row names such as this pre-installed data frame, presidential. This dataset is about the terms of the eleven USA presidents from Eisenhower to Obama.

presidential %>% head() ## # A tibble: 6 x 4 ## name start end party ## ## 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 Kennedy 1961-01-20 1963-11-22 Democratic ## 3 Johnson 1963-11-22 1969-01-20 Democratic ## 4 Nixon 1969-01-20 1974-08-09 Republican ## 5 Ford 1974-08-09 1977-01-20 Republican ## 6 Carter 1977-01-20 1981-01-20 Democratic

Occasionally, you may decide to create row names for your data frame using your own set of values or values from a variable from the data frame. Let us designate the values from the “start” column to be the row names.

presidential_names<-presidential %>% mutate(start_term=start) %>% column_to_rownames(var="start") %>% select(name, start=start_term, everything()) presidential_names %>% head() ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican ## 1961-01-20 Kennedy 1961-01-20 1963-11-22 Democratic ## 1963-11-22 Johnson 1963-11-22 1969-01-20 Democratic ## 1969-01-20 Nixon 1969-01-20 1974-08-09 Republican ## 1974-08-09 Ford 1974-08-09 1977-01-20 Republican ## 1977-01-20 Carter 1977-01-20 1981-01-20 Democratic

And we will expect the row names to appear as index in python.

py_run_string("PyPresident= r.presidential_names") #renamed for the convenience of future use py_eval("PyPresident.head()") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican ## 1961-01-20 Kennedy 1961-01-20 1963-11-22 Democratic ## 1963-11-22 Johnson 1963-11-22 1969-01-20 Democratic ## 1969-01-20 Nixon 1969-01-20 1974-08-09 Republican ## 1974-08-09 Ford 1974-08-09 1977-01-20 Republican Filtering/slicing differences

One difference is that R doesn’t use row names for slicing whereas in python you can use indexes for slicing.

py_eval("PyPresident.loc[['1953-01-20']]") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican

Another difference is that when you filter rows in R the row name is dropped by default as the output is printed as a tibble. Remember, tibbles omit row names.

presidential_names %>% filter(start=="1953-01-20") ## name start end party ## 1 Eisenhower 1953-01-20 1961-01-20 Republican

However, the index is persevered when rows are filtered in python

# Filtering using == py_eval("PyPresident[PyPresident.index=='1953-01-20']") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican # Filtering using .loc() py_eval("PyPresident.loc[['1953-01-20']]") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican Other differences

R only supports strings for row names.

presidential_names %>% row.names() %>% class() ## [1] "character"

Even if you explicitly try to coerce characters into another type, it doesn’t work. Let us try to convert the row names back to a date format. It did not work, the row names are still characters.

presidential_Date<-presidential_names row.names(presidential_Date)<-as.Date(presidential_names %>% row.names(), "%Y-%m-%d") presidential_Date %>% row.names() %>% class() ## [1] "character"

When this R data frame is converted into a python data frame, the indexes inherit this character format (object is python’s equivalent of R’s character format).

py_eval("PyPresident.head().index") ## Index(['1953-01-20', '1961-01-20', '1963-11-22', '1969-01-20', '1974-08-09'], dtype='object')

The disadvantage of having the indexes in strings is that you neither can use partial temporal information

py_eval("PyPresident.loc['1953']") # prints-> KeyError: '1953'

nor other permutation of temporal information for slicing.

py_eval("PyPresident.loc['Jan 1953']") # prints -> KeyError: 'Jan 1953'

You need to use the complete string to filter out the row.

py_eval("PyPresident.loc[['1953-01-20']]") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican

Changing the index from strings to date-time format will allow you to achieve them.

py_run_string("PyPresident_Date= PyPresident.copy()") py_run_string("PyPresident_Date.index = pd.to_datetime(PyPresident_Date.index)") py_eval("PyPresident_Date.head().index") ## DatetimeIndex(['1953-01-20', '1961-01-20', '1963-11-22', '1969-01-20', ## '1974-08-09'], ## dtype='datetime64[ns]', freq=None)

Filtering for presidents using only year. E.g. which president started his term in year 1953?

py_eval("PyPresident_Date.loc['1953']") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican

Filtering for presidents using year and shorthand for month. E.g. which president started his term in January 1953.

py_eval("PyPresident_Date.loc['Jan 1953']") ## name start end party ## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican

Differences between R and python data frames (sub setting)

For this section, we will use msleep dataset from ggplot. This dataset is about the sleep patterns and other biological information of different mammals.

msleep %>% head() %>% knitr::kable() name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt Cheetah Acinonyx carni Carnivora lc 12.1 NA NA 11.9 NA 50.000 Owl monkey Aotus omni Primates NA 17.0 1.8 NA 7.0 0.01550 0.480 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NA 9.6 NA 1.350 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.1333333 9.1 0.00029 0.019 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.6666667 20.0 0.42300 600.000 Three-toed sloth Bradypus herbi Pilosa NA 14.4 2.2 0.7666667 9.6 NA 3.850

Similar to sub setting lists, there are 2 approaches to subset data frames.

1. Preserved sub setting
2. Simplified sub setting
Preserved sub setting

In preserved sub setting, the data structure of the input is the same as the output.

Preserved sub setting in R

In R, the techniques for preserved sub setting are applicable to selecting both single and multiple variables.

1. base R via singular square brackets [ ].
msleep["name"] %>% class() ## [1] "tbl_df" "tbl" "data.frame" msleep[, c("name", "sleep_total")] %>% class() ## [1] "tbl_df" "tbl" "data.frame"
1. tidyverse verb, dplyr::select.
msleep %>% select(name) %>% class() ## [1] "tbl_df" "tbl" "data.frame" msleep %>% select(name, sleep_total) %>% class() ## [1] "tbl_df" "tbl" "data.frame" Preserved sub setting in python

In python, the number of techniques for preserved sub setting depends if a single variable or multiple variables is selected.

1. square brackets [[ ]]

It can be used to select both single and multiple variables. It is the only technique to do preserved sub setting for a single variable.

py_eval("type(r.msleep[['name']])") ## py_eval("type(r.msleep[['name', 'sleep_total']])") ##

For both kinds of sub setting, the number of square brackets used in python is the opposite to what is used in R. In this case, R uses singular square brackets for preserved sub setting and python uses DUAL square brackets for the same kind of sub setting.

1. .loc attribute

It can only be used for selecting multiple variables if preserved sub setting is desired.

py_eval("type(r.msleep.loc[:, ('name', 'sleep_total')])") ## Simplified sub setting

For simplified sub setting, “the simplest possible data structure that can represent the output” is returned. Simplest data structures are one-dimensional structures. Thus, it is impossible to convert and store values from multiple variables in these one-dimensional structures. In other words, you can only do simplified sub setting to select a single variable from a data frame.

Simplified sub setting in R

For R there are three techniques to do simplified sub setting.

1. base R via dual square brackets [[ ]]
msleep[["name"]] %>% class() ## [1] "character"
1. tidyverse verb, dplyr::pull
msleep %>% pull(name) %>% class() ## [1] "character"
1. dollar sign syntax $msleep$name %>% class() ## [1] "character" Simplified sub setting in python

When you sub set a python data frame in a simplified manner, you will get a series and not a python list nor python array. Series is a one-dimensional array with named axis labels and is governed by the pandas package. There are 3 techniques to do simplified sub setting for python data frames.

1. sqaure brackets [ ]

Similar to preserved sub setting, the number of brackets for simplified sub setting in python is different than R. Singular square brackets [ ] are used for simplified sub setting in python compared to double square brackets in R.

py_eval("type(r.msleep['name'])") ##
1. .loc attirbute
py_eval("type(r.msleep.loc[:,('name')])") ##
1. . techinque
py_eval("type(r.msleep.name)") ##

Side note about working with R data frames

Occasionally, you may need to convert multi-element named vectors into data frames. For example when working with quantiles.

(Rvec_name<-quantile(rnorm(100))) ## 0% 25% 50% 75% 100% ## -2.2622171 -0.6303713 0.1307985 0.7885520 2.5857533

We can present data frames either in a long format or in a wide format. In a long data frame, the number of rows is equal to the length of the vector thus; the values of the vector form a column. The names in the vector appear as row names in the data frame.

data.frame(percentiles=names(Rvec_name), values=(Rvec_name)) ## percentiles values ## 0% 0% -2.2622171 ## 25% 25% -0.6303713 ## 50% 50% 0.1307985 ## 75% 75% 0.7885520 ## 100% 100% 2.5857533

In a wide data frame, the number of columns (not rows) is equal to the length of the vector hence the values of the vector will occupy a column of their own. A neat trick to directly achieve a wide data frame from named vectors without any need of transposition is to use tibble with !!!.

tibble(!!!Rvec_name) ## # A tibble: 1 x 5 ## 0% 25% 50% 75% 100% ## ## 1 -2.26 -0.630 0.131 0.789 2.59 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### How to compute the z-score with R

Sun, 02/16/2020 - 18:26

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you interested in guest posting? Publish at DataScience+ via your RStudio editor. Category Tags

Sometimes it is necessary to standardize the data due to its distribution or simply because we need to have a fair comparison of a value (e.g, body weight) with a reference population (e.g., school, city, state, country). The calculation of z-score is simple, but less information we can find on the web for its purpose and mean.

In this post, I will explain what the z-score means, how it is calculated with an example, and how to create a new z-score variable in R. As usual, I will use the data from National Health and Nutrition Examination Survey (NHANES).

What is Z-score

In short, the z-score is a measure that shows how much away (below or above) of the mean is a specific value (individual) in a given dataset. In the example below, I am going to measure the z value of body mass index (BMI) in a dataset from NHANES.

Get the data and packages

library(tidyverse) library(RNHANES) dat = nhanes_load_data("DEMO_E", "2007-2008") %>% select(SEQN, RIAGENDR) %>% left_join(nhanes_load_data("BMX_E", "2007-2008"), by="SEQN") %>% select(SEQN, RIAGENDR, BMXBMI) %>% filter(RIAGENDR == "1", !is.na(BMXBMI)) %>% transmute(SEQN, Gender = RIAGENDR, BMI = BMXBMI) dat SEQN Gender BMI 1 41475 2 58.04 2 41476 2 15.18 3 41477 1 30.05 4 41479 1 27.56 5 41480 1 17.93 6 41481 1 23.34 7 41482 1 33.64 8 41483 1 44.06 9 41485 2 25.99 10 41486 2 31.21 How to calculate the z-score for BMI

To calculate the z-score of BMI, we need to have the average of BMI, the standard deviation of BMI.

Mean of BMI:

mean(dat$BMI) ## [1] 25.70571 Standard deviation of BMI: sd(dat$BMI) ## [1] 7.608628

Suppose we want to calculate the z-score of the first and third participant in the dataset dat. The calculation will be: I take the actual BMI (58.04), substract the mean (25.70571), and divide the difference by the standard deviation (7.608628). The result is 4.249687. This indicate that z score is 4.249687 standard deviations above the average of population.

(58.04 - 25.70571)/7.608628 = 4.249687 How to calculate the z-score in R dat %>% mutate(zscore = (BMI - mean(BMI))/sd(BMI)) SEQN Gender BMI zscore 1 41475 2 58.04 4.249687006 2 41476 2 15.18 -1.383391690 3 41477 1 30.05 0.570968558 4 41479 1 27.56 0.243708503 5 41480 1 17.93 -1.021959902 6 41481 1 23.34 -0.310925004 7 41482 1 33.64 1.042801328 8 41483 1 44.06 2.412299228 9 41485 2 25.99 0.037363810 10 41486 2 31.21 0.723427057

Now we see the z-score for each individual, and the values corresponded to what we calculated above.

If you calculate the mean and standard deviation of the zscore above, you will find that mean is 0, and standard deviation is 1.

Feel free to comment!

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Building a base dplyr with primitives

Sun, 02/16/2020 - 02:12

Introduction

In one of my latest posts, I discussed the idea of turning base R’s get and set operators ([, [[, [<-, [[<-) into human readable and pipeable functions. It was kindly pointed out in the comments that the get() and set() functions I defined in that blog post are actually exported in the magrittr package as magrittr::extract2() ([[) and magrittr::inset2() ([[<-). In fact, there are a whole host of “alias” functions exported by magrittr, see ?magrittr::extract2 for more. However if we are developing a package, we may not necessarily want to Import: magrittr, we may only want to Suggest it as a package that complements our package. This is especially true when the functions we will be importing are simple aliases of other functions that we can just as easily create ourselves. Now sure, a lot of people already have and use magrittr, in which case they can use it, but not everyone wants it or uses it, so we shouldn’t enforce that dependency on users.

Take for example if we were to create a package that recreates dplyr’s main verbs, select(); filter(); mutate(); and arrange(), using base R only. Think of it as a “poor man’s” dplyr, of course I jest – base is awesome. Oftentimes the main complaint I hear about dplyr is the sheer number of dependencies it has and the installation times that come with that; not to mention APIs have changed a few times over the years. base on the other hand already comes pre-installed with R and the API is extremely stable. The reason people like dplyr, however, is because the API on offer is extremely flexible and easy to understand. This blog post will show how we can recreate these verbs using base R and aliases to R’s operator functions and use them in conjunction with magrittr.

Select

dplyr::select() allows the user to subset the columns of a data.frame and always return a data.frame. Thus to recreate this function we will need the operator for subsetting columns of a data.frame which is [, or more specifically, [.data.frame. Let’s take a look at the arguments for this function.

args([.data.frame) # function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == # 1) # NULL

We see that it takes:

• x – the data.frame
• i – the rows to subset
• j – the columns to subset
• drop – whether to return a vector if only one column is left

We will define our wrapper for the [ function in the same way that magrittr does.

extract <- [

As this is an S3 generic, R will know to dispatch to [.data.frame when it is passed a data.frame. Hence, we can now define a select() function which is similar in functionality to that of dplyr::select(). Note that we tell R that we wish to subset all of the rows in the i position by leaving the argument blank.

select <- function(.data, ...) { cols <- vapply(substitute(...()), deparse, NA_character_) extract(.data, , cols, drop = FALSE) }

This function uses a couple of tricks here, so I’ll break them down. To use non-standard evaluation in the same way that dplyr does, that is to pass non-quoted column names, we must deparse them. We loop over the columns passed via ... using a vapply(). The substitute(...()) gives us a list-like object of all the symbols we pass which we can loop over. Using this function, we can now select a single column.

mtcars %>% select(mpg) # mpg # Mazda RX4 21.0 # Mazda RX4 Wag 21.0 # Datsun 710 22.8 # Hornet 4 Drive 21.4 # Hornet Sportabout 18.7 # Valiant 18.1 # Duster 360 14.3 # Merc 240D 24.4 # Merc 230 22.8 # ... 23 rows omitted

Or multiple columns by passing a vector.

mtcars %>% select(mpg, cyl) # mpg cyl # Mazda RX4 21.0 6 # Mazda RX4 Wag 21.0 6 # Datsun 710 22.8 4 # Hornet 4 Drive 21.4 6 # Hornet Sportabout 18.7 8 # Valiant 18.1 6 # Duster 360 14.3 8 # Merc 240D 24.4 4 # Merc 230 22.8 4 # ... 23 rows omitted

And of course, this function works without magrittr.

select(mtcars, mpg, cyl) # mpg cyl # Mazda RX4 21.0 6 # Mazda RX4 Wag 21.0 6 # Datsun 710 22.8 4 # Hornet 4 Drive 21.4 6 # Hornet Sportabout 18.7 8 # Valiant 18.1 6 # Duster 360 14.3 8 # Merc 240D 24.4 4 # Merc 230 22.8 4 # ... 23 rows omitted

For bonus points, we can write an equivalent of dplyr::pull() by setting the drop = TRUE argument and removing the cols parameter since we are only dealing with one column.

pull <- function(.data, var) { var <- deparse(substitute(var)) stopifnot(length(var) == 1) extract(.data, , var, drop = TRUE) } mtcars %>% pull(mpg) # [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 # [18] 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4 Filter

As we saw in the previous section, [.data.frame takes i as an argument which represents the rows to filter. Thus we can use a similar method to that used for select() only in this case, we must build the expressions by which to filter and separate them with an ampersand from which we can parse and evaluate.

filter <- function(.data, ...) { conditions <- paste(vapply(substitute(...()), deparse, NA_character_), collapse = " & ") extract(.data, with(.data, eval(parse(text = conditions))), ) } mtcars %>% filter(cyl == 4) # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 # Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 # Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 # Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 # ... 2 rows omitted mtcars %>% filter(cyl <= 5 & am > 0) # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 # Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 # Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 # Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 # Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 mtcars %>% filter(cyl == 4 | cyl == 8) # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 # Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 # Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 # Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 # ... 16 rows omitted mtcars %>% filter(!(cyl %in% c(4, 6)), am != 0) # mpg cyl disp hp drat wt qsec vs am gear carb # Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 # Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8

We can also get a copy of dplyr::slice() really cheaply.

slice <- function(.data, ...) { stopifnot(is.numeric(...) || is.integer(...)) extract(.data, ..., ) } mtcars %>% slice(1:3) # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Arrange

The final function using the extract() alias that I want to highlight is arrange(). I want to highlight this function because of the required trick with eval.parent() (note there are other ways we could achieve this).

arrange <- function(.data, ...) { rows <- eval.parent(substitute(with(.data, order(...)))) extract(.data, rows, , drop = FALSE) }

We use eval.parent() instead of eval(), because the eval()/substitute() combo doesn’t play well with nested functions. The eval.parent() trick has been proposed by @MoodyMudskipper as a way to address this problem and allows us to seamlessly use arrange() inside other functions, including magrittr pipes:

mtcars %>% arrange(mpg) # mpg cyl disp hp drat wt qsec vs am gear carb # Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 # Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 # Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 # Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 # Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 # Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 # AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 # Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 # ... 23 rows omitted mtcars %>% arrange(cyl, mpg) # mpg cyl disp hp drat wt qsec vs am gear carb # Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 # Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 # Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 # Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 # ... 23 rows omitted

Thanks go to Artem Sokolov for pointing this out.

Mutate

If we wish to create new columns in our dataset, particularly columns created using existing columns in the data, we must use the [<- operator, specifically, [<-.data.frame.

args([<-.data.frame) # function (x, i, j, value) # NULL

[<-.data.frame takes the arguments:

• x – the data
• i – the rows to create
• j – the columns to create
• value – the value to give to the rows/columns

We will assign this operator to inset – the same as magrittr does.

inset <- [<-

Here we lapply() over each of the conditions to return a list of vectored results of our expressions. We then use the inset() function to add these vectors as new columns to the data.frame.

mutate <- function(.data, ...) { conditions <- vapply(substitute(...()), deparse, NA_character_) new_data <- lapply( conditions, function(x, .data) with(.data, eval(parse(text = x))), .data ) inset(.data, , names(conditions), new_data) } mtcars %>% mutate(mpg2 = mpg * 2) # mpg cyl disp hp drat wt qsec vs am gear carb mpg2 # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 42.0 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 42.0 # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 45.6 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 42.8 # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 37.4 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 36.2 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 28.6 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 48.8 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 45.6 # ... 23 rows omitted mtcars %>% mutate(mpg2 = mpg * 2, cyl2 = cyl * 2) # mpg cyl disp hp drat wt qsec vs am gear carb mpg2 cyl2 # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 42.0 12 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 42.0 12 # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 45.6 8 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 42.8 12 # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 37.4 16 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 36.2 12 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 28.6 16 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 48.8 8 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 45.6 8 # ... 23 rows omitted

Note that unlike dplyr::mutate(), we cannot create columns based on expressions we pass, for example, the following would not work.

mtcars %>% mutate(mpg2 = mpg * 2, mpg3 = mpg2 * 3) # Error in eval(parse(text = x)): object 'mpg2' not found

As a bonus, we can combine our mutate() function with extract() to create a copy of dplyr::transmute().

transmute <- function(.data, ...) { conditions <- vapply(substitute(...()), deparse, NA_character_) mutated <- mutate(.data, ...) extract(mutated, names(conditions)) } mtcars %>% transmute(mpg2 = mpg * 2, cyl2 = cyl * 2) # mpg2 cyl2 # Mazda RX4 42.0 12 # Mazda RX4 Wag 42.0 12 # Datsun 710 45.6 8 # Hornet 4 Drive 42.8 12 # Hornet Sportabout 37.4 16 # Valiant 36.2 12 # Duster 360 28.6 16 # Merc 240D 48.8 8 # Merc 230 45.6 8 # ... 23 rows omitted Chaining

As a final note, it should be clear that due to the nature of magrittr, your standard chaining of functions will still work.

mtcars %>% filter(cyl == 4) %>% select(mpg, cyl, wt, disp) # mpg cyl wt disp # Datsun 710 22.8 4 2.320 108.0 # Merc 240D 24.4 4 3.190 146.7 # Merc 230 22.8 4 3.150 140.8 # Fiat 128 32.4 4 2.200 78.7 # Honda Civic 30.4 4 1.615 75.7 # Toyota Corolla 33.9 4 1.835 71.1 # Toyota Corona 21.5 4 2.465 120.1 # Fiat X1-9 27.3 4 1.935 79.0 # Porsche 914-2 26.0 4 2.140 120.3 # ... 2 rows omitted Conclusion

The idea behind this blog post was to highlight how we can use more human readable versions of R’s primitive operators to aid in pipeable data manipulation functions. Of course the solutions provided in this blog post are over-engineered and you would probably write them in a different way if you were seriously thinking about releasing them as a package. Also, whilst these functions are available via an import of magrittr, you may not wish to force the user to import magrittr and may wish to keep it as a suggestion instead. This reduces the number of dependencies on your package.

For what it’s worth, I have included all of the above code in a package called poorman on my GitHub account. These functions haven’t been thoroughly tested and there may well be bugs. There are, however, much more detailed and dedicated recreations of dplyr using base R. If you are interested, check out: bplyr (note this package uses rlang) and tbltools.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Apple Health Export Part I

Sun, 02/16/2020 - 01:00

[This article was first published on R on Can I Blog Too, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is Part I of a dive into the contents of the Apple Health Export.
It will work through the mechanics of moving data from the Apple Health app
out of your iPhone and into R where you can analyze it. It also will describe in detail the problem of
adjusting the time stamps for daylight savings time and travel across time zones.
Unfortunately the topic of time zones and the Apple Health Export is a complicated subject.

First, Export the Data from the Health App code { color: #97736C; background-color: #FAF9F9; font-weight: 400 }

You can export all of the data you are able to view via the Health app.
Open the Health app on your iPhone.
To export, go to your personal settings by
clicking on your icon near the upper right corner of the Browse screen.
(See the the first screenshot below.)
Click on the icon and you will see some of your personal settings.
You will need to scroll to the bottom of this page, where you will
see a clickable line “Export All Health Data”, as shown in the second screenshot below.

Tap upper right icon:

Scroll to very bottom:

Once you click OK to go ahead with the export, it may take a significant amount of time.
On my iPhone 8 it takes more than five minutes. Once it is complete, you’ll get a
dialog that asks where to send the exported data. I use AirDrop to send it to the
Mac where I am running RStudio. It ends up in the Downloads folder on that Mac.
If you need to move the data to a Windows computer, you may need to send it
via email or Dropbox.
The exported file is named export.zip. If you double-click on that file it
will expand into a folder called apple_health_export. The uncompressed file is huge
in comparison with the size of the zip file. In my case, export.zip is about
79 megabytes which becomes an apple_health_export folder that is 2.45 gigabytes!
In my R code, I uncompress the file into my Downloads folder, which is excluded
from my Time Machine backups.

R Code to Expand the Export File and Import It As XML Data

The R code below shows how to decompress export.zip and follow some
basic steps to import it into R. I’m following in the footsteps of
several people who have published code to accomplish these steps.
See work by Ryan Praskievicz, Taras Kaduk, and Deepankar Datta (who has
created a package called AppleHealthAnalysis). I’m sure there are other examples
using R, and there are quite a number of examples using python (e.g., by Mark Koester).

The R code uncompresses the zip file and replaces the apple_health_export folder.
(Because of the size of this folder, I try to avoid having multiple copies and also
avoid having it saved to my disk backup.)
The big file inside that folder is export.xml. Following the examples cited above, I
use the XML package to covert the major elements of the XML file into
tidy data frames.

I won’t go into the details of the XML structure of the health export.
For most purposes, the Record, ActivitySummary, Workout, and Clinical data types
will provide all that you are looking for. My expanded Apple Health Export
folder also includes workout GPX files, electrocardiograms, and clinical
records imported from my health system’s medical records system.

I have a bit over two years of Apple Watch data in my iPhone.
After a full career working
as a data analyst, this is the largest number of data points
I have ever dealt with. Extracting the “Record” data from export.xml produces
more than four million rows and takes about 70 seconds on my 2019 iMac.

The Types of Data in the Export

The counts by “type” describe the breadth and quantity of data are shown in Table 1.

Table 1: Frequency by Type and Data Source
type Watch Phone Lose It! Other Total HKQuantityTypeIdentifierActiveEnergyBurned 1,459,787 0 0 1 1,459,788 HKQuantityTypeIdentifierBasalEnergyBurned 911,246 0 0 0 911,246 HKQuantityTypeIdentifierDistanceWalkingRunning 664,925 140,647 0 1 805,573 HKQuantityTypeIdentifierHeartRate 655,002 0 0 1,684 656,686 HKQuantityTypeIdentifierStepCount 55,350 136,637 0 1 191,988 HKQuantityTypeIdentifierAppleExerciseTime 53,944 337 0 0 54,281 HKQuantityTypeIdentifierFlightsClimbed 13,358 11,988 0 1 25,347 HKCategoryTypeIdentifierAppleStandHour 20,276 7 0 0 20,283 HKQuantityTypeIdentifierAppleStandTime 9,657 0 0 0 9,657 HKQuantityTypeIdentifierEnvironmentalAudioExposure 6,586 0 0 0 6,586 HKQuantityTypeIdentifierDietaryFatTotal 0 0 5,628 0 5,628 HKQuantityTypeIdentifierDietaryEnergyConsumed 0 0 5,064 0 5,064 HKQuantityTypeIdentifierDietaryFatSaturated 0 0 5,040 0 5,040 HKQuantityTypeIdentifierDietaryProtein 0 0 4,736 0 4,736 HKQuantityTypeIdentifierDietarySodium 0 0 4,699 0 4,699 HKQuantityTypeIdentifierDietaryFiber 0 0 4,695 0 4,695 HKQuantityTypeIdentifierDietarySugar 0 0 4,603 1 4,604 HKQuantityTypeIdentifierHeartRateVariabilitySDNN 4,573 0 0 0 4,573 HKQuantityTypeIdentifierDietaryCholesterol 0 0 4,447 0 4,447 HKCategoryTypeIdentifierSleepAnalysis 0 0 0 3,377 3,377 HKQuantityTypeIdentifierBloodPressureDiastolic 0 0 0 2,428 2,428 HKQuantityTypeIdentifierBloodPressureSystolic 0 0 0 2,428 2,428 HKQuantityTypeIdentifierDistanceCycling 1,415 0 0 0 1,415 HKQuantityTypeIdentifierRestingHeartRate 861 0 0 0 861 HKQuantityTypeIdentifierWalkingHeartRateAverage 781 0 0 0 781 HKQuantityTypeIdentifierBodyMass 0 1 263 1 265 HKQuantityTypeIdentifierVO2Max 187 0 0 0 187 HKCategoryTypeIdentifierMindfulSession 76 0 0 0 76 HKQuantityTypeIdentifierHeadphoneAudioExposure 0 17 0 0 17 HKQuantityTypeIdentifierHeight 0 1 0 1 2 HKQuantityTypeIdentifierDietaryCaffeine 0 0 0 1 1 HKQuantityTypeIdentifierDietaryCarbohydrates 0 0 0 1 1 HKQuantityTypeIdentifierNumberOfTimesFallen 1 0 0 0 1 Total 3,858,025 289,635 39,175 9,926 4,196,761

The same data may be repeated from multiple sources
so it is important to pay attention to sourceName.
Note that step counts, flights climbed, and distance walking/running
are recorded from both the Watch and the iPhone.
On any particular day you probably want to include only one of the sources.
Otherwise you risk double counting.
Generally I focus on the Watch data, but I have almost three years of
data from my iPhone before I started wearing the Apple Watch.

The largest quantity of data is collected via my Apple Watch rather than my iPhone.
There are other sources as well.
I have been using the Lose It! app on my iPhone for about six months to count calories,
and that produces a noticeable amount of data.
The free version of the app that I am using does not display much beyond
basic calorie counts.
It’s interesting to see that the more detailed nutrition breakdowns are passed
into the Health app.

As far as the “Other” category, I’m using an Omron blood pressure cuff
that can transfer readings to the Omron app on the iPhone via Bluetooth. Those
readings are then updated in the Apple Health database.
There are a few odds and ends contributed by apps on my iPhone
such as AllTrails, Breathe, and AutoSleep. Note that heart rate data comes both from
the watch and from the blood pressure data.

The other major XML categories are ActivitySummary, Workout, and ClinicalRecord.
For ActivitySummary I have one row per
day which basically summarizes some of the intra-day activity data.
Workout has one row per workout. If I were
still running I would focus much more on that data frame.
For each workout it shows type, distance, duration,
and energy burned Most of my workouts are outdoor walks.
Quite often I forget to end the workout when I end the walk,
which certainly reduces the usefulness of the data.
But I would imagine that for a runner or a swimmer or a cyclist,
the Workout information would be interesting and useful.

ClinicalRecord is a bit tricky.
I have set things up so that my health organization shares my health records with the
Apple Health app. The count by type is in Table ??.

type n AllergyIntolerance 1 Condition 12 DiagnosticReport 63 Immunization 15 MedicationOrder 7 MedicationStatement 13 Observation 694 Patient 1 Procedure 7

In the clinical data frame there is a column called resourceFilePath that contains
the path to a json dataset in the Apple Health Export/clinical records folder. Presumably
this would allow you to retrieve items such as individual lab test results.
I haven’t attempted to get into this data. I only know what’s available
here because I can view it via the Apple Health app.

The Problem of Time Zones and of Daylight Savings

When I first looked at day by day data for resting heart rate I bumped into
problems caused by the issue of time zones. I have about 800 rows of data of type
HKQuantityTypeIdentifierRestingHeartRate which should be one per day.
I quickly discovered I had several days where there were two values in a single day,
and these were related to
occasions when I traveled by air to a different time zone.

This leads to a long digression on the subject of computer time stamps and time zones.
(Here is an entertaining video that describes
the general problem of time stamps and time zones, but it doesn’t relate to the specific
problems that I will get into here. It’s fun if you want to nerd out on this topic.)

Attention Conservation Notice: If you don’t care about time of day and can tolerate
a few things like resting heart rate being a day off, then you can ignore the
issues with the time stamps. Your life will be a lot simpler. You can basically skip
the rest of this post and look forward to Part II.

Now we shall dive deep into the weeds.
Each item in the records of the health dataset has a creation, start, and end time stamp.
In the export dataset
they appear as a character string that looks like this: “2019-04-10 08:10:34 -0500”.
There is “-0500” on the end
because as I write this local time is Eastern Standard Time which is five hours
earlier than UTC (universal
time code). At first I thought the time code offset at the end of the text string would take care of everything.
In fact, it is useless.
As near as I can tell, the internal data has no indication of time zone.1
The UTC offset is attached to the date and time information when the data is exported.
Every single time stamp in the exported dataset has the same “-0500” offset,
which merely represent my local offset at the time the export was done.
When I exported the data during Eastern Daylight Savings, all of the offsets appeared as “-0400”.
In fact, the exported data has no information about time zone or daylight savings.
One should think of the actual time stamp as being in universal time (UTC).
The export creates a character string which displays the date and time
in the time zone where and when the export is created and attaches an offset to UTC.

Here is a detailed example.
When I go into the Activity app on my iPhone I see that it claims I started a walk in England
on 9/1/2019 at 05:51:58. I’m not that much of an early riser.
I know from my travel diary that I actually got started five hours later than that
at 10:51:58 local time (British Summer Time) because I had to take a bus before
I could start the walk. When I exported the workout data last October during
Daylight Savings, the exported
date showed “2019-09-01 05:51:58 -0400”. When I export the data now during standard
time it shows “2019-09-01 04:51:58 -0500”. If I feed either of those
character strings into lubridate::as_datetime I get “2019-09-01 09:51:58 UTC”.
In absolute terms, there’s no ambiguity about when the observation was added to the data.
The ambiguity is the local time as it appeared on my watch at the moment the data was added.
Sometimes that matters and sometimes it doesn’t.

When does it matter?
A number of items like resting heart rate are recorded once per day.
But because of time zone issues, you can end up with two on one day and none on another.
Also, there may be situations where you want to look at patterns over the course of a day.
At one point I wanted to look at whether there were periods when my heart rate was
unusually high during normal sleeping hours.
I looked for a heart rate above 120 between the hours of 11PM and 6AM.
I got hits for when I was hiking in a different time zone because the
time stamp appeared to be during those night time hours when in fact the local time
was shifted five or seven hours because of the different time zone.

The time stamp issues are tricky. If these kinds of situations are not a problem for you,
then ignore them and skip this section. Your life will be simpler.

I use the lubridate package to deal with the time stamps.
R relies on a Unix-based standard for dates and time called
POSIX that is implemented as a class
called POSIXct. You can see lots of references to POSIXct in the lubridate documentation.
The as_datetime function in lubridate allows you to add a tz parameter that
specifies the time zone.
A significant difficulty is that in R the time zone is stored as an attribute of the vector
rather than as part of the data.
If you have a vector of datetime data,
the time zone attribute applies to the entire vector, not to individual elements
in the vector.
If you want to store time zones that vary within the vector, you need to store them in a separate
vector, and that’s not part of the R standard for handling dates and times.
You’re on your own. The lubridate package includes some functions to help convert vectors
from one time zone to another and to deal somewhat with
daylight savings.
But it does not automatically help with a vector that contains datetime information from
varying time zones (as well as different daylight savings issues).
(See Clayton Yochum
for a more detailed discussion of general time zone messiness in R.)

As I searched the web for tips on how to approach this issue, I discovered that
there’s a population of people who are working hard to maintain a streak
in filling their activity rings in the Apple Activity app.
Some of those individuals get frustrated because they are tripped up
by movement across time zones or even changes to daylight savings.
There are a number of tips out there for activity tracking in the face of
crossing time zones.

My Treatment of Time Zones and the Apple Health Export

Here I describe the way I handled the time zone issue.
There will be three steps.
1. Identify when I was in a different time zone.
2. Attach a time zone to each row of health_df.
3. Use the time zone information to attach a local start and end time to each observation.

The first task is to figure out when I was outside of my home time zone.
To do that I need a data frame that shows me the time when
I arrived in another time zone.
I created a table that contains the date and time my
watch switched to a new time zone.
In my case that means identifying airplane trips
that landed in a different time zone.
As a bonus, I will describe in detail how I
used data exported from TripIt to create most
of the table of time zone changes. Later I converted
that information to UTC and then used the table to
establish the time zone in each of my 4+ million rows
of data from the Apple Health Export.

My TripIt data is missing one overseas trip. I needed to
add that trip manually. Here is a tribble that creates a table
to describe a trip to Greece via a two day stop
in Amsterdam. If you have no
data in TripIt then the manual table would have to
describe all your trips to a different time zone.

manual_timezone_changes <- tibble::tribble( ~local_arrive, ~local_timezone, "2018-04-18 09:15:00", "Europe/Amsterdam", "2018-04-20 21:00:00", "Europe/Athens", "2018-04-30 15:23:00", "America/New_York" ) %>% mutate(local_arrive = as_datetime(local_arrive))

The local_arrive column records when I arrived in a new time zone and
is the scheduled arrival time for the fight to that time zone.
For example, the first line represent a flight that I took
from JFK to Amsterdam that arrived at 9:15 the next morning
in Amsterdam time. My watch was on New York time until I turned off
airplane mode after arrival in Amsterdam so I am focusing on
the arrival time. One
could do similar lines of data for arrivals by car, train, or boat.
Later we will see how to transform this data so that it can
be applied to the rows of the Apple Health Export. First,
let’s go into the details of how to complete this table
using data from TripIt.

Using TripIt Data to Track Your Plane Flights

I will use TripIt to get most of my flight information. This
was inspired by a talk by Hadley Wickham. The point of the talk was
data visualization,
but one of his examples
was based on relationship between
when he is traveling and number of commits on GitHub.
For that example he used the history of his trips on TripIt, and
that’s the data I need to correct the time stamps.

OK, here we go. Remember that I warned you that adjusting the time
stamps would involve complications.

to fetch history from TripIt. There’s a TripIt developer page that points you to
more information. TripIt has an OAuth API for full-strength applications. That was more
than I needed. Hadley used a simpler form of authorization. Here’s the TripIt description:

TripIt API offers a very simple way of authenticating the API that should only be used for testing and development purposes…. Note that this authentication scheme is off by default for every TripIt user. If you want to have this turned on for your account so you can use it for development purposes please send email to .

I sent them an email to request simple authentication, and TripIt support responded the next day.
From then on I could use httr functions to get the data from TripIt via information
from documentation of the TripIt API.

I did one GET call from httr to get a list of my TripIt trips. Next I used the purrr package
to extract data from the nested JSON lists returned by TripIt. In particular, I used the map function
to get TripIt “air” objects for each trip ID. Individual airplane flights are “segments”
with each air object. For example, a trip might be two connecting flights to get to the
destination and two flights to return home, each represented by a “segment”.
I always feel like I’m a few steps away from thorough understanding of
purrr and tend to rely on a certain amount of trial and error to get a sequence of flatten
and map call that extract what I need. The code is available from a GitHub repo I made to support
this post
.

I end up with a
data frame that has the scheduled departure and arrival for each flight and conveniently provides
the time zone for each airport. Note that in practice this data might not be perfect. It is scheduled
flights only and would not account for cancelled flights or even the time of a delayed flight. So keep
that in mind before you
try to use this data to examine something detailed such as whether your heart rate is elevated during takeoffs and landings.

I took a quick detour and explored whether I could use the FlightAware API to get the actual arrival
times. It is now easy to get free access to limited FlightAware data. But the API calls are
oriented to retrieving current rather than historical data so I can’t use it to find out about
my past flights.

Once I have the trip ID’s, I use trip ID to fetch the flight
information. I used the RStudio View() function to examine
the results I got back from calls to the TripIt API.
I get one trip at a time, fetch the
air segments, and then bind them together with purrr::map_dfr. I
fetched more than the minimum columns I needed partly out of curiosity over
what was in the TripIt data. It’s interesting to see all my flight information
in one table, although much is not directly relevant to the task at hand.

# get list of trips trips_list <- GET_tripit("https://api.tripit.com/v1/list/trip/past/true/false") trip_ids <- trips_list$Trip %>% map_chr("id") GET_air <- function(trip_id) { atrip <- GET_tripit( paste0( "https://api.tripit.com/v1/get/trip/id/", trip_id, "/include_objects/true" ) ) air_trip <- atrip[["AirObject"]][["Segment"]] flights <- dplyr::tibble( trip_id = trip_id, trip_start = atrip[["Trip"]][["start_date"]], start_date = air_trip %>% purrr::map("StartDateTime") %>% map_chr("date"), start_time = air_trip %>% purrr::map("StartDateTime") %>% map_chr("time"), start_timezone = air_trip %>% purrr::map("StartDateTime") %>% map_chr("timezone"), start_city = air_trip %>% purrr::map_chr("start_city_name"), end_date = air_trip %>% purrr::map("EndDateTime") %>% map_chr("date"), end_time = air_trip %>% purrr::map("EndDateTime") %>% map_chr("time"), end_timezone = air_trip %>% purrr::map("EndDateTime") %>% map_chr("timezone"), end_city = air_trip %>% purrr::map_chr("end_city_name"), airline = air_trip %>% purrr::map_chr("marketing_airline"), code = air_trip %>% purrr::map_chr("marketing_airline_code"), number = air_trip %>% purrr::map_chr("marketing_flight_number"), aircraft = air_trip %>% purrr::map_chr("aircraft_display_name"), distance = air_trip %>% purrr::map_chr("distance"), duration = air_trip %>% purrr::map_chr("duration") ) } GET_air_mem <- memoise::memoise(GET_air) # I used memoise because TripIt typically doesn't let me fetch # all the data in a single call. With memoise, for each call # I only fetch the items it doesn't remember. flying <- trip_ids %>% map_dfr(GET_air_mem) I used memoise with the GET_air_mem function because TripIt generally doesn’t give me all my flights on one try. It seems to limit how much data it will return. Because of memoise the Get_air_mem remembers the results of previous successful fetches from TripIt and only asks for things flights it doesn’t already know about. I ended up with a tibble with one row per flight. I selected the two columns I needed and used bind_rows to combine them with the manual data I created above. I focused on the ending time and location of each flight. That’s when I assume my watch gets changed to a new time zone and life in that time zone begins. Life in that time zone ends when life in the next time zone begins. I wrote a short function to convert the local scheduled time in the flight information into UTC time comparable to what is in the health export. So after I bind together rows from TripIt and the manual rows above, I set utc_until = lead(utc_arrive), until_timezone = lead(local_timezone). I set the last utc_until to now() local_to_utc <- function(dt, timezone) { # UTC of this local time. Note: not vectorized if (is.character(dt)) dt <- as_datetime(dt) tz(dt) <- timezone xt <- with_tz(dt, "UTC") tz(xt) <- "UTC" return(xt) } arrivals <- flying %>% mutate(local_start = ymd_hms(paste0(start_date, start_time)), local_arrive = ymd_hms(paste0(end_date, end_time))) %>% filter(start_timezone != end_timezone) %>% # only trips that change time zone matter select(local_arrive, local_timezone = end_timezone) %>% # manual additions here bind_rows(manual_timezone_changes) %>% mutate(utc_arrive = map2_dbl(local_arrive, local_timezone, local_to_utc) %>% as_datetime) %>% arrange(utc_arrive) %>% mutate(utc_until = lead(utc_arrive), until_timezone = lead(local_timezone)) arrivals$utc_until[nrow(arrivals)] <- with_tz(now(), "UTC") arrivals$until_timezone[nrow(arrivals)] <- Sys.timezone() # I don't really need this. # Check that I have arrivals set up properly so that my last arrival is in my home time zone, otherwise warn if (arrivals$local_timezone[nrow(arrivals)] != Sys.timezone()) warning("Expected to end in Sys.timezone:", Sys.timezone(), " rather than ", arrivals\$local_timezone[nrow(arrivals)]) Assign a Time Zone to Each Row of the Data

The next step was to associate a time zone with each of the 4+ million rows in health_df.
I thought this would be difficult and very slow. I was wrong!
The fuzzyjoin package
by Dave Robinson
came to the rescue. I can’t do a simple join between flights and data rows because there is
not an exact match for the time stamps. Instead I want to join the time stamps in health_df
with a start and end time stamp for each row in the arrivals table. It turns out
fuzzyjoin has that covered.

The fuzzyjoin package provides a variety of special joins that do not rely on an exact match.
In this case, what I needed was the interval_left_join function which joins tables based
on overlapping intervals. The help for interval_left_join explains that this function requires
the IRanges package available from Bioconductor and points to
instructions for installation.
This was the first time I have used anything from the Bioconductor repository. I’m
impressed by the speed of interval_left_join. I thought it would be impractical to run it
on the full dataset, but it feels fast. I also used it to relate rows in
health_df to rows in workout_df, but I’ll describe that in Part II.

health_df <- health_df %>% mutate(utc_start = as_datetime(startDate), utc_end = as_datetime(endDate)) %>% filter(!is.na(utc_start)) %>% interval_left_join(arrivals %>% select(utc_arrive, utc_until, timezone = local_timezone), by = c("utc_start" = "utc_arrive", "utc_start" = "utc_until"))

Do a quick report to check whether the assignment of time zone makes sense.
If data documenting the transition from one time zone to
another is missing the whole table will be off kilter. If
there’s something wrong with the arrivals table, it should show up as an
unexpected result in Table 2.

# shows trip by trip to check whether timezone assignment works as expected: # health_df %>% #filter(timezone != "America/New_York") %>% # mutate(date = as_date(utc_start)) %>% # group_by(timezone, utc_arrive) %>% summarise(arrive = min(utc_start), leave = max(utc_start), # days = length(unique(date)), # n = n()) # Do a quick check whether the distribution of time zones makes sense health_df %>% mutate(date = as_date(utc_start)) %>% group_by(timezone) %>% summarise(dates = length(unique(date)), observations = n()) %>% kable(format.args = list(decimal.mark = " ", big.mark = ","), table.attr='class="myTable"', caption = "Frequency of Days by Time Zone") Table 2: Frequency of Days by Time Zone
timezone dates observations America/Chicago 19 52,281 America/Los_Angeles 15 74,973 America/New_York 1,800 3,458,188 America/Phoenix 20 10,535 Europe/Amsterdam 3 5,105 Europe/Athens 11 19,555 Europe/London 29 574,659 Europe/Rome 14 1,467 Using Row-by-Row Time Zone Data to Adjust Time Stamps

Now that I have time zone information attached to each row of health_df,
I can translate the UTC time into the local time as it appeared on my watch.
In practice I find it quite hard to wrap my head around exactly
what is happening with all these time manipulations. Sometimes it feels like
a science fiction time travel story2.

Remember that for the datetime class
in R, time zone is an attribute that applies to an entire vector.
That’s a limitation we need to work around. First we will convert
the character version of the time stamp (with universal time offset)
to a datetime class value with UTC as the time zone.

Next I create
a function exported_time_to_local which will convert the time as
it appears in the Apple Health Export to the time as it appeared at the
time the data was originally added to the health data.
As a side effect the lubridate conversion
functions will also adjust for daylight savings changes.
This function
will apply to data for a single time zone only so that it can be
vectorized. This is important because it will be applied to millions of rows.

utc_dt_to_local <- function(dt, time_zone) { # Adjust a vector of datetime from time zone where data was exported # to a particular time_zone that that applies to the whole vector. tz(dt) <- "UTC" local <- with_tz(dt, time_zone) # now adjust utc to the time zone I want tz(local) <- "UTC" # I mark the vector as UTC because I will be row_bind-ing vectors # together and all need to have the same time zone attribute. # Although the vector is marked as UTC, # in the end I will treat the hour as being whatever the local # time was that I experienced then. return(local) }

Next I will apply the utc_dt_to_local function to the character time stamps
in the Apple Health Export. The function needs to be applied to a vector with
the same time zone for all elements in the vector. By doing group_by(start_time_zone)
before I use the function inside mutate, the function will be applied with a
different time zone for each group.
That way the function is vectorized for each group and is reasonably fast.
I did not group the time zones separately for the start date and the end date. Usually
they would be in the same tine zone, and even if they are not I want to handle them as if they were.
All the time stamp vectors end up having the time zone attribute of “UTC”.
Remember that time zone is a single attribute that has to apply to the whole vector.
In this case it is labelled UTC, but each time stamp describes the local time
when and where the observation was recorded. So 16:42 means 4:42 PM in the time zone
(and daylight savings status) at the place and time of the measurement that goes
with that time stamp.

health_df <- health_df %>% group_by(timezone) %>% # assume end_date is in the same time zone as start_date mutate(local_start = utc_dt_to_local(utc_start, first(timezone)), local_end = utc_dt_to_local(utc_end, first(timezone))) %>% # mutate(end_time_zone = get_my_time_zone(endDate)) %>% # group_by(end_time_zone) %>% # mutate(end_date = exported_time_to_local(endDate, first(end_time_zone))) %>% ungroup() %>% mutate(date = as_date(utc_start), start_time = as.integer(difftime(local_start, floor_date(local_start, "day"), unit = "secs")) %>% hms::hms()) %>% arrange(type, utc_start) %>% ungroup() # Here I'll adjust time for workout_df as well workout_df <- workout_df %>% mutate(utc_start = as_datetime(startDate), utc_end = as_datetime(endDate)) %>% filter(!is.na(utc_start)) %>% interval_left_join(arrivals %>% select(utc_arrive, utc_until, timezone = local_timezone), by = c("utc_start" = "utc_arrive", "utc_start" = "utc_until")) workout_df <- workout_df %>% group_by(timezone) %>% mutate(local_start = utc_dt_to_local(utc_start, first(timezone)), local_end = utc_dt_to_local(utc_end, first(timezone))) %>% arrange(utc_start) %>% ungroup() # I'm going to focus on health_df and workout_df, but I could adjust times in the other df's as well

At this point we have two sets of date and time stamps for health_df and workout_df.
The prefix “utc” is for universal time and
should be used for sorting by time. The prefix “local” has local time as was shown on the watch when the
measurement was recorded. Local time and date is what we need if we want to examine time during the day.
For health_df there is also a column called start_time which is just the time of day, without date, expressed as hh:mm:ss.

Figure 1 is a density plot that shows
the difference between looking at
time of day in terms of UTC versus local time.
The first takeaway from this plot is that the frequency of measurements is
strongly related to time of day. There are relatively few measurements during the
night period when I am usually asleep in bed. There is also a deep dip around
lunch time. The Apple Watch makes more measurements when I am being active.
I’ll look at that in more detail in Part II.

The two distributions (UTC and local time) mostly overlap,
but there are some systematic differences.
I live in the Eastern Time Zone which is five hours later than
UTC during standard
time and four hours later during daylight savings.
As a compromise I
subtracted 4.5 hours from the UTC time of day to make the distributions fit on the same scale.
The red scale shows the distribution of observations in
the export data in terms of UTC time of day and the blue scale shows local time. The blue scale corresponds more
to my subjective experience of time of day. You can see that the red distribution is “fatter” during the period from
4AM to 8AM than is the blue scale. That’s primarily because of walking trips in England that are not in
the Eastern Time. Observations that appear to be at 4AM in terms of UTC – 4.5 are probably happening in
British Summer Time (because I’m on a hike in England during the summer) which is actually UTC + 1. In terms
of my experience of local time, they are happening when my watch says 9AM. (The dataset also includes
a 10 day walking holiday in Greece which is another two hours farther east than England.)

If one looks at sleep time between 00:00 and 06:00, the Local Time distribution (blue) shows a
consistent pattern of fewer observations because of less activity. A larger proportion of my
activity outside of the Eastern Time Zone was in Europe rather than in Pacific Time and that’s why
the UTC distribution seems shifted to the left relative to Local Time. Also, Local Time is responding to
daylight savings changes and UTC is not. I think that explains why the peaks before and after
lunch time are higher in the Local Time distribution than in UTC. I’m actually fairly rigid about when
I take time for lunch (and presumably reduce my physical activity) and taking daylight savings
into account makes that more clear.

In summary, if I did not correct for time zone and daylight savings the general bi-modal pattern would
still be apparent. But the translation into local time makes the time of day pattern more clear
and more accurate, especially if one focuses on hours when I would expect to be asleep.

Figure 1: Density Plot by Time of Observation

Apple, If You’re Listening…

There may be a relatively simple item of data that Apple could add to the health data that would
make it easier (and more accurate) to execute the adjustments for time zone described in this post.
Whenever time zone on the watch or phone is changed, surely there is a log of that event. If
those change events were added to the health dataset as a separate item, then one could construct
an effective way to adjust all the time stamps in the datasets, similar
to the way I have used the flight arrival times. I don’t know whether there are
logs to accomplish this retrospectively, but even if it were added going forward
that would be a big help. It would add very little data to the database.
Travel and daylight savings are the
only events that cause me to change the time on my phone or watch.
Perhaps there are some unusual situations where someone lives close to the border of
a time zone where they frequently change among cell phone towers in different
time zones. Perhaps that would add a lot more time change data, but such a person
really needs a way to adjust for time zone even more than most people. The fancy solution from Apple
would be if they added time change events to the data and then used that data to adjust the UTC offset
at the time they produced the Apple Health Export. That would be a great help and make the data
less confusing. In that case the time zone offsets that already appear in the
exported data would actually mean something useful and would allow easy translation between
UTC and local time.

Conclusion

This ends Part I. I decided I need to do a separate Part I because this is mostly about
the machinations to adjust time stamps to retrieve local time. But for most people that’s
an issue that can be ignored, meaning the bulk of Part I is not relevant for most people.
In Part II I will examine some of the data elements in more detail and give
some examples using the data.

1. The documentation for the Apple Health Kit does offer a way for developers to store time zone meta data via the HKMetadataKeyTimeZone object. It appears that not even Apple uses this feature. And the Apple documentation only suggests using it for sleep data. It would be impractical to try to attach this meta data to every single observation.

2. For the ultimate in time travel paradox, see Heinlein’s
All You Zombies, described here.
There’s a movie version called Predestination starring Ethan Hawke.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### new release of offensive programming packages

Sat, 02/15/2020 - 01:00

Offensive programming ecosystem has been upgraded. You may now use new versions of packages according to following table

package name (available on CRAN) recommanded version wyz.code.offensiveProgramming 1.1.17 or higher wyz.code.testthat 1.1.16 or higher wyz.code.metaTesting 1.1.11 or higher wyz.code.rdoc 1.1.15 or higher New functionnalities

Global support of environment, S3, S4, RC and R6 objects enhanced and strengthened. Offensive programming instrumentation is available for your favorite R class flavor.

Package wyz.code.rdoc allows you to create R manual pages in a programmatic way, thus allowing greater reuse of documentation pieces from one manual page to another.

Moreover, you can now also use it with standard R functions. Offensive programming instrumentation is no more a required precondition to generate mannual pages with package wyz.code.rdoc.

Scope of supported manual pages for generation is now complete: data, function, objects/classes and package manual pages can be generated in a programmatic way.

Main enhancements

You have

1. higher usability: more intuitive and friendly functions
2. increased robustness: more than 400 tests added
3. improved documentation: both manual pages (made using wyz.code.rdoc) and vignettes have been rewritten
4. increased code coverage: a global score higher than 99%.

Feel free to download directly from CRAN for trial, experimentation or for professional use.

Read online offensive programming book to discover how to generate easily R manual pages from your code, while owning full control on production options and on content.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));