R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 20 min ago

### Teaching to machines: What is learning in machine learning entails?

Fri, 11/17/2017 - 04:40

(This article was first published on Memo's Island, and kindly contributed to R-bloggers)

Preamble

Figure 1: The oldest learning institution
in the world; University of Bologna
(Source: Wikipedia). Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach to a machine? Unfortunately, for even moderate technical people coming from different backgrounds, answer to this question is not apparent in the first instance. This sounds like a conceptual and jargon issue, but it lies in the very success of supervised learning algorithms.

What is a machine in machine learning

First of all here, machine does not mean a machine in conventional sense, but computational modules or set of instructions. It is called machine because of thermodynamics can be applied to analyse this computational modules, their algorithmic complexity can be map to an energy consumption and work production, recall Landauer Principle. Charles Bennett has an article on this principle, here.

Learning curve for machines

What about learning bit? Learning is a natural process for humans, specially for the younger ones. We learn new things naturally without explicitly thinking with instruction and some training, such as practised in University of Bologna a millennia ago, see Figure 1. Usually,  it takes time for us to learn new thing, and the process is called learning curve, due to Hermann Ebinghauss. Essentially, learning in machine learning exactly roots in Ebinghauss approach.  But question remains, how this manifests quantitatively in machine learning. Answer to this question is going to be given here.

The concept of learning curve and  effect of sample size in training machine learning models for both experienced and novice practitioners. It is often this type of analysis is omitted in producing supervised learning solutions. In the advent of deep learning architecture, multi-layered neural networks, this concept becomes more pronounced.

Quantifying learning curve lies in measuring performance over increasing experience. If the performance of the machine, or human, increases over experience we denote that learning is achieved. We distinguish good learner.

On the misconception of unsupervised learning

Figure 2: Donald Webb,
father of unsupervised
learning.
(Source: Wikipedia)

A generic misconception appear on what unsupervised learning means. Clustering, or categorising unlabelled data, is not learning in Ebbinghaus sense. The goodness of fit or cluster validation exercise do not account an increase in experience, at least this is not established in the literature, judging from cluster validation techniques, see, Jain-Dubes’s Algorithms for clustering data. Wikipedia defines unsupervised learning “inferring a function to describe hidden structure from “unlabeled” data”, this is a function inference is not learning,.

Then, what is unsupervised learning? It originates from Hebbian Theory from neuroscience that “Cells that fire together wire together”. This implies, unsupervised learning is about how information is encoded, not how it is labelled. One good practical model that could be classified as unsupervised learning, so called spiking network model.

Outlook

The concept of learning from Machine Learning perspective is summarised in this post. In data science, it is a good practice to differentiate what we call learning and function inference/optimisation. Being attentive to this details would help us.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Memo's Island. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### pool package on CRAN

Fri, 11/17/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

The pool package makes it easier for Shiny developers to connect to databases. Up until now, there wasn’t a clearly good way to do this. As a Shiny app author, if you connect to a database globally (outside of the server function), your connection won’t be robust because all sessions would share that connection (which could leave most users hanging when one of them is using it, or even all of them if the connection breaks). But if you try to connect each time that you need to make a query (e.g. for every reactive you have), your app becomes a lot slower, as it can take in the order of seconds to establish a new connection. The pool package solves this problem by taking care of when to connect and disconnect, allowing you to write performant code that automatically reconnects to the database only when needed.

So, if you are a Shiny app author who needs to connect and interact with databases inside your apps, keep reading because this package was created to make your life easier.

What the pool package does

The pool package adds a new level of abstraction when connecting to a database: instead of directly fetching a connection from the database, you will create an object (called a “pool”) with a reference to that database. The pool holds a number of connections to the database. Some of these may be currently in-use and some of these may be idle, waiting for a new query or statement to request them. Each time you make a query, you are querying the pool, rather than the database. Under the hood, the pool will either give you an idle connection that it previously fetched from the database or, if it has no free connections, fetch one and give it to you. You never have to create or close connections directly: the pool knows when it should grow, shrink or keep steady. You only need to close the pool when you’re done.

Since pool integrates with both DBI and dplyr, there are very few things that will be new to you, if you’re already using either of those packages. Essentially, you shouldn’t feel the difference, with the exception of creating and closing a “Pool” object (as opposed to connecting and disconnecting a “DBIConnection” object). See this copy-pasteable app that uses pool and dplyr to query a MariaDB database (hosted on AWS) inside a Shiny app.

Very briefly, here’s how you’d connect to a database, write a table into it using DBI, query it using dplyr, and finally disconnect (you must have DBI, dplyr and pool installed and loaded in order to be able to run this code):

conn <- dbConnect(RSQLite::SQLite(), dbname = ":memory:") dbWriteTable(conn, "quakes", quakes) tbl(conn, "quakes") %>% select(-stations) %>% filter(mag >= 6) ## # Source: lazy query [?? x 4] ## # Database: sqlite 3.19.3 [:memory:] ## lat long depth mag ## ## 1 -20.70 169.92 139 6.1 ## 2 -13.64 165.96 50 6.0 ## 3 -15.56 167.62 127 6.4 ## 4 -12.23 167.02 242 6.0 ## 5 -21.59 170.56 165 6.0 dbDisconnect(conn)

And here’s how you’d do the same using pool:

pool <- dbPool(RSQLite::SQLite(), dbname = ":memory:") dbWriteTable(pool, "quakes", quakes) tbl(pool, "quakes") %>% select(-stations) %>% filter(mag >= 6) ## # Source: lazy query [?? x 4] ## # Database: sqlite 3.19.3 [:memory:] ## lat long depth mag ## ## 1 -20.70 169.92 139 6.1 ## 2 -13.64 165.96 50 6.0 ## 3 -15.56 167.62 127 6.4 ## 4 -12.23 167.02 242 6.0 ## 5 -21.59 170.56 165 6.0 poolClose(pool) What problem pool was created to solve

As mentioned before, the goal of the pool package is to abstract away the logic of connection management and the performance cost of fetching a new connection from a remote database. These concerns are especially prominent in interactive contexts, like Shiny apps. (So, while this package is of most practical value to Shiny developers, there is no harm if it is used in other contexts.)

The rest of this post elaborates some more on the specific problems of connection management inside of Shiny, and how pool addresses them.

The connection management spectrum

When you’re connecting to a database, it’s important to manage your connections: when to open them (taking into account that this is a potentially long process for remote databases), how to keep track of them, and when to close them. This is always true, but it becomes especially relevant for Shiny apps, where not following best practices can lead to many slowdowns (from inadvertently opening too many connections) and/or many leaked connections (i.e. forgetting to close connections once you no longer need them). Over time, leaked connections could accumulate and substantially slow down your app, as well as overwhelming the database itself.

Oversimplifying a bit, we can think of connection management in Shiny as a spectrum ranging from the extreme of just having one connection per app (potentially serving several sessions of the app) to the extreme of opening (and closing) one connection for each query you make. Neither of these approaches is great: the former is fast, but not robust, and the reverse is true for the latter.

In particular, opening only one connection per app makes it fast (because, in the whole app, you only fetch one connection) and your code is kept as simple as possible. However:

• it cannot handle simultaneous requests (e.g. two sessions open, both querying the database at the same time);
• if the connection breaks at some point (maybe the database server crashed), you won’t get a new connection (you have to exit the app and re-run it);
• finally, if you are not quite at this extreme, and you use more than one connection per app (but fewer than one connection per query), it can be difficult to keep track of all your connections, since you’ll be opening and closing them in potentially very different places.

While the other extreme of opening (and closing) one connection for each query you make resolves all of these points, it is terribly slow (each time we need to access the database, we first have to fetch a connection), and you need a lot more (boilerplate) code to connect and disconnect the connection within each reactive/function.

If you’d like to see actual code that illustrates these two approaches, check this section of the pool README.

The best of both worlds

The pool package was created so that you don’t have to worry about this at all. Since pool abstracts away the logic of connection management, for the vast majority of cases, you never have to deal with connections directly. Since the pool “knows” when it should have more connections and how to manage them, you have all the advantages of the second approach (one connection per query), without the disadvantages. You are still using one connection per query, but that connection is always fetched and returned to the pool, rather than getting it from the database directly. This is a whole lot faster and more efficient. Finally, the code is kept just as simple as the code in the first approach (only one connection for the entire app), since you don’t have to continuously call dbConnect and dbDisconnect.

Feedback

This package has quietly been around for a year and it’s now finally on CRAN, following lots of the changes in the database world (both in DBI and dplyr). All pool-related feedback is welcome. Issues (bugs and features requests) can be posted to the github tracker. Requests for help with code or other questions can be posted to community.rstudio.com/c/shiny, which I check regularly (they can, of course, also be posted to Stack Overflow, but I’m extremely likely to miss it).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Six tips for running a successful unconference

Fri, 11/17/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Attendees at the May 2017 rOpenSci unconference. Photo credit: Nistara Randhawa

In May 2017, I helped run a wildly successful “unconference” that had a huge positive impact on the community I serve. rOpenSci is a non-profit initiative enabling open and reproducible research by creating technical infrastructure in the form of staff- and community-contributed software tools in the R programming language that lower barriers to working with scientific data sources on the web, and creating social infrastructure through a welcoming and diverse community of software users and developers. Our 4th annual unconference brought together 70 people to hack on projects they dreamed up and to give them opportunities to meet and work together in person. One third of the participants had attended before, and two thirds were first-timers, selected from an open call for applications. We paid all costs up front for anyone who requested this in order to lower barriers to participation.

It’s called an “unconference” because there is no schedule set before the event – participants discuss project ideas online in advance and projects are selected by participant-voting at the start. I’m sharing some tips here on how to do this well for this particular flavour of unconference.

1. Have a code of conduct

Having a code of conduct that the organizers promote in the welcome goes a long way to creating a welcoming and safe environment and preventing violations in the first place.

2. Host online discussion of project ideas before the unconference

Our unconference centered on teams working on programming projects, rather than discussions, so prior to the unconference, we asked all participants to suggest project ideas using an open online system, called GitHub, that allows everyone to see and comment on ideas or just share enthusiastic emoji to show support.

3. Have a pre-unconference video-chat with first-time participants

Our AAAS CEFP training emphasizes the importance of extending a personalized welcome to community members, so I was inspired to make the bold move of talking with more than 40 first-time participants prior to the unconference. I asked each person to complete a short questionnaire to get them to consider the roles they anticipated playing prior to our chat. Frequently, within an hour of a video-chat, without prompting, I would see the person post a new project idea or share their thoughts about someone else’s. Other times, our conversation gave the person an opportunity to say “this idea maybe isn’t relevant but…” and I would help them talk through it, inevitably leading to “oh my gosh this is such a cool idea”. I got a huge return on my investment. People’s questions like “how do you plan to have 20 different projects present their work?” led to better planning on my part. Specific ideas for improvements came from me responding with “well…how would YOU do it?”

Between the emails, slack channel, issues on GitHub, and personal video chats, I felt completely at ease going into the unconf (where I knew next to no one!).

-rOpenSci unconf17 participant

4. Run an effective ice breaker

I adapted the “Human Barometer” ice breaker to enable 70 people to share opinions across all perceived levels of who a participant is and introduce themselves to the entire group within a 1 hour period. Success depended on creating questions that were relevant to the unconference crowd, and on visually keeping track of who had spoken up in order to call on those who had not. Ice breakers and the rOpenSci version of the Human Barometer will be the subject of a future CEFP blog post.

5. Have a plan to capture content

So much work and money go into running a great unconference, you can’t afford to do it without a plan to capture and disseminate stories about the people and the products. Harness the brain work that went into the ideas! I used the concept of content pillars to come up with a plan. Every project group was given a public repository on GitHub to house their code and documentation. In a 2-day unconference with 70 people in 20 projects, how do people present their results?! We told everyone that they had just three minutes to present, and that the only presentation material they could use was their project README (the page of documentation in their code repository). No slides allowed! This kept their focus on great documentation for the project rather than an ephemeral set of pretty slides. Practically speaking, this meant that all presentations were accessible from a single laptop connected to the projector and that to access their presentation, a speaker had only to click on the link to their repo. Where did the essence of this great idea come from? From a pre-unconference chat of course!

In the week following the unconference, we used the projects’ README documentation to create a series of five posts released Monday through Friday, noting every one of the 20 projects with links to their code repositories. To get more in-depth stories about people and projects, I let participants know we were keen to host community-contributed blog posts and that accepted posts would be tweeted to rOpenSci’s >13,000 followers. Immediately after the unconference, we invited selected projects to contribute longer form narrative blog posts and posted these once a week. The series ended with Unconf 2017: The Roads Not Taken, about all the great ideas that were not yet pursued and inviting people to contribute to these.

All of this content is tied together in one blog post to summarize the unconference and link to all staff- and community-contributed posts in the unconference projects series as well as to posts of warm personal and career-focussed reflections by some participants.

6. Care about other people’s success

Community managers do a lot of “emotional work”. In all of this, my #1 key to running a successful unconference is to genuinely and deeply care that participants arrive feeling ready to hit the ground running and leave feeling like they got what they wanted out of the experience. Ultimately, the success of our unconference is more about the human legacy – building people’s capacity as software users and developers, and the in-person relationships they develop within our online community – than it is about the projects.

“I’ve steered clear of ‘hackathons’ because they feel intimidating and the ‘bad’ kind of competitive. But, this….this was something totally different.”

– rOpenSci unconf17 participant

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The City of Chicago uses R to issue beach safety alerts

Thu, 11/16/2017 - 23:27

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Among the many interesting talks I saw a the Domino Data Science Pop-Up in Chicago earlier this week was the presentation by Gene Lynes and Nick Lucius from the City of Chicago. The City of Chicago Tech Plan encourages smart communities and open government, and as part of that initiative the city has undertaken dozens of open-source, open-data projects in areas such as food safety inspections, preventing the spread of West Nile virus, and keeping sidewalks clear of snow.

This talk was on the Clear Water initiative, a project to monitor the water quality of Chicago's many public beaches on Lake Michigan, and to issue safety alerts (or in serious cases, beach closures) when E Coli levels in the water get too high. The problem is that E Coli levels can change rapidly: water levels can be normal for weeks, and then spike for a single day. But traditional culture tests take many days to return results, and while rapid DNA-based tests do exist, it's not possible conduct these tests daily at every beach.

The solution is to build a predictive model, which uses meteorological data and rapid DNA tests for some beaches, combined with historical (culture-based) evaluations of water quality, to predict E Coli levels at all beaches every day. The analysis is performed using R (you can find the R code at this Github repository).

The analysis was developed in conjunction with citizen scientists at Chi Hack Night and statisticians from DePaul University. In 2017, the model was piloted in production in Chicago to issue beach safety alerts and to create a live map of beach water quality. This new R-based model predicted 60 additional occurrences of poor water quality, compared with the process used in prior years.

Still, water quality is hard to predict: once you have the slower test data and an actual result to compare with, that's an accuracy rate of 38%, with fewer than 2% false alarms. (The city plans to use clustering techniques to further improve that number.) That model uses rapid DNA testing at five beaches to predict all beaches along Lake Michigan. A Shiny app (linked below) lets you explore the impact of testing at a different set of beaches, and adjusting the false positive rate, on the overall accuracy of the model.

You can find more details about the City of Chicago Clear Water initiative at the link below.

City of Chicago: Clear Water

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data science courses in R (/python/and others) for $10 at Udemy (Black Friday sale) Thu, 11/16/2017 - 22:05 Udemy is offering readers of R-bloggers access to its global online learning marketplace for only$10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

• R programming

Introductory R courses:

Top data science courses (not in R):

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### New IBM Plex Sans Support in hrbrthemes + Automating Axis Text Justification

Thu, 11/16/2017 - 15:07

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes) library(ggplot2) ggplot(mpg, aes(displ, hwy)) + geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) + scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) + scale_y_continuous(expand=c(0,0), limits=c(10, 50)) + scale_color_ipsum() + scale_fill_ipsum() + facet_wrap(~class, scales="free") + labs( title="IBM Plex Sans Test", subtitle="This is a subtitle to see the how it looks in IBM Plex Sans", caption="Source: hrbrthemes & IBM" ) + theme_ipsum_ps(grid="XY", axis="xy") + theme(legend.position="none") -> gg flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing how confounding biases estimates of population-wide (or marginal) average causal effects

Thu, 11/16/2017 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

When we are trying to assess the effect of an exposure or intervention on an outcome, confounding is an ever-present threat to our ability to draw the proper conclusions. My goal (starting here and continuing in upcoming posts) is to think a bit about how to characterize confounding in a way that makes it possible to literally see why improperly estimating intervention effects might lead to bias.

Confounding, potential outcomes, and causal effects

Typically, we think of a confounder as a factor that influences both exposure and outcome. If we ignore the confounding factor in estimating the effect of an exposure, we can easily over- or underestimate the size of the effect due to the exposure. If sicker patients are more likely than healthier patients to take a particular drug, the relatively poor outcomes of those who took the drug may be due to the initial health status rather than the drug.

A slightly different view of confounding is tied to the more conceptual framework of potential outcomes, which I wrote a bit about earlier. A potential outcome is the outcome we would observe if an individual were subjected to a particular exposure. We may or may not observe the potential outcome – this depends on the actual exposure. (To simplify things here, I will assume we are interested only in two different exposures.) $$Y_0$$ and $$Y_1$$ represent the potential outcomes for an individual with and without exposure, respectively. We observe $$Y_0$$ if the individual is not exposed, and $$Y_1$$ if she is.

The causal effect of the exposure for the individual $$i$$ can be defined as $$Y_{1i} – Y_{0i}$$. If we can observe each individual in both states (with and without the exposure) long enough to measure the outcome $$Y$$, we are observing both potential outcomes and can measure the causal effect for each individual. Averaging across all individuals in the sample provides an estimate the population average causal effect. (Think of a crossover or N-of-1 study.)

Unfortunately, in the real world, it is rarely feasible to expose an individual to multiple conditions. Instead, we use one group as a proxy for the other. For example, the control group represents what would have happened to the exposed group had the exposed group not been exposed. This approach only makes sense if the control group is identical in every way to the exposure group (except for the exposure, of course.)

Our goal is to compare the distribution of outcomes for the control group with the exposed group. We often simplify this comparison by looking at the means of each distribution. The average causal effect (across all individuals) can be written as $$E(Y_1 – Y_0)$$, where $$E()$$ is the expectation or average. In reality, we cannot directly measure this since only one potential outcome is observed for each individual.

Using the following logic, we might be able to convince ourselves that we can use observed measurements to estimate unobservable average causal effects. First, we can say $$E(Y_1 – Y_0) = E(Y_1) – E(Y_0)$$, because expectation is linear. Next, it seems fairly reasonable to say that $$E(Y_1 | A = 1) = E(Y | A = 1)$$, where $$A=1$$ for exposure, $$A=0$$ for control. In words, this states that the average potential outcome of exposure for the exposed group is the same as what we actually observe for the exposed group (this is the consistency assumption in causal inference theory). Along the same lines, $$E(Y_0 | A = 0) = E(Y | A = 0)$$. Finally, if we can say that $$E(Y_1) = E(Y_1 | A = 1)$$ – the potential outcome of exposure for everyone is equal to the potential outcome of exposure for those exposed – then we can say that $$E(Y_1) = E(Y | A = 1)$$ (the potential outcome with exposure for everyone is the same as the observed outcome for the exposed. Similarly, we can make the same argument to conclude that $$E(Y_0) = E(Y | A = 0)$$. At the end of this train of logic, we conclude that we can estimate $$E(Y_1 – Y_0)$$ using observed data only: $$E(Y | A = 1) – E(Y | A = 0)$$.

This nice logic fails if $$E(Y_1) \ne E(Y | A = 1)$$ and/or $$E(Y_0) \ne E(Y | A = 0)$$. That is, this nice logic fails when there is confounding.

This is all a very long-winded way of saying that confounding arises when the distributions of potential outcomes for the population are different from those distributions for the subgroups we are using for analysis. For example, if the potential outcome under exposure for the population as a whole ($$Y_1$$) differs from the observed outcome for the subgroup that was exposed ($$Y|A=1$$), or the potential outcome without exposure for the entire population ($$Y_0$$) differs from the observed outcome for the subgroup that was not exposed ($$Y|A=0$$), any estimates of population level causal effects using observed data will be biased.

However, if we can find a factor $$L$$ (or factors) where

\begin{aligned} P(Y_1 | L=l) &= P(Y | A = 1 \text{ and } L=l) \\ P(Y_0 | L=l) &= P(Y | A = 0 \text{ and } L=l) \end{aligned} both hold for all levels or values of $$L$$, we can remove confounding (and get unbiased estimates of the causal effect) by “controlling” for $$L$$. In some cases, the causal effect we measure will be conditional on $$L$$, sometimes it will be a population-wide average (or marginal) causal effect, and sometimes it will be both.

What confounding looks like …

The easiest way to illustrate the population/subgroup contrast is to generate data from a process that includes confounding. In this first example, the outcome is continuous, and is a function of both the exposure ($$A$$) and a covariate ($$L$$). For each individual, we can generate both potential outcomes $$Y_0$$ and $$Y_1$$. (Note that both potential outcomes share the same individual level noise term $$e$$ – this is not a necessary assumption.) This way, we can “know” the true population, or marginal causal effect of exposure. The observed outcome $$Y$$ is determined by the exposure status. For the purposes of plotting a smooth density curve, we generate a very large sample – 2 million.

library(simstudy) defC <- defData(varname = "e", formula = 0, variance = 2, dist = "normal") defC <- defData(defC, varname = "L", formula = 0.4, dist = "binary") defC <- defData(defC, varname = "Y0", formula = "1 + 4*L + e", dist = "nonrandom") defC <- defData(defC, varname = "Y1", formula = "5 + 4*L + e", dist = "nonrandom") defC <- defData(defC, varname = "A", formula = "0.3 + 0.3 * L", dist = "binary") defC <- defData(defC, varname = "Y", formula = "1 + 4*A + 4*L + e", dist = "nonrandom") set.seed(2017) dtC <- genData(n = 2000000, defC) dtC[1:5] ## id e L Y0 Y1 A Y ## 1: 1 2.02826718 1 7.0282672 11.028267 1 11.0282672 ## 2: 2 -0.10930734 0 0.8906927 4.890693 0 0.8906927 ## 3: 3 1.04529790 0 2.0452979 6.045298 0 2.0452979 ## 4: 4 -2.48704266 1 2.5129573 6.512957 1 6.5129573 ## 5: 5 -0.09874778 0 0.9012522 4.901252 0 0.9012522

Feel free to skip over this code – I am just including in case anyone finds it useful to see how I generated the following series of annotated density curves:

library(ggplot2) getDensity <- function(vector, weights = NULL) { if (!is.vector(vector)) stop("Not a vector!") if (is.null(weights)) { avg <- mean(vector) } else { avg <- weighted.mean(vector, weights) } close <- min(which(avg < density(vector)$x)) x <- density(vector)$x[close] if (is.null(weights)) { y = density(vector)$y[close] } else { y = density(vector, weights = weights)$y[close] } return(data.table(x = x, y = y)) } plotDens <- function(dtx, var, xPrefix, title, textL = NULL, weighted = FALSE) { dt <- copy(dtx) if (weighted) { dt[, nIPW := IPW/sum(IPW)] dMarginal <- getDensity(dt[, get(var)], weights = dt$nIPW) } else { dMarginal <- getDensity(dt[, get(var)]) } d0 <- getDensity(dt[L==0, get(var)]) d1 <- getDensity(dt[L==1, get(var)]) dline <- rbind(d0, dMarginal, d1) brk <- round(dline$x, 1) p <- ggplot(aes(x=get(var)), data=dt) + geom_density(data=dt[L==0], fill = "#ce682f", alpha = .4) + geom_density(data=dt[L==1], fill = "#96ce2f", alpha = .4) if (weighted) { p <- p + geom_density(aes(weight = nIPW), fill = "#2f46ce", alpha = .8) } else p <- p + geom_density(fill = "#2f46ce", alpha = .8) p <- p + geom_segment(data = dline, aes(x = x, xend = x, y = 0, yend = y), size = .7, color = "white", lty=3) + annotate(geom="text", x = 12.5, y = .24, label = title, size = 5, fontface = 2) + scale_x_continuous(limits = c(-2, 15), breaks = brk, name = paste(xPrefix, var)) + theme(panel.grid = element_blank(), axis.text.x = element_text(size = 12), axis.title.x = element_text(size = 13) ) if (!is.null(textL)) { p <- p + annotate(geom = "text", x = textL[1], y = textL[2], label = "L=0", size = 4, fontface = 2) + annotate(geom = "text", x = textL[3], y = textL[4], label="L=1", size = 4, fontface = 2) + annotate(geom = "text", x = textL[5], y = textL[6], label="Population distribution", size = 4, fontface = 2) } return(p) } library(gridExtra) grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Full\npopulation", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed\nonly"), plotDens(dtC, "Y1", "Potential outcome", "Full\npopulation"), plotDens(dtC[A==1], "Y", "Observed", "Exposed\nonly"), nrow = 2 )

Looking at the various plots, we can see a few interesting things. The density curves on the left represent the entire population. The conditional distributions of the potential outcomes at the population level are all normally distributed, with means that depend on the exposure and covariate $$L$$. We can also see that the population-wide distribution of $$Y_0$$ and $$Y_1$$ (in blue) are non-symmetrically shaped, because they are a mixture of the conditional normal distributions, weighted by the proportion of each level of $$L$$. Since the proportions for the top and bottom plots are in fact the population proportion, the population-level density curves for $$Y_0$$ and $$Y_1$$ are similarly shaped, with less mass on the higher end, because individuals are less likely to have an $$L$$ value of 1:

dtC[, .(propLis1 = mean(L))] ## propLis1 ## 1: 0.399822

The shape of the marginal distribution of $$Y_1$$ is identical to $$Y_0$$ (in this case, because that is the way I generated the data), but shifted to the right by an amount equal to the causal effect. The conditional effect sizes are 4, as is the population or marginal effect size.

The subgroup plots on the right are a different story. In this case, the distributions of $$L$$ vary across the exposed and unexposed groups:

dtC[, .(propLis1 = mean(L)), keyby = A] ## A propLis1 ## 1: 0 0.2757937 ## 2: 1 0.5711685

So, even though the distributions of (observed) $$Y$$ conditional on $$L$$ are identical to their potential outcome counterparts in the whole population – for example, $$P(Y | A=0 \text{ and } L = 1) = P(Y_0 | L = 1)$$ – the marginal distributions of $$Y$$ are quite different for the exposed and unexposed. For example, $$P(Y | A = 0) \ne P(Y_0)$$. This is directly due to the fact that the mixing weights (the proportions of $$L$$) are different for each of the groups. In the unexposed group, about 28% have $$L=1$$, but for the exposed group, about 57% do. Using the subgroup data only, the conditional effect sizes are still 4 (comparing mean outcomes $$Y$$ within each level of $$L$$). However the difference in means between the marginal distributions of each subgroup is about 5.2 (calculated by 7.3 – 2.1). This is confounding.

No confounding

Just so we can see that when the covariate $$L$$ has nothing to do with the probability of exposure, the marginal distributions of the subgroups do in fact look like their population-level potential outcome marginal distributions:

defC <- updateDef(defC, "A", newformula = 0.5) # change data generation dtC <- genData(n = 2000000, defC) dtC[, .(propLis1 = mean(L)), keyby = A] # subgroup proportions ## A propLis1 ## 1: 0 0.4006499 ## 2: 1 0.3987437 dtC[, .(propLis1 = mean(L))] # population/marginal props ## propLis1 ## 1: 0.3996975 grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed"), plotDens(dtC, "Y1", "Potential outcome", "Population"), plotDens(dtC[A==1], "Y", "Observed", "Exposed"), nrow = 2 )

Estimation of causal effects (now with confounding)

Generating a smaller data set, we estimate the causal effects using simple calculations and linear regression:

library(broom) # change back to confounding defC <- updateDef(defC, "A", newformula = ".3 + .3 * L") dtC <- genData(2500, defC)

The true average (marginal) causal effect from the average difference in potential outcomes for the entire population:

dtC[, mean(Y1 - Y0)] ## [1] 4

And the true average causal effects conditional on the covariate $$L$$:

dtC[, mean(Y1 - Y0), keyby = L] ## L V1 ## 1: 0 4 ## 2: 1 4

If we try to estimate the marginal causal effect by using a regression model that does not include $$L$$, we run into problems. The estimate of 5.2 we see below is the same biased estimate we saw in the plot above. This model is reporting the differences of the means (across both levels of $$L$$) for the two subgroups, which we know (because we saw) are not the same as the potential outcome distributions in the population due to different proportions of $$L$$ in each subgroup:

tidy(lm(Y ~ A, data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.027132 0.06012997 33.71251 1.116211e-205 ## 2 A 5.241004 0.09386145 55.83766 0.000000e+00

If we estimate a model that conditions on $$L$$, the estimates are on target because in the context of normal linear regression without interaction terms, conditional effects are the same as marginal effects (when confounding has been removed, or think of the comparisons being made within the orange groups and green groups in the fist set of plots above, not within the purple groups):

tidy(lm(Y ~ A + L , data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 0.9178849 0.03936553 23.31697 5.809202e-109 ## 2 A 4.0968358 0.05835709 70.20288 0.000000e+00 ## 3 L 3.9589109 0.05862583 67.52844 0.000000e+00 Inverse probability weighting (IPW)

What follows briefly here is just a sneak preview of IPW (without any real explanation), which is one way to recover the marginal mean using observed data with confounding. For now, I am ignoring the question of why you might be interested in knowing the marginal effect when the conditional effect estimate provides the same information. Suffice it to say that the conditional effect is not always the same as the marginal effect (think of data generating processes that include interactions or non-linear relationships), and sometimes the marginal effect estimate may the best that we can do, or at least that we can do easily.

If we weight each individual observation by the inverse probability of exposure, we can remove confounding and estimate the marginal effect of exposure on the outcome. Here is a quick simulation example.

After generating the dataset (the same large one we started out with so you can compare) we estimate the probability of exposure $$P(A=1 | L)$$, assuming that we know the correct exposure model. This is definitely a questionable assumption, but in this case, we actually do. Once the model has been fit, we assign the predicted probability to each individual based on her value of $$L$$.

set.seed(2017) dtC <- genData(2000000, defC) exposureModel <- glm(A ~ L, data = dtC, family = "binomial") tidy(exposureModel) ## term estimate std.error statistic p.value ## 1 (Intercept) -0.847190 0.001991708 -425.3584 0 ## 2 L 1.252043 0.003029343 413.3053 0 dtC[, pA := predict(exposureModel, type = "response")]

The IPW is not based exactly on $$P(A=1 | L)$$ (which is commonly used in propensity score analysis), but rather, the probability of the actual exposure at each level of $$L$$: $$P(A=a | L)$$, where $$a\in(0,1)$$:

# Define two new columns defC2 <- defDataAdd(varname = "pA_actual", formula = "A * pA + (1-A) * (1-pA)", dist = "nonrandom") defC2 <- defDataAdd(defC2, varname = "IPW", formula = "1/pA_actual", dist = "nonrandom") # Add weights dtC <- addColumns(defC2, dtC) round(dtC[1:5], 2) ## id e L Y0 Y1 A Y pA pA_actual IPW ## 1: 1 2.03 1 7.03 11.03 1 11.03 0.6 0.6 1.67 ## 2: 2 -0.11 0 0.89 4.89 0 0.89 0.3 0.7 1.43 ## 3: 3 1.05 0 2.05 6.05 0 2.05 0.3 0.7 1.43 ## 4: 4 -2.49 1 2.51 6.51 1 6.51 0.6 0.6 1.67 ## 5: 5 -0.10 0 0.90 4.90 0 0.90 0.3 0.7 1.43

To estimate the marginal effect on the log-odds scale, we use function lm again, but with weights specified by IPW. The true value of the marginal effect of exposure (based on the population-wide potential outcomes) was 4.0. I know I am repeating myself here, but first I am providing the biased estimate that we get when we ignore covariate $$L$$ to convince you that the relationship between exposure and outcome is indeed confounded:

tidy(lm(Y ~ A , data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.101021 0.002176711 965.2275 0 ## 2 A 5.184133 0.003359132 1543.2956 0

And now, with the simple addition of the weights but still not including $$L$$ in the model, our weighted estimate of the marginal effect is spot on (but with such a large sample size, this is not so surprising):

tidy(lm(Y ~ A , data = dtC, weights = IPW)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.596769 0.002416072 1074.789 0 ## 2 A 4.003122 0.003416842 1171.585 0

And finally, here is a plot of the IPW-adjusted density. You might think I am just showing you the plots for the unconfounded data again, but you can see from the code (and I haven’t hidden anything) that I am still using the data set with confounding. In particular, you can see that I am calling the routine plotDens with weights.

grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed", weighted = TRUE), plotDens(dtC, "Y1", "Potential outcome", "Population"), plotDens(dtC[A==1], "Y", "Observed", "Exposed", weighted = TRUE), nrow = 2 )

As I mentioned, I hope to write more on IPW, and marginal structural models, which make good use of this methodology to estimate effects that can be challenging to get a handle on.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Where to live in the US

Thu, 11/16/2017 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I was fascinated by this xkcd comic about where to live based on your temperature preferences. I also thought it’d be fun to try to make a similar one from my R session! Since I’m no meteorologist and was a bit unsure of how to define winter and summer, and of their relevance in countries like, say, India which has monsoon, I decided to focus on a single country located in one hemisphere only and big enough to offer some variety… the USA! So, dear American readers, where should you live based on your temperature preferences?

Defining data sources Weather data

The data for the original xkcd graph comes from weatherbase. I changed sources because 1) I was not patient enough to wait for weatherbase to send me a custom dataset which I imagine is what xkcd author did and 2) I’m the creator of a cool package accessing airport weather data for the whole word including the US! My package is called “riem” like “R Iowa Environmental Mesonet” (the source of the data, a fantastic website) and “we laugh” in Catalan (at the time I wrote the package I was living in Barcelona and taking 4 hours of Catalan classes a week!). It’s a simple but good package which underwent peer-review at rOpenSci onboarding, thank you Brooke!

I based my graph on data from the last winter and the last summer. I reckon that one should average over more years, but nothing important is at stake here, right?

Cities sample

My package has a function for downloading weather data for a given airport based on its ID. For instance Los Angeles airport is LAX. At that point I just needed a list of cities in the US with their airport code. Indeed with my package you can get all airport weather networks, one per US state, and all the stations withing that network… this is a lot of airports with no way to automatically determine how big they are! And to get the city name since it’d be so hard to parse the airport name I’d have resorted to geocoding with e.g. this package. A bit complicated for a simple fun graph!

So I went to the Wikipedia page of the busiest airports in the US and ended up getting this dataset from the US Department of Transportation (such a weird open data format by the way… I basically copy-pasted the first two columns in a spreadsheet!). This is not perfect, getting a list of major cities in every state would be more fair but hey reader I want you to live in a really big city so that I might know where it is.

Ok so let’s get the data us_airports <- readr::read_csv("data/2017-11-16_wheretoliveus_airports.csv") knitr::kable(us_airports[1:10,]) Name Code Atlanta, GA (Hartsfield-Jackson Atlanta International) ATL Los Angeles, CA (Los Angeles International) LAX Chicago, IL (Chicago O’Hare International) ORD Dallas/Fort Worth, TX (Dallas/Fort Worth International) DFW New York, NY (John F. Kennedy International) JFK Denver, CO (Denver International) DEN San Francisco, CA (San Francisco International) SFO Charlotte, NC (Charlotte Douglas International) CLT Las Vegas, NV (McCarran International) LAS Phoenix, AZ (Phoenix Sky Harbor International) PHX

I first opened my table of 50 airports. Then, I made the call to the Iowa Environment Mesonet.

summer_weather <- purrr::map_df(us_airports$Code, riem::riem_measures, date_start = "2017-06-01", date_end = "2017-08-31") winter_weather <- purrr::map_df(us_airports$Code, riem::riem_measures, date_start = "2016-12-01", date_end = "2017-02-28")

We then remove the lines that will be useless for further computations (The xkcd graph uses average temperature in the winter and average Humidex in the summer which uses both temperature and dew point).

summer_weather <- dplyr::filter(summer_weather, !is.na(tmpf), !is.na(dwpf)) winter_weather <- dplyr::filter(winter_weather, !is.na(tmpf))

I quickly checked that there was “nearly no missing data” so I didn’t remove any station nor day but if I were doing this analysis for something serious I’d do more checks including the time difference between measures for instance. Note that I end up with 48 airports only, there was no measure for Honolulu, HI (Honolulu International), San Juan, PR (Luis Munoz Marin International). Too bad!

Calculating the two weather values

I started by converting all temperatures to Celsius degrees because although I like you, American readers, Fahrenheit degrees do not mean anything to me which is problematic for checking results plausibility for instance. I was lazy and just used Brooke Anderson’s (yep the same Brooke who reviewed riem for rOpenSci) weathermetrics package.

summer_weather <- dplyr::mutate(summer_weather, tmpc = weathermetrics::convert_temperature(tmpf, old_metric = "f", new_metric = "c"), dwpc = weathermetrics::convert_temperature(dwpf, old_metric = "f", new_metric = "c")) winter_weather <- dplyr::mutate(winter_weather, tmpc = weathermetrics::convert_temperature(tmpf, old_metric = "f", new_metric = "c")) Summer values

Summer values are Humidex values. The author explained Humidex “combines heat and dew point”. Once again I went to Wikipedia and found the formula. I’m not in a mood to fiddle with formula writing on the blog so please go there if you want to see it. There’s also a package with a function returning the Humidex. I was feeling adventurous and therefore wrote my own function and checked the results with the numbers from Wikipedia. It was first wrong because I wrote a “+” instead of a “-“… checking one’s code is crucial.

calculate_humidex <- function(temp, dewpoint){ temp + 0.5555*(6.11 *exp(5417.7530*(1/273.16 - 1/(273.15 + dewpoint))) - 10) } calculate_humidex(30, 15) ## [1] 33.969 calculate_humidex(30, 25) ## [1] 42.33841

And then calculating the summer Humidex values by station was quite straightforward…

library("magrittr") summer_weather <- dplyr::mutate(summer_weather, humidex = calculate_humidex(tmpc, dwpc)) summer_values <- summer_weather %>% dplyr::group_by(station) %>% dplyr::summarise(summer_humidex = mean(humidex, na.rm = TRUE))

… as it was for winter temperatures.

winter_values <- winter_weather %>% dplyr::group_by(station) %>% dplyr::summarise(winter_tmpc = mean(tmpc, na.rm = TRUE)) Prepare data for plotting

I first joined winter and summer values

climates <- dplyr::left_join(winter_values, summer_values, by = "station")

Then I re-added city airport names.

climates <- dplyr::left_join(climates, us_airports, by = c("station" = "Code"))

I only kept the city name.

head(climates$Name) ## [1] "Atlanta, GA (Hartsfield-Jackson Atlanta International)" ## [2] "Austin, TX (Austin - Bergstrom International)" ## [3] "Nashville, TN (Nashville International)" ## [4] "Boston, MA (Logan International)" ## [5] "Baltimore, MD (Baltimore/Washington International Thurgood Marshall)" ## [6] "Cleveland, OH (Cleveland-Hopkins International)" climates <- dplyr::mutate(climates, city = stringr::str_replace(Name, " \\(.*", "")) head(climates$city) ## [1] "Atlanta, GA" "Austin, TX" "Nashville, TN" "Boston, MA" ## [5] "Baltimore, MD" "Cleveland, OH" Plotting!

When imitating an xkcd graph, one should use the xkcd package! I had already done that in this post.

library("xkcd") library("ggplot2") library("extrafont") library("ggrepel") xrange <- range(climates$summer_humidex) yrange <- range(climates$winter_tmpc) set.seed(42) ggplot(climates, aes(summer_humidex, winter_tmpc)) + geom_point() + geom_text_repel(aes(label = city), family = "xkcd", max.iter = 50000)+ ggtitle("Where to live based on your temperature preferences", subtitle = "Data from airports weather stations, 2016-2017") + xlab("Summer heat and humidity via Humidex")+ ylab("Winter temperature in Celsius degrees") + xkcdaxis(xrange = xrange, yrange = yrange)+ theme_xkcd() + theme(text = element_text(size = 16, family = "xkcd"))

So, which city seems attractive to you based on this plot? Or would you use different weather measures, e.g. do you care about rain?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Shiny App for making Pixel Art Models

Thu, 11/16/2017 - 01:00

(This article was first published on blog, and kindly contributed to R-bloggers)

Last weekend, I discovered the pixel art. The goal is to reproduce a pixelated drawing. Anyone can do this without any drawing skills because you just have to reproduce the pixels one by one (on a squared paper). Kids and big kids can quickly become addicted to this.

Example

For this pixelated ironman, you need only 3 colors (black, yellow and red). At the beginning I thought this would be really easy and quick. It took me approximately 15 minutes to reproduce this. Children could take more than 1 hour to reproduce this, so it’s nice if you want to keep them busy.

Make your own pixel art models

On the internet, there are lots of models. There are also tutos on how to make models with Photoshop. Yet, I wanted to make an R package for making pixel art models, based on any pictures. The pipeline I came up with is the following:

• read an image with package magick
• downsize this image for processing
• use K-means to project colors in a small set of colors
• downsize the image and project colors
• plot the pixels and add lines to separate them

I think there may be a lot to improve but from what I currently know about images, it’s the best I could come up with as a first shot.

I made a package called pixelart, with an associated Shiny App.

# Installation devtools::install_github("privefl/pixelart") # Run Shiny App pixelart::run_app()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data Wrangling at Scale

Wed, 11/15/2017 - 20:15

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Programming, meh… Let’s Teach How to Write Computational Essays Instead

Wed, 11/15/2017 - 13:11

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

From Stephen Wolfram, a nice phrase to describe the sorts of thing you can create using tools like Jupyter notebooks, Rmd and Mathematica notebooks: computational essays that complements the “computational narrative” phrase that is also used to describe such documents.

Wolfram’s recent blog post What Is a Computational Essay?, part essay, part computational essay,  is primarily a pitch for using Mathematica notebooks and the Wolfram Language. (The Wolfram Language provides computational support plus access to a “fact engine” database that ca be used to pull factual information into the coding environment.)

But it also describes nicely some of the generic features of other “generative document” media (Jupyter notebooks, Rmd/knitr) and how to start using them.

There are basically three kinds of things [in a computational essay]. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these three kinds of these all work together to express what’s being communicated.

In Mathematica, the view is something like this:

In Jupyter notebooks:

In its raw form, an RStudio Rmd document source looks something like this:

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

When we originally drafted the OU/FutureLearn course Learn to Code for Data Analysis (also available on OpenLearn), we wrote the explanatory text – delivered as HTML but including static code fragments and code outputs – as a notebook, and then ‘ran” the notebook to generate static HTML (or markdown) that provided the static course content. These notebooks were complemented by actual notebooks that students could work with interactively themselves.

(Actually, we prototyped authoring both the static text, and the elements to be used in the student notebooks, in a single document, from which the static HTML and “live” notebook documents could be generated: Authoring Multiple Docs from a Single IPython Notebook. )

Whilst the notion of the computational essay as a form is really powerful, I think the added distinction between between generative and generated documents is also useful. For example, a raw Rmd document of Jupyter notebook is a generative document that can be used to create a document containing text, code, and the output generated from executing the code. A generated document is an HTML, Word, or PDF export from an executed generative document.

Note that the generating code can be omitted from the generated output document, leaving just the text and code generated outputs. Code cells can also be collapsed so the code itself is hidden from view but still available for inspection at any time:

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

Even if notebooks are not used interactively, they can be used to create correct static texts where outputs that are supposed to relate to some fragment of code in the main text actually do so because they are created by the code, rather than being cut and pasted from some other environment.

However, making the generative – as well as generated – documents available means readers can learn by doing, as well as reading:

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

In terms of our own learnings to date about how to use notebooks most effectively as part of a teaching communication (i.e. as learning materials), Wolfram seems to have come to many similar conclusions. For example, try to limit the amount of code in any particular code cell:

In a typical computational essay, each piece of input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

So what can go wrong? Well, like English prose, can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

It can also be useful to “preview” the output of a particular operation that populates a variable for use in the following expression to help the reader understand what sort of thing that expression is evaluating:

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like.

In many respects, the computational narrative format forces you to construct an argument in a particular way: if a piece of code operates on a particular thing, you need to access, or create, the thing before you can operate on it.

[A]nother thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

One of the arguments I’ve been trying to develop in an attempt to persuade some of my colleagues to consider the use of notebooks to support teaching is the notebook nature of them. Several years ago, one of the en vogue ideas being pushed in our learning design discussions was to try to find ways of supporting and encouraging the use of “learning diaries”, where students could reflect on their learning, recording not only things they’d learned but also ways they’d come to learn them. Slightly later, portfolio style assessment became “a thing” to consider.

Wolfram notes something similar from way back when…

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

One of the advantages that notebooks have over some other environments in which students learn to code is that structure of the notebook can encourage you to develop a solution to a problem whilst retaining your earlier working.

The earlier working is where you can engage in the minutiae of trying to figure out how to apply particular programming concepts, creating small, playful, test examples of the sort of the thing you need to use in the task you have actually been set. (I think of this as a “trial driven” software approach rather than a “test driven* one; in a trial,  you play with a bit of code in the margins to check that it does the sort of thing you want, or expect, it to do before using it in the main flow of a coding task.)

One of the advantages for students using notebooks is that they can doodle with code fragments to try things out, and keep a record of the history of their own learning, as well as producing working bits of code that might be used for formative or summative assessment, for example.

Another advantage is that by creating notebooks, which may include recorded fragments of dead ends when trying to solve a particular problem, is that you can refer back to them. And reuse what you learned, or discovered how to do, in them.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

Looking at many of the notebooks students have created from scratch to support assessment activities in TM351, it’s evident that many of them are not using them other than as an interactive code editor with history. The documents contain code cells and outputs, with little if any commentary (what comments there are are often just simple inline code comments in a code cell). They are barely computational narratives, let alone computational essays; they’re more of a computational scratchpad containing small code fragments, without context.

This possibly reflects the prior history in terms of code education that students have received, working “out of context” in an interactive Python command line editor, or a traditional IDE, where the idea is to produce standalone files containing complete programmes or applications. Not pieces of code, written a line at a time, in a narrative form, with example output to show the development of a computational argument.

(One argument I’ve heard made against notebooks is that they aren’t appropriate as an environment for writing “real programmes” or “applications”. But that’s not strictly true: Jupyter notebooks can be used to define and run microservices/APIs as well as GUI driven applications.)

However, if you start to see computational narratives as a form of narrative documentation that can be used to support a form of literate programming, then once again the notebook format can come in to its own, and draw on styling more common in a text document editor than a programming environment.

(By default, Jupyter notebooks expect you to write text content in markdown or markdown+HTML, but WYSIWYG editors can be added as an extension.)

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

As well as allowing you to create documents in which the content is generated interactively – code cells can be changed and re-run, for example – it is also possible to embed interactive components in both generative and generated documents.

On the one hand, it’s quite possible to generate and embed an interactive map or interactive chart that supports popups or zooming in a generated HTML output document.

On the other, Mathematica and Jupyter both support the dynamic creation of interactive widget controls in generative documents that give you control over code elements in the document, such as sliders to change numerical parameters or list boxes to select categorical text items. (In the R world, there is support for embedded shiny apps in Rmd documents.)

These can be useful when creating narratives that encourage exploration (for example, in the sense of  explorable explantations, though I seem to recall Michael Blastland expressing concern several years ago about how ineffective interactives could be in data journalism stories.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like Manipulate, [interact/interactive in Jupyter notebooks] into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay.”

I’ve also thought previously that interactive functions are a useful way of motivating the use of functions in general when teaching introductory programming. For example, An Alternative Way of Motivating the Use of Functions?.

One of the issues in trying to set up student notebooks is how to handle boilerplate code that is required before the student can create, or run, the code you actually want them to explore. In TM351, we preload notebooks with various packages and bits of magic; in my own tinkerings, I’m starting to try to package stuff up so that it can be imported into a notebook in a single line.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

As far as opportunities for making increasing use of notebooks as a kind of technology goes, I came to a similar conclusion some time ago to Stephen Wolfram when he writes:

[I]t’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Regarding his final conclusion, I’m a little bit more circumspect:

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new.

In many respects, HTML+Javascript pages have been capable of delivering, and actually delivering, computationally generated documents for some time. Whether computational notebooks offer some sort of step-change away from that, or actually represent a return to the original read/write imaginings of the web with portable and computed facts accessed using Linked Data?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Speeding up package installation

Wed, 11/15/2017 - 12:17

(This article was first published on R – Why?, and kindly contributed to R-bloggers)

Can’t be bothered reading, tell me now

A simple one-line tweak can significantly speed up package installation and updates.

See my post at the Jumping Rivers blog (no point duplicating content in two places)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Why?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Using Shiny with Scheduled and Streaming Data

Wed, 11/15/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Shiny applications are often backed by fluid, changing data. Data updates can occur at different time scales: from scheduled daily updates to live streaming data and ad-hoc user inputs. This article describes best practices for handling data updates in Shiny, and discusses deployment strategies for automating data updates.

This post builds off of a 2017 rstudio::conf talk. The recording of the original talk and the sample code for this post are available.

The end goal of this example is a dashboard to help skiers in Colorado select a resort to visit. Recommendations are based on:

1. Snow reports that provide useful metrics like number of runs open and amount of new snow. Snow reports are updated daily.
2. Weather data, updated in near real-time from a live stream.
3. User preferences, entered in the dashboard.

Visit the live app!

The backend for the dashboard looks like:

The first challenge is preparing the daily data. In this case, the data preparation requires a series of API requests and then basic data cleansing. The code for this process is written into an R Markdown document, alongside process documentation and a few simple graphs that help validate the new data. The R Markdown document ends by saving the cleansed data into a shared data directory. The entire R Markdown document is scheduled for execution.

It may seem odd at first to use a R Markdown document as the scheduled task. However, our team has found it incredibly useful to be able to look back through historical renderings of the “report” to gut-check the process. Using R Markdown also forces us to properly document the scheduled process.

We use RStudio Connect to easily schedule the document, view past historical renderings, and ultimately to host the application. If the job fails, Connect also sends us an email containing stdout from the render, which helps us stay on top of errors. (Connect can optionally send the successfully rendered report, as well.) However, the same scheduling could be accomplished with a workflow tool or even CRON.

Make sure the data, written to shared storage, is readable by the user running the Shiny application – typically a service account like rstudio-connect or shiny can be set as the run-as user to ensure consistent behavior.

Alternatively, instead of writing results to the file system, prepped data can be saved to a view in a database.

Using Scheduled Data in Shiny

The dashboard needs to look for updates to the underlying shared data and automatically update when the data changes. (It wouldn’t be a very good dashboard if users had to refresh a page to see new data.) In Shiny, this behavior is accomplished with the reactiveFileReader function:

The function checks the shared data file’s update timestamp every intervalMillis to see if the data has changed. If the data has changed, the file is re-read using readFunc. The resulting data object, daily_data, is reactive and can be used in downstream functions like render***.

If the cleansed data is stored in a database instead of written to a file in shared storage, use reactivePoll. reactivePoll is similar to reactiveFileReader, but instead of checking the file’s update timestamp, a second function needs to be supplied that identifies when the database is updated. The function’s help documentation includes an example.

Streaming Data

The second challenge is updating the dashboard with live streaming weather data. One way for Shiny to ingest a stream of data is by turning the stream into “micro-batches”. The invalidateLater function can be used for this purpose:

liveish_data <- reactive({ invalidateLater(100) httr::GET(...) })

This causes Shiny to poll the streaming API every 100 milliseconds for new data. The results are available in the reactive data object liveish_data. Picking how often to poll for data depends on a few factors:

1. Does the upstream API enforce rate limits?
2. How long does a data update take? The application will be blocked while it polls data.

The goal is to pick a polling time that balances the user’s desire for “live” data with these two concerns.

Conclusion

To summarize, this architecture provides a number of benefits: No more painful, manual running of R code every day! Dashboard code is isolated from data prep code. There is enough flexibility to meet user requirements for live and daily data, while preventing un-necessary number crunching on the backend.

_____='https://rviews.rstudio.com/2017/11/15/shiny-and-scheduled-data-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Mapping data using R and leaflet

Wed, 11/15/2017 - 00:02

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

The R language provides many different tools for creating maps and adding data to them. I’ve been using the leaflet package at work recently, so I thought I’d provide a short example here.

Whilst searching for some data that might make a nice map, I came across this article at ABC News. It includes a table containing Australian members of parliament, their electorate and their voting intention regarding legalisation of same-sex marriage. Since I reside in New South Wales, let’s map the data for electorates in that state.

Here’s the code at Github. The procedure is pretty straightforward:

• Obtain a shapefile of New South Wales electorates (from here) and read into R
• Read the data from the ABC News web page into a data frame (very easy using rvest)
• Match the electorate names from the two sources (they match perfectly, well done ABC!)
• Use the match to add a voting intention variable to the shapefile data
• Use leaflet to generate the map of New South Wales with electorates coloured by voting intention

At right, a non-interactive screen grab of the result (click for larger version). For the full interactive map visit this RPubs page where you can zoom, drag and mouse-over to see how Leaflet maps work.

I like R/leaflet a lot. It generates high-quality interactive maps and it’s easy to experiment in and publish from Rstudio. Some of the syntax feels a little clunky (a mix of newer “pipes” and older “formula-style”) and generating colour palettes feels strange if you spend most time in ggplot2. However, some of that is probably my inexperience with the package as much as anything else.

Filed under: australia, australian news, R, statistics Tagged: leaflet, maps, politics, ssm

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to plot basic maps with ggmap

Tue, 11/14/2017 - 18:40

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

When you’re learning data science – and specifically data science in R using the Tidyverse – it is very helpful to learn individual techniques in a highly modular way.

This is because large projects, when you really break them down, are just combinations of small techniques. If you master all of those little tools and techniques on a small scale then you can combine them together later to create more complex charts, graphs, and analyses.

At a very basic level, this means that you should master fundamental tools like geom_bar(), geom_line(), and geom_point(). After you master those very simple tools, you can learn additional tools that can be combined with the basics.

A good example of this is ggmap. The ggmap package allows you to download and plot basic maps from Google maps (and a few other sources). These maps can then be used as layers within the ggplot2 plotting system. Essentially, you can plot maps from ggmap, and then use ggplot2 to plot points and other geoms on top of the map. Using ggplot+ggmap this way is extremely useful for quick-and-dirty geospatial visualization.

With that in mind, I want to quickly show you how to use ggmap to download and plot maps.

First, let’s just load the packages that we will use. Here, we’re going to load ggmap and tidyverse. ggmap will allow us to access Google Maps, but we’ll also use tidyverse for some additional plotting and other functionality (like pipes).

#============== # LOAD PACKAGES #============== library(ggmap) library(tidyverse)

Now that our packages are downloaded, let’s just get a simple map. Here, we’ll retrieve a basic map of Tokyo from Google Maps.

#================= # GET MAP OF Tokyo #================= map.tokyo <- get_map("Tokyo") ggmap(map.tokyo)

Notice that we’ve used two functions.

First we used get_map() to retrieve the map from Google Maps. To do this, we just specified the name of the location (more on that later).

Next, we used ggmap() to plot the map.

In doing this, I saved the map first with the name map.tokyo. It can be useful sometimes to save a map like this with a name, but sometimes you don’t need the name. In fact, if you don’t need to save the map, then it can be useful to avoid doing so; finding names for little objects like this can become tiresome.

Having said that, we can actually retrieve and plot the map in a single line of code, without saving the map object.

To do this, we will use the tidyverse “pipe” operator, %>%. (Note: this is one of the reasons that we loaded the tidyverse package.)

#============================================= # USE THE PIPE OPERATOR # - Here, we're basically doing the same thing # but we will do it in one line of code #============================================= get_map("Tokyo") %>% ggmap()

Essentially, what we’ve done here is retrieved the map using get_map(), but then immediately “piped” it into ggmap(). Again, this can be useful because the code is cleaner … we don’t need an intermediate name.

Now let’s plot a different map. Here we will plot a map of Japan.

#========================================== # GET MAP OF JAPAN # - this doens't work well without zooming #========================================== get_map("Japan") %>% ggmap()

This code works in a similar way: we just provide the location name to get_map(), and then pipe it into ggmap().

Having said that, this plot is not very good, because it’s not really zoomed properly.

To fix this, we will use the zoom parameter of get_map() to zoom out.

#=========================================== # GET MAP OF JAPAN, ZOOMED # - here, we are manually setting the zoom # to get a better map # - to find the best setting for 'zoom', # you'll need to use some trial-and-error #=========================================== get_map("Japan", zoom = 5) %>% ggmap()

This is much better.

Keep in mind that to get the right setting for zoom, you’ll typically need to use a bit of trial-and-error.

Next, we’ll get a map of a more specific location. Here we will get a map of Shinjuku, an area of Tokyo.

Notice that as we do this, we are again just specifying the location and zooming in properly with the zoom parameter.

#============================== # GET MAP OF SPECIFIC LOCATION #============================== # note: not zoomed enough get_map("Shinjuku") %>% ggmap() # this is properly zoomed get_map("Shinjuku", zoom = 16) %>% ggmap()

So there you have it. That’s the quick introduction to ggmap.

To be clear, there’s actually quite a bit more functionality for get_map(), but in the interest of simplicity, you should master these techniques first. Later, once you have mastered these simple ggmap tools, you can start combining maps from ggmap() with tools from ggplot2:

# CREATE DATA FRAME df.tokyo_locations <- tibble(location = c("Ueno, Tokyo, Japan" ,"Shibuya, Tokyo, Japan" ,"Shinjuku, Tokyo, Japan")) # GEOCODE geo.tokyo_locations <- geocode(df.tokyo_locations$location) # COMBINE DATA df.tokyo_locations <- cbind(df.tokyo_locations, geo.tokyo_locations) # USE WITH GGPLOT get_map("Tokyo", zoom = 12) %>% ggmap() + geom_point(data = df.tokyo_locations, aes(x = lon, y = lat), color = 'red', size = 3) By the way, this is exactly why I recommend learning and mastering simple tools like geom_point() first … once you have mastered simple tools, you can start combining them together like building-blocks to create more advanced visualizations and analyses. Sign up to master R Want to master R? Here at Sharp Sight, we teach data science in R. And we not only show you the techniques, we will show you how to practice those techniques so that you master them. So if you want to master data science and become “fluent” in R, sign up for our email newsletter. When you sign up, you’ll get: – ggplot2 data visualization tutorials – tutorials on how to practice ggplot2 syntax (so you can write code “with your eyes closed”) – practice tips to help you master and memorize syntax You’ll also get access to our “Data Science Crash Course” for free. SIGN UP NOW The post How to plot basic maps with ggmap appeared first on SHARP SIGHT LABS. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Visualize KEGG pathway and fold enrichment Tue, 11/14/2017 - 18:29 (This article was first published on R – Fabio Marroni's Blog, and kindly contributed to R-bloggers) As a useful note to self, I paste here an easy example on the use of the pathview package by Weijun Luo to plot the log fold change of gene expression across a given KEGG pathway. This example is on Vitis vinifera (as the prefix vvi can suggest), but the approach is general. Basically, you just need to feed pathview the pathway argument and a gene.data argument. In this case, gene.data is a named vector, with names being the (entrez) gene names, and the value is the log Fold Change. I selected a pathway for which KEGG has a nice representation, you might not be so lucky! library(pathview) mypathway<-"vvi03060" genes<-c("100241050","100243802","100244217","100244265","100247624","100247887","100248517"," 100248990","100250268","100250385","100250458","100251379","100252350","100252527","100252725"," 100252902","100253826","100254350","100254429","100254996","100255515","100256046","100256113"," 100256412","100256941","100257568","100257730","100258179","100258854","100259285","100259443"," 100260422","100260431","100261219","100262919","100263033","100264739","100265371","100266802"," 100267343","100267692","100852861","100853033","100854110","100854416","100854647","100855182"," 104879783","109122671") logFC<-rnorm(length(genes),-0.5,1) names(logFC)<-genes pathview(gene.data=logFC,species="vvi",pathway=mypathway) The result should be something like this: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fabio Marroni's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Advice to aspiring data scientists: start a blog Tue, 11/14/2017 - 17:45 (This article was first published on Variance Explained, and kindly contributed to R-bloggers) Last week I shared a thought on Twitter: When you’ve written the same code 3 times, write a function When you’ve given the same in-person advice 3 times, write a blog post — David Robinson (@drob) November 9, 2017 Ironically, this tweet hints at a piece of advice I’ve given at least 3 dozen times, but haven’t yet written a post about. I’ve given this advice to almost every aspiring data scientist who asked me what they should do to find a job: start a blog, and write about data science.1 What could you write about if you’re not yet working as a data scientist? Here are some possible topics (each attached to examples from my own blog): • Analyses of datasets you find interesting (example, example) • Intuitive explanations of concepts you’ve recently mastered (example, example) • Explorations of how to make a specific piece of code faster (example) • Announcements about open source projects you’ve released (example) • Links to interactive applications you’ve built (example, example) • Sharing a writeup of conferences or meetups you’ve attended (example) • Expressing opinions about data science or educational practice (example) In a future post I’d like to share advice about the how of data science blogging (such as how to choose a topic, how to structure a post, and how to publish with blogdown). But here I’ll focus on the why. If you’re in the middle of a job search, you’re probably very busy (especially if you’re currently employed), and blogging is a substantial commitment. So here I’ll lay out three reasons that a data science blog is well worth your time. Practice analyzing data and communicating about it If you’re hoping to be a data scientist, you’re (presumably) not one yet. A blog is your chance to practice the relevant skills. • Data cleaning: One of the benefits of working with a variety of datasets is that you learn to take data “as it comes”, whether it’s in the form of a supplementary file from a journal article or a movie script • Statistics: Working with unfamiliar data lets you put statistical methods into practice, and writing posts that communicate and teach concepts helps build your own understanding • Machine learning: There’s a big difference between having used a predictive algorithm once and having used it on a variety of problems, while understanding why you’d choose one over another • Visualization: Having an audience for your graphs encourages you to start polishing them and building your personal style • Communication: You gain experience writing and get practice structuring a data-driven argument. This is probably the most relevant skill that blogging develops since it’s hard to practice elsewhere, and it’s an essential part of any data science career I can’t emphasize enough how important this kind of practice is. No matter how many Coursera, DataCamp or bootcamp courses you’ve taken, you still need experience applying those tools to real problems. This isn’t unique to data science: whatever you currently do professionally, I’m sure you’re better at it now than when you finished taking classes in it. … and that concludes Machine Learning 101. Now, go forth and apply what you've learned to real data! pic.twitter.com/D6wSKgdjeM — ML Hipster (@ML_Hipster) August 19, 2015 One of the great thrills of a data science blog is that, unlike a course, competition, or job, you can analyze any dataset you like! No one was going to pay me to analyze Love Actually’s plot or Hacker News titles. Whatever amuses or interests you, you can find relevant data and write some posts about it. Create a portfolio of your work and skills Graphic designers don’t typically get evaluated based on bullet points on their CV or statements in a job interview: they share a portfolio with examples of their work. I think the data science field is shifting in the same direction: the easiest way to evaluate a candidate is to see a few examples of data analyses they’ve performed. Blogging is an especially good fit for showing off your skills because, unlike a technical interview, you get to put your “best foot forward.” Which of your skills are you proudest of? • If you’re skilled at visualizing data, write some analyses with some attractive and informative graphs (“Here’s an interactive visualization of mushroom populations in the United States”) • If you’re great at teaching and communicating, write some lessons about statistical concepts (“Here’s an intuitive explanation of PCA”) • If you have a knack for fitting machine learning models, blog about some predictive accomplishments (“I was able to determine the breed of a dog from a photo with 95% accuracy”) • If you’re an experienced programmer, announce open source projects you’ve developed and share examples of how they can be used (“With my sparkcsv package, you can load CSV datasets into Spark 10X faster than previous methods”) • If your real expertise is in a specific domain, try focusing on that (“Here’s how penguin populations have been declining in the last decade, and why”) Just because you’re expecting employers to look at your work doesn’t mean it has to be perfect. Generally, when I’m evaluating a candidate, I’m excited to see what they’ve shared publicly, even if it’s not polished or finished. And sharing anything is almost always better than sharing nothing. "Things that are still on your computer are approximately useless." –@drob #eUSR #eUSR2017 pic.twitter.com/nS3IBiRHBn — Amelia McNamara (@AmeliaMN) November 3, 2017 In this post I shared how I got my current job, when a Stack Overflow engineer saw one of my posts and reached out to me. That certainly qualifies as a freak accident. But the more public work you do, the higher the chance of a freak accident like that: of someone noticing your work and pointing you towards a job opportunity, or of someone who’s interviewing you having heard of work you’ve done. And the purpose of blogging isn’t only to advertise yourself to employers. You also get to build a network of colleagues and fellow data scientists, which helps both in finding a job and in your future career. (I’ve found #rstats users on Twitter to be a particularly terrific community). A great example of someone who succeeded in this strategy is my colleague Julia Silge, who started her excellent blog while she was looking to shift her career into data science, and both got a job and built productive relationships through it. Get feedback and evaluation Suppose you’re currently looking for your first job as a data scientist. You’ve finished all the relevant DataCamp courses, worked your way through some books, and practiced some analyses. But you still don’t feel like you’re ready, or perhaps your applications and interviews haven’t been paying off, and you decide you need a bit more practice. What should you do next? What skills could you improve on? It’s hard to tell when you’re developing a new set of skills how far along you are, and what you should be learning next. This is one of the challenges of self-driven learning as opposed to working with a teacher or mentor. A blog is one way to get this kind of feedback from others in the field. This might sound scary, like you could get a flood of criticism that pushes you away from a topic. But in practice, you can usually sense that you’re not ready well before you finish a blog post.2 For instance, even if you’re familiar with the basics of random forests, you might discover that you can’t achieve the accuracy you’d hoped for on a Kaggle dataset- and you have a chance to hold off on your blog post until you’ve learned more. What’s important is the committment: it’s easy to think “I probably could write this if I wanted”, but harder to try writing it. Which of your skills are more developed, or more important, than you thought you were? This is the positive side of self-evaluation. Once you’ve shared some analyses and code, you’ll probably find that you were underrating yourself in some areas. This affects everyone but it’s especially important for graduating Ph.D. students, who spend several years becoming an expert in a specific topic while surrounded by people who are already experts- a recipe for impostor syndrome. Imposter Syndrome: be honest with yourself about what you know and have accomplished & focus less on the difference. pic.twitter.com/VTjS5KdR6Y — David Whittaker (@rundavidrun) April 13, 2015 For instance, I picked up the principles of empirical Bayes estimation while I was a graduate student, and since it was a simplification of “real” Bayesian analysis I assumed it wasn’t worth talking about. But once I blogged about empirical Bayes, I learned that those posts had a substantial audience, and that there’s a real lack of intuitive explanations for the topic. I ended up expanding the posts into an e-book: most of the material in the book would never qualify for an academic publication, but it was still worth sharing with the wider world. One question I like to ask of PhD students, and anyone with hard-won but narrow expertise, is “What’s the simplest thing you understand that almost no one outside your field does?” That’s a recipe for a terrific and useful blog post. Conclusion One of the hardest mental barriers to starting a blog is the worry that you’re “shouting into the void”. If you haven’t developed an audience yet, it’s possible almost no one will read your blog posts- so why put work into them? First, a lot of the benefits I describe above are just as helpful whether you have ten Twitter followers or ten thousand. You can still practice your analysis and writing skills, and point potential employers towards your work. And it helps you get into the habit of sharing work publicly, which will become increasingly relevant as your network grows. Secondly, this is where people who are already members of the data science community can help. My promise is this: if you’re early in your career as a data scientist and you start a data-related blog, tweet me a link at @drob and I’ll tweet about your first post (in fact, the offer’s good for each of your first three posts). Don’t worry if it’s polished or “good enough to share”- just share the first work you find interesting!3 I have a decently-sized audience, and more importantly my followers include a lot of data scientists who are very supportive of beginners and are interested in promoting their work. Good luck and I’m excited to see what you come up with! 1. It’s also a great idea to blog if you’re currently a data scientist! But the reasons are a bit different, and I won’t be exploring them in this post. 2. Even if you do post an analysis with some mistakes or inefficiencies, if you’re part of a welcoming community the comments are likely to trend towards constructive (“Nice post! Have you considered vectorizing that operation?”) rather than toxic (“That’s super slow, dummy!”). In short, if as a beginner you post something that gets nasty comments, it’s not your fault, it’s the community’s 3. A few common-sense exceptions: I wouldn’t share work that’s ethically compromised, such as if it publicizes private data or promotes invidious stereotypes. Another exception is posts that are just blatant advertisements for a product or service. This probably doesn’t apply to you: just don’t actively try to abuse it! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Comparing Some Strategies from Easy Volatility Investing, and the Table.Drawdowns Command Tue, 11/14/2017 - 16:38 (This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers) This post will be about comparing strategies from the paper “Easy Volatility Investing”, along with a demonstration of R’s table.Drawdowns command. First off, before going further, while I think the execution assumptions found in EVI don’t lend the strategies well to actual live trading (although their risk/reward tradeoffs also leave a lot of room for improvement), I think these strategies are great as benchmarks. So, some time ago, I did an out-of-sample test for one of the strategies found in EVI, which can be found here. Using the same source of data, I also obtained data for SPY (though, again, AlphaVantage can also provide this service for free for those that don’t use Quandl). Here’s the new code. require(downloader) require(quantmod) require(PerformanceAnalytics) require(TTR) require(Quandl) require(data.table) download("http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/vix3mdailyprices.csv", destfile="vxvData.csv") VIX <- fread("http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/vixcurrent.csv", skip = 1) VIXdates <- VIX$Date VIX$Date <- NULL; VIX <- xts(VIX, order.by=as.Date(VIXdates, format = '%m/%d/%Y')) vxv <- xts(read.zoo("vxvData.csv", header=TRUE, sep=",", format="%m/%d/%Y", skip=2)) ma_vRatio <- SMA(Cl(VIX)/Cl(vxv), 10) xivSigVratio <- ma_vRatio < 1 vxxSigVratio <- ma_vRatio > 1 # V-ratio (VXV/VXMT) vRatio <- lag(xivSigVratio) * xivRets + lag(vxxSigVratio) * vxxRets # vRatio <- lag(xivSigVratio, 2) * xivRets + lag(vxxSigVratio, 2) * vxxRets # Volatility Risk Premium Strategy spy <- Quandl("EOD/SPY", start_date='1990-01-01', type = 'xts') spyRets <- Return.calculate(spy$Adj_Close) histVol <- runSD(spyRets, n = 10, sample = FALSE) * sqrt(252) * 100 vixDiff <- Cl(VIX) - histVol maVixDiff <- SMA(vixDiff, 5) vrpXivSig <- maVixDiff > 0 vrpVxxSig <- maVixDiff < 0 vrpRets <- lag(vrpXivSig, 1) * xivRets + lag(vrpVxxSig, 1) * vxxRets obsCloseMomentum <- magicThinking # from previous post compare <- na.omit(cbind(xivRets, obsCloseMomentum, vRatio, vrpRets)) colnames(compare) <- c("BH_XIV", "DDN_Momentum", "DDN_VRatio", "DDN_VRP")

So, an explanation: there are four return streams here–buy and hold XIV, the DDN momentum from a previous post, and two other strategies.

The simpler one, called the VRatio is simply the ratio of the VIX over the VXV. Near the close, check this quantity. If this is less than one, buy XIV, otherwise, buy VXX.

The other one, called the Volatility Risk Premium strategy (or VRP for short), compares the 10 day historical volatility (that is, the annualized running ten day standard deviation) of the S&P 500, subtracts it from the VIX, and takes a 5 day moving average of that. Near the close, when that’s above zero (that is, VIX is higher than historical volatility), go long XIV, otherwise, go long VXX.

Again, all of these strategies are effectively “observe near/at the close, buy at the close”, so are useful for demonstration purposes, though not for implementation purposes on any large account without incurring market impact.

Here are the results, since 2011 (that is, around the time of XIV’s actual inception):

To note, both the momentum and the VRP strategy underperform buying and holding XIV since 2011. The VRatio strategy, on the other hand, does outperform.

Here’s a summary statistics function that compiles some top-level performance metrics.

stratStats <- function(rets) { stats <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets)) stats[5,] <- stats[1,]/stats[4,] stats[6,] <- stats[1,]/UlcerIndex(rets) rownames(stats)[4] <- "Worst Drawdown" rownames(stats)[5] <- "Calmar Ratio" rownames(stats)[6] <- "Ulcer Performance Index" return(stats) }

And the result:

> stratStats(compare['2011::']) BH_XIV DDN_Momentum DDN_VRatio DDN_VRP Annualized Return 0.3801000 0.2837000 0.4539000 0.2572000 Annualized Std Dev 0.6323000 0.5706000 0.6328000 0.6326000 Annualized Sharpe (Rf=0%) 0.6012000 0.4973000 0.7172000 0.4066000 Worst Drawdown 0.7438706 0.6927479 0.7665093 0.7174481 Calmar Ratio 0.5109759 0.4095285 0.5921650 0.3584929 Ulcer Performance Index 1.1352168 1.2076995 1.5291637 0.7555808

To note, all of the benchmark strategies suffered very large drawdowns since XIV’s inception, which we can examine using the table.Drawdowns command, as seen below:

> table.Drawdowns(compare[,1]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2011-07-08 2011-11-25 2012-11-26 -0.7439 349 99 250 2 2015-06-24 2016-02-11 2016-12-21 -0.6783 379 161 218 3 2014-07-07 2015-01-30 2015-06-11 -0.4718 236 145 91 4 2011-02-15 2011-03-16 2011-04-20 -0.3013 46 21 25 5 2013-04-15 2013-06-24 2013-07-22 -0.2877 69 50 19 > table.Drawdowns(compare[,2]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2014-07-07 2016-06-27 2017-03-13 -0.6927 677 499 178 2 2012-03-27 2012-06-13 2012-09-13 -0.4321 119 55 64 3 2011-10-04 2011-10-28 2012-03-21 -0.3621 117 19 98 4 2011-02-15 2011-03-16 2011-04-21 -0.3013 47 21 26 5 2011-06-01 2011-08-04 2011-08-18 -0.2723 56 46 10 > table.Drawdowns(compare[,3]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2014-01-23 2016-02-11 2017-02-14 -0.7665 772 518 254 2 2011-09-13 2011-11-25 2012-03-21 -0.5566 132 53 79 3 2012-03-27 2012-06-01 2012-07-19 -0.3900 80 47 33 4 2011-02-15 2011-03-16 2011-04-20 -0.3013 46 21 25 5 2013-04-15 2013-06-24 2013-07-22 -0.2877 69 50 19 > table.Drawdowns(compare[,4]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2015-06-24 2016-02-11 2017-10-11 -0.7174 581 161 420 2 2011-07-08 2011-10-03 2012-02-03 -0.6259 146 61 85 3 2014-07-07 2014-12-16 2015-05-21 -0.4818 222 115 107 4 2013-02-20 2013-07-08 2014-06-10 -0.4108 329 96 233 5 2012-03-27 2012-06-01 2012-07-17 -0.3900 78 47 31

Note that the table.Drawdowns command only examines one return stream at a time. Furthermore, the top argument specifies how many drawdowns to look at, sorted by greatest drawdown first.

One reason I think that these strategies seem to suffer the drawdowns they do is that they’re either all-in on one asset, or its exact opposite, with no room for error.

One last thing, for the curious, here is the comparison with my strategy since 2011 (essentially XIV inception) benchmarked against the strategies in EVI (which I have been trading with live capital since September, and have recently opened a subscription service for):

stratStats(compare['2011::']) QST_vol BH_XIV DDN_Momentum DDN_VRatio DDN_VRP Annualized Return 0.8133000 0.3801000 0.2837000 0.4539000 0.2572000 Annualized Std Dev 0.3530000 0.6323000 0.5706000 0.6328000 0.6326000 Annualized Sharpe (Rf=0%) 2.3040000 0.6012000 0.4973000 0.7172000 0.4066000 Worst Drawdown 0.2480087 0.7438706 0.6927479 0.7665093 0.7174481 Calmar Ratio 3.2793211 0.5109759 0.4095285 0.5921650 0.3584929 Ulcer Performance Index 10.4220721 1.1352168 1.2076995 1.5291637 0.7555808

NOTE: I am currently looking for networking and full-time opportunities related to my skill set. My LinkedIn profile can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### An update for MRAN

Tue, 11/14/2017 - 16:09

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

MRAN, the Microsoft R Application Network has been migrated to a new high-performance, high-availability server, and we've taken the opportunity to make a few upgrades along the way. You shouldn't notice any breaking changes (of course if you do, please let us know), but you should notice faster performance for the MRAN site and for the checkpoint package. (MRAN is also the home of daily archives of CRAN, which checkpoint relies on to deliver specific package versions for its reproducibility functions.)

Among the improvements you will find:

As always, you can find MRAN at mran.microsoft.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### intsvy: PISA for research and PISA for teaching

Tue, 11/14/2017 - 12:42

(This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

The Programme for International Student Assessment (PISA) is a worldwide study of 15-year-old school pupils’ scholastic performance in mathematics, science, and reading. Every three years more than 500 000 pupils from 60+ countries are surveyed along with their parents and school representatives. The study yields in more than 1000 variables concerning performance, attitude and context of the pupils that can be cross-analyzed. A lot of data.

OECD prepared manuals and tools for SAS and SPSS that show how to use and analyze this data. What about R? Just a few days ago Journal of Statistical Software published an article ,,intsvy: An R Package for Analyzing International Large-Scale Assessment Data”. It describes the intsvy package and gives instructions on how to download, analyze and visualize data from various international assessments with R. The package was developed by Daniel Caro and me. Daniel prepared various video tutorials on how to use this package; you may find them here: http://users.ox.ac.uk/~educ0279/.

PISA is intended not only for researchers. It is a great data set also for teachers who may employ it as an infinite source of ideas for projects for students. In this post I am going to describe one such project that I have implemented in my classes in R programming.

I usually plan two or three projects every semester. The objective of my projects is to show what is possible with R. They are not set to verify knowledge nor practice a particular technique for data analysis. This year the first project for R programming class was designed to experience that ,,With R you can create an automated report that summaries various subsets of data in one-page summaries”.
PISA is a great data source for this. Students were asked to write a markdown file that generates a report in the form of one-page summary for every country. To do this well you need to master loops, knitr, dplyr and friends (we are rather focused on tidyverse). Students had a lot of freedom in trying out different things and approaches and finding out what works and how.

This project has finished just a week ago and the results are amazing.
Here you will find a beamer presentation with one-page summary, smart table of contents on every page, and archivist links that allow you to extract each ggplot2 plots and data directly from the report (click to access full report or the R code).

Here you will find one-pagers related to the link between taking extra math and students’ performance for boys and girls separately (click to access full report or the R code).

And here is a presentation with lots of radar plots (click to access full report or the R code).

Find all projects here: https://github.com/pbiecek/ProgramowanieWizualizacja2017/tree/master/Projekt_1.

And if you are willing to use PISA data for your students or if you need any help, just let me know.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...