Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 min 22 sec ago

Where to live in the US

Thu, 11/16/2017 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I was fascinated by this xkcd comic about where to live based on your temperature preferences. I also thought it’d be fun to try to make a similar one from my R session! Since I’m no meteorologist and was a bit unsure of how to define winter and summer, and of their relevance in countries like, say, India which has monsoon, I decided to focus on a single country located in one hemisphere only and big enough to offer some variety… the USA! So, dear American readers, where should you live based on your temperature preferences?

Defining data sources Weather data

The data for the original xkcd graph comes from weatherbase. I changed sources because 1) I was not patient enough to wait for weatherbase to send me a custom dataset which I imagine is what xkcd author did and 2) I’m the creator of a cool package accessing airport weather data for the whole word including the US! My package is called “riem” like “R Iowa Environmental Mesonet” (the source of the data, a fantastic website) and “we laugh” in Catalan (at the time I wrote the package I was living in Barcelona and taking 4 hours of Catalan classes a week!). It’s a simple but good package which underwent peer-review at rOpenSci onboarding, thank you Brooke!

I based my graph on data from the last winter and the last summer. I reckon that one should average over more years, but nothing important is at stake here, right?

Cities sample

My package has a function for downloading weather data for a given airport based on its ID. For instance Los Angeles airport is LAX. At that point I just needed a list of cities in the US with their airport code. Indeed with my package you can get all airport weather networks, one per US state, and all the stations withing that network… this is a lot of airports with no way to automatically determine how big they are! And to get the city name since it’d be so hard to parse the airport name I’d have resorted to geocoding with e.g. this package. A bit complicated for a simple fun graph!

So I went to the Wikipedia page of the busiest airports in the US and ended up getting this dataset from the US Department of Transportation (such a weird open data format by the way… I basically copy-pasted the first two columns in a spreadsheet!). This is not perfect, getting a list of major cities in every state would be more fair but hey reader I want you to live in a really big city so that I might know where it is.

Ok so let’s get the data us_airports <- readr::read_csv("data/2017-11-16_wheretoliveus_airports.csv") knitr::kable(us_airports[1:10,]) Name Code Atlanta, GA (Hartsfield-Jackson Atlanta International) ATL Los Angeles, CA (Los Angeles International) LAX Chicago, IL (Chicago O’Hare International) ORD Dallas/Fort Worth, TX (Dallas/Fort Worth International) DFW New York, NY (John F. Kennedy International) JFK Denver, CO (Denver International) DEN San Francisco, CA (San Francisco International) SFO Charlotte, NC (Charlotte Douglas International) CLT Las Vegas, NV (McCarran International) LAS Phoenix, AZ (Phoenix Sky Harbor International) PHX

I first opened my table of 50 airports. Then, I made the call to the Iowa Environment Mesonet.

summer_weather <- purrr::map_df(us_airports$Code, riem::riem_measures, date_start = "2017-06-01", date_end = "2017-08-31") winter_weather <- purrr::map_df(us_airports$Code, riem::riem_measures, date_start = "2016-12-01", date_end = "2017-02-28")

We then remove the lines that will be useless for further computations (The xkcd graph uses average temperature in the winter and average Humidex in the summer which uses both temperature and dew point).

summer_weather <- dplyr::filter(summer_weather, !is.na(tmpf), !is.na(dwpf)) winter_weather <- dplyr::filter(winter_weather, !is.na(tmpf))

I quickly checked that there was “nearly no missing data” so I didn’t remove any station nor day but if I were doing this analysis for something serious I’d do more checks including the time difference between measures for instance. Note that I end up with 48 airports only, there was no measure for Honolulu, HI (Honolulu International), San Juan, PR (Luis Munoz Marin International). Too bad!

Calculating the two weather values

I started by converting all temperatures to Celsius degrees because although I like you, American readers, Fahrenheit degrees do not mean anything to me which is problematic for checking results plausibility for instance. I was lazy and just used Brooke Anderson’s (yep the same Brooke who reviewed riem for rOpenSci) weathermetrics package.

summer_weather <- dplyr::mutate(summer_weather, tmpc = weathermetrics::convert_temperature(tmpf, old_metric = "f", new_metric = "c"), dwpc = weathermetrics::convert_temperature(dwpf, old_metric = "f", new_metric = "c")) winter_weather <- dplyr::mutate(winter_weather, tmpc = weathermetrics::convert_temperature(tmpf, old_metric = "f", new_metric = "c")) Summer values

Summer values are Humidex values. The author explained Humidex “combines heat and dew point”. Once again I went to Wikipedia and found the formula. I’m not in a mood to fiddle with formula writing on the blog so please go there if you want to see it. There’s also a package with a function returning the Humidex. I was feeling adventurous and therefore wrote my own function and checked the results with the numbers from Wikipedia. It was first wrong because I wrote a “+” instead of a “-“… checking one’s code is crucial.

calculate_humidex <- function(temp, dewpoint){ temp + 0.5555*(6.11 *exp(5417.7530*(1/273.16 - 1/(273.15 + dewpoint))) - 10) } calculate_humidex(30, 15) ## [1] 33.969 calculate_humidex(30, 25) ## [1] 42.33841

And then calculating the summer Humidex values by station was quite straightforward…

library("magrittr") summer_weather <- dplyr::mutate(summer_weather, humidex = calculate_humidex(tmpc, dwpc)) summer_values <- summer_weather %>% dplyr::group_by(station) %>% dplyr::summarise(summer_humidex = mean(humidex, na.rm = TRUE))

… as it was for winter temperatures.

winter_values <- winter_weather %>% dplyr::group_by(station) %>% dplyr::summarise(winter_tmpc = mean(tmpc, na.rm = TRUE)) Prepare data for plotting

I first joined winter and summer values

climates <- dplyr::left_join(winter_values, summer_values, by = "station")

Then I re-added city airport names.

climates <- dplyr::left_join(climates, us_airports, by = c("station" = "Code"))

I only kept the city name.

head(climates$Name) ## [1] "Atlanta, GA (Hartsfield-Jackson Atlanta International)" ## [2] "Austin, TX (Austin - Bergstrom International)" ## [3] "Nashville, TN (Nashville International)" ## [4] "Boston, MA (Logan International)" ## [5] "Baltimore, MD (Baltimore/Washington International Thurgood Marshall)" ## [6] "Cleveland, OH (Cleveland-Hopkins International)" climates <- dplyr::mutate(climates, city = stringr::str_replace(Name, " \\(.*", "")) head(climates$city) ## [1] "Atlanta, GA" "Austin, TX" "Nashville, TN" "Boston, MA" ## [5] "Baltimore, MD" "Cleveland, OH" Plotting!

When imitating an xkcd graph, one should use the xkcd package! I had already done that in this post.

library("xkcd") library("ggplot2") library("extrafont") library("ggrepel") xrange <- range(climates$summer_humidex) yrange <- range(climates$winter_tmpc) set.seed(42) ggplot(climates, aes(summer_humidex, winter_tmpc)) + geom_point() + geom_text_repel(aes(label = city), family = "xkcd", max.iter = 50000)+ ggtitle("Where to live based on your temperature preferences", subtitle = "Data from airports weather stations, 2016-2017") + xlab("Summer heat and humidity via Humidex")+ ylab("Winter temperature in Celsius degrees") + xkcdaxis(xrange = xrange, yrange = yrange)+ theme_xkcd() + theme(text = element_text(size = 16, family = "xkcd"))

So, which city seems attractive to you based on this plot? Or would you use different weather measures, e.g. do you care about rain?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny App for making Pixel Art Models

Thu, 11/16/2017 - 01:00

(This article was first published on blog, and kindly contributed to R-bloggers)

Last weekend, I discovered the pixel art. The goal is to reproduce a pixelated drawing. Anyone can do this without any drawing skills because you just have to reproduce the pixels one by one (on a squared paper). Kids and big kids can quickly become addicted to this.

Example

For this pixelated ironman, you need only 3 colors (black, yellow and red). At the beginning I thought this would be really easy and quick. It took me approximately 15 minutes to reproduce this. Children could take more than 1 hour to reproduce this, so it’s nice if you want to keep them busy.

Make your own pixel art models

On the internet, there are lots of models. There are also tutos on how to make models with Photoshop. Yet, I wanted to make an R package for making pixel art models, based on any pictures. The pipeline I came up with is the following:

  • read an image with package magick
  • downsize this image for processing
  • use K-means to project colors in a small set of colors
  • downsize the image and project colors
  • plot the pixels and add lines to separate them

I think there may be a lot to improve but from what I currently know about images, it’s the best I could come up with as a first shot.

I made a package called pixelart, with an associated Shiny App.

# Installation devtools::install_github("privefl/pixelart") # Run Shiny App pixelart::run_app()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Wrangling at Scale

Wed, 11/15/2017 - 20:15

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Please check it out.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Programming, meh… Let’s Teach How to Write Computational Essays Instead

Wed, 11/15/2017 - 13:11

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

From Stephen Wolfram, a nice phrase to describe the sorts of thing you can create using tools like Jupyter notebooks, Rmd and Mathematica notebooks: computational essays that complements the “computational narrative” phrase that is also used to describe such documents.

Wolfram’s recent blog post What Is a Computational Essay?, part essay, part computational essay,  is primarily a pitch for using Mathematica notebooks and the Wolfram Language. (The Wolfram Language provides computational support plus access to a “fact engine” database that ca be used to pull factual information into the coding environment.)

But it also describes nicely some of the generic features of other “generative document” media (Jupyter notebooks, Rmd/knitr) and how to start using them.

There are basically three kinds of things [in a computational essay]. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these three kinds of these all work together to express what’s being communicated.

In Mathematica, the view is something like this:


In Jupyter notebooks:

In its raw form, an RStudio Rmd document source looks something like this:

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

When we originally drafted the OU/FutureLearn course Learn to Code for Data Analysis (also available on OpenLearn), we wrote the explanatory text – delivered as HTML but including static code fragments and code outputs – as a notebook, and then ‘ran” the notebook to generate static HTML (or markdown) that provided the static course content. These notebooks were complemented by actual notebooks that students could work with interactively themselves.

(Actually, we prototyped authoring both the static text, and the elements to be used in the student notebooks, in a single document, from which the static HTML and “live” notebook documents could be generated: Authoring Multiple Docs from a Single IPython Notebook. )

Whilst the notion of the computational essay as a form is really powerful, I think the added distinction between between generative and generated documents is also useful. For example, a raw Rmd document of Jupyter notebook is a generative document that can be used to create a document containing text, code, and the output generated from executing the code. A generated document is an HTML, Word, or PDF export from an executed generative document.

Note that the generating code can be omitted from the generated output document, leaving just the text and code generated outputs. Code cells can also be collapsed so the code itself is hidden from view but still available for inspection at any time:

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

Even if notebooks are not used interactively, they can be used to create correct static texts where outputs that are supposed to relate to some fragment of code in the main text actually do so because they are created by the code, rather than being cut and pasted from some other environment.

However, making the generative – as well as generated – documents available means readers can learn by doing, as well as reading:

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

In terms of our own learnings to date about how to use notebooks most effectively as part of a teaching communication (i.e. as learning materials), Wolfram seems to have come to many similar conclusions. For example, try to limit the amount of code in any particular code cell:

In a typical computational essay, each piece of input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

So what can go wrong? Well, like English prose, can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

It can also be useful to “preview” the output of a particular operation that populates a variable for use in the following expression to help the reader understand what sort of thing that expression is evaluating:

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like.

In many respects, the computational narrative format forces you to construct an argument in a particular way: if a piece of code operates on a particular thing, you need to access, or create, the thing before you can operate on it.

[A]nother thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

One of the arguments I’ve been trying to develop in an attempt to persuade some of my colleagues to consider the use of notebooks to support teaching is the notebook nature of them. Several years ago, one of the en vogue ideas being pushed in our learning design discussions was to try to find ways of supporting and encouraging the use of “learning diaries”, where students could reflect on their learning, recording not only things they’d learned but also ways they’d come to learn them. Slightly later, portfolio style assessment became “a thing” to consider.

Wolfram notes something similar from way back when…

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

One of the advantages that notebooks have over some other environments in which students learn to code is that structure of the notebook can encourage you to develop a solution to a problem whilst retaining your earlier working.

The earlier working is where you can engage in the minutiae of trying to figure out how to apply particular programming concepts, creating small, playful, test examples of the sort of the thing you need to use in the task you have actually been set. (I think of this as a “trial driven” software approach rather than a “test driven* one; in a trial,  you play with a bit of code in the margins to check that it does the sort of thing you want, or expect, it to do before using it in the main flow of a coding task.)

One of the advantages for students using notebooks is that they can doodle with code fragments to try things out, and keep a record of the history of their own learning, as well as producing working bits of code that might be used for formative or summative assessment, for example.

Another advantage is that by creating notebooks, which may include recorded fragments of dead ends when trying to solve a particular problem, is that you can refer back to them. And reuse what you learned, or discovered how to do, in them.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

Looking at many of the notebooks students have created from scratch to support assessment activities in TM351, it’s evident that many of them are not using them other than as an interactive code editor with history. The documents contain code cells and outputs, with little if any commentary (what comments there are are often just simple inline code comments in a code cell). They are barely computational narratives, let alone computational essays; they’re more of a computational scratchpad containing small code fragments, without context.

This possibly reflects the prior history in terms of code education that students have received, working “out of context” in an interactive Python command line editor, or a traditional IDE, where the idea is to produce standalone files containing complete programmes or applications. Not pieces of code, written a line at a time, in a narrative form, with example output to show the development of a computational argument.

(One argument I’ve heard made against notebooks is that they aren’t appropriate as an environment for writing “real programmes” or “applications”. But that’s not strictly true: Jupyter notebooks can be used to define and run microservices/APIs as well as GUI driven applications.)

However, if you start to see computational narratives as a form of narrative documentation that can be used to support a form of literate programming, then once again the notebook format can come in to its own, and draw on styling more common in a text document editor than a programming environment.

(By default, Jupyter notebooks expect you to write text content in markdown or markdown+HTML, but WYSIWYG editors can be added as an extension.)

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

As well as allowing you to create documents in which the content is generated interactively – code cells can be changed and re-run, for example – it is also possible to embed interactive components in both generative and generated documents.

On the one hand, it’s quite possible to generate and embed an interactive map or interactive chart that supports popups or zooming in a generated HTML output document.

On the other, Mathematica and Jupyter both support the dynamic creation of interactive widget controls in generative documents that give you control over code elements in the document, such as sliders to change numerical parameters or list boxes to select categorical text items. (In the R world, there is support for embedded shiny apps in Rmd documents.)

These can be useful when creating narratives that encourage exploration (for example, in the sense of  explorable explantations, though I seem to recall Michael Blastland expressing concern several years ago about how ineffective interactives could be in data journalism stories.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like Manipulate, [interact/interactive in Jupyter notebooks] into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay.”

I’ve also thought previously that interactive functions are a useful way of motivating the use of functions in general when teaching introductory programming. For example, An Alternative Way of Motivating the Use of Functions?.

One of the issues in trying to set up student notebooks is how to handle boilerplate code that is required before the student can create, or run, the code you actually want them to explore. In TM351, we preload notebooks with various packages and bits of magic; in my own tinkerings, I’m starting to try to package stuff up so that it can be imported into a notebook in a single line.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

As far as opportunities for making increasing use of notebooks as a kind of technology goes, I came to a similar conclusion some time ago to Stephen Wolfram when he writes:

[I]t’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Regarding his final conclusion, I’m a little bit more circumspect:

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new.

In many respects, HTML+Javascript pages have been capable of delivering, and actually delivering, computationally generated documents for some time. Whether computational notebooks offer some sort of step-change away from that, or actually represent a return to the original read/write imaginings of the web with portable and computed facts accessed using Linked Data?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Speeding up package installation

Wed, 11/15/2017 - 12:17

(This article was first published on R – Why?, and kindly contributed to R-bloggers)

Can’t be bothered reading, tell me now

A simple one-line tweak can significantly speed up package installation and updates.

See my post at the Jumping Rivers blog (no point duplicating content in two places)

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Why?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using Shiny with Scheduled and Streaming Data

Wed, 11/15/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Shiny applications are often backed by fluid, changing data. Data updates can occur at different time scales: from scheduled daily updates to live streaming data and ad-hoc user inputs. This article describes best practices for handling data updates in Shiny, and discusses deployment strategies for automating data updates.

This post builds off of a 2017 rstudio::conf talk. The recording of the original talk and the sample code for this post are available.

The end goal of this example is a dashboard to help skiers in Colorado select a resort to visit. Recommendations are based on:

  1. Snow reports that provide useful metrics like number of runs open and amount of new snow. Snow reports are updated daily.
  2. Weather data, updated in near real-time from a live stream.
  3. User preferences, entered in the dashboard.

Visit the live app!

The backend for the dashboard looks like:

Automate Scheduled Data Updates

The first challenge is preparing the daily data. In this case, the data preparation requires a series of API requests and then basic data cleansing. The code for this process is written into an R Markdown document, alongside process documentation and a few simple graphs that help validate the new data. The R Markdown document ends by saving the cleansed data into a shared data directory. The entire R Markdown document is scheduled for execution.

It may seem odd at first to use a R Markdown document as the scheduled task. However, our team has found it incredibly useful to be able to look back through historical renderings of the “report” to gut-check the process. Using R Markdown also forces us to properly document the scheduled process.

We use RStudio Connect to easily schedule the document, view past historical renderings, and ultimately to host the application. If the job fails, Connect also sends us an email containing stdout from the render, which helps us stay on top of errors. (Connect can optionally send the successfully rendered report, as well.) However, the same scheduling could be accomplished with a workflow tool or even CRON.

Make sure the data, written to shared storage, is readable by the user running the Shiny application – typically a service account like rstudio-connect or shiny can be set as the run-as user to ensure consistent behavior.

Alternatively, instead of writing results to the file system, prepped data can be saved to a view in a database.

Using Scheduled Data in Shiny

The dashboard needs to look for updates to the underlying shared data and automatically update when the data changes. (It wouldn’t be a very good dashboard if users had to refresh a page to see new data.) In Shiny, this behavior is accomplished with the reactiveFileReader function:

daily_data <- reactiveFileReader( intervalMillis = 100, filePath = 'path/to/shared/data', readFunc = readr::read_cs )

The function checks the shared data file’s update timestamp every intervalMillis to see if the data has changed. If the data has changed, the file is re-read using readFunc. The resulting data object, daily_data, is reactive and can be used in downstream functions like render***.

If the cleansed data is stored in a database instead of written to a file in shared storage, use reactivePoll. reactivePoll is similar to reactiveFileReader, but instead of checking the file’s update timestamp, a second function needs to be supplied that identifies when the database is updated. The function’s help documentation includes an example.

Streaming Data

The second challenge is updating the dashboard with live streaming weather data. One way for Shiny to ingest a stream of data is by turning the stream into “micro-batches”. The invalidateLater function can be used for this purpose:

liveish_data <- reactive({ invalidateLater(100) httr::GET(...) })

This causes Shiny to poll the streaming API every 100 milliseconds for new data. The results are available in the reactive data object liveish_data. Picking how often to poll for data depends on a few factors:

  1. Does the upstream API enforce rate limits?
  2. How long does a data update take? The application will be blocked while it polls data.

The goal is to pick a polling time that balances the user’s desire for “live” data with these two concerns.

Conclusion

To summarize, this architecture provides a number of benefits: No more painful, manual running of R code every day! Dashboard code is isolated from data prep code. There is enough flexibility to meet user requirements for live and daily data, while preventing un-necessary number crunching on the backend.

_____='https://rviews.rstudio.com/2017/11/15/shiny-and-scheduled-data-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping data using R and leaflet

Wed, 11/15/2017 - 00:02

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

The R language provides many different tools for creating maps and adding data to them. I’ve been using the leaflet package at work recently, so I thought I’d provide a short example here.

Whilst searching for some data that might make a nice map, I came across this article at ABC News. It includes a table containing Australian members of parliament, their electorate and their voting intention regarding legalisation of same-sex marriage. Since I reside in New South Wales, let’s map the data for electorates in that state.

Here’s the code at Github. The procedure is pretty straightforward:

  • Obtain a shapefile of New South Wales electorates (from here) and read into R
  • Read the data from the ABC News web page into a data frame (very easy using rvest)
  • Match the electorate names from the two sources (they match perfectly, well done ABC!)
  • Use the match to add a voting intention variable to the shapefile data
  • Use leaflet to generate the map of New South Wales with electorates coloured by voting intention

At right, a non-interactive screen grab of the result (click for larger version). For the full interactive map visit this RPubs page where you can zoom, drag and mouse-over to see how Leaflet maps work.

I like R/leaflet a lot. It generates high-quality interactive maps and it’s easy to experiment in and publish from Rstudio. Some of the syntax feels a little clunky (a mix of newer “pipes” and older “formula-style”) and generating colour palettes feels strange if you spend most time in ggplot2. However, some of that is probably my inexperience with the package as much as anything else.

Filed under: australia, australian news, R, statistics Tagged: leaflet, maps, politics, ssm

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to plot basic maps with ggmap

Tue, 11/14/2017 - 18:40

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

When you’re learning data science – and specifically data science in R using the Tidyverse – it is very helpful to learn individual techniques in a highly modular way.

This is because large projects, when you really break them down, are just combinations of small techniques. If you master all of those little tools and techniques on a small scale then you can combine them together later to create more complex charts, graphs, and analyses.

At a very basic level, this means that you should master fundamental tools like geom_bar(), geom_line(), and geom_point(). After you master those very simple tools, you can learn additional tools that can be combined with the basics.

A good example of this is ggmap. The ggmap package allows you to download and plot basic maps from Google maps (and a few other sources). These maps can then be used as layers within the ggplot2 plotting system. Essentially, you can plot maps from ggmap, and then use ggplot2 to plot points and other geoms on top of the map. Using ggplot+ggmap this way is extremely useful for quick-and-dirty geospatial visualization.

With that in mind, I want to quickly show you how to use ggmap to download and plot maps.

First, let’s just load the packages that we will use. Here, we’re going to load ggmap and tidyverse. ggmap will allow us to access Google Maps, but we’ll also use tidyverse for some additional plotting and other functionality (like pipes).

#============== # LOAD PACKAGES #============== library(ggmap) library(tidyverse)

Now that our packages are downloaded, let’s just get a simple map. Here, we’ll retrieve a basic map of Tokyo from Google Maps.

#================= # GET MAP OF Tokyo #================= map.tokyo <- get_map("Tokyo") ggmap(map.tokyo)



Notice that we’ve used two functions.

First we used get_map() to retrieve the map from Google Maps. To do this, we just specified the name of the location (more on that later).

Next, we used ggmap() to plot the map.

In doing this, I saved the map first with the name map.tokyo. It can be useful sometimes to save a map like this with a name, but sometimes you don’t need the name. In fact, if you don’t need to save the map, then it can be useful to avoid doing so; finding names for little objects like this can become tiresome.

Having said that, we can actually retrieve and plot the map in a single line of code, without saving the map object.

To do this, we will use the tidyverse “pipe” operator, %>%. (Note: this is one of the reasons that we loaded the tidyverse package.)

#============================================= # USE THE PIPE OPERATOR # - Here, we're basically doing the same thing # but we will do it in one line of code #============================================= get_map("Tokyo") %>% ggmap()



Essentially, what we’ve done here is retrieved the map using get_map(), but then immediately “piped” it into ggmap(). Again, this can be useful because the code is cleaner … we don’t need an intermediate name.

Now let’s plot a different map. Here we will plot a map of Japan.

#========================================== # GET MAP OF JAPAN # - this doens't work well without zooming #========================================== get_map("Japan") %>% ggmap()



This code works in a similar way: we just provide the location name to get_map(), and then pipe it into ggmap().

Having said that, this plot is not very good, because it’s not really zoomed properly.

To fix this, we will use the zoom parameter of get_map() to zoom out.

#=========================================== # GET MAP OF JAPAN, ZOOMED # - here, we are manually setting the zoom # to get a better map # - to find the best setting for 'zoom', # you'll need to use some trial-and-error #=========================================== get_map("Japan", zoom = 5) %>% ggmap()



This is much better.

Keep in mind that to get the right setting for zoom, you’ll typically need to use a bit of trial-and-error.

Next, we’ll get a map of a more specific location. Here we will get a map of Shinjuku, an area of Tokyo.

Notice that as we do this, we are again just specifying the location and zooming in properly with the zoom parameter.

#============================== # GET MAP OF SPECIFIC LOCATION #============================== # note: not zoomed enough get_map("Shinjuku") %>% ggmap() # this is properly zoomed get_map("Shinjuku", zoom = 16) %>% ggmap()



So there you have it. That’s the quick introduction to ggmap.

To be clear, there’s actually quite a bit more functionality for get_map(), but in the interest of simplicity, you should master these techniques first. Later, once you have mastered these simple ggmap tools, you can start combining maps from ggmap() with tools from ggplot2:

# CREATE DATA FRAME df.tokyo_locations <- tibble(location = c("Ueno, Tokyo, Japan" ,"Shibuya, Tokyo, Japan" ,"Shinjuku, Tokyo, Japan")) # GEOCODE geo.tokyo_locations <- geocode(df.tokyo_locations$location) # COMBINE DATA df.tokyo_locations <- cbind(df.tokyo_locations, geo.tokyo_locations) # USE WITH GGPLOT get_map("Tokyo", zoom = 12) %>% ggmap() + geom_point(data = df.tokyo_locations, aes(x = lon, y = lat), color = 'red', size = 3)



By the way, this is exactly why I recommend learning and mastering simple tools like geom_point() first … once you have mastered simple tools, you can start combining them together like building-blocks to create more advanced visualizations and analyses.

Sign up to master R

Want to master R?

Here at Sharp Sight, we teach data science in R.

And we not only show you the techniques, we will show you how to practice those techniques so that you master them.

So if you want to master data science and become “fluent” in R, sign up for our email newsletter.

When you sign up, you’ll get:
– ggplot2 data visualization tutorials
– tutorials on how to practice ggplot2 syntax (so you can write code “with your eyes closed”)
– practice tips to help you master and memorize syntax

You’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post How to plot basic maps with ggmap appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualize KEGG pathway and fold enrichment

Tue, 11/14/2017 - 18:29

(This article was first published on R – Fabio Marroni's Blog, and kindly contributed to R-bloggers)

As a useful note to self, I paste here an easy example on the use of the pathview package by Weijun Luo to plot the log fold change of gene expression across a given KEGG pathway. This example is on Vitis vinifera (as the prefix vvi can suggest), but the approach is general.
Basically, you just need to feed pathview the pathway argument and a gene.data argument. In this case, gene.data is a named vector, with names being the (entrez) gene names, and the value is the log Fold Change.
I selected a pathway for which KEGG has a nice representation, you might not be so lucky!

library(pathview) mypathway<-"vvi03060" genes<-c("100241050","100243802","100244217","100244265","100247624","100247887","100248517"," 100248990","100250268","100250385","100250458","100251379","100252350","100252527","100252725"," 100252902","100253826","100254350","100254429","100254996","100255515","100256046","100256113"," 100256412","100256941","100257568","100257730","100258179","100258854","100259285","100259443"," 100260422","100260431","100261219","100262919","100263033","100264739","100265371","100266802"," 100267343","100267692","100852861","100853033","100854110","100854416","100854647","100855182"," 104879783","109122671") logFC<-rnorm(length(genes),-0.5,1) names(logFC)<-genes pathview(gene.data=logFC,species="vvi",pathway=mypathway)

The result should be something like this:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fabio Marroni's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Advice to aspiring data scientists: start a blog

Tue, 11/14/2017 - 17:45

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Last week I shared a thought on Twitter:

When you’ve written the same code 3 times, write a function

When you’ve given the same in-person advice 3 times, write a blog post

— David Robinson (@drob) November 9, 2017

Ironically, this tweet hints at a piece of advice I’ve given at least 3 dozen times, but haven’t yet written a post about. I’ve given this advice to almost every aspiring data scientist who asked me what they should do to find a job: start a blog, and write about data science.1

What could you write about if you’re not yet working as a data scientist? Here are some possible topics (each attached to examples from my own blog):

  • Analyses of datasets you find interesting (example, example)
  • Intuitive explanations of concepts you’ve recently mastered (example, example)
  • Explorations of how to make a specific piece of code faster (example)
  • Announcements about open source projects you’ve released (example)
  • Links to interactive applications you’ve built (example, example)
  • Sharing a writeup of conferences or meetups you’ve attended (example)
  • Expressing opinions about data science or educational practice (example)

In a future post I’d like to share advice about the how of data science blogging (such as how to choose a topic, how to structure a post, and how to publish with blogdown). But here I’ll focus on the why. If you’re in the middle of a job search, you’re probably very busy (especially if you’re currently employed), and blogging is a substantial commitment. So here I’ll lay out three reasons that a data science blog is well worth your time.

Practice analyzing data and communicating about it

If you’re hoping to be a data scientist, you’re (presumably) not one yet. A blog is your chance to practice the relevant skills.

  • Data cleaning: One of the benefits of working with a variety of datasets is that you learn to take data “as it comes”, whether it’s in the form of a supplementary file from a journal article or a movie script
  • Statistics: Working with unfamiliar data lets you put statistical methods into practice, and writing posts that communicate and teach concepts helps build your own understanding
  • Machine learning: There’s a big difference between having used a predictive algorithm once and having used it on a variety of problems, while understanding why you’d choose one over another
  • Visualization: Having an audience for your graphs encourages you to start polishing them and building your personal style
  • Communication: You gain experience writing and get practice structuring a data-driven argument. This is probably the most relevant skill that blogging develops since it’s hard to practice elsewhere, and it’s an essential part of any data science career

I can’t emphasize enough how important this kind of practice is. No matter how many Coursera, DataCamp or bootcamp courses you’ve taken, you still need experience applying those tools to real problems. This isn’t unique to data science: whatever you currently do professionally, I’m sure you’re better at it now than when you finished taking classes in it.

… and that concludes Machine Learning 101. Now, go forth and apply what you've learned to real data! pic.twitter.com/D6wSKgdjeM

— ML Hipster (@ML_Hipster) August 19, 2015

One of the great thrills of a data science blog is that, unlike a course, competition, or job, you can analyze any dataset you like! No one was going to pay me to analyze Love Actually’s plot or Hacker News titles. Whatever amuses or interests you, you can find relevant data and write some posts about it.

Create a portfolio of your work and skills

Graphic designers don’t typically get evaluated based on bullet points on their CV or statements in a job interview: they share a portfolio with examples of their work. I think the data science field is shifting in the same direction: the easiest way to evaluate a candidate is to see a few examples of data analyses they’ve performed.

Blogging is an especially good fit for showing off your skills because, unlike a technical interview, you get to put your “best foot forward.” Which of your skills are you proudest of?

  • If you’re skilled at visualizing data, write some analyses with some attractive and informative graphs (“Here’s an interactive visualization of mushroom populations in the United States”)
  • If you’re great at teaching and communicating, write some lessons about statistical concepts (“Here’s an intuitive explanation of PCA”)
  • If you have a knack for fitting machine learning models, blog about some predictive accomplishments (“I was able to determine the breed of a dog from a photo with 95% accuracy”)
  • If you’re an experienced programmer, announce open source projects you’ve developed and share examples of how they can be used (“With my sparkcsv package, you can load CSV datasets into Spark 10X faster than previous methods”)
  • If your real expertise is in a specific domain, try focusing on that (“Here’s how penguin populations have been declining in the last decade, and why”)

Just because you’re expecting employers to look at your work doesn’t mean it has to be perfect. Generally, when I’m evaluating a candidate, I’m excited to see what they’ve shared publicly, even if it’s not polished or finished. And sharing anything is almost always better than sharing nothing.

"Things that are still on your computer are approximately useless." –@drob #eUSR #eUSR2017 pic.twitter.com/nS3IBiRHBn

— Amelia McNamara (@AmeliaMN) November 3, 2017

In this post I shared how I got my current job, when a Stack Overflow engineer saw one of my posts and reached out to me. That certainly qualifies as a freak accident. But the more public work you do, the higher the chance of a freak accident like that: of someone noticing your work and pointing you towards a job opportunity, or of someone who’s interviewing you having heard of work you’ve done.

And the purpose of blogging isn’t only to advertise yourself to employers. You also get to build a network of colleagues and fellow data scientists, which helps both in finding a job and in your future career. (I’ve found #rstats users on Twitter to be a particularly terrific community). A great example of someone who succeeded in this strategy is my colleague Julia Silge, who started her excellent blog while she was looking to shift her career into data science, and both got a job and built productive relationships through it.

Get feedback and evaluation

Suppose you’re currently looking for your first job as a data scientist. You’ve finished all the relevant DataCamp courses, worked your way through some books, and practiced some analyses. But you still don’t feel like you’re ready, or perhaps your applications and interviews haven’t been paying off, and you decide you need a bit more practice. What should you do next?

What skills could you improve on? It’s hard to tell when you’re developing a new set of skills how far along you are, and what you should be learning next. This is one of the challenges of self-driven learning as opposed to working with a teacher or mentor. A blog is one way to get this kind of feedback from others in the field.

This might sound scary, like you could get a flood of criticism that pushes you away from a topic. But in practice, you can usually sense that you’re not ready well before you finish a blog post.2 For instance, even if you’re familiar with the basics of random forests, you might discover that you can’t achieve the accuracy you’d hoped for on a Kaggle dataset- and you have a chance to hold off on your blog post until you’ve learned more. What’s important is the committment: it’s easy to think “I probably could write this if I wanted”, but harder to try writing it.

Which of your skills are more developed, or more important, than you thought you were? This is the positive side of self-evaluation. Once you’ve shared some analyses and code, you’ll probably find that you were underrating yourself in some areas. This affects everyone but it’s especially important for graduating Ph.D. students, who spend several years becoming an expert in a specific topic while surrounded by people who are already experts- a recipe for impostor syndrome.

Imposter Syndrome: be honest with yourself about what you know and have accomplished & focus less on the difference. pic.twitter.com/VTjS5KdR6Y

— David Whittaker (@rundavidrun) April 13, 2015

For instance, I picked up the principles of empirical Bayes estimation while I was a graduate student, and since it was a simplification of “real” Bayesian analysis I assumed it wasn’t worth talking about. But once I blogged about empirical Bayes, I learned that those posts had a substantial audience, and that there’s a real lack of intuitive explanations for the topic. I ended up expanding the posts into an e-book: most of the material in the book would never qualify for an academic publication, but it was still worth sharing with the wider world.

One question I like to ask of PhD students, and anyone with hard-won but narrow expertise, is “What’s the simplest thing you understand that almost no one outside your field does?” That’s a recipe for a terrific and useful blog post.

Conclusion

One of the hardest mental barriers to starting a blog is the worry that you’re “shouting into the void”. If you haven’t developed an audience yet, it’s possible almost no one will read your blog posts- so why put work into them?

First, a lot of the benefits I describe above are just as helpful whether you have ten Twitter followers or ten thousand. You can still practice your analysis and writing skills, and point potential employers towards your work. And it helps you get into the habit of sharing work publicly, which will become increasingly relevant as your network grows.

Secondly, this is where people who are already members of the data science community can help. My promise is this: if you’re early in your career as a data scientist and you start a data-related blog, tweet me a link at @drob and I’ll tweet about your first post (in fact, the offer’s good for each of your first three posts). Don’t worry if it’s polished or “good enough to share”- just share the first work you find interesting!3 I have a decently-sized audience, and more importantly my followers include a lot of data scientists who are very supportive of beginners and are interested in promoting their work.

Good luck and I’m excited to see what you come up with!

  1. It’s also a great idea to blog if you’re currently a data scientist! But the reasons are a bit different, and I won’t be exploring them in this post. 

  2. Even if you do post an analysis with some mistakes or inefficiencies, if you’re part of a welcoming community the comments are likely to trend towards constructive (“Nice post! Have you considered vectorizing that operation?”) rather than toxic (“That’s super slow, dummy!”). In short, if as a beginner you post something that gets nasty comments, it’s not your fault, it’s the community’s

  3. A few common-sense exceptions: I wouldn’t share work that’s ethically compromised, such as if it publicizes private data or promotes invidious stereotypes. Another exception is posts that are just blatant advertisements for a product or service. This probably doesn’t apply to you: just don’t actively try to abuse it! 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Comparing Some Strategies from Easy Volatility Investing, and the Table.Drawdowns Command

Tue, 11/14/2017 - 16:38

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be about comparing strategies from the paper “Easy Volatility Investing”, along with a demonstration of R’s table.Drawdowns command.

First off, before going further, while I think the execution assumptions found in EVI don’t lend the strategies well to actual live trading (although their risk/reward tradeoffs also leave a lot of room for improvement), I think these strategies are great as benchmarks.

So, some time ago, I did an out-of-sample test for one of the strategies found in EVI, which can be found here.

Using the same source of data, I also obtained data for SPY (though, again, AlphaVantage can also provide this service for free for those that don’t use Quandl).

Here’s the new code.

require(downloader) require(quantmod) require(PerformanceAnalytics) require(TTR) require(Quandl) require(data.table) download("http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/vix3mdailyprices.csv", destfile="vxvData.csv") VIX <- fread("http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/vixcurrent.csv", skip = 1) VIXdates <- VIX$Date VIX$Date <- NULL; VIX <- xts(VIX, order.by=as.Date(VIXdates, format = '%m/%d/%Y')) vxv <- xts(read.zoo("vxvData.csv", header=TRUE, sep=",", format="%m/%d/%Y", skip=2)) ma_vRatio <- SMA(Cl(VIX)/Cl(vxv), 10) xivSigVratio <- ma_vRatio < 1 vxxSigVratio <- ma_vRatio > 1 # V-ratio (VXV/VXMT) vRatio <- lag(xivSigVratio) * xivRets + lag(vxxSigVratio) * vxxRets # vRatio <- lag(xivSigVratio, 2) * xivRets + lag(vxxSigVratio, 2) * vxxRets # Volatility Risk Premium Strategy spy <- Quandl("EOD/SPY", start_date='1990-01-01', type = 'xts') spyRets <- Return.calculate(spy$Adj_Close) histVol <- runSD(spyRets, n = 10, sample = FALSE) * sqrt(252) * 100 vixDiff <- Cl(VIX) - histVol maVixDiff <- SMA(vixDiff, 5) vrpXivSig <- maVixDiff > 0 vrpVxxSig <- maVixDiff < 0 vrpRets <- lag(vrpXivSig, 1) * xivRets + lag(vrpVxxSig, 1) * vxxRets obsCloseMomentum <- magicThinking # from previous post compare <- na.omit(cbind(xivRets, obsCloseMomentum, vRatio, vrpRets)) colnames(compare) <- c("BH_XIV", "DDN_Momentum", "DDN_VRatio", "DDN_VRP")

So, an explanation: there are four return streams here–buy and hold XIV, the DDN momentum from a previous post, and two other strategies.

The simpler one, called the VRatio is simply the ratio of the VIX over the VXV. Near the close, check this quantity. If this is less than one, buy XIV, otherwise, buy VXX.

The other one, called the Volatility Risk Premium strategy (or VRP for short), compares the 10 day historical volatility (that is, the annualized running ten day standard deviation) of the S&P 500, subtracts it from the VIX, and takes a 5 day moving average of that. Near the close, when that’s above zero (that is, VIX is higher than historical volatility), go long XIV, otherwise, go long VXX.

Again, all of these strategies are effectively “observe near/at the close, buy at the close”, so are useful for demonstration purposes, though not for implementation purposes on any large account without incurring market impact.

Here are the results, since 2011 (that is, around the time of XIV’s actual inception):

To note, both the momentum and the VRP strategy underperform buying and holding XIV since 2011. The VRatio strategy, on the other hand, does outperform.

Here’s a summary statistics function that compiles some top-level performance metrics.

stratStats <- function(rets) { stats <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets)) stats[5,] <- stats[1,]/stats[4,] stats[6,] <- stats[1,]/UlcerIndex(rets) rownames(stats)[4] <- "Worst Drawdown" rownames(stats)[5] <- "Calmar Ratio" rownames(stats)[6] <- "Ulcer Performance Index" return(stats) }

And the result:

> stratStats(compare['2011::']) BH_XIV DDN_Momentum DDN_VRatio DDN_VRP Annualized Return 0.3801000 0.2837000 0.4539000 0.2572000 Annualized Std Dev 0.6323000 0.5706000 0.6328000 0.6326000 Annualized Sharpe (Rf=0%) 0.6012000 0.4973000 0.7172000 0.4066000 Worst Drawdown 0.7438706 0.6927479 0.7665093 0.7174481 Calmar Ratio 0.5109759 0.4095285 0.5921650 0.3584929 Ulcer Performance Index 1.1352168 1.2076995 1.5291637 0.7555808

To note, all of the benchmark strategies suffered very large drawdowns since XIV’s inception, which we can examine using the table.Drawdowns command, as seen below:

> table.Drawdowns(compare[,1]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2011-07-08 2011-11-25 2012-11-26 -0.7439 349 99 250 2 2015-06-24 2016-02-11 2016-12-21 -0.6783 379 161 218 3 2014-07-07 2015-01-30 2015-06-11 -0.4718 236 145 91 4 2011-02-15 2011-03-16 2011-04-20 -0.3013 46 21 25 5 2013-04-15 2013-06-24 2013-07-22 -0.2877 69 50 19 > table.Drawdowns(compare[,2]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2014-07-07 2016-06-27 2017-03-13 -0.6927 677 499 178 2 2012-03-27 2012-06-13 2012-09-13 -0.4321 119 55 64 3 2011-10-04 2011-10-28 2012-03-21 -0.3621 117 19 98 4 2011-02-15 2011-03-16 2011-04-21 -0.3013 47 21 26 5 2011-06-01 2011-08-04 2011-08-18 -0.2723 56 46 10 > table.Drawdowns(compare[,3]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2014-01-23 2016-02-11 2017-02-14 -0.7665 772 518 254 2 2011-09-13 2011-11-25 2012-03-21 -0.5566 132 53 79 3 2012-03-27 2012-06-01 2012-07-19 -0.3900 80 47 33 4 2011-02-15 2011-03-16 2011-04-20 -0.3013 46 21 25 5 2013-04-15 2013-06-24 2013-07-22 -0.2877 69 50 19 > table.Drawdowns(compare[,4]['2011::'], top = 5) From Trough To Depth Length To Trough Recovery 1 2015-06-24 2016-02-11 2017-10-11 -0.7174 581 161 420 2 2011-07-08 2011-10-03 2012-02-03 -0.6259 146 61 85 3 2014-07-07 2014-12-16 2015-05-21 -0.4818 222 115 107 4 2013-02-20 2013-07-08 2014-06-10 -0.4108 329 96 233 5 2012-03-27 2012-06-01 2012-07-17 -0.3900 78 47 31

Note that the table.Drawdowns command only examines one return stream at a time. Furthermore, the top argument specifies how many drawdowns to look at, sorted by greatest drawdown first.

One reason I think that these strategies seem to suffer the drawdowns they do is that they’re either all-in on one asset, or its exact opposite, with no room for error.

One last thing, for the curious, here is the comparison with my strategy since 2011 (essentially XIV inception) benchmarked against the strategies in EVI (which I have been trading with live capital since September, and have recently opened a subscription service for):

stratStats(compare['2011::']) QST_vol BH_XIV DDN_Momentum DDN_VRatio DDN_VRP Annualized Return 0.8133000 0.3801000 0.2837000 0.4539000 0.2572000 Annualized Std Dev 0.3530000 0.6323000 0.5706000 0.6328000 0.6326000 Annualized Sharpe (Rf=0%) 2.3040000 0.6012000 0.4973000 0.7172000 0.4066000 Worst Drawdown 0.2480087 0.7438706 0.6927479 0.7665093 0.7174481 Calmar Ratio 3.2793211 0.5109759 0.4095285 0.5921650 0.3584929 Ulcer Performance Index 10.4220721 1.1352168 1.2076995 1.5291637 0.7555808

Thanks for reading.

NOTE: I am currently looking for networking and full-time opportunities related to my skill set. My LinkedIn profile can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An update for MRAN

Tue, 11/14/2017 - 16:09

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

MRAN, the Microsoft R Application Network has been migrated to a new high-performance, high-availability server, and we've taken the opportunity to make a few upgrades along the way. You shouldn't notice any breaking changes (of course if you do, please let us know), but you should notice faster performance for the MRAN site and for the checkpoint package. (MRAN is also the home of daily archives of CRAN, which checkpoint relies on to deliver specific package versions for its reproducibility functions.)

Among the improvements you will find:

As always, you can find MRAN at mran.microsoft.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

intsvy: PISA for research and PISA for teaching

Tue, 11/14/2017 - 12:42

(This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

The Programme for International Student Assessment (PISA) is a worldwide study of 15-year-old school pupils’ scholastic performance in mathematics, science, and reading. Every three years more than 500 000 pupils from 60+ countries are surveyed along with their parents and school representatives. The study yields in more than 1000 variables concerning performance, attitude and context of the pupils that can be cross-analyzed. A lot of data.

OECD prepared manuals and tools for SAS and SPSS that show how to use and analyze this data. What about R? Just a few days ago Journal of Statistical Software published an article ,,intsvy: An R Package for Analyzing International Large-Scale Assessment Data”. It describes the intsvy package and gives instructions on how to download, analyze and visualize data from various international assessments with R. The package was developed by Daniel Caro and me. Daniel prepared various video tutorials on how to use this package; you may find them here: http://users.ox.ac.uk/~educ0279/.

PISA is intended not only for researchers. It is a great data set also for teachers who may employ it as an infinite source of ideas for projects for students. In this post I am going to describe one such project that I have implemented in my classes in R programming.

I usually plan two or three projects every semester. The objective of my projects is to show what is possible with R. They are not set to verify knowledge nor practice a particular technique for data analysis. This year the first project for R programming class was designed to experience that ,,With R you can create an automated report that summaries various subsets of data in one-page summaries”.
PISA is a great data source for this. Students were asked to write a markdown file that generates a report in the form of one-page summary for every country. To do this well you need to master loops, knitr, dplyr and friends (we are rather focused on tidyverse). Students had a lot of freedom in trying out different things and approaches and finding out what works and how.

This project has finished just a week ago and the results are amazing.
Here you will find a beamer presentation with one-page summary, smart table of contents on every page, and archivist links that allow you to extract each ggplot2 plots and data directly from the report (click to access full report or the R code).

Here you will find one-pagers related to the link between taking extra math and students’ performance for boys and girls separately (click to access full report or the R code).

And here is a presentation with lots of radar plots (click to access full report or the R code).

Find all projects here: https://github.com/pbiecek/ProgramowanieWizualizacja2017/tree/master/Projekt_1.

And if you are willing to use PISA data for your students or if you need any help, just let me know.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Functional peace of mind

Tue, 11/14/2017 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I think what I enjoy the most about functional programming is the peace of mind that comes with it. With functional programming, there’s a lot of stuff you don’t need to think about. You can write functions that are general enough so that they solve a variety of problems. For example, imagine for a second that R does not have the sum() function anymore. If you want to compute the sum of, say, the first 100 integers, you could write a loop that would do that for you:

numbers = 0 for (i in 1:100){ numbers = numbers + i } print(numbers) ## [1] 5050

The problem with this approach, is that you cannot reuse any of the code there, even if you put it inside a function. For instance, what if you want to merge 4 datasets together? You would need something like this:

library(dplyr) data(mtcars) mtcars1 = mtcars %>% mutate(id = "1") mtcars2 = mtcars %>% mutate(id = "2") mtcars3 = mtcars %>% mutate(id = "3") mtcars4 = mtcars %>% mutate(id = "4") datasets = list(mtcars1, mtcars2, mtcars3, mtcars4) temp = datasets[[1]] for(i in 1:3){ temp = full_join(temp, datasets[[i+1]]) } ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") glimpse(temp) ## Observations: 128 ## Variables: 12 ## $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.... ## $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ... ## $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1... ## $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ... ## $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9... ## $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3... ## $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2... ## $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ... ## $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ... ## $ id "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1...

Of course, the logic is very similar as before, but you need to think carefully about the structure holding your elements (which can be numbers, datasets, characters, etc…) as well as be careful about indexing correctly… and depending on the type of objects you are working on, you might need to tweak the code further.

How would a functional programming approach make this easier? Of course, you could use purrr::reduce() to solve these problems. However, since I assumed that sum() does not exist, I will also assume that purrr::reduce() does not exist either and write my own, clumsy implementation. Here’s the code:

my_reduce = function(a_list, a_func, init = NULL, ...){ if(is.null(init)){ init = `[[`(a_list, 1) a_list = tail(a_list, -1) } car = `[[`(a_list, 1) cdr = tail(a_list, -1) init = a_func(init, car, ...) if(length(cdr) != 0){ my_reduce(cdr, a_func, init, ...) } else { init } }

This can look much more complicated than before, but the idea is quite simple; if you know about recursive functions (recursive functions are functions that call themselves). I won’t explain how the function works, because it is not the main point of the article (but if you’re curious, I encourage you to play around with it). The point is that now, I can do the following:

my_reduce(list(1,2,3,4,5), `+`) ## [1] 15 my_reduce(datasets, full_join) %>% glimpse ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Observations: 128 ## Variables: 12 ## $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.... ## $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ... ## $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1... ## $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ... ## $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9... ## $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3... ## $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2... ## $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ... ## $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ... ## $ id "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1...

And if I need to merge another dataset, I don’t need to change anything at all. Plus, because my_reduce() is very general, I can even use it for situation I didn’t write it for in the first place:

my_reduce(list("a", "b", "c", "d", "e"), paste) ## [1] "a b c d e"

Of course, paste() is vectorized, so you could just as well do paste(1, 2, 3, 4, 5), but again, I want to insist on the fact that writing or using such functions allows you to abstract over a lot of thing. There is nothing specific to any type of object in my_reduce(), whereas the loops have to be tailored for the kind of object you’re working with. As long as the a_func argument is a binary operator that combines the elements inside a_list, it’s going to work. And I don’t need to think about indexing, about having temporary variables or thinking about the structure that will hold my results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Come and work with me

Tue, 11/14/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

I have funding for a new post-doctoral research fellow, on a 2-year contract, to work with me and Professor Kate Smith-Miles on analysing large collections of time series data. We are particularly seeking someone with a PhD in computational statistics or statistical machine learning.
Desirable characteristics:
Experience with time series data. Experience with R package development. Familiarity with reproducible research practices (e.g., git, rmarkdown, etc). A background in machine learning or computational statistics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2017 rOpenSci ozunconf :: Reflections and the realtime Package

Tue, 11/14/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

This year’s rOpenSci ozunconf was held in Melbourne, bringing together over 45 R enthusiasts from around the country and beyond. As is customary, ideas for projects were discussed in GitHub Issues (41 of them by the time the unconf rolled around!) and there was no shortage of enthusiasm, interesting concepts, and varied experience.

I’ve been to a few unconfs now and I treasure the time I get to spend with new people, new ideas, new backgrounds, new approaches, and new insights. That’s not to take away from the time I get to spend with people I met at previous unconfs; I’ve gained great friendships and started collaborations on side projects with these wonderful people.

When the call for nominations came around this year it was an easy decision. I don’t have employer support to attend these things so I take time off work and pay my own way. This is my networking time, my development time, and my skill-building time. I wasn’t sure what sort of project I’d be interested in but I had no doubts something would come up that sounded interesting.

As it happened, I had been playing around with a bit of code, purely out of interest and hoping to learn how htmlwidgets work. The idea I had was to make a classic graphic equaliser visualisation like this

using R.

This presents several challenges; how can I get live audio into R, and how fast can I plot the signal? I had doubts about both parts, partly because of the way that R calls tie up the session (for now…) and partly because constructing a ggplot2 object is somewhat slow (in terms of raw audio speeds). I’d heard about htmlwidgets and thought there must be a way to leverage that towards my goal.

I searched for a graphic equaliser javascript library to work with and didn’t find much that aligned with what I had in my head. Eventually I stumbled on p5.js and its examples page which has an audio-input plot with a live demo. It’s a frequency spectrum, but I figured that’s just a bit of binning away from what I need. Running the example there looks like

This seemed to be worth a go. I managed to follow enough of this tutorial to have the library called from R. I modified the javascript canvas code to look a little more familiar, and the first iteration of geom_realtime() was born

This seemed like enough of an idea that I proposed it in the GitHub Issues for the unconf. It got a bit of attention, which was worrying, because I had no idea what to do with this next. Peter Hickey pointed out that Sean Kross had already wrapped some of the p5.js calls into R calls with his p5 package, so this seemed like a great place to start. It’s quite a clever way of doing it too; it involves re-writing the javascript which htmlwidgets calls on each time you want to do something.

Fast forward to the unconf and a decent number of people gathered around a little slip of paper with geom_realtime() written on it. I had to admit to everyone that the ggplot2 aspect of my demo was a sham (it’s surprisingly easy to draw a canvas in just the right shade of grey with white gridlines), but people stayed, and we got to work seeing what else we could do with the idea. We came up with some suggestions for input sources, some different plot types we might like to support, and set about trying to understand what Sean’s package actually did.

As it tends to work out, we had a great mix of people with different experience levels in different aspects of the project; some who knew how to make a package, some who knew how to work with javascript, some who knew how to work with websockets, some who knew about realtime data sources, and some who knew about nearly none of these things (✋ that would be me). If everyone knew every aspect about how to go about an unconf project I suspect the endeavor would be a bit boring. I love these events because I get to learn so much about so many different topics.

I shared my demo script and we deconstructed the pieces. We dug into the inner workings of the p5 package and started determining which parts we could siphon off to meet our own needs. One of the aspects that we wanted to figure out was how to simulate realtime data. This could be useful both for testing, and also in the situation where one might want to ’re-cast’ some time-coded data. We were thankful that Jackson Kwok had gone deep-dive into websockets and pretty soon (surprisingly soon, perhaps; within the first day) we had examples of (albeit, constructed) real-time (every 100ms) data streaming from a server and being plotted at-speed

Best of all, running the plot code didn’t tie up the session; it uses a listener written into the javascript so it just waits for input on a particular port.

With the core goal well underway, people started branching out into aspects they found most interesting. We had some people work on finding and connecting actual data sources, such as the bitcoin exchange rate

and a live-stream of binary-encoded data from the Australian National University (ANU) Quantum Random Numbers Server

Others formalised the code so that it can be piped into different ‘themes’, and retain the p5 structure for adding more components

These were still toy examples of course, but they highlight what’s possible. They were each constructed using an offshoot of the p5 package whereby the javascript is re-written to include various features each time the plot is generated.

Another route we took is to use the direct javascript binding API with factory functions. This had less flexibility in terms of adding modular components, but meant that the javascript could be modified without worrying about how it needed to interact with p5 so much. This resulted in some outstanding features such as side-scrolling and date-time stamps. We also managed to pipe the data off to another thread for additional processing (in R) before being sent to the plot.

The example we ended up with reads the live-feed of Twitter posts under a given hashtag, computes a sentiment analysis on the words with R, and live-plots the result:

Overall I was amazed at the progress we made over just two days. Starting from a silly idea/demo, we built a package which can plot realtime data, and can even serve up some data to be plotted. I have no expectations that this will be the way of the future, but it’s been a fantastic learning experience for me (and hopefully others too). It’s highlighted that there’s ways to achieve realtime plots, even if we’ve used a library built for drawing rather than one built for plotting per se.

It’s even inspired offshoots in the form of some R packages; tRainspotting which shows realtime data on New South Wales public transport using leaflet as the canvas

and jsReact which explores the interaction between R and Javascript

The possibilities are truly astounding. My list of ‘things to learn’ has grown significantly since the unconf, and projects are still starting up/continuing to develop. The ggeasy package isn’t related, but it was spawned from another unconf Github Issue idea. Again; ideas and collaborations starting and developing.

I had a great time at the unconf, and I can’t wait until the next one. My hand will be going up to help out, attend, and help start something new.

My thanks and congratulations go out to each of the realtime developers: Richard Beare, Jonathan Carroll, Kim Fitter, Charles Gray, Jeffrey O Hanson, Yan Holtz, Jackson Kwok, Miles McBain and the entire cohort of 2017 rOpenSci ozunconf attendees. In particular, my thanks go to the organisers of such a wonderful event; Nick Tierney, Rob Hyndman, Di Cook, and Miles McBain.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Updated curl package provides additional security for R on Windows

Tue, 11/14/2017 - 00:27

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

There are many R packages that connect to the internet, whether it's to import data (readr), install packages from Github (devtools), connect with cloud services (AzureML), or many other web-connected tasks. There's one R package in particular that provides the underlying connection between R and the Web: curl, by Jeroen Ooms, who is also the new maintainer for R for Windows. (The name comes from curl, a command-line utility and interface library for connecting to web-based services). The curl package provides replacements for the standard url and download.file functions in R with support for encryption, and the package was recently updated to enhance its security, particularly on Windows.

To implement secure communications, the curl package needs to connect with a library that handles the SSL (secure socket layer) encryption. On Linux and Macs, curl has always used the OpenSSL library, which is included on those systems. Windows doesn't have this library (at least, outside of the Subsystem for Linux), so on Windows the curl package included the OpenSSL library and associated certificate. This raises its own set of issues (see the post linked below for details), so version 3.0 of the package instead uses the built-in winSSL library. This means curl uses the same security architecture as other connected applications on Windows.

This shouldn't have any impact on your web-connectivity from R now or in the future, except the knowledge that the underlying architecture is more secure. Nonetheless, it's possible to switch back to OpenSSL-based encryption (and this remains the default on Windows 7, which does not include the winSSL).

Version 3.0 of the curl package is available now on CRAN (though you'll likely never need to load it explicitly — packages that use it do that for you automatically). You can learn more about the changes at the link below. If you'd like to know more about what the cur packahe can do, this vignette is a great place to start. Many thanks to Jeroen Ooms for this package.

rOpenSci: Changes to Internet Connectivity in R on Windows

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

normal variates in Metropolis step

Tue, 11/14/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A definitely puzzled participant on X validated, confusing the Normal variate or variable used in the random walk Metropolis-Hastings step with its Normal density… It took some cumulated efforts to point out the distinction. Especially as the originator of the question had a rather strong a priori about his or her background:

“I take issue with your assumption that advice on the Metropolis Algorithm is useless to me because of my ignorance of variates. I am currently taking an experimental course on Bayesian data inference and I’m enjoying it very much, i believe i have a relatively good understanding of the algorithm, but i was unclear about this specific.”

despite pondering the meaning of the call to rnorm(1)… I will keep this question in store to use in class when I teach Metropolis-Hastings in a couple of weeks.

Filed under: Books, Kids, R, Statistics, University life Tagged: cross validated, Gaussian random walk, Markov chain Monte Carlo algorithm, MCMC, Metropolis-Hastings algorithm, Monte Carlo Statistical Methods, normal distribution, normal generator, random variates

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spatial networks – case study St James centre, Edinburgh (2/3)

Mon, 11/13/2017 - 23:46

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

This is part two in a series I’m writing on network analysis. The first part is here. In this section I’m going to cover allocating resources, again using the St James’ development in Edinburgh as an example. Most excitingly (for me), the end of this post covers the impact of changes in resource allocation.

Edinburgh (and surrounds) has more than one shopping centre. Many more. I’ve had a stab at narrowing these down to those that are similar to the St James centre, i.e. they’re big, (generally) covered and may have a cinema. You can see a plot of these below. As you can see the majority are concentrated around the population centre of Edinburgh.

Location of big shopping centres in and around Edinburgh.

As with the previous post I’ve used GRASS GIS for the network analysis, QGIS for cartography and R for some subsequent analysis. I’ve used the Ordnance Survey code-point open and openroads datasets for the analysis and various Ordnance Survey maps for the background.

An allocation map shows how you can split your network to be serviced by different resource centres. I like to think of it as deciding which fire station sends an engine to which road. But this can be extended to any resource with multiple locations: bank branches, libraries, schools, swimming pools. In this case we’re using shopping centres. As always the GRASS manual page contains a full walk through of how to run the analysis. I’ll repeat the steps I took below:

# connect points to network v.net roads_EH points=shopping_centres out=centres_net op=connect thresh=200 # allocate, specifying range of center cats (easier to catch all): v.net.alloc centres_net out=centres_alloc center_cats=1-100000 node_layer=2 # Create db table v.db.addtable map=centres_alloc@shopping_centres # Join allocation and centre tables v.db.join map=centres_alloc column=cat other_table=shopping_centres other_column=cat # Write to shp v.out.ogr -s input=centres_alloc output=shopping_alloc format=ESRI_Shapefile output_layer=shopping_alloc

The last step isn’t strictly necessary, as QGIS and R can connect directly to the GRASS database, but old habits die hard! We’ve now got a copy of the road network where all roads are tagged with which shopping centre they’re closest too. We can see this below:

Allocation network of EH shopping centres.

A few things stand out for me:

  • Ocean terminal is a massive centre but is closest to few people.
  • Some of the postcodes closest to St James, as really far away.
  • The split between Fort Kinnaird and St James is really stark just east of the A702.

If I was a councillor and I coordinated shopping centres in a car free world, I now know where I’d be lobbying for better public transport!

We can also do a similar analysis using the shortest path, as in the previous post. Instead of looking for the shortest path to a single point, we can get GRASS to calculate the distance from each postcode to its nearest shopping centre (note this is using the postcodes_EH file from the previous post):

# connect postcodes to streets as layer 2 v.net --overwrite input=roads_EH points=postcodes_EH output=roads_net1 operation=connect thresh=400 arc_layer=1 node_layer=2 # connect shops to streets as layer 3 v.net --overwrite input=roads_net1 points=shopping_centres output=roads_net2 operation=connect thresh=400 arc_layer=1 node_layer=3 # inspect the result v.category in=roads_net2 op=report # shortest paths from postcodes (points in layer 2) to nearest stations (points in layer 3) v.net.distance --overwrite in=roads_net2 out=pc_2_shops flayer=2 to_layer=3 # Join postcode and distance tables v.db.join map=postcodes_EH column=cat other_table=pc_2_shops other_column=cat # Join station and distance tables v.db.join map=postcodes_EH column=tcat other_table=shopping_centres other_column=cat subset_columns=Centre # Make a km column # Really short field name so we can output to shp v.db.addcolumn map=postcodes_EH columns="dist_al_km double precision" v.db.update map=postcodes_EH column=dist_al_km qcol="dist/1000" # Make a st james vs column # Uses results from the previous blog post v.db.addcolumn map=postcodes_EH columns="diff_km double precision" v.db.update map=postcodes_EH column=diff_km qcol="dist_km-dist_al_km" # Write to shp v.out.ogr -s input=postcodes_EH output=pc_2_shops format=ESRI_Shapefile output_layer=pc_2_shops

Again we can plot these up in QGIS (below). These are really similar results to the road allocation previously, but give us a little more detail on where the population are as each postcode is show. However, the eagle eyed of you will have noticed we pulled out the distance for each postcode in the code above and then compared it to the distance to St James alone. We can use this for considering the impact of resource allocation.

Closest shopping centre for each EH postcode.

Switching to R, we can interrogate the postcode data further. Using R’s gdal library we can read in the shp file and generate some summary statistics:

Centre No. of postcodes closest
Almondvale 4361 Fort Kinnaird 7813 Gyle 3437 Ocean terminal 1321 St James 7088 # Package install.packages("rgdal") library(rgdal) # Read file postcodes = readOGR("/home/user/dir/dir/network/data/pc_2_shops.shp") # How many postcodes for each centre? table(postcodes$Centre)

We can also look at the distribution of distances for each shopping centre using a box and whisker plot. As in the map we can see that Fort Kinnaird and St James are closest to the most distant postcodes, and that Ocean terminal has a small geographical catchment. The code for this plot is a the end of this post.

We can also repeat the plot from the previous blog post and look at how many postcodes are within walking and cycling distance of their nearest centre. In the previous post I showed the solid line and circle points for the St James centre. We can now compare those results to the impact of people travelling to their closest centre (below). The number of postcodes within walking distance of their nearest centre is nearly double that of St James alone, and those within cycling distance rises to nearly 50%! Code at the end of the post.

We also now have two curves on the above plot, and the area between them is the distance saved if each postcode travelled to its closest shopping centre instead of the St James.

The total distance is a whopping 123,680 km!

This impact analysis is obviously of real use in these times of reduced public services. My local council, Midlothian, is considering closing all its libraries bar one. What impact would this have on users? How would the road network around the kept library cope? Why have they just been building new libraries? It’s also analysis I really hope the DWP undertook before closing job centres across Glasgow. Hopefully the work of this post helps people investigate these impacts themselves.

# distance saved # NA value is one postcode too far to be joined to road - oops! sum(postcodes$diff_km, na.rm=T) # Boxplot png("~/dir/dir/network/figures/all-shops_distance_boxplot.png", height=600, width=800) par(cex=1.5) boxplot(dist_al_km ~ Centre, postcodes, lwd=2, range=0, main="Box and whiskers of EH postcodes to their nearest shopping centre", ylab="Distance (km)") dev.off() # Line plot # Turn into percentage instead of postcode counts x = sort(postcodes$dist_km) x = quantile(x, seq(0, 1, by=0.01)) y = sort(postcodes$dist_al_km) y = quantile(y, seq(0, 1, by=0.01)) png("~/dir/dir/network/figures/all-shops_postcode-distance.png", height=600, width=800) par(cex=1.5) plot(x, type="l", main="EH postcode: shortest road distances to EH shopping centres", xlab="Percentage of postcodes", ylab="Distance (km)", lwd=3) lines(y, lty=2, lwd=3) points(max(which(x<2)), 2, pch=19, cex=2, col="purple4") points(max(which(x<5)), 5, pch=19, cex=2, col="darkorange") points(max(which(y<2)), 2, pch=18, cex=2.5, col="purple4") points(max(which(y<5)), 5, pch=18, cex=2.5, col="darkorange") legend("topleft", c("St James", "Nearest centre", paste0(max(which(x<2)), "% postcodes within 2 km (walking) of St James"), paste0(max(which(x<5)), "% postcodes within 5 km (cycling) of St James"), paste0(max(which(y<2)), "% postcodes within 2 km (walking) of nearest centre"), paste0(max(which(y<5)), "% postcodes within 5 km (cycling) of nearest centre")), col=c("black", "black", "purple4", "darkorange", "purple4", "darkorange"), pch=c(NA, NA, 19, 19, 18, 18), lwd=c(3), lty=c(1, 2, NA, NA, NA, NA), pt.cex=c(NA, NA, 2, 2, 2.5, 2.5)) dev.off()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

SQL Saturday statistics – Web Scraping with R and SQL Server

Mon, 11/13/2017 - 20:08

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script @language = N'R' ,@script = N' library(rvest) library(XML) library(dplyr) #URL to schedule url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx'' #Read HTML webpage <- read_html(url_schedule) # Event schedule schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK # Extracting HTML content ht <- html_text(schedule_info) df <- data.frame(data=ht) #create empty DF df_res <- data.frame(title=c(), speaker=c()) for (i in 1:nrow(df)){ #print(df[i]) if (i %% 2 != 0) #odd flow print(paste0("title is: ", df$data[i])) if (i %% 2 == 0) #even flow print(paste0("speaker is: ", df$data[i])) df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1])) } df_res_new = df_res[seq(1, nrow(df_res), 2), ] OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday; GO CREATE OR ALTER PROCEDURE GetSessions @eventID SMALLINT AS DECLARE @URL VARCHAR(500) SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx' PRINT @URL DECLARE @TEMP TABLE ( SqlSatTitle NVARCHAR(500) ,SQLSatSpeaker NVARCHAR(200) ) DECLARE @RCODE NVARCHAR(MAX) SET @RCODE = N' library(rvest) library(XML) library(dplyr) library(httr) library(curl) library(selectr) #URL to schedule url_schedule <- "' DECLARE @RCODE2 NVARCHAR(MAX) SET @RCODE2 = N'" #Read HTML webpage <- html_session(url_schedule) %>% read_html() # Event schedule schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK # Extracting HTML content ht <- html_text(schedule_info) df <- data.frame(data=ht) #create empty DF df_res <- data.frame(title=c(), speaker=c()) for (i in 1:nrow(df)){ #print(df[i]) if (i %% 2 != 0) #odd flow print(paste0("title is: ", df$data[i])) if (i %% 2 == 0) #even flow print(paste0("speaker is: ", df$data[i])) df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1])) } df_res_new = df_res[seq(1, nrow(df_res), 2), ] OutputDataSet <- df_res_new '; DECLARE @FINAL_RCODE NVARCHAR(MAX) SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2) INSERT INTO @Temp EXEC sp_execute_external_script @language = N'R' ,@script = @FINAL_RCODE INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker) SELECT @EventID AS sqlsat ,SqlSatTitle ,SqlSatSpeaker FROM @Temp

 

Before you run this, just a little environement setup:

USE [master]; GO CREATE DATABASe SQLSaturday; GO USE SQLSaturday; GO CREATE TABLE SQLSatSessions ( id SMALLINT IDENTITY(1,1) NOT NULL ,SqlSat SMALLINT NOT NULL ,SqlSatTitle NVARCHAR(500) NOT NULL ,SQLSatSpeaker NVARCHAR(200) NOT NULL )

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

Or presented with pandas graph:

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people!

As always, code is available at Github.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages