Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 21 hours 18 sec ago

Meetup: Machine Learning in Production with Szilard Pafka

Fri, 04/28/2017 - 09:55

Machine Learning in Production
by Szilard Pafka

In this talk I will discuss the main steps involved in having machine learning models in production. No hype (“artificial intelligence”), no bullshitting and no marketing of expensive products (the best tools are all open source and free, you just need to know what you are doing).

Bio: Szilard has trained, deployed and maintained machine learning models in production (starting with random forests) since 2007. He’s a Chief Scientist and a Physics PhD with 20+ years experience with data, computing and modeling. He founded the first data meetup in LA in 2009 (by chance) and he has organized several data science events since.

We’ll also have a lightning talk (10-mins) before the main talk:

Opening the black box: Attempts to understand the results of machine learning models
by Michael Tiernay, Data Scientist at Edmunds


– 6:30pm arrival, food/drinks and networking
– 7:30pm talks
– 9 pm more networking

You must have a confirmed RSVP and please arrive by 7:25pm the latest. Please RSVP here on Eventbrite.

Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one)
Park underneath the building (Colorado Center), Edmunds will validate.

Salaries by alma mater – an interactive visualization with R and plotly

Fri, 04/28/2017 - 06:08

(This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

Based on an interesting dataset from the Wall Street Journal I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, if it were interactive. To the very least I wanted to be able to see the school names when hovering over or clicking on the points with the mouse.

Luckily, this kind of interactivity can be easily achieved in R with the library plotly, especially due to its excellent integration with ggplot2, which I used to produce the above figure. In the following I describe how exactly this can be done.

Before I show you the interactive visualizations, a few words on the data preprocessing, and on how the map and the points are plotted with ggplot2:

  • I generally use functions from the tidyverse R packages.
  • I save the data in the data frame salaries, and transform the given amounts to proper floating point numbers, stripping the dollar signs and extra whitespaces.
  • The data provide school names. However, I need to find out the exact geographical coordinates of each school to put it on the map. This can be done in a very convenient way, by using the geocode function from the ggmap R package: school_longlat <- geocode(salaries$school) school_longlat$school <- salaries$school salaries <- left_join(salaries, school_longlat)

  • For the visualization I want to disregard the colleges in Alaska and Hawaii to avoid shrinking the rest of the map. The respective rows of salaries can be easily determined with a grep search: grep("alaska", salaries$school, = 1) # [1] 206 grep("hawaii", salaries$school, = 1) # [1] 226

  • A data frame containing geographical data that can be used to plot the outline of all US states can be loaded using the function map_data from the ggplot2 package: states <- map_data("state")

  • And I load a yellow-orange-red palette with the function brewer.pal from the RColorBrewer library, to use as a scale for the salary amounts: yor_col <- brewer.pal(6, "YlOrRd")

  • Finally the (yet non-interactive) visualization is created with ggplot2: p <- ggplot(salaries[-c(206, 226), ]) + geom_polygon(aes(x = long, y = lat, group = group), data = states, fill = "black", color = "white") + geom_point(aes(x = lon, y = lat, color = starting, text = school)) + coord_fixed(1.3) + scale_color_gradientn(name = "Starting\nSalary", colors = rev(yor_col), labels = comma) + guides(size = FALSE) + theme_bw() + theme(axis.text = element_blank(), axis.line = element_blank(), axis.ticks = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.title = element_blank())

Now, entering p into the R console will generate the figure shown at the top of this post.

However, we want to…

…make it interactive

The function ggplotly immediately generates a plotly interactive visualization from a ggplot object. It’s that simple! :smiley: (Though I must admit that, more often than I would be okay with, some elements of the ggplot visualization disappear or don’t look as expected. :fearful:)

The function argument tooltip can be used to specify which aesthetic mappings from the ggplot call should be shown in the tooltip. So, the code

ggplotly(p, tooltip = c("text", "starting"), width = 800, height = 500)

generates the following interactive visualization.

Now, if you want to publish a plotly visualization to, you first need to communicate your account info to the plotly R package:

Sys.setenv("plotly_username" = "??????") Sys.setenv("plotly_api_key" = "????????????")

and after that, posting the visualization to your account at is as simple as:

plotly_POST(filename = "Starting", sharing = "public") More visualizations

Finally, based on the same dataset I have generated an interactive visualization of the median mid-career salaries by undergraduate alma mater (the R script is almost identical to the one described above).
The resulting interactive visualization is embedded below.

Additionally, it is quite informative to look at a visualization of the salary increase from starting to mid-career.

To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NY R Conference

Fri, 04/28/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

The 2017 New York R Conference was held last weekend in Manhattan. For the third consecutive year, the organizers – a partnership including Lander Analytics, The New York Meetup and Work-Bench – pulled off a spectacular event. There was a wide range of outstanding talks, some technical and others more philosophical, a palpable sense of community and inclusiveness, great food, beer and Bloody Marys.

Fortunately, the talks were recorded. While watching the videos (which I imagine will be available in a couple of weeks) will be no substitute for the experience, I expect that the extended R community will find the talks valuable. In this post, I would like to mention just a couple of the presentations that touched on the professions and practices of data science and statistics.

In his talk, “The Humble (Programmer) Data Scientist: Essence and Accident in (Software Engineering) Data Science”, Friederike Schüür invoked an analogy between the current effort to establish data science as a profession, and the events of sixty years ago to obtain professional status for programmers. Schüür called out three works from master programmers: The Humble Programmer by Edsger Dijstra, Computer Programming as Art by Donald Knuth, and No Silver Bullet – Essence and Accident in Software Engineering. These are all classic papers, perennially relevant, and well worth reading and re-reading.

In the first paper, Dijstra writes:

Another two years later, in 1957, I married and Dutch marriage rites require you to state your profession and I stated that I was a programmer. But the municipal authorities of the town of Amsterdam did not accept it on the grounds that there was no such profession. . . So much for the slowness with which I saw the programming profession emerge in my own country. Since then I have seen more of the world, and it is my general impression that in other countries, apart from a possible shift of dates, the growth pattern has been very much the same.

Social transformations appear to happen much more quickly in the twenty-first century. Nevertheless, that the process requires many different kinds of people to alter their world views to establish a new profession seems to be an invariant.

The second talk I would like to mention was by Andrew Gelman. The title listed in the program is “Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do”. Since Gelman spoke without slides in front of a dark screen, I am not sure that is actually the talk he gave, but whatever talk he gave, it was spellbinding. Gelman spoke so quickly and threw out so many ideas that my mental buffers were completely overwritten. I will have to wait for the video to sort out my impressions. Nevertheless, there were four quotations that I managed to write down that I think are worth considering:

  • “Taking something that was outside of the realm of statistics and putting it into statistics is a good idea.”
  • “I would like workflows to be more formalized.”
  • “Much of workflow is still outside of the tent of statistical theory.”
  • “I think we need a theory of models.”

Taken together, and I hope not taken out of context, the sentiment expressed here argues for expanding the domain of theoretical statistics to include theories of inferential models and the workflows in which they are produced.

The idea of the importance of workflows to science itself seems to be gaining some traction in the statistical community. In his yet-to-be-published but widely circulated paper, 50 years of Data Science, theoretical statistician David Donoho writes:

A crucial hidden component of variability in science is the analysis workflow. Different studies of the same intervention may follow different workflows, which may cause the studies to get different conclusions… Joshua Carp studied analysis workflows in 241 fMRI studies. He found nearly as many unique workflows as studies! In other words researchers are making up a new workflow for pretty much every study.

My take on things is that the movement to establish the profession of data science already has too much momentum behind it to stop it any time soon. Whether or not it is the data scientists or the statisticians who come to own the theories of models and workflows doesn’t matter all that much. No matter who develops these research areas, we will all benefit. The social and intellectual friction caused my the movement of data science is heating things up in a good way.


To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

a secretary problem with maximum ability

Fri, 04/28/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The Riddler of today has a secretary problem, where one measures sequentially N random variables until one deems the current variable to be the largest of the whole sample. The classical secretary problem has a counter-intuitive solution where one first measures N/e random variables without taking any decision and then and only then picks the first next outcome larger than the largest in the first group. The added information in the current riddle is that the distribution of those iid random variables is set to be uniform on {1,…,M}, which begs for a modification in the algorithm. As for instance when observing M on the current draw.

The approach I devised is clearly suboptimal, as I decided to pick the currently observed value if the (conditional) probability it is the largest is larger than the probability subsequent draws. This translates into the following R code:

M=100 #maximum value N=10 #total number of draws hyprob=function(m){ # m is sequence of draws so far n=length(m);mmax=max(m) if ((m[n]<mmax)||(mmax-n<N-n)){prob=0 }else{ prob=prod(sort((1:mmax)[-m],dec=TRUE) [1:(N-n)]/((M-n):(M-N+1))} return(prob)} decision=function(draz=sample(1:M,N)){ i=0 keepgoin=TRUE while ((keepgoin)&(i<N)){ i=i+1 keepgoin=(hyprob(draz[1:i])<0.5)} return(c(i,draz[i],(draz[i]<max(draz))))}

which produces a winning rate of around 62% when N=10 and M=100, hence much better than the expected performances of the secretary algorithm, with a winning frequency of 1/e.

Filed under: Kids, R Tagged: mathematical puzzle, R, secretary problem, stopping rule, The Riddler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Where Europe lives, in 14 lines of R Code

Thu, 04/27/2017 - 23:56

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Via Max Galka, always a great source of interesting data visualizations, we have this lovely visualization of population density in Europe in 2011, created by Henrik Lindberg:


Impressively, the chart was created with just 14 lines of R code:

(To recreate it yourself, download the file from eurostat, and move the two .csv files inside in range of your R script.) The code parses the latitude/longitude of population centers listed in the CSV file, arranges them into a 0.01 by 0.01 degree grid, and plots each row as a horizontal line with population as the vertical axis. Grid cells with zero populations cause breaks in the line and leave white gaps in the map. It's quite an elegant effect!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Load, Save, and .rda files

Thu, 04/27/2017 - 16:44

(This article was first published on The Practical R, and kindly contributed to R-bloggers)

A couple weeks ago I stumbled across a feature in R that I had never heard of before. The functions save(), load(), and the R file type .rda.

The .rda files allow a user to save their R data structures such as vectors, matrices, and data frames. The file is automatically compressed, with user options for additional compression. Let’s take a look.

First, we will grab one of the built-in R datasets. We can view these by calling data(). Let’s use the “Orange” dataset.

# get the Orange data Orange Tree age circumference 1 1 118 30 2 1 484 58 3 1 664 87 4 1 1004 115 5 1 1231 120 6 1 1372 142 7 1 1582 145 8 2 118 33 9 2 484 69 10 2 664 111 11 2 1004 156 12 2 1231 172 13 2 1372 203 14 2 1582 203 15 3 118 30 16 3 484 51 17 3 664 75 18 3 1004 108 19 3 1231 115 20 3 1372 139 21 3 1582 140 22 4 118 32 23 4 484 62 24 4 664 112 25 4 1004 167 26 4 1231 179 27 4 1372 209 28 4 1582 214 29 5 118 30 30 5 484 49 31 5 664 81 32 5 1004 125 33 5 1231 142 34 5 1372 174 35 5 1582 177

Next, let’s save each column individually as vectors.

# save the Orange data as vectors count<-Orange$Tree age<-Orange$age circumference<-Orange$circumference

Now if we look at our variables in the RStudio environment, we can see count, age, and circumference saved there.

Next, let’s set our R working directory, so the .rda file will save in the correct location. First we’ll use getwd() to find our current working directory, then we’ll adjust it (if needed) using setwd(). I set my working directory to a folder on the D drive.

#get and set working directory getwd() [1] "D:/Users" setwd("D:/r-temp") > getwd() [1] "D:/r-temp"

Finally, let’s use the save() command to save our 3 vectors to an .rda file. The “file” name will be the name of the new .rda file.

#save to rda file save(count, age, circumference, file = "mydata.rda")

Next we will remove our R environment variables using the command rm().

#remove variables rm(age, circumference, count)

Now we can see that we no longer have saved variables in our R workspace.

Now, we can check that our .rda file (myrda.rda) does in fact store our data by using the load() command.
Note: If we had not properly set our working directory, then we would have needed to provide a full path to the rda file. For example, “C:/Users/Documents/R files/myrda” rather than just “myrda”.

#load the rda file load(file = "mydata.rda")

Great, now we can see that our variables are back in the R environment for use once more.

Saving and loading data in R might be very useful when you’re working with large datasets that you want to clear from your memory, but you also would like to save for later. It also might be useful for long, complex R workflows and scripts. You can control the compression of the file using the settings ‘compress’ and ‘compression_level’.

That’s all for now!

To leave a comment for the author, please follow the link and comment on their blog: The Practical R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science for Operational Excellence (Part-4)

Thu, 04/27/2017 - 16:26

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Suppose your friend is a restaurant chain owner (only 3 units) facing some competitors challenges related to low price, lets call it a price war. Inside his business he knows that there’s no much cost to be cut. But, he thinks that, maybe if he tries harder to find better supplier with low freight and product costs, he could be in a better situation. So, he decided to hire you a recent grad data scientist to figure out how to solve this problem and to build a tool to make your findings to be incorporated in his daily operations. As a Data Scientist you know that this problem could be solved through the use of lpSolve package.

Our goal here is to expand your knowledge to create custom constraints to be used in real business problems.

Answers to the exercises are available here.

Exercise 1
We will solve a transportation problem for your friend’s restaurant chain with 2 different products sold from 4 different suppliers. Create a cost vector that model different costs for each combination of restaurant, supplier and product. Use integer random numbers from 0 to 1000 to fill this vector. In order to be reproducible, set seed equals to 1234.

Exercise 2
Create the demand constraints. Consider that every restaurant need a specific quantity for each product. Use integer random numbers from 100 to 500 to define minimum quantities to keep the restaurant open without running out of any supplies.

Exercise 3
Create the offer constraints. Consider that every supplier can deliver a specific quantity related to each product. Use integer random numbers from 200 to 700 to define maximum quantities that each supplier can deliver.

Exercise 4
Prepare the parameter of the lp() function using the variables created above.

Exercise 5
Now, solve these problem with these constraints created so far.

Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Build advanced map visualizations
  • Work with different sources for maps
  • And much more visualizations

Exercise 6
We know that some suppliers have minimum order quantities. Create a new set of constraints to represent that. Use integer random numbers from 50 to 70 to define minimum quantities that we can order from each supplier.

Exercise 7
Now, solve these problem with these constraints created so far.

Exercise 8
We also know that some vehicles have maximum capacity in terms of weight and volume. Create a new set of constraints to represent that. Use integer random numbers from 100 to 500 to define maximum quantities that we can order from each supplier.

Exercise 9
Prepare again the lp() function parameters using the variable created above.

Exercise 10
Now, solve these problem with all constraints.

Related exercise sets:
  1. Lets Begin with something sample
  2. Data Science for Operational Excellence (Part-1)
  3. Data Exploration with Tables exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Overcome the Fear of Programming

Thu, 04/27/2017 - 15:08

(This article was first published on R programming, and kindly contributed to R-bloggers)

By Milind Paradkar

You say you never programmed in life before? Never heard of words like Classes and Objects, Dataframe, Methods, Inheritance, Loops? Are you fearful of programming, huh?

Don’t be! Programming can be fun, stimulating, and once you start and learn to program many of you would love spending hours programming different strategies; you would love to see your own codes run in the blink of an eye and would see how powerful these codes can be.

The Executive Programme in Algorithmic Trading (EPAT™) course makes extensive use of Python and R programming language to teach strategies, backtesting, and their optimization. Let us take the help of R to demonstrate how you can overcome the fear of programming. Here are some suggestions for the newbie programmers.

1) Think and let the questions pop in your mind

As a newbie programmer when you have a task to code, even before you start on it, spend some time ideating on how you would like to solve it step-by-step. Simply let questions pop up in your mind, as many questions as your mind may throw up.

Here are a few questions:
Is it possible to download stock price data in R from google finance?
How to delete a column in R? How to compute an exponential moving average (EMA)?
How do I draw a line chart in R? How to merge two data sets?
Is it possible to save the results in an excel workbook using R?

2) Google the questions for answers

Use google search to see whether solutions exist for the questions that you have raised. Let us take the second question, how to delete a column in R? We posted the question in the google search, and as we can see from the screenshot below we have the solution in the very first result shown by google.

R is an open-source project, and there are hundreds of articles, blogs, forums, tutorials, Youtube videos on the net and books which will help you overcome the fear of programming and transition you from a beginner to an intermediate level, and eventually to an expert if you aspire to.

The chart below shows the number of questions/threads posted by newbie and expert programmers on two popular websites. As you can see, R clearly tops the results with more than 10 thousand questions/threads.
(Source: )

Let us search in google whether QuantInsti™ has put up any programming material on R.
As you can see from the google results, QuantInsti™ has posted quality content on its website to help newbie programmers design and model quantitative trading strategies in R. You can read all the rich content posted regularly by QuantInsti™ here –

3) Use the print command in R

As a newbie programmer, don’t get intimidated when you come across complex looking codes on the internet. If you are unable to figure out what exactly the code does, just copy the code in R. You can use a simple “print” command to help understand the code’s working.

One can also use Ctrl+Enter to execute the code line-by-line and see the results in the console.

Let us take an example of an MACD trading strategy posted on QuantInsti’s blog.

An example of a trading strategy coded using Quantmod Package in R

I am unsure of the working of commands at line 9 and line 11. So I simply inserted a print(head(returns)) command at line 10 and one more at line 12. Thereafter I ran the code. Below is the result as shown in the console window of R.

The returns = returns[‘2008-06-02/2015-09-22’] command simply trims the original NSEI.Close price returns series. The series was earlier starting from 2007-09-17. The series now starts from 2008-06-02 and ends at 2015-09-22.

4) Use help() and example() functions in R

One can also make use of the help() and example() functions in R to understand a code, and also learn new ways of coding. Continuing with the code above, I am unsure what the ROC function does at line 9 in the code.

I used the help(“ROC”) command, and R displays all the relevant information regarding the usage, arguments of the ROC function.

There are hundreds of add-on packages in R which makes programming easy and yet powerful.

Below is the link to view all the available packages in R:

5) Give time to programming

Programming can be a very rewarding experience, and we expect that you devote some time towards learning and honing your programming skills. Below is a word cloud of some essential characteristics a good programmer should possess. The best suggestion would be to just start programming!!

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

As a newbie programmer, you have just made a start. The faculty at QuantInsti™ will teach and guide you through different aspects of programming in R and Python. Over the course of the program, you will learn different data structures, classes and objects, functions, and many other aspects which will enable you to program algorithmic trading strategies in the most efficient and powerful way.

The post Overcome the Fear of Programming appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Population Lines: How and Why I Created It

Thu, 04/27/2017 - 13:54

(This article was first published on R –, and kindly contributed to R-bloggers)

Thanks to the power of Reddit the “Population Lines” print (buy here) I created back in 2013 has attracted a huge amount of interest in the past week or so (a Europe only version made by Henrik Lindberg made the Reddit front page). There’s been lots of subsequent discussion about it’s inspiration, effectiveness as a form of visualisation and requests for the original code. So for the purposes of posterity I thought I’d run through the map’s conception and history.

I was asked to create a graphic for the front cover for the programme of the Royal Geographical Society’s Annual Conference in 2013. The theme for the conference was “New Geographical Frontiers” and given the RGS’s long association with exploration I immediately thought of path breaking across undiscovered lands or summiting mountains. These activities have very little to do with the annual conference, but I wanted to evoke them a little on the cover. To me the greatest frontier for geographers is better understanding the world population and also appreciating that Europe/ the USA are no longer at the centre of the action. I therefore played around with various population map options (you can see some inspiration ideas here). I was keen to get something that showed the true peaks of global population, which is how I came upon what I thought was a fairly original idea of showing population density spikes along lines of latitude using data from NASA’s SEDAC. After creating this – and as with so many things in geography/cartography/ dataviz – I later discovered that mapping spatial data in this way had been done before by the pioneers of computer mapping using software called SYMAP (I can’t find an image of this but I’ve definitely seen it!). So it’s nothing new in terms of a way of geographically representing the world.

As for the design I was seeking something akin to the iconic Joy Division album cover. The original RGS cover was black and white, but thanks to a suggestion from Oliver, I opted to mark key cities in gold to help with orientation etc. I’ve provided the original R code below for those interested. Since creating Population Lines I’ve only created one other map using this technique and it was for the book London: the Information Capital. For this we wanted to show the huge population spike experienced but The City of London during the working day. I think it works really well for this. I’m not convinced the technique is good at showing exact data values at precise locations but if you are looking to draw attention to extremes and you are dealing with familiar geographies/ places then it can tell a compelling story.

R Code

Here is the original R code I used back in May 2013. R has come on a lot since then, but this still works! You should use the latest population grid from NASA’s SEDAC.


#Load packages library(sp) library(rgdal) library(reshape) library(ggplot2) library(maptools) library(rgeos) #create a range standardisation function range01 <- function(x){(x-min(x))/(max(x)-min(x))} #load in a population grid - in this case it's population density from NASA's SEDAC (see link above). It's spatial data if you haven't seen this before in R. input<-readGDAL("glp10ag30.asc") # the latest data come as a Tiff so you will need to tweak. ## glp10ag30.asc has GDAL driver AAIGrid ## and has 286 rows and 720 columns proj4string(input) = CRS("+init=epsg:4326") ## Warning in `proj4string<-`(`*tmp*`, value = <S4 object of class structure("CRS", package = "sp")>): A new CRS was assigned to an object with an existing CRS: ## +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 ## without reprojecting. ## For reprojection, use function spTransform #Get the data out of the spatial grid format using "melt" and rename the columns. values<-melt(input) names(values)<- c("pop", "x", "y") #Rescale the values. This is to ensure that you can see the variation in the data values$pop_st<-range01(values$pop)*0.20 values$x_st<-range01(values$x) values$y_st<-range01(values$y) #Switch off various ggplot things xquiet<- scale_x_continuous("", breaks=NULL) yquiet<-scale_y_continuous("", breaks=NULL) quiet<-list(xquiet, yquiet) #Add 180 to the latitude to remove negatives in southern hemisphere, then order them. values$ord<-values$y+180 values_s<- values[order(-values$ord),] #Create an empty plot called p p<-ggplot() #This loops through each line of latitude and produced a filled polygon that will mask out the lines beneath and then plots the paths on top.The p object becomes a big ggplot2 plot. for (i in unique(values_s$ord)) { p<-p+geom_polygon(data=values_s[values_s$ord==i,],aes(x_st, pop_st+y_st,group=y_st), size=0.2, fill="white", col="white")+ geom_path(data=values_s[values_s$ord==i,],aes(x_st, pop_st+y_st,group=y_st),size=0.2, lineend="round") } #show plot! p +theme(panel.background = element_rect(fill='white',colour='white'))+quiet

If you want to learn more, you can see here for another tutorial that takes a slightly different approach with the same effect.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Genetic Music: From Schoenberg to Bach

Thu, 04/27/2017 - 09:30

Bach, the epitome of a musician who strove all life long and finally acquired the ‘Habit of Perfection’, was a thoroughly imperfect human being (John Eliot Gardiner, Bach: Music in the Castle of Heaven)

Sometimes I dream awake and imagine I am a famous musician.  I fantasize being Paco de Lucía playing Mi niño Curro alone on the stage, Thom Yorke singing Fake plastic trees at Glastombury or Noel Gallagher singing Don’t look back in anger for a devoted crowd.

My parents gave me the opportunity to learn music, and this has been one of the best gifts I have received ever. I played the cello intensively until I had children but I still have enough skills to play some pieces. One of that is the Prelude of Suite No. 1 of J. S. Bach. It is very close to the limit of my possibilities but I love it. It is timeless, thrilling, provocative and elegant: an absolute masterpiece. I also imagine myself often playing it as well as my admired Yo-Yo Ma does.

The aim of this experiment is to discern first 4 beats of the prelude using a genetic algorithm. First of all, let’s listen our goal melody, created with tuneR package (sorry for the sound, Mr. Bach):

The frequency range of cello goes from 65.41 Hz to 987.77 Hz. Using the basic formula for the frequency of the notes, it means that a cello can produce 48 different notes. I generated the next codification for the 48 notes of the cello:

frequency (hz) note code 65.41 C2 a 69.30 C#2/Db2 b 73.42 D2 c 77.78 D#2/Eb2 d 82.41 E2 e 87.31 F2 f 92.50 F#2/Gb2 g 98.00 G2 h 103.83 G#2/Ab2 i 110.00 A2 j 116.54 A#2/Bb2 k 123.47 B2 l 130.81 C3 m 138.59 C#3/Db3 n 146.83 D3 o 155.56 D#3/Eb3 p 164.81 E3 q 174.61 F3 r 185.00 F#3/Gb3 s 196.00 G3 t 207.65 G#3/Ab3 u 220.00 A3 v 233.08 A#3/Bb3 w 246.94 B3 x 261.63 C4 y 277.18 C#4/Db4 z 293.66 D4 A 311.13 D#4/Eb4 B 329.63 E4 C 349.23 F4 D 369.99 F#4/Gb4 E 392.00 G4 F 415.30 G#4/Ab4 G 440.00 A4 H 466.16 A#4/Bb4 I 493.88 B4 J 523.25 C5 K 554.37 C#5/Db5 L 587.33 D5 M 622.25 D#5/Eb5 N 659.26 E5 O 698.46 F5 P 739.99 F#5/Gb5 Q 783.99 G5 R 830.61 G#5/Ab5 S 880.00 A5 T 932.33 A#5/Bb5 U 987.77 B5 V

So our goal melody is codified like this:


I start with a population of 500 random melodies. All of them have 64 notes, the same length as the goal melody has. Given a melody, the algorithm compares it with the goal melody to calculate its fitness, with the following formula:

For example, a melody with 5 correct notes has a fitness of 32. Being correct means being the right note in the right place. After measuring fitness of all melodies, I select 250 couples of individuals depending of its fitness (the more fitness, the more probability of being selected). Each couple generates two children for the next generation depending on certain probability, called crossover rate. Crossing operation is not always applied. Once two parents are selected, a random crossover point is chosen. At that point in both strings the genetic material from the left side of one parent is spliced to the material from the right side of other parent. The next figure illustrates the idea:

So two parents give birth to two children for the next generation. The last thing to do is mutate children. Once again, mutation is not always applied since it depends on a rate, usually small. Mutation introduces some new notes (new genetic material) to the next population. It increases convergence speed and reduces the probability to obtain a local optimum.

How many 32 -length melodies can be written with 48 notes? The answer is 4832, which is this extremely big number:


To understand how enormous is, let’s suppose we could work with Sunway TaihuLight, the fastest supercomputer in the world nowadays. This monster can do floating-point operations per second so it will expend more than 214.995.831.974.513.789.322.026.202.008 years to calculate the fitness of all possible melodies: brute force is not an option.

A genetic algorithm does the job in just a few iterations. Best melodies introduce innovations which increase the average fitness of the whole population as well as its maximum fitness. Next table shows the evolution of an execution of the algorithm for a crossover rate equal of 75% and a mutation  rate of 1% (not exhaustive):


The optimum is reached in just 125 iterations. It is funny to merge the best melodies of some iterations. This sample blends four of them. The first one comes from the first initial population (the Schoenberg flavored) and the last one is our goal melody.  The other two were randomly picked from the rest iterations. It is nice to hear how the genetic algorithm turns randomness into the wonderful Bach’s melody:

This experiment was inspired by The Computational Beauty of Nature, a splendid book by Gary William Flake I strongly recommend you.

This is the code of the experiment:

library(tuneR) library(stringdist) library(dplyr) #Function to calculate frequency freq=function(n) 440*(2^(1/12))^n #cello notes notes=c("C2", "C#2/Db2", "D2", "D#2/Eb2", "E2", "F2", "F#2/Gb2", "G2", "G#2/Ab2", "A2", "A#2/Bb2", "B2", "C3", "C#3/Db3", "D3", "D#3/Eb3", "E3", "F3", "F#3/Gb3", "G3", "G#3/Ab3", "A3", "A#3/Bb3", "B3", "C4", "C#4/Db4", "D4", "D#4/Eb4", "E4", "F4", "F#4/Gb4", "G4", "G#4/Ab4", "A4", "A#4/Bb4", "B4", "C5", "C#5/Db5", "D5", "D#5/Eb5", "E5", "F5", "F#5/Gb5", "G5", "G#5/Ab5", "A5", "A#5/Bb5", "B5") #Table of frequencies frequencies=data.frame(n=-33:14) %>% mutate(frequency=round(freq(n),4), note=notes, code=c(letters, toupper(letters))[1:48]) #Codification of the goal melody prelude="tAJHJAJAtAJHJAJAtCKJKCKCtCKJKCKCtEKJKEKEtEKJKEKEtFJHJFJFtFJHJFJF" #Sample wav if (exists("all_wave")) rm(all_wave) frequencies %>% filter(code==substr(prelude,1,1)) %>% select(frequency) %>% as.numeric %>% sine(duration = 10000)->all_wave for (i in 2:nchar(prelude)) frequencies %>% filter(code==substr(prelude,i,i)) %>% select(frequency) %>% as.numeric %>% sine(duration = 10000) %>% bind(all_wave, .)->all_wave play(all_wave) writeWave(all_wave, 'PreludeSample.wav') popsize=500 #Population size length=nchar(prelude) genes=frequencies$code maxfitness=2^(1-(stringdist(prelude, prelude, method="hamming")-length)) maxiter=200 #Max number of iterations iter=1 mutrate=0.01 #Initial population replicate(popsize, sample(genes, length, replace = TRUE)) %>% apply(2, function(x) paste(x,collapse="")) -> population #Fitness evaluation fitness=sapply(population, function(x) 2^(1-(stringdist(x, prelude, method="hamming")-length)), USE.NAMES=FALSE) #Maximum fitness maxfitenss_iter=max(fitness) #Best melody which((fitness)==max(fitness)) %>% min %>% population[.] ->bestfit results=data.frame(iteration=iter, best_melody=bestfit, correct_notes=log(maxfitenss_iter, base = 2)-1) #Execution of the algorithm while(maxfitenss_iter<maxfitness & iter<maxiter) { population2=c() for (i in 1:(popsize/2)) { parents=sample(1:popsize, size=2, prob=fitness/sum(fitness), replace=FALSE) mix=sample(1:(length-1), 1) if (runif(1)>.25) { p1=paste0(substr(population[parents[1]],1,mix), substr(population[parents[2]],mix+1,length)) p2=paste0(substr(population[parents[2]],1,mix), substr(population[parents[1]],mix+1,length)) } else { p1=population[parents[1]] p2=population[parents[2]] } for (j in 1:length) if(runif(1)<mutrate) substr(p1,j,j)=sample(genes,1) for (j in 1:length) if(runif(1)<mutrate) substr(p2,j,j)=sample(genes,1) c(p1, p2) %>% c(population2)->population2 } #New population population=population2 fitness=sapply(population, function(x) 2^(1-(stringdist(x, prelude, method="hamming")-length)), USE.NAMES=FALSE) which((fitness)==max(fitness)) %>% min %>% population[.] ->bestfit print(paste0("Iteration ",iter, ": ", bestfit)) maxfitenss_iter=max(fitness) iter=iter+1 data.frame(iteration=iter, best_melody=bestfit, correct_notes=log(maxfitenss_iter, base = 2)-1) %>% rbind(results) -> results }

Gender and verbs across 100,000 stories: a tidy analysis

Thu, 04/27/2017 - 09:30

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Previously in this series

I was fascinated by my colleague Julia Silge’s recent blog post on what verbs tend to occur after “he” or “she” in several novels, and what they might imply about gender roles within fictional work. This made me wonder what trends could be found across a larger dataset of stories.

Mark Riedl’s Wikipedia plots dataset that I examined in yesterday’s post offers a great opportunity to analyze this question. The dataset contains over 100,000 descriptions of plots from films, novels, TV shows, and video games. The stories span centuries and come from tens of thousands of authors, but the descriptions are written by a modern audience, which means we can quantify gender roles across a wide variety of genres. Since the dataset contains plot descriptions rather than primary sources, it’s also more about what happens at than how an author describes the work: we’re less likely to see “thinks” or “says”, but more likely to see “shoots” or “escapes”.

As I usually do for text analysis, I’ll be using the tidytext package Julia and I developed last year. To learn more about analyzing datasets like this, see our online book Text Mining with R: A Tidy Approach, soon to be published by O’Reilly. I’ll provide code for the text mining sections so you can follow along. I don’t show the code for most of the visualizations to keep the post concise, but as with all of my posts the code can be found here on GitHub.


We’d start with the same code from the last post, that read in the plot_text variable from the raw dataset. Just as Julia did, we then tokenize the text into bigrams, or consecutive pairs of words, with the tidytext package, then filter for cases where a word occurred after “he” or “she”.1

library(dplyr) library(tidytext) bigrams <- plot_text %>% unnest_tokens(bigram, text, token = "ngrams", n = 2, collapse = FALSE) bigrams_separated <- bigrams %>% separate(bigram, c("word1", "word2"), sep = " ") he_she_words <- bigrams_separated %>% filter(word1 %in% c("he", "she")) he_she_words ## # A tibble: 797,388 × 4 ## story_number title word1 word2 ## <dbl> <chr> <chr> <chr> ## 1 1 Animal Farm he refers ## 2 1 Animal Farm he accuses ## 3 1 Animal Farm he collapses ## 4 1 Animal Farm he celebrates ## 5 1 Animal Farm he abolishes ## 6 2 A Clockwork Orange (novel) he is ## 7 2 A Clockwork Orange (novel) he describes ## 8 2 A Clockwork Orange (novel) he meets ## 9 2 A Clockwork Orange (novel) he invites ## 10 2 A Clockwork Orange (novel) he drugs ## # ... with 797,378 more rows

For example, we see the plot description for “Animal Farm” has five uses of a verb after “he”, such as “he refers” and “he accuses”. (Note that throughout this post I’ll refer to these after-pronoun words as as “verbs” since the vast majority are, but some are conjunctions like “and” or adjectives like “quickly”).

Gender-associated verbs

Which words were most shifted towards occurring after “he” or “she”? We’ll filter for words that appeared at least 200 times.

he_she_counts <- he_she_words %>% count(word1, word2) %>% spread(word1, n, fill = 0) %>% mutate(total = he + she, he = (he + 1) / sum(he + 1), she = (she + 1) / sum(she + 1), log_ratio = log2(she / he), abs_ratio = abs(log_ratio)) %>% arrange(desc(log_ratio))

This can be visualized in a bar plot of the most skewed words.2

I think this paints a somewhat dark picture of gender roles within typical story plots. Women are more likely to be in the role of victims- “she screams”, “she cries”, or “she pleads.” Men tend to be the aggressor: “he kidnaps” or “he beats”. Not all male-oriented terms are negative- many, like “he saves”/”he rescues” are distinctly positive- but almost all are active rather than receptive.

We could alternatively visualize the data by comparing the total number of words to the difference in association with “he” and “she”. This helps find common words that show a large shift.

he_she_counts %>% filter(!word2 %in% c("himself", "herself", "she"), total>= 100) %>% ggplot(aes(total, log_ratio)) + geom_point() + scale_x_log10(breaks = c(100, 1000, 10000, 1e5), labels = comma_format()) + geom_text(aes(label = word2), vjust = 1, hjust = 1, check_overlap = TRUE) + scale_y_continuous(breaks = seq(-2, 2), labels = c('4X "he"', '2X "he"', "Same", '2X "she"', '4X "she"')) + labs(x = 'Total uses after "he" or "she" (note log scale)', y = 'Relative uses after "she" to after "he"', title = "Gendered verbs: comparing frequency to pronoun shift", subtitle = "Only words occurring at least 100 times after he/she. Overlapping labels were removed.") + expand_limits(x = 75)

There are a number of very common words (“is”, “has”, “was”) that occur equally often after “he” or “she”, but also some fairly common ones (“agrees”, “loves”, “tells”) that are shifted. “She accepts” and “He kills” are the two most shifted verbs that occurred at least a thousand times, as well as the most frequent words with more than a twofold shift.

“Poison is a woman’s weapon”: terms related to violence

Women in storylines are not always passive victims: the fact that the verb “stabs” is shifted towards female characters is interesting. What does the shift look like for other words related to violence or crime?

There’s an old stereotype (that’s appeared in works like Game of Thrones and Sherlock Holmes) that “poison is a woman’s weapon”, and this is supported in our analysis. Female characters are more likely to “poison”, “stab”, or “kick”; male characters are more likely to “beat”, “strangle”, or simply “murder” or “kill”. Men are moderately more likely to “steal”, but much more likely to “rob”.

It’s interesting to compare this to an analysis from the Washington Post of real murders in America. Based on this text analysis, a fictional murderer is about 2.5X as likely to be male than female, but in America (and likely elsewhere) murderers are about 9X more likely to be male than female. This means female murderers may be overrepresented in fiction relative to reality.

The fact that men are only slightly more likely to “shoot” in fiction is also notable, since the article noted that men are considerably more likely to choose guns as a murder weapon than women are.

Try it yourself

This data shows a shift in what verbs are used after “he” and “she”, and therefore what roles male and female characters tend to have within stories. However, it’s only scratching the surface of the questions that can be examined with this data.

  • Is the shift stronger in some formats or genre than another? We could split the works into films, novels, and TV series, and ask whether these gender roles are equally strong in each.
  • Is the shift different between male- and female- created works?
  • Has the difference changed over time? Some examination indicates the vast majority of these plots come from stories written in the last century, and most of them from the last few decades (not surprising since many are movies or television episodes, and since Wikipedia users are more likely to describe contemporary work).

I’d also note that we could expand the analysis to include not only pronouns but first names (e.g. not only “she tells”, but “Mary tells” or “Susan tells”), which would probably improve the accuracy of the analysis.

Again, the full code for this post is available here, and I hope others explore this data more deeply.

  1. Gender is not a binary, so please note that this analysis is examining how the Wikipedia description’s author refers to the character. For example, we miss cases of singular they, but it would be challenging to separate it from the more common plural. 

  2. I’m also skipping the words “himself” and “herself”, which are the most gender-shifted words but aren’t interesting for our purposes. 

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Creating a VIX Futures Term Structure In R From Official CBOE Settlement Data

Thu, 04/27/2017 - 09:21

This post will be detailing a process to create a VIX term structure from freely available CBOE VIX settlement data and a calendar of freely obtainable VIX expiry dates. This has applications for volatility trading strategies.

So this post, as has been the usual for quite some time, will not be about a strategy, but rather, a tool that can be used for exploring future strategies. Particularly, volatility strategies–which seems to have been a hot topic on this blog some time ago (and might very well be still, especially since the Volatility Made Simple blog has just stopped tracking open-sourced strategies for the past year).

This post’s topic is the VIX term structure–that is, creating a set of continuous contracts–properly rolled according to VIX contract specifications, rather than a hodgepodge of generic algorithms as found on some other websites. The idea is, as of the settlement of a previous day (or whenever the CBOE actually releases their data), you can construct a curve of contracts, and see if it’s in contango (front month cheaper than next month and so on) or backwardation (front month more expensive than next month, etc.).

The first (and most code-intensive) part of the procedure is fairly simple–map the contracts to an expiration date, then put their settlement dates and times to expiry into two separate xts objects, with one column for each contract.

The expiries text file is simply a collection of copied and pasted expiry dates from this site. It includes the January 2018 expiration date. Here is what it looks like:

> head(expiries) V1 V2 V3 1 18 January 2006 2 15 February 2006 3 22 March 2006 4 19 April 2006 5 17 May 2006 6 21 June 2006 require(xts) require(data.table) # 06 through 17 years <- c(paste0("0", c(6:9)), as.character(c(10:17))) # futures months futMonths <- c("F", "G", "H", "J", "K", "M", "N", "Q", "U", "V", "X", "Z") # expiries come from expiries <- read.table("expiries.txt", header = FALSE, sep = " ") # convert expiries into dates in R dateString <- paste(expiries$V3, expiries$V2, expiries$V1, sep = "-") dates <- as.Date(dateString, format = "%Y-%B-%d") # map futures months to numbers for dates monthMaps <- cbind(futMonths, c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")) monthMaps <- data.frame(monthMaps) colnames(monthMaps) <- c("futureStem", "monthNum") dates <- data.frame(dates) dates$dateMon <- substr(dates$dates, 1, 7) contracts <- expand.grid(futMonths, years) contracts <- paste0(contracts[,1], contracts[,2]) contracts <- c(contracts, "F18") stem <- "" #contracts <- paste0(stem, contracts, "_VX.csv") masterlist <- list() timesToExpiry <- list() for(i in 1:length(contracts)) { # obtain data contract <- contracts[i] dataFile <- paste0(stem, contract, "_VX.csv") expiryYear <- paste0("20",substr(contract, 2, 3)) expiryMonth <- monthMaps$monthNum[monthMaps$futureStem == substr(contract,1,1)] expiryDate <- dates$dates[dates$dateMon == paste(expiryYear, expiryMonth, sep="-")] data <- suppressWarnings(fread(dataFile)) # create dates dataDates <- as.Date(data$`Trade Date`, format = '%m/%d/%Y') # create time to expiration xts toExpiry <- xts(expiryDate - dataDates, colnames(toExpiry) <- contract timesToExpiry[[i]] <- toExpiry # get settlements settlement <- xts(data$Settle, colnames(settlement) <- contract masterlist[[i]] <- settlement } # cbind outputs masterlist <-, masterlist) timesToExpiry <-, timesToExpiry) # NA out zeroes in settlements masterlist[masterlist==0] <- NA

From there, we need to visualize how many contracts are being traded at once on any given day (I.E. what’s a good steady state number for the term structure)?

sumNonNA <- function(row) { return(sum(! } simultaneousContracts <- xts(apply(masterlist, 1, sumNonNA), chart.TimeSeries(simultaneousContracts)

The result looks like this:

So, 8 contracts (give or take) at any given point in time. This is confirmed by the end of the master list of settlements.

dim(masterlist) tail(masterlist[,135:145]) > dim(masterlist) [1] 3002 145 > tail(masterlist[,135:145]) H17 J17 K17 M17 N17 Q17 U17 V17 X17 Z17 F18 2017-04-18 NA 14.725 14.325 14.525 15.175 15.475 16.225 16.575 16.875 16.925 NA 2017-04-19 NA 14.370 14.575 14.525 15.125 15.425 16.175 16.575 16.875 16.925 NA 2017-04-20 NA NA 14.325 14.325 14.975 15.375 16.175 16.575 16.875 16.900 NA 2017-04-21 NA NA 14.325 14.225 14.825 15.175 15.925 16.350 16.725 16.750 NA 2017-04-24 NA NA 12.675 13.325 14.175 14.725 15.575 16.025 16.375 16.475 17.00 2017-04-25 NA NA 12.475 13.125 13.975 14.425 15.225 15.675 16.025 16.150 16.75

Using this information, an algorithm can create eight continuous contracts, ranging from front month to eight months out. The algorithm starts at the first day of the master list to the first expiry, then moves between expiry windows, and just appends the front month contract, and the next seven contracts to a list, before rbinding them together, and does the same with the expiry structure.

termStructure <- list() expiryStructure <- list() masterDates <- unique(c(first(index(masterlist)), dates$dates[dates$dates %in% index(masterlist)], Sys.Date()-1)) for(i in 1:(length(masterDates)-1)) { subsetDates <- masterDates[c(i, i+1)] dateRange <- paste(subsetDates[1], subsetDates[2], sep="::") subset <- masterlist[dateRange,c(i:(i+7))] subset <- subset[-1,] expirySubset <- timesToExpiry[index(subset), c(i:(i+7))] colnames(subset) <- colnames(expirySubset) <- paste0("C", c(1:8)) termStructure[[i]] <- subset expiryStructure[[i]] <- expirySubset } termStructure <-, termStructure) expiryStructure <-, expiryStructure)

Again, one more visualization of when we have a suitable number of contracts:

simultaneousContracts <- xts(apply(termStructure, 1, sumNonNA), chart.TimeSeries(simultaneousContracts)

And in order to preserve the most data, we’ll cut the burn-in period off when we first have 7 contracts trading at once.

first(index(simultaneousContracts)[simultaneousContracts >= 7]) termStructure <- termStructure["2006-10-23::"] expiryStructure <- expiryStructure[index(termStructure)]

So there you have it–your continuous VIX futures contract term structure, as given by the official CBOE settlements. While some may try and simulate a trading strategy based on these contracts, I myself prefer to use them as indicators or features to a model that would rather buy XIV or VXX.

One last trick, for those that want to visualize things, a way to actually visualize the term structure on any given day, in particular, the most recent one in the term structure.

plot(t(coredata(last(termStructure))), type = 'b')

A clear display of contango.

A post on how to compute synthetic constant-expiry contracts (EG constant 30 day expiry contracts) will be forthcoming in the near future.

Thanks for reading.

NOTE: I am currently interested in networking and full-time positions which may benefit from my skills. I may be contacted at my LinkedIn profile found here.

Welcome to our rOpenSci Interns

Thu, 04/27/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

There's a lot of work that goes in to making software: the code that does the thing itself, unit testing, examples, tutorials, documentation, and support. rOpenSci software is created and maintained both by our staff and by our (awesome) community. In keeping with our aim to build capacity of software users and developers, three interns from our academic home at UC Berkeley are now working with us as well. Our interns are mentored by Carl Boettiger, Scott Chamberlain, and Karthik Ram and they will receive academic credit and/or pay for their work.

The interns
  • Katie Rice:
    • Is a third-year undergraduate student doing a major in Environmental Sciences and a minor in Sustainable Design
    • Katie on GitHub
  • Diana Ly:
    • Is a third-year undergraduate student doing a major in Mathematics.
    • Diana on GitHub
  • Siwei (Steven) Ye:
    • Is a second-year undergraduate student doing a double major in Mathematics and Statistics
    • Steven on GitHub
What they're working on Katie

README's are the first thing someone reads when landing on a GitHub repository. Thus, it's important that the README has sufficient information to tell the reader what the software is for, who it's meant for, what it does, how to install and how to give feedback.

Most software maintainers do a good job with how to install and how to give feedback (link to issues tab usually), but we often fall short on describing at a high level what the piece of software is about.

This is where Katie comes in! She's been going through rOpenSci repositories on GitHub, improving the high level description of the software and then sending pull requests with changes to the README's.

Check out Katie's GitHub activity for rOpenSci related work.


Diana is just getting started. She'll be working on documentation and maintenance.

In addition, she'll be working on an R package that will make it easy to make cheatsheets for packages from simple markdown templates – no editing powerpoint or keynote files needed!


Software unit tests (method to determine whether software components perform as designed) are very important. We have it as policy that packages submitted to our onboarding repository have tests.

In addition, we build and check our software on each change (includes running tests). However, since rOpenSci has been around for a while, there are still some packages that don't have tests, or enough of them. In addition, when you have tests, you can calculate test coverage using, for example with Codecov – a good way to target what tests to write.

Lastly, it's ideal to signal to potential users that you have continuous integration and test coverage through badges, like this one:

This is where Steven comes in! When a package has tests already, he adds integration for Codecov and a badge for it (like the one above) when it's missing. When packages don't have tests, he writes them, including integrating Codecov.

Check out Steven's GitHub activity for rOpenSci related work.

Want to be an rOpenSci intern?

We'll be looking for new interns from time to time. Contact us at with any questions.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Euler Problem 18 & 67: Maximum Path Sums

Thu, 04/27/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

An example of a pedigree. Source: Wikimedia.

Euler Problem 18 and 67 are exactly the same besides that the data set in the second version is larger than in the first one. In this post, I kill two Eulers with one code.

These problems deal with binary trees, which is a data structure where each node has two children. A practical example of a binary tree is a pedigree chart, where each person or animal has two parents, four grandparents and so on.

Euler Problem 18 Definition

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.

7 4
2 4 6
8 5 9 3

That is, 3 + 7 + 4 + 9 = 23. Find the maximum total from top to bottom of the triangle below:

95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23

As there are only 16,384 routes, it is possible to solve this problem by trying every route. However, Problem 67, is the same challenge with a triangle containing one-hundred rows; it cannot be solved by brute force, and requires a clever method! ;o)


This problem seeks a maximum path sum in a binary tree. The brute force method, as indicated in the problem definition, is a very inefficient way to solve this problem. The video visualises the quest for the maximum path, which takes eleven minutes of hypnotic animation.

A more efficient method is to define the maximum path layer by layer, starting at the bottom. The maximum sum of 2+8 or 2+5 is 10, the maximum sum of 4+5 or 4+9 is 13 and the last maximum sum is 15. These numbers are now placed in the next row. This process cycles until only one number is left. This algorithm solves the sample triangle in four steps:

Step 1:

7 4
2 4 6
8 5 9 3

Step 2:

7 4
10 13 15

Step 3:

20 19

Step 4:


In the code below, the data is triangle matrix. The variables rij (row) and kol (column) drive the search for the maximum path. The triangle for Euler Problem 18 is manually created and the triangle for Euler Problem 67 is read from the website.

path.sum <- function(triangle) { for (rij in nrow(triangle):2) { for (kol in 1:(ncol(triangle)-1)) { triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol] } triangle[rij,] <- NA } return(max(triangle, na.rm = TRUE)) } # Euler Problem 18 triangle <- matrix(ncol = 15, nrow = 15) triangle[1,1] <- 75 triangle[2,1:2] <- c(95, 64) triangle[3,1:3] <- c(17, 47, 82) triangle[4,1:4] <- c(18, 35, 87, 10) triangle[5,1:5] <- c(20, 04, 82, 47, 65) triangle[6,1:6] <- c(19, 01, 23, 75, 03, 34) triangle[7,1:7] <- c(88, 02, 77, 73, 07, 63, 67) triangle[8,1:8] <- c(99, 65, 04, 28, 06, 16, 70, 92) triangle[9,1:9] <- c(41, 41, 26, 56, 83, 40, 80, 70, 33) triangle[10,1:10] <- c(41, 48, 72, 33, 47, 32, 37, 16, 94, 29) triangle[11,1:11] <- c(53, 71, 44, 65, 25, 43, 91, 52, 97, 51, 14) triangle[12,1:12] <- c(70, 11, 33, 28, 77, 73, 17, 78, 39, 68, 17, 57) triangle[13,1:13] <- c(91, 71, 52, 38, 17, 14, 91, 43, 58, 50, 27, 29, 48) triangle[14,1:14] <- c(63, 66, 04, 68, 89, 53, 67, 30, 73, 16, 69, 87, 40, 31) triangle[15,1:15] <- c(04, 62, 98, 27, 23, 09, 70, 98, 73, 93, 38, 53, 60, 04, 23) answer <- path.sum(triangle) print(answer) Euler Problem 67

The solution for problem number 67 is exactly the same. The data is read directly from the Project Euler website.

# Euler Problem 67 triangle.file <- read.delim("", stringsAsFactors = F, header = F) triangle.67 <- matrix(nrow = 100, ncol = 100) for (i in 1:100) { triangle.67[i,1:i] <- as.numeric(unlist(strsplit(triangle.file[i,], " "))) } answer <- path.sum(triangle.67) print(answer)

The post Euler Problem 18 & 67: Maximum Path Sums appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate

Thu, 04/27/2017 - 01:12

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

There is no shortage of analyses on the NYC bike share system. Most of them aim at predicting the demand for bikes and balancing bike stock, i.e forecasting when to remove bikes from fully occupied stations, and refill stations before the supply runs dry.


This is why I decided to take a different approach and use the Citi Bike data to help its users instead.


The Challenge

The online dating scene is complicated and unreliable: there is a discrepancy between what online daters say and what they do. Although this challenge is not relevant to me anymore – I am married – I wished that, as a bike enthusiast, I had a platform where I could have spotted like-minded people who did ride a bike (and not just pretend they did).

The goal of this project was to turn the Citi Bike data into an app where a rider could identify the best spots and times to meet other Citi Bike users and cyclists in general.



The Data

As of March 31, 2016, the total number of annual subscribers was 163,865, and Citi Bike riders took an average of 38,491 rides per day in 2016 (source: wikipedia)

This is more than 14 million rides in 2016!

I used the Citi Bike data for the month of May 2016 (approximately 1 million observations). Citi Bike provides the following variables:

  • Trip duration (in seconds).
  • Timestamps for when the trip started and ended.
  • Station locations for where the trip started and ended (both the names and coordinates).
  • Rider’s gender and birth year – this is the only demographic data we have.
  • Rider’s plan (annual subscriber, 7-day pass user or 1-day pass user).
  • Bike ID.
  Riders per Age Group

Before moving ahead with building the app, I was interested in exploring the data and identifying patterns in relation to gender, age and day of the week. Answering the following questions helped identify which variables influence how riders use the Citi Bike system and form better features for the app:

  • Who are the primary users of Citi Bike?
  • What is the median age per Citi Bike station?
  • How do the days of the week impact biking behaviours?

As I expected, based on my daily rides from Queens to Manhattan, 75% of the Citi Bike trips are taken by males. The primary users are 25 to 24 years old.

Riders per Age Group


Distribution of Riders per Hour of the Day (weekdays)

However, while we might expect these young professionals to be the primary users during the weekdays around 8-9am and 5-6pm (when they commute to and from work), and the older audience to take over the Citi Bike system midday, this hypothesis proved to be wrong. The tourists don’t have anything to do with it; the short term customers only represent 10% of the dataset.

Distribution of Riders per Hour of the Day (weekdays only)


Median Age per Departure Station

Looking at the median age of the riders for each station departure, we see the youngest riders in East Village, while older riders start their commute from Lower Manhattan (as shown in the map below). The age trends disappear when mapping the station arrival, above all in the financial district (in Lower Manhattan), which is populated by the young wolves of Wall Street (map not shown).

The map also confirms that the Citi Bike riders are mostly between 30 and 45 years old.

Median Age per Departure Station



Rides by Hour of the Day

Finally, when analyzing how the days of the week impacted biking behaviours, I was surprised to see that Citi Bike users didn’t ride for a longer period of time during the weekend: the median trip duration is 19 minutes for each day of the week.

Trip Duration per Gender and Age Group


However, as illustrated below, there is a difference in peak hours; during the weekend, riders hop on a bike later during the day, with most of the rides happening midday while the peak hours during the weekdays are around 8-9am and 5-7pm when riders commute to and from work.


Number of Riders per Hour of the Day (weekdays vs. weekends)



The App

Where does this analysis leave us?

  • The day of the week and the hour of the day are meaningful variables which we need to take into account in the app.
  • Most of the users are between 30 and 45 years. This means that the age groups 25-34 and 35-44 won’t be granular enough when app users need to filter their search. We will let them filter by age instead.


The Citi Tinder app in a few words and screenshots.

There are 3 steps to the app:

  • The “when“: find the times and days where your ideal mate is more likely to ride.


  • The “where“: once you know the best times and days, filter out the location by day of the week, time of the day, gender and age. You can also select if you want to spot where they arrive or depart.


  • The “how‘: the final step is to grab a Citi Bike and get to those hot spots. The app calls the Google Maps API to show the directions with a little extra: you can compare the time estimated by Google to connect two stations versus the average time it took Citi Bike users. I believe the latter is more accurate because it factors in the time of the day and day of the week (which the app let you filter).


Although screenshots are nice, the interactive app is better so head to the first step of the app to get started!



Would Have, Should Have, Could Have

This is the first of the four projects from the NYC Data Science Academy Data Science Bootcamp program. With a two-week timeline and only 24 hours in a day, some things gotta give… Below is a quick list of the analysis I could have, would have and should have done if given more time and data:

  • Limited scope : I only took the data from May 2016. However, I expect the Citi Bike riders to behave differently depending on the season, temperature, etc. Besides, the bigger the sample size the more reliable the insights are.
  • Missing data : There was no data on the docks available per station that could be scraped from the Citi Bike website. The map would have been more complete if the availability of docks had been displayed.
  • Limited number of variables : I would have liked to have more demographics data (aside from gender and age); a dating app with only the age and gender as filters is restrictive…
  • Incomplete filters : With more time, I’d have added a filter ‘speed’ in the 2nd step of the app (the ‘where’ part) to enable the hard core cyclists to filter the fastest ones…
  • Sub-optimal visualization : I am aware that the map in the introduction page (with the dots displaying the median age per station) is hard to read and with more time, I’d have used polygons instead to group by neighbourhoods.
  • Finally, I would have liked to track unique users. Although users don’t have a unique identifier in the Citi Bike dataset, I could have identified unique users by looking at their gender, age, zip and usual start/end stations.

The post Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

qualtRics 1.0 now available from CRAN

Wed, 04/26/2017 - 23:02

(This article was first published on Jasper Ginn's blog, and kindly contributed to R-bloggers)

Qualtrics allows users to collect online data through surveys. My R package qualtRics contains convenience functions to pull survey results straight into R using the Qualtrics API instead of having to download survey results manually.

Currently, the package contains three functions:

  1. getSurveys() fetches a list of all surveys that you own or have access to from Qualtrics.
  2. getSurvey() downloads a survey from Qualtrics and loads it into R.
  3. readSurvey() allows you to read CSV files you download manually from Qualtrics.

Getting started with the package is straightforward. Note that your institution must support API access and that it must be enabled for your account. Whoever manages your Qualtrics account can help you with this. Refer to the Qualtrics documentation to find your API token.


Register your Qualtrics API key. You need to do this once every R session:

library(qualtRics) registerApiKey(API.TOKEN = "<yourapitoken>")

Get a data frame of all surveys to which you have access:

surveys <- getSurveys(root_url="") # URL is for my own institution

Export a survey and load it into R:

mysurvey <- getSurvey(surveyID = surveys$id[6], root_url = "", verbose = TRUE)

You can add a from/to date to only retrieve responses between those dates:

surv <- getSurvey(survs$id[4], root_url = "", startDate = "2016-09-18", endDate = "2016-10-01", useLabels = FALSE, verbose = TRUE)

You may also reference a response ID. getSurvey will then download all responses that were submitted after that response:

surv <- getSurvey(survs$id[4], root_url = "", lastResponseId = "R_3mmovCIeMllvsER", useLabels = FALSE, verbose = TRUE)

You can store the results in a specific location if you like:

mysurvey <- getSurvey(surveyID = surveys$id[6], save_dir = "/users/jasper/desktop/", root_url = "", verbose = TRUE)

Note that surveys that are stored in this way will be saved as an RDS file rather than e.g. a CSV. Reading an RDS file is as straightforward as this:

mysurvey <- readRDS(file = "/users/jasper/desktop/mysurvey.rds")

You can read a survey that you downloaded manually using readSurvey:

mysurvey <- readSurvey("/users/jasper/desktop/mysurvey.csv")

To avoid special characters (mainly periods) in header names, readSurvey uses question labels as the header names. The question belonging to that label is then added using the sjmisc package. Qualtrics gives names to these labels automatically, but you can easily change them.

If you have any questions or feature requests, please leave them here.

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Assorted Shiny apps collection, full code and data

Wed, 04/26/2017 - 19:51

(This article was first published on rbloggers – SNAP tech blog, and kindly contributed to R-bloggers)

Here is an assortment of R Shiny apps that you may find useful for exploration if you are in the process of learning Shiny and looking for something different. Some of these apps are very small and simple whereas others are large and complex. This repository provides full code and any necessary accompanying data sets. The repo also links to the apps hosted online at so that you can run apps in your browser without having to download the entire collection repo to run apps locally. That and other details can be found at the repo linked above. This isn’t a tutorial or other form of support, but it’s plenty of R code to peruse if that is what you are looking for.

A bit of backstory. If I recall correctly, I began exploring RStudio’s Shiny package when I first heard of it in late 2012. Needless to say, a lot has changed since then, including not only all the alpha-release code-breaking changes I had to adjust to when making my first apps and what features and capabilities Shiny has grown to offer, but also simply how I go about coding apps has changed over time symbiotically with the package’s continued development. None of the apps in this repository are quite that old, though a few are close. Even so, they have been maintained and updated and tweaked since then to keep with the times as necessary.

Most of the apps are newer. But one nice thing about this collection is that it shows a diversity of approaches to coding different features and behavior into apps depending on their purposes and how for me that has changed over time. For example, some apps are heavy on maps. Prior to the robust availability of Leaflet in Shiny, I would tend to have a Shiny app display maps using static (but reactive) plots made with Lattice or ggplot2. There are many ways to do the same thing, and the way that is best in one case is not always the best way.

Across these apps there are many other examples of different ways to implement the same general task, depending on how I want that to be presented to the user in a specific app. In other cases, some approaches have proven more powerful and outright superior to others and have won out and it is equally useful to see these examples of what once was considered to be “good enough” is no longer.

Lastly, if you do happen to stumble upon something that is actually broken, I am unaware of it, so please let me know.

To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SNAP tech blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny Application Layouts Exercises (Part-2)

Wed, 04/26/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


In the second part of our series we will build another small shiny app but use another UI.

More specifically we will present the example of a UI with a plot at the top and columns at the bottom that contain the inputs that drive the plot. For our case we are going to use the diamonds dataset to create a Diamonds Analyzer App.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in the “Building Shiny App” series as we will build basic shiny stuff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test your skills with the exercise set we prepared for you. Lets begin!

Answers to the are available here.

Shiny Installation

In order to create the app we have to install and load the package shiny.

Exercise 1

Install and load shiny.

Grid Layout

The sidebarLayout uses Shiny’s grid layout functions. Rows are created by the fluidRow function and include columns defined by the column function. Column widths should add up to 12 within a fluidRow.
The first parameter to the column function is it’s width. You can also change the position of columns to decide the location of UI elements. You can put columns to the right by adding the offset parameter to the column function. Each unit of offset increases the left-margin of a column by a whole column.

Now let’s begin to build our UI. First of all we will place the fluidpage with a title as below:

title = "Diamonds",
h4("Diamonds Analyzer")

function(input, output) {}

Exercise 2

Create the initial UI of your app and name it “Diamonds Analyzer”.

You can use the fluidrow function with the column function of width =2 inside of it like this:
title = "Diamonds",
h4("Diamonds Analyzer"),


Exercise 3

Create a fluidrow with two columns of width = 4 inside it. NOTE: Do not expect to see something yet.

Now it is time to fill these columns with some tools that will help us determine the variables that we are going to use for our plot.
In the first 4 columns we will put a selectInput as the code below.
h4("Variable X"),
selectInput('x', 'X', names(diamonds)))

Exercise 4

Put a selectInput in the first 4 columns of your UI. Name it “Variable X”. HINT:Use names to get the names of the dataset diamonds as inputs.

Now let’s move to the next four columns. We are going to put in there another selectInput and select the second of the dataset’s names as default. We are also going to see what offset does by setting it to 1 and then deactivating it again like the example below. You can use the code as it is or change the parameters given to understand the logic behind its use.
offset = 1,
selectInput('y', 'Y', names(dataset), names(dataset)[[2]])

Exercise 5

Create a selectInput from column 5 to column 8. Choose the second of the dataset’s name as default. Name it “Variable Y”. HINT: Use names to get the names of the dataset diamonds as inputs.

Exercise 6

Set the offset parameter to 1 from columns 5 to 8.

Now let’s call our plot and put it on the top of our UI. Look at the example below.

Exercise 7

Place the plot on the top of your UI. HINT: Use plotOutput and hr. NOTE: You are not going to see the plot in your UI because you have not created the server side yet.

We are going to create a reactive expression in order to combine the selected variables into a new data frame.Look at the example:
selectedData <- reactive({
diamonds[, c(input$x, input$y)]

Exercise 8

Create a reactive expression in order to combine the selected variables into a new data frame. HINT: Use reactive.

Now plot your new data frame like the example:
output$plot <- renderPlot({

Exercise 9

Plot your data frame. HINT: Use renderPlot.

As mentioned before the width of our UI is equal to 12 columns. So what is goint to happen if we a add a new column of width = 4 next to the other two? You have to find out in order to understand better how it works.

Exercise 10

Create a new selectInput and try to put it next to “Variable Y”. Try to explain the result. NOTE: You do not have to connect it with your plot.

Related exercise sets:
  1. Building Shiny App exercises part 1
  2. Building Shiny App exercises part 4
  3. Building Shiny App exercises part 3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

dv01 uses R bring greater transparency to the consumer lending market

Wed, 04/26/2017 - 17:35

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The founder of the NYC-based startup dv01 watched the 2008 financial crisis and was inspired to bring greater transparency to institutional investors in the consumer lending market. Despite being an open-source shop, they switched their data services to Microsoft SQL Server to provide better performance (reducing latency for queries from tens of seconds to under two seconds). They also use R for modeling, as you can see in the video below:

Principal Data Scientist Wei Wei Lu also mentions Microsoft Machine Learning in the video. The MicrosoftML package, included in SQL Server R Services and Microsoft R Server 9, provides parallel implementations of many popular algorithms as R functions. The video below (presented by Gleb Krivosheev and Yunling Wang) provides a brief overview of some of the new capabilities in Microsoft ML, including the new pre-trained models for sentiment analysis and image featurization.

To try out the MicrosoftML package, download SQL Server 2016 or Microsoft R Server 9.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Binning Outliers in a Histogram

Wed, 04/26/2017 - 16:30

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

I guess we all use it, the good old histogram. One of the first things we are taught in Introduction to Statistics and routinely applied whenever coming across a new continuous variable. However, it easily gets messed up by outliers. Putting most of the data into a single bin or a few bins, and scattering the outliers barely visible over the x-axis. This distribution might look familiar

library(tidyverse) set.seed(42) hist_data <- data_frame(x = c(rexp(1000, .5), runif(50, 0, 500))) ggplot(hist_data, aes(x)) + geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% mutate(x_new = ifelse(x > 10, 10, x)) %>% ggplot(aes(x_new)) + geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Not bad. However, it now suggests incorrectly that many observations are exactly 10. I routinely make these plots for my own information, but they cannot be shared without explaining what happened to the outliers and what there original range was. Since a plot with a manual is not that great either, I recently did a hacking session into the ggplot object. The resulting gg_outlier_bin function not only indicates the range of the last bin, it also allows for a different fill color of the bin. Now we are clearly distinguishing the outlier aggregation

gg_outlier_bin(hist_data, "x", cut_off_floor = NA, cut_off_ceiling = 10, binwidth = 0.1)

It is still a bit experimental, but it seems to work in most situations. Below you find the function code for making histograms with outlier bins. You can also get it by installing the package accompanying this blog devtools::install_github("edwinth/thatssorandom"). By the way, it works on both floor and ceiling outliers. Like in the following

data_frame(x = c(runif(100, 0, 100), rnorm(1000, 50, 2))) %>% gg_outlier_bin("x", 45, 55, binwidth = .1)

gg_outlier_bin <- function(x, var_name, cut_off_floor, cut_off_ceiling, col = "black", fill = "cornflowerblue", fill_outlier_bins = "forestgreen", binwidth = NULL) { printing_min_max <- x %>% summarise_(sprintf("round(min(%s, na.rm = TRUE), 1)", var_name), sprintf("round(max(%s, na.rm = TRUE), 1)", var_name)) ceiling_filter <- ifelse(!, sprintf("%s < %f", var_name, cut_off_ceiling), "1 == 1") floor_filter <- ifelse(!, sprintf("%s > %f", var_name, cut_off_floor), "1 == 1") x_regular <- x %>% filter_(ceiling_filter, floor_filter) %>% select_(var_name) x_to_roll_ceiling <- x %>% filter_( sprintf("%s >= %f", var_name, cut_off_ceiling)) %>% select_(var_name) if (! x_to_roll_ceiling[, 1] <- cut_off_ceiling x_to_roll_floor <- x %>% filter_( sprintf("%s <= %f", var_name, cut_off_floor)) %>% select_(var_name) if (! x_to_roll_floor[, 1] <- cut_off_floor plot_obj <- ggplot(x_regular, aes_string(var_name)) + geom_histogram(col = col, fill = fill, binwidth = binwidth) if (! { ticks_for_ceiling <- update_tickmarks_ceiling(plot_obj, cut_off_ceiling, printing_min_max[1,2]) plot_obj <- plot_obj + geom_histogram(data = x_to_roll_ceiling, fill = fill_outlier_bins, col = col, binwidth = binwidth) + scale_x_continuous(breaks = ticks_for_ceiling$tick_positions, labels = ticks_for_ceiling$tick_labels) } if (! { ticks_for_floor <- update_tickmarks_floor(plot_obj, cut_off_floor, printing_min_max[1,1]) plot_obj <- plot_obj + geom_histogram(data = x_to_roll_floor, fill = fill_outlier_bins, col = col, binwidth = binwidth) + scale_x_continuous(breaks = ticks_for_floor$tick_positions, labels = ticks_for_floor$tick_labels) } return(plot_obj) } update_tickmarks_ceiling <- function(gg_obj, co, max_print) { ranges <- suppressMessages( ggplot_build(gg_obj)$layout$panel_ranges[[1]]) label_to_add <- sprintf("(%s , %s)", round(co, 1), max_print) tick_positions <- ranges$x.major_source tick_labels <- ranges$x.labels if (overlap_ceiling(tick_positions, co)) { tick_positions <- tick_positions[-length(tick_positions)] tick_labels <- tick_labels[-length(tick_labels)] } return(list(tick_positions = c(tick_positions, co), tick_labels = c(tick_labels, label_to_add))) } overlap_ceiling <- function(positions, cut_off) { n <- length(positions) ticks_dif <- positions[n] - positions[n-1] (cut_off - positions[n]) / ticks_dif < 0.25 } update_tickmarks_floor <- function(gg_obj, co, min_print) { ranges <- suppressMessages( ggplot_build(gg_obj)$layout$panel_ranges[[1]]) label_to_add <- sprintf("(%s , %s)", min_print, round(co, 1)) tick_positions <- ranges$x.major_source tick_labels <- ranges$x.labels if (overlap_floor(tick_positions, co)) { tick_positions <- tick_positions[-1] tick_labels <- tick_labels[-1] } return(list(tick_positions = c(co, tick_positions), tick_labels = c(label_to_add, tick_labels))) } overlap_floor <- function(positions, cut_off) { ticks_dif <- positions[2] - positions[1] (positions[1] - cut_off) / ticks_dif < 0.25 }

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...