Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 5 hours 5 min ago

My 3 video presentations on “Essential R”

Sat, 03/18/2017 - 06:39

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

In this post I include my  3 video presentations on the topic “Essential R”. In these 3 presentations I cover the entire landscape of R. I cover the following

  • R Language – The essentials
  • Key R Packages (dplyr, lubridate, ggplot2, etc.)
  • How to create R Markdown and share reports
  • A look at Shiny apps
  • How to create a simple R package

Essential R – Part 1
This video cover basic R data types – character, numeric, vectors, matrices, lists and data frames. It also touches on how to subset these data types

Essential R – Part 2
This video continues on how to subset dataframes (the most important data type) and some important packages. It also presents one of the most important job of a Data Scientist – that of cleaning and shaping the data. This is done with an example unclean data frame. It also  touches on some  key operations of dplyr like select, filter, arrange, summarise and mutate. Other packages like lubridate, quantmod are also included. This presentation also shows how to use base plot and ggplot2

Essential R – Part 3
This final session covers R Markdown , and  touches on some of the key markdown elements. There is a brief overview of a simple Shiny app. Finally this presentation also shows the key steps to create an R package

These 3 R sessions cover most of the basic R topics that we tend to use in a our day-to-day R way of life. With this you should be able to hit the ground running!

Hope you enjoy these video presentation and also hope you have an even greater time with R!

Also see
1. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
2. Computer Vision: Ramblings on derivatives, histograms and contours
3. Designing a Social Web Portal
4. Revisiting Whats up, Watson – Using Watson’s Question and Answer with Bluemix – Part 2
5. Re-introducing cricketr! : An R package to analyze performances of cricketers

To see all my posts click – Index of posts

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Contours of statistical penalty functions as GIF images

Sat, 03/18/2017 - 06:00

(This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

Many statistical modeling problems reduce to a minimization problem of the general form:

\begin{equation} \mathrm{minimize}\subscript{\boldsymbol{\beta}\in\mathbb{R}^m}\quad f(\mathbf{X}, \boldsymbol{\beta}) + \lambda g(\boldsymbol{\beta}), \end{equation}

or

%

where $f$ is some type of loss function, $\mathbf{X}$ denotes the data, and $g$ is a penalty, also referred to by other names, such as “regularization term” (problems (1) and (2-3) are often equivalent by the way). Of course both, $f$ and $g$, may depend on further parameters.

There are multiple reasons why it can be helpful to check out the contours of such penalty functions $g$:

  1. When $\boldsymbol{\beta}$ is two-dimensional, the solution of problem (2-3) can be found by simply taking a look at the contours of $f$ and $g$.
  2. That builds intuition for what happens in more than two dimensions, and in other more general cases.
  3. From a Bayesian point of view, problem (1) can often be interpreted as an MAP estimator, in which case the contours of $g$ are also contours of the prior distribution of $\boldsymbol{\beta}$.

Therefore, it is meaningful to visualize the set of points that $g$ maps onto the unit ball in $\mathbb{R}^2$, i.e., the set

B\subscript{g} := \{ \mathbf{x}\in\mathbb{R}^2 : g(\mathbf{x}) \leq 1 \}.

Below you see GIF images of such sets $B\subscript{g}$ for various penalty functions $g$ in 2D, capturing the effect of varying certain parameters in $g$. The covered penalty functions include the family of $p$-norms, the elastic net penalty, the fused penalty, and the sorted $\ell_1$ norm.

:white_check_mark: R code to reproduce the GIFs is provided.

p-norms in 2D

First we consider the $p$-norm,

g\subscript{p}(\boldsymbol{\beta}) = \lVert\boldsymbol{\beta}\rVert\subscript{p}^{p} = \lvert\beta\subscript{1}\rvert^p + \lvert\beta\subscript{2}\rvert^p,

with a varying parameter $p \in (0, \infty]$ (which actually isn’t a proper norm for $p < 1$). Many statistical methods, such as LASSO and Ridge Regression, employ $p$-norm penalties. To find all $\boldsymbol{\beta}$ on the boundary of the 2D unit $p$-norm ball, given $\beta_1$ (the first entry of $\boldsymbol{\beta}$), $\beta_2$ is easily obtained as

\beta_2 = \pm (1-|\beta_1|^p)^{1/p}, \quad \forall\beta_1\in[-1, 1].

Elastic net penalty in 2D

The elastic net penalty can be written in the form

g\subscript{\alpha}(\boldsymbol{\beta}) = \alpha \lVert \boldsymbol{\beta} \rVert\subscript{1} + (1 - \alpha) \lVert \boldsymbol{\beta} \rVert\subscript{2}^{2},

for $\alpha\in(0,1)$. It is quite popular with a variety of regression-based methods (such as the Elastic Net, of course). We obtain the corresponding 2D unit “ball”, by calculating $\beta\subscript{2}$ from a given $\beta\subscript{1}\in[-1,1]$ as

\beta\subscript{2} = \pm \frac{-\alpha + \sqrt{\alpha^2 - 4 (1 - \alpha) ((1 - \alpha) \beta\subscript{1}^2 + \alpha \beta\subscript{1} - 1)}}{2 - 2 \alpha}.

Fused penalty in 2D

The fused penalty can be written in the form

g\subscript{\alpha}(\boldsymbol{\beta}) = \alpha \lVert \boldsymbol{\beta} \rVert\subscript{1} + (1 - \alpha) \sum\subscript{i = 2}^m \lvert \beta\subscript{i} - \beta\subscript{i-1} \rvert.

It encourages neighboring coefficients $\beta\subscript{i}$ to have similar values, and is utilized by the fused LASSO and similar methods.

(Here I have simply evaluated the fused penalty function on a grid of points in $[-2,2]^2$, because figuring out equations in parametric form for the above polygons was too painful for my taste… :stuck_out_tongue:)

Sorted L1 penalty in 2D

The Sorted $\ell\subscript{1}$ penalty is used in a number of regression-based methods, such as SLOPE and OSCAR. It has the form

g\subscript{\boldsymbol{\lambda}}(\boldsymbol{\beta}) = \sum\subscript{i = 1}^m \lambda\subscript{i} \lvert \beta \rvert\subscript{(i)},

where $\lvert \beta \rvert\subscript{(1)} \geq \lvert \beta \rvert\subscript{(2)} \geq \ldots \geq \lvert \beta \rvert\subscript{(m)}$ are the absolute values of the entries of $\boldsymbol{\beta}$ arranged in a decreasing order. In 2D this reduces to

g\subscript{\boldsymbol{\lambda}}(\boldsymbol{\beta}) = \lambda\subscript{1} \max\{|\beta\subscript{1}|, |\beta\subscript{2}|\} + \lambda\subscript{2} \min\{|\beta\subscript{1}|, |\beta\subscript{2}|\}.

Difference of p-norms

It holds that

\lVert \boldsymbol{\beta} \rVert\subscript{1} \geq \lVert \boldsymbol{\beta} \rVert\subscript{2},

or more generally, for all $p$-norms it holds that

(\forall p \leq q) : \lVert \boldsymbol{\beta} \rVert\subscript{p} \geq \lVert \boldsymbol{\beta} \rVert\subscript{q}.

Thus, it is meaningful to define a penalty function of the form

g\subscript{\alpha}(\boldsymbol{\beta}) = \lVert \boldsymbol{\beta} \rVert\subscript{1} - \alpha \lVert \boldsymbol{\beta} \rVert\subscript{2},

for $\alpha\in[0,1]$, which results in the following.

We visualize the same for varying $p \geq 1$ fixing $\alpha = 0.6$, i.e., we define

g\subscript{\alpha}(\boldsymbol{\beta}) = \lVert \boldsymbol{\beta} \rVert\subscript{1} - 0.6 \lVert \boldsymbol{\beta} \rVert\subscript{p},

and we obtain the following GIF.

Code

The R code uses the libraries dplyr for data manipulation, ggplot2 for generation of figures, and magick to combine the individual images into a GIF.

Here are the R scripts that can be used to reproduce the above GIFs:

  1. p-norms in 2D
  2. Elastic net penalty in 2D
  3. Fused penalty in 2D
  4. Sorted L1 penalty in 2D
  5. Difference of $p$-norms: $\ell\subscript{1} – \ell\subscript{2}$ in 2D
  6. Difference of $p$-norms: $\ell\subscript{1} – \ell\subscript{p}$ in 2D

Should I come across other interesting penalty functions that make sense in 2D, then I will add corresponding further visualizations to the same Github repository.



This work is licensed under a Creative Commons Attribution 4.0 International License.

To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science at StitchFix

Fri, 03/17/2017 - 20:56

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you want to see a great example of how data science can inform every stage of a business process, from product concept to operations, look no further than Stitch Fix's Algorithms Tour. Scroll down through this explainer to see how this personal styling service uses data and statistical inference to suggest clothes their customers will love, ship them from a nearby warehouse (already stocked thanks to demand modeling), and incorporate customer feedback and designer trends into the next round of suggestions.

The design of the Tour is interactive and elegant, and clearly someone put a lot of work into it. It's a great example of simplifying the detail of data science into a format where the impacts are immediately apparent. Communicating the benefits of data science to non-experts is a key skill of any data scientist, and dialing down on the complexity, while not dumbing down the science, is key to that communication.

From the Tour, we can see that Stitch Fix's data science is far from simple. Their methods aren't just black-box machine learning techniques: rather, they are using statistical inference to figure out why customers like particular products (or why operational strategies are efficient or not) rather than simply predicting what will happen. This is all driven by a sophisticated data science platform, that enables the team to employ techniques like Generalized Additive Models implemented in R, or Natural Language Processing implemented in Python. Check out their MultiThreaded blog for more details on data science applications at Stitch Fix, and be sure to check out the Algorithms Tour linked below.

Stitch Fix MultiThreaded: Algorithms Tour

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

One way MANOVA exercises

Fri, 03/17/2017 - 19:30

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In ANOVA our interest lies in knowing if one continuous dependent variable is affected by one or more categorical independent variables. MANOVA is an extension of ANOVA where we are now able to understand how several dependent variables are affected by independent variables. For example consider an investigation where a medical investigator has developed 3 back pain therapies. Patients are enrolled for a 10 week trial and at the end the investigator interviews them on reduction of physiological, emotional and cognitive pain. Interest is in knowing which therapy is best at reducing pain.

Just like in ANOVA we can have one way or two way MANOVA depending on number of independent variables.

When conducting MANOVA it is important to understand the assumptions that need to be satisfied so that the results are valid. The assumptions are explained below.

  • The observations are independent. Observations that are collected over time, over space and in any groupings violate the assumption of independence.
  • The data follows a multivariate normal distribution. When observations are many we can rely on the central limit theorem (CLT) to achieve normality. It has been generally accepted any distribution with more than that observations will follow a normal distribution. MANOVA is robust to any non-normality that arises from skewness but it is not robust to non-normality resulting from outliers. Outliers should be checked and appropriate action taken. Analysis can be done with and without the outliers to check sensitivity.
  • The variance in all the groups is homogeneous. A Bartlett test is useful for assessing the homogeneity of variance. MANOVA is not robust to deviations from the assumption of normality therefore a transformation is required to stabilize variance.

MANOVA can be used to understand the interactions and main effects of independent variables. The four test statistics that can be used are Wilk’s lambda, Pillai trace, Hotelling-Lawley trace and Roy’s maximum root. Among the four test statistics Pillai is least affected by any violations in assumptions but Wilk’s is the most commonly used.

In this first part of MANOVA exercises we will use data from a study investigating a control and three therapies aimed at reducing symptoms of koro. Forty patients were selected for inclusion in the study and 10 patients were assigned to each of the four groups. Interest is in understanding which therapy is best in reducing symptoms. We will create three variables that hold change in indices before and after treatment. Here we have one independent variable and three dependent variables resulting in a one way MANOVA.

Solutions to these exercises can be found here

Exercise 1

Import data into R

Exercise 2

Check the number of observations in each group

Exercise 3

Create the variables that hold the change in indices

Exercise 4

Summarize the change variables

Exercise 5

Get descriptive statistics for each therapy

Exercise 6

Obtain the correlation matrix

Exercise 7

Check for outliers

Exercise 8

Check for homogeneity of variance

Exercise 9

Run MANOVA test with outliers

Exercise 10

Interpret results

Related exercise sets:
  1. One Way Analysis of Variance Exercises
  2. Two Way ANOVA in R Exercises
  3. Paired t-test in R Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Experimenting With Sankey Diagrams in R and Python

Fri, 03/17/2017 - 18:52

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DataChats: An Interview with Hank Roark

Fri, 03/17/2017 - 17:45

Hola! We just released episode 14 of our DataChats video series, you’ll like this one.

In this episode, we interview Hank Roark. Hank is a Senior Data Scientist at Boeing and a long time user of the R language. Prior to his current role, he led the Customer Data Science team at H2O.ai, a leading provider of machine learning and predictive analytics services. You can take Hank’s course, Unsupervised Learning in R, here!

Here, Hank Roark answers Nick’s questions about his career into data science, work automation, the importance of communicating your results as a data scientist in both directions and much more!

We hope that you enjoy watching this series and make sure not to miss any of our upcoming episodes by subscribing to DataCamp’s YouTube channel!

Another R [Non-]Standard Evaluation Idea

Fri, 03/17/2017 - 17:42

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Jonathan Carroll had a an interesting R language idea: to use @-notation to request value substitution in a non-standard evaluation environment (inspired by msyql User-Defined Variables).

He even picked the right image:

The idea is kind of reverse from some Lisp ideas ("evaled unless ticked"), but an interesting possibility. We can play along with it a bit in R as follows. The disadvantages of our simulation include:

  • The user must both call wrapr::ateval and place their code in quotes.
  • The effect is still achieved through string substitution.

But here it is for what it is worth:

# devtools::install_github("WinVector/wrapr") library("wrapr") library("dplyr")

The original example function from the Tweet:

f <- function(col1, col2, new_col_name) { ateval('mtcars %>% mutate(@new_col_name = @col1 + @col2)') }

And the requested effect actually realized:

d <- f('gear', 'carb', 'nonsense') head(d) ## mpg cyl disp hp drat wt qsec vs am gear carb nonsense ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 8 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 8 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 4

The point is a standard function (f()) can accept user column names and then itself use convenient non-standard evaluation notation (in this case dplyr) to perform operations. This package-based simulation is not as good as actual language support, but simulations like this are how we collect experience with possible new language features.

The real point is a user wants a flexible language with macros and functions (R uses an intermediate form called "an Fexpr" for just about everything) that they can both use interactively and program over. This means they eventually want an execution environment where they can both pass in parametric values (what R calls standard evaluation) and the ability to have code elements treated directly as values (a convenience and related to what R calls non-standard evaluation).

The classic Lisp solution organizes things a bit differently, and uses various "back-tick" notations to specify control of the interpretation of symbols. I think R has picked a different set of defaults as to how symbols, values, expressions, and execution interact- so any notation is going to be a bit different.

The development version of wrapr can be found here (atexpr is not yet in the CRAN version, which supplies the let alternative). The example shown in this article can be found in markdown form here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quandl and Forecasting

Fri, 03/17/2017 - 17:04

(This article was first published on RStudio, and kindly contributed to R-bloggers)

A Reproducible Finance with R Post
by Jonathan Regenstein

Welcome to another installment of Reproducible Finance with R. Today we are going to shift focus in recognition of the fact that there’s more to Finance than stock prices, and there’s more to data download than quantmod/getSymbols. In this post, we will explore commodity prices using data from Quandl, a repository for both free and paid data sources. We will also get into the forecasting game a bit and think about how best to use dygraphs when visualizing predicted time series as an extension of historical data. We are not going to do anything too complex, but we will expand our toolkit by getting familiar with Quandl, commodity prices, the forecast() function, and some advanced dygraph work.

Before we dive in, a few thoughts to frame the notebook underlying this post.

  • We are using oil data from Quandl, but the original data is from FRED. There’s nothing wrong with grabbing the data directly from FRED, of course, and I browse FRED frequently to check out economic data, but I tend to download the data into my RStudio environment using Quandl. I wanted to introduce Quandl today because it’s a nice resource that will be involved in the next few posts in this series. Plus, it’s gaining in popularity, and if you work in the financial industry, you might start to encounter it in your work.

  • This post marks our first foray into the world of predictive modeling, albeit in a very simple way. But the complexity and accuracy of the forecasting methodology we use here is almost irrelevant since I expect that most R coders, whether in industry or otherwise, will have their own proprietary models. Rather, what I want to accomplish here is a framework where models can be inserted, visualized, and scrutinized in the future. I harp on reproducible workflows a lot, and that’s not going to change today because one goal of this Notebook is to house a forecast that can be reproduced in the future (at which point, we will know if the forecast was accurate or not), and then tweaked/criticized/updated/heralded.

  • This post walks through a detailed example of importing, forecasting, and visualizing oil prices. In the near future, I will repeat those steps for gold and copper, and we will examine the relationship between the copper/gold price ratio and interest rates. We are starting simple, but stay tuned.

Now, let’s get to the data download! In the chunk below, as we import WTI oil prices, notice that Quanld makes it easy to choose types of objects (raw/dataframe, xts, or zoo), periods (daily, weekly, or monthly) and start/end dates.

Also note that I specified the end date of February 2017. This indicates that the Notebook houses a model that was built and run using data as of February 2017. Without the end date, this Notebook would house a model that was built and run using data as of time t. Which you choose depends how you want the Notebook to function for your team.

Looking back at those oil price objects, each would work well for the rest of this project, but let’s stick with the monthly data. We will be dealing with the date index quite a bit below, so let’s use the seq() function and mdy() from the lubridate package to put the date into a nicer format.

Now we have a cleaner date format. Our base data object is in good shape. As always, we like to have a look at the data in graphical format, so let’s fire up dygraphs. Since we imported an xts object directly from Quandl, we can just plug it straight into the dygraph() function.

Alright, nothing too shocking here. We see a peak in mid-2008, followed by a precipitous decline through the beginning of 2009.

Now we’ll make things a bit more interesting and try to extract some meaning from that data. Let’s use the forecast() function to predict what oil prices will look like over the next six months. This is the part of the code where you might want to insert whatever model your team has built or wish to test. We can think of it as a placeholder for any proprietary model or models that could be dropped into this Notebook. For our purposes, we will simply pass in the monthly oil prices object and supply a lookahead parameter of 6. The forecast() function will then supply some interesting numbers about the next six months of oil prices.

The mean forecast is right around $54. It looks like the 95% confidence level has a high of $78 in August and a low of $29 in March. We won’t dwell on these numbers because I imagine you will want to use your own model here – this Notebook is more of a skeleton where those models can be inserted and then tested or evaluated at a later date.

Let’s move on to visualizing the results of the forecast along with the historical data. The base plot() function does a decent job here.

That plot looks OK, but it’s not great. We can see that the mean forecast is to stay around $50, with the 95% bands stretching all the way to around $80 and $30, but honestly, I have to squint to really see those 95% intervals. We don’t like squinting, so let’s put in some extra work to make use of dygraphs, which will have the benefit of allowing a reader to zoom on the predicted portion of the graph.

This is where things require a bit more thought. We want one xts object to hold both the historical data and the forecasted prices.

We already have our monthly prices in the xts object we imported from Quandl, but the forecasted prices are currently in a list with a different date convention than we would like.

First, let’s move the mean forecast and 95% confidence bands to a dataframe, along with a date column. We predicted oil out six months, so we will need a date column for the six months after February.

The data we want is now housed in its own dataframe. Let’s convert that to an xts object.

Now we can combine the historical xts object with the forecasted xts object using cbind().

It looks as it should. From January 2006 to February 2017 we have our actual data and NAs for the forecasted data. From March 2017 to August 2017, we have our mean forecast and 95% confidence levels and NAs for actual data. Said another way, we have four time series with observations at different dates, some of which are in the future. Most fortunately, dygraph provides a nice way to plot our actual time series versus our three forecasted time series because it simply does not plot the NAs.

Take a quick look back at our previous graph using the plot() function. At first glance, the dygraph might not seem so different. But, we can now make use of hovering/tooltips and, more importantly, we can zoom in on the forecasted numbers see them much more clearly. Plus, the whole world of dygraph functionality is now available to us!

That’s all for today. We have gotten some familiarity with Quandl, used forecast() to predict the next six months of oil prices, and done some data wrangling so we can use our old friend dygraphs. Next time, we will wrap this into a Shiny app so that users can choose their own parameters, and maybe even choose different commodities. See you then!

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Because its Friday… The IKEA Billy index

Fri, 03/17/2017 - 10:47

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Because it is Friday, another ‘playful and frivolous data exercise

IKEA is more than a store, it is a very nice experience to go through. I can drop of my two kids at smàland, have some ‘quality time’ by walking around the store with my wife and eat some delicious Swedish meatballs. Back at home, the IKEA furniture are a good ‘relation-tester’: try building a big wardrobe together with your wife…..

The nice thing about IKEA is that you don’t have to come to the store for nothing, you can check the availability of an item on the IKEA website.

According to the website this gets refreshed every 1,5 hour. This brought me on an idea, if I check the availability every 1,5 hour I could get an idea of the number of items sold for a particular item.

The IKEA Billy index

Probably the most iconic item of IKEA is the Billy bookcase. Just in case you don’t know how this bookcase looks like, below is a picture, its simplicity in its most elegant way….

For every 1,5 hour over the last few months I have checked the Dutch IKEA website for the availability of this famous item for the 13 stores in the Netherlands, and calculated the negative difference between consecutive values.

The data that you get from this little playful exercise do not necessarily represent the numbers of Billy bookcases really sold. Maybe the stock got replenished in between, maybe items were moved internally to other stores. For example, if there are 50 Billy’s in Amsterdam available and 1,5 hour later there are 45 Billy’s, maybe 5 were sold, or 6 were sold and 1 got returned? replenished? I just don’t know!

All I see are movements in availability that might have been caused by products sold. But anyway, let’s call the movements of availability of the Billy’s the IKEA Billy index.

Some graphs of the Billy Index Trends and forecasts

Facebook released a nice R package, called prophet. It can be used to perform forecasting on time series, and it is used internally by Facebook across many applications. I ran the prophet forecasting algorithm on the IKEA Billy index. The graph below shows the result.

There are some high peaks end of October, and end of December. We can also clearly see the Saturday peaks that the algorithm has picked up from the historic data and projected it in its future forecasts.

Weekday and color

The graph above showed already that on Saturdays the Billy index is high, what about the other days? The graph below shows the other days, it depicts the sum of the Ikea index per day since I started to collect this data (end of September). Wednesdays and Thursdays are less active days.

 

Clearly most of the Billy’s are white.

Correlations

Does the daily Billy Index correlate with other data? I have used some Dutch weather data that can be downloaded from the Royal Netherlands Meteorological Institute (KNMI). The data consists of many daily weather variables. The graph below shows a correlation matrix of the IKEA Billy Index and only some of these weather variables.

 

The only correlation with some meaning of the IKEA Billy Index and a weather variable is the Wind Speed (-0.19). Increasing wind speeds means decreasing Billy’s.

 

It’s an explainable correlation of course…. You wouldn’t want to go to IKEA on (very) windy days, it is not easy to drive through strong winds with your Billy on top of your car.

 

Cheers, Longhow.

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How many digits into pi do you need to go to find your birthday?

Fri, 03/17/2017 - 07:13

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

FIND YOUR BIRTHDAY IN PI, IN THREE DIFFERENT FORMATS

It was Pi Day (March 14, like 3/14, like 3.14, get it?) recently and Time Magazine did a fun interactive app in which you can find your birthday inside the digits of pi.

However:
1) They only used one format of birthday, in which July 4th would be 704. But there are other ways to write July 4. Like 0704 or 74. And those must be investigated, right?
2) Our friend Dave Pennock said he wanted to see the locations of every possible birthday inside pi.

So here they are. Enjoy! And R code below if you are interested. Plot and other stuff use Hadley Wickham’s Tidyverse tools.

And what the heck, have a spreadsheet of fun (i.e., spreadsheet with all the birthdays and locations in it).

Birthday location Shorter form location Shortest form loc 01-01 2563 1-01 853 1-1 95 01-02 6599 1-02 164 1-2 149 01-03 25991 1-03 3487 1-3 111 01-04 1219 1-04 270 1-4 2 01-05 3982 1-05 50 1-5 4 01-06 1011 1-06 1012 1-6 41 01-07 23236 1-07 1488 1-7 96 01-08 16216 1-08 2535 1-8 425 01-09 4555 1-09 207 1-9 38 01-10 3253 1-10 175 1-10 175 01-11 19627 1-11 154 1-11 154 01-12 4449 1-12 710 1-12 710 01-13 362 1-13 363 1-13 363 01-14 4678 1-14 2725 1-14 2725 01-15 27576 1-15 922 1-15 922 01-16 5776 1-16 396 1-16 396 01-17 6290 1-17 95 1-17 95 01-18 7721 1-18 446 1-18 446 01-19 494 1-19 495 1-19 495 01-20 43172 1-20 244 1-20 244 01-21 6305 1-21 711 1-21 711 01-22 660 1-22 484 1-22 484 01-23 27847 1-23 1925 1-23 1925 01-24 8619 1-24 1081 1-24 1081 01-25 15458 1-25 1351 1-25 1351 01-26 4372 1-26 2014 1-26 2014 01-27 9693 1-27 298 1-27 298 01-28 1199 1-28 149 1-28 149 01-29 2341 1-29 500 1-29 500 01-30 5748 1-30 745 1-30 745 01-31 5742 1-31 1097 1-31 1097 02-01 9803 2-01 245 2-1 94 02-02 7287 2-02 1514 2-2 136 02-03 3831 2-03 1051 2-3 17 02-04 28185 2-04 375 2-4 293 02-05 12631 2-05 1327 2-5 90 02-06 9809 2-06 885 2-6 7 02-07 14839 2-07 2374 2-7 29 02-08 6141 2-08 77 2-8 34 02-09 24361 2-09 54 2-9 187 02-10 9918 2-10 1318 2-10 1318 02-11 4353 2-11 94 2-11 94 02-12 30474 2-12 712 2-12 712 02-13 524 2-13 281 2-13 281 02-14 10340 2-14 103 2-14 103 02-15 24666 2-15 3099 2-15 3099 02-16 4384 2-16 992 2-16 992 02-17 546 2-17 547 2-17 547 02-18 2866 2-18 424 2-18 424 02-19 716 2-19 717 2-19 717 02-20 15715 2-20 1911 2-20 1911 02-21 17954 2-21 1737 2-21 1737 02-22 1889 2-22 1736 2-22 1736 02-23 25113 2-23 136 2-23 136 02-24 12782 2-24 536 2-24 536 02-25 3019 2-25 1071 2-25 1071 02-26 6742 2-26 965 2-26 965 02-27 18581 2-27 485 2-27 485 02-28 4768 2-28 2528 2-28 2528 02-29 6707 2-29 186 2-29 186 03-01 1045 3-01 493 3-1 1 03-02 1439 3-02 818 3-2 16 03-03 15339 3-03 195 3-3 25 03-04 4460 3-04 4461 3-4 87 03-05 29985 3-05 366 3-5 10 03-06 3102 3-06 116 3-6 286 03-07 3813 3-07 65 3-7 47 03-08 35207 3-08 520 3-8 18 03-09 5584 3-09 421 3-9 44 03-10 7692 3-10 442 3-10 442 03-11 11254 3-11 846 3-11 846 03-12 11647 3-12 2632 3-12 2632 03-13 858 3-13 859 3-13 859 03-14 3496 3-14 1 3-14 1 03-15 8285 3-15 314 3-15 314 03-16 13008 3-16 238 3-16 238 03-17 4778 3-17 138 3-17 138 03-18 7291 3-18 797 3-18 797 03-19 4530 3-19 1166 3-19 1166 03-20 3767 3-20 600 3-20 600 03-21 18810 3-21 960 3-21 960 03-22 12744 3-22 3434 3-22 3434 03-23 8948 3-23 16 3-23 16 03-24 16475 3-24 2495 3-24 2495 03-25 12524 3-25 3147 3-25 3147 03-26 7634 3-26 275 3-26 275 03-27 9208 3-27 28 3-27 28 03-28 24341 3-28 112 3-28 112 03-29 16096 3-29 3333 3-29 3333 03-30 2568 3-30 365 3-30 365 03-31 13852 3-31 1128 3-31 1128 04-01 19625 4-01 1198 4-1 3 04-02 17679 4-02 1368 4-2 93 04-03 4776 4-03 724 4-3 24 04-04 1272 4-04 1273 4-4 60 04-05 10748 4-05 596 4-5 61 04-06 39577 4-06 71 4-6 20 04-07 4258 4-07 2008 4-7 120 04-08 5602 4-08 146 4-8 88 04-09 33757 4-09 340 4-9 58 04-10 11495 4-10 163 4-10 163 04-11 9655 4-11 2497 4-11 2497 04-12 20103 4-12 297 4-12 297 04-13 6460 4-13 1076 4-13 1076 04-14 3756 4-14 385 4-14 385 04-15 7735 4-15 3 4-15 3 04-16 10206 4-16 1709 4-16 1709 04-17 15509 4-17 1419 4-17 1419 04-18 7847 4-18 728 4-18 728 04-19 4790 4-19 37 4-19 37 04-20 3286 4-20 702 4-20 702 04-21 16028 4-21 93 4-21 93 04-22 4796 4-22 1840 4-22 1840 04-23 8220 4-23 2995 4-23 2995 04-24 1878 4-24 1110 4-24 1110 04-25 5716 4-25 822 4-25 822 04-26 1392 4-26 1393 4-26 1393 04-27 36946 4-27 630 4-27 630 04-28 910 4-28 203 4-28 203 04-29 4514 4-29 1230 4-29 1230 04-30 4619 4-30 519 4-30 519 05-01 21816 5-01 1344 5-1 49 05-02 25156 5-02 32 5-2 173 05-03 19743 5-03 838 5-3 9 05-04 9475 5-04 1921 5-4 192 05-05 36575 5-05 132 5-5 131 05-06 2094 5-06 1116 5-6 211 05-07 683 5-07 684 5-7 405 05-08 14658 5-08 1122 5-8 11 05-09 17171 5-09 1147 5-9 5 05-10 10274 5-10 49 5-10 49 05-11 444 5-11 395 5-11 395 05-12 2294 5-12 1843 5-12 1843 05-13 597 5-13 110 5-13 110 05-14 42236 5-14 2118 5-14 2118 05-15 10458 5-15 1100 5-15 1100 05-16 5367 5-16 3516 5-16 3516 05-17 24365 5-17 2109 5-17 2109 05-18 751 5-18 471 5-18 471 05-19 17096 5-19 390 5-19 390 05-20 20247 5-20 326 5-20 326 05-21 19553 5-21 173 5-21 173 05-22 2041 5-22 535 5-22 535 05-23 4317 5-23 579 5-23 579 05-24 15635 5-24 1615 5-24 1615 05-25 3233 5-25 1006 5-25 1006 05-26 3984 5-26 613 5-26 613 05-27 7300 5-27 241 5-27 241 05-28 18917 5-28 867 5-28 867 05-29 6373 5-29 1059 5-29 1059 05-30 367 5-30 368 5-30 368 05-31 26607 5-31 2871 5-31 2871 06-01 2400 6-01 1218 6-1 220 06-02 14854 6-02 291 6-2 21 06-03 9019 6-03 264 6-3 313 06-04 1172 6-04 1173 6-4 23 06-05 5681 6-05 750 6-5 8 06-06 13163 6-06 311 6-6 118 06-07 10861 6-07 287 6-7 99 06-08 13912 6-08 618 6-8 606 06-09 3376 6-09 128 6-9 42 06-10 13704 6-10 269 6-10 269 06-11 7448 6-11 427 6-11 427 06-12 11585 6-12 220 6-12 220 06-13 7490 6-13 971 6-13 971 06-14 2887 6-14 1612 6-14 1612 06-15 25066 6-15 1030 6-15 1030 06-16 6017 6-16 1206 6-16 1206 06-17 886 6-17 887 6-17 887 06-18 14424 6-18 1444 6-18 1444 06-19 6351 6-19 843 6-19 843 06-20 14967 6-20 76 6-20 76 06-21 2819 6-21 2820 6-21 2820 06-22 30206 6-22 185 6-22 185 06-23 51529 6-23 456 6-23 456 06-24 5491 6-24 510 6-24 510 06-25 5971 6-25 1569 6-25 1569 06-26 16707 6-26 21 6-26 21 06-27 5424 6-27 462 6-27 462 06-28 72 6-28 73 6-28 73 06-29 11928 6-29 571 6-29 571 06-30 10902 6-30 1900 6-30 1900 07-01 10640 7-01 167 7-1 40 07-02 544 7-02 545 7-2 140 07-03 10108 7-03 408 7-3 300 07-04 9882 7-04 2670 7-4 57 07-05 9233 7-05 561 7-5 48 07-06 5134 7-06 97 7-6 570 07-07 3815 7-07 755 7-7 560 07-08 8111 7-08 3061 7-8 67 07-09 3641 7-09 121 7-9 14 07-10 3169 7-10 681 7-10 681 07-11 4250 7-11 1185 7-11 1185 07-12 7695 7-12 243 7-12 243 07-13 36438 7-13 627 7-13 627 07-14 9546 7-14 610 7-14 610 07-15 5668 7-15 344 7-15 344 07-16 10242 7-16 40 7-16 40 07-17 15743 7-17 568 7-17 568 07-18 2988 7-18 2776 7-18 2776 07-19 8664 7-19 541 7-19 541 07-20 44221 7-20 1009 7-20 1009 07-21 756 7-21 650 7-21 650 07-22 2375 7-22 2140 7-22 2140 07-23 3300 7-23 1632 7-23 1632 07-24 8811 7-24 302 7-24 302 07-25 16244 7-25 140 7-25 140 07-26 288 7-26 289 7-26 289 07-27 12328 7-27 406 7-27 406 07-28 4879 7-28 1584 7-28 1584 07-29 4078 7-29 771 7-29 771 07-30 6892 7-30 898 7-30 898 07-31 2808 7-31 785 7-31 785 08-01 9508 8-01 3418 8-1 68 08-02 18639 8-02 1993 8-2 53 08-03 8283 8-03 85 8-3 27 08-04 1284 8-04 775 8-4 19 08-05 2292 8-05 957 8-5 172 08-06 9289 8-06 968 8-6 75 08-07 15787 8-07 451 8-7 306 08-08 4545 8-08 106 8-8 35 08-09 3009 8-09 1003 8-9 12 08-10 3207 8-10 206 8-10 206 08-11 10884 8-11 153 8-11 153 08-12 147 8-12 148 8-12 148 08-13 10981 8-13 734 8-13 734 08-14 4274 8-14 882 8-14 882 08-15 25257 8-15 324 8-15 324 08-16 1601 8-16 68 8-16 68 08-17 4857 8-17 319 8-17 319 08-18 30399 8-18 490 8-18 490 08-19 12786 8-19 198 8-19 198 08-20 4787 8-20 53 8-20 53 08-21 12923 8-21 102 8-21 102 08-22 5149 8-22 135 8-22 135 08-23 11957 8-23 114 8-23 114 08-24 10358 8-24 1196 8-24 1196 08-25 828 8-25 89 8-25 89 08-26 4294 8-26 2059 8-26 2059 08-27 619 8-27 620 8-27 620 08-28 24614 8-28 2381 8-28 2381 08-29 1123 8-29 335 8-29 335 08-30 816 8-30 492 8-30 492 08-31 25355 8-31 237 8-31 237 09-01 658 9-01 659 9-1 250 09-02 2074 9-02 715 9-2 6 09-03 6335 9-03 357 9-3 15 09-04 8107 9-04 909 9-4 59 09-05 23683 9-05 3191 9-5 31 09-06 8297 9-06 1294 9-6 181 09-07 2986 9-07 543 9-7 13 09-08 7828 9-08 815 9-8 81 09-09 12077 9-09 248 9-9 45 09-10 17009 9-10 2874 9-10 2874 09-11 6125 9-11 1534 9-11 1534 09-12 29379 9-12 483 9-12 483 09-13 17515 9-13 1096 9-13 1096 09-14 249 9-14 250 9-14 250 09-15 3413 9-15 1315 9-15 1315 09-16 5930 9-16 3370 9-16 3370 09-17 341 9-17 342 9-17 342 09-18 4731 9-18 2220 9-18 2220 09-19 2127 9-19 417 9-19 417 09-20 328 9-20 329 9-20 329 09-21 422 9-21 423 9-21 423 09-22 7963 9-22 687 9-22 687 09-23 23216 9-23 63 9-23 63 09-24 5122 9-24 1430 9-24 1430 09-25 4008 9-25 337 9-25 337 09-26 5380 9-26 6 9-26 6 09-27 1177 9-27 977 9-27 977 09-28 5768 9-28 2192 9-28 2192 09-29 7785 9-29 1854 9-29 1854 09-30 1494 9-30 194 9-30 194 10-01 15762 10-01 15762 10-1 853 10-02 6740 10-02 6740 10-2 164 10-03 12292 10-03 12292 10-3 3487 10-04 3849 10-04 3849 10-4 270 10-05 2881 10-05 2881 10-5 50 10-06 3481 10-06 3481 10-6 1012 10-07 4076 10-07 4076 10-7 1488 10-08 8281 10-08 8281 10-8 2535 10-09 1817 10-09 1817 10-9 207 10-10 853 10-10 853 10-10 853 10-11 3846 10-11 3846 10-11 3846 10-12 8618 10-12 8618 10-12 8618 10-13 8271 10-13 8271 10-13 8271 10-14 2781 10-14 2781 10-14 2781 10-15 2564 10-15 2564 10-15 2564 10-16 9987 10-16 9987 10-16 9987 10-17 8041 10-17 8041 10-17 8041 10-18 1224 10-18 1224 10-18 1224 10-19 15483 10-19 15483 10-19 15483 10-20 9808 10-20 9808 10-20 9808 10-21 2750 10-21 2750 10-21 2750 10-22 6400 10-22 6400 10-22 6400 10-23 6771 10-23 6771 10-23 6771 10-24 12736 10-24 12736 10-24 12736 10-25 12926 10-25 12926 10-25 12926 10-26 14679 10-26 14679 10-26 14679 10-27 164 10-27 164 10-27 164 10-28 3242 10-28 3242 10-28 3242 10-29 8197 10-29 8197 10-29 8197 10-30 20819 10-30 20819 10-30 20819 10-31 3495 10-31 3495 10-31 3495 11-01 2780 11-01 2780 11-1 154 11-02 12721 11-02 12721 11-2 710 11-03 3494 11-03 3494 11-3 363 11-04 23842 11-04 23842 11-4 2725 11-05 175 11-05 175 11-5 922 11-06 13461 11-06 13461 11-6 396 11-07 21819 11-07 21819 11-7 95 11-08 7450 11-08 7450 11-8 446 11-09 3254 11-09 3254 11-9 495 11-10 22898 11-10 22898 11-10 22898 11-11 12701 11-11 12701 11-11 12701 11-12 12702 11-12 12702 11-12 12702 11-13 3504 11-13 3504 11-13 3504 11-14 23209 11-14 23209 11-14 23209 11-15 27054 11-15 27054 11-15 27054 11-16 3993 11-16 3993 11-16 3993 11-17 154 11-17 154 11-17 154 11-18 14376 11-18 14376 11-18 14376 11-19 984 11-19 984 11-19 984 11-20 3823 11-20 3823 11-20 3823 11-21 710 11-21 710 11-21 710 11-22 12018 11-22 12018 11-22 12018 11-23 6548 11-23 6548 11-23 6548 11-24 25705 11-24 25705 11-24 25705 11-25 1350 11-25 1350 11-25 1350 11-26 12703 11-26 12703 11-26 12703 11-27 4252 11-27 4252 11-27 4252 11-28 14719 11-28 14719 11-28 14719 11-29 4450 11-29 4450 11-29 4450 11-30 9107 11-30 9107 11-30 9107 12-01 244 12-01 244 12-1 711 12-02 19942 12-02 19942 12-2 484 12-03 60873 12-03 60873 12-3 1925 12-04 11885 12-04 11885 12-4 1081 12-05 3329 12-05 3329 12-5 1351 12-06 3259 12-06 3259 12-6 2014 12-07 7511 12-07 7511 12-7 298 12-08 3714 12-08 3714 12-8 149 12-09 4729 12-09 4729 12-9 500 12-10 3456 12-10 3456 12-10 3456 12-11 50289 12-11 50289 12-11 50289 12-12 711 12-12 711 12-12 711 12-13 47502 12-13 47502 12-13 47502 12-14 4560 12-14 4560 12-14 4560 12-15 11942 12-15 11942 12-15 11942 12-16 6986 12-16 6986 12-16 6986 12-17 11078 12-17 11078 12-17 11078 12-18 9167 12-18 9167 12-18 9167 12-19 1426 12-19 1426 12-19 1426 12-20 36498 12-20 36498 12-20 36498 12-21 8732 12-21 8732 12-21 8732 12-22 17882 12-22 17882 12-22 17882 12-23 9550 12-23 9550 12-23 9550 12-24 661 12-24 661 12-24 661 12-25 9417 12-25 9417 12-25 9417 12-26 964 12-26 964 12-26 964 12-27 484 12-27 484 12-27 484 12-28 5183 12-28 5183 12-28 5183 12-29 30418 12-29 30418 12-29 30418 12-30 7146 12-30 7146 12-30 7146 12-31 9451 12-31 9451 12-31 9451

R CODE

library(ggplot2)
library(readr)
library(stringr)
library(tidyr)
pistring = read_file("pi.txt")
#choose 2016 b/c it's a leap year
dayvec = seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by = 1)
dayvec = str_sub(dayvec, start = 6, end = 10)
shorterdayvec = sub("0(.-.)", "\\1", dayvec)
shortestdayvec = sub("(.-)0(.)$", "\\1\\2", shorterdayvec)
df = data.frame(
birthday = dayvec,
loc = str_locate(pistring, gsub("-", "", dayvec))[, 1],
shorter_bday = shorterdayvec,
shorter_loc = str_locate(pistring, gsub("-", "", shorterdayvec))[, 1],
shortest_bday = shortestdayvec,
shortest_loc = str_locate(pistring, gsub("-", "", shortestdayvec))[, 1]
)
write.csv(df, file = "birthdays_in_pi.csv", row.names = FALSE)
long_df = df %>% gather(key, location, loc, shorter_loc, shortest_loc)
p = ggplot(long_df, aes(x = location, fill = key))
p = p + geom_density(alpha = .3)
p = p + coord_cartesian(xlim = c(0, 40000))
p = p + theme(legend.position = "bottom")
p
ggsave(
plot = p,
file = "birthdays_in_pi.pdf",
height = 4,
width = 4
)

The post How many digits into pi do you need to go to find your birthday? appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

what does more efficient Monte Carlo mean?

Fri, 03/17/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

“I was just thinking that there might be a magic trick to simulate directly from this distribution without having to go for less efficient methods.”

In a simple question on X validated a few days ago [about simulating from x²φ(x)] popped up the remark that the person asking the question wanted a direct simulation method for higher efficiency. Compared with an accept-reject solution. Which shows a misunderstanding of what “efficiency” means on Monte Carlo situations. If it means anything, I would think it is reflected in the average time taken to return one simulation and possibly in the worst case. But there is no reason to call an inverse cdf method more efficient than an accept reject or a transform approach since it all depends on the time it takes to make the inversion compared with the other solutions… Since inverting the closed-form cdf in this example is much more expensive than generating a Gamma(½,½), and taking plus or minus its root, this is certainly the case here. Maybe a ziggurat method could be devised, especially since x²φ(x)<φ(x) when |x|≤1, but I am not sure it is worth the effort!

Filed under: Books, Kids, R, Statistics Tagged: cross validated, efficiency, inverse cdf, Monte Carlo Statistical Methods, R, simulation, ziggurat algorithm

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Book Review: Testing R Code

Thu, 03/16/2017 - 22:00

When it comes to getting things right in data science, most of the focus goes to the data and the statistical methodology used. But when a misplaced parenthesis can throw off your results entirely, ensuring correctness in your programming is just as important. A new book published by CRC Press, Testing R Code by Richard (Richie) Cotton, provides all the guidance you’ll need to write robust, correct code in the R language.

This is not a book exclusively for package developers: data science, after all, is a combination of programming and statistics, and this book provides lots of useful advice on helping to ensure the programming side of things is correct. The book suggest you being by using the assertive package to provide alerts when your code runs off the rails. From there, you can move on to creating formal unit tests as you code, using the testthat package. There's also a wealth of useful advice for organizing and writing code that's more maintainable in the long run.

If you are a package developer, there's lots here for you too. There are detailed instructions on incorporating unit tests into your package, and how tests interact with version control and continuous integration systems. It also touches on advanced topics relevant to package authors, like debugging C or C++ code embedded via the Rcpp package. And there's more good advice, like how to right better error messages so your users can recover more easily when an error does occur.

Testing is a topic that doesn't get as much attention as it deserves in data science disciplines. One reason may be that it's a fairly dry topic, but Cotton does a good job in making the material engaging with practical examples and regular exercises (with answers in the appendix). Frequent (and often amusing) footnotes help make this an entertaining read (given the material) and hopefully will motivate you to make testing a standard (and early) part of your R programming process. 

Testing R Code is available from your favorite bookseller now, in hardback and electronic formats.

 

Mapping Housing Data with R

Thu, 03/16/2017 - 14:28

What is my home worth?  Many homeowners in America ask themselves this question, and many have an answer.  What does the market think, though?  The best way to estimate a property’s value is by looking at other, similar properties that have sold recently in the same area – the comparable sales approach.  In an effort to allow homeowners to do some exploring (and because I needed a new project), I developed a small Shiny app with R.

My day job allows me access to the local multiple listing service, which provides a wealth of historic data.  The following project makes use of that data to map real estate that has sold near Raleigh, NC in the past six months.  Without getting too lost in the weeds I’ll go over a few key parts of the process.  Feel free to jump over to my GitHub page to check out the full source code.  Click here to view the app!

  1. Geocode everything.  The data did not come with latitude and longitude coordinates, so we’ll have to do some geocoding.  I haven’t found an efficient way to do this in R, so, like in the mailing list example, I’ll use QGIS to process my data and return a .csv for each town I’m interested in.
  2. Setup your data.  To make sure that everything runs smoothly later on, we’ve got to import our data using readr and make sure each attribute is typed properly. library(readr) apex <- read_csv("apex2.csv") #Remove non-character elements from these columns. df$`Sold Price` <- as.numeric(gsub("[^0-9]","",df$`Sold Price`)) df$`List Price` <- as.numeric(gsub("[^0-9]","",df$`List Price`)) #Some re-typing for later. df$Fireplace <- as.numeric(df$Fireplace) df$`New Constr` <- as.factor(df$`New Constr`) #Assign some placeholders. assign("latitude", NA, envir = .GlobalEnv) assign("longitude", NA, envir = .GlobalEnv)
  3. Get info from the user.  The first thing the app wants you to do is give it some characteristics about the subject property, a property that you are interested in valuating.  A function further down uses this information to produce a map using these inputs. #What city's dataset are we using? selectInput("city", label = "City", c("Apex", "Cary", "Raleigh")) #Get some info. textInput("address",label = "Insert Subject Property Address", value = "2219 Walden Creek Drive"), numericInput("dist", label = "Miles from Subject", value = 5, max = 20), numericInput("footage",label = "Square Footage", value = 2000), selectInput("acres",label = "How Many Acres?", acresf) #Changes datasets based on what city you choose on the frontend. #This expression is followed by two more else if statements. observeEvent(input$city, { if(input$city == "Apex") { framework_retype(apex) cityschools <-schoolsdf$features.properties %>% filter(ADDRCITY_1 == "Apex") assign("cityschools", cityschools, envir = .GlobalEnv) #Draw the map on click. observeEvent(input$submit, { output$map01 <- renderLeaflet({distanceFrom(input$address, input$footage, input$acres,tol = .15, input$dist) }) })
  4. Filter the data.  The distanceFrom function above uses dplyr to filter the properties in the selected city by square footage, acreage, and distance from the subject property.  The tol argument is used to give a padding around square footage – few houses match exactly in that respect. #Filter once. houses_filtered <- houses %>% filter(Acres == acres)%>% filter(LvngAreaSF >= ((1-tol)*sqft)) %>% filter(LvngAreaSF <= ((1+tol)*sqft)) #This grabs lat & long from Google. getGeoInfo(subj_address) longitude_subj <- as.numeric(longitude) latitude_subj <- as.numeric(latitude) #Use the comparable house locations. xy <- houses_filtered[,1:2] xy <- as.matrix(xy) #Calculate distance. d <- spDistsN1(xy, c(longitude_subj, latitude_subj), longlat = TRUE) d <- d/1.60934 d <- substr(d, 0,4) #Filter again. distance <- houses_filtered %>% filter(distanceMi <= dist)
  5. Draw the map. The most important piece, the map, is drawn using Leaflet.  I have the Schools layer hidden initially because it detracts from the main focus – the houses. map <- leaflet() %>% addTiles(group = "Detailed") %>% addProviderTiles("CartoDB.Positron", group = "Simple") %>% addAwesomeMarkers(lng = longitude, lat = latitude, popup = subj_address, icon = awesomeIcons(icon='home', markerColor = 'red'), group = "Subject Property") %>% addAwesomeMarkers(lng = distance$X, lat = distance$Y, popup = paste(distance$Address,distance$`Sold Price`, distance$distanceMi, sep = ""), icon = awesomeIcons(icon = 'home', markerColor = 'blue'), group = "Comps")%>% addAwesomeMarkers(lng = schoolsdf$long, lat = schoolsdf$lat, icon = awesomeIcons(icon = 'graduation-cap',library = 'fa', markerColor = 'green', iconColor = '#FFFFFF'), popup = schoolsdf$features.properties$NAMELONG, group = "Schools")%>% addLayersControl( baseGroups = c("Simple", "Detailed"), overlayGroups = c("Subject Property", "Comps", "Schools"), options = layersControlOptions(collapsed = FALSE)) map <- map %>% hideGroup(group = "Schools")
  6. Regression model.  The second tab at the top of the page leads to more information input that is used in creating a predictive model for the subject property.  The implementation is somewhat messy, so if you’d like to check it out the code is at the bottom of app.R in the GitHub repo.

That’s it!  It took a while to get all the pieces together, but I think the final product is useful and I learned a lot along the way.  There are a few places I want to improve: simplify the re-typing sections, make elements refresh without clicking submit, among others.  If you have any questions about the code please leave a comment or feel free to send me an email.

Happy coding,

Kiefer Smith

 

 

 

 

Forrester’s 2017 Take on Tools for Data Science

Thu, 03/16/2017 - 13:00

In my ongoing quest to track The Popularity of Data Science Software, I’ve updated the discussion of the annual report from Forrester, which I repeat here to save you from having to read through the entire document. If your organization is looking for training in the R language, you might consider my books, R for SAS and SPSS Users or R for Stata Users, or my on-site workshops.

Forrester Research, Inc. is a company that provides reports which analyze the competitive position of tools for data science. The conclusions from their 2017 report, Forrester Wave: Predictive Analytics and Machine Learning Solutions, are summarized in Figure 3b. On the x-axis they list the strength of each company’s strategy, while the y-axis measures the strength of their current offering. The size and shading of the circles around each data point indicate the strength of each vendor in the marketplace (70% vendor size, 30% ISV and service partners).

As with Gartner 2017 report discussed above, IBM, SAS, KNIME, and RapidMiner are considered leaders. However, Forrester sees several more companies in this category: Angoss, FICO, and SAP. This is quite different from the Gartner analysis, which places Angoss and SAP in the middle of the pack, while FICO is considered a niche player.

Figure 3b. Forrester Wave plot of predictive analytics and machine learning software.

In their Strong Performers category, they have H2O.ai, Microsoft, Statistica, Alpine Data, Dataiku, and, just barely, Domino Data Labs. Gartner rates Dataiku quite a bit higher, but they generally agree on the others. The exception is that Gartner dropped coverage of Alpine Data in 2017. Finally, Salford Systems is in the Contenders section. Salford was recently purchased by Minitab, a company that has never been rated by either Gartner or Forrester before as they focused on being a statistics package rather than expanding into machine learning or artificial intelligence tools as most other statistics packages have (another notable exception: Stata). It will be interesting to see how they’re covered in future reports.

Compared to last year’s Forrester report, KNIME shot up from barely being a Strong Performer into the Leader’s segment. RapidMiner and FICO moved from the middle of the Strong Performers segment to join the Leaders. The only other major move was a lateral one for Statistica, whose score on Strategy went down while its score on Current Offering went up (last year Statistica belonged to Dell, this year it’s part of Quest Software.)

The size of the “market presence” circle for RapidMiner indicates that Forrester views its position in the marketplace to be as strong as that of IBM and SAS. I find that perspective quite a stretch indeed!

Alteryx, Oracle, and Predixion were all dropped from this year’s Forrester report. They mention Alteryx and Oracle as having “capabilities embedded in other tools” implying that that is not the focus of this report. No mention was made of why Predixion was dropped, but considering that Gartner also dropped coverage of then in 2017, it doesn’t bode well for the company.

R + D3, A powerful and beautiful combo

Thu, 03/16/2017 - 01:00

For a while now, I have been looking for ways to incorporate pretty D3 graphics into R markdown reports or shiny apps. Why? I like R for the ability to be able to do data handling, stats, ML all very easily with minimal code. But when you want to present something to clients or the public, there is no competition for the front end web stuff e.g. d3.js. The answer: htmlwidgets.

Here is my attempt so far

Ok well I wasn’t looking too hard because it completely escaped me that you could do this with htmlwidgets. I stumbled upon this when I was browsing the shiny user showcase and I came across the FRISS Analytics example which is accompanied by a really fantastic set of tutorials under ‘other tutorials’. If you want to do this to customise for your own plotting needs, I strongly suggest you start there with those tutorials. They are very thourough and step-wise and will allow you to build a R-D3 binding for the first time with very little hassle, as I have done.

It looks a little like this…

  • You write a package.
  • The package contains R functions.
  • Those functions invoke JavaScript as you define it.
  • htmlwidgets binds them together and provide the plumbing between your R objects and the JavaScript that you feed them into.

The outcome is that you can easily create an interactive graphic in the end my simply calling an R function. The real beauty though is being able to update the javascript plot in response to user input on the R side, without plotting a new plot each time in the slightly awkward way that I was previously doing. If you have loaded the app try typing extra zeros into the sample size, you’ll see the plot update as the underlying data is updated. Smooth. This is what I was looking for.

Of course you don’t need to be a JavaScript programmer to get this done. You can use higher levels js libraries such as C3 in my case or nvd3, maybe there are more?

So all in all the chain is…

R -> htmlwidgets -> || C3 -> D3 -> JavaScript.

Where htmlwidgets is reaching through the border between R and JavaScript.

I forked the R package from FRISS and when I get some time or there is a need, I will try to port some more C3 based templates and expose them to R functions in this way.

This post is obviously not a tutorial just a flag and a signpost, to find out how to do this yourself, go to the FRISS Analytics tutorial either here on rstudio or here on github. Thanks a million to the folks at FRISS analytics and the authors of htmlwidgets.

R, GIS, and fuzzyjoin to reconstruct demographic data for NUTS regions of Denmark

Thu, 03/16/2017 - 01:00
NUTS

NUTS stands for the Nomenclature of Territorial Units For Statistics. The history of NUTS dates back to the beginning of 1970s, when European countries developed unified standards for systems of administrative geography. It was not until the beginning of this century when such a system finally became widely used. There are three main hierarchical levels of NUTS, and the most commonly used for regional analysis is NUTS-2.


Figure 1. Illustration of the principle of NUTS hierarchical system

One of the objectives of NUTS was to provide more or less comparable administrative divisions for all countries of Europe. Nevertheless, in 2013, population figures for single NUTS-2 regions ranged from 28.5 thousands in Aland island (Finland) to almost 12 million in Ile-de-France (Paris and surroundings, France).

The broken time series

Quite arbitrary in its essence, territorial division tends to evolve. Changes in administrative boundaries can cause problems for regional analysis as they break the time series and therefore make it harder to analyze trends. Despite this inconvenience, the boundaries of regions actually change quite often based on the needs and interests of local or national governmenta. Eurostat tracks all modifications providing detailed explanations of all the changes that happen between versions of NUTS (figure 2).


Figure 2. Changes in NUTS between versions 2006 and 2010

Despite this, Eurostat does not recalculate historic demographic data to match the most recent NUTS version. This means that, for the most recent version of NUTS, there is missing data for all years before the latest administrative change. So researchers have to reconstruct historical data manually to obtain a long time series. Of course, crude assumptions often have to be accepted in order to approximate the population figures for the current regions that did not exist in the past.

To make thing even more complex, Eurostat provides the data only for the latest version of NUTS (at least, I did not work out how to download previous versions). In my PhD project I carry out regional analysis for the NUTS-2 regions of European Union. To have the longest possible time series, when I did the data preparation in 2015, I chose the 2010 version of NUTS, on which the regional demographic projection EUROPOP2013 is based. For reproducibility, I uploaded the precise versions of the Eurostat data at NUTS-2 level on population age structures and deaths, as downloaded in 2015, to figshare.

Denmark

Some countries had to perform major changes in their systems of territorial division to fit the NUTS standards. The most significant reform happened in Denmark in 2007, where the former 271 municipalities were transformed into the new 98 municipalities. At the same time, NUTS was introduced, so that 98 municipalities were allocated to 11 NUTS-3 regions, which aggregate to 5 NUTS-2 regions. Typically, for a small country, there is only one NUTS-1 region in Denmark, which is the whole country.

As far as I know, there was no official attempt of Eurostat to reconstruct the time series for Denmark before 2007. The typical map of Eurostat for the pre-2007 period shows Denmark as “no data available” country (figure 3).


Figure 3. Life expectancy at birth in European NUTS-2 regions, 2006; a screenshot from the Eurostat’s interactive data exploratory tool

Such a data loss is somewhat surprising for a country such as Denmark. It might be quite difficult to match the old and new municipal systems; but it should be relatively easy to re-aggregate the old municipalities into the new (higher level) NUTS regions. That is precisely what I did during my data preparation1 and what I now want to share in this post.

The task is basically to identify which of the old 271 municipalities are located within the modern 11 NUTS-3 regions and to aggregate the municipal data. Then, NUTS-3 data is easily aggregated for the NUTS-2 level. Such a task could have meant working late into the night, but luckily we live in the GIS era. I used GIS to match the old municipalities with the NUTS-3 regions. Here I want to show (with code) how the task can be performed using the amazing and opensource R. Below I show the process of matching old municipalities to the NUTS regions and the process that I used to aggregate population data.

Data

The data on the population age structures for the old 271 municipalities of Denmark was downloaded from the official website of Statistics Denmark. The system only allows you to grab up to 10K cells for unregistered users and up to 100K for registered users. So the process of downloading the data involves some tedious manual manipulations. For the purpose of my phd project, I downloaded the data for the period 2001-2006; but, if needed, the data is available since 1979. The data, downloaded in 2015 and ‘tidied up’ can be found here.

I have spent a lot of time trying to find geodata with the boundaries of the old municipalities. Now, coming back to the topic more than 1.5 year later, I failed to identify the original source of the shapefile, though I am pretty sure that it came from here 2. The copy of the shapefile that I used can be found here.

Finally, we need a shapefile of NUTS-3 regions. It can be easily downloaded from Eurostat geodata repository. The shapefile that I used is “NUTS_2010_20M_SH.zip”. The selection of the 11 Danish regions can be found here.

The projection used for both shapefiles is ESPG-3044, the one often used to map Denmark.

Now, the code to prepare the R session and load the data.

# set locale and encoding parameters to read Danish if(Sys.info()['sysname']=="Linux"){ Sys.setlocale("LC_CTYPE", "da_DK.utf8") danish_ecnoding <- "WINDOWS-1252" }else if(Sys.info()['sysname']=="Windows"){ Sys.setlocale("LC_CTYPE", "danish") danish_ecnoding <- "Danish_Denmark.1252" } # load required packages (install first if needed) library(tidyverse) # version: 1.0.0 library(ggthemes) # version: 3.3.0 library(rgdal) # version: 1.2-4 library(rgeos) # version: 0.3-21 library(RColorBrewer) # version: 1.1-2 mypal <- brewer.pal(11, "BrBG") library(fuzzyjoin) # version: 0.1.2 library(viridis) # version: 0.3.4 # load Denmark pop structures for the old municipalities df <- read_csv("https://ikashnitsky.github.io/doc/misc/nuts2-denmark/BEF1A.csv.gz") # create a directory for geodata ifelse(!dir.exists("geodata"), dir.create("geodata"), "Directory already exists") # download, unzip and read Danish NUTS-3 geodata (31KB) url_nuts <- "https://ikashnitsky.github.io/doc/misc/nuts2-denmark/denmark-nuts3-espg3044.tgz" path_nuts <- "geodata/denmark-nuts3-espg3044.tgz" ifelse(!file.exists(path_nuts), download.file(url_nuts, path_nuts, mode="wb"), 'file alredy exists') # If there are problems downloading the data automatically, please download it manually from # https://ikashnitsky.github.io/doc/misc/nuts2-denmark/denmark-nuts3-espg3044.tgz untar(tarfile = path_nuts, exdir = "geodata") sp_nuts3 <- readOGR(dsn = "geodata/.", layer = "denmark-nuts3-espg3044") gd_nuts3 <- fortify(sp_nuts3, region = "NUTS_ID") # to the ggplot format # download, unzip and read Danish old municipal geodata (6.0MB) url_mun <- "https://ikashnitsky.github.io/doc/misc/nuts2-denmark/kommune2006win1252.tgz" path_mun <- "geodata/kommune2006win1252.tgz" ifelse(!file.exists(path_mun), download.file(url_mun, path_mun, mode="wb"), 'file alredy exists') # If there are problems downloading the data automatically, please download it manually from # https://ikashnitsky.github.io/doc/misc/nuts2-denmark/kommune2006utf8.tgz untar(tarfile = path_mun, exdir = "geodata") sp_mun <- readOGR(dsn = "geodata/.", layer = "kommune2006win1252", encoding = danish_ecnoding) gd_mun <- fortify(sp_mun) # coordinates of the municipalities mun_coord <- bind_cols(as.data.frame(coordinates(sp_mun)), sp_mun@data[,1:3]) %>% transmute(long = V1, lat = V2, enhedid, objectid, name = navn) Spatial matching

Let’s first have a look at the map.

ggplot()+ geom_polygon(data = gd_nuts3, aes(long, lat, group = group), color = brbg[3], fill = "grey90", size = 1)+ geom_point(data = mun_coord, aes(long, lat), color = brbg[10], size = 1)+ theme_map()


Figure 4. Reference map of the old municipalities and NUTS-3 regions of Denmark

We can easily see that the boundaries of the municipalities (light blue) are much more precise than that of the NUTS-3 regions (orange/brown). This is not a problem as long as all the centroids of the municipalities fall within the boundaries of the NUTS-3 regions, which seems to be true for all municipalities except for the easternmost one. A quick check reveals that this is Christiansø, a tiny fortified island, whose history goes back to the Middle Ages. It has a special status and is not included into the NUTS system. For further manipulations, Christiansø can safely merge it with the close-by Bornholm.

To identify which municipalities fall into which NUTS regions, I use the spatial overlap function (over) from sp package. Here I should thank Roger Bivand, a person who made it possible to do any spatial analysis in R.

# municipality coordinates to Spatial mun_centr <- SpatialPoints(coordinates(sp_mun), proj4string = CRS(proj4string(sp_nuts3))) # spatial intersection with sp::over inter <- bind_cols(mun_coord, over(mun_centr, sp_nuts3[,"NUTS_ID"])) %>% transmute(long, lat, objectid, nuts3 = as.character(NUTS_ID), nuts2 = substr(nuts3, 1, 4))

Let’s again check visually if the spatial matching worked okay.

ggplot()+ geom_polygon(data = gd_mun, aes(long, lat, group = group), color = brbg[9], fill = "grey90", size = .1)+ geom_polygon(data = gd_nuts3, aes(long, lat, group = group), color = brbg[3], fill = NA, size = 1)+ geom_point(data = inter, aes(long, lat, color = nuts3), size = 1)+ geom_point(data = inter[is.na(inter$nuts3),], aes(long, lat), color = "red", size = 7, pch = 1)+ theme_map(base_size = 15)+ theme(legend.position = c(1, 1), legend.justification = c(1, 1))


Figure 5. Checking the spatial intersection between the old municipalities and NUTS-3 regions of Denmark

Not bad. But there is an “NA” category that represents all the cases where the spatial match failed. How many such cases do we have?

# how many failed cases do we have sum(is.na(inter$nuts3)) ## [1] 3 # where the intersection failed inter[is.na(inter$nuts3),] ## long lat objectid nuts3 nuts2 ## 23 892474.0 6147918 46399 <NA> <NA> ## 65 504188.4 6269329 105319 <NA> <NA> ## 195 533446.8 6312770 47071 <NA> <NA>

As there are only 3 cases, I decided to fix them manually.

# fix the three cases manually fixed <- inter fixed[fixed$objectid=="46399", 4:5] <- c("DK014", "DK01") fixed[fixed$objectid=="105319", 4:5] <- c("DK041", "DK04") fixed[fixed$objectid=="47071", 4:5] <- c("DK050", "DK05")

The final visual check.

ggplot()+ geom_polygon(data = gd_mun, aes(long, lat, group = group), color = brbg[9], fill = "grey90", size = .1)+ geom_polygon(data = gd_nuts3, aes(long, lat, group = group), color = brbg[3], fill = NA, size = 1)+ geom_point(data = fixed, aes(long, lat, color = nuts3), size = 1)+ theme_map(base_size = 15)+ theme(legend.position = c(1, 1), legend.justification = c(1, 1))


Figure 6. Re-checking the spatial intersection between the old municipalities and NUTS-3 regions of Denmark

Now everything seems okay.

Joining spatial and statistical data (fuzzy join)

The next task is to join the spatial data and statistical data together. The spatial layer for municipalities does not contain the codes that are used by Statistics Denmark, so I have to match municipalities in the two datasets by their names. This is quite a difficult task. Names can be written slightly differently, there are some special characters in Danish alphabet, and some municipalities may have experienced a change of name. To solve the task most efficiently, I used the ‘Fuzzy String Matching’ approach which is implemented in the fuzzyjoin package by David Robinson.

First, I simplify the names in both datasets turning them into lowercase, replacing the character “å” with “aa”, and removing the “Kommune” word in the spatial dataset names. Please note that I downloaded (separately) a small selection from Statistics Denmark to have a lightweight dataframe with municipal codes and names.

# simplify municipalities names mun_geo <- mun_coord %>% transmute(name = sub(x = name, " Kommune", replacement = ""), objectid) %>% mutate(name = gsub(x = tolower(name), "å", "aa")) mun_stat <- read.csv2("https://ikashnitsky.github.io/doc/misc/nuts2-denmark/stat-codes-names.csv", fileEncoding = danish_ecnoding) %>% select(name) %>% separate(name, into = c("code", "name"), sep = " ", extra = "merge") %>% mutate(name = gsub("\\s*\\([^\\)]+\\)", "", x = name)) %>% mutate(name = gsub(x = tolower(name), "å", "aa"))

Let’s try fuzzy join.

# first attempt fuz_joined_1 <- regex_left_join(mun_geo, mun_stat, by = "name")

The resulting dataframe has 278 rows instead of 271. That means that for some municipalities in the spatial dataset there was more than one match. Let’s identify them.

# identify more that 1 match (7 cases) and select which to drop fuz_joined_1 %>% group_by(objectid) %>% mutate(n = n()) %>% filter(n > 1) ## Source: local data frame [14 x 5] ## Groups: objectid [7] ## ## name.x objectid code name.y n ## <chr> <dbl> <chr> <chr> <int> ## 1 haslev 105112 313 haslev 2 ## 2 haslev 105112 403 hasle 2 ## 3 brønderslev 47003 739 rønde 2 ## 4 brønderslev 47003 805 brønderslev 2 ## 5 hirtshals 47037 817 hals 2 ## 6 hirtshals 47037 819 hirtshals 2 ## 7 rønnede 46378 385 rønnede 2 ## 8 rønnede 46378 407 rønne 2 ## 9 hvidebæk 46268 317 hvidebæk 2 ## 10 hvidebæk 46268 681 videbæk 2 ## 11 ryslinge 46463 477 ryslinge 2 ## 12 ryslinge 46463 737 ry 2 ## 13 aarslev 46494 497 aarslev 2 ## 14 aarslev 46494 861 aars 2

So, for 7 municipalities, two matches were found. I will drop the imperfect match variants in the next iteration of fuzzy join.

The other issue is the municipalities for which no match was found in that statistical data.

# show the non-matched cases fuz_joined_1 %>% filter(is.na(code)) ## name.x objectid code name.y ## 1 faxe 105120 <NA> <NA> ## 2 nykøbing falsters 46349 <NA> <NA> ## 3 herstederne 46101 <NA> <NA>

As there are only three such cases, I corrected them manually in the spatial data to match the statistical data. There are two cases of a difference in the way the name of municipality are written and one case of name change.

# correct the 3 non-matching geo names mun_geo_cor <- mun_geo mun_geo_cor[mun_geo_cor$name=="faxe", "name"] <- "fakse" mun_geo_cor[mun_geo_cor$name=="nykøbing falsters", "name"] <- "nykøbing f." mun_geo_cor[mun_geo_cor$name=="herstederne", "name"] <- "albertslund"

Now the second attempt to match the datasets (spatial dataset is corrected).

# second attempt fuz_joined_2 <- regex_left_join(mun_geo_cor, mun_stat, by = "name") # drop non-perfect match fuz_joined_2 <- fuz_joined_2 %>% group_by(objectid) %>% mutate(n = n()) %>% ungroup() %>% filter(n < 2 | name.x==name.y) fuz_joined_2 <- fuz_joined_2 %>% transmute(name = name.x, objectid, code)

The output looks perfect. Now, the last step – using the matched “objectid” field, I will finally attach the NUTS data to statistical codes.

# finally, attach the NUTS info to matched table key <- left_join(fuz_joined_2, fixed, "objectid") Aggregate old municipal data to NUTS levels

The previous manipulations yielded a dataframe that links statistical codes of the old municipalities with the corresponding NUTS regions. The last thing that has to be done is aggregation. I will attach the “key” dataset to a statistical dataset and aggregate the data at NUTS-3 and NUTS-2 levels.

# finally, we only need to aggregate the old stat data df_agr <- left_join(key, df, "code") %>% filter(!is.na(name)) %>% gather("year", "value", y2001:y2006) df_nuts3 <- df_agr %>% group_by(year, sex, age, nuts3) %>% summarise(value = sum(value)) %>% ungroup() df_nuts2 <- df_agr %>% group_by(year, sex, age, nuts2) %>% summarise(value = sum(value)) %>% ungroup()

Let’s now calculate the shares of working age population in Danish NUTS-3 regions in 2001 and map the information.

# total population in 2001 by NUTS-3 regions tot_01 <- df_nuts3 %>% filter(year=="y2001") %>% group_by(nuts3) %>% summarise(tot = sum(value, na.rm = TRUE)) %>% ungroup() # working-age population in 2001 by NUTS-3 regions working_01 <- df_nuts3 %>% filter(year=="y2001", age %in% paste0("a0", 15:64)) %>% group_by(nuts3) %>% summarise(work = sum(value, na.rm = TRUE)) %>% ungroup() # calculate the shares of working age population sw_01 <- left_join(working_01, tot_01, "nuts3") %>% mutate(sw = work / tot * 100) # map the shares of working age population in 2001 by NUTS-3 regions ggplot()+ geom_polygon(data = gd_nuts3 %>% left_join(sw_01, c("id" = "nuts3")), aes(long, lat, group = group, fill = sw), color = "grey50", size = 1) + scale_fill_viridis()+ theme_map(base_size = 15)+ theme(legend.position = c(1, 1), legend.justification = c(1, 1))


Figure 7. The share of working age (15-64) population by NUTS-3 regions of Denmark in 2001

The result (thankfully!) looks realistic, with higher shares of the working-age population in the capital region, and in other regions that have relatively big cities.

This post is written for Demotrends
  1. I have spent quite some time searching if someone else did the job before me and failed to find. 

  2. There is a note on the website saying that, due to a planned change in the structure of the website, there might be some problems with data accuisition. I failed to download the geodata on 2017-02-23. 

Plotting trees from Random Forest models with ggraph

Thu, 03/16/2017 - 01:00

Today, I want to show how I use Thomas Lin Pederson’s awesome ggraph package to plot decision trees from Random Forest models.

I am very much a visual person, so I try to plot as much of my results as possible because it helps me get a better feel for what is going on with my data.

A nice aspect of using tree-based machine learning, like Random Forest models, is that that they are more easily interpreted than e.g. neural networks as they are based on decision trees. So, when I am using such models, I like to plot final decision trees (if they aren’t too large) to get a sense of which decisions are underlying my predictions.

There are a few very convient ways to plot the outcome if you are using the randomForest package but I like to have as much control as possible about the layout, colors, labels, etc. And because I didn’t find a solution I liked for caret models, I developed the following little function (below you may find information about how I built the model):

As input, it takes part of the output from model_rf <- caret::train(... "rf" ...), that gives the trees of the final model: model_rf$finalModel$forest. From these trees, you can specify which one to plot by index.

library(dplyr) library(ggraph) library(igraph) tree_func <- function(final_model, tree_num) { # get tree by index tree <- randomForest::getTree(final_model, k = tree_num, labelVar = TRUE) %>% tibble::rownames_to_column() %>% # make leaf split points to NA, so the 0s won't get plotted mutate(`split point` = ifelse(is.na(prediction), `split point`, NA)) # prepare data frame for graph graph_frame <- data.frame(from = rep(tree$rowname, 2), to = c(tree$`left daughter`, tree$`right daughter`)) # convert to graph and delete the last node that we don't want to plot graph <- graph_from_data_frame(graph_frame) %>% delete_vertices("0") # set node labels V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`)) V(graph)$leaf_label <- as.character(tree$prediction) V(graph)$split <- as.character(round(tree$`split point`, digits = 2)) # plot plot <- ggraph(graph, 'dendrogram') + theme_bw() + geom_edge_link() + geom_node_point() + geom_node_text(aes(label = node_label), na.rm = TRUE, repel = TRUE) + geom_node_label(aes(label = split), vjust = 2.5, na.rm = TRUE, fill = "white") + geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE, repel = TRUE, colour = "white", fontface = "bold", show.legend = FALSE) + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(), plot.background = element_rect(fill = "white"), panel.border = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), plot.title = element_text(size = 18)) print(plot) }

We can now plot, e.g. the tree with the smalles number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == min(model_rf$finalModel$forest$ndbigtree)) tree_func(final_model = model_rf$finalModel, tree_num)

Or we can plot the tree with the biggest number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == max(model_rf$finalModel$forest$ndbigtree)) tree_func(final_model = model_rf$finalModel, tree_num)

Preparing the data and modeling

The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first data set looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",") colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes") bc_data$classes <- ifelse(bc_data$classes == "2", "benign", ifelse(bc_data$classes == "4", "malignant", NA)) bc_data[bc_data == "?"] <- NA # impute missing data library(mice) bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x))) dataset_impute <- mice(bc_data[, 2:10], print = FALSE) bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1)) bc_data$classes <- as.factor(bc_data$classes) # how many benign and malignant cases are there? summary(bc_data$classes) # separate into training and test data library(caret) set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] # run model set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] igraph_1.0.1 ggraph_1.0.0 ggplot2_2.2.1.9000 ## [4] dplyr_0.5.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 nloptr_1.0.4 plyr_1.8.4 ## [4] viridis_0.3.4 iterators_1.0.8 tools_3.3.3 ## [7] digest_0.6.12 lme4_1.1-12 evaluate_0.10 ## [10] tibble_1.2 gtable_0.2.0 nlme_3.1-131 ## [13] lattice_0.20-34 mgcv_1.8-17 Matrix_1.2-8 ## [16] foreach_1.4.3 DBI_0.6 ggrepel_0.6.5 ## [19] yaml_2.1.14 parallel_3.3.3 SparseM_1.76 ## [22] gridExtra_2.2.1 stringr_1.2.0 knitr_1.15.1 ## [25] MatrixModels_0.4-1 stats4_3.3.3 rprojroot_1.2 ## [28] grid_3.3.3 caret_6.0-73 nnet_7.3-12 ## [31] R6_2.2.0 rmarkdown_1.3 minqa_1.2.4 ## [34] udunits2_0.13 tweenr_0.1.5 deldir_0.1-12 ## [37] reshape2_1.4.2 car_2.1-4 magrittr_1.5 ## [40] units_0.4-2 backports_1.0.5 scales_0.4.1 ## [43] codetools_0.2-15 ModelMetrics_1.1.0 htmltools_0.3.5 ## [46] MASS_7.3-45 splines_3.3.3 randomForest_4.6-12 ## [49] assertthat_0.1 pbkrtest_0.4-6 ggforce_0.1.1 ## [52] colorspace_1.3-2 labeling_0.3 quantreg_5.29 ## [55] stringi_1.1.2 lazyeval_0.2.0 munsell_0.4.3

Improved Python-style Logging in R

Thu, 03/16/2017 - 00:53
This entry is part 21 of 21 in the series Using R

Last August, in Python-style Logging in R, we described using an R script as a wrapper around the futile.logger package to generate log files for an operational R data processing script. Today, we highlight an improved, documented version that can be sourced by your R scripts or dropped into your package’s R/ directory to provide easy file and console logging.

The improved pylogging.R script enables the following use case:

  1. set up log files for different log levels
  2. set console log level
  3. six available log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL

All of these capabilities depend upon the excellent futile.logger package (CRAN or github). This script just wraps this package to get python-style file logging. Please see futile.logger’s documentation for details on output formatting, etc.

The pylogging.R script is fully documented with roxygen2 comments and can be incorporated into packages as long as their DESCRIPTION file adds a dependency on futile.logger.  For those developing operational processing pipelines using R, python style logging can be very useful.

To demonstrate log files and console output you can download pylogging.R and the following sleepy.R script:

# sleepy.R logger.info("Getting sleepy ...") Sys.sleep(1) logger.warn("Getting really tired ...") Sys.sleep(2) logger.error("Having trouble staying awake ...") Sys.sleep(3) logger.fatal("Not gonna marzlmurrrzzzzz ...") stop("Snap out of it!")

The following R session demonstrates the general functionality:

> list.files() [1] "pylogger.R" "sleepy.R" > # Nothing up my sleeve > > source("pylogger.R") > source("sleepy.R") Error: You must initialize with 'logger.setup()' before issuing logger statements. > # Setup is required > > logger.setup() > source("sleepy.R") FATAL [2017-03-15 16:34:15] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # The console log level is set to FATAL by default > > list.files() [1] "pylogger.R" "sleepy.R" > # No log files created > > # Now modify console log level > logger.setLevel(ERROR) > source("sleepy.R") ERROR [2017-03-15 16:35:29] Having trouble staying awake ... FATAL [2017-03-15 16:35:32] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Got ERROR and higher > > logger.setLevel(DEBUG) > source("sleepy.R") INFO [2017-03-15 16:35:42] Getting sleepy ... WARN [2017-03-15 16:35:43] Getting really tired ... ERROR [2017-03-15 16:35:45] Having trouble staying awake ... FATAL [2017-03-15 16:35:48] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Got DEBUG and higher > > list.files() [1] "pylogger.R" "sleepy.R" > # Still no log files > > # Set up log files for two levels > logger.setup(debugLog="debug.log",errorLog="error.log") > logger.setLevel(FATAL) > source("sleepy.R") FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Expected console output > > list.files() [1] "debug.log" "error.log" "pylogger.R" "sleepy.R" > readr::read_lines("debug.log") [1] "INFO [2017-03-15 16:36:37] Getting sleepy ..." [2] "WARN [2017-03-15 16:36:38] Getting really tired ..." [3] "ERROR [2017-03-15 16:36:40] Having trouble staying awake ..." [4] "FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ..." > readr::read_lines("error.log") [1] "ERROR [2017-03-15 16:36:40] Having trouble staying awake ..." [2] "FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ..." > > # Got two log files containing DEBUG-and-higher and ERROR-and-higher

Best Wishes for Better Logging!

 

 

Puts as Protection

Wed, 03/15/2017 - 21:10

Many asset management firms are happily enjoying record revenue and profits driven not by inorganic growth or skillful portfolio management but by a seemingly endless increase in US equity prices. These firms are effectively commodity producers entirely dependent on the price of an index over which the firm has no control. The options market presents an easy, cheap, and liquid form of protection

Ensemble Methods are Doomed to Fail in High Dimensions

Wed, 03/15/2017 - 20:28

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Ensemble methods

By ensemble methods, I (Bob, not Andrew) mean approaches that scatter points in parameter space and then make moves by inteprolating or extrapolating among subsets of them. Two prominent examples are:

There are extensions and computer implementations of these algorithms. For example, the Python package emcee implements Goodman and Weare’s walker algorithm and is popular in astrophysics.

Typical sets in high dimensions

If you want to get the background on typical sets, I’d highly recommend Michael Betancourt’s video lectures on MCMC in general and HMC in particular; they both focus on typical sets and their relation to the integrals we use MCMC to calculate:

It was Michael who made a doughnut in the air, pointed at the middle of it and said, “It’s obvious ensemble methods won’t work.” This is just fleshing out the details with some code for the rest of us without such sharp geometric intuitions.

MacKay’s information theory book is another excellent source on typical sets. Don’t bother with the Wikipedia on this one.

Why ensemble methods fail: Executive summary

  1. We want to draw a sample from the typical set
  2. The typical set is a thin shell a fixed radius from the mode in a multivariate normal
  3. Interpolating or extrapolating two points in this shell is unlikely to fall in this shell
  4. The only steps that get accepted will be near one of the starting points
  5. The samplers devolve to a random walk with poorly biased choice of direction

Several years ago, Peter Li built the Goodman and Weare walker methods for Stan (all they need is log density) on a branch for evaluation. They failed in practice exactly the way the theory says they will fail. Which is too bad, because the ensemble methods are very easy to implement and embarassingly parallel.

Why ensemble methods fail: R simulation

OK, so let’s see why they fail in practice. I’m going to write some simple R code to do the job for us. Here’s an R function to generate a 100-dimensional standard isotropic normal variate (each element is generated normal(0, 1) independently):

normal_rng <- function(K) rnorm(K);

This function computes the log density of a draw:

normal_lpdf <- function(y) sum(dnorm(y, log=TRUE));

Next, generate two draws from a 100-dimesnional version:

K <- 100; y1 <- normal_rng(K); y2 <- normal_rng(K);

and then interpolate by choosing a point between them:

lambda = 0.5; y12 <- lambda * y1 + (1 - lambda) * y2;

Now let's see what we get:

print(normal_lpdf(y1), digits=1); print(normal_lpdf(y2), digits=1); print(normal_lpdf(y12), digits=1);

[1] -153 [1] -142 [1] -123

Hmm. Why is the log density of the interpolated vector so much higher? Given that it's a multivariate normal, the answer is that it's closer to the mode. That should be a good thing, right? No, it's not. The typical set is defined as an area within "typical" density bounds. When I take a random draw from a 100-dimensional standard normal, I expect log densities that hover between -140 and -160 or so. That interpolated vector y12 with a log density of -123 isn't in the typical set!!! It's a bad draw, even though it's closer to the mode. Still confused? Watch Michael's videos above. Ironically, there's a description in the Goodman and Weare paper in a discussion of why they can use ensemble averages that also explains why their own sampler doesn't scale---the variance of averages is lower than the variance of individual draws; and we want to cover the actual posterior, not get closer to the mode.

So let's put this in a little sharper perspective by simulating thousands of draws from a multivariate normal and thousands of draws interpolating between pairs of draws and plot them in two histograms. First, draw them and print a summary:

lp1 <- vector(); for (n in 1:1000) lp1[n] <- normal_lpdf(normal_rng(K)); print(summary(lp1)); lp2 <- vector() for (n in 1:1000) lp2[n] <- normal_lpdf((normal_rng(K) + normal_rng(K))/2); print(summary(lp2));

from which we get:

Min. 1st Qu. Median Mean 3rd Qu. Max. -177 -146 -141 -142 -136 -121 Min. 1st Qu. Median Mean 3rd Qu. Max. -129 -119 -117 -117 -114 -108

That's looking bad. It's even clearer with a faceted histogram:

library(ggplot2); df <- data.frame(list(log_pdf = c(lp1, lp2), case=c(rep("Y1", 1000), rep("(Y1 + Y2)/2", 1000)))); plot <- ggplot(df, aes(log_pdf)) + geom_histogram(color="grey") + facet_grid(case ~ .);

Here's the plot:

The bottom plot shows the distribution of log densities in independent draws from the standard normal (these are pure Monte Carlo draws). The top plot shows the distribution of the log density of the vector resulting from interpolating two independent draws from the same distribution. Obviously, the log densities of the averaged draws are much higher. In other words, they are atypical of draws from the target standard normal density.

Exercise

Check out what happens as (1) the number of dimensions K varies, and (2) as lambda varies within or outside of [0, 1].

Hint: What you should see is that as lambda approaches 0 or 1, the draws get more and more typical, and more and more like random walk Metropolis with a small step size. As dimensionality increases, the typical set becomes more attenuated and the problem becomes worse (and vice-versa as it decreases).

Does Hamiltonian Monte Carlo (HMC) have these problems?

Not so much. It scales much better with dimension. It'll slow down, but it won't break and devolve to a random walk like ensemble methods do.

The post Ensemble Methods are Doomed to Fail in High Dimensions appeared first on Statistical Modeling, Causal Inference, and Social Science.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages