Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 54 min 33 sec ago

Quandl and Forecasting

Fri, 03/17/2017 - 17:04

(This article was first published on RStudio, and kindly contributed to R-bloggers)

A Reproducible Finance with R Post
by Jonathan Regenstein

Welcome to another installment of Reproducible Finance with R. Today we are going to shift focus in recognition of the fact that there’s more to Finance than stock prices, and there’s more to data download than quantmod/getSymbols. In this post, we will explore commodity prices using data from Quandl, a repository for both free and paid data sources. We will also get into the forecasting game a bit and think about how best to use dygraphs when visualizing predicted time series as an extension of historical data. We are not going to do anything too complex, but we will expand our toolkit by getting familiar with Quandl, commodity prices, the forecast() function, and some advanced dygraph work.

Before we dive in, a few thoughts to frame the notebook underlying this post.

  • We are using oil data from Quandl, but the original data is from FRED. There’s nothing wrong with grabbing the data directly from FRED, of course, and I browse FRED frequently to check out economic data, but I tend to download the data into my RStudio environment using Quandl. I wanted to introduce Quandl today because it’s a nice resource that will be involved in the next few posts in this series. Plus, it’s gaining in popularity, and if you work in the financial industry, you might start to encounter it in your work.

  • This post marks our first foray into the world of predictive modeling, albeit in a very simple way. But the complexity and accuracy of the forecasting methodology we use here is almost irrelevant since I expect that most R coders, whether in industry or otherwise, will have their own proprietary models. Rather, what I want to accomplish here is a framework where models can be inserted, visualized, and scrutinized in the future. I harp on reproducible workflows a lot, and that’s not going to change today because one goal of this Notebook is to house a forecast that can be reproduced in the future (at which point, we will know if the forecast was accurate or not), and then tweaked/criticized/updated/heralded.

  • This post walks through a detailed example of importing, forecasting, and visualizing oil prices. In the near future, I will repeat those steps for gold and copper, and we will examine the relationship between the copper/gold price ratio and interest rates. We are starting simple, but stay tuned.

Now, let’s get to the data download! In the chunk below, as we import WTI oil prices, notice that Quanld makes it easy to choose types of objects (raw/dataframe, xts, or zoo), periods (daily, weekly, or monthly) and start/end dates.

Also note that I specified the end date of February 2017. This indicates that the Notebook houses a model that was built and run using data as of February 2017. Without the end date, this Notebook would house a model that was built and run using data as of time t. Which you choose depends how you want the Notebook to function for your team.

Looking back at those oil price objects, each would work well for the rest of this project, but let’s stick with the monthly data. We will be dealing with the date index quite a bit below, so let’s use the seq() function and mdy() from the lubridate package to put the date into a nicer format.

Now we have a cleaner date format. Our base data object is in good shape. As always, we like to have a look at the data in graphical format, so let’s fire up dygraphs. Since we imported an xts object directly from Quandl, we can just plug it straight into the dygraph() function.

Alright, nothing too shocking here. We see a peak in mid-2008, followed by a precipitous decline through the beginning of 2009.

Now we’ll make things a bit more interesting and try to extract some meaning from that data. Let’s use the forecast() function to predict what oil prices will look like over the next six months. This is the part of the code where you might want to insert whatever model your team has built or wish to test. We can think of it as a placeholder for any proprietary model or models that could be dropped into this Notebook. For our purposes, we will simply pass in the monthly oil prices object and supply a lookahead parameter of 6. The forecast() function will then supply some interesting numbers about the next six months of oil prices.

The mean forecast is right around $54. It looks like the 95% confidence level has a high of $78 in August and a low of $29 in March. We won’t dwell on these numbers because I imagine you will want to use your own model here – this Notebook is more of a skeleton where those models can be inserted and then tested or evaluated at a later date.

Let’s move on to visualizing the results of the forecast along with the historical data. The base plot() function does a decent job here.

That plot looks OK, but it’s not great. We can see that the mean forecast is to stay around $50, with the 95% bands stretching all the way to around $80 and $30, but honestly, I have to squint to really see those 95% intervals. We don’t like squinting, so let’s put in some extra work to make use of dygraphs, which will have the benefit of allowing a reader to zoom on the predicted portion of the graph.

This is where things require a bit more thought. We want one xts object to hold both the historical data and the forecasted prices.

We already have our monthly prices in the xts object we imported from Quandl, but the forecasted prices are currently in a list with a different date convention than we would like.

First, let’s move the mean forecast and 95% confidence bands to a dataframe, along with a date column. We predicted oil out six months, so we will need a date column for the six months after February.

The data we want is now housed in its own dataframe. Let’s convert that to an xts object.

Now we can combine the historical xts object with the forecasted xts object using cbind().

It looks as it should. From January 2006 to February 2017 we have our actual data and NAs for the forecasted data. From March 2017 to August 2017, we have our mean forecast and 95% confidence levels and NAs for actual data. Said another way, we have four time series with observations at different dates, some of which are in the future. Most fortunately, dygraph provides a nice way to plot our actual time series versus our three forecasted time series because it simply does not plot the NAs.

Take a quick look back at our previous graph using the plot() function. At first glance, the dygraph might not seem so different. But, we can now make use of hovering/tooltips and, more importantly, we can zoom in on the forecasted numbers see them much more clearly. Plus, the whole world of dygraph functionality is now available to us!

That’s all for today. We have gotten some familiarity with Quandl, used forecast() to predict the next six months of oil prices, and done some data wrangling so we can use our old friend dygraphs. Next time, we will wrap this into a Shiny app so that users can choose their own parameters, and maybe even choose different commodities. See you then!

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Because its Friday… The IKEA Billy index

Fri, 03/17/2017 - 10:47

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Because it is Friday, another ‘playful and frivolous data exercise

IKEA is more than a store, it is a very nice experience to go through. I can drop of my two kids at smàland, have some ‘quality time’ by walking around the store with my wife and eat some delicious Swedish meatballs. Back at home, the IKEA furniture are a good ‘relation-tester’: try building a big wardrobe together with your wife…..

The nice thing about IKEA is that you don’t have to come to the store for nothing, you can check the availability of an item on the IKEA website.

According to the website this gets refreshed every 1,5 hour. This brought me on an idea, if I check the availability every 1,5 hour I could get an idea of the number of items sold for a particular item.

The IKEA Billy index

Probably the most iconic item of IKEA is the Billy bookcase. Just in case you don’t know how this bookcase looks like, below is a picture, its simplicity in its most elegant way….

For every 1,5 hour over the last few months I have checked the Dutch IKEA website for the availability of this famous item for the 13 stores in the Netherlands, and calculated the negative difference between consecutive values.

The data that you get from this little playful exercise do not necessarily represent the numbers of Billy bookcases really sold. Maybe the stock got replenished in between, maybe items were moved internally to other stores. For example, if there are 50 Billy’s in Amsterdam available and 1,5 hour later there are 45 Billy’s, maybe 5 were sold, or 6 were sold and 1 got returned? replenished? I just don’t know!

All I see are movements in availability that might have been caused by products sold. But anyway, let’s call the movements of availability of the Billy’s the IKEA Billy index.

Some graphs of the Billy Index Trends and forecasts

Facebook released a nice R package, called prophet. It can be used to perform forecasting on time series, and it is used internally by Facebook across many applications. I ran the prophet forecasting algorithm on the IKEA Billy index. The graph below shows the result.

There are some high peaks end of October, and end of December. We can also clearly see the Saturday peaks that the algorithm has picked up from the historic data and projected it in its future forecasts.

Weekday and color

The graph above showed already that on Saturdays the Billy index is high, what about the other days? The graph below shows the other days, it depicts the sum of the Ikea index per day since I started to collect this data (end of September). Wednesdays and Thursdays are less active days.

 

Clearly most of the Billy’s are white.

Correlations

Does the daily Billy Index correlate with other data? I have used some Dutch weather data that can be downloaded from the Royal Netherlands Meteorological Institute (KNMI). The data consists of many daily weather variables. The graph below shows a correlation matrix of the IKEA Billy Index and only some of these weather variables.

 

The only correlation with some meaning of the IKEA Billy Index and a weather variable is the Wind Speed (-0.19). Increasing wind speeds means decreasing Billy’s.

 

It’s an explainable correlation of course…. You wouldn’t want to go to IKEA on (very) windy days, it is not easy to drive through strong winds with your Billy on top of your car.

 

Cheers, Longhow.

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How many digits into pi do you need to go to find your birthday?

Fri, 03/17/2017 - 07:13

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

FIND YOUR BIRTHDAY IN PI, IN THREE DIFFERENT FORMATS

It was Pi Day (March 14, like 3/14, like 3.14, get it?) recently and Time Magazine did a fun interactive app in which you can find your birthday inside the digits of pi.

However:
1) They only used one format of birthday, in which July 4th would be 704. But there are other ways to write July 4. Like 0704 or 74. And those must be investigated, right?
2) Our friend Dave Pennock said he wanted to see the locations of every possible birthday inside pi.

So here they are. Enjoy! And R code below if you are interested. Plot and other stuff use Hadley Wickham’s Tidyverse tools.

And what the heck, have a spreadsheet of fun (i.e., spreadsheet with all the birthdays and locations in it).

Birthday location Shorter form location Shortest form loc 01-01 2563 1-01 853 1-1 95 01-02 6599 1-02 164 1-2 149 01-03 25991 1-03 3487 1-3 111 01-04 1219 1-04 270 1-4 2 01-05 3982 1-05 50 1-5 4 01-06 1011 1-06 1012 1-6 41 01-07 23236 1-07 1488 1-7 96 01-08 16216 1-08 2535 1-8 425 01-09 4555 1-09 207 1-9 38 01-10 3253 1-10 175 1-10 175 01-11 19627 1-11 154 1-11 154 01-12 4449 1-12 710 1-12 710 01-13 362 1-13 363 1-13 363 01-14 4678 1-14 2725 1-14 2725 01-15 27576 1-15 922 1-15 922 01-16 5776 1-16 396 1-16 396 01-17 6290 1-17 95 1-17 95 01-18 7721 1-18 446 1-18 446 01-19 494 1-19 495 1-19 495 01-20 43172 1-20 244 1-20 244 01-21 6305 1-21 711 1-21 711 01-22 660 1-22 484 1-22 484 01-23 27847 1-23 1925 1-23 1925 01-24 8619 1-24 1081 1-24 1081 01-25 15458 1-25 1351 1-25 1351 01-26 4372 1-26 2014 1-26 2014 01-27 9693 1-27 298 1-27 298 01-28 1199 1-28 149 1-28 149 01-29 2341 1-29 500 1-29 500 01-30 5748 1-30 745 1-30 745 01-31 5742 1-31 1097 1-31 1097 02-01 9803 2-01 245 2-1 94 02-02 7287 2-02 1514 2-2 136 02-03 3831 2-03 1051 2-3 17 02-04 28185 2-04 375 2-4 293 02-05 12631 2-05 1327 2-5 90 02-06 9809 2-06 885 2-6 7 02-07 14839 2-07 2374 2-7 29 02-08 6141 2-08 77 2-8 34 02-09 24361 2-09 54 2-9 187 02-10 9918 2-10 1318 2-10 1318 02-11 4353 2-11 94 2-11 94 02-12 30474 2-12 712 2-12 712 02-13 524 2-13 281 2-13 281 02-14 10340 2-14 103 2-14 103 02-15 24666 2-15 3099 2-15 3099 02-16 4384 2-16 992 2-16 992 02-17 546 2-17 547 2-17 547 02-18 2866 2-18 424 2-18 424 02-19 716 2-19 717 2-19 717 02-20 15715 2-20 1911 2-20 1911 02-21 17954 2-21 1737 2-21 1737 02-22 1889 2-22 1736 2-22 1736 02-23 25113 2-23 136 2-23 136 02-24 12782 2-24 536 2-24 536 02-25 3019 2-25 1071 2-25 1071 02-26 6742 2-26 965 2-26 965 02-27 18581 2-27 485 2-27 485 02-28 4768 2-28 2528 2-28 2528 02-29 6707 2-29 186 2-29 186 03-01 1045 3-01 493 3-1 1 03-02 1439 3-02 818 3-2 16 03-03 15339 3-03 195 3-3 25 03-04 4460 3-04 4461 3-4 87 03-05 29985 3-05 366 3-5 10 03-06 3102 3-06 116 3-6 286 03-07 3813 3-07 65 3-7 47 03-08 35207 3-08 520 3-8 18 03-09 5584 3-09 421 3-9 44 03-10 7692 3-10 442 3-10 442 03-11 11254 3-11 846 3-11 846 03-12 11647 3-12 2632 3-12 2632 03-13 858 3-13 859 3-13 859 03-14 3496 3-14 1 3-14 1 03-15 8285 3-15 314 3-15 314 03-16 13008 3-16 238 3-16 238 03-17 4778 3-17 138 3-17 138 03-18 7291 3-18 797 3-18 797 03-19 4530 3-19 1166 3-19 1166 03-20 3767 3-20 600 3-20 600 03-21 18810 3-21 960 3-21 960 03-22 12744 3-22 3434 3-22 3434 03-23 8948 3-23 16 3-23 16 03-24 16475 3-24 2495 3-24 2495 03-25 12524 3-25 3147 3-25 3147 03-26 7634 3-26 275 3-26 275 03-27 9208 3-27 28 3-27 28 03-28 24341 3-28 112 3-28 112 03-29 16096 3-29 3333 3-29 3333 03-30 2568 3-30 365 3-30 365 03-31 13852 3-31 1128 3-31 1128 04-01 19625 4-01 1198 4-1 3 04-02 17679 4-02 1368 4-2 93 04-03 4776 4-03 724 4-3 24 04-04 1272 4-04 1273 4-4 60 04-05 10748 4-05 596 4-5 61 04-06 39577 4-06 71 4-6 20 04-07 4258 4-07 2008 4-7 120 04-08 5602 4-08 146 4-8 88 04-09 33757 4-09 340 4-9 58 04-10 11495 4-10 163 4-10 163 04-11 9655 4-11 2497 4-11 2497 04-12 20103 4-12 297 4-12 297 04-13 6460 4-13 1076 4-13 1076 04-14 3756 4-14 385 4-14 385 04-15 7735 4-15 3 4-15 3 04-16 10206 4-16 1709 4-16 1709 04-17 15509 4-17 1419 4-17 1419 04-18 7847 4-18 728 4-18 728 04-19 4790 4-19 37 4-19 37 04-20 3286 4-20 702 4-20 702 04-21 16028 4-21 93 4-21 93 04-22 4796 4-22 1840 4-22 1840 04-23 8220 4-23 2995 4-23 2995 04-24 1878 4-24 1110 4-24 1110 04-25 5716 4-25 822 4-25 822 04-26 1392 4-26 1393 4-26 1393 04-27 36946 4-27 630 4-27 630 04-28 910 4-28 203 4-28 203 04-29 4514 4-29 1230 4-29 1230 04-30 4619 4-30 519 4-30 519 05-01 21816 5-01 1344 5-1 49 05-02 25156 5-02 32 5-2 173 05-03 19743 5-03 838 5-3 9 05-04 9475 5-04 1921 5-4 192 05-05 36575 5-05 132 5-5 131 05-06 2094 5-06 1116 5-6 211 05-07 683 5-07 684 5-7 405 05-08 14658 5-08 1122 5-8 11 05-09 17171 5-09 1147 5-9 5 05-10 10274 5-10 49 5-10 49 05-11 444 5-11 395 5-11 395 05-12 2294 5-12 1843 5-12 1843 05-13 597 5-13 110 5-13 110 05-14 42236 5-14 2118 5-14 2118 05-15 10458 5-15 1100 5-15 1100 05-16 5367 5-16 3516 5-16 3516 05-17 24365 5-17 2109 5-17 2109 05-18 751 5-18 471 5-18 471 05-19 17096 5-19 390 5-19 390 05-20 20247 5-20 326 5-20 326 05-21 19553 5-21 173 5-21 173 05-22 2041 5-22 535 5-22 535 05-23 4317 5-23 579 5-23 579 05-24 15635 5-24 1615 5-24 1615 05-25 3233 5-25 1006 5-25 1006 05-26 3984 5-26 613 5-26 613 05-27 7300 5-27 241 5-27 241 05-28 18917 5-28 867 5-28 867 05-29 6373 5-29 1059 5-29 1059 05-30 367 5-30 368 5-30 368 05-31 26607 5-31 2871 5-31 2871 06-01 2400 6-01 1218 6-1 220 06-02 14854 6-02 291 6-2 21 06-03 9019 6-03 264 6-3 313 06-04 1172 6-04 1173 6-4 23 06-05 5681 6-05 750 6-5 8 06-06 13163 6-06 311 6-6 118 06-07 10861 6-07 287 6-7 99 06-08 13912 6-08 618 6-8 606 06-09 3376 6-09 128 6-9 42 06-10 13704 6-10 269 6-10 269 06-11 7448 6-11 427 6-11 427 06-12 11585 6-12 220 6-12 220 06-13 7490 6-13 971 6-13 971 06-14 2887 6-14 1612 6-14 1612 06-15 25066 6-15 1030 6-15 1030 06-16 6017 6-16 1206 6-16 1206 06-17 886 6-17 887 6-17 887 06-18 14424 6-18 1444 6-18 1444 06-19 6351 6-19 843 6-19 843 06-20 14967 6-20 76 6-20 76 06-21 2819 6-21 2820 6-21 2820 06-22 30206 6-22 185 6-22 185 06-23 51529 6-23 456 6-23 456 06-24 5491 6-24 510 6-24 510 06-25 5971 6-25 1569 6-25 1569 06-26 16707 6-26 21 6-26 21 06-27 5424 6-27 462 6-27 462 06-28 72 6-28 73 6-28 73 06-29 11928 6-29 571 6-29 571 06-30 10902 6-30 1900 6-30 1900 07-01 10640 7-01 167 7-1 40 07-02 544 7-02 545 7-2 140 07-03 10108 7-03 408 7-3 300 07-04 9882 7-04 2670 7-4 57 07-05 9233 7-05 561 7-5 48 07-06 5134 7-06 97 7-6 570 07-07 3815 7-07 755 7-7 560 07-08 8111 7-08 3061 7-8 67 07-09 3641 7-09 121 7-9 14 07-10 3169 7-10 681 7-10 681 07-11 4250 7-11 1185 7-11 1185 07-12 7695 7-12 243 7-12 243 07-13 36438 7-13 627 7-13 627 07-14 9546 7-14 610 7-14 610 07-15 5668 7-15 344 7-15 344 07-16 10242 7-16 40 7-16 40 07-17 15743 7-17 568 7-17 568 07-18 2988 7-18 2776 7-18 2776 07-19 8664 7-19 541 7-19 541 07-20 44221 7-20 1009 7-20 1009 07-21 756 7-21 650 7-21 650 07-22 2375 7-22 2140 7-22 2140 07-23 3300 7-23 1632 7-23 1632 07-24 8811 7-24 302 7-24 302 07-25 16244 7-25 140 7-25 140 07-26 288 7-26 289 7-26 289 07-27 12328 7-27 406 7-27 406 07-28 4879 7-28 1584 7-28 1584 07-29 4078 7-29 771 7-29 771 07-30 6892 7-30 898 7-30 898 07-31 2808 7-31 785 7-31 785 08-01 9508 8-01 3418 8-1 68 08-02 18639 8-02 1993 8-2 53 08-03 8283 8-03 85 8-3 27 08-04 1284 8-04 775 8-4 19 08-05 2292 8-05 957 8-5 172 08-06 9289 8-06 968 8-6 75 08-07 15787 8-07 451 8-7 306 08-08 4545 8-08 106 8-8 35 08-09 3009 8-09 1003 8-9 12 08-10 3207 8-10 206 8-10 206 08-11 10884 8-11 153 8-11 153 08-12 147 8-12 148 8-12 148 08-13 10981 8-13 734 8-13 734 08-14 4274 8-14 882 8-14 882 08-15 25257 8-15 324 8-15 324 08-16 1601 8-16 68 8-16 68 08-17 4857 8-17 319 8-17 319 08-18 30399 8-18 490 8-18 490 08-19 12786 8-19 198 8-19 198 08-20 4787 8-20 53 8-20 53 08-21 12923 8-21 102 8-21 102 08-22 5149 8-22 135 8-22 135 08-23 11957 8-23 114 8-23 114 08-24 10358 8-24 1196 8-24 1196 08-25 828 8-25 89 8-25 89 08-26 4294 8-26 2059 8-26 2059 08-27 619 8-27 620 8-27 620 08-28 24614 8-28 2381 8-28 2381 08-29 1123 8-29 335 8-29 335 08-30 816 8-30 492 8-30 492 08-31 25355 8-31 237 8-31 237 09-01 658 9-01 659 9-1 250 09-02 2074 9-02 715 9-2 6 09-03 6335 9-03 357 9-3 15 09-04 8107 9-04 909 9-4 59 09-05 23683 9-05 3191 9-5 31 09-06 8297 9-06 1294 9-6 181 09-07 2986 9-07 543 9-7 13 09-08 7828 9-08 815 9-8 81 09-09 12077 9-09 248 9-9 45 09-10 17009 9-10 2874 9-10 2874 09-11 6125 9-11 1534 9-11 1534 09-12 29379 9-12 483 9-12 483 09-13 17515 9-13 1096 9-13 1096 09-14 249 9-14 250 9-14 250 09-15 3413 9-15 1315 9-15 1315 09-16 5930 9-16 3370 9-16 3370 09-17 341 9-17 342 9-17 342 09-18 4731 9-18 2220 9-18 2220 09-19 2127 9-19 417 9-19 417 09-20 328 9-20 329 9-20 329 09-21 422 9-21 423 9-21 423 09-22 7963 9-22 687 9-22 687 09-23 23216 9-23 63 9-23 63 09-24 5122 9-24 1430 9-24 1430 09-25 4008 9-25 337 9-25 337 09-26 5380 9-26 6 9-26 6 09-27 1177 9-27 977 9-27 977 09-28 5768 9-28 2192 9-28 2192 09-29 7785 9-29 1854 9-29 1854 09-30 1494 9-30 194 9-30 194 10-01 15762 10-01 15762 10-1 853 10-02 6740 10-02 6740 10-2 164 10-03 12292 10-03 12292 10-3 3487 10-04 3849 10-04 3849 10-4 270 10-05 2881 10-05 2881 10-5 50 10-06 3481 10-06 3481 10-6 1012 10-07 4076 10-07 4076 10-7 1488 10-08 8281 10-08 8281 10-8 2535 10-09 1817 10-09 1817 10-9 207 10-10 853 10-10 853 10-10 853 10-11 3846 10-11 3846 10-11 3846 10-12 8618 10-12 8618 10-12 8618 10-13 8271 10-13 8271 10-13 8271 10-14 2781 10-14 2781 10-14 2781 10-15 2564 10-15 2564 10-15 2564 10-16 9987 10-16 9987 10-16 9987 10-17 8041 10-17 8041 10-17 8041 10-18 1224 10-18 1224 10-18 1224 10-19 15483 10-19 15483 10-19 15483 10-20 9808 10-20 9808 10-20 9808 10-21 2750 10-21 2750 10-21 2750 10-22 6400 10-22 6400 10-22 6400 10-23 6771 10-23 6771 10-23 6771 10-24 12736 10-24 12736 10-24 12736 10-25 12926 10-25 12926 10-25 12926 10-26 14679 10-26 14679 10-26 14679 10-27 164 10-27 164 10-27 164 10-28 3242 10-28 3242 10-28 3242 10-29 8197 10-29 8197 10-29 8197 10-30 20819 10-30 20819 10-30 20819 10-31 3495 10-31 3495 10-31 3495 11-01 2780 11-01 2780 11-1 154 11-02 12721 11-02 12721 11-2 710 11-03 3494 11-03 3494 11-3 363 11-04 23842 11-04 23842 11-4 2725 11-05 175 11-05 175 11-5 922 11-06 13461 11-06 13461 11-6 396 11-07 21819 11-07 21819 11-7 95 11-08 7450 11-08 7450 11-8 446 11-09 3254 11-09 3254 11-9 495 11-10 22898 11-10 22898 11-10 22898 11-11 12701 11-11 12701 11-11 12701 11-12 12702 11-12 12702 11-12 12702 11-13 3504 11-13 3504 11-13 3504 11-14 23209 11-14 23209 11-14 23209 11-15 27054 11-15 27054 11-15 27054 11-16 3993 11-16 3993 11-16 3993 11-17 154 11-17 154 11-17 154 11-18 14376 11-18 14376 11-18 14376 11-19 984 11-19 984 11-19 984 11-20 3823 11-20 3823 11-20 3823 11-21 710 11-21 710 11-21 710 11-22 12018 11-22 12018 11-22 12018 11-23 6548 11-23 6548 11-23 6548 11-24 25705 11-24 25705 11-24 25705 11-25 1350 11-25 1350 11-25 1350 11-26 12703 11-26 12703 11-26 12703 11-27 4252 11-27 4252 11-27 4252 11-28 14719 11-28 14719 11-28 14719 11-29 4450 11-29 4450 11-29 4450 11-30 9107 11-30 9107 11-30 9107 12-01 244 12-01 244 12-1 711 12-02 19942 12-02 19942 12-2 484 12-03 60873 12-03 60873 12-3 1925 12-04 11885 12-04 11885 12-4 1081 12-05 3329 12-05 3329 12-5 1351 12-06 3259 12-06 3259 12-6 2014 12-07 7511 12-07 7511 12-7 298 12-08 3714 12-08 3714 12-8 149 12-09 4729 12-09 4729 12-9 500 12-10 3456 12-10 3456 12-10 3456 12-11 50289 12-11 50289 12-11 50289 12-12 711 12-12 711 12-12 711 12-13 47502 12-13 47502 12-13 47502 12-14 4560 12-14 4560 12-14 4560 12-15 11942 12-15 11942 12-15 11942 12-16 6986 12-16 6986 12-16 6986 12-17 11078 12-17 11078 12-17 11078 12-18 9167 12-18 9167 12-18 9167 12-19 1426 12-19 1426 12-19 1426 12-20 36498 12-20 36498 12-20 36498 12-21 8732 12-21 8732 12-21 8732 12-22 17882 12-22 17882 12-22 17882 12-23 9550 12-23 9550 12-23 9550 12-24 661 12-24 661 12-24 661 12-25 9417 12-25 9417 12-25 9417 12-26 964 12-26 964 12-26 964 12-27 484 12-27 484 12-27 484 12-28 5183 12-28 5183 12-28 5183 12-29 30418 12-29 30418 12-29 30418 12-30 7146 12-30 7146 12-30 7146 12-31 9451 12-31 9451 12-31 9451

R CODE

library(ggplot2)
library(readr)
library(stringr)
library(tidyr)
pistring = read_file("pi.txt")
#choose 2016 b/c it's a leap year
dayvec = seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by = 1)
dayvec = str_sub(dayvec, start = 6, end = 10)
shorterdayvec = sub("0(.-.)", "\\1", dayvec)
shortestdayvec = sub("(.-)0(.)$", "\\1\\2", shorterdayvec)
df = data.frame(
birthday = dayvec,
loc = str_locate(pistring, gsub("-", "", dayvec))[, 1],
shorter_bday = shorterdayvec,
shorter_loc = str_locate(pistring, gsub("-", "", shorterdayvec))[, 1],
shortest_bday = shortestdayvec,
shortest_loc = str_locate(pistring, gsub("-", "", shortestdayvec))[, 1]
)
write.csv(df, file = "birthdays_in_pi.csv", row.names = FALSE)
long_df = df %>% gather(key, location, loc, shorter_loc, shortest_loc)
p = ggplot(long_df, aes(x = location, fill = key))
p = p + geom_density(alpha = .3)
p = p + coord_cartesian(xlim = c(0, 40000))
p = p + theme(legend.position = "bottom")
p
ggsave(
plot = p,
file = "birthdays_in_pi.pdf",
height = 4,
width = 4
)

The post How many digits into pi do you need to go to find your birthday? appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

what does more efficient Monte Carlo mean?

Fri, 03/17/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

“I was just thinking that there might be a magic trick to simulate directly from this distribution without having to go for less efficient methods.”

In a simple question on X validated a few days ago [about simulating from x²φ(x)] popped up the remark that the person asking the question wanted a direct simulation method for higher efficiency. Compared with an accept-reject solution. Which shows a misunderstanding of what “efficiency” means on Monte Carlo situations. If it means anything, I would think it is reflected in the average time taken to return one simulation and possibly in the worst case. But there is no reason to call an inverse cdf method more efficient than an accept reject or a transform approach since it all depends on the time it takes to make the inversion compared with the other solutions… Since inverting the closed-form cdf in this example is much more expensive than generating a Gamma(½,½), and taking plus or minus its root, this is certainly the case here. Maybe a ziggurat method could be devised, especially since x²φ(x)<φ(x) when |x|≤1, but I am not sure it is worth the effort!

Filed under: Books, Kids, R, Statistics Tagged: cross validated, efficiency, inverse cdf, Monte Carlo Statistical Methods, R, simulation, ziggurat algorithm

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Book Review: Testing R Code

Thu, 03/16/2017 - 22:00

When it comes to getting things right in data science, most of the focus goes to the data and the statistical methodology used. But when a misplaced parenthesis can throw off your results entirely, ensuring correctness in your programming is just as important. A new book published by CRC Press, Testing R Code by Richard (Richie) Cotton, provides all the guidance you’ll need to write robust, correct code in the R language.

This is not a book exclusively for package developers: data science, after all, is a combination of programming and statistics, and this book provides lots of useful advice on helping to ensure the programming side of things is correct. The book suggest you being by using the assertive package to provide alerts when your code runs off the rails. From there, you can move on to creating formal unit tests as you code, using the testthat package. There's also a wealth of useful advice for organizing and writing code that's more maintainable in the long run.

If you are a package developer, there's lots here for you too. There are detailed instructions on incorporating unit tests into your package, and how tests interact with version control and continuous integration systems. It also touches on advanced topics relevant to package authors, like debugging C or C++ code embedded via the Rcpp package. And there's more good advice, like how to right better error messages so your users can recover more easily when an error does occur.

Testing is a topic that doesn't get as much attention as it deserves in data science disciplines. One reason may be that it's a fairly dry topic, but Cotton does a good job in making the material engaging with practical examples and regular exercises (with answers in the appendix). Frequent (and often amusing) footnotes help make this an entertaining read (given the material) and hopefully will motivate you to make testing a standard (and early) part of your R programming process. 

Testing R Code is available from your favorite bookseller now, in hardback and electronic formats.

 

Mapping Housing Data with R

Thu, 03/16/2017 - 14:28

What is my home worth?  Many homeowners in America ask themselves this question, and many have an answer.  What does the market think, though?  The best way to estimate a property’s value is by looking at other, similar properties that have sold recently in the same area – the comparable sales approach.  In an effort to allow homeowners to do some exploring (and because I needed a new project), I developed a small Shiny app with R.

My day job allows me access to the local multiple listing service, which provides a wealth of historic data.  The following project makes use of that data to map real estate that has sold near Raleigh, NC in the past six months.  Without getting too lost in the weeds I’ll go over a few key parts of the process.  Feel free to jump over to my GitHub page to check out the full source code.  Click here to view the app!

  1. Geocode everything.  The data did not come with latitude and longitude coordinates, so we’ll have to do some geocoding.  I haven’t found an efficient way to do this in R, so, like in the mailing list example, I’ll use QGIS to process my data and return a .csv for each town I’m interested in.
  2. Setup your data.  To make sure that everything runs smoothly later on, we’ve got to import our data using readr and make sure each attribute is typed properly. library(readr) apex <- read_csv("apex2.csv") #Remove non-character elements from these columns. df$`Sold Price` <- as.numeric(gsub("[^0-9]","",df$`Sold Price`)) df$`List Price` <- as.numeric(gsub("[^0-9]","",df$`List Price`)) #Some re-typing for later. df$Fireplace <- as.numeric(df$Fireplace) df$`New Constr` <- as.factor(df$`New Constr`) #Assign some placeholders. assign("latitude", NA, envir = .GlobalEnv) assign("longitude", NA, envir = .GlobalEnv)
  3. Get info from the user.  The first thing the app wants you to do is give it some characteristics about the subject property, a property that you are interested in valuating.  A function further down uses this information to produce a map using these inputs. #What city's dataset are we using? selectInput("city", label = "City", c("Apex", "Cary", "Raleigh")) #Get some info. textInput("address",label = "Insert Subject Property Address", value = "2219 Walden Creek Drive"), numericInput("dist", label = "Miles from Subject", value = 5, max = 20), numericInput("footage",label = "Square Footage", value = 2000), selectInput("acres",label = "How Many Acres?", acresf) #Changes datasets based on what city you choose on the frontend. #This expression is followed by two more else if statements. observeEvent(input$city, { if(input$city == "Apex") { framework_retype(apex) cityschools <-schoolsdf$features.properties %>% filter(ADDRCITY_1 == "Apex") assign("cityschools", cityschools, envir = .GlobalEnv) #Draw the map on click. observeEvent(input$submit, { output$map01 <- renderLeaflet({distanceFrom(input$address, input$footage, input$acres,tol = .15, input$dist) }) })
  4. Filter the data.  The distanceFrom function above uses dplyr to filter the properties in the selected city by square footage, acreage, and distance from the subject property.  The tol argument is used to give a padding around square footage – few houses match exactly in that respect. #Filter once. houses_filtered <- houses %>% filter(Acres == acres)%>% filter(LvngAreaSF >= ((1-tol)*sqft)) %>% filter(LvngAreaSF <= ((1+tol)*sqft)) #This grabs lat & long from Google. getGeoInfo(subj_address) longitude_subj <- as.numeric(longitude) latitude_subj <- as.numeric(latitude) #Use the comparable house locations. xy <- houses_filtered[,1:2] xy <- as.matrix(xy) #Calculate distance. d <- spDistsN1(xy, c(longitude_subj, latitude_subj), longlat = TRUE) d <- d/1.60934 d <- substr(d, 0,4) #Filter again. distance <- houses_filtered %>% filter(distanceMi <= dist)
  5. Draw the map. The most important piece, the map, is drawn using Leaflet.  I have the Schools layer hidden initially because it detracts from the main focus – the houses. map <- leaflet() %>% addTiles(group = "Detailed") %>% addProviderTiles("CartoDB.Positron", group = "Simple") %>% addAwesomeMarkers(lng = longitude, lat = latitude, popup = subj_address, icon = awesomeIcons(icon='home', markerColor = 'red'), group = "Subject Property") %>% addAwesomeMarkers(lng = distance$X, lat = distance$Y, popup = paste(distance$Address,distance$`Sold Price`, distance$distanceMi, sep = ""), icon = awesomeIcons(icon = 'home', markerColor = 'blue'), group = "Comps")%>% addAwesomeMarkers(lng = schoolsdf$long, lat = schoolsdf$lat, icon = awesomeIcons(icon = 'graduation-cap',library = 'fa', markerColor = 'green', iconColor = '#FFFFFF'), popup = schoolsdf$features.properties$NAMELONG, group = "Schools")%>% addLayersControl( baseGroups = c("Simple", "Detailed"), overlayGroups = c("Subject Property", "Comps", "Schools"), options = layersControlOptions(collapsed = FALSE)) map <- map %>% hideGroup(group = "Schools")
  6. Regression model.  The second tab at the top of the page leads to more information input that is used in creating a predictive model for the subject property.  The implementation is somewhat messy, so if you’d like to check it out the code is at the bottom of app.R in the GitHub repo.

That’s it!  It took a while to get all the pieces together, but I think the final product is useful and I learned a lot along the way.  There are a few places I want to improve: simplify the re-typing sections, make elements refresh without clicking submit, among others.  If you have any questions about the code please leave a comment or feel free to send me an email.

Happy coding,

Kiefer Smith

 

 

 

 

Forrester’s 2017 Take on Tools for Data Science

Thu, 03/16/2017 - 13:00

In my ongoing quest to track The Popularity of Data Science Software, I’ve updated the discussion of the annual report from Forrester, which I repeat here to save you from having to read through the entire document. If your organization is looking for training in the R language, you might consider my books, R for SAS and SPSS Users or R for Stata Users, or my on-site workshops.

Forrester Research, Inc. is a company that provides reports which analyze the competitive position of tools for data science. The conclusions from their 2017 report, Forrester Wave: Predictive Analytics and Machine Learning Solutions, are summarized in Figure 3b. On the x-axis they list the strength of each company’s strategy, while the y-axis measures the strength of their current offering. The size and shading of the circles around each data point indicate the strength of each vendor in the marketplace (70% vendor size, 30% ISV and service partners).

As with Gartner 2017 report discussed above, IBM, SAS, KNIME, and RapidMiner are considered leaders. However, Forrester sees several more companies in this category: Angoss, FICO, and SAP. This is quite different from the Gartner analysis, which places Angoss and SAP in the middle of the pack, while FICO is considered a niche player.

Figure 3b. Forrester Wave plot of predictive analytics and machine learning software.

In their Strong Performers category, they have H2O.ai, Microsoft, Statistica, Alpine Data, Dataiku, and, just barely, Domino Data Labs. Gartner rates Dataiku quite a bit higher, but they generally agree on the others. The exception is that Gartner dropped coverage of Alpine Data in 2017. Finally, Salford Systems is in the Contenders section. Salford was recently purchased by Minitab, a company that has never been rated by either Gartner or Forrester before as they focused on being a statistics package rather than expanding into machine learning or artificial intelligence tools as most other statistics packages have (another notable exception: Stata). It will be interesting to see how they’re covered in future reports.

Compared to last year’s Forrester report, KNIME shot up from barely being a Strong Performer into the Leader’s segment. RapidMiner and FICO moved from the middle of the Strong Performers segment to join the Leaders. The only other major move was a lateral one for Statistica, whose score on Strategy went down while its score on Current Offering went up (last year Statistica belonged to Dell, this year it’s part of Quest Software.)

The size of the “market presence” circle for RapidMiner indicates that Forrester views its position in the marketplace to be as strong as that of IBM and SAS. I find that perspective quite a stretch indeed!

Alteryx, Oracle, and Predixion were all dropped from this year’s Forrester report. They mention Alteryx and Oracle as having “capabilities embedded in other tools” implying that that is not the focus of this report. No mention was made of why Predixion was dropped, but considering that Gartner also dropped coverage of then in 2017, it doesn’t bode well for the company.

R + D3, A powerful and beautiful combo

Thu, 03/16/2017 - 01:00

For a while now, I have been looking for ways to incorporate pretty D3 graphics into R markdown reports or shiny apps. Why? I like R for the ability to be able to do data handling, stats, ML all very easily with minimal code. But when you want to present something to clients or the public, there is no competition for the front end web stuff e.g. d3.js. The answer: htmlwidgets.

Here is my attempt so far

Ok well I wasn’t looking too hard because it completely escaped me that you could do this with htmlwidgets. I stumbled upon this when I was browsing the shiny user showcase and I came across the FRISS Analytics example which is accompanied by a really fantastic set of tutorials under ‘other tutorials’. If you want to do this to customise for your own plotting needs, I strongly suggest you start there with those tutorials. They are very thourough and step-wise and will allow you to build a R-D3 binding for the first time with very little hassle, as I have done.

It looks a little like this…

  • You write a package.
  • The package contains R functions.
  • Those functions invoke JavaScript as you define it.
  • htmlwidgets binds them together and provide the plumbing between your R objects and the JavaScript that you feed them into.

The outcome is that you can easily create an interactive graphic in the end my simply calling an R function. The real beauty though is being able to update the javascript plot in response to user input on the R side, without plotting a new plot each time in the slightly awkward way that I was previously doing. If you have loaded the app try typing extra zeros into the sample size, you’ll see the plot update as the underlying data is updated. Smooth. This is what I was looking for.

Of course you don’t need to be a JavaScript programmer to get this done. You can use higher levels js libraries such as C3 in my case or nvd3, maybe there are more?

So all in all the chain is…

R -> htmlwidgets -> || C3 -> D3 -> JavaScript.

Where htmlwidgets is reaching through the border between R and JavaScript.

I forked the R package from FRISS and when I get some time or there is a need, I will try to port some more C3 based templates and expose them to R functions in this way.

This post is obviously not a tutorial just a flag and a signpost, to find out how to do this yourself, go to the FRISS Analytics tutorial either here on rstudio or here on github. Thanks a million to the folks at FRISS analytics and the authors of htmlwidgets.

R, GIS, and fuzzyjoin to reconstruct demographic data for NUTS regions of Denmark

Thu, 03/16/2017 - 01:00
NUTS

NUTS stands for the Nomenclature of Territorial Units For Statistics. The history of NUTS dates back to the beginning of 1970s, when European countries developed unified standards for systems of administrative geography. It was not until the beginning of this century when such a system finally became widely used. There are three main hierarchical levels of NUTS, and the most commonly used for regional analysis is NUTS-2.


Figure 1. Illustration of the principle of NUTS hierarchical system

One of the objectives of NUTS was to provide more or less comparable administrative divisions for all countries of Europe. Nevertheless, in 2013, population figures for single NUTS-2 regions ranged from 28.5 thousands in Aland island (Finland) to almost 12 million in Ile-de-France (Paris and surroundings, France).

The broken time series

Quite arbitrary in its essence, territorial division tends to evolve. Changes in administrative boundaries can cause problems for regional analysis as they break the time series and therefore make it harder to analyze trends. Despite this inconvenience, the boundaries of regions actually change quite often based on the needs and interests of local or national governmenta. Eurostat tracks all modifications providing detailed explanations of all the changes that happen between versions of NUTS (figure 2).


Figure 2. Changes in NUTS between versions 2006 and 2010

Despite this, Eurostat does not recalculate historic demographic data to match the most recent NUTS version. This means that, for the most recent version of NUTS, there is missing data for all years before the latest administrative change. So researchers have to reconstruct historical data manually to obtain a long time series. Of course, crude assumptions often have to be accepted in order to approximate the population figures for the current regions that did not exist in the past.

To make thing even more complex, Eurostat provides the data only for the latest version of NUTS (at least, I did not work out how to download previous versions). In my PhD project I carry out regional analysis for the NUTS-2 regions of European Union. To have the longest possible time series, when I did the data preparation in 2015, I chose the 2010 version of NUTS, on which the regional demographic projection EUROPOP2013 is based. For reproducibility, I uploaded the precise versions of the Eurostat data at NUTS-2 level on population age structures and deaths, as downloaded in 2015, to figshare.

Denmark

Some countries had to perform major changes in their systems of territorial division to fit the NUTS standards. The most significant reform happened in Denmark in 2007, where the former 271 municipalities were transformed into the new 98 municipalities. At the same time, NUTS was introduced, so that 98 municipalities were allocated to 11 NUTS-3 regions, which aggregate to 5 NUTS-2 regions. Typically, for a small country, there is only one NUTS-1 region in Denmark, which is the whole country.

As far as I know, there was no official attempt of Eurostat to reconstruct the time series for Denmark before 2007. The typical map of Eurostat for the pre-2007 period shows Denmark as “no data available” country (figure 3).


Figure 3. Life expectancy at birth in European NUTS-2 regions, 2006; a screenshot from the Eurostat’s interactive data exploratory tool

Such a data loss is somewhat surprising for a country such as Denmark. It might be quite difficult to match the old and new municipal systems; but it should be relatively easy to re-aggregate the old municipalities into the new (higher level) NUTS regions. That is precisely what I did during my data preparation1 and what I now want to share in this post.

The task is basically to identify which of the old 271 municipalities are located within the modern 11 NUTS-3 regions and to aggregate the municipal data. Then, NUTS-3 data is easily aggregated for the NUTS-2 level. Such a task could have meant working late into the night, but luckily we live in the GIS era. I used GIS to match the old municipalities with the NUTS-3 regions. Here I want to show (with code) how the task can be performed using the amazing and opensource R. Below I show the process of matching old municipalities to the NUTS regions and the process that I used to aggregate population data.

Data

The data on the population age structures for the old 271 municipalities of Denmark was downloaded from the official website of Statistics Denmark. The system only allows you to grab up to 10K cells for unregistered users and up to 100K for registered users. So the process of downloading the data involves some tedious manual manipulations. For the purpose of my phd project, I downloaded the data for the period 2001-2006; but, if needed, the data is available since 1979. The data, downloaded in 2015 and ‘tidied up’ can be found here.

I have spent a lot of time trying to find geodata with the boundaries of the old municipalities. Now, coming back to the topic more than 1.5 year later, I failed to identify the original source of the shapefile, though I am pretty sure that it came from here 2. The copy of the shapefile that I used can be found here.

Finally, we need a shapefile of NUTS-3 regions. It can be easily downloaded from Eurostat geodata repository. The shapefile that I used is “NUTS_2010_20M_SH.zip”. The selection of the 11 Danish regions can be found here.

The projection used for both shapefiles is ESPG-3044, the one often used to map Denmark.

Now, the code to prepare the R session and load the data.

# set locale and encoding parameters to read Danish if(Sys.info()['sysname']=="Linux"){ Sys.setlocale("LC_CTYPE", "da_DK.utf8") danish_ecnoding <- "WINDOWS-1252" }else if(Sys.info()['sysname']=="Windows"){ Sys.setlocale("LC_CTYPE", "danish") danish_ecnoding <- "Danish_Denmark.1252" } # load required packages (install first if needed) library(tidyverse) # version: 1.0.0 library(ggthemes) # version: 3.3.0 library(rgdal) # version: 1.2-4 library(rgeos) # version: 0.3-21 library(RColorBrewer) # version: 1.1-2 mypal <- brewer.pal(11, "BrBG") library(fuzzyjoin) # version: 0.1.2 library(viridis) # version: 0.3.4 # load Denmark pop structures for the old municipalities df <- read_csv("https://ikashnitsky.github.io/doc/misc/nuts2-denmark/BEF1A.csv.gz") # create a directory for geodata ifelse(!dir.exists("geodata"), dir.create("geodata"), "Directory already exists") # download, unzip and read Danish NUTS-3 geodata (31KB) url_nuts <- "https://ikashnitsky.github.io/doc/misc/nuts2-denmark/denmark-nuts3-espg3044.tgz" path_nuts <- "geodata/denmark-nuts3-espg3044.tgz" ifelse(!file.exists(path_nuts), download.file(url_nuts, path_nuts, mode="wb"), 'file alredy exists') # If there are problems downloading the data automatically, please download it manually from # https://ikashnitsky.github.io/doc/misc/nuts2-denmark/denmark-nuts3-espg3044.tgz untar(tarfile = path_nuts, exdir = "geodata") sp_nuts3 <- readOGR(dsn = "geodata/.", layer = "denmark-nuts3-espg3044") gd_nuts3 <- fortify(sp_nuts3, region = "NUTS_ID") # to the ggplot format # download, unzip and read Danish old municipal geodata (6.0MB) url_mun <- "https://ikashnitsky.github.io/doc/misc/nuts2-denmark/kommune2006win1252.tgz" path_mun <- "geodata/kommune2006win1252.tgz" ifelse(!file.exists(path_mun), download.file(url_mun, path_mun, mode="wb"), 'file alredy exists') # If there are problems downloading the data automatically, please download it manually from # https://ikashnitsky.github.io/doc/misc/nuts2-denmark/kommune2006utf8.tgz untar(tarfile = path_mun, exdir = "geodata") sp_mun <- readOGR(dsn = "geodata/.", layer = "kommune2006win1252", encoding = danish_ecnoding) gd_mun <- fortify(sp_mun) # coordinates of the municipalities mun_coord <- bind_cols(as.data.frame(coordinates(sp_mun)), sp_mun@data[,1:3]) %>% transmute(long = V1, lat = V2, enhedid, objectid, name = navn) Spatial matching

Let’s first have a look at the map.

ggplot()+ geom_polygon(data = gd_nuts3, aes(long, lat, group = group), color = brbg[3], fill = "grey90", size = 1)+ geom_point(data = mun_coord, aes(long, lat), color = brbg[10], size = 1)+ theme_map()


Figure 4. Reference map of the old municipalities and NUTS-3 regions of Denmark

We can easily see that the boundaries of the municipalities (light blue) are much more precise than that of the NUTS-3 regions (orange/brown). This is not a problem as long as all the centroids of the municipalities fall within the boundaries of the NUTS-3 regions, which seems to be true for all municipalities except for the easternmost one. A quick check reveals that this is Christiansø, a tiny fortified island, whose history goes back to the Middle Ages. It has a special status and is not included into the NUTS system. For further manipulations, Christiansø can safely merge it with the close-by Bornholm.

To identify which municipalities fall into which NUTS regions, I use the spatial overlap function (over) from sp package. Here I should thank Roger Bivand, a person who made it possible to do any spatial analysis in R.

# municipality coordinates to Spatial mun_centr <- SpatialPoints(coordinates(sp_mun), proj4string = CRS(proj4string(sp_nuts3))) # spatial intersection with sp::over inter <- bind_cols(mun_coord, over(mun_centr, sp_nuts3[,"NUTS_ID"])) %>% transmute(long, lat, objectid, nuts3 = as.character(NUTS_ID), nuts2 = substr(nuts3, 1, 4))

Let’s again check visually if the spatial matching worked okay.

ggplot()+ geom_polygon(data = gd_mun, aes(long, lat, group = group), color = brbg[9], fill = "grey90", size = .1)+ geom_polygon(data = gd_nuts3, aes(long, lat, group = group), color = brbg[3], fill = NA, size = 1)+ geom_point(data = inter, aes(long, lat, color = nuts3), size = 1)+ geom_point(data = inter[is.na(inter$nuts3),], aes(long, lat), color = "red", size = 7, pch = 1)+ theme_map(base_size = 15)+ theme(legend.position = c(1, 1), legend.justification = c(1, 1))


Figure 5. Checking the spatial intersection between the old municipalities and NUTS-3 regions of Denmark

Not bad. But there is an “NA” category that represents all the cases where the spatial match failed. How many such cases do we have?

# how many failed cases do we have sum(is.na(inter$nuts3)) ## [1] 3 # where the intersection failed inter[is.na(inter$nuts3),] ## long lat objectid nuts3 nuts2 ## 23 892474.0 6147918 46399 <NA> <NA> ## 65 504188.4 6269329 105319 <NA> <NA> ## 195 533446.8 6312770 47071 <NA> <NA>

As there are only 3 cases, I decided to fix them manually.

# fix the three cases manually fixed <- inter fixed[fixed$objectid=="46399", 4:5] <- c("DK014", "DK01") fixed[fixed$objectid=="105319", 4:5] <- c("DK041", "DK04") fixed[fixed$objectid=="47071", 4:5] <- c("DK050", "DK05")

The final visual check.

ggplot()+ geom_polygon(data = gd_mun, aes(long, lat, group = group), color = brbg[9], fill = "grey90", size = .1)+ geom_polygon(data = gd_nuts3, aes(long, lat, group = group), color = brbg[3], fill = NA, size = 1)+ geom_point(data = fixed, aes(long, lat, color = nuts3), size = 1)+ theme_map(base_size = 15)+ theme(legend.position = c(1, 1), legend.justification = c(1, 1))


Figure 6. Re-checking the spatial intersection between the old municipalities and NUTS-3 regions of Denmark

Now everything seems okay.

Joining spatial and statistical data (fuzzy join)

The next task is to join the spatial data and statistical data together. The spatial layer for municipalities does not contain the codes that are used by Statistics Denmark, so I have to match municipalities in the two datasets by their names. This is quite a difficult task. Names can be written slightly differently, there are some special characters in Danish alphabet, and some municipalities may have experienced a change of name. To solve the task most efficiently, I used the ‘Fuzzy String Matching’ approach which is implemented in the fuzzyjoin package by David Robinson.

First, I simplify the names in both datasets turning them into lowercase, replacing the character “å” with “aa”, and removing the “Kommune” word in the spatial dataset names. Please note that I downloaded (separately) a small selection from Statistics Denmark to have a lightweight dataframe with municipal codes and names.

# simplify municipalities names mun_geo <- mun_coord %>% transmute(name = sub(x = name, " Kommune", replacement = ""), objectid) %>% mutate(name = gsub(x = tolower(name), "å", "aa")) mun_stat <- read.csv2("https://ikashnitsky.github.io/doc/misc/nuts2-denmark/stat-codes-names.csv", fileEncoding = danish_ecnoding) %>% select(name) %>% separate(name, into = c("code", "name"), sep = " ", extra = "merge") %>% mutate(name = gsub("\\s*\\([^\\)]+\\)", "", x = name)) %>% mutate(name = gsub(x = tolower(name), "å", "aa"))

Let’s try fuzzy join.

# first attempt fuz_joined_1 <- regex_left_join(mun_geo, mun_stat, by = "name")

The resulting dataframe has 278 rows instead of 271. That means that for some municipalities in the spatial dataset there was more than one match. Let’s identify them.

# identify more that 1 match (7 cases) and select which to drop fuz_joined_1 %>% group_by(objectid) %>% mutate(n = n()) %>% filter(n > 1) ## Source: local data frame [14 x 5] ## Groups: objectid [7] ## ## name.x objectid code name.y n ## <chr> <dbl> <chr> <chr> <int> ## 1 haslev 105112 313 haslev 2 ## 2 haslev 105112 403 hasle 2 ## 3 brønderslev 47003 739 rønde 2 ## 4 brønderslev 47003 805 brønderslev 2 ## 5 hirtshals 47037 817 hals 2 ## 6 hirtshals 47037 819 hirtshals 2 ## 7 rønnede 46378 385 rønnede 2 ## 8 rønnede 46378 407 rønne 2 ## 9 hvidebæk 46268 317 hvidebæk 2 ## 10 hvidebæk 46268 681 videbæk 2 ## 11 ryslinge 46463 477 ryslinge 2 ## 12 ryslinge 46463 737 ry 2 ## 13 aarslev 46494 497 aarslev 2 ## 14 aarslev 46494 861 aars 2

So, for 7 municipalities, two matches were found. I will drop the imperfect match variants in the next iteration of fuzzy join.

The other issue is the municipalities for which no match was found in that statistical data.

# show the non-matched cases fuz_joined_1 %>% filter(is.na(code)) ## name.x objectid code name.y ## 1 faxe 105120 <NA> <NA> ## 2 nykøbing falsters 46349 <NA> <NA> ## 3 herstederne 46101 <NA> <NA>

As there are only three such cases, I corrected them manually in the spatial data to match the statistical data. There are two cases of a difference in the way the name of municipality are written and one case of name change.

# correct the 3 non-matching geo names mun_geo_cor <- mun_geo mun_geo_cor[mun_geo_cor$name=="faxe", "name"] <- "fakse" mun_geo_cor[mun_geo_cor$name=="nykøbing falsters", "name"] <- "nykøbing f." mun_geo_cor[mun_geo_cor$name=="herstederne", "name"] <- "albertslund"

Now the second attempt to match the datasets (spatial dataset is corrected).

# second attempt fuz_joined_2 <- regex_left_join(mun_geo_cor, mun_stat, by = "name") # drop non-perfect match fuz_joined_2 <- fuz_joined_2 %>% group_by(objectid) %>% mutate(n = n()) %>% ungroup() %>% filter(n < 2 | name.x==name.y) fuz_joined_2 <- fuz_joined_2 %>% transmute(name = name.x, objectid, code)

The output looks perfect. Now, the last step – using the matched “objectid” field, I will finally attach the NUTS data to statistical codes.

# finally, attach the NUTS info to matched table key <- left_join(fuz_joined_2, fixed, "objectid") Aggregate old municipal data to NUTS levels

The previous manipulations yielded a dataframe that links statistical codes of the old municipalities with the corresponding NUTS regions. The last thing that has to be done is aggregation. I will attach the “key” dataset to a statistical dataset and aggregate the data at NUTS-3 and NUTS-2 levels.

# finally, we only need to aggregate the old stat data df_agr <- left_join(key, df, "code") %>% filter(!is.na(name)) %>% gather("year", "value", y2001:y2006) df_nuts3 <- df_agr %>% group_by(year, sex, age, nuts3) %>% summarise(value = sum(value)) %>% ungroup() df_nuts2 <- df_agr %>% group_by(year, sex, age, nuts2) %>% summarise(value = sum(value)) %>% ungroup()

Let’s now calculate the shares of working age population in Danish NUTS-3 regions in 2001 and map the information.

# total population in 2001 by NUTS-3 regions tot_01 <- df_nuts3 %>% filter(year=="y2001") %>% group_by(nuts3) %>% summarise(tot = sum(value, na.rm = TRUE)) %>% ungroup() # working-age population in 2001 by NUTS-3 regions working_01 <- df_nuts3 %>% filter(year=="y2001", age %in% paste0("a0", 15:64)) %>% group_by(nuts3) %>% summarise(work = sum(value, na.rm = TRUE)) %>% ungroup() # calculate the shares of working age population sw_01 <- left_join(working_01, tot_01, "nuts3") %>% mutate(sw = work / tot * 100) # map the shares of working age population in 2001 by NUTS-3 regions ggplot()+ geom_polygon(data = gd_nuts3 %>% left_join(sw_01, c("id" = "nuts3")), aes(long, lat, group = group, fill = sw), color = "grey50", size = 1) + scale_fill_viridis()+ theme_map(base_size = 15)+ theme(legend.position = c(1, 1), legend.justification = c(1, 1))


Figure 7. The share of working age (15-64) population by NUTS-3 regions of Denmark in 2001

The result (thankfully!) looks realistic, with higher shares of the working-age population in the capital region, and in other regions that have relatively big cities.

This post is written for Demotrends
  1. I have spent quite some time searching if someone else did the job before me and failed to find. 

  2. There is a note on the website saying that, due to a planned change in the structure of the website, there might be some problems with data accuisition. I failed to download the geodata on 2017-02-23. 

Plotting trees from Random Forest models with ggraph

Thu, 03/16/2017 - 01:00

Today, I want to show how I use Thomas Lin Pederson’s awesome ggraph package to plot decision trees from Random Forest models.

I am very much a visual person, so I try to plot as much of my results as possible because it helps me get a better feel for what is going on with my data.

A nice aspect of using tree-based machine learning, like Random Forest models, is that that they are more easily interpreted than e.g. neural networks as they are based on decision trees. So, when I am using such models, I like to plot final decision trees (if they aren’t too large) to get a sense of which decisions are underlying my predictions.

There are a few very convient ways to plot the outcome if you are using the randomForest package but I like to have as much control as possible about the layout, colors, labels, etc. And because I didn’t find a solution I liked for caret models, I developed the following little function (below you may find information about how I built the model):

As input, it takes part of the output from model_rf <- caret::train(... "rf" ...), that gives the trees of the final model: model_rf$finalModel$forest. From these trees, you can specify which one to plot by index.

library(dplyr) library(ggraph) library(igraph) tree_func <- function(final_model, tree_num) { # get tree by index tree <- randomForest::getTree(final_model, k = tree_num, labelVar = TRUE) %>% tibble::rownames_to_column() %>% # make leaf split points to NA, so the 0s won't get plotted mutate(`split point` = ifelse(is.na(prediction), `split point`, NA)) # prepare data frame for graph graph_frame <- data.frame(from = rep(tree$rowname, 2), to = c(tree$`left daughter`, tree$`right daughter`)) # convert to graph and delete the last node that we don't want to plot graph <- graph_from_data_frame(graph_frame) %>% delete_vertices("0") # set node labels V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`)) V(graph)$leaf_label <- as.character(tree$prediction) V(graph)$split <- as.character(round(tree$`split point`, digits = 2)) # plot plot <- ggraph(graph, 'dendrogram') + theme_bw() + geom_edge_link() + geom_node_point() + geom_node_text(aes(label = node_label), na.rm = TRUE, repel = TRUE) + geom_node_label(aes(label = split), vjust = 2.5, na.rm = TRUE, fill = "white") + geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE, repel = TRUE, colour = "white", fontface = "bold", show.legend = FALSE) + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(), plot.background = element_rect(fill = "white"), panel.border = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), plot.title = element_text(size = 18)) print(plot) }

We can now plot, e.g. the tree with the smalles number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == min(model_rf$finalModel$forest$ndbigtree)) tree_func(final_model = model_rf$finalModel, tree_num)

Or we can plot the tree with the biggest number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == max(model_rf$finalModel$forest$ndbigtree)) tree_func(final_model = model_rf$finalModel, tree_num)

Preparing the data and modeling

The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first data set looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",") colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes") bc_data$classes <- ifelse(bc_data$classes == "2", "benign", ifelse(bc_data$classes == "4", "malignant", NA)) bc_data[bc_data == "?"] <- NA # impute missing data library(mice) bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x))) dataset_impute <- mice(bc_data[, 2:10], print = FALSE) bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1)) bc_data$classes <- as.factor(bc_data$classes) # how many benign and malignant cases are there? summary(bc_data$classes) # separate into training and test data library(caret) set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] # run model set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] igraph_1.0.1 ggraph_1.0.0 ggplot2_2.2.1.9000 ## [4] dplyr_0.5.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 nloptr_1.0.4 plyr_1.8.4 ## [4] viridis_0.3.4 iterators_1.0.8 tools_3.3.3 ## [7] digest_0.6.12 lme4_1.1-12 evaluate_0.10 ## [10] tibble_1.2 gtable_0.2.0 nlme_3.1-131 ## [13] lattice_0.20-34 mgcv_1.8-17 Matrix_1.2-8 ## [16] foreach_1.4.3 DBI_0.6 ggrepel_0.6.5 ## [19] yaml_2.1.14 parallel_3.3.3 SparseM_1.76 ## [22] gridExtra_2.2.1 stringr_1.2.0 knitr_1.15.1 ## [25] MatrixModels_0.4-1 stats4_3.3.3 rprojroot_1.2 ## [28] grid_3.3.3 caret_6.0-73 nnet_7.3-12 ## [31] R6_2.2.0 rmarkdown_1.3 minqa_1.2.4 ## [34] udunits2_0.13 tweenr_0.1.5 deldir_0.1-12 ## [37] reshape2_1.4.2 car_2.1-4 magrittr_1.5 ## [40] units_0.4-2 backports_1.0.5 scales_0.4.1 ## [43] codetools_0.2-15 ModelMetrics_1.1.0 htmltools_0.3.5 ## [46] MASS_7.3-45 splines_3.3.3 randomForest_4.6-12 ## [49] assertthat_0.1 pbkrtest_0.4-6 ggforce_0.1.1 ## [52] colorspace_1.3-2 labeling_0.3 quantreg_5.29 ## [55] stringi_1.1.2 lazyeval_0.2.0 munsell_0.4.3

Improved Python-style Logging in R

Thu, 03/16/2017 - 00:53
This entry is part 21 of 21 in the series Using R

Last August, in Python-style Logging in R, we described using an R script as a wrapper around the futile.logger package to generate log files for an operational R data processing script. Today, we highlight an improved, documented version that can be sourced by your R scripts or dropped into your package’s R/ directory to provide easy file and console logging.

The improved pylogging.R script enables the following use case:

  1. set up log files for different log levels
  2. set console log level
  3. six available log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL

All of these capabilities depend upon the excellent futile.logger package (CRAN or github). This script just wraps this package to get python-style file logging. Please see futile.logger’s documentation for details on output formatting, etc.

The pylogging.R script is fully documented with roxygen2 comments and can be incorporated into packages as long as their DESCRIPTION file adds a dependency on futile.logger.  For those developing operational processing pipelines using R, python style logging can be very useful.

To demonstrate log files and console output you can download pylogging.R and the following sleepy.R script:

# sleepy.R logger.info("Getting sleepy ...") Sys.sleep(1) logger.warn("Getting really tired ...") Sys.sleep(2) logger.error("Having trouble staying awake ...") Sys.sleep(3) logger.fatal("Not gonna marzlmurrrzzzzz ...") stop("Snap out of it!")

The following R session demonstrates the general functionality:

> list.files() [1] "pylogger.R" "sleepy.R" > # Nothing up my sleeve > > source("pylogger.R") > source("sleepy.R") Error: You must initialize with 'logger.setup()' before issuing logger statements. > # Setup is required > > logger.setup() > source("sleepy.R") FATAL [2017-03-15 16:34:15] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # The console log level is set to FATAL by default > > list.files() [1] "pylogger.R" "sleepy.R" > # No log files created > > # Now modify console log level > logger.setLevel(ERROR) > source("sleepy.R") ERROR [2017-03-15 16:35:29] Having trouble staying awake ... FATAL [2017-03-15 16:35:32] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Got ERROR and higher > > logger.setLevel(DEBUG) > source("sleepy.R") INFO [2017-03-15 16:35:42] Getting sleepy ... WARN [2017-03-15 16:35:43] Getting really tired ... ERROR [2017-03-15 16:35:45] Having trouble staying awake ... FATAL [2017-03-15 16:35:48] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Got DEBUG and higher > > list.files() [1] "pylogger.R" "sleepy.R" > # Still no log files > > # Set up log files for two levels > logger.setup(debugLog="debug.log",errorLog="error.log") > logger.setLevel(FATAL) > source("sleepy.R") FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Expected console output > > list.files() [1] "debug.log" "error.log" "pylogger.R" "sleepy.R" > readr::read_lines("debug.log") [1] "INFO [2017-03-15 16:36:37] Getting sleepy ..." [2] "WARN [2017-03-15 16:36:38] Getting really tired ..." [3] "ERROR [2017-03-15 16:36:40] Having trouble staying awake ..." [4] "FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ..." > readr::read_lines("error.log") [1] "ERROR [2017-03-15 16:36:40] Having trouble staying awake ..." [2] "FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ..." > > # Got two log files containing DEBUG-and-higher and ERROR-and-higher

Best Wishes for Better Logging!

 

 

Puts as Protection

Wed, 03/15/2017 - 21:10

Many asset management firms are happily enjoying record revenue and profits driven not by inorganic growth or skillful portfolio management but by a seemingly endless increase in US equity prices. These firms are effectively commodity producers entirely dependent on the price of an index over which the firm has no control. The options market presents an easy, cheap, and liquid form of protection

Ensemble Methods are Doomed to Fail in High Dimensions

Wed, 03/15/2017 - 20:28

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Ensemble methods

By ensemble methods, I (Bob, not Andrew) mean approaches that scatter points in parameter space and then make moves by inteprolating or extrapolating among subsets of them. Two prominent examples are:

There are extensions and computer implementations of these algorithms. For example, the Python package emcee implements Goodman and Weare’s walker algorithm and is popular in astrophysics.

Typical sets in high dimensions

If you want to get the background on typical sets, I’d highly recommend Michael Betancourt’s video lectures on MCMC in general and HMC in particular; they both focus on typical sets and their relation to the integrals we use MCMC to calculate:

It was Michael who made a doughnut in the air, pointed at the middle of it and said, “It’s obvious ensemble methods won’t work.” This is just fleshing out the details with some code for the rest of us without such sharp geometric intuitions.

MacKay’s information theory book is another excellent source on typical sets. Don’t bother with the Wikipedia on this one.

Why ensemble methods fail: Executive summary

  1. We want to draw a sample from the typical set
  2. The typical set is a thin shell a fixed radius from the mode in a multivariate normal
  3. Interpolating or extrapolating two points in this shell is unlikely to fall in this shell
  4. The only steps that get accepted will be near one of the starting points
  5. The samplers devolve to a random walk with poorly biased choice of direction

Several years ago, Peter Li built the Goodman and Weare walker methods for Stan (all they need is log density) on a branch for evaluation. They failed in practice exactly the way the theory says they will fail. Which is too bad, because the ensemble methods are very easy to implement and embarassingly parallel.

Why ensemble methods fail: R simulation

OK, so let’s see why they fail in practice. I’m going to write some simple R code to do the job for us. Here’s an R function to generate a 100-dimensional standard isotropic normal variate (each element is generated normal(0, 1) independently):

normal_rng <- function(K) rnorm(K);

This function computes the log density of a draw:

normal_lpdf <- function(y) sum(dnorm(y, log=TRUE));

Next, generate two draws from a 100-dimesnional version:

K <- 100; y1 <- normal_rng(K); y2 <- normal_rng(K);

and then interpolate by choosing a point between them:

lambda = 0.5; y12 <- lambda * y1 + (1 - lambda) * y2;

Now let's see what we get:

print(normal_lpdf(y1), digits=1); print(normal_lpdf(y2), digits=1); print(normal_lpdf(y12), digits=1);

[1] -153 [1] -142 [1] -123

Hmm. Why is the log density of the interpolated vector so much higher? Given that it's a multivariate normal, the answer is that it's closer to the mode. That should be a good thing, right? No, it's not. The typical set is defined as an area within "typical" density bounds. When I take a random draw from a 100-dimensional standard normal, I expect log densities that hover between -140 and -160 or so. That interpolated vector y12 with a log density of -123 isn't in the typical set!!! It's a bad draw, even though it's closer to the mode. Still confused? Watch Michael's videos above. Ironically, there's a description in the Goodman and Weare paper in a discussion of why they can use ensemble averages that also explains why their own sampler doesn't scale---the variance of averages is lower than the variance of individual draws; and we want to cover the actual posterior, not get closer to the mode.

So let's put this in a little sharper perspective by simulating thousands of draws from a multivariate normal and thousands of draws interpolating between pairs of draws and plot them in two histograms. First, draw them and print a summary:

lp1 <- vector(); for (n in 1:1000) lp1[n] <- normal_lpdf(normal_rng(K)); print(summary(lp1)); lp2 <- vector() for (n in 1:1000) lp2[n] <- normal_lpdf((normal_rng(K) + normal_rng(K))/2); print(summary(lp2));

from which we get:

Min. 1st Qu. Median Mean 3rd Qu. Max. -177 -146 -141 -142 -136 -121 Min. 1st Qu. Median Mean 3rd Qu. Max. -129 -119 -117 -117 -114 -108

That's looking bad. It's even clearer with a faceted histogram:

library(ggplot2); df <- data.frame(list(log_pdf = c(lp1, lp2), case=c(rep("Y1", 1000), rep("(Y1 + Y2)/2", 1000)))); plot <- ggplot(df, aes(log_pdf)) + geom_histogram(color="grey") + facet_grid(case ~ .);

Here's the plot:

The bottom plot shows the distribution of log densities in independent draws from the standard normal (these are pure Monte Carlo draws). The top plot shows the distribution of the log density of the vector resulting from interpolating two independent draws from the same distribution. Obviously, the log densities of the averaged draws are much higher. In other words, they are atypical of draws from the target standard normal density.

Exercise

Check out what happens as (1) the number of dimensions K varies, and (2) as lambda varies within or outside of [0, 1].

Hint: What you should see is that as lambda approaches 0 or 1, the draws get more and more typical, and more and more like random walk Metropolis with a small step size. As dimensionality increases, the typical set becomes more attenuated and the problem becomes worse (and vice-versa as it decreases).

Does Hamiltonian Monte Carlo (HMC) have these problems?

Not so much. It scales much better with dimension. It'll slow down, but it won't break and devolve to a random walk like ensemble methods do.

The post Ensemble Methods are Doomed to Fail in High Dimensions appeared first on Statistical Modeling, Causal Inference, and Social Science.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Jobs for R users – 7 R jobs from around the world (2017-03-15)

Wed, 03/15/2017 - 19:00
To post your R job on the next post

Just visit this link and post a new R job to the R community. You can post a job for free (and there are also two “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New Jobs
  1. Full-Time
    R Shiny Developer
    Summit Consulting LLC – Posted by jdieterle
    Washington
    District of Columbia, United States
    14 Mar2017
  2. Full-Time
    Data Scientist for MDClone
    MDClone – Posted by MDClone
    Be’er Sheva
    South District, Israel
    14 Mar2017
  3. Freelance
    Authoring a training course : Machine learning with R – for Packt
    Koushik Sen – Posted by Koushik Sen
    Anywhere
    7 Mar2017
  4. Full-Time
    Quantitative Research Associate
    The Millburn Corporation – Posted by themillburncorporation
    New York
    New York, United States
    6 Mar2017
  5. Full-Time
    Data Scientist – Analytics @ booking.com
    Booking.com – Posted by work_at_booking
    Anywhere
    4 Mar2017
  6. Full-Time
    Data Manager and Data Analysis Expert @ Leipzig, Sachsen, Germany
    Max Planck Institute for Cognitive and Brain Sciencer – Posted by Mandy Vogel
    Leipzig
    Sachsen, Germany
    3 Mar2017
  7. Full-Time
    Postdoctoral fellow @ Belfast, Northern Ireland, United Kingdom
    Queen’s University Belfast – Posted by Reinhold
    Belfast
    Northern Ireland, United Kingdom
    2 Mar2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

New screencast: using R and RStudio to install and experiment with Apache Spark

Wed, 03/15/2017 - 17:40

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.

More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RevoScaleR package dependencies with graph visualization

Wed, 03/15/2017 - 17:10

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

MRAN currently holds 7520 R Packages. We can see this with usage of following command (stipulating that you are using MRAN R version. ):

library(tools) df_ap <- data.frame(available.packages()) head(df_ap)

With importing package tools, we get many useful functions to find additional information on packages.

Function package.dependencies() parses and check dependencies of a package in current environment. Function package_dependencies()  (with underscore and not dot) will find all dependent and reverse dependent packages.

With following code I can extract the packages and their dependencies (this will perform a data normalization):

net <- data.frame(df_ap[,c(1,4)]) library(dplyr) netN <- net %>%         mutate(Depends = strsplit(as.character(Depends), ",")) %>%         unnest(Depends) netN

And the result is:

Source: local data frame [14,820 x 2] Package Depends (fctr) (chr) 1 A3 R (>= 2.15.0) 2 A3 xtable 3 A3 pbapply 4 abbyyR R (>= 3.2.0) 5 abc R (>= 2.10) 6 abc abc.data 7 abc nnet 8 abc quantreg 9 abc MASS 10 abc locfit .. ... ...

Presented way needs to be further cleaned and prepared.

Once you have data normalized, we can use any of the network packages for visualizing the data. With use of igraph package, I created visual presentation of the RevoScaleR package; dependencies and imported packages.

With the code I filter out the RevoScaleR package and create visual:

library(igraph) netN_g <- graph.data.frame(edges[edges$src %in% c('RevoScaleR', deptree), ]) plot(netN_g)

 

Happy Ring!

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Structures Exercises

Wed, 03/15/2017 - 17:06

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

There are 5 important basic data structures in R: vector, matrix, array, list and dataframe. They can be 1-dimensional (vector and list), 2-dimensional (matrix and data frame) or multidimensional (array). They also differ according to homogeneity of elements they can contain: while all elements contained in vector, matrix and array must be of the same type, list and data frame can contain multiple types.

In this set of exercises we shall practice casting between different types of these data structures, together with some basic operations on them. You can find more about data structures on Advanced R – Data structures page.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Create a vector named v which contains 10 random integer values between -100 and +100.

Exercise 2

Create a two-dimensional 5×5 array named a comprised of sequence of even integers greater than 25.

Create a list named s containing sequence of 20 capital letters, starting with ‘C’.

Exercise 3

Create a list named l and put all previously created objects in it. Name them a, b and c respectively. How many elements are there in the list? Show the structure of the list. Count all elements recursively.

Exercise 4

Without running commands in R, answer the following questions:

  1. what is the result of l[[3]]?
  2. How would you access random-th letter in the list element c?
  3. If you convert list l to a vector, what will be the type of it’s elements?
  4. Can this list be converted to an array? What will be the data type of elements in array?

Check the results with R.

Exercise 5

Remove letters from the list l. Convert the list l to a vector and check its class. Compare it with the result from exercise 4, question #3.

Exercise 6

Find the difference between elements in l[["a"]] and l[["b"]]. Find the intersection between them. Is there number 33 in their union?

Exercise 7

Create 5×5 matrix named m and fill it with random numeric values rounded to two decimal places, ranging from 1.00 to 100.00.

Exercise 8

Answer the following question without running R command, then check the result.

What will be the class of data structure if you convert matrix m to:

  • vector
  • list
  • data frame
  • array?

Exercise 9

Transpose array l$b and then convert it to matrix.

Exercise 10

Get union of matrix m and all elements in list l and sort it ascending.

Related exercise sets:
  1. Matrix exercises
  2. Array exercises
  3. Mode exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why I love R Notebooks

Wed, 03/15/2017 - 17:00

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Nathan Stephens

Note: R Notebooks requires RStudio Version 1.0 or later

I’m a big fan of the R console. During my early years with R, that’s all I had, so I got very comfortable with pasting my code into the console. Since then I’ve used many code editors for R, but they all followed the same paradigm – script in one window and get output in another window. Notebooks on the other hand combine code, output, and narrative into a single document. Notebooks allow you to interactively build narratives around small chunks of code and then publish the complete notebook as a report.

R Notebooks is a new feature of RStudio which combines the benefits of other popular notebooks (such as Jupyter, Zeppelin, and Beaker) with the benefits of R Markdown documents. As a long time R user I was skeptical that I would like this new paradigm, but after a few months I became a big fan. Here are my top three reasons why I love R Notebooks.

Number 3: Notebooks are for doing science

If scripting is for writing software then notebooks are for doing data science. In high school science class I used a laboratory notebook that contained all my experiments. When I conducted an experiment, I drew sketches and wrote down my results. I also wrote down my ideas and thoughts. The process had a nice flow which helped me improve my thinking. Doing science with physical notebooks is an idea that is centuries old.

Electronic notebooks follow the same pattern as physical notebooks but applies the pattern to code. With notebooks you break your script into manageable code chunks. You add narrative and output around the code chunk which puts it into context and makes it reproducible. When you are done, you have an elegant report that can be shared with others. Here is the thought process for doing data science with notebooks:

  • I have a chunk of code that I want to tell you about.
  • I am going to execute this chunk of code and show you the output.
  • I am going to share all chunks of code with you in a single, reproducible document.

If you do data science with R scripts, on the other hand, you develop your code as a single script. You add comments to the code, but the comments tend to be terse or nonexistent. Your output may or may not be captured at all. Sharing your results in a report requires a separate, time consuming process. Here is the thought process for doing data science with scripts:

  • I have a thousand lines of code and you get to read my amazing comments!
  • Hold onto your hats while I batch execute this entire script!
  • You can find my code and about 50 plots under the project directory (I hope you have permissions).
Number 2: R Notebooks have great features

R Notebooks are based on R Markdown documents. That means they are written in plain text and work well with version control. They can be used to create elegantly formatted output in multiple formats (e.g. HTML, PDF, and Word).

R Notebooks have some features that are not found in traditional notebooks. These are not necessarily inherent differences, but differences of emphasis. For example, R Markdown documents gives you many options when selecting graphics, templates, and formats for your output.

Feature R Notebooks Traditional Notebooks Plain text representation ✓ Same editor/tools used for R scripts ✓ Works well with version control ✓ Create elegantly formatted output ✓ Output inline with code ✓ ✓ Output cached across sessions ✓ ✓ Share code and output in a single file ✓ ✓ Emphasized execution model Interactive & Batch Interactive

When you execute a code chunk in an R Notebook, the output is cached and rendered inside the IDE. When you save the notebook, the same cache is rendered inside a document. The HTML output of R Notebooks is a dual file format that contains both the HTML and the R Markdown source code. The dual format gives you a single file that can be viewed in a browser or opened in the RStudio IDE.

Number 1: R Notebooks make it easy to create and share reports

My favorite part of R Notebooks is having the ability to easily share my work. With R notebooks a high quality report is an automatic byproduct of a completed analysis. If I write down my thoughts while I analyze my code chunks, then all I have to do is push a button to render a report. I can share this report by publishing it to the web, emailing it to my colleagues, or presenting it with slides. This video shows how easy it is to create a report from an R Notebook.




R Notebooks for data science

The following table summarizes the differences between notebooks and scripts.

Activity R Notebook R Script Building a narrative Rich text is added throughout the analytic process describing the motivation and the conclusions for each chunk of the code. Comments are added to the script, and a report that describes the entire analysis is drafted after the script is completed. Organizing plots, widgets, and tables All output is embedded in a single document and collocated with the narrative and code chunk to which it belongs. Each individual output is sent to file and is collected later into a report. Creating reports Rendering the final report is instant. The same document can be published to multiple formats (e.g. HTML, PDF, Word). Since the document is based on code, future changes are easy to implement and the document is reproducible by others. Creating a report is a separate, time consuming step. Any changes to the report can be time consuming and prone to error. Since the report is not tied to code, it is not reproducible.

R Notebooks are not designed for all the work you do in R. If you are writing software for a new package or building a Shiny app you will want to use an R script. However, if you are doing data science you might try R Notebooks. They are great for tasks like exploratory data analysis, model building, and communicating insights. Notebooks are useful for data science because they organize narrative, code, and text around manageable code chunks; and creating quality, reproducible reports is easy.

References: For an introduction to R Notebooks see this video or blog post. For more detailed information, see this workflow presentation or the reference site.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Adding figure labels (A, B, C, …) in the top left corner of the plotting region

Wed, 03/15/2017 - 16:39

I decided to submit a manuscript using only R with knitr, pandoc and make. Actually, it went quite well. Certainly, revisions of manuscript with complex figures did not require much of manual work once the R code for the figures has been created. The manuscript ended up as a Word file (for the sake of co-authors), looking no different than any other manuscript. However, you can look up precisely how all the figures have been generated and, with a single command, re-create the manuscript (with all figures and supplementary data) after you changed a parameter.

One of the small problems I faced was adding labels to pictures. You know — like A, B, C… in the top right corner of each panel of a composite figure. Here is the output I was striving at:

Doing it proved to be more tedious than I thought at first. By default, you can only plot things in the plotting region, everything else gets clipped — you cannot put arbitrary text anywhere outside the rectangle containing the actual plot:

plot(rnorm(100)) text(-20, 0, "one two three four", cex=2)

This is because the plotting are is the red rectangle on the figure below, and everything outside will not be shown by default:

One can use the function mtext to put text on the margins. However, there is no easy way to say “put the text in the top left corner of the figure”, and the results I was able to get were never perfect. Anyway, to push the label really to the very left of the figure region using mtext, you first need to have the user coordinate of that region (to be able to use option ‘at’). However, if you know these coordinates, it is much easier to achieve the desired effect using text.

However, we need to figure out a few things. First, to avoid clipping of the region, one needs to change the parameter xpd:

par(xpd=NA)

Then, we need to know where to draw the label. We can get the coordinates of the device (in inches), and then we can translate these to user coordinates with appropriate functions:

plot(rnorm(100)) di <- dev.size("in") x <- grconvertX(c(0, di[1]), from="in", to="user") y <- grconvertY(c(0, di[2]), from="in", to="user")

x[1] and y[2] are the coordinates of the top left corner of the device… but not of the figure. Since we might have used, for example, par(mar=...) or layout to put multiple plots on the device, and we would like to always label the current plot only (i.e. put the label in the corner of the current figure, not of the whole device), we have to take this into account as well:

fig <- par("fig") x <- x[1] + (x[2] - x[1]) * fig[1:2] y <- y[1] + (y[2] - y[1]) * fig[3:4]

However, before plotting, we have to adjust this position by half of the text string width and height, respectively:

txt <- "A" x <- x[1] + strwidth(txt, cex=3) / 2 y <- y[2] - strheight(txt, cex=3) / 2 text(x, y, txt, cex=3)

Looks good! That is exactly what I wanted:

Below you will find an R function that draws a label in one of the three regions — figure (default), plot or device. You specify the position of the label using the labels also used by legend: “topleft”, “bottomright” etc.

fig_label <- function(text, region="figure", pos="topleft", cex=NULL, ...) { region <- match.arg(region, c("figure", "plot", "device")) pos <- match.arg(pos, c("topleft", "top", "topright", "left", "center", "right", "bottomleft", "bottom", "bottomright")) if(region %in% c("figure", "device")) { ds <- dev.size("in") # xy coordinates of device corners in user coordinates x <- grconvertX(c(0, ds[1]), from="in", to="user") y <- grconvertY(c(0, ds[2]), from="in", to="user") # fragment of the device we use to plot if(region == "figure") { # account for the fragment of the device that # the figure is using fig <- par("fig") dx <- (x[2] - x[1]) dy <- (y[2] - y[1]) x <- x[1] + dx * fig[1:2] y <- y[1] + dy * fig[3:4] } } # much simpler if in plotting region if(region == "plot") { u <- par("usr") x <- u[1:2] y <- u[3:4] } sw <- strwidth(text, cex=cex) * 60/100 sh <- strheight(text, cex=cex) * 60/100 x1 <- switch(pos, topleft =x[1] + sw, left =x[1] + sw, bottomleft =x[1] + sw, top =(x[1] + x[2])/2, center =(x[1] + x[2])/2, bottom =(x[1] + x[2])/2, topright =x[2] - sw, right =x[2] - sw, bottomright =x[2] - sw) y1 <- switch(pos, topleft =y[2] - sh, top =y[2] - sh, topright =y[2] - sh, left =(y[1] + y[2])/2, center =(y[1] + y[2])/2, right =(y[1] + y[2])/2, bottomleft =y[1] + sh, bottom =y[1] + sh, bottomright =y[1] + sh) old.par <- par(xpd=NA) on.exit(par(old.par)) text(x1, y1, text, cex=cex, ...) return(invisible(c(x,y))) }

New Course: Unsupervised Learning in R

Wed, 03/15/2017 - 15:13

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hi there – today we’re launching a new machine learning course on Unsupervised Learning in R by Hank Roark!

Many times in machine learning, the goal is to find patterns in data without trying to make predictions. This is called unsupervised learning. One common use case of unsupervised learning is grouping consumers based on demographics and purchasing history to deploy targeted marketing campaigns. Another example is wanting to describe the unmeasured factors that most influence crime differences between cities. This course provides a basic introduction to clustering and dimensionality reduction in R from a machine learning perspective so that you can get from data to insights as quickly as possible.

Start for freeUnsupervised Learning in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master at machine learning in R!

What you’ll learn:

The k-means algorithm is one common approach to clustering. Learn how the algorithm works under the hood, implement k-means clustering in R, visualize and interpret the results, and select the number of clusters when it’s not known ahead of time. By the end of the first chapter, you’ll have applied k-means clustering to a fun “real-world” dataset!

In chapter 2, you’ll learn about hierarchical clustering which is another popular method for clustering. The goal of this chapter is to go over how it works, how to use it, and how it compares to k-means clustering.

Chapter 3 covers principal component analysis, or PCA, which is a common approach to dimensionality reduction. Learn exactly what PCA does, visualize the results of PCA with biplots and scree plots, and deal with practical issues such as centering and scaling the data before performing PCA.

The goal of the final chapter is to guide you through a complete analysis using the unsupervised learning techniques covered in the first three chapters. You’ll extend what you’ve learned by combining PCA as a preprocessing step to clustering using data that consist of measurements of cell nuclei of human breast masses.

About Hank Roark: Hank is a Senior Data Scientist at Boeing and a long time user of the R language. Prior to his current role, he led the Customer Data Science team at H2O.ai, a leading provider of machine learning and predictive analytics services.

Start course for free

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages