Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 days 52 min ago

Manipulate dates easily with {lubridate}

Sat, 12/15/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
free here. This is taken from Chapter 5, which presents
the {tidyverse} packages and how to use them to compute descriptive statistics and manipulate data.
In the text below, I scrape a table from Wikipedia, which shows when African countries gained
independence from other countries. Then, using {lubridate} functions I show you how you can
answers questions such as Which countries gained independence before 1960?.

Set-up: scraping some data from Wikipedia

{lubridate} is yet another tidyverse package, that makes dealing with dates or duration data
(and intervals) as painless as possible. I do not use every function contained in the package
daily, and as such will only focus on some of the functions. However, if you have to deal with
dates often, you might want to explore the package thoroughly.

Let’s get some data from a Wikipedia table:

library(tidyverse) library(rvest) page <- read_html("https://en.wikipedia.org/wiki/Decolonisation_of_Africa") independence <- page %>% html_node(".wikitable") %>% html_table(fill = TRUE) independence <- independence %>% select(-Rank) %>% map_df(~str_remove_all(., "\\[.*\\]")) %>% rename(country = `Country[a]`, colonial_name = `Colonial name`, colonial_power = `Colonial power[b]`, independence_date = `Independence date`, first_head_of_state = `First head of state[d]`, independence_won_through = `Independence won through`)

This dataset was scraped from the following Wikipedia table.
It shows when African countries gained independence from which colonial powers. In Chapter 11, I
will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents
of the dataset:

independence ## # A tibble: 54 x 6 ## country colonial_name colonial_power independence_da… first_head_of_s… ## ## 1 Liberia Liberia United States 26 July 1847 Joseph Jenkins … ## 2 South … Cape Colony … United Kingdom 31 May 1910 Louis Botha ## 3 Egypt Sultanate of… United Kingdom 28 February 1922 Fuad I ## 4 Eritrea Italian Erit… Italy 10 February 1947 Haile Selassie ## 5 Libya British Mili… United Kingdo… 24 December 1951 Idris ## 6 Sudan Anglo-Egypti… United Kingdo… 1 January 1956 Ismail al-Azhari ## 7 Tunisia French Prote… France 20 March 1956 Muhammad VIII a… ## 8 Morocco French Prote… France Spain 2 March 19567 A… Mohammed V ## 9 Ghana Gold Coast United Kingdom 6 March 1957 Kwame Nkrumah ## 10 Guinea French West … France 2 October 1958 Ahmed Sékou Tou… ## # ... with 44 more rows, and 1 more variable: ## # independence_won_through

as you can see, the date of independence is in a format that might make it difficult to answer questions
such as Which African countries gained independence before 1960 ? for two reasons. First of all,
the date uses the name of the month instead of the number of the month (well, this is not such a
big deal, but still), and second of all the type of
the independence day column is character and not “date”. So our first task is to correctly define the column
as being of type date, while making sure that R understands that January is supposed to be “01”, and so
on.

Using {lubridate}

There are several helpful functions included in {lubridate} to convert columns to dates. For instance
if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd(),
for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm(). There’s
a few of these helper functions, and they can handle a lot of different formats for dates. In our case,
having the name of the month instead of the number might seem quite problematic, but it turns out
that this is a case that {lubridate} handles painfully:

library(lubridate) ## ## Attaching package: 'lubridate' ## The following object is masked from 'package:base': ## ## date independence <- independence %>% mutate(independence_date = dmy(independence_date)) ## Warning: 5 failed to parse.

Some dates failed to parse, for instance for Morocco. This is because these countries have several
independence dates; this means that the string to convert looks like:

"2 March 1956 7 April 1956 10 April 1958 4 January 1969"

which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for
simplicity’s sake.

Let’s take a look at the data now:

independence ## # A tibble: 54 x 6 ## country colonial_name colonial_power independence_da… first_head_of_s… ## ## 1 Liberia Liberia United States 1847-07-26 Joseph Jenkins … ## 2 South … Cape Colony … United Kingdom 1910-05-31 Louis Botha ## 3 Egypt Sultanate of… United Kingdom 1922-02-28 Fuad I ## 4 Eritrea Italian Erit… Italy 1947-02-10 Haile Selassie ## 5 Libya British Mili… United Kingdo… 1951-12-24 Idris ## 6 Sudan Anglo-Egypti… United Kingdo… 1956-01-01 Ismail al-Azhari ## 7 Tunisia French Prote… France 1956-03-20 Muhammad VIII a… ## 8 Morocco French Prote… France Spain NA Mohammed V ## 9 Ghana Gold Coast United Kingdom 1957-03-06 Kwame Nkrumah ## 10 Guinea French West … France 1958-10-02 Ahmed Sékou Tou… ## # ... with 44 more rows, and 1 more variable: ## # independence_won_through

As you can see, we now have a date column in the right format. We can now answer questions such as
Which countries gained independence before 1960? quite easily, by using the functions year(),
month() and day(). Let’s see which countries gained independence before 1960:

independence %>% filter(year(independence_date) <= 1960) %>% pull(country) ## [1] "Liberia" "South Africa" ## [3] "Egypt" "Eritrea" ## [5] "Libya" "Sudan" ## [7] "Tunisia" "Ghana" ## [9] "Guinea" "Cameroon" ## [11] "Togo" "Mali" ## [13] "Madagascar" "Democratic Republic of the Congo" ## [15] "Benin" "Niger" ## [17] "Burkina Faso" "Ivory Coast" ## [19] "Chad" "Central African Republic" ## [21] "Republic of the Congo" "Gabon" ## [23] "Mauritania"

You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work
on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on
Christmas Eve:

independence %>% filter(month(independence_date) == 12, day(independence_date) == 24) %>% pull(country) ## [1] "Libya"

Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between
two dates, using the interval() column:

independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) ## # A tibble: 54 x 2 ## country independent_since ## ## 1 Liberia 1847-07-26 UTC--2018-12-15 UTC ## 2 South Africa 1910-05-31 UTC--2018-12-15 UTC ## 3 Egypt 1922-02-28 UTC--2018-12-15 UTC ## 4 Eritrea 1947-02-10 UTC--2018-12-15 UTC ## 5 Libya 1951-12-24 UTC--2018-12-15 UTC ## 6 Sudan 1956-01-01 UTC--2018-12-15 UTC ## 7 Tunisia 1956-03-20 UTC--2018-12-15 UTC ## 8 Morocco NA--NA ## 9 Ghana 1957-03-06 UTC--2018-12-15 UTC ## 10 Guinea 1958-10-02 UTC--2018-12-15 UTC ## # ... with 44 more rows

The independent_since column now contains an interval object that we can convert to years:

independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) %>% mutate(years_independent = as.numeric(independent_since, "years")) ## # A tibble: 54 x 3 ## country independent_since years_independent ## ## 1 Liberia 1847-07-26 UTC--2018-12-15 UTC 171. ## 2 South Africa 1910-05-31 UTC--2018-12-15 UTC 109. ## 3 Egypt 1922-02-28 UTC--2018-12-15 UTC 96.8 ## 4 Eritrea 1947-02-10 UTC--2018-12-15 UTC 71.8 ## 5 Libya 1951-12-24 UTC--2018-12-15 UTC 67.0 ## 6 Sudan 1956-01-01 UTC--2018-12-15 UTC 63.0 ## 7 Tunisia 1956-03-20 UTC--2018-12-15 UTC 62.7 ## 8 Morocco NA--NA NA ## 9 Ghana 1957-03-06 UTC--2018-12-15 UTC 61.8 ## 10 Guinea 1958-10-02 UTC--2018-12-15 UTC 60.2 ## # ... with 44 more rows

We can now see for how long the last country to gain independence has been independent.
Because the data is not tidy (in some cases, an African country was colonized by two powers,
see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:

independence %>% filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% mutate(years_independent = as.numeric(independent_since, "years")) %>% group_by(colonial_power) %>% summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE)) ## # A tibble: 4 x 2 ## colonial_power last_colony_independent_for ## ## 1 Belgium 56.5 ## 2 France 41.5 ## 3 Portugal 43.1 ## 4 United Kingdom 42.5

{lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate}
is a package that you have to master.

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Learning R: A gentle introduction to higher-order functions

Fri, 12/14/2018 - 23:02

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are first-class members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?

In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!

In R there are many applications as well, let us go through a simple example step by step.

Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:

mean(iris[ , 1]) ## [1] 5.843333 mean(iris[ , 2]) ## [1] 3.057333 mean(iris[ , 3]) ## [1] 3.758 mean(iris[ , 4]) ## [1] 1.199333

Quite tedious and not very elegant. Of course, we can use a for loop for that:

for (x in iris[1:4]) { print(mean(x)) } ## [1] 5.843333 ## [1] 3.057333 ## [1] 3.758 ## [1] 1.199333

This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:

apply(iris[1:4], 2, mean) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.843333 3.057333 3.758000 1.199333

Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higher-order function” and we could use it with all kinds of other functions:

apply(iris[1:4], 2, sd) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 0.8280661 0.4358663 1.7652982 0.7622377 apply(iris[1:4], 2, min) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 4.3 2.0 1.0 0.1 apply(iris[1:4], 2, max) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 7.9 4.4 6.9 2.5

You can also use user-defined functions:

midrange <- function(x) (min(x) + max(x)) / 2 apply(iris[1:4], 2, midrange) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.10 3.20 3.95 1.30

We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):

apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.10 3.20 3.95 1.30

Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):

head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

To see the power of higher-order functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!

apply(mtcars, 2, summary) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Min. 10.40000 4.0000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000 ## 1st Qu. 15.42500 4.0000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000 ## Median 19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000 ## Mean 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125 ## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000 ## Max. 33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000

Wow, that was easy and the result is quite impressive, is it not!

Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:

sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3)) ## mpg cyl disp hp drat wt qsec vs am gear carb ## (Intercept) 0 37.885 29.600 30.099 -7.525 37.285 -5.114 16.617 17.147 5.623 25.872 ## x 1 -2.876 -0.041 -0.068 7.678 -5.344 1.412 7.940 7.245 3.923 -2.056

Here we used another higher-order function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).

Often, you might not even have realised when you were using higher-order functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higher-order function curve and give it the function you want to plot as an argument:

curve(sin(x) + cos(1/2 * x), -10, 10)

I want to give you one last example of another very helpful higher-order function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:

by(iris[1:4], iris$Species, summary) ## iris$Species: setosa ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ## -------------------------------------------------------------- ## iris$Species: versicolor ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 ## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 ## Median :5.900 Median :2.800 Median :4.35 Median :1.300 ## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326 ## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500 ## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800 ## -------------------------------------------------------------- ## iris$Species: virginica ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 ## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 ## Median :6.500 Median :3.000 Median :5.550 Median :2.000 ## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026 ## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300 ## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500

This was just a very shy look at this huge topic. There are very powerful higher-order functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).

For some reason people tend to confuse higher-order functions with recursive functions but that is the topic of another post, so stay tuned…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: November 2018 roundup

Fri, 12/14/2018 - 16:29

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from November of particular interest to R users.

David Gerard assesses the plausibility of a key plot point in 'Jurassic Park' with simulations in R.

In-database R is available in Azure SQL Database for private preview. 

Introducing AzureR, a new suite of R packages for managing Azure resources in R.

The AzureRMR package provides an interface for Resource Manager.

Roundup of AI, Machine Learning and Data Science news from November 2018.

You can now use the AI capabilities of Microsoft Cognitive Services within a container you host.

A look back at some of the R applications presented at the EARL conference in Seattle.

Slides and notebooks from my ODSC workshop, AI for Good.

T-Mobile uses AI models implemented with R to streamline customer service.

A guide to R packages for importing and working with US Census data.

Azure Machine Learning Studio, the online drag-and-drop data analysis tool, upgrades its R support.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Day 14 – little helper print_fs

Fri, 12/14/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 14th day of Christmas my true love gave to me…

What can it do?

This little helper returns the folder structure of a given path. With this, one can for example add a nice overview to the documentation of a project or within a git. For the sake of automation, this function could run and change parts wihtin a log or news file after a major change.

How to use it?

If we take a look at the same example we used for the get_network function on day 5, we get the following:

print_fs("~/flowchart/", depth = 4) 1 flowchart 2 ¦--create_network.R 3 ¦--getnetwork.R 4 ¦--plots 5 ¦ ¦--example-network-helfRlein.png 6 ¦ °--improved-network.png 7 ¦--R_network_functions 8 ¦ ¦--dataprep 9 ¦ ¦ °--foo_01.R 10 ¦ ¦--method 11 ¦ ¦ °--foo_02.R 12 ¦ ¦--script_01.R 13 ¦ °--script_02.R 14 °--README.md

With depth we can adjust how deep we want to traverse through our folders.

Overview

To see all the other functions you can either check out our GitHub or you can read about them here.

Have a merry advent season!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Day 14 – little helper print_fs erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

running plot [and simulated annealing]

Fri, 12/14/2018 - 07:26

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Last weekend, I found out a way to run updated plots within a loop in R, when calling plot() within the loop was never updated in real time. The above suggestion of including a Sys.sleep(0.25) worked perfectly on a simulated annealing example for determining the most dispersed points in a unit disc.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Easy CI/CD of GPU applications on Google Cloud including bare-metal using Gitlab and Kubernetes

Fri, 12/14/2018 - 01:00

(This article was first published on Angel Sevilla Camins' Blog, and kindly contributed to R-bloggers)

Summary

Are you a data scientist who only wants to focus on modelling and coding and not on setting up a GPU cluster? Then, this blog might be interesting for you. We developed an automated pipeline using gitlab and Kubernetes that is able to run code in two GPU environments, GCP and bare-metal; no need to worry about drivers, Kubernetes cluster creation or deletion. The only thing that you should do is to push your code and it runs in a GPU!
Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.
See here the complete blog post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Angel Sevilla Camins' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Reusable Pipelines in R

Thu, 12/13/2018 - 22:46

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Pipelines in R are popular, the most popular one being magrittr as used by dplyr.

This note will discuss the advanced re-usable piping systems: rquery/rqdatatable operator trees and wrapr function object pipelines. In each case we have a set of objects designed to extract extra power from the wrapr dot-arrow pipe %.>%.

Piping

Piping is not much more than having a system that lets one treat “x %.>% f(.)” as a near synonym for “f(x)”. For the wrapr dot arrow pipe the semantics are intentionally closer to (x %.>% f(.)) ~ {. <- x; f(.)}.

The pipe notation may be longer, but it avoids nesting and reversed right to left reading for many-stage operations (such as “x %.>% f1(.) %.>% f2(.) %.>% f3(.)” versus “f3(f2(f1(x)))”).

In addition to allowing users to write operations in this notation, most piping systems allow users to save pipelines for later re-use (though some others have issues serializing or saving such pipelines due to entanglement with the defining environment).

wrapr and rquery/rqdatatable supply a number of piping tools that are re-usable, serializable, and very powerful (via R S3 and S4 dispatch features). One of the most compelling features are “function objects” which mans objects can be treated like functions (applied to other objects by pipelines). We will discuss some of these features in the context of rquery/rqdatatable and wrapr.

rquery/rqdatatable

For quite a while the rquery and rqdatatable packages have supplied a sequence of operators abstraction called an “operator tree” or “operator pipeline”.

These pipelines are (deliberately) fairly strict:

  • They must start with a table description or definition.
  • Each step must be a table to table transform meeting certain column pre-conditions.
  • Each step must advertise what columns it makes available or produces, for later condition checking.

For a guiding example suppose we want to row-subset some data, get per-group means, and then sort the data by those means.

# our example data d <- data.frame( group = c("a", "a", "b", "b"), value = c( 1, 2, 2, -10), stringsAsFactors = FALSE ) # load our package library("rqdatatable") ## Loading required package: rquery # build an operator tree threshold <- 0.0 ops <- # define the data format local_td(d) %.>% # restrict to rows with value >= threshold select_rows_nse(., value >= threshold) %.>% # compute per-group aggegations project_nse(., groupby = "group", mean_value = mean(value)) %.>% # sort rows by mean_value decreasing orderby(., cols = "mean_value", reverse = "mean_value") # show the tree/pipeline cat(format(ops)) ## table(d; ## group, ## value) %.>% ## select_rows(., ## value >= 0) %.>% ## project(., mean_value := mean(value), ## g= group) %.>% ## orderby(., desc(mean_value))

Of course the purpose of such a pipeline is to be able to apply it to data. This is done simply with the wrapr dot arrow pipe:

d %.>% ops ## group mean_value ## 1: b 2.0 ## 2: a 1.5

rquery pipelines are designed to specify and execute data wrangling tasks. An important feature of rquery pipelines is: they are designed for serialization. This means we can save them and also send them to multiple nodes for parallel processing.

# save the optree saveRDS(ops, "rquery_optree.RDS") # simulate a fresh R session rm(list=setdiff(ls(), "d")) library("rqdatatable") # read the optree back in ops <- readRDS('rquery_optree.RDS') # look at it cat(format(ops)) ## table(d; ## group, ## value) %.>% ## select_rows(., ## value >= 0) %.>% ## project(., mean_value := mean(value), ## g= group) %.>% ## orderby(., desc(mean_value)) # use it again d %.>% ops ## group mean_value ## 1: b 2.0 ## 2: a 1.5 # clean up rm(list=setdiff(ls(), "d"))

We can also run rqdatatable operations in “immediate mode”, without pre-defining the pipeline or tables:

threshold <- 0.0 d %.>% select_rows_nse(., value >= threshold) %.>% project_nse(., groupby = "group", mean_value = mean(value)) %.>% orderby(., cols = "mean_value", reverse = "mean_value") ## group mean_value ## 1: b 2.0 ## 2: a 1.5 wrapr function objects

A natural question is: given we already have rquery pipelines why do we need wrapr function object pipelines? The reason is: rquery/rdatatable pipelines are strict and deliberately restricted to operations that can be hosted both in R (via data.table) or on databases (examples: PostgreSQL and Spark). One might also want a more general pipeline with fewer constraints optimized for working in R directly.

The wrapr “function object” pipelines allow treatment of arbitrary objects as items we can pipe into. Their primary purpose is to partially apply functions to convert arbitrary objects and functions into single-argument (or unary) functions. This converted form is perfect for pipelining. This, in a sense, lets us treat these objects as functions. The wrapr function object pipeline also has less constraint checking than rquery pipelines, so is more suitable for “black box” steps that do not publish their column use and production details (in fact wrapr function object pipelines work on arbitrary objects, not just data.frames or tables).

Let’s adapt our above example into a simple wrapr dot arrow pipeline.

library("wrapr") threshold <- 0 d %.>% .[.$value >= threshold, , drop = FALSE] %.>% tapply(.$value, .$group, 'mean') %.>% sort(., decreasing = TRUE) ## b a ## 2.0 1.5

All we have done is replace the rquery steps with typical base-R commands. As we see the wrapr dot arrow can route data through a sequence of such commands to repeat our example.

Now let’s adapt our above example into a re-usable wrapr function object pipeline.

library("wrapr") threshold <- 0 pipeline <- srcfn( ".[.$value >= threshold, , drop = FALSE]" ) %.>% srcfn( "tapply(.$value, .$group, 'mean')" ) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., ), ## SrcFunction{ tapply(.$value, .$group, 'mean') }(.=., ), ## base::sort(x=., decreasing))

We used two wrapr abstractions to capture the steps for re-use (something built in to rquery, and now also supplied by wrapr). The abstractions are:

  • srcfn() which wraps arbitrary quoted code as a function object.
  • pkgfn() which wraps a package qualified function name as a function object (“base” being the default package).

This sort of pipeline can be applied to data using pipe notation:

d %.>% pipeline ## b a ## 2.0 1.5

The above pipeline has one key inconvenience and one key weakness:

  • For the srcfn() steps we had to place the source code in quotes, which defeats any sort of syntax highlighting and auto-completing in our R integrated development environment (IDE).
  • The above pipeline has a reference to the value of threshold in our current environment, this means the pipeline is not sufficiently self-contained to serialize and share.

We can quickly address both of these issues with the wrapr::qe() (“quote expression”) function. It uses base::substitute() to quote its arguments, and the IDE doesn’t know the contents are quoted and thus can help us with syntax highlighting and auto-completion. Also we are using base::bquote() .()-style escaping to bind in the value of threshold.

pipeline <- srcfn( qe( .[.$value >= .(threshold), , drop = FALSE] )) %.>% srcfn( qe( tapply(.$value, .$group, 'mean') )) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= 0, , drop = FALSE] }(.=., ), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5

Notice this pipeline works as before, but no longer refers to the external value threshold. This pipeline can be saved and shared.

Another recommended way to bind in values is with the args-argument, which is a named list of values that are expected to be available with a srcfn() is evaluated, or additional named arguments that will be applied to a pkgfn().

In this notation the pipeline is written as follows.

pipeline <- srcfn( qe( .[.$value >= threshold, , drop = FALSE] ), args = list('threshold' = threshold)) %.>% srcfn( qe( tapply(.$value, .$group, 'mean') )) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5

We can save this pipeline.

saveRDS(pipeline, "wrapr_pipeline.RDS")

And simulate using it in a fresh environment (i.e. simulate sharing it).

# simulate a fresh environment rm(list = setdiff(ls(), "d")) library("wrapr") pipeline <- readRDS('wrapr_pipeline.RDS') cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5 Conclusion

And that is some of the power of wrapr piping, rquery/rqdatatable, and wrapr function objects. Essentially wrapr function objects are a reference application of the S3/S4 piping abilities discussed in the wrapr pipe formal article.

The technique is very convenient when each of the steps is a substantial (such as non-trivial data preparation and model application steps).

The above techniques can make reproducing and sharing methods much easier.

We have some more examples of the technique here and here.

# clean up after example unlink("rquery_optree.RDS") unlink("wrapr_pipeline.RDS") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R community update: announcing sessions for useR Delhi December meetup

Thu, 12/13/2018 - 18:45

(This article was first published on R – Tech and Mortals, and kindly contributed to R-bloggers)

As referenced in my last blog post, useR Delhi NCR is all set to host our second meetup on 15th December, i.e. upcoming Saturday. We’ve finalized two exciting speaker sessions for the same. They’re as follows:

  • Basics of Shiny and geospatial visualizations by Sean Angiolillo
  • Up in the air: cloud storage based workflows in R by Himanshu Sikaria
    Speaker session #1 Speaker session #2

     

Apart from the above, the meetup will feature lightning talks and networking session for attendees.
If you haven’t registered yet, do it now via this form. Event details can be found on useR Delhi’s meetup page.

Also, if you know of someone who may be interested in speaking at a future event, please let us know via mail, Twitter, Facebook or Meetup.
Please share and retweet our event and/or this blog. We would be grateful.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Tech and Mortals. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 15 (2018)

Thu, 12/13/2018 - 16:02

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 15 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 15 (2018) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RTutor: Better Incentive Contracts For Road Construction

Thu, 12/13/2018 - 09:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

Since about two weeks, I face a large additional traffic jam every morning due to a construction site on the road. When passing the construction site, often only few people or sometimes nobody seems to be working there. Being an economist, I really wonder how much of such traffic jams could be avoided with better contracts that give the construction company proper incentives to account for the social cost of the road blocking and therefore more often fully staff the constructing site and finish earlier.

Patrick Bajari and Gregory Lewis have collected a detailed sample of 466 road construction projects in Minnesota to study this question in their very interesting article Moral Hazard, Incentive Contracts and Risk: Evidence from Procurement in the Review of Economic Studies, 2014.

They estimate a structural econometric model and find that changes in contract design could substantially reduce the duration of road blockages and largely increase total welfare at only minor increases in the risk that road construction firms face.

As part of his Master Thesis at Ulm University, Claudius Schmid has generated a nice and detailed RTutor problem set that allows you to replicate the findings in an interactive fashion. You learn a lot about the structure and outcomes of the currently used contracts, the theory behind better contract design and how the structural model to assess the quantitative effects can be estimated and simulated. At the same time, you can hone your general data science and R skills.

Here is a small screenshot:

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by multiple choice quizzes.

You can test the problem set online on the rstudio.cloud: https://rstudio.cloud/project/137023 Source the run.R file to start the problem set.

If you don’t want to register at rstudio cloud, you can also check out the problem on shinyapps.io: https://clamasch.shinyapps.io/RTutorIncentiveContracts

The free shinyapps.io account is capped at 25 hours total usage time per month. So it may be greyed out when you click at it. Also, unlike on rstudio.cloud, you cannot save your progress on shinyapps.io.

To locally install the problem set, follow the installation guide at the problem set’s Github repository: https://github.com/ClaMaSch/RTutorIncentiveContracts

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Day 13 – little helper read_files

Thu, 12/13/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 13th day of Christmas my true love gave to me…

What can it do?

This little helper reads in multiple files of the same type and combines them into a data.table. What kind of „file reading function“ should be used can be choosen by the FUN argument.

How to use it?

If you have a list of files, that all needs to be loaded in with the same function (e.g. read.csv), instead of using lapply and rbindlist now you can use this:

read_files(files, FUN = readRDS) read_files(files, FUN = readLines) read_files(files, FUN = read.csv, sep = ";")

Internally, it just uses lapply and rbindlist but you dont have to type it all the time. The read_files combines the single files by their column names and returns one data.table. Why data.table? Because I like it. But, let's not go down the rabbit hole of data.table vs dplyr (to the rabbit hole …).

Overview

To see all the other functions you can either check out our GitHub or you can read about them here.

Have a merry advent season!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Day 13 – little helper read_files erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recreating the NBA lead tracker graphic

Thu, 12/13/2018 - 07:38

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

For each NBA game, nba.com has a really nice graphic which tracks the point differential between the two teams throughout the game. Here is the lead tracker graphic for the game between the LA Clippers and the Phoenix Suns on 10 Dec 2018:

Taken from https://www.nba.com/games/20181210/LACPHX#/matchup

I thought it would be cool to try recreating this graphic with R. You might ask: why try to replicate something that exists already? If we are able to pull out the data underlying this graphic, we could do much more than just replicate what is already out there; we have the power to make other visualizations which could be more informative or powerful. (For example, how does this chart look like for all games that the Golden State Warriors played in? Or, how does the chart look like for each quarter of the game?)

The full R code for this post can be found here. For a self-contained script that accepts a game ID parameter and produces the lead tracker graphic, click here.

First, we load the packages that we will use:

library(lubridate) library(rvest) library(stringr) library(tidyverse)

We can get play-by-play data from Basketball-Reference.com (here is the link for the LAC @ PHX game on 2018-12-10). Here is a snippet of the play-by-play table on that webpage, we would like to extract the columns in red:

Play-by-play data from basketball-reference.com.

The code below extracts the webpage, then pulls out rows from the play-by-play table:

# get webpage url <- paste0("https://www.basketball-reference.com/boxscores/pbp/", current_id, ".html") webpage <- read_html(url) # pull out the events from the play-by-play table events <- webpage %>% html_nodes("#pbp") %>% html_nodes("tr") %>% html_text()

events is a character vector that looks like this:

We would really like to pull out the data in the boxes above. Timings are easy enough to pull out with regular expressions (e.g. start of the string: at least 1 digit, then :, then at least one digit, then ., then at least one digit). Pulling out the score is a bit trickier: we can’t just use the regular expression denoting a dash – with a number on each side. An example of why that doesn’t work is in the purple box above. Whenever a team scores, basketball-reference.com puts a “+2” or “+3” on the left or right of the score, depending on which team scored. In events, these 3 columns get smushed together into one string. If the team on the left scores, pulling out number-dash-number will give the wrong value (e.g. the purple box above would give 22-2 instead of 2-2).

To avoid this issue, we extract the “+”s that may appear on either side of the score. In fact, this has an added advantage: we only need to extract a score if it is different from the previous timestamp. As such, we only have to keep the scores which have a “+” on either side of it. We then post-process the scores.

# get event times & scores times <- str_extract(events, "^\\d+:\\d+.\\d+") scores <- str_extract(events, "[\\+]*\\d+-\\d+[\\+]*") scores <- ifelse(str_detect(scores, "\\+"), scores, NA) df <- data.frame(time = times, score = scores, stringsAsFactors = FALSE) %>% na.omit() # remove the +'s parseScore <- function(x) { if (startsWith(x, "+")) { return(str_sub(x, 3, str_length(x))) } else if (endsWith(x, "+")) { return(str_sub(x, 1, str_length(x) - 1)) } else { return(x) } } df$score <- sapply(df$score, parseScore)

Next, we split the score into visitor and home score and compute the point differential (positive means the visitor team is winning):

# split score into visitor and home score, get home advantage df <- df %>% separate(score, into = c("visitor", "home"), sep = "-") %>% mutate(visitor = as.numeric(visitor), home = as.numeric(home), time = ms(time)) %>% mutate(visitor_adv = visitor - home)

Next we need to process the timings. Each of the 4 quarters lasts for 12 minutes, while each overtime period (if any) lasts for 5 minutes. The time column shows the amount of time remaining in the current period. We will amend the times so that they show the time elapsed (in seconds) from the start of the game. This notion of time makes it easier for plotting, and works for any number of overtime periods as well.

# get period of play (e.g. Q1, Q2, ...) df$period <- NA period <- 0 prev_time <- ms("0:00") for (i in 1:nrow(df)) { curr_time <- df[i, "time"] if (prev_time < curr_time) { period <- period + 1 } df[i, "period"] <- period prev_time <- curr_time } # convert time such that it runs upwards. regular quarters are 12M long, OT # periods are 5M long df <- df %>% mutate(time = ifelse(period <= 4, as.duration(12 * 60) - as.duration(time), as.duration(5 * 60) - as.duration(time))) %>% mutate(time = ifelse(period <= 4, time + as.duration(12 * 60 * (period - 1)), time + as.duration(12 * 60 * 4) + as.duration(5 * 60 * (period - 5)) ))

At this point, we have enough to make crude approximations of the lead tracker graphic:

ggplot() + geom_line(data = df, aes(x = time, y = visitor_adv)) + labs(title = "LAC @ PHX, 2018-12-10") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

ggplot() + geom_step(data = df, aes(x = time, y = visitor_adv)) + labs(title = "LAC @ PHX, 2018-12-10") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

Getting the fill colors that NBA.com’s lead tracker has requires a bit more work. We need to split the visitor_adv into two columns: the visitor’s lead (0 if they are behind) and the home’s lead (0 if they are behind). We can then draw the chart above and below the x-axis as two geom_ribbons. (It’s a little more complicated than that, see this StackOverflow question and this gist for details.) Colors were obtained using imagecolorpicker.com.

df$visitor_lead <- pmax(df$visitor_adv, 0) df$home_lead <- pmin(df$visitor_adv, 0) df_extraSteps <- df %>% mutate(visitor_adv = lag(visitor_adv), visitor_lead = lag(visitor_lead), home_lead = lag(home_lead)) df2 <- bind_rows(df_extraSteps, df) %>% arrange(time) ggplot() + geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") + geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") + labs(title = "LAC @ PHX, 2018-12-10") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

Almost there! The code below does some touch up to the figure, giving it the correct limits for the y-axis as well as vertical lines for the end of the periods.

# get score differential range (round to nearest 5) ymax <- round(max(df$visitor_adv) * 2, digits = -1) / 2 ymin <- round(min(df$visitor_adv) * 2, digits = -1) / 2 # get period positions and labels periods <- unique(df$period) x_value <- ifelse(periods <= 4, 12 * 60 * periods, 12 * 60 * 4 + 5 * 60 * (periods - 4)) x_label <- ifelse(periods <= 4, paste0("Q", periods), paste0("OT", periods - 4)) ggplot() + geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") + geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") + geom_vline(aes(xintercept = x_value), linetype = 2, col = "grey") + scale_y_continuous(limits = c(ymin, ymax)) + labs(title = "LAC @ PHX, 2018-12-10") + scale_x_continuous(breaks = x_value, labels = x_label) + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5), axis.title.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank())

The figure above is what we set out to plot. However, since we have the underlying data, we can now make plots of the same data that may reveal other trends (code at the end of this R file). Here are the line and ribbon plots where we look at the absolute score rather than the point differential:

Here, we add points to the line plot to indicate whether a free throw, 2 pointer or 3 pointer was scored:

The possibilities are endless!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Twins on the up

Thu, 12/13/2018 - 01:00

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

Are multiple births on the increase?

My twin boys turned 5 years old today. Wow, time flies. Life is never dull, because twins are still seen as something of a novelty, so wherever we go, we find ourselves in conversation with strangers, who are intrigued by the whole thing.

In order to save time if we ever meet, here’s some FAQ’s:

  • No, they’re not identical
  • Yes, I’m sure
  • No, they do not have similar personalities
  • They like different things – One likes Hulk and Gekko, the other likes Iron Man and Catboy.

Recently I’ve been hearing and seeing anecdotal evidence that twins and multiple births are on the increase. I tried to find some data for Scotland, and while there is a lot of information on births in Scotland available, I couldn’t find breakdowns of multiple births.
However, I did find some information for England and Wales, so let’s look at that.

In this next bit, they key thing that may be of interest is the use of tidyr::gather.
There has been some discussion on #rstats Twitter about things people struggle with and a surprising amount of people struggle to remember the syntax for tidyr’s gather and spread.

(I can neither confirm or deny I am one of them).

The data was found here

library(readxl) library(dplyr) library(tidyr) library(ggplot2) data <- read_xls("birthcharacteristicsworkbook2016.xls", sheet = "Table 11", range = "A10:I87") data <- data %>% rename(Year = X__1, All_ages = `All maternities with multiple births`, Under20 = X__2, `20_to_24` = X__3, `25_to_29` = X__4, `30_to_34` = X__5, `35_to_39` = X__6, `40_to_44` = X__7, `45_and_over` = X__8) # the 1981 data is borked, so ignore that

Note use of gather to combine all the age groups into an age_group variable.

We use the Year column as an index so we have an entry for every age group, for every year, with the value represented as ‘maternities’.

Back to the code:

long_data <- data %>% filter(Year != "1981") %>% gather(key = age_group, value = "maternities", -Year) long_data$Year <- as.numeric(long_data$Year) long_data$age_group <- forcats::as_factor(long_data$age_group) long_data$maternities <- as.numeric(long_data$maternities) ggplot(long_data,aes(Year, maternities), group = age_group) + geom_line() + geom_point() + facet_wrap(vars(age_group), scales = "free_y") + ggtitle(label = "England and Wales maternities with multiple births - numbers", subtitle = "By age of mother, 1940 to 2016") + labs(x = NULL, y = "Multiple maternities")

# Let's do rates rates <- read_xls("birthcharacteristicsworkbook2016.xls", sheet = "Table 11", range = "A89:I166") rates <- rates %>% rename(Year = X__1, All_ages = `All maternities with multiple births per 1,000 all maternities`, Under20 = X__2, `20_to_24` = X__3, `25_to_29` = X__4, `30_to_34` = X__5, `35_to_39` = X__6, `40_to_44` = X__7, `45_and_over` = X__8) long_rates <- rates %>% filter(Year != 1981) %>% gather(key = age_group, value = "multiple_maternities_per_1000", -Year) long_rates$Year <- as.numeric(long_rates$Year) long_rates$age_group <- forcats::as_factor(long_rates$age_group) long_rates$multiple_maternities_per_1000 <- as.numeric(long_rates$multiple_maternities_per_1000) ggplot(long_rates,aes(Year, multiple_maternities_per_1000), group = age_group) + geom_line() + geom_point() + facet_wrap(vars(age_group)) + ggtitle(label = "England and Wales Rate of maternities with multiple births - per 1,000 all maternities ", subtitle = "By age of mother, 1940 to 2016") + labs(x = NULL, y = "Multiple maternities")

When we look at maternities with multiple births as a rate per 1000 maternities, we see the increase in multiple births among older mothers, especially in the over 45 group.

Again, with free scales on the y axis – which helps us see almost all age groups are exhibiting an increase – compare the 20-24 age group as a rate and as count for example.

Looks to me that overall, the rate of multiple births is increasing.
What’s driving this?
Can it continue?
Will people ever stop asking us if the twins are identical?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rsampling Fama French

Thu, 12/13/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Today we will continue our work on Fama French factor models, but more as a vehicle to explore some of the awesome stuff happening in the world of tidy models. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, my most recent previous post here where we covered managing many models, and if you’re into Shiny, this flexdashboard.

Our goal today is to explore k-fold cross-validation via the rsample package, and a bit of model evaluation via the yardstick package. We started the model evaluation theme last time when we used tidy(), glance() and augment() from the broom package. In this post, we will use the rmse() function from yardstick, but our main focus will be on the vfold_cv() function from rsample. We are going to explore these tools in the context of linear regression and Fama French, which might seem weird since these tools would typically be employed in the realms of machine learning, classification, and the like. We’ll stay in the world of explanatory models via linear regression world for a few reasons.

First, and this is a personal preference, when getting to know a new package or methodology, I prefer to do so in a context that’s already familiar. I don’t want to learn about rsample whilst also getting to know a new data set and learning the complexities of some crazy machine learning model. Since Fama French is familiar from our previous work, we can focus on the new tools in rsample and yardstick. Second, factor models are important in finance, despite relying on good old linear regression. We won’t regret time spent on factor models, and we might even find creative new ways to deploy or visualize them.

The plan for today is take the same models that we ran in the last post, only this use k-fold cross validation and bootstrapping to try to assess the quality of those models.

For that reason, we’ll be working with the same data as we did previously. I won’t go through the logic again, but in short, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen:

library(tidyverse) library(broom) library(tidyquant) library(timetk) symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG") # The prices object will hold our daily price data. prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) asset_returns_long <- prices %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% na.omit() factors_data_address <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip" factors_csv_name <- "Global_5_Factors_Daily.csv" temp <- tempfile() download.file( # location of file to be downloaded factors_data_address, # where we want R to store that file temp, quiet = TRUE) Global_5_Factors <- read_csv(unz(temp, factors_csv_name), skip = 6 ) %>% rename(date = X1, MKT = `Mkt-RF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(-RF) data_joined_tidy <- asset_returns_long %>% left_join(Global_5_Factors, by = "date") %>% na.omit()

After running that code, we have an object called data_joined_tidy. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.

data_joined_tidy %>% slice(1) # A tibble: 5 x 8 # Groups: asset [5] date asset returns MKT SMB HML RMW CMA 1 2013-01-02 AGG -0.00117 0.0199 -0.0043 0.0028 -0.0028 -0.0023 2 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023 3 2013-01-02 EFA 0.0154 0.0199 -0.0043 0.0028 -0.0028 -0.0023 4 2013-01-02 IJS 0.0271 0.0199 -0.0043 0.0028 -0.0028 -0.0023 5 2013-01-02 SPY 0.0253 0.0199 -0.0043 0.0028 -0.0028 -0.0023

Let’s work with just one ETF for today and use filter(asset == "AGG") to shrink our data down to just that ETF.

agg_ff_data <- data_joined_tidy %>% filter(asset == "AGG")

Okay, we’re going to regress the daily returns of AGG on one factor, then three factors, then five factors, and we want to evaluate how well each model explains AGG’s returns. That means we need a way to test the model. Last time, we looked at the adjusted r-squared values when the model was run on the entirety of AGG returns. Today, we’ll evaluate the model using k-fold cross validation. That’s a pretty jargon-heavy phrase that isn’t part of the typical finance lexicon. Let’s start with the second part, cross-validation. Instead of running our model on the entire data set – all the daily returns of AGG – we’ll run it on just part of the data set, then test the results on the part that we did not use. Those different subsets of our original data are often called the training and the testing sets, though rsample calls them the analysis and assessment sets. We validate the model results by applying them to the assessment data and seeing how the model performed.

The k-fold bit refers to the fact that we’re not just dividing our data into training and testing subsets, we’re actually going to divide it into a bunch of groups, a k number of groups, or a k number of folds. One of those folds will be used as the validation set; the model will be fit on the other k - 1 sets, and then tested on the validation set. We’re doing this with a linear model to see how well it explains the data; it’s typically used in machine learning to see how well a model predicts data (we’ll get there in 2019).1

If you’re like me, it will take a bit of tinkering to really grasp k-fold cross validation, but rsample as a great function for dividing our data into k-folds. If we wish to use five folds (the state of the art seems to be either five or ten folds), we call the vfold_cv() function, pass it our data object agg_ff_data, and set v = 5.

library(rsample) library(yardstick) set.seed(752) cved_ff<- vfold_cv(agg_ff_data, v = 5) cved_ff # 5-fold cross-validation # A tibble: 5 x 2 splits id 1 Fold1 2 Fold2 3 Fold3 4 Fold4 5 Fold5

We have an object called cved_ff, with a column called splits and a column called id. Let’s peek at the first split.

cved_ff$splits[[1]] <1007/252/1259>

Three numbers. The first, 1007, is telling us how many observations are in the analysis. Since we have five folds, we should have 80% (or 4/5) of our data in the analysis set. The second number, 252, is telling us how many observations are in the assessment, which is 20% of our original data. The third number, 1259, is the total number of observations in our original data.

Next, we want to apply a model to the analysis set of this k-folded data and test the results on the assessment set. Let’s start with one factor and run a simple linear model, lm(returns ~ MKT).

We want to run it on analysis(cved_ff$splits[[1]]) – the analysis set of out first split.

ff_model_test <- lm(returns ~ MKT, data = analysis(cved_ff$splits[[1]])) ff_model_test Call: lm(formula = returns ~ MKT, data = analysis(cved_ff$splits[[1]])) Coefficients: (Intercept) MKT 0.0001025 -0.0265516

Nothing too crazy so far. Now we want to test on our assessment data. The first step is to add that data to the original set. We’ll use augment() for that task, and pass it assessment(cved_ff$splits[[1]])

ff_model_test %>% augment(newdata = assessment(cved_ff$splits[[1]])) %>% head() %>% select(returns, .fitted) returns .fitted 1 0.0009021065 1.183819e-04 2 0.0011726989 4.934779e-05 3 0.0010815505 1.157267e-04 4 -0.0024385815 -7.544460e-05 5 -0.0021715702 -8.341007e-05 6 0.0028159467 3.865527e-04

We just added our fitted values to the assessment data, the subset of the data on which the model was not fit. How well did our model do when compare the fitted values to the data in the held out set?

We can use the rmse() function from yardstick to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared differences between our fitted values and the actual values in the assessment data. A lower RMSE is better!

ff_model_test %>% augment(newdata = assessment(cved_ff$splits[[1]])) %>% rmse(returns, .fitted) # A tibble: 1 x 3 .metric .estimator .estimate 1 rmse standard 0.00208

Now that we’ve done that piece by piece, let’s wrap the whole operation into one function. This function takes one argument, a split, and we’re going to use pull() so we can extract the raw number, instead of the entire tibble result.

model_one <- function(split) { split_for_model <- analysis(split) ff_model <- lm(returns ~ MKT, data = split_for_model) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

Now we pass it our first split.

model_one(cved_ff$splits[[1]]) %>% head() [1] 0.002080324

Now we want to apply that function to each of our five folds that are stored in agg_cved_ff. We do that with a combination of mutate() and map_dbl(). We use map_dbl() instead of map because we are returning a number here and there’s not a good reason to store that number in a list column.

cved_ff %>% mutate(rmse = map_dbl(cved_ff$splits, model_one)) # 5-fold cross-validation # A tibble: 5 x 3 splits id rmse * 1 Fold1 0.00208 2 Fold2 0.00189 3 Fold3 0.00201 4 Fold4 0.00224 5 Fold5 0.00190

OK, we have five RMSE’s since we ran the model on five separate analysis fold sets and tested on five separate assessment fold sets. Let’s find the average RMSE by taking the mean() of the rmse column. That can help reduce noisiness that resulted from our random creation of those five folds.

cved_ff %>% mutate(rmse = map_dbl(cved_ff$splits, model_one)) %>% summarise(mean_rse = mean(rmse)) # 5-fold cross-validation # A tibble: 1 x 1 mean_rse 1 0.00202

We now have the mean RMSE after running on our model, lm(returns ~ MKT), on all five of our folds.

That process for finding the mean RMSE can be applied other models, as well. Let’s suppose we wish to find the mean RMSE for two other models: lm(returns ~ MKT + SMB + HML), the Fama French three-factor model, and lm(returns ~ MKT + SMB + HML + RMW + CMA, the Fama French five-factor model. By comparing the mean RMSE’s, we can evaluate which model explained the returns of AGG better. Since we’re just adding more and more factors, the models can be expected to get more and more accurate but again, we are exploring the rsample machinery and creating a template where we can pop in whatever models we wish to compare.

First, let’s create two new functions, that follow the exact same code pattern as above but house the three-factor and five-factor models.

model_two <- function(split) { split_for_model <- analysis(split) ff_model <- lm(returns ~ MKT + SMB + HML, data = split_for_model) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) } model_three <- function(split) { split_for_model <- analysis(split) ff_model <- lm(returns ~ MKT + SMB + HML + RMW + CMA, data = split_for_model) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

Now we pass those three models to the same mutate() with map_dbl() flow that we used with just one model. The result will be three new columns of RMSE’s, one for each of our three models applied to our five folds.

cved_ff %>% mutate( rmse_model_1 = map_dbl( splits, model_one), rmse_model_2 = map_dbl( splits, model_two), rmse_model_3 = map_dbl( splits, model_three)) # 5-fold cross-validation # A tibble: 5 x 5 splits id rmse_model_1 rmse_model_2 rmse_model_3 * 1 Fold1 0.00208 0.00211 0.00201 2 Fold2 0.00189 0.00184 0.00178 3 Fold3 0.00201 0.00195 0.00194 4 Fold4 0.00224 0.00221 0.00213 5 Fold5 0.00190 0.00183 0.00177

We can also find the mean RMSE for each model.

cved_ff %>% mutate( rmse_model_1 = map_dbl( splits, model_one), rmse_model_2 = map_dbl( splits, model_two), rmse_model_3 = map_dbl( splits, model_three)) %>% summarise(mean_rmse_model_1 = mean(rmse_model_1), mean_rmse_model_2 = mean(rmse_model_2), mean_rmse_model_3 = mean(rmse_model_3)) # 5-fold cross-validation # A tibble: 1 x 3 mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3 1 0.00202 0.00199 0.00192

That code flow worked just fine, but we had to repeat ourselves when creating the functions for each model. Let’s toggle to a flow where we define three models – the ones that we wish to test with via cross-validation and RMSE – then pass those models to one function.

First, we use as.formula() to define our three models.

mod_form_1 <- as.formula(returns ~ MKT) mod_form_2 <- as.formula(returns ~ MKT + SMB + HML) mod_form_3 <- as.formula(returns ~ MKT + SMB + HML + RMW + CMA)

Now we write one function that takes split as an argument, same as above, but also takes formula as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to map, so I’ll append _flex to the name of this function.

ff_rmse_models_flex <- function(split, formula) { split_for_data <- analysis(split) ff_model <- lm(formula, data = split_for_data) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

Now we use the same code flow as before, except we call map_dbl(), pass it our cved_ff$splits object, our new flex function called ff_rmse_models_flex(), and the model we wish to pass as the formula argument. First we pass it mod_form_1.

cved_ff %>% mutate(rmse_model_1 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_1)) # 5-fold cross-validation # A tibble: 5 x 3 splits id rmse_model_1 * 1 Fold1 0.00208 2 Fold2 0.00189 3 Fold3 0.00201 4 Fold4 0.00224 5 Fold5 0.00190

Now let’s pass it all three models and find the mean RMSE.

cved_ff %>% mutate(rmse_model_1 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_1), rmse_model_2 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_2), rmse_model_3 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_3)) %>% summarise(mean_rmse_model_1 = mean(rmse_model_1), mean_rmse_model_2 = mean(rmse_model_2), mean_rmse_model_3 = mean(rmse_model_3)) # 5-fold cross-validation # A tibble: 1 x 3 mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3 1 0.00202 0.00199 0.00192

Alright, that code flow seems a bit more flexible than our original method of writing a function to assess each model. We didn’t do much hard thinking about functional form here, but hopefully this provides a template where you could assess more nuanced models. We’ll get into bootstrapping and time series work next week, then head to Shiny to ring in the New Year!

And, finally, a couple of public service announcements.

First, thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).

Second, applications are open for the Battlefin alternative data contest, and RStudio is one of the tools you can use to analyze the data. Check it out here. In January, they’ll announce 25 finalists who will get to compete for a cash prize and connect with some quant hedge funds. Go get ‘em!

Thanks for reading and see you next time.

  1. For more on cross-validation, see “An Introduction to Statistical Learning”, chapter 5. Available online here: http://www-bcf.usc.edu/~gareth/ISL/.

_____='https://rviews.rstudio.com/2018/12/13/rsampling-fama-french/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Teaching and Learning Materials for Data Visualization

Wed, 12/12/2018 - 19:12

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)


Data Visualization: A Practical Introduction will begin shipping next week. I’ve written an R package that contains datasets, functions, and a course packet to go along with the book. The socviz package contains about twenty five datasets and a number of utility and convenience functions. The datasets range in size from things with just a few rows (used for purely illustrative purproses) to datasets with over 120,000 observations, for practicing with and exploring.

A course packet is also included the package. This is a zipped file containing an R Studio project consisting of nine R Markdown documents that parallel the chapters in the book. They contain the code for almost all the figures in the book (and a few more besides). There are also some additional support files, to help demonstrate things like reading in your own data locally in R.

Installing the package

To install the package, you can follow the instructions in the Preface to the book. Alternatively, first download and install R for MacOS, Windows or Linux, as appropriate. Then download and install RStudio. Launch RStudio and then type the following code at the Console prompt (>), hitting return at the end of each line:

my_packages <- c("tidyverse", "fs", "devtools") install.packages(my_packages) devtools::install_github("kjhealy/socviz")

Once everything has downloaded and been installed (which may take a little while), load the socviz package:

library(socviz) The Course Packet

The supporting materials are contained in a compressed .zip file. To extract them to your Desktop, make sure the socviz package is loaded as described above. Then do this:

setup_course_notes()

This will copy the dataviz_course_notes.zip file to your Desktop, and uncompress it into a folder called dataviz_course_notes. Double-click the file named dataviz.Rproj to launch the project as a new RStudio session. If you want to uncompress the file somewhere other than your Desktop, e.g. your Documents folder, you can do this:

setup_course_notes(folder = "~/Documents")

The source code for socviz is available on GitHub. I plan on continuing to update and improve it as I use it myself in my own classes and workshops.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to deploy a predictive service to Kubernetes with R and the AzureContainers package

Wed, 12/12/2018 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It's easy to create a function in R, but what if you want to call that function from a different application, with the scale to support a large number of simultaneous requests? This article shows how you can deploy an R fitted model as a Plumber web service in Kubernetes, using Azure Container Registry (ACR) and Azure Kubernetes Service (AKS). We use the AzureContainers package to create the necessary resources and deploy the service.

Fit the model

We’ll fit a simple model for illustrative purposes, using the Boston housing dataset (which ships with R in the MASS package). To make the deployment process more interesting, the model we fit will be a random forest, using the randomForest package. This is not part of R, so we’ll have to install it from CRAN.

data(Boston, package="MASS") install.packages("randomForest") library(randomForest) # train a model for median house price bos_rf <- randomForest(medv ~ ., data=Boston, ntree=100) # save the model saveRDS(bos.rf, "bos_rf.rds") Scoring script for plumber

Now that we have the model, we also need a script to obtain predicted values from it given a set of inputs:

# save as bos_rf_score.R bos_rf <- readRDS("bos_rf.rds") library(randomForest) #* @param df data frame of variables #* @post /score function(req, df) { df <- as.data.frame(df) predict(bos_rf, df) }

This is fairly straightforward, but the comments may require some explanation. They are plumber annotations that tell it to call the function if the server receives a HTTP POST request with the path /score, and query parameter df. The value of the df parameter is then converted to a data frame, and passed to the randomForest predict method. For a fuller description of how Plumber works, see the Plumber website.

Create a Dockerfile

Let’s package up the model and the scoring script into a Docker image. A Dockerfile to do this is shown below. This uses the base image supplied by Plumber (trestletech/plumber), installs randomForest, and then adds the model and the above scoring script. Finally, it runs the code that will start the server and listen on port 8000.

# example Dockerfile to expose a plumber service FROM trestletech/plumber # install the randomForest package RUN R -e 'install.packages(c("randomForest"))' # copy model and scoring script RUN mkdir /data COPY bos_rf.rds /data COPY bos_rf_score.R /data WORKDIR /data # plumb and run server EXPOSE 8000 ENTRYPOINT ["R", "-e", \ "pr <- plumber::plumb('/data/bos_rf_score.R'); \ pr$run(host='0.0.0.0', port=8000)"] Build and upload the image

The code to store our image on Azure Container Registry is as follows. This calls AzureRMR to login to Resource Manager, creates an Azure Container Registry resource (a Docker registry hosted in Azure), and then pushes the image to the registry.

If this is the first time you are using AzureRMR, you’ll have to create a service principal first. For more information on how to do this, see the AzureRMR readme.

library(AzureContainers) az <- AzureRMR::az_rm$new( tenant="myaadtenant.onmicrosoft.com", app="app_id", password="password") # create a resource group for our deployments deployresgrp <- az$ get_subscription("subscription_id")$ create_resource_group("deployresgrp", location="australiaeast") # create container registry deployreg_svc <- deployresgrp$create_acr("deployreg") # build image 'bos_rf' call_docker("build -t bos_rf .") # upload the image to Azure deployreg <- deployreg_svc$get_docker_registry() deployreg$push("bos_rf")

If you run this code, you should see a lot of output indicating that R is downloading, compiling and installing randomForest, and finally that the image is being pushed to Azure. (You will see this output even if your machine already has the randomForest package installed. This is because the package is being installed to the R session inside the container, which is distinct from the one running the code shown here.)

All docker calls in AzureContainers, like the one to build the image, return the actual docker commandline as the cmdline attribute of the (invisible) returned value. In this case, the commandline is docker build -t bos_rf . Similarly, the push() method actually involves two Docker calls, one to retag the image, and the second to do the actual pushing; the returned value in this case will be a 2-component list with the command lines being docker tag bos_rf deployreg.azurecr.io/bos_rf and docker push deployreg.azurecr.io/bos_rf.

Deploy to a Kubernetes cluster

The code to create an AKS resource (a managed Kubernetes cluster in Azure) is quite simple:

# create a Kubernetes cluster with 2 nodes, running Linux deployclus_svc <- deployresgrp$create_aks("deployclus", agent_pools=aks_pools("pool1", 2))

Creating a Kubernetes cluster can take several minutes. By default, the create_aks() method will wait until the cluster provisioning is complete before it returns.

Having created the cluster, we can deploy our model and create a service. We’ll use a YAML configuration file to specify the details for the deployment and service API.

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: bos-rf spec: replicas: 1 template: metadata: labels: app: bos-rf spec: containers: - name: bos-rf image: deployreg.azurecr.io/bos_rf ports: - containerPort: 8000 resources: requests: cpu: 250m limits: cpu: 500m imagePullSecrets: - name: deployreg.azurecr.io --- apiVersion: v1 kind: Service metadata: name: bos-rf-svc spec: selector: app: bos-rf type: LoadBalancer ports: - protocol: TCP port: 8000

The following code will obtain the cluster endpoint from the AKS resource and then deploy the image and service to the cluster. The configuration details for the deployclus cluster are stored in a file located in the R temporary directory; all of the cluster’s methods will use this file. Unless told otherwise, AzureContainers does not touch your default Kubernetes configuration (~/kube/config).

# get the cluster endpoint deployclus <- deployclus_svc$get_cluster() # pass registry authentication details to the cluster deployclus$create_registry_secret(deployreg, email="me@example.com") # create and start the service deployclus$create("bos_rf.yaml")

To check on the progress of the deployment, run the get() methods specifying the type and name of the resource to get information on. As with Docker, these correspond to calls to the kubectl commandline tool, and again, the actual commandline is stored as the cmdline attribute of the returned value.

deployclus$get("deployment bos-rf") #> Kubernetes operation: get deployment bos-rf --kubeconfig=".../kubeconfigxxxx" #> NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE #> bos-rf 1 1 1 1 5m deployclus$get("service bos-rf-svc") #> Kubernetes operation: get service bos-rf-svc --kubeconfig=".../kubeconfigxxxx" #> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE #> bos-rf-svc LoadBalancer 10.0.8.189 52.187.249.58 8000:32276/TCP 5m

Once the service is up and running, as indicated by the presence of an external IP in the service details, let’s test it with a HTTP request. The response should look like this.

response <- httr::POST("http://52.187.249.58:8000/score", body=list(df=MASS::Boston[1:10,]), encode="json") httr::content(response, simplifyVector=TRUE) #> [1] 25.9269 22.0636 34.1876 33.7737 34.8081 27.6394 21.8007 22.3577 16.7812 18.9785

Finally, once we are done, we can tear down the service and deployment:

deployclus$delete("service", "bos-rf-svc") deployclus$delete("deployment", "bos-rf")

And if required, we can also delete all the resources created here, by simply deleting the resource group (AzureContainers will prompt you for confirmation):

deployresgrp$delete() See also

An alternative to Plumber is the model operationalisation framework found in Microsoft Machine Learning Server. While it is proprietary software, unlike Plumber which is open source, ML Server provides a number of features not available in the latter. These include model management, so that you can easily access multiple versions of a given model; user authentication, so that only authorised users can access your service; and batch (asynchronous) requests. For more information, see the MMLS documentation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Day 12 – little helper dive

Wed, 12/12/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 12th day of Christmas my true love gave to me…

What can it do?

This little helper is a debug function if you do not work with RStudio and has it's own blog post here. Before I just say the same thing twice – go check it out!

Overview

To see all the other functions you can either check out our GitHub or you can read about them here.

Have a merry advent season!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Day 12 – little helper dive erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing Hurricane Data with Shiny

Wed, 12/12/2018 - 04:56

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Visualizing Hurricane Data with Shiny Motivation for Project

Around the time that I was selecting a topic for this project, my parents and my hometown found themselves in the path of a Category 1 hurricane. Thankfully, everyone was ok, and there was only minor damage to their property. But this event made me think about how long it had been since the last time my hometown had been in the path of a Category 1 hurricane. I also wanted to study trends in hurricane intensity over time to see if it corresponds to the popular impression that storms have grown stronger  storms over the past few years.

The Dataset

The dataset that I used for this project was provided by the National Hurricane Center. It includes historical data on storms in the Atlantic Ocean. Though this dataset has measurements for hurricanes dating all the way back to 1851, the data from the earlier years is incomplete. The measurements included in the dataset are of the longitude and latitude of the storm, the maximum sustained wind speed, the minimum pressure, and the time of the measurement. Using the maximum wind speed, the storms are categorized with the Saffir-Simpson scale, which divides the storms between Category 1 and Category 5. The National Hurricane Center also provides data on tropical storms and tropical depressions because they can grow in intensity to hurricane level storms.

Objective

I wanted to use R and the shiny framework to create a shiny app for exploring the hurricane data and examining the trends of storm intensity to see if there has been any increase in intensity.

Challenges

Unfortunately, the early tracking data is not complete since it relied on observations that were collected either by ships that encountered one of these storms while at sea or coastal towns that were in the path of a hurricane and happened to have the necessary equipment to collect the desired data. 1851 is the first year with recorded measurements, and there are only six storms that were tracked. There are a couple of the 1851 storms that only have a single data entry point. In comparison, the data for 2017, the latest year available, has 18 storms that were tracked throughout the lifespan of the storm. I had to make a decision on where to set a cutoff in the data if I wanted to do an analysis of the trends over the years. I considered a few starting points.The year 1950 was when meteorologists started giving names to hurricane level storms, and in 1953 they started the custom of assigning human names that varied from year to year. I ended up choosing 1967 because that is the first hurricane season in which data was collected using weather tracking satellites for all the storms. I still kept the pre-1967 data available for analysis, but the only trend I explored within those years was the peak intensity for each storm.

Mapping the Storms

With the data containing the longitude and latitude for each storm at the time that they were measured, it is possible to recreate the path for each storm. For each path, I colored the location of the measurement according to the maximum wind speed at the time of the measurement. Tropical depressions are yellow, tropical storms are orange, and all hurricane level storms are red. I also added additional information for each storm in a tooltip; clicking on a point shows the storm’s name, maximum wind speed, minimum pressure, and the time that the storm was measured. This makes it possible to view the path for an individual storm or all of the storms during a year.

Analyzing Storm Data

In the shiny app, I implemented two ways to explore the hurricane data, by year and by storm name (storm number for pre-1950 storms). When exploring by name, you get an analysis of the storm  by the amount of time that it was classified in each category.

For exploring by year, I categorized each storm by the peak wind speed measured over the lifespan of the storm. Because the plots begin to get crowded when two years are compared, I started aggregating the storms by peak intensity in each year. This makes it possible to compare the type of storms that occurred each year.

While working with the data, I found that the best way to compare the intensities of the storms per year was to look at the average length of time for each category during the hurricane season. This makes it easier to compare the composition of the hurricane season. A  proportional comparison is useful because there can be a large variation in the number of storms per year. For example, comparing the 2008 season against 2009 reveals that 2008 had above average hurricane activity in the Atlantic while 2009’s activity was below average. However, looking at the proportional composition of each season shows that even though there was a large difference in the total number of storms, categorically the years were very similar.

Analyzing Hurricane Trends

When viewing the post-1967 only data, I added an option to view all years available for analysis, and selecting this option displays information from all the storms that occured between 1967 and 2017. The count on the y axis is a measure of the number of storms that had a peak intensity for each category. An increase in the intensity of storms is noticeable in that the number of storms that peaked as a tropical depression has steadily dropped over this timeframe while there has been an increase in the number of storms that peaked as tropical storms.

This increasing trend in intensity is more apparent when looking at the proportional composition of all hurricane seasons since 1967. With this view, I can show that the amount of time that storms spend in the tropical depression stage has decreased while the time as a tropical storm has increased. While this plot makes it possible to make the argument that there has been an increase in the overall intensity of the storms since 1967, there are problems with this conclusion. One of the main issues with this conclusion is that the trend doesn’t include data related to climate change, such as the increase in ocean temperature. Incorporating this additional data would be helpful in making a more conclusive argument. Ultimately more data is required to arrive at an informed conclusion.

Ideas for Future Development

One thing that I learned while analyzing this data is that the Saffir-Simpson scale is a very limited way to categorize storms since it only uses a single variable –maximum wind speed– to make a classification. I believe that a better metric for intensity would take into account other attributes of a storm like the minimum pressure or area. Starting in 2004, the National Hurricane Center began taking measurements of the size of the storm in addition to the other measurements that had been taken in the past. I believe that using the area of the storm would be a better way track the intensity of the storm than the Saffir-Simpson scale. Adding this metric along with additional information on ocean temperatures would create a more robust analysis of the historical data.

Another improvement that I would like to make is to implement a way to use the map view to see the progression of a storm from when it was first formed until its final measurement. This would be useful for seeing how a storm develops over time, as well as the ways that the intensity increases and decreases over time and location

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scraping the Turkey Accordion

Wed, 12/12/2018 - 03:00

(This article was first published on R on datawookie, and kindly contributed to R-bloggers)



var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Network Centrality in R: New ways of measuring Centrality

Wed, 12/12/2018 - 01:00

(This article was first published on schochastics, and kindly contributed to R-bloggers)

This is the third post of a series on the concept of “network centrality” with
applications in R and the package netrankr. The last part introduced the concept of
neighborhood-inclusion and its implications for centrality. In this post, we
extend the concept to a broader class of dominance relations by deconstructing indices
into a series of building blocks and introduce new ways of evaluating centrality.

library(igraph) library(ggraph) library(tidyverse) library(netrankr) Introduction

Neighborhood-inclusion seems to underlie many different centrality indices and as such
serves (or better: could serve) as a defining property of a centrality index. That is:

An index is a measure of centrality if and only if it preserves the partial ranking induced by neighborhood-inclusion

While this gives as a theoretical basis for centrality, it comes with a bit of problem.
Namely, that we do not expect many comparable pairs in (large) real-world networks.
Take the third example network from the first post.

g3 <- readRDS("example_3.rds") P <- neighborhood_inclusion(g3) comparable_pairs(P) ## [1] 0.00600532

Only 0.6% of pairs are comparable. This means that
centrality indices are at liberty to rank 99.4% of pairs in any order. So, neighborhood-inclusion
may not be very restrictive in large networks. This is much like with structural equivalence.
It is a very intuitive concept for equivalences in networks (having exactly the same neighbors), however we do not expect many equivalent pairs in large networks. This does not render the concept useless, but requires some relaxations.

In the following, we introduce dominance relations, which incrementally extend the
partial ranking of neighborhood-inclusion and thus tighten the partial ranking
that indices preserve. We start by illustrating how indices can be decomposed into a series
of building blocks.

Deconstructing Indices

Centrality indices assess structural importance based on a great variety of different
graph theoretic notions, like shortest paths (closeness) or walks (subgraph centrality).
Implicitly, though, they all follow a simple recipe:

  • derive an indirect relation
  • aggregate the indirect relations

As mentioned above, indirect relation are commonly derived via graph trajectories such as paths
and walks, for instance to compute distances. The aggregation is usually a simple summation of all relations of a node, but others are
possible too (e.g. \(\max\) as in eccentricity). In its most generic form, we can thus write a centrality index as
\[
c(i)= \sum_j \tau(x)_{ij}
\]
where \(x\) are the observed relations (basically the adjacency matrix) and \(\tau\)
a generic transformation function. Replacing \(\tau\) with \(id\), we obtain degree centrality
and setting \(\tau=dist\), we obtain closeness. A suitable function \(\tau\) can be
defined for all centrality indices, such that any index can basically be seen as degree in an
appropriately transformed network.

The package netrankr provides a set of 24 different indirect relations that can be used
to construct indices. A few common examples are given below.

#closeness g %>% indirect_relations(type='dist_sp') %>% aggregate_positions(type='invsum') #betweenness g %>% indirect_relations(type='depend_sp') %>% aggregate_positions(type='sum') #eigenvector g %>% indirect_relations(type='walks',FUN=walks_limit_prop) %>% aggregate_positions(type='sum') #subgraph g %>% indirect_relations(type='walks',FUN=walks_exp) %>% aggregate_positions(type='self')

(Consult the help for indirect_relations() to see all options)
Note that we make use of the %>% operator. This should appeal to the recipe
idea from above: network %>% indirect_relation %>% aggregation. The package also
includes a handy RStudio addin, which can be used to build the pipelines more easily.

Defining indices in this way is certainly more cumbersome than using, say, betweennes(g).
However, it allows us to intervene at any step and do something else.

Extended Dominance Relations

To illustrate the “something else”, we look at our small example network again.

g1 <- readRDS("example_1.rds")

Following the recipe, you have decided, that the relations of interest for your analysis
are the distances between nodes. The problem is, aggregating them into an index can still
be done in various ways. Three distance based centrality examples are shown below.

#classic closeness c_C <- g1 %>% indirect_relations(type="dist_sp") %>% aggregate_positions(type="invsum") #harmonic closeness c_HC <- g1 %>% indirect_relations(type="dist_sp",FUN=dist_inv) %>% aggregate_positions(type="sum") #residual-type closeness c_RC <- g1 %>% indirect_relations(type="dist_sp",FUN=dist_2pow) %>% aggregate_positions(type="sum")

Any of the above indices starts with the derivation of distances, but then proceeds with
a different form of aggregation:
\[
c_C(i)=\frac{1}{\sum dist(i,j)},\quad c_{HC}(i)=\sum\frac{1}{dist(i,j)}, \quad c_{RC}(i)=\sum 2^{-dist(i,j)}
\]
Possibilities are virtually endless for aggregating distances into an index.
From the previous part, we know that any of these indices preserve neighborhood-inclusion.
Once we have settled for a relation, as in this case, we can extend the partial ranking
using the following considerations: If a \(dist(i,k)\) is larger than \(dist(j,k)\) for all
nodes \(k\), then no matter how we aggregate the relations (as long as it is monotonic),
\(i\) will always be less central then \(j\). With a bit more formalism:
\[
dist(i,k) \geq dist(j,k) \text{ for all } k \implies c_x(i)\leq c_x(j)
\]
where \(c_x\) is an arbitrary centrality index based on distances. This actually defines
a new dominance relation among nodes. In fact, we can go a step further. It does
not really matter in which order we aggregate the distances, the result will always be the same.
Hence, we can permute all relations of a single node without affecting the result.
A convenient choice of permutation is simply to reorder all relations in descending
order. Afterwards, we can compute the dominance relations as above. (More formal details on these dominance relations can be found in this article.)

We can compute this new dominance relation using the function positional_dominance().
The benefit parameter is set to FALSE since large distances are not beneficial for
a node to have. Setting map=TRUE invokes the above mentioned reordering. For comparison,
we also compute neighborhood-inclusion again.

P <- neighborhood_inclusion(g1) D <- g1 %>% indirect_relations(type="dist_sp") %>% positional_dominance(benefit=FALSE,map=TRUE) c("neighborhood-inclusion"=comparable_pairs(P),"distance dominance"=comparable_pairs(D)) ## neighborhood-inclusion distance dominance ## 0.1636364 0.8727273

By fixing our relation of interest to distances and allowing reordering of relations, we
went from only 16% of comparable pairs to 87%! Hence, no matter what index based on distance we use,
results will always be very similar. As a sanity check, we can verify that all distance based indices from above preserve the dominance relations.

c("classic"=is_preserved(D,c_C),"harmonic"=is_preserved(D,c_HC),"residual"=is_preserved(D,c_RC)) ## classic harmonic residual ## TRUE TRUE TRUE Partial Centrality

By now, we should have understood that there are various kinds of partial rankings in networks,
which form the basis of centrality. Indices extend these partial rankings into one possible ranking,
but, as we will see later, there might be hundreds of thousands of possible rankings. And hence, hundreds of
thousands of indices that produce these rankings. Instead of inventing hundreds of thousands of
indices, why not just study the partial rankings? Or why not be extremely bold, and try to
analyse all possible rankings at once?

In this section, we consider the former question, by introducing rank intervals. A rank interval
of a node is the set of ranks a node can have in any ranking that extends a given partial ranking.
Let us consider two extreme cases. A node that is neither dominated nor dominates any other node can potentially
have any rank. So its rank interval is \([1,n]\). (We use the convention, that \(n\) is the top rank and \(1\) the lowest possible rank).
On the other hand, if a node dominates all other nodes, it can only be ranked on the top. So its rank interval is just a point.

netrankr includes the function rank_intervals() which returns the rank intervals for all nodes in the network.
A visual representation of the intervals can be obtained with the plot_rank_intervals() function, as done below for the
first example network and neighborhood-inclusion as partial ranking input. We also included an optional data.frame
containing centrality scores of five indices.

cent_scores <- data.frame( degree=degree(g1), betweenness=betweenness(g1), closeness=closeness(g1), eigen=eigen_centrality(g1)$vector, subgraph=subgraph.centrality(g1)) plot_rank_intervals(P,cent.df = cent_scores,ties.method = "average")


The rank intervals are extremely big for this network. Node 10, for instance can take any possible rank.
The most constraint interval is that of node 1, containing 6 possible ranks. The rank intervals can be interpreted as
a sort of confidence interval for centrality. The larger the interval, the less explanatory power
a single index may have Again, consider Node 10. It is the most central node according to subgraph centrality, but ranks very low in
betweenness.

We have learned that we can extend neighborhood-inclusion by choosing a relation of interest as
basis for our analysis. For the example network, we considered distances.
The below figure shows the resulting rank intervals.

cent_scores <- data.frame( classic=c_C, harmonic=c_HC, residual=c_RC) plot_rank_intervals(D,cent.df = cent_scores)


Notice how much smaller they got. The intervals of node 1 and 2 even collapse into a single point.
They will thus always be ranked at the bottom in any distance based centrality ranking.

Rank intervals are a convenient choice to assess the possibilities of rankings in a network.
It is important to understand, though, that the ranks in each interval do not occur with
uniform probability. A rank interval \([6,7]\) does not mean that the node is ranked 6th in 50%
of all possible rankings. We address the rank probabilities in the next section.

Probabilistic Centrality

A node ranking can be defined as a mapping
\[rk: V \to \{1,\ldots,n\},\]
where we use the convention that \(u\) is the top ranked node if \(rk(u)=n\) and the
bottom ranked one if \(rk(u)=1\). The set of all possible rankings can then be characterized as
\[
\mathcal{R}(\leq)=\{rk:V \to \{1,\ldots,n\}\; : \; u\leq v \implies rk(u)\leq rk(v)\}.
\]
This set contains all rankings that could be obtained for centrality indices that preserve the
partial ranking of a dominance relation “\(\leq\)”.

Once \(\mathcal{R}(\leq)\) is calculated, it can be used for a probabilistic assessment of centrality,
analyzing all possible rankings at once. Examples include relative rank probabilities
(How likely is it, that a node \(u\) is more central than another node \(v\)?) or
expected ranks (How central do we expect a node \(u\) to be).

netrankr includes the function exact_rank_prob(), which helps to answer the above posted questions.
We stick with our small example network and apply the function to both, neighborhood-inclusion
and dominance based on distances.

probNI <- exact_rank_prob(P) probD <- exact_rank_prob(D)

The function returns a large list of different outputs, which we discuss in the following.
The number of possible rankings is stored in lin.ext.

c("neighborhood-inclusion"=probNI$lin.ext,"distances"=probD$lin.ext) ## neighborhood-inclusion distances ## 739200 20

So, for this tiny network, there are still more than 700,000 possibilities to rank the nodes differently.
If we restrict ourselves to distances, we end up with only 20.

The rank probabilities (for example how likely is it that node \(u\) is ranked on top?)
are stored in the matrix rank.prob. We here focus on the probability to be the most central node.

top_rank <- ncol(probNI$rank.prob) probNI$rank.prob[,11] ## V1 V2 V3 V4 V5 V6 ## 0.00000000 0.00000000 0.00000000 0.13636364 0.16363636 0.10909091 ## V7 V8 V9 V10 V11 ## 0.10909091 0.13636364 0.09090909 0.09090909 0.16363636

Node 5 and 11 have the highest probability to be the most central node. You can think of the
probabilities as follows: If we would apply thousands of indices to the network, in 16% of the cases
will node 5 be the most central node.

Relative rank probabilities (How likely is it that \(u\) is less central than \(v\)?) are stored in the matrix relative.rank.

round(probNI$relative.rank,2) ## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ## V1 0.00 0.67 1.00 0.95 1.00 1.00 1.00 0.95 0.86 0.86 1.00 ## V2 0.33 0.00 0.67 1.00 0.92 0.83 0.83 1.00 0.75 0.75 0.92 ## V3 0.00 0.33 0.00 0.80 1.00 0.75 0.75 0.80 0.64 0.64 1.00 ## V4 0.05 0.00 0.20 0.00 0.56 0.44 0.44 0.50 0.38 0.38 0.56 ## V5 0.00 0.08 0.00 0.44 0.00 0.38 0.38 0.44 0.32 0.32 0.50 ## V6 0.00 0.17 0.25 0.56 0.62 0.00 0.50 0.56 0.43 0.43 0.62 ## V7 0.00 0.17 0.25 0.56 0.62 0.50 0.00 0.56 0.43 0.43 0.62 ## V8 0.05 0.00 0.20 0.50 0.56 0.44 0.44 0.00 0.38 0.38 0.56 ## V9 0.14 0.25 0.36 0.62 0.68 0.57 0.57 0.62 0.00 0.50 0.68 ## V10 0.14 0.25 0.36 0.62 0.68 0.57 0.57 0.62 0.50 0.00 0.68 ## V11 0.00 0.08 0.00 0.44 0.50 0.37 0.37 0.44 0.32 0.32 0.00

For example, the probability that node 2 is less central than node 1 is \(0.33\).
The closer a probability to \(0.5\) (see node 11 and 5), the less reason exists to put either node on top of the other.

The last return values of interest are the expected ranks and the standard deviation in expected.rank and rank.spread.
The expected ranks can be seen as a sort of baseline ranking. Applying hundreds of random indices, this is the ranking we could expect to get on average.

exp_rk <- round(probNI$expected.rank,2) sd_rk <- round(probNI$rank.spread,2) tibble(nodes=1:11,expected=exp_rk,sd=sd_rk) %>% ggplot(aes(x=reorder(nodes,expected)))+ geom_segment(aes(y=expected-sd,yend=expected+sd,xend=reorder(nodes,expected)))+ geom_point(aes(y=expected))+ theme_minimal()+ labs(y="Rank",x="Node")


The standard deviations are quite large for neighborhood-inclusion, which was to be expected from the big rank
intervals. The below figure shows the expected ranks for the distance based dominance.

exp_rk <- round(probD$expected.rank,2) sd_rk <- round(probD$rank.spread,2) tibble(nodes=1:11,expected=exp_rk,sd=sd_rk) %>% ggplot(aes(x=reorder(nodes,expected)))+ geom_segment(aes(y=expected-sd,yend=expected+sd,xend=reorder(nodes,expected)))+ geom_point(aes(y=expected))+ theme_minimal()+ labs(y="Rank",x="Node")

As a word of warning: The problem of finding all possible rankings for a partial ranking is
computationally a hard problem. So it is advisable to use exact_rank_prob() only for small
networks. Some benchmark results and approximation methods for larger networks can be found here.

Summary

After this post, it is time to take stock of what we have done so far.
To date, putting it bluntly, network centrality is nothing more than the application of indices
to a network:


The only degree of freedom is the choice of index and it is hard to justify choices
without resorting to data-driven reasoning, as in “We used betweenness because it worked best”.

The introduced neighborhood-inclusion and more specific
dominance concepts allow for additional ways of analyzing centrality in networks,
described in this superficial diagram.

Any centrality analysis starts with identifying the relation of interest, which
replaces the choice of index. The relation of interest is usually some graph-theoretic
property (e.g. distances) that we assume to be indicative for centrality. The relations
of a node to all others is called its position. The aggregation of the relations leads
to the definition of indices, hence the usual centrality concept. However, positions can also
be compared via positional dominance, the dominance relation introduced in this post, leading to partial centrality rankings and
the option to calculate probabilistic centrality rankings.

So far, we have mostly been toying around with small contrived networks. The final post
will illustrate the introduced methodology by means of a more realistic application example.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: schochastics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages