Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 day 15 hours ago

prrd 0.0.1: Parallel Running [of] Reverse Depends

Sun, 01/07/2018 - 19:33

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new package is now on the ghrr drat. It was uploaded four days ago to CRAN but still lingers in the inspect/ state, along with a growing number of other packages. But as some new packages have come through, I am sure it will get processed eventually but in the meantime I figured I may as well make it available this way.

The idea of prrd is simple, and described in some more detail on its webpage and its GitHub repo. Reverse dependency checks are an important part of package development (provided you care about not breaking other packages as CRAN asks you too), and is easily done in a (serial) loop. But these checks are also generally embarassingly parallel as there is no or little interdependency between them (besides maybe shared build depedencies).

So this package uses the liteq package by Gabor Csardi to set up all tests to run as tasks in a queue. This permits multiple independent queue runners to work at a task at a time. Results are written back and summarized.

This already works pretty well as evidenced by the following screenshot (running six parallel workers, arranged in split byobu session).

See the aforementioned webpage and its repo for more details, and by all means give it a whirl.

For more questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Discontinuing maintenance of jug

Sun, 01/07/2018 - 14:00

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

I’m starting off 2018 with dropping my involvement with jug. The reason for this being that I simply cannot find the time or interest anymore to maintain the code.

All of the open-source efforts that I started always had to do with a concrete need that I had in one way or another. Once that need is fulfilled by creating a solution for it, the next challenge becomes to maintain that solution.

In a best case scenario, someone with the right set of skills will continue the development into a mature project which become significant better then the original solution. A close to my heart example of this is simmer (see http://r-simmer.org/), where Iñaki Ucar has taken over the development and has improved the original enormously.

However, for jug the active developers community is currently too small too do a handover. Next to that, jug its main competitor plumber (see website) has been able to create a significant community with highly active support.

Today I have to sort of admit defeat. My own time availability as co-founder of dataroots and my changing interests unfortunately no long allow me to put in the time and effort required to do the unnecessary maintenance to do justice to the jug user base. therefore I will be discontinuing the maintenance of jug. If anyone would be willing to take over, they will be more than welcome.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Undemocratic Democracies

Sun, 01/07/2018 - 13:16

(This article was first published on R – DiffusePrioR, and kindly contributed to R-bloggers)

I’ve always thought that it was ironic that most countries with the word “Democratic” in their name are exceptionally un-democratic in reality. So I was very interested in the following post on Reddit this week showing exactly this (https://www.reddit.com/r/dataisbeautiful/comments/7nkyek/countries_with_the_word_democratic_in_their/).

However, there are only 8 countries that have “democratic” in their name: People’s Democratic Republic of Algeria, Democratic Republic of the Congo, Democratic Republic of Timor-Leste, Federal Democratic Republic of Ethiopia, Lao People’s Democratic Republic, Democratic People’s Republic of Korea, Federal Democratic Republic of Nepal, and the Democratic Socialist Republic of Sri Lanka. This and the fact that I am back teaching next week got me thinking that this might be a nice example to show how two-sample t-tests work.

The 8 “democratic” countries have an overall democracy index score of 3.89 with a sample standard deviation of 2.174942. This contrasts with a mean and standard deviation of 5.602013 and 2.172326 for the remaining 159 countries in the Economist Intelligence Unit’s democracy index (https://en.wikipedia.org/wiki/Democracy_Index).

Let’s conduct a two-sample t-test that assumes equal variances to test whether this difference in means is statistically significant. The null hypothesis is that both sample means are equal. This is a two-tailed test. The test statistic follows Gosset’s t-distribution with n1+n2-2 degrees of freedom (159+8-2=165). The test statistic is calculated (formula here: https://en.wikipedia.org/wiki/Student%27s_t-test):

s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+8-2))
t = (5.602013-3.89)/(s*(sqrt(1/159+1/8)))=2.1749

which allows us to reject at the conventional alpha of 0.05. The p-value is ~0.03 meaning we would not be able to reject at the 1% level. Interestingly, if you do not follow the equal variance assumption, you can no longer reject the null at the 5% level.

Hopefully, this example will be of interest to people teaching stats and econometrics for undergrads!

rm(list = ls()) other <- c(9.93, 9.5, 9.39, 9.26, 9.2, 9.15, 9.15, 9.09, 9.03, 9.01, 8.81, 8.8, 8.63, 8.41, 8.39, 8.36, 8.3, 8.28, 8.17, 7.99, 7.98, 7.98, 7.94, 7.92, 7.92, 7.88, 7.87, 7.86, 7.85, 7.85, 7.82, 7.81, 7.79, 7.78, 7.77, 7.65, 7.51, 7.47, 7.41, 7.39, 7.31, 7.29, 7.23, 7.13, 7.1, 7.01, 6.97, 6.96, 6.94, 6.9, 6.83, 6.77, 6.75, 6.75, 6.72, 6.67, 6.67, 6.65, 6.64, 6.62, 6.62, 6.59, 6.57, 6.54, 6.47, 6.42, 6.4, 6.38, 6.31, 6.27, 6.25, 6.21, 6.03, 6.01, 5.99, 5.93, 5.92, 5.92, 5.91, 5.81, 5.76, 5.73, 5.72, 5.7, 5.7, 5.67, 5.64, 5.63, 5.55, 5.33, 5.31, 5.26, 5.23, 5.07, 5.04, 4.93, 4.93, 4.92, 4.87, 4.86, 4.81, 4.77, 4.7, 4.68, 4.55, 4.5, 4.49, 4.33, 4.27, 4.2, 4.08, 4.02, 4.02, 3.96, 3.96, 3.96, 3.88, 3.85, 3.81, 3.74, 3.71, 3.54, 3.46, 3.46, 3.4, 3.38, 3.32, 3.31, 3.24, 3.18, 3.14, 3.14, 3.07, 3.06, 3.05, 3.04, 3.03, 2.91, 2.91, 2.83, 2.79, 2.75, 2.65, 2.55, 2.4, 2.37, 2.37, 2.34, 2.25, 2.06, 1.98, 1.95, 1.93, 1.89, 1.83, 1.7, 1.61, 1.5, 1.43) demo <- c(7.24, 6.48, 4.86, 3.6, 3.56, 2.37, 1.93, 1.08) mean(other) ; sd(other) ; length(other) mean(demo) ; sd(demo) ; length(demo) t.test(other, demo, var.equal = T) s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+8-2)) t = (5.602013-3.89)/(s*(sqrt(1/159+1/8))) t.test(other, demo, var.equal = F) library(ggplot2) data1 <- data.frame(Score = other, Name = "No") data2 <- data.frame(Score = demo, Name = "Yes") data <- rbind(data1, data2) ggplot(data, aes(Name, Score)) + geom_boxplot(fill="lightblue") + theme_bw() + xlab("Democratic in Name") + ylab("Democracy Score")

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DiffusePrioR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rainbowing a set of pictures

Sun, 01/07/2018 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I’ve now down a few collages from R using magick: the faces of #rstats Twitter, We R-Ladies with Lucy D’Agostino McGowan, and a holiday card for R-Ladies. The faces of #rstats Twitter and holiday card collages were arranged at random, while the We R-Ladies one was a mosaic forming the R-Ladies logo. I got the idea to up my collage skills by trying to learn how to arrange pics by their main colour, like a rainbow. The verb rainbow doesn’t exist, and “rainbowing” doesn’t mean ordering by colour, but I didn’t let this stop me.

It was the occasion to grab some useful knowledge about colours, not useless for someone who did not even know about Pantone’s Colors of the Year a few weeks ago…

This post has nothing to do with Kesha’s new album. However, you can listen to it while reading since it’s so good, but maybe switch to something older from her when I use “$”.

Getting some pics to play with

The first pictures I tried to arrange were all the pictures ever posted by R-Ladies local chapters on their Twitter account. While it was fun to grab them all, it was not very interesting to play with them as so many of them were pictures of screens. I therefore grabbed “nature” pictures from Pexels using the same method [as when creating the Bubblegum Puppies] following this Stack Overflow thread. I chose “nature” as a keyword because 1) it lead to many hits 2) it offered a good variety of colours.

library("rvest") library("RSelenium") library("magrittr") rD <- rsDriver() remDr <- rD[["client"]] # open the webpage remDr$navigate("https://www.pexels.com/search/nature/") # scroll down for(i in 1:130){ remDr$executeScript(paste("scroll(0,",i*10000,");"), args = list("dummy")) # be nice and wait Sys.sleep(1) } # https://www.pexels.com/faq/ page_content <- remDr$getPageSource() remDr$close() get_link_from_src <- function(node){ xml2::xml_attrs(node)["src"] %>% as.character() %>% stringr::str_replace("\\?h.*", "") } xtract_pic_links <- function(source) { css <- '.photo-item__img' read_html(source[[1]]) %>% html_nodes(css) %>% purrr::map_chr(get_link_from_src) } links <- xtract_pic_links(page_content) links <- links[1:1400] # save dir.create("nature") save_pic <- function(url){ Sys.sleep(1) name <- stringr::str_replace(url, ".*\\/", "") try(magick::image_read(url) %>% magick::image_write(paste0("nature/", name)), silent = TRUE) } purrr::walk(links, save_pic) Extracting the main colour and making pics size-compatible

In the following code, I extracted the main colour from each pic using Russel Dinnage’s method as presented in this blog post from David Smith. Before that I had to install two packages from Github, rblocks and rPlotter.

This code also serves another role: since I wanted to paste pics together at some point, I decided to make them all of the same dimensions by adding a border with magick. I had learnt how to do that when preparing R-Ladies Global holiday card, but this time instead of using the same colour every time (R-Ladies’ official purple), I used the main colour I’d just extracted. The most important points to make a picture a square are to know magick::image_info gives you the height and width of an image… and to somehow understand geometry which was embarrassingly a hurdle when I did that.

The code to extract colours didn’t work in a few cases which I did not investigate a lot: I had downloaded more pics than what I needed because I had experienced the issue when working with R-Ladies meetups pics, and had seen it was because of seemingly bicolor pics.

dir.create("formatted_pics") format_image <- function(path){ image <- magick::image_read(path) info <- magick::image_info(image) # find in which direction I need to add pixels # to make this a square direction <- ifelse(info$height > info$width, "height", "width") scale_number <- as.numeric(info[direction]/500) image <- magick::image_scale(image, paste0(info["width"]/scale_number, "x", info["height"]/scale_number)) newinfo <- magick::image_info(image) # colours colours <- try(rPlotter::extract_colours(path, num_col = 1), silent = TRUE) # one pic at least was problematic if(!is(colours, "try-error")){ colour <- colours[1] image <- magick::image_border(image, colour, paste0((500-newinfo$width)/2, "x", (500-newinfo$height)/2)) info <- magick::image_info(image) # odd numbers out! if(info$height/2 != floor(info$height/2)){ image <- magick::image_crop(image, "0x500+0") } if(info$width/2 != floor(info$width/2)){ image <- magick::image_crop(image, "500x0+0") } magick::image_write(image, stringr::str_replace(path, "nature", "formatted_pics")) tibble::tibble(path = path, colour = colour) }else{ NULL } } pics_main_colours <- purrr::map_df(dir("nature", full.names = TRUE), format_image) readr::write_csv(pics_main_colours, path = "pics_main_colours.csv")

And because I’m apparently a bad planner, I had to reduce pictures afterwards.

# we need smaller images reduce_image <- function(path){ magick::image_read(path) %>% magick::image_scale("50x50!") %>% magick::image_write(path) } purrr::walk(dir("formatted_pics", full.names = TRUE), reduce_image) Preparing a function to order and paste pictures

This function has a collage part which you might recognize from my the faces of #rstats Twitter blog post, and a ordering pictures according to a variable part that’s new and uses a bit of tidy eval… Maybe I’ll really learn tidy eval this year! pics_info needs to be a data.frame with the path to pictures and well the variable one wants to use to order them.

library("rlang") make_column <- function(i, files, no_rows){ magick::image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>% magick::image_append(stack = TRUE) %>% magick::image_write(paste0("cols/", i, ".jpg")) } make_collage <- function(pics_info, no_rows, no_cols, ordering_col){ pics_info <- dplyr::arrange(pics_info, !!!syms(ordering_col)) pics_info <- pics_info[1:(no_rows*no_cols),] pics_info <- dplyr::mutate(pics_info, column = rep(1:no_cols, each = no_rows)) pics_info <- dplyr::group_by(pics_info, column) %>% dplyr::arrange(!!!syms(ordering_col)) %>% dplyr::mutate(row = 1:no_rows) %>% dplyr::ungroup() pics_info <- dplyr::arrange(pics_info, column, row) dir.create("cols") purrr::walk(0:(no_cols-1), make_column, files = pics_info$path, no_rows = no_rows) banner <- magick::image_read(dir("cols/", full.names = TRUE)) %>% magick::image_append(stack = FALSE) unlink("cols", recursive = TRUE) return(banner) }

The function returns a magick object that one can then write to disk as a PNG for instance.

I first tested it using a random approach added to the data.frame created in the next section, and show the result here to give an idea of the variety of pictures. For many of them, however, the main colour that you can see in their border is greyish.

set.seed(42) pics_info <- dplyr::mutate(pics_info, random = sample(1:nrow(pics_info), nrow(pics_info))) make_collage(pics_info, 19, 59, "random") %>% magick::image_write("data/2018-01-07-rainbowing-banner_random.png")

Testing a first (bad) approach: using hue

Once I had the main colour as an hex code, I had no idea how to order the colours and thought a good idea would be to use hue, which is the main wave length in a colour. Most observed colours are a mix of wave lengths unless you’re using a laser for instance. To get hue from a colour identified by its hex code, one needs two functions: colorspace::hex2rgb and grDevices::rgb2hsv. The latter one outputs hue, saturation and value. Hue is the main wavelength, saturation the amount of that wavelength in the colour and value the amount of light in the colour. The smaller the saturation, the less representative the hue is of the main colour. Add to that the fact that the main colour can also be only a little representative of your original picture… Ordering by hue isn’t too perfect, but I tried that anyway.

# now work on getting the hue and value for all pics # create a data.frame with path, hue, value get_values <- function(path, pics_main_colours){ print(path) # get main color colour <- pics_main_colours$colour[pics_main_colours$path == stringr::str_replace(path, "formatted_pics", "nature")] # translate it rgb <- colorspace::hex2RGB(colour) values <- grDevices::rgb2hsv(t(rgb@coords)) tibble::tibble(path = path, hue = values[1,1], saturation = values [2,1], value = values[3,1]) } # all values pics_col <- purrr::map_df(dir("formatted_pics", full.names = TRUE), get_values, pics_main_colours) make_collage(pics_info, 19, 59, "hue") %>% magick::image_write("banner_hue.png")

So this is not too pretty. Blue and green pictures seem to cluster together but there are very dark pictures which we’d intuitively put aside.

So I thought a bit more and decided to first assign main colours to a reference colour and then order pictures based on this…

Choosing a better approach: RGB and distances

The first challenge was to choose reference colours which’d be a rainbow slices. I could have looked up RGB codes corresponding to ROYGBIV (red, orange, yellow, green, blue, indigo and violet.) but I had read about xkcd colors survey in this interesting post and therefore decided to use XKCD colors, available in R via the xkcdcolors package. I chose to use the 18 most common ones, and add black to that lot. It was no longer really a rainbow, I agree. The colors present in the pictures were ordered by hand by my husband, and I like his choices.

Then to assign each pic to a reference colour via its main colour, I calculated the Euclidian distance between that colour and all reference colours to find the closes reference colours, using the RGB values.

library("xkcdcolors") library("magrittr") # version of colorspace::hex2RGB returning a df hex2rgb <- function(hex){ result <- colorspace::hex2RGB(hex)@coords } # https://stackoverflow.com/questions/45328221/unnest-one-column-list-to-many-columns-in-tidyr colors <- tibble::tibble(name = c(xcolors()[1:18], "black"), hex = name2color(name), rgb = purrr::map(hex, hex2rgb)) %>% dplyr::mutate(rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() # for each colour I want the closest one. find_closest_colour <- function(hex, colors){ test <- tibble::tibble(hex = hex) %>% dplyr::mutate(rgb = purrr::map(hex, hex2rgb), rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() distance <- stats::dist(rbind(test[, c("R", "G", "B")], colors[, c("R", "G", "B")])) colors$name[which.min(as.matrix(distance)[,1][2:(nrow(colors) + 1)])] } imgs_col <- dplyr::mutate(pics_main_colours, xkcd_col = purrr::map_chr(colour, find_closest_colour, colors = colors)) readr::write_csv(imgs_col, path = "imgs_xkcd_col.csv")

Once I had this information about each pic, I could order the pictures, after having defined the order of the reference colours.

pics_info <- readr::read_csv("imgs_xkcd_col.csv") pics_info <- dplyr::mutate(pics_info, path = stringr::str_replace(path, "nature", "formatted_pics")) pics_info <- dplyr::mutate(pics_info, xkcd_col = factor(xkcd_col, ordered = TRUE, levels = c("black","brown","red","magenta","pink", "lime green","green","dark green","teal", "light blue","sky blue","blue","purple","grey")))

This looks much better, but of course the initial set (and maybe the used extraction method as well) don’t provide for enough colours to make this extremely pretty. I’m not sure how useful this end product is, but hey I got to look at pretty landscapes full of colours from my grey rainy city, and learnt a lot along the way… Besides, maybe you will find a cool use case of some of the colour methods featured here and will tell me about it in the comments?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dataviz Signup

Sat, 01/06/2018 - 23:26

(This article was first published on R on kieranhealy.org , and kindly contributed to R-bloggers)

Data Visualization for Social Science will be published later this year by Princeton University Press. You can read a near-complete draft of the book at socviz.co. If you would like to receive one (1) email when the book is available for pre-order, please fill out this very short form. The goal of the book is to introduce readers to the principles and practice of data visualization in a humane, reproducible, and up-to-date way.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New wrapr R pipeline feature: wrapr_applicable

Sat, 01/06/2018 - 18:35

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package wrapr now has a neat new feature: “wrapr_applicable”.

This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery.

A small effort in making a package “wrapr aware” appears to have a fairly large payoff.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tint 0.0.5

Sat, 01/06/2018 - 14:50

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of the tint package arrived on CRAN earlier this week. Its name expands from tint is not tufte as the package offers a fresher take on the Tufte-style for html and pdf presentations.

A screenshot of the pdf variant is below.

Similar to the RcppCNPy release this week, this is pure maintenance related to dependencies. CRAN noticed that processing these vignettes requires the mgcv package—as we use geom_smooth() in some example graphs. So that was altered to not require this requirement just for the vignette tests. We also had one pending older change related to jurassic pandoc versions on some CRAN architectures.

Changes in tint version 0.0.5 (2018-01-05)
  • Only run html rendering regression test on Linux or Windows as the pandoc versions on CRAN are too old elsewhere.

  • Vignette figures reworked so that the mgcv package is not required avoiding a spurious dependency [CRAN request]

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppCNPy 0.2.8

Sat, 01/06/2018 - 14:44

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A minor maintenance release of the RcppCNPy package arrived on CRAN this week.

RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

There is no code change here. But to process the vignette we rely on knitr which sees Python here and (as of its most recent release) wants the (excellent !!) reticulate package. Which is of course overkill just to process a short pdf document, so we turned this off.

Changes in version 0.2.8 (2018-01-04)
  • Vignette sets knitr option python.reticulate=FALSE to avoid another depedency just for the vignette [CRAN request]

CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RQuantLib 0.4.4 for Windows

Fri, 01/05/2018 - 21:02

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

I’m pleased to announce that the RQuantLib Windows binaries are now up to 0.4.4!  The RQuantLib pre-built Windows binaries have been frozen on CRAN since 0.4.2, but now you can get version 0.4.4 binaries on Dirk’s ghrr drat repo.

Installation is as simple as:

drat::addRepo(“ghrr”) # maybe use ‘install.packages(“drat”)’ first  install.packages(“RQuantLib”, type=”binary”) I will be able to create Windows binaries for future RQuantLib versions too, now that I have a Windows QuantLib build (version 1.11) to link against. Dirk and I plan to talk with CRAN about getting the new binaries hosted there.  Regardless, they will always be available via the drat repo. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to extract data from a PDF file with R

Fri, 01/05/2018 - 07:02

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

In this post, taken from the book R Data Mining by Andrea Cirillo, we’ll be looking at how to scrape PDF files using R. It’s a relatively straightforward way to look at text mining – but it can be challenging if you don’t know exactly what you’re doing.

Until January 15th, every single eBook and video by Packt is just $5! Start exploring some of Packt’s huge range of R titles here.

You may not be aware of this, but some organizations create something called a ‘customer card’ for every single customer they deal with. This is quite an informal document that contains some relevant information related to the customer, such as the industry and the date of foundation. Probably the most precious information contained within these cards is the comments they write down about the customers. Let me show you one of them:
My plan was the following—get the information from these cards and analyze it to discover whether some kind of common traits emerge.

As you may already know, at the moment this information is presented in an unstructured way; that is, we are dealing with unstructured data. Before trying to analyze this data, we will have to gather it in our analysis environment and give it some kind of structure.

Technically, what we are going to do here is called text mining, which generally refers to the activity of gaining knowledge from texts. The techniques we are going to employ are the following:

  • Sentiment analysis
  • Wordclouds
  • N-gram analysis
  • Network analysis
Getting a list of documents in a folder

First of all, we need to get a list of customer cards we were from the commercial department. I have stored all of them within the ‘data’ folder on my workspace. Let’s use list.files() to get them:

file_vector <- list.files(path = "data")

Nice! We can inspect this looking at the head of it. Using the following command:

file_vector %>% head() [1] "banking.xls" "Betasoloin.pdf" "Burl Whirl.pdf" "BUSINESSCENTER.pdf" [5] "Buzzmaker.pdf" "BuzzSaw Publicity.pdf"

Uhm… not exactly what we need. I can see there are also .xls files. We can remove them using the grepl() function, which performs partial matches on strings, returning TRUE if the pattern required is found, or FALSE if not. We are going to set the following test here: give me TRUE if you find .pdf in the filename, and FALSE if not:

grepl(".pdf",file_list) [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [18] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [35] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [52] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [69] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [86] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [103] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE [120] TRUE

As you can see, the first match results in a FALSE since it is related to the .xls file we saw before. We can now filter our list of files by simply passing these matching results to the list itself. More precisely, we will slice our list, selecting only those records where our grepl() call returns TRUE:

pdf_list <- file_vector[grepl(".pdf",file_list)]

Did you understand [grepl(“.pdf”,file_list)] ? It is actually a way to access one or more indexes within a vector, which in our case are the indexes corresponding to “.pdf”, exactly the same as we printed out before.

If you now look at the list, you will see that only PDF filenames are shown on it.

Reading PDF files into R via pdf_text()

R comes with a really useful that’s employed tasks related to PDFs. This is named pdftools, and beside the pdf_text function we are going to employ here, it also contains other relevant functions that are used to get different kinds of information related to the PDF file into R.

For our purposes, it will be enough to get all of the textual information contained within each of the PDF files. First of all, let’s try this on a single document; we will try to scale it later on the whole set of documents. The only required argument to make pdf_text work is the path to the document. The object resulting from this application will be a character vector of length 1:

pdf_text("data/BUSINESSCENTER.pdf") [1] "BUSINESSCENTER business profile\nInformation below are provided under non disclosure agreement. date of enquery: 12.05.2017\ndate of foundation: 1993-05-18\nindustry: Non-profit\nshare holders: Helene Wurm ; Meryl Savant ; Sydney Wadley\ncomments\nThis is one of our worst customer. It really often miss payments even if for just a couple of days. We have\nproblems finding useful contact persons. The only person we can have had occasion to deal with was the\nfiscal expert, since all other relevant person denied any kind of contact.\n 1\n"

If you compare this with the original PDF document you can easily see that all of the information is available even if it is definitely not ready to be analyzed. What do you think is the next step needed to make our data more useful?

We first need to split our string into lines in order to give our data a structure that is closer to the original one, that is, made of paragraphs. To split our string into separate records, we can use the strsplit() function, which is a base R function. It requires the string to be split and the token that decides where the string has to be split as arguments. If you now look at the string, you’ll notice that where we found the end of a line in the original document, for instance after the words business profile, we now find the \n token. This is commonly employed in text formats to mark the end of a line.

We will therefore use this token as a split argument:

pdf_text("data/BUSINESSCENTER.pdf") %>% strsplit(split = "\n") [[1]] [1] "BUSINESSCENTER business profile" [2] "Information below are provided under non disclosure agreement. date of enquery: 12.05.2017" [3] "date of foundation: 1993-05-18" [4] "industry: Non-profit" [5] "share holders: Helene Wurm ; Meryl Savant ; Sydney Wadley" [6] "comments" [7] "This is one of our worst customer. It really often miss payments even if for just a couple of days. We have" [8] "problems finding useful contact persons. The only person we can have had occasion to deal with was the" [9] "fiscal expert, since all other relevant person denied any kind of contact." [10] " 1"

strsplit() returns a list with an element for each element of the character vector passed as argument; within each list element, there is a vector with the split string.

Isn’t that better? I definitely think it is. The last thing we need to do before actually doing text mining on our data is to apply those treatments to all of the PDF files and gather the results into a conveniently arranged data frame.

Iteratively extracting text from a set of documents with a for loop

What we want to do here is run trough the list of files and for filename found there, we run the pdf_text() function and then the strsplit() function to get an object similar to the one we have seen with our test. A convenient way to do this is by employing a ‘for’ loop. These structures basically do this to their content: Repeat this instruction n times and then stop. Let me show you a typical ‘for’ loop:

for (i in 1:3){ print(i) }

If we run this, we obtain the following:

[1] 1 [1] 2 [1] 3

This means that the loop runs three times and therefore repeats the instructions included within the brackets three times. What is the only thing that seems to change every time? It is the value of i. This variable, which is usually called counter, is basically what the loop employs to understand when to stop iterating. When the loop execution starts, the loop starts increasing the value of the counter, going from 1 to 3 in our example. The for loop repeats the instructions between the brackets for each element of the values of the vector following the in clause in the for command. At each step, the value of the variable before in (i in this case) takes one value of the sequence from the vector itself.

The counter is also useful within the loop itself, and it is usually employed to iterate within an object in which some kind of manipulation is desired. Take, for instance, a vector defined like this:

vector <- c(1,2,3)

Imagine we want to increase the value of every element of the vector by 1. We can do this by employing a loop such as this:

for (i in 1:3){ vector[i] <- vector[i]+1 }

If you look closely at the loop, you’ll realize that the instruction needs to access the element of the vector with an index equal to i and modify this value by 1. The counter here is useful because it will allow iteration on all vectors from 1 to 3.

Be aware that this is actually not a best practice because loops tend to be quite computationally expensive, and they should be employed when no other valid alternative is available. For instance, we can obtain the same result here by working directly on the whole vector, as follows:

vector_increased <- vector +1

If you are interested in the topic of avoiding loops where they are not necessary, I can share with you some relevant material on this.

For our purposes, we are going to apply loops to go through the pdf_list object, and apply the pdf_text function and subsequently the strsplit() function to each element of this list:

corpus_raw <- data.frame("company" = c(),"text" = c()) for (i in 1:length(pdf_list)){ print(i) pdf_text(paste("data/", pdf_list[i],sep = "")) %>% strsplit("\n")-> document_text data.frame("company" = gsub(x =pdf_list[i],pattern = ".pdf", replacement = ""), "text" = document_text, stringsAsFactors = FALSE) -> document colnames(document) <- c("company", "text") corpus_raw <- rbind(corpus_raw,document) }

Let’s get closer to the loop: we first have a call to the pdf_text function, passing an element of pdf_list as an argument; it is defined as referencing the i position in the list. Once we have done this, we can move on to apply the strsplit() function to the resulting string. We define the document object, which contains two columns, in this way:

  • company, which stores the name of the PDF without the .pdf token; this is the name of the company
  • text, which stores the text resulting from the extraction

This document object is then appended to the corpus object, which we created previously, to store all of the text within the PDF.

Let’s have a look a the resulting data frame:

corpus_raw %>% head()

This is a well-structured object, ready for some text mining. Nevertheless, if we look closely at our PDF customer cards, we can see that there are three different kinds of information and they should be handled differently:

  • Repeated information, such as the confidentiality disclosure on the second line and the date of enquiry (12.05.2017)
  • Structured attributes, for instance, date of foundation or industry
  • Strictly unstructured data, which is in the comments paragraph

We are going to address these three kinds of data differently, removing the first group of irrelevant information; we therefore have to split our data frame accordingly into two smaller data frames. To do so, we will leverage the grepl() function once again, looking for the following tokens:

  • 12.05.2017: This denotes the line showing the non-disclosure agreement and the date of inquiry.
  • business profile: This denotes the title of the document, containing the name of the company. We already have this information stored within the company column.
  • comments: This is the name of the last paragraph.
  • 1: This represents the number of the page and is always the same on every card.

We can apply the filter function to our corpus_raw object here as follows:

corpus_raw %>% filter(!grepl("12.05.2017",text)) %>% filter(!grepl("business profile",text)) %>% filter(!grepl("comments",text)) %>% filter(!grepl("1",text)) -> corpus

Now that we have removed those useless things, we can actually split the data frame into two sub-data frames based on what is returned by the grepl function when searching for the following tokens, which point to the structured attributes we discussed previously:

  • date of foundation
  • industry
  • shareholders

We are going to create two different data frames here; one is called ‘information’ and the other is called ‘comments’:

corpus %>% filter(!grepl(c("date of foundation"),text)) %>% filter(!grepl(c( "industry"),text)) %>% filter(!grepl(c( "share holders"),text)) -> comments corpus %>% filter(grepl(("date of foundation"),text)|grepl(( "industry"),text)|grepl(( "share holders"),text))-> information

As you can see, the two data treatments are nearly the opposite of each other, since the first looks for lines showing none of the three tokens while the other looks for records showing at least one of the tokens.

Let’s inspect the results by employing the head function:

information %>% head() comments %>% head()

Great! We are nearly done. We are now going to start analyzing the comments data frame, which reports all comments from our colleagues. The very last step needed to make this data frame ready for subsequent analyzes is to tokenize it, which basically means separating it into different rows for all the words available within the text column. To do this, we are going to leverage the unnest_tokens function, which basically splits each line into words and produces a new row for each word, having taken care to repeat the corresponding value within the other columns of the original data frame.

This function comes from the recent tidytext package by Julia Silge and Davide Robinson, which provides an organic framework of utilities for text mining tasks. It follows the tidyverse framework, which you should already know about if you are using the dplyr package.

Let’s see how to apply the unnest_tokens function to our data:

comments %>% unnest_tokens(word,text)-> comments_tidy

If we now look at the resulting data frame, we can see the following:

As you can see, we now have each word separated into a single record.

Thanks for reading! We hope you enjoyed this extract taken from R Data Mining.

Explore $5 R eBooks and videos from Packt until January 15th 2018.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using nonstandard evaluation to simulate a register machine

Fri, 01/05/2018 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

I recently completed all 25 days of Advent of Code 2017, an annual
series of recreational programming puzzles. Each day describes a programming
puzzle and illustrates a handful of simple examples of the problem. The puzzle
then requires the participant to solve a much, much larger form of the problem.

For five or so of the puzzles, I used nonstandard evaluation to implement my
solution. As I previously wrote, nonstandard evaluation is a way of
bottling up magic spells (lines of R code) and changing how or where they are
cast (evaluated). I chose to use special evaluation not because it was the
easiest or most obvious solution but because I wanted to develop my skills with
computing on the language. In this post, I work through one of the examples
where I used nonstandard evaluation to write an interpreter for a simple
machine.

Puzzle description

Day 8 requires us to simulate the state
of a register machine as it receives a series of instructions.

Each instruction consists of several parts: the register to modify, whether to
increase or decrease that register’s value, the amount by which to increase or
decrease it, and a condition. If the condition fails, skip the instruction
without modifying the register. The registers all start at 0. The instructions
look like this:

b inc 5 if a > 1 a inc 1 if b < 5 c dec -10 if a >= 1 c inc -20 if c == 10

[…]

You might also encounter <= (less than or equal to) or != (not equal to).
However, the CPU doesn’t have the bandwidth to tell you what all the registers
are named, and leaves that to you to determine.

What is the largest value in any register after completing the
instructions in your puzzle input?

If I squint long enough at the register instructions, I can basically
see R code.

# b inc 5 if a > 1 b <- if (a > 1) b + 5 else b # a inc 1 if b < 5 a <- if (b < 5) a + 1 else a # c dec -10 if a >= 1 c <- if (a >= 1) c - -10 else c # c inc -20 if c == 10 c <- if (c == 10) c + -20 else c

If we can set up a way to convert the machine instructions into R code, R will
handle the job of looking up values, modifying values and evaluating the logic
and if statements. In other words, if we can convert register instructions
into R code, the problem simplifies into something like running an R script.

And that’s a good strategy, because we have a lot of instructions to process.
Each user receives some unique (I think) input for each problem, and my problem
input contains 1,000 instructions.

library(magrittr) full_input <- "https://raw.githubusercontent.com" %>% file.path("tjmahr", "adventofcode17", "master", "inst", "input08.txt") %>% readr::read_lines() length(full_input) #> [1] 1000 head(full_input) #> [1] "kd dec -37 if gm <= 9" "x dec -715 if kjn == 0" #> [3] "ey inc 249 if x < 722" "n dec 970 if t > 3" #> [5] "f dec -385 if msg > -3" "kd dec -456 if ic <= -8"

Our strategy for simulating the register machine will have the following steps:

  • Parsing a register instruction
  • Creating an R expression from an instruction
  • Evaluating an R expression inside of a register machine
  • Changing the evaluation rules to adapt to the quirks of this problem
Parsing the register instructions with regular expressions

The instructions have a very simple grammar. Here is how I would tag the first
few lines of my problem input.

[target] [verb] [num1] if [s1] [op] [num2] kd dec -37 if gm <= 9 x dec -715 if kjn == 0 ey inc 249 if x < 722 n dec 970 if t > 3 f dec -385 if msg > -3

We can parse these lines using regular expressions. Regular expressions are an
incredibly powerful language for processing text using pattern-matching rules. I
will walk through each of the regular expression patterns used to parse an
instruction.

To match the verbs, we can use the or | operator, so
(inc|dec) matches inc or dec. We can also match the six different
comparison operators using | too. In the code below, I put the patterns in
parentheses so that they will be treated as a single group.

re_verb <- "(inc|dec)" re_op <- "(>|<|==|!=|>=|<=)"

A register name is just a sequence of letters. The special character \w
matches any word character; that is, it matches uppercase/lowercase letters,
digits and underscores. The (token)+ suffix matches 1 or more
repetitions of a token. Putting these two together, \w+ will match 1 or
more adjacent word characters. That pattern in principle could matches things
beside register names (like numbers) but the instruction format here is so
constrained that it’s not a problem.

# We have to double the backslashes when using them in R strings re_name <- "(\\w+)"

Numbers are just integers, and sometimes they are negative. A number here
is an optional - plus some digits. The special character \d matches any
digit from 0 to 9, so \d+ matches 1 or more successive digits. We can use
the (token)? suffix to match an optional token. In our case,
-?\d+ will match a sequence of digits and match leading hyphen if one is
present.

re_number <- "(-?\\d+)"

Each pattern in parentheses is a matching group, and the function str_match()
from the stringr package will return a matrix
with a column for each matched group. I also include an extra set of
parentheses around the condition in the if statement to also match the whole
condition as well as its parts.

# Combine the sub-patterns together re <- sprintf("%s %s %s if (%s %s %s)", re_name, re_verb, re_number, re_name, re_op, re_number) re #> [1] "(\\w+) (inc|dec) (-?\\d+) if ((\\w+) (>|<|==|!=|>=|<=) (-?\\d+))" text <- "b inc 5 if a > 1" stringr::str_match(text, re) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] "b inc 5 if a > 1" "b" "inc" "5" "a > 1" "a" ">" "1" # Column 5 matches the subgroups in columns 6, 7 and 8 as a single # group because of the extra grouping parentheses after the `if`.

We can package this step into a function that takes an instruction’s
text and returns a list with the labelled parts of that instruction.

parse_instruction <- function(text) { stopifnot(length(text) == 1) re <- "(\\w+) (inc|dec) (-?\\d+) if ((\\w+) (>|<|==|!=|>=|<=) (-?\\d+))" text %>% stringr::str_match(re) %>% as.list() %>% setNames(c("instruction", "target", "verb", "num1", "cond", "s1", "op", "num2")) } str(parse_instruction(text)) #> List of 8 #> $ instruction: chr "b inc 5 if a > 1" #> $ target : chr "b" #> $ verb : chr "inc" #> $ num1 : chr "5" #> $ cond : chr "a > 1" #> $ s1 : chr "a" #> $ op : chr ">" #> $ num2 : chr "1" Creating R code

Next, we need to convert some strings into R code. We can do this with
rlang::parse_expr(). It takes a string and creates an R expression,
something I’ve described as a kind of bottled up magic spell: An
expression captures magic words (code) allow us to manipulate or cast (evaluate)
them.

code <- rlang::parse_expr("print('hello')") code #> print("hello") code <- rlang::parse_expr("if (a > 1) b + 5 else b") code #> if (a > 1) b + 5 else b

The format of the instructions is relatively straightforward. We can fill
in a template with the parts of the parsed line. Because inc/dec are just
addition and subtraction, we replace them with the appropriate math operations.

create_r_instruction <- function(parsed) { parsed$math <- if (parsed$verb == "inc") "+" else "-" code <- sprintf("if (%s) %s %s %s else %s", parsed$cond, parsed$target, parsed$math, parsed$num1, parsed$target) rlang::parse_expr(code) } r_code <- "b inc 5 if a > 1" %>% parse_instruction() %>% create_r_instruction() r_code #> if (a > 1) b + 5 else b Create the register machine

We have to figure out where we want to evaluate the generated R code. We
create a register object to hold the values. The object will just be a list()
with some extra metadata. This object will be the location where the R code
is evaluated.

create_register_machine <- function(...) { initial <- list(...) data <- c(initial, list(.metadata = list())) structure(data, class = c("register_machine", "list")) } # Give the machines a pretty print method print.register_machine <- function(x, ...) { utils::str(x, ...) invisible(x) } create_register_machine() #> List of 1 #> $ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list"

For now, we can initialize registers by using named arguments to the function.

create_register_machine(a = 0, b = 0) #> List of 3 #> $ a : num 0 #> $ b : num 0 #> $ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" Evaluating code inside of the machine

So far, we have:

  • A way to analyze register instructions and convert them into R code
  • An object that holds register values

Now, we need to evaluate an expression inside of the register. We will use
tidy evaluation; the function eval_tidy() lets us evaluate an
R expression inside of a data object.

r_code #> if (a > 1) b + 5 else b # b + 5 r <- create_register_machine(a = 4, b = 7) rlang::eval_tidy(r_code, data = r) #> [1] 12 # just b r <- create_register_machine(a = 0, b = 7) rlang::eval_tidy(r_code, data = r) #> [1] 7

Now, we need to actually do something. We need to update the register machine
using the value from the evaluated instruction. Otherwise, the machine will just
read expressions and forget everything it’s read.

To update the machine, we have to determine the register to update. Fortunately,
our generated code always ends with an else branch that has the target
register.

r_code #> if (a > 1) b + 5 else b

If we can pull out that symbol after the else, we will have the name of
register to update in the machine. Because the code is so formulaic, we can just
extract the symbol directly using the code’s abstract syntax tree (AST).
pryr::call_tree() shows the AST for an expression.

pryr::call_tree(r_code) #> \- () #> \- `if #> \- () #> \- `> #> \- `a #> \- 1 #> \- () #> \- `+ #> \- `b #> \- 5 #> \- `b

We can extract elements from the tree like elements in a list by selecting
indices.

# The numbers match one of the slashs at the first level of indentation r_code[[1]] #> `if` r_code[[2]] #> a > 1 # We can crawl down subtrees too r_code[[2]][[2]] #> a # But what we want is the last branch from the `if` level r_code[[4]] #> b

If we convert the symbol into a string, we can look it up in the register using
the usual list lookup syntax.

r <- create_register_machine(a = 4, b = 7) target <- rlang::as_string(r_code[[4]]) r[[target]] #> [1] 7

We can also use list lookup syntax with assignment to modify the register.

r[[target]] <- rlang::eval_tidy(r_code, data = r) r #> List of 3 #> $ a : num 4 #> $ b : num 12 #> $ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list"

Let’s wrap these steps into a function.

eval_instruction <- function(register_machine, instruction) { target <- rlang::as_string(instruction[[4]]) register_machine[[target]] <- rlang::eval_tidy( expr = instruction, data = register_machine) register_machine } create_register_machine(a = 2, b = 0) %>% eval_instruction(r_code) #> List of 3 #> $ a : num 2 #> $ b : num 5 #> $ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" create_register_machine(a = 2, b = 0) %>% # For quick testing, we pass in quoted expressions eval_instruction(quote(if (a > 1) b - 100 else b)) %>% # Should not run eval_instruction(quote(if (a < 1) b + 5 else b)) %>% # Should run eval_instruction(quote(if (a > 1) a + 10 else a)) #> List of 3 #> $ a : num 12 #> $ b : num -100 #> $ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" Time for some extra nonstandard evaluation

The code so far only works if the machine already has registers that match the
registers in an instruction. Otherwise, we raise an error.

create_register_machine() %>% eval_instruction(quote(if (a > 1) b - 100 else b)) #> Error in overscope_eval_next(overscope, expr): object 'a' not found # "Overscope" is the tidy evaluation term for the data context, so failing to # find the name in the data is failing to find the name in the overscope.

To solve the problem, we could study the 1,000 lines of input beforehand,
extract the register names, initialize them to 0 and then evaluate the
instructions.1 Or… or… we could procrastinate and only
initialize a register name to 0 when the machine encounters a name it doesn’t
recognize. If, for some reason, our machine received instructions
one at a time, like over a network connection, then the procrastinated approach
seems even more reasonable.

This latter strategy will involve some very nonstandard evaluation. I
emphasize the “very” because we are changing one of the fundamental rules of R
evaluation
:smiling_imp:. R throws an error if you ask it to evaluate the name
of a variable that doesn’t exist. But here we are going to detect those missing
variables and set them to 0 before they get evaluated.

To find the brand-new register names, we can inspect the call tree and find the
names of the registers. We already know where the target is. The other place
where names show up is in the condition of the if statement.

pryr::call_tree(r_code) #> \- () #> \- `if #> \- () #> \- `> #> \- `a #> \- 1 #> \- () #> \- `+ #> \- `b #> \- 5 #> \- `b extract_register_names <- function(instruction) { reg_target <- rlang::as_string(instruction[[4]]) reg_condition <- rlang::as_string(instruction[[2]][[2]]) list(target = reg_target, registers = unique(c(reg_target, reg_condition)) ) } extract_register_names(quote(if (a > 1) b - 100 else b)) %>% str() #> List of 2 #> $ target : chr "b" #> $ registers: chr [1:2] "b" "a" # Just returns unique names extract_register_names(quote(if (b > 1) b - 100 else b)) %>% str() #> List of 2 #> $ target : chr "b" #> $ registers: chr "b"

We can define a helper function which checks for missing names—names that
yield NULL values when we try to retrieve them—and initializes them to 0.

initialize_new_registers <- function(register_machine, registers) { for (each_register in registers) { if (is.null(register_machine[[each_register]])) { register_machine[[each_register]] <- 0 } } register_machine } # Before r #> List of 3 #> $ a : num 4 #> $ b : num 12 #> $ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" initialize_new_registers(r, c("a", "b", "w", "a", "s", "j")) #> List of 6 #> $ a : num 4 #> $ b : num 12 #> $ .metadata: list() #> $ w : num 0 #> $ s : num 0 #> $ j : num 0 #> - attr(*, "class")= chr [1:2] "register_machine" "list"

Finally, we update our evaluation function to do this step automatically. I’m
also going to add some code to record the value of the maximum register whenever
an instruction is evaluated because, ummm, that’s the whole point of puzzle.

eval_instruction <- function(register_machine, instruction) { # Set any new registers to 0 registers <- extract_register_names(instruction) register_machine <- initialize_new_registers( register_machine, registers$registers) # Evaluate instruction register_machine[[registers$target]] <- rlang::eval_tidy( expr = instruction, data = register_machine) # Find the maximum value register_names <- setdiff(names(register_machine), ".metadata") current_max <- max(unlist(register_machine[register_names])) register_machine$.metadata$max <- current_max register_machine }

Let’s try four instructions from a clean slate.

create_register_machine() %>% # b gets 5 eval_instruction(quote(if (d < 1) b + 5 else b)) %>% # c gets 10 eval_instruction(quote(if (b > 1) c + 10 else c)) %>% # b gets 5 more eval_instruction(quote(if (a < 1) b + 5 else b)) #> List of 5 #> $ .metadata:List of 1 #> ..$ max: num 10 #> $ b : num 10 #> $ d : num 0 #> $ c : num 10 #> $ a : num 0 #> - attr(*, "class")= chr [1:2] "register_machine" "list"

Now, for the moment of truth… Let’s process all 1,000 instructions.

r <- create_register_machine() for (each_instruction in full_input) { parsed <- each_instruction %>% parse_instruction() %>% create_r_instruction() r <- eval_instruction(r, parsed) } r #> List of 27 #> $ .metadata:List of 1 #> ..$ max: num 4832 #> $ kd : num -2334 #> $ gm : num -4239 #> $ x : num -345 #> $ kjn : num -1813 #> $ ey : num 209 #> $ n : num -764 #> $ t : num 2997 #> $ f : num 4468 #> $ msg : num -3906 #> $ ic : num -263 #> $ zv : num -599 #> $ gub : num 2025 #> $ yp : num -2530 #> $ lyr : num -2065 #> $ j : num 3619 #> $ e : num -4230 #> $ riz : num 863 #> $ lzd : num 4832 #> $ ucy : num -3947 #> $ i : num 3448 #> $ omz : num -3365 #> $ djq : num 392 #> $ bxy : num 1574 #> $ tj : num 1278 #> $ y : num 1521 #> $ m : num 2571 #> - attr(*, "class")= chr [1:2] "register_machine" "list"

:star: Ta-da! The maximum register value is 4,832. Problem solved!

And then the rules change

Advent of Code problems come in two parts, and we don’t learn the question
behind Part 2 until we complete Part 1. In this case, after submitting our
solution for Part 1, we receive the following requirement:

To be safe, the CPU also needs to know the highest value held in any
register during this process
so that it can decide how much memory to allocate
to these operations.

Accounting for this twist requires a small change to the evaluation code. We
add another metadata variable to track the highest value ever stored in a
register.

eval_instruction <- function(register_machine, instruction) { # Set any new registers to 0 registers <- extract_register_names(instruction) register_machine <- initialize_new_registers( register_machine, registers$registers) # Evaluate instruction register_machine[[registers$target]] <- rlang::eval_tidy( expr = instruction, data = register_machine) # Find the maximum value register_names <- setdiff(names(register_machine), ".metadata") current_max <- max(unlist(register_machine[register_names])) register_machine$.metadata$max <- current_max # Create the max-ever value if necessary if (is.null(register_machine$.metadata$max_ever)) { register_machine$.metadata$max_ever <- 0 } # Maybe update the max-ever value if (register_machine$.metadata$max_ever < current_max) { register_machine$.metadata$max_ever <- current_max } register_machine }

Admittedly, eval_instruction() is starting to get bloated. Conceptually, we
could probably the break this function down into three functions: pre-evaluation
steps, evaluation, and post-evaluation steps.2 But this is good
enough for now.

We run the instructions again to get the updated metadata.

r <- create_register_machine() for (each_instruction in full_input) { parsed <- each_instruction %>% parse_instruction() %>% create_r_instruction() r <- eval_instruction(r, parsed) } r$.metadata #> $max #> [1] 4832 #> #> $max_ever #> [1] 5443

:star2: And boom! Another problem solved.

eval(thoughts, envir = this_problem)

I like this kind of nonstandard evaluation approach for converting problems into
R code, but it’s mostly useful when the problem describes a series of
instructions that can be parsed and evaluated. For problems like this register
machine simulation, the nonstandard evaluation route is straightforward. But
it’s also a viable problem-solving strategy when the “machine” or the
“instructions” are subtler, as in
this problem about simulating “dance” moves.

Odds are, you’ll never have to write an interpreter for a toy machine or
language. Nevertheless, here are some R functions that we used for this
puzzle that are helpful in other contexts:

  • stringr::str_match() to extract all the groups in a regular
    expression at once.
  • rlang::parse_expr() to convert a string of text into an R expression.
  • pryr::call_tree() to visualize an expression’s syntax tree and
    expression[[i]][[j]] to pluck out symbols from locations in a tree.
  • rlang::as_string() to convert a symbol into a string.
  • rlang::eval_tidy() to evaluate an expression inside of a data
    context.
  1. That actually would be pretty easy. Get a dataframe with
    purrr::map_df(full_input, parse_instruction). Find the unique register
    names. Create a list of 0’s with those names. Use do.call() to call
    create_register_machine() with that list. With no special evaluation
    trickery, this approach is closer to the idea of “just running R code”. 

  2. If all I did for a living was write code to simulate machines
    or toy languages, I might try to formalize this custom evaluation process
    with pre-evaluation and post-evaluations “hooks” that could be arguments
    to a custom evaluation function. I’m just brainstorming though. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

StanCon is next week, Jan 10-12, 2018

Fri, 01/05/2018 - 04:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

It looks pretty cool!

Wednesday, Jan 10

Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia Rabe-Hesketh and Daniel Furr (U California, Berkely) 10:40-11:30am

Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:30-11:50am

Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am-12:10pm

Joint longitudinal and time-to-event models via Stan. Sam Brilleman, Michael Crowther, Margarita Moreno-Betancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:10-12:30pm
Lunch 12:30-2:00pm

ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:00-2:20pm
A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:20-2:40pm

Student Ornstein-Uhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:40-3:00pm

SlicStan: a blockless Stan-like language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:00-3:20pm
Break 3:20-4:00pm

Invited Talk: Talia Weiss (MIT) 4:00-4:50pm

Thursday, Jan 11

Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:40-11:30am

NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:30-11:50am

Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am-12:10pm

A brief history of Stan. Daniel Lee (Generable) 12:10-12:30pm
Lunch 12:30-1:30pm

Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:30-1:50pm

Flexible modeling of Alzheimer’s disease progression with I-Splines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:50-2:10pm

Intrinsic Auto-Regressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:10-2:30pm

Modeling/Data Session + Classes 2:30-4:10pm

Open session for consultations on modeling and data problems with Stan developers and modelers. 2:30-4:10pm

Session 3 of Intro to Stan 2:30-4:10pm

2:30-3:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter

What is new to Stan 3:30-4:10pm

Invited Talk: Manuel Rivas (Stanford U) 4:00-4:50pm

Friday, Jan 12

Invited Talk: Susan Holmes (Stanford U) 10:40-11:30am

Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:30-11:50am

The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, Emma Pierson (Stanford U) 11:50am-12:10pm

Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:10-12:30pm
Lunch 12:30-1:30pm

Causal inference with the g-formula in Stan. Leah Comment (Harvard U) 1:30-1:50pm
Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:50-2:10pm

Invited Talk: Andrew Gelman 2:10-3:00 (Columbia U) (virtual)

Classes/Tutorials

We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other:

Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry.

Executive decision making the Bayesian way: This is for non-technical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik.

Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich.

Model assessment, model selection and inference after model selection. Aki Vehtari.

How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian.

[edited by Aki: added Model selection tutorial]

The post StanCon is next week, Jan 10-12, 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Big cdata News

Thu, 01/04/2018 - 20:04

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata.

cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.

However, cdata is much more than that.

cdata provides a family of general transforms that include pivot/unpivot (or tidyr::spread/tidyr::gather) as easy special cases. Nina refused to write the article on it until we re-factored the api to be even more teachable (and therefore more learnable, and more useful). After her re-design (adding the concepts of both concrete records and abstract records to the coordinatized data theory) the system teaches itself. It is actually hard to remember you are graphically specifying potentially involved, difficult, and confusing transforms (which do not remain confusing as the graphical specification becomes its own documenting diagram!).

Don’t take my word for it, please checkout Nina’s article: “Fluid data reshaping with cdata”.

Also, I will be presenting a lightening talk on cdata at the January 2018 BARUG Meetup! We think we have gotten the concepts and package refined and polished to the point where it can be mastered in 15 minutes with time to spare.

As a bonus this new version of cdata is now available on CRAN (with the old method naming still supported), and also works at big data scale (for DBI-adaptable databases such as Spark and PostgreSQL, an uncommon feature for a full featured pivot/un-pivot system).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

simmer 3.6.5

Thu, 01/04/2018 - 19:19

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The fifth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release extends the attributes API by allowing users to set/get multiple attributes at once (a pretty straightforward change as well as useful; I don’t know why it didn’t occurred to me before…). Vectors as attributes and other data types are not supported yet, but they are on the roadmap.

This version also fixes some minor bugs (many thanks to the users of the simmer-devel mailing list for taking their simulations to edge cases, where these bugs arise), deprecates the onestep() function and provides the new stepn() instead. Since onestep() serves primarily for debugging purposes, the transition to the new one may go unnoticed. Finally, there is a new vignette about the Dining Philosophers Problem.

New features:
  • set_attribute() (and set_global() by extension) can set multiple attributes at once by providing vectors of keys and values (or functions returning such keys and/or values). get_attribute() (and get_global() by extension) can retrieve multiple keys (#122).
  • New stepn() method deprecates onestep() (e452975).
Minor changes and fixes:
  • Restore ostream after formatting (9ff11f8).
  • Fix arrival cloning to copy attributes over to the clone (#118).
  • Fix self-induced preemption through set_capacity() (#125).
  • Update “Queueing Systems” vignette (a0409a0, 8f03f4f).
  • Update “Advanced Trajectory Usage” vignette (4501927).
  • Fix print methods to return the object invisibly (#128).
  • New “Dining Philosophers Problem” vignette (ff6137e).

Article originally published in Enchufa2.es: simmer 3.6.5.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Divide and parallelize large data problems with Rcpp

Thu, 01/04/2018 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Błażej Moska, computer science student and data science intern

Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:

{A,B,C}=>{D}

{E}=>{F}

{A,B}=>{D}

{A}=>D

How can we achieve that? For example, using below pseudo algorithm:

For i=1 to n: For j=i+1 to n: # check if antecedent[i] contains antecedent[j] (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, otherwise with 0 else: # check if antecedent[j] contains antecedent[i] (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, otherwise with 0

How many operations do we need to perform with this simple algorithm?

For the first i we need to iterate \(n-1\) times, for the second i \(n-2\) times, for the third i \(n-3\) and so on, reaching finally \(n-(n-1)\). This leads to (proof can be found here):

\[
\sum_{i=1}^{n}{i}= \frac{n(n-1)}{2}
\]

So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?

Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents:

{A,B}=>{D}

{A}=>{D}

{C,G}=>{F}

{Y}=>{F}

After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:

  1. Create .cpp file, which includes all of functions needed
  2. Create a package using Rcpp. This can be achieved using for example:
    Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")
  3. Install your Rcpp package from source:
    install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")
  4. Load your library:
    library(name_of_your_rcpp_package)

Now you can use your Rcpp function in foreach:

results=foreach(k=1:length(len), .packages=c(name_of_your_package)) %dopar% {your_cpp_function(data)}

Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!

Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung?

Thu, 01/04/2018 - 15:51

(This article was first published on R-Programming – Statistik Service, and kindly contributed to R-bloggers)

In diesem Artikel möchten wir Ihnen das Konzept der Markov Kette vorstellen, dessen Grundlagen veranschaulichen und Ihnen mehrere mögliche Anwendungsbereiche aufzeigen, in denen Sie mit einer gezielten statistischen Programmierung von Markov Ketten profitieren können.

Eine Markov Kette ist ein stochastischer Prozess mit den vielfältigsten Anwendungsbereichen aus der Natur, Technik und Wirtschaft. Klassische Anwendungsmöglichkeiten aus der Wirtschaft sind die Modellierung von Warteschlangen und Wechselkursen. Auch im Bereich der Technik gibt es zahlreiche Einsatzgebiete wie zum Beispiel die Modellierung des Verhaltens von Talsperren und von Geschwindigkeitsregelanlagen bei Kraftfahrzeugen, sowie die analytische Bewertung von Mobilitätsalgorithmen wie dem Random Walk. In der Naturwissenschaft verwendet man diesen Prozess in der Populationsdynamik zur Vorhersage des Bevölkerungswachstums von Menschen und Tieren und in einer abgewandelten Form für die Brownsche Molekularbewegung. Für die Praxis besonders relevant ist die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R, welche im Folgenden anhand eines anschaulichen Beispiels durchgeführt wird. Doch zunächst werden die für die Berechnung erforderlichen Begriffe erläutert.

Was sind Markov Kette und Gleichgewichtsverteilung?

Es gibt eine Vielzahl unterschiedlicher stochastischer Prozesse, welche in verschiedene Kategorien eingeteilt werden. Ein bekannter und in der Praxis besonders oft eingesetzter Typ stochastischer Prozesse ist die Markov Kette, benannt nach dem Mathematiker Andrei Andrejewitsch Markov. Diese besteht aus einer Zustandsmenge, einer Indexmenge, einer Startverteilung und den Übergangswahrscheinlichkeiten.

Die Zustandsmenge bestimmt, welche Zustände angenommen werden können. Hierbei unterscheidet man zwischen einer stetigen Zustandsmenge, welche überabzählbar unendlich viele Zustände enthält und einer diskreten Zustandsmenge, welche höchstens abzählbar unendlich viele Zustände enthält. Ein wichtiger Spezialfall des zuletzt Genannten ist ein Zustandsraum mit endlich vielen Zuständen, auf welchen wir uns konzentrieren werden.

Ein stochastischer Prozess ändert seinen Zustand im Laufe der Zeit. Analog zur Zustandsmenge unterscheidet man auch bei der Indexmenge (welche die Zeitpunkte beinhaltet) zwischen einer stetigen und einer diskreten Indexmenge. Eine stetige Indexmenge kommt beispielsweise bei der Brownschen Molekularbewegung in Betracht, weil die Moleküle in ständiger Bewegung sind und ihre Richtung und Geschwindigkeit in kleinsten Zeitabständen wechseln können. Doch wir verwenden die diskrete Indexmenge der natürlichen Zahlen N0={0,1,2,3,4,5,…} einschließlich des Startzeitpunkts 0 mit einer beliebigen Zeiteinheit. Zwischen zwei aufeinander folgenden Zeitpunkten bleibt der Zustand also konstant.

In irgendeinem Zustand muss die Markov Kette starten. Dies kann ein fest vorgegebener oder zufällig ausgewählter Zustand sein. Mit welcher Wahrscheinlichkeit der stochastische Prozess in welchem Zustand startet, legt die Startverteilung fest.

Nicht nur die Startverteilung ist zufällig, sondern auch das weitere Verhalten. Mit welcher Wahrscheinlichkeit der Prozess in welchen Zustand wechselt, legen die Übergangswahrscheinlichkeiten fest. Wir betrachten den Standardfall einer Markov Kette erster Ordnung, bei welcher die Übergangswahrscheinlichkeiten ausschließlich vom momentanen Zustand abhängen. Außerdem ist die Markov Kette homogen, das heißt, die Übergangswahrscheinlichkeiten sind konstant und ändern sich nicht im Lauf der Zeit. Die Übergangswahrscheinlichkeiten können daher in einer Übergangsmatrix veranschaulicht werden.

Die Zustandsverteilung hängt vom jeweiligen Zeitpunkt ab. In einigen Fällen konvergiert die Zustandsverteilung (diese beinhaltet die Aufenthaltswahrscheinlichkeiten der Zustände zu einem Zeitpunkt n) gegen eine Gleichgewichtsverteilung, welche auch stationäre Verteilung genannt wird. In den folgenden Abschnitten erfahren Sie anhand eines Beispiels nicht nur die Kriterien für Existenz und Eindeutigkeit der Gleichgewichtsverteilung, sondern auch die analytische Lösung und wie Sie die statistische Programmierung und Simulation mit der Statistik Software R durchführen.

Bedingungen für Existenz und Eindeutigkeit der Gleichgewichtsverteilung

Die Spielfigur Pac-Man frisst in einem Labyrinth kleine Kugeln und Kirschen und wird dabei von Gespenstern verfolgt. Der Einfachheit halber ist die Spielwelt in diesem Beispiel ein kleines 3×3-Gitter und die Monster bewegen sich rein zufällig. Jedes horizontal und vertikal angrenzende Spielfeld ist mit gleicher Wahrscheinlichkeit der nächste Aufenthaltsort des Gespensts, mit Ausnahme eines Geheimgangs zwischen den Zuständen 2 und 8. Der unten abgebildete Übergangsgraph beinhaltet exemplarisch die Übergangswahrscheinlichkeiten der Zustände 1, 5, 6 und 8.

Um für Abwechslung zu sorgen, wird der Startort der Monster zufällig gewählt, und zwar jedes Spielfeld mit der gleichen Wahrscheinlichkeit. Der Startvektor auf dem Zustandsraum Z = {1,2,3,4,5,6,7,8,9} lautet daher als Zeilenvektor π0 = 1/9 (1,1,1,1,1,1,1,1,1). Die i-te Zeile und j-te Spalte der unten abgebildeten Übergangsmatrix P enthält die Übergangswahrscheinlichkeit vom i-ten zum j-ten Zustand. Einträge mit Wahrscheinlichkeit 0 wurden entfernt, um eine bessere Übersichtlichkeit zu erhalten:

Zu Beginn (zum Zeitpunkt 0) ist jeder Zustand in diesem Beispiel noch gleichwahrscheinlich, die Zustandsverteilung zu Beginn lässt sich direkt am Startvektor ablesen. Und wie sieht die Zustandsverteilung nach einer Zeiteinheit aus? Um diese Frage zu beantworten, multiplizieren Sie den Startvektor mit der Übergangsmatrix P. Für den zweiten Zeitpunkt multiplizieren Sie den resultierenden Zeilenvektor ein weiteres mal mit der Übergangsmatrix P und so weiter. Im Allgemeinen erhalten Sie für den Zeitpunkt n die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände durch die Formel: πn=π0⋅Pn

Unter bestimmten Voraussetzungen konvergiert die Zustandsverteilung πn gegen die Gleichgewichtsverteilung π. In unserem Beispiel mit endlichem Zustandsraum muss die Markov-Kette hierfür irreduzibel und aperiodisch sein.

Was bedeuten nun die Begriffe Irreduzibilität und Aperiodizität? Bei dem von uns betrachteten Typ von Markov Ketten liegt Irreduzibilität vor, falls man in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand gelangt. Aperiodizität ist gegeben, falls der größte gemeinsame Teiler aller Rückkehrzeiten zu dem ursprünglichen Zustand 1 ist. Und dieses muss für jeden Zustand gelten. Überprüfen wir mal die beiden Bedingungen:

Unsere Markov-Kette ist irreduzibel, da sich die Gespenster in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand begeben können. Dank des Geheimgangs sind hierfür nur maximal drei Zustandswechsel nötig. Durch den Geheimgang ist die Markov-Kette auch aperiodisch, weil die Monster sowohl in einer geraden als auch in einer ungeraden Anzahl an Zustandswechseln von jedem beliebigen Zustand in jeden beliebigen Zustand wechseln können und somit der größte gemeinsame Teiler dieser Rückkehrzeiten 1 beträgt. Ohne den Geheimgang wäre die Markov-Kette periodisch, weil dann ein Übergang von einem geraden in einen geraden Zustand bzw. von einem ungeraden in einen ungeraden Zustand nur in einer geraden Anzahl von Zustandswechseln möglich und somit der größte gemeinsame Teiler 2 wäre.

Die Voraussetzungen für Existenz und Eindeutigkeit der Gleichgewichtsverteilung sind also erfüllt. Wegen der Irreduzibilität und Aperiodizität gibt es genau eine stabile Gleichgewichtsverteilung, welche die Markov-Kette nach einer unendlich langen Zeit annimmt. Das bedeutet, die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände ändern sich nach langer Zeit (fast) nicht mehr. Aus diesem Grund konvergieren auch die Matrixpotenzen. Doch wie können Sie nun die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R berechnen? Das erfahren Sie in den folgenden beiden Abschnitten dieses Artikels.

Mathematische Berechnung der Gleichgewichtsverteilung

Wie wir gesehen haben, existiert eine eindeutige Gleichgewichtsverteilung, auch stationäre Verteilung genannt. In diesem Abschnitt erfahren Sie, wie Sie diese Verteilung mathematisch berechnen können. Dadurch erhalten Sie die Information, mit welcher Wahrscheinlichkeit sich die Monster langfristig in welchen Zuständen (bzw. Orten) aufhalten. Die Gleichgewichtsverteilung der Markov Kette als stochastischer Prozess lässt sich naiv bestimmen, indem in das Gleichungssystemπn=π0⋅Pn ein großes n eingesetzt wird, weil die Matrixpotenzen aufgrund der Irreduzibilität und Aperiodizität konvergieren, wie wir bereits herausgefunden erfahren haben. Um eine analytische Lösung zu berechnen, ist das lineare Gleichungssystem π=π⋅P nach π aufzulösen, unter der Nebenbedingung einer Zeilensumme von π von 1, damit wir auch tatsächlich eine Wahrscheinlichkeitsverteilung erhalten. Das Einsetzen der naiven Lösung in dieses Gleichungssystem dient dann als Kontrolle.

Das obige Gleichungssystem π=π⋅P ohne Nebenbedingung ist äquivalent zu (PT-I)⋅πT=0. Die Übergangsmatrix wird demnach transponiert und die Einheitsmatrix subtrahiert. Der gesuchte Vektor der Zustandswahrscheinlichkeiten ist nun ein Spaltenvektor. Wir müssen also ein lineares Gleichungssystem lösen, welches inklusive Nebenbedingung eine Gleichung mehr hat als die Markov Kette Zustände. Bei einer großen Anzahl an Zuständen ist es sehr mühsam, die Lösung von Hand zu berechnen. Daher führen wir die statistische Programmierung nun mit der Statistik Software R durch. Eine alternative Lösung wäre die Verwendung einer MCMC Simulation (MCMC steht für Markov Chain Monte Carlo). Eine Simulation stellt eine sinnvolle Alternative dar, falls ein stochastischer Prozess beispielsweise so viele Zustände hat, dass die analytische Berechnung numerisch zu aufwändig wäre. Darauf verzichten wir jedoch, weil wir unsere Markov Kette nur 9 Zustände besitzt.

So berechnen Sie die Gleichgewichtsverteilung durch statistische Programmierung

Zunächst einmal tragen wir die Übergangsmatrix P in die Statistik Software R ein.

P = matrix ( c ( 0 , 1/2 , 0 , 1/2 , 0 , 0 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 0 , 0 , 1/2 , 0 , 1/2 , 0 ) , byrow = T , ncol = 9) P

Eine Übergangsmatrix enthält als Einträge die Übergangswahrscheinlichkeiten und diese müssen Werte zwischen 0 und 1 aufweisen. Ob das zutrifft, kann für jeden Eintrag der Matrix einzeln überprüft werden,

P >= 0 & P <= 1

oder man lässt sich gleich anzeigen, ob alle Einträge plausibel sind.

P >= 0 && P <= 1

Außerdem sollten sich die von jedem Zustand ausgehenden Übergangswahrscheinlichkeiten zu 1 summieren.

apply ( P , MARGIN = 1, FUN = sum )

Die Übergangsmatrix P wird transponiert und die Einheitsmatrix subtrahiert.

P.T.minus.I = t ( P ) - diag ( 9 )

Jetzt ist es wichtig bloß nicht die Nebenbedingung zu vergessen, da wir sonst kein Ergebnis erhalten. Die Gleichgewichtsverteilung ist eine Wahrscheinlichkeitsverteilung und als solche muss die Summe über alle Zustände der Gleichgewichtsverteilung 1 ergeben. Wir ergänzen also zur Matrix P.T.minus.I eine Zeile mit Einsen.

P.T.minus.I.mit.Nebenbedingung = rbind ( P.T.minus.I , 1 ) P.T.minus.I.mit.Nebenbedingung

Auf der anderen Seite des Gleichungssystems steht der Nullvektor. Aufgrund der Nebenbedingung müssen wir eine Eins ergänzen. Um einen Spaltenvektor zu erhalten, verwenden wir als Datentyp eine Matrix mit einer Spalte.

Nullvektor = matrix ( c ( rep ( 0 , 9 ) , 1 ) , ncol = 1 ) Nullvektor

Das durch die Nebenbedingung erweitere lineare Gleichungssystem ist nun nicht mehr quadratisch, sondern enthält eine Bedingung mehr als sie Variablen hat. Daher können wir zur Lösung nicht den Standardbefehl solve verwenden, sondern verwenden alternativ den Befehl Solve (mit großem Anfangsbuchstaben) aus dem Paket limSolve, welches wir zuvor installieren müssen. Nach der Installation können wir das Paket mit library ( limSolve ) einbinden.

Wir geben in die Funktion Matrix und Vektor ein und erhalten:

Gleichgewichtsverteilung = Solve ( P.T.minus.I.mit.Nebenbedingung , Nullvektor ) Gleichgewichtsverteilung

Zum Schluss überprüfen wir noch, ob wir tatsächlich eine gültige Wahrscheinlichkeitsverteilung erhalten haben:

Gleichgewichtsverteilung >= 0 & Gleichgewichtsverteilung <= 1 Gleichgewichtsverteilung >= 0 && Gleichgewichtsverteilung <= 1 sum ( Gleichgewichtsverteilung ) Lösung der Gleichgewichtsverteilung

Die Lösung des linearen Gleichungssystems mit der Statistik Software R ergibt: π=(7.7,15.4,7.7,11.5,15.4,11.5,7.7,15.4,7.7)

Die Gespenster halten sich demnach am häufigsten in der Mitte auf, weniger oft am Rand und am seltensten in der Ecke. Eine Ausnahme bilden die Randzustände 2 und 8, welche aufgrund des Geheimwegs durchschnittlich genauso oft besucht werden wie das zentrale Spielfeld. Die Aufenthaltswahrscheinlichkeiten der Zustände sind proportional zur Anzahl der eingehenden Pfeile. Je mehr ein-schrittige Wege zu einem Zustand führen, umso öfter wird dieser Zustand langfristig besucht.

Wie verwende ich die Markov Kette als stochastischer Prozess in der Wirtschaft?

Es gibt zahlreiche Anwendungen für Markov Ketten in der Wirtschaft. Ein Beispiel wird im Folgenden vorgestellt. Im Aktienhandel ist man oftmals besonders daran interessiert, vorherzusagen, wie sich der Aktienmarkt entwickelt. Die Situation an der Börse wird in drei Hauptphasen unterteilt – Bull Market, Bear Market und Stagnant Market. Der unten abgebildete Übergangsgraph enthält die Übergangswahrscheinlichkeiten zwischen den drei Phasen von Woche zu Woche, wobei jede Phase immer für mindestens eine Woche bestehen bleibt. Mit der Gleichgewichtsverteilung können Sie nun berechnen, mit welcher Wahrscheinlichkeit sich der Aktienmarkt langfristig in welchem Zustand befindet.

Wenn die oben beschriebene Methodik der Markov Ketten auf dieses Beispiel aus der Wirtschaft angewendet wird, ergibt sich für den Aktienmarkt ein langfristiges Ergebnis von 62,5% der Wochen im Bull Market, 31,25% der Wochen im Bear Market und in 6,25% der Wochen im Stagnant Market befindet.

Wir hoffen, dass wir Ihnen mit diesem Artikel nun die Thematik der Markov Ketten und Gleichgewichtsverteilung näherbringen konnten, und Sie diese in Zukunft zur Lösung mathematischer Probleme oder von Fragestellungen im Business-Kontext einsetzen können. Für den Fall, dass Sie dabei professionelle Unterstützung benötigen sollten, stehen Ihnen die professionellen Statistiker von Novustat für statistische Programmierungen oder Simulationen zur Verfügung.

Der Beitrag Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung? erschien zuerst auf Statistik Service.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Faster Than Excel: Painlessly Merge Data into Actuarial Loss Development Triangles with R

Thu, 01/04/2018 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

The objective of this post is to create a method which easily combines loss runs, or listings of insurance claims, into triangles. Using only Excel, the common method is to create links between the excel files which must be updated manually for each new evaluation. This is prone to human error and is time-consuming. Using a script to merge the files first and then create a triangle saves time and is more consistent.

Example of the conventional linked Excel triangle:

For accident year 2016, all the diagonal values need to be linked to a data file that was last evaluated in 2016. For 2015, the links go to the 2015 file, for 2014, the 2014 file, ect. This is time consuming.

See Wikipedia for a definition of Loss Development Triangles and why they are useful.

Methods

I use the packages plyr, dplyr, tidyr, and lubridate to manipulate the data once it is loaded into R. The package readxl reads the excel files from the windows file directory. It is worth noting that all five of these packages are included in Hadley Wickham’s tidyverse package. The package ChainLadder has a variety of tools for actuarial reserve analysis.

library(dplyr) library(readxl) library(plyr) library(tidyr) library(ChainLadder) library(lubridate) Step 1: Organize the Excel Files

Copy all of the excel files into the same windows folder, inside the working directory of the R project. It is important to have *only* the files that are going to be loaded into R in this folder.

On my computer, my file structure is as follows:

C:/Users/Sam/Desktop/[R project working directory]/[folder with excel claims listing files to aggregate]/[file 1, file 2, file 3, etc]

Using the command list.files(), R automatically looks in the working directory and returns a vector of all the the file names which it sees. For example, R sees the data files in the current working directory:

wd_path = "C:/Users/Sam/Desktop/projects/Excel Aggregate Loop/excel files" list.files(wd_path) ## [1] "claims listing 2011.xlsx" "claims listing 2012.xlsx" ## [3] "claims listing 2013.xlsx" "claims listing 2014.xlsx" ## [5] "claims listing 2015.xlsx" "claims listing 2016.xlsx"

R can then loop through each of these files and perform any action. If there were 100 files, or 1000 files in this directory, this would still work.

file_names = list.files(wd_path) file_paths = paste(wd_path, "/", file_names, sep = "") head(read_excel(file_paths[4]) %>% select(-3,-4,-5)) ## # A tibble: 4 x 6 ## name `license number` age `file year` `loss date` paid ## ## 1 jeff 3 23 2014 2011-01-01 400 ## 2 sue 3 43 2014 2012-01-01 400 ## 3 mark 2 55 2014 2013-01-01 200 ## 4 sarah 1 100 2014 2014-01-01 500

In order to evaluate the age of the losses, we need to take into account when each loss was evaluated. This is accomplished by going into Excel and adding in a column for “file year”, which specifies the year of evaluation of the file. For instance, for the “claim listing 2013” file, all of the claims have a “2013” in the “file year” column.

Step 2: Load the Data into R

Initialize a data frame which will store the aggregated loss run data from each of the excel files. **The names of this data frame need to be the names of excel file columns which need to be aggregated.** For instance, these could be “reported”, “Paid Loss”, “Case Reserves”, or “Incurred Loss”. If the excel files have different names for the same quantities (ie, “Paid Loss” vs. “paid loss”), then they should be renamed within excel first.

merged_data = as_data_frame(matrix(nrow = 0, ncol = 3)) names(merged_data) = c("file year", "loss date", "paid")

Someone once said “if you need to use a ‘for’ loop in R, then you are doing something wrong”. Vectorized functions are faster and easier to implement. The function my_extract below takes in the file name of the excel file and returns a data frame with only the columns which are selected.

excel_file_extractor = function(cur_file_name){ read_excel(cur_file_name[1], sheet = "Sheet1", col_names = T) %>% select(`file year`, `file year`, `loss date`, paid) %>% rbind(merged_data) }

Apply the function to all of the files in the folder that you created. Obviously, if you had 100 excel files this would still work just as effectively.

From the plyr package, ldply takes in a list and returns a data frame. The way to remember this is by the ordering of the letters (“list”-“data frame”-“ply”). For example, ff we wanted to read in a data frame and return a data frame, it would be ddply.

loss_run_data = ldply(file_paths, excel_file_extractor) names(loss_run_data) = c("file_year", "loss_date", "paid losses")

The data now has only the columns what we selected, despite the fact that the loss run files had different columns in each of the files.

glimpse(loss_run_data) ## Variables: 3 ## $ file_year 2011, 2012, 2012, 2013, 2013, 2013, 2014, 2014, 20... ## $ loss_date 2011-01-01, 2011-01-01, 2012-01-01, 2011-01-01, 2... ## $ paid losses 100, 200, 300, 300, 350, 100, 400, 400, 200, 500, ... head(loss_run_data) ## file_year loss_date paid losses ## 1 2011 2011-01-01 100 ## 2 2012 2011-01-01 200 ## 3 2012 2012-01-01 300 ## 4 2013 2011-01-01 300 ## 5 2013 2012-01-01 350 ## 6 2013 2013-01-01 100 Step 3: Create Development Triangles

Finally, once we have the loss run combined, we just need to create a triangle. This is made easy by the `as.triangle` function from the `ChainLadder` package.

The only manual labor required in excel was to go into each file and create the `file year` column, which was just the year of evaluation of each loss run file.

loss_run_data = loss_run_data %>% mutate(accident_year = as.numeric(year(ymd(loss_date))), maturity_in_months = (file_year - accident_year)*12) merged_triangle = as.triangle(loss_run_data, dev = "maturity_in_months", origin = "accident_year", value = "paid losses") merged_triangle

Within the package `ChainLadder` is a plot function, which shows the development of losses by accident year. Because these are arbitrary amounts, the plot is not realistic.

plot(merged_triangle, main = "Paid Losses vs Maturity by Accident Year", xlab = "Maturity in Months", ylab = "Paid Losses")

Conclusion

When it comes to aggregating excel files, R can be faster and more consistent than linking together each of the excel files, and once this script is set in place, making modifications to the data can be done easily by editing the `exel_file_extractor` function.

Download the data and Rmd files for this article from my github account

    Related Post

    1. The Mcomp Package
    2. Building a Hacker News scraper with 8 lines of R code using rvest library
    3. Building a Telecom Dictionary scraping web using rvest in R
    4. Scraping Javascript-rendered web content using R
    5. Analysing Cryptocurrency Market in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Deep Learning from first principles in Python, R and Octave – Part 1

    Thu, 01/04/2018 - 13:42

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    “You don’t perceive objects as they are. You perceive them as you are.”
    “Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
    “The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

    David Eagleman - The Brain: The Story of You

    This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

    This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

    To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

    1. Logistic Regression import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96

    To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2. You could also take a look at my book, available on Amazon , in which I implement the most popular Machine Learning algorithms in both R and Python – My book ‘Practical Machine Learning with R and Python’ on Amazon

    2. Logistic Regression as a 2 layer Neural Network

    In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

     

    The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

    This simple 2 layer Neural Network is shown below
    At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.

    where ‘b’ is the bias term

    The Activation function is the sigmoid function which is
    The Loss, when the sigmoid function is used in the output layer, is given by
    (1)

    Gradient Descent Forward propagation

    In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

    Backward propagation

    The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

    Chain rule of differentiation
    Let y=f(u)
    and u=g(x) then

    Derivative of sigmoid

    Let  then



    Using the chain rule of differentiation we get


    Therefore         -(2)

    The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
          -(a)
          -(b)
       -(c)

    The back propagation step requires the computation of and . In the case of regression it would be and where dE is the Mean Squared Error function.
    Computing the derivatives for back propagation we have
              -(d)
    because
    Also from equation (2) we get
                                      – (e)
    By chain rule

    therefore substituting the results of (d) & (e) we get
             (f)
    Finally
                                                               -(g)
                – (h)
    and from (f) we have 
    Therefore  (g) reduces to
    -(i)
    Also
    -(j)
    Since

    The gradient computes the weights at the input layer and the corresponding bias by using the values
    of and


    I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

    3. Neural Network for Logistic Regression -Python code (vectorized) import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split # Define the sigmoid function def sigmoid(z): a=1/(1+np.exp(-z)) return a # Initialize def initialize(dim): w = np.zeros(dim).reshape(dim,1) b = 0 return w # Compute the loss def computeLoss(numTraining,Y,A): loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A))) return(loss) # Execute the forward propagation def forwardPropagation(w,b,X,Y): # Compute Z Z=np.dot(w.T,X)+b # Determine the number of training samples numTraining=float(len(X)) # Compute the output of the sigmoid activation function A=sigmoid(Z) #Compute the loss loss = computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ=A-Y dw=1/numTraining*np.dot(X,dZ.T) db=1/numTraining*np.sum(dZ) # Return the results as a dictionary gradients = {"dw": dw, "db": db} loss = np.squeeze(loss) return gradients,loss # Compute Gradient Descent def gradientDescent(w, b, X, Y, numIerations, learningRate): losses=[] idx =[] # Iterate for i in range(numIerations): gradients,loss=forwardPropagation(w,b,X,Y) #Get the derivates dw = gradients["dw"] db = gradients["db"] w = w-learningRate*dw b = b-learningRate*db # Store the loss if i % 100 == 0: idx.append(i) losses.append(loss) # Set params and grads params = {"w": w, "b": b} grads = {"dw": dw, "db": db} return params, grads, losses,idx # Predict the output for a training set def predict(w,b,X): size=X.shape[1] yPredicted=np.zeros((1,size)) Z=np.dot(w.T,X) # Compute the sigmoid A=sigmoid(Z) for i in range(A.shape[1]): #If the value is > 0.5 then set as 1 if(A[0][i] > 0.5): yPredicted[0][i]=1 else: # Else set as 0 yPredicted[0][i]=0 return yPredicted #Normalize the data def normalize(x): x_norm = None x_norm = np.linalg.norm(x,axis=1,keepdims=True) x= x/x_norm return x # Run the 2 layer Neural Network on the cancer data set from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) # Create train and test sets X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Normalize the data for better performance X_train1=normalize(X_train) # Create weight vectors of zeros. The size is the number of features in the data set=30 w=np.zeros((X_train.shape[1],1)) #w=np.zeros((30,1)) b=0 #Normalize the training data so that gradient descent performs better X_train1=normalize(X_train) #Transpose X_train so that we have a matrix as (features, numSamples) X_train2=X_train1.T # Reshape to remove the rank 1 array and then transpose y_train1=y_train.reshape(len(y_train),1) y_train2=y_train1.T # Run gradient descent for 4000 times and compute the weights parameters, grads, costs,idx = gradientDescent(w, b, X_train2, y_train2, numIerations=4000, learningRate=0.75) w = parameters["w"] b = parameters["b"] # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1.T #Reshape y_test y_test1=y_test.reshape(len(y_test),1) y_test2=y_test1.T # Predict the values for yPredictionTest = predict(w, b, X_test2) yPredictionTrain = predict(w, b, X_train2) # Print the accuracy print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100)) print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100)) # Plot the Costs vs the number of iterations fig1=plt.plot(idx,costs) fig1=plt.title("Gradient descent-Cost vs No of iterations") fig1=plt.xlabel("No of iterations") fig1=plt.ylabel("Cost") fig1.figure.savefig("fig1", bbox_inches='tight') ## train accuracy: 90.3755868545 % ## test accuracy: 89.5104895105 %

    Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

    4. Neural Network for Logistic Regression -R code (vectorized) source("RFunctions-1.R") # Define the sigmoid function sigmoid <- function(z){ a <- 1/(1+ exp(-z)) a } # Compute the loss computeLoss <- function(numTraining,Y,A){ loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A)) return(loss) } # Compute forward propagation forwardPropagation <- function(w,b,X,Y){ # Compute Z Z <- t(w) %*% X +b #Set the number of samples numTraining <- ncol(X) # Compute the activation function A=sigmoid(Z) #Compute the loss loss <- computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ<-A-Y dw<-1/numTraining * X %*% t(dZ) db<-1/numTraining*sum(dZ) fwdProp <- list("loss" = loss, "dw" = dw, "db" = db) return(fwdProp) } # Perform one cycle of Gradient descent gradientDescent <- function(w, b, X, Y, numIerations, learningRate){ losses <- NULL idx <- NULL # Loop through the number of iterations for(i in 1:numIerations){ fwdProp <-forwardPropagation(w,b,X,Y) #Get the derivatives dw <- fwdProp$dw db <- fwdProp$db #Perform gradient descent w = w-learningRate*dw b = b-learningRate*db l <- fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx <- c(idx,i) losses <- c(losses,l) } } # Return the weights and losses gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict <- function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z <- t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize <- function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<-as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<-sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Set the features X_train <-train[,1:30] y_train <- train[,31] X_test <- test[,1:30] y_test <- test[,31] # Create a matrix of 0's with the number of features w <-matrix(rep(0,dim(X_train)[2])) b <-0 X_train1 <- normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <-data.frame(gradDescent$idx, gradDescent$losses) names(df) <- c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent - Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses")

    4. Neural Network for Logistic Regression -Octave code (vectorized)


    1;
    # Define sigmoid function
    function a = sigmoid(z)
    a = 1 ./ (1+ exp(-z));
    end
    # Compute the loss
    function loss=computeLoss(numtraining,Y,A)
    loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A));
    end

    # Perform forward propagation
    function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y)
    % Compute Z
    Z = w' * X + b;
    numtraining = size(X)(1,2);
    # Compute sigmoid
    A = sigmoid(Z);

    #Compute loss. Note this is element wise product
    loss =computeLoss(numtraining,Y,A);
    # Compute the gradients dZ, dw and db
    dZ = A-Y;
    dw = 1/numtraining* X * dZ';
    db =1/numtraining*sum(dZ);

    end

    # Compute Gradient Descent
    function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate)
    #Initialize losses and idx
    losses=[];
    index=[];
    # Loop through the number of iterations
    for i=1:numIerations,
    [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y);
    # Perform Gradient descent
    w = w - learningRate*dw;
    b = b - learningRate*db;
    if(mod(i,100) ==0)
    # Append index and loss
    index = [index i];
    losses = [losses loss];
    endif

    end
    end

    # Determine the predicted value for dataset
    function yPredicted = predict(w,b,X)
    m = size(X)(1,2);
    yPredicted=zeros(1,m);
    # Compute Z
    Z = w' * X + b;
    # Compute sigmoid
    A = sigmoid(Z);
    for i=1:size(X)(1,2),
    # Set predicted as 1 if A > 0,5
    if(A(1,i) >= 0.5)
    yPredicted(1,i)=1;
    else
    yPredicted(1,i)=0;
    endif
    end
    end

    # Normalize by dividing each value by the sum of squares
    function normalized = normalize(x)
    # Compute Frobenius norm. Square the elements, sum rows and then find square root
    a = sqrt(sum(x .^ 2,2));
    # Perform element wise division
    normalized = x ./ a;
    end

    # Split into train and test sets
    function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent)
    # Create a random index
    ix = randperm(length(dataset));
    # Split into training
    trainSize = floor(trainPercent/100 * length(dataset));
    train=dataset(ix(1:trainSize),:);
    # And test
    test=dataset(ix(trainSize+1:length(dataset)),:);
    X_train = train(:,1:30);
    y_train = train(:,31);
    X_test = test(:,1:30);
    y_test = test(:,31);
    end


    cancer=csvread("cancer.csv");
    [X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75);
    w=zeros(size(X_train)(1,2),1);
    b=0;
    X_train1=normalize(X_train);
    X_train2=X_train1';
    y_train1=y_train';
    [w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75);
    # Normalize X_test
    X_test1=normalize(X_test);
    #Transpose X_train so that we have a matrix as (features, numSamples)
    X_test2=X_test1';
    y_test1=y_test';
    # Use the values of the weights generated from Gradient Descent
    yPredictionTest = predict(w1, b1, X_test2);
    yPredictionTrain = predict(w1, b1, X_train2);


    trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100
    testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100
    trainAccuracy = 90.845
    testAccuracy = 89.510
    graphics_toolkit('gnuplot')
    plot(idx,losses);
    title ('Gradient descent- Cost vs No of iterations');
    xlabel ("No of iterations");
    ylabel ("Cost");

    Conclusion
    This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers

    The Deep Learning journey has begun… Don’t miss the bus!
    Stay tuned for more interesting posts in Deep Learning!!

    References
    1. Deep Learning Specialization
    2. Neural Networks for Machine Learning
    3. Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville
    4. Neural Networks: The mechanics of backpropagation
    5. Machine Learning

    Also see
    1. My book ‘Practical Machine Learning with R and Python’ on Amazon
    2. Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4
    3. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
    4. Practical Machine Learning with R and Python – Part 4
    5. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
    6. A Bluemix recipe with MongoDB and Node.js
    7. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

    To see all posts check Index of posts

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Cheer up, Black Metal Cats! Bubblegum Puppies

    Thu, 01/04/2018 - 01:00

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    Do you know the Black Metal Cats Twitter account? As explained in this great introduction, it “combines kitties with heavy metal lyrics”. I know the account because I follow Scott Chamberlain who retweets them a lot, which I enjoy as far as one can enjoy such a dark mood. Speaking of which, I decided to try and transform Black Metal Cat tweets into something more positive… The Bubblegum Puppies were born!

    Getting Black Metal Cats tweets

    It won’t come as a surprise for the loyal readers of this blog that I just had to use rtweet. I kept only original standalone tweets and removed the picture link from the tweet.

    black_tweets <- rtweet::get_timeline("evilbmcats") black_tweets <- dplyr::filter(black_tweets, is.na(reply_to_user_id), !is_retweet, !is_quote) black_tweets <- dplyr::select(black_tweets, text, created_at, status_id) black_tweets <- dplyr::mutate(black_tweets, text = stringr::str_replace(text, "https.*", "")) readr::write_csv(black_tweets, path = "data/2018-01-03-bubblegumpuppies_cats.csv")

    Now that the dark material is ready, let’s sweeten it…

    Modifying the tweet text

    Black Metal Cats tweet heavy metal lyrics so as you can imagine, they’re sad. How to make them happy while keeping the text similar enough to the original one? And this without too much effort? My simplistic strategy was to identify negative words via sentiment analysis and to replace them with positive words.

    Note, had I not wanted to stay close to the original tweet, I could have just chosen lyrics picked from this dataset for instance, and filtered them by sentiment via the sentimentr package.

    Finding negative words

    I computed sentiment using tidytext copy-pasting code from this post of mine.

    library("tidytext") library("magrittr") bing <- get_sentiments("bing") sentiment <- dplyr::select(black_tweets, text) %>% dplyr::mutate(saved_text = text) %>% unnest_tokens(word, text) %>% dplyr::inner_join(bing, by = "word") %>% dplyr::filter(sentiment == "negative")

    It was a bit disappointing since out of the 97 only 63 were represented in that data.frame. But well, this shall do! I just looked rapidly at some non included tweets.

    dplyr::filter(black_tweets, !text %in% sentiment$saved_text) %>% head() %>% knitr::kable() text created_at status_id Black covfefe cats. 2017-12-29 21:01:06 946848575446663168 Black coffee cats. 2017-12-28 00:15:48 946172798203916288 Say goodbye to the light. 2017-12-26 15:50:00 945683121336344576 The storm will never end. 2017-12-23 15:49:00 944595705204678656 Satan, your time has come. This world must end. 2017-12-18 22:36:00 942886191107575808 Timely. 2017-12-13 00:43:30 940743951287504896

    Ok so some of them are probably negative tweets that the sentimentr package would help detect but that do not contain negative words.

    Replacing words

    I seriously considered using the wordnet package because of this Stack Overflow question “Getting antonyms using the wordnet package” but I was not brave enough, my strength failed me in front of the Java needs of that package.

    I decided to use praise to get positive words, and cleanNLP (as in this post) to try and identify correctly negative words as adjective or verbs for instance in order to be able to replace them. The right annotation for that is a token.

    library("cleanNLP") #init_spaCy() get_token_with_text <- function(x){ obj <- run_annotators(x, as_strings = TRUE) entities <- get_token(obj) entities$text <- x entities } possibly_get_tokens <- purrr::possibly(get_token_with_text, otherwise = NULL) tokens <- purrr::map_df(sentiment$saved_text, possibly_get_tokens) head(tokens) %>% knitr::kable()        

    Once here, I joined sentiment and tokens.

    tokens <- dplyr::mutate(tokens, word = enc2native(word)) tokens <- dplyr::mutate(tokens, word = tolower(word)) tokens <- dplyr::left_join(sentiment, tokens, by = c("saved_text" = "text", "word")) head(tokens) %>% knitr::kable() saved_text word sentiment id sid tid lemma upos pos cid We are the creatures you fear. fear negative 1 1 6 fear VERB VBP 25 In darkness we walk. darkness negative 1 1 2 darkness NOUN NN 3 Standing tall, side by side, we shall build thy throne upon the burning ashes. burning negative 1 1 15 burn VERB VBG 64 This is the dawn of the infernal reign. infernal negative 1 1 7 infernal ADJ JJ 24 A breath of the past so distant and so unreal. A soul condemned to haunt a frozen burial ground. condemned negative 1 2 3 condemn VERB VBD 54 A breath of the past so distant and so unreal. A soul condemned to haunt a frozen burial ground. condemned negative 1 2 3 condemn VERB VBD 54

    The praise package provides adjectives that I’ll use to replace adjectives and to add an exclamation at the beginning of each text. Nouns, present and preterit verbs will be replaced with love/loves/loved because hey, this is pop music inspiration. I’ll lose capital letters and won’t bother too much, puppies probably do not care either. What I prepare below is what Hilary Parker called a dictionary table in this tweet. VBZ is for instance a verb in the present form like “haunts”.

    modifiers <- tibble::tibble(pos = c( "NN", "VBG", "JJ", "VBD", "VB", "NNS", "VBP", "VBN", "NNP", "VBZ"), modifier = c("love", "${adjective}", "${adjective}", "loved", "love", "lovers", "love", "${adjective}", "love", "loves")) knitr::kable(modifiers) pos modifier NN love VBG ${adjective} JJ ${adjective} VBD loved VB love NNS lovers VBP love VBN ${adjective} NNP love VBZ loves

    Now let’s get to work on the sweetening of the texts at last! Some occurrences of “hate” and “evil” remained, and I removed them by hand.

    # praise has some randomness, let's make it reproducible set.seed(42) modifiable_tweets <- dplyr::left_join(tokens, black_tweets, by = c("saved_text" = "text")) modifiable_tweets <- dplyr::left_join(modifiable_tweets, modifiers, by = "pos") # for easier use with map replace_all <- function(pattern, replacement, x){ stringr::str_replace_all(x, pattern = pattern, replacement = replacement) } modified_tweets <- dplyr::group_by(modifiable_tweets, saved_text) %>% dplyr::summarize(praise_template = purrr::map2_chr(word, modifier, replace_all, x = tolower(saved_text[1]))[1], praise_template = paste("${exclamation}!", praise_template), praise_template = stringr::str_replace_all(praise_template, "hate", "love"), praise_template = stringr::str_replace_all(praise_template, "evil", "love"), new_text = praise::praise(praise_template)) modified_tweets <- dplyr::select(modified_tweets, saved_text, new_text) puppies_and_cats <- dplyr::left_join(modified_tweets, black_tweets, by = c("saved_text" = "text")) head(puppies_and_cats) %>% knitr::kable() saved_text new_text created_at status_id A breath of the past so distant and so unreal. A soul condemned to haunt a frozen burial ground. yikes! a breath of the past so distant and so unreal. a soul loved to haunt a frozen burial ground. 2017-12-28 20:35:00 946479619020115968 Be very afraid. yippie! be very fine. 2017-11-18 21:02:00 931990900523307008 Beware of the cat. yay! love of the cat. 2017-12-11 14:54:02 940233218204172288 Beware the legions of Satan, they’re ready for attack! ole! love the legions of satan, they’re ready for attack! 2017-12-05 22:48:00 938178169399422976 Born by the nothingness into eternal blasphemy. mm! born by the nothingness into eternal love. 2017-12-03 04:15:01 937173299448221696 Born on a day of curses and damnation, I was doomed to hate. whoa! born on a day of lovers and damnation, i was doomed to love. 2017-12-04 21:40:01 937798672347226113

    So although some new lyrics do not look that cheerful, they’re at least grammatically correct.

    Modifying the picture

    I recently discovered Pexels, a website with CC-0 pictures, that I even learnt to scrape for an unpublished (yet) project. So many photographs you can use and modify for free without any attribution! Quite cool, really. To scrape the page I first had to scroll down to get enough pictures, which I did following this Stack Overflow thread with RSelenium. I tried using seleniumPipes instead but had trouble setting up the server and not too much time to dwell on that.

    Yes, you got it right, the code below automatically downloads pics of puppies into your laptop. Happy New Year.

    library("rvest") library("RSelenium") library("magrittr") # https://stackoverflow.com/questions/29861117/r-rvest-scraping-a-dynamic-ecommerce-page rD <- rsDriver() remDr <- rD[["client"]] # open the webpage remDr$navigate("https://www.pexels.com/search/puppies/") # scroll down for(i in 1:30){ remDr$executeScript(paste("scroll(0,",i*10000,");"), args = list("dummy")) # be nice and wait Sys.sleep(1) } page_content <- remDr$getPageSource() remDr$close() # functiosn for getting the pic links get_link_from_src <- function(node){ xml2::xml_attrs(node)["src"] %>% as.character() %>% stringr::str_replace("\\?h.*", "") } xtract_pic_links <- function(source) { css <- '.photo-item__img' read_html(source[[1]]) %>% html_nodes(css) %>% purrr::map_chr(get_link_from_src) } links <- xtract_pic_links(page_content) links <- links[1:nrow(puppies_and_cats)] # save dir.create("data/puppies") save_pic <- function(url, name){ Sys.sleep(1) name <- paste0("puppy", name, ".png") try(magick::image_read(url) %>% magick::image_write(paste0("data/puppies/", name)), silent = TRUE) } purrr::walk2(links, 1:nrow(puppies_and_cats), save_pic)

    I took a moment to browse the 66 pics. Carpe diem, carpe canes (I learnt Latin for 6 years as a teen(ager) and had to check the plural of canem…).

    The pics were a bit too big so I resized them.

    resize <- function(pic){ magick::image_read(pic) %>% magick::image_resize("300x300") %>% magick::image_write(pic) } purrr::walk(dir("data/puppies/", full.names = TRUE), resize) Tweeting the cheerful tweets – DON’T DO IT LIKE I DID!

    I created a Twitter account for the Bubblegum Puppies so that they could interact with the Black Metal Cats in their natural environment. I was asked by Scott whether I’d make a bot out of my idea and I don’t intend to especially since my simplistic strategy can only answer tweets with negative words detected, but I’d be glad to let someone adopt the account, especially since I did not educate my puppies well to begin with as you see below! Here is a tutorial for making a Twitter bot with rtweet. In the meantime, I simply decided to tweet the answers I had created… without enough thinking.

    Note: It has been a long since I last obtained Twitter access tokens, so I refered to the vignette.

    Then I wrote the code to send replies, an important point being that you need to mention the username you answer to in your tweet otherwise the in_reply_to_status_id argument is ignored. It was a dangerous code because it sent too many replies, which made it a spam bot… Don’t do that. So if you ever need an example of a very stupid spammer, think of me. Lesson learnt, I’ll never be that reckless again, because I do not want to spam anyone.

    post_a_reply <- function(df, pic){ rtweet::post_tweet(paste("@evilbmcats", df$new_text), in_reply_to_status_id = df$status_id, media = pic) Sys.sleep(1) } pics <- dir("data/puppies/", full.names = TRUE) purrr::walk2(split(puppies_and_cats, puppies_and_cats$status_id), pics, post_a_reply)

    So how would I use the tweets if I could do it again? Well, I’d post them with a faaar bigger delay. And I think the account would be fun as a bot which’d reply only when the cats account tweets again. That way the puppies would be cute, not cute and annoying. Live and learn! Thanks to Bob Rudis to encourage me to post this!

    Ending with some cuteness

    And now, because I do not want you to think I’m now as depressed as a Black Metal Cat, I’ll end this post by showing you a few replies thanks to the brand new rtweet::tweet_shot function added in the dev version of rtweet by Bob Rudis after he saw my #best9of2017 post. I resorted to saving the files and add the Markdown code to show them by hand but in a normal Rmd, not in a website, the images (magick objects) actually render very well. Head to Twitter to see the rest!

    save_one_tweet <- function(status_id){ rtweet::tweet_shot(status_id) %>% magick::image_write(paste0("data/2018-01-04-", status_id, ".png")) } c("948942337912328192", "948942293272334336", "948942240713531392") %>% purrr::walk(save_one_tweet)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introduction to Kurtosis

    Thu, 01/04/2018 - 01:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)




























    Happy 2018 and welcome to our first reproducible finance post of the year! What better way to ring in a new beginning than pondering/calculating/visualizing returns distributions.

    We ended 2017 by tackling skewness, and we will begin 2018 by tackling kurtosis.

    Kurtosis is a measure of the degree to which portfolio returns appear in the tails of our distribution. A normal distribution has a kurtosis of 3, which follows from the fact that a normal distribution does have some of its mass in its tails. A distribution with a kurtosis greater than 3 has more returns out in its tails than the normal, and one with kurtosis less than 3 has fewer returns in its tails than the normal. That matters to investors because more bad returns out in tails means that our portfolio might be at risk of a rare but huge downside. The terminology is a bit confusing. Negative kurtosis is considered less risky because it has fewer returns out in the tails. Negative == less risky? We’re not used to that in finance.

    Kurtosis is often has the word ‘excess’ appended to its description, as in ‘negative excess kurtosis’ or ‘positive excess kurtosis’. That ‘excess’ is in comparison to a normal distribution kurtosis of 3. A distribution with negative excess kurtosis equal to -1 has an actual kurtosis of 2.

    Enough with the faux investopedia entry, let’s get to the calculations, R code and visualizations.

    Here’s the equation for excess kurtosis. Note that we subtract 3 at the end:

    \[Kurtosis=\sum_{t=1}^n (x_i-\overline{x})^4/n \bigg/ (\sum_{t=1}^n (x_i-\overline{x})^2/n)^{2}-3 \]

    By way of reminder, we will be working with our usual portfolio consisting of:

    + SPY (S&P500 fund) weighted 25% + EFA (a non-US equities fund) weighted 25% + IJS (a small-cap value fund) weighted 20% + EEM (an emerging-mkts fund) weighted 20% + AGG (a bond fund) weighted 10%

    Before we can calculate kurtosis, we need to find portfolio monthly returns, which was covered in this post.

    Building off that previous work, we will be working with two objects of portfolio returns:

    + portfolio_returns_xts_rebalanced_monthly (an xts of monthly returns) + portfolio_returns_tq_rebalanced_monthly (a tibble of monthly returns)

    Now we are going to test our past self’s work on skewness, and reuse that code flow to expedite the kurtosis work. The logic will remain the same, but we will call different built-in functions and different by-hand calculations.

    For the xts world, we use the kurtosis() function instead of the skewness() function.

    kurt_xts <- kurtosis(portfolio_returns_xts_rebalanced_monthly$returns) kurt_xts ## [1] 0.5267736

    For tidy, we have the same piped flow and use the formula for kurtosis for our by-hand calculations. Our by-hand result is labeled with kurt_byhand, and involves quite a few parentheticals to map it back to the kurtosis equation above.

    kurt_tidy <- portfolio_returns_tq_rebalanced_monthly %>% summarise( kurt_builtin = kurtosis(returns), kurt_byhand = ((sum((returns - mean(returns))^4)/length(returns))/ ((sum((returns - mean(returns))^2)/length(returns))^2)) - 3) %>% select(kurt_builtin, kurt_byhand)

    Let’s confirm that we have consistent calculations.

    kurt_xts ## [1] 0.5267736 kurt_tidy$kurt_builtin ## [1] 0.5267736 kurt_tidy$kurt_byhand ## [1] 0.5267736

    We have consistent results from xts and the tidy built-in/by-hand worlds, and we were able to reuse our code from above to shorten the development time here. dd

    Let’s do the same with the visualizations and head straight for a density plot, starting with the same portfolio_density_plot.

    portfolio_density_plot <- portfolio_returns_tq_rebalanced_monthly %>% ggplot(aes(x = returns)) + stat_density(geom = "line", alpha = 1, colour = "cornflowerblue") portfolio_density_plot

    We are interested in both tails for kurtosis, so let’s shade at 2 standard deviations above and below the mean return (for our skewness work, we only shaded the negative tail).

    mean <- mean(portfolio_returns_tq_rebalanced_monthly$returns) sd_pos <- mean + (2 * sd(portfolio_returns_tq_rebalanced_monthly$returns)) sd_neg <- mean - (2 * sd(portfolio_returns_tq_rebalanced_monthly$returns)) sd_pos_shaded_area <- ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x > sd_pos ) sd_neg_shaded_area <- ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < sd_neg) portfolio_density_plot <- portfolio_density_plot + geom_area(data = sd_pos_shaded_area, aes(x = x, y = y), fill="pink", alpha = 0.5) + geom_area(data = sd_neg_shaded_area, aes(x = x, y = y), fill="pink", alpha = 0.5) + scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) portfolio_density_plot

    That density chart is a good look at the mass in both tails, where we have defined ‘tail’ as being two standard deviations away from the mean. We can add a line for the mean, as did in our skewness visualization, with the following.

    mean <- mean(portfolio_returns_tq_rebalanced_monthly$returns) mean_line_data <- ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x <= mean) portfolio_density_plot + geom_segment(data = mean_line_data, aes(x = mean, y = 0, xend = mean, yend = density), color = "black", linetype = "dotted") + annotate(geom = "text", x = mean, y = 5, label = "mean", fontface = "plain", angle = 90, alpha = .8, vjust = 1.75)

    Finally, we can calculate and chart the rolling kurtosis with the same logic as we did for skewness. The only difference is that here we call fun = kurtosis instead of fun = skewness.

    window <- 6 rolling_kurt_xts <- na.omit(apply.rolling(portfolio_returns_xts_rebalanced_monthly, window, fun = kurtosis))

    Now we pop that xts object into highcharter for a visualization.

    highchart(type = "stock") %>% hc_title(text = "Rolling Kurt") %>% hc_add_series(rolling_kurt_xts, name = "Rolling kurtosis", color = "cornflowerblue") %>% hc_yAxis(title = list(text = "kurtosis"), opposite = FALSE, max = .03, min = -.03) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

    {"x":{"hc_opts":{"title":{"text":"Rolling Kurt"},"yAxis":{"title":{"text":"kurtosis"},"opposite":false,"max":0.03,"min":-0.03},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.00854775829739259],[1377820800000,0.00444785082581815],[1380499200000,0.0104092039506155],[1383177600000,0.0128359457242099],[1385683200000,0.0164325330568209],[1388448000000,0.0224217167324512],[1391126400000,0.00789080127302483],[1393545600000,0.0191481706943007],[1396224000000,0.0119153871442633],[1398816000000,0.0066512115146186],[1401408000000,0.0069340953984468],[1404086400000,0.00804186927781203],[1406764800000,0.0127245048219774],[1409270400000,0.00997562223278388],[1412035200000,0.00106782119332109],[1414713600000,0.004269190603526],[1417132800000,0.00218739200640291],[1419984000000,-0.0033686842425531],[1422576000000,-0.00308812576416303],[1424995200000,0.000672300702671621],[1427760000000,0.0063800007202866],[1430352000000,0.00602025921117102],[1432857600000,0.00450860859731235],[1435622400000,0.00359285608324082],[1438300800000,0.00510762504715069],[1440979200000,-0.0135237670918179],[1443571200000,-0.017272859188811],[1446163200000,-0.0104452656749339],[1448841600000,-0.0101050819169769],[1451520000000,-0.0116837282083901],[1454025600000,-0.0185112647100233],[1456704000000,-0.00879112875305636],[1459382400000,0.00889238266836386],[1461888000000,0.000582532447263954],[1464652800000,0.000410213890434274],[1467244800000,0.00639455232238465],[1469750400000,0.0210143856721272],[1472601600000,0.0228061076525519],[1475193600000,0.0119133789927622],[1477872000000,0.00660677259528886],[1480464000000,0.0100397694050085],[1483056000000,0.0117902641682449],[1485820800000,0.0090557870395865],[1488240000000,0.0113804250332236],[1490918400000,0.0120564550652874],[1493337600000,0.017696157218908],[1496188800000,0.0170231562654927],[1498780800000,0.0157307217365092],[1501459200000,0.0159944354163099],[1504137600000,0.0128801041981877],[1506643200000,0.0149609794697498],[1509408000000,0.0156222680756303],[1512000000000,0.0159171082293594],[1514505600000,0.0151986230947173],[1514937600000,0.0133193338956239]],"name":"Rolling kurtosis","color":"cornflowerblue"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

    We didn’t cover this before, but what if we wanted to use ggplot for the rolling kurtosis? We could convert that xts object to a tibble with tk_tbl() from the timetk package, and then pipe straight to ggplot.

    rolling_kurt_xts %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% rename(rolling_kurtosis = calcs) %>% ggplot(aes(x = date, y = rolling_kurtosis)) + geom_line(color = "cornflowerblue") + xlab("date") + ylab("rolling kurtosis") + ggtitle("Rolling Kurtosis")

    Interestingly, this portfolio has displayed slight positive rolling excess kurtosis for most of its life, except during the last half of 2015 through early 2016.

    That’s all for today. Our work on kurtosis was made a lot more efficient by our work on skewness – so let’s thank our 2017 selves for constructing a reproducible and reusable code flow! See you next time.

    _____='https://rviews.rstudio.com/2018/01/04/introduction-to-kurtosis/';

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages