prrd 0.0.1: Parallel Running [of] Reverse Depends
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new package is now on the ghrr drat. It was uploaded four days ago to CRAN but still lingers in the inspect/ state, along with a growing number of other packages. But as some new packages have come through, I am sure it will get processed eventually but in the meantime I figured I may as well make it available this way.
The idea of prrd is simple, and described in some more detail on its webpage and its GitHub repo. Reverse dependency checks are an important part of package development (provided you care about not breaking other packages as CRAN asks you too), and is easily done in a (serial) loop. But these checks are also generally embarassingly parallel as there is no or little interdependency between them (besides maybe shared build depedencies).
So this package uses the liteq package by Gabor Csardi to set up all tests to run as tasks in a queue. This permits multiple independent queue runners to work at a task at a time. Results are written back and summarized.
This already works pretty well as evidenced by the following screenshot (running six parallel workers, arranged in split byobu session).
See the aforementioned webpage and its repo for more details, and by all means give it a whirl.
For more questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Discontinuing maintenance of jug
(This article was first published on FishyOperations, and kindly contributed to Rbloggers)
I’m starting off 2018 with dropping my involvement with jug. The reason for this being that I simply cannot find the time or interest anymore to maintain the code.
All of the opensource efforts that I started always had to do with a concrete need that I had in one way or another. Once that need is fulfilled by creating a solution for it, the next challenge becomes to maintain that solution.
In a best case scenario, someone with the right set of skills will continue the development into a mature project which become significant better then the original solution. A close to my heart example of this is simmer (see http://rsimmer.org/), where Iñaki Ucar has taken over the development and has improved the original enormously.
However, for jug the active developers community is currently too small too do a handover. Next to that, jug its main competitor plumber (see website) has been able to create a significant community with highly active support.
Today I have to sort of admit defeat. My own time availability as cofounder of dataroots and my changing interests unfortunately no long allow me to put in the time and effort required to do the unnecessary maintenance to do justice to the jug user base. therefore I will be discontinuing the maintenance of jug. If anyone would be willing to take over, they will be more than welcome.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Undemocratic Democracies
(This article was first published on R – DiffusePrioR, and kindly contributed to Rbloggers)
I’ve always thought that it was ironic that most countries with the word “Democratic” in their name are exceptionally undemocratic in reality. So I was very interested in the following post on Reddit this week showing exactly this (https://www.reddit.com/r/dataisbeautiful/comments/7nkyek/countries_with_the_word_democratic_in_their/).
However, there are only 8 countries that have “democratic” in their name: People’s Democratic Republic of Algeria, Democratic Republic of the Congo, Democratic Republic of TimorLeste, Federal Democratic Republic of Ethiopia, Lao People’s Democratic Republic, Democratic People’s Republic of Korea, Federal Democratic Republic of Nepal, and the Democratic Socialist Republic of Sri Lanka. This and the fact that I am back teaching next week got me thinking that this might be a nice example to show how twosample ttests work.
The 8 “democratic” countries have an overall democracy index score of 3.89 with a sample standard deviation of 2.174942. This contrasts with a mean and standard deviation of 5.602013 and 2.172326 for the remaining 159 countries in the Economist Intelligence Unit’s democracy index (https://en.wikipedia.org/wiki/Democracy_Index).
Let’s conduct a twosample ttest that assumes equal variances to test whether this difference in means is statistically significant. The null hypothesis is that both sample means are equal. This is a twotailed test. The test statistic follows Gosset’s tdistribution with n1+n22 degrees of freedom (159+82=165). The test statistic is calculated (formula here: https://en.wikipedia.org/wiki/Student%27s_ttest):
s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+82))
t = (5.6020133.89)/(s*(sqrt(1/159+1/8)))=2.1749
which allows us to reject at the conventional alpha of 0.05. The pvalue is ~0.03 meaning we would not be able to reject at the 1% level. Interestingly, if you do not follow the equal variance assumption, you can no longer reject the null at the 5% level.
Hopefully, this example will be of interest to people teaching stats and econometrics for undergrads!
rm(list = ls()) other < c(9.93, 9.5, 9.39, 9.26, 9.2, 9.15, 9.15, 9.09, 9.03, 9.01, 8.81, 8.8, 8.63, 8.41, 8.39, 8.36, 8.3, 8.28, 8.17, 7.99, 7.98, 7.98, 7.94, 7.92, 7.92, 7.88, 7.87, 7.86, 7.85, 7.85, 7.82, 7.81, 7.79, 7.78, 7.77, 7.65, 7.51, 7.47, 7.41, 7.39, 7.31, 7.29, 7.23, 7.13, 7.1, 7.01, 6.97, 6.96, 6.94, 6.9, 6.83, 6.77, 6.75, 6.75, 6.72, 6.67, 6.67, 6.65, 6.64, 6.62, 6.62, 6.59, 6.57, 6.54, 6.47, 6.42, 6.4, 6.38, 6.31, 6.27, 6.25, 6.21, 6.03, 6.01, 5.99, 5.93, 5.92, 5.92, 5.91, 5.81, 5.76, 5.73, 5.72, 5.7, 5.7, 5.67, 5.64, 5.63, 5.55, 5.33, 5.31, 5.26, 5.23, 5.07, 5.04, 4.93, 4.93, 4.92, 4.87, 4.86, 4.81, 4.77, 4.7, 4.68, 4.55, 4.5, 4.49, 4.33, 4.27, 4.2, 4.08, 4.02, 4.02, 3.96, 3.96, 3.96, 3.88, 3.85, 3.81, 3.74, 3.71, 3.54, 3.46, 3.46, 3.4, 3.38, 3.32, 3.31, 3.24, 3.18, 3.14, 3.14, 3.07, 3.06, 3.05, 3.04, 3.03, 2.91, 2.91, 2.83, 2.79, 2.75, 2.65, 2.55, 2.4, 2.37, 2.37, 2.34, 2.25, 2.06, 1.98, 1.95, 1.93, 1.89, 1.83, 1.7, 1.61, 1.5, 1.43) demo < c(7.24, 6.48, 4.86, 3.6, 3.56, 2.37, 1.93, 1.08) mean(other) ; sd(other) ; length(other) mean(demo) ; sd(demo) ; length(demo) t.test(other, demo, var.equal = T) s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+82)) t = (5.6020133.89)/(s*(sqrt(1/159+1/8))) t.test(other, demo, var.equal = F) library(ggplot2) data1 < data.frame(Score = other, Name = "No") data2 < data.frame(Score = demo, Name = "Yes") data < rbind(data1, data2) ggplot(data, aes(Name, Score)) + geom_boxplot(fill="lightblue") + theme_bw() + xlab("Democratic in Name") + ylab("Democracy Score")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – DiffusePrioR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rainbowing a set of pictures
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
I’ve now down a few collages from R using magick: the faces of #rstats Twitter, We RLadies with Lucy D’Agostino McGowan, and a holiday card for RLadies. The faces of #rstats Twitter and holiday card collages were arranged at random, while the We RLadies one was a mosaic forming the RLadies logo. I got the idea to up my collage skills by trying to learn how to arrange pics by their main colour, like a rainbow. The verb rainbow doesn’t exist, and “rainbowing” doesn’t mean ordering by colour, but I didn’t let this stop me.
It was the occasion to grab some useful knowledge about colours, not useless for someone who did not even know about Pantone’s Colors of the Year a few weeks ago…
This post has nothing to do with Kesha’s new album. However, you can listen to it while reading since it’s so good, but maybe switch to something older from her when I use “$”.
Getting some pics to play withThe first pictures I tried to arrange were all the pictures ever posted by RLadies local chapters on their Twitter account. While it was fun to grab them all, it was not very interesting to play with them as so many of them were pictures of screens. I therefore grabbed “nature” pictures from Pexels using the same method [as when creating the Bubblegum Puppies] following this Stack Overflow thread. I chose “nature” as a keyword because 1) it lead to many hits 2) it offered a good variety of colours.
library("rvest") library("RSelenium") library("magrittr") rD < rsDriver() remDr < rD[["client"]] # open the webpage remDr$navigate("https://www.pexels.com/search/nature/") # scroll down for(i in 1:130){ remDr$executeScript(paste("scroll(0,",i*10000,");"), args = list("dummy")) # be nice and wait Sys.sleep(1) } # https://www.pexels.com/faq/ page_content < remDr$getPageSource() remDr$close() get_link_from_src < function(node){ xml2::xml_attrs(node)["src"] %>% as.character() %>% stringr::str_replace("\\?h.*", "") } xtract_pic_links < function(source) { css < '.photoitem__img' read_html(source[[1]]) %>% html_nodes(css) %>% purrr::map_chr(get_link_from_src) } links < xtract_pic_links(page_content) links < links[1:1400] # save dir.create("nature") save_pic < function(url){ Sys.sleep(1) name < stringr::str_replace(url, ".*\\/", "") try(magick::image_read(url) %>% magick::image_write(paste0("nature/", name)), silent = TRUE) } purrr::walk(links, save_pic) Extracting the main colour and making pics sizecompatibleIn the following code, I extracted the main colour from each pic using Russel Dinnage’s method as presented in this blog post from David Smith. Before that I had to install two packages from Github, rblocks and rPlotter.
This code also serves another role: since I wanted to paste pics together at some point, I decided to make them all of the same dimensions by adding a border with magick. I had learnt how to do that when preparing RLadies Global holiday card, but this time instead of using the same colour every time (RLadies’ official purple), I used the main colour I’d just extracted. The most important points to make a picture a square are to know magick::image_info gives you the height and width of an image… and to somehow understand geometry which was embarrassingly a hurdle when I did that.
The code to extract colours didn’t work in a few cases which I did not investigate a lot: I had downloaded more pics than what I needed because I had experienced the issue when working with RLadies meetups pics, and had seen it was because of seemingly bicolor pics.
dir.create("formatted_pics") format_image < function(path){ image < magick::image_read(path) info < magick::image_info(image) # find in which direction I need to add pixels # to make this a square direction < ifelse(info$height > info$width, "height", "width") scale_number < as.numeric(info[direction]/500) image < magick::image_scale(image, paste0(info["width"]/scale_number, "x", info["height"]/scale_number)) newinfo < magick::image_info(image) # colours colours < try(rPlotter::extract_colours(path, num_col = 1), silent = TRUE) # one pic at least was problematic if(!is(colours, "tryerror")){ colour < colours[1] image < magick::image_border(image, colour, paste0((500newinfo$width)/2, "x", (500newinfo$height)/2)) info < magick::image_info(image) # odd numbers out! if(info$height/2 != floor(info$height/2)){ image < magick::image_crop(image, "0x500+0") } if(info$width/2 != floor(info$width/2)){ image < magick::image_crop(image, "500x0+0") } magick::image_write(image, stringr::str_replace(path, "nature", "formatted_pics")) tibble::tibble(path = path, colour = colour) }else{ NULL } } pics_main_colours < purrr::map_df(dir("nature", full.names = TRUE), format_image) readr::write_csv(pics_main_colours, path = "pics_main_colours.csv")And because I’m apparently a bad planner, I had to reduce pictures afterwards.
# we need smaller images reduce_image < function(path){ magick::image_read(path) %>% magick::image_scale("50x50!") %>% magick::image_write(path) } purrr::walk(dir("formatted_pics", full.names = TRUE), reduce_image) Preparing a function to order and paste picturesThis function has a collage part which you might recognize from my the faces of #rstats Twitter blog post, and a ordering pictures according to a variable part that’s new and uses a bit of tidy eval… Maybe I’ll really learn tidy eval this year! pics_info needs to be a data.frame with the path to pictures and well the variable one wants to use to order them.
library("rlang") make_column < function(i, files, no_rows){ magick::image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>% magick::image_append(stack = TRUE) %>% magick::image_write(paste0("cols/", i, ".jpg")) } make_collage < function(pics_info, no_rows, no_cols, ordering_col){ pics_info < dplyr::arrange(pics_info, !!!syms(ordering_col)) pics_info < pics_info[1:(no_rows*no_cols),] pics_info < dplyr::mutate(pics_info, column = rep(1:no_cols, each = no_rows)) pics_info < dplyr::group_by(pics_info, column) %>% dplyr::arrange(!!!syms(ordering_col)) %>% dplyr::mutate(row = 1:no_rows) %>% dplyr::ungroup() pics_info < dplyr::arrange(pics_info, column, row) dir.create("cols") purrr::walk(0:(no_cols1), make_column, files = pics_info$path, no_rows = no_rows) banner < magick::image_read(dir("cols/", full.names = TRUE)) %>% magick::image_append(stack = FALSE) unlink("cols", recursive = TRUE) return(banner) }The function returns a magick object that one can then write to disk as a PNG for instance.
I first tested it using a random approach added to the data.frame created in the next section, and show the result here to give an idea of the variety of pictures. For many of them, however, the main colour that you can see in their border is greyish.
set.seed(42) pics_info < dplyr::mutate(pics_info, random = sample(1:nrow(pics_info), nrow(pics_info))) make_collage(pics_info, 19, 59, "random") %>% magick::image_write("data/20180107rainbowingbanner_random.png") Testing a first (bad) approach: using hueOnce I had the main colour as an hex code, I had no idea how to order the colours and thought a good idea would be to use hue, which is the main wave length in a colour. Most observed colours are a mix of wave lengths unless you’re using a laser for instance. To get hue from a colour identified by its hex code, one needs two functions: colorspace::hex2rgb and grDevices::rgb2hsv. The latter one outputs hue, saturation and value. Hue is the main wavelength, saturation the amount of that wavelength in the colour and value the amount of light in the colour. The smaller the saturation, the less representative the hue is of the main colour. Add to that the fact that the main colour can also be only a little representative of your original picture… Ordering by hue isn’t too perfect, but I tried that anyway.
# now work on getting the hue and value for all pics # create a data.frame with path, hue, value get_values < function(path, pics_main_colours){ print(path) # get main color colour < pics_main_colours$colour[pics_main_colours$path == stringr::str_replace(path, "formatted_pics", "nature")] # translate it rgb < colorspace::hex2RGB(colour) values < grDevices::rgb2hsv(t(rgb@coords)) tibble::tibble(path = path, hue = values[1,1], saturation = values [2,1], value = values[3,1]) } # all values pics_col < purrr::map_df(dir("formatted_pics", full.names = TRUE), get_values, pics_main_colours) make_collage(pics_info, 19, 59, "hue") %>% magick::image_write("banner_hue.png")So this is not too pretty. Blue and green pictures seem to cluster together but there are very dark pictures which we’d intuitively put aside.
So I thought a bit more and decided to first assign main colours to a reference colour and then order pictures based on this…
Choosing a better approach: RGB and distancesThe first challenge was to choose reference colours which’d be a rainbow slices. I could have looked up RGB codes corresponding to ROYGBIV (red, orange, yellow, green, blue, indigo and violet.) but I had read about xkcd colors survey in this interesting post and therefore decided to use XKCD colors, available in R via the xkcdcolors package. I chose to use the 18 most common ones, and add black to that lot. It was no longer really a rainbow, I agree. The colors present in the pictures were ordered by hand by my husband, and I like his choices.
Then to assign each pic to a reference colour via its main colour, I calculated the Euclidian distance between that colour and all reference colours to find the closes reference colours, using the RGB values.
library("xkcdcolors") library("magrittr") # version of colorspace::hex2RGB returning a df hex2rgb < function(hex){ result < colorspace::hex2RGB(hex)@coords } # https://stackoverflow.com/questions/45328221/unnestonecolumnlisttomanycolumnsintidyr colors < tibble::tibble(name = c(xcolors()[1:18], "black"), hex = name2color(name), rgb = purrr::map(hex, hex2rgb)) %>% dplyr::mutate(rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() # for each colour I want the closest one. find_closest_colour < function(hex, colors){ test < tibble::tibble(hex = hex) %>% dplyr::mutate(rgb = purrr::map(hex, hex2rgb), rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() distance < stats::dist(rbind(test[, c("R", "G", "B")], colors[, c("R", "G", "B")])) colors$name[which.min(as.matrix(distance)[,1][2:(nrow(colors) + 1)])] } imgs_col < dplyr::mutate(pics_main_colours, xkcd_col = purrr::map_chr(colour, find_closest_colour, colors = colors)) readr::write_csv(imgs_col, path = "imgs_xkcd_col.csv")Once I had this information about each pic, I could order the pictures, after having defined the order of the reference colours.
pics_info < readr::read_csv("imgs_xkcd_col.csv") pics_info < dplyr::mutate(pics_info, path = stringr::str_replace(path, "nature", "formatted_pics")) pics_info < dplyr::mutate(pics_info, xkcd_col = factor(xkcd_col, ordered = TRUE, levels = c("black","brown","red","magenta","pink", "lime green","green","dark green","teal", "light blue","sky blue","blue","purple","grey")))This looks much better, but of course the initial set (and maybe the used extraction method as well) don’t provide for enough colours to make this extremely pretty. I’m not sure how useful this end product is, but hey I got to look at pretty landscapes full of colours from my grey rainy city, and learnt a lot along the way… Besides, maybe you will find a cool use case of some of the colour methods featured here and will tell me about it in the comments?
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dataviz Signup
(This article was first published on R on kieranhealy.org , and kindly contributed to Rbloggers)
Data Visualization for Social Science will be published later this year by Princeton University Press. You can read a nearcomplete draft of the book at socviz.co. If you would like to receive one (1) email when the book is available for preorder, please fill out this very short form. The goal of the book is to introduce readers to the principles and practice of data visualization in a humane, reproducible, and uptodate way.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New wrapr R pipeline feature: wrapr_applicable
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
The R package wrapr now has a neat new feature: “wrapr_applicable”.
This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery.
A small effort in making a package “wrapr aware” appears to have a fairly large payoff.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
tint 0.0.5
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A maintenance release of the tint package arrived on CRAN earlier this week. Its name expands from tint is not tufte as the package offers a fresher take on the Tuftestyle for html and pdf presentations.
A screenshot of the pdf variant is below.
Similar to the RcppCNPy release this week, this is pure maintenance related to dependencies. CRAN noticed that processing these vignettes requires the mgcv package—as we use geom_smooth() in some example graphs. So that was altered to not require this requirement just for the vignette tests. We also had one pending older change related to jurassic pandoc versions on some CRAN architectures.
Changes in tint version 0.0.5 (20180105)
Only run html rendering regression test on Linux or Windows as the pandoc versions on CRAN are too old elsewhere.

Vignette figures reworked so that the mgcv package is not required avoiding a spurious dependency [CRAN request]
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.
For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppCNPy 0.2.8
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A minor maintenance release of the RcppCNPy package arrived on CRAN this week.
RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.
There is no code change here. But to process the vignette we rely on knitr which sees Python here and (as of its most recent release) wants the (excellent !!) reticulate package. Which is of course overkill just to process a short pdf document, so we turned this off.
Changes in version 0.2.8 (20180104) Vignette sets knitr option python.reticulate=FALSE to avoid another depedency just for the vignette [CRAN request]
CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RQuantLib 0.4.4 for Windows
(This article was first published on FOSS Trading, and kindly contributed to Rbloggers)
I’m pleased to announce that the RQuantLib Windows binaries are now up to 0.4.4! The RQuantLib prebuilt Windows binaries have been frozen on CRAN since 0.4.2, but now you can get version 0.4.4 binaries on Dirk’s ghrr drat repo.
Installation is as simple as: drat::addRepo(“ghrr”) # maybe use ‘install.packages(“drat”)’ first install.packages(“RQuantLib”, type=”binary”) I will be able to create Windows binaries for future RQuantLib versions too, now that I have a Windows QuantLib build (version 1.11) to link against. Dirk and I plan to talk with CRAN about getting the new binaries hosted there. Regardless, they will always be available via the drat repo. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to extract data from a PDF file with R
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
In this post, taken from the book R Data Mining by Andrea Cirillo, we’ll be looking at how to scrape PDF files using R. It’s a relatively straightforward way to look at text mining – but it can be challenging if you don’t know exactly what you’re doing.
Until January 15th, every single eBook and video by Packt is just $5! Start exploring some of Packt’s huge range of R titles here.
You may not be aware of this, but some organizations create something called a ‘customer card’ for every single customer they deal with. This is quite an informal document that contains some relevant information related to the customer, such as the industry and the date of foundation. Probably the most precious information contained within these cards is the comments they write down about the customers. Let me show you one of them:
My plan was the following—get the information from these cards and analyze it to discover whether some kind of common traits emerge.
As you may already know, at the moment this information is presented in an unstructured way; that is, we are dealing with unstructured data. Before trying to analyze this data, we will have to gather it in our analysis environment and give it some kind of structure.
Technically, what we are going to do here is called text mining, which generally refers to the activity of gaining knowledge from texts. The techniques we are going to employ are the following:
 Sentiment analysis
 Wordclouds
 Ngram analysis
 Network analysis
First of all, we need to get a list of customer cards we were from the commercial department. I have stored all of them within the ‘data’ folder on my workspace. Let’s use list.files() to get them:
file_vector < list.files(path = "data")Nice! We can inspect this looking at the head of it. Using the following command:
file_vector %>% head() [1] "banking.xls" "Betasoloin.pdf" "Burl Whirl.pdf" "BUSINESSCENTER.pdf" [5] "Buzzmaker.pdf" "BuzzSaw Publicity.pdf"Uhm… not exactly what we need. I can see there are also .xls files. We can remove them using the grepl() function, which performs partial matches on strings, returning TRUE if the pattern required is found, or FALSE if not. We are going to set the following test here: give me TRUE if you find .pdf in the filename, and FALSE if not:
grepl(".pdf",file_list) [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [18] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [35] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [52] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [69] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [86] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [103] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE [120] TRUEAs you can see, the first match results in a FALSE since it is related to the .xls file we saw before. We can now filter our list of files by simply passing these matching results to the list itself. More precisely, we will slice our list, selecting only those records where our grepl() call returns TRUE:
pdf_list < file_vector[grepl(".pdf",file_list)]Did you understand [grepl(“.pdf”,file_list)] ? It is actually a way to access one or more indexes within a vector, which in our case are the indexes corresponding to “.pdf”, exactly the same as we printed out before.
If you now look at the list, you will see that only PDF filenames are shown on it.
Reading PDF files into R via pdf_text()R comes with a really useful that’s employed tasks related to PDFs. This is named pdftools, and beside the pdf_text function we are going to employ here, it also contains other relevant functions that are used to get different kinds of information related to the PDF file into R.
For our purposes, it will be enough to get all of the textual information contained within each of the PDF files. First of all, let’s try this on a single document; we will try to scale it later on the whole set of documents. The only required argument to make pdf_text work is the path to the document. The object resulting from this application will be a character vector of length 1:
pdf_text("data/BUSINESSCENTER.pdf") [1] "BUSINESSCENTER business profile\nInformation below are provided under non disclosure agreement. date of enquery: 12.05.2017\ndate of foundation: 19930518\nindustry: Nonprofit\nshare holders: Helene Wurm ; Meryl Savant ; Sydney Wadley\ncomments\nThis is one of our worst customer. It really often miss payments even if for just a couple of days. We have\nproblems finding useful contact persons. The only person we can have had occasion to deal with was the\nfiscal expert, since all other relevant person denied any kind of contact.\n 1\n"If you compare this with the original PDF document you can easily see that all of the information is available even if it is definitely not ready to be analyzed. What do you think is the next step needed to make our data more useful?
We first need to split our string into lines in order to give our data a structure that is closer to the original one, that is, made of paragraphs. To split our string into separate records, we can use the strsplit() function, which is a base R function. It requires the string to be split and the token that decides where the string has to be split as arguments. If you now look at the string, you’ll notice that where we found the end of a line in the original document, for instance after the words business profile, we now find the \n token. This is commonly employed in text formats to mark the end of a line.
We will therefore use this token as a split argument:
pdf_text("data/BUSINESSCENTER.pdf") %>% strsplit(split = "\n") [[1]] [1] "BUSINESSCENTER business profile" [2] "Information below are provided under non disclosure agreement. date of enquery: 12.05.2017" [3] "date of foundation: 19930518" [4] "industry: Nonprofit" [5] "share holders: Helene Wurm ; Meryl Savant ; Sydney Wadley" [6] "comments" [7] "This is one of our worst customer. It really often miss payments even if for just a couple of days. We have" [8] "problems finding useful contact persons. The only person we can have had occasion to deal with was the" [9] "fiscal expert, since all other relevant person denied any kind of contact." [10] " 1"strsplit() returns a list with an element for each element of the character vector passed as argument; within each list element, there is a vector with the split string.
Isn’t that better? I definitely think it is. The last thing we need to do before actually doing text mining on our data is to apply those treatments to all of the PDF files and gather the results into a conveniently arranged data frame.
Iteratively extracting text from a set of documents with a for loopWhat we want to do here is run trough the list of files and for filename found there, we run the pdf_text() function and then the strsplit() function to get an object similar to the one we have seen with our test. A convenient way to do this is by employing a ‘for’ loop. These structures basically do this to their content: Repeat this instruction n times and then stop. Let me show you a typical ‘for’ loop:
for (i in 1:3){ print(i) }If we run this, we obtain the following:
[1] 1 [1] 2 [1] 3This means that the loop runs three times and therefore repeats the instructions included within the brackets three times. What is the only thing that seems to change every time? It is the value of i. This variable, which is usually called counter, is basically what the loop employs to understand when to stop iterating. When the loop execution starts, the loop starts increasing the value of the counter, going from 1 to 3 in our example. The for loop repeats the instructions between the brackets for each element of the values of the vector following the in clause in the for command. At each step, the value of the variable before in (i in this case) takes one value of the sequence from the vector itself.
The counter is also useful within the loop itself, and it is usually employed to iterate within an object in which some kind of manipulation is desired. Take, for instance, a vector defined like this:
vector < c(1,2,3)Imagine we want to increase the value of every element of the vector by 1. We can do this by employing a loop such as this:
for (i in 1:3){ vector[i] < vector[i]+1 }If you look closely at the loop, you’ll realize that the instruction needs to access the element of the vector with an index equal to i and modify this value by 1. The counter here is useful because it will allow iteration on all vectors from 1 to 3.
Be aware that this is actually not a best practice because loops tend to be quite computationally expensive, and they should be employed when no other valid alternative is available. For instance, we can obtain the same result here by working directly on the whole vector, as follows:
vector_increased < vector +1If you are interested in the topic of avoiding loops where they are not necessary, I can share with you some relevant material on this.
For our purposes, we are going to apply loops to go through the pdf_list object, and apply the pdf_text function and subsequently the strsplit() function to each element of this list:
corpus_raw < data.frame("company" = c(),"text" = c()) for (i in 1:length(pdf_list)){ print(i) pdf_text(paste("data/", pdf_list[i],sep = "")) %>% strsplit("\n")> document_text data.frame("company" = gsub(x =pdf_list[i],pattern = ".pdf", replacement = ""), "text" = document_text, stringsAsFactors = FALSE) > document colnames(document) < c("company", "text") corpus_raw < rbind(corpus_raw,document) }Let’s get closer to the loop: we first have a call to the pdf_text function, passing an element of pdf_list as an argument; it is defined as referencing the i position in the list. Once we have done this, we can move on to apply the strsplit() function to the resulting string. We define the document object, which contains two columns, in this way:
 company, which stores the name of the PDF without the .pdf token; this is the name of the company
 text, which stores the text resulting from the extraction
This document object is then appended to the corpus object, which we created previously, to store all of the text within the PDF.
Let’s have a look a the resulting data frame:
corpus_raw %>% head()This is a wellstructured object, ready for some text mining. Nevertheless, if we look closely at our PDF customer cards, we can see that there are three different kinds of information and they should be handled differently:
 Repeated information, such as the confidentiality disclosure on the second line and the date of enquiry (12.05.2017)
 Structured attributes, for instance, date of foundation or industry
 Strictly unstructured data, which is in the comments paragraph
We are going to address these three kinds of data differently, removing the first group of irrelevant information; we therefore have to split our data frame accordingly into two smaller data frames. To do so, we will leverage the grepl() function once again, looking for the following tokens:
 12.05.2017: This denotes the line showing the nondisclosure agreement and the date of inquiry.
 business profile: This denotes the title of the document, containing the name of the company. We already have this information stored within the company column.
 comments: This is the name of the last paragraph.
 1: This represents the number of the page and is always the same on every card.
We can apply the filter function to our corpus_raw object here as follows:
corpus_raw %>% filter(!grepl("12.05.2017",text)) %>% filter(!grepl("business profile",text)) %>% filter(!grepl("comments",text)) %>% filter(!grepl("1",text)) > corpusNow that we have removed those useless things, we can actually split the data frame into two subdata frames based on what is returned by the grepl function when searching for the following tokens, which point to the structured attributes we discussed previously:
 date of foundation
 industry
 shareholders
We are going to create two different data frames here; one is called ‘information’ and the other is called ‘comments’:
corpus %>% filter(!grepl(c("date of foundation"),text)) %>% filter(!grepl(c( "industry"),text)) %>% filter(!grepl(c( "share holders"),text)) > comments corpus %>% filter(grepl(("date of foundation"),text)grepl(( "industry"),text)grepl(( "share holders"),text))> informationAs you can see, the two data treatments are nearly the opposite of each other, since the first looks for lines showing none of the three tokens while the other looks for records showing at least one of the tokens.
Let’s inspect the results by employing the head function:
information %>% head() comments %>% head()Great! We are nearly done. We are now going to start analyzing the comments data frame, which reports all comments from our colleagues. The very last step needed to make this data frame ready for subsequent analyzes is to tokenize it, which basically means separating it into different rows for all the words available within the text column. To do this, we are going to leverage the unnest_tokens function, which basically splits each line into words and produces a new row for each word, having taken care to repeat the corresponding value within the other columns of the original data frame.
This function comes from the recent tidytext package by Julia Silge and Davide Robinson, which provides an organic framework of utilities for text mining tasks. It follows the tidyverse framework, which you should already know about if you are using the dplyr package.
Let’s see how to apply the unnest_tokens function to our data:
comments %>% unnest_tokens(word,text)> comments_tidyIf we now look at the resulting data frame, we can see the following:
As you can see, we now have each word separated into a single record.
Thanks for reading! We hope you enjoyed this extract taken from R Data Mining.
Explore $5 R eBooks and videos from Packt until January 15th 2018.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using nonstandard evaluation to simulate a register machine
(This article was first published on Higher Order Functions, and kindly contributed to Rbloggers)
I recently completed all 25 days of Advent of Code 2017, an annual
series of recreational programming puzzles. Each day describes a programming
puzzle and illustrates a handful of simple examples of the problem. The puzzle
then requires the participant to solve a much, much larger form of the problem.
For five or so of the puzzles, I used nonstandard evaluation to implement my
solution. As I previously wrote, nonstandard evaluation is a way of
bottling up magic spells (lines of R code) and changing how or where they are
cast (evaluated). I chose to use special evaluation not because it was the
easiest or most obvious solution but because I wanted to develop my skills with
computing on the language. In this post, I work through one of the examples
where I used nonstandard evaluation to write an interpreter for a simple
machine.
Day 8 requires us to simulate the state
of a register machine as it receives a series of instructions.
Each instruction consists of several parts: the register to modify, whether to
increase or decrease that register’s value, the amount by which to increase or
decrease it, and a condition. If the condition fails, skip the instruction
without modifying the register. The registers all start at 0. The instructions
look like this:
[…]
You might also encounter <= (less than or equal to) or != (not equal to).
However, the CPU doesn’t have the bandwidth to tell you what all the registers
are named, and leaves that to you to determine.
What is the largest value in any register after completing the
instructions in your puzzle input?
If I squint long enough at the register instructions, I can basically
see R code.
If we can set up a way to convert the machine instructions into R code, R will
handle the job of looking up values, modifying values and evaluating the logic
and if statements. In other words, if we can convert register instructions
into R code, the problem simplifies into something like running an R script.
And that’s a good strategy, because we have a lot of instructions to process.
Each user receives some unique (I think) input for each problem, and my problem
input contains 1,000 instructions.
Our strategy for simulating the register machine will have the following steps:
 Parsing a register instruction
 Creating an R expression from an instruction
 Evaluating an R expression inside of a register machine
 Changing the evaluation rules to adapt to the quirks of this problem
The instructions have a very simple grammar. Here is how I would tag the first
few lines of my problem input.
We can parse these lines using regular expressions. Regular expressions are an
incredibly powerful language for processing text using patternmatching rules. I
will walk through each of the regular expression patterns used to parse an
instruction.
To match the verbs, we can use the or  operator, so
(incdec) matches inc or dec. We can also match the six different
comparison operators using  too. In the code below, I put the patterns in
parentheses so that they will be treated as a single group.
A register name is just a sequence of letters. The special character \w
matches any word character; that is, it matches uppercase/lowercase letters,
digits and underscores. The (token)+ suffix matches 1 or more
repetitions of a token. Putting these two together, \w+ will match 1 or
more adjacent word characters. That pattern in principle could matches things
beside register names (like numbers) but the instruction format here is so
constrained that it’s not a problem.
Numbers are just integers, and sometimes they are negative. A number here
is an optional  plus some digits. The special character \d matches any
digit from 0 to 9, so \d+ matches 1 or more successive digits. We can use
the (token)? suffix to match an optional token. In our case,
?\d+ will match a sequence of digits and match leading hyphen if one is
present.
Each pattern in parentheses is a matching group, and the function str_match()
from the stringr package will return a matrix
with a column for each matched group. I also include an extra set of
parentheses around the condition in the if statement to also match the whole
condition as well as its parts.
We can package this step into a function that takes an instruction’s
text and returns a list with the labelled parts of that instruction.
Next, we need to convert some strings into R code. We can do this with
rlang::parse_expr(). It takes a string and creates an R expression,
something I’ve described as a kind of bottled up magic spell: An
expression captures magic words (code) allow us to manipulate or cast (evaluate)
them.
The format of the instructions is relatively straightforward. We can fill
in a template with the parts of the parsed line. Because inc/dec are just
addition and subtraction, we replace them with the appropriate math operations.
We have to figure out where we want to evaluate the generated R code. We
create a register object to hold the values. The object will just be a list()
with some extra metadata. This object will be the location where the R code
is evaluated.
For now, we can initialize registers by using named arguments to the function.
create_register_machine(a = 0, b = 0) #> List of 3 #> $ a : num 0 #> $ b : num 0 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list" Evaluating code inside of the machineSo far, we have:
 A way to analyze register instructions and convert them into R code
 An object that holds register values
Now, we need to evaluate an expression inside of the register. We will use
tidy evaluation; the function eval_tidy() lets us evaluate an
R expression inside of a data object.
Now, we need to actually do something. We need to update the register machine
using the value from the evaluated instruction. Otherwise, the machine will just
read expressions and forget everything it’s read.
To update the machine, we have to determine the register to update. Fortunately,
our generated code always ends with an else branch that has the target
register.
If we can pull out that symbol after the else, we will have the name of
register to update in the machine. Because the code is so formulaic, we can just
extract the symbol directly using the code’s abstract syntax tree (AST).
pryr::call_tree() shows the AST for an expression.
We can extract elements from the tree like elements in a list by selecting
indices.
If we convert the symbol into a string, we can look it up in the register using
the usual list lookup syntax.
We can also use list lookup syntax with assignment to modify the register.
r[[target]] < rlang::eval_tidy(r_code, data = r) r #> List of 3 #> $ a : num 4 #> $ b : num 12 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list"Let’s wrap these steps into a function.
eval_instruction < function(register_machine, instruction) { target < rlang::as_string(instruction[[4]]) register_machine[[target]] < rlang::eval_tidy( expr = instruction, data = register_machine) register_machine } create_register_machine(a = 2, b = 0) %>% eval_instruction(r_code) #> List of 3 #> $ a : num 2 #> $ b : num 5 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list" create_register_machine(a = 2, b = 0) %>% # For quick testing, we pass in quoted expressions eval_instruction(quote(if (a > 1) b  100 else b)) %>% # Should not run eval_instruction(quote(if (a < 1) b + 5 else b)) %>% # Should run eval_instruction(quote(if (a > 1) a + 10 else a)) #> List of 3 #> $ a : num 12 #> $ b : num 100 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list" Time for some extra nonstandard evaluationThe code so far only works if the machine already has registers that match the
registers in an instruction. Otherwise, we raise an error.
To solve the problem, we could study the 1,000 lines of input beforehand,
extract the register names, initialize them to 0 and then evaluate the
instructions.1 Or… or… we could procrastinate and only
initialize a register name to 0 when the machine encounters a name it doesn’t
recognize. If, for some reason, our machine received instructions
one at a time, like over a network connection, then the procrastinated approach
seems even more reasonable.
This latter strategy will involve some very nonstandard evaluation. I
emphasize the “very” because we are changing one of the fundamental rules of R
evaluation :smiling_imp:. R throws an error if you ask it to evaluate the name
of a variable that doesn’t exist. But here we are going to detect those missing
variables and set them to 0 before they get evaluated.
To find the brandnew register names, we can inspect the call tree and find the
names of the registers. We already know where the target is. The other place
where names show up is in the condition of the if statement.
We can define a helper function which checks for missing names—names that
yield NULL values when we try to retrieve them—and initializes them to 0.
Finally, we update our evaluation function to do this step automatically. I’m
also going to add some code to record the value of the maximum register whenever
an instruction is evaluated because, ummm, that’s the whole point of puzzle.
Let’s try four instructions from a clean slate.
create_register_machine() %>% # b gets 5 eval_instruction(quote(if (d < 1) b + 5 else b)) %>% # c gets 10 eval_instruction(quote(if (b > 1) c + 10 else c)) %>% # b gets 5 more eval_instruction(quote(if (a < 1) b + 5 else b)) #> List of 5 #> $ .metadata:List of 1 #> ..$ max: num 10 #> $ b : num 10 #> $ d : num 0 #> $ c : num 10 #> $ a : num 0 #>  attr(*, "class")= chr [1:2] "register_machine" "list"Now, for the moment of truth… Let’s process all 1,000 instructions.
r < create_register_machine() for (each_instruction in full_input) { parsed < each_instruction %>% parse_instruction() %>% create_r_instruction() r < eval_instruction(r, parsed) } r #> List of 27 #> $ .metadata:List of 1 #> ..$ max: num 4832 #> $ kd : num 2334 #> $ gm : num 4239 #> $ x : num 345 #> $ kjn : num 1813 #> $ ey : num 209 #> $ n : num 764 #> $ t : num 2997 #> $ f : num 4468 #> $ msg : num 3906 #> $ ic : num 263 #> $ zv : num 599 #> $ gub : num 2025 #> $ yp : num 2530 #> $ lyr : num 2065 #> $ j : num 3619 #> $ e : num 4230 #> $ riz : num 863 #> $ lzd : num 4832 #> $ ucy : num 3947 #> $ i : num 3448 #> $ omz : num 3365 #> $ djq : num 392 #> $ bxy : num 1574 #> $ tj : num 1278 #> $ y : num 1521 #> $ m : num 2571 #>  attr(*, "class")= chr [1:2] "register_machine" "list":star: Tada! The maximum register value is 4,832. Problem solved!
And then the rules changeAdvent of Code problems come in two parts, and we don’t learn the question
behind Part 2 until we complete Part 1. In this case, after submitting our
solution for Part 1, we receive the following requirement:
To be safe, the CPU also needs to know the highest value held in any
register during this process so that it can decide how much memory to allocate
to these operations.
Accounting for this twist requires a small change to the evaluation code. We
add another metadata variable to track the highest value ever stored in a
register.
Admittedly, eval_instruction() is starting to get bloated. Conceptually, we
could probably the break this function down into three functions: preevaluation
steps, evaluation, and postevaluation steps.2 But this is good
enough for now.
We run the instructions again to get the updated metadata.
r < create_register_machine() for (each_instruction in full_input) { parsed < each_instruction %>% parse_instruction() %>% create_r_instruction() r < eval_instruction(r, parsed) } r$.metadata #> $max #> [1] 4832 #> #> $max_ever #> [1] 5443:star2: And boom! Another problem solved.
eval(thoughts, envir = this_problem)I like this kind of nonstandard evaluation approach for converting problems into
R code, but it’s mostly useful when the problem describes a series of
instructions that can be parsed and evaluated. For problems like this register
machine simulation, the nonstandard evaluation route is straightforward. But
it’s also a viable problemsolving strategy when the “machine” or the
“instructions” are subtler, as in
this problem about simulating “dance” moves.
Odds are, you’ll never have to write an interpreter for a toy machine or
language. Nevertheless, here are some R functions that we used for this
puzzle that are helpful in other contexts:
 stringr::str_match() to extract all the groups in a regular
expression at once.  rlang::parse_expr() to convert a string of text into an R expression.
 pryr::call_tree() to visualize an expression’s syntax tree and
expression[[i]][[j]] to pluck out symbols from locations in a tree.  rlang::as_string() to convert a symbol into a string.
 rlang::eval_tidy() to evaluate an expression inside of a data
context.

That actually would be pretty easy. Get a dataframe with
purrr::map_df(full_input, parse_instruction). Find the unique register
names. Create a list of 0’s with those names. Use do.call() to call
create_register_machine() with that list. With no special evaluation
trickery, this approach is closer to the idea of “just running R code”. 
If all I did for a living was write code to simulate machines
or toy languages, I might try to formalize this custom evaluation process
with preevaluation and postevaluations “hooks” that could be arguments
to a custom evaluation function. I’m just brainstorming though.
To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
StanCon is next week, Jan 1012, 2018
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
It looks pretty cool!
Wednesday, Jan 10
Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia RabeHesketh and Daniel Furr (U California, Berkely) 10:4011:30am
Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:3011:50am
Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am12:10pm
Joint longitudinal and timetoevent models via Stan. Sam Brilleman, Michael Crowther, Margarita MorenoBetancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:1012:30pm
Lunch 12:302:00pm
ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:002:20pm
A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:202:40pm
Student OrnsteinUhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:403:00pm
SlicStan: a blockless Stanlike language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:003:20pm
Break 3:204:00pm
Invited Talk: Talia Weiss (MIT) 4:004:50pm
Thursday, Jan 11
Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:4011:30am
NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:3011:50am
Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am12:10pm
A brief history of Stan. Daniel Lee (Generable) 12:1012:30pm
Lunch 12:301:30pm
Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:301:50pm
Flexible modeling of Alzheimer’s disease progression with ISplines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:502:10pm
Intrinsic AutoRegressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:102:30pm
Modeling/Data Session + Classes 2:304:10pm
Open session for consultations on modeling and data problems with Stan developers and modelers. 2:304:10pm
Session 3 of Intro to Stan 2:304:10pm
2:303:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter
What is new to Stan 3:304:10pm
Invited Talk: Manuel Rivas (Stanford U) 4:004:50pm
Friday, Jan 12
Invited Talk: Susan Holmes (Stanford U) 10:4011:30am
Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:3011:50am
The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam CorbettDavies, Sharad Goel, Emma Pierson (Stanford U) 11:50am12:10pm
Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:1012:30pm
Lunch 12:301:30pm
Causal inference with the gformula in Stan. Leah Comment (Harvard U) 1:301:50pm
Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:502:10pm
Invited Talk: Andrew Gelman 2:103:00 (Columbia U) (virtual)
Classes/Tutorials
We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other:
Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry.
Executive decision making the Bayesian way: This is for nontechnical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik.
Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich.
Model assessment, model selection and inference after model selection. Aki Vehtari.
How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian.
[edited by Aki: added Model selection tutorial]
The post StanCon is next week, Jan 1012, 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big cdata News
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata.
cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.
However, cdata is much more than that.
cdata provides a family of general transforms that include pivot/unpivot (or tidyr::spread/tidyr::gather) as easy special cases. Nina refused to write the article on it until we refactored the api to be even more teachable (and therefore more learnable, and more useful). After her redesign (adding the concepts of both concrete records and abstract records to the coordinatized data theory) the system teaches itself. It is actually hard to remember you are graphically specifying potentially involved, difficult, and confusing transforms (which do not remain confusing as the graphical specification becomes its own documenting diagram!).
Don’t take my word for it, please checkout Nina’s article: “Fluid data reshaping with cdata”.
Also, I will be presenting a lightening talk on cdata at the January 2018 BARUG Meetup! We think we have gotten the concepts and package refined and polished to the point where it can be mastered in 15 minutes with time to spare.
As a bonus this new version of cdata is now available on CRAN (with the old method naming still supported), and also works at big data scale (for DBIadaptable databases such as Spark and PostgreSQL, an uncommon feature for a full featured pivot/unpivot system).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
simmer 3.6.5
(This article was first published on R – Enchufa2, and kindly contributed to Rbloggers)
The fifth update of the 3.6.x release of simmer, the DiscreteEvent Simulator for R, is on CRAN. This release extends the attributes API by allowing users to set/get multiple attributes at once (a pretty straightforward change as well as useful; I don’t know why it didn’t occurred to me before…). Vectors as attributes and other data types are not supported yet, but they are on the roadmap.
This version also fixes some minor bugs (many thanks to the users of the simmerdevel mailing list for taking their simulations to edge cases, where these bugs arise), deprecates the onestep() function and provides the new stepn() instead. Since onestep() serves primarily for debugging purposes, the transition to the new one may go unnoticed. Finally, there is a new vignette about the Dining Philosophers Problem.
New features: set_attribute() (and set_global() by extension) can set multiple attributes at once by providing vectors of keys and values (or functions returning such keys and/or values). get_attribute() (and get_global() by extension) can retrieve multiple keys (#122).
 New stepn() method deprecates onestep() (e452975).
 Restore ostream after formatting (9ff11f8).
 Fix arrival cloning to copy attributes over to the clone (#118).
 Fix selfinduced preemption through set_capacity() (#125).
 Update “Queueing Systems” vignette (a0409a0, 8f03f4f).
 Update “Advanced Trajectory Usage” vignette (4501927).
 Fix print methods to return the object invisibly (#128).
 New “Dining Philosophers Problem” vignette (ff6137e).
Article originally published in Enchufa2.es: simmer 3.6.5.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Divide and parallelize large data problems with Rcpp
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Błażej Moska, computer science student and data science intern
Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!
One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.
Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:
{A,B,C}=>{D}
{E}=>{F}
{A,B}=>{D}
{A}=>D
How can we achieve that? For example, using below pseudo algorithm:
For i=1 to n: For j=i+1 to n: # check if antecedent[i] contains antecedent[j] (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, otherwise with 0 else: # check if antecedent[j] contains antecedent[i] (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, otherwise with 0How many operations do we need to perform with this simple algorithm?
For the first i we need to iterate \(n1\) times, for the second i \(n2\) times, for the third i \(n3\) and so on, reaching finally \(n(n1)\). This leads to (proof can be found here):
\[
\sum_{i=1}^{n}{i}= \frac{n(n1)}{2}
\]
So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?
Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents:
{A,B}=>{D}
{A}=>{D}
{C,G}=>{F}
{Y}=>{F}
After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:
 Create .cpp file, which includes all of functions needed
 Create a package using Rcpp. This can be achieved using for example:
Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")  Install your Rcpp package from source:
install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")  Load your library:
library(name_of_your_rcpp_package)
Now you can use your Rcpp function in foreach:
results=foreach(k=1:length(len), .packages=c(name_of_your_package)) %dopar% {your_cpp_function(data)}Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!
 Here is the Rcpp code for the issub function
 Here is the R code that partitions the data and calls the issub function in parallel
Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung?
(This article was first published on RProgramming – Statistik Service, and kindly contributed to Rbloggers)
In diesem Artikel möchten wir Ihnen das Konzept der Markov Kette vorstellen, dessen Grundlagen veranschaulichen und Ihnen mehrere mögliche Anwendungsbereiche aufzeigen, in denen Sie mit einer gezielten statistischen Programmierung von Markov Ketten profitieren können.
Eine Markov Kette ist ein stochastischer Prozess mit den vielfältigsten Anwendungsbereichen aus der Natur, Technik und Wirtschaft. Klassische Anwendungsmöglichkeiten aus der Wirtschaft sind die Modellierung von Warteschlangen und Wechselkursen. Auch im Bereich der Technik gibt es zahlreiche Einsatzgebiete wie zum Beispiel die Modellierung des Verhaltens von Talsperren und von Geschwindigkeitsregelanlagen bei Kraftfahrzeugen, sowie die analytische Bewertung von Mobilitätsalgorithmen wie dem Random Walk. In der Naturwissenschaft verwendet man diesen Prozess in der Populationsdynamik zur Vorhersage des Bevölkerungswachstums von Menschen und Tieren und in einer abgewandelten Form für die Brownsche Molekularbewegung. Für die Praxis besonders relevant ist die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R, welche im Folgenden anhand eines anschaulichen Beispiels durchgeführt wird. Doch zunächst werden die für die Berechnung erforderlichen Begriffe erläutert.
Was sind Markov Kette und Gleichgewichtsverteilung?Es gibt eine Vielzahl unterschiedlicher stochastischer Prozesse, welche in verschiedene Kategorien eingeteilt werden. Ein bekannter und in der Praxis besonders oft eingesetzter Typ stochastischer Prozesse ist die Markov Kette, benannt nach dem Mathematiker Andrei Andrejewitsch Markov. Diese besteht aus einer Zustandsmenge, einer Indexmenge, einer Startverteilung und den Übergangswahrscheinlichkeiten.
Die Zustandsmenge bestimmt, welche Zustände angenommen werden können. Hierbei unterscheidet man zwischen einer stetigen Zustandsmenge, welche überabzählbar unendlich viele Zustände enthält und einer diskreten Zustandsmenge, welche höchstens abzählbar unendlich viele Zustände enthält. Ein wichtiger Spezialfall des zuletzt Genannten ist ein Zustandsraum mit endlich vielen Zuständen, auf welchen wir uns konzentrieren werden.
Ein stochastischer Prozess ändert seinen Zustand im Laufe der Zeit. Analog zur Zustandsmenge unterscheidet man auch bei der Indexmenge (welche die Zeitpunkte beinhaltet) zwischen einer stetigen und einer diskreten Indexmenge. Eine stetige Indexmenge kommt beispielsweise bei der Brownschen Molekularbewegung in Betracht, weil die Moleküle in ständiger Bewegung sind und ihre Richtung und Geschwindigkeit in kleinsten Zeitabständen wechseln können. Doch wir verwenden die diskrete Indexmenge der natürlichen Zahlen N0={0,1,2,3,4,5,…} einschließlich des Startzeitpunkts 0 mit einer beliebigen Zeiteinheit. Zwischen zwei aufeinander folgenden Zeitpunkten bleibt der Zustand also konstant.
In irgendeinem Zustand muss die Markov Kette starten. Dies kann ein fest vorgegebener oder zufällig ausgewählter Zustand sein. Mit welcher Wahrscheinlichkeit der stochastische Prozess in welchem Zustand startet, legt die Startverteilung fest.
Nicht nur die Startverteilung ist zufällig, sondern auch das weitere Verhalten. Mit welcher Wahrscheinlichkeit der Prozess in welchen Zustand wechselt, legen die Übergangswahrscheinlichkeiten fest. Wir betrachten den Standardfall einer Markov Kette erster Ordnung, bei welcher die Übergangswahrscheinlichkeiten ausschließlich vom momentanen Zustand abhängen. Außerdem ist die Markov Kette homogen, das heißt, die Übergangswahrscheinlichkeiten sind konstant und ändern sich nicht im Lauf der Zeit. Die Übergangswahrscheinlichkeiten können daher in einer Übergangsmatrix veranschaulicht werden.
Die Zustandsverteilung hängt vom jeweiligen Zeitpunkt ab. In einigen Fällen konvergiert die Zustandsverteilung (diese beinhaltet die Aufenthaltswahrscheinlichkeiten der Zustände zu einem Zeitpunkt n) gegen eine Gleichgewichtsverteilung, welche auch stationäre Verteilung genannt wird. In den folgenden Abschnitten erfahren Sie anhand eines Beispiels nicht nur die Kriterien für Existenz und Eindeutigkeit der Gleichgewichtsverteilung, sondern auch die analytische Lösung und wie Sie die statistische Programmierung und Simulation mit der Statistik Software R durchführen.
Bedingungen für Existenz und Eindeutigkeit der GleichgewichtsverteilungDie Spielfigur PacMan frisst in einem Labyrinth kleine Kugeln und Kirschen und wird dabei von Gespenstern verfolgt. Der Einfachheit halber ist die Spielwelt in diesem Beispiel ein kleines 3×3Gitter und die Monster bewegen sich rein zufällig. Jedes horizontal und vertikal angrenzende Spielfeld ist mit gleicher Wahrscheinlichkeit der nächste Aufenthaltsort des Gespensts, mit Ausnahme eines Geheimgangs zwischen den Zuständen 2 und 8. Der unten abgebildete Übergangsgraph beinhaltet exemplarisch die Übergangswahrscheinlichkeiten der Zustände 1, 5, 6 und 8.
Um für Abwechslung zu sorgen, wird der Startort der Monster zufällig gewählt, und zwar jedes Spielfeld mit der gleichen Wahrscheinlichkeit. Der Startvektor auf dem Zustandsraum Z = {1,2,3,4,5,6,7,8,9} lautet daher als Zeilenvektor π0 = 1/9 (1,1,1,1,1,1,1,1,1). Die ite Zeile und jte Spalte der unten abgebildeten Übergangsmatrix P enthält die Übergangswahrscheinlichkeit vom iten zum jten Zustand. Einträge mit Wahrscheinlichkeit 0 wurden entfernt, um eine bessere Übersichtlichkeit zu erhalten:
Zu Beginn (zum Zeitpunkt 0) ist jeder Zustand in diesem Beispiel noch gleichwahrscheinlich, die Zustandsverteilung zu Beginn lässt sich direkt am Startvektor ablesen. Und wie sieht die Zustandsverteilung nach einer Zeiteinheit aus? Um diese Frage zu beantworten, multiplizieren Sie den Startvektor mit der Übergangsmatrix P. Für den zweiten Zeitpunkt multiplizieren Sie den resultierenden Zeilenvektor ein weiteres mal mit der Übergangsmatrix P und so weiter. Im Allgemeinen erhalten Sie für den Zeitpunkt n die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände durch die Formel: πn=π0⋅Pn
Unter bestimmten Voraussetzungen konvergiert die Zustandsverteilung πn gegen die Gleichgewichtsverteilung π. In unserem Beispiel mit endlichem Zustandsraum muss die MarkovKette hierfür irreduzibel und aperiodisch sein.
Was bedeuten nun die Begriffe Irreduzibilität und Aperiodizität? Bei dem von uns betrachteten Typ von Markov Ketten liegt Irreduzibilität vor, falls man in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand gelangt. Aperiodizität ist gegeben, falls der größte gemeinsame Teiler aller Rückkehrzeiten zu dem ursprünglichen Zustand 1 ist. Und dieses muss für jeden Zustand gelten. Überprüfen wir mal die beiden Bedingungen:
Unsere MarkovKette ist irreduzibel, da sich die Gespenster in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand begeben können. Dank des Geheimgangs sind hierfür nur maximal drei Zustandswechsel nötig. Durch den Geheimgang ist die MarkovKette auch aperiodisch, weil die Monster sowohl in einer geraden als auch in einer ungeraden Anzahl an Zustandswechseln von jedem beliebigen Zustand in jeden beliebigen Zustand wechseln können und somit der größte gemeinsame Teiler dieser Rückkehrzeiten 1 beträgt. Ohne den Geheimgang wäre die MarkovKette periodisch, weil dann ein Übergang von einem geraden in einen geraden Zustand bzw. von einem ungeraden in einen ungeraden Zustand nur in einer geraden Anzahl von Zustandswechseln möglich und somit der größte gemeinsame Teiler 2 wäre.
Die Voraussetzungen für Existenz und Eindeutigkeit der Gleichgewichtsverteilung sind also erfüllt. Wegen der Irreduzibilität und Aperiodizität gibt es genau eine stabile Gleichgewichtsverteilung, welche die MarkovKette nach einer unendlich langen Zeit annimmt. Das bedeutet, die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände ändern sich nach langer Zeit (fast) nicht mehr. Aus diesem Grund konvergieren auch die Matrixpotenzen. Doch wie können Sie nun die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R berechnen? Das erfahren Sie in den folgenden beiden Abschnitten dieses Artikels.
Mathematische Berechnung der GleichgewichtsverteilungWie wir gesehen haben, existiert eine eindeutige Gleichgewichtsverteilung, auch stationäre Verteilung genannt. In diesem Abschnitt erfahren Sie, wie Sie diese Verteilung mathematisch berechnen können. Dadurch erhalten Sie die Information, mit welcher Wahrscheinlichkeit sich die Monster langfristig in welchen Zuständen (bzw. Orten) aufhalten. Die Gleichgewichtsverteilung der Markov Kette als stochastischer Prozess lässt sich naiv bestimmen, indem in das Gleichungssystemπn=π0⋅Pn ein großes n eingesetzt wird, weil die Matrixpotenzen aufgrund der Irreduzibilität und Aperiodizität konvergieren, wie wir bereits herausgefunden erfahren haben. Um eine analytische Lösung zu berechnen, ist das lineare Gleichungssystem π=π⋅P nach π aufzulösen, unter der Nebenbedingung einer Zeilensumme von π von 1, damit wir auch tatsächlich eine Wahrscheinlichkeitsverteilung erhalten. Das Einsetzen der naiven Lösung in dieses Gleichungssystem dient dann als Kontrolle.
Das obige Gleichungssystem π=π⋅P ohne Nebenbedingung ist äquivalent zu (PTI)⋅πT=0. Die Übergangsmatrix wird demnach transponiert und die Einheitsmatrix subtrahiert. Der gesuchte Vektor der Zustandswahrscheinlichkeiten ist nun ein Spaltenvektor. Wir müssen also ein lineares Gleichungssystem lösen, welches inklusive Nebenbedingung eine Gleichung mehr hat als die Markov Kette Zustände. Bei einer großen Anzahl an Zuständen ist es sehr mühsam, die Lösung von Hand zu berechnen. Daher führen wir die statistische Programmierung nun mit der Statistik Software R durch. Eine alternative Lösung wäre die Verwendung einer MCMC Simulation (MCMC steht für Markov Chain Monte Carlo). Eine Simulation stellt eine sinnvolle Alternative dar, falls ein stochastischer Prozess beispielsweise so viele Zustände hat, dass die analytische Berechnung numerisch zu aufwändig wäre. Darauf verzichten wir jedoch, weil wir unsere Markov Kette nur 9 Zustände besitzt.
So berechnen Sie die Gleichgewichtsverteilung durch statistische ProgrammierungZunächst einmal tragen wir die Übergangsmatrix P in die Statistik Software R ein.
P = matrix ( c ( 0 , 1/2 , 0 , 1/2 , 0 , 0 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 0 , 0 , 1/2 , 0 , 1/2 , 0 ) , byrow = T , ncol = 9) PEine Übergangsmatrix enthält als Einträge die Übergangswahrscheinlichkeiten und diese müssen Werte zwischen 0 und 1 aufweisen. Ob das zutrifft, kann für jeden Eintrag der Matrix einzeln überprüft werden,
P >= 0 & P <= 1oder man lässt sich gleich anzeigen, ob alle Einträge plausibel sind.
P >= 0 && P <= 1Außerdem sollten sich die von jedem Zustand ausgehenden Übergangswahrscheinlichkeiten zu 1 summieren.
apply ( P , MARGIN = 1, FUN = sum )Die Übergangsmatrix P wird transponiert und die Einheitsmatrix subtrahiert.
P.T.minus.I = t ( P )  diag ( 9 )Jetzt ist es wichtig bloß nicht die Nebenbedingung zu vergessen, da wir sonst kein Ergebnis erhalten. Die Gleichgewichtsverteilung ist eine Wahrscheinlichkeitsverteilung und als solche muss die Summe über alle Zustände der Gleichgewichtsverteilung 1 ergeben. Wir ergänzen also zur Matrix P.T.minus.I eine Zeile mit Einsen.
P.T.minus.I.mit.Nebenbedingung = rbind ( P.T.minus.I , 1 ) P.T.minus.I.mit.NebenbedingungAuf der anderen Seite des Gleichungssystems steht der Nullvektor. Aufgrund der Nebenbedingung müssen wir eine Eins ergänzen. Um einen Spaltenvektor zu erhalten, verwenden wir als Datentyp eine Matrix mit einer Spalte.
Nullvektor = matrix ( c ( rep ( 0 , 9 ) , 1 ) , ncol = 1 ) NullvektorDas durch die Nebenbedingung erweitere lineare Gleichungssystem ist nun nicht mehr quadratisch, sondern enthält eine Bedingung mehr als sie Variablen hat. Daher können wir zur Lösung nicht den Standardbefehl solve verwenden, sondern verwenden alternativ den Befehl Solve (mit großem Anfangsbuchstaben) aus dem Paket limSolve, welches wir zuvor installieren müssen. Nach der Installation können wir das Paket mit library ( limSolve ) einbinden.
Wir geben in die Funktion Matrix und Vektor ein und erhalten:
Gleichgewichtsverteilung = Solve ( P.T.minus.I.mit.Nebenbedingung , Nullvektor ) GleichgewichtsverteilungZum Schluss überprüfen wir noch, ob wir tatsächlich eine gültige Wahrscheinlichkeitsverteilung erhalten haben:
Gleichgewichtsverteilung >= 0 & Gleichgewichtsverteilung <= 1 Gleichgewichtsverteilung >= 0 && Gleichgewichtsverteilung <= 1 sum ( Gleichgewichtsverteilung ) Lösung der GleichgewichtsverteilungDie Lösung des linearen Gleichungssystems mit der Statistik Software R ergibt: π=(7.7,15.4,7.7,11.5,15.4,11.5,7.7,15.4,7.7)
Die Gespenster halten sich demnach am häufigsten in der Mitte auf, weniger oft am Rand und am seltensten in der Ecke. Eine Ausnahme bilden die Randzustände 2 und 8, welche aufgrund des Geheimwegs durchschnittlich genauso oft besucht werden wie das zentrale Spielfeld. Die Aufenthaltswahrscheinlichkeiten der Zustände sind proportional zur Anzahl der eingehenden Pfeile. Je mehr einschrittige Wege zu einem Zustand führen, umso öfter wird dieser Zustand langfristig besucht.
Wie verwende ich die Markov Kette als stochastischer Prozess in der Wirtschaft?Es gibt zahlreiche Anwendungen für Markov Ketten in der Wirtschaft. Ein Beispiel wird im Folgenden vorgestellt. Im Aktienhandel ist man oftmals besonders daran interessiert, vorherzusagen, wie sich der Aktienmarkt entwickelt. Die Situation an der Börse wird in drei Hauptphasen unterteilt – Bull Market, Bear Market und Stagnant Market. Der unten abgebildete Übergangsgraph enthält die Übergangswahrscheinlichkeiten zwischen den drei Phasen von Woche zu Woche, wobei jede Phase immer für mindestens eine Woche bestehen bleibt. Mit der Gleichgewichtsverteilung können Sie nun berechnen, mit welcher Wahrscheinlichkeit sich der Aktienmarkt langfristig in welchem Zustand befindet.
Wenn die oben beschriebene Methodik der Markov Ketten auf dieses Beispiel aus der Wirtschaft angewendet wird, ergibt sich für den Aktienmarkt ein langfristiges Ergebnis von 62,5% der Wochen im Bull Market, 31,25% der Wochen im Bear Market und in 6,25% der Wochen im Stagnant Market befindet.
Wir hoffen, dass wir Ihnen mit diesem Artikel nun die Thematik der Markov Ketten und Gleichgewichtsverteilung näherbringen konnten, und Sie diese in Zukunft zur Lösung mathematischer Probleme oder von Fragestellungen im BusinessKontext einsetzen können. Für den Fall, dass Sie dabei professionelle Unterstützung benötigen sollten, stehen Ihnen die professionellen Statistiker von Novustat für statistische Programmierungen oder Simulationen zur Verfügung.
Der Beitrag Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung? erschien zuerst auf Statistik Service.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProgramming – Statistik Service. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Faster Than Excel: Painlessly Merge Data into Actuarial Loss Development Triangles with R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
The objective of this post is to create a method which easily combines loss runs, or listings of insurance claims, into triangles. Using only Excel, the common method is to create links between the excel files which must be updated manually for each new evaluation. This is prone to human error and is timeconsuming. Using a script to merge the files first and then create a triangle saves time and is more consistent.
Example of the conventional linked Excel triangle:
For accident year 2016, all the diagonal values need to be linked to a data file that was last evaluated in 2016. For 2015, the links go to the 2015 file, for 2014, the 2014 file, ect. This is time consuming.
See Wikipedia for a definition of Loss Development Triangles and why they are useful.
MethodsI use the packages plyr, dplyr, tidyr, and lubridate to manipulate the data once it is loaded into R. The package readxl reads the excel files from the windows file directory. It is worth noting that all five of these packages are included in Hadley Wickham’s tidyverse package. The package ChainLadder has a variety of tools for actuarial reserve analysis.
library(dplyr) library(readxl) library(plyr) library(tidyr) library(ChainLadder) library(lubridate) Step 1: Organize the Excel FilesCopy all of the excel files into the same windows folder, inside the working directory of the R project. It is important to have *only* the files that are going to be loaded into R in this folder.
On my computer, my file structure is as follows:
C:/Users/Sam/Desktop/[R project working directory]/[folder with excel claims listing files to aggregate]/[file 1, file 2, file 3, etc]
Using the command list.files(), R automatically looks in the working directory and returns a vector of all the the file names which it sees. For example, R sees the data files in the current working directory:
wd_path = "C:/Users/Sam/Desktop/projects/Excel Aggregate Loop/excel files" list.files(wd_path) ## [1] "claims listing 2011.xlsx" "claims listing 2012.xlsx" ## [3] "claims listing 2013.xlsx" "claims listing 2014.xlsx" ## [5] "claims listing 2015.xlsx" "claims listing 2016.xlsx"R can then loop through each of these files and perform any action. If there were 100 files, or 1000 files in this directory, this would still work.
file_names = list.files(wd_path) file_paths = paste(wd_path, "/", file_names, sep = "") head(read_excel(file_paths[4]) %>% select(3,4,5)) ## # A tibble: 4 x 6 ## name `license number` age `file year` `loss date` paid ## ## 1 jeff 3 23 2014 20110101 400 ## 2 sue 3 43 2014 20120101 400 ## 3 mark 2 55 2014 20130101 200 ## 4 sarah 1 100 2014 20140101 500In order to evaluate the age of the losses, we need to take into account when each loss was evaluated. This is accomplished by going into Excel and adding in a column for “file year”, which specifies the year of evaluation of the file. For instance, for the “claim listing 2013” file, all of the claims have a “2013” in the “file year” column.
Step 2: Load the Data into RInitialize a data frame which will store the aggregated loss run data from each of the excel files. **The names of this data frame need to be the names of excel file columns which need to be aggregated.** For instance, these could be “reported”, “Paid Loss”, “Case Reserves”, or “Incurred Loss”. If the excel files have different names for the same quantities (ie, “Paid Loss” vs. “paid loss”), then they should be renamed within excel first.
merged_data = as_data_frame(matrix(nrow = 0, ncol = 3)) names(merged_data) = c("file year", "loss date", "paid")Someone once said “if you need to use a ‘for’ loop in R, then you are doing something wrong”. Vectorized functions are faster and easier to implement. The function my_extract below takes in the file name of the excel file and returns a data frame with only the columns which are selected.
excel_file_extractor = function(cur_file_name){ read_excel(cur_file_name[1], sheet = "Sheet1", col_names = T) %>% select(`file year`, `file year`, `loss date`, paid) %>% rbind(merged_data) }Apply the function to all of the files in the folder that you created. Obviously, if you had 100 excel files this would still work just as effectively.
From the plyr package, ldply takes in a list and returns a data frame. The way to remember this is by the ordering of the letters (“list”“data frame”“ply”). For example, ff we wanted to read in a data frame and return a data frame, it would be ddply.
loss_run_data = ldply(file_paths, excel_file_extractor) names(loss_run_data) = c("file_year", "loss_date", "paid losses")The data now has only the columns what we selected, despite the fact that the loss run files had different columns in each of the files.
glimpse(loss_run_data) ## Variables: 3 ## $ file_year 2011, 2012, 2012, 2013, 2013, 2013, 2014, 2014, 20... ## $ loss_date 20110101, 20110101, 20120101, 20110101, 2... ## $ paid losses 100, 200, 300, 300, 350, 100, 400, 400, 200, 500, ... head(loss_run_data) ## file_year loss_date paid losses ## 1 2011 20110101 100 ## 2 2012 20110101 200 ## 3 2012 20120101 300 ## 4 2013 20110101 300 ## 5 2013 20120101 350 ## 6 2013 20130101 100 Step 3: Create Development TrianglesFinally, once we have the loss run combined, we just need to create a triangle. This is made easy by the `as.triangle` function from the `ChainLadder` package.
The only manual labor required in excel was to go into each file and create the `file year` column, which was just the year of evaluation of each loss run file.
loss_run_data = loss_run_data %>% mutate(accident_year = as.numeric(year(ymd(loss_date))), maturity_in_months = (file_year  accident_year)*12) merged_triangle = as.triangle(loss_run_data, dev = "maturity_in_months", origin = "accident_year", value = "paid losses") merged_triangleWithin the package `ChainLadder` is a plot function, which shows the development of losses by accident year. Because these are arbitrary amounts, the plot is not realistic.
plot(merged_triangle, main = "Paid Losses vs Maturity by Accident Year", xlab = "Maturity in Months", ylab = "Paid Losses") ConclusionWhen it comes to aggregating excel files, R can be faster and more consistent than linking together each of the excel files, and once this script is set in place, making modifications to the data can be done easily by editing the `exel_file_extractor` function.
Download the data and Rmd files for this article from my github account
Related Post
 The Mcomp Package
 Building a Hacker News scraper with 8 lines of R code using rvest library
 Building a Telecom Dictionary scraping web using rvest in R
 Scraping Javascriptrendered web content using R
 Analysing Cryptocurrency Market in R
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Deep Learning from first principles in Python, R and Octave – Part 1
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”
This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any selfrespecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!
This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning Part 1
To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below
1. Logistic Regression import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2. You could also take a look at my book, available on Amazon , in which I implement the most popular Machine Learning algorithms in both R and Python – My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Logistic Regression as a 2 layer Neural NetworkIn the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below
The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.
This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
where ‘b’ is the bias term
The Activation function is the sigmoid function which is
The Loss, when the sigmoid function is used in the output layer, is given by
(1)
In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.
Backward propagationThe backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.
Chain rule of differentiation
Let y=f(u)
and u=g(x) then
Derivative of sigmoid
Let then
Using the chain rule of differentiation we get
Therefore (2)
The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
(a)
(b)
(c)
The back propagation step requires the computation of and . In the case of regression it would be and where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
(d)
because
Also from equation (2) we get
– (e)
By chain rule
therefore substituting the results of (d) & (e) we get
(f)
Finally
(g)
– (h)
and from (f) we have
Therefore (g) reduces to
(i)
Also
(j)
Since
The gradient computes the weights at the input layer and the corresponding bias by using the values
of and
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below
Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.
4. Neural Network for Logistic Regression R code (vectorized) source("RFunctions1.R") # Define the sigmoid function sigmoid < function(z){ a < 1/(1+ exp(z)) a } # Compute the loss computeLoss < function(numTraining,Y,A){ loss < 1/numTraining* sum(Y*log(A) + (1Y)*log(1A)) return(loss) } # Compute forward propagation forwardPropagation < function(w,b,X,Y){ # Compute Z Z < t(w) %*% X +b #Set the number of samples numTraining < ncol(X) # Compute the activation function A=sigmoid(Z) #Compute the loss loss < computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ<AY dw<1/numTraining * X %*% t(dZ) db<1/numTraining*sum(dZ) fwdProp < list("loss" = loss, "dw" = dw, "db" = db) return(fwdProp) } # Perform one cycle of Gradient descent gradientDescent < function(w, b, X, Y, numIerations, learningRate){ losses < NULL idx < NULL # Loop through the number of iterations for(i in 1:numIerations){ fwdProp <forwardPropagation(w,b,X,Y) #Get the derivatives dw < fwdProp$dw db < fwdProp$db #Perform gradient descent w = wlearningRate*dw b = blearningRate*db l < fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx < c(idx,i) losses < c(losses,l) } } # Return the weights and losses gradDescnt < list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict < function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z < t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize < function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer < read.csv("cancer.csv") # Rename the target variable names(cancer) < c(seq(1,30),"output") # Split as training and test sets train_idx < trainTestSplit(cancer,trainPercent=75,seed=5) train < cancer[train_idx, ] test < cancer[train_idx, ] # Set the features X_train <train[,1:30] y_train < train[,31] X_test < test[,1:30] y_test < test[,31] # Create a matrix of 0's with the number of features w <matrix(rep(0,dim(X_train)[2])) b <0 X_train1 < normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100  mean(abs(yPredictionTrain  y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100  mean(abs(yPredictionTest  y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <data.frame(gradDescent$idx, gradDescent$losses) names(df) < c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent  Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses") 4. Neural Network for Logistic Regression Octave code (vectorized)
1;
# Define sigmoid function
function a = sigmoid(z)
a = 1 ./ (1+ exp(z));
end
# Compute the loss
function loss=computeLoss(numtraining,Y,A)
loss = 1/numtraining * sum((Y .* log(A)) + (1Y) .* log(1A));
end
# Perform forward propagation
function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y)
% Compute Z
Z = w' * X + b;
numtraining = size(X)(1,2);
# Compute sigmoid
A = sigmoid(Z);
#Compute loss. Note this is element wise product
loss =computeLoss(numtraining,Y,A);
# Compute the gradients dZ, dw and db
dZ = AY;
dw = 1/numtraining* X * dZ';
db =1/numtraining*sum(dZ);
end
# Compute Gradient Descent
function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate)
#Initialize losses and idx
losses=[];
index=[];
# Loop through the number of iterations
for i=1:numIerations,
[loss,dw,db,dZ] = forwardPropagation(w,b,X,Y);
# Perform Gradient descent
w = w  learningRate*dw;
b = b  learningRate*db;
if(mod(i,100) ==0)
# Append index and loss
index = [index i];
losses = [losses loss];
endif
end
end
# Determine the predicted value for dataset
function yPredicted = predict(w,b,X)
m = size(X)(1,2);
yPredicted=zeros(1,m);
# Compute Z
Z = w' * X + b;
# Compute sigmoid
A = sigmoid(Z);
for i=1:size(X)(1,2),
# Set predicted as 1 if A > 0,5
if(A(1,i) >= 0.5)
yPredicted(1,i)=1;
else
yPredicted(1,i)=0;
endif
end
end
# Normalize by dividing each value by the sum of squares
function normalized = normalize(x)
# Compute Frobenius norm. Square the elements, sum rows and then find square root
a = sqrt(sum(x .^ 2,2));
# Perform element wise division
normalized = x ./ a;
end
# Split into train and test sets
function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent)
# Create a random index
ix = randperm(length(dataset));
# Split into training
trainSize = floor(trainPercent/100 * length(dataset));
train=dataset(ix(1:trainSize),:);
# And test
test=dataset(ix(trainSize+1:length(dataset)),:);
X_train = train(:,1:30);
y_train = train(:,31);
X_test = test(:,1:30);
y_test = test(:,31);
end
cancer=csvread("cancer.csv");
[X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75);
w=zeros(size(X_train)(1,2),1);
b=0;
X_train1=normalize(X_train);
X_train2=X_train1';
y_train1=y_train';
[w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75);
# Normalize X_test
X_test1=normalize(X_test);
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1';
y_test1=y_test';
# Use the values of the weights generated from Gradient Descent
yPredictionTest = predict(w1, b1, X_test2);
yPredictionTrain = predict(w1, b1, X_train2);
trainAccuracy=100mean(abs(yPredictionTrain  y_train1))*100
testAccuracy=100 mean(abs(yPredictionTest  y_test1))*100
trainAccuracy = 90.845
testAccuracy = 89.510
graphics_toolkit('gnuplot')
plot(idx,losses);
title ('Gradient descent Cost vs No of iterations');
xlabel ("No of iterations");
ylabel ("Cost");
Conclusion
This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers
The Deep Learning journey has begun… Don’t miss the bus!
Stay tuned for more interesting posts in Deep Learning!!
References
1. Deep Learning Specialization
2. Neural Networks for Machine Learning
3. Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville
4. Neural Networks: The mechanics of backpropagation
5. Machine Learning
Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4
3. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
4. Practical Machine Learning with R and Python – Part 4
5. Introducing QCSimulator: A 5qubit quantum computing simulator in R
6. A Bluemix recipe with MongoDB and Node.js
7. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
To see all posts check Index of posts
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Cheer up, Black Metal Cats! Bubblegum Puppies
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
Do you know the Black Metal Cats Twitter account? As explained in this great introduction, it “combines kitties with heavy metal lyrics”. I know the account because I follow Scott Chamberlain who retweets them a lot, which I enjoy as far as one can enjoy such a dark mood. Speaking of which, I decided to try and transform Black Metal Cat tweets into something more positive… The Bubblegum Puppies were born!
Getting Black Metal Cats tweetsIt won’t come as a surprise for the loyal readers of this blog that I just had to use rtweet. I kept only original standalone tweets and removed the picture link from the tweet.
black_tweets < rtweet::get_timeline("evilbmcats") black_tweets < dplyr::filter(black_tweets, is.na(reply_to_user_id), !is_retweet, !is_quote) black_tweets < dplyr::select(black_tweets, text, created_at, status_id) black_tweets < dplyr::mutate(black_tweets, text = stringr::str_replace(text, "https.*", "")) readr::write_csv(black_tweets, path = "data/20180103bubblegumpuppies_cats.csv")Now that the dark material is ready, let’s sweeten it…
Modifying the tweet textBlack Metal Cats tweet heavy metal lyrics so as you can imagine, they’re sad. How to make them happy while keeping the text similar enough to the original one? And this without too much effort? My simplistic strategy was to identify negative words via sentiment analysis and to replace them with positive words.
Note, had I not wanted to stay close to the original tweet, I could have just chosen lyrics picked from this dataset for instance, and filtered them by sentiment via the sentimentr package.
Finding negative wordsI computed sentiment using tidytext copypasting code from this post of mine.
library("tidytext") library("magrittr") bing < get_sentiments("bing") sentiment < dplyr::select(black_tweets, text) %>% dplyr::mutate(saved_text = text) %>% unnest_tokens(word, text) %>% dplyr::inner_join(bing, by = "word") %>% dplyr::filter(sentiment == "negative")It was a bit disappointing since out of the 97 only 63 were represented in that data.frame. But well, this shall do! I just looked rapidly at some non included tweets.
dplyr::filter(black_tweets, !text %in% sentiment$saved_text) %>% head() %>% knitr::kable() text created_at status_id Black covfefe cats. 20171229 21:01:06 946848575446663168 Black coffee cats. 20171228 00:15:48 946172798203916288 Say goodbye to the light. 20171226 15:50:00 945683121336344576 The storm will never end. 20171223 15:49:00 944595705204678656 Satan, your time has come. This world must end. 20171218 22:36:00 942886191107575808 Timely. 20171213 00:43:30 940743951287504896Ok so some of them are probably negative tweets that the sentimentr package would help detect but that do not contain negative words.
Replacing wordsI seriously considered using the wordnet package because of this Stack Overflow question “Getting antonyms using the wordnet package” but I was not brave enough, my strength failed me in front of the Java needs of that package.
I decided to use praise to get positive words, and cleanNLP (as in this post) to try and identify correctly negative words as adjective or verbs for instance in order to be able to replace them. The right annotation for that is a token.
library("cleanNLP") #init_spaCy() get_token_with_text < function(x){ obj < run_annotators(x, as_strings = TRUE) entities < get_token(obj) entities$text < x entities } possibly_get_tokens < purrr::possibly(get_token_with_text, otherwise = NULL) tokens < purrr::map_df(sentiment$saved_text, possibly_get_tokens) head(tokens) %>% knitr::kable()Once here, I joined sentiment and tokens.
tokens < dplyr::mutate(tokens, word = enc2native(word)) tokens < dplyr::mutate(tokens, word = tolower(word)) tokens < dplyr::left_join(sentiment, tokens, by = c("saved_text" = "text", "word")) head(tokens) %>% knitr::kable() saved_text word sentiment id sid tid lemma upos pos cid We are the creatures you fear. fear negative 1 1 6 fear VERB VBP 25 In darkness we walk. darkness negative 1 1 2 darkness NOUN NN 3 Standing tall, side by side, we shall build thy throne upon the burning ashes. burning negative 1 1 15 burn VERB VBG 64 This is the dawn of the infernal reign. infernal negative 1 1 7 infernal ADJ JJ 24 A breath of the past so distant and so unreal. A soul condemned to haunt a frozen burial ground. condemned negative 1 2 3 condemn VERB VBD 54 A breath of the past so distant and so unreal. A soul condemned to haunt a frozen burial ground. condemned negative 1 2 3 condemn VERB VBD 54The praise package provides adjectives that I’ll use to replace adjectives and to add an exclamation at the beginning of each text. Nouns, present and preterit verbs will be replaced with love/loves/loved because hey, this is pop music inspiration. I’ll lose capital letters and won’t bother too much, puppies probably do not care either. What I prepare below is what Hilary Parker called a dictionary table in this tweet. VBZ is for instance a verb in the present form like “haunts”.
modifiers < tibble::tibble(pos = c( "NN", "VBG", "JJ", "VBD", "VB", "NNS", "VBP", "VBN", "NNP", "VBZ"), modifier = c("love", "${adjective}", "${adjective}", "loved", "love", "lovers", "love", "${adjective}", "love", "loves")) knitr::kable(modifiers) pos modifier NN love VBG ${adjective} JJ ${adjective} VBD loved VB love NNS lovers VBP love VBN ${adjective} NNP love VBZ lovesNow let’s get to work on the sweetening of the texts at last! Some occurrences of “hate” and “evil” remained, and I removed them by hand.
# praise has some randomness, let's make it reproducible set.seed(42) modifiable_tweets < dplyr::left_join(tokens, black_tweets, by = c("saved_text" = "text")) modifiable_tweets < dplyr::left_join(modifiable_tweets, modifiers, by = "pos") # for easier use with map replace_all < function(pattern, replacement, x){ stringr::str_replace_all(x, pattern = pattern, replacement = replacement) } modified_tweets < dplyr::group_by(modifiable_tweets, saved_text) %>% dplyr::summarize(praise_template = purrr::map2_chr(word, modifier, replace_all, x = tolower(saved_text[1]))[1], praise_template = paste("${exclamation}!", praise_template), praise_template = stringr::str_replace_all(praise_template, "hate", "love"), praise_template = stringr::str_replace_all(praise_template, "evil", "love"), new_text = praise::praise(praise_template)) modified_tweets < dplyr::select(modified_tweets, saved_text, new_text) puppies_and_cats < dplyr::left_join(modified_tweets, black_tweets, by = c("saved_text" = "text")) head(puppies_and_cats) %>% knitr::kable() saved_text new_text created_at status_id A breath of the past so distant and so unreal. A soul condemned to haunt a frozen burial ground. yikes! a breath of the past so distant and so unreal. a soul loved to haunt a frozen burial ground. 20171228 20:35:00 946479619020115968 Be very afraid. yippie! be very fine. 20171118 21:02:00 931990900523307008 Beware of the cat. yay! love of the cat. 20171211 14:54:02 940233218204172288 Beware the legions of Satan, they’re ready for attack! ole! love the legions of satan, they’re ready for attack! 20171205 22:48:00 938178169399422976 Born by the nothingness into eternal blasphemy. mm! born by the nothingness into eternal love. 20171203 04:15:01 937173299448221696 Born on a day of curses and damnation, I was doomed to hate. whoa! born on a day of lovers and damnation, i was doomed to love. 20171204 21:40:01 937798672347226113So although some new lyrics do not look that cheerful, they’re at least grammatically correct.
Modifying the pictureI recently discovered Pexels, a website with CC0 pictures, that I even learnt to scrape for an unpublished (yet) project. So many photographs you can use and modify for free without any attribution! Quite cool, really. To scrape the page I first had to scroll down to get enough pictures, which I did following this Stack Overflow thread with RSelenium. I tried using seleniumPipes instead but had trouble setting up the server and not too much time to dwell on that.
Yes, you got it right, the code below automatically downloads pics of puppies into your laptop. Happy New Year.
library("rvest") library("RSelenium") library("magrittr") # https://stackoverflow.com/questions/29861117/rrvestscrapingadynamicecommercepage rD < rsDriver() remDr < rD[["client"]] # open the webpage remDr$navigate("https://www.pexels.com/search/puppies/") # scroll down for(i in 1:30){ remDr$executeScript(paste("scroll(0,",i*10000,");"), args = list("dummy")) # be nice and wait Sys.sleep(1) } page_content < remDr$getPageSource() remDr$close() # functiosn for getting the pic links get_link_from_src < function(node){ xml2::xml_attrs(node)["src"] %>% as.character() %>% stringr::str_replace("\\?h.*", "") } xtract_pic_links < function(source) { css < '.photoitem__img' read_html(source[[1]]) %>% html_nodes(css) %>% purrr::map_chr(get_link_from_src) } links < xtract_pic_links(page_content) links < links[1:nrow(puppies_and_cats)] # save dir.create("data/puppies") save_pic < function(url, name){ Sys.sleep(1) name < paste0("puppy", name, ".png") try(magick::image_read(url) %>% magick::image_write(paste0("data/puppies/", name)), silent = TRUE) } purrr::walk2(links, 1:nrow(puppies_and_cats), save_pic)I took a moment to browse the 66 pics. Carpe diem, carpe canes (I learnt Latin for 6 years as a teen(ager) and had to check the plural of canem…).
The pics were a bit too big so I resized them.
resize < function(pic){ magick::image_read(pic) %>% magick::image_resize("300x300") %>% magick::image_write(pic) } purrr::walk(dir("data/puppies/", full.names = TRUE), resize) Tweeting the cheerful tweets – DON’T DO IT LIKE I DID!I created a Twitter account for the Bubblegum Puppies so that they could interact with the Black Metal Cats in their natural environment. I was asked by Scott whether I’d make a bot out of my idea and I don’t intend to especially since my simplistic strategy can only answer tweets with negative words detected, but I’d be glad to let someone adopt the account, especially since I did not educate my puppies well to begin with as you see below! Here is a tutorial for making a Twitter bot with rtweet. In the meantime, I simply decided to tweet the answers I had created… without enough thinking.
Note: It has been a long since I last obtained Twitter access tokens, so I refered to the vignette.
Then I wrote the code to send replies, an important point being that you need to mention the username you answer to in your tweet otherwise the in_reply_to_status_id argument is ignored. It was a dangerous code because it sent too many replies, which made it a spam bot… Don’t do that. So if you ever need an example of a very stupid spammer, think of me. Lesson learnt, I’ll never be that reckless again, because I do not want to spam anyone.
post_a_reply < function(df, pic){ rtweet::post_tweet(paste("@evilbmcats", df$new_text), in_reply_to_status_id = df$status_id, media = pic) Sys.sleep(1) } pics < dir("data/puppies/", full.names = TRUE) purrr::walk2(split(puppies_and_cats, puppies_and_cats$status_id), pics, post_a_reply)So how would I use the tweets if I could do it again? Well, I’d post them with a faaar bigger delay. And I think the account would be fun as a bot which’d reply only when the cats account tweets again. That way the puppies would be cute, not cute and annoying. Live and learn! Thanks to Bob Rudis to encourage me to post this!
Ending with some cutenessAnd now, because I do not want you to think I’m now as depressed as a Black Metal Cat, I’ll end this post by showing you a few replies thanks to the brand new rtweet::tweet_shot function added in the dev version of rtweet by Bob Rudis after he saw my #best9of2017 post. I resorted to saving the files and add the Markdown code to show them by hand but in a normal Rmd, not in a website, the images (magick objects) actually render very well. Head to Twitter to see the rest!
save_one_tweet < function(status_id){ rtweet::tweet_shot(status_id) %>% magick::image_write(paste0("data/20180104", status_id, ".png")) } c("948942337912328192", "948942293272334336", "948942240713531392") %>% purrr::walk(save_one_tweet) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introduction to Kurtosis
(This article was first published on R Views, and kindly contributed to Rbloggers)
Happy 2018 and welcome to our first reproducible finance post of the year! What better way to ring in a new beginning than pondering/calculating/visualizing returns distributions.
We ended 2017 by tackling skewness, and we will begin 2018 by tackling kurtosis.
Kurtosis is a measure of the degree to which portfolio returns appear in the tails of our distribution. A normal distribution has a kurtosis of 3, which follows from the fact that a normal distribution does have some of its mass in its tails. A distribution with a kurtosis greater than 3 has more returns out in its tails than the normal, and one with kurtosis less than 3 has fewer returns in its tails than the normal. That matters to investors because more bad returns out in tails means that our portfolio might be at risk of a rare but huge downside. The terminology is a bit confusing. Negative kurtosis is considered less risky because it has fewer returns out in the tails. Negative == less risky? We’re not used to that in finance.
Kurtosis is often has the word ‘excess’ appended to its description, as in ‘negative excess kurtosis’ or ‘positive excess kurtosis’. That ‘excess’ is in comparison to a normal distribution kurtosis of 3. A distribution with negative excess kurtosis equal to 1 has an actual kurtosis of 2.
Enough with the faux investopedia entry, let’s get to the calculations, R code and visualizations.
Here’s the equation for excess kurtosis. Note that we subtract 3 at the end:
\[Kurtosis=\sum_{t=1}^n (x_i\overline{x})^4/n \bigg/ (\sum_{t=1}^n (x_i\overline{x})^2/n)^{2}3 \]
By way of reminder, we will be working with our usual portfolio consisting of:
+ SPY (S&P500 fund) weighted 25% + EFA (a nonUS equities fund) weighted 25% + IJS (a smallcap value fund) weighted 20% + EEM (an emergingmkts fund) weighted 20% + AGG (a bond fund) weighted 10%Before we can calculate kurtosis, we need to find portfolio monthly returns, which was covered in this post.
Building off that previous work, we will be working with two objects of portfolio returns:
+ portfolio_returns_xts_rebalanced_monthly (an xts of monthly returns) + portfolio_returns_tq_rebalanced_monthly (a tibble of monthly returns)Now we are going to test our past self’s work on skewness, and reuse that code flow to expedite the kurtosis work. The logic will remain the same, but we will call different builtin functions and different byhand calculations.
For the xts world, we use the kurtosis() function instead of the skewness() function.
kurt_xts < kurtosis(portfolio_returns_xts_rebalanced_monthly$returns) kurt_xts ## [1] 0.5267736For tidy, we have the same piped flow and use the formula for kurtosis for our byhand calculations. Our byhand result is labeled with kurt_byhand, and involves quite a few parentheticals to map it back to the kurtosis equation above.
kurt_tidy < portfolio_returns_tq_rebalanced_monthly %>% summarise( kurt_builtin = kurtosis(returns), kurt_byhand = ((sum((returns  mean(returns))^4)/length(returns))/ ((sum((returns  mean(returns))^2)/length(returns))^2))  3) %>% select(kurt_builtin, kurt_byhand)Let’s confirm that we have consistent calculations.
kurt_xts ## [1] 0.5267736 kurt_tidy$kurt_builtin ## [1] 0.5267736 kurt_tidy$kurt_byhand ## [1] 0.5267736We have consistent results from xts and the tidy builtin/byhand worlds, and we were able to reuse our code from above to shorten the development time here. dd
Let’s do the same with the visualizations and head straight for a density plot, starting with the same portfolio_density_plot.
portfolio_density_plot < portfolio_returns_tq_rebalanced_monthly %>% ggplot(aes(x = returns)) + stat_density(geom = "line", alpha = 1, colour = "cornflowerblue") portfolio_density_plotWe are interested in both tails for kurtosis, so let’s shade at 2 standard deviations above and below the mean return (for our skewness work, we only shaded the negative tail).
mean < mean(portfolio_returns_tq_rebalanced_monthly$returns) sd_pos < mean + (2 * sd(portfolio_returns_tq_rebalanced_monthly$returns)) sd_neg < mean  (2 * sd(portfolio_returns_tq_rebalanced_monthly$returns)) sd_pos_shaded_area < ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x > sd_pos ) sd_neg_shaded_area < ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < sd_neg) portfolio_density_plot < portfolio_density_plot + geom_area(data = sd_pos_shaded_area, aes(x = x, y = y), fill="pink", alpha = 0.5) + geom_area(data = sd_neg_shaded_area, aes(x = x, y = y), fill="pink", alpha = 0.5) + scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) portfolio_density_plotThat density chart is a good look at the mass in both tails, where we have defined ‘tail’ as being two standard deviations away from the mean. We can add a line for the mean, as did in our skewness visualization, with the following.
mean < mean(portfolio_returns_tq_rebalanced_monthly$returns) mean_line_data < ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x <= mean) portfolio_density_plot + geom_segment(data = mean_line_data, aes(x = mean, y = 0, xend = mean, yend = density), color = "black", linetype = "dotted") + annotate(geom = "text", x = mean, y = 5, label = "mean", fontface = "plain", angle = 90, alpha = .8, vjust = 1.75)Finally, we can calculate and chart the rolling kurtosis with the same logic as we did for skewness. The only difference is that here we call fun = kurtosis instead of fun = skewness.
window < 6 rolling_kurt_xts < na.omit(apply.rolling(portfolio_returns_xts_rebalanced_monthly, window, fun = kurtosis))Now we pop that xts object into highcharter for a visualization.
highchart(type = "stock") %>% hc_title(text = "Rolling Kurt") %>% hc_add_series(rolling_kurt_xts, name = "Rolling kurtosis", color = "cornflowerblue") %>% hc_yAxis(title = list(text = "kurtosis"), opposite = FALSE, max = .03, min = .03) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE){"x":{"hc_opts":{"title":{"text":"Rolling Kurt"},"yAxis":{"title":{"text":"kurtosis"},"opposite":false,"max":0.03,"min":0.03},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.00854775829739259],[1377820800000,0.00444785082581815],[1380499200000,0.0104092039506155],[1383177600000,0.0128359457242099],[1385683200000,0.0164325330568209],[1388448000000,0.0224217167324512],[1391126400000,0.00789080127302483],[1393545600000,0.0191481706943007],[1396224000000,0.0119153871442633],[1398816000000,0.0066512115146186],[1401408000000,0.0069340953984468],[1404086400000,0.00804186927781203],[1406764800000,0.0127245048219774],[1409270400000,0.00997562223278388],[1412035200000,0.00106782119332109],[1414713600000,0.004269190603526],[1417132800000,0.00218739200640291],[1419984000000,0.0033686842425531],[1422576000000,0.00308812576416303],[1424995200000,0.000672300702671621],[1427760000000,0.0063800007202866],[1430352000000,0.00602025921117102],[1432857600000,0.00450860859731235],[1435622400000,0.00359285608324082],[1438300800000,0.00510762504715069],[1440979200000,0.0135237670918179],[1443571200000,0.017272859188811],[1446163200000,0.0104452656749339],[1448841600000,0.0101050819169769],[1451520000000,0.0116837282083901],[1454025600000,0.0185112647100233],[1456704000000,0.00879112875305636],[1459382400000,0.00889238266836386],[1461888000000,0.000582532447263954],[1464652800000,0.000410213890434274],[1467244800000,0.00639455232238465],[1469750400000,0.0210143856721272],[1472601600000,0.0228061076525519],[1475193600000,0.0119133789927622],[1477872000000,0.00660677259528886],[1480464000000,0.0100397694050085],[1483056000000,0.0117902641682449],[1485820800000,0.0090557870395865],[1488240000000,0.0113804250332236],[1490918400000,0.0120564550652874],[1493337600000,0.017696157218908],[1496188800000,0.0170231562654927],[1498780800000,0.0157307217365092],[1501459200000,0.0159944354163099],[1504137600000,0.0128801041981877],[1506643200000,0.0149609794697498],[1509408000000,0.0156222680756303],[1512000000000,0.0159171082293594],[1514505600000,0.0151986230947173],[1514937600000,0.0133193338956239]],"name":"Rolling kurtosis","color":"cornflowerblue"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vmlradialgradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvastools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}
We didn’t cover this before, but what if we wanted to use ggplot for the rolling kurtosis? We could convert that xts object to a tibble with tk_tbl() from the timetk package, and then pipe straight to ggplot.
rolling_kurt_xts %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% rename(rolling_kurtosis = calcs) %>% ggplot(aes(x = date, y = rolling_kurtosis)) + geom_line(color = "cornflowerblue") + xlab("date") + ylab("rolling kurtosis") + ggtitle("Rolling Kurtosis")Interestingly, this portfolio has displayed slight positive rolling excess kurtosis for most of its life, except during the last half of 2015 through early 2016.
That’s all for today. Our work on kurtosis was made a lot more efficient by our work on skewness – so let’s thank our 2017 selves for constructing a reproducible and reusable code flow! See you next time.
_____='https://rviews.rstudio.com/2018/01/04/introductiontokurtosis/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...