Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 7 hours 45 min ago

In case you missed it: April 2018 roundup

Wed, 05/09/2018 - 20:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from April of particular interest to R users.

Microsoft R Open 3.4.4, based on R 3.4.4, is now available.

An R script by Ryan Timpe converts a photo into instructions for rendering it as LEGO bricks.

R functions to build a random maze in Minecraft, and have your avatar solve the maze automatically.

A dive into some of the internal changes bringing performance improvements to the new R 3.5.0.

AI, Machine Learning and Data Science Roundup, April 2018

An analysis with R shows that Uber has overtaken taxis for trips in New York City.

News from the R Consortium: new projects, results from a survey on licenses, and R-Ladies is promoted to a top-level project.

A talk, aimed at Artificial Intelligence developers, making the case for using R.

Bob Rudis analyzes data from the R-bloggers.com website, and lists the top 10 authors.

An R-based implementation of Silicon Valley's "Not Hotdog" application.

An R package for creating interesting designs with generative algorithms.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Flow charts in R

Wed, 05/09/2018 - 18:29

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

Flow charts are an important part of a clinical trial report. Making them can be a pain though. One good way to do it seems to be with the grid and Gmisc packages in R. X and Y coordinates can be designated based on the center of the boxes in normalized device coordinates (proportions of the device space – 0.5 is this middle) which saves a lot of messing around with corners of boxes and arrows.

A very basic flow chart, based very roughly on the CONSORT version, can be generated as follows…

library(grid) library(Gmisc) grid.newpage() # set some parameters to use repeatedly leftx <- .25 midx <- .5 rightx <- .75 width <- .4 gp <- gpar(fill = "lightgrey") # create boxes (total <- boxGrob("Total\n N = NNN", x=midx, y=.9, box_gp = gp, width = width)) (rando <- boxGrob("Randomized\n N = NNN", x=midx, y=.75, box_gp = gp, width = width)) # connect boxes like this connectGrob(total, rando, "v") (inel <- boxGrob("Ineligible\n N = NNN", x=rightx, y=.825, box_gp = gp, width = .25, height = .05)) connectGrob(total, inel, "-") (g1 <- boxGrob("Allocated to Group 1\n N = NNN", x=leftx, y=.5, box_gp = gp, width = width)) (g2 <- boxGrob("Allocated to Group 2\n N = NNN", x=rightx, y=.5, box_gp = gp, width = width)) connectGrob(rando, g1, "N") connectGrob(rando, g2, "N") (g11 <- boxGrob("Followed up\n N = NNN", x=leftx, y=.3, box_gp = gp, width = width)) (g21 <- boxGrob("Followed up\n N = NNN", x=rightx, y=.3, box_gp = gp, width = width)) connectGrob(g1, g11, "N") connectGrob(g2, g21, "N") (g12 <- boxGrob("Completed\n N = NNN", x=leftx, y=.1, box_gp = gp, width = width)) (g22 <- boxGrob("Completed\n N = NNN", x=rightx, y=.1, box_gp = gp, width = width)) connectGrob(g11, g12, "N") connectGrob(g21, g22, "N")

Sections of code to make the boxes are wrapped in brackets to print them immediately. The code creates something like the following figure:

For detailed info, see the Gmisc vignette. This code is also on github.

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

10 PRINT mazes with ggplot2

Wed, 05/09/2018 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

There is a celebrated Commodore 64 program that randomly prints outs / and \
characters and fills the screen with neat-looking maze designs. It is just one
line of code, but there is a whole book written about
it.

10 PRINT CHR$(205.5+RND(1)); : GOTO 10 Screenshots of the 10 PRINT program in action. Images taken from the 10 PRINT book.

The basic idea, from my reading of the code, is that CHR$(205) is \,
CHR$(206) is /, and the program randomly selects between the two by adding a
random number to 205.5. Endlessly looping over this command fills the screen
with that pleasing maze pattern.

In R, we could replicate this functionality with by randomly sampling the
slashes:

sample_n_slashes <- function(n) { sample(c("/", "\\"), size = n, replace = TRUE) } withr::with_options( list(width = 40), cat(sample_n_slashes(800), sep = "", fill = TRUE) ) #> //////\/\\/\/\/\/\\//\\\\//\//\/\//\///\ #> \\\/\//\\/\/////////////\\/\//\\\/\//\\\ #> /\\/\/\//\/\/\\/\\/\/\//\\\\//\/\\\/\/// #> \\//\/\\\/\///\\/\/\/////\//\///\/\/\//\ #> /\/\//\///\//\//\//\/\//\\/\\\\\/\\\//\/ #> //\\\\\//////\//\//\/\//\\////\/\\\/\/\/ #> \\////\/\\/\////\//////\\/\\/////\/////\ #> /\/\/\/\/\\///\/\\\\/\/\\//\/\\/\\\\\\// #> //\\\/\///\///\/\\\//\//\\\\//\\/\\/\/\/ #> /\\/\//\\//\/\\\\\/\\\/\\\/\/\/\/\////\/ #> /\//\\//\////\\\///\//\/\\\/\\///\//\\\\ #> /\//\\////\/\\\//\\\//\/\///\\\\\/\/\\// #> \\////\\/\\\\\\///\///\//\\\/\\\\//\//// #> \\\///\/\//\//\/\//\/\/\\\/\/\\///\///// #> \\/\\//\\\\\\///\\\\/\\/\\//\//\\\/\\/\/ #> ////\\\////\/\\\\\/////\/\\\////\\///\\\ #> \//\\\//\///\/\\\\//\\\////\\//\\/\/\//\ #> \/\//\\//\\///\\\\\\//\///\/\\\\\\\\/\\/ #> ///\/\//\\/\/\\\\\\\\/\///\//\\///\\//\\ #> /////\\///\/\\/\/\\//\\//\/\\/\//\//\\\\

where withr::with_options() lets us temporarily change the print width and
cat() concatenates the slashes and prints out the characters as text.

We can also make this much prettier by drawing the patterns using ggplot2.

Drawing line segments with ggplot2

Instead of writing out slashes, we will draw a grid of diagonal line segments,
some of which will be flipped at random. To draw a segment, we need a starting
xy coordinate and an ending xy coordinate. geom_segment() will
connect the two coordinates with a line. Here’s a small example where we draw
four “slashes”.

library(ggplot2) library(dplyr) data <- tibble::tribble( ~row, ~col, ~x_start, ~x_end, ~y_start, ~y_end, 1, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 0, # flipped 2, 1, 0, 1, 1, 2, 2, 2, 1, 2, 1, 2) ggplot(data) + aes(x = x_start, xend = x_end, y = y_start, yend = y_end) + geom_segment()

The programming task now is to make giant grid of these slashes. Let’s start
with an observation: To draw two slashes, we needed three x
values: 0, 1, 2. The first two served as segment starts and the last two
as segment ends. The same idea applies to the y values. We can generate a
bunch of starts and ends by taking a sequence of steps and removing the first
and last elements.

# We want a 20 by 20 grid rows <- 20 cols <- 20 x_points <- seq(0, 1, length.out = cols + 1) x_starts <- head(x_points, -1) x_ends <- tail(x_points, -1) y_points <- seq(0, 1, length.out = rows + 1) y_starts <- head(y_points, -1) y_ends <- tail(y_points, -1)

Each x_starts–x_ends pair is a column in the grid, and each
y_starts–y_ends is a row in the grid. To make a slash at each
row–column combination, we have to map out all the combinations of the rows
and columns. We can do this with crossing() which creates all crossed
combinations of values. (If it helps, you might think of crossed like crossed
experiments
or the
Cartesian cross product of
sets.)

grid <- tidyr::crossing( # columns data_frame(x_start = x_starts, x_end = x_ends), # rows data_frame(y_start = y_starts, y_end = y_ends)) %>% # So values move left to right, bottom to top arrange(y_start, y_end) # 400 rows because 20 rows x 20 columns grid #> # A tibble: 400 x 4 #> x_start x_end y_start y_end #> #> 1 0 0.05 0 0.05 #> 2 0.05 0.1 0 0.05 #> 3 0.1 0.15 0 0.05 #> 4 0.15 0.2 0 0.05 #> 5 0.2 0.25 0 0.05 #> 6 0.25 0.3 0 0.05 #> 7 0.3 0.35 0 0.05 #> 8 0.35 0.4 0 0.05 #> 9 0.4 0.45 0 0.05 #> 10 0.45 0.5 0 0.05 #> # ... with 390 more rows

We can confirm that the segments in the grid fill out a plot. (I
randomly color the line segments to make individual ones visible.)

ggplot(grid) + aes( x = x_start, y = y_start, xend = x_end, yend = y_end, color = runif(400)) + geom_segment(size = 1) + guides(color = FALSE)

Finally, we need to flip slashes at random. A segment becomes flipped if the
y_start and y_end are swapped. In the code below, we flip the slash in each
row if a randomly drawn number between 0 and 1 is less than .5. For style, we
also use theme_void() to strip away the plotting theme, leaving us with just
the maze design.

p_flip <- .5 grid <- grid %>% arrange(y_start, y_end) %>% mutate( p_flip = p_flip, flip = runif(length(y_end)) <= p_flip, y_temp1 = y_start, y_temp2 = y_end, y_start = ifelse(flip, y_temp2, y_temp1), y_end = ifelse(flip, y_temp1, y_temp2)) %>% select(x_start:y_end, p_flip) ggplot(grid) + aes(x = x_start, y = y_start, xend = x_end, yend = y_end) + geom_segment(size = 1, color = "grey20") last_plot() + theme_void()

Now, we wrap all these steps together into a pair of functions.

make_10_print_data <- function(rows = 20, cols = 20, p_flip = .5) { x_points <- seq(0, 1, length.out = cols + 1) x_starts <- head(x_points, -1) x_ends <- tail(x_points, -1) y_points <- seq(0, 1, length.out = rows + 1) y_starts <- head(y_points, -1) y_ends <- tail(y_points, -1) grid <- tidyr::crossing( data.frame(x_start = x_starts, x_end = x_ends), data.frame(y_start = y_starts, y_end = y_ends)) grid %>% arrange(y_start, y_end) %>% mutate( p_flip = p_flip, flip = runif(length(y_end)) <= p_flip, y_temp1 = y_start, y_temp2 = y_end, y_start = ifelse(flip, y_temp2, y_temp1), y_end = ifelse(flip, y_temp1, y_temp2)) %>% select(x_start:y_end, p_flip) } draw_10_print <- function(rows = 20, cols = 20, p_flip = .5) { grid <- make_10_print_data(rows = rows, cols = cols, p_flip = p_flip) ggplot(grid) + aes(x = x_start, y = y_start, xend = x_end, yend = y_end) + geom_segment(size = 1, color = "grey20") } Now the fun part: custom flipping probabilities

We can vary the probability of flipping the slashes. For example, we can use the
density of a normal distribution to make flipping more likely for central values
and less likely for more extreme values.

xs <- seq(0, 1, length.out = 40) p_flip <- dnorm(seq(-4, 4, length.out = 40)) ggplot(data.frame(x = xs, y = p_flip)) + aes(x, y) + geom_line() + labs( x = "x position", y = "p(flipping)", title = "normal density") # We repeat p_flip for each row of the grid draw_10_print(rows = 40, cols = 40, p_flip = rep(p_flip, 40)) + theme_void()

We can use the cumulative density of the normal distribution so that
flipping becomes more likely as x increases.

xs <- seq(0, 1, length.out = 40) p_flip <- pnorm(seq(-4, 4, length.out = 40)) ggplot(data.frame(x = xs, y = p_flip)) + aes(x, y) + geom_line() + labs( x = "x position", y = "p(flipping)", title = "cumulative normal") draw_10_print(rows = 40, cols = 40, p_flip = rep(p_flip, 40)) + theme_void()

The Cauchy distribution is said to have “thicker” tails than the normal
distribution, so here it shows more flips on the left and right extremes.

xs <- seq(0, 1, length.out = 40) p_flip <- dcauchy(seq(-4, 4, length.out = 40)) ggplot(data.frame(x = xs, y = p_flip)) + aes(x, y) + geom_line() + labs( x = "x position", y = "p(flipping)", title = "Cauchy density") draw_10_print(rows = 40, cols = 40, p_flip = rep(p_flip, 40)) + theme_void()

The exponential distribution is a spike that quickly peters out. We can make a
probability “bowl” by splicing an exponential and a reversed exponential
together.

# Use flipped exponential densities as probabilities p_flip <- c(dexp(seq(0, 4, length.out = 20)), dexp(seq(4, 0, length.out = 20))) ggplot(data.frame(x = xs, y = p_flip)) + aes(x, y) + geom_line() + labs( x = "x position", y = "p(flipping)", title = "exponential + flipped exponential") draw_10_print(rows = 40, cols = 40, p = rep(p_flip, 40)) + theme_void()

We might have the probabilities increase by 10% from row to row. In the code
below, I use a simple loop to boost some random probability values by 10% from
row to row. This gives us nice streaks in the grid as a column starts to flip
for every row.

boost_probs <- function(p_flip, nrows, factor = 1.1) { output <- p_flip for (i in seq_len(nrows - 1)) { p_flip <- p_flip * factor output <- c(output, p_flip) } output } draw_10_print(cols = 40, rows = 40, p = boost_probs(runif(40), 40, 1.1)) + theme_void()

The probabilities can be anything we like. Here I compute the frequency of
English alphabet letters as they appear in Pride and Prejudice and based the
flipping probability on those values.

char_counts <- janeaustenr::prideprejudice %>% tolower() %>% stringr::str_split("") %>% unlist() %>% table() letter_counts <- char_counts[letters] %>% as.vector() p_letter <- letter_counts / sum(letter_counts) ggplot(data.frame(x = letters, y = p_letter)) + aes(x, y, label = x) + geom_text() + labs( x = NULL, y = "p(letter)", title = "letter frequencies in Pride and Prejudice")

draw_10_print(cols = 26, rows = 80, p = rep(p_letter, 80)) + theme_void()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Connect v1.6.2

Wed, 05/09/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio Connect version 1.6.2 is now available!

There are a handful of new features that are highlighted below. We encourage
you to upgrade as soon as possible!

Recommend Building R from Source

If you have installed R using apt-get, yum, or some other package manager,
we recommend that you install R from
source

instead. This protects you from application breakages when the system version
of R is upgraded. We have updated our documentation to reflect these best
practices concerning R administration for use with Connect.

Installing R from source allows installing multiple versions of R side by side,
and allows content to persist as published without risk of breaking during an
upgrade to the version of R. This also allows publishers to publish to a
version of R that more closely approximates their development environment.

User Filtering

For Connect implementations with many users, we have added features to the User
page that allow administrators to filter users by various states. Last release,
we added the ability for administrators to filter between users that are
counting against the Connect license and those that are inactive. This release,
we also exposed the ability for administrators to filter inactive users into
those that were manually locked and those that are not counting against your
license due to inactivity.

Other Changes

Connect Server API

The Connect Server API was introduced in Connect 1.6.0. This release, we added
an endpoint called
/audit_logs
where
audit logs can be accessed and paged. There is an example of how to use the API
in the Connect Server API
Cookbook
.
Stay tuned as we add endpoints to the Connect Server API to allow programmatic
interaction with the features of RStudio Connect.

Content Filtering

Content searching and content filtering state now persists when a user navigates
to a new page and then returns. To return to the default content screen, the
user can select “Reset all filters.”

Email Customization

In the last few releases, Connect has added features that allow publishers of
RMarkdown documents to customize the email output of scheduled RMarkdown
reports. In this release, publishers gain the ability to optionally suppress
attachments or suppress email output altogether.

These options can be set in the YAML header for a RMarkdown document, but are
probably more useful when set inline by editing
rmarkdown::output_metadata
.
For instance, in financial market analysis, I might decide that variance is not
significant enough to warrant an email update, and thereby ensure email updates
only occur when there is critical information to deliver.

New Icons

Connect previously used Gravatar icons. We have
changed this to standard monogram icons, which should fit better with offline
installations and other enterprise applications.

Upgrade Planning

There are no special precautions to be aware of when upgrading from v1.6.0 to
v1.6.2. You can expect the installation and startup of v1.6.2 to be complete in
under a minute.

If you’re upgrading from a release older than v1.6.0, be sure to consider the
“Upgrade Planning” notes from the intervening releases, as well.

If you haven’t yet had a chance to download and try RStudio
Connect
, we encourage you to do so.
RStudio Connect is the best way to share all the work that you do in R (Shiny
apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with
collaborators, colleagues, or customers.

You can find more details or download a 45-day evaluation of the product at
https://www.rstudio.com/products/connect/.
Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Harry Potter and competition results with comperes

Wed, 05/09/2018 - 02:00

(This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

Exploration of Harry Potter Books Survey results with help of my new comperes package.

Prologue

About a month ago I decided to add interesting data set to my almost finished (hopefully, soon to be on CRAN) comperes package. Data should represent results of some not ordinary competition. After some thought I picked a “competition” between Harry Potter books with a goal eventually to rate them from worst to best. After a series of events I ended up creating data myself. You can read more about that in my previous post.

Post and survey in general were popularized mostly among R users with R-bloggers (which gave me ~53 respondents), Twitter (which added the rest) and Reddit (which added ~0 people as post was deleted soon after publication). Survey managed to attract 182 respondents. I want to greatly thank all people who took their time to take part in and spread a word about my survey. Special thanks goes to Mara Averick who started a Twitter wave.

This post has two goals:

  • Present and explore results of the survey.
  • Demonstrate basic functionality of comperes package. To learn more go to its README and vignettes.
Overview

Survey results can be obtained by installing development version of comperes package from GitHub. They are present as package data named hp_survey.

This post will cover the following topics:

  • Exploration of survey results (most important being Book scores section).
  • Description of comperes competition results formats with conversion hp_survey to one of them.
  • Head-to-Head “performance” of books against each other.

We will need the following setup:

library(dplyr) library(tidyr) library(rlang) library(stringr) library(ggplot2) library(comperes) set.seed(201805) theme_set(theme_bw()) # Authenticity palette hp_pal <- c(Gryff = "#D02037", Huffl = "#F0C346", Raven = "#2450A8", Raven_light = "#0088FF", Slyth = "#09774A") # For less noisy bar charts theme_bar <- function() { list(theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())) } Exploration Data preparation

hp_suvery is a tibble (enhanced data frame) and has the following columns:

  • person : Identifier of a person.
  • book : Identifier of a Harry Potter book. Its values are of the form “HP_x” where “x” represents book’s number in the series (from 1 to 7).
  • score : Book’s score. Can be one of “1 – Poor”, “2 – Fair”, “3 – Good”, “4 – Very Good”, “5 – Excellent”.

For exploration, let’s transform hp_survey for more expressive code and results:

  • Convert scores to numerical.
  • Add book names.
book_names <- c( "Philosopher's (Sorcerer's) Stone (#1)", "Chamber of Secrets (#2)", "Prisoner of Azkaban (#3)", "Goblet of Fire (#4)", "Order of the Phoenix (#5)", "Half-Blood Prince (#6)", "Deathly Hallows (#7)" ) book_name_tbl <- tibble( book = paste0("HP_", 1:7), book_name = factor(book_names, levels = book_names) ) hp <- hp_survey %>% # Extract numerical score rename(score_chr = score) %>% mutate(score = as.integer(gsub("[^0-9].*$", "", score_chr))) %>% # Add book names left_join(y = book_name_tbl, by = "book") hp ## # A tibble: 657 x 5 ## person book score_chr score book_name ## ## 1 1 HP_6 5 - Excellent 5 Half-Blood Prince (#6) ## 2 1 HP_7 5 - Excellent 5 Deathly Hallows (#7) ## 3 2 HP_1 3 - Good 3 Philosopher's (Sorcerer's) Stone (#1) ## 4 2 HP_4 5 - Excellent 5 Goblet of Fire (#4) ## 5 2 HP_5 2 - Fair 2 Order of the Phoenix (#5) ## # ... with 652 more rows Subset uniformity

The first step in the survey was to choose the first element in the randomly shuffled list to simulate generation of random subset from all books. Each of 127 list element was connected to one subset. Lets visualize subset frequency to ensure a good faith of respondents:

# Compute subset representations hp_subsets <- hp %>% arrange(person, book) %>% group_by(person) %>% summarise(subset = paste0(book, collapse = "-")) # Compute the number of actually picked subsets n_distinct(hp_subsets$subset) ## [1] 95 # Visualize hp_subsets %>% ggplot(aes(subset)) + geom_bar(fill = hp_pal["Gryff"]) + labs( x = "Subset", y = "Number of times subset was picked", title = "Picked subsets have fairly uniform distribution" ) + scale_x_discrete(labels = NULL) + theme_bar() + theme(axis.ticks.x = element_blank())

So there are 95 subsets actually picked and their distribution seems reasonably uniform. This is enough for me to confirm that randomization for subsets was successful.

Book presence

Other important thing to explore is number of times book was actually rated:

hp %>% ggplot(aes(book_name)) + geom_bar(fill = hp_pal["Huffl"]) + # Cool way to wrap labels for a given width scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) + labs( x = "", y = "Number of present scores", title = "Some books were rated more times than others", subtitle = "But it seems fine" ) + theme_bar()

Book scores

The most obvious way to summarise book “performance” is its mean score of numerical representation of scale. Using mean is not harmful in this study as no outlier can be present.

hp_book_score <- hp %>% group_by(book_name) %>% summarise(mean_score = round(mean(score), digits = 2)) %>% arrange(desc(mean_score)) hp_book_score ## # A tibble: 7 x 2 ## book_name mean_score ## ## 1 Prisoner of Azkaban (#3) 4.19 ## 2 Half-Blood Prince (#6) 4.13 ## 3 Goblet of Fire (#4) 4.00 ## 4 Deathly Hallows (#7) 3.96 ## 5 Philosopher's (Sorcerer's) Stone (#1) 3.91 ## 6 Order of the Phoenix (#5) 3.90 ## 7 Chamber of Secrets (#2) 3.55

So, “the best” book seems to be “Harry Potter and the Prisoner of Azkaban (#3)”.

For more understanding of results, lets also visualize score distribution.

hp %>% # Compute share of score per book count(book_name, score) %>% group_by(book_name) %>% mutate(share = n / sum(n)) %>% ungroup() %>% # Visualize ggplot() + geom_col( aes(score, share, colour = score, fill = score), show.legend = FALSE ) + geom_text( data = hp_book_score, mapping = aes(label = paste0("Mean = ", mean_score)), x = -Inf, y = Inf, hjust = -0.05, vjust = 1.3 ) + facet_wrap(~ book_name) + scale_x_continuous( breaks = 1:5, labels = c("1\nPoor", "2\nFair", "3\nGood", "4\nVery\nGood", "5\nExcellent") ) + scale_fill_gradient(low = hp_pal["Raven"], high = hp_pal["Raven_light"]) + scale_colour_gradient(low = hp_pal["Raven"], high = hp_pal["Raven_light"]) + labs( x = "", y = "Score share per book", title = '"Prisoner of Azkaban (#3)" seems to be "the best" HP book', caption = "@echasnovski" ) + theme_bar()

Competition results Formats of comperes

Understanding of competition is quite general: it is a set of games (abstract event) in which players (abstract entity) gain some abstract scores (typically numeric). Inside games all players are treated equally. The most natural example is sport results, however not the only one. For example, product rating can be considered as a competition between products as “players”. Here a “game” is a customer that reviews a set of products by rating them with numerical “score” (stars, points, etc.).

In case of Harry Potter Books Survey results “game” is an act of respondent taking part in survey, “player” – Harry Potter book, “score” – discrete scale values converted to numerical score from 1 to 5.

In comperes there are two supported formats of competition results:

  • Long format. It is the most abstract way of presenting competition results. Basically, it is a data frame (or tibble) with columns game (game identifier), player (player identifier) and score where each row represents the score of particular player in particular game. One game can consist from variable number of players which makes this format more usable. Extra columns are allowed.
  • Wide format is a more convenient way to store results with fixed number of players in a game. Each row represents scores of all players in particular game. Data should be organized in pairs of columns “player”-“score”. Identifier of a pair should go after respective keyword and consist only from digits. For example: player1, score1, player2, score2. Order doesn’t matter. Column game is optional. Extra columns are also allowed.

Programmatically these formats are implemented as S3 classes longcr and widecr respectively. Essentially, they are tibbles with fixed structure. Objects of these classes should be created using functions as_longcr() and as_widecr() which also do conversions to other format.

Conversion

hp_survey presents results in long format.

hp_cr <- hp_survey %>% transmute( game = person, player = book, score = as.integer(gsub("[^0-9].*$", "", score)) ) %>% as_longcr() hp_cr ## # A longcr object: ## # A tibble: 657 x 3 ## game player score ## ## 1 1 HP_6 5 ## 2 1 HP_7 5 ## 3 2 HP_1 3 ## 4 2 HP_4 5 ## 5 2 HP_5 2 ## # ... with 652 more rows

Here is the demonstration of conversion to wide format. It detects the maximum number of players in a game, which is 7, and assumes that data is missing in games with less number of players.

as_widecr(hp_cr) ## # A widecr object: ## # A tibble: 182 x 15 ## game player1 score1 player2 score2 player3 score3 player4 score4 ## ## 1 1 HP_6 5 HP_7 5 NA NA ## 2 2 HP_1 3 HP_4 5 HP_5 2 HP_6 4 ## 3 3 HP_1 3 HP_3 4 HP_5 1 NA ## 4 4 HP_6 5 HP_7 5 NA NA ## 5 5 HP_4 4 HP_5 3 NA NA ## # ... with 177 more rows, and 6 more variables: player5 , ## # score5 , player6 , score6 , player7 , score7 Head-to-Head Functionality of comperes

Head-to-Head value is a summary statistic of direct confrontation between two players. It is assumed that this value can be computed based only on the players’ matchups (results for ordered pairs of players from one game). In other words, every game is converted into series of “subgames” between ordered pairs of players (including selfplay) which is stored as widecr object. After that, summary of item, defined by columns player1 and player2, is computed.

comperes has function get_matchups() for computing matchups:

get_matchups(hp_cr) ## # A widecr object: ## # A tibble: 2,697 x 5 ## game player1 score1 player2 score2 ## ## 1 1 HP_6 5 HP_6 5 ## 2 1 HP_6 5 HP_7 5 ## 3 1 HP_7 5 HP_6 5 ## 4 1 HP_7 5 HP_7 5 ## 5 2 HP_1 3 HP_1 3 ## # ... with 2,692 more rows

To compute multiple Head-to-Head values, use h2h_long() supplying competition results and summarizing expressions in dplyr::summarise() fashion. They will be applied to a data frame of matchups.

hp_cr_h2h <- hp_cr %>% h2h_long( # Number of macthups n = n(), # Number of wins plus half the number of ties # num_wins() is a function from comperes to compute number of times # first score is bigger than second one num_wins = num_wins(score1, score2, half_for_draw = TRUE), # Mean rating of a book scored in matchups with other books mean_score = mean(score1), # Mean rating difference of books scored in direct matchups mean_score_diff = mean(score1 - score2) ) %>% mutate_if(is.numeric, funs(round(., 2))) hp_cr_h2h ## # A long format of Head-to-Head values: ## # A tibble: 49 x 6 ## player1 player2 n num_wins mean_score mean_score_diff ## ## 1 HP_1 HP_1 88. 44.0 3.91 0. ## 2 HP_1 HP_2 42. 29.5 3.88 0.500 ## 3 HP_1 HP_3 51. 19.5 3.92 -0.390 ## 4 HP_1 HP_4 48. 24.0 3.79 0.0400 ## 5 HP_1 HP_5 42. 21.5 3.79 0. ## # ... with 44 more rows

So here we see, for example, that HP_1 and HP_2 had 42 matchups, i.e. they were rated by the same person 42 times. HP_1 “won” 29.5 (respecting ties) times, gained mean score of 3.88 in those matchups and had, on average, 0.5 points more.

There is also an h2h_mat() function which computes a matrix of Head-to-Head values for one expression.

hp_cr %>% h2h_mat(num_wins(score1, score2, half_for_draw = TRUE)) ## # A matrix format of Head-to-Head values: ## HP_1 HP_2 HP_3 HP_4 HP_5 HP_6 HP_7 ## HP_1 44.0 29.5 19.5 24.0 21.5 17.0 24.0 ## HP_2 12.5 40.0 12.0 11.5 10.5 12.0 19.0 ## HP_3 31.5 32.0 49.0 31.5 28.0 25.0 33.5 ## HP_4 24.0 33.5 26.5 49.5 23.5 30.5 31.5 ## HP_5 20.5 25.5 15.0 24.5 42.0 23.0 24.5 ## HP_6 25.0 30.0 20.0 27.5 24.0 50.0 34.0 ## HP_7 26.0 34.0 21.5 29.5 25.5 26.0 54.0

For more convenient usage, comperes has a list h2h_funs of some common Head-to-Head functions stored as expressions. To use them you need a little bit of rlang’s unquoting magic.

h2h_funs[1:3] ## $mean_score_diff ## mean(score1 - score2) ## ## $mean_score_diff_pos ## max(mean(score1 - score2), 0) ## ## $mean_score ## mean(score1) hp_cr %>% h2h_long(!!! h2h_funs) ## # A long format of Head-to-Head values: ## # A tibble: 49 x 11 ## player1 player2 mean_score_diff mean_score_diff_pos mean_score ## ## 1 HP_1 HP_1 0. 0. 3.91 ## 2 HP_1 HP_2 0.500 0.500 3.88 ## 3 HP_1 HP_3 -0.392 0. 3.92 ## 4 HP_1 HP_4 0.0417 0.0417 3.79 ## 5 HP_1 HP_5 0. 0. 3.79 ## # ... with 44 more rows, and 6 more variables: sum_score_diff , ## # sum_score_diff_pos , sum_score , num_wins , ## # num_wins2 , num Harry Potter books

Head-to-Head “performance” of Harry Potter books is summarised in the following plot:

hp_cr_h2h %>% gather(h2h_fun, value, -player1, -player2) %>% # Manually produce a dummy colour variable to use in facets group_by(h2h_fun) %>% mutate(col = (value - min(value)) / (max(value) - min(value))) %>% ungroup() %>% # Make factors for correct orders mutate( player1 = factor(player1, levels = rev(sort(unique(player1)))), player2 = factor(player2, levels = sort(unique(player2))), h2h_fun = factor(h2h_fun, levels = c("n", "num_wins", "mean_score", "mean_score_diff")), h2h_fun = recode( h2h_fun, n = "Number of matchups (ratings by common person)", num_wins = 'Number of "wins" in matchups (half for ties)', mean_score = "Mean score in matchups", mean_score_diff = "Mean score difference in matchups" ) ) %>% # Visualize ggplot(aes(player1, player2)) + geom_text( aes(label = value, colour = col), size = 5, fontface = "bold", show.legend = FALSE ) + facet_wrap(~ h2h_fun, scales = "free") + # To coordinate well with matrix form of Head-to-Head results coord_flip() + scale_colour_gradient(low = hp_pal["Slyth"], high = hp_pal["Gryff"]) + labs( x = "", y = "", title = "Head-to-Head performance of Harry Potter books", subtitle = paste0( '"HP_x" means Harry Potter book number "x" in series\n', "Numbers are Head-to-Head values of book in row against book in column" ), caption = "@echasnovski" ) + theme_classic() + theme(strip.text = element_text(face = "bold"))

There is a lot of information hidden in this plot. The most obvious discoveries:

  • It happened that book “HP_7” (“Deathly Hallows”) was rated with “HP_4” (“Goblet of Fire”) by one person the most: 61 times.
  • “HP_7” scored over “HP_2” (“Chamber of Secrets”) the most wins (34, half for ties) as did “HP_6” (“Half-Blood Prince”) over “HP_7”.
  • Book “HP_6” made the highest mean score of 4.36 in matchups with “HP_2”, which is bigger by 0.23 from its overall mean score.
  • In terms of score differences, “HP_3” (“Prisoner of Azkaban”) did best in matchups with “HP_2”, scoring on average 0.77 points more. This pair also represents “the best” and “the worst” books in terms of mean score.
Conclusion
  • A public call for help in creating data set for R package shouldn’t be made on Reddit but rather on R-bloggers or Twitter.
  • Among all original Harry Potter books, “Harry Potter and the Prisoner of Azkaban” seems to be considered “best” among R users. “Harry Potter and the Chamber of Secrets” suffers the opposite fate.
  • Package comperes is useful for storing, manipulating and summarising abstract competition results.
  • However informative, manually inspecting competition results with direct summaries and Head-to-Head tables is hard. They can display complex nature of performance relations between players. Next analysis of Harry Potter Books Survey data will be using my package comperank which implements different ranking methods for automatic discovery of player’s performance.

sessionInfo()

sessionInfo() ## R version 3.4.4 (2018-03-15) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.4 LTS ## ## Matrix products: default ## BLAS: /usr/lib/openblas-base/libblas.so.3 ## LAPACK: /usr/lib/libopenblasp-r0.2.18.so ## ## locale: ## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] bindrcpp_0.2.2 comperes_0.2.0 ggplot2_2.2.1 stringr_1.3.0 ## [5] rlang_0.2.0 tidyr_0.8.0.9000 dplyr_0.7.5.9000 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.16 pillar_1.2.1 compiler_3.4.4 plyr_1.8.4 ## [5] bindr_0.1.1 tools_3.4.4 digest_0.6.15 evaluate_0.10.1 ## [9] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1 cli_1.0.0 ## [13] yaml_2.1.17 blogdown_0.5 xfun_0.1 knitr_1.20 ## [17] rprojroot_1.3-2 grid_3.4.4 tidyselect_0.2.4 glue_1.2.0 ## [21] R6_2.2.2 rmarkdown_1.9 bookdown_0.7 purrr_0.2.4 ## [25] magrittr_1.5 backports_1.1.2 scales_0.5.0 htmltools_0.3.6 ## [29] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3 utf8_1.1.3 ## [33] stringi_1.1.6 lazyeval_0.2.1 munsell_0.4.3 crayon_1.3.4 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Courses in Hamburg

Wed, 05/09/2018 - 02:00

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

Big news, from the 13th til the 27th June Jumping Rivers will be running 6 courses on R in Hamburg!!. It should be noted that each course runs for one day, apart from the Predictive Analytics course, which runs for 2 days. The courses are as follows:

Introduction to R – 13th

For this course you need no prior programming knowledge of any type! You will learn how to efficiently manipulate, plot, import and export data using R.

Mastering the Tidyverse – 14th

A course on the suite of R packages known as the tidyverse. The tidyverse is essential for any statistician or data scientist who deals with data on a day-to-day basis. By focusing on small key tasks, the tidyverse suite of packages removes the pain of data manipulation. You will learn how to use packages at the forefront of cutting edge data analytics such as dplyr and tidyr. Learn how to efficiently manage dates and times with lubridate and learn how to import and export data from anything from csv and excel files to SAS and SPSS files using the fantastic readr, readxl and foreign packages.

Next Steps in the Tidyverse – 15th

After mastering only half the potentialy of the tidyverse in Mastering the Tidyverse, this course will cover other tidyverse essentials. This includes the broom package, for tidying up statistical outputs. Tired of writing for loops? Look no further, the purrr package provides a complete and consistent set of tools for working with functions and vectors. Learn how to use stringr, regular expressions and tidytext, providing you with the tools for complex string and text analysis.

Automated Reporting (First Steps Towards Shiny) – 20th

Tired of having to write a new report for every single data set? With R Markdown, and knitr you can build interactive reports, documents and dashboards that update when your data updates!

Interactive Graphics with Shiny – 21st

The R package Shiny allows you to create cutting edge, interactive web graphics and applications that react to your data in a new and innovative way. Not only that, but shiny provides a platform where you can host your web applications for free!

Predictive Analytics – 27/28th

You will learn about popular analytical techniques practised in industry today such as simple regression, clustering, discriminant analysis, random forests, splines and many more. Of course by the end of the day you will be able to use R to apply these methods to your data, often in one line of code!

For any more info, drop us a visit at jumpingrivers.com or feel free to give me an email at theo@jumpingrivers.com

Cheers!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Open-Source Machine Learning in Azure

Tue, 05/08/2018 - 18:28

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The topic for my talk at the Microsoft Build conference yesterday was "Migrating Existing Open Source Machine Learning to Azure". The idea behind the talk was to show how you can take the open-source tools and workflows you already use for machine learning and data science, and easily transition them to the Azure cloud to take advantage of its capacity and scale. The theme for the talk was "no surprises", and other than the Azure-specific elements I tried to stick to standard OSS tools rather than Microsoft-specific things, to make the process as familiar as possible.

In the talk I covered:

  • Using Visual Studio Code as a cross-platform, open-source editor and interface to Azure services
  • Using the Azure CLI to script the deployment, functions, and deletion of resources in the Azure cloud
  • Using the range of data science and machine learning tools included in the Data Science Virtual Machine on Ubuntu Linux
  • Using Python and Tensorflow to train a deep neural network on a GPU cluster with Azure Batch AI
  • Running sparklyr with RSTudio on a Spark cluster using aztk

It's not quite the same without the demos (and there's no recording available just yet), but you may find the notes and references in the slides below useful.

  var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tidying messy Excel data (tidyxl)

Tue, 05/08/2018 - 18:15

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

Reposted from Abhijit’s blog. Some <- have been replaced by = due to idiosyncracies of the WordPress platform.

Well, here’s what I was dealing with:

Exemplar Excel file from collaborator

Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent).

Goal

A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.

Attempt 1 (readxl) d1 <- readxl::read_excel('spreadsheet1.xlsx') head(d1)

## # A tibble: 6 x 26
## X__1 patient `44` `44__1` `10` `10__1` `3` `3__1` `53` `53__1`
##
## 1 IDS T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6
## 2 angles A B A B A B A B
## 3 60 31.83… 1 31.52… 1 32.9… 0 31.8… 0
## 4 65 31.66… 1 31.33… 1 32.2… 0 32.3… 0
## 5 70 31.45… 1 31.09… 0.20200… 31.7… 0 32.5… 0
## 6 75 31.08… 1 30.96… 0.44831… 31.2… 8.641… 32.3… 1
## # … with 16 more variables: `2` , `2__1` `8` ,
## # `8__1` , `6` , `6__1` , `43` , `43__1` ,
## # `48` , `48__1` , `46` , `46__1` , `4` ,
## # `4__1` , `9` , `9__1`

This strategy gives us funky column names, and pushes two of the headers into data rows. Since the headers are in rows, they’re a little harder to extract and work with. More worrisome is the fact that since the headers leaked into the data rows, the columns are all of type character rather than type numeric, which would now require further careful conversion after cleaning. So I don’t think readxl is the way to go here, if there’s a better solution. Attempt 2 (tidyxl) d2 <- tidyxl::xlsx_cells('spreadsheet1.xlsx') head(d2)

## # A tibble: 6 x 21
## sheet address row col is_blank data_type error logical numeric
##
## 1 T5T6 B1 1 2 FALSE character NA NA
## 2 T5T6 C1 1 3 FALSE numeric NA 44.
## 3 T5T6 D1 1 4 FALSE numeric NA 44.
## 4 T5T6 E1 1 5 FALSE numeric NA 10.
## 5 T5T6 F1 1 6 FALSE numeric NA 10.
## 6 T5T6 G1 1 7 FALSE numeric NA 3.
## # … with 12 more variables: date , character ,
## # character_formatted , formula , is_array ,
## # formula_ref , formula_group , comment , height ,
## # width , style_format , local_format_id

The xlsx_cells captures the data in a tidy fashion, explicitly calling out rows and columns and other metadata within each cell. We can clean up this data using tidyverse functions:

library(tidyverse) cleanData1 = function(d) { angle = d %>% filter(row >= 4, col == 1) %>% pull(numeric) name = d %>% filter(row %in% c(1,3), col >= 3) %>% mutate(character = ifelse(is.na(character), as.character(numeric), character)) %>% select(row, col, character) %>% filter(!is.na(character)) %>% spread(row, character) %>% unite(ID, `1`:`3`, sep = '_') %>% pull(ID) data = d %>% filter(row >= 4, col >= 3) %>% filter(!is.na(numeric)) %>% select(row, col, numeric) %>% spread(col, numeric) %>% select(-row) %>% set_names(name) %>% cbind(angle) %>% gather(variable, value, -angle) %>% separate(variable, c('ID','Measure'), sep = '_') %>% spread(Measure, value) %>% select(ID, angle, A, B) %>% arrange(ID, angle) return(data) } head(cleanData1(d2)) ## ID angle A B ## 1 10 60 31.52867 1.000000 ## 2 10 65 31.33477 1.000000 ## 3 10 70 31.09272 0.202002 ## 4 10 75 30.96078 0.448317 ## 5 10 80 30.79397 0.670876 ## 6 10 85 30.52185 0.461406

This is a lot of data munging, and though dplyr is powerful, it took a lot of trial and error to get the final pipeline done.Nonetheless, I was really psyched about tidyxl, since it automated a job that would have taken manual manipulation (I had 12 spreadsheets like this to process). I was going to write a blog post on this cool package that made my life dealing with messy Excel file a piece of cake. But wait, there’s more… Attempt 3 (tidyxl + unpivotr)

I didn’t know about unpivotr until this post:

When your spreadsheet is too for readxl, tidyxl + unpivotr helps you tackle charming features like “data as formatting” and “data in the layout”. https://t.co/ABerpfHT8W

— Jenny Bryan (@JennyBryan) December 7, 2017

So maybe all that complicated munging can be simplfied.

# devtools::install_github('nacnudus/unpivotr') library(unpivotr) cleanData2 = function(d){ bl = d %>% select(row, col, data_type, numeric, character) %>% behead('N', ID) %>% behead('N', spine) %>% behead('N', variable) # Extract the angles column bl1 = bl %>% filter(variable == 'angles') %>% spatter(variable) %>% select(row, angles) # Extract the rest of the columns bl2 = bl %>% filter(variable %in% c('A','B')) %>% select(-spine, -col) %>% spatter(ID) %>% # Spread to columns select(-character) %>% # All my variables are numeric gather(ID, value, -row, -variable) %>% spread(variable, value) final = bl1 %>% left_join(bl2) %>% # put things back together arrange(ID, angles) %>% select(ID, everything(),-row) # re-arrange columns return(final) } cleanData2(d2) ## # A tibble: 588 x 4 ## ID angles A B ## ## 1 10 60. 31.5 1.00 ## 2 10 65. 31.3 1.00 ## 3 10 70. 31.1 0.202 ## 4 10 75. 31.0 0.448 ## 5 10 80. 30.8 0.671 ## 6 10 85. 30.5 0.461 ## 7 10 90. 30.3 0.245 ## 8 10 95. 30.0 0.159 ## 9 10 100. 29.7 0.170 ## 10 10 105. 29.2 0.421 ## # ... with 578 more rows

In this example, I’m using the behead function (available in the development version of unpivotr on GitHub) to extract out the three rows of headers. Then I’m extracting out the angles column separately and merging it with the rest of the columns.

In case you’re wondering about the “N” in the behead code, unpivotr has a geographic options system as to where the headers are with respect to the main code. This vignette explains this nomenclature.

Attempt 4 (tidyxl + unpivotr)

After re-reading the unpivotr documentation, I realized that the angles column could be treated as a row header in the unpivotr code. So I further modified the function:

cleanData3 = function(d) { final = d %>% select(row, col, data_type, numeric, character) %>% behead('N', ID) %>% # Extract column headers behead('N', spine) %>% behead('N', variable) %>% behead('W', angles) %>% # angles as row header select(numeric, ID:angles, data_type, -spine) %>% # all vars are numeric filter(variable %in% c'A','B')) %>% # Kills off some extra columns spatter(variable) # Spreads, using data_type, numeric return(final) } cleanData3(d2) ## A tibble: 588 x 4 ## ID angles A B ## ## 1 10 60. 31.5 1.00 ## 2 10 65. 31.3 1.00 ## 3 10 70. 31.1 0.202 ## 4 10 75. 31.0 0.448 ## 5 10 80. 30.8 0.671 ## 6 10 85. 30.5 0.461 ## 7 10 90. 30.3 0.245 ## 8 10 95. 30.0 0.159 ## 9 10 100. 29.7 0.170 ##10 10 105. 29.2 0.421 ## ... with 578 more rows

I get to the same output, but with much cleaner code. This is cool!!I’m going to go deeper into the unpivotr documentation and see what else can be in my regular pipeline. A big thank you to the tool-makers that create these tools that make everyday activies easier and make us stay saner. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Wrangling Data Table Out Of the FBI 2017 IC3 Crime Report

Tue, 05/08/2018 - 16:10

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The U.S. FBI Internet Crime Complaint Center was established in 2000 to receive complaints of Internet crime. They produce an annual report, just released 2017’s edition, and I need the data from it. Since I have to wrangle it out, I thought some folks might like to play long at home, especially since it turns out I had to use both tabulizer and pdftools to accomplish my goal.

Concepts presented:

  • PDF scraping (with both tabulizer and pdftools)
  • asciiruler
  • general string manipulation
  • case_when() vs ifelse() for text cleanup
  • reformatting data for ggraph treemaps

Let’s get started! (NOTE: you can click/tap on any image for a larger version)

library(stringi) library(pdftools) library(tabulizer) library(igraph) library(ggraph) # devtools::install_github("thomasp85/ggraph") library(hrbrthemes) library(tidyverse) ic3_file <- "~/Data/2017-ic3-report.pdf" # change "~/Data" for your system if (!file.exists(ic3_file)) { # don't waste anyone's bandwidth download.file("https://pdf.ic3.gov/2017_IC3Report.pdf", ic3_file) }

Let's try pdftools since I like text wrangling

cat(pdftools::pdf_text(ic3_file)[[20]]) ## 2017 Internet Crime Report 20 ## 2017 Crime Types ## By Victim Count ## Crime Type Victims Crime Type Victims ## Non-Payment/Non-Delivery 84,079 Misrepresentation 5,437 ## Personal Data Breach 30,904 Corporate Data Breach 3,785 ## Phishing/Vishing/Smishing/Pharming 25,344 Investment 3,089 ## Overpayment 23,135 Malware/Scareware/Virus 3,089 ## No Lead Value 20,241 Lottery/Sweepstakes 3,012 ## Identity Theft 17,636 IPR/Copyright and 2,644 ## Counterfeit ## Advanced Fee 16,368 Ransomware 1,783 ## Harassment/Threats of Violence 16,194 Crimes Against Children 1,300 ## Employment 15,784 Denial of Service/TDoS 1,201 ## BEC/EAC 15,690 Civil Matter 1,057 ## Confidence Fraud/Romance 15,372 Re-shipping 1,025 ## Credit Card Fraud 15,220 Charity 436 ## Extortion 14,938 Health Care Related 406 ## Other 14,023 Gambling 203 ## Tech Support 10,949 Terrorism 177 ## Real Estate/Rental 9,645 Hacktivist 158 ## Government Impersonation 9,149 ## Descriptors* ## Social Media 19,986 *These descriptors relate to the medium or ## Virtual Currency 4,139 tool used to facilitate the crime, and are used ## by the IC3 for tracking purposes only. They ## are available only after another crime type ## has been selected.

OK, I don't like text wrangling that much. How about tabulizer?

tabulizer::extract_tables(ic3_file, pages = 20) ## list()

Well, that's disappointing. Perhaps if we target the tables on the PDF pages we'll have better luck. You can find them on pages 20 and 21 if you downloaded your own copy. Here are some smaller, static views of them:

I can't show the tabulizer pane (well I could if I had time to screen capture and make an animated gif) but run this to get the areas:

areas <- tabulizer::locate_areas(ic3_file, pages = 20:21) # this is what ^^ produces for my rectangles: list( c(top = 137.11911357341, left = 66.864265927978, bottom = 413.5512465374, right = 519.90581717452), c(top = 134.92520775623, left = 64.670360110803, bottom = 458.52631578947, right = 529.7783933518) ) -> areas

Now, see if tabulizer can do a better job. We'll start with the first page:

tab <- tabulizer::extract_tables(ic3_file, pages = 20, area = areas[1]) tab ## [[1]] ## [,1] [,2] ## [1,] "" "By Victim Cou nt" ## [2,] "Crime Type" "Victims" ## [3,] "Non-Payment/Non-Delivery" "84,079" ## [4,] "Personal Data Breach" "30,904" ## [5,] "Phishing/Vishing/Smishing/Pharming" "25,344" ## [6,] "Overpayment" "23,135" ## [7,] "No Lead Value" "20,241" ## [8,] "Identity Theft" "17,636" ## [9,] "" "" ## [10,] "Advanced Fee" "16,368" ## [11,] "Harassment/Threats of Violence" "16,194" ## [12,] "Employment" "15,784" ## [13,] "BEC/EAC" "15,690" ## [14,] "Confidence Fraud/Romance" "15,372" ## [15,] "Credit Card Fraud" "15,220" ## [16,] "Extortion" "14,938" ## [17,] "Other" "14,023" ## [18,] "Tech Support" "10,949" ## [19,] "Real Estate/Rental" "9,645" ## [20,] "G overnment Impersonation" "9,149" ## [21,] "" "" ## [22,] "Descriptors*" "" ## [,3] [,4] ## [1,] "" "" ## [2,] "Crime Type" "Victims" ## [3,] "Misrepresentation" "5,437" ## [4,] "Corporate Data Breach" "3,785" ## [5,] "Investment" "3,089" ## [6,] "Malware/Scareware/Virus" "3,089" ## [7,] "Lottery/Sweepstakes" "3,012" ## [8,] "IPR/Copyright and" "2,644" ## [9,] "Counterfeit" "" ## [10,] "Ransomware" "1,783" ## [11,] "Crimes Against Children" "1,300" ## [12,] "Denial of Service/TDoS" "1,201" ## [13,] "Civil Matter" "1,057" ## [14,] "Re-shipping" "1,025" ## [15,] "Charity" "436" ## [16,] "Health Care Related" "406" ## [17,] "Gambling" "203" ## [18,] "Terrorism" "177" ## [19,] "Hacktivist" "158" ## [20,] "" "" ## [21,] "" "" ## [22,] "" ""

Looking good. How does it look data-frame'd?

tab <- as_data_frame(tab[[1]]) print(tab, n=50) ## # A tibble: 22 x 4 ## V1 V2 V3 V4 ## 1 "" By Victim Cou nt "" "" ## 2 Crime Type Victims Crime Type Vict… ## 3 Non-Payment/Non-Delivery 84,079 Misrepresent… 5,437 ## 4 Personal Data Breach 30,904 Corporate Da… 3,785 ## 5 Phishing/Vishing/Smishing/Pharming 25,344 Investment 3,089 ## 6 Overpayment 23,135 Malware/Scar… 3,089 ## 7 No Lead Value 20,241 Lottery/Swee… 3,012 ## 8 Identity Theft 17,636 IPR/Copyrigh… 2,644 ## 9 "" "" Counterfeit "" ## 10 Advanced Fee 16,368 Ransomware 1,783 ## 11 Harassment/Threats of Violence 16,194 Crimes Again… 1,300 ## 12 Employment 15,784 Denial of Se… 1,201 ## 13 BEC/EAC 15,690 Civil Matter 1,057 ## 14 Confidence Fraud/Romance 15,372 Re-shipping 1,025 ## 15 Credit Card Fraud 15,220 Charity 436 ## 16 Extortion 14,938 Health Care … 406 ## 17 Other 14,023 Gambling 203 ## 18 Tech Support 10,949 Terrorism 177 ## 19 Real Estate/Rental 9,645 Hacktivist 158 ## 20 G overnment Impersonation 9,149 "" "" ## 21 "" "" "" "" ## 22 Descriptors* "" "" ""

Still pretty good. Cleaning it up is pretty simple from here. Just filter out some rows, parse some numbers, fix some chopped labels and boom - done:

tab <- filter(tab[3:21,], !V2 == "") bind_rows( select(tab, crime = V1, n_victims = V2), select(tab, crime = V3, n_victims = V4) ) %>% filter(crime != "") %>% mutate(n_victims = readr::parse_number(n_victims)) %>% mutate(crime = case_when( stri_detect_fixed(crime, "G o") ~ "Government Impersonation", stri_detect_fixed(crime, "IPR/C") ~ "IPR/Copyright and Counterfeit", TRUE ~ crime )) %>% print(n=50) -> ic3_2017_crimes_victim_ct ## # A tibble: 33 x 2 ## crime n_victims ## ## 1 Non-Payment/Non-Delivery 84079. ## 2 Personal Data Breach 30904. ## 3 Phishing/Vishing/Smishing/Pharming 25344. ## 4 Overpayment 23135. ## 5 No Lead Value 20241. ## 6 Identity Theft 17636. ## 7 Advanced Fee 16368. ## 8 Harassment/Threats of Violence 16194. ## 9 Employment 15784. ## 10 BEC/EAC 15690. ## 11 Confidence Fraud/Romance 15372. ## 12 Credit Card Fraud 15220. ## 13 Extortion 14938. ## 14 Other 14023. ## 15 Tech Support 10949. ## 16 Real Estate/Rental 9645. ## 17 Government Impersonation 9149. ## 18 Misrepresentation 5437. ## 19 Corporate Data Breach 3785. ## 20 Investment 3089. ## 21 Malware/Scareware/Virus 3089. ## 22 Lottery/Sweepstakes 3012. ## 23 IPR/Copyright and Counterfeit 2644. ## 24 Ransomware 1783. ## 25 Crimes Against Children 1300. ## 26 Denial of Service/TDoS 1201. ## 27 Civil Matter 1057. ## 28 Re-shipping 1025. ## 29 Charity 436. ## 30 Health Care Related 406. ## 31 Gambling 203. ## 32 Terrorism 177. ## 33 Hacktivist 158.

Now, on to page 2!

tab <- tabulizer::extract_tables(ic3_file, pages = 21, area = areas[2]) tab ## [[1]] ## [,1] [,2] ## [1,] "" "By Victim Lo ss" ## [2,] "Crime Type" "Loss Crime Type" ## [3,] "BEC/EAC" "$676,151,185 Misrepresentation" ## [4,] "Confidence Fraud/Romance" "$211,382,989 Harassment/Threats" ## [5,] "" "of Violence" ## [6,] "Non-Payment/Non-Delivery" "$141,110,441 Government" ## [7,] "" "Impersonation" ## [8,] "Investment" "$96,844,144 Civil Matter" ## [9,] "Personal Data Breach" "$77,134,865 IPR/Copyright and" ## [10,] "" "Counterfeit" ## [11,] "Identity Theft" "$66,815,298 Malware/Scareware/" ## [12,] "" "Virus" ## [13,] "Corporate Data Breach" "$60,942,306 Ransomware" ## [14,] "Advanced Fee" "$57,861,324 Denial of Service/TDoS" ## [15,] "Credit Card Fraud" "$57,207,248 Charity" ## [16,] "Real Estate/Rental" "$56,231,333 Health Care Related" ## [17,] "Overpayment" "$53,450,830 Re-Shipping" ## [18,] "Employment" "$38,883,616 Gambling" ## [19,] "Phishing/Vishing/Smishing/" "$29,703,421 Crimes Against" ## [20,] "Pharming" "Children" ## [21,] "Other" "$23,853,704 Hacktivist" ## [22,] "Lottery/Sweepstakes" "$16,835,001 Terrorism" ## [23,] "Extortion" "$15,302,792 N o Lead Value" ## [24,] "Tech Support" "$14,810,080" ## [25,] "" "" ## [26,] "" "" ## [,3] ## [1,] "" ## [2,] "Loss" ## [3,] "$14,580,907" ## [4,] "$12,569,185" ## [5,] "" ## [6,] "$12,467,380" ## [7,] "" ## [8,] "$5,766,550" ## [9,] "$5,536,912" ## [10,] "" ## [11,] "$5,003,434" ## [12,] "" ## [13,] "$2,344,365" ## [14,] "$1,466,195" ## [15,] "$1,405,460" ## [16,] "$925,849" ## [17,] "$809,746" ## [18,] "$598,853" ## [19,] "$46,411" ## [20,] "" ## [21,] "$20,147" ## [22,] "$18,926" ## [23,] "$0" ## [24,] "" ## [25,] "" ## [26,] "Descriptors*"

:facepalm: That's disappointing. Way too much scrambled content. So, back to pdftools!

cat(pg21 <- pdftools::pdf_text(ic3_file)[[21]]) ## Internet Crime Complaint Center 21 ## 2017 Crime Types Continued ## By Victim Loss ## Crime Type Loss Crime Type Loss ## BEC/EAC $676,151,185 Misrepresentation $14,580,907 ## Confidence Fraud/Romance $211,382,989 Harassment/Threats $12,569,185 ## of Violence ## Non-Payment/Non-Delivery $141,110,441 Government $12,467,380 ## Impersonation ## Investment $96,844,144 Civil Matter $5,766,550 ## Personal Data Breach $77,134,865 IPR/Copyright and $5,536,912 ## Counterfeit ## Identity Theft $66,815,298 Malware/Scareware/ $5,003,434 ## Virus ## Corporate Data Breach $60,942,306 Ransomware $2,344,365 ## Advanced Fee $57,861,324 Denial of Service/TDoS $1,466,195 ## Credit Card Fraud $57,207,248 Charity $1,405,460 ## Real Estate/Rental $56,231,333 Health Care Related $925,849 ## Overpayment $53,450,830 Re-Shipping $809,746 ## Employment $38,883,616 Gambling $598,853 ## Phishing/Vishing/Smishing/ $29,703,421 Crimes Against $46,411 ## Pharming Children ## Other $23,853,704 Hacktivist $20,147 ## Lottery/Sweepstakes $16,835,001 Terrorism $18,926 ## Extortion $15,302,792 No Lead Value $0 ## Tech Support $14,810,080 ## Descriptors* ## Social Media $56,478,483 *These descriptors relate to the medium or ## Virtual Currency $58,391,810 tool used to facilitate the crime, and are used ## by the IC3 for tracking purposes only. They ## are available only after another crime type ## has been selected.

This is (truthfully) not too bad. Just make columns from substring ranges and do some cleanup. The asciiruler package can definitely help here since it makes it much easier to see start/stop points (I used a new editor pane and copied some lines into it):

stri_split_lines(pg21)[[1]] %>% .[-(1:4)] %>% # remove header & bits above header .[-(26:30)] %>% # remove trailing bits map_df(~{ list( crime = stri_trim_both(c(stri_sub(.x, 1, 25), stri_sub(.x, 43, 73))), cost = stri_trim_both(c(stri_sub(.x, 27, 39), stri_sub(.x, 74))) # no length/to in the last one so it goes until eol ) }) %>% filter(!(crime == "" | cost == "")) %>% # get rid of blank rows mutate(cost = suppressWarnings(readr::parse_number(cost))) %>% # we can use NAs generated to remove non-data rows filter(!is.na(cost)) %>% mutate(crime = case_when( stri_detect_fixed(crime, "Phish") ~ "Phishing/Vishing/Smishing/Pharming", stri_detect_fixed(crime, "Malware") ~ "Malware/Scareware/Virus", stri_detect_fixed(crime, "IPR") ~ "IPR/Copyright and Counterfeit", stri_detect_fixed(crime, "Harassment") ~ "Harassment/Threats of Violence", TRUE ~ crime )) %>% print(n=50) -> ic3_2017_crimes_cost ## # A tibble: 35 x 2 ## crime cost ## 1 BEC/EAC 676151185. ## 2 Misrepresentation 14580907. ## 3 Confidence Fraud/Romance 211382989. ## 4 Harassment/Threats of Violence 12569185. ## 5 Non-Payment/Non-Delivery 141110441. ## 6 Government 12467380. ## 7 Investment 96844144. ## 8 Civil Matter 5766550. ## 9 Personal Data Breach 77134865. ## 10 IPR/Copyright and Counterfeit 5536912. ## 11 Identity Theft 66815298. ## 12 Malware/Scareware/Virus 5003434. ## 13 Corporate Data Breach 60942306. ## 14 Ransomware 2344365. ## 15 Advanced Fee 57861324. ## 16 Denial of Service/TDoS 1466195. ## 17 Credit Card Fraud 57207248. ## 18 Charity 1405460. ## 19 Real Estate/Rental 56231333. ## 20 Health Care Related 925849. ## 21 Overpayment 53450830. ## 22 Re-Shipping 809746. ## 23 Employment 38883616. ## 24 Gambling 598853. ## 25 Phishing/Vishing/Smishing/Pharming 29703421. ## 26 Crimes Against 46411. ## 27 Other 23853704. ## 28 Hacktivist 20147. ## 29 Lottery/Sweepstakes 16835001. ## 30 Terrorism 18926. ## 31 Extortion 15302792. ## 32 No Lead Value 0. ## 33 Tech Support 14810080. ## 34 Social Media 56478483. ## 35 Virtual Currency 58391810.

Now that we have real data, we can take a look at the IC3 crimes by loss and victims.

We'll use treemaps first then take a quick look at the relationship between counts and losses.

Just need to do some data wrangling for ggraph, starting with victims:

ic3_2017_crimes_victim_ct %>% mutate(crime = case_when( crime == "Government Impersonation" ~ "Government\nImpersonation", crime == "Corporate Data Breach" ~ "Corporate\nData\nBreach", TRUE ~ crime )) %>% mutate(crime = stri_replace_all_fixed(crime, "/", "/\n")) %>% mutate(grp = "ROOT") %>% add_row(grp = "ROOT", crime="ROOT", n_victims=0) %>% select(grp, crime, n_victims) %>% arrange(desc(n_victims)) -> gdf select(gdf, -grp) %>% mutate(lab = sprintf("%s\n(%s)", crime, scales::comma(n_victims))) %>% mutate(lab = ifelse(n_victims > 1300, lab, "")) %>% # don't show a label when blocks are small mutate(lab_col = ifelse(n_victims > 40000, "#2b2b2b", "#cccccc")) -> vdf # change up colors when blocks are lighter g <- graph_from_data_frame(gdf, vertices=vdf) ggraph(g, "treemap", weight=n_victims) + geom_node_tile(aes(fill=n_victims), size=0.25) + geom_text( aes(x, y, label=lab, size=n_victims, color = I(lab_col)), family=font_ps, lineheight=0.875 ) + scale_x_reverse(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + scale_size_continuous(trans = "sqrt", range = c(0.5, 8)) + labs(title = "FBI 2017 Internet Crime Report — Crimes By Victim Count") + ggraph::theme_graph(base_family = font_ps) + theme(plot.title = element_text(color="#cccccc", family = "IBMPlexSans-Bold")) + theme(panel.background = element_rect(fill="black")) + theme(plot.background = element_rect(fill="black")) + theme(legend.position="none")

# Now, do the same for losses: ic3_2017_crimes_cost %>% mutate(crime = case_when( crime == "Phishing/Vishing/Smishing/Pharming" ~ "Phishing/Vishing/\nSmishing/Pharming", crime == "Harassment/Threats of Violence" ~ "Harassment/\nThreats of Violence", crime == "Lottery/Sweepstakes" ~ "Lottery\nSweepstakes", TRUE ~ crime )) %>% filter(cost > 0) %>% mutate(cost = cost / 1000000) %>% mutate(grp = "ROOT") %>% add_row(grp = "ROOT", crime="ROOT", cost=0) %>% select(grp, crime, cost) %>% arrange(desc(cost)) -> gdf_cost select(gdf_cost, -grp) %>% mutate(lab = sprintf("%s\n($%s M)", crime, prettyNum(cost, digits=2))) %>% mutate(lab = ifelse(cost > 5.6, lab, "")) %>% mutate(lab_col = ifelse(cost > 600, "#2b2b2b", "#cccccc")) -> vdf_cost g_cost <- graph_from_data_frame(gdf_cost, vertices=vdf_cost) ggraph(g_cost, "treemap", weight=cost) + geom_node_tile(aes(fill=cost), size=0.25) + geom_text( aes(x, y, label=lab, size=cost, color=I(lab_col)), family=font_ps, lineheight=0.875 ) + scale_x_reverse(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + scale_size_continuous(trans = "sqrt", range = c(0.5, 8)) + labs(title = "FBI 2017 Internet Crime Report — Crime Loss By Category") + ggraph::theme_graph(base_family = font_ps) + theme(plot.title = element_text(color="#cccccc", family = "IBMPlexSans-Bold")) + theme(panel.background = element_rect(fill="black")) + theme(plot.background = element_rect(fill="black")) + theme(legend.position="none")

Let's plot victim counts vs losses to see what stands out:

left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>% filter(!is.na(cost)) %>% ggplot(aes(n_victims, cost)) + geom_point() + ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) + scale_x_comma() + scale_y_continuous(labels=scales::dollar) + labs( x = "# of Victims", y = "Loss magnitude", title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category" ) + theme_ipsum_ps(grid="XY")

BEC == "Business email compromise" and it's definitely a major problem, but those two count/loss outliers are not helping us see the rest of the data. Let's zoom in:

left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>% filter(!is.na(cost)) %>% filter(cost < 300000000) %>% filter(n_victims < 40000) %>% ggplot(aes(n_victims, cost)) + geom_point() + ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) + scale_x_comma() + scale_y_continuous(labels=scales::dollar) + labs( x = "# of Victims", y = "Loss magnitude", title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category", subtitle = "NOTE: BEC/EAC and Non-payment/Non-delivery removed" ) + theme_ipsum_ps(grid="XY")

Better, but let's go zoom in a bit more:

left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>% filter(!is.na(cost)) %>% filter(cost < 50000000) %>% filter(n_victims < 10000) %>% ggplot(aes(n_victims, cost)) + geom_point() + ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) + scale_x_comma() + scale_y_continuous(labels=scales::dollar) + labs( x = "# of Victims", y = "Loss magnitude", title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category", subtitle = "NOTE: Only includes losses between $0-50 M USD & victim counts <= 10,000 " ) + theme_ipsum_ps(grid="XY")

Looks like the ransomware folks have quite a bit of catching up to do (at least when it comes to crimes reported to the IC3).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An overview of R with a curated learning path

Tue, 05/08/2018 - 07:05

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

I recently wrote a 80-page guide to how to get a programming job without a degree, curated from my experience helping students do just that at Springboard, a leading data science bootcamp. This excerpt is a part where I focus on an overview of the R programming language.

Description: R is an open-source programming language most often used by statisticians, data scientists, and academics who will use it to explore large data sets and distill insights from it. It offers multiple libraries that are useful for data processing tasks. Developed by John Chambers and other colleagues at Bell Laboratories, it is a refined version of its precursor language S.

It has strong libraries for data visualization, time series analysis and a variety of statistical analysis tasks.

It’s free software that runs on a variety of operating systems, from UNIX to Windows to OSX. It runs on the open source license of the GNU general public license and is thus free to use and adapt.

Salary: Median salaries for R users tend to vary, with the main split being the difference between data analysts who use R to query existing data pipelines and data scientists who build those data pipelines and train different models programmatically on top of larger data sets. The difference can be stark, with around a $90,000 median salary for data scientists who use R, vs about a $60,000 median salary for data analysts who use R.

Uses: R is often used to analyze datasets, especially in an academic context. Most frameworks that have evolved around R focus on different methods of data processing. The ggplot family of libraries has been widely recognized as some of the top programming modules for data visualization.

Differentiation: R is often compared to Python when it comes to data analysis and data science tasks. Its strong data visualization and manipulation libraries along with its data analysis-focused community help make it a strong contender for any data transformation, analysis, and visualization tasks.

Most Popular Github Projects:

1- Mal

Mal is a Clojure inspired lisp interpreter which can be implemented in the R programming language. With 4,500 stars, Mal requires one of the lowest amount of stars to qualify for the top repository of a programming language. It speaks to the fact that most of the open-source work done on the R programming language resides outside of Github.

2- Prophet

Prophet is a library that is able to do rich time series analysis by adjusting forecasts to account for seasonal and non-linear trends. It was created by Facebook and forms a part of the strong corpus of data analysis frameworks and libraries that exist for the R programming language.

3- ggplot2

Ggplot2 is a data visualization library for R that is based on the Grammar of Graphics. It is a library often used by data analysts and data scientists to display their results in charts, heatmaps, and more.

4- H2o-3

H2o-3 is the open source machine learning library for the R programming language, similar to scikit-learn for Python. It allows people using the R programming language to run deep learning and other machine learning techniques on their data, an essential utility in an era where data is not nearly as useful without machine learning techniques.

5- Shiny

Shiny is an easy web application framework for R that allows you to build interactive websites in a few lines of code without any JavaScript. It uses an intuitive UI (user interface) component based on Bootstrap. Shiny can take all of the guesswork out of building something for the web with the R programming language.

Example Sites: There are not many websites built with R, which is used more for data analysis tasks and projects that are internal to one computer. However, you can build things with R Markdown and build different webpages. You might also use a web development framework such as Shiny if you wanted to create simple interactive web applications around your data.

Frameworks: The awesome repository comes up again with a great list of different R packages and frameworks you can use. A few that are worth mentioning are packages such as dplyr that help you assemble data in an intuitive tabular fashion, ggplot2 to help with data visualization and plotly to help with interactive web displays of R analysis. R libraries and frameworks are some of the most robust for doing ad hoc data analysis and displaying the results in a variety of formats.

Learning Path: This article helps frame the resources you need to learn R, and how you should learn it, starting from syntax and going to specific packages. It makes for a great introduction to the field, even if you’re an absolute beginner. If you want to apply R to data science projects and different data analysis tasks, Datacamp will help you learn the skills and mentality you need to do just that — you’ll learn everything from machine learning practices with R to how to do proper data visualization of the results.

Resources: R-bloggers is a large community of R practitioners and writers who aim to share knowledge about R with each other. This list of 60+ resources on R can be used in case you ever get lost trying to learn R.

I hope this is helpful for you! Want more material like this? Check out my guide to how to get a programming job without a degree.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

simmer 3.8.0

Mon, 05/07/2018 - 18:14

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The 3.8.0 release of simmer, the Discrete-Event Simulator for R, hit CRAN almost a week ago, and Windows binaries are already available. This version includes two highly requested new features that justify this second consecutive minor release.

Attachment of precomputed data

Until v3.7.0, the generator was the only means to attach data to trajectories, and it was primarily intended for dynamic generation of arrivals:

library(simmer) set.seed(42) hello_sayer <- trajectory() %>% log_("hello!") simmer() %>% add_generator("dummy", hello_sayer, function() rexp(1, 1)) %>% run(until=2) ## 0.198337: dummy0: hello! ## 0.859232: dummy1: hello! ## 1.14272: dummy2: hello! ## 1.18091: dummy3: hello! ## 1.65409: dummy4: hello! ## simmer environment: anonymous | now: 2 | next: 3.11771876826972 ## { Monitor: in memory } ## { Source: dummy | monitored: 1 | n_generated: 6 }

Although it may be used to attach precomputed data too, especially using the at() adaptor:

simmer() %>% add_generator("dummy", hello_sayer, at(seq(0, 10, 0.5))) %>% run(until=2) ## 0: dummy0: hello! ## 0.5: dummy1: hello! ## 1: dummy2: hello! ## 1.5: dummy3: hello! ## simmer environment: anonymous | now: 2 | next: 2 ## { Monitor: in memory } ## { Source: dummy | monitored: 1 | n_generated: 21 }

Now, let’s say that we want to attach some empirical data, and our observations not only include arrival times, but also priorities and some attributes (e.g., measured service times), as in this question on StackOverflow:

myData <- data.frame( time = c(1:10,1:5), priority = 1:3, duration = rnorm(15, 50, 5)) %>% dplyr::arrange(time)

This is indeed possible using generators, but it requires some trickery; more specifically, the clever usage of a consumer function as follows:

consume <- function(x, prio=FALSE) { i <- 0 function() { i <<- i + 1 if (prio) c(x[[i]], x[[i]], FALSE) else x[[i]] } } activityTraj <- trajectory() %>% seize("worker") %>% timeout_from_attribute("duration") %>% release("worker") initialization <- trajectory() %>% set_prioritization(consume(myData$priority, TRUE)) %>% set_attribute("duration", consume(myData$duration)) %>% join(activityTraj) arrivals_gen <- simmer() %>% add_resource("worker", 2, preemptive=TRUE) %>% add_generator("dummy_", initialization, at(myData$time)) %>% run() %>% get_mon_arrivals() # check the resulting duration times activity_time <- arrivals_gen %>% tidyr::separate(name, c("prefix", "n"), convert=TRUE) %>% dplyr::arrange(n) %>% dplyr::pull(activity_time) all(activity_time == myData$duration) ## [1] TRUE

Since this v3.8.0, the new data source add_dataframe greatly simplifies this process:

arrivals_df <- simmer() %>% add_resource("worker", 2, preemptive=TRUE) %>% add_dataframe("dummy_", activityTraj, myData, time="absolute") %>% run() %>% get_mon_arrivals() identical(arrivals_gen, arrivals_df) ## [1] TRUE On-disk monitoring

As some users noted (see 12), the default in-memory monitoring capabilities can turn problematic for very long simulations. To address this issue, the simmer() constructor gains a new argument, mon, to provide different types of monitors. Monitoring is still performed in-memory by default, but as of v3.8.0, it can be offloaded to disk through monitor_delim() and monitor_csv(), which produce flat delimited files.

mon <- monitor_csv() mon ## simmer monitor: to disk (delimited files) ## { arrivals: /tmp/RtmpAlQH2g/file6933ce99281_arrivals.csv } ## { releases: /tmp/RtmpAlQH2g/file6933ce99281_releases.csv } ## { attributes: /tmp/RtmpAlQH2g/file6933ce99281_attributes.csv } ## { resources: /tmp/RtmpAlQH2g/file6933ce99281_resources.csv } env <- simmer(mon=mon) %>% add_generator("dummy", hello_sayer, function() rexp(1, 1)) %>% run(until=2) ## 0.26309: dummy0: hello! ## 0.982183: dummy1: hello! env ## simmer environment: anonymous | now: 2 | next: 2.29067480322535 ## { Monitor: to disk (delimited files) } ## { arrivals: /tmp/RtmpAlQH2g/file6933ce99281_arrivals.csv } ## { releases: /tmp/RtmpAlQH2g/file6933ce99281_releases.csv } ## { attributes: /tmp/RtmpAlQH2g/file6933ce99281_attributes.csv } ## { resources: /tmp/RtmpAlQH2g/file6933ce99281_resources.csv } ## { Source: dummy | monitored: 1 | n_generated: 3 } read.csv(mon$handlers["arrivals"]) # direct access ## name start_time end_time activity_time finished ## 1 dummy0 0.2630904 0.2630904 0 1 ## 2 dummy1 0.9821828 0.9821828 0 1 get_mon_arrivals(env) # adds the "replication" column ## name start_time end_time activity_time finished replication ## 1 dummy0 0.2630904 0.2630904 0 1 1 ## 2 dummy1 0.9821828 0.9821828 0 1 1

See below for a comprehensive list of changes.

New features:
  • New data source add_dataframe enables the attachment of precomputed data, in the form of a data frame, to a trajectory. It can be used instead of (or along with) add_generator. The most notable advantage over the latter is that add_dataframe is able to automatically set attributes and prioritisation values per arrival based on columns of the provided data frame (#140 closing #123).
  • New set_source activity deprecates set_distribution(). It works both for generators and data sources (275a09c, as part of #140).
  • New monitoring interface allows for disk offloading. The simmer() constructor gains a new argument mon to provide different types of monitors. By default, monitoring is performed in-memory, as usual. Additionally, monitoring can be offloaded to disk through monitor_delim and monitor_csv, which produce flat delimited files. But more importantly, the C++ interface has been refactorised to enable the development of new monitoring backends (#146 closing #119).
Minor changes and fixes:
  • Some documentation improvements (1e14ed7, 194ed05).
  • New default until=Inf for the run method (3e6aae9, as part of #140).
  • branch and clone now accept lists of trajectories, in the same way as join, so that there is no need to use do.call (#142).
  • The argument continue (present in seize and branch) is recycled if only one value is provided but several sub-trajectories are defined (#143).
  • Fix process reset: sources are reset in strict order of creation (e7d909b).
  • Fix infinite timeouts (#144).

Article originally published in Enchufa2.es: simmer 3.8.0.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tidying messy Excel data (Introduction)

Mon, 05/07/2018 - 15:42

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

[Re-posted from Abhijit’s blog] Personal expressiveness, or how data is stored in a spreadsheet

When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!

But for smaller data, or data collected in the lab, the possibilities are truly endless! You can get every possiblle color-coding of rows, columns and cells, merged cells, hidden columns and rows, and inopportune blank spaces that convert numbers to characters, and notes where there should be numbers. That’s just from the more organized spreadsheets. Then you get multiple tables in the same spreadsheet, ad hoc computations in some cells, cells copied by hand (with error), and sundry other variations on this theme. In other words, it can be a nightmare scenario for the analyst. To wit,

I feel like some sort of spreadsheet naturalist, where new specimens spotted in the wild must be logged and compared to known species

— Jenny Bryan (@JennyBryan) April 25, 2018

In thinking about this post, I went back and looked at the documentation of the readxl package, which has made reading Excel files into R a lot easier than before. This package is quite powerful, so as long as data are in a relatively clean tabular form, this tool can pull it into R; see these vignettes to get a real sense of how to process decently behaved Excel files with R.

On a side note, how many ways of importing Excel files into R can you name, or have you used?

The readxl package has been my friend for a while, but then I received some well-intentioned spreadsheets that even readxl wouldn’t touch, despite much coaxing. Now, I had two options: ask the collaborator to re-format the spreadsheet (manually, of course ), which in my experience is a disaster waiting to happen; or just take the spreadsheet as is and figure out how to import it into R. I almost always take the latter route, and tidyxl is my bosom friend in this effort. In the next part of this series, I’ll describe my experiences with tidyxl and why, every time I use it, I say a small blessing for the team that created it.

Resources
  1. tidyeval meets Spreadsheet Hell
  2. Carpentries lesson on spreadsheet practices
  3. readxl workflows
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGSL 0.3.4

Mon, 05/07/2018 - 14:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A minor update version 0.3.4 of RcppGSL is now on CRAN. It contains an improved Windows build system (thanks, Jeroen!) and updates the C++ headers by removing dynamic exception specifications which C++11 frowns upon, and the compilers lets us know that in no uncertain terms. Builds using RcppGSL will now be more quiet. And as always, an extra treat for Solaris.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.4 (2018-05-06)
  • Windows builds were updated (Jeroen Ooms in #16).

  • Remove dynamic exception specifications which are deprecated with C++11 or later (Dirk in #17).

  • Accomodate Solaris by being more explicit about sqrt.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning Explained: Vectorization and matrix operations

Mon, 05/07/2018 - 14:29

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: vectorization. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.
With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective.

A basic exemple of vectorization Preliminary exemple – Python

 
Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.

import numpy as np import time W=np.random.normal(0,1,100000) n_rep=1000

The naive way to compute the sum iterates over all the elements of the array and stores the sum:

start_time = time.time() for i in range(n_rep): loop_res=0 for elt in W: loop_res+=elt time_loop = time.time() - start_time

If is our vector of interest, the sum of its elements can be expressed as the dot product :

start_time = time.time() for i in range(n_rep): one_dot=np.ones(W.shape) vect_res=one_dot.T.dot(W) time_vect = time.time() - start_time

Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.

print(np.abs(vect_res-loop_res)<10e-10) print(time_loop/time_vect)

Note: The same results can be obtained with np.sum.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.

start_time = time.time() for i in range(n_rep): vect_res=np.sum(W) time_vect_np = time.time() - start_time Preliminary exemple – R

The previous experiments can be replicated in R:

##Creation of the vector W=matrix(rnorm(100000)) n_rep=10000 #Naive way: library(tictoc) tic('Naive computation') for (rep in 1:n_rep) { res_loop=0 for (w_i in W) res_loop=w_i+res_loop } toc() tic('Vectorized computation') # vectorized way for (rep in 1:n_rep) { ones=rep(1,length(W)) res_vect= crossprod(ones,W) } toc() tic('built-in computation') # built-in way for (rep in 1:n_rep) { res_built_in= sum(W) } toc()

In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.

Preliminary exemple – Results

   

   

Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.

Why vectorization is (often) faster
  • R and Python are interpreted language, this means that your instructions are analyzed and interpreted at each execution. Since they are not statically typed, the loops have to assess the type of the operands at each iteration which leads to a computational overhead.
  • R and Python linear algebra relies on optimized back-end for matrix operations and linear algebra. These back-end are written in C/C++ and can process loops efficiently. Furthermore, while loops are sequential, these back-end can run operations in parallel which improves the computation speed on modern CPU.

Note 1: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.
Note 2: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).

A small zoo of matrix operations

The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.

Column-wise matrix sum

The column wise sum (and mean) can be expressed as a matrix product. Let be our matrix of interest. Using the matrix multiplication formula, we have: . Hence, the column-wise sum of is .

Python code:

def colWiseSum(W): ones=np.ones((W.shape[0],1)) return ones.T.dot(W)

R code:

colWiseSum=function(W) { ones=rep(1,nrow(W)) t(W)%*%ones } Row-wise matrix sum

Similarly, the row-wise sum is .

Python code:

def rowWiseSum(W): ones=np.ones((W.shape[1],1)) return W.dot(ones)

R code:

rowWiseSum=function(W) { ones=rep(1,ncol(W)) W%*%ones } Matrix sum

The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of is .

Python code:

def matSum(W): rhs_ones=np.ones((W.shape[1],1)) lhs_ones=np.ones((W.shape[0],1)) return lhs_ones.T.dot(W).dot(rhs_ones)

R code:

matSum=function(W) { rhs_ones=rep(1,ncol(W)) lhs_ones=rep(1,nrow(W)) t(lhs_ones) %*% W%*% rhs_ones } Similarity matrix (Gram matrix)

Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension (using word2vect). Let (our set of words) and (our dictionary) be two matrices resp. in and . To compute the similarity of all the observations of and we simply need to compute .

Python code:

def gramMatrix(X,Y): return X.dot(Y.T)

R code:

gramMatrix=function(X,Y) { X %*% t(Y) } L2 distance

We want to compute the pair-wise distance between two sets of vector. Let and be two matrix in and . For each vector of we need to compute the distance with all the vectors of .Hence, the output matrix should be of size .

If and are two vectors, their distance is:

   

. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size , then the output vector of distance will be of size , which can be reshaped into a vector of size .

The first two terms and need to be computed for each and . is the element-wise multiplication of X with itself (its elements are ). Hence, the i-th element of is the squared sum of the coordinate of the i-th observation, .

However, this is a vector and its shape is . By replicating each of its elements times, we will get a matrix of size . The replication can be done easily, if we consider the right matrix .

Let be a vector of one of size . Let be the matrix of size with repetitions of on the “diagonal”:

   

Then, our final vector is (The same expression holds for ). We denote a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:

   

The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.

Python code:

def L2dist(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0) delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0) return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)

R Code:

L2dist=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) } L2 distance (improved)

Actually the previous L2dist is not completely optimized requires a lot of memory since has cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor ):

Python code:

def L2dist_improved(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] ##Replace multiplication by a simple repeat X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2)) Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T return X_rpt+Y_rpt-2*gramMatrix(X,Y)

R code:

L2dist_improved=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) } L2 distance – benchmark

To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.

Computation time of the L2 distance

The vectorized implementation is 2 to 3 orders of magnitude faster than the naive implementation and on par with the scikit implementation.

The post Machine Learning Explained: Vectorization and matrix operations appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Coordinate systems in ggplot2: easily overlooked and rather underrated

Mon, 05/07/2018 - 08:05

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

All plots have coordinate systems. Perhaps because they are such an integral element of plots, they are easily overlooked. However, in ggplot2, there are several very useful options to customize the coordinate systems of plots, which we will not overlook but explore in this blog post.

Since it is spring, we will use a random subset of the famous iris data set. When we plot the petal length against the petal width, and map species onto color and play around a little with the shape, color and sizes of aesthetics, one obtains this vernal plot:

# Base plot plot_base <- ggplot(data = df_iris) + geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species), size = 3, alpha = 0.9, shape = 8) + geom_point(aes(x = Petal.Length, y = Petal.Width), color = "yellow", size = 0.4) + scale_color_manual(values = c("#693FE9", "#A089F8", "#0000FF")) + theme_minimal() Cartesian coordinate system Zooming in and out

The coordinate system can be manipulated by adding one of ggplot’s different coordinate systems. When you are imagining a coordinate system, you are most likely thinking of a Cartesian one. The Cartesian coordinate system combines x and y dimension orthogonally and is ggplots default (coord_cartesian).

There also are several varaitions of the familiar Cartesian coordinate system in ggplot, namely coord_fixed, coord_flip and coord_trans. For all of them, the displayed section of the data can be specified by defining the maximal value depicted on the x (xlim =) and y (ylim =) axis. This allows to “zoom in” or “zoom out” of a plot. It is a great advantage, that all manipulations of the coordinate system only alter the depiction of the data but not the data itself.

# Zooming in with xlim/ylim plot_base + coord_cartesian(xlim = 5, ylim = 2) + ggtitle("coord_cartesian with xlim = 5 and ylim = 2")

Specifying the “aspect ratio” of the axes

Via coord_fixed one can specify the exact ratio of the length of a y unit relative to the length of a x unit within the final visualization.

# Setting the "aspect ratio" of y vs. x units plot_base + coord_fixed(ratio = 1/2) + ggtitle("coord_fixed with ratio = 1/2")

Transforming the scales of the axes

This helps to emphasize the exact insight one wants to communicate. Another way to do so is coord_trans, which allows several transformations of the x and y variable (see table below, taken from Wickham 2016 page 97). Let me stress this again, very conveniently such transformations only pertain to the depicted – not the actual – scale of the data. This also is the reason why, regardless of the conducted transformation, the original values are used as axis labels.

Name Funktion Inverse asn exp identity log log10 log2 logit pow10 probit recip reverse sqrt # Transforming the axes plot_base + coord_trans(x = "log", y = "log2") + ggtitle("coord_trans with x = \"log\" and y = \"log2\"")

Swapping the axes

The last of the Cartesian options, cood_flip, swaps x and y axis. For example, this option can be useful, when we intend to change the orientation of univariate plots as histograms or plot types – like box plots – that visualize the distribution of a continuous variable over the categories of another variable. Nonetheless, coord_flip also works with all other plots. This multiplies the overall possibilities for designing plots, especially since all Cartesian coordinate systems can be combined.

# Swapping axes # base plot #2 p1 <- ggplot(data = df_iris) + geom_bar(aes(x = Species, fill = Species), alpha = 0.6) + scale_fill_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() # base plot & coord_flip() p2 <- ggplot(data = df_iris) + geom_bar(aes(x = Species, fill = Species), alpha = 0.6) + scale_fill_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() + coord_flip() gridExtra::grid.arrange(p1, p2, top = "Bar plot without and with coord_flip")

Polar coordinate system

The customization of Cartesian coordinate systems allows for the fine tuning of plots. However, coord_polar, the final coordinate system discussed here, changes the whole character of a plot. By using coord_polar, bar geoms are transformed to pie charts or “bullseye” plots, while line geoms are transformed to radar charts. This is done by mapping x and y to the angle and radius of the resulting plot. By default, the x variable is mapped to the angle but by setting the theta augment in coord_polar to “y” this can be changed.


While such plots might shine with respect to novelty and looks, their perceptual properties are intricate, and their correct interpretation may be quite hard and rather unintuitive.

# Base plot 2 (long format, x = 1 is summed up to generate count) plot_base_2 <- df_iris %>% dplyr::mutate(x = 1) %>% ggplot(.) + geom_bar(aes(x = x, fill = Species), alpha = 0.6) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + scale_fill_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() + ggtitle("base plot") # Bullseye plot # geom_bar & coord_polar(theta = "x") p2 <- plot_base_2 + coord_polar(theta = "x") + ggtitle("theta = \"x\"") # Pie chart # geom_bar & coord_polar(theta = "y") p3 <- plot_base_2 + coord_polar(theta = "y") + ggtitle("theta = \"y\"") gridExtra::grid.arrange(p2, p3, plot_base_2, top = "geom_bar & coord_polar", ncol = 2) # Base plot 3 (long format, mean width/length of sepals/petals calculated) plot_base_3 <- iris %>% dplyr::group_by(Species) %>% dplyr::summarise(Petal.Length = mean(Petal.Length), Sepal.Length = mean(Sepal.Length), Sepal.Width = mean(Sepal.Width), Petal.Width = mean(Petal.Width)) %>% reshape2::melt() %>% ggplot() + geom_polygon(aes(group = Species, color = Species, y = value, x = variable), fill = NA) + scale_color_manual(values = c("#693FE9", "#A089F8", "#4f5fb7")) + theme_minimal() + ggtitle("base plot") # Radar plot # geom_polygon & coord_polar p2 <- plot_base_3 + theme_minimal() + coord_polar() + ggtitle("coord_polar") gridExtra::grid.arrange(plot_base_3, p2, top = "geom_polygon & coord_polar", ncol = 2)

 

References
  • Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
Über den Autor

Lea Waniek

Lea ist Mitglied im Data Science Team und unterstützt ebenfalls im Bereich Statistik.

Der Beitrag Coordinate systems in ggplot2: easily overlooked and rather underrated erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppMsgPack 0.2.2

Sun, 05/06/2018 - 23:18

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Am maintenance release of RcppMsgPack got onto CRAN this afternoon. It contains a single static_cast fix to address a warning which g++-8.1 shows whereas older compilers remained silent—and CRAN asked us to address this.

MessagePack itself is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it is faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. RcppMsgPack brings both the C++ headers of MessagePack as well as clever code (in both R and C++) Travers wrote to access MsgPack-encoded objects directly from R.

Changes in version 0.2.2 (2018-05-06)
  • Apply a static_cast from upstream to suppress a warning from g++-8.1 as requested by CRAN.

Courtesy of CRANberries, there is also a diffstat report for this release.

More information is on the RcppRedis page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing Spotify Song Metrics to Visualize Popular Songs

Sun, 05/06/2018 - 21:04

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

How would you go about describing your favorite new song? What makes it catchy to you? According to the people at Spotify, characteristics like like energy or mood of a song can actually be quantified, and have many algorithms to describe music in an amazing amount of ways.

Now, why would you care about what the people at Spotify have to say about a song? With a user base of 159 million active monthly users, determining key factors that affect popularity can actually be a powerful tool for record label producers to find new artists to sign, or for aspiring data scientists to show off some nice visualizations and determine what to put on the ultimate summer playlist. Popularity is well defined in their API notes as a function of the number of times a particular song was streamed, as well as how recent those streams are.

About the App

This app visualizes several key factors and investigates their correlation to popularity visually across a wide spectrum of music.

The first two plots offer a degree of interactivity, allowing the user to visualize the difference amongst the genres. The box plot helps to see a more quantitative take on the separation across a wide musical spectrum. The density plot helps more with visualizing across the 0 – 100 scale of popularity, to see if there are any abnormalities with popularity distribution (for instance, classical music seems to have a pretty well defined bi-modal distribution of popularity!)

Besides looking at each genre as a whole, I wanted visualize a subset of each genre on a scatter plot to identify clusters to look at other variables like energy or danceability and how they change along with popularity. However, I ran into many issues in trying to separate the genres and effectively display the information. I decided on a 3D scatter plot, adding another user-input variable to look at two separate correlations with a very interactive plot for the user to zoom in and rotate the axes to better display information to their preference.  I have also included a small table to look at the Pearson correlation coefficient of several of the metrics from Spotify with popularity.

Finally, I took the 50th percentile (in terms of popularity) from each genre in my dataset and displayed them in a datatable in terms of ‘threshold values’ for each genre. For instance, for a successfully popular metal song, the relative would need to be quite high, as the 50th percentile has a value of 0.902. Also interestingly enough, danceability seems to be a much more crucial factor for pop as opposed to indie pop.

About the Dataset

The dataset was obtained by using the Spotify Web API in combination with the Python 3 library Spotipy. For each genre I chose, I queried 3,000 songs for the Spotify audio analysis and features. The API has a 50 song limit at each time, so I had to create a loop to query the API in 3,000 song chunks, and store them in a relevant pandas dataframe. Afterwards, I wrote the data into a CSV to do the majority of the analysis within R.

The jupyter notebook used to query the server as well as the Shiny application can be found at this GitHub repo.

Future Work

I would love to continue this analysis of popularity metrics with clustering/regression analysis at a further date, or to be able to develop a predictive model and feed information into it via Spotipy to determine up-and-coming popular artists.

For any comments or questions, please reach me via e-mail at joshvichare@gmail.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Tokenizing Text

Sun, 05/06/2018 - 18:12

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Statistics Sunday: Tokenizing Text I recently started working my way through Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. There are many interesting ways text analysis can be used, not only for research interests but for marketing and business insights. Today, I’d like to introduce one of the basic concepts necessary for conducting text analysis: tokens.

In text analysis, a token is any kind of meaningful text unit that can be analyzed. Frequently, a token is a word and the process of tokenizing splits up the text into individual words and counts up how many times each word appears in the text. But a token could also be a phrase (such as each two-word combination present in a text, which is called a bi-gram), a sentence, a paragraph, even a whole chapter. Obviously, the size of the token you choose impacts what kind of analysis you can do. Generally, people choose smaller tokens, like words.

Let’s use R to download the text of a classic book (which I did previously in this post, but today, I’ll do in an easier way) and tokenize it by word.

Any text available in the Project Gutenberg repository can be downloaded, with header and footer information stripped out, with the guternbergr package.

install.packages("gutenbergr")
library(gutenbergr)

The package comes with a dataset, called gutenberg_metadata, that contains a list, by ID, of all text available. Let’s use The War of the Worlds by H.G. Wells as our target book. We can find our target book like this:

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ggplot2 2.2.1 purrr 0.2.4
## tibble 1.4.2 dplyr 0.7.4
## tidyr 0.8.0 stringr 1.3.0
## readr 1.1.1 forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## dplyr::filter() masks stats::filter()
## dplyr::lag() masks stats::lag()
gutenberg_metadata %>%
filter(title == "The War of the Worlds")
## # A tibble: 3 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books…
##
## 1 36 The Wa… Wells, … 30 en Movie Books/Sci…
## 2 8976 The Wa… Wells, … 30 en Movie Books/Sci…
## 3 26291 The Wa… Wells, … 30 en
## # ... with 2 more variables: rights , has_text

The ID for The War of the Worlds is 36. Now I can use that information to download the text of the book into a data frame, using the gutenbergr function, gutenberg_download.

warofworlds<-gutenberg_download(36)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Now I have a dataset with two variables: one containing the Project Gutenberg ID for the text (which is helpful if you create a dataset with multiple texts, perhaps all by the same author or within the same genre) and one containing a line of text. To tokenize our dataset, we need the R package, tidytext.

install.packages("tidytext")
library(tidytext)

We can tokenize with the function, unnest_tokens: first we tell it to do so by word then we tell it which column to look in to find the tokens.

tidywow<-warofworlds %>%
unnest_tokens(word, text)

Now we have a dataset with each word from the book, one after the other. There are duplicates in here, because I haven’t told R to count up the words. Before I do that, I probably want to tell R to ignore extremely common words, like “the,” “and,” “to,” and so on. In text analysis, these are called stop words, and tidytext comes with a dataset called stop_words that can be used to drop stop words from your text data.

tidywow<-tidywow %>%
anti_join(stop_words)
## Joining, by = "word"

Last, we have R count up the words.

wowtokens<-tidywow %>%
count(word, sort=TRUE)
head(wowtokens)
## # A tibble: 6 x 2
## word n
##
## 1 martians 163
## 2 people 159
## 3 black 122
## 4 time 121
## 5 road 104
## 6 night 102

After removing stop words, of which there may be hundreds or thousands in any text, the most common words are: Martians, people, black, time, and road.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Send tweets from R: A very short walkthrough

Sun, 05/06/2018 - 12:28

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

There are a few reasons why you might want to send tweets from R. You might want to write a Twitter bot or – as in my case – you want to send yourself a tweet when a very long computation finishes.

So, here I will run you through all the steps you have to take using
Twitter’s API and
– the twitteR package written by Jeff Gentry

The setup to send myself tweets is the following: I have my main twitter account and an additional account I am only using to tweet from R. I am following the additional account with my main account. On my phone’s Twitter app, my main account is logged in and notifications are activated only for the additional account. So, whenever I tweet something with the additional account, the new tweet pops up on my phone.

Let’s get started.

Step 1: Create an additional twitter account and log in with this one.
Step 2: Go to apps.twitter.com (Application Management) and create a new app. After completing the process your Application Management main screen should look something like this.

  Step 3: Click on the name of your app and then select “manage keys and access tokens”. This should look like this (without the red boxes, of course).   Step 4: Copy your Consumer Key, Consumer Secret, Access Token and Access Token Secret. You will need them for authenticating from your R script. Step 5: Fire up R and install the twitteR package.   install.packages(“twitteR”) library(twitteR)
setup_twitter_oauth(consumer_key = “”,
                    access_token = “”,
                    consumer_secret = “”,
                    access_secret = “”) Step 6: Tweet something!
  tw <- updateStatus(“My first tweet from R!”) Basically, that’s it. But you can do more cool stuff. If, for example, you only want to read the notification about the tweet on your phone, you can delete the latest tweet just as fast.
deleteStatus(tw)
Aaaand, it’s gone!
You can also include a plot you just did in R! In this case, do not delete the tweet, because the media file will be deleted too and you cannot view it. Let’s do this.
t1 <- Sys.time()
gauss <- rnorm(n = 10000)
jpeg(“temp.jpg”)
plot(density(gauss))
dev.off()
t2 <- difftime(Sys.time(), t1, units = “secs”)

tw <- updateStatus(paste(“Finished in”, round(t2, 2), “seconds”), mediaPath = “temp.jpg”)
And just a few seconds later, this pops up on my phone.

 
I’m sure you’ll come up with a lot of more cool stuff you can do with that. Let’s hear them in the comments. Also, feel free to report any problems and I will try to help.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Deep Learning from first principles in Python, R and Octave – Part 8

Sun, 05/06/2018 - 05:28

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

1. Introduction

You don’t understand anything until you learn it more than one way. Marvin Minsky
No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. Marvin Minsky
A wealth of information creates a poverty of attention. Herbert Simon

This post, Deep Learning from first Principles in Python, R and Octave-Part8, is my final post in my Deep Learning from first principles series. In this post, I discuss and implement a key functionality needed while building Deep Learning networks viz. ‘Gradient Checking’. Gradient Checking is an important method to check the correctness of your implementation, specifically the forward propagation and the backward propagation cycles of an implementation. In addition I also discuss some tips for tuning hyper-parameters of a Deep Learning network based on my experience.

My post in this  ‘Deep Learning Series’ so far were
1. Deep Learning from first principles in Python, R and Octave – Part 1 In part 1, I implement logistic regression as a neural network in vectorized Python, R and Octave
2. Deep Learning from first principles in Python, R and Octave – Part 2 In the second part I implement a simple Neural network with just 1 hidden layer and a sigmoid activation output function
3. Deep Learning from first principles in Python, R and Octave – Part 3 The 3rd part implemented a multi-layer Deep Learning Network with sigmoid activation output in vectorized Python, R and Octave
4. Deep Learning from first principles in Python, R and Octave – Part 4 The 4th part deals with multi-class classification. Specifically, I derive the Jacobian of the Softmax function and enhance my L-Layer DL network to include Softmax output function in addition to Sigmoid activation
5. Deep Learning from first principles in Python, R and Octave – Part 5 This post uses the Softmax classifier implemented to classify MNIST digits using a L-layer Deep Learning network
6. Deep Learning from first principles in Python, R and Octave – Part 6 The 6th part adds more bells and whistles to my L-Layer DL network, by including different initialization types namely He and Xavier. Besides L2 Regularization and random dropout is added.
7. Deep Learning from first principles in Python, R and Octave – Part 7 The 7th part deals with Stochastic Gradient Descent Optimization methods including momentum, RMSProp and Adam
8. Deep Learning from first principles in Python, R and Octave – Part 8 – This post implements a critical function for ensuring the correctness of a L-Layer Deep Learning network implementation using Gradient Checking

By the way , also take a look at my compact, minimal book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

Gradient Checking is based on the following approach. One iteration of Gradient Descent computes and updates the parameters by doing
.
To minimize the cost we will need to minimize
Let be a function that computes the derivative . Gradient Checking allows us to numerically evaluate the implementation of the function and verify its correctness.
We know the derivative of a function is given by
0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" title="\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" class="latex" />
Note: The above derivative is based on the 2 sided derivative. The 1-sided derivative  is given by 0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}" title="\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}" class="latex" />
Gradient Checking is based on the 2-sided derivative because the error is of the order as opposed for the 1-sided derivative.
Hence Gradient Check uses the 2 sided derivative as follows.
0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" title="g(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}" class="latex" />

In Gradient Check the following is done
A) Run one normal cycle of your implementation by doing the following
a) Compute the output activation by running 1 cycle of forward propagation
b) Compute the cost using the output activation
c) Compute the gradients using backpropation (grad)

B) Perform gradient check steps as below
a) Set . Flatten all ‘weights’ and ‘bias’ matrices and vectors to a column vector.
b) Initialize by bumping up by adding ()
c) Perform forward propagation with
d) Compute cost with i.e.
e) Initialize  by bumping down by subtracting
f) Perform forward propagation with
g) Compute cost with i.e. 
h) Compute or ‘gradapprox’ asusing the 2 sided derivative.
i) Compute L2norm or the Euclidean distance between ‘grad’ and ‘gradapprox’. If the
diference is of the order of or the implementation is correct. In the Deep Learning Specialization Prof Andrew Ng mentions that if the difference is of the order of then the implementation is correct. A difference of is also ok. Anything more than that is a cause of worry and you should look at your code more closely. To see more details click Gradient checking and advanced optimization

You can clone/download the code from Github at DeepLearning-Part8

After spending a better part of 3 days, I now realize how critical Gradient Check is for ensuring the correctness of you implementation. Initially I was getting very high difference and did not know how to understand the results or debug my implementation. After many hours of staring at the results, I  was able to finally arrive at a way, to localize issues in the implementation. In fact, I did catch a small bug in my Python code, which did not exist in the R and Octave implementations. I will demonstrate this below

1.1a Gradient Check – Sigmoid Activation – Python import numpy as np import matplotlib exec(open("DLfunctions8.py").read()) exec(open("testcases.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() #Set layer dimensions layersDimensions = [2,4,1] parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="sigmoid") print("cost=",cost) #Perform backprop and get gradients gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") epsilon = 1e-7 outputActivationFunc="sigmoid" # Set-up variables # Flatten parameters to a vector parameters_values, _ = dictionary_to_vector(parameters) #Flatten gradients to a vector grad = gradients_to_vector(parameters,gradients) num_parameters = parameters_values.shape[0] #Initialize J_plus = np.zeros((num_parameters, 1)) J_minus = np.zeros((num_parameters, 1)) gradapprox = np.zeros((num_parameters, 1)) # Compute gradapprox using 2 sided derivative for i in range(num_parameters): # Compute J_plus[i]. thetaplus = np.copy(parameters_values) thetaplus[i][0] = thetaplus[i][0] + epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_plus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = np.copy(parameters_values) thetaminus[i][0] = thetaminus[i][0] - epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_minus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) # Compare gradapprox to backward propagation gradients by computing difference. numerator = np.linalg.norm(grad-gradapprox) denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) difference = numerator/denominator #Check the difference if difference > 1e-5: print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m") else: print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m") print(difference) print("\n") # The technique below can be used to identify # which of the parameters are in error # Covert grad to dictionary m=vector_to_dictionary2(parameters,grad) print("Gradients from backprop") print(m) print("\n") # Convert gradapprox to dictionary n=vector_to_dictionary2(parameters,gradapprox) print("Gradapprox from gradient check") print(n) ## (300, 2) ## (300,) ## cost= 0.6931455556341791 ## [92mYour backward propagation works perfectly fine! difference = 1.1604150683743381e-06[0m ## 1.1604150683743381e-06 ## ## ## Gradients from backprop ## {'dW1': array([[-6.19439955e-06, -2.06438046e-06], ## [-1.50165447e-05, 7.50401672e-05], ## [ 1.33435433e-04, 1.74112143e-04], ## [-3.40909024e-05, -1.38363681e-04]]), 'db1': array([[ 7.31333221e-07], ## [ 7.98425950e-06], ## [ 8.15002817e-08], ## [-5.69821155e-08]]), 'dW2': array([[2.73416304e-04, 2.96061451e-04, 7.51837363e-05, 1.01257729e-04]]), 'db2': array([[-7.22232235e-06]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[-6.19448937e-06, -2.06501483e-06], ## [-1.50168766e-05, 7.50399742e-05], ## [ 1.33435485e-04, 1.74112391e-04], ## [-3.40910633e-05, -1.38363765e-04]]), 'db1': array([[ 7.31081862e-07], ## [ 7.98472399e-06], ## [ 8.16013923e-08], ## [-5.71764858e-08]]), 'dW2': array([[2.73416290e-04, 2.96061509e-04, 7.51831930e-05, 1.01257891e-04]]), 'db2': array([[-7.22255589e-06]])} 1.1b Gradient Check – Softmax Activation – Python (Error!!)

In the code below I show, how I managed to spot a bug in your implementation

import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 parameters = initializeDeepModel(layersDimensions) #Compute forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T # Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax") cost= 1.0986187818144022 2 There is a mistake in the backward propagation! difference = 0.7100295155692544 0.7100295155692544 Gradients from backprop {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40953794e-05, -9.52657769e-04, -1.10269379e-04], [-7.45469382e-04, 9.49795606e-04, 2.29045434e-04], [ 8.29564761e-04, 2.86216305e-06, -1.18776055e-04]]), 'db2': array([[-0.00253808], [-0.00505508], [ 0.00759315]])} Gradapprox from gradient check {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40960634e-05, -9.52657953e-04, -1.10268461e-04], [-7.45469242e-04, 9.49796908e-04, 2.29045671e-04], [ 8.29565305e-04, 2.86104473e-06, -1.18776100e-04]]), 'db2': array([[-8.46211989e-06], [-1.68487446e-05], [ 2.53108645e-05]])}

Gradient Check gives a high value of the difference of 0.7100295. Inspecting the Gradients and Gradapprox we can see there is a very big discrepancy in db2. After I went over my code I discovered that I my computation in the function layerActivationBackward for Softmax was

# Erroneous code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W) instead of # Fixed code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = 1/numtraining * np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W)

After fixing this error when I ran Gradient Check I get

1.1c Gradient Check – Softmax Activation – Python (Corrected!!) import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 #Set layer dimensions parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T #Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax") ## cost= 1.0986193170234435 ## 2 ## [92mYour backward propagation works perfectly fine! difference = 5.268804859613151e-07[0m ## 5.268804859613151e-07 ## ## ## Gradients from backprop ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83441270e-05, -9.70179498e-04, -1.08715815e-04], ## [-7.70175008e-04, 9.54478237e-04, 2.27690198e-04], ## [ 8.48519135e-04, 1.57012608e-05, -1.18974383e-04]]), 'db2': array([[-8.52190476e-06], ## [-1.69954294e-05], ## [ 2.55173342e-05]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83439980e-05, -9.70180603e-04, -1.08716369e-04], ## [-7.70173925e-04, 9.54478718e-04, 2.27690089e-04], ## [ 8.48520143e-04, 1.57018842e-05, -1.18973720e-04]]), 'db2': array([[-8.52096171e-06], ## [-1.69964043e-05], ## [ 2.55162558e-05]])} 1.2a Gradient Check – Sigmoid Activation – R source("DLfunctions8.R") z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) #Set layer dimensions layersDimensions = c(2,5,1) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="sigmoid", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 0.6931447 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid",numClasses=layersDimensions[length(layersDimensions)]) epsilon = 1e-07 outputActivationFunc="sigmoid" #Convert parameter list to vector parameters_values = list_to_vector(parameters) #Convert gradient list to vector grad = gradients_to_vector(parameters,gradients) num_parameters = dim(parameters_values)[1] #Initialize J_plus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) J_minus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) gradapprox = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) # Compute gradapprox for(i in 1:num_parameters){ # Compute J_plus[i]. thetaplus = parameters_values thetaplus[i][1] = thetaplus[i][1] + epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_plus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = parameters_values thetaminus[i][1] = thetaminus[i][1] - epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_minus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) } # Compare gradapprox to backward propagation gradients by computing difference. #Compute L2Norm numerator = L2NormVec(grad-gradapprox) denominator = L2NormVec(grad) + L2NormVec(gradapprox) difference = numerator/denominator if(difference > 1e-5){ cat("There is a mistake, the difference is too high",difference) } else{ cat("The implementations works perfectly", difference) } ## The implementations works perfectly 1.279911e-06 # This can be used to check print("Gradients from backprop") ## [1] "Gradients from backprop" vector_to_list2(parameters,grad) ## $dW1 ## [,1] [,2] ## [1,] -7.641588e-05 -3.427989e-07 ## [2,] -9.049683e-06 6.906304e-05 ## [3,] 3.401039e-06 -1.503914e-04 ## [4,] 1.535226e-04 -1.686402e-04 ## [5,] -6.029292e-05 -2.715648e-04 ## ## $db1 ## [,1] ## [1,] 6.930318e-06 ## [2,] -3.283117e-05 ## [3,] 1.310647e-05 ## [4,] -3.454308e-05 ## [5,] -2.331729e-08 ## ## $dW2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0001612356 0.0001113475 0.0002435824 0.000362149 2.874116e-05 ## ## $db2 ## [,1] ## [1,] -1.16364e-05 print("Grad approx from gradient check") ## [1] "Grad approx from gradient check" vector_to_list2(parameters,gradapprox) ## $dW1 ## [,1] [,2] ## [1,] -7.641554e-05 -3.430589e-07 ## [2,] -9.049428e-06 6.906253e-05 ## [3,] 3.401168e-06 -1.503919e-04 ## [4,] 1.535228e-04 -1.686401e-04 ## [5,] -6.029288e-05 -2.715650e-04 ## ## $db1 ## [,1] ## [1,] 6.930012e-06 ## [2,] -3.283096e-05 ## [3,] 1.310618e-05 ## [4,] -3.454237e-05 ## [5,] -2.275957e-08 ## ## $dW2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0001612355 0.0001113476 0.0002435829 0.0003621486 2.87409e-05 ## ## $db2 ## [,1] ## [1,] -1.16368e-05 1.2b Gradient Check – Softmax Activation – R source("DLfunctions8.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) layersDimensions = c(2, 3, 3) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="softmax", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 1.098618 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax",numClasses=layersDimensions[length(layersDimensions)]) # Need to take transpose of the last layer for Softmax L=length(parameters)/2 gradients[[paste('dW',L,sep="")]]=t(gradients[[paste('dW',L,sep="")]]) gradients[[paste('db',L,sep="")]]=t(gradients[[paste('db',L,sep="")]]) #Perform gradient check gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7,outputActivationFunc="softmax") ## The implementations works perfectly 3.903011e-07[1] "Gradients from backprop" ## $dW1 ## [,1] [,2] ## [1,] 0.0007962367 -0.0001907606 ## [2,] 0.0004444254 0.0010354412 ## [3,] 0.0003078611 0.0007591255 ## ## $db1 ## [,1] ## [1,] -0.0017305136 ## [2,] 0.0005393734 ## [3,] 0.0012484550 ## ## $dW2 ## [,1] [,2] [,3] ## [1,] -3.515627e-04 7.487283e-04 -3.971656e-04 ## [2,] -6.381521e-05 -1.257328e-06 6.507254e-05 ## [3,] -1.719479e-04 -4.857264e-04 6.576743e-04 ## ## $db2 ## [,1] ## [1,] -5.536383e-06 ## [2,] -1.824656e-05 ## [3,] 2.378295e-05 ## ## [1] "Grad approx from gradient check" ## $dW1 ## [,1] [,2] ## [1,] 0.0007962364 -0.0001907607 ## [2,] 0.0004444256 0.0010354406 ## [3,] 0.0003078615 0.0007591250 ## ## $db1 ## [,1] ## [1,] -0.0017305135 ## [2,] 0.0005393741 ## [3,] 0.0012484547 ## ## $dW2 ## [,1] [,2] [,3] ## [1,] -3.515632e-04 7.487277e-04 -3.971656e-04 ## [2,] -6.381451e-05 -1.257883e-06 6.507239e-05 ## [3,] -1.719469e-04 -4.857270e-04 6.576739e-04 ## ## $db2 ## [,1] ## [1,] -5.536682e-06 ## [2,] -1.824652e-05 ## [3,] 2.378209e-05 1.3a Gradient Check – Sigmoid Activation – Octave source("DL8functions.m") ################## Circles data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); #Set layer dimensions layersDimensions = [2 5 1]; #tanh=-0.5(ok), #relu=0.1 best! [weights biases] = initializeDeepModel(layersDimensions); #Perform forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid"); #Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); #Compute gradients from cost [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid", numClasses=layersDimensions(size(layersDimensions)(2))); epsilon = 1e-07; outputActivationFunc="sigmoid"; # Convert paramter cell array to vector parameters_values = cellArray_to_vector(weights, biases); #Convert gradient cell array to vector grad = gradients_to_vector(gradsDW,gradsDB); num_parameters = size(parameters_values)(1); #Initialize J_plus = zeros(num_parameters, 1); J_minus = zeros(num_parameters, 1); gradapprox = zeros(num_parameters, 1); # Compute gradapprox for i = 1:num_parameters # Compute J_plus[i]. thetaplus = parameters_values; thetaplus(i,1) = thetaplus(i,1) + epsilon; [weights1 biases1] =vector_to_cellArray(weights, biases,thetaplus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_plus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute J_minus[i]. thetaminus = parameters_values; thetaminus(i,1) = thetaminus(i,1) - epsilon ; [weights1 biases1] = vector_to_cellArray(weights, biases,thetaminus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X',weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_minus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute gradapprox[i] gradapprox(i) = (J_plus(i) - J_minus(i))/(2*epsilon); endfor #Compute L2Norm numerator = L2NormVec(grad-gradapprox); denominator = L2NormVec(grad) + L2NormVec(gradapprox); difference = numerator/denominator; disp(difference); #Check difference if difference > 1e-04 printf("There is a mistake in the implementation "); disp(difference); else printf("The implementation works perfectly"); disp(difference); endif [weights1 biases1] = vector_to_cellArray(weights, biases,grad); printf("Gradients from back propagation"); disp(weights1); disp(biases1); [weights2 biases2] = vector_to_cellArray(weights, biases,gradapprox); printf("Gradients from gradient check"); disp(weights2); disp(biases2); 0.69315 1.4893e-005 The implementation works perfectly 1.4893e-005 Gradients from back propagation { [1,1] = 5.0349e-005 2.1323e-005 8.8632e-007 1.8231e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9529e-007 5.4502e-005 3.2721e-005 [1,2] = 1.0567e-005 6.0615e-005 4.6004e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8716e-005 1.1309e-009 4.7686e-005 1.2051e-005 -1.4612e-005 [1,2] = 9.5808e-006 } Gradients from gradient check { [1,1] = 5.0348e-005 2.1320e-005 8.8485e-007 1.8219e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9762e-007 5.4502e-005 3.2723e-005 [1,2] = [1,2] = 1.0565e-005 6.0614e-005 4.6007e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8713e-005 1.1102e-009 4.7687e-005 1.2048e-005 -1.4609e-005 [1,2] = 9.5790e-006 } 1.3b Gradient Check – Softmax Activation – Octave source("DL8functions.m") data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); # Set the layer dimensions layersDimensions = [2 3 3]; [weights biases] = initializeDeepModel(layersDimensions); # Run forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax"); # Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); # Perform backward prop [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax", numClasses=layersDimensions(size(layersDimensions)(2))); #Take transpose of last layer for Softmax L=size(weights)(2); gradsDW{L}= gradsDW{L}'; gradsDB{L}= gradsDB{L}'; #Perform gradient check difference= gradient_check_n(weights, biases, gradsDW,gradsDB, X, Y, epsilon = 1e-7, outputActivationFunc="softmax",numClasses=layersDimensions(size(layersDimensions)(2))); 1.0986 The implementation works perfectly 2.0021e-005 Gradients from back propagation { [1,1] = -7.1590e-005 4.1375e-005 -1.9494e-004 -5.2014e-005 -1.4554e-004 5.1699e-005 [1,2] = 3.3129e-004 1.9806e-004 -1.5662e-005 -4.9692e-004 -3.7756e-004 -8.2318e-005 1.6562e-004 1.7950e-004 9.7980e-005 } { [1,1] = -3.0856e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.2046e-006 2.9259e-007 -1.4972e-006 } Gradients from gradient check { [1,1] = -7.1586e-005 4.1377e-005 -1.9494e-004 -5.2013e-005 -1.4554e-004 5.1695e-005 3.3129e-004 1.9806e-004 -1.5664e-005 -4.9692e-004 -3.7756e-004 -8.2316e-005 1.6562e-004 1.7950e-004 9.7979e-005 } { [1,1] = -3.0852e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.1902e-006 2.8200e-007 -1.4644e-006 } 2.1 Tip for tuning hyperparameters

Deep Learning Networks come with a large number of hyper parameters which require tuning. The hyper parameters are

1. -learning rate
2. Number of layers
3. Number of hidden units
4. Number of iterations
5. Momentum – – 0.9
6. RMSProp – – 0.9
7. Adam – , and
8. learning rate decay
9. mini batch size
10. Initialization method – He, Xavier
11. Regularization

– Among the above the most critical is learning rate decay. Rather than just trying out random values, it may help to try out values on a logarithmic scale. So we could try out
values -0.01,0.1,1.0,10 etc. If we find that the cost is between 0.01 and 0.1 we could use a technique similar to binary search, so we can try 0.01, 0.05. If we need to be bigger than 0.01 and 0.05 we could try 0.25  and then keep halving the distance etc.
– The performance of Momentum and RMSProp are very good and work well with values 0.9. Even with this, it is better to try out values of 1- in the logarithmic range. So 1- could 0.001,0.01,0.1 and hence would be 0.999,0.99 or 0.9
– Increasing the number of hidden units or number of hidden layers need to be done gradually. I have noticed that increasing number of hidden layers heavily does not improve performance and sometimes degrades it.
– Sometimes, I tend to increase the number of iterations if I think I see a steady decrease in the cost for a certain learning rate
– It may also help to add learning rate decay if you see there is an oscillation while it decreases.
– Xavier and He initializations also help in a fast convergence and are worth trying out.

3.1 Final thoughts

As I come to a close in this Deep Learning Series from first principles in Python, R and Octave, I must admit that I learnt a lot in the process.

* Building a L-layer, vectorized Deep Learning Network in Python, R and Octave was extremely challenging but very rewarding
* One benefit of building vectorized versions in Python, R and Octave was that I was looking at each function that I was implementing thrice, and hence I was able to fix any bugs in any of the languages
* In addition since I built the generic L-Layer DL network with all the bells and whistles, layer by layer I further had an opportunity to look at all the functions in each successive post.
* Each language has its advantages and disadvantages. From the performance perspective I think Python is the best, followed by Octave and then R
* Interesting, I noticed that even if small bugs creep into your implementation, the DL network does learn and does generate a valid set of weights and biases, however this may not be an optimum solution. In one case of an inadvertent bug, I was not updating the weights in the final layer of the DL network. Yet, using all the other layers, the DL network was able to come with a reasonable solution (maybe like random dropout, remaining units can still learn the data!)
* Having said that, the Gradient Check method discussed and implemented in this post can be very useful in ironing out bugs.
Feel free to clone/download the code from Github at DeepLearning-Part8

Conclusion

These last couple of months when I was writing the posts and the also churning up the code in Python, R and Octave were  very hectic. There have been times when I found that implementations of some function to be extremely demanding and I almost felt like giving up. Other times, I have spent quite some time on an intractable DL network which would not respond to changes in hyper-parameters. All in all, it was a great learning experience. I would suggest that you start from my first post Deep Learning from first principles in Python, R and Octave-Part 1 and work your way up. Feel free to take the code apart and try out things. That is the only way you will learn.

Hope you had as much fun as I had. Stay tuned. I will be back!!!

Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Revisiting crimes against women in India
3. Literacy in India – A deepR dive
4. Sixer – R package cricketr’s new Shiny avatar
5. Bend it like Bluemix, MongoDB using Auto-scale – Part 1!
6. Computer Vision: Ramblings on derivatives, histograms and contours
7. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
8. A closer look at “Robot Horse on a Trot” in Android

To see all post click Index of Posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages