Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 43 min 47 sec ago

Sankey charts for swinging voters

Sat, 05/20/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Continuing my examination of the individual level voting behaviour from the New Zealand Election Study, I wanted to look at the way individuals swap between parties, and between “did not vote” and a party, from one election to another. How much and how this happens is obviously an important question for both political scientists and for politicians.

Vote transition visualisations

I chose a Sankey chart as a way of showing the transition matrix from self-reported party vote in the 2011 election to the 2014 election. Here’s a static version:

And here is the more screen-friendly interactive version, with mouseover tooltips to give actual estimates:

The point with these graphics is to highlight the transitions. For example, what were the implications of turnout being higher in 2014 than 2011 (77.9% of enrolled voters in 2014 compared to 74.2% in 2011)? Judging from this survey data, the National Party gained 6.6% of the enrolled population in 2014 by converting them from a 2011 “did not vote” and lost only 3.6% in the other direction. This net gain of three percentage points was enough to win the election for the National-led coalition. In contrast, the Labour party had a net gain from “did not vote” in 2011 of only 0.2 percentage points. Remember though that these are survey-based estimates, and subject to statistical error.

I find setting up and polishing Sankey charts – controlling colours for example – a bit of a pain, so the code at the bottom of this post on how this was done might be of interest.

Weighting, missing data, population and age complications

Those visualisations have a few hidden fishhooks, which careful readers would find if they compare the percentages in the tooltips of the interactive version with percentages reported by the New Zealand Electoral Commission.

  • The 2014 percentages are proportions of the enrolled population. As the 2014 turnout of enrolled voters was 77.9%, the numbers here are noticeably less than the usually cited percentages which were used to translate into seat counts (for example, National Party reported party vote of 47.0% of votes becomes 36.6% of enrolled voters)
  • The 2011 percentages are even harder to explain, because I’ve chosen not only to scale the party vote and “did not vote” to the 2011 enrolled population as reported by the Commission, but also to add in around 5% of the 2014 population that were too young to vote in 2011.

Two things I would have liked to have taken into account but wasn’t able to were:

  • The “leakage” from the 2011 election of people who were deceased or had left the country by 2014
  • Explicit recognition of people who voted in 2014 but not in 2011 because they were out of the country. There is a variable in the survey that picks up the year the respondent came to live in New Zealand if not born here, but for only 10 respondents was this 2012 or later (in contrast to age – there were 58 respondents aged 20 or less in 2014).

I re-weighted the survey so the 2014 and 2011 reported party votes matched the known totals (with the addition of people aged 15 to 17 in 2011). One doesn’t normally re-weight a survey based on answers provided by the respondents, but in this case I think it makes perfect sense to calibrate to the public totals. The biggest impact is that for both years, but particularly 2011, relying on the respondents’ self-report and the published weighting of the NZES, totals for “did not vote” are materially understated, compared to results when calibrated to the known totals.

When party vote in 2011 had been forgotten or was an NA, and this wasn’t explained by being too young in 2011, I used multiple imputation based on a subset of relevant variables to give five instances of probable party vote to each such respondent.

Taken together, all this gives the visualisations a perspective based in 2014. It is better to think of it as “where did the 2014 voters come from” than “where did the 2011 voters go”. This is fairly natural when we consider it is the 2014 New Zealand Election Study, but is worth keeping in mind in interpretation.

Age (and hence the impact new young voters coming in, and of older voters passing on) is important in voting behaviour, as even the most casual observation of politics shows. In New Zealand, the age distribution of party voters in 2014 is seen in the chart below:

Non-voters, Green voters and to a certain extent Labour voters are young; New Zealand First voters are older. If this interests you though, I suggest you look at the multivariate analysis in this blog post or, probably more fun, this fancy interactive web app which lets you play with the predicted probabilities of voting based on a combination of demographic and socio-economic status variables.


Here’s the R code that did that imputation, weighting, and the graphics:

library(tidyverse) library(forcats) library(riverplot) library(sankeyD3) library(foreign) library(testthat) library(mice) library(grid) library(survey) # Caution networkD3 has its own sankeyD3 with less features. Make sure you know which one you're using! # Don't load up networkD3 in the same session #============import data====================== # See previous blog posts for where this comes from: unzip("D:/Downloads/") nzes_orig <- read.spss("NZES2014GeneralReleaseApril16.sav", = TRUE, trim.factor.names = TRUE) # Five versions of each row, so when we do imputation later on we # in effect doing multiple imputation: nzes <- nzes_orig[rep(1:nrow(nzes_orig), each = 5), ] %>% # lump up party vote in 2014: mutate(partyvote2014 = ifelse(, "Did not vote", as.character(dpartyvote)), partyvote2014 = gsub("M.ori", "Maori", partyvote2014), partyvote2014 = gsub("net.Man", "net-Man", partyvote2014), partyvote2014 = fct_infreq(partyvote2014)) %>% mutate(partyvote2014 = fct_lump(partyvote2014, 5)) # party vote in 2011, and a smaller set of columns to do the imputation: nzes2 <- nzes %>% mutate(partyvote2011 = as.factor(ifelse(dlastpvote == "Don't know", NA, as.character(dlastpvote)))) %>% # select a smaller number of variables so imputation is feasible: select(dwtfin, partyvote2014, partyvote2011, dethnicity_m, dage, dhhincome, dtradeunion, dprofassoc, dhousing, dclass, dedcons, dwkpt) # impute the missing values. Although we are imputing only a single set of values, # because we are doing it with each row repeated five times this has the same impact, # for the simple analysis we do later on, as multiple imputation: nzes3 <- complete(mice(nzes2, m = 1)) # Lump up the 2011 vote, tidy labels, and work out who was too young to vote: nzes4 <- nzes3 %>% mutate(partyvote2011 = fct_lump(partyvote2011, 5), partyvote2011 = ifelse(grepl("I did not vote", partyvote2011), "Did not vote", as.character(partyvote2011)), partyvote2011 = ifelse(dage < 21, "Too young to vote", partyvote2011), partyvote2014 = as.character(partyvote2014)) #===============re-weighting to match actual votes in 2011 and 2014=================== # This makes a big difference, for both years, but more for 2011. "Did not vote" is only 16% in 2011 # if we only calibrate to 2014 voting outcomes, but in reality was 25.8%. We calibrate for both # 2011 and 2014 so the better proportions for both elections. #------------2014 actual outcomes---------------- # actual_vote_2014 <- data_frame( partyvote2014 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote"), Freq = c(1131501, 604534, 257356, 208300, 31850 + 16689 + 5286 + 95598 + 34095 + 10961 + 5113 + 1730 + 1096 + 872 + 639, NA) ) # calculate the did not vote, from the 77.9 percent turnout actual_vote_2014[6, 2] <- (100 / 77.9 - 1) * sum(actual_vote_2014[1:5, 2]) # check I did the turnout sums right: expect_equal(0.779 * sum(actual_vote_2014[ ,2]), sum(actual_vote_2014[1:5, 2])) #---------------2011 actual outcomes--------------------- # actual_vote_2011 <- data_frame( partyvote2011 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote", "Too young to vote"), Freq = c(1058636, 614937, 247372, 147511, 31982 + 24168 + 23889 + 13443 + 59237 + 11738 + 1714 + 1595 + 1209, NA, NA) ) # calculate the did not vote, from the 74.21 percent turnout at # actual_vote_2011[6, 2] <- (100 / 74.21 - 1) * sum(actual_vote_2011[1:5, 2]) # check I did the turnout sums right: expect_equal(0.7421 * sum(actual_vote_2011[1:6 ,2]), sum(actual_vote_2011[1:5, 2])) # from the below, we conclude 4.8% of the 2014 population (as estimated by NZES) # were too young to vote in 2011: nzes_orig %>% mutate(tooyoung = dage < 21) %>% group_by(tooyoung) %>% summarise(pop = sum(dwtfin), n = n()) %>% ungroup() %>% mutate(prop = pop / sum(pop)) # this is pretty approximate but will do for our purposes. actual_vote_2011[7, 2] <- 0.048 * sum(actual_vote_2011[1:6, 2]) # Force the 2011 numbers to match the 2014 population actual_vote_2011$Freq <- actual_vote_2011$Freq / sum(actual_vote_2011$Freq) * sum(actual_vote_2014$Freq) #------------create new weights-------------------- # set up survey design with the original weights: nzes_svy <- svydesign(~1, data = nzes4, weights = ~dwtfin) # calibrate weights to the known total party votes in 2011 and 2014: nzes_cal <- calibrate(nzes_svy, list(~partyvote2014, ~partyvote2011), list(actual_vote_2014, actual_vote_2011)) # extract those weights for use in straight R (not just "survey") nzes5 <- nzes4 %>% mutate(weight = weights(nzes_cal)) # See impact of calibrated weights for the 2014 vote: prop.table(svytable(~partyvote2014, nzes_svy)) # original weights prop.table(svytable(~partyvote2014, nzes_cal)) # recalibrated weights # See impact of calibrated weights for the 2011 vote: prop.table(svytable(~partyvote2011, nzes_svy)) # original weights prop.table(svytable(~partyvote2011, nzes_cal)) # recalibrated weights #=======================previous years vote riverplot version============ the_data <- nzes5 %>% mutate( # add two spaces to ensure the partyvote2011 does not have any # names that exactly match the party vote in 2014 partyvote2011 = paste(partyvote2011, " "), partyvote2011 = gsub("M.ori", "Maori", partyvote2011)) %>% group_by(partyvote2011, partyvote2014) %>% summarise(vote_prop = sum(weight)) %>% ungroup() # change names to the names wanted by makeRiver names(the_data) <- c("col1", "col2", "Value") # node ID need to be characters I think c1 <- unique(the_data$col1) c2 <- unique(the_data$col2) nodes_ref <- data_frame(fullname = c(c1, c2)) %>% mutate(position = rep(c(1, 2), times = c(length(c1), length(c2)))) %>% mutate(ID = LETTERS[1:n()]) edges <- the_data %>% left_join(nodes_ref[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>% rename(N1 = ID) %>% left_join(nodes_ref[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>% rename(N2 = ID) %>% = FALSE) rp <- makeRiver(nodes = as.vector(nodes_ref$ID), edges = edges, node_labels = nodes_ref$fullname, # manual vertical positioning by parties. Look at # nodes_ref to see the order in which positions are set. # This is a pain, so I just let it stay with the defaults: # node_ypos = c(4, 1, 1.8, 3, 6, 7, 5, 4, 1, 1.8, 3, 6, 7), node_xpos = nodes_ref$position, # set party colours; all based on those in nzelect::parties_v: node_styles = list(C = list(col = "#d82a20"), # red labour D = list(col = "#00529F"), # blue national E= list(col = "black"), # black NZFirst B = list(col = "#098137"), # green J = list(col = "#d82a20"), # labour I = list(col = "#098137"), # green K = list(col = "#00529F"), # national L = list(col = "black"))) # NZ First # Turn the text horizontal, and pale grey ds <- ds$srt <- 0 ds$textcol <- "grey95" mygp <- gpar(col = "grey75") # using PNG rather than SVG as vertical lines appear in the SVG version par(bg = "grey40") # note the plot_area argument - for some reason, defaults to using only half the # vertical space available, so set this to higher than 0.5!: plot(rp, default_style = ds, plot_area = 0.9) title(main = "Self-reported party vote in 2011 compared to 2014", col.main = "white", font.main = 1) grid.text(x = 0.15, y = 0.1, label = "2011 party vote", gp = mygp) grid.text(x = 0.85, y = 0.1, label = "2014 party vote", gp = mygp) grid.text(x = 0.95, y = 0.03, gp = gpar(fontsize = 7, col = "grey75"), just = "right", label = "Source: New Zealand Election Study, analysed at") #=======================sankeyD3 version==================== nodes_ref2 <- nodes_ref %>% mutate(ID = as.numeric(as.factor(ID)) - 1) %>% edges2 <- the_data %>% ungroup() %>% left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>% rename(N1 = ID) %>% left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>% rename(N2 = ID) %>% = FALSE) %>% mutate(Value = Value / sum(Value) * 100) pal <- 'd3.scaleOrdinal() .range(["#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC", "#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC"]);' sankeyNetwork(Links = edges2, Nodes = nodes_ref2, Source = "N1", Target = "N2", Value = "Value", NodeID = "fullname", NodeGroup = "fullname", LinkGroup = "col2", fontSize = 12, nodeWidth = 30, colourScale = JS(pal), numberFormat = JS('d3.format(".1f")'), fontFamily = "Calibri", units = "%", nodeShadow = TRUE, showNodeValues = FALSE, width = 700, height = 500) #=======other by the by analysis================== # Age density plot by party vote # Remember to weight by the survey weights - in this case it controls for # the under or over sampling by age in the original design. nzes5 %>% ggplot(aes(x = dage, fill = partyvote2014, weight = weight / sum(nzes5$weight))) + geom_density(alpha = 0.3) + facet_wrap(~partyvote2014, scales = "free_y") + scale_fill_manual(values = parties_v) + theme(legend.position = "none") + labs(x = "Age at time of 2014 election", caption = "Source: New Zealand Election Study") + ggtitle("Age distribution by Party Vote in the 2014 New Zealand General Election")

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – IX

Sat, 05/20/2017 - 08:42

This week’s R bulletin will cover topics on how to list files, extracting file names, and creating a folder using R.

We will also cover functions like the select function, filter function, and the arrange function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Run the current chunk – Ctrl+Alt+C
2. Run the next chunk – Ctrl+Alt+N
3. Run the current function definition – Ctrl+Alt+F

Problem Solving Ideas How to list files with a particular extension

To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list csv files use the following syntax.


files = list.files(pattern = "\\.csv$")

This will list all the csv files present in the current working directory. To list files in any other folder, you need to provide the folder path.

list.files(path = "C:/Users/MyFolder", pattern = "\\.csv$")

$ at the end means that this is end of the string. Adding \. ensures that you match only files with extension .csv

Extracting file name using gsub function

When we download stock data from google finance, the file’s name corresponds to the stock data symbol. If we want to extract the stock data symbol from the file name, we can do it using the gsub function. The function searches for a match to the pattern argument and replaces all the matches with the replacement value given in the replacement argument. The syntax for the function is given as:

gsub(pattern, replacement, x)


pattern – is a character string containing a regular expression to be matched in the given character vector.
replacement – a replacement for matched pattern.
x – is a character vector where matches are sought.

In the example given below, we extract the file name for files stored in the “Reading MFs” folder. We have downloaded the stock price data in R working directory for two companies namely, MRF and PAGEIND Ltd.


folderpath = paste(getwd(), "/Reading MFs", sep = "") temp = list.files(folderpath, pattern = "*.csv") print(temp)

[1] “MRF.csv”  “PAGEIND.csv”

gsub("*.csv$", "", temp)

[1] “MRF”   “PAGEIND”

Create a folder using R

One can create a folder via R with the help of the “dir.create” function. The function creates a folder with the name as specified in the last element of the path. Trailing path separators are discarded.

The syntax is given as:

dir.create(path, showWarnings = FALSE, recursive = FALSE)


dir.create("D:/RCodes", showWarnings = FALSE, recursive = FALSE)

This will create a folder called “RCodes” in the D drive.

Functions Demystified select function

The select function comes from the dplyr package and can be used to select certain columns of a data frame which you need. Consider the data frame “df” given in the example.


library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Suppose we wanted to select the first 2 columns only. We can use the names of the columns in the # second argument to select them from the main data frame. subset_df = select(df, Ticker:OpenPrice) print(subset_df)

# Suppose we want to omit the OpenPrice column using the select function. We can do so by using # the negative sign along with the column name as the second argument to the function. subset_df = select(df, -OpenPrice) print(subset_df)

# We can also use the 'starts_with' and the 'ends_with' arguments for selecting columns from the # data frame. The example below will select all the columns which end with the word 'Price'. subset_df = select(df, ends_with("Price")) print(subset_df)

filter function

The filter function comes from the dplyr package and is used to extract subsets of rows from a data frame. This function is similar to the subset function in R.


library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Suppose we want to select stocks with closing prices above 750, we can do so using the filter # function in the following manner: subset_df = filter(df, ClosePrice > 750) print(subset_df)

# One can also use a combination of conditions as the second argument in filtering a data set. subset_df = filter(df, ClosePrice > 750 & OpenPrice < 2000) print(subset_df)

arrange function

The arrange function is part of the dplyr package, and is used to reorder rows of a data frame according to one of the columns. Columns can be arranged in descending order or ascending order by using the special desc() operator.


library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Arrange in descending order subset_df = arrange(df, desc(OpenPrice)) print(subset_df)

# Arrange in ascending order. subset_df = arrange(df, -desc(OpenPrice)) print(subset_df)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!

The post R Weekly Bulletin Vol – IX appeared first on .

Which science is all around? #BillMeetScienceTwitter

Sat, 05/20/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweets

It might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet") billmeet <- search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent") billmeet <- unique(billmeet) billmeet <- dplyr::filter(billmeet, lang == "en") billmeet <- dplyr::filter(billmeet, is_retweet == FALSE)

I’ve ended up with 2491 tweets.

Classifying the tweets

I’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output <- monkeylearn::monkeylearn_classify(request = billmeet$text, classifier_id = "cl_5icAVzKR") str(output) ## Classes 'tbl_df', 'tbl' and 'data.frame': 4466 obs. of 4 variables: ## $ category_id: int 64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ... ## $ probability: num 0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ... ## $ label : chr "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ... ## $ text_md5 : chr "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ... ## - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame': 0 obs. of 0 variables

In the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet <- dplyr::mutate(billmeet, text_md5 = vapply(X = text, FUN = digest::digest, FUN.VALUE = character(1), USE.NAMES = FALSE, algo = "md5")) billmeet <- dplyr::select(billmeet, text, text_md5) output <- dplyr::left_join(output, billmeet, by = "text_md5")

Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output <- dplyr::filter(output, probability > 0.5)

This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>% dplyr::summarise(nlabels = n()) %>% dplyr::group_by(nlabels) %>% dplyr::summarise(n_tweets = n()) %>% knitr::kable() nlabels n_tweets 1415 1

Perfect, only one.

Looking at the results

I know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

dplyr::filter(output, label == "Chemistry") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64701 0.530 Chemistry e82fc920b07ea9d08850928218529ca9 Hi @billnye I started off running BLAST for other ppl but now I have all the money I make them do my DNA extractions #BillMeetScienceTwitter 64701 0.656 Chemistry d21ce4386512aae5458565fc2e36b686 .@uw’s biochemistry dept – home to Nobel Laureate Eddy Fischer & ZymoGenetics co founder Earl Davie… 64701 0.552 Chemistry 1d5be9d1e169dfbe2453b6cbe07a4b34 Yo @BillNye – I’m a chemist who plays w lasers & builds to study protein interactions w materials #BillMeetScienceTwitter 64701 0.730 Chemistry 1b6a25fcb66deebf35246d7eeea34b1f Meow @BillNye! I’m Zee and I study quantum physics and working on a Nobel prize. #BillMeetScienceTwitter 64701 0.873 Chemistry 701d8c53e3494961ee7f7146b28b9c8c Hi @BillNye, I’m a organic chemist studying how molecules form materials like the liquid crystal shown below.… dplyr::filter(output, label == "Aquatic Mammals") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64609 0.515 Aquatic Mammals f070a05b09d2ccc85b4b1650139b6cd0 Hi Bill, I am Anusuya. I am a palaeo-biologist working at the University of Cape Town. @BillNye #BillMeetScienceTwitter 64609 0.807 Aquatic Mammals bb06d18a1580c28c255e14e15a176a0f Hi @BillNye! I worked with people at APL to show that California blue whales are nearly recovered #BillMeetScienceTwitter 64609 0.748 Aquatic Mammals 1ca07aad8bc1abe54836df8dd1ff1a9d Hi @BillNye! I’m researching marine ecological indicators to improve Arctic marine monitoring and management… 64609 0.568 Aquatic Mammals a140320fcf948701cfc9e7b01309ef8b More like as opposed to vaginitis in dolphins or chimpanzees or sharks #BillMeetScienceTwitter 64609 0.520 Aquatic Mammals 06d1e8423a7d928ea31fd6db3c5fee05 Hi @BillNye I study visual function in ppl born w/o largest connection between brain hemispheres #callosalagenesis… dplyr::filter(output, label == "Internet") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64640 0.739 Internet f7b28f45ea379b4ca6f34284ce0dc4b7 @BillNye #AskBillNye @BillNye join me @AllendaleCFD. More details at #BillMeetScienceTwitter             64640 0.725 Internet b2b7843dc9fcd9cd959c828beb72182d @120Stat you could also use #actuallivingscientist #womeninSTEM or #BillMeetScienceTwitter to spread the word about your survey as well   64640 0.542 Internet a357e1216c5e366d7f9130c7124df316 Thank you so much for the retweet, @BillNye! I’m excited for our next generation of science-lovers!…   64640 0.839 Internet 61712f61e877f3873b69fed01486d073 @ParkerMolloy Hi @BillNye, Im an elem school admin who wants 2 bring in STEM/STEAM initiatives 2 get my students EX…   64640 0.924 Internet 4c7f961acfa2cdd17c9af655c2e81684 I just filled my twitter-feed with brilliance. #BIllMeetScienceTwitter

Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2") library("viridis") label_counts <- output %>% dplyr::filter(label != "Internet") %>% dplyr::group_by(label) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) label_counts <- label_counts %>% dplyr::mutate(label = ifelse(n < 5, "others", label)) %>% dplyr::group_by(label) %>% dplyr::summarize(n = sum(n)) %>% dplyr::arrange(desc(n)) label_counts <- dplyr::mutate(label_counts, label = factor(label, ordered = TRUE, levels = unique(label))) ggplot(label_counts) + geom_bar(aes(label, n, fill = label), stat = "identity")+ scale_fill_viridis(discrete = TRUE, option = "plasma")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1), text = element_text(size=25), legend.position = "none")

In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R Users group in Münster!

Sat, 05/20/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

This is to announce that Münster now has its very own R users group!

If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on and we are also on the R User groups list.

Code for the logo, which of course has been created in R:

library(ggplot2) library(ggmap) library(ggthemes) munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor") ggmap(munster) + scale_y_continuous(limits=c(51.94, 51.97)) + annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) + annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) + theme_map()

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Primer in Functional Programming in R Exercises (Part – 1)

Fri, 05/19/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below we cover the basics of functional programming in R( part 1 of a two series exercises on functional programming) . We consider recursion with R , apply family of functions , higher order functions such as Map ,Reduce,Filter in R .
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create a function which calculates factorial of a number with help of Reduce ,

Exercise 2
create the same function which calculates factorial but with recursion and memoization. :

Exercise 3
create a function cum_add which makes cumulative summation for e.g if x <- 1:3 cum_add(x) will result in 1 3 6 .Don’t use cumsum .

Exercise 4
create a function which takes a dataframe and returns mean , minimum and maximum of all numeric columns . Your function should take a dataframe as input .for e.g your_func(iris)
.Try to avoid loops ,its possible to make it in one line .swag your functional skills .

Learn more about Functions in the online course R Programming: Advanced Analytics In R For Data Science. In this course you will learn how to prepare your data quicker and more efficiently, leaving you more time to focus on analyzing and visualizing your results!

Exercise 5
create a function centerColumnAroundMean which takes a numeric dataframe and manipulates the dataframe such a way that all column’s values are centered with mean value of the column . For example if my data frame is like this

then centerColumnAroundMean(df) will result in

It is possible to achieve this by single line of R , Please dont use loops .
Exercise 6
I have list of movies ,which have two movie franchieses as the elements starwars and LOTR,you can create the movie list by
my_movie_list<- list(STARWARS= list("A NEW HOPE","The Last Jedi","The Force Awakens"),LOTR=list("THE FELLOWSHIP OF THE RING","THE Two Towers","The RETURN of the KING","Hobbit" = list("An unexpected Journey","The Battle of the FIVE ARMY","The Desolation of Smaug") ))
now the problem I am facing is some of the texts are in caps and some of them are in small without any particular order , I would like the list to have a format like
“The Last Jedi” , Help me in writing a function which will do the same . Please keep in mind that the list is a nested list .

Exercise 7
load the diamonds data set from ggplot 2 package , I want to buy a diamond of each color but don’t want to pay the top price , I assume the second highest price is good enough for me . Help me in finding the second highest price for each color from the dataset.

Exercise 8
use the already loaded diamonds dataset from previous exercise , I want to know the average price for each combination of cut and color .your output should be similar to following. Don’t use table

Exercise 9
load iris dataset , I want to get the third row for each group of species . Help me to write a short function for me to achieve that .

Exercise 10
Create a new environment with new.env() command and create 3 vector variables under that environment as a=1:10;b=100:500;c=1000:1500
without knowing or manually calling mean for all the vector variables can you print the mean of all variables of the new environment .

Related exercise sets:
  1. Accessing Dataframe Objects Exercises
  2. Correlation and Correlogram Exercises
  3. Building Shiny App exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text Mining with R: A Tidy Approach

Fri, 05/19/2017 - 18:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

About the book

This book applies tidy data principles to text analysis. The aim is to present tools to make many text mining tasks easier, more effective, and consistent with tools already in use, and in particular it presents the tidytext R package.

I love this ebook, at the moment you can read the chapter at the book’s website, and I want it to be soon available on Amazon to have a paperback copy.

The authors of this beautiful exposition of methodology and coding are Julia Silge and David Robinson. Kudos to both of them. In particular, I’ve been following Julia’s blog posts in the last two years and using it as a reference to teach R in my courses.

Table of contents

List of chapters:

  1. The tidy text format
  2. Sentiment analysis with tidy data
  3. Analyzing word and document frequency: tf-idf
  4. Relationships between words: n-grams and correlations
  5. Converting to and from non-tidy formats
  6. Topic modeling
  7. Case study: comparing Twitter archives
  8. Case study: mining NASA metadata
  9. Case study: analyzing usenet text
  10. References
Remarkable contributions of this book

In my opinion chapter 5 is one of the best expositions of data structures in R. By using modern R packages such as dplyr and tidytext, among other packages, the authors move between tibble, DocumentTermMatrix and VCorpus, while they present a set of good practises in R and do include ggplot2 charts to make concepts such as sentiment analysis clear.

If you often hear colleagues saying that R syntax is awkward, show this material to them. Probably people who used R 5 years ago or more, and haven’t used it in a while, will be amazed to see how the %>% operator is used here.

Text analysis requires working with a variety of tools, many of which have inputs and outputs that aren’t in a tidy form. What the authors present here is a noble and remarkable piece of work.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

AzureDSVM: a new R package for elastic use of the Azure Data Science Virtual Machine

Fri, 05/19/2017 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and software for data science and machine learning, pre-installed. AzureDSVM is a new R package that enables seamless interaction with the DSVM from a local R session, by providing functions for the following tasks:

  1. Deployment, deallocation, deletion of one or multiple DSVMs;
  2. Remote execution of local R scripts: compute contexts available in Microsoft R Server can be enabled for enhanced computation efficiency for either a single DSVM or a cluster of DSVMs;
  3. Retrieval of cost consumption and total expense spent on using DSVM(s).

AzureDSVM is built upon the AzureSMR package and depends on the same set of R packages such as httr, jsonlite, etc. It requires the same initial set up on Azure Active Directory (for authentication).

To install AzureDSVM with devtools package:

library(devtools) devtools::install_github("Azure/AzureDSVM") library("AzureDSVM")

When deploying a Data Science Virtual Machine, the machine name, size, OS type, etc. must be specified. AzureDSVM supports DSVMs on Ubuntu, CentOS, Windows, and Windows with the Deep Learning Toolkit (on GPU-class instances). For example, the following code fires up a D4 v2 Ubuntu DSVM located in South East Asia:

deployDSVM(context,"example", location="southeastasia", size="Standard_D4_v2", os="Ubuntu", hostname="mydsvm", username="myname", pubkey="pubkey")

where context is an azureActiveContext object created by AzureSMR::createAzureContext() function that encapsulates credentials (Tenant ID, Client ID, etc.) for Azure authentication.

In addition to launching a single DSVM, the AzureDSVM package makes it easy to launch a cluster with multiple virtual machines. Multi-deployment supports:

  1. creating a collection of independent DSVMs which can be distributed to a group of data scientists for collaborative projects, as well as
  2. clustering a set of connected DSVMs for high-performance computation.

To create a cluster of 5 Ubuntu DSVMs with default VM size, use:

cluster<-deployDSVMCluster(context,, location="southeastasia", hostnames="mydsvm", usernames="myname", pubkeys="pubkey", count=5)

To execute a local script on remote cluster of DSVMs with a specified Microsoft R Server compute context, use the executeScript function. (NOTE: only Linux-based DSVM instances are supported at the moment as underneath the remote execution is achieved via SSH. Microsoft R Server 9.x allows remote interaction for both Linux and Windows, and more details can be found here.) Here, we use the RxForeachDoPar context (as indicated by the compute.context option):

executeScript(context,"southeastasia", machines="dsvm_names_in_the_cluster", remote="fqdn_of_dsvm_used_as_master", user="myname", script="path_to_the_script_for_remote_execution", master="fqdn_of_dsvm_used_as_master", slaves="fqdns_of_dsvms_used_as_slaves", compute.context="clusterParallel")

Information of cost consumption and expense spent on DSVMs can be retrieved with:

consum<-expenseCalculator(context, instance="mydsvm", time.start="time_stamp_of_starting_point", time.end="time_stamp_of_ending_point", granularity="Daily", currency="USD", locale="en-US", offerId="offer_id_of_azure_subscription", region="southeastasia") print(consum)

Detailed introductions and tutorials can be found in the AzureDSVM Github repository, linked below.

Github (Azure): AzureDSVM

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R/Finance 2017 livestreaming today and tomorrow

Fri, 05/19/2017 - 16:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you weren't able to make it to Chicago for R/Finance, the annual conference devoted to applications of R in the financial industry, don't fret: the entire conference is being livestreamed (with thanks to the team at Microsoft). You can watch the proceedings at, and recordings will be available at the same link after the event.

Check out the conference program below for the schedule of events (times in US Central Standard Daylight Time).

Friday, May 19th, 2017 09:30 – 09:35   Kickoff 09:35 – 09:40   Sponsor Introduction 09:40 – 10:10   Marcelo Perlin: GetHFData: An R package for downloading and aggregating high frequency trading data from Bovespa    Jeffrey Mazar: The obmodeling Package    Yuting Tan: Return Volatility, Market Microstructure Noise, and Institutional Investors: Evidence from High Frequency Market    Stephen Rush: Adverse Selection and Broker Execution    Jerzy Pawlowski: How Can Machines Learn to Trade? 10:10 – 10:30   Michael Hirsch: Revealing High-Frequency Trading Provisions of Liquidity with Visualization in R 10:30 – 10:50   Matthew Dixon: MLEMVD: A R Package for Maximum Likelihood Estimation of Multivariate Diffusion Models 10:50 – 11:10   Break 11:10 – 11:30   Seoyoung Kim: Zero-Revelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News 11:30 – 12:10   Szilard Pafka: No-Bullshit Data Science 12:10 – 13:30   Lunch 13:30 – 14:00   Francesco Bianchi: Measuring Risk with Continuous Time Generalized Autoregressive Conditional Heteroscedasticity Models    Eina Ooka: Bunched Random Forest in Monte Carlo Risk Simulation    Matteo Crimella: Operational Risk Stress Testing: An Empirical Comparison of Machine Learning Algorithms and Time Series Forecasting Methods    Thomas Zakrzewski: Using R for Regulatory Stress Testing Modeling    Andy Tang: How much structure is best? 14:00 – 14:20   Robert McDonald: Ratings and Asset Allocation: An Experimental Analysis 14:20 – 14:50   Break 14:50 – 15:10   Dries Cornilly: Nearest Comoment Estimation with Unobserved Factors and Linear Shrinkage 15:10 – 15:30   Bernhard Pfaff: R package: mcrp: Multiple criteria risk contribution optimization 15:30 – 16:00   Oliver Haynold: Practical Options Modeling with the sn Package, Fat Tails, and How to Avoid the Ultraviolet Catastrophe    Shuang Zhou: A Nonparametric Estimate of the Risk-Neutral Density and Its Applications    Luis Damiano: A Quick Intro to Hidden Markov Models Applied to Stock Volatility    Oleg Bondarenko: Rearrangement Algorithm and Maximum Entropy    Xin Chen: Risk and Performance Estimator Standard Errors for Serially Correlated Returns 16:00 – 16:20   Qiang Kou: Text analysis using Apache MxNet 16:20 – 16:40   Robert Krzyzanowski: Syberia: A development framework for R 16:40 – 16:52   Matt Dancho: New Tools for Performing Financial Analysis Within the 'Tidy' Ecosystem    Leonardo Silvestri: ztsdb, a time-series DBMS for R users Saturday, May 20th, 2017 08:00 – 09:00   Coffee/ Breakfast 09:00 – 09:05   Kickoff 09:05 – 09:35   Stephen Bronder: Integrating Forecasting and Machine Learning in the mlr Framework    Leopoldo Catania: Generalized Autoregressive Score Models in R: The GAS Package    Guanhao Feng: Regularizing Bayesian Predictive Regressions    Jonas Rende: partialCI: An R package for the analysis of partially cointegrated time series    Carson Sievert: Interactive visualization for multiple time series 09:35 – 09:55   Emanuele Guidotti: yuimaGUI: A graphical user interface for the yuima package 09:55 – 10:15   Daniel Kowal: A Bayesian Multivariate Functional Dynamic Linear Model 10:15 – 10:45   Break 10:45 – 11:05   Jason Foster: Scenario Analysis of Risk Parity using RcppParallel 11:05 – 11:35   Michael Weylandt: Convex Optimization for High-Dimensional Portfolio Construction    Lukas Elmiger: Risk Parity Under Parameter Uncertainty    Ilya Kipnis: Global Adaptive Asset Allocation, and the Possible End of Momentum    Vyacheslav Arbuzov: Dividend strategy: towards the efficient market    Nabil Bouamara: The Alpha and Beta of Equity Hedge UCITS Funds – Implications for Momentum Investing 11:35 – 12:15   Dave DeMers: Risk Fast and Slow 12:15 – 13:35   Lunch 13:35 – 13:55   Eric Glass: Equity Factor Portfolio Case Study 13:55 – 14:15   Jonathan Regenstein: Reproducible Finance with R: A Global ETF Map 14:15 – 14:35   David Ardia: Markov-Switching GARCH Models in R: The MSGARCH Package 14:35 – 14:55   Keven Bluteau: Forecasting Performance of Markov-Switching GARCH Models: A Large-Scale Empirical Study 14:55 – 15:07   Riccardo Porreca: Efficient, Consistent and Flexible Credit Default Simulation    Maisa Aniceto: Machine Learning and the Analysis of Consumer Lending 15:07 – 15:27   David Smith: Detecting Fraud at 1 Million Transactions per Second 15:27 – 15:50   Break 15:50 – 16:10   Thomas Harte: The PE package: Modeling private equity in the 21st century 16:10 – 16:30   Guanhao Feng: The Market for English Premier League (EPL) Odds 16:30 – 16:50   Bryan Lewis: Project and conquer 16:50 – 17:00   Prizes and Feedback 17:00 – 17:05   Conclusion    

R/Finance 2017 livestream:

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Improving automatic document production with R

Fri, 05/19/2017 - 10:02

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

In this post, I describe the latest iteration of my automatic document production with R. It improves upon the methods used in Rtraining, and previous work on this topic can read by going to the auto deploying R documentation tag.

I keep banging on about this area because reproducible research / analytical document pipelines is an area I’ve a keen interest in. I see it as a core part of DataOps as it’s vital for helping us ensure our models and analysis are correct in data science and boosting our productivity.

Even after (or because of) a few years of off and on again development to the process, Rtraining had a number of issues:

  • build time was long because all presentations and their respective dependencies were required
  • if a single presentation broke, any later presentations would not get generated
  • the presentation build step was in “after_success” so didn’t trigger notifications
  • the build script for the presentations did a lot of stuff I thought could be removed
  • the dynamic index page sucks

This post covers how I’m attempting to fix all bar the last problem (more on that in a later post).

With the problems outlined, let’s look at my new base solution and how it addresses these issues.


I have built a template that can be used to generate multiple presentations and publish them to a docs/ directory for online hosting by GitHub. I can now use this template to produce category repositories, based on the folders in inst/slides/ in Rtraining. I can always split them out further at a later date.

The new repo is structured like so:

  • Package infrastructure
    • DESCRIPTION – used for managing dependencies primarily
    • R/ – store any utility functions
    • .Rbuildignore – avoid non-package stuff getting checked
  • Presentations
    • pres/ – directory for storing presentation .Rmd files
    • pres/_output.yml – file with render preferences
  • Output directory
    • docs/ – directory for putting generated slides in
  • Document generation infrastructure
    • .travis.yml – used to generate documents every time we push a commit
    • – shell script doing the git workflow and calling R
    • buildpres.R – R script that performs the render step
  • My Rtraining repo contained all presentations in the inst/slidedecks/ directory with further categories. This meant that if someone installed Rtraining, they’d get all the decks. I think this is a sub-optimal experience for people, especially because it mean installing so many packages, and I’ll be focusing instead on improving the web delivery.
  • Render requirements are now stored in an _output.yml file instead of being hard coded into the render step so that I can add more variant later
  • I’m currently using a modified version of the revealjs package as I’ve built a heavily customised theme. As I’m not planning on any of these presentation packages ever going on CRAN, I can use the Remotes field in DESCRIPTION to describe the location. This simplifies the code significantly.
Document generation
Automatic document generation with R Travis

I use travis-ci to perform the presentation builds. The instructions I provide travis are:

language: r
cache: packages
latex: false
warnings_are_errors: false
- R -e 'install.packages("devtools")'
- R -e 'devtools::install_deps(dep = T)'
- R CMD build --no-build-vignettes --no-manual .
- R CMD check --no-build-vignettes --no-manual *tar.gz
- Rscript -e 'devtools::install(pkg = ".")'
- chmod +x ./
- ./

One important thing to note here is that I used some arguments on my package build and check steps along with latex: false to drastically reduce the build time as I have no intention of producing PDFs normally.

The install section is the prep work, and then the script section does the important bit. Now if there are errors, I’ll get notified!


The script that gets executed in my Travis build:

  • changes over to a GITHUB_PAT based connection to the repo to facilitate pushing changes and does some other config
  • executes the R render step
  • puts the R execution log in docs/ for debugging
  • commits all the changes using the important prefix [CI SKIP] so we don’t get infinite build loops
  • pushes the changes

git remote set-url origin $GITURL
git pull
git checkout master
git config --global $AUTHORNAME
git config --global $AUTHOREMAIL

R CMD BATCH './buildpres.R'

cp buildpres.Rout docs/

git add docs/
git commit -am "[ci skip] Documents produced in clean environment via Travis $TRAVIS_BUILD_NUMBER"
git push -u --quiet origin master

The R step is now very minimal in that it works out what presentations to generate, then loops through them and builds each one according to the options specified in _output.yml


for (f in slides) render(f,output_dir = "docs")
Next steps for me

This work has substantially mitigated most of the issues I had with my previous Rtraining workflow. I now have to get all my slide decks building under this new process.

I will be writing about making an improved presentation portal and how to build and maintain your own substantially modified revealjs theme at a later date.

The modified workflow and scripts also have implications on my pRojects package that I’m currently developing along with Jon Calder. I’d be very interested to hear from you if you have thoughts on how to make things more streamlined.

The post Improving automatic document production with R appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to interpret correspondence analysis plots (it probably isn’t the way you think)

Fri, 05/19/2017 - 09:36

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Correspondence analysis is a popular data science technique. It takes a large table, and turns it into a seemingly easy-to-read visualization. Unfortunately, it is not quite as easy to read as most people assume.

In How correspondence analysis works (a simple explanation), I provide a basic explanation of how to interpret correspondence analysis, so if you are completely new to the field, please read that post first. In this post I provide lots of examples to illustrate some of the more complex issues.

1. Check conclusions using the raw data

The key to correctly interpreting correspondence analysis is to check any important conclusions by referring back to the original data. In this post I list 9 other things to think about when interpreting correspondence analysis. But, so long as you always remember this first rule, you will not go wrong.

The reason for this rule is illustrated in the example below. It shows 24 months of sales data by different retail categories. The visualization shows that Department stores are associated with December (i.e., Christmas, Dec-15 and Dec-16). We can see that Food retailing is on the opposite side of the map, which most people would interpret as meaning that Food retailing sales are lower in December.

Now, take a look at the actual data, shown below. Even though Food retailing is a long way from December on the map:

  1. Food retailing has the highest sales in December of any of the categories.
  2. Food retailing’s biggest month is December.

How can this be? The data seems to say the exact opposite of visualization? If you have read How correspondence analysis works (a simple explanation), you should understand that this is because correspondence analysis is all about the relativities. If we dig deeper into the data we can see that the map above does make sense, once you know how to read it.

While Food retailing does peak at Christmas, its sales are only 19% above its average monthly sales. By contrast, Department store sales spike to 85% above average in December. This is what correspondence analysis is trying to show us. Correspondence analysis does not show us which rows have the highest numbers, nor which columns have the highest numbers. It instead shows us the relativities. If your interest is instead on which categories sell the most, or how sales change over time, you are better off plotting the raw data than using correspondence analysis.


2. The further things are from the origin, the more discriminating they are

The correspondence analysis plot below is from a big table consisting of 42 rows, each representing a different brand, and 15 columns. You can see the original data here. Correspondence analysis has greatly simplified the story in the data. As you hopefully remember from school, the origin is where the x- and y-axes are both at 0. It is shown below as the intersection of two dashed lines. The further labels are from the origin, the more discriminating they are. Thus, Lee Jeans (at the top) is highly differentiated. Similarly, Outdoorsy is a highly discriminating attribute.

3. The closer things are to origin, the less distinct they probably are

In the map above, we see that Qantas is bang smack in the middle of the visualization. Thus, the conclusion probably is that it is not differentiated based on any of the data in the study. I explain the use of the weasel-word “probably” in the next section.

Here is another example. In the center of the map we have Wallaby and Lucky. Does this mean wallabies are lucky animals? No. They get hit by cars a lot. If you follow rugby, you will know that 99 times out of 100 a Wallaby is no match for even a Kiwi. If you look at the table below, you can see that the Wallaby is pretty average on all the variables being measured. As it has nothing that differentiates it, the result is that it is in the middle of the map (i.e., near the origin). Similarly, Lucky does not differentiate, so it is also near the center. That they are both in the center tells us that they are both indistinct, and that is all that they have in common (in the data).

4. The more variance explained, the fewer insights will be missed

I have reproduced the correspondence analysis of the brand personality data below. You will hopefully recall my mentioning that Qantas being in the middle meant that it was probably not differentiated based on the data. Why did I write “probably”? If you sum up the proportion of variance explained by horizontal and vertical dimensions (shown in the axis labels), we see that visualization displays 57% of the variance in the data. And, remember, this is only 57% of the variance in the relativities. So, a lot of the data has been left out of the summary. Perhaps Qantas is highly differentiated on some dimension that is irrelevant for most of the brands; the only way to know for sure is to check the data.

Now, in fairness to correspondence analysis, it is important to appreciate that it is actually a great achievement for the map to explain 57% of the variation with such a large input table. To represent all of the relativities of this table requires 14 dimensions, but we have only plotted two. Correspondence analysis is not the problem. The problem is the quantity of the data. The more data, the greater the chance that any good summary will miss out important details.

5. Proximity between row labels probably indicates similarity (if properly normalized)

As discussed in some detail in How correspondence analysis works (a simple explanation), we should be able to gauge the similarity of row labels based on their distance on the map (i.e., their proximity).  “Should” is another weasel word! Why? Three things are required in order for this to be true:

  1. We need to be explaining a high proportion of variance in the data. If we are not, there is always the risk that the two row labels are highly distinct, but are still shown on the map as if not distinct.
  2. The normalization, which is a technical option in correspondence analysis software, needs to have been set to either principal or row principal. I return to this in the next section.
  3. The aspect ratio of the map needs to be fixed at 1. That is, the horizontal and vertical coordinates of the map need to match each other. If your maps are in Excel or, as in the example below, PowerPoint, you may well have a problem. In the chart below, the really big pattern is that there is an enormous gap between the pro-Trump camp, on the far left, and the pro-Clinton camp on the far right. If you have even a passing understanding of American politics, this will make sense. However, if you look at the scale of the labels on the x- and y- axes you will see a problem. A distance of 0.2 on the horizontal is equivalent to a distance of 0.6 on the vertical. The map below this has the aspect ratio set to 1, and it tells a different story. Yes, the pro- and anti-Trump camps are well apart, but the disenfranchised youth are now much more prominent.

6. Proximity between column labels indicates similarity (if properly normalized)

This is a repeat of the previous point, but applying to columns. Here, the normalization needs to be either principal or column principal. You may recall me writing in the previous point that to compare between rows, we need to be using either principal or row principal normalization. So, setting the normalization to principal seems the obvious solution. But, before jumping to this conclusion, which has its own problems (as discussed in the next section), I will illustrate what these different normalization settings look like. The visualization below is based on the principal normalization. Principal is the default in some apps, such as Displayr, Q, and the R package flipDimensionReduction. However, it is not the default in SPSS, which means that comparing the distances between rows labels in a map created by SPSS with defaults is dangerous.

The plot below uses the column principal normalization. If you look very carefully, you will see that the positions of the column points are unchanged (although the map has been zoomed out). But, the positions of the row labels, representing the brands, have changed. There are two ways that the row labels positions have changed. First, they have been stretched out to be further form the origin. Second, the degree of stretching has been greater vertically. With the principal plot shown above, the horizontal differences for the row labels are, in relative terms, bigger. With the column principal shown below, the vertical differences are bigger. So, to repeat the point made a bit earlier: the distances between the column points are valid for both principal and column principal, but the distances between the row points are not correct in the column principal shown below.

The visualization below shows the row principal normalization. Now the distances between the row labels are meaningful and consistent with those shown in the principal normalization, but the differences between the column coordinates are now misleading.


7. If there is a small angle connecting a row and column label to the origin, they are probably associated

Take a look at the plot above. Would you say Lift is more strongly associated with Cheers you up or Relax? If you have said Relax, you are interpreting the map correctly. As discussed in How correspondence analysis works (a simple explanation) it is wrong to look at the distance between row labels and column labels. Instead, we should imagine a line connecting the row and column labels with the origin. The sharper the angle, the stronger the relationship. Thus, there is a strong relationship between Relax and Lift (although, if you look at the data shown below, you will see that Lift is very small, so it does not in any sense “own” Relax).

If you have not yet had your coffee for the day, go get it now. We are at the hard bit. In the plot above, the angles are informative. However, interpreting the angles is only strictly valid when you have either row principal, column principal, or symmetrical (1/2) normalization. So, if wanting to make inferences about the relationships between the rows and columns (e.g., brands and attributes in the examples above), we are better off not using the default principal normalization. This is really the yuckiest aspect of correspondence analysis. No one normalization is appropriate for everything. Or, stated from a glass half full perspective, our choice of normalization is really a choice of how we want to mislead the viewer!

Additional complexity is added to this problem by the tendency of people not to report the normalization. Fortunately, we can make an educated guess based on the dispersion of the points (if the rows points are all near the origin we probably have row principal, and vice versa for columns).

Depending on the day of the week I have two ways of dealing with this issue. Most of the time, my preference is to use the principal normalization, and remind viewers to check everything in the raw data. Sometimes though, where my interest is principally in the rows of a table, I use row principal and a moon plot. Distances between the brands are plotted inside of a circle and these distances are meaningful. The column labels are shown on the outside of the circle. They have the same angles as on the plot above. But, now the font size represents what was previously indicated by the distance between the column labels and the origin. The beauty of this representation is that we can now compare distances between column labels and points, so the plot is much harder to misread, and we have no need to educate the reader about the whole reading of angles. The information regarding the relativities of the column labels is harder to gauge, but, this is arguably beneficial, as the construction of the plot makes it clear that the focuses is on the rows (brands).

8. A row and column label are probably not associated if their angle to the origin is 90 degrees

In the moonplot above, if you draw a line connecting Red Bull to the Origin, and back out to Kids, you will see that it is roughly a right-angle (90 degrees). This tells us that there is no association between Kids and Red Bull. Again, I have written “probably”. If you look at the data, shown in the table above, there is clearly a negative association. Remember, always look at the data!

9. A row and column label are probably negatively associated if they are on opposite sides of the origin

The plot below shows the traits that people want in an American president by age. What do the 25 to 34 year old yearn for? The is a strong association with Entertaining. What is the next strongest association? You may think it would be concern about global warming and minorities. This is not the case. The next strongest associations are negative ones: the 25 to 34 year olds are less keen on a Christian President, one who has been successful in business, and one who is plain-speaking. We can see this because these traits are on the opposite side of the origin, and are a long way from the origin, whereas the traits relating to global warming and welfare of minorities are all closer to the origin, and thus are less discriminating.

Here’s another example. The correct way to read this visualization is that Yahoo is, in relative terms, stronger than Google on Fun. However, if you look at the raw data it shows that Google is much more fun than Yahoo (54% versus 28%). The reason that Yahoo has stronger association with Fun is that it is its second best performing attribute (with 29% for Easy-to-use). By contrast, while Google is twice as fun as Yahoo, it scores three times as high on High quality, and High performance, which are on the opposite side of the map, and this is what drags Google away from Yahoo.

10. The further a point from the origin, the stronger their positive or negative association

The visualization below shows movement of Yahoo’s perceptions from 2012 to 2017, with the arrow head showing 2017 and the base of the arrow showing 2012. The obvious way to read this is that Yahoo has become more fun, more innovative, and easier-to-use. However, such a conclusion would be misplaced.

A better interpretation is:

  • In 2012, the angle formed by connecting the base of Yahoo to the origin and back to Fun is very small, which tells us that they are associated.
  • As Fun is relatively far from the origin we know that Fun is a relatively good discriminator between the brands.
  • As Yahoo was very far from the origin, and associated with Fun, we can conclude that Yahoo and Fun were closely associated in 2012 (remember, correspondence analysis focuses on relativities; in 2012 Yahoo’s Fun score was half of Google’s).
  • From 2012 to 2017, Yahoo moved much closer to the origin, which tells us that Yahoo’s relative strengths in terms of Fun, Easy-to-Use, and Innovative, have likely declined (and, in reality, they have declined sharply; Google is now more than four times as fun).

It is really, really, important to always check key conclusions from correspondence analysis by inspecting the raw data.


Hopefully you like the look of the plots in this post! They can all be created in R using the Displayr/flipDimensionReduction package, or in Displayr and Q via the menus. More detail about the various plots shown in this post, and R code, can be found in the other correspondence analysis posts on this blog.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2017-01 Variable-width lines in R

Fri, 05/19/2017 - 02:29

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This document describes the ‘vwline’ package, which provides an R interface for drawing variable-width curves. The package provides functions to draw line segments through a set of locations, or a smooth curve relative to a set of control points, with the width of the line allowed to vary along the length of the line.

Paul Murrell


To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Review of Efficient R Programming

Fri, 05/19/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In the crowded market space of data science and R language books, Lovelace and Gillespie’s Efficient R Programming (2016) stands out from the crowd. Over the course of ten comprehensive chapters, the authors address the primary tenets of developing efficient R programs. Unless you happen to be a member of the R core development team, you will find this book useful whether you are a novice R programmer or an established data scientist and engineer. This book is chock full of useful tips and techniques that will help you improve the efficiency of your R programs, as well as the efficiency of your development processes. Although I have been using R daily (and nearly exclusively) for the past 4+ years, every chapter of this book provided me with new insights into how to improve my R code while helping solidify my understanding of previously learned techniques. Each chapter of Efficient R Programming is devoted to a single topic, each of which includes a “top five tips” list, covers numerous packages and techniques, and contains useful exercises and problem sets for consolidating key insights.

In Chapter 1. Introduction, the authors orient the audience to the key characteristics of R that affect its efficiency, compared to other programming languages. Importantly, the authors address R efficiency not just in the expected sense of algorithmic speed and complexity, but broaden its scope to include programmer productivity and how it relates to programming idioms, IDEs, coding conventions, and community support – all things that can improve the efficiency of writing and maintaining code. This is doubly important for a language like R, which is notoriously flexible in its ability to solve problems in multiple ways. The first chapter concludes by introducing the reader to two valuable packages: (1) microbenchmark, an accurate benchmarking tool with nanosecond precision; and (2) profvis, a handy tool for profiling larger chunks of code. These two packages are repeatedly used throughout the remainder of the book to illustrate key concepts and highlight efficient techniques.

In Chapter 2. Efficient Setup, the reader is introduced to techniques for setting up a development environment that facilitates efficient workflow. Here the authors cover choices in operating system, R version, R start-up, alternative R interpreters, and how to maintain up-to-date packages with tools like packrat and installr. I found their overview of the R startup process particularly useful, as the authors taught me how to modify my .Renviron and .Rprofile files to cache external API keys and customize my R environment, for example by adding alias shortcuts to commonly used functions. The chapter concludes by discussing how to setup and customize the RStudio environment (e.g., modifying code editing preference, editing keyboard shortcuts, and turning off restore .Rdata to help prevent bugs), which can greatly improve individual efficiency.

Chapter 3. Efficient Programming introduces the reader to efficient programming by discussing “big picture” programming techniques and how they relate to the R language. This chapter will most likely be beneficial to established programmers who are new to R, as well as to data scientists and analysts who have limited exposure to programming in a production environment. In this chapter the authors introduce the “golden rule of R programming” before delving into the usual suspects of inefficient R code. Usefully, the book illustrates multiple ways of performing the same task (e.g., data selection) with different code snippets, and highlights the performance differences through benchmarked results. Here we learn about the pitfalls of growing vectors, the benefits of vectorization, and the proper use of factors. The chapter wraps up with the requisite overview of the apply function family, before discussing the use of variable caching (package memoise) and byte compilation as important techniques in writing fast, responsive R code.

Chapter 4. Efficient Workflow will be of primary use to junior-level programmers, analysts, and project managers who haven’t had enough time or practice to develop their own efficient workflows. This chapter discusses the importance of project planning, audience, and scope before delving into common tools that facilitate project management. In my opinion, one of best aspects of R is the huge, maddeningly broad number of packages that are available on CRAN and GitHub. The authors provide useful advice and techniques for identifying the packages that will be of most use to your project. A brief discussion on the use of R Markdown and knitr concludes this chapter.

Chapter 5. Efficient Input/Output is devoted to efficient read/write operations. Anybody who has ever struggled with loading a big file into R for analysis will appreciate this discussion and the packages covered in this chapter. The rio package, which can handle a wide variety of common data file types, provides a useful starting point for exploratory work on a new project. Other packages that are discussed (including readr and data.table) provide more efficient I/O than those in base R. The chapter ends with a discussion of two new file formats and associated packages, (feather and RProtoBuf), that can be used for cross-language, fast, efficient serialized data I/O.

Chapter 6. Efficient Data Carpentry introduces what are, in my opinion, the most useful R tools for data munging – what Lovelace and Gillespie prefer to call by the more admirable term “data carpentry.” This chapter could more aptly be titled the “Tidyverse” or the “Hadleyverse”, for most of the tools discussed in this chapter were developed by prolific R package writer, Hadley Wickham. Sections of the chapter are devoted to each of the primary packages of the tidyverse: tibble, a more useful and user-friendly data.frame; tidyr, used for reshaping data between short and long forms; stringr, which provides a consistent API over obtuse regex functions; dplyr, used for efficient data processing including filtering, sorting, mutating, joining, and summarizing; and of course magrittr, for piping all these operations together with %>%. A brief section on package data.table rounds out the discussion on efficient data carpentry.

Chapter 7. Efficient Optimization begins with the requisite optimization quote by computer scientist Donald Knuth:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

In this chapter, the authors introduce profvis, and they illustrate the utility of this package by showing how it can be used to identify bottlenecks in a Monte Carlo simulation of a Monopoly game. The authors next examine alternative methods in base R that can be used for greater efficiency. These include discussion of if() vs. ifelse(), sorting operations, AND (&) and OR (|) vs. && and ||, row/column operations, and sparse matrices. The authors then apply these tricks to the Monopoly code to show a 20-fold decrease in run time. The chapter concludes with a discussion and examples of parallelization, and the use of Rcpp as an R interface to underlying fast and efficient C++ code.

I found the chapter Efficient Hardware to be the least useful in the book (spoiler alert: add more RAM or migrate to cloud-based services), though the chapter on Efficient Collaboration will be particularly useful for novice data scientists lacking real-world experience developing data artifacts and production applications in a distributed, collaborative environment. In this chapter, the authors discuss the importance of coding style, code comments, version control, and code review. The final chapter Efficient Learning, will find appreciative readers among those just getting started with R (and if this describes you, I would suggest that you start with this chapter first!). Here the authors discuss using and navigating R’s excellent internal help utility, as well as the importance of vignettes and source code in learning/understanding. After briefly introducing swirl, the book concludes with a discussion of online resources, including Stack Overflow; the authors thankfully provide the newbie with important information on how to ask the right questions and the importance of providing a great R reproducible example.

In summary, Lovelace and Gillespie’s Efficient R Programming does an admirable job of illustrating the key techniques and packages for writing efficient R programs. The book will appeal to a wide audience from advanced R programmers to those just starting out. In my opinion, the book hits that pragmatic sweet spot between breadth and depth, and it usefully contains links to external resources for those wishing to delve deeper into a specific topic. After reading this book, I immediately went to work refactoring a Shiny dashboard application I am developing and several internal R packages I maintain for our data science team. In a matter of a few short hours, I witnessed a 5 to 10-fold performance increase in these applications just by implementing a couple of new techniques. I was particularly impressed with the greatly improved end-user performance and the ease with which I implemented intelligent caching with the memoise package for a consumer decision tree application I am developing. If you care deeply about writing beautiful, clean, efficient code and bringing your data science to the next level, I highly recommend adding Efficient R Programming to your arsenal.

The book is published by O’Reilly Media and is available online at the authors’ website, as well as through Safari.


To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Note on on.exit()

Fri, 05/19/2017 - 02:00

(This article was first published on English Blog on Yihui Xie | 谢益辉, and kindly contributed to R-bloggers)

I have used on.exit() for several years, but it was not until the other day that I realized a very weird thing about it: you’d better follow the default positions of its arguments expr and add, i.e., the first argument has to be expr and the second has to be add.

on.exit(expr = NULL, add = FALSE)

If you do on.exit(add = TRUE, {...}), weird things can happen. I discovered this by accident. I have never switched the positions of expr and add before, and I was surprised that R CMD check failed on Travis with an error message that confused me in the beginning:

Error in on.exit(add = TRUE, if (file.exists(main)) { : invalid 'add' argument

I was thinking why add = TRUE was considered invalid. Then I guessed perhaps the expression if (file.exists(main)) {} was treated as the actual value of add. So I switched to the normal order of arguments, and the error was gone.

I tested it a bit more and was totally confused, e.g., why was 1 printed twice below? I guess TRUE was not printed because add was treated as expr.

f = function() { on.exit(add = print(TRUE), print(1)) } f() # [1] 1 # [1] 1

I don’t have the capability to understand the source code in C, and I’ll leave it experts to explain the weird things I observed. For me, I’ll just never move add before expr again.

BTW, I don’t what the rationale is for the default add = FALSE in on.exit(), but I have not used add = FALSE for a single time, so I feel add = TRUE might be a better default. When I want to do something on exit, I almost surely mean do it in addition to the things that I assigned to on.exit() before, instead of cleaning up all previous tasks and only doing this one (add = FALSE).


To leave a comment for the author, please follow the link and comment on their blog: English Blog on Yihui Xie | 谢益辉. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

continental divide

Fri, 05/19/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

While the Riddler puzzle this week was anticlimactic,  as it meant filling all digits in the above division towards a null remainder, it came as an interesting illustration of how different division is taught in the US versus France: when I saw the picture above, I had to go and check an American primary school on-line introduction to division, since the way I was taught in France is something like that

with the solution being that 12128316 = 124 x 97809… Solved by a dumb R exploration of all constraints:

for (y in 111:143) for (z4 in 8:9) for (oz in 0:999){ z=oz+7e3+z4*1e4 x=y*z digx=digits(x) digz=digits(z) if ((digz[2]==0)&(x>=1e7)&(x<1e8)){ r1=trunc(x/1e4)-digz[5]*y if ((digz[5]*y>=1e3)&(digz[4]*y<1e4) &(r1>9)&(r1<100)){ r2=10*r1+digx[4]-7*y if ((7*y>=1e2)&(7*y<1e3)&(r2>=1e2)&(r2<1e3)){ r3=10*r2+digx[3]-digz[3]*y if ((digz[3]*y>=1e2)&(digz[3]*y<1e3)&(r3>9)&(r3<1e2)){ r4=10*r3+digx[2] if (r4

Looking for a computer-free resolution, the constraints on z exhibited by the picture are that (a) the second digit is 0 and the fourth digit is 7.  Moreover, the first and fifth digits are larger than 7 since y times this digit is a four-digit number. Better, since the second subtraction from a three-digit number by 7y returns a three-digit number and the third subtraction from a four-digit number by ny returns a two-digit number, n is larger than 7 but less than the first and fifth digits. Ergo, z is necessarily 97809. Furthermore, 8y<10³ and 9y≥10³, which means 111

Filed under: Books, Kids, pictures, R Tagged: arithmetics, division, FiveThirtyEight, long division, The Riddler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning. Artificial Neural Networks (Strength of Concrete).

Thu, 05/18/2017 - 18:52

(This article was first published on Data R Value, and kindly contributed to R-bloggers)

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them. The bibliography and corresponding authors are cited at all times and this posts series is a way of honoring and giving them the credit they deserve for their work.
We will develop an artificial neural network example. The example was originally published in “Machine Learning in R” by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting the strength of concrete based in the ingredients used to made it.

We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the model trees and regression trees algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.

### “Machine Learning in R” by Brett Lantz,
### PACKT publishing 2015
### (open source community experience destilled)
### based on: Yeh IC. “Modeling of Strength of
### High Performance Concrete Using Artificial
### Neural Networks.” Cement and Concrete Research
### 1998; 28:1797-1808.

### Strength of concrete example
### relationship between the ingredients used in
### concrete and the strength of finished product

### Dataset
### Compressive strength of concrete
### UCI Machine Learning Data Repository

### install an load required packages


### read and explore the data
concrete <- read.csv(“concrete.csv”)

### neural networks work best when the input data
### are scaled to a narrow range around zero

### normalize the dataset values
normalize <- function(x){
  return((x – min(x)) / (max(x) – min(x)) )

### apply normalize() to the dataset columns
concrete_norm <-, normalize))

### confirm and compare normalization

### split the data into training and testing sets
### 75% – 25%

concrete_train <- concrete_norm[1:773, ]
concrete_test <- concrete_norm[774:1030, ]

### training model on the data
concrete_model <- neuralnet(strength ~ cement + slag +
              ash + water + superplastic + coarseagg +
              fineagg + age, data = concrete_train)

### visualize the network topology


### there is one input node for each of the eight
### features, followed by a single hidden node and
### a single output node that predicts the concrete
### strength
### at the bottom of the figure, R reports the number
### of training steps and an error measure called the
### the sum of squared errors (SSE)

### evaluating model performance

### predictions
model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result

### because this is a numeric prediction problem rather
### than a classification problem, we cannot use a confusion
### matrix to examine model accuracy
### obtain correlation between our predicted concrete strength
### and the true value

cor(predicted_strength, concrete_test$strength)

### correlation indicate a strong linear relationships between
### two variables

### improving model performance
### increase the number of hidden nodes to five

concrete_model2 <- neuralnet(strength ~ cement + slag +
                 ash + water + superplastic + coarseagg +
            fineagg + age, data = concrete_train, hidden = 5)


### SSE has been reduced significantly

### predictions
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result

### performance
cor(predicted_strength, concrete_test$strength)

### notice that results can differs because neuralnet
### begins with random weights
### if you’d like to match results exactly, use set.seed(12345)
### before building the neural network 

You can get the example and the dataset in:


To leave a comment for the author, please follow the link and comment on their blog: Data R Value. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Character Functions (Advanced)

Thu, 05/18/2017 - 18:10

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This set of exercises will help you to help you improve your skills with character functions in R. Most of the exercises are related with text mining, a statistical technique that analyses text using statistics. If you find them interesting I would suggest checking the library tm, this includes functions designed for this task. There are many applications of text mining, a pretty popular one is the ability to associate a text with his or her author, this was how J.K.Rowling (Harry potter author) was caught publishing a new novel series under an alias. Before proceeding, it might be helpful to look over the help pages for the nchar, tolower, toupper, grep, sub and strsplit. Take at the library stringr and the functions it includes such as str_sub.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Before starting the set of exercises run the following code lines :

if (!'tm' %in% installed.packages()) install.packages('tm')
txt = system.file("texts", "txt", package = "tm")
ovid = VCorpus(DirSource(txt, encoding = "UTF-8"),
readerControl = list(language = "lat"))
OVID = c(data.frame(text=unlist(TEXT), stringsAsFactors = F))
TEXT = lapply(ovid[1:5], as.character)
TEXT1 = TEXT[[4]]

Exercise 1

Delete all the punctuation marks from TEXT1

Exercise 2

How many letters does TEXT1 contains?

Exercise 3

How many words does TEXT1 contains?

Exercise 4

What is the most common word in TEXT1?

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 5

Get an object that contains all the words with at least one capital letter (Make sure the object contains each word only once)

Exercise 6

Which are the 5 most common letter in the object OVID?

Exercise 7

Which letters from the alphabet are not in the object OVID

Exercise 8

On the OVID object, there is a character from the popular sitcom ‘FRIENDS’ , Who is he/she?  There were six main characters (Chandler, Phoebe, Ross, Monica, Joey, Rachel)

Exercise 9

Find the line where this character is mentioned

Exercise 10

How many words finish with a vowel, how many with a consonant?

Related exercise sets:
  1. Regular Expressions Exercises – Part 1
  2. Scan exercises
  3. Building Shiny App exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

shinydashboard 0.6.0

Thu, 05/18/2017 - 17:43

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Shinydashboard 0.6.0 is now on CRAN! This release of shinydashboard was aimed at both fixing bugs and also bringing the package up to speed with users’ requests and Shiny itself (especially fully bringing bookmarkable state to shinydashboard’s sidebar). In addition to bug fixes and new features, we also added a new “Behavior” section to the shinydashboard website to explain this release’s two biggest new features, and also to provide users with more material about shinydashboard-specific behavior.


This release introduces two new sidebar inputs. One of these inputs reports whether the sidebar is collapsed or expanded, and the other input reports which (if any) menu item in the side bar is expanded. In the screenshot below, the Charts tab is expanded.

These inputs are unusual since they’re automatically available without you needing to declare them, and they have a fixed name. The first input is accessible via input$sidebarCollapsed and can have only two values: TRUE, which indicates that the sidebar is collapsed, and FALSE, which indicates that it is expanded (default).

The second input is accessible via input$sidebarItemExpanded. If no menuItem() in the sidebar is currently expanded, the value of this input is NULL. Otherwise, input$sidebarItemExpanded holds the value of the expandedName of whichever menuItem() is currently expanded (expandedName is a new argument to menuItem(); if none is provided, shinydashboard creates a sensible default).

Full changes

As usual, you can view the full changelog for shinydashboard in the NEWS file.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New EARL Conference app

Thu, 05/18/2017 - 16:44

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Get the most out of the EARL Conference with the new phone app, available on iTunes and Google play.

View the agenda, speakers, and sponsors in the palm of your hand.

Bookmark the sessions you’d like to attend and rate them when they’re finished.

You can even stay up-to-date with the #EARLConf2017 Twitter feed without switching apps.

Just download the app, sign up and view the San Francisco event.

To download the app:
Or search for EARL Conference in the stores.

See you at EARL!

Click here to get 15% off tickets to San Francisco EARL.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

On indexing operators and composition

Thu, 05/18/2017 - 15:15

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).

A permutation of indices

In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.

R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable).

R’s square-bracket array index operator

In R array or vector indexing is commonly denoted by the square-bracket “[]“. For example if we have an array of values we can read them off as follows:

array <- c('a', 'b', 'x', 'y') print(array[1]) # [1] "a" print(array[2]) # [1] "b" print(array[3]) # [1] "x" print(array[4]) # [1] "y"

A cool thing about R‘s array indexing operator is: you can pass in arrays or vectors of values and get many results back at the same time:

print(array) # [1] "b" "x"

You can even use this notation on the left-hand side (LHS) during assignment:

array <- 'zzz' print(array) # [1] "a" "zzz" "zzz" "y"

This ability to address any number of elements is the real power of R‘s array operator. However, if you know you only want one value I strongly suggest always using R‘s double-square operator “[[]]” which confirms you are selecting exactly one argument and is also the correct operator when dealing with lists.

Let’s get back to the single-square bracket “[]” and its vectorized behavior.

Application: computing ranks

Let’s use the square bracket to work with ranks.

Consider the following data.frame.

d <- data.frame(x= c('d', 'a', 'b', 'c'), origRow= 1:4, stringsAsFactors= FALSE) print(d) # x origRow # 1 d 1 # 2 a 2 # 3 b 3 # 4 c 4

Suppose we want to compute the rank of the x-values. This is easy as R has a built in rank-calculating function:

print(rank(d$x)) # [1] 4 1 2 3

Roughly (and ignoring controlling treatment of ties) rank calculation can also be accomplished by sorting the data so d$x is ordered, ranking in this trivial configuration (just writing an increasing sequence), and then returning the data to its original order. We are going to use R‘s order() command. This calculates a permutation such that data is in sorted order (in this article all permutations are represented as arrays of length n containing each of the integers from 1 through n exactly once). order() works as follows:

ord <- order(d$x) print(ord) # [1] 2 3 4 1 print(d$x[ord]) # [1] "a" "b" "c" "d"

The rank calculation written in terms of order() then looks like the following:

d2 <- d[ord, ] d2$rankX <- 1:nrow(d2) d3 <- d2[order(d2$origRow), ] print(d3$rankX) # [1] 4 1 2 3

And we again have the rankings.

Returning to the original order

Of particular interest are the many ways we can return d2 to original d$origRow-order. My absolute favorite way is indexing the left-side as in:

d4 <- d2 # scratch frame to ready for indexed assignment d4[ord, ] <- d2 # invert by assignment print(d4) # x origRow rankX # 2 d 1 4 # 3 a 2 1 # 4 b 3 2 # 1 c 4 3

The idea is d2 <- d[ord, ] applies the permutation represented by ord and d4[ord, ] <- d2 undoes the permutation represented by ord. The notation is so powerful it almost looks like declarative programing (and a lot like the explicit fixed-point operators we were able to write in R here).

Let’s see that again:

print(ord) # [1] 2 3 4 1 invOrd <- numeric(length(ord)) # empty vector to ready for indexed assigment invOrd[ord] <- 1:length(ord) # invert by assignment print(invOrd[ord]) # [1] 1 2 3 4 print(invOrd) # [1] 4 1 2 3

We used the assignment invOrd[ord] <- 1:length(ord) to “say” we wanted invOrd[ord] to be “1 2 3 4” and it is “1 2 3 4“. This means invOrd looks like an inverse of ord, which is why it can undo the ord permutation. We can get d2 into the correct order by writing d2[invOrd, ].

Why did that indexing work?

To work out why the above transformations are correct we will need a couple of transform rules (both to be established later!):

  • Associativity of indexing: (a[b]) = a[b] (a, b, and c all permutations of 1:n).
  • Equivalence of left and right inverses: a[b] == 1:n if and only if b[a] == 1:n (a and b both permutations of 1:n). Note one does not have a[b] == b[a] in general (check with a <- c(2,1,3); b <- c(1,3,2)).

The above follow from the fact that composition of permutations (here seen as composition of indexing) form a group similar to function composition, but let’s work up to that.

Here is our reasoning to show d4 has its rows back in the original order of those of d1:

  1. d2$origRow == (1:n)[ord] (as d2 == d[ord, ]), so d2$origRow == ord (as (1:n)[ord] == ord).
  2. d4$origRow[ord] == d2$origRow (by the left hand side assignment), so d4$origRow[ord] == ord (by the last step).

    If we could just cancel off the pesky “ord” from both sides of this equation we would be done. That is in fact how we continue, bringing in rules that justify the cancellation.

  3. (d4$origRow[ord])[invOrd] == d4$origRow[ord[invOrd]] (by associaitivity, which we will prove later!). So we can convert (d4$origRow[ord])[invOrd] == ord[invOrd] (derivable from the last step) into d4$origRow[ord[invOrd]] == ord[invOrd].
  4. Substituting in ord[invOrd] == 1:n (the other big thing we will show is: ord[invOrd] == invOrd[ord] == 1:n) yields d4$origRow == 1:n. This demonstrates d4‘s rows must be back in the right order (that which we were trying to show).

So all that remains is to discuss associativity, and also show why invOrd[ord] == 1:n (which was established by the assignment invOrd[ord] <- 1:length(ord)) implies ord[invOrd] == 1:n (what we actually used in our argument).

Some operator notation

To make later steps easier, let’s introduce some R-operator notation. Define:

  • An array indexing operator:

    `%[]%` <- function(a,b) { a[b] }
  • A function composition operator:

    `%.%` <- function(f,g) { function(x) f(g(x)) }

    (this is pure function composition, allows us to write abs(sin(1:4)) as (abs %.% sin)(1:4)).

  • A function application operator:

    `%+%` <- function(f,g) { f(g) }

    (which is only right-associative as in abs %+% ( sin %+% 1:4 ) == abs(sin(1:4))). This notation is interesting as it moves us towards “point free notation”, though to move all the way there we would need a point-free function abstraction operator.

  • A reverse application operator such as “magrittr::`%>%`” or the following imitation:

    `%|>%` <- function(f,g) { g(f) }

    (which can write abs(sin(1:4)) as 1:4 %|>% sin %|>% abs, the notation being in reference to F#‘s as mentioned here). Here we left out the parenthesis not because %|>% is fully associative (it is not), but because it is left-associative which is also the order R decides to evaluate user operators (so 1:4 %|>% sin %|>% abs is just shorthand for ( 1:4 %|>% sin ) %|>% abs).

With %[]% our claim about inverses is written as follows. For permutations on 1:n: ord %[]% invOrd == 1:n implies invOrd %[]% ord == 1:n. (Again we do not have a %[]% b == b %[]% a in general as shown by a <- c(2,1,3); b <- c(1,3,2).)

Semigroup axioms

In operator notation we claim sequenced selectionlowing are true (not all of which we will confirm here). Call an element “a permutation” if it is an array of length n containing each integer from 1 through exactly once. Call an element “a sequenced selection” if it is an array of length n containing only integers from 1 through (possibly with repetitions). n will be held at a single value throughout. All permutations are sequenced selections.

Permutations and sequenced selections can be confirmed to obey the following axioms:

  1. For all a, b sequenced selections we have a %[]% b is itself a sequenced selection 1:n. If both a, b are permutations then a %[]% b is also a permutation.
  2. (this is the big one) For all a, b, c sequenced selections we have: ( a %[]% b ) %[]% c == a %[]% ( b %[]% c ).
  3. For all a that are sequenced selections we have: (1:n) %[]% a == a and a %[]% (1:n) == a. We call 1:n the multiplicative identity and it is often denoted as e <- 1:n.
  4. For all “a” a permutation there exists La, Ra permutations of 1:n such that La %[]% a == 1:n and a %[]% Ra == 1:n. Also, we can derive La == Ra from the earlier axioms.

The above are essentially the axioms defining a mathematical object called a semigroup (for the sequenced selections) or group (for the permutations).

A famous realization of a permutation group: Rubik’s Cube.

Checking our operator “%[]%” obeys axioms 1 and 3 is fairly mechanical. Confirming axiom 4 roughly follows from the fact you can sort arrays (i.e. that order() works). Axiom 2 (associativity) is the amazing one. A lot of the power of groups comes from the associativity and a lot of the math of things like Monads is just heroic work trying to retain useful semigroup-like properties.

Nim: an associative irreversible state space.

A bit more on inverses

Suppose we had confirmed all of the above axioms, then our remaining job would be to confirm that invOrd[ord] == 1:n implies ord[invOrd] == 1:n. This is easy in the %[]% notation.

The argument is as follows:

  • By axiom 4 we know there exist L and R such that L ord == 1:n and ord R == 1:n. In fact we have already found L == invOrd. So if we can show L==R we are done.
  • Expand L %[]% ord %[]% R two ways using axiom 2 (associativity):
    • L %[]% ord %[]% R == ( L %[]% ord ) %[]% R == (1:n) %[]% R == R
    • L %[]% ord %[]% R == L ( %[]% ord %[]% R ) == L %[]% (1:n) == L
  • So: L == R == invOrd as required.


An important property of %[]% and %.% is that they are fully associative over their values (permutations and functions respectively). We can safely re-parenthesize them (causing different execution order and different intermediate results) without changing outcomes. This is in contrast to %+%, %|>%, and %>% which are only partially associative (yet pick up most of their power from properly managing what associativity they do have).

%>% works well because it associates in the same direction as R‘s parser. We don’t write parenthesis in “-4 %>% abs %>% sqrt” because in this case it is unambiguous that it must be the case that the parenthesis are implicitly “( -4 %>% abs ) %>% sqrt” (by R left-associative user operator parsing) as “-4 %>% ( abs %>% sqrt)” would throw an exception (so post hoc ergo propter hoc could not be how R interpreted “-4 %>% abs %>% sqrt“, as that did not throw an exception). So it isn’t that both associations are equal (they are not), it is that only one of them is “well formed” and that one happens to be the way R‘s parser works. It isn’t that R‘s parser is magically looking ahead to solve this, it is just the conventions match.

%[]% is also neat in that values have an nice interpretation as functions over values. All the other operators are either more about functions (%.%) or more about values (%+%, %|>%, and %>%).

The amazing emergence of full associativity

Now consider the following two (somewhat complicated) valid R expressions involving permutations a, b, and c:

  1. a[b]: which means calculate x <- b[c] and then calculate a[x].
  2. (a[b]): which means calculate y <- a[b] and then calculate y.

Consider this as a possible a bar bet with programming friends: can they find two vectors that are permutations of 1:n (or even just length n vectors consisting of any combination of taken from the integers 1:n) where the above two calculations disagree.

For example we could try the following:

n <- 4 a <- c(1,3,2,4) b <- c(4,3,2,1) c <- c(2,3,4,1) x <- b[c] print(a[x]) # [1] 2 3 1 4 y &lt- a[b] print(y[c]) # [1] 2 3 1 4 # whoa, they are equal

It is an amazing fact for the types of values we are discussing we always have:

a[b] == (a[b])

This is what we claim when we claim:

a %[]% ( b %[]% c ) == ( a %[]% b ) %[]% c

(i.e., when we claim associativity).

The above means we can neglect parenthesis and unambiguously write “a %[]% b %[]% c” as both common ways of inserting the parenthesis yield equivalent values (though specify different execution order and entail different intermediate results).

Nested Indexing is an Instance of Function Composition

We can confirm associativity by working through all the details of array indexing (which would be a slog), or we can (as we will here) confirm this by an appeal to algebra. Either way the above claim is true, but sufficiently subtle that you certainly will not believe it without some more experience (which we will try to supply here). Associativity of indexing unintuitive mostly because it is unfamiliar; one rarely sees re-factoring code based on associativity.

As we mentioned above each different association or parenthesization specifies a different calculation, with different intermediate values- but they both result in the same value.

The proof is as follows. Consider each sequenced selection a, b, c as a function that maps the integers 1:n to the integers 1:n (with a(i) defined as a[i], and similar for b and c). Some inspection shows sequenced selections composed with the operator %[]% must behave just as functions composed under %.%. Function composition is famously fully associative, therefore (by the parallel behavior between %.% and %[]%) we know %[]% is fully associative.

Function composition being fully associative usually comes as a surprise to non-mathematicians. I doubt most users of math regularly think about this. Here is why function composition being fully associative is hard to grasp.

  • Function composition notation in standard mathematics (pre lambda-calculus) is actually written informally (even in “formal” proofs) and often confuses the reader with notation that seems to imply the function composition operator is “`%+%` <- function(f,g) { f(g) }” (which is the application operator, not the composition operator). The function composition operator is the more ungainly “`%.%` <- function(f,g) { function(x) f(g(x)) }“.
  • Function composition being fully associative is a very strong statement. So strong one should not accept it without proof and working examples. So in that sense it is not intuitive.
  • Function composition being associative follows very quickly from the proper axioms. In particular associativity is proven by simply substituting in a value for a variable (more on this later). This is by design: axioms are adjusted until fundamental concepts are near them. In fact Gian-Carlo Rota wrote “Great definitions are what mathematics contributes to the world” (Indiscrete Thoughts, p. 46). However, “by definition” proofs tend to be so short and move so fast that they look like mere notational tricks (and are thus hard to internalize).

Some nice writing on proving function composition is associative can be found here: “Is composition of functions associative?”. Function composition is a very learnable concept (and well worth thinking about it). Do not worry if it takes you a long time to get comfortable with it. Nobody understands it quickly the first time (though it is a very sensible topic deeply understood by very many mathematical teachers and writers).


In this writeup we had the rare pleasure of showing two different implementations of a concept (nested indexing) are equivalent. In programming there are very few operations that are so regular and interchangeable. This is why I advocate design choices that preserve referential transparency (the statement you can safely substitute values for variables, which is one of the few things that let’s us reason about programs) to keep as many of these opportunities as practical.

At this point I hope you find the vectorized square-bracket as nifty as I do. It allows some very succinct expressions of powerful sorting and permutations steps. The “find the inverse by putting the square-bracket on the left side” is one of my favorite coding tricks, and actually quite useful in arranging data for analysis (especially ordered data such as time series, or when you need to work with ranks or quantiles). It always seems “a bit magic.” It really it is a bit magic, but it is also formally correct and reliable.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Unsupervised Learning and Text Mining of Emotion Terms Using R

Thu, 05/18/2017 - 14:07

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Unsupervised learning refers to data science approaches that involve learning without a prior knowledge about the classification of sample data. In Wikipedia, unsupervised learning has been described as “the task of inferring a function to describe hidden structure from ‘unlabeled’ data (a classification of categorization is not included in the observations)”. The overarching objectives of this post were to evaluate and understand the co-occurrence and/or co-expression of emotion words in individual letters, and if there were any differential expression profiles /patterns of emotions words among the 40 annual shareholder letters? Differential expression of emotion words was being used to refer to quantitative differences in emotion word frequency counts among letters, as well as qualitative differences in certain emotion words occurring uniquely in some letters but not present in others.

The dataset

This is the second part to a companion post I have on “parsing textual data for emotion terms”. As with the first post, the raw text data set for this analysis was using Mr. Warren Buffett’s annual shareholder letters in the past 40-years (1977 – 2016). The R code to retrieve the letters was from here.

## Retrieve the letters library(pdftools) library(rvest) library(XML) # Getting & Reading in HTML Letters urls_77_97 <- paste('', seq(1977, 1997), '.html', sep='') html_urls <- c(urls_77_97, '', '', '', '') letters_html <- lapply(html_urls, function(x) read_html(x) %>% html_text()) # Getting & Reading in PDF Letters urls_03_16 <- paste('', seq(2003, 2016), 'ltr.pdf', sep = '') pdf_urls <- data.frame('year' = seq(2002, 2016), 'link' = c('', urls_03_16)) download_pdfs <- function(x) { myfile = paste0(x['year'], '.pdf') download.file(url = x['link'], destfile = myfile, mode = 'wb') return(myfile) } pdfs <- apply(pdf_urls, 1, download_pdfs) letters_pdf <- lapply(pdfs, function(x) pdf_text(x) %>% paste(collapse=" ")) tmp <- lapply(pdfs, function(x) if(file.exists(x)) file.remove(x)) # Combine letters in a data frame letters <-, Map(data.frame, year=seq(1977, 2016), text=c(letters_html, letters_pdf))) letters$text <- as.character(letters$text) Analysis of emotions terms usage ## Load additional required packages require(tidyverse) require(tidytext) require(gplots) require(SnowballC) require(sqldf) theme_set(theme_bw(12)) ### pull emotion words and aggregate by year and emotion terms emotions <- letters %>% unnest_tokens(word, text) %>% anti_join(stop_words, by = "word") %>% filter(!grepl('[0-9]', word)) %>% left_join(get_sentiments("nrc"), by = "word") %>% filter(!(sentiment == "negative" | sentiment == "positive")) %>% group_by(year, sentiment) %>% summarize( freq = n()) %>% mutate(percent=round(freq/sum(freq)*100)) %>% select(-freq) %>% spread(sentiment, percent, fill=0) %>% ungroup() ## Normalize data sd_scale <- function(x) { (x - mean(x))/sd(x) } emotions[,c(2:9)] <- apply(emotions[,c(2:9)], 2, sd_scale) emotions <- rownames(emotions) <- emotions[,1] emotions3 <- emotions[,-1] emotions3 <- as.matrix(emotions3) ## Using a heatmap and clustering to visualize and profile emotion terms expression data heatmap.2( emotions3, dendrogram = "both", scale = "none", trace = "none", key = TRUE, col = colorRampPalette(c("green", "yellow", "red")) )

Heatmap and clustering:

The colors of the heatmap represent high levels of emotion terms expression (red), low levels of emotion terms expression (green) and average or moderate levels of emotion terms expression (yellow).

Co-expression profiles of emotion words usage

Based on the expression profiles combined with the vertical dendrogram, there are about four co-expression profiles of emotion terms: i) emotion terms referring to fear and sadness appeared to be co-expressed together, ii) anger and disgust showed similar expression profiles and hence were co-expressed emotion terms; iii) emotion terms referring to joy, anticipation and surprise appeared to be similarly expressed, and iv) emotion terms referring to trust did show the least co-expression pattern.

Emotion expression profiling of annual shareholder letters

Looking at the horizontal dendrogram and heatmap above, approximately six groups of emotions expressions profiles can be recognized among the 40 annual shareholder letters.
Group-1 letters showed over-expression of emotion terms referring to anger, disgust, sadness or fear. Letters of this group included 1982, 1987, 1989 & 2003 (first 4 rows of the heatmap).
Group-2 and group-3 letters showed over-expression of emotion terms referring to surprise, disgust, anger and sadness. Letters of this group included 1985, 1994, 1996, 1998, 2010, 2014, 2016, 1990 – 1993, 2002, 2004, 2005 & 2015 (rows 5 – 19 of the heatmap).
Group-4 letters showed over-expression of emotion terms referring to surprise, joy and anticipation. Letters of this group included 1995, 1997 – 2000, 2006, 2007, 2009 & 2011 – 2013 (rows 20 – 30).
Group-5 letters showed over-expression of emotion terms referring to fear, sadness, anger and disgust. Letters of this group included 1984, 1986, 2001 & 2008 (rows 31 – 34).
Group-6 letters showed over-expression of emotion terms referring to trust. Letters of this group included those of the early letters (1977 – 1981 & 1983) (rows 35 – 40).

Numbers of emotion terms expressed uniquely or in common among the heatmap groups

The next steps of the analysis attempted to determine the numbers of emotion words that were uniquely expressed in any of the heatmap groups. Also wanted to see if some emotion words were expressed in all of the heatmap groups, if any? For these analyses, I chose to include a word stemming procedure. Wikipedia described word stemming as “the process of reducing inflected (or sometimes derived) words to their word stem, base or root form- generally a written word form”. In practice, the stemming step removes word endings such as “es”, “ed” and “’s”, so that the various word forms would be taken as same when considering uniqueness and/or counting word frequency (see the example below for before and after applying word stemmer function in R).

Examples of word stemmer output

There are several word stemmers in R. One such function, the wordStem, in the SnowballC package extracts the stems of each of the given words in a vector (See example below).

Before <- c("produce", "produces", "produced", "producing", "product", "products", "production") wstem <- names(wstem) <- "After" pander::pandoc.table(cbind(Before, wstem)) Here is the example output for stemming ## ------------------ ## Before After ## ---------- ------- ## produce produc ## ## produces produc ## ## produced produc ## ## producing produc ## ## product product ## ## products product ## ## production product ## ------------------ Analysis of unique and commonly expressed emotion words ## pull emotions words for selected heatmap groups and apply stemming set.seed(456) emotions_final <- letters %>% unnest_tokens(word, text) %>% filter(!grepl('[0-9]', word)) %>% left_join(get_sentiments("nrc"), by = "word") %>% filter(!(sentiment == "negative" | sentiment == "positive" | sentiment == "NA")) %>% subset(year==1987 | year==1989 | year==2001 | year==2008 | year==2012 | year==2013) %>% mutate(word = wordStem(word)) %>% ungroup() group1 <- emotions_final %>% subset(year==1987| year==1989 ) %>% select(-year, -sentiment) %>% unique() set.seed(456) group4 <- emotions_final %>% subset(year==2012 | year==2013) %>% select(-year, -sentiment) %>% unique() set.seed(456) group5 <- emotions_final %>% subset(year==2001 | year==2008 ) %>% select(-year, -sentiment) %>% unique() Unique and common emotion words among two groups

Let’s draw a two-way Venn diagram to find out which emotions terms were unique or commonly expressed between heatmap group-1 & group-4.

# common and unique words between two groups emo_venn2 <- venn(list(group1$word, group4$word))

Two-way Venn diagram:

A total of 293 emotion terms were expressed in common between group-1 (A) and group-4 (B). There were also 298 and 162 unique emotion words usage in heatmap groups-1 & 4, respectively.

Unique and common emotion words among three groups

Let’s draw a three way Venn diagram to find out which emotions terms were uniquely or commonly expressed among group-1, group-4 & group-5.

# common and unique words among three groups emo_venn3 <- venn(list(group1$word, group4$word, group5$word))

Three way Venn diagram:

A total of 225 emotion terms were expressed in common among the three heatmap groups. On the other hand, there were 193, 108 and 159 unique emotion words usage in heatmap group-1 (A), group-4 (B) and group-5 (C), respectively. The Venn diagram included various combinations of unique and common emotion word expressions. Of particular interest were the 105 emotion words that were expressed in common between heatmap group-1 and group-5. Recall that group-1 & group-5 were highlighted for their high expression of emotions referring to disgust, anger, sadness and fear.

I also wanted to see a list of those emotion words that were expressed uniquely and/or in common among several groups. For instance the R code below requested for a list of the 109 unique emotion words that were expressed solely in group-5 letters of the 2001 & 2008.

## The code below pulled a list of all common/unique emotion words expressed ## in all possible combinations of the the three heatmap groups venn_list <- (attr(emo_venn3, "intersection")) ## and then print only the list of unique emotion words expressed in group-5. print(venn_list$'C') ## [1] "manual" "familiar" "sentenc" ## [4] "variabl" "safeguard" "abu" ## [7] "egregi" "gorgeou" "hesit" ## [10] "strive" "loyalti" "loom" ## [13] "approv" "like" "winner" ## [16] "entertain" "tender" "withdraw" ## [19] "harm" "strike" "straightforward" ## [22] "victim" "iron" "bounti" ## [25] "chaotic" "bloat" "proviso" ## [28] "frank" "honor" "child" ## [31] "lemon" "prospect" "relev" ## [34] "threaten" "terror" "quak" ## [37] "scarciti" "question" "manipul" ## [40] "deton" "terrorist" "attack" ## [43] "ill" "nation" "hydra" ## [46] "disast" "sadli" "prolong" ## [49] "concern" "urgenc" "presid"

The output from the above code included a list of 159 words, but the list above contained only the first 51 for space considerations. Besides, you may have noticed that some of the emotions words were truncated and did not look proper words due to stemming.

Dispersion Plot

Dispersion plot is a graphical display that can be used to represent the approximate locations and densities of emotion terms across the length of the text document. Shown below are three dispersion plots of unique emotion words of heatmap group-1 (1987, 1989), group-5 (2001, 2008) and group-4 (2012 and 2013) shareholder letters. For the dispersion plots, all words in the listed years were sequentially ordered by year of the letters and the presence and approximate locations of the unique words were identified/displayed by a stripe. Each stripe represented an instance of a unique word in the shareholder letters.

Confirmation of emotion words expressed uniquely in heatmap group-1

## Confirmation of unique emotion words in heatmap group-1 group1_U <-$'A') names(group1_U) <- "terms" uniq1 <- sqldf( "select t1.*, g1.terms from emotions_final t1 left join group1_U g1 on t1.word = g1.terms " ) uniq1a <- !$terms) uniqs1 <- rep(NA, length(emotions_final)) uniqs1[uniq1a] <- 1 plot(uniqs1, main="Dispersion plot of emotions words \n unique to heatmap group 1 ", xlab="Length (Word count)", ylab=" ", col="red", type='h', ylim=c(0,1), yaxt='n')

Heatmap plot:

Heatmap group-1 letters included those in 1987/1989. As expected, the dispersion plot above confirmed that the unique emotion words in group-1 were confined at the start of the dispersion plot.

Confirmation of emotion words expressed uniquely in heatmap group-5

## confirmation of unique emotion words in heatmap group-5 group5_U <-$'C') names(group5_U) <- "terms" uniq5 <- sqldf( "select t1.*, g5.terms from emotions_final t1 left join group5_U g5 on t1.word = g5.terms " ) uniq5a <- !$terms) uniqs5 <- rep(NA, length(emotions_final)) uniqs5[uniq5a] <- 1 plot(uniqs5, main="Dispersion plot of emotions words \n unique to heatmap group 5 ", xlab="Length (Word count)", ylab=" ", col="red", type='h', ylim=c(0,1), yaxt='n')

Dispersion plot of emotions words:

Heatmap group-5 letters included those in 2001 & 2008. As expected, the dispersion plot above confirmed that the unique emotion words in group-5 were confined at the middle parts of the dispersion plot.

Confirmation of emotion words expressed uniquely in heatmap group-4

## confirmation of unique emotion words in heatmap group-4 group4_U <-$'B') names(group4_U) <- "terms" uniq4 <- sqldf( "select t1.*, g4.terms from emotions_final t1 left join group4_U g4 on t1.word = g4.terms " ) uniq4a <- !$terms) uniqs4 <- rep(NA, length(emotions_final)) uniqs4[uniq4a] <- 1 plot(uniqs4, main="Dispersion plot of emotions words \n unique to heatmap group 4 ", xlab="Length (Word count)", ylab=" ", col="red", type='h', ylim=c(0,1), yaxt='n')

Dispersion plot of emotions words:

Heatmap group4 letters included those in 2012 & 2013. As expected, the dispersion plot above confirmed that the unique emotion words in group-4 were confined towards the end of the dispersion plot.

Annual Returns on Investment in S&P500 (1977 – 2016)

Finally a graph of the annual returns on investment in S&P 500 during the same 40 years of the annual shareholder letters is being displayed below for perspective. The S&P 500 data was downloaded from here using an R code from here.

## You need to first download the raw data before running the code to recreate the graph below. ggplot(sp500[50:89,], aes(x=year, y=return, colour=return>0)) + geom_segment(aes(x=year, xend=year, y=0, yend=return), size=1.1, alpha=0.8) + geom_point(size=1.0) + xlab("Investment Year") + ylab("S&P500 Annual Returns") + labs(title="Annual Returns on Investment in S&P500", subtitle= "source:") + theme(legend.position="none") + coord_flip()

Annual Returns on Investment in S&P500:

Concluding Remarks

R offers several packages and functions for the evaluation and analyses of differential expression and co-expression profiling of emotion words in textual data, as well as visualization and presentation of analyses results. Some of those functions, techniques and tools have been attempted in two companion posts. Hopefully, you found the examples helpful.

    Related Post

    1. Using MCA and variable clustering in R for insights in customer attrition
    2. Web Scraping and Applied Clustering Global Happiness and Social Progress Index
    3. Key Phrase Extraction from Tweets
    4. Financial time series forecasting – an easy approach
    5. Outlier detection and treatment with R

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...