R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 24 min ago

### Don’t teach students the hard way first

Thu, 09/21/2017 - 16:00

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Imagine you were going to a party in an unfamiliar area, and asked the host for directions to their house. It takes you thirty minutes to get there, on a path that takes you on a long winding road with slow traffic. As the party ends, the host tells you “You can take the highway on your way back, it’ll take you only ten minutes. I just wanted to show you how much easier the highway is.”

Wouldn’t you be annoyed? And yet this kind of attitude is strangely common in programming education.

I was recently talking to a friend who works with R and whose opinions I greatly respect. He was teaching some sessions to people in his company who hadn’t used R before, where he largely followed my philosophy on teaching the tidyverse to beginners. I agreed with his approach, until he said something far too familiar to me:

“I teach them dplyr’s group_by/summarize in the second lesson, but I teach them loops first just to show them how much easier dplyr is.”

I talk to people about teaching a lot, and that phrase keeps popping up: “I teach them X just to show them how much easier Y is”. It’s a trap- a trap I’ve fallen into before when teaching, and one that I’d like to warn others against.

First, why do some people make this choice? I think because when we teach a class, we accidentally bring in all of our own history and context.

For instance, I started programming with Basic and Perl in high school, then Java in college, then got really into Python, then got even more into R. Along the way I built up habits in each language that had to be painfully undone afterwards. I worked in an object-oriented style in Python, then switched to thinking in data frames and functional operations. I wrote too many loops when I started R, then I grew accustomed to the apply family of functions. Along the way there were thousands of little frustrations and little epiphanies, tragedies and triumphs in microcosm.

But when I’m teaching someone how to use R… they don’t care about any of that. They weren’t there! They didn’t make my mistakes, they don’t need to be talked out of them. Going back to the party host, perhaps the highway was built only last year, and maybe the host had to talk his friends, stuck in their habits, into taking the highway by explaining how much faster it is. But that doesn’t make any difference to the guest, who has never taken either route.

@ucfagls I think ppl teach base stuff preferentially because it’s what they still use or they’re reliving their own #rstats “journey” @drob

— Jenny Bryan (@JennyBryan) January 16, 2015

It’s true that I learned a lot about programming in the path I described above. I learned how to debug, how to compare programming paradigms, and when to switch from one approach to another. I think some teachers hope that by walking students through this “journey”, we can impart some of that experience. But it doesn’t work: feeding students two possible solutions in a row is nothing like the experience of them comparing solutions for themselves, and doesn’t grant them any of the same skills. Besides which, students will face plenty of choices and challenges as they continue their programming career, without us having to invent artificial ones.

Students should absolutely learn multiple approaches (there’s usually advantages to each one). But not in this order, from hardest to easiest, and not because it happened to be the order we learned it ourselves.

Bandwidth and trust

There are two reasons I recommend against teaching a harder approach first. One is educational bandwidth, and one is trust.

One of the most common mistakes teachers make (especially inexperienced ones) is to think they can teach more material than they can.

.@minebocek talks philosophy of Data Carpentry.

Highlight: don't take on too much material. Rushing discourages student questions #UseR2017 pic.twitter.com/txAcSd3ND3

— David Robinson (@drob) July 6, 2017

This comes down to what I sometimes call educational bandwidth: the total amount of information you can communicate to students is limited, especially since you need to spend time reinforcing and revisiting each concept. It’s not just about the amount of time you have in the lesson, either. Learning new ideas is hard work: think of the headache you can get at the end of a one-day workshop. This means you should make sure every idea you get across is valuable. If you teach them a method they’ll never have to use, you’re wasting time and energy.

The other reason is trust. Think of the imaginary host who gave poor directions before giving better ones. Wouldn’t it be unpleasant to be tricked like that? When a student goes through the hard work of learning a new method, telling them “just kidding, you didn’t need to learn that” is obnoxious, and hurts their trust in you as a teacher.

In some cases there’s a tradeoff between bandwidth and trust. For instance, as I’ve described before, I teach dplyr and the %>% operator before explaining what a function is, or even how to do a variable assignment. This conserves bandwidth (it gets students doing powerful things quickly) but it’s a minor violation of their trust (I’m hiding details of how R actually works). But there’s no tradeoff in teaching a hard method before an easier one.

Exceptions

Is there any situation where you might want to show students the hard way first? Yes: one exception is if the hard solution is something the student would have been tempted to do themselves.

One good example is in R for Data Science, by Hadley Wickham and Garrett Grolemund. Chapter 19.2 describes “When should you write a function”, and gives an example of rescaling several columns by copying and pasting code:

df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE)) df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE)) df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE)) df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

By showing them what this approach looks like (and even including an intentional typo in the second line, to show how copying and pasting code is prone to error), the book guides them towards the several steps involved in writing a function.

rescale01 <- function(x) { rng <- range(x, na.rm = TRUE) (x - rng[1]) / (rng[2] - rng[1]) }

This educational approach makes sense because copying and pasting code is a habit beginners would fall into naturally, especially if they’ve never programmed before. It doesn’t take up any educational bandwidth because students already know how to do it, and it’s upfront about it’s approach (when I teach this way, I usually use the words “you might be tempted to…”).

However, teaching a loop (or split()/lapply(), or aggregate, or tapply) isn’t something beginners would do accidentally, and it’s therefore not something you need to be talking.

In conclusion: teaching programming is hard, don’t make it harder. Next time you’re teaching a course, or workshop, or writing a tutorial, or just helping a colleague getting set up in R, try teaching them your preferred method first, instead of meandering through subpar solutions. I think you’ll find that it’s worth it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### ggformula: another option for teaching graphics in R to beginners

Thu, 09/21/2017 - 15:03

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

A previous entry (http://sas-and-r.blogspot.com/2017/07/options-for-teaching-r-to-beginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested

In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here’s Randall:

For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of

• the lattice package for graphics,
• several functions from the stats package for modeling (e.g., lm(), t.test()), and
• the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.

Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.

goal ( y ~ x , data = mydata, … )

Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.

As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing. On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify. ggformula: The third-and-a half way Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics). For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the side-by-side boxplots made with lattice in the previous post. We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the “then” operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.
Comparing groups Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color) or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses | very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers. The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired. Fitting into the tidyverse work flow ggformala also fits into a tidyverse-style workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations. Summary The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the  ggformula  package vignettes and tutorials describe these in more detail.

— Randall Pruim

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: SAS and R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016

Thu, 09/21/2017 - 14:05

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

R has a lot of packages for users to analyse posts on social media. As an experiment in this field, I decided to start with the biggest one: Facebook.

I decided to look at the Facebook activity of Donald Trump and Hillary Clinton during the 2016 presidential election in the United States.

The winner may be more famous for his Twitter account than his Facebook one, but he still used it to great effect to help pick off his Republican rivals in the primaries and to attack Hillary Clinton in the general election.

For this work we’re going to be using the Rfacebook package developed by Pablo Barbera, plus his excellent how-to guide.

The next thing to do is to use the getPage() function to retrieve all the posts from each candidate.

I’m going to start the clock on January 1, 2016 and end it the day after the election on November 9, 2016 (which means it will stop on election day, the day before)

trump <- getPage("donaldtrump", token, n = 5000, since='2016/01/01', until='2016/11/09') clinton <- getPage("hillaryclinton", token, n = 5000, since='2016/01/01', until='2016/11/09') Caveat: The data doesn’t seem to contain all Trump and Clinton’s Facebook posts

I ran the commands several times and got 545 posts for Trump and 692 posts for Clinton. However, I think I may have got more results the first time I ran the commands. I also searched their pages via Facebook and came up with some posts that don’t appear in the R datasets. If you have a solution for this, please let me know!

In the meantime, we will work with what we have

We want to calculate the average number of likes, comments and shares for each month for both candidates. Again, we will be using Pablo’s code for a while here.

First up, we will format the date:

format.facebook.date <- function(datestring) { date <- as.POSIXct(datestring, format = "%Y-%m-%dT%H:%M:%S+0000", tz = "GMT") }

Then we will use his formula to calculate the average likes, comments and shares (metrics) per month:

aggregate.metric <- function(metric) { m <- aggregate(page[[paste0(metric, "_count")]], list(month = page$month), mean) m$month <- as.Date(paste0(m$month, "-15")) m$metric <- metric return(m) }

Now we run this data for both candidates:

#trump page <- trump page$datetime <- format.facebook.date(page$created_time) page$month <- format(page$datetime, "%Y-%m") df.list <- lapply(c("likes", "comments", "shares"), aggregate.metric) trump_months head(trump_months) month x metric 1 2016-01-15 17199.93 likes 2 2016-02-15 15239.63 likes 3 2016-03-15 22616.28 likes 4 2016-04-15 19364.17 likes 5 2016-05-15 14598.30 likes 6 2016-06-15 32760.68 likes #clinton page <- clinton page$datetime <- format.facebook.date(page$created_time) page$month <- format(page$datetime, "%Y-%m") df.list <- lapply(c("likes", "comments", "shares"), aggregate.metric) clinton_months <- do.call(rbind, df.list)

Before we combine them together, let’s label them so we know who’s who:

trump_months$candidate <- "Donald Trump" clinton_months$candidate <- "Hillary Clinton" both <- rbind(trump_months, clinton_months)

Now we have the data, we can visualise it. This is a neat opportunity to have a go at faceting using ggplot2.

Faceting is when you display two or more plots side-by-side for easy at-a-glance comparison.

library(ggplot2) library(scales) p <- ggplot(both, aes(x = month, y = x, group = metric)) + geom_line(aes(color = metric)) + scale_x_date(date_breaks = "months", labels = date_format("%m")) + ggtitle("Facebook engagement during the 2016 election") + labs(y = "Count", x = "Month (2016)", aesthetic='Metric') + theme(text=element_text(family="Browallia New", color = "#2f2f2d")) + scale_colour_discrete(name = "Metric") #add in a facet p <- p + facet_grid(. ~ candidate) p Analysis

Clearly Trump’s Facebook engagement got far better results than Clinton’s. Even during his ‘off months’ he received more likes per page on average than Clinton managed at the height of the general election.

Trump’s comments per page also skyrocketed during October and November as the election neared.

Hillary Clinton enjoyed a spike in engagement around June. It was a good month for her: she was confirmed as the Democratic nominee and received the endorsement of President Obama.

Themes

Trump is famous for using nicknames for his political opponents. We had low-energy Jeb, Little Marco, Lyin’ Ted and then Crooked Hillary.

The first usage of Crooked Hillary in the data came on April 26. A look through his Twitter feed shows he seems to have decided on Crooked Hillary around this time as well.

#DrainTheSwamp was one of his later hashtags, making the first appearance in the data on October 20, just a few weeks shy of the election on November 8.

Clinton meanwhile mentioned her rival’s surname in about a quarter of her posts. Her most popular ones were almost all on the eve of the election exhorting her followers to vote.

Of her earlier ones, only her New Year message and one from March appealing to Trump’s temperament resonated as strongly.

Conclusion

It’s frustrating that the API doesn’t seem to retrieve all the data.
Nonetheless, it’s a decent sample size and shows that the Trump campaign was far more effective than the Clinton one on Facebook during the 2016 election.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing the Spanish Contribution to The Metropolitan Museum of Art

Thu, 09/21/2017 - 08:30

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Well I walk upon the river like it’s easier than land
(Love is All, The Tallest Man on Earth)

The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and  geography of a particular piece.

I can imagine a bunch of things to do with these data but since I am a big fan of highcharter,  I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.

To read data I use fread function from data.table package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word SPANISH in the columns Artist Nationality and Culture and looking for the word SPAIN in the column Country. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:

• There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
• This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
• The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)

My treemap represents the distribution of Spanish artworks by department (column Department) and type of work (column Classification). There are two important things to know before doing a treemap with highcharter:

• You have to use treemap function from treemap package to create a list with your data frame that will serve as input for hctreemap function
• hctreemap fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.

This is the treemap:

Here you can see a full size version of it.

There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.

Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!

Feel free to use my code to do your own experiments:

library(data.table) library(dplyr) library(stringr) library(highcharter) library(treemap) file="MetObjects.csv" # Download data if (!file.exists(file)) download.file(paste0("https://media.githubusercontent.com/media/metmuseum/openaccess/master/", file), destfile=file, mode='wb') # Read data data=fread(file, sep=",", encoding="UTF-8") # Modify column names to remove blanks colnames(data)=gsub(" ", ".", colnames(data)) # Clean columns to prepare for searching data[,:=(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\$\\d+\$", "") %>% iconv(from='UTF-8', to='ASCII//TRANSLIT'), Culture_aux=toupper(Culture) %>% str_replace_all("\$\\d+\$", "") %>% iconv(from='UTF-8', to='ASCII//TRANSLIT'), Country_aux=toupper(Country) %>% str_replace_all("\$\\d+\$", "") %>% iconv(from='UTF-8', to='ASCII//TRANSLIT'))] # Look for Spanish artworks data[Artist.Nationality_aux %like% "SPANISH" | Culture_aux %like% "SPANISH" | Country_aux %like% "SPAIN"] -> data_spain # Count artworks by Department and Classification data_spain %>% mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>% mutate(Department=tolower(Department), Classification1=str_match(Classification, "(\\w+)(-|,|\\|)")[,2], Classification=ifelse(!is.na(Classification1), tolower(Classification1), tolower(Classification))) %>% group_by(Department, Classification) %>% summarize(Objects=n()) %>% ungroup %>% mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"), Classification)) %>% as.data.frame() -> dfspain # Do treemap without drawing tm_dfspain <- treemap(dfspain, index = c("Department", "Classification"), draw=F, vSize = "Objects", vColor = "Objects", type = "index") # Do highcharter treemap hctreemap( tm_dfspain, allowDrillToNode = TRUE, allowPointSelect = T, levelIsConstant = F, levels = list( list( level = 1, dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")), borderWidth = 1 ), list( level = 2, dataLabels = list (enabled = F, align = 'right', verticalAlign = 'top', style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")), borderWidth = 0.7 ) )) %>% hc_title(text = "Spanish Artworks in The Met") %>% hc_subtitle(text = "Distribution by Department") -> plot plot var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Pandigital Products: Euler Problem 32

Thu, 09/21/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler Problem 32 returns to pandigital numbers, which are numbers that contain one of each digit. Like so many of the Euler Problems, these numbers serve no practical purpose whatsoever, other than some entertainment value. You can find all pandigital numbers in base-10 in the Online Encyclopedia of Interegers (A050278). The Numberhile video explains everything you ever wanted to

The Numberhile video explains everything you ever wanted to know about pandigital numbers but were afraid to ask.

Euler Problem 32 Definition

We shall say that an n-digit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5-digit number, 15234, is 1 through 5 pandigital.

The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.

Find the sum of all products whose multiplicand/multiplier/product identity can be written as a 1 through 9 pandigital.

HINT: Some products can be obtained in more than one way so be sure to only include it once in your sum.

Proposed Solution

The pandigital.9 function tests whether a string classifies as a pandigital number. The pandigital.prod vector is used to store the multiplication.

The only way to solve this problem is brute force and try all multiplications but we can limit the solution space to a manageable number. The multiplication needs to have ten digits. For example, when the starting number has two digits, the second number should have three digits so that the total has four digits, e.g.: 39 × 186 = 7254. When the first number only has one digit, the second number needs to have four digits.

pandigital.9 <- function(x) # Test if string is 9-pandigital (length(x)==9 & sum(duplicated(x))==0 & sum(x==0)==0) t <- proc.time() pandigital.prod <- vector() i <- 1 for (m in 2:100) { if (m < 10) n_start <- 1234 else n_start <- 123 for (n in n_start:round(10000 / m)) { # List of digits digs <- as.numeric(unlist(strsplit(paste0(m, n, m * n), ""))) # is Pandigital? if (pandigital.9(digs)) { pandigital.prod[i] <- m * n i <- i + 1 print(paste(m, "*", n, "=", m * n)) } } } answer <- sum(unique(pandigital.prod)) print(answer)

Numbers can also be checked for pandigitality using mathematics instead of strings.

You can view the most recent version of this code on GitHub.

The post Pandigital Products: Euler Problem 32 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Report from Mexico City

Thu, 09/21/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Editors Note: It has been heartbreaking watching the images from México City. Teresa Ortiz, co-organizer of R-Ladies CDMX reports on efforts of data scientists to help. Our thoughts are with them, and with the people of México.

It has been a hard couple of days around here.

In less than 2 weeks, México has gone through two devastating earthquakes and the damages keep adding. Nevertheless, the response from the citizens has been outstanding and Mexican data-driven initiatives have not stayed behind.

An example is codeandoMexico, an open innovation platform which brings together top talent through public challenges, codeandoMexico developed a Citizen’s quick response center with a directory of emergency phone numbers, providing the location of places that need help, maps, and more. Most of the information in the site is updated from first-hand experience of volunteers that are helping throughout the affected areas. Technical collaboration with codeandoMexico is open to everyone by attending to issues on their website these include extending the services offered by the site or attending to bugs.

Social networks are being extensively used to spread information about where and how to help, however often times tweets and facebook posts are fake or obsolete. There are several efforts taking place to overcome this difficulty; R-Ladies CDMX’s cofounder Silvia Gutierrez is working in one such project to map information on shelters, affected buildings and more (http://bit.ly/2xfIwZm).

The full scope of the damage is yet to be seen, but the support, both from México and other countries, is encouraging.

_____='https://rviews.rstudio.com/2017/09/21/report-from-mexico-city/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Monte Carlo Simulations & the "SimDesign" Package in R

Wed, 09/20/2017 - 21:43

(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

Past posts on this blog have included several relating to Monte Carlo simulation – e.g., see here, here, and here.

Recently I came across a great article by Matthew Sigal and Philip Chalmers in the Journal of Statistics Education. It’s titled, “Play it Again: Teaching Statistics With Monte Carlo Simulation”, and the full reference appears below. The authors provide a really nice introduction to basic Monte Carlo simulation, using R. In particular, they contrast using a “for loop” approach, with using the “SimDesign” R package (Chalmers, 2017).  Here’s the abstract of their paper:

“Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how well-suited MCS designs are to classroom demonstrations, and how they provide a hands-on method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate–analyze–summarize coding structure over the typical “for-loop” strategy are discussed.”

The SimDesign package provides an efficient, and safe template for setting pretty much any Monte Carlo experiment that you’re likely to want to conduct. It’s really impressive, and I’m looking forward to experimenting with it. The Sigal-Chalmers paper includes helpful examples, with the associated R code and output. It would be superfluous for me to add that here. Needless to say, the SimDesign package is just as useful for simulations in econometrics as it is for those dealing with straight statistics problems. Try it out for yourself! References
Chalmers, R. P., 2017. SimDesign: Structure for Organizing Monte Carlo Simulation Designs, R package version 1.7. M. J. Sigal and R. P. Chalmers, 2016. Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24, 136-156.

© 2017, David E. Giles var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Answer probability questions with simulation (part-2)

Wed, 09/20/2017 - 18:22

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This is the second exercise set on answering probability questions with simulation. Finishing the first exercise set is not a prerequisite. The difficulty level is about the same – thus if you are looking for a challenge aim at writing up faster more elegant algorithms.

As always, it pays off to read the instructions carefully and think about what the solution should be before starting to code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated and slow.

Exercise 1
If you take cards numbered from 1-10 and shuffle them, and lay them down in order, what is the probability that at least one card matches its position. For example card 3 comes down third?

Exercise 2
Consider an election 3 candidates and 35 voters and who casts his ballot randomly for one of the candidates. What is the probability of a tie for the first position?

Exercise 3
If you were to randomly find a playing card on the floor every day, how many days would it take on average to find a full standard deck?

Exercise 4
Throw two dice. What is the probability the difference between them is 3, 4, or 5 instead of 0, 1, or 2?

Exercise 5
What is the expected number of distinct birthdays in a group of 400 people? Assume 365 days and that all are equally likely.

Exercise 6
What is the probability that a five-card hand in standard deck of cards has exactly three aces?

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

• work with different binomial and logistic regression techniques,
• know how to compare regression models and choose the right fit,
• and much more.

Exercise 7
Randomly select three distinct integers a, b, c from the set {1, 2, 3, 4, 5, 6, 7}.
What is the probability that a + b > c?

Exercise 8
Given that a throw of three dice show three different faces what is the probability if the sum of all the dice is 8.

Exercise 9
Throwing a standard die until you get 6. What is the expected number of throws (including the throw giving 6) conditioned on the event that all throws gave even numbers.

Exercise 10
Choose two-digit integer at random, what is the probability that it is divisible by each of its digits. This is not exactly a simulation proplem – but the concept is similar make R do the hard work.

(Picture by Gil)

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### EARL London 2017 – That’s a wrap!

Wed, 09/20/2017 - 18:04

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Beth Ashlee, Data Scientist

After a successful first-ever EARL San Francisco in June, it was time to head back to the birth place EARL – London. With more abstracts submitted than ever before, the conference was made up of 54 fantastic talks and 5 key notes from an impressive selection of industries. With so many talks to pick from we thought we would summarise a few of my favourites!

Day 1 highlights:

After brilliant keynotes from Tom Smith (ONS) and Rstudio’s Jenny Bryan in session 1, Derek Norton and Neera Talbert from Microsoft took us through the Microsoft process of moving a company from SAS to R in session 2. They explained that with the aim of shrinking the ‘SAS footprint’, it’s important to think about the drivers behind a company leaving SAS as well as considering the impact to end users. Their approach focused on converting program logic rather than specific code.

After lunch, Luisa Pires discussed the digital change occurring within the Bank of England. She highlighted the key selling points behind choosing R as a platform and the process behind organizing their journey. They first ran divisional R training, before progressing through to produce a data science training programme to enable the use of R as a project tool.

Finishing up the day, Jessica Peterka-Bonetta gave a fascinating talk on sentiment analysis when considering the use of emojis. She demonstrated that even though adding emojis into your sentiment analysis can add complexity to your process, in the right context they can add real value to tracking the sentiment of a trend. It was an engaging talk which prompted some interesting audience questions such as – “What about combinations of emojis; how would they effect sentiment?”.

After a Pimms reception it was all aboard the Symphony Cruiser for a tour of the River Thames. On board we enjoyed food, drinks and live music (which resulted in some impromptu dancing, but what happens at a Conference, stays at the Conference!).

Day 2 highlights:

Despite a ‘lively’ evening, the EARL kicked off in full swing on Thursday morning. There were three fantastic keynotes, including Hilary Parker – her talk filled with analogies and movie references to describe her reproducible work flow methods – definitely something I could relate to!

The morning session included a talk from Mike Smith from Pfizer. Mike showed us his use of Shiny as a tool for determining wait times when submitting large jobs to Pfizer’s high performance compute grid. Mike used real time results to visualise whether it was beneficial to submit a large job at the current time or wait until later. He outlined some of the frustrations of changing data sources in a such a large company and his reluctance to admit he was ‘data science-ing’.

After lunch, my colleague Adnan Fiaz gave an interesting talk on the data pipeline, comparing it to the process involved in oil pipelines. He spoke of the methods and actions that must be taken before being able to process your data and generate results. The comparison showed surprising similarities between the two processes, and clarified the method that we must take at Mango to ensure we can safely, efficiently and aptly analyse data.

The final session of day 2 finished on a high, with a packed room gathering to hear from RStudio’s Joe Cheng. Joe highlighted the exciting new asynchronous programming feature that will be available in the next release of Shiny. This new feature is set to revolutionise the responsiveness of Shiny applications, allowing users to overcome the restrictions of R’s single threaded natured.

I get so much out of EARL each year and this year was no different; I just wish I had a timeturner to get to all the presentations!

On behalf of the rest of the Mango team, a massive thank you to all our speakers and attendees, we hope you enjoyed EARL as much as we did!

All slides we have permission to publish are available under the Speakers section on the EARL conference website.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Preview: ALTREP promises to bring major performance improvements to R

Wed, 09/20/2017 - 16:55

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Changes are coming to the internals of the R engine which promise to improve performance and reduce memory use, with dramatic impacts in some circumstances. The changes were first proposed by Gabe Becker at the DSC Conference in 2016 (and updated in 2017), and the implementation by Luke Tierney and Gabe Becker is now making its way into the development branches of the R sources.

ALTREP is an alternative way of representing R objects internally. (The details are described in this draft vignette, and there are a few more examples in this presentation.) It's easiest to illustrate by example.

Today, a vector of numbers in R is represented as a contiguous block of memory in RAM. In other words, if you create the sequence of a million integers 1:1000000, R creates a block of memory 4Mb in size (4 bytes per number) to store each number in the sequence. But what if we could use a more efficient representation just for sequences like this? WIth ALTREP, a sequence like this is instead represented by just its start and end values, which takes up almost no memory at all. That means, in the future you'll be able to write loops like this:

for (i in 1:1e10) do_something()

without getting errors like this:

Error: cannot allocate vector of size 74.5 Gb

ALTREP has the potential to make many other operations faster or even instantaneous, even on very large objects. Here are a few examples of functions that could be sped up:

• is.na(x) — ALTREP will keep track of whether a vector includes NAs or not, so that R no longer has to inspect the entire vector
• sort(x) — ALTREP will keep track of whether a vector is already sorted, and sorting will be instantaneous in this case
• x < 5 — knowing that the vector is already sorted, ALTREP can find the breakpoint very quickly (in O(log n) time), and return a "sequence" logical vector that consumes basically no memory
• match(y,x) — if ALTREP knows that x is already sorted, matching is also much faster
• as.character(numeric_vector) — ALTREP can defer converting a numeric vector to characters until the character representation is actually needed

That last benefit will likely have a large impact on the handling of data frames, which carry around a column of character row labels which start out as a numeric sequence. Development builds are already demonstrating a huge performance gain in the linear regression function lm() as a result:

> n <- 10000000 > x <- rnorm(n) > y <- rnorm(n) # With R 3.4 > system.time(lm(y ~ x)) user system elapsed 9.225 0.703 9.929 # With ALTREP > system.time(lm(y ~ x)) user system elapsed 1.886 0.610 2.496

The ALTREP framework is designed to be extensible, so that package authors can define their own alternative representations of standard R objects. For example, an R vector could be represented as a distributed object as in a system like Spark, while still behaving like an ordinary R vector to the R engine. An example package, simplemmap, illustrates this concept by defining an implementation of vectors as memory-mapped objects that live on disk instead of in RAM.

There's no definite date yet when ALTREP will be in an official R release, and my guess is that there's likely to be an extensive testing period to shake out any bugs caused by changing the internal representation of R objects. But the fact that the implementation is already making its way into the R sources is hugely promising, and I look forward to testing out the real-world impacts. You can read more about the current state of ALTREP in the draft vignette by Luke Tierney, Gabe Becker and Tomas Kalibera linked below.

R-Project.org: ALTREP: Alternative Representations for R Objects

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### pinp 0.0.2: Onwards

Wed, 09/20/2017 - 15:00

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A first update 0.0.2 of the pinp package arrived on CRAN just a few days after the initial release.

We added a new vignette for the package (see below), extended a few nice features, and smoothed a few corners.

The NEWS entry for this release follows.

Changes in tint version 0.0.2 (2017-09-20)
• The YAML segment can be used to select font size, one-or-two column mode, one-or-two side mode, linenumbering and watermarks (#21 and #26 addressing #25)

• If pinp.cls or jss.bst are not present, they are copied in ((#27 addressing #23)

• Output is now in shaded framed boxen too (#29 addressing #28)

• Endmatter material is placed in template.tex (#31 addressing #30)

• Expanded documentation of YAML options in skeleton.Rmd and clarified available one-column option (#32).

• Section numbering can now be turned on and off (#34)

• The default bibliography style was changed to jss.bst.

• A short explanatory vignette was added.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js

Wed, 09/20/2017 - 10:00

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

D3partitionR is an R package to visualize interactively nested and hierarchical data using D3.js and HTML widget. These last few weeks I’ve been working on a major D3partitionR update which is now available on GitHub. As soon as enough feedbacks are collected, the package will be on uploaded on the CRAN. Until then, you can install it using devtools

library(devtools) install_github("AntoineGuillot2/D3partitionR")

Here is a quick overview of the possibilities using the Titanic data:

A major update

This update is a major update from the previous version which will break code from 0.3.1

New functionalities
• Additional data for nodes: Additional data can be added for some given nodes. For instance, if a comment or a link needs to be shown in the tooltip or label of some nodes, they can be added through the add_nodes_data function

• Variable selection and computation, now, you can provide a variable for:
• sizing (i.e. the size of each node)
• color, any variable from your data.frame or from the nodes data can be used as a color.
• label, any variable from your data.frame or from the nodes data can be used as a label.
• tooltip, you can provide several variables to be displayed in the tooltip.
• aggregation function, when numerical variables are provided, you can choose the aggregation function you want.
• Coloring: The color scale can now be continuous. For instance, you can use the mean survival rate to the Titanic accident in each node, this make it easy to visualise quickly women in 1st class are more likely to survive than men in 3rd class.

Treemap to show the survival rate to the Titanic accident
• Label: Labels providing the showing the node’s names (or any other variable) can now be added to the plot.
• Breadcrumb: To avoid overlapping, the width of each breadcrumb is now variable and dependant on the length of the word

• Legend: By default, the legend now shows all the modalities/levels that are in the plot. To avoid wrapping, enabling the zoom_subset option will only shows the modalities in the direct children of the zoomed root.
API and backend change
• Easy data preprocessing, The data preparation was tedious in the previous versions. Now, you just need to aggregate your data.frame at the right level, the data.frame can directly be used in the D3partitionR functions to avoid to deal with nesting a data.frame which can be pretty complicated.
• The R API is greatly improved, D3partitionR are now S3 objects with a clearly named list of function to add data and to modify the chart appearance and parameters. Using pipes now makes D3partitionR syntax looks gg-like
##Treemap D3partitionR()%>% add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>% set_chart_type('treemap')%>% plot() ##Circle treemap D3partitionR()%>% add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>% set_chart_type('circle_treemap')%>% plot()

Style consistency among the different type of chart. Now, it’s easy to switch from a treemap to a circle treemap or a sunburst and keep consistent styling policy.

Update to d3.js V4 and modularization. Each type of charts now has its own file and function. This function draws the chart at its root level with labels and colors, it returns a zoom function. The on-click actions (such as the breadcrumb update or the legend update) and the hover action (tooltips) are defined in a ‘global’ function.

Hence, adding new visualizations will be easy, the drawing and zooming script will just need to be adapted to this previous template.

What’s next

Thanks to the several feedbacks that will be collected during next week, a stable release version should soon be on the CRAN. I will also post more ressources on D3partitionR with use cases and example of Shiny Applications build on it.

The post Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Regression Analysis — What You Should’ve Been Taught But Weren’t, and Were Taught But Shouldn’t Have Been

Wed, 09/20/2017 - 09:55

The above title was the title of my talk this evening at our Bay Area R Users Group. I had been asked to talk about my new book, and I presented four of the myths that are dispelled in the book.

Hadley also gave an interesting talk, “An introduction to tidy evaluation,” involving some library functions that are aimed at writing clearer, more readable R. The talk came complete with audience participation, very engaging and informative.

The venue was GRAIL, a highly-impressive startup. We will be hearing a lot more about this company, I am sure.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### 12 Visualizations to Show a Single Number

Wed, 09/20/2017 - 08:53

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Infographics, dashboards, and reports often need to highlight or visualize a single number. But how do you highlight a single number so that it has an impact and looks good? It can be a big challenge to make a lonely, single number look great. In this post, I show 12 different ways of representing a single number. Most of these visualizations have been created automatically using R.

When to create a visualization to represent a single number

There are a number of situations in which it can be advantageous to create a visualization to represent a single number:

• To communicate with less numerate viewers/readers;
• Infographics and dashboards commonly use one important number;
• To attract the attention of distracted or busy viewers/readers;
• To add some humanity or “color”, to create an emotional connection;
• Or to increase the redundancy of the presentation (see Improve the Quality of Data Visualizations Using Redundancy).

Option 1: Standard text formatting: font, size, style

Sometimes the plain text is the best option. Make fonts big and simple so they stand out.

669 people died

Option 2: Informative formatting

Colors, bolding, emphasis, and other formatting options can be used to draw attention to figures and to communicate additional information. For example, the red font could draw attention to low results. Example: Sales declined by 23%.

You can do this in a sentence or in a visualization, such as in the bar chart below, where color is used to encode statistical testing results.

And you could also use informative formatting via a colored background for numbers, as in the visual confection below. In this instance, traffic-light coloring indicates the relative levels of performance of different departments in a supermarket.

Option 3: Pie charts

Proportions are often illustrated using pie charts with varying degrees of success. They can be particularly elegant for displaying single numbers that are proportions.

Option 4: Donut charts

Similarly with donut charts. It’s just a matter of taste.

Option 5: Portion of an image

The two-category pie chart and donut chart are special cases of a more general strategy, which is to show a portion of an image.

Option 6:  Overlaid images

A twist on showing a portion of an image is to proportionally color an image.

A common, but misleading, criticism of overlaid image visualizations and all the pictograph type of visualizations is that they are imprecise at best, and innumerate at worst. The three visualizations above have all been created to illustrate this point. The one on the left is not too bad. The President Trump approval visualization can readily be criticized in that the actual area shaded in blue is less than 37%. This is due to the greater amount of whitespace over the shoulders. Particularly problematic is the age visualization. This implicitly compares a number, 37, against some unknown and silly benchmark implied by the top of the image.

While such criticisms are technically correct, they are misleading. Consider the “worst” of the visualizations, which shows the average age. The purpose of the visualization is simply to communicate to the viewer that the average age figure is in some way low. How low? This is communicated by the number at the base. If the actual correct number is shown, there is little prospect of the viewer being misled. However, showing a number without the visualization runs the risk that the viewer fails to notice the point at all. This leads to a much higher error.

Furthermore, there are many contexts in which precision is not even important. How often is half a glass of wine actually precisely half a glass?

Option 7:  Countable pictographs

While all pictographs have a bad reputation, in the case of the countable pictograph, it is quite undeserved. Countable pictographs achieve redundancy and thus likely improve the accuracy with which the underlying data is understood by the viewer.

Option 8: Uncountable pictographs

The goal of the uncountable pictograph is to suggest, in a graphical way, “big”. It is often most useful when next to a comparable countable pictograph.

Option 9: Dials, gauges, and thermometers

This data is from a study by the Pew Research Center. Go to the original report for a better visualization, which uses a thermometer.

Option 10: Physical representations

This photo, from The Guardian, shows 888,246 ceramic red poppies installed around the Tower of London. Each poppy represents a British or Colonial serviceman killed in World War I. It is the same basic idea as the uncountable pictograph but on a much grander scale.

Option 11: Artwork/graphic design

Option 12: Evocative collages

And finally the most common of all approaches in marketing reporting is to find an image or images that in some way represent, or evoke, something relevant to the number.

Software

You can only create the visualizations in option 2 easily in Displayr. The visualizations in options 3 through 8 were all created using the open-source GitHub R packages Displayr/rhtmlPictographs and Displayr/flipPictographs, which were created by my colleagues Kyle and Carmen.

Everyone can access the Displayr document used to create these visualizations here. To modify the visualizations with your own data, click on each visualization and either change the Inputs or modify the underlying R code (Properties > R CODE).

Acknowledgments

Google’s gauge inspired the gauge in this post. I haven’t been able to find any copyright information for the beautiful Coca-cola photo. So, if you know where it is from, please tell me!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### From Biology to Industry. A Blogger’s Journey to Data Science.

Wed, 09/20/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Today, I have given a webinar for the Applied Epidemiology Didactic of the University of Wisconsin – Madison titled “From Biology to Industry. A Blogger’s Journey to Data Science.”

I talked about how blogging about R and Data Science helped me become a Data Scientist. I also gave a short introduction to Machine Learning, Big Data and Neural Networks.

My slides can be found here: https://www.slideshare.net/ShirinGlander/from-biology-to-industry-a-bloggers-journey-to-data-science

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm

Wed, 09/20/2017 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement.

Defining data conditionally

Of course, simstudy has always enabled conditional distributions based on sequentially defined variables. That is really the whole point of simstudy. But, what if I wanted to specify completely different families of distributions or very different regression curves based on different individual characteristics? With the previous version of simstudy, it was not really easy to do. Now, with the addition of two key functions, defCondition and addCondition the process is much improved. defCondition is analogous to the function defData, in that this new function provides an easy way to specify conditional definitions (as does defReadCond, which is analogous to defRead). addCondition is used to actually add the data column, just as addColumns adds columns.

It is probably easiest to see in action:

library(simstudy) # Define baseline data set def <- defData(varname="x", dist="normal", formula=0, variance=9) def <- defData(def, varname = "group", formula = "0.2;0.5;0.3", dist = "categorical") # Generate data set.seed(111) dt <- genData(1000, def) # Convert group to factor - new function dt <- genFactor(dt, "group", replace = TRUE) dt

defCondition is the same as defData, except that instead of specifying a variable name, we need to specify a condition that is based on a pre-defined field:

defC <- defCondition(condition = "fgroup == 1", formula = "5 + 2*x", variance = 4, dist = "normal") defC <- defCondition(defC, condition = "fgroup == 2", formula = 4, variance = 3, dist="normal") defC <- defCondition(defC, condition = "fgroup == 3", formula = "3 - 2*x", variance = 2, dist="normal") defC ## condition formula variance dist link ## 1: fgroup == 1 5 + 2*x 4 normal identity ## 2: fgroup == 2 4 3 normal identity ## 3: fgroup == 3 3 - 2*x 2 normal identity

A subsequent call to addCondition generates a data table with the new variable, in this case $$y$$:

dt <- addCondition(defC, dt, "y") dt ## id y x fgroup ## 1: 1 5.3036869 0.7056621 2 ## 2: 2 2.1521853 -0.9922076 2 ## 3: 3 4.7422359 -0.9348715 3 ## 4: 4 16.1814232 -6.9070370 3 ## 5: 5 4.3958893 -0.5126281 3 ## --- ## 996: 996 -0.8115245 -2.7092396 1 ## 997: 997 1.9946074 0.7126094 2 ## 998: 998 11.8384871 2.3895135 1 ## 999: 999 3.3569664 0.8123200 1 ## 1000: 1000 3.4662074 -0.4653198 3

In this example, I’ve partitioned the data into three subsets, each of which has a very different linear relationship between variables $$x$$ and $$y$$, and different variation. In this particular case, all relationships are linear with normally distributed noise, but this is absolutely not required.

Here is what the data look like:

library(ggplot2) mycolors <- c("#555bd4","#d4555b","#d4ce55") ggplot(data = dt, aes(x = x, y = y, group = fgroup)) + geom_point(aes(color = fgroup), size = 1, alpha = .4) + geom_smooth(aes(color = fgroup), se = FALSE, method = "lm") + scale_color_manual(name = "Cluster", values = mycolors) + scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey96", color = "grey80"))

Latent class regression models

Suppose we come across the same data set, but are not privy to the group classification, and we are still interested in the relationship between $$x$$ and $$y$$. This is what the data set would look like – not as user-friendly:

rawp <- ggplot(data = dt, aes(x = x, y = y, group = fgroup)) + geom_point(color = "grey75", size = .5) + scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey96", color = "grey80")) rawp

We might see from the plot, or we might have some subject-matter knowledge that suggests there are are several sub-clusters within the data, each of which appears to have a different relationship between $$x$$ and $$y$$. (Obviously, we know this is the case, since we generated the data.) The question is, how can we estimate the regression lines if we don’t know the class membership? That is where the EM algorithm comes into play.

The EM algorithm, very, very briefly

The EM algorithm handles model parameter estimation in the context of incomplete or missing data. In the example I’ve been discussing here, the subgroups or cluster membership are the missing data. There is an extensive literature on EM methods (starting with this article by Dempster, Laird & Rubin), and I am barely even touching the surface, let alone scratching it.

The missing data (cluster probabilities) are estimated in the Expectation- or E-step. The unknown model parameters (intercept, slope, and variance) for each of the clusters is estimated in the Maximization- or M-step, which in this case assumes the data come from a linear process with normally distributed noise – both the linear coefficients and variation around the line are conditional on cluster membership. The process is iterative. First, the E-step, which is based on some starting model parameters at first and then updated with the most recent parameter estimates from the prior M-step. Second, the M-step is based on estimates of the maximum likelihood of all the data (including the ‘missing’ data estimated in the prior E-step). We iterate back and forth until the parameter estimates in the M-step reach a steady state, or the overal likelihood estimate becomes stable.

The strength or usefulness of the EM method is that the likelihood of the full data (both observed data – $$x$$’s and $$y$$’s – and unobserved data – cluster probabilities) is much easier to write down and estimate than the likelihood of the observed data only ($$x$$’s and $$y$$’s). Think of the first plot above with the structure given by the colors compared to the second plot in grey without the structure. The first seems so much more manageable than the second – if only we knew the underlying structure defined by the clusters. The EM algorithm builds the underlying structure so that the maximum likelihood estimation problem becomes much easier.

Here is a little more detail on what the EM algorithm is estimating in our application. (See this for the much more detail.) First, we estimate the probability of membership in cluster $$j$$ for our linear regression model with three clusters:

$P_i(j|x_i, y_i, \pi_j, \alpha_{j0}, \alpha_{j1}, \sigma_j) = p_{ij}= \frac{\pi_jf(y_i|x_i, \alpha_{j0}, \alpha_{j1}, \sigma_j))}{\sum_{k=1}^3 \pi_k f(y_i|x_i, \alpha_{k0}, \alpha_{k1}, \sigma_k))},$ where $$\alpha_{j0}$$ and $$\alpha_{j1}$$ are the intercept and slope for cluster $$j$$, and $$\sigma_j$$ is the standard deviation for cluster $$j$$. $$\pi_j$$ is the probability of any individual being in cluster $$j$$, and is estimated by taking and average of the $$p_{ij}$$’s across all individuals. Finally, $$f(.|.)$$ is the density from the normal distribution $$N(\alpha_{j0} + \alpha_{j1}x, \sigma_j^2)$$.

Second, we maximize each of the three cluster-specific log-likelihoods, where each individual is weighted by its probability of cluster membership (which is $$P_i(j)$$, estimated in the E-step). In particular, we are maximizing the cluster-specific likelihood with respect to the three unknown parameters $$\alpha_{j0}$$, $$\alpha_{j1}$$, and $$\sigma_j$$:

$\sum_{n=1}^N \hat{p}_{nk} \text{log} (f(y_n|x_n,\alpha_{j0},\alpha_{j1},\sigma_j)$ In R, the flexmix package has implemented an EM algorithm to estimate latent class regression models. The package documentation provides a really nice, accessible description of the two-step procedure, with much more detail than I have provided here. I encourage you to check it out.

Iterating slowly through the EM algorithm

Here is a slow-motion version of the EM estimation process. I show the parameter estimates (visually) at the early stages of estimation, checking in after every three steps. In addition, I highlight two individuals and show the estimated probabilities of cluster membership. At the beginning, there is little differentiation between the regression lines for each cluster. However, by the 10th iteration the parameter estimates for the regression lines are looking pretty similar to the original plot.

library(flexmix) selectIDs <- c(508, 775) # select two individuals ps <- list() count <- 0 p.ij <- data.table() # keep track of estimated probs pi.j <- data.table() # keep track of average probs for (i in seq(1,10, by=3)) { count <- count + 1 set.seed(5) # fit model up to "i" iterations - either 1, 4, 7, or 10 exMax <- flexmix(y ~ x, data = dt, k = 3, control = list(iter.max = i) ) p.ij <- rbind(p.ij, data.table(i, selectIDs, posterior(exMax)[selectIDs,])) pi.j <- rbind(pi.j, data.table(i, t(apply(posterior(exMax), 2, mean)))) dp <- as.data.table(t(parameters(exMax))) setnames(dp, c("int","slope", "sigma")) # flexmix rearranges columns/clusters dp[, grp := c(3, 1, 2)] setkey(dp, grp) # create plot for each iteration ps[[count]] <- rawp + geom_abline(data = dp, aes(intercept = int, slope = slope, color=factor(grp)), size = 1) + geom_point(data = dt[id %in% selectIDs], color = "black") + scale_color_manual(values = mycolors) + ggtitle(paste("Iteration", i)) + theme(legend.position = "none", plot.title = element_text(size = 9)) } library(gridExtra) grid.arrange(ps[[1]], ps[[2]], ps[[3]], ps[[4]], nrow = 1)

For the two individuals, we can see the probabilities converging to a level of certainty/uncertainty. The individual with ID #775 lies right on the regression line for cluster 3, far from the other lines, and the algorithm quickly assigns a probability of 100% to cluster 3 (its actual cluster). The cluster assignment is less certain for ID #508, which lies between the two regression lines for clusters 1 and 2.

# actual cluster membership dt[id %in% selectIDs, .(id, fgroup)] ## id fgroup ## 1: 508 2 ## 2: 775 3 setkey(p.ij, selectIDs, i) p.ij[, .(selectIDs, i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))] ## selectIDs i C1 C2 C3 ## 1: 508 1 0.32 0.36 0.32 ## 2: 508 4 0.29 0.44 0.27 ## 3: 508 7 0.25 0.65 0.10 ## 4: 508 10 0.24 0.76 0.00 ## 5: 775 1 0.35 0.28 0.37 ## 6: 775 4 0.33 0.14 0.53 ## 7: 775 7 0.11 0.01 0.88 ## 8: 775 10 0.00 0.00 1.00

In addition, we can see how the estimate of overall group membership (for all individuals) changes through the iterations. The algorithm starts by assigning equal probability to each cluster (1/3) and slowly moves towards the actual distribution used to generate the data (20%, 50%, and 30%).

pi.j[, .(i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))] ## i C1 C2 C3 ## 1: 1 0.33 0.34 0.33 ## 2: 4 0.31 0.34 0.35 ## 3: 7 0.25 0.39 0.36 ## 4: 10 0.23 0.44 0.33 Final estimation of linear models

The final estimation is shown below, and we can see that the parameters have largely converged to the values used to generate the data.

# Estimation until convergence set.seed(5) ex1 <- flexmix(y ~ x, data = dt, k = 3) # paramter estimates data.table(parameters(ex1))[, .(param = c("int", "slope", "sd"), C1 = round(Comp.2, 2), C2 = round(Comp.3, 2), C3 = round(Comp.1, 2))] ## param C1 C2 C3 ## 1: int 5.18 3.94 3.00 ## 2: slope 1.97 -0.03 -1.99 ## 3: sd 2.07 1.83 1.55 # estimates of cluster probabilities round(apply(posterior(ex1), 2, mean), 2)[c(2,3,1)] ## [1] 0.19 0.51 0.30 # estimates of individual probabilities data.table(posterior(exMax)[selectIDs,])[,.(selectIDs, C1 = round(V2, 2), C2 = round(V3, 2), C3 = round(V1, 2))] ## selectIDs C1 C2 C3 ## 1: 508 0.24 0.76 0 ## 2: 775 0.00 0.00 1 How do we know the relationship is linear?

In reality, there is no reason to assume that the relationship between $$x$$ and $$y$$ is simply linear. We might want to look at other possibilities, such as a quadratic relationship. So, we use flexmix to estimate an expanded model, and then we plot the fitted lines on the original data:

ex2 <- flexmix(y ~ x + I(x^2), data = dt, k = 3) dp <- as.data.table(t(parameters(ex2))) setnames(dp, c("int","slope", "slope2", "sigma")) dp[, grp := c(1,2,3)] x <- c(seq(-10,10, by =.1)) dp1 <- data.table(grp = 1, x, dp[1, int + slope*x + slope2*(x^2)]) dp2 <- data.table(grp = 2, x, dp[2, int + slope*x + slope2*(x^2)]) dp3 <- data.table(grp = 3, x, dp[3, int + slope*x + slope2*(x^2)]) dp <- rbind(dp1, dp2, dp3) rawp + geom_line(data=dp, aes(x=x, y=V3, group = grp, color = factor(grp)), size = 1) + scale_color_manual(values = mycolors) + theme(legend.position = "none")

And even though the parameter estimates appear to be reasonable, we would want to compare the simple linear model with the quadratic model, which we can use with something like the BIC. We see that the linear model is a better fit (lower BIC value) – not surprising since this is how we generated the data.

summary(refit(ex2)) ## $Comp.1 ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.440736 0.309576 4.6539 3.257e-06 *** ## x -0.405118 0.048808 -8.3003 < 2.2e-16 *** ## I(x^2) -0.246075 0.012162 -20.2337 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ##$Comp.2 ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.955542 0.289914 23.9918 < 2.2e-16 *** ## x 0.305995 0.049584 6.1712 6.777e-10 *** ## I(x^2) 0.263160 0.014150 18.5983 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## \$Comp.3 ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.9061090 0.1489738 26.2201 < 2e-16 *** ## x -0.0681887 0.0277366 -2.4584 0.01395 * ## I(x^2) 0.0113305 0.0060884 1.8610 0.06274 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Comparison of the two models BIC(ex1) ## [1] 5187.862 BIC(ex2) ## [1] 5316.034 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing dataset to apply machine learning-exercises

Fri, 09/08/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then see solutions to check your answers.

Exercise 1

Create a variable “x” and attach to it the input attributes of the “iris” dataset. HINT: Use columns 1 to 4.

Exercise 2

Create a variable “y” and attach to it the output attribute of the “iris” dataset. HINT: Use column 5.

Exercise 3

Create a whisker plot (boxplot) for the variable of the first column of the “iris” dataset. HINT: Use boxplot().

Exercise 4

Now create a whisker plot for each one of the four input variables of the “iris” dataset in one image. HINT: Use par().

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

• Create a machine learning algorithm from a beginner point of view
• Quickly dive into more advanced methods in an accessible pace and with more explanations
• And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

Exercise 5

Create a barplot to breakdown your output attribute. HINT: Use plot().

Exercise 6

Create a scatterplot matrix of the “iris” dataset using the “x” and “y” variables. HINT: Use featurePlot().

Exercise 7

Create a scatterplot matrix with ellipses around each separated group. HINT: Use plot="ellipse".

Exercise 8

Create box and whisker plots of each input variable again, but this time broken down into separated plots for each class. HINT: Use plot="box".

Exercise 9

Create a list named “scales” that includes the “x” and “y” variables and set relation to “free” for both of them. HINT: Use list()

Exercise 10

Create a density plot matrix for each attribute by class value. HINT: Use featurePlot().

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hurricane Harvey’s rains, visualized in R by USGS

Fri, 09/08/2017 - 17:36

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

On August 26 Hurricane Harvey became the largest hurricane to make landfall in the United States in over 20 years. (That record may yet be broken by Irma, now bearing down on the Florida peninsula.) Harvey's rains brought major flooding to Houston and other coastal areas in the Gulf of Mexico. You can see the rainfall generated by Harvey across Texas and Louisiana in this animation from the US Geological Survey of county-by-county precipitation as the storm makes its way across land.

Watch #Harvey move thru SE #Texas spiking rainfall rates in each county (blue colors) Interactive version: https://t.co/qFrnyq3Sbm pic.twitter.com/md0hiUs9Bb

— USGS Coastal Change (@USGSCoastChange) September 6, 2017

Interestingly, the heaviest rains appear to fall somewhat away from the eye of the storm, marked in orange on the map. The animation features Harvey's geographic track, along with a choropleth of hourly rainfall totals and a joyplot of river gage flow rates, and was created using the R language. You can find the data and R code on Github, which makes use of the USGS's own vizlab package which facilitates the rapid assembly of web-ready visualizations.

USGS Vizlab: Hurricane Harvey's Water Footprint (via Laura DeCicco)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Experiences as a first time rOpenSci package reviewer

Fri, 09/08/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

It all started January 26th this year when I signed up to volunteer as
a reviewer for R packages submitted to rOpenSci. My main motivation for
wanting to volunteer was to learn something new and to
contribute to the R open source community. If you are wondering why the
people behind rOpenSci are doing this, you can read How rOpenSci uses Code Review to Promote Reproducible Science.

Three months later I was contacted by Maelle Salmon asking whether I was interested in
reviewing the R package patentsview for rOpenSci. And yes, I
was! To be honest I was a little bit thrilled.

The packages are submitted for review to rOpenSci via an issue to their
GitHub repository and also the reviews happen there. So you can check out
all previous package submissions and reviews.
With all the information you
get from rOpenSci and also the help from the editor it is straightforward
to do the package review. Before I started I read the
reviewer guides (links below) and checked out a few of the existing
reviews. I installed the package patentsview from GitHub and also
downloaded the source code so I could check out how it was implemented.

I started by testing core functionality of the package by
running the examples that were mentioned in the README of the
package. I think this is a good
starting point because you get a feeling of what the author wants to
achieve with the package. Later on I came up with my
own queries (side note: this R package interacts with an API from which
you can query patents). During the process I used to switch between
writing queries like a normal user of the package
would do and checking the code. When I saw something in the code that
wasn't quite clear to me or looked wrong I went back to writing new
queries to check whether the behavior of the methods was as expected.

With this approach I was able to give feedback to the package author
which led to the inclusion of an additional unit test, a helper function
that makes the package easier to use, clarification of an error message
and an improved documentation. You can find the review I did here.

There are several R packages that helped me get started with my review,
e.g. devtools and
goodpractice. These
example for a very useful method is devtools::spell_check(), which
performs a spell check on the package description and on manual pages.
At the beginning I had an issue with goodpractice::gp() but Maelle Salmon
(the editor) helped me resolve it.

review.

Contributing to the open source community

When people think about contributing to the open source community, the
first thought is about creating a new R package or contributing to one
of the major packages out there. But not everyone has the resources
(e.g. time) to do so. You also don't have awesome ideas every other day
which can immediately be implemented into new R packages to be used by
the community. Besides contributing with code there are also lots of
other things than can be useful for other R users, for example writing
blog posts about problems you solved, speaking at meetups or reviewing
code to help improve it. What I like much about reviewing code is that
people see things differently and have other experiences. As a reviewer,
you see a new package from the user's perspective which can be hard for
the programmer themselves. Having someone else
review your code helps finding things that are missing because they seem
obvious to the package author or detect code pieces that require more
testing. I had a great feeling when I finished the review, since I had
helped improve an already amazing R package a little bit more.

Reviewing helps improve your own coding style

When I write R code I usually try to do it in the best way possible.
is a good start to get used to coding best practice in R and I also
Tidbits
. So normally
when I think some piece of code can be improved (with respect to speed,
readability or memory usage) I check online whether I can find a
better solution. Often you just don't think something can be
improved because you always did it in a certain way or the last time you
checked there was no better solution. This is when it helps to follow
other people's code. I do this by reading their blogs, following many R
users on Twitter and checking their GitHub account. Reviewing an R
package also helped me a great deal with getting new ideas because I
checked each function a lot more carefully than when I read blog posts.
In my opinion, good code does not only use the best package for each
problem but also the small details are well implemented. One thing I
used to do wrong for a long time was filling of data.frames until I
found a better (much faster)
solution on stackoverflow.
And with respect to this you
can learn a lot from someone else's code. What I found really cool in
the package I reviewed was the usage of small helper functions (see
utils.R).
Functions like paste0_stop and paste0_message make the rest of the
code a lot easier to read.

Good start for writing your own packages

When reviewing an R package, you check the code like a really observant
user. I noticed many things that you usually don't care about when using
examples are, and also how well unit tests cover the code. I think that
reviewing a few good packages can prepare you very well for writing your
own packages.

Do you want to contribute to rOpenSci yourself?

is a list of useful things if you want to become an rOpenSci reviewer
like me.

If you are generally interested in either submitting or reviewing an R package, I would like to invite you to the Community Call on rOpenSci software review and onboarding.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The writexl package: zero dependency xlsx writer for R

Fri, 09/08/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

We have started working on a new rOpenSci package called writexl. This package wraps the very powerful libxlsxwriter library which allows for exporting data to Microsoft Excel format.

The major benefit of writexl over other packages is that it is completely written in C and has absolutely zero dependencies. No Java, Perl or Rtools are required.

Getting Started

The write_xlsx function writes a data frame to an xlsx file. You can test that data roundtrips properly by reading it back using the readxl package. Columns containing dates and factors get automatically coerced to character strings.

You can also give it a named list of data frames, in which case each data frame becomes a sheet in the xlsx file:

write_xlsx(list(iris = iris, cars = cars, mtcars = mtcars), "mydata.xlsx")

Performance is good too; in our benchmarks writexl is about twice as fast as openxlsx:

library(microbenchmark) library(nycflights13) microbenchmark( writexl = writexl::write_xlsx(flights, tempfile()), openxlsx = openxlsx::write.xlsx(flights, tempfile()), times = 5 ) ## Unit: seconds ## expr min lq mean median uq max neval ## writexl 8.884712 8.904431 9.103419 8.965643 9.041565 9.720743 5 ## openxlsx 17.166818 18.072527 19.171003 18.669805 18.756661 23.189206 5 Roadmap

The initial version of writexl implements the most important functionality for R users: exporting data frames. However the underlying libxlsxwriter library actually provides far more sophisticated functionality such as custom formatting, writing complex objects, formulas, etc.

Most of this probably won't be useful to R users. But if you have a well defined use case for exposing some specific features from the library in writexl, open an issue on Github and we'll look into it!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...