Don’t teach students the hard way first
(This article was first published on Variance Explained, and kindly contributed to Rbloggers)
Imagine you were going to a party in an unfamiliar area, and asked the host for directions to their house. It takes you thirty minutes to get there, on a path that takes you on a long winding road with slow traffic. As the party ends, the host tells you “You can take the highway on your way back, it’ll take you only ten minutes. I just wanted to show you how much easier the highway is.”
Wouldn’t you be annoyed? And yet this kind of attitude is strangely common in programming education.
I was recently talking to a friend who works with R and whose opinions I greatly respect. He was teaching some sessions to people in his company who hadn’t used R before, where he largely followed my philosophy on teaching the tidyverse to beginners. I agreed with his approach, until he said something far too familiar to me:
“I teach them dplyr’s group_by/summarize in the second lesson, but I teach them loops first just to show them how much easier dplyr is.”
I talk to people about teaching a lot, and that phrase keeps popping up: “I teach them X just to show them how much easier Y is”. It’s a trap a trap I’ve fallen into before when teaching, and one that I’d like to warn others against.
Students don’t share your nostalgiaFirst, why do some people make this choice? I think because when we teach a class, we accidentally bring in all of our own history and context.
For instance, I started programming with Basic and Perl in high school, then Java in college, then got really into Python, then got even more into R. Along the way I built up habits in each language that had to be painfully undone afterwards. I worked in an objectoriented style in Python, then switched to thinking in data frames and functional operations. I wrote too many loops when I started R, then I grew accustomed to the apply family of functions. Along the way there were thousands of little frustrations and little epiphanies, tragedies and triumphs in microcosm.
But when I’m teaching someone how to use R… they don’t care about any of that. They weren’t there! They didn’t make my mistakes, they don’t need to be talked out of them. Going back to the party host, perhaps the highway was built only last year, and maybe the host had to talk his friends, stuck in their habits, into taking the highway by explaining how much faster it is. But that doesn’t make any difference to the guest, who has never taken either route.
@ucfagls I think ppl teach base stuff preferentially because it’s what they still use or they’re reliving their own #rstats “journey” @drob
— Jenny Bryan (@JennyBryan) January 16, 2015
It’s true that I learned a lot about programming in the path I described above. I learned how to debug, how to compare programming paradigms, and when to switch from one approach to another. I think some teachers hope that by walking students through this “journey”, we can impart some of that experience. But it doesn’t work: feeding students two possible solutions in a row is nothing like the experience of them comparing solutions for themselves, and doesn’t grant them any of the same skills. Besides which, students will face plenty of choices and challenges as they continue their programming career, without us having to invent artificial ones.
Students should absolutely learn multiple approaches (there’s usually advantages to each one). But not in this order, from hardest to easiest, and not because it happened to be the order we learned it ourselves.
Bandwidth and trustThere are two reasons I recommend against teaching a harder approach first. One is educational bandwidth, and one is trust.
One of the most common mistakes teachers make (especially inexperienced ones) is to think they can teach more material than they can.
.@minebocek talks philosophy of Data Carpentry.
Highlight: don't take on too much material. Rushing discourages student questions #UseR2017 pic.twitter.com/txAcSd3ND3
— David Robinson (@drob) July 6, 2017
This comes down to what I sometimes call educational bandwidth: the total amount of information you can communicate to students is limited, especially since you need to spend time reinforcing and revisiting each concept. It’s not just about the amount of time you have in the lesson, either. Learning new ideas is hard work: think of the headache you can get at the end of a oneday workshop. This means you should make sure every idea you get across is valuable. If you teach them a method they’ll never have to use, you’re wasting time and energy.
The other reason is trust. Think of the imaginary host who gave poor directions before giving better ones. Wouldn’t it be unpleasant to be tricked like that? When a student goes through the hard work of learning a new method, telling them “just kidding, you didn’t need to learn that” is obnoxious, and hurts their trust in you as a teacher.
In some cases there’s a tradeoff between bandwidth and trust. For instance, as I’ve described before, I teach dplyr and the %>% operator before explaining what a function is, or even how to do a variable assignment. This conserves bandwidth (it gets students doing powerful things quickly) but it’s a minor violation of their trust (I’m hiding details of how R actually works). But there’s no tradeoff in teaching a hard method before an easier one.
ExceptionsIs there any situation where you might want to show students the hard way first? Yes: one exception is if the hard solution is something the student would have been tempted to do themselves.
One good example is in R for Data Science, by Hadley Wickham and Garrett Grolemund. Chapter 19.2 describes “When should you write a function”, and gives an example of rescaling several columns by copying and pasting code:
df$a < (df$a  min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE)  min(df$a, na.rm = TRUE)) df$b < (df$b  min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE)  min(df$a, na.rm = TRUE)) df$c < (df$c  min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE)  min(df$c, na.rm = TRUE)) df$d < (df$d  min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE)  min(df$d, na.rm = TRUE))By showing them what this approach looks like (and even including an intentional typo in the second line, to show how copying and pasting code is prone to error), the book guides them towards the several steps involved in writing a function.
rescale01 < function(x) { rng < range(x, na.rm = TRUE) (x  rng[1]) / (rng[2]  rng[1]) }This educational approach makes sense because copying and pasting code is a habit beginners would fall into naturally, especially if they’ve never programmed before. It doesn’t take up any educational bandwidth because students already know how to do it, and it’s upfront about it’s approach (when I teach this way, I usually use the words “you might be tempted to…”).
However, teaching a loop (or split()/lapply(), or aggregate, or tapply) isn’t something beginners would do accidentally, and it’s therefore not something you need to be talking.
In conclusion: teaching programming is hard, don’t make it harder. Next time you’re teaching a course, or workshop, or writing a tutorial, or just helping a colleague getting set up in R, try teaching them your preferred method first, instead of meandering through subpar solutions. I think you’ll find that it’s worth it.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ggformula: another option for teaching graphics in R to beginners
(This article was first published on SAS and R, and kindly contributed to Rbloggers)
A previous entry (http://sasandr.blogspot.com/2017/07/optionsforteachingrtobeginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested.
In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here’s Randall:
For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of
 the lattice package for graphics,
 several functions from the stats package for modeling (e.g., lm(), t.test()), and
 the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.
Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.
goal ( y ~ x , data = mydata, … )
Trouble in paradise
As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantilequantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into subplots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing. On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify. ggformula: The thirdanda half way Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics). For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the sidebyside boxplots made with lattice in the previous post. We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the “then” operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.
Comparing groups Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color) or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses  very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers. The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired. Fitting into the tidyverse work flow ggformala also fits into a tidyversestyle workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations. Summary The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the ggformula package vignettes and tutorials describe these in more detail. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: SAS and R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
R has a lot of packages for users to analyse posts on social media. As an experiment in this field, I decided to start with the biggest one: Facebook.
I decided to look at the Facebook activity of Donald Trump and Hillary Clinton during the 2016 presidential election in the United States.
The winner may be more famous for his Twitter account than his Facebook one, but he still used it to great effect to help pick off his Republican rivals in the primaries and to attack Hillary Clinton in the general election.
For this work we’re going to be using the Rfacebook package developed by Pablo Barbera, plus his excellent howto guide.
The first thing to do is to generate an access token from Facebook’s developer portal. Keep it anonymous (otherwise you’re gifting the world access to your account) and save it in your environment.
library(Rfacebook) options(scipen = 999) token < "Your token goes here"The next thing to do is to use the getPage() function to retrieve all the posts from each candidate.
I’m going to start the clock on January 1, 2016 and end it the day after the election on November 9, 2016 (which means it will stop on election day, the day before)
trump < getPage("donaldtrump", token, n = 5000, since='2016/01/01', until='2016/11/09') clinton < getPage("hillaryclinton", token, n = 5000, since='2016/01/01', until='2016/11/09') Caveat: The data doesn’t seem to contain all Trump and Clinton’s Facebook postsI ran the commands several times and got 545 posts for Trump and 692 posts for Clinton. However, I think I may have got more results the first time I ran the commands. I also searched their pages via Facebook and came up with some posts that don’t appear in the R datasets. If you have a solution for this, please let me know!
In the meantime, we will work with what we have
We want to calculate the average number of likes, comments and shares for each month for both candidates. Again, we will be using Pablo’s code for a while here.
First up, we will format the date:
format.facebook.date < function(datestring) { date < as.POSIXct(datestring, format = "%Y%m%dT%H:%M:%S+0000", tz = "GMT") }Then we will use his formula to calculate the average likes, comments and shares (metrics) per month:
aggregate.metric < function(metric) { m < aggregate(page[[paste0(metric, "_count")]], list(month = page$month), mean) m$month < as.Date(paste0(m$month, "15")) m$metric < metric return(m) }Now we run this data for both candidates:
#trump page < trump page$datetime < format.facebook.date(page$created_time) page$month < format(page$datetime, "%Y%m") df.list < lapply(c("likes", "comments", "shares"), aggregate.metric) trump_months head(trump_months) month x metric 1 20160115 17199.93 likes 2 20160215 15239.63 likes 3 20160315 22616.28 likes 4 20160415 19364.17 likes 5 20160515 14598.30 likes 6 20160615 32760.68 likes #clinton page < clinton page$datetime < format.facebook.date(page$created_time) page$month < format(page$datetime, "%Y%m") df.list < lapply(c("likes", "comments", "shares"), aggregate.metric) clinton_months < do.call(rbind, df.list)Before we combine them together, let’s label them so we know who’s who:
trump_months$candidate < "Donald Trump" clinton_months$candidate < "Hillary Clinton" both < rbind(trump_months, clinton_months)Now we have the data, we can visualise it. This is a neat opportunity to have a go at faceting using ggplot2.
Faceting is when you display two or more plots sidebyside for easy ataglance comparison.
library(ggplot2) library(scales) p < ggplot(both, aes(x = month, y = x, group = metric)) + geom_line(aes(color = metric)) + scale_x_date(date_breaks = "months", labels = date_format("%m")) + ggtitle("Facebook engagement during the 2016 election") + labs(y = "Count", x = "Month (2016)", aesthetic='Metric') + theme(text=element_text(family="Browallia New", color = "#2f2f2d")) + scale_colour_discrete(name = "Metric") #add in a facet p < p + facet_grid(. ~ candidate) p AnalysisClearly Trump’s Facebook engagement got far better results than Clinton’s. Even during his ‘off months’ he received more likes per page on average than Clinton managed at the height of the general election.
Trump’s comments per page also skyrocketed during October and November as the election neared.
Hillary Clinton enjoyed a spike in engagement around June. It was a good month for her: she was confirmed as the Democratic nominee and received the endorsement of President Obama.
ThemesTrump is famous for using nicknames for his political opponents. We had lowenergy Jeb, Little Marco, Lyin’ Ted and then Crooked Hillary.
The first usage of Crooked Hillary in the data came on April 26. A look through his Twitter feed shows he seems to have decided on Crooked Hillary around this time as well.
#DrainTheSwamp was one of his later hashtags, making the first appearance in the data on October 20, just a few weeks shy of the election on November 8.
Clinton meanwhile mentioned her rival’s surname in about a quarter of her posts. Her most popular ones were almost all on the eve of the election exhorting her followers to vote.
Of her earlier ones, only her New Year message and one from March appealing to Trump’s temperament resonated as strongly.
ConclusionIt’s frustrating that the API doesn’t seem to retrieve all the data.
Nonetheless, it’s a decent sample size and shows that the Trump campaign was far more effective than the Clinton one on Facebook during the 2016 election.
Related Post
 Analyzing Obesity across USA
 Can we predict flu deaths with Machine Learning and R?
 Graphical Presentation of Missing Data; VIM Package
 Creating Graphs with Python and GooPyCharts
 Visualizing Tennis Grand Slam Winners Performances
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualizing the Spanish Contribution to The Metropolitan Museum of Art
(This article was first published on R – Fronkonstin, and kindly contributed to Rbloggers)
TweetWell I walk upon the river like it’s easier than land
(Love is All, The Tallest Man on Earth)
The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and geography of a particular piece.
I can imagine a bunch of things to do with these data but since I am a big fan of highcharter, I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.
To read data I use fread function from data.table package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word SPANISH in the columns Artist Nationality and Culture and looking for the word SPAIN in the column Country. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:
 There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
 This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
 The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)
My treemap represents the distribution of Spanish artworks by department (column Department) and type of work (column Classification). There are two important things to know before doing a treemap with highcharter:
 You have to use treemap function from treemap package to create a list with your data frame that will serve as input for hctreemap function
 hctreemap fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.
This is the treemap:
Here you can see a full size version of it.
There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.
Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!
Feel free to use my code to do your own experiments:
library(data.table) library(dplyr) library(stringr) library(highcharter) library(treemap) file="MetObjects.csv" # Download data if (!file.exists(file)) download.file(paste0("https://media.githubusercontent.com/media/metmuseum/openaccess/master/", file), destfile=file, mode='wb') # Read data data=fread(file, sep=",", encoding="UTF8") # Modify column names to remove blanks colnames(data)=gsub(" ", ".", colnames(data)) # Clean columns to prepare for searching data[,`:=`(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF8', to='ASCII//TRANSLIT'), Culture_aux=toupper(Culture) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF8', to='ASCII//TRANSLIT'), Country_aux=toupper(Country) %>% str_replace_all("\\[\\d+\\]", "") %>% iconv(from='UTF8', to='ASCII//TRANSLIT'))] # Look for Spanish artworks data[Artist.Nationality_aux %like% "SPANISH"  Culture_aux %like% "SPANISH"  Country_aux %like% "SPAIN"] > data_spain # Count artworks by Department and Classification data_spain %>% mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>% mutate(Department=tolower(Department), Classification1=str_match(Classification, "(\\w+)(,\\)")[,2], Classification=ifelse(!is.na(Classification1), tolower(Classification1), tolower(Classification))) %>% group_by(Department, Classification) %>% summarize(Objects=n()) %>% ungroup %>% mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"), Classification)) %>% as.data.frame() > dfspain # Do treemap without drawing tm_dfspain < treemap(dfspain, index = c("Department", "Classification"), draw=F, vSize = "Objects", vColor = "Objects", type = "index") # Do highcharter treemap hctreemap( tm_dfspain, allowDrillToNode = TRUE, allowPointSelect = T, levelIsConstant = F, levels = list( list( level = 1, dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")), borderWidth = 1 ), list( level = 2, dataLabels = list (enabled = F, align = 'right', verticalAlign = 'top', style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")), borderWidth = 0.7 ) )) %>% hc_title(text = "Spanish Artworks in The Met") %>% hc_subtitle(text = "Distribution by Department") > plot plot var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Pandigital Products: Euler Problem 32
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler Problem 32 returns to pandigital numbers, which are numbers that contain one of each digit. Like so many of the Euler Problems, these numbers serve no practical purpose whatsoever, other than some entertainment value. You can find all pandigital numbers in base10 in the Online Encyclopedia of Interegers (A050278). The Numberhile video explains everything you ever wanted to
The Numberhile video explains everything you ever wanted to know about pandigital numbers but were afraid to ask.
Euler Problem 32 DefinitionWe shall say that an ndigit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5digit number, 15234, is 1 through 5 pandigital.
The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.
Find the sum of all products whose multiplicand/multiplier/product identity can be written as a 1 through 9 pandigital.
HINT: Some products can be obtained in more than one way so be sure to only include it once in your sum.
Proposed SolutionThe pandigital.9 function tests whether a string classifies as a pandigital number. The pandigital.prod vector is used to store the multiplication.
The only way to solve this problem is brute force and try all multiplications but we can limit the solution space to a manageable number. The multiplication needs to have ten digits. For example, when the starting number has two digits, the second number should have three digits so that the total has four digits, e.g.: 39 × 186 = 7254. When the first number only has one digit, the second number needs to have four digits.
pandigital.9 < function(x) # Test if string is 9pandigital (length(x)==9 & sum(duplicated(x))==0 & sum(x==0)==0) t < proc.time() pandigital.prod < vector() i < 1 for (m in 2:100) { if (m < 10) n_start < 1234 else n_start < 123 for (n in n_start:round(10000 / m)) { # List of digits digs < as.numeric(unlist(strsplit(paste0(m, n, m * n), ""))) # is Pandigital? if (pandigital.9(digs)) { pandigital.prod[i] < m * n i < i + 1 print(paste(m, "*", n, "=", m * n)) } } } answer < sum(unique(pandigital.prod)) print(answer)Numbers can also be checked for pandigitality using mathematics instead of strings.
You can view the most recent version of this code on GitHub.
The post Pandigital Products: Euler Problem 32 appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Report from Mexico City
(This article was first published on R Views, and kindly contributed to Rbloggers)
Editors Note: It has been heartbreaking watching the images from México City. Teresa Ortiz, coorganizer of RLadies CDMX reports on efforts of data scientists to help. Our thoughts are with them, and with the people of México.
It has been a hard couple of days around here.
In less than 2 weeks, México has gone through two devastating earthquakes and the damages keep adding. Nevertheless, the response from the citizens has been outstanding and Mexican datadriven initiatives have not stayed behind.
An example is codeandoMexico, an open innovation platform which brings together top talent through public challenges, codeandoMexico developed a Citizen’s quick response center with a directory of emergency phone numbers, providing the location of places that need help, maps, and more. Most of the information in the site is updated from firsthand experience of volunteers that are helping throughout the affected areas. Technical collaboration with codeandoMexico is open to everyone by attending to issues on their website these include extending the services offered by the site or attending to bugs.
Social networks are being extensively used to spread information about where and how to help, however often times tweets and facebook posts are fake or obsolete. There are several efforts taking place to overcome this difficulty; RLadies CDMX’s cofounder Silvia Gutierrez is working in one such project to map information on shelters, affected buildings and more (http://bit.ly/2xfIwZm).
The full scope of the damage is yet to be seen, but the support, both from México and other countries, is encouraging.
_____='https://rviews.rstudio.com/2017/09/21/reportfrommexicocity/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Monte Carlo Simulations & the "SimDesign" Package in R
(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to Rbloggers)
Past posts on this blog have included several relating to Monte Carlo simulation – e.g., see here, here, and here. Recently I came across a great article by Matthew Sigal and Philip Chalmers in the Journal of Statistics Education. It’s titled, “Play it Again: Teaching Statistics With Monte Carlo Simulation”, and the full reference appears below. The authors provide a really nice introduction to basic Monte Carlo simulation, using R. In particular, they contrast using a “for loop” approach, with using the “SimDesign” R package (Chalmers, 2017). Here’s the abstract of their paper:“Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how wellsuited MCS designs are to classroom demonstrations, and how they provide a handson method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate–analyze–summarize coding structure over the typical “forloop” strategy are discussed.”
The SimDesign package provides an efficient, and safe template for setting pretty much any Monte Carlo experiment that you’re likely to want to conduct. It’s really impressive, and I’m looking forward to experimenting with it. The SigalChalmers paper includes helpful examples, with the associated R code and output. It would be superfluous for me to add that here. Needless to say, the SimDesign package is just as useful for simulations in econometrics as it is for those dealing with straight statistics problems. Try it out for yourself! ReferencesChalmers, R. P., 2017. SimDesign: Structure for Organizing Monte Carlo Simulation Designs, R package version 1.7. M. J. Sigal and R. P. Chalmers, 2016. Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24, 136156.
© 2017, David E. Giles var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Answer probability questions with simulation (part2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
This is the second exercise set on answering probability questions with simulation. Finishing the first exercise set is not a prerequisite. The difficulty level is about the same – thus if you are looking for a challenge aim at writing up faster more elegant algorithms.
As always, it pays off to read the instructions carefully and think about what the solution should be before starting to code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated and slow.
Answers are available here.
Exercise 1
If you take cards numbered from 110 and shuffle them, and lay them down in order, what is the probability that at least one card matches its position. For example card 3 comes down third?
Exercise 2
Consider an election 3 candidates and 35 voters and who casts his ballot randomly for one of the candidates. What is the probability of a tie for the first position?
Exercise 3
If you were to randomly find a playing card on the floor every day, how many days would it take on average to find a full standard deck?
Exercise 4
Throw two dice. What is the probability the difference between them is 3, 4, or 5 instead of 0, 1, or 2?
Exercise 5
What is the expected number of distinct birthdays in a group of 400 people? Assume 365 days and that all are equally likely.
Exercise 6
What is the probability that a fivecard hand in standard deck of cards has exactly three aces?
 work with different binomial and logistic regression techniques,
 know how to compare regression models and choose the right fit,
 and much more.
Exercise 7
Randomly select three distinct integers a, b, c from the set {1, 2, 3, 4, 5, 6, 7}.
What is the probability that a + b > c?
Exercise 8
Given that a throw of three dice show three different faces what is the probability if the sum of all the dice is 8.
Exercise 9
Throwing a standard die until you get 6. What is the expected number of throws (including the throw giving 6) conditioned on the event that all throws gave even numbers.
Exercise 10
Choose twodigit integer at random, what is the probability that it is divisible by each of its digits. This is not exactly a simulation proplem – but the concept is similar make R do the hard work.
(Picture by Gil)
Related exercise sets: Answer probability questions with simulation
 Probability functions intermediate
 Lets Begin with something sample
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
EARL London 2017 – That’s a wrap!
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
Beth Ashlee, Data Scientist
After a successful firstever EARL San Francisco in June, it was time to head back to the birth place EARL – London. With more abstracts submitted than ever before, the conference was made up of 54 fantastic talks and 5 key notes from an impressive selection of industries. With so many talks to pick from we thought we would summarise a few of my favourites!
After brilliant keynotes from Tom Smith (ONS) and Rstudio’s Jenny Bryan in session 1, Derek Norton and Neera Talbert from Microsoft took us through the Microsoft process of moving a company from SAS to R in session 2. They explained that with the aim of shrinking the ‘SAS footprint’, it’s important to think about the drivers behind a company leaving SAS as well as considering the impact to end users. Their approach focused on converting program logic rather than specific code.
After lunch, Luisa Pires discussed the digital change occurring within the Bank of England. She highlighted the key selling points behind choosing R as a platform and the process behind organizing their journey. They first ran divisional R training, before progressing through to produce a data science training programme to enable the use of R as a project tool.
Finishing up the day, Jessica PeterkaBonetta gave a fascinating talk on sentiment analysis when considering the use of emojis. She demonstrated that even though adding emojis into your sentiment analysis can add complexity to your process, in the right context they can add real value to tracking the sentiment of a trend. It was an engaging talk which prompted some interesting audience questions such as – “What about combinations of emojis; how would they effect sentiment?”.
After a Pimms reception it was all aboard the Symphony Cruiser for a tour of the River Thames. On board we enjoyed food, drinks and live music (which resulted in some impromptu dancing, but what happens at a Conference, stays at the Conference!).
Day 2 highlights:
Despite a ‘lively’ evening, the EARL kicked off in full swing on Thursday morning. There were three fantastic keynotes, including Hilary Parker – her talk filled with analogies and movie references to describe her reproducible work flow methods – definitely something I could relate to!
The morning session included a talk from Mike Smith from Pfizer. Mike showed us his use of Shiny as a tool for determining wait times when submitting large jobs to Pfizer’s high performance compute grid. Mike used real time results to visualise whether it was beneficial to submit a large job at the current time or wait until later. He outlined some of the frustrations of changing data sources in a such a large company and his reluctance to admit he was ‘data scienceing’.
After lunch, my colleague Adnan Fiaz gave an interesting talk on the data pipeline, comparing it to the process involved in oil pipelines. He spoke of the methods and actions that must be taken before being able to process your data and generate results. The comparison showed surprising similarities between the two processes, and clarified the method that we must take at Mango to ensure we can safely, efficiently and aptly analyse data.
The final session of day 2 finished on a high, with a packed room gathering to hear from RStudio’s Joe Cheng. Joe highlighted the exciting new asynchronous programming feature that will be available in the next release of Shiny. This new feature is set to revolutionise the responsiveness of Shiny applications, allowing users to overcome the restrictions of R’s single threaded natured.
I get so much out of EARL each year and this year was no different; I just wish I had a timeturner to get to all the presentations!
On behalf of the rest of the Mango team, a massive thank you to all our speakers and attendees, we hope you enjoyed EARL as much as we did!
All slides we have permission to publish are available under the Speakers section on the EARL conference website.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Preview: ALTREP promises to bring major performance improvements to R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Changes are coming to the internals of the R engine which promise to improve performance and reduce memory use, with dramatic impacts in some circumstances. The changes were first proposed by Gabe Becker at the DSC Conference in 2016 (and updated in 2017), and the implementation by Luke Tierney and Gabe Becker is now making its way into the development branches of the R sources.
ALTREP is an alternative way of representing R objects internally. (The details are described in this draft vignette, and there are a few more examples in this presentation.) It's easiest to illustrate by example.
Today, a vector of numbers in R is represented as a contiguous block of memory in RAM. In other words, if you create the sequence of a million integers 1:1000000, R creates a block of memory 4Mb in size (4 bytes per number) to store each number in the sequence. But what if we could use a more efficient representation just for sequences like this? WIth ALTREP, a sequence like this is instead represented by just its start and end values, which takes up almost no memory at all. That means, in the future you'll be able to write loops like this:
for (i in 1:1e10) do_something()without getting errors like this:
Error: cannot allocate vector of size 74.5 GbALTREP has the potential to make many other operations faster or even instantaneous, even on very large objects. Here are a few examples of functions that could be sped up:
 is.na(x) — ALTREP will keep track of whether a vector includes NAs or not, so that R no longer has to inspect the entire vector
 sort(x) — ALTREP will keep track of whether a vector is already sorted, and sorting will be instantaneous in this case
 x < 5 — knowing that the vector is already sorted, ALTREP can find the breakpoint very quickly (in O(log n) time), and return a "sequence" logical vector that consumes basically no memory
 match(y,x) — if ALTREP knows that x is already sorted, matching is also much faster
 as.character(numeric_vector) — ALTREP can defer converting a numeric vector to characters until the character representation is actually needed
That last benefit will likely have a large impact on the handling of data frames, which carry around a column of character row labels which start out as a numeric sequence. Development builds are already demonstrating a huge performance gain in the linear regression function lm() as a result:
> n < 10000000 > x < rnorm(n) > y < rnorm(n) # With R 3.4 > system.time(lm(y ~ x)) user system elapsed 9.225 0.703 9.929 # With ALTREP > system.time(lm(y ~ x)) user system elapsed 1.886 0.610 2.496The ALTREP framework is designed to be extensible, so that package authors can define their own alternative representations of standard R objects. For example, an R vector could be represented as a distributed object as in a system like Spark, while still behaving like an ordinary R vector to the R engine. An example package, simplemmap, illustrates this concept by defining an implementation of vectors as memorymapped objects that live on disk instead of in RAM.
There's no definite date yet when ALTREP will be in an official R release, and my guess is that there's likely to be an extensive testing period to shake out any bugs caused by changing the internal representation of R objects. But the fact that the implementation is already making its way into the R sources is hugely promising, and I look forward to testing out the realworld impacts. You can read more about the current state of ALTREP in the draft vignette by Luke Tierney, Gabe Becker and Tomas Kalibera linked below.
RProject.org: ALTREP: Alternative Representations for R Objects
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
pinp 0.0.2: Onwards
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A first update 0.0.2 of the pinp package arrived on CRAN just a few days after the initial release.
We added a new vignette for the package (see below), extended a few nice features, and smoothed a few corners.
The NEWS entry for this release follows.
Changes in tint version 0.0.2 (20170920)
The YAML segment can be used to select font size, oneortwo column mode, oneortwo side mode, linenumbering and watermarks (#21 and #26 addressing #25)

If pinp.cls or jss.bst are not present, they are copied in ((#27 addressing #23)

Output is now in shaded framed boxen too (#29 addressing #28)

Endmatter material is placed in template.tex (#31 addressing #30)

Expanded documentation of YAML options in skeleton.Rmd and clarified available onecolumn option (#32).

Section numbering can now be turned on and off (#34)

The default bibliography style was changed to jss.bst.

A short explanatory vignette was added.
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js
(This article was first published on Enhance Data Science, and kindly contributed to Rbloggers)
D3partitionR is an R package to visualize interactively nested and hierarchical data using D3.js and HTML widget. These last few weeks I’ve been working on a major D3partitionR update which is now available on GitHub. As soon as enough feedbacks are collected, the package will be on uploaded on the CRAN. Until then, you can install it using devtools
library(devtools) install_github("AntoineGuillot2/D3partitionR")Here is a quick overview of the possibilities using the Titanic data:
A major updateThis update is a major update from the previous version which will break code from 0.3.1
New functionalities Additional data for nodes: Additional data can be added for some given nodes. For instance, if a comment or a link needs to be shown in the tooltip or label of some nodes, they can be added through the add_nodes_data function
You can easily add specific hyperlink or text in the tooltip  Variable selection and computation, now, you can provide a variable for:
 sizing (i.e. the size of each node)
 color, any variable from your data.frame or from the nodes data can be used as a color.
 label, any variable from your data.frame or from the nodes data can be used as a label.
 tooltip, you can provide several variables to be displayed in the tooltip.
 aggregation function, when numerical variables are provided, you can choose the aggregation function you want.
 Coloring: The color scale can now be continuous. For instance, you can use the mean survival rate to the Titanic accident in each node, this make it easy to visualise quickly women in 1st class are more likely to survive than men in 3rd class.
Treemap to show the survival rate to the Titanic accident
 Label: Labels providing the showing the node’s names (or any other variable) can now be added to the plot.
 Breadcrumb: To avoid overlapping, the width of each breadcrumb is now variable and dependant on the length of the word
Variable breadcrumb width
 Legend: By default, the legend now shows all the modalities/levels that are in the plot. To avoid wrapping, enabling the zoom_subset option will only shows the modalities in the direct children of the zoomed root.
 Easy data preprocessing, The data preparation was tedious in the previous versions. Now, you just need to aggregate your data.frame at the right level, the data.frame can directly be used in the D3partitionR functions to avoid to deal with nesting a data.frame which can be pretty complicated.
 The R API is greatly improved, D3partitionR are now S3 objects with a clearly named list of function to add data and to modify the chart appearance and parameters. Using pipes now makes D3partitionR syntax looks gglike
Style consistency among the different type of chart. Now, it’s easy to switch from a treemap to a circle treemap or a sunburst and keep consistent styling policy.
Update to d3.js V4 and modularization. Each type of charts now has its own file and function. This function draws the chart at its root level with labels and colors, it returns a zoom function. The onclick actions (such as the breadcrumb update or the legend update) and the hover action (tooltips) are defined in a ‘global’ function.
Hence, adding new visualizations will be easy, the drawing and zooming script will just need to be adapted to this previous template.
What’s nextThanks to the several feedbacks that will be collected during next week, a stable release version should soon be on the CRAN. I will also post more ressources on D3partitionR with use cases and example of Shiny Applications build on it.
The post Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js appeared first on Enhance Data Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Regression Analysis — What You Should’ve Been Taught But Weren’t, and Were Taught But Shouldn’t Have Been
The above title was the title of my talk this evening at our Bay Area R Users Group. I had been asked to talk about my new book, and I presented four of the myths that are dispelled in the book.
Hadley also gave an interesting talk, “An introduction to tidy evaluation,” involving some library functions that are aimed at writing clearer, more readable R. The talk came complete with audience participation, very engaging and informative.
The venue was GRAIL, a highlyimpressive startup. We will be hearing a lot more about this company, I am sure.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));12 Visualizations to Show a Single Number
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
Infographics, dashboards, and reports often need to highlight or visualize a single number. But how do you highlight a single number so that it has an impact and looks good? It can be a big challenge to make a lonely, single number look great. In this post, I show 12 different ways of representing a single number. Most of these visualizations have been created automatically using R.
When to create a visualization to represent a single numberThere are a number of situations in which it can be advantageous to create a visualization to represent a single number:
 To communicate with less numerate viewers/readers;
 Infographics and dashboards commonly use one important number;
 To attract the attention of distracted or busy viewers/readers;
 To add some humanity or “color”, to create an emotional connection;
 Or to increase the redundancy of the presentation (see Improve the Quality of Data Visualizations Using Redundancy).
Option 1: Standard text formatting: font, size, style
Sometimes the plain text is the best option. Make fonts big and simple so they stand out.
669 people died
Option 2: Informative formatting
Colors, bolding, emphasis, and other formatting options can be used to draw attention to figures and to communicate additional information. For example, the red font could draw attention to low results. Example: Sales declined by 23%.
You can do this in a sentence or in a visualization, such as in the bar chart below, where color is used to encode statistical testing results.
And you could also use informative formatting via a colored background for numbers, as in the visual confection below. In this instance, trafficlight coloring indicates the relative levels of performance of different departments in a supermarket.
Option 3: Pie charts
Proportions are often illustrated using pie charts with varying degrees of success. They can be particularly elegant for displaying single numbers that are proportions.
Option 4: Donut charts
Similarly with donut charts. It’s just a matter of taste.
Option 5: Portion of an image
The twocategory pie chart and donut chart are special cases of a more general strategy, which is to show a portion of an image.
Option 6: Overlaid images
A twist on showing a portion of an image is to proportionally color an image.
A common, but misleading, criticism of overlaid image visualizations and all the pictograph type of visualizations is that they are imprecise at best, and innumerate at worst. The three visualizations above have all been created to illustrate this point. The one on the left is not too bad. The President Trump approval visualization can readily be criticized in that the actual area shaded in blue is less than 37%. This is due to the greater amount of whitespace over the shoulders. Particularly problematic is the age visualization. This implicitly compares a number, 37, against some unknown and silly benchmark implied by the top of the image.
While such criticisms are technically correct, they are misleading. Consider the “worst” of the visualizations, which shows the average age. The purpose of the visualization is simply to communicate to the viewer that the average age figure is in some way low. How low? This is communicated by the number at the base. If the actual correct number is shown, there is little prospect of the viewer being misled. However, showing a number without the visualization runs the risk that the viewer fails to notice the point at all. This leads to a much higher error.
Furthermore, there are many contexts in which precision is not even important. How often is half a glass of wine actually precisely half a glass?
Option 7: Countable pictographs
While all pictographs have a bad reputation, in the case of the countable pictograph, it is quite undeserved. Countable pictographs achieve redundancy and thus likely improve the accuracy with which the underlying data is understood by the viewer.
Option 8: Uncountable pictographs
The goal of the uncountable pictograph is to suggest, in a graphical way, “big”. It is often most useful when next to a comparable countable pictograph.
Option 9: Dials, gauges, and thermometers
This data is from a study by the Pew Research Center. Go to the original report for a better visualization, which uses a thermometer.
Option 10: Physical representationsThis photo, from The Guardian, shows 888,246 ceramic red poppies installed around the Tower of London. Each poppy represents a British or Colonial serviceman killed in World War I. It is the same basic idea as the uncountable pictograph but on a much grander scale.
Option 11: Artwork/graphic design
Option 12: Evocative collages
And finally the most common of all approaches in marketing reporting is to find an image or images that in some way represent, or evoke, something relevant to the number.
Software
You can only create the visualizations in option 2 easily in Displayr. The visualizations in options 3 through 8 were all created using the opensource GitHub R packages Displayr/rhtmlPictographs and Displayr/flipPictographs, which were created by my colleagues Kyle and Carmen.
Create your own visualizationsEveryone can access the Displayr document used to create these visualizations here. To modify the visualizations with your own data, click on each visualization and either change the Inputs or modify the underlying R code (Properties > R CODE).
AcknowledgmentsGoogle’s gauge inspired the gauge in this post. I haven’t been able to find any copyright information for the beautiful Cocacola photo. So, if you know where it is from, please tell me!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
From Biology to Industry. A Blogger’s Journey to Data Science.
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
Today, I have given a webinar for the Applied Epidemiology Didactic of the University of Wisconsin – Madison titled “From Biology to Industry. A Blogger’s Journey to Data Science.”
I talked about how blogging about R and Data Science helped me become a Data Scientist. I also gave a short introduction to Machine Learning, Big Data and Neural Networks.
My slides can be found here: https://www.slideshare.net/ShirinGlander/frombiologytoindustryabloggersjourneytodatascience
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm
(This article was first published on ouR data generation, and kindly contributed to Rbloggers)
I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an ExpectationMaximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement.
Defining data conditionallyOf course, simstudy has always enabled conditional distributions based on sequentially defined variables. That is really the whole point of simstudy. But, what if I wanted to specify completely different families of distributions or very different regression curves based on different individual characteristics? With the previous version of simstudy, it was not really easy to do. Now, with the addition of two key functions, defCondition and addCondition the process is much improved. defCondition is analogous to the function defData, in that this new function provides an easy way to specify conditional definitions (as does defReadCond, which is analogous to defRead). addCondition is used to actually add the data column, just as addColumns adds columns.
It is probably easiest to see in action:
library(simstudy) # Define baseline data set def < defData(varname="x", dist="normal", formula=0, variance=9) def < defData(def, varname = "group", formula = "0.2;0.5;0.3", dist = "categorical") # Generate data set.seed(111) dt < genData(1000, def) # Convert group to factor  new function dt < genFactor(dt, "group", replace = TRUE) dtdefCondition is the same as defData, except that instead of specifying a variable name, we need to specify a condition that is based on a predefined field:
defC < defCondition(condition = "fgroup == 1", formula = "5 + 2*x", variance = 4, dist = "normal") defC < defCondition(defC, condition = "fgroup == 2", formula = 4, variance = 3, dist="normal") defC < defCondition(defC, condition = "fgroup == 3", formula = "3  2*x", variance = 2, dist="normal") defC ## condition formula variance dist link ## 1: fgroup == 1 5 + 2*x 4 normal identity ## 2: fgroup == 2 4 3 normal identity ## 3: fgroup == 3 3  2*x 2 normal identityA subsequent call to addCondition generates a data table with the new variable, in this case \(y\):
dt < addCondition(defC, dt, "y") dt ## id y x fgroup ## 1: 1 5.3036869 0.7056621 2 ## 2: 2 2.1521853 0.9922076 2 ## 3: 3 4.7422359 0.9348715 3 ## 4: 4 16.1814232 6.9070370 3 ## 5: 5 4.3958893 0.5126281 3 ##  ## 996: 996 0.8115245 2.7092396 1 ## 997: 997 1.9946074 0.7126094 2 ## 998: 998 11.8384871 2.3895135 1 ## 999: 999 3.3569664 0.8123200 1 ## 1000: 1000 3.4662074 0.4653198 3In this example, I’ve partitioned the data into three subsets, each of which has a very different linear relationship between variables \(x\) and \(y\), and different variation. In this particular case, all relationships are linear with normally distributed noise, but this is absolutely not required.
Here is what the data look like:
library(ggplot2) mycolors < c("#555bd4","#d4555b","#d4ce55") ggplot(data = dt, aes(x = x, y = y, group = fgroup)) + geom_point(aes(color = fgroup), size = 1, alpha = .4) + geom_smooth(aes(color = fgroup), se = FALSE, method = "lm") + scale_color_manual(name = "Cluster", values = mycolors) + scale_x_continuous(limits = c(10,10), breaks = c(10, 5, 0, 5, 10)) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey96", color = "grey80")) Latent class regression modelsSuppose we come across the same data set, but are not privy to the group classification, and we are still interested in the relationship between \(x\) and \(y\). This is what the data set would look like – not as userfriendly:
rawp < ggplot(data = dt, aes(x = x, y = y, group = fgroup)) + geom_point(color = "grey75", size = .5) + scale_x_continuous(limits = c(10,10), breaks = c(10, 5, 0, 5, 10)) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey96", color = "grey80")) rawpWe might see from the plot, or we might have some subjectmatter knowledge that suggests there are are several subclusters within the data, each of which appears to have a different relationship between \(x\) and \(y\). (Obviously, we know this is the case, since we generated the data.) The question is, how can we estimate the regression lines if we don’t know the class membership? That is where the EM algorithm comes into play.
The EM algorithm, very, very brieflyThe EM algorithm handles model parameter estimation in the context of incomplete or missing data. In the example I’ve been discussing here, the subgroups or cluster membership are the missing data. There is an extensive literature on EM methods (starting with this article by Dempster, Laird & Rubin), and I am barely even touching the surface, let alone scratching it.
The missing data (cluster probabilities) are estimated in the Expectation or Estep. The unknown model parameters (intercept, slope, and variance) for each of the clusters is estimated in the Maximization or Mstep, which in this case assumes the data come from a linear process with normally distributed noise – both the linear coefficients and variation around the line are conditional on cluster membership. The process is iterative. First, the Estep, which is based on some starting model parameters at first and then updated with the most recent parameter estimates from the prior Mstep. Second, the Mstep is based on estimates of the maximum likelihood of all the data (including the ‘missing’ data estimated in the prior Estep). We iterate back and forth until the parameter estimates in the Mstep reach a steady state, or the overal likelihood estimate becomes stable.
The strength or usefulness of the EM method is that the likelihood of the full data (both observed data – \(x\)’s and \(y\)’s – and unobserved data – cluster probabilities) is much easier to write down and estimate than the likelihood of the observed data only (\(x\)’s and \(y\)’s). Think of the first plot above with the structure given by the colors compared to the second plot in grey without the structure. The first seems so much more manageable than the second – if only we knew the underlying structure defined by the clusters. The EM algorithm builds the underlying structure so that the maximum likelihood estimation problem becomes much easier.
Here is a little more detail on what the EM algorithm is estimating in our application. (See this for the much more detail.) First, we estimate the probability of membership in cluster \(j\) for our linear regression model with three clusters:
\[P_i(jx_i, y_i, \pi_j, \alpha_{j0}, \alpha_{j1}, \sigma_j) = p_{ij}= \frac{\pi_jf(y_ix_i, \alpha_{j0}, \alpha_{j1}, \sigma_j))}{\sum_{k=1}^3 \pi_k f(y_ix_i, \alpha_{k0}, \alpha_{k1}, \sigma_k))},\] where \(\alpha_{j0}\) and \(\alpha_{j1}\) are the intercept and slope for cluster \(j\), and \(\sigma_j\) is the standard deviation for cluster \(j\). \(\pi_j\) is the probability of any individual being in cluster \(j\), and is estimated by taking and average of the \(p_{ij}\)’s across all individuals. Finally, \(f(..)\) is the density from the normal distribution \(N(\alpha_{j0} + \alpha_{j1}x, \sigma_j^2)\).
Second, we maximize each of the three clusterspecific loglikelihoods, where each individual is weighted by its probability of cluster membership (which is \(P_i(j)\), estimated in the Estep). In particular, we are maximizing the clusterspecific likelihood with respect to the three unknown parameters \(\alpha_{j0}\), \(\alpha_{j1}\), and \(\sigma_j\):
\[\sum_{n=1}^N \hat{p}_{nk} \text{log} (f(y_nx_n,\alpha_{j0},\alpha_{j1},\sigma_j)\] In R, the flexmix package has implemented an EM algorithm to estimate latent class regression models. The package documentation provides a really nice, accessible description of the twostep procedure, with much more detail than I have provided here. I encourage you to check it out.
Iterating slowly through the EM algorithmHere is a slowmotion version of the EM estimation process. I show the parameter estimates (visually) at the early stages of estimation, checking in after every three steps. In addition, I highlight two individuals and show the estimated probabilities of cluster membership. At the beginning, there is little differentiation between the regression lines for each cluster. However, by the 10th iteration the parameter estimates for the regression lines are looking pretty similar to the original plot.
library(flexmix) selectIDs < c(508, 775) # select two individuals ps < list() count < 0 p.ij < data.table() # keep track of estimated probs pi.j < data.table() # keep track of average probs for (i in seq(1,10, by=3)) { count < count + 1 set.seed(5) # fit model up to "i" iterations  either 1, 4, 7, or 10 exMax < flexmix(y ~ x, data = dt, k = 3, control = list(iter.max = i) ) p.ij < rbind(p.ij, data.table(i, selectIDs, posterior(exMax)[selectIDs,])) pi.j < rbind(pi.j, data.table(i, t(apply(posterior(exMax), 2, mean)))) dp < as.data.table(t(parameters(exMax))) setnames(dp, c("int","slope", "sigma")) # flexmix rearranges columns/clusters dp[, grp := c(3, 1, 2)] setkey(dp, grp) # create plot for each iteration ps[[count]] < rawp + geom_abline(data = dp, aes(intercept = int, slope = slope, color=factor(grp)), size = 1) + geom_point(data = dt[id %in% selectIDs], color = "black") + scale_color_manual(values = mycolors) + ggtitle(paste("Iteration", i)) + theme(legend.position = "none", plot.title = element_text(size = 9)) } library(gridExtra) grid.arrange(ps[[1]], ps[[2]], ps[[3]], ps[[4]], nrow = 1)For the two individuals, we can see the probabilities converging to a level of certainty/uncertainty. The individual with ID #775 lies right on the regression line for cluster 3, far from the other lines, and the algorithm quickly assigns a probability of 100% to cluster 3 (its actual cluster). The cluster assignment is less certain for ID #508, which lies between the two regression lines for clusters 1 and 2.
# actual cluster membership dt[id %in% selectIDs, .(id, fgroup)] ## id fgroup ## 1: 508 2 ## 2: 775 3 setkey(p.ij, selectIDs, i) p.ij[, .(selectIDs, i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))] ## selectIDs i C1 C2 C3 ## 1: 508 1 0.32 0.36 0.32 ## 2: 508 4 0.29 0.44 0.27 ## 3: 508 7 0.25 0.65 0.10 ## 4: 508 10 0.24 0.76 0.00 ## 5: 775 1 0.35 0.28 0.37 ## 6: 775 4 0.33 0.14 0.53 ## 7: 775 7 0.11 0.01 0.88 ## 8: 775 10 0.00 0.00 1.00In addition, we can see how the estimate of overall group membership (for all individuals) changes through the iterations. The algorithm starts by assigning equal probability to each cluster (1/3) and slowly moves towards the actual distribution used to generate the data (20%, 50%, and 30%).
pi.j[, .(i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))] ## i C1 C2 C3 ## 1: 1 0.33 0.34 0.33 ## 2: 4 0.31 0.34 0.35 ## 3: 7 0.25 0.39 0.36 ## 4: 10 0.23 0.44 0.33 Final estimation of linear modelsThe final estimation is shown below, and we can see that the parameters have largely converged to the values used to generate the data.
# Estimation until convergence set.seed(5) ex1 < flexmix(y ~ x, data = dt, k = 3) # paramter estimates data.table(parameters(ex1))[, .(param = c("int", "slope", "sd"), C1 = round(Comp.2, 2), C2 = round(Comp.3, 2), C3 = round(Comp.1, 2))] ## param C1 C2 C3 ## 1: int 5.18 3.94 3.00 ## 2: slope 1.97 0.03 1.99 ## 3: sd 2.07 1.83 1.55 # estimates of cluster probabilities round(apply(posterior(ex1), 2, mean), 2)[c(2,3,1)] ## [1] 0.19 0.51 0.30 # estimates of individual probabilities data.table(posterior(exMax)[selectIDs,])[,.(selectIDs, C1 = round(V2, 2), C2 = round(V3, 2), C3 = round(V1, 2))] ## selectIDs C1 C2 C3 ## 1: 508 0.24 0.76 0 ## 2: 775 0.00 0.00 1 How do we know the relationship is linear?In reality, there is no reason to assume that the relationship between \(x\) and \(y\) is simply linear. We might want to look at other possibilities, such as a quadratic relationship. So, we use flexmix to estimate an expanded model, and then we plot the fitted lines on the original data:
ex2 < flexmix(y ~ x + I(x^2), data = dt, k = 3) dp < as.data.table(t(parameters(ex2))) setnames(dp, c("int","slope", "slope2", "sigma")) dp[, grp := c(1,2,3)] x < c(seq(10,10, by =.1)) dp1 < data.table(grp = 1, x, dp[1, int + slope*x + slope2*(x^2)]) dp2 < data.table(grp = 2, x, dp[2, int + slope*x + slope2*(x^2)]) dp3 < data.table(grp = 3, x, dp[3, int + slope*x + slope2*(x^2)]) dp < rbind(dp1, dp2, dp3) rawp + geom_line(data=dp, aes(x=x, y=V3, group = grp, color = factor(grp)), size = 1) + scale_color_manual(values = mycolors) + theme(legend.position = "none")And even though the parameter estimates appear to be reasonable, we would want to compare the simple linear model with the quadratic model, which we can use with something like the BIC. We see that the linear model is a better fit (lower BIC value) – not surprising since this is how we generated the data.
summary(refit(ex2)) ## $Comp.1 ## Estimate Std. Error z value Pr(>z) ## (Intercept) 1.440736 0.309576 4.6539 3.257e06 *** ## x 0.405118 0.048808 8.3003 < 2.2e16 *** ## I(x^2) 0.246075 0.012162 20.2337 < 2.2e16 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## $Comp.2 ## Estimate Std. Error z value Pr(>z) ## (Intercept) 6.955542 0.289914 23.9918 < 2.2e16 *** ## x 0.305995 0.049584 6.1712 6.777e10 *** ## I(x^2) 0.263160 0.014150 18.5983 < 2.2e16 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## $Comp.3 ## Estimate Std. Error z value Pr(>z) ## (Intercept) 3.9061090 0.1489738 26.2201 < 2e16 *** ## x 0.0681887 0.0277366 2.4584 0.01395 * ## I(x^2) 0.0113305 0.0060884 1.8610 0.06274 . ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Comparison of the two models BIC(ex1) ## [1] 5187.862 BIC(ex2) ## [1] 5316.034 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualizing dataset to apply machine learningexercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Dear reader,
If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.
This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.
Before proceeding, please follow our short tutorial.
Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then see solutions to check your answers.
Exercise 1
Create a variable “x” and attach to it the input attributes of the “iris” dataset. HINT: Use columns 1 to 4.
Exercise 2
Create a variable “y” and attach to it the output attribute of the “iris” dataset. HINT: Use column 5.
Exercise 3
Create a whisker plot (boxplot) for the variable of the first column of the “iris” dataset. HINT: Use boxplot().
Exercise 4
Now create a whisker plot for each one of the four input variables of the “iris” dataset in one image. HINT: Use par().
Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to: Create a machine learning algorithm from a beginner point of view
 Quickly dive into more advanced methods in an accessible pace and with more explanations
 And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
Exercise 5
Create a barplot to breakdown your output attribute. HINT: Use plot().
Exercise 6
Create a scatterplot matrix of the “iris” dataset using the “x” and “y” variables. HINT: Use featurePlot().
Exercise 7
Create a scatterplot matrix with ellipses around each separated group. HINT: Use plot="ellipse".
Exercise 8
Create box and whisker plots of each input variable again, but this time broken down into separated plots for each class. HINT: Use plot="box".
Exercise 9
Create a list named “scales” that includes the “x” and “y” variables and set relation to “free” for both of them. HINT: Use list()
Exercise 10
Create a density plot matrix for each attribute by class value. HINT: Use featurePlot().
Related exercise sets: How to prepare and apply machine learning to your dataset
 Summarizing dataset to apply machine learning – exercises
 Building Shiny App exercises part 6
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hurricane Harvey’s rains, visualized in R by USGS
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
On August 26 Hurricane Harvey became the largest hurricane to make landfall in the United States in over 20 years. (That record may yet be broken by Irma, now bearing down on the Florida peninsula.) Harvey's rains brought major flooding to Houston and other coastal areas in the Gulf of Mexico. You can see the rainfall generated by Harvey across Texas and Louisiana in this animation from the US Geological Survey of countybycounty precipitation as the storm makes its way across land.
Watch #Harvey move thru SE #Texas spiking rainfall rates in each county (blue colors) Interactive version: https://t.co/qFrnyq3Sbm pic.twitter.com/md0hiUs9Bb
— USGS Coastal Change (@USGSCoastChange) September 6, 2017
Interestingly, the heaviest rains appear to fall somewhat away from the eye of the storm, marked in orange on the map. The animation features Harvey's geographic track, along with a choropleth of hourly rainfall totals and a joyplot of river gage flow rates, and was created using the R language. You can find the data and R code on Github, which makes use of the USGS's own vizlab package which facilitates the rapid assembly of webready visualizations.
You can find more information about the animation, including the webbased interactive version, at the link below.
USGS Vizlab: Hurricane Harvey's Water Footprint (via Laura DeCicco)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Experiences as a first time rOpenSci package reviewer
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
It all started January 26th this year when I signed up to volunteer as
a reviewer for R packages submitted to rOpenSci. My main motivation for
wanting to volunteer was to learn something new and to
contribute to the R open source community. If you are wondering why the
people behind rOpenSci are doing this, you can read How rOpenSci uses Code Review to Promote Reproducible Science.
Three months later I was contacted by Maelle Salmon asking whether I was interested in
reviewing the R package patentsview for rOpenSci. And yes, I
was! To be honest I was a little bit thrilled.
The packages are submitted for review to rOpenSci via an issue to their
GitHub repository and also the reviews happen there. So you can check out
all previous package submissions and reviews.
With all the information you
get from rOpenSci and also the help from the editor it is straightforward
to do the package review. Before I started I read the
reviewer guides (links below) and checked out a few of the existing
reviews. I installed the package patentsview from GitHub and also
downloaded the source code so I could check out how it was implemented.
I started by testing core functionality of the package by
running the examples that were mentioned in the README of the
package. I think this is a good
starting point because you get a feeling of what the author wants to
achieve with the package. Later on I came up with my
own queries (side note: this R package interacts with an API from which
you can query patents). During the process I used to switch between
writing queries like a normal user of the package
would do and checking the code. When I saw something in the code that
wasn't quite clear to me or looked wrong I went back to writing new
queries to check whether the behavior of the methods was as expected.
With this approach I was able to give feedback to the package author
which led to the inclusion of an additional unit test, a helper function
that makes the package easier to use, clarification of an error message
and an improved documentation. You can find the review I did here.
There are several R packages that helped me get started with my review,
e.g. devtools and
goodpractice. These
packages can also help you when you start writing your own packages. An
example for a very useful method is devtools::spell_check(), which
performs a spell check on the package description and on manual pages.
At the beginning I had an issue with goodpractice::gp() but Maelle Salmon
(the editor) helped me resolve it.
In the rest of this article you can read what I gained personally from doing a
review.
When people think about contributing to the open source community, the
first thought is about creating a new R package or contributing to one
of the major packages out there. But not everyone has the resources
(e.g. time) to do so. You also don't have awesome ideas every other day
which can immediately be implemented into new R packages to be used by
the community. Besides contributing with code there are also lots of
other things than can be useful for other R users, for example writing
blog posts about problems you solved, speaking at meetups or reviewing
code to help improve it. What I like much about reviewing code is that
people see things differently and have other experiences. As a reviewer,
you see a new package from the user's perspective which can be hard for
the programmer themselves. Having someone else
review your code helps finding things that are missing because they seem
obvious to the package author or detect code pieces that require more
testing. I had a great feeling when I finished the review, since I had
helped improve an already amazing R package a little bit more.
When I write R code I usually try to do it in the best way possible.
Google's R Style Guide
is a good start to get used to coding best practice in R and I also
enjoyed reading Programming Best Practices
Tidbits. So normally
when I think some piece of code can be improved (with respect to speed,
readability or memory usage) I check online whether I can find a
better solution. Often you just don't think something can be
improved because you always did it in a certain way or the last time you
checked there was no better solution. This is when it helps to follow
other people's code. I do this by reading their blogs, following many R
users on Twitter and checking their GitHub account. Reviewing an R
package also helped me a great deal with getting new ideas because I
checked each function a lot more carefully than when I read blog posts.
In my opinion, good code does not only use the best package for each
problem but also the small details are well implemented. One thing I
used to do wrong for a long time was filling of data.frames until I
found a better (much faster)
solution on stackoverflow.
And with respect to this you
can learn a lot from someone else's code. What I found really cool in
the package I reviewed was the usage of small helper functions (see
utils.R).
Functions like paste0_stop and paste0_message make the rest of the
code a lot easier to read.
When reviewing an R package, you check the code like a really observant
user. I noticed many things that you usually don't care about when using
an R package, like comments, how helpful the documentation and the
examples are, and also how well unit tests cover the code. I think that
reviewing a few good packages can prepare you very well for writing your
own packages.
If I motivated you to become an rOpenSci reviewer, please sign up! Here
is a list of useful things if you want to become an rOpenSci reviewer
like me.

While writing this blog post I found a nice article about contributing
to the tidyverse which is
mostly also applicable to other R packages in my opinion.
If you are generally interested in either submitting or reviewing an R package, I would like to invite you to the Community Call on rOpenSci software review and onboarding.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The writexl package: zero dependency xlsx writer for R
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
We have started working on a new rOpenSci package called writexl. This package wraps the very powerful libxlsxwriter library which allows for exporting data to Microsoft Excel format.
The major benefit of writexl over other packages is that it is completely written in C and has absolutely zero dependencies. No Java, Perl or Rtools are required.
Getting StartedThe write_xlsx function writes a data frame to an xlsx file. You can test that data roundtrips properly by reading it back using the readxl package. Columns containing dates and factors get automatically coerced to character strings.
library(writexl) library(readxl) write_xlsx(iris, "iris.xlsx") # read it back out < read_xlsx("iris.xlsx")You can also give it a named list of data frames, in which case each data frame becomes a sheet in the xlsx file:
write_xlsx(list(iris = iris, cars = cars, mtcars = mtcars), "mydata.xlsx")Performance is good too; in our benchmarks writexl is about twice as fast as openxlsx:
library(microbenchmark) library(nycflights13) microbenchmark( writexl = writexl::write_xlsx(flights, tempfile()), openxlsx = openxlsx::write.xlsx(flights, tempfile()), times = 5 ) ## Unit: seconds ## expr min lq mean median uq max neval ## writexl 8.884712 8.904431 9.103419 8.965643 9.041565 9.720743 5 ## openxlsx 17.166818 18.072527 19.171003 18.669805 18.756661 23.189206 5 RoadmapThe initial version of writexl implements the most important functionality for R users: exporting data frames. However the underlying libxlsxwriter library actually provides far more sophisticated functionality such as custom formatting, writing complex objects, formulas, etc.
Most of this probably won't be useful to R users. But if you have a well defined use case for exposing some specific features from the library in writexl, open an issue on Github and we'll look into it!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...