A Few Old Books
(This article was first published on R Views, and kindly contributed to Rbloggers)
Greg Wilson is a data scientist and professional educator at RStudio.
My previous column looked at a few new books about R. In this one, I’d like to explore a few books about programming that people coming from data science backgrounds may not have stumbled upon.
The first is Michael Nygard’s Release It!, which more than lives up to its subtitle, “Design and Deploy ProductionReady Software”. Most of us can write programs that work for us on our machines; this book explores what it takes to create software that will work reliably for other people, on machines you’ve never met, long after you’ve moved on to your next project. It focuses on software that’s deployed for general use rather than installed on individuals’ machines, and covers stability patterns and antipatterns, designing software to meet production needs, security, and a range of other pragmatic issues. You might not need to take care of these things yourself, but whoever has to get your software running on the departmental cluster will be grateful that you thought about it, and can have a sensible conversation about tradeoffs.
The second book is Andreas Zeller’s Why Programs Fail, which bills itself as “a guide to pragmatic debugging”, and has been turned into a Udacity course. Programmers spend anywhere from a quarter to three quarters of their time debugging, but most only get an inpassing overview of how to do this well, and are never shown tools more advanced than print statements and breakpoint debuggers. Zeller starts with that, but goes much further to look at automatic and semiautomatic ways of simplifying programs to localize problems, isolating values’ origins, program slicing, anomaly detection, and much more. Some of the methods he describes will seem very familiar to data scientists, though the domain is new; others will take readers without a computerscience background into new territory in the same way that Advanced R does.
Our third entry is Michael Feathers’ Working Effectively with Legacy Code. Feathers defines legacy code as software that we’re reluctant to modify because we don’t understand how it works and are afraid of breaking. Having a comprehensive test suite allays this fear, but how can we construct one after the fact for a tangled mess of code? The bulk of the book explores answers to this question, including how to identify seams where code can be split, how to break dependencies so that parts can be improved incrementally, and so on. Some of the examples may seem a little out of date (the book is almost 15 years old), but they all apply directly to the unholy mixture of Perl, shell scripts, hundredline SQL statements, and tenpage R scripts that you were just handed.
Number four is Jeff Johnson’s GUI Bloopers. I was in two startups in the 1990s, and in both of them, I was told after a few weeks that I was never allowed to work on the user interface again. It was the right decision, but this book might have made it unnecessary. Rather than trying to explain the rules for designing a good user interface, Johnson gives example after example of how to fix bad ones. The companion book, Web Bloopers, is less useful today because web interfaces have evolved so rapidly, but either will help you make an interface that is at least not bad.
The last entry for this post is Ashley Davis’s Data Wrangling with JavaScript. As its title suggests, it doesn’t spend very much time on statistical theory; instead, it covers the “other 90%” of squeezing answers out of data, from establishing your data pipeline and getting started with Node (a widelyused commandline version of JavaScript) to cleaning, analyzing, and visualizing data. There are lots of code samples and plenty of diagrams, and you can download both the data sets the author uses in examples and his DataForge library. I suspect readers will need some prior familiarity with JavaScript to dive into this, but Davis shows just how far you can go with what’s available today, and that the journey is a lot smoother than people might think.
_____='https://rviews.rstudio.com/2019/04/25/afewoldbooks/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A new image on DigitalOcean to start using RStudio Server without waiting more than 2 minutes
(This article was first published on Pachá, and kindly contributed to Rbloggers)
Motivation I initially made an closeaccess image for DigitalOcean because I wanted to spend my lectures and workshops giving useful examples and not solving software installation issues. Now you can use my RStudio image which available on DigitalOcean Marketplace, it costs $0 except for the cost of running the server, and it’s also scalable.
For novice users, installing R can be hard, that’s why I wrote a guide with videos in english here and I translated it to spanish here.
To leave a comment for the author, please follow the link and comment on their blog: Pachá. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
I walk the (train) line – part trois – Dijkstra’s revenge
(This article was first published on Embracing the Random  R, and kindly contributed to Rbloggers)
(TL;DR: Author algorithmically confirms what he already knows – that there is a way to get from Newtown Station to a tasty burger. Shortest path from Newtown to tasty burger discovered. Author can’t stop thinking about cheeseburgers. Why didn’t he choose a gym as his destination?)
This is part three of our epic: here is part one, and here is part two.
Now for some unfinished business: back to OpenStreetMapWe have so far:
 gathered some OpenStreetMap (OSM) data, and
 we have subset this data for a region containing Newtown Station and Small Bar (where some tasty burgers await us), and
 we have plotted our map using our pretty_graph() function, and
 we have learnt about breadthfirst search, but have not applied it to our map.
Here is our map of interest:
We want to convert this into a graph so that we can apply our breadthfirst search algorithm from the last post to it.
What information do we need to apply our breadthfirst search? All we need are edges and nodes! Let’s take a look at how many nodes there are in our OSM data:
length(unique(osm_data_subset$nodes$attrs$id)) ## [1] 85126How many ways?
length(osm_data_subset$ways$attrs$id) ## [1] 16212These numbers feel a little bit too high. Our OSM data may contain information about nodes that are not associated with any way in our dataset. Let’s investigate!
Reading the valuable OpenStreetMap wiki page describing OSM Elements, we find that this may in fact be the case:
Nodes can be used to define standalone point features. For example, a node could represent a park bench or a water well.
As the running time of our breathfirstsearch algorithm depends on the number of nodes and edges in our graph, we could speed it up by applying it only to the relevant nodes and edges. There’s no sense in applying our algorithm to an edge that makes up part of a river, or to the nodes that represent water wells. I cannot walk on water, nor am I interested in water wells!
How can we reduce the number of elements in our dataset to speed up our algorithm?Let’s learn some more about the OpenStreetMap data model.
The OSM data model – hierarchical, it isHere are the definitions of nodes, ways and relations taken from here:
A node represents a specific point on the earth’s surface defined by its latitude and longitude. Each node comprises at least an id number and a pair of coordinates.
A way is an ordered list of between 2 and 2,000 nodes that define a polyline. Ways are used to represent linear features such as rivers and roads.
A relation is a multipurpose data structure that documents a relationship between two or more data elements (nodes, ways, and/or other relations).
If we were to depict the above definitions in a graphlike structure, it would looks like this:
There are two functions from the osmar package that let us traverse this hierarchy. These are find_up() and find_down().
find_*() them elements!According to the osmar documentation for find_up() and find_down(), this is what they do:
For a given ID these functions return all IDs of related elements.
Relations, ways and nodes have IDs in our dataset. We are good to proceed.
find_up()According to the previous visualisation, nodes are at the bottom of our hierarchy. So by passing one or more node IDs into find_up(), the function returns the associated way IDs and relation IDs, if any. For ways, we can return associated way IDs and relation IDs, if any. As relations can be made up of other relations, we return associated relation IDs, if any.
find_down()This is the inverse of the above. I’m sure you can figure this one out!
Putting it all togetherThis is what we will do:
 Get all the relevant way IDs in our osmar object. We are only interested in roadlike ways (remember, ways include things like rivers and wells). So we will keep ways that have a key, k, equal to ‘highway’. According to this OSM wiki page, highways identify any kind of road, street or path. So we will get ourselves a vector of highwayway IDs.
 We will use find_down() on these way IDs to return a list of the highwayway IDs and the node IDs that make up those highwayways.
 We will then use the subset() function on the full osmar object that we started with to return a reduced osmar object containing the nodes and ways that we are interested in.
 Our original Newtown Station and Small Bar nodes can’t be found in the set of nodes that make up the ways in our graph as they are landmarks that don’t fall neatly on any ‘way’. We will find the nearest nodes to Newtown Station and Small Bar that are related to some ‘way’ as a proxy for our source and destination nodes.
 We will then convert our OSM subset into an igraph object.
 Convert our graph into an adjacency list.
We will then have all that is required to apply our breadthfirst search algorithm to our adjacency list!
Steps 1 through to 3 – subsetting the dataBoom!
highways < find(osm_data_subset, way(tags(k == 'highway'))) highways_nodes < find_down(osm_data_subset, way(highways)) osmar_highways_and_nodes < subset(osm_data_subset, ids = highways_nodes)Let’s take a look at our reduced osmar object:
print(osmar_highways_and_nodes) ## osmar object ## 28965 nodes, 6580 ways, 0 relationsYes! We started out with 85,126 nodes. We are now down to 28,965. We started out with 16,212 ways and are now down to 6,580 ways. If we plot the reduced map, we can see that it looks a lot sparser!
pretty_graph(osmar_highways_and_nodes, edge_alpha = 0.2) Step 4 – finding the nearest node in a wayThese are our original source and destination node IDs.
newtown_station < '4735625743' small_bar < '1346628719'I won’t go into much detail here. Here is a little function named find_pedestrian_node which calls osmar's find_nearest_node function and returns the nearest node included in a highwayway that is also a pedestrianway. Here it is in full. It can also be found in the repo.
find_pedestrian_node < function(osm_data, node) { node < find_nearest_node(osm_data, node(node), way(tags(k == "highway" & v == "pedestrian"))) nearest_highway_node < subset(osm_data, node_ids = node) return(nearest_highway_node) }Let’s apply it!
nearest_to_newtown < find_pedestrian_node(osm_data_subset, newtown_station) nearest_to_small_bar < find_pedestrian_node(osm_data_subset, small_bar) print(nearest_to_newtown$nodes$attrs$id) ## [1] 3891123785 print(nearest_to_small_bar$nodes$attrs$id) ## [1] 1427601758 Step 5 – convert our data into a graph using igraphigraph is an open source graph analysis package written in C. It contains a bunch of useful tools to convert our data structures into different representations of graphs (such as the adjacency list representation from part two of this series). The R port of this package interfaces with igraph's C library so it should generally be faster than the lowlevel R code that we’ve written thus far!
library(igraph)One of the suggested packages listed on osmar’s CRAN entry is igraph. One reason for this is that the osmar package contains a function that easily converts our osmar object into an igraph object!
osmar's as_igraph functionConverting our osmar object into an igraph object is simple. Let’s use osmar’s as_igraph function:
igraph_highways_and_nodes < as_igraph(osmar_highways_and_nodes) %>% as.undirected()Let’s check its class.
class(igraph_highways_and_nodes) ## [1] "igraph"Woohoo! Now that we’ve got ourselves an igraph object, we can use some handy igraph functions on our graph. For example, let’s get the number of vertices in our graph:
vcount(igraph_highways_and_nodes) ## [1] 19004What about the number of edges?:
ecount(igraph_highways_and_nodes) ## [1] 22356But wait a minute…why do we have only 19,004 nodes in our graph when our original osmar object contains 28,965 nodes?
I have an answer for you! Remember that ways are collections of nodes? Our subset osmar object contains ways that refer to nodes that fall outside of our map’s perimeter. For example, way id 3188239 is made up of 9 nodes. Only one of those nodes falls within our map’s perimeter!
If you’re interested in how to debug issues like this, follow me down the rabbit hole in the next section. Otherwise, skip to the next section
Sidenote – using the OpenStreetMap API to get details of nodes outside of our map subsetThe latest version of the OSM API is detailed here. Here is how you can use it to get the details of any node, way or relation by ID.
 Navigate to the OSM website.
 Open up Chrome developer tools.

Make sure you’re on the ‘Console’ tab. We’ll be writing some JavaScript here!
 We’ll be making a call to the API. According to the API specification, we can get an XML response by making a call of this format: GET /api/0.6/[nodewayrelation]/#id. So we will be making a GET request, which will take one of the following three forms:
 /api/0.6/node/ to get details of the
 /api/0.6/way/ to get details of the
 /api/0.6/relation/ to get the details of the
I’m going to pick node ID 2240223186 which is part of way ID 3188239. I already know that this node falls outside of our subset map. Let’s get its details using some JavaScript! Type this in the Chrome Console:
var httpRequest = new XMLHttpRequest(); httpRequest.open('GET', '/api/0.6/node/2240223186'); httpRequest.send();If all goes well, we can checkout our our XML reponse like this:
httpRequest.responseText;Which should print something like this:
<?xml version="1.0" encoding="UTF8"?> <osm version="0.6" generator="CGImap 0.6.1 (11637 thorn03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/10/"> <node id="2240223186" visible="true" version="2" changeset="19370313" timestamp="20131210T04:07:03Z" user="mrpulley" uid="67896" lat="33.8671718" lon="151.2167234"/> osm>We can use the lat and lon attributes in our response to search for this node in the OpenStreetMap search box:
This is where the node is:
And we can see that it sits just outside of our map perimeter!
We have survived our quest and have returned home stronger.
Step 6 – convert our graph into an adjacency listWe know how to make an adjacency list using lowlevel code from part two. Now that you know how adjacency lists work, we can be lazy and use igraph's as_adj_list function!
adj_list_highways_and_nodes < igraph_highways_and_nodes %>% as_adj_list()The resulting adjacency list is a R list object.
class(adj_list_highways_and_nodes) ## [1] "list"The list element names are the node IDs!
names(adj_list_highways_and_nodes) %>% head(10) ## [1] "20827842" "3814697125" "2869677788" "1810713242" "8965944" ## [6] "33683546" "1826240863" "1826240636" "8410958" "8408936"If we index into a node in our list, we can apply the name function to get our adjacent nodes. Let’s use our source node as an example:
adj_list_highways_and_nodes[['3891123785']] %>% names() ## [1] "3891123784" "3891123778" It’s breadthfirst search time!I will source the breadthfirst search function from last time. It can be found here.
source('breadth_first_search.R') bfs_path < breadth_first_search(adj_list_highways_and_nodes, nearest_to_newtown, nearest_to_small_bar) head(bfs_path, 10) ## [1] "3891123785" "3891123778" "6128173386" "3891123774" "1843636548" ## [6] "3123414079" "1833285840" "5802864448" "3123414075" "1840727738"Success! We have some nodes on our path!
Is there a path from Newtown to Small Bar?Yes! Of course a path exists from Newtown Station to Small Bar.
pretty_graph(osm_data_subset, nearest_to_newtown, nearest_to_small_bar, path_nodes = bfs_path, node_alpha = 0.7, edge_alpha = 0.1) What is the path’s length?I will be using my function get_path_lengths to get our path’s distance in kilometres. It can be found here.
source('get_path_length.R') bfs_distance_in_kms < get_path_length(bfs_path, igraph_highways_and_nodes) print(bfs_distance_in_kms) ## [1] 6.554328~6.5 kms…but Google Maps says:
NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO!
With BFS, we have shown the existance of a path, which is an easier condition to satisfy than finding the shortest path from our source to destination. Let’s tackle this trickier problem.
Is this the shortest path?According to Google Maps, there is a shorter path. Therefore, our path is not the shortest path!
Introducing Dijkstra’s algorithmDijkstra’s algorithm is a famous algorithm which was created by Dutch computer scientist Edsger Dijkstra. It is a solution to the singlesource shortest path problem for graphs with nonnegative edge weights. Singlesource means that the algorithm solves for the shortest paths from a single source node to all other nodes in our graph. In our case, we’re really only interested in the shortest path from our source node (Newtown Station), to our destination node (Small Bar). But the output of Dijkstra will give us the shortest path from Newtown Station to every other node in our graph! Exciting!
Here is my implementation in (Rlike) pseudo code.
dijkstra(graph_representation, source_node) { # input: #  some graph represented as a weighted adjacency list or matrix #  source node ID # output: #  vec of distances of shortest paths from all nodes in graph to source node #  vec of predecessors of nodes on shortests paths LARGE_DISTANCE = 999999 vec_of_distances = vector of all node IDs in graph initialised with LARGE_DISTANCE vec_of_distances[source_node] = 0 vec_of_predecessors = vector of all node IDs initialised to NA vec_of_predecessors[source_node] = source_node vec_of_nodes_to_process = vector of all node IDs in graph while nodes_left_to_process(vec_of_nodes_to_process) { min_distance_node = node with smallest distance in vec_of_distances min_distance = distance to min_distance_node for each adjacent_node to min_dist_node { distance_from_adj_to_min_dist = distance from adjacent node to min_distance_node known_dist_to_adj_node = currently known distance to adjacent node from vec_of_distances # if the distance to the adjacent node is smaller if we go via the min_distance_node... if min_distance + distance_from_adj_to_min_dist < known_dist_to_adj_node { # set the distance to the adjacent node to the new, smaller distance via the min dist node vec_of_distances[adjacent_node] = min_distance + distance_from_adj_to_min_dist # set the predecessor to the adjacent node as the min dist node vec_of_predecessors[adjacent_node] = min_distance_node } } # we are done processing the min_dist_node so remove it from vec_of_nodes_to_process remove_min_dist_node(vec_of_nodes_to_process) } }Let’s break this down!
Input: some representation of a weighted graphPreviously, we cared about the existence of a path. The lengths of the edges on the path didn’t matter. All we wanted to know was whether we could get from one point of the graph to another. This time, we are trying to answer a more complex problem. We want to know the answer to this question:
Of all the paths from the source node to the destination node, which is the shortest?
To answer this question, we need to know the edge weights (in this case, the lengths of edges between nodes, in metres. osmar's as_igraph function creates estimates of these edge weights for us in this line by calling distHaversine from the geosphere package like so:
weights < distHaversine(x[from, c("lon", "lat")], x[to, c("lon", "lat")])So it turns out that we already have our weights in our igraph_highways_and_nodes object! To find out about the **haversine formula**, see this awesome page!
Let’s create ourselves a weighted adjacency list. I couldn’t find a way to do it using the osmar and igraph packages so here is my implementation.
Initialise!There are a few things we need to do before we can get to the loops that are at the heart of Dijkstra.
What is the LARGE_DISTANCE constant and vec_of_distances?The type of graph search we are going to use in Dijkstra is a priorityfirst search. This implies that we will be processing our nodes in some specific order by some definition of priority.
So what is our definition of priority? We want to process nodes in order of distance from our source node, where those closest to our source node are processed before those that are further away.
Let’s compare this to our breadthfirst search algorithm from the previous post. Here, we started with our source node and processed each node adjacent to it in some arbitrary order (in this case, the order they happened to appear in our data!). We paid no attention to edge weights. We added each adjacent node to the end of our queue (again, in some arbitrary order). Then we procesed the node at the front of our queue. Et cetera, et cetera, et cetera! So it follows that if all edge weights in are graph are equal, Dijkstra becomes a breathfirst search algorithm as we lose the ability to prioritise nodes by distance from our source node!
To make our algorithm process nodes in order of distance from our source node, we create a vector of node distances, where the names attribute of our distances vector contains node IDs. We initialise all our distances to some extremelly large number (or infinity). We then find our source node in our distances vector, and set its distance value to zero. This ensures that When we ask the question “Of all the nodes left to process, which has the smallest distance from our source node?” in our first iteration of our while loop, we start at our source node!
Here is a slice of our distances vector immediately after its initialisation:
What is vec_of_predecessors?We create another vector of all the nodes in our graph. But this time, the values of our vector will be the predecessor of the node along the currently known minimum distance path. Once our algorithm has finished running, we’ll use this to recover our shortest path in the same way as we did in the last post for BFS.
In this case, we initialise all values with NAs. Here’s a slice of our predecessors vector:
What is vec_of_nodes_to_process?This one is easy! This is a vector of all the node IDs in our graph. It’ll be used to keep track of which nodes are yet to be processed in our graph.
Time to get loopy while….While we still have nodes left…
find the min_distance_nodeHere, we find the node with the currently known smallest distance from our source node in our distances vector. This is where we find which node to prioritise in our priorityfirst search!
and for each adjacent node…We now know which node to prioritise. We will then go through each node adjacent to our min_distance_node in no particular order and compare some distances.
perform edge relaxationWhat the hell is edge relaxation? I will explain this visually using one of the many iterations of our while loop.
We have 2 coloured nodes in the below graph:
 Green – our source node (Newtown Station).
 Red – our minimum distance node. Of all our nodes that haven’t been processed yet, this one has the smallest distance from our source node.
Here is the path from our source node to our minimum distance node.
We then start processing nodes adjacent to our minimum distance node. Our adjacent node is coloured in orange:
Here is the currently known path from our source node to the adjacent node.
We can clearly see that this path is longer than the path to our minimum distance node! We can get to the adjacent node via the minimum distance node in a shorter distance than via the orange path.
How can we check for shorter ways to get to our adjacent node using some code? In our pseudocode, we have the following inequality:
if min_distance + distance_from_adj_to_min_dist < known_dist_to_adj_node
Here is a description of each variable:
 min_distance = the length of the path from source to the minimum distance node that is currently being processed (i.e. the length of the red path).
 distance_from_adj_to_min_dist = the weight of the direct edge (i.e. the distance) from our minimum distance node to our adjacent node.
 known_dist_to_adj_node = the length of the currently known path from our source node to our adjacent node (i.e. our orange path).
In other words, if the length of the minimum distance path plus the incremental edge length from the minimum distance node to the adjacent node is less than the currently known path to the adjacent node, we can lower the cost (i.e. distance) of getting to the adjacent node by going via the minimum distance node.
Edge relaxation is defined by these lines:
if min_distance + distance_from_adj_to_min_dist < known_dist_to_adj_node
> vec_of_distances[adjacent_node] = min_distance + distance_from_adj_to_min_dist > vec_of_predecessors[adjacent_node] = min_distance_nodeWe have established that we can get to the adjacent node via the minimum distance node in a shorter distance than that of the currently known path to the adjacent node. So we simply update our vector of distances and vector of predecessors to reflect our finding!
This new path via the minimum distance node is coloured in green in this graph:
Then remove min_distance_node from vec_of_nodes_to_processOnce all nodes adjacent to the minimum distance node have been processed, we remove it from the vector of nodes that are left to process.
NEXT!While we still have nodes left to process, we take the one with the currently known minimum distance and continue until we run out of nodes!
Dijkstra time!Let’s do this! Here is my implementation of Dijkstra.
source('dijkstra.R')Execute!
results < dijkstra_shortest_path(igraph_highways_and_nodes, nearest_to_newtown, nearest_to_small_bar)In the following, satisfyingtowatch video, the minimum distance nodes are coloured in blue. You will see orange circles jumping around our graph – these are the adjacent nodes affected by edge relaxation!
Here is our shortest path:
pretty_graph(osm_data_subset, nearest_to_newtown, nearest_to_small_bar, path_nodes = results$path, node_alpha = 0.7, edge_alpha = 0.1)And here is its length:
distances_dijkstra < results$distances results$distances[get_node_id(nearest_to_small_bar)]/1000 ## 1427601758 ## 4.877405 pretty_graph(osm_data_subset, nearest_to_newtown, nearest_to_small_bar, path_nodes = results$path, node_alpha = 0.7, edge_alpha = 0.1)~4.9 kms! That’s more like it!!!
What have we learnt? OSM data is hierarchical
 The igraph package contains some useful functions to help us work with graphs
 We ran breadthfirst search on our map after we converted it to an adjacency list
 We learnt about the OSM API
 We learnt about the details of Dijkstra’s algorithm and wrote a low level implementation of it
In the final (I promise!) post in this series, we will cover a few things.
 Why does Dijkstra work? It’s time to get mathematical!
 Unfortunately, the shortest path turns out to be a boring path to walk. Our original problem was to come up with (reasonable and interesting) new paths to walk to work. We will figure out a way to accomplish this.
Justin
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Detailed Guide to the ggplot Scatter Plot in R
(This article was first published on Learn R Programming & Build a Data Science Career  Michael Toth, and kindly contributed to Rbloggers)
When it comes to data visualization, flashy graphs can be fun. But if you’re trying to convey information, especially to a broad audience, flashy isn’t always the way to go.
Last week I showed how to work with line graphs in R.
In this article, I’m going to talk about creating a scatter plot in R. Specifically, we’ll be creating a ggplot scatter plot using ggplot‘s geom_point function.
A scatter plot is a twodimensional data visualization that uses points to graph the values of two different variables – one along the xaxis and the other along the yaxis. Scatter plots are often used when you want to assess the relationship (or lack of relationship) between the two variables being plotted.
Scatter Plot of Adam Sandler Movies from FiveThirtyEight
For example, in this graph, FiveThirtyEight uses Rotten Tomatoes ratings and Box Office gross for a series of Adam Sandler movies to create this scatter plot. They’ve additionally grouped the movies into 3 categories, highlighted in different colors.
The Famous Gapminder Scatter Plot of Life Expectancy vs. Income by Country
This scatter plot, initially created by Hans Rosling, is famous among data visualization practitioners. It graphs the life expectancy vs. income for countries around the world. It also uses the size of the points to map country population and the color of the points to map continents, adding 2 additional variables to the traditional scatter plot.
Hans Rosling used a famously provocative and animated presentation style to make this data come alive. He used his presentations to advocate for sustainable global development through the Gapminder Foundation.
Hans Rosling’s example shows how simple graphic styles can be powerful tools for communication and change when used properly! Convinced? Let’s dive into this guide to creating a ggplot scatter plot in R!
Follow Along With the WorkbookI’ve created a free workbook to help you apply what you’re learning as you read.
The workbook is an R file that contains all the code shown in this post as well as additional questions and exercises to help you understand the topic even deeper.
If you want to really learn how to create a scatter plot in R so that you’ll still remember weeks or even months from now, you need to practice.
So Download the workbook now and practice as you read this post!
Introduction to ggplotBefore we get into the ggplot code to create a scatter plot in R, I want to briefly touch on ggplot and why I think it’s the best choice for plotting graphs in R.
ggplot is a package for creating graphs in R, but it’s also a method of thinking about and decomposing complex graphs into logical subunits.
ggplot takes each component of a graph–axes, scales, colors, objects, etc–and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that’s both flexible and userfriendly. When components are unspecified, ggplot uses sensible defaults. This makes ggplot a powerful and flexible tool for creating all kinds of graphs in R. It’s the tool I use to create nearly every graph I make these days, and I think you should use it too!
Investigating our datasetThroughout this post, we’ll be using the mtcars dataset that’s built into R. This dataset contains details of design and performance for 32 cars. Let’s take a look to see what it looks like:
The mtcars dataset contains 11 columns:
 mpg: Miles/(US) gallon
 cyl: Number of cylinders
 disp: Displacement (cu.in.)
 hp: Gross horsepower
 drat: Rear axle ratio
 wt: Weight (1000 lbs)
 qsec: 1/4 mile time
 vs: Engine (0 = Vshaped, 1 = straight)
 am: Transmission (0 = automatic, 1 = manual)
 gear: Number of forward gears
 carb: Number of carburetors
ggplot uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I talked about geom_line, which is used to produce line graphs. Today I’ll be focusing on geom_point, which is used to create scatter plots in R.
library(tidyverse) ggplot(mtcars) + geom_point(aes(x = wt, y = mpg))Here we are starting with the simplest possible ggplot scatter plot we can create using geom_point. Let’s review this in more detail:
First, I call ggplot, which creates a new ggplot graph. It’s essentially a blank canvas on which we’ll add our data and graphics. In this case, I passed mtcars to ggplot to indicate that we’ll be using the mtcars data for this particular ggplot scatter plot.
Next, I added my geom_point call to the base ggplot graph in order to create this scatter plot. In ggplot, you use the + symbol to add new layers to an existing graph. In this second layer, I told ggplot to use wt as the xaxis variable and mpg as the yaxis variable.
And that’s it, we have our scatter plot! It shows that, on average, as the weight of cars increase, the milespergallon tends to fall.
Changing point color in a ggplot scatter plotExpanding on this example, we can now play with colors in our scatter plot.
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg), color = 'blue')You’ll note that this geom_point call is identical to the one before, except that we’ve added the modifier color = 'blue' to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.
Now, let’s try something a little different. Compare the ggplot code below to the code we just executed above. There are 3 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you’ve read my previous ggplot guides, this bit should look familiar!
mtcars$am < factor(mtcars$am) ggplot(mtcars) + geom_point(aes(x = wt, y = mpg, color = am))This graph shows the same data as before, but now there are two different colors! The red dots correspond to automatic transmission vehicles, while the blue dots represent manual transmission vehicles. Did you catch the 3 changes we used to change the graph? They were:
 First, we converted the am variable to a factor. What do you think happens if we don’t do this? Give it a try!
 Instead of specifying color = 'blue', we specified color = am
 We moved the color parameter inside of the aes() parentheses
Let’s review each of these changes:
Converting the am variable to a factorIn the dataset, am was initially a numeric variable. You can check this by running class(mtcars$am). When you pass a numeric variable to a color scale in ggplot, it creates a continuous color scale.
In this case, however, there are only 2 values for the am field, corresponding to automatic and manual transmission. So it makes our graph more clear to use a discrete color scale, with 2 color options for the two values of am. We can accomplish this by converting the am field from a numeric value to a factor, as we did above.
On your own, try graphing both with and without this conversion to factor. If you’ve already converted to factor, you can reload the dataset by running data(mtcars) to try graphing as numeric!
This point is a bit tricky. Check out my workbook for this post for a guided exploration of this issue in more detail!
Specifying color = am and moving it within the aes() parenthesesI’m combining these because these two changes work together.
Before, we told ggplot to change the color of the points to blue by adding color = 'blue' to our geom_point() call.
What we’re doing here is a bit more complex. Instead of specifying a single color for our points, we’re telling ggplot to map the data in the am column to the color aesthetic.
This means we are telling ggplot to use a different color for each value of am in our data! This mapping also lets ggplot know that it also needs to create a legend to identify the transmission types, and it places it there automatically!
Changing point shapes in a ggplot scatter plotLet’s look at a related example. This time, instead of changing the color of the points in our scatter plot, we will change the shape of the points:
mtcars$am < factor(mtcars$am) ggplot(mtcars) + geom_point(aes(x = wt, y = mpg, shape = am))The code for this ggplot scatter plot is identical to the code we just reviewed, except we’ve substituted shape for color. The graph produced is quite similar, but it uses different shapes (triangles and circles) instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.
A deeper review of aes() (aesthetic) mappings in ggplotWe just saw how we can create graphs in ggplot that map the am variable to color or shape in a scatter plot. ggplot refers to these mappings as aesthetic mappings, and they include everything you see within the aes() in ggplot.
Aesthetic mappings are a way of mapping variables in your data to particular visual properties (aesthetics) of a graph.
I know this can sound a bit theoretical, so let’s review the specific aesthetic mappings you’ve already seen as well as the other mappings available within geom_point.
Reviewing the list of geom_point aesthetic mappingsThe main aesthetic mappings for a ggplot scatter plot include:
 x: Map a variable to a position on the xaxis
 y: Map a variable to a position on the yaxis
 color: Map a variable to a point color
 shape: Map a variable to a point shape
 size: Map a variable to a point size
 alpha: Map a variable to a point transparency
From the list above, we’ve already seen the x, y, color, and shape aesthetic mappings.
x and y are what we used in our first ggplot scatter plot example where we mapped the variables wt and mpg to xaxis and yaxis values. Then, we experimented with using color and shape to map the am variable to different colored points or shapes.
In addition to those, there are 2 other aesthetic mappings commonly used with geom_point. We can use the alpha aesthetic to change the transparency of the points in our graph. Finally, the size aesthetic can be used to change the size of the points in our scatter plot.
Note there are two additional aesthetic mappings for ggplot scatter plots, stroke, and fill, but I’m not going to cover them here. They’re only used for particular shapes, and have very specific use cases beyond the scope of this guide.
Changing the size aesthetic mapping in a ggplot scatter plot ggplot(mtcars) + geom_point(aes(x = wt, y = mpg, size = cyl))In the code above, we map the number of cylinders (cyl), to the size aesthetic in ggplot. Cars with more cylinders display as larger points in this graph.
Note: A scatter plot where the size of the points vary based on a variable in the data is sometimes called a bubble chart. The scatter plot above could be considered a bubble chart!
In general, we see that cars with more cylinders tend to be clustered in the bottom right of the graph, with larger weights and lower miles per gallon, while those with fewer cylinders are on the top left. That said, it’s a bit hard to make out all the points in the bottom right corner. How can we solve that issue? Let’s learn more about the alpha aesthetic to find out!
Changing transparency in a ggplot scatter plot with the alpha aesthetic ggplot(mtcars) + geom_point(aes(x = wt, y = mpg, alpha = cyl))In this code we’ve mapped the alpha aesthetic to the variable cyl. Cars with fewer cylinders appear more transparent, while those with more cylinders are more opaque. But in this case, I don’t think this helps us to understand relationships in the data any better. Instead, it just seems to highlight the points on the bottom right. I think this is a bad graph!
How else can we use the alpha aesthetic to improve the readability of our graph? Let’s turn back to our code from above where we mapped the cylinders to the size variable, creating what I called a bubble chart. Remember how it was difficult to make out all of the cars in the bottom right? What if we made all of the points in the graph semitransparent so that we can see through the bubbles that are overlapping? Let’s try!
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg, size = cyl), alpha = 0.3)This makes it much easier to see the clustering of larger cars in the bottom right while not reducing the importance of those points in the top left! This is my favorite use of the alpha aesthetic in ggplot: adding transparency to get more insight into dense regions of points.
Aesthetic mappings vs. parameters in ggplotAbove, we saw that we are able to use color in two different ways with geom_point. First, we were able to set the color of our points to blue by specifying color = 'blue' outside of our aes() mappings. Then, we were able to map the variable am to color by specifying color = am inside of our aes() mappings.
Similarly, we saw two different ways to use the alpha aesthetic as well. First, we mapped the variable cyl to alpha by specifying alpha = cyl inside of our aes() mappings. Then, we set the alpha of all points to 0.3 by specifying alpha = 0.3 outside of our aes() mappings.
What is the difference between these two ways of dealing with the aesthetic mappings available to us?
Each of the aesthetic mappings you’ve seen can also be used as a parameter, that is, a fixed value defined outside of the aes() aesthetic mappings. You saw how to do this with color when we made the scatter plot points blue with color = 'blue' above. Then, you saw how to do this with alpha when we set the transparency to 0.3 with alpha = 0.3. Now let’s look at an example of how to do this with shape in the same manner:
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg), shape = 18)Here, we specify to use shape 18, which corresponds to this diamond shape you see here. Because we specified this outside of the aes(), this applies to all of the points in this graph!
To review what values shape, size, and alpha accept, just run ?shape, ?size, or ?alpha from your console window! For even more details, check out vignette("ggplot2specs")
Common errors with aesthetic mappings and parameters in ggplotWhen I was first learning R and ggplot, the difference between aesthetic mappings (the values included inside your aes()), and parameters (the ones outside your aes()) was constantly confusing me. Luckily, over time, you’ll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.
Trying to include aesthetic mappings outside your aes() callIf you’re trying to map the cyl variable to shape, you should include shape = cyl within the aes() of your geom_point call. What happens if you include it outside accidentally, and instead run ggplot(mtcars) + geom_point(aes(x = wt, y = mpg), shape = cyl)? You’ll get an error message that looks like this:
Whenever you see this error about object not found, be sure to check that you’re including your aesthetic mappings inside the aes() call!
Trying to specify parameters inside your aes() callOn the other hand, if we try including a specific parameter value (for example, color = 'blue') inside of the aes() mapping, the error is a bit less obvious. Take a look:
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg, color = 'blue'))In this case, ggplot actually does produce a scatter plot, but it’s not what we intended.
For starters, the points are all red instead of the blue we were hoping for! Also, there’s a legend in the graph that simply says ‘blue’.
What’s going on here? Under the hood, ggplot has taken the string ‘blue’ and effectively created a new hidden column of data where every value simple says ‘blue’. Then, it’s mapped that column to the color aesthetic, like we saw before when we specified color = am. This results in the legend label and the color of all the points being set, not to blue, but to the default color in ggplot.
If this is confusing, that’s okay for now. Just remember: when you run into issues like this, double check to make sure you’re including the parameters of your graph outside your aes() call!
You should now have a solid understanding of how to create a scatter plot in R using the ggplot scatter plot function, geom_point!
Experiment with the things you’ve learned to solidify your understanding. You can download my free workbook with the code from this article to work through on your own.
I’ve found that working through code on my own is the best way for me to learn new topics so that I’ll actually remember them when I need to do things on my own in the future.
Download the workbook now to practice what you learned!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career  Michael Toth. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Google’s Eigenvector… or how a Random Surfer finds the most relevant Webpages
(This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers)
Like most people you will have used a search engine lately, like Google. But have you ever thought about how it manages to give you the most fitting results? How does it order the results so that the best are on top? Read on to find out!
The earliest search engines either had human curated indices, like Yahoo! or used some simple heuristic like the more often the keyword you were looking for was mentioned on a page the better, like Altavista – which led to all kinds of crazy effects like certain keywords being repeated thousands of times on webpages to make them more “relevant”.
Now, most of those search engines are long gone because a new kid arrived on the block: Google! Google’s search engine results were much better than all of the competition and they became the dominant player in no time. How did they do that?
The big idea was in fact published by the two founders: Sergey Brin and Lawrence Page, it is called the pagerank algorithm (which is of course a pun because one of the authors was named Page too). The original paper can be found here: S. Brin, L. Page: The Anatomy of a LargeScale Hypertextual Web Search Engine.
Let us start with another, related question: which properties are the best to own in Monopoly? Many would instinctively answer with the most expensive ones, i.e. Park Place and Boardwalk. But a second thought reveals that those might be the the ones where you get the biggest rent if somebody lands on them but that the last part is the caveat… “IF” somebody lands on them! The best streets are actually the ones where players land on the most. Those happen to be the orange streets, St. James Place, Tennessee Avenue and New York Avenue and therefore they are the key to winning the game.
How do find those properties? For example by simulation: you just simulate thousands of dice rolls and see where the players land.
A similar idea holds true for finding the best web pages: you just start from a random position and simulate a surfer who visits different web pages by chance. For each surfing session you tally the respective webpage where she ends up and after many runs we get a percentage for each page. The higher this percentage is the more relevant the webpage!
Let us do this with some R code. First we define a very small net and plot it (the actual example can be found in chapter 30 of the very good book “Chaotic Fishponds and Mirror Universes” by Richard Elwes):
library(igraph) ## ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union # cols represent outgoing links, rows incoming links # A links to C, D; B links to A; C links to A; D links to A,B,C M < matrix(c(0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0), nrow = 4) colnames(M) < rownames(M) < c("A", "B", "C", "D") M ## A B C D ## A 0 1 1 1 ## B 0 0 0 1 ## C 1 0 0 1 ## D 1 0 0 0 g < graph_from_adjacency_matrix(t(M)) # careful with how the adjacency matrix is defined > transpose of matrix plot(g)Now, we are running the actual simulation. We define two helper functions for that, next_page for getting a random but possible next page given the page our surfer is on at the moment and last_page which gives the final page after N clicks:
next_page < function(page, graph) { l < sample(rownames(graph)[as.logical(graph[ , as.logical(page)])], 1) as.numeric(rownames(graph) == l) } last_page < function(page, graph, N = 100) { for (i in seq(N)) { page < next_page(page, graph) } page } current_page < c(1, 0, 0, 0) # random surfer starting from A random_surfer < replicate(2e4, last_page(current_page, M, 50)) round(rowSums(random_surfer) / sum(random_surfer), 2) ## [1] 0.43 0.07 0.28 0.22So we see that page A is the most relevant one because our surfer ends up being there in more than 40% of all sessions, after that come the pages C, D and B. When you look at the net that makes sense, since all pages refer to A whereas B gets only one link, so it doesn’t seem to be that relevant.
As you have seen the simulation even for this small net took quite long so we need some clever mathematics to speed up the process. One idea is to transform our matrix which represents the network into a matrix which gives the probabilities of landing on the next pages and then multiply the probability matrix with the current position (and thereby transform the probabilities). Let us do this for the first step:
M_prob < prop.table(M, 2) # create probability matrix M_prob ## A B C D ## A 0.0 1 1 0.3333333 ## B 0.0 0 0 0.3333333 ## C 0.5 0 0 0.3333333 ## D 0.5 0 0 0.0000000 M_prob %*% current_page ## [,1] ## A 0.0 ## B 0.0 ## C 0.5 ## D 0.5The result says that there is a fiftyfifty chance of landing on C or D. When you look at the graph you see that this is correct since there are two links, one to C and one to D! For the next step you would have to multiply the matrix with the result again, or first multiply the matrix with itself before multiplying with the current position, which gives:
If we want to do this a hundred times we just have to raise this probability matrix to the one hundredth power:
We use the %^% operator in the expm package (on CRAN) for that:
library(expm) ## Loading required package: Matrix ## ## Attaching package: 'expm' ## The following object is masked from 'package:Matrix': ## ## expm r < M_prob %^% 100 %*% current_page r ## [,1] ## A 0.42857143 ## B 0.07142857 ## C 0.28571429 ## D 0.21428571Again, we get the same result! You might ask: why ? The answer is that this is in most cases enough to get a stable result so that any further multiplication still results in the same result:
The last equations opens up still another possibility: we are obviously looking for a vector which goes unaffected when multiplied by the matrix . There is a mathematical name for that kind of behaviour: eigenvector! As you might have guessed the name is an import from the German language where it means something like “own vector”.
This hints at the problem we were solving all along (without consciously realizing perhaps): a page is the more relevant the more relevant a page is that links to it… now we have to know the importance of that page but that page two is the more relevant… and so on and so forth, we are going in circles here. The same is true when you look at the equation above: you define in terms of – is the eigenvector of matrix !
There are very fast and powerful methods to find the eigenvectors of a matrix, and the corresponding eigen function is even a function in base R:
lr < Re(eigen(M_prob)$vectors[ , 1]) # real parts of biggest eigenvector lr / sum(lr) # normalization ## [1] 0.42857143 0.07142857 0.28571429 0.21428571Again, the same result! You can now understand the title of this post and titles of other articles about the pagerank algorithm and Google like “The $25,000,000,000 eigenvector”.
Yet, a word of warning is in order: there are cases where the probability matrix is not diagonalizable (we won’t get into the mathematical details here), which means that the eigenvector method won’t give sensible results. To check this the following code must evaluate to TRUE:
ev < eigen(M_prob)$values length(unique(ev)) == length(ev) ## [1] TRUEWe now repeat the last two methods for a bigger network:
set.seed(1415) n < 10 g < sample_gnp(n, p = 1/4, directed = TRUE) # create random graph g < set_vertex_attr(g, "name", value = LETTERS[1:n]) plot(g) M < t(as_adjacency_matrix(g, sparse = FALSE)) M_prob < prop.table(M, 2) # create probability matrix M_prob ## A B C D E F G H I J ## A 0.00 0 0 1 0.5 0.5 0.5 0.0000000 0.0000000 0.5 ## B 0.00 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 ## C 0.00 1 0 0 0.0 0.0 0.0 0.0000000 0.3333333 0.5 ## D 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## E 0.25 0 0 0 0.0 0.0 0.5 0.3333333 0.3333333 0.0 ## F 0.00 0 1 0 0.0 0.0 0.0 0.0000000 0.3333333 0.0 ## G 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## H 0.00 0 0 0 0.5 0.0 0.0 0.0000000 0.0000000 0.0 ## I 0.00 0 0 0 0.0 0.5 0.0 0.0000000 0.0000000 0.0 ## J 0.25 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 current_page < c(1, rep(0, n1)) r < M_prob %^% 100 %*% current_page r ## [,1] ## A 0.27663574 ## B 0.02429905 ## C 0.08878509 ## D 0.06915881 ## E 0.14579434 ## F 0.10654199 ## G 0.06915881 ## H 0.07289723 ## I 0.05327107 ## J 0.09345787 lr < Re(eigen(M_prob)$vectors[ , 1]) lr / sum(lr) # normalization of the real parts ## [1] 0.27663551 0.02429907 0.08878505 0.06915888 0.14579439 0.10654206 ## [7] 0.06915888 0.07289720 0.05327103 0.09345794We can now order the pages according to their importance – like the first 10 results of a google search:
search < data.frame(Page = LETTERS[1:n], Rank = r) search[order(search$Rank, decreasing = TRUE), ] ## Page Rank ## A A 0.27663574 ## E E 0.14579434 ## F F 0.10654199 ## J J 0.09345787 ## C C 0.08878509 ## H H 0.07289723 ## D D 0.06915881 ## G G 0.06915881 ## I I 0.05327107 ## B B 0.02429905Looking at the net, does the resulting order make sense to you?
Congratulations, you now understand the big idea behind one the greatest revolutions in information technology!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
conditionz: control how many times conditions are thrown
(This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers)
conditionz is a new (just on CRAN today) R package for controlling how many times conditions are thrown.
This package arises from an annoyance in another set of packages I maintain: The brranching package uses
the taxize package internally, calling it’s function taxize::tax_name(). The taxize::tax_name() function
throws useful messages to the user if their API key is not found, and gives them instructions
on how to find it. However, the user does not have to get an API key. If they don’t they then get subjected to
lots of repeats of the same message.
I wondered if there’s anything that could be done about this. That is, if the same message is going to be
thrown that was already thrown within a function call, just skip additional messages that are the same.
(The tibble package has something like this, but as part of the package itself AFAICT)
The package has the following API:
 ConditionKeeper
 handle_conditions()
 handle_messages()
 handle_warnings()
Exported but mostly meant for internal use:
 capture_message()
 capture_warning()
Links:
 conditionz source code: https://github.com/ropenscilabs/conditionz
 conditionz on CRAN: https://cran.rstudio.com/web/packages/conditionz/
 blog post on the braintstorming period of this package: https://recology.info/2018/12/conditioncontrol/
Install from CRAN
install.packages("conditionz")Development version
remotes::install_github("ropenscilabs/conditionz")Load conditionz
library(conditionz) ConditionKeeperConditionKeeper is an R6 class that keeps track of conditions and lets
us determine if conditions have been encountered, how many times they’ve
been encountered, etc.
A simple function that throws messages
squared < function(x) { stopifnot(is.numeric(x)) y < x^2 if (y > 20) message("woops, > than 20! check your numbers") return(y) } foo < function(x) { vapply(x, function(z) squared(z), numeric(1)) } bar < function(x, times = 1) { y < ConditionKeeper$new(times = times) on.exit(y$purge()) vapply(x, function(z) y$handle_conditions(squared(z)), numeric(1)) }Running the function normally throws many messages
foo(1:10) #> woops, > than 20! check your numbers #> woops, > than 20! check your numbers #> woops, > than 20! check your numbers #> woops, > than 20! check your numbers #> woops, > than 20! check your numbers #> woops, > than 20! check your numbers #> [1] 1 4 9 16 25 36 49 64 81 100Using in ConditionKeeper allows you to control how many messages
are thrown
The function handle_conditions/handle_warnings/handle_messages use
ConditionKeeper internally, and are meant as a simpler, but less flexible
alternative.
For now, handle_conditions/handle_warnings/handle_messages only allow
throwing the condition 1 time with the times parameter. I hope to fix this
so you can choose any number you like.
A small example. Here a function that prints a message with every part of the
input vector.
If we call that function with the vector 1:5 we get five messages
foo(1:5) #> you gave: 1 #> you gave: 2 #> you gave: 3 #> you gave: 4 #> you gave: 5 #> [1] 1 2 3 4 5If we wrap the foo() function call in handle_conditions we get only
one message thrown
The default for the condition parameter is “message”, and calling handle_messages
does the same thing
For warnings, call handle_warnings() or set handle_conditions(..., condition = "warning")
To do I’m expecting use cases that I haven’t accounted for, the socalled unknown unknowns
 The handle_conditions/handle_warnings/handle_messages need some work to be able
to actually have times parameter work as advertised  I’ll use this package in the brranching package soon, so it will get a real world
test
Get in touch if you have any conditionz questions in the
issue tracker or the
rOpenSci discussion forum.
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Join, split, and compress PDF files with pdftools
(This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers)
Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the pdf_data() function which was introduced in pdftools 2.0 is now available on all major systems.
Split and Join PDF filesIt is now possible to split, join, and compress pdf files with pdftools. For example the pdf_subset() function creates a new pdf file with a selection of the pages from the input file:
# Load pdftools library(pdftools) # extract some pages pdf_subset('https://cran.rproject.org/doc/manuals/rrelease/Rintro.pdf', pages = 1:3, output = "subset.pdf") # Should say 3 pdf_length("subset.pdf")Similarly pdf_combine() is used to join several pdf files into one.
# Generate another pdf pdf("test.pdf") plot(mtcars) dev.off() # Combine them with the other one pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf") # Should say 4 pdf_length("joined.pdf")The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.
Data extraction now available on Linux tooThe pdftools 2.0 announcement post from December introduced the new pdf_data() function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.
I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use pdf_data on Ubuntu LTS servers, including Travis:
sudo addaptrepository ppa:opencpu/poppler sudo aptget update sudo aptget install libpopplercppdevMoreover, the upcoming Fedora 30 will ship with popplerdevel 0.73.
Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so pdf_data() will work on Debian (and hence CRAN) as well!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
77th Tokyo.R Users Meetup Roundup!
(This article was first published on R by R(yo), and kindly contributed to Rbloggers)
As the sakura bloomed in Tokyo, another TokyoR User
Meetup was held, this time
at SONY City! On April 13th useRs from all over Tokyo (and some even
from further afield) flocked to Osaki, Tokyo for a special session
focused on beginner R users, BeginneRs. We’ve also had several special
“themed” sessions in the past like TokyoR #69: Bayesian
statistics in June last year as well as
the TokyoR #70: Hadley Wickham + Joe Rickert
Special! last July. Although
mainly for BeginneRs, the LTs for this session were applications of
R/case studies and may be of more interest to nonbeginners if you want
to skip ahead to that section.
Like my previous round up posts (for TokyoR
#76 and
JapanR Conference
#9)
I will be going over around half of all the talks. Hopefully, my efforts
will help spread the vast knowledge of Japanese R users to the wider R
community. Throughout I will also post helpful blog posts and links from
other sources if you are interested in learning more about the topic of
a certain talk. You can follow Tokyo.R by searching for the
#TokyoR hashtag on Twitter.
These were the same set of beginner user focused talks that happens at
the start of every TokyoR meetup:
 Basics of R Programming by
kotatyamtema  Data munging with the tidyverse by
y_mattu  Visualization with R by
koriakane
Since this was a special BeginneR session, the main talks were all
focused on even more introductory stuff on top of the usual beginner’s
session in the previous section.
First up, @kilometer00 talked about doing data analysis with R. As a
brief introduction he talked about what exactly IS data analysis, how we
might go about forming a research problem and hypothesis, and how these
steps are important in figuring out what kind of modeling we might
attempt to do on the data at hand. The modeling techniques
@kilometer00 covered were just the basics such as single/multiple
linear regression and PCA.
In the last part of the talk he covered nested data modelling. By using
group_by() and nest() we can create a data frame that has one row
per group (for example, a row for each species group in the iris
data set) with has a new column called data which is itself a data
frame. Now, instead of having to repeat the modelling code over and over
again for each subset of rows (a model for each species in the iris
data set), by using the purrr::map_*() family of functions you can
apply your model to each of the groups that you specified earlier.
Filled with great explanatory graphics, plots, and code the slides are a
good example of teaching the basics of modelling with R.
Some other resources about modelling with R:
 Model buidling chapter in R for Data Science – Hadley
Wickham  Picking the best model with caret – Rachael
Tatman  Introduction to Statistical Learning – Gareth James, Daniela
Witten, Trevor Hastie, & Rob
Tibshirani  Applied Predictive Modeling – Max Kuhn & Kjell
Johnson
Another TokyoR organizer, @aad34210, talked about using the R Studio
IDE to maximize R’s strengths for programming and data analysis. After a
brief spiel on the early days of using R solely from the console he
talked about R Studio’s capabilities and the various options that can be
customized to suit your R needs. From installing R Studio, configuring
the four main panes, using R Projects, and using Global options,
aad34210 opened up his own R Studio window and pointed out the various
menu options in thorough detail to help beginneRs navigate without
getting overwhelmed. He rounded off the talk by showing the various
Cheat sheets (included one for R Studio itself) that can be obtained
from the Help tab.
Some other resources one might consider to learn R Studio are:
 Introduction to RStudio – Software
Carpentry  RStudio Essentials – RStudio
 RStudio IDE Cheat Sheet (shown above) –
RStudio
@u_ribo gave a talk about the benefits of creating a reproducible and
enclosed R environment using git and Docker. As an instructor who has
ran many R workshops, @u_ribo has ran into the problem of his learners
all having different OSs, versions of packages, and versions of R itself
causing much headache for all involved. This is also the case when
working in a team environment where analyses need to be shared and
reproducibility of results is essential.
To reduce these problems he gave a livedemo using a variety of R tools
such as the here package, the usethis package, and managing a
project environment with R Studio Projects (.Rproj). To go further in
depth he also talked about using git (made very easy with its seamless
integration with R Studio) and the use of Docker. To run Docker you
need an Docker “image” and a Docker “container”. An image is a file,
called a Dockerfile, that has the information about and configures the
Operating System for the environment. The container is the the actual
running instance of the “image” that is defined by the Docker file.
Going back to the issue of running a workshop, using Docker allows all
of the participants to run R inside a specific container, an enclosed environment
set up by the instructor, so that all the different dependencies and
version differences won’t prevent you from running the code provided in
the workshop.
Other good introductions to Docker and R:
 Docker, R, And Reproducibility – Colin
Fay  R Docker Tutorial – rOpenSci
Labs  Docker and Packrat – Joel
Nitta
@niszet talked about reproducible reporting with R Markdown. He was
certainly the right person to give a talk on this topic as he is the
author of the minibooks, “Create a Word document with R Markdown” and
“Create a PowerPoint presentation with R Markdown”. To begin, he talked
about cases where one writes a document, report, or any kind of output
and how it might be a good idea to be able to create it again for
“future me” or anybody else that might want to take a look. Normally,
you would run into problems such as “where did my data go?”, “how did I
preprocess my data?”, “” but you can mitigate these problems by using R
Markdown reports. Whether it’s importing your raw data, the
preprocessing/modelling/viz code, to the actual report and
documentation writeup R Markdown renders the document in a completely
clean environment each time so the results should be the same, every
time! As noted in the beginning, you can create many different kinds of
output such as Word document, PowerPoint presentation, html document,
presentation slides, and more – even business cards (with the pagedown
package)! Following an explanation of what you can do with R Markdown,
@niszet talked about how exactly one would use it to its best
capabilities. In a livedemo he took us through all the different
components of an R Markdown document:
 YAML header: Where one defines the “how” of the RMD such as the
title, template, output directory, output type, etc.  Code chunk and options: Where all your code (can be languages
besides R) that you want to be run should be located. Chunk options such as
whether to evaluate the code within, toggle showing the code, and
many more are also specified here.  Markdown text: Regular text using markdown. Can also include inline
code using “.  Various buttons/shortcut keys: Such as Ctrl + Shift + K to Knit!
Some other intros to R Markdown:
 R Markdown: The Definitive Guide – Yihui Xie, JJ Allaire, Garrett
Grolemund  R Markdown Cheat Sheet –
RStudio  R Markdown 入門 – Kazuhiro Maeda (in
Japanese)
It’s only been 3 (three!) days since @GotaMorishita started learning R
yet here he was giving a LT on finding a new place to live in Tokyo
using R! Tired of living with his parents @GotaMorishita decided to
live by himself and started thinking about ways to use R and machine
learning to search for a place with an optimal price and location. After
webscraping a housing website and preprocessing the data he came
across a problem: if he split the data into a training and test set for
selecting the best predictive model then he would be throwing away a
considerable amount of possible candidates for housing.
If @GotaMorishita took out 90% of the houses/apartments from the
training data and kept those as candidates for the test data, it woudl’ve meant
that the training data will have a markedly different distribution
compared to the test data set and the model created from the training
set wouldn’t be as accurate. This problem, called covariate shifting, is when the training data and
test data have different distributions but the conditional distribution
of the output values given the input data is unchanged. Using standard
methods of model selection such as crossvalidation or AIC in this
situation leads to biasedness. However, this problem can be mitigated by
weighting the loss function by importance (the ratio of training and
test input densities). You can find a more detailed description in the
research papers below. @GotaMorishita used xgboost to implement his
models, one with importance weighting and another without, and used
groupwise crossvalidation to tune the hyperparameters. The results are
shown below, comparing the overall test scores for all Tokyo districts
(left) and just the Sangenjaya district (right), the RMSE was smaller
when Importance Weighting was used.
Some more info on covariate shifting and importance weighting:
 Sugiyama, M., Krauledat, M., & Mueller, K. R. (2007). Covariate
shift adaptation by importance weighted cross validation. Journal of
Machine Learning Research, 8(May),
9851005  Sugiyama, M. (2012). Learning under nonstationarity: Covariate
shift adaptation by importance weighting. In Handbook of
Computational Statistics (pp. 927952). Springer, Berlin,
Heidelberg.
@sk_bono36 gave a presentation on using R for marketing purposes with
the rtweet package. In marketing there is a concept called a “persona”
which is a blueprint of what a certain type of person in your target
audience for your product/service is like. A basic persona template
can consist of their name, job title, demographic details, values, and
hobbies. You create these ideal customers through careful market
research involving both social media analytics and
interviewing/surveying the customers themselves. @sk_bono36 focused on
creating a persona (with “自転車/Bicycle” as the keyword for this case
study) by using rtweet then running Japanese text analysis with
RMeCab. Visualizing the data with word clouds and network graphs of
bigrams he was able to find a lot of information on Twitter users who
have “bicycle” on their profile or tweets and extract the common
characteristics of this type of person.
As a result of his analysis @sk_bono36 was able to create a persona of
people who like bicycles:
 20~30 Years Old
 Owns a road bike
 Friendly disposition
 Likes Anime/video games
 Does weight lifting
Some other intros to the rtweet package:
 rtweet workshop – Mike Kearney (author of
rtweet)  21 Recipes for Mining Twitter Data with rtweet – Bob
Rudis
@igjit, who also presented at
Japan.R back in December on
building an R compiler with
R,
gave a talk about another recent project of his which is a package that
acts as a type checking system for R. A problem that he finds in R is
that errors relating to having the wrong data type in the arguments of R
functions are only found when code is executed, not before. Frustrated
by this @igjit created his own package called
typrr that type checks your code! The
underlying checker that typrr runs is
Flycheck which is a syntax
checking extension for Emacs.
For now, the package only checks for the basic data types found in R,
integer, double, logical, and character and it can only check functions
with one argument only. He rounded off the talk by saying that he
created this package just for fun and he advised all the beginneRs in
the audience that people learn from doing rather than just
reading so to truly get better it’s best to go out and experiment!
 ao053934144: R in the Healthcare Industry
 hana_orin: Using Hierarchical
Linear Modeling to predict housing prices  AsaKiriSun: Predicting sample
size from Rbinom(n=10, size,
prob)
Following all of the talks, those who were staying for the afterparty
were served sushi and drinks! With a loud rendition of “kampai!”
(cheers!) R users from all over Tokyo began to talk about their
successes and struggles with R. A fun tradition at TokyoR is a
RockPaperScissors tournament with the prize being free data
science books (I still haven’t won any nor even gotten close to the last rounds)!
The prizes for this month was:
 A couple copies of “Create a Word document with R Markdown”
minibook by niszet.  3 copies of the Japanese translation (by Hoxom
Inc.) of “Feature Engineering for Machine
Learning” by Alice Zheng and Amanda Casari provided by
uribo.
TokyoR happens almost monthly and it’s a great way to mingle with
Japanese R users as it’s the largest regular meetup here in Japan. Talks
in English are also welcome so if you’re ever in Tokyo come join us!
To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Probability of winning a bestof7 series
(This article was first published on R – Statistical Odds & Ends, and kindly contributed to Rbloggers)
The NBA playoffs are in full swing! A total of 16 teams are competing in a playoffformat competition, with the winner of each bestof7 series moving on to the next round. In each matchup, two teams play 7 basketball games against each other, and the team that wins more games progresses. Of course, we often don’t have to play all 7 games: we can determine the overall winner once either team reaches 4 wins.
During one of the games, a commentator made a remark along the lines of “you have no idea how hard it is for the better team to lose in a bestof7 series”. In this post, I’m going to try and quantify how hard it is to do so! I will do this not via analytical methods, but by Monte Carlo simulation. (You can see the code all at once here.)
Let’s imagine an idealized setup, where for any given game, team A has probability of beating team B. This probability is assumed to be the same from game to game, and the outcome of each game is assumed to be independent of the others. With these assumptions, the number of games that team A wins has a binomial distribution .
In our simulation, for each , we will generate a large number of random values and determine the proportion of them that are : that would be our estimate for the winning probability of the 7game series. How many random values should we draw? We should draw enough so that we are reasonably confident of the accuracy of the proportion estimate. If we draw samples, then the plugin estimate of standard error is . In what follows, we will plot estimates with error bars indicating standard error.
First, let’s set up a function that takes in three parameters: the number of simulation runs (simN), the total number of games in a series (n), and the probability that team A wins (p). The function returns the proportion of the simN runs which team A wins.
sim_fn < function(simN, n, p) { if (n %% 2 == 0) { stop("n should be an odd number") } mean(rbinom(simN, n, p) > n / 2) }Next, we set up a data frame with each row representing a different win probability for team A. Because of symmetry inherent in the problem, we focus on win probabilities which are .
n < 7 df < data.frame(p = seq(0.5, 1, length.out = 26), n = n)For a start, we set simN to be 100. For each row, we run simN simulation runs to get the win probability estimate and compute the associated standard error estimate. We also compute the upper and lower 1 standard error deviations from the probability estimate, to be used when plotting error bars.
set.seed(1043) simN < 100 df$win_prob < apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err < with(df, sqrt(win_prob * (1  win_prob) / simN)) df$lower < with(df, win_prob  std_err) df$upper < with(df, win_prob + std_err)Here is the code for the plot and the plot itself:
library(ggplot2) ggplot(df, aes(x = p, y = win_prob)) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of", n, "series"), x = "Win prob for single game", y = "Win prob for series")We can see that the line is still quite wiggly with large error bars, suggesting that 100 simulation runs is not enough. (Indeed, we expect the graph to be monotonically increasing.) Below, we run 100,000 simulations for each value of p instead (with the same random seed):
We get a much smoother line, and the error bars are so small we can hardly see them. My conclusion here is that while it is harder for the weaker team to win a bestof7 series that a single game, the odds are not insurmountable. For example, a team that has a 70% chance of winning any one game, which is a huge advantage, still has about a 13% chance of losing a bestof7 series: not insignificant!
How do these probabilities change if we consider bestofn series for different values of n? The code below is very similar to that above, except that our data frame contains data for n equal to all the odd numbers from 1 to 15, not just n = 7.
p < seq(0.5, 1, length.out = 26) n < seq(1, 15, length.out = 8) df < expand.grid(p, n) names(df) < c("p", "n") set.seed(422) simN < 100000 df$win_prob < apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err < with(df, sqrt(win_prob * (1  win_prob) / simN)) df$lower < with(df, win_prob  std_err) df$upper < with(df, win_prob + std_err) ggplot(df, aes(x = p, y = win_prob, col = factor(n))) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of n series"), x = "Win prob for single game", y = "Win prob for series") + theme(legend.position = "bottom")Here is the plot:
As we expect, the series win probabilities increase as n increases. Also, the n = 1 graph corresponds to the line. It looks like the win probabilities don’t change much from bestof9 to bestof15.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing PointandClick Front Ends for R
(This article was first published on R – r4stats.com, and kindly contributed to Rbloggers)
Now that I’ve completed seven detailed reviews of Graphical User Interfaces (GUIs) for R, let’s compare them. It’s easy enough to count their features and plot them, so let’s start there. I’m basing the counts on the number of menu items in each category. That’s not too hard to get, but it’s far from perfect. Some software has fewer menu choices, depending instead on dialog box choices. Studying every menu and dialog box would be too timeconsuming, so be aware of this limitation. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own graphs. If you decide to make your own graphs, I’d love to hear from you in the comments below.
Figure 1 shows the number of analytic methods each software supports on the xaxis and the number of graphics methods on the yaxis. The analytic methods count combines statistical features, machine learning / artificial intelligence ones (ML/AI), and the ability to create R model objects. The graphics features count totals up the number of bar charts, scatterplots, etc. each package can create.
The ideal place to be in this graph is in the upper right corner. We see that BlueSky and R Commander offer quite a lot of both analytic and graphical features. Rattle stands out as having the second greatest number of graphics features. JASP is the lowest on graphics features and 3rd from the bottom on analytic ones.
Next, let’s swap out the yaxis for general usability features. These consist of a variety of features that make your work easier, including data management capabilities (see appendix for details).
Figure 2 shows that BlueSky and R Commander still in the top two positions overall, but now Deducer has nearly caught up with R Commander on the number of general features. That’s due to its reasonably strong set of data management tools, plus its output is in true word processing tables saving you the trouble of formatting it yourself. Rattle is much lower in this plot since, while its graphics capabilities are strong (at least in relation to ML/AI tasks), it has minimal data management capabilities.
These plots help show us three main overall feature sets, but each package offers things that the others don’t. Let’s look at a brief overview of each. Remember that each of these has a detailed review that follows my standard template. I’ll start with the two that have come out on top, then follow in alphabetical order.
The R Commander – This is the oldest GUI, having been around since at least 2005. There are an impressive 41 plugins developed for it. It is currently the only R GUI that saves R Markdown files, but it does not create word processing tables by default, as some of the others do. The R code it writes is classic, rarely using the newer tidyverse functions. It works as a partner to R; you install R separately, then use it to install and start R Commander. It makes it easy to blend menubased analysis with coding. If your goal is to learn to code in classic R, this is an excellent choice.
BlueSky Statistics – This software was created by former SPSS employees and it shares many of SPSS’ features. BlueSky is only a few years old, and it converted from commercial to open source just a few months ago. Although BlueSky and R Commander offer many of the same features, they do them in different ways. When using BlueSky, it’s not initially apparent that R is involved at all. Unless you click the “Syntax” button that every dialog box has, you’ll never see the R code or the code editor. Its output is in publicationquality tables which follow the popular style of the American Psychological Association.
Deducer – This has a very nicelooking interface, and it’s probably the first to offer true word processing tables by default. Being able to just cut and paste a table into your word processor saves a lot of time and it’s a feature that has been copied by several others. Deducer was released in 2008, and when I first saw it, I thought it would quickly gain developers. It got a few, but development seems to have halted. Deducer’s installation is quite complex, and it depends on the troublesome Java software. It also used JGR, which never became as popular as the similar RStudio. The main developer, Ian Fellows, has moved on to another very interesting GUI project called Vivid.
jamovi – The developers who form the core of the jamovi project used to be part of the JASP team. Despite the fact that they started a couple of years later, they’re ahead of JASP in several ways at the moment. Its developers decided that the R code it used should be visible and any R code should be executable, something that differentiated it from JASP. jamovi has an extremely interactive interface that shows you the result of every selection in each dialog box. It also saves the settings in every dialog box, and lets you reuse every step on a new dataset by saving a “template.” That’s extremely useful since GUI users often don’t want to learn R code. jamovi’s biggest weakness its dearth of data management tasks, though there are plans to address that.
JASP – The biggest advantage JASP offers is its emphasis on Bayesian analysis. If that’s your preference, this might be the one for you. At the moment JASP is very different from all the other GUIs reviewed here because it won’t show you the R code it’s writing, and you can’t execute your own R code from within it. Plus the software has not been open to outside developers. The development team plans to address those issues, and their deep pockets should give them an edge.
Rattle – If your work involves ML/AI (a.k.a. data mining) instead of standard statistical methods, Rattle may be the best GUI for you. It’s focused on ML/AI, and its tabbedbased interface makes quick work of it. However, it’s the weakest of them all when it comes to statistical analysis. It also lacks many standard data management features. The only other GUI that offers many ML/AI features is BlueSky.
RKWard – This GUI blends a nice pointandclick interface with an integrated development environment that is the most advanced of all the other GUIs reviewed here. It’s easy to install and start, and it saves all your dialog box settings, allowing you to rerun them. However, that’s done stepbystep, not all at once as jamovi’s templates allow. The code RKWard creates is classic R, with no tidyverse at all.
ConclusionI hope this brief comparison will help you choose the R GUI that is right for you. Each offers unique features that can make life easier for nonprogrammers. If one catches your eye, don’t forget to read the full review of it here.
AcknowledgementsWriting this set of reviews has been a monumental undertaking. It would not have been possible without the assistance of Bruno Boutin, Anil Dabral, Ian Fellows, John Fox, Thomas Friedrichsmeier, Rachel Ladd, Jonathan Love, Ruben Ortiz, Christina Peterson, Josh Price, EricJan Wagenmakers, and Graham Williams.
Appendix: Guide to ScoringIn figures 1 and 2, Analytic Features adds up: statistics, machine learning / artificial intelligence, the ability to create R model objects, and the ability to validate models using techniques such as kfold crossvalidation. The Graphics Features is the sum of two rows, the number of graphs the software can create plus one point for small multiples, or facets, if it can do them. Usability is everything else, with each row worth 1 point, except where noted.
Feature Definition Simpleinstallation Is it done in one step? Simple startup Does it start on its own without starting R, loading
packages, etc.? Import Data Files How many files types can it import? Import
Database How many databases can it read from? Export Data Files How many file formats can it write to? Data Editor Does it have a data editor? Can work on >1 file Can it work on more than one file at a time? Variable
View Does it show metadata in a variable view, allowing for many fast edits to metadata? Data
Management How many data management tasks can it do? Transform
Many Can it transform many variables at once? Types How many graph types does it have? Small
Multiples Can it show small multiples (facets)? Model Objects Can it create R model objects? Statistics How many statistical methods does it have? ML/AI How many ML / AI methods does it have? Model Validation Does it offer model validation (kfold, etc.)? R Code IDE Can you edit and execute R code? GUI Reuse Does it let you reuse work without code? Code Reuse Does it let you rerun all using code? Package Management Does it manage packages for you? Table of Contents Does output have a table of contents? Reorder Can you reorder output? Publication Quality Is output in publication quality by default? R Markdown Can it create R Markdown? Add
comments Can you add comments to output? Groupby Does it do groupby repetition of any other task? Output as
Input Does it save equivalent to broom’s tidy, glance, augment? (They earn 1 point for each) Developer tools Does it offer developer tools? Scores Feature BlueSky Deducer JASP jamovi Rattle Rcmdr RKWard Simple installation 1 0 1 1 0 0 1 Simple startup 1 1 1 1 0 0 1 Import Data Files 7 13 4 5 9 7 5 Import Database 5 0 0 0 1 0 0 Export Data Files 5 7 1 4 1 3 3 Data Editor 1 1 0 1 0 1 1 Can work on >1 file 1 1 0 0 0 0 0 Variable View 1 1 0 0 0 0 0 Data Management 30 9 2 3 9 25 4 Transform Many 1 1 0 1 1 1 0 Types 25 16 9 12 24 21 14 Small Multiples 1 1 0 0 0 1 0 Model Objects 1 1 0 0 0 1 1 Statistics 96 37 26 44 8 95 22 ML/AI 9 0 0 0 12 0 0 Model Validation 1 0 0 0 1 0 0 R Code IDE 1 1 0 1 0 0 1 GUI Reuse 0 0 1 0 0 0 1 Code Reuse 1 1 0 1 1 1 1 Package Management 1 0 1 1 0 0 0 Output: Table of Contents 1 0 0 0 0 0 0 Output: Reorder 0 0 0 0 0 0 0 Output: Publication Quality 1 1 1 1 0 0 0 Output: R Markdown 0 0 0 0 0 1 0 Output: Add comments 0 0 1 0 0 1 0 Groupby / Split File 1 0 0 0 0 0 0 Output as Input 3 1 0 0 0 1 0 Developer tools 1 1 0 1 0 1 1 Total 196 94 48 77 67 160 56 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Le Monde puzzle [#1094]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
A rather blah number Le Monde mathematical puzzle:
Find all integer multiples of 11111 with exactly one occurrence of each decimal digit..
Which I solved by brute force, by looking at the possible range of multiples (and borrowing stringr:str_count from Robin!)
> combien=0 > for (i in 90001:900008){ j=i*11111 combien=combien+(min(stringr::str_count(j,paste(0:9)))==1)} > combien [1] 3456And a bonus one:
Find all integers y that can write both as x³ and (10z)³+a with 1≤a≤999.
which does not offer much in terms of solutions since x³v³=(xv)(x²+xv+v²)=a shows that x² is less than 2a/3, meaning x is at most 25. Among such numbers only x=11,12 lead to a solution as x³=1331,1728.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using R/exams for Written Exams in Finance Classes
(This article was first published on R/exams, and kindly contributed to Rbloggers)
Experiences with using R/exams for written exams in finance classes with a moderate number of students at Texas A&M International University (TAMIU).
Guest post by Nathaniel P. Graham (Texas A&M International University, Division of International Banking and Finance Studies).
BackgroundWhile R/exams was originally written for large statistics classes, there is nothing specific to either large courses or statistics. In this article, I will describe how I use R/exams for my “Introduction to Finance (FIN 3310)” course. While occasionally I might have a class of 60 students, I generally have smaller sections of 20–25. These courses are taught and exams are given inperson (some material and homework assignments are delivered via the online learning management system, but I do not currently use R/exams to generate that).
My written exams are generated by exams2nops() and have two components: a multiplechoice portion and a shortanswer portion. The former are generated using R/exams’s dynamic exercise format while the latter are included from static PDF documents. Example pages from a typical exam are linked below:
Because I have a moderate number of students, I grade all my exams manually. This blog post outlines how the exercises are set up and which R scripts I use to automate the generation of the PDF exams.
Multiplechoice questions ShufflingThe actual questions in my exams are not different from many other R/exams examples, except that they are about finance instead of statistics. An example is included below with LaTeX markup and both the R/LaTeX (.Rnw) and R/Markdown (.Rmd) versions can be downloaded: goalfinance.Rnw, goalfinance.Rmd. I have 7 possible answers, but only 5 will be randomly chosen for a given exam (always including the 1 correct answer). To do so, many of my nonnumerical questions take advantage of the exshuffle option here. (If you are one of my students, I told you this would be on the exam!)
Periodic updates to individual questions (stored as separate files) keep the exams fresh.
\begin{question} The primary goal of financial management is to maximize the \dots \begin{answerlist} \item Present value of future operating cash flows \item Net income, according to GAAP/IFRS \item Market share \item Value of the firm \item Number of shares outstanding \item Book value of shareholder equity \item Firm revenue \end{answerlist} \end{question} \exname{Goal of Financial Management} \extype{schoice} \exsolution{0001000} \exshuffle{5} Numerical exercisesSince this is for a finance class, there are numerical exercises as well. Every semester, I promise my students that this problem will be on the second midterm. The problem is simple: given some cash flows and a discount rate, calculate the net present value (NPV), see npvcalc.Rnw, npvcalc.Rmd. Since the cash flows, the discount rate, and the answers are generated from random numbers, I would not gain much from defining more than five possible, the way I did in the qualitative question above. As you might expect, some of the incorrect answers are common mistakes students make when approaching NPV, so I have not lost anything relative to the carefully crafted questions and answers I used in the past to find out who really learned the material.
<>= discountrate < round(runif(1, min = 6.0, max = 15.0), 2) r < discountrate / 100.0 cf0 < sample(10:20, 1) * 100 ocf < sample(seq(200, 500, 25), 5) discounts < sapply(1:5, function(i) (1 + r) ** i) npv < round(sum(ocf / discounts) + cf0, 2) notvm < round(sum(ocf) + cf0, 2) wrongtvm < round(sum(ocf / (1.0 + r)) + cf0, 2) revtvm < round(sum(ocf * (1.0 + r)) + cf0, 2) offnpv < round(npv + sample(c(200.0, 200.0), 1), 2) @ \begin{question} Assuming the discount rate is \Sexpr{discountrate}\%, find the net present value of a project with the following cash flows, starting at time 0: \$\Sexpr{cf0}, \Sexpr{ocf[1]}, \Sexpr{ocf[2]}, \Sexpr{ocf[3]}, \Sexpr{ocf[4]}, \Sexpr{ocf[5]}. \begin{answerlist} \item \$\Sexpr{wrongtvm} \item \$\Sexpr{notvm} \item \$\Sexpr{npv} \item \$\Sexpr{revtvm} \item \$\Sexpr{offnpv} \end{answerlist} \end{question} \exname{Calculating NPV} \extype{schoice} \exsolution{00100} \exshuffle{5} Leveraging R’s flexibilityWhile there are other methods of generating randomized exams out there, few are as flexible as R/exams, in large part because we have full access to R. The next example also appears on my second midterm; it asks students to compute the internal rate of return for a set of cash flows, see irrcalc.Rnw, irrcalc.Rmd. Numerically, this is a simple rootfinding problem, but systems that do not support more advanced mathematical operations (such as Blackboard’s “Calculated Formula” questions) can make this difficult or impossible to implement directly.
<>= cf0 < sample(10:16, 1) * 100 ocf < sample(seq(225, 550, 25), 5) npvfunc < function(r) { discounts < sapply(1:5, function(i) (1 + r) ** i) npv < (sum(ocf / discounts) + cf0) ** 2 return(npv) } res < optimize(npvfunc, interval = c(1,1)) irr < round(res$minimum, 4) * 100.0 wrong1 < irr + sample(c(1.0, 1.0), 1) wrong2 < irr + sample(c(0.25, 0.25), 1) wrong3 < irr + sample(c(0.5, 0.5), 1) wrong4 < irr + sample(c(0.75, 0.75), 1) @ \begin{question} Find the internal rate of return of a project with the following cash flows, starting at time 0: \$\Sexpr{cf0}, \Sexpr{ocf[1]}, \Sexpr{ocf[2]}, \Sexpr{ocf[3]}, \Sexpr{ocf[4]}, \Sexpr{ocf[5]}. \begin{answerlist} \item \Sexpr{wrong1}\% \item \Sexpr{wrong2}\% \item \Sexpr{irr}\% \item \Sexpr{wrong3}\% \item \Sexpr{wrong4}\% \end{answerlist} \end{question} \exname{Calculating IRR} \extype{schoice} \exsolution{00100} \exshuffle{5} Shortanswer questionsWhile exams2nops() has some support for openended questions that are scored manually, it is very limited. I create and format those questions in the traditional manner (Word or LaTeX) and save the resulting file as a PDF. Alternatively, a custom exams2pdf() could be used for this. Finally, I add this PDF file on to the end of the exam using the pages option of exams2nops() (as illustrated in more detail below). As an example, the shortanswer questions from the final exam above are given in the following PDF file:
To facilitate automatic inclusion of these shortanswer questions, a naming convention is adopted in the scripts below. The script looks for a file named examNsa.pdf, where N is the exam number (1 for the first midterm exam, 2 for the second, and 3 for the final exam) and sets pages to it. So for the final exam, it looks for the file exam3sa.pdf. If I want to update my shortanswer questions, I just replace the old PDF with a new one.
Midterm examsI keep the questions for each exam in their own directory, which makes the script below that I use to generate exams 1 and 2 straightforward. For the moment, the script uses all the questions in an exam’s directory, but I if want to guarantee that I have exactly (for example) 20 multiple choice questions, I could set nsamp = 20 instead of however many Rnw (or Rmd) files it finds.
Note that exlist is a list with one element (a vector of file names), so that R/exams will randomize the order of the questions as well. If I wanted to specify the order, I could make exlist a list of N elements, where each element was exactly one file/question. The nsamp option could then be left unset (the default is NULL).
When I generate a new set of exams, I only need to update the first four lines, and the script does the rest. Note that I use a modified version of the English en.dcf language configuration file in order to adapt some terms to TAMIU’s terminology, e.g., “ID Number” instead of the exams2nops() default “Registration Number”. See en_tamiu.dcf for the details. Since the TAMIU student ID numbers have 8 digits, I use the reglength = 8 argument, which sets the number of digits in the ID Number to 8.
library("exams") ## configuration: change these to make a new batch of exams exid < 2 exnum < 19 exdate < "20190404" exsemester < "SP19" excourse < "FIN 3310" ## options derived from configuration above exnam < paste0("fin3310exam", exid) extitle < paste(excourse, exsemester, "Exam", exid, sep = " ") saquestion < paste0("SA_questions/exam", exid, "sa.pdf") ## exercise directory (edir) and output directory (odir) exedir < paste0("fin3310_exam", exid, "_exercises") exodir < "nops" if(!dir.exists(exodir)) dir.create(exodir) ## in case it was previously deleted exodir < paste0(exodir, "/exam", exid, "_", format(Sys.Date(), "%Y%m%d")) ## exercise list: one element with a vector of all available file names exlist < list(dir(path = exedir, pattern = "\.Rnw$")) ## generate exam set.seed(54321) # not the actual random seed exams2nops(file = exlist, n = exnum, nsamp = length(exlist[[1]]), dir = exodir, edir = exedir, name = exnam, date = exdate, course = excourse, title = extitle, institution = "Texas A\\&M International University", language = "en_tamiu.dcf", encoding = "UTF8", pages = saquestion, blank = 1, logo = NULL, reglength = 8, samepage = TRUE) Final examMy final exam is comprehensive, so I would like to include questions from the previous exams. I do not want to keep separate copies of those questions just for the final, in case I update one version and forget to update the other, so I need a script that gathers up questions from the first two midterms and adds in some questions specific to the final.
The script below uses a feature that has long been available in R/exams but was undocumented up to version 2.32: You can set edir to a directory and all its subdirectories will be included in the search path. I have specified some required questions that I want to appear on every student’s exam; each student will also get a random draw of other questions from the first two midterms in addition to some questions that only appear on the final. How many of each is controlled by exnsamp, which is passed to the nsamp argument. They add up to 40 currently, so exams2nops()’s 45 question limit does not affect me.
library("exams") ## configuration: change these to make a new batch of exams exnum < 19 exdate < "20190430" exsemester < "SP19" excourse < "FIN 3310" ## options derived from configuration above extitle < paste(excourse, exsemester, "Exam", exid, sep = " ") ## exercise directory (edir) and output directory (odir) exedir < getwd() exodir < "nops" if(!dir.exists(exodir)) dir.create(exodir) exodir < paste0(exodir, "/exam3", "_", format(Sys.Date(), "%Y%m%d")) ## exercises: required and from previous midterms exrequired < c("goalfinance.Rnw", "pvcalc.Rnw", "cashcoverageortie.Rnw", "ocfcalc.Rnw", "discountratecalc.Rnw", "npvcalc.Rnw", "npvcalc2.Rnw", "stockpriceisabouttopaycalc.Rnw", "annuitycalc.Rnw", "irrcalc.Rnw") exlist1 < dir(path = "fin3310_exam1_exercises", pattern = "\.Rnw$") exlist2 < dir(path = "fin3310_exam2_exercises", pattern = "\.Rnw$") exlist3 < dir(path = "fin3310_exam3_exercises", pattern = "\.Rnw$") ## final list and corresponding number to be sampled exlist < list( exrequired, setdiff(exlist1, exrequired), setdiff(exlist2, exrequired), exlist3 ) exnsamp < c(10, 5, 10, 15) ## generate exam set.seed(54321) # not the actual random seed exams2nops(file = exlist, n = exnum, nsamp = exnsamp, dir = exodir, edir = exedir, name = "fin3310exam3", date = exdate, course = excourse, title = extitle, institution = "Texas A\\&M International University", language = "en_tamiu.dcf", encoding = "UTF8", pages = "SA_questions/exam3sa.pdf", blank = 1, logo = NULL, reglength = 8, samepage = TRUE) GradingR/exams can read scans of the answer sheet, automating grading for the multiplechoice questions. For a variety of reasons, I do not use any of those features. If I had to teach larger classes I would doubtless find a way to make it convenient, but for the foreseeable future I will continue to use the function below to produce the answer key for each exam, which I grade by hand. This is not especially onorous, since I have to grade the shortanswer questions by hand anyway.
Given the serialized R data file (.rds) produced by exam2nops() the corresponding exams_metainfo() can be extracted. This comes with a print() method that displays all correct answers for a given exam. Starting from R/exams version 2.33 (current RForge devel version) it is also possible to split the output into blocks of five for easy reading (matching the blocks on the answer sheets). As an example the first few correct answers of the third exam are extracted:
fin3310exam3 < readRDS("fin3310exam3.rds") print(exams_metainfo(fin3310exam1), which = 3, block = 5) ## 19043000003 ## 1. Discount Rate: 1 ## 2. Calculating IRR: 4 ## 3. Present Value: 3 ## 4. Calculating stock prices: 5 ## 5. Calculating NPV: 3 ## ## 6. Annuity PV: 3 ## 7. Goal of Financial Management: 1 ## 8. Finding OCF: 5 ## ...It can be convenient to display the correct answers in a customized format, e.g., with letters instead of numbers and omitting the exercise title text. To do so, the code below sets up a function exam_solutions(), applying it to the same exam as above.
exam_solutions < function(examdata, examnum) { solutions < LETTERS[sapply(examdata[[examnum]], function(x) which(x$metainfo$solution))] data.frame(Solution = solutions) } split(exam_solutions(fin3310exam3, examnum = 3), rep(1:8, each = 5)) ## $`1` ## Solution ## 1 A ## 2 D ## 3 C ## 4 E ## 5 C ## ## $`2` ## Solution ## 6 C ## 7 A ## 8 E ## ...The only downside of the manual grading approach is that I do not have students’ responses in an electronic format for easy statistical analysis, but otherwise grading is very fast.
Wrapping upUsing the system above, R/exams works well even for small courses with inperson, paper exams. I do not need to worry about copies of my exams getting out and students memorizing it, or any of the many ways students have found to cheat over the years. By doing all the formatting work for me, R/exams helps me avoid a lot of the finicky aspects of adding, adjusting, or removing questions from an existing exam, and generally keeps exam construction focused on the important part: writing questions that gauge students’ progress.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Practical Data Science with R Book Update (April 2019)
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.
The second edition should be fully available this fall! Nina and I have finished up through chapter 10 (of 12), and Manning has released previews of up through chapter 7 (with more to come!).
At this point the second edition is very real. So it makes me feel a bit conflicted about the 1st edition. If you need a good guide to data science in R in paper form right now please consider buying the 1st edition. If you are happy with an ebook: please consider buying the 2nd edition MEAP (Manning Early Access Program) now, it gets you previews of the new book, a complete ecopy of the 1st edition, and a complete ecopy of the 2nd edition when that is done! And we expect the fully printed version of the 2nd edition to be available this fall.
The 1st edition remains an excellent book that we are very proud of. I have 12 more copies of it (that I paid for) in my closet for gifting. Just four days ago I achieved a personal 1st edition goal: handing a print copy of the 1st edition to Robert Gentleman in person!
But, the 1st edition’s time is coming to an end.
The 1st edition has been with Nina and me since we started work on it in November of 2012 (getting the 1st edition out in April 2014). Earlier drafts were titled “Successful Data Science in R”, and issues around book title made me so nervous the entire book production directory was named “DataScienceBook” until last year about 4 years after the book publication!
Well, now for me Practical Data Science with R, 2nd edition is “the book.” Nina and I have put a lot of time into revising and improving the book for the new edition. In fact we started the 2nd edition March of 2018, so we have been working on this new edition for quite some time. If you wonder why you have not seen Nina for the last year, this has been the major reason (she is the leader on both editions of the book).
Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019 is intended as an improved revision of the 1st edition. We fixed things, streamlined things, incorporated new packages, and updated the content to catch up with over 5 years of evolution of the R ecosystem. We also had the chance to add some new content on boosted trees, model explanation, and data wrangling (including data.table!).
On a technical side we have a free support site for the book: https://github.com/WinVector/PDSwR2. As with the first edition: we share all code examples, so you don’t have to try and copy and paste from the ecopy. This is fairly substantial, as Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019 is an example driven book. We also have just released Rmarkdown rerenderings of all of the examples from the book. These confirm the examples run correctly and are where you can look if you are hung up on trying to run an example (which will almost always be an issue of adding the path from where you have the code to the data, or a small matter of installing packages).
If you have and liked the 1st edition, please consider buying the 2nd edition: it is a substantial improvement. It is the same book, but a better version of it. If you want a book that teaches you how to work as data scientist using R as the example: please buy Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Free Course: Help Your Team Learn R!
(This article was first published on R – AriLamstein.com, and kindly contributed to Rbloggers)
Today I am happy to announce a new free course: Help Your Team Learn R!
Over the last few years I’ve helped a number of data teams train their analysts to use R. At each company there was a skilled R user who was leading the team’s effort to adopt R.
Each of these internal R advocates, however, had a similar problem: they did not have a background in training. So while they knew that R could help their company, and they were R experts themselves, they struggled to get their team to actually learn R.
Help Your Team Learn R is designed to help people in this situation by providing an introduction to some of the key concepts in the field of technical training.
What’s in the Course?In terms of Pedagogy, the course focuses on Criterion Referenced Instruction, which is the branch of training that I have found most useful when teaching R.
The course contains seven lessons that are sent via email over the course of six days. Each email contains:
 One key concept from the field of training
 An example of how I have applied that concept when I teach R
 An exercise to help you apply that concept to your own teaching
If you are looking for help in teaching R to your team, then I encourage you to sign up for Help Your Team Learn R using the form below!
The post Free Course: Help Your Team Learn R! appeared first on AriLamstein.com.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
India has 100k records on iNaturalist
(This article was first published on Vijay Barve, and kindly contributed to Rbloggers)
Biodiversity citizen scientists use iNaturalist to post their observations with photographs. The observations are then curated there by crowdsourcing the identifications and other trait related aspects too. The data once converted to “research grade” is passed on to GBIF as occurrence records.
Exciting news from India in 3rd week of April 2019 is:
Another important milestone in #Biodiversity Citizen Science in #India. This week we crossed 100K verifiable records on @inaturalist this data is about ~10K species by 4500+ observers #CitSci pic.twitter.com/DCF3QxQl1i
— Vijay Barve (@vijaybarve) April 21, 2019
Being interested in biodiversity data visualizations and completeness, I was waiting for 100k records to explore the data. Here is what I did and found out.
Step 1: Download the data from iNaturalist website. Which can be done very easily by visiting the website and choosing the right options.
https://www.inaturalist.org/observations?place_id=6681
I downloaded the file as .zip and extracted the observationsxxxx.csv. [In my case it was observations51008.csv].
Step 2: Read the data file in R
library(readr) observations_51008 < read_csv("input/observations51008.csv")Step 3: Clean up the data and set it up to be used in package bdvis.
library(bdvis) inatc < list( Latitude="latitude", Longitude="longitude", Date_collected="observed_on", Scientific_name="scientific_name" ) inat < format_bdvis(observations_51008,config = inatc)Step 4: We still need to rename some more columns for ease in visualizations like rather than ‘taxon_family_name’ it will be easy to have field called ‘Family’
rename_column < function(dat,old,new){ if(old %in% colnames(dat)){ colnames(dat)[which(names(dat) == old)] < new } else { print(paste("Error: Fieldname not found...",old)) } return(dat) } inat < rename_column(inat,'taxon_kingdom_name','Kingdom') inat < rename_column(inat,'taxon_phylum_name','Phylum') inat < rename_column(inat,'taxon_class_name','Class') inat < rename_column(inat,'taxon_order_name','Order_') inat < rename_column(inat,'taxon_family_name','Family') inat < rename_column(inat,'taxon_genus_name','Genus') # Remove records excess of 100k inat < inat[1:100000,]Step 5: Make sure the data is loaded properly
bdsummary(inat)will produce some like this:
Total no of records = 100000 Temporal coverage... Date range of the records from 18980101 to 20190419 Taxonomic coverage... No of Families : 1345 No of Genus : 5638 No of Species : 13377 Spatial coverage ... Bounding box of records 6.806092 , 68.532  35.0614769085 , 97.050133 Degree celles covered : 336 % degree cells covered : 39.9524375743163The data looks good. But we have a small issue, we have some records from year 1898, which might cause trouble with some of our visualizations. So let us drop records before year 2000 for the time being.
inat = inat[which(inat$Date_collected > "20000101"),]Now we are ready to explore the data. First one I always like to see is geographical coverage of the data. First let us try it at 1 degree (~100km) grid cells. Note here I have Admin2.shp file with India states map.
mapgrid(inat,ptype="records",bbox=c(60,100,5,40), shp = "Admin2.shp")This shows a fairly good geographical coverage of the data at this scale. We have very few degree cells with no data. How about fines scale though? Say at 0.1 degree (~10km) grid. Let us generate that.
mapgrid(inat,ptype="records",bbox=c(60,100,5,40), shp = "Admin2.shp", gridscale=0.1)Now the pattern is clear, where the data is coming from.
To be continued…
References
 Barve, Vijay, and Javier Otegui. 2016. “Bdvis: Visualizing Biodiversity Data in R.” Bioinformatics. doi:http://dx.doi.org/10.1093/bioinformatics/btw333.
To leave a comment for the author, please follow the link and comment on their blog: Vijay Barve. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reproducible Environments
(This article was first published on R Views, and kindly contributed to Rbloggers)
Great data science work should be reproducible. The ability to repeat
experiments is part of the foundation for all science, and reproducible work is
also critical for business applications. Team collaboration, project validation,
and sustainable products presuppose the ability to reproduce work over time.
In my opinion, mastering just a handful of important tools will make
reproducible work in R much easier for data scientists. R users should be
familiar with version control, RStudio projects, and literate programming
through R Markdown. Once these tools are mastered, the major remaining challenge
is creating a reproducible environment.
An environment consists of all the dependencies required to enable your code to
run correctly. This includes R itself, R packages, and system dependencies. As
with many programming languages, it can be challenging to manage reproducible R
environments. Common issues include:
 Code that used to run no longer runs, even though the code has not changed.
 Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
 Typing install.packages in your environment doesn’t do anything, or doesn’t do the right thing.
These challenges can be addressed through a careful combination of tools and
strategies. This post describes two use cases for reproducible environments:
 Safely upgrading packages
 Collaborating on a team
The sections below each cover a strategy to address the use case, and the necessary
tools to implement each strategy. Additional use cases, strategies, and tools are
presented at https://environments.rstudio.com. This website is a work in
progress, but we look forward to your feedback.
Upgrading packages can be a risky affair. It is not difficult to find serious R
users who have been in a situation where upgrading a package had unintended
consequences. For example, the upgrade may have broken parts of their current code, or upgrading a
package for one project accidentally broke the code in another project. A
strategy for safely upgrading packages consists of three steps:
 Isolate a project
 Record the current dependencies
 Upgrade packages
The first step in this strategy ensures one project’s packages and upgrades
won’t interfere with any other projects. Isolating projects is accomplished by
creating perproject libraries. A tool that makes this easy is the new renv
package. Inside of your R project, simply use:
The second step is to record the current dependencies. This step is critical
because it creates a safety net. If the package upgrade goes poorly, you’ll be
able to revert the changes and return to the record of the working state. Again,
the renv package makes this process easy.
With an isolated project and a safety net in place, you can now proceed to
upgrade or add new packages, while remaining certain the current functional
environment is still reproducible. The pak
package can be used to install and upgrade
packages in an interactive environment:
The safety net provided by the renv package relies on access to older versions
of R packages. For public packages, CRAN provides these older versions in the
CRAN archive. Organizations can
use tools like RStudio Package
Manager to make multiple versions
of private packages available. The “snapshot and
restore” approach can also be used
to promote content to production. In
fact, this approach is exactly how RStudio
Connect and
shinyapps.io deploy thousands of R applications to
production each day!
A common challenge on teams is sharing and running code. One strategy that
administrators and R users can adopt to facilitate collaboration is
shared baselines. The basics of the strategy are simple:
 Administrators setup a common environment for R users by installing RStudio Server.
 On the server, administrators install multiple versions of R.
 Each version of R is tied to a frozen repository using a Rprofile.site file.
By using a frozen repository, either administrators or users can install
packages while still being sure that everyone will get the same set of packages.
A frozen repository also ensures that adding new packages won’t upgrade other
shared packages as a sideeffect. New packages and upgrades are offered to users
over time through the addition of new versions of R.
Frozen repositories can be created by manually cloning CRAN, accessing a service
like MRAN, or utilizing a supported product like RStudio Package
Manager.
The prior sections presented specific strategies for creating reproducible
environments in two common cases. The same strategy may not be appropriate for
every organization, R user, or situation. If you’re a student reporting an
error to your professor, capturing your sessionInfo() may be all you need. In
contrast, a statistician working on a clinical trial will need a robust
framework for recreating their environment. Reproducibility is not binary!
To help pick between strategies, we’ve developed a strategy
map. By answering two questions,
you can quickly identify where your team falls on this map and identify the
nearest successful strategy. The two questions are represented on the x and
yaxis of the map:
 Do I have any restrictions on what packages can be used?
 Who is responsible for managing installed packages?
{"x":{"data":[{"x":[0.05,1.05],"y":[0.15,1.25],"text":"","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"rgba(0,0,0,0.2)","dash":"solid"},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0,0,0.8,0],"y":[0.2,1,1,0.2],"text":"NA","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(255,0,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0,1,0.8,0,0],"y":[null,0.8,1,0.2,null],"text":"NA","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,255,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0,0,1,0.2,0],"y":[0,0.2,0.8,0,0],"text":"","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,255,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0.2,1,1,0.2],"y":[0,0,0.8,0],"text":"NA","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(255,0,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0.05,1.05],"y":[0.25,0.85],"text":"","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"rgba(0,0,0,0.2)","dash":"solid"},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[0.5,0.75,0.2],"y":[0.75,0.2,0.8],"text":["Open access,
not reproducible,
how we learn","Backdoor package access,
offline systems without a strategy","Admins involved,
no testing,
slow updates,
high risk of breakage"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(255,0,0,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(255,0,0,1)"}},"hoveron":"points","name":"FALSE","legendgroup":"FALSE","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[0.1,0.5,0.8],"y":[0.1,0.5,0.8],"text":["Admins test and approve
a subset of CRAN","All or most of CRAN,
updated with R versions,
tied to a system library","Open access, user or system
records perproject dependencies"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(163,197,134,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(163,197,134,1)"}},"hoveron":"points","name":" TRUE","legendgroup":" TRUE","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[0.125,0.525,0.525,0.825,0.775,0.225],"y":[0.125,0.525,0.775,0.825,0.225,0.825],"text":["Validated","Shared Baseline","Wild West","Snapshot","Blocked","Ticket System"],"hovertext":["","","","","",""],"textfont":{"size":14.6645669291339,"color":"rgba(0,0,0,1)"},"type":"scatter","mode":"text","hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":89.8630136986302},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":"Reproducing Environments: Strategies and Danger Zones","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.05,1.05],"tickmode":"array","ticktext":["Admins","","","","Users"],"tickvals":[0,0.25,0.5,0.75,1],"categoryorder":"array","categoryarray":["Admins","","","","Users"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":"Who is Responsible for Reproducing the Environment?","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.05,1.05],"tickmode":"array","ticktext":["Locked Down","","","","Open"],"tickvals":[0,0.25,0.5,0.75,1],"categoryorder":"array","categoryarray":["Locked Down","","","","Open"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":"Package Access","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895},"y":1},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":[{"name":"Collaborate","icon":{"width":1000,"ascent":500,"descent":50,"path":"M487 375c710 923 536l79259c3121123223111822123512l263 0c15 029 543 1513 1023 2328 375 135 251 37 0 0 0 3 1 7 1 5 1 8 1 11 0 2 0 41 6 0 31 51 6 1 2 2 4 3 6 1 2 2 4 4 6 2 3 4 5 5 7 5 7 9 16 13 26 4 10 7 19 9 26 0 2 0 5 0 91 41 6 0 8 0 2 2 5 4 8 3 3 5 5 5 7 4 6 8 15 12 26 4 11 7 19 7 26 1 1 0 4 0 91 41 7 0 8 1 2 3 5 6 8 4 4 6 6 6 7 4 5 8 13 13 24 4 11 7 20 7 28 1 1 0 4 0 71 31 61 7 0 2 1 4 3 6 1 1 3 4 5 6 2 3 3 5 5 6 1 2 3 5 4 9 2 3 3 7 5 10 1 3 2 6 4 10 2 4 4 7 6 9 2 3 4 5 7 7 3 2 7 3 11 3 3 0 8 0 131l01c7 2 12 2 14 2l218 0c14 0 255 3216 810 1023 637l79259c722133720437719103710l248 0c5 0921152327 012 413 1820 4120l264 0c5 0 10 2 16 5 5 3 8 6 10 11l85 282c2 5 2 10 2 17 73 137 1713z m304 0c1315 07 11 32 62l174 0c2 0 4 1 7 2 2 2 4 4 5 7l6 18c0 3 0 51 71 13 26 2l173 0c3 05182224447z m2473c1315 07 22 32 62l174 0c2 0 5 0 7 2 3 2 4 4 5 7l6 18c1 2 0 51 61 23 35 3l174 0c3 05173314456z"},"click":"function(gd) { \n // is this being viewed in RStudio?\n if (location.search == '?viewer_pane=1') {\n alert('To learn about plotly for collaboration, visit:\\n https://cpsievert.github.io/plotly_book/plotlyforcollaboration.html');\n } else {\n window.open('https://cpsievert.github.io/plotly_book/plotlyforcollaboration.html', '_blank');\n }\n }"}],"cloud":false,"displayModeBar":false},"source":"A","attrs":{"f87a7b9b28cc":{"intercept":{},"slope":{},"type":"scatter"},"f87a793a87a":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87a6f19e578":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87ad286244":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87a564b651b":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87a6fdafbdf":{"intercept":{},"slope":{}},"f87a11ce26d8":{"x":{},"y":{},"colour":{},"text":{},"x.1":{},"y.1":{}},"f87a75583809":{"x":{},"y":{},"label":{},"x.1":{},"y.1":{}}},"cur_data":"f87a7b9b28cc","visdat":{"f87a7b9b28cc":["function (y) ","x"],"f87a793a87a":["function (y) ","x"],"f87a6f19e578":["function (y) ","x"],"f87ad286244":["function (y) ","x"],"f87a564b651b":["function (y) ","x"],"f87a6fdafbdf":["function (y) ","x"],"f87a11ce26d8":["function (y) ","x"],"f87a75583809":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"base_url":"https://plot.ly",".hideLegend":true},"evals":["config.modeBarButtonsToAdd.0.click"],"jsHooks":[]}
For more information on picking and using these strategies, please visit
https://environments.rstudio.com. By adopting a strategy for reproducible
environments, R users, administrators, and teams can solve a number of important
challenges. Ultimately, reproducible work adds credibility, creating a solid
foundation for research, business applications, and production systems. We are
excited to be working on tools to make reproducible work in R easy and fun. We
look forward to your feedback, community discussions, and future posts.
_____='https://rviews.rstudio.com/2019/04/22/reproducibleenvironments/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
survivalists [a Riddler’s riddle]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
A neat question from The Riddler on a multiprobability survival rate:
Nine processes are running in a loop with fixed survivals rates .99,….,.91. What is the probability that the first process is the last one to die? Same question with probabilities .91,…,.99 and the probability that the last process is the last one to die.
The first question means that the realisation of a Geometric G(.99) has to be strictly larger than the largest of eight Geometric G(.98),…,G(.91). Given that the cdf of a Geometric G(a) is [when counting the number of attempts till failure, included, i.e. the Geometric with support the positive integers]
the probability that this happens has the nice (?!) representation
which leads to an easy resolution by recursion since
and
and a value of 0.5207 returned by R (Monte Carlo evaluation of 0.5207 based on 10⁷ replications). The second question is quite similar, with solution
and value 0.52596 (Monte Carlo evaluation of 0.52581 based on 10⁷ replications).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Binning with Weights
(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to Rbloggers)
After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down nonfraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all nonfraud cases to reflect the fraud rate in the presample data.
While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.
– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from 0.38 to 0.64.
– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.
– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.
The conclusion from this demonstrate is very clear. In cases of twovalue weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Familiarisation with the Australian Election Study by @ellis2013nz
(This article was first published on free range statistics  R, and kindly contributed to Rbloggers)
The Australian Election Study is an impressive long term research project that has collected the attitudes and behaviours of a sample of individual voters after each Australian federal election since 1987. All the datasets and documentation are freely available. An individual survey of this sort is a vital complement to the necessarily aggregate results provided by the Australian Electoral Commission, and is essential for coming closer to understanding political and social change. Many countries have a comparable data source.
Many thanks to the original researchers:
McAllister, Ian; Makkai, Toni; Bean, Clive; Gibson, Rachel Kay, 2017, “Australian Election Study, 2016”, doi:10.4225/87/7OZCZA, ADA Dataverse, V2, UNF:6:TNnUHDn0ZNSlIM94TQphWw==; 1. Australian Election Study December 2016 Technical Report.pdf
In this blog post I just scratch the surface of the latest available data, from the time of the 2016 federal election. In later posts I’ll get into modelling it a bit deeper, and hopefully get a chance to look at some changes over time. Some of the code I use in this post for preparing the data might find its way into the ozfedelect R package, but not the data itself.
Getting the dataThe data need to be downloaded by hand from the Australian Data Archive via the Dataverse (an international open source research data repository) in order to agree to the terms and conditions of use. The following code requires the SPSS version of the data and the Excel version of the data dictionary to have been saved in an ./aes/ subfolder of the working directory.
The SPSS data has column names truncated at eight characters long, which means that some of the variables in the data don’t match those in the data dictionary. The code below makes the necessary manual adjustments to variable names from the dictionary to deal with this, and reclasses the categorical responses as factors with the correct level labels in the correct order. I do this by importing and joining to the data dictionary rather than extracting the attributes information from the SPSS import, which might have also worked; I prefer my manual approach so I can deal with some anomalies explicitly (for example, many of the values listed as 999 “skipped” in the data dictionary are NA in the SPSS data).
Post continues after code extract
#====================Preparation============= #packages library(tidyverse) library(scales) library(rio) library(svglite) library(frs) library(readxl) library(testthat) library(survey) library(vcd) library(RColorBrewer) #Import and tidy data # Download by hand from https://dataverse.ada.edu.au/dataset.xhtml?persistentId=doi:10.4225/87/7OZCZA aes2016_orig < import("aes/2. Australian Election Study, 2016.sav") aes2016_code_orig < read_excel("aes/1. Australian Election Study, 2016  Codebook.xlsx", sheet = "Data Dictionary") aes2016_code < aes2016_code_orig %>% rename(question_label = Label...3, answer_label = Label...5) %>% rename_all(tolower) %>% fill(variable, position, question_label) %>% mutate(var_abb = substring(variable, 1, 8)) %>% mutate(var_abb = case_when( var_abb == "H19_STAT" ~ "H19STATE", var_abb == "H19_PCOD" ~ "H19pcoRV", var_abb == "H20_othe" ~ "H20_oth", var_abb == "H25_othe" ~ "H25_oth", var_abb == "Final_ST" ~ "finSTATE", var_abb == "Samp_STA" ~ "SamSTATE", var_abb == "detailed" ~ "doutcome", var_abb == "Responde" ~ "responID", var_abb == "Start_Ti" ~ "Sta_time", var_abb == "Samp_PCO" ~ "SampcoRV", var_abb == "Final_PC" ~ "finpcoRV", var_abb == "Total_Du" ~ "totaldur", TRUE ~ var_abb )) %>% rbind(tibble( variable = "StateMap", position = NA, question_label = "Unidentified state variable", value = 1:8, answer_label = c("New South Wales", "Victoria", "Queensland", "South Australia", "Western Australia", "Tasmania", "Northern Territory", "Australian Capital Territory"), var_abb = "StateMap" )) aes2016_questions < distinct(aes2016_code, var_abb, position, question_label) aes2016_answers < aes2016_code %>% select(var_abb, variable, value, answer_label) %>% filter(!is.na(value)) # Check all the names in the data now match those in the data dictionary: expect_equal( names(aes2016_orig)[!names(aes2016_orig) %in% aes2016_questions$var_abb], character(0) ) # ... and vice versa: expect_equal( unique(aes2016_questions$var_abb)[!unique(aes2016_questions$var_abb) %in% names(aes2016_orig)], character(0) ) aes2016 < aes2016_orig %>% as_tibble() attributes(aes2016)$question_labels < names(aes2016) for(i in 1:ncol(aes2016)){ this_q < filter(aes2016_questions, var_abb == names(aes2016)[i]) # Sometimes a code like 999 'Skipped' is not present in the data but has already # been replaced with an NA, so we don't want it in our factor level. So we need # to find out what answers are actually used in the i'th column used_a < unique(pull(aes2016, i)) # the answer labels for this particular question these_a < aes2016_code %>% filter(var_abb == names(aes2016)[i]) %>% filter(value %in% used_a) %>% arrange(value) attributes(aes2016)$question_labels[i] < pull(this_q, question_label) # half a dozen open text questions don't match to the data dictionary so we exclude those: if(nrow(these_a) > 0 & !pull(this_q, question_label) %in% c("What kind of work do you do? Full job title", "What are (or were) the main tasks that you usually perform(ed)?", "What kind of business or industry is (or was) that in?", "What was your partner's main activity last week? (other specify)", "What kind of work does (or did) your partner do? Provide job title", "Generally speaking, does your partner usually think of himself or herself as Liberal, Labor, National or what? (other specify)") ){ # for the rest, we turn the response into a factor with the correct labels aes2016[ , i] < factor(pull(aes2016, i), labels = pull(these_a, answer_label)) } } aes2016 < aes2016 %>% mutate(age_2016 = 2016  H2, agegrp_2016 = cut(age_2016, breaks = c(0, 17.5, 24.5, 34.5, 44.5, 54.5, 64.5, Inf), labels = c("017", "1824", "2534", "3544", "4554", "5564", "65+")), agegrp_2016 = as.factor(replace_na(as.character(agegrp_2016), "Age unknown")), sex = replace_na(as.character(H1), "Sex unknown"), first_pref_hr_grp = case_when( B9_1 %in% c("Liberal Party", "National (Country) Party") ~ "Coalition", B9_1 == "Labor Party (ALP)" ~ "Labor", B9_1 == "Greens" ~ "Greens", B9_1 == "Voted informal"  is.na(B9_1) ~ "Did not vote/voted informal", TRUE ~ "Other" ), first_pref_sen_grp = case_when( B9_2 %in% c("Liberal Party", "National (Country) Party") ~ "Coalition", B9_2 == "Labor Party (ALP)" ~ "Labor", B9_2 == "Greens" ~ "Greens", B9_2 == "Voted informal"  is.na(B9_2) ~ "Did not vote/voted informal", TRUE ~ "Other" ), first_pref_sen_grp2 = case_when( B9_2 == "Liberal Party" ~ "Liberal", B9_2 == "National (Country) Party" ~ "National", B9_2 == "One Nation" ~ "One Nation", B9_2 == "Labor Party (ALP)" ~ "Labor", B9_2 == "Greens" ~ "Greens", TRUE ~ "Other (including did not vote)" ) ) %>% mutate(first_pref_sen_grp2 = toupper(first_pref_sen_grp2), first_pref_sen_grp2 = ifelse(grepl("^OTHER", first_pref_sen_grp2), "OTHER (incl. did not vote)", first_pref_sen_grp2)) %>% mutate(first_pref_sen_grp2 = fct_relevel(first_pref_sen_grp2, toupper(c("Greens", "Labor", "Liberal", "National", "One Nation")))) %>% mutate(tpp_alp = B11_1 %in% "Labor Party (ALP)"  B9_1 %in% "Labor Party (ALP)")The last few lines of that code also defines some summarised variables that lump various parties or nonvotes together and will be useful in later analysis.
Survey design and weightsThe AES in 2016 used two separate sampling frames: one provided by the Australian Electoral Commission, and one from the Geocoded National Address File or GNAF, a large, authoritative open data source of every address in Australia. For inference about Australian adults on the election roll (enrolment is required of all resident adult citizens by legislation), the AES team recommend using a combination of the two samples, and weights are provided in the wt_enrol column to do so.
The technical report describes data collection as mixedmode (hard copy and online versions), with a solid system of managing contacts, reminders and nonreturns.
The AEC sample is stratified by state, and the GNAF sample by Greater Capital City (GCC) statistical areas (which divide states into two eg “Greater Sydney” and “Rest of New South Wales”). As far as I can see no GCC variable is provided for enduse analysts such as myself. Nor are the various postcode variables populated, and I can’t see a variable for electorate/division (presumably for statistical disclosure control purposes) which is a shame as it would be an important variable to use for a variety of multilevel analyses. So we have to use “state” as our best regional variable.
Poststratification weighting (raking) was used by the AES to match the population proportions by sex, age group, state, and first preference vote in the House of Representatives. This latter calibration is particularly important to ensure that analysis gives appropriate weight to the supporters of the various parties; it is highly likely that nonresponse is related to political views (or nonviews).
The weights have been scaled to add up to the sample size, rather than to population numbers (as would be the approach of eg a national statistical office). This is presumably out of deference for SPSS’ behaviour of treating survey weights as frequencies, which leads to appalling results if they add up to anything other than the sample size (and still suboptimal results even then).
Table 16 of the technical report is meant to provide the weighting benchmarks for the poststratification weighting, but the state, age and sex totals given are only for the GNAF sample; and the benchmarks given for which party voted for appear to be simply incorrect in a way I cannot determine. They add up to much more than either the AEC or GNAF sample, but much less than their combined total; and I can’t match their results with any of the plausible combinations of filters I’ve tried. I’d be happy for anyone to point out if I’m doing something wrong, but my working hypothesis is that Table 16 simply contains some mistakes, whereas the weights in the actual data are correct.
The code below explores the three sets of weights provided, and defines a survey design object (using Thomas Lumley’s survey package) that treats state as the strata for sampling, and sex, state, age and House of Representatives vote as poststratification variables. Although I won’t make much use of the survey package today, I’ll almost certainly want it for more sophisticated analysis in a later post.
Post continues after code extract
#Weighting # The sample has two parts  addresses supplied by the AEC, and addresses from # the GNAF (Geocoded National Address File). There are separate weights for each, # so they can be analysed as two separate surveys or you can use the combined set of weights: table(aes2016$wt_aec > 0, useNA = 'always') # for comparison to previous waves of AES table(aes2016$wt_gnaf > 0, useNA = 'always') # inference about Australian adults table(aes2016$wt_enrol > 0, useNA = 'always') # inference about enrolled Australian adults # There are 107 cases from the GNAF sample that have zero weight in the final combined sample: filter(aes2016, wt_enrol <= 0) %>% select(wt_aec, wt_gnaf, wt_enrol) # wt_enrol had been raked for age group, sex, state and first preference vote. See pages 3233 # of the technical guide. # These three state variables seem identical, and are all different from those in Table 16 # of the technical report: cbind( table(aes2016$finSTATE), table(aes2016$StateMap), table(aes2016$SamSTATE) ) # In fact, Table 16 has the state, age and sex only of the GNAF sample, not the whole sample: table(filter(aes2016, wt_gnaf >0)$finSTATE) table(filter(aes2016, wt_gnaf >0)$agegrp_2016) table(filter(aes2016, wt_gnaf >0)$H1) # The first pref party vote in Table 16 looks to be just wrong. The results below match those # in the code book, but not in Table 16 table(aes2016$first_pref_hr_grp, useNA = 'always') table(aes2016$first_pref_sen_grp, useNA = 'always') # Well, we'll assume that wt_enrol has actually been weighted as described, and ignore # the problems in Table 16. # Set up survey design: aes2016_svy < svydesign(~1, strata = ~SamSTATE, weights = ~wt_enrol, data = aes2016) # Deduce population totals from the sums of weights (if weights were raked to match pop totals, # then the sum of the weights should reveal what they were) age_pop < aes2016 %>% group_by(agegrp_2016) %>% summarise(Freq = sum(wt_enrol)) sex_pop < aes2016 %>% group_by(sex) %>% summarise(Freq = sum(wt_enrol)) state_pop < aes2016 %>% group_by(SamSTATE) %>% summarise(Freq = sum(wt_enrol)) age2016_svy < rake(aes2016_svy, sample.margins = list(~sex, ~agegrp_2016, ~SamSTATE), population.margins = list(sex_pop, age_pop, state_pop)) # Compare the weighted and unweighted response to a survey question filter(aes2016_questions, var_abb == "H14_2") svytable(~H14_2, aes2016_svy) table(aes2016$H14_2) # Many more "yes" posted answers to the Internet when weighted. So people active on the internet # needed extra weight (probably younger) Analysis – vote and one questionNow we’re ready for some exploratory analysis. In Australia, the Senate is elected by proportional representation, a wider range of small parties stand candidates, and the first preference vote for that house is less subject to gaming by voters than is the singletransferrablevote system in the House of Representatives. So I prefer to use Senate vote when examining the range of political opinion, particularly of supporters of the smaller parties. Here is one of the graphics that I love during exploration but which I despair of using for communication purposes – a mosaic plot, which beautifully visualises a contingency table of counts (or, in this case, weighted counts). This particular graphic shows the views of people who voted for different parties in the Senate on the trade off between reducing taxes and spening more on social services:
As any casual observer of Australian politics would have predicted, disproportionately large numbers (coloured blue to show the positive discrepency from the nointeraction null hypothesis) of Greens voters (and to a less extent Labor) are mildly or strongly in favour of spending more on social services. Correspondingly these parties’ supporters have relatively few people counted in the top left corner (coloured red to show a negative discrepancy from the null hypothesis) supporting reducing tax as the priority.
In contrast, there are relatively few Liberal voters in the social services camp, and disproportionately high numbers who favour reducing taxes.
The “other” voters – which includes those who did not vote, could not remember, or did not vote for one of the five most prominent parties – are disproportionately likely to sit on the fence or not have a view on the tradeoff between social services and tax cuts, which again is consistent with expectations.
What about a different angle – what influences people when they decide their vote?
We see that Greens voters decided their vote disproportionately on the basis of “the policy issues”. Liberal voters have less people in the “policy” box than would be expected under the nointeraction null hypothesis and were more likely to decide based on either “the party leaders” or “the parties taken as a whole”. National voters stand out as the only party with a noticeable disproportionate count (high, in their case) in the “candidates in your electorate” box, emphasising the known spatiallyspecific nature of National’s support.
Here’s the code for drawing those mosaic plots, using the vcd R package that implements some useful ways of visualising categorical data.
Post continues after code extract
#======================selected bivariate============ #tax/social services x < svytable(~E1 + first_pref_sen_grp2, aes2016_svy) x < round(x, 2) colnames(x)[grepl("OTHER", colnames(x))] < "OTHER" rownames(x) < paste0(str_wrap(rownames(x), 20)) mosaic(x, shade = TRUE, border = "grey70", direction = "v", legend = legend_resbased(10, fontfamily = "Roboto", pvalue = FALSE), xlab = "x", ylab = "y", labeling = labeling_values( suppress = c(0, 1000), gp_labels = gpar(fontfamily = "Roboto", cex = 0.8), gp_varnames = gpar(col = "transparent"), just_labels = c("center", "right"), alternate_labels = c(TRUE, FALSE), rot_labels = c(0,90,0,0), offset_labels = c(1,0,1.5,0), digits = 0 )) #how decide vote x < svytable(~B5 + first_pref_sen_grp2, aes2016_svy) x < round(x, 2) colnames(x)[grepl("OTHER", colnames(x))] < "OTHER" rownames(x) < paste0(str_wrap(rownames(x), 20)) mosaic(x, shade = TRUE, border = "grey70", direction = "v", legend = legend_resbased(10, fontfamily = "Roboto", pvalue = FALSE), xlab = "x", ylab = "y", labeling = labeling_values( suppress = c(0, 1000), gp_labels = gpar(fontfamily = "Roboto", cex = 0.8), gp_varnames = gpar(col = "transparent"), just_labels = c("center", "right"), alternate_labels = c(TRUE, FALSE), rot_labels = c(0,90,0,0), offset_labels = c(1,0,1.5,0), digits = 0 )) Analysis – vote and a battery of questionsLike many social sciences surveys, the AES contains a number of batteries of questions with similar response sets. An advantage of this, both for an analyst and the consumer of their results, is that we can set up a common approach to comparing the responses to those questions. Here is a selection of diverging likert plots depicting interesting results from these questions:
Above, we see that Greens voters disproportionately think that Australia hasn’t gone far enough in allowing more migrants into the country, and One Nation voters unsurprisingly tend to think the opposite. Views on change in Australia relating to Aboriginal assistance and land rights follow a similar pattern. Interestingly, there is less partisan difference on some other issues such as “equal opportunities for women”. Overall, the pattern is clear of increasing tendency to think that “things have gone to far” as we move across the spectrum (GreensLaborLiberalNationalOne Nation).
When it comes to attitudes on the leftright economic issues, we see subtle but expected party differences (chart above). Liberal supporters are particularly inclined to think trade unions have too much power and should be regulated; but to disagree with the statement that income and wealth should be distributed towards ordinary working people.
In the social issues chart above, we see the Greens’ voters strongly disagreeing with turning back boats of asylum seekers, and reintroducing the death penalty – issues where One Nation voters tend to agree with the statement. Again, the clear sequencing of parties’ supporters (GreensLaborLiberalNationalOne Nation) is most obvious on these social/cultural touchpoint issues, in contrast to the leftright economic debate.
In the above, we see high levels of confidence in the armed forces, police and universities across all parties’ supporters, and distrust in the press, television (what does it even mean to trust the television?) and political parties. Differences by party in trust in institutions varies along predictable party lines eg voters of the Liberal party tend to have less confidence in unions. One Nation voters have particularly low trust in the Australian political system, consistent with the mainstream interpretation in political science of the rise of socially conservative populist parties of this sort.
Attitudes to public spend are fairly consistent across parties, with subtle but unsurprising differences in a few valuebased touch points such as defence (Greens favour spending less) and unemployment benefits (Greens favour spending more but the three conservative parties favour spending less; Labor voters are in the middle).
Finally, when it comes to factual knowledge of election law and constitutional issues, there is little to say from a party perspective. Many people don’t know the maximum period between federal elections for the lower house (the correct answer is three); and doubtless niether know nor care about the need to pay a deposit to stand for election. But there is no obvious difference by party voted for.
In summary, on many of the key social variables of the day, we see the expected greatest contrast between supporters of the Greens on one hand and One Nation on the other, with the ALP, Liberal and National voters strung out between. Economic issues indicate another but closely related dimension, with One Nation closer to the Greens than they are to the other socially conservative parties on statements such as “big business in this country has too much power”, “the government should take measures to reduce differences in income levels” and “trade unions in this country have too much power”.
Here’s the code for drawing those divergingLikert plots. I’m not particularly proud of it, it’s a bit repetitive (I hadn’t realised I’d be doing six of these), but it get’s the job done. There are some specialist R packages for drawing these charts, but I prefer (at this point) build them with the standard grammar of graphics tools in ggplot2 rather learn a new specific API.
 Post continues after code extract*
OK, watch this space for more analysis using the AES (2016 and other years), if and when I get time.
Note – I have no affiliation or contact with the Australian Election Study, and the AES researchers bear no responsibility for my analysis or interpretation of their data. Use of the AES data by me is solely at my risk and I indemnify the Australian Data Archive and its host institution, The Australian National University. The Australian Data Archive and its host institution, The Australian National University, are not responsible for the accuracy and completeness of the material supplied. Similarly, if you use my analysis, it is at your risk. Don’t blame me for anything that goes wrong, even if I made a mistake. But it would be nice if you let me know.
thankr::shoulders() %>% mutate(maintainer = str_squish(gsub("<.+>", "", maintainer)), maintainer = ifelse(maintainer == "Rcore", "R Core Team", maintainer)) %>% group_by(maintainer) %>% summarise(`Number packages` = sum(no_packages), packages = paste(packages, collapse = ", ")) %>% arrange(desc(`Number packages`)) %>% knitr::kable() %>% clipr::write_clip() maintainer Number packages packages Hadley Wickham 17 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, testthat, tidyr, tidyverse R Core Team 13 base, compiler, datasets, graphics, grDevices, grid, methods, splines, stats, tools, utils, nlme, foreign Gábor Csárdi 4 cli, crayon, pkgconfig, zip Kirill Müller 4 DBI, hms, pillar, tibble Lionel Henry 4 purrr, rlang, svglite, tidyselect Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1 Yihui Xie 4 evaluate, knitr, rmarkdown, xfun Achim Zeileis 3 colorspace, lmtest, zoo Jim Hester 3 glue, withr, readr Yixuan Qiu 3 showtext, showtextdb, sysfonts Dirk Eddelbuettel 2 digest, Rcpp Jennifer Bryan 2 readxl, cellranger Jeroen Ooms 2 curl, jsonlite Simon Urbanek 2 Cairo, audio “Thomas Lumley” 1 survey Alex Hayes 1 broom Alexander Walker 1 openxlsx Brian Ripley 1 MASS Brodie Gaslam 1 fansi Charlotte Wickham 1 munsell David Gohel 1 gdtools David Meyer 1 vcd Deepayan Sarkar 1 lattice Erich Neuwirth 1 RColorBrewer James Hester 1 xml2 Jeremy Stephens 1 yaml Joe Cheng 1 htmltools Justin Talbot 1 labeling Kamil Slowikowski 1 ggrepel Kevin Ushey 1 rstudioapi Marek Gagolewski 1 stringi Martin Maechler 1 Matrix Matt Dowle 1 data.table Max Kuhn 1 generics Michel Lang 1 backports Patrick O. Perry 1 utf8 Peter Ellis 1 frs Rasmus Bååth 1 beepr Simon Garnier 1 viridisLite Stefan Milton Bache 1 magrittr Terry M Therneau 1 survival Thomas J. Leeper 1 rio Vitalie Spinu 1 lubridate var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: free range statistics  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
FizzBuzz in R and Python
(This article was first published on Method Matters, and kindly contributed to Rbloggers)
In this post, we will solve a simple problem (called “FizzBuzz“) that is asked by some employers in data scientist job interviews. The question seeks to ascertain the applicant’s familiarity with basic programming concepts. We will see 2 different ways to solve the problem in 2 different statistical programming languages: R and Python.
The FizzBuzz QuestionI came across the FizzBuzz question in this excellent blog post on conducting data scientist interviews and have seen it referenced elsewhere on the web. The intent of the question is to probe the job applicant’s knowledge of basic programming concepts.
The prompt of the question is as follows:
In pseudocode or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”. The Solution in RWe will first solve the problem in R. We will make use of control structures: statements “which result in a choice being made as to which of two or more paths to follow.” This is necessary for us to A) iterate over a number sequence and B) print the correct output for each number in the sequence. We will solve this problem in two slightly different ways.
Solution 1: Using a “for loop”
The classical way of solving this problem is via a “for loop“: we use the loop to explicitly iterate through a sequence of numbers. For each number, we use ifelse statements to evaluate which output to print to the console.
The code* looks like this:
for (i in 1:100){if(i%%3 == 0 & i%%5 == 0) {
print('FizzBuzz')
}
else if(i%%3 == 0) {
print('Fizz')
}
else if (i%%5 == 0){
print('Buzz')
}
else {
print(i)
}
}
Our code first explicitly defines the number sequence from 1 to 100. For each number in this sequence, we evaluate a series of ifelse statements, and based on the results of these evaluations, we print the appropriate output (as defined in the question prompt).
The double parentheses is a modulo operator, which allows us to determine the remainder following a division. If the remainder is zero (which we test via the double equals signs), the first number is divisible by the second number.
We start with the double evaluation – any other ordering of our ifelse statements would cause the program to fail (because numbers which pass the double evaluation by definition also pass the single evaluations). If a given number in the sequence from 1 to 100 is divisible both by 3 and by 5, we print “FizzBuzz” to the console.
We then continue our testing, using “else if” statements. The program checks each one of these statements in turn. If the number fails the double evaluation, we test to see if it is divisible by 3. If this is the case, we print “Fizz” to the console. If the number fails this evaluation, we check to see if it is divisible by 5 (in which case we print “Buzz” to the console). If the number fails each of these evaluations (meaning it is not divisible by 3 or 5), we simply print the number to the console.
Solution 2: Using a function
In general, for loops are discouraged in R. In earlier days, they were much slower than functions, although this seems to be no longer true. The wonderful R programmer Hadley Wickham advocates for a functional approach based on the fact that functions are generally clearer about what they’re trying to accomplish. As he writes: “A for loop conveys that it’s iterating over something, but doesn’t clearly convey a high level goal… [Functions are] tailored for a specific task, so when you recognise the functional you know immediately why it’s being used.” (There are many interesting and strong opinions about using loops vs. functions in R; I won’t get into the details here, but it’s informative to see the various opinions out there.)
The FizzBuzz solution using a function looks like this:
# define the functionfizz_buzz < function(number_sequence_f){
if(number_sequence_f%%3 == 0 & number_sequence_f%%5 == 0) {
print('FizzBuzz')
}
else if(number_sequence_f%%3 == 0) {
print('Fizz')
}
else if (number_sequence_f%%5 == 0){
print('Buzz')
}
else {
print(number_sequence_f)
}
}
# apply it to the numbers 1 to 100
sapply(seq(from = 1, to = 100, by = 1), fizz_buzz)
The code above defines a function called “fizz_buzz”. This function takes an input (which I have defined in the function as number_sequence_f), and then passes it through the logical sequence we defined using the “if” statements above.
We then use sapply to apply the FizzBuzz function to a sequence of numbers, which we define directly in the apply statement itself. (The seq command allows us to define a sequence of numbers, and we explicitly state that we want to start at 1 and end at 100, advancing in increments of 1).
This approach is arguably more elegant. We define a function, which is quite clear as to what its goal is. And then we use a singleline command to apply that function to the sequence of numbers we generate with the seq function.
However, note that both solutions produce the output that was required by the question prompt!
The Solution in PythonWe will now see how to write these two types of programs in Python. The logic will be the same as that used above, so I will only comment on the slight differences between how the languages implement the central ideas.
Solution 1: Using a “for loop”
The code to solve the FizzBuzz problem looks like this:
for i in range(1,101):if (i % 3 == 0) & (i % 5 == 0):
print('FizzBuzz')
elif (i % 3 == 0):
print('Fizz')
elif (i % 5 == 0):
print('Buzz')
else:
print(i)
Note that the structure of definition of the “for loop” is different. Whereas R requires {} symbols around all the code in the loop, and for the code inside of each ifelse statement, Python uses the colon and indentations to indicate the different parts of the control structure.
I personally find the Python implementation cleaner and easier to read. The indentations are simpler and make the code less cluttered. I get the sense that Python is a true programming language, whereas R’s origins in the statistical community have left it with comparatively more complicated ways to accomplish the same programming goal.
Note also that, if we want to stop at 100, we have to specify that the range in our for loop stops at 101. Python has a different way of defining ranges. The first element in the range indicates where to start, but the last number in the range should be 1 greater than the number you actually want to stop at.
Solution 2: Using a function
Note that we can also use the functional approach to solve this problem in Python. As we did above, we wrap our control flow in a function (called fizz_buzz), and apply the function to the sequence of numbers from 1 to 100.
There are a number of different ways to do this in Python. Below I use pandas, a tremendously useful Python library for data manipulation and munging, to apply our function to the numbers from 1 to 100. Because the pandas apply function cannot be used on arrays (the element returned by the numpy np.arange), I convert the array to a Series and call the apply function on that.
# import the pandas libraryimport pandas as pd
# define the function
def fizz_buzz(number_sequence_f):
if (number_sequence_f % 3 == 0) & (number_sequence_f % 5 == 0):
return('FizzBuzz')
elif (number_sequence_f % 3 == 0):
return('Fizz')
elif (number_sequence_f % 5 == 0):
return('Buzz')
else:
return(number_sequence_f)
# apply it to the number series
pd.Series(np.arange(1,101)).apply(fizz_buzz)
One final bit of programming details. The for loop defined above calls range, which is actually an iterator, not a series of numbers. Long story short, this is appropriate for the for loop, but as we need to apply our function to an actual series of numbers, we can’t use the range command here. Instead, in order to apply our function to a series of numbers, we use the numpy arange command (which returns an array of numbers). As described above, we convert the array to a pandas Series, which is then passed to our function.
Does this make sense as an interview question?As part of a larger interview, I suppose, this question makes sense. A large part of applied analytical work involves manipulating data programmatically, and so getting some sense of how the candidate can use control structures could be informative.
However, this doesn’t really address the biggest challenges I face as a data scientist, which involve case definition and scoping: e.g. how can a data scientist partner with stakeholders to develop an analytical solution that answers a business problem while avoiding unnecessary complexity? The key issues involve A) analytical maturity to know what numerical solutions are likely to give a reasonable solution with the available data, B) business sense to understand the stakeholder problem and translate it into a form that allows a data science solution, and C) a combination of A and B (analytical wisdom and business sense) to identify the use cases for which analytics will have the largest business impact.**
So data science is a complicated affair, and I’m not sure the FizzBuzz question gives an interviewer much sense of the things I see as being important in a job candidate.*** For another interesting opinion on what’s important in applied analytics, I can recommend this excellent blog post which details a number of other important but underdiscussed aspects of the data science profession.
ConclusionIn this post, we saw how to solve a data science interview question called “FizzBuzz.” The question requires the applicant to think through (and potentially code) a solution which involves cycling through a series of numbers, and based on certain conditions, printing a specific outcome.
We solved this problem in two different ways in both R and Python. The solutions allowed us to see different programming approaches (for loops and functions) for implementing control flow logic. The for loops are perhaps the most intuitive approach, but potentially more difficult for others to interpret. The functional approach benefits from clarity of purpose, but requires a slightly higher level of abstract thinking to code. The examples here serve as a good illustration of the differences between R and Python code. R has its origins in the statistical community, and its implementation of control structures seems less straightforward than the comparable implementations in Python.
Finally, we examined the value of the FizzBuzz interview question when assessing qualities in a prospective data science candidate. While the FizzBuzz question definitely provides some information, it misses a lot of what (in my view) comprises the critical skillset that is needed to deliver data science projects in a business setting.
Coming Up Next
In the next post, we will see how to use R to download data from Fitbit devices, employing a combination of existing R packages and custom API calls.
Stay tuned!
—
* My goto html code formatter for R has disappeared! If anyone knows of such a resource, it would be hugely helpful in making the R code on this blog more readable. Please let me know in the comments! (The Python code is converted with tohtml).
** And then, of course, once a proper solution has been developed, it has to be deployed. Can you partner with data engineers and/or other IT stakeholders to integrate your results into business processes? (Notice the importance of collaboration – a critical but seldomdiscussed aspect of the data science enterprise).
*** In defense of the abovereferenced blog post, the original author is quite nuanced and does not only rely on the FizzBuzz question when interviewing candidates!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Method Matters. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...