R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 18 min ago

### House effects, herding, and the last few days before the election by @ellis2013nz

Tue, 05/14/2019 - 16:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

So, we’re down to the last few days before the Australian federal election, the first one that I’ve been tracking polls and making forecasts for. I thought I’d address a couple of points raised on Twitter about my forecasts. My forecasts are generally a bit more sceptical of a clean ALP win (ie 76 or more seats in the House of Representatives) than most of the published punditry (like this “Labor will win this election… virtually unquestionable” piece, for instance). However, my forecasts are mostly about uncertainty – I’m not even forecasting a close election as such, as much as saying that we have fairly low level knowledge about what is going to happen, which happens to translate into a 38% chance of a clean ALP win. It might be a biggish win; very little will surprise me. That’s a difficult thing to sell to the public.

House effects

Kevin Bonham pointed out on Twitter that my forecasts assume that house effects (ie the structural under- or over-estimates of voting behaviour by individual polling firms) are the same over time. As surveying firms try to improve their methods, and as public behaviour changes, we would expect to see differences in the “biases” (speaking statistically) of each firm over time. Indeed, the same firm can publishe data simultaneously from surveys that collect data from different modes (eg phone versus online). I haven’t attempted to capture those nuances, never mind changes over time as the firms try to fix issues.

This is a definite limitation of my method, and one I’d like to fix for future elections. I also reflected on this issue in my forecasts of the New Zealand election, and even did a blog post on it; I concluded from the minimal effects seen there that the house effects changed surprisingly slowly so I could ignore the phenomenon and judge each firm on its overall results. That might not be the case for Australian polls, so I should at least try building in changing house effects over time. Unfortunately this will have the impact of increasing even further the uncertainty in translating polling numbers to election results.

While I don’t have time for that today, I do have time to at least explore the most likely problem candidate, Roy Morgan. My Bayesian state space model finds a strong over-estimate of the ALP vote in Roy Morgan, comparing it to the other polling firms and to actual election outcomes as far back as I have data. Bonham suggests that more recently, Roy Morgan polls if anything now under-estimate the ALP vote.

Here’s the house effects shown by my current model:

Let’s look at the voting intention reported by the four current main pollsters by election year. Just looking at these density curves certainly does suggest that this time around, Roy Morgan has reversed and is now underestimating ALP vote; the purple density is noticeably to the left of the others:

Compare to how far the yellow is to the right of the crowd in the first plot. Sorry for not matching the colours; tempus fugit.

However, I’m not going to draw firm conclusions from this. We can see that Roy Morgan have published relatively few polls this election cycle, and they have mostly been relatively late:

The ALP support now is less than six months ago, so the fact that Roy Morgan’s average reported intended vote for ALP is less than the other firms is not as convincing as it might be thought. There are just too few data points to be sure. In effect, my model judges Roy Morgan on their performance over the full period data are available, and overall that data shows an overestimate of ALP vote. It’s possible they’ve fixed the problem, but they’ll need to get quite a few runs on the board to overcome their bad history on this.

Having said that, I think there is enough data (and other more theoretical considerations) to be confident that house effects in general change over time, so when I can I will address it. I can also say that I think this issue is not important in my current forecasts; excluding Roy Morgan from my data altogether does not lead to a noticeably different result. The real driver of my less-certain-than-normal-pundits’ outlook for an ALP win is from the uncertainty my model exists, particularly in translating an uncertain two-party-preferred swing into an even more uncertain set of seat changes.

Here’s the R code for the above examination of Roy Morgan’s house effect:

Post continues after code extract

library(ozfedelect) library(tidyverse) library(lubridate) #-------------Has Roy Morgan fixed the pro-ALP overestimate---------------- d <- ozpolls %>% filter(firm %in% c("Election result", "Roy Morgan", "Newspoll", "Galaxy", "Ipsos", "Essential")) %>% filter(preference_type == "Two-party-preferred") %>% filter(party == "ALP") d %>% filter(firm != "Election result") %>% ggplot(aes(x = intended_vote, colour = firm)) + geom_density() + facet_wrap(~election_year) + labs(x = "Intended two-party-preferred vote for the ALP") + ggtitle("Differences in polling firms' estimates of ALP vote by election year") d %>% filter(firm != "Election result") %>% ggplot(aes(x = mid_date, y = intended_vote, colour = firm)) + geom_point() + facet_wrap(~firm) + ggtitle("Roy Morgan polls for the 2019 election are few and late", "We can't conclude the ALP overestimate is fixed from these data points.") + labs(x = "Survey mid date", y = "Intended two-party-preferred vote for the ALP") Herding

Another issue I and others have been pondering is the low level of variation between polls. Of the last 16 published national polls, 7 have placed the ALP two-party-preferred vote at 51%, 7 at 52%, and one each at 52.5% and 53%. As Mark the Ballot points out, that’s less variation than you’d expect if they really were based on random samples. The most likely explanation for this underdispersion is that polling firms are introducing errors in their processing that push their polls towards a consensus view. This is a known phenomenon around the world, as this 2014 article by Nate Silver on FiveThirtyEight demonstrates. The net effect is to reduce the usefulness of polls, by bringing them towards a pundit-based folly of the crowds consensus.

It’s not clear if the errors are being introduced accidentally (for example, by unconsciously giving more scrutiny and checks to “suspicious” points that break with the consensus view) or more deliberately through some kind of smoothing or regularisation. Either way, it means that the polling information is worth less than it would be with a more straightforward production process. But it’s something we’re going to have to live with, and just note yet another reason for being less confident in what we think is going to happen on Saturday – whatever it is that we think!

Here’s a comparison of what we’ve seen in the reported polls versus what we’d expect in actual random samples. I’ve used Mark the Ballot’s assumption of sample sizes of 1,000; in practice, some samples are higher and some smaller, but as we are ignoring non-sampling error I am happy to err a little on the small side for sampling. Also, note that the reported results are (usually) rounded in ways that impact on the distribution of overall results in complicated but unimportant ways. To derive the results below I’ve mimicked that rounding process in 100,000 simulated samples of 25 polls, all of them rounded to the nearest percent:

Here’s the code for that:

Post continues after code extract

#----------How bad is the underdispersion--------- set.seed(123) d2 <- d %>% filter(year(mid_date) == 2019) %>% select(mid_date, observed_polls = intended_vote) n <- nrow(d2) reps <- 100000 sims <- matrix(round(rbinom(n * reps, 1000, mean(d2$observed_polls) / 100) / 10), ncol = n, nrow = reps) d4 <- tibble(standard_deviations = apply(sims, 1, sd)) d4 %>% ggplot(aes(x = standard_deviations)) + geom_density(fill = "grey", alpha = 0.5) + geom_vline(xintercept = sd(d2$observed_polls), colour = "red") + annotate("text", x = 0.45, y = 0.5, hjust = 0, label = paste0( "Only ", round(mean(d4 < sd(d2$observed_polls)), 3), " of simulated poll\nsequences have a standard\ndeviation less than has been\nobserved in the ", n, " polls in 2019." )) + annotate("text", x = 1, y = 1.5, colour = "red", label = "Observed", hjust = 1) + labs(x = paste0("Standard deviations of simulated sequences of ", n, " surveys with 1,000 respondents, with survey results rounded")) + ggtitle("The polls published in 2019 vary somewhat less than they should if random", "The amount of variance is surprisingly low, but not impossibly so.") For the record – forecasts as at 15 May 2019 I find it interesting that some people react to forecasts as though they are normative. I got a number of retweets from rightists – including Australians who proudly wear their Make America Great Again caps to early voting and tweet about, thereby donating cultural scientists all sorts of interesting material – for my most recent update on my election predictions. And a few left-leaning sceptics. Let me state my own intent to keep my own political preferences (which are strong) out of this; I want to avoid the problems that lead to herding of opinion pollsters, for one thing. While I can understand people who don’t like to see forecasts that their side is less likely to win they thought (or do like to see forecasts that their own party may have a chance after all), I think we should endeavour to avoid our preferences clouding our judgement. Here’s my forecasts – with big chunks of uncertainty – as at 15 May 2019: My forecast for the Australian 2019 federal election on 18 May 2019 (with the latest polling data available at 14 May 2019) is a narrow ALP win in their own right (36% chance) or with the presumed support of one or two Green members (47% chance in total). However, a range of other options are very much in play. An 80% prediction interval for the ALP in the House of Representatives is 68 to 80 seats. The Liberal-National Coalition is likely to end up with between 67 and 78 seats. An outright win is a distinct possibility, and a win with the help of independents or minor parties more so. The Greens are most likely to end up retaining their single seat, but have a fair chance of picking up a second. In total, parties other than the ALP and Coalition are likely to end up with three to six seats. The range of possibilities with regard to numbers of seats: The predicted two-party-preferred vote: The R packages that make this possible thankr::shoulders() %>% mutate(maintainer = str_squish(gsub("<.+>", "", maintainer)), maintainer = ifelse(maintainer == "R-core", "R Core Team", maintainer)) %>% group_by(maintainer) %>% summarise(Number packages = sum(no_packages), packages = paste(packages, collapse = ", ")) %>% arrange(desc(Number packages)) %>% knitr::kable() %>% clipr::write_clip() maintainer Number packages packages Hadley Wickham 15 assertthat, dplyr, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, tidyr, tidyverse R Core Team 11 base, compiler, datasets, graphics, grDevices, grid, methods, stats, tools, utils, nlme Yihui Xie 5 evaluate, highr, knitr, rmarkdown, xfun Kirill Müller 4 DBI, hms, pillar, tibble Lionel Henry 4 purrr, rlang, svglite, tidyselect Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1 Gábor Csárdi 3 cli, crayon, pkgconfig Jim Hester 3 glue, withr, readr Yixuan Qiu 3 showtext, showtextdb, sysfonts Dirk Eddelbuettel 2 digest, Rcpp Edzer Pebesma 2 sf, units Jennifer Bryan 2 cellranger, readxl Peter Ellis 2 frs, ozfedelect Simon Urbanek 2 audio, Cairo Achim Zeileis 1 colorspace Alex Hayes 1 broom Brian Ripley 1 class Brodie Gaslam 1 fansi Charlotte Wickham 1 munsell Claus O. Wilke 1 cowplot David Gohel 1 gdtools David Meyer 1 e1071 Deepayan Sarkar 1 lattice James Hester 1 xml2 Jeremy Stephens 1 yaml Jeroen Ooms 1 jsonlite Joe Cheng 1 htmltools Justin Talbot 1 labeling Kamil Slowikowski 1 ggrepel Kevin Ushey 1 rstudioapi Marek Gagolewski 1 stringi Matthew Lincoln 1 clipr Max Kuhn 1 generics Michel Lang 1 backports Patrick O. Perry 1 utf8 Rasmus Bååth 1 beepr Roger Bivand 1 classInt Stefan Milton Bache 1 magrittr Vitalie Spinu 1 lubridate var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### A Detailed Guide to ggplot colors Tue, 05/14/2019 - 16:00 (This article was first published on Learn R Programming & Build a Data Science Career | Michael Toth, and kindly contributed to R-bloggers) Once you’ve figured out how to create the standard scatter plots, bar charts, and line graphs in ggplot, the next step to really elevate your graphs is to master working with color. Strategic use of color can really help your graphs to stand out and make an impact. In this guide, you’ll learn how to incorporate your own custom color palettes into your graphs by modifying the base ggplot colors. By the end of this tutorial, you’ll know how to: • Change all items in a graph to a static color of your choice • Differentiate between setting a static color and mapping a variable in your data to a color palette so that each color represents a different level of the variable • Customize your own continuous color palettes using the scale_color_gradient, scale_fill_gradient, scale_color_gradient2, and scale_fill_gradient2 functions • Customize your own color palettes for categorical data using the scale_color_manual and scale_fill_manual functions Introducing video tutorials! I’m also excited to try something new in this guide! I’ll be adding video tutorials to accompany the content, so please let me know what you think about these and if you find them helpful. I’d love to do more of this in the future if you find them valuable! Get my free workbook to build a deeper understanding of ggplot colors! Have you ever read a tutorial or guide, learned a bunch of interesting things, only to forget them shortly after you finish reading? Me too. And it’s really annoying! Unfortunately, our brains aren’t good at remembering what we read. We need to think critically and be engaged in solving problems to learn information so it sticks. That’s why I’ve created a free workbook to accompany this post. The workbook is an R file that includes additional questions and exercises to help you engage with this material. Get your free workbook to master working with colors in ggplot A high-level overview of ggplot colors By default, ggplot graphs use a black color for lines and points and a gray color for shapes like the rectangles in bar graphs. Sometimes this is fine for your purposes, but often you’ll want to modify these colors to something different. Depending on the type of graph you’re working with, there are two primary attributes that affect the colors in a graph. You use the color attribute to change the outline of a shape and you use the fill attribute to fill the inside of a shape. Specifically, we use the color attribute to change the color of any points and lines in your graph. This is because points and graphs are 0- and 1-dimensional objects, so they don’t have any inside to fill! library(tidyverse) ggplot(iris) + geom_point(aes(x = Sepal.Width, y = Sepal.Length), color = 'red') In contrast, bars and other 2-dimensional shapes do have an inside to fill, so you will be using the fill attribute to change the color of these items in your graph: ggplot(mpg) + geom_bar(aes(x = class), fill = 'red') Side note: technically you can also use the color attribute to change the outline of shapes like bars in a bar graph. I use this functionality very rarely, and for the sake of simplicity I will not go into this in further detail in this guide. Except for the difference in naming, color and fill operate very similarly in ggplot. As you’ll see, the functions that exist for modifying your ggplot colors all come in both color and fill varieties. But before we get to modifying the colors in your graphs, there’s one other thing we need to touch on first. Modifying ggplot colors: static color vs. color mapping We need to distinguish between two different ways of modifying colors in a ggplot graph. The two things we can do are: 1. setting a static color for our entire graph 2. mapping a variable to a color so each level of the variable is a different color in our graph In the earlier examples, we used a static color (red) to modify all of the points and bars in the two graphs that we created. It’s often the case, however, that we want to use color to convey additional information in our graph. Usually, we do this by mapping a variable in our dataset to the color or fill aesthetic, which tells ggplot to use a different color for each level of that variable in the data. Setting a static color is pretty straightforward, and you can use the two examples above as references for how to accomplish that. In the rest of this guide, I’m going to show you how you can map variables in your data to colors in your graph. You’ll learn about the different functions in ggplot to set your own color palettes and how they differ for continuous and categorical variables. Working with Color Palettes for Continuous Data Let’s start with a simple example of the default continuous color palette in ggplot. First, we’ll generate some random data that we’ll use for our graph. df <- data.frame( x = runif(100), # 100 uniformly distributed random values y = runif(100), # 100 uniformly distributed random values z1 = rnorm(100), # 100 normally distributed random values z2 = abs(rnorm(100)) # 100 normally distributed random values mapped to positive ) On sequential color scales and ggplot colors When we map a continuous variable to a color scale, we map the values for that variable to a color gradient. You can see the default ggplot color gradient below. This is called a sequential color scale, because it maps data sequentially from one color to another color. The minimum value will in your dataset will be mapped to the left side (dark blue) of this sequential color gradient, while the maximum value will be mapped to the right side (light blue) of this sequential color gradient. You can imagine stretching a number line across this gradient of colors. Then, for every value in your data, you find it on the number line, take the color at that location, and graph using that resulting color. Let’s see how this works in practice. Using the random data we generated above, we’ll graph a scatter plot of the x and y variables. To illustrate the color gradient, we’ll map the z2 variable to the color aesthetic: # Default colour scale colours from light blue to dark blue g1 <- ggplot(df, aes(x, y)) + geom_point(aes(color = z2)) g1 Note the contrast between this syntax and the syntax before where we set a static color for our graph. Here, we aren’t specifying the color to use, we’re simply telling ggplot to map the z2 variable to the color aesthetic by including the mapping color = z2 within the aes function. In the dataset that I created, the minimum value for the z2 variable is 0.0024422, while the maximum value is 2.6241346. All values–and therefore all colors–fall between these minimum and maximum levels. Modifying our ggplot colors for continuous data using scale_color_gradient Now that you understand how ggplot can map a continuous variable to a sequential color gradient, let’s go into more detail on how you can modify the specific colors used within that gradient. Instead of the default blue gradient that ggplot uses, we can use any color gradient we want! To modify the colors used in this scale, we’ll be using the scale_color_gradient function to modify our ggplot colors. Side note: if we were instead graphing bars or other fillable shapes, we would use the scale_fill_gradient function. For brevity, I won’t be including an example of this function. It operates in exactly the same way as the scale_color_gradient function, so you can easily modify this code to work for filling graphs with color as well. Using the same graph from before, we simply add a call to the scale_color_gradient function to modify our color palette. Here, we can specify our own values for low and high to customize the gradient of colors in our graph. In this case, we’ll be mapping low values to greenyellow and high values to forestgreen. g1 + scale_color_gradient(low = 'greenyellow', high = 'forestgreen') By modifying the values you’re passing to the scale_color_gradient function, you can create a sequential color scale between any two colors! Under the hood, ggplot was already using this color scale with the dark blue and light blue colors that show up by default. By adding this color scale to the graph and specifying your own colors, you’re simply overriding the default values that ggplot was already using. On diverging gradient scales and ggplot colors Sequential color scales are great when you want to easily differentiate between low and high values in a dataset. Sometimes, however, that’s not what you want. Sometimes you want to look at deviations from a certain baseline value, and you care about distinguishing both positive and negative deviations. For this type of data, we use what’s called a diverging color scale. A diverging color scale creates a gradient between three different colors, allowing you to easily identify low, middle, and high values within your data. You can see an example of a diverging color scale below. In this color scale, we see that blue is associated with values on the low end, white with values in the middle, and red with values on the high end. Among other things, this type of scale is often used when presenting United States presidential election results. Instead of the scale_color_gradient function that we used for a sequential color palette, we’re now going to use the scale_color_gradient2 to produce a diverging palette. Side note: Again, there is a similar function called scale_fill_gradient2 that we would use if we were instead graphing bars or other fillable shapes. I won’t be including an example of this function, but it operates in exactly the same way as the scale_color_gradient2 function, so you can easily modify this code to work for filling graphs with color as well. As before, we tell the scale_color_gradient2 function which colors to map to low and high values of our variable. In addition, we also specify a color to map the mid values to. As in the color scale we just reviewed, we’ll use the blue-white-red color palette for this example: ggplot(df, aes(x, y)) + geom_point(aes(colour = z1)) + scale_color_gradient2(low = 'blue', mid = 'white', high = 'red') While you can technically specify any 3 colors for a diverging color scale, the convention is to use a light color like white or light yellow in the middle and darker colors of different hues for both low and high values, like we’ve done here. Working with Color Palettes for Categorical Data When working with continuous data, each value in your dataset was automatically mapped to a value on a 2-color sequential gradient or 3-color diverging gradient, as we just saw. The goal was to show a smooth transition between colors, highlighting low and high values or low, middle, and high values in the data. When working with categorical data, each distinct level in your dataset will be mapped to a distinct color in your graph. With categorical data, the goal is to have highly differentiated colors so that you can easily identify data points from each category. There are built-in functions within ggplot to generate categorical color palettes. That said, I’ve always preferred the control I get from generating my own, and that’s what I’m going to show you how to do here. My favorite tool for building categorical color palette: Color Picker for Data My favorite way of generating beautiful color palettes is to use Tristen Brown’s tool Color Picker for Data. It offers an intuitive visual interface to build and export a color palette that you can use directly within ggplot. Mapping Categorical Data to Color in ggplot For this example, we’ll be working with the mtcars dataset. We’re going to create a scatter plot of weight and miles per gallon. Then, we’ll use the color aesthetic to map 4-, 6-, and 8-cylinder engines each to a different color using the default ggplot colors: g2 <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(aes(color = factor(cyl))) g2 Similar to how we worked with categorical data, we simply map a variable to the color aesthetic by including the code color = your_variable within the aes function of our geom_point call. The one caveat is that here we’re converting the cyl variable to a factor before we do this mapping. Because cyl is recorded as a numerical variable, it will by default map to the color gradients we saw before, which isn’t what we want in this case, as we’re treating cyl as a categorical variable with 3 levels. Remember, just because a variable happens to have numeric values does not necessarily mean it should be mapped as a continuous scale! We know how we can map a categorical variable to the color aesthetic to produce different colors in our graph for each level in our dataset. How can we modify those colors to a color palette of our choice? Modifying our ggplot colors for categorical data using scale_color_manual Once you have your color palette, you can use the scale_color_manual function to map the levels in your dataset to different colors in your generated color palette. Side note: Can you guess? Yes, again, there is a similar function called scale_fill_manual that we would use if we were instead graphing bars or other fillable shapes. I won’t be including an example of this function, but it operates in exactly the same way as the scale_color_manual function, so you can easily modify this code to work for filling graphs with color as well. Here, we start by creating a vector that maps the different levels in our data, in this case “4”, “6”, and “8”, to different colors. We then use the scale_color_manual function and specify the mapping by passing our colors vector to the values argument of scale_color_manual. It will then go through each entry in the cyl column, mapping it to the relevant color in our colors vector. colors <- c("4" = "#D9717D", "6" = "#4DB6D0", "8" = "#BECA55") g2 + scale_color_manual(values = colors) A Summary of Working with ggplot Colors Congratulations! You know know how to work with colors in your ggplot graphs. In this guide, you learned: • How to change all items in a graph to a static color of your choice • To distinguish between two ways of modifying color in your ggplot graph: 1. Setting a static color for all elements in your graph 2. Mapping a variable in your data to a color palette so that each color represents a different level of the variable • How you can customize your sequential color scales using the scale_color_gradient and scale_fill_gradient functions for continuous data • How to customize your diverging color scales using the scale_color_gradient2, and scale_fill_gradient2 functions for continuous data • How to customize your color palettes for categorical data using the scale_color_manual and scale_fill_manual functions Don’t Forget to Practice! Right now, you should have a pretty good understanding of how you can work with and modify the colors in your ggplot graphs. But if you don’t practice, you’re going to forget this stuff! That’s why I’ve created a free workbook that you can use to apply what you’ve learned in this guide. The workbook is an R file that includes additional questions and exercises to help you engage with this material. Get your free workbook to master working with colors in ggplot var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career | Michael Toth. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Transforming E-commerce Data into SCM Value Tue, 05/14/2019 - 10:02 (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers) Introduction The enlargement of e-commerce platform on brick and mortar retailers is one eminent illustration of the big data phenomenon in which is wiping out many traditional retail industries nowadays. There are many well known aspects and causation of Amazon’s huge success in e-commerce; however turning data into actionable insights in the supply chain can be restrained due to inability to transform transactional data into a ‘customer-centric’ value. There are numerous opportunities within retail industries to optimize and take control of accumulated data within supply chain network. My intention of this project is to tackle order management part of the business which is sometimes much less appreciated compared to the spotlight of the demand forecasting by many people. List of SCM Opportunities: The motives of this R Shiny App is to visualize: 1. Transactional lead time of point of sales to delivery to the end-consumer 2. Maintain backorder ratio 3. Monitor sales run-rate which will allow the users to optimize warehouse resource allocation in the future. The Data Finding a decent public sales dataset which conveys real life situation is often difficult to obtain because of its nature of confidentiality; however thankfully Olist, the largest department store in Brazilian marketplaces, has published unclassified dataset over ~100,000 orders on Kaggle. The whole dataset is well organized and comprised of 8 tables in a form of a star schema; key paired relationships between approximately 45 features in total. The dataset is somewhat self-explainable yet naming convention of some categorical features and customer reviews in Portuguese were some troublesome factors. Limitation Above all, lacking dominant expertise of Brazilian market/logistics condition was an impediment for clear analysis. I have generalized my understanding to fit closely with circumstances here in United States. Approach Change internal data to Enterprise Resource Planning (ERP) format to hopefully enable to blend with Warehouse Management System (WMS) in Olist by creating features that can capture duration of each partition of lead time. Findings and Explanation Sales • Consumer spent over$7 million Brazilian dollars in year 2017 alone
• Shows YoY growth rate of 143% (week 1-35, 2017 v 2018).
Illustrates potential of e-commerce in Brazil.

This is a break down latency between the initiation and execution of product delivery. Which is essential part of this analysis. Link for the picture below.

Explanation:

Black dotted lines – Customer RDD (Required Delivery Date).

Definition: Estimated delivery date informed to the customer (approximately 24 days out)

Problem: This habit of having customer RDD set too far off compared to the actual delivery will often interfere for sellers to prioritize order fulfillment especially when there are stockouts and prevent to shape a good practice of demand forecasting. Just to note that forecasting should be done with KPI of actual shipment rather than POS (point of sales)

Green Bar – Purchased Order to Sales Order

Definition: duration of order purchase to payment approval time.

• It takes about 12 hours for system ETL transaction in fraud checks on both credit cards and boletos.

Blue Bar – Sales Order to Goods Issue

Definition: duration of sellers shipping to the carriers.

• About 3 days for sellers to deliver to each carrier.
• This is an indicator that inventory outage isn’t yet a problem for many sellers; however this portion is likely to increase as demand goes up and forecasting is done correctly.

Red Bar: Goods Issue to Goods Receipt

Definition: duration of carriers shipping to the end-consumers

• About 9 days: for carriers to deliver the goods to the customers.

Problem: This is an atrocious result, in which holds up 73% proportion of the whole delivery process. Carrier should not hold onto products for this long.

This indicates carrier is likely to maintain high inventory levels resulting in immense inventory stocking cost, opportunity loss, and vulnerability of theft (Brazil is extremely well known for thieveries)

Backorder Rate

The Backorder Rate KPI measures how many orders cannot be filled at the time a customer places them. This is measured by frequency of orders delivered later than the  the estimated delivery date as mentioned above. I have drawn a red dotted line for ideal minimum of keeping a good 5%.

Back Order Ratio =  (Number of Customer Orders Delayed due to Backorder / Total Number of Customer Orders Shipped) * 100

• Rates is likely to increase during the peak season.
• Some odd finding is correlation plot indicates backorder status has higher correlation to customer review scores than type of the product.

Extra

1. Office furniture is one of the product category prone to have high lead time. This is mainly because sellers aren’t able to fulfill the orders right time. Knowing furniture may comparably longer manufacturing time, perhaps better forecasting strategy is required by the sellers.
2.  Amapa, Roraima, Amazonas are top three states in Brazil for average delivery lead time > 25 days. (date range: 2016/09/15 – 2018/08/27)
Conclusion:

As we all know, biggest differentiator between Amazon and many other online retailers is the speed of delivery. This could be a big factor of reducing the customer churn. My goal now was how can we shorten duration of carrier shipping (GI) to the end customer (GR)?

Meaning, if we need to choose a specific city in Brazil for locating ideal warehouses or locate the carriers where could it be?

I have used K-Means Clustering algorithm to cluster geolocation coordinates of customer locations for finding the cluster centroid. The central position of each partitioned location of Brazil will indicate ideal location for the warehouses for each location in Brazil. Obviously, more the fulfillment center faster the delivery; however we want to be cost efficient and expectation maximization, that is determining the optimal number of clusters.

Deciding what is the minimum, and best value for the number of clusters is not too clear with over 100K geolocation data. The elbow method was used below to determine what will be the best value for K for clustering which can clarify the fuzziness of the coordinates. It looks at the percentage of variance for each given clusters. The cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroid. The total variation within each cluster will be smaller than before and at some point the marginal gain will drop, providing an acute angle in the graph.

Below is the ideal carrier/warehouse locations per each cluster of 5.

Future work:

Earth isn’t flat: after doing the analysis, I have realized that my K-Means result should be reasonably accurate only near the equator. To cluster using shortest curve distances among the surface of the earth, density function such as HDBSCAN or DBSCAN algorithms can be used for spatial clustering.

Efficient Coding: Clustering part of this project is not reactive. Not to blame performance limitation of R and ShinyApp’s 1GB data restriction. Clustering performance was very slow and plotting over 100K point on ggplot would result in a shutdown. It will be good to find an option to bypass this problem.

LAST UPDATE: 5/14/2019

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Was the Bavarian Abitur too hard this time?

Tue, 05/14/2019 - 10:00

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult Abitur, a qualification granted by university-preparatory schools in Germany.

A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing!

It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on…

Let us have a look at the first (and easiest) question in the stochastics part:

Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart.

Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them analytically solving them by simulation is often much easier:

set.seed(12) N <- 1e7 v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25) sum(rowSums(v) <= 1) / N ## [1] 0.062936

Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just . “Exactly one gingerbread heart” is because there are possibilities of where the gingerbread heart could occur. We have to add both probabilities:

(5/6)^25 + 25*1/6*(5/6)^24 ## [1] 0.06289558

If you know a little bit about probability distributions you will recognize the above as the binomial distribution:

pbinom(q = 1, size = 25, prob = 1/6) ## [1] 0.06289558

Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think:

Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Transfer learning and semi-supervised learning with ruimtehol

Tue, 05/14/2019 - 07:23

(This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Last week the R package ruimtehol was updated on CRAN giving R users who perform Natural Language Processing access to the possibility to

• Allow to do semi-supervised learning (learning where you have both text as labels but not always both of them on the same document identifier.
• Allow to do transfer learning by passing on an embedding matrix (e.g. obtained via fasttext or Glove or the like) and keep on training based on that matrix or just use the embeddings in your Natural Language Processing flow.

More information can be found in the package vignette shown below or which you can obtain by installing the package and visiting the vignette with the following R code. Enjoy!

install.packages("ruimtehol")
vignette("ground-control-to-ruimtehol", package = "ruimtehol")

{aridoc engine=”pdfjs” width=”100%” height=”550″}images/bnosac/blog/ground-control-to-ruimtehol.pdf{/aridoc}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Upgrading to R 3.6.0 on a Mac – May 14, 2019

Tue, 05/14/2019 - 02:00

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

Every time there is a new major update from The R
Foundation
(like the recent 3.6.0
release
in April). I’m always happy to see
the continuing progress and the combination of new features and bug
the issue of what to do about the burgeoning number of packages
(libraries) I have installed.

Up until now I confess I simply have sort of “winged it”, done the
needed or just grabbed a few essentials and then let my needs dictate
whatever else I reloaded. This time I decided to get serious about the
process and pay attention to not only what I was doing but documenting
it and keeping a record via some amount of coding (and this post).

I’m aware that there are full-fledged package
managers
like
packrat and checkpoint and even a package designed to manage the
upgrade for you on windows, but I’m a Mac user and wanted to do things
my own way and I don’t need that level of sophistication.

So I set out to do the following:

1. Capture a list of everything I had installed under R 3.5.3 and,
very importantly, as much as I could about where I got the package
e.g.  CRAN or GitHub or ???
2. Keep a copy for my own edification and potential future use.
3. Do a clean R 3.6.0 install and not copy any library directories
manually.
4. Take a look at the list I produced in #1 above but mainly to just
download and install the exact same packages if I can find them.
5. Make the process mainly scripted and automatic and available again
for the future.

As I was searching the web I found a few helpful posts that saved me
post

on Stack Overflow. I wanted to extend the function listed there to do
a little more of my work for me. Instead of just being able to generate
a listing of what I had installed from GitHub I wanted to be able to
determine most of the places I get packages from, which are CRAN,
GitHub and R-Forge.

and features and then build a dataframe called allmypackages with the
basic information about the packages I currently have installed in R
3.5.3.

Note – I’m writing this after already upgrading so there will be a few
inconsistencies in the output

• This could just as easily be a tibble but I chose as.data.frame
• I am deliberately removing base packages from the dataframe by
filter
• I am eliminating columns I really don’t care about with select

require(tidyverse) allmypackages <- as.data.frame(installed.packages()) allmypackages <- allmypackages %>% filter(Priority != "base" | is.na(Priority)) %>% select(-c(Enhances:MD5sum, LinkingTo:Suggests)) %>% droplevels() str(allmypackages) ## 'data.frame': 533 obs. of 8 variables: ## $Package : Factor w/ 533 levels "abind","acepack",..: 1 2 3 4 5 6 7 8 9 10 ... ##$ LibPath : Factor w/ 1 level "/Library/Frameworks/R.framework/Versions/3.6/Resources/library": 1 1 1 1 1 1 1 1 1 1 ... ## $Version : Factor w/ 361 levels "0.0-8","0.0.1",..: 223 227 57 18 202 352 179 165 49 177 ... ##$ Priority : Factor w/ 1 level "recommended": NA NA NA NA NA NA NA NA NA NA ... ## $Depends : Factor w/ 194 levels "base, stats, R (>= 3.3.0)",..: 25 NA 131 167 NA 145 132 NA NA 97 ... ##$ Imports : Factor w/ 356 levels "abind, coda, graphics, grDevices, methods, nlme, utils",..: 234 NA 250 75 323 242 1 324 328 336 ... ## $NeedsCompilation: Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 2 1 2 ... ##$ Built : Factor w/ 1 level "3.6.0": 1 1 1 1 1 1 1 1 1 1 ... A function to do the hard work

As I mentioned above the stack overflow post was a good start but I
it github I would like as much information as possible about where I got
the package. The package~source function will be applied to the
Package column for each row of our dataframe. For example
as.character(packageDescription("ggplot2")$Repository) will get back “CRAN”, and as.character(packageDescription("CHAID")$Repository)
will yield “R-Forge”. For GitHub packages the result is character(0)
which has a length of zero. So we’ll test with an if else clause. If
we get an answer like “CRAN” we’ll just return it. If not, we’ll see
if there is a GitHub repo listed with
as.character(packageDescription(pkg)$GithubRepo) as well as a GitHub username as.character(packageDescription(pkg)$GithubUsername). If they
exist we’ll concatenate and return. If not we’ll return “Other”. Besides
being good defensive programming this may catch the package you have
built for yourself as is the case for me.

package_source <- function(pkg){ x <- as.character(packageDescription(pkg)$Repository) if (length(x)==0) { y <- as.character(packageDescription(pkg)$GithubRepo) z <- as.character(packageDescription(pkg)$GithubUsername) if (length(y)==0) { return("Other") } else { return(str_c("GitHub repo = ", z, "/", y)) } } else { return(x) } } # show the first 60 as an example head(sapply(allmypackages$Package, package_source), 60) ## [1] "CRAN" "CRAN" "CRAN" ## [4] "CRAN" "CRAN" "CRAN" ## [7] "CRAN" "CRAN" "CRAN" ## [10] "CRAN" "CRAN" "CRAN" ## [13] "CRAN" "CRAN" "CRAN" ## [16] "CRAN" "CRAN" "CRAN" ## [19] "CRAN" "CRAN" "CRAN" ## [22] "CRAN" "CRAN" "CRAN" ## [25] "CRAN" "CRAN" "CRAN" ## [28] "CRAN" "CRAN" "CRAN" ## [31] "CRAN" "CRAN" "CRAN" ## [34] "CRAN" "CRAN" "CRAN" ## [37] "CRAN" "CRAN" "CRAN" ## [40] "CRAN" "CRAN" "CRAN" ## [43] "CRAN" "CRAN" "Other" ## [46] "R-Forge" "CRAN" "CRAN" ## [49] "CRAN" "CRAN" "CRAN" ## [52] "CRAN" "CRAN" "CRAN" ## [55] "CRAN" "CRAN" "CRAN" ## [58] "CRAN" "GitHub repo = cjtexas/colourgen" "CRAN" What’s in your libraries?

Now that we have the package_source function we can add a column to
our data frame and do a little looking.

allmypackages$whereat <- sapply(allmypackages$Package, package_source) str(allmypackages) ## 'data.frame': 533 obs. of 9 variables: ## $Package : Factor w/ 533 levels "abind","acepack",..: 1 2 3 4 5 6 7 8 9 10 ... ##$ LibPath : Factor w/ 1 level "/Library/Frameworks/R.framework/Versions/3.6/Resources/library": 1 1 1 1 1 1 1 1 1 1 ... ## $Version : Factor w/ 361 levels "0.0-8","0.0.1",..: 223 227 57 18 202 352 179 165 49 177 ... ##$ Priority : Factor w/ 1 level "recommended": NA NA NA NA NA NA NA NA NA NA ... ## $Depends : Factor w/ 194 levels "base, stats, R (>= 3.3.0)",..: 25 NA 131 167 NA 145 132 NA NA 97 ... ##$ Imports : Factor w/ 356 levels "abind, coda, graphics, grDevices, methods, nlme, utils",..: 234 NA 250 75 323 242 1 324 328 336 ... ## $NeedsCompilation: Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 2 1 2 ... ##$ Built : Factor w/ 1 level "3.6.0": 1 1 1 1 1 1 1 1 1 1 ... ## $whereat : chr "CRAN" "CRAN" "CRAN" "CRAN" ... table(allmypackages$whereat) ## ## CRAN GitHub repo = cjtexas/colourgen ## 526 1 ## GitHub repo = duncantl/RWordPress GitHub repo = duncantl/XMLRPC ## 1 1 ## GitHub repo = leeper/slopegraph GitHub repo = lorenzwalthert/stylermd ## 1 1 ## Other R-Forge ## 1 1 allmypackages %>% filter(whereat == "Other") %>% select(Package, Version) ## Package Version ## 1 CGPfunctions 0.5.3

And just to be on the safe side we’ll also write a copy out as a csv
file so we have it around in case we ever need to refer back.

write.csv(allmypackages, "mypackagelistMay2019.csv") Go ahead and install R 3.6.0

R 3.6.0. At the end of the installation process you’ll have a pristine
copy with a new library directory. When next you restart R and R Studio
you’ll see a clean new version. Let’s make use of our data frame to
automate most of the process of getting nice clean copies of the
libraries we want.

We’ll start by getting the entire tidyverse since we need several
parts and because installing it will trigger the installation of quite a
few dependencies and bootstrap our work.

# post upgrade with output surpessed install.packages("tidyverse") library(tidyverse)

Now we have R 3.6.0 and some additional packages. Let’s see what we can
do. First let’s create two dataframes, one with our old list and one
with what we have right now. Then we can use anti_join to make a
dataframe that lists the differences thediff. We can use filter and
pull to generate a vector of just the the packages that are on CRAN we
want to install.

**Note – I’m faking the output rather than reinstalling all these
packages on my machine so you will see packages from the tidyverse in
the listing **

oldpackages <- read.csv("mypackagelistMay2019.csv") allmypackages <- as.data.frame(installed.packages()) allmypackages <- allmypackages %>% filter(Priority != "base" | is.na(Priority)) %>% select(-c(Enhances:MD5sum, LinkingTo:Suggests)) thediff <- anti_join(oldpackages,allmypackages, by = "Package") ## Warning: Column Package joining factors with different levels, coercing to character vector thediff <- droplevels(thediff) thediff %>% filter(whereat == "CRAN") %>% pull(Package) %>% as.character ## [1] "abind" "acepack" "afex" "alphavantager" "antiword" "ape" ## [7] "arm" "askpass" "assertthat" "backports" "base64enc" "BayesFactor" ## [13] "bayesplot" "bayestestR" "BDgraph" "beeswarm" "BH" "bibtex" ## [19] "bigmemory" "bigmemory.sri" "bindr" "bindrcpp" "bitops" "blavaan" ## [25] "bookdown" "Boom" "BoomSpikeSlab" "boot" "brew" "broom" ## [31] "broom.mixed" "broomExtra" "BSDA" "bsts" "BWStest" "ca" ## [37] "Cairo" "callr" "car" "carData" "caret" "caTools" ## [43] "cdata" "cellranger" "checkmate" "class" "classInt" "cli" ## [49] "clipr" "clisymbols" "cluster" "coda" "codetools" "coin" ## [55] "colormap" "colorspace" "colourpicker" "combinat" "commonmark" "CompQuadForm" Just do it!

Now that you have a nice automated list of everything that is a CRAN
package you can give it a final look and see if there is anything else
you’d like to filter out. Once you are sure the list is right one final
pipe will set the process in motion.

thediff %>% filter(whereat == "CRAN") %>% pull(Package) %>% as.character %>% install.packages

Depending on the speed of your network connection and the number of
packages you have that will run for a few minutes.

That takes care of our CRAN packages. What about GitHub?

thediff %>% filter(str_detect(whereat, "GitHub repo")) %>% select(Package, Version, NeedsCompilation, whereat) ## Package Version NeedsCompilation whereat ## 1 colourgen 0.2.0 no GitHub repo = cjtexas/colourgen ## 2 RWordPress 0.2-3 no GitHub repo = duncantl/RWordPress ## 3 slopegraph 0.1.14 no GitHub repo = leeper/slopegraph ## 4 stylermd 0.1.0.9000 no GitHub repo = lorenzwalthert/stylermd ## 5 XMLRPC 0.3-1 no GitHub repo = duncantl/XMLRPC

Here’s another chance to review what you have and whether you still want
need these packages. I could automate the process and once again feed
the right vector to devtools::install_github() but instead I choose to
handle these manually as in
devtools::install_github("leeper/slopegraph").

Same with the one package I get from R-Forge…

allmypackages %>% filter(str_detect(whereat, "R-Forge")) %>% select(Package, Version, NeedsCompilation, whereat) install.packages("CHAID", repos="http://R-Forge.R-project.org")

At the end of this process you should have a nice clean R install that
has all the packages you choose to maintain as well as a detailed
listing of what those are.

Done!

I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.

Chuck (ibecav at gmail dot com)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introducing SAML in RStudio Connect

Tue, 05/14/2019 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio Connect 1.7.4 builds off of the prior release with significant improvements for RStudio Connect administrators. SAML support is available for production, a new admin view of scheduled reports helps identify potential conflicts, additional server APIs support deployment workflows, and our email configuration has been completely re-vamped.

View Scheduled Content

Scheduled Content Calendar

This release includes a new calendar view, pictured above, that helps administrators review all scheduled content in a single place. This view can help administrators identify reports that are running too frequently, or times when multiple reports have overlapping schedules; e.g., Monday at 9am.

SAML Support

We are very excited to announce the release of SAML 2.0 as a production authentication method for RStudio Connect. This opens the door for integration with common identity providers in the enterprise, along with single-sign-on, multi-factor authentication, and other important security conventions.

As a part of this release, we prepared SAML integration templates to simplify your integration with common cloud identity providers. RStudio Connect also supports the SAML 2.0 protocol for integrations with many other authentication providers or homegrown SAML tools.

Identity Provider Status More Information Azure Active Directory Tested and Integrated Azure Portal Okta Tested & Integrated Okta Integration Guide OneLogin Tested & Integrated Search OneLogin Portal After Login Google, JumpCloud, ADFS, WSO2, PingIdentity, Shibboleth Tested Configuration Guide General SAML 2.0 Supported Configuration Guide Duo, Centrify, Auth0 Not Supported Coming in a Future Release

RStudio Connect’s SAML 2.0 authentication method supports Just-In-Time provisioning, either local or remote group management, Identity Provider metadata, and a handful of other configuration options that can be explored in the RStudio Connect Admin Guide.

Server APIs

The Connect Server API allows teams to interact with RStudio Connect from code. This release lets you programmatically update updating content settings and manage content bundles. Build a custom workflow, such as promoting content from a staging server to production.

Integrate Connect into CI/CD Toolchains

Learn more about different approaches to asset deployment, including how to integrate RStudio Connect into CI/CD toolchains.

Email Overhaul

RStudio Connect uses email to distribute content, manage user accounts, notify publishers of errors, and more. In order to send emails, administrators must configure Connect to use sendmail or an SMTP client. In prior versions of RStudio Connect, this configuration was done in the RStudio Connect dashboard. Version 1.7.4 and above removes this support in favor of managing email settings with the RStudio Connect configuration file. This change makes setup easier and more consistent for administrators.

When upgrading to RStudio Connect 1.7.4, administrators should follow these instructions to migrate email settings to the configuration file.

In addition, RStudio Connect no longer requires email setup. Disabling email can be useful for groups starting a proof-of-concept, or teams running Connect in certain locked-down environments, In these cases, Connect will gracefully disable settings that require email. For full functionality we strongly recommend an email integration.

Breaking Changes and Deprecations
• Network information is no longer collected and stored at {Server.DataDir}/metrics/rrd/network-*.rrd. This information was never used by RStudio Connect and is removed to save storage space.
• The experimental content API has changed, see the new API documentation for details.
• The [LoadBalancing].EnforceMinRsconnectVersion setting now defaults to true and has been deprecated. RStudio Connect will now require rsconnect version 0.8.3 or above.
• The next version of RStudio Connect will not support discovering R versions through the file /etc/rstudio/r-versions. Administrators using this file should migrate this information to the Connect configuration file’s [Server].RVersion field.

Please consult the full release notes.

RStudio Connect 1.7.4 introduces a new upgrade process to facilitate changes
to the RStudio Connect configuration file. This process requires administrators
to upgrade RStudio Connect and then update the configuration file, as described here.
Specific instructions for the 1.7.4 upgrade are also available. If
you are upgrading from an earlier release than 1.7.2, please consult the
intermediate release notes.

Connect
, we encourage you to do so.
RStudio Connect is the best way to share all the work that you do in R and Python (Shiny
apps, R Markdown documents, plots, dashboards, Plumber APIs, Jupyter, etc.) with
collaborators, colleagues, or customers.

You can find more details or download a 45-day evaluation of the product at
https://www.rstudio.com/products/connect/.
Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### POWER to the People

Tue, 05/14/2019 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

NASA generates and provides heaps of data to the scientific community. Not all
of it is looking out at the stars. Some of it is looking back at us here on
Earth. NASA’s Earth science program observes, understands and models the
Earth system1. We can use these data to discover how our Earth is changing,
to better predict change, and to understand the consequences for life on Earth.

The Earth science program includes the
Prediction of Worldwide Energy Resource (POWER)
project, which was initiated to improve upon the current renewable energy data
set and to create new data sets from new satellite systems. The POWER project
targets three user communities: 1) Renewable Energy (SSE), 2) Sustainable
Buildings (SB) and 3) Agroclimatology (AG)1 and covers 140+ different
parameters.

How Did This Package Happen?

I first became acquainted with the POWER data when I joined the GIS lab
at the International Rice Research Institute
(IRRI)
in 2011. We commonly used the
agroclimatology data from POWER to map and model rice projects. In most
of the work I did, I used it with the EPIRICE model2, but it was also used
for other crop modelling. Since then I have used the POWER data in projects with
EPIRICE myself3 and have worked with other researchers who use it for crop
simulation modelling exercises4,5. The data were a great resource for
agricultural modelling, even if it was a bit coarse at 1˚ x 1˚ (covering roughly
12225 km2 or 4721 mi2 per grid cell at the equator, more
as you approach the poles), the full data set had global coverage, offering data
where we often needed it in areas that lacked good weather station coverage.

Because I had used the data and I knew plenty of others used the data; in 2017 I
started writing nasapower6,7 to interface with the POWER website and run
queries to get the data from the server but only for agricultural weather data
(AG community), as this was my main (really only) interest. This created a
click interface of the website for repeated queries. I submitted it for
review with rOpenSci
and was happy with the feedback I received from the reviewers, quickly making
the suggested changes.

What Happened Next

However, in late 2017 very shortly after the package was reviewed, the
POWER team announced that the POWER data would be getting a makeover.
The data would be served at 0.5˚ x 0.5˚ and use a new
API to access them. This
complicated things. I had never worked with an official API before. I
had only worked with FTP servers, which are at best a very rudimentary
API, but easy to work with in R. So I went back to the reviewers and
editor and suggested that I needed to rewrite the whole package due to
the major changes that were coming. The editor, Scott Chamberlain,
readily agreed and over the next several months I learned how to write
an R package that is an API client, a proper API client.

Thankfully rOpenSci has a great package, crul8, that makes this much easier.
Even better, Maëlle Salmon had written a
blog post on how to use it! So I
got down to business writing the new version of nasapower using crul.

I quickly found out that the easy part was using crul; the hard part
was validating user inputs. I decided early in the process of writing
the package to validate the users’ input on the client side before even
querying the server to make sure that only well-formed requests were
sent. That way I could provide the user with feedback on what may have
been entered incorrectly to make it easier for them to correct rather
than relying on the server’s response.

There are over 140 different parameters for three “communities”
(AG, SSE and SB), that the POWER data provides. One of the first issues I
encountered was how to validate the users’ requests against what was available.
Thankfully Ben Raymond found a JSON file that
was served up for the API documentation that I was able to use to
create an internal list object
to check against, which made the parameters and communities easy validate.

Formatting dates and checking those given all the different ways we enter them
proved to be another challenge entirely taking nearly 100 lines of code.

Along with learning how to write an API client package, one of the
methods I used in this package that I had not made full use of before
was lists. Internally in that 100+ lines of code there are several
inter-related checks and values that are provided for the query. By
using lists I was able to return more than one value from a function
after it was checked and validated and then provide that to the function
I created to generate the request that crul sends, e.g., the
longitude, latitude and whether it is a single point or
region
.
By doing this I was able to simplify the parameters that the user had to
enter when constructing a query in their R session because the API
dictates by default many values that cannot co-occur.

Are We Doing it Right?

There were a few hang-ups along the way. At one point the POWER team
made a change to the API response while the package was in review.
Initially it provided an option for just a CSV file that was properly
rectangular. Suddenly it changed overnight with no warning to offering a
CSV file with information in a “header” as they called it. To me a
header in a CSV file is just column names. This was more. This was
metadata about the data itself. The quickest way to deal with this was
to simply read only the CSV file portion of the returned file. However,
I thought it might be useful to users to include the metadata that POWER
was now providing in this format as well. So I learned something else
new, how to modify the S3 print() method function to include the
metadata in the console along with the weather or climate data.

Later, in early 2019 I ran into an issue with the method I’d used to
validate user inputs for community when building the query that was
sent. Daniel Reis Pereira reported
that for some reason nasapower was failing to download 2 m wind data,
necessary for calculating evapotranspiration for crop modelling. Looking
into the JSON file I used as suggested by Ben Raymond, I found the issue. The
POWER team did not list the AG community as having this data available. I
contacted them thinking is was a mistake in the file since the data are in
indeed available. After some back and forth with the NASA-POWER team I never
really was clear on how I should handle this from their perspective aside from
their suggestion of just looking at their webpage to see what was available.
Additionally, when I checked, it appeared that the data were available for all
three communities. So I ended up dropping the check for community from the
user input validations because the community specified only affects the units
that the data are reported in, you can see the POWER documentation for more on
this. I still check user inputs for the correct parameter specification and
temporal matches, but assume all communities offer the same data. It’s not
optimal but until the POWER team can deliver a more robust way for us to check
against their records it will have to do.

Finally, It (Mostly?) Works!

The package is relatively simple as far as functionality. It only
fetches data and reformats it for use in R or crop modelling software.

The Three Functions

The first function, get_power(), is the main function that is used and
offers the greatest flexibility. Say you want to know the maximum and
minimum temperatures for Death Valley in California, USA every day for
the last 35 years, you can easily do this with nasapower using
coordinates for the valley floor.

If you are not sure what the parameters are that you need to get the
data you can always have a look at the list of available parameters
using ?parameters. That shows that T2M_MIN and T2M_MAX
are the minimum and maximum temperatures at 2 meters. We will use those
to construct the query.

library(nasapower) dv <- get_power(community = "AG", lonlat = c(-117.2180, 36.7266), dates = c("1983-01-01", "2018-12-31"), temporal_average = "DAILY", pars = c("T2M_MIN", "T2M_MAX")) dv ## NASA/POWER SRB/FLASHFlux/MERRA2/GEOS 5.12.4 (FP-IT) 0.5 x 0.5 Degree Daily Averaged Data ## Dates (month/day/year): 01/01/1983 through 12/31/2018 ## Location: Latitude 36.7266 Longitude -117.218 ## Elevation from MERRA-2: Average for 1/2x1/2 degree lat/lon region = 1192.53 meters Site = na ## Climate zone: na (reference Briggs et al: http://www.energycodes.gov) ## Value for missing model data cannot be computed or out of model availability range: -99 ## ## Parameters: ## T2M_MIN MERRA2 1/2x1/2 Minimum Temperature at 2 Meters (C) ; ## T2M_MAX MERRA2 1/2x1/2 Maximum Temperature at 2 Meters (C) ## ## # A tibble: 13,149 x 9 ## LON LAT YEAR MM DD DOY YYYYMMDD T2M_MIN T2M_MAX ## ## 1 -117. 36.7 1983 1 1 1 1983-01-01 -4.73 7.24 ## 2 -117. 36.7 1983 1 2 2 1983-01-02 -3.33 7.72 ## 3 -117. 36.7 1983 1 3 3 1983-01-03 0.05 11.8 ## 4 -117. 36.7 1983 1 4 4 1983-01-04 1.27 14.9 ## 5 -117. 36.7 1983 1 5 5 1983-01-05 2.55 15.9 ## 6 -117. 36.7 1983 1 6 6 1983-01-06 2.63 17.1 ## 7 -117. 36.7 1983 1 7 7 1983-01-07 3.2 17.2 ## 8 -117. 36.7 1983 1 8 8 1983-01-08 1.96 18.3 ## 9 -117. 36.7 1983 1 9 9 1983-01-09 0.7 14.0 ## 10 -117. 36.7 1983 1 10 10 1983-01-10 1.3 17.6 ## # … with 13,139 more rows

comes from and what dates have been queried and returned as well as the
elevation data.

The tibble contains a few columns not in the original data, but that
can make it easier to work within R. The original data only include YEAR
and DOY. Looking at the data returned, there are:

• LON = the queried longitude as a double;

• LAT = the queried latitude as a double;

• YEAR = the queried year as a double;

• MM = the queried month as a double,

• DD = the queried day as a double,

• DOY = the day of year or Julian date as an integer,

• YYYYMMDD = the full date as a date object and the requested parameters,

• T2M_MIN = the minimum temperature at 2 meters above the Earth’s surface as a double, and

• T2M_MAX = the maxiumum temperature at 2 meters above the Earth’s surface as a double.

Visualising the Data

To visualise these data I will use ggplot2, but first I need to gather the data
into long format using tidyr’s gather().

library(tidyr) library(ggplot2) library(hrbrthemes) dv_long <- tidyr::gather(dv, key = "T2M", value = "Degrees", c("T2M_MIN", "T2M_MAX")) ggplot(dv_long, aes(x = YYYYMMDD, y = Degrees, colour = Degrees)) + geom_line() + xlab("Year") + ggtitle(label = "Death Valley, CA Daily Max and Min Temperatures", sub = "Source: NASA POWER") + scale_colour_viridis_c() + theme_ipsum()

Figure 1: Daily temperature extremes at 2 meters above the Earth’s surface for the grid cell covering Death Valley, California, USA for the time-period from 1983 to 2018.

That is quite a swing in air temperatures from well over 40˚ C to well
below 0˚ C throughout the year. I was going to put together a comparison
with station data using GSODR9 but
instead found a good reason why you might want to use nasapower to get
POWER data. When I checked for stations nearby this specified point,
there were two in the GSOD database, 746190-99999 and 999999-53139.
However, neither one of them offers weather data for this time period.
The station, 746190-99999, only offers data from 1982 to 1992 and
999999-53139 only provides data from 2004 to 2019. This nicely
illustrates one of the advantages of the POWER data, it is available for
any point on the globe. One of the weaknesses you might have noticed if
you looked at the elevation data in the metadata header is that it shows
the elevation of Death Valley is over 1000 m above sea level when we
know that it is 86 m below sea level. This is due to the fact that the
data are the average of a 0.5˚ by 0.5˚ area or roughly
3000 km2 as opposed to a single point that a station can
provide.

Crop Modelling From Space

This can be advantageous as well, however. Some agricultural scientists work
with models that predict crop yields in response to changes to the crop,
weather, farmer inputs or even climate change. Crop yield modelling often
uses daily weather data covering large relatively continuous areas that
are less affected by things like elevation, so for these purposes, the
POWER data can be very useful. Another reason the POWER data are useful
could be that the same data set is available for many areas, making
model output comparisons easier.

If you do any crop modelling work you are likely familiar with the
Decision Support System for Agrotechnology Transfer
DSSAT platform10, 11, 12. The new POWER API
provides properly formatted ICASA files,
which are the format that DSSAT uses. Naturally I took advantage of this and
crop simulations.

But, being in Toowoomba, Queensland, I had to acknowledge another crop
simulation model, the Agricultural Production Systems sIMulator
APSIM13. APSIM was developed here and has similar
functionality to DSSAT. However, the POWER API did not offer properly formatted
APSIM .met files. So, wrote a function, create_met(), that takes advantage of
the POWER data API and the R APSIM package 14 to generate the proper weather
.met files since many APSIM users, use R in their modelling pipeline, e.g.,
APSIMBatch15 and apsimr16.

Both of these functions simply download data and write the values to
disk in a specialised file format that these crop modelling platforms
use, therefore I have declined to illustrate their usage in this blog
post.

There Is More Than Just Daily Data for Single Cells Retrieving Climatology

Daily weather data are not the only data offered by this API. Two other options
exist for the temporal_average parameter, INTERANNUAL and CLIMATOLOGY.
INTERANNUAL data provide monthly averages for the same 0.5˚ x 0.5˚ grid as the
daily data for the time-period the user specifies, while CLIMATOLOGY provides
0.5˚ x 0.5˚ gridded data of a thirty year time period from January 1984 to
December 2013.

The CLIMATOLOGY data are the only way to get the entire surface in one query,
but single cell and regional data are also available for this temporal average.

library(raster) ## Loading required package: sp ## ## Attaching package: 'raster' ## The following object is masked from 'package:tidyr': ## ## extract library(viridisLite) global_t2m <- get_power( community = "AG", pars = "T2M", temporal_average = "CLIMATOLOGY", lonlat = "GLOBAL" ) # view the tibble global_t2m ## NASA/POWER SRB/FLASHFlux/MERRA2/GEOS 5.12.4 (FP-IT) 0.5 x 0.5 Degree Climatologies ## 22-year Additional Solar Parameter Monthly & Annual Climatologies (July 1983 - June 2005), 30-year Meteorological and Solar Monthly & Annual Climatologies (January 1984 - December 2013) ## Location: Global ## Value for missing model data cannot be computed or out of model availability range: -99 ## Parameter(s): ## T2M MERRA2 1/2x1/2 Temperature at 2 Meters (C) ## ## Parameters: ## NA; ## NA; ## T2M MERRA2 1/2x1/2 Temperature at 2 Meters (C) ## ## # A tibble: 259,200 x 16 ## LON LAT PARAMETER JAN FEB MAR APR MAY JUN JUL AUG ## ## 1 -180. -89.8 T2M -29.0 -40.7 -52.9 -57.8 -59.1 -59.6 -61.3 -61.8 ## 2 -179. -89.8 T2M -29.0 -40.7 -52.9 -57.8 -59.1 -59.6 -61.3 -61.8 ## 3 -179. -89.8 T2M -29.0 -40.7 -52.9 -57.8 -59.1 -59.6 -61.3 -61.8 ## 4 -178. -89.8 T2M -29.0 -40.7 -52.9 -57.8 -59.1 -59.6 -61.3 -61.8 ## 5 -178. -89.8 T2M -29.0 -40.7 -52.9 -57.8 -59.1 -59.6 -61.3 -61.8 ## 6 -177. -89.8 T2M -28.9 -40.7 -52.9 -57.9 -59.1 -59.6 -61.3 -61.8 ## 7 -177. -89.8 T2M -28.9 -40.7 -52.9 -57.9 -59.1 -59.6 -61.3 -61.8 ## 8 -176. -89.8 T2M -28.9 -40.7 -53.0 -57.9 -59.1 -59.6 -61.3 -61.8 ## 9 -176. -89.8 T2M -28.9 -40.7 -53.0 -57.9 -59.1 -59.6 -61.3 -61.8 ## 10 -175. -89.8 T2M -28.9 -40.7 -53.0 -57.9 -59.1 -59.6 -61.3 -61.8 ## # … with 259,190 more rows, and 5 more variables: SEP , OCT , ## # NOV , DEC , ANN # map only annual average temperatures by converting the 15th column to a raster # object T2M_ann <- rasterFromXYZ( global_t2m[c(1:2, 16)], crs = "+proj=eqc +lat_ts=0 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") # how many unique values n <- length(unique(global_t2m$ANN)) # plot the annual average temperature using viridis plot(T2M_ann, col = viridis(n = n), xlab = "Longitude", ylab = "Latitude") Figure 2: Global 30-year annual meteorological (January 1984 – December 2013) average temperature at 2 meters above the Earth’s surface modelled from satellite derived data. You can mostly make out the outlines of the continents and especially the mountain ranges such as the Andes and Rocky Mountains to the left and the Tibetan plateau at about 100˚ longitude (x-axis) and 45˚ latitude (y-axis). Retrieving Regional Data If your interests cover a large area, it is possible to retrieve an area of cells, rather than a single cell in a query. They can not be more than 100 points in total for an area of 4.5˚ x 4.5˚. Regional coverage is simply specified by providing a bounding box as “lower left (lon, lat)” and “upper right (lon, lat)” coordinates, i.e., lonlat = c(xmin, ymin, xmax, ymax) in that order for a given region, e.g., a bounding box for the south-western corner of Australia: lonlat = c(112.5, -55.5, 115.5, -50.5). regional_t2m <- get_power( community = "AG", pars = "T2M", temporal_average = "CLIMATOLOGY", lonlat = c(112.5, -55.5, 115.5, -50.5) ) # view the tibble regional_t2m ## NASA/POWER SRB/FLASHFlux/MERRA2/ 0.5 x 0.5 Degree Climatologies ## 22-year Additional Solar Parameter Monthly & Annual Climatologies (July 1983 - June 2005), 30-year Meteorological and Solar Monthly & Annual Climatologies (January 1984 - December 2013) ## Location: Regional ## Elevation from MERRA-2: Average for 1/2x1/2 degree lat/lon region = na meters Site = na ## Climate zone: na (reference Briggs et al: http://www.energycodes.gov) ## Value for missing model data cannot be computed or out of model availability range: -99 ## ## Parameters: ## T2M MERRA2 1/2x1/2 Temperature at 2 Meters (C) ## ## # A tibble: 77 x 16 ## LON LAT PARAMETER JAN FEB MAR APR MAY JUN JUL AUG ## ## 1 113. -55.2 T2M 2.98 3.38 3.13 2.4 1.65 1.01 0.41 -0.05 ## 2 113. -55.2 T2M 3.07 3.47 3.22 2.5 1.75 1.11 0.51 0.05 ## 3 114. -55.2 T2M 3.14 3.55 3.32 2.6 1.85 1.21 0.61 0.15 ## 4 114. -55.2 T2M 3.2 3.62 3.39 2.69 1.94 1.29 0.69 0.22 ## 5 115. -55.2 T2M 3.26 3.69 3.46 2.76 2.01 1.37 0.77 0.28 ## 6 115. -55.2 T2M 3.3 3.73 3.51 2.82 2.06 1.43 0.83 0.34 ## 7 116. -55.2 T2M 3.32 3.77 3.54 2.86 2.1 1.49 0.88 0.39 ## 8 113. -54.8 T2M 3.12 3.51 3.3 2.57 1.82 1.17 0.59 0.12 ## 9 113. -54.8 T2M 3.2 3.6 3.39 2.68 1.92 1.26 0.69 0.21 ## 10 114. -54.8 T2M 3.27 3.7 3.49 2.78 2.02 1.36 0.78 0.3 ## # … with 67 more rows, and 5 more variables: SEP , OCT , ## # NOV , DEC , ANN # map only annual average temperatures by converting the 15th column to a raster # object T2M_ann_regional <- rasterFromXYZ( regional_t2m[c(1:2, 16)], crs = "+proj=eqc +lat_ts=0 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") # how many unique values n <- length(unique(regional_t2m$ANN)) # plot the annual average temperature using viridis plot(T2M_ann_regional, col = viridis(n = n), xlab = "Longitude", ylab = "Latitude")

Figure 3: Regional 30-year annual meteorological (January 1984 – December 2013) average temperature at 2 meters above the Earth’s surface modelled from satellite derived data for the south-western coastal area of Australia, illustrating the maximum allowable cells in a regional query.

As you can see, because the data are georeferenced it is easy to use
them in R’s spatial packages including sf and raster.
Emerson Del Ponte,
(@edelponte), used the data in a talk at the
International Congress of Plant Pathology in August 2018, “Can Rainfall
be a Useful Predictor of Epidemic Risk Across Temporal and Spatial
Scales?
”,
see slides
23,
24
and
25
for maps created using nasapower and the POWER data for two states in
Brazil. These maps took a bit more work to query the API and generate,
but I plan to add an example vignette detailing how this can be done in
a future release.

Conclusion

Even though the package took me well over one year to write and work out
all of the bugs and has only three functions, I learned quite a bit from
the experience. The new API really does improve the ability for a
developer to write a client to interface with the data. Now the
nasapower package is able to access all of the data available whereas
the initial version was only able to work with the “AG” community data.

While nasapower does not redistribute the data or provide it in any
way, if you use the data, we encourage users to follow the requests of
the POWER Project Team.

When POWER data products are used in a publication, we request the
following acknowledgement be included: “These data were obtained from
the NASA Langley Research Center POWER Project funded through the NASA
Earth Science Directorate Applied Science Program.”

The approach I have used mostly works, but there has been at least one
example where data are listed as being available in the POWER database
but querying by the apparently proper pars does not work, see Issue
#34
. Because of
issues like this, in what appear to be edge cases, I suggest checking
the web interface if you experience difficulties querying data.
Hopefully the API itself has settled out a bit and will not have the
sudden changes that I experienced. The POWER team have been supportive
of the package, and I have received feedback and interaction from other
users along the way that helped me by finding and reporting bugs.

Acknowledgements

I would like to recognise the valuable comments from Emerson Del Ponte and
Paul Melloy on this blog post.
Most of all, I appreciate the rOpenSci review process with
valuable contributions from Hazel
Kavılı
, Alison
Boyer
and of course my ever-patient
editor and third reviewer, Scott
Chamberlain
. I welcome further feedback for
issues and suggestions,
they have only made this package better. If you are interested in this or
want to know more about how to use the package, I encourage you to
browse the vignette and other on-line documentation,
https://ropensci.github.io/nasapower/articles/nasapower.html.

1. Stackhouse, Paul W., Jr., Taiping Zhang, David Westberg, A. Jason Barnett, Tyler Bristow, Bradley Macpherson, and James M. Hoell. 2018. “POWER Release 8 (with GIS Applications) Methodology (Data Parameters, Sources, & Validation)”, Data Version 8.0.1. NASA. https://power.larc.nasa.gov/documents/POWER_Data_v8_methodology.pdf.
2. Savary, Serge, Andrew Nelson, Laetitia Willocquet, Ireneo Pangga, and Jorrel Aunario. 2012. “Modeling and Mapping Potential Epidemics of Rice Diseases Globally.” Crop Protection 34 (Supplement C): 6–17. https://doi.org/10.1016/j.cropro.2011.11.009.
3. Duku, Confidence, Adam H. Sparks, and Sander J. Zwart. 2016. “Spatial Modelling of Rice Yield Losses in Tanzania Due to Bacterial Leaf Blight and Leaf Blast in a Changing Climate.” Climatic Change 135 (3): 569–83. https://doi.org/10.1007/s10584-015-1580-2.
4. van Wart, Justin, Patricio Grassini, and Kenneth G. Cassman. 2013. “Impact of Derived Global Weather Data on Simulated Crop Yields.” Global Change Biology 19 (12): 3822–34. https://doi.org/10.1111/gcb.12302.
5. van Wart, Justin, Patricio Grassini, Haishun Yang, Lieven Claessens, Andrew Jarvis, and Kenneth G. Cassman. 2015. “Creating Long-Term Weather Data from Thin Air for Crop Simulation Modeling.” Agricultural and Forest Meteorology 209-210 (Supplement C): 49–58. https://doi.org/10.1016/j.agrformet.2015.02.020.
6. Sparks, Adam, 2018. “nasapower: A NASA POWER Global Meteorology, Surface Solar Energy and Climatology Data Client for R”. Journal of Open Source Software, 3(30), 1035, https://doi.org/10.21105/joss.01035.
7. Sparks, Adam. 2019. “nasapower: NASA-POWER Data from R”. R package version 1.1.1, https://ropensci.github.io/nasapower/.
8. Chamberlain, Scott. 2018. “crul: HTTP Client”. rOpenSci. https://CRAN.R-project.org/package=crul.
9. Sparks, Adam H., Tomislav Hengl and Andrew Nelson. 2017. “GSODR: Global Summary Daily Weather Data in R”. The Journal of Open Source Software, 2(10). https://doi.org/10.21105/joss.00177.
10. Jones, J. W., G. Y. Tsuji, G. Hoogenboom, L. A. Hunt, P. K. Thornton, P. W. Wilkens, D. T. Imamura, W. T. Bowen, and U. Singh. 1998. “Decision Support System for Agrotechnology Transfer: DSSAT v3.” In Understanding Options for Agricultural Production, 157–77. Springer.
11. Jones, James W, Gerrit Hoogenboom, Cheryl H Porter, Ken J Boote, William D Batchelor, LA Hunt, Paul W Wilkens, Upendra Singh, Arjan J Gijsman, and Joe T Ritchie. 2003. “The DSSAT Cropping System Model.” European Journal of Agronomy 18 (3-4): 235–65.
12. Hoogenboom, G, CH Porter, V Shelia, KJ Boote, U Singh, JW White, LA Hunt, et al. 2017. “Decision Support System for Agrotechnology Transfer (DSSAT) Version 4.7 https://dssat.net. DSSAT Foundation, Gainesville, Florida.” USA.
13. Keating, Brian A., Peter S. Carberry, Graeme L. Hammer, Mervyn E. Probert, Michael J. Robertson, D. Holzworth, Neil I. Huth, et al. 2003. “An Overview of APSIM, a Model Designed for Farming Systems Simulation.” European Journal of Agronomy 18 (3-4): 267–88.
14. Fainges, Justin. 2017. “APSIM: General Utility Functions for the ’Agricultural Production Systems Simulator’”. CSIRO.
15. Bangyou Zheng (2012). “APSIMBatch: Analysis the output of Apsim software.” R package version 0.1.0.2374. https://CRAN.R-project.org/package=APSIMBatch
16. Bryan Stanfill (2015). “apsimr: Edit, Run and Evaluate APSIM Simulations Easily Using R”. R package version 1.2. https://CRAN.R-project.org/package=apsimr
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Generating and modeling over-dispersed binomial data

Tue, 05/14/2019 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

A couple of weeks ago, I was inspired by a study to write about a classic design issue that arises in cluster randomized trials: should we focus on the number of clusters or the size of those clusters? This trial, which is concerned with preventing opioid use disorder for at-risk patients in primary care clinics, has also motivated this second post, which concerns another important issue – over-dispersion.

A count outcome

In this study, one of the primary outcomes is the number of days of opioid use over a six-month follow-up period (to be recorded monthly by patient-report and aggregated for the six-month measure). While one might get away with assuming that the outcome is continuous, it really is not; it is a count outcome, and the possible range is 0 to 180. There are two related questions here – what model will be used to analyze the data once the study is complete? And, how should we generate simulated data to estimate the power of the study?

In this particular study, the randomization is at the physician level so that all patients in a particular physician practice will be in control or treatment. (For the purposes of simplification here, I am going to assume there is no treatment effect, so that all variation in the outcome is due to physicians and patients only.) One possibility is to assume the outcome $$Y_{ij}$$ for patient $$i$$ in group $$j$$ has a binomial distribution with 180 different “experiments” – every day we ask did the patient use opioids? – so that we say $$Y_{ij} \sim Bin(180, \ p_{ij})$$.

The probability parameter

The key parameter here is $$p_{ij}$$, the probability that patient $$i$$ (in group $$j$$) uses opioids on any given day. Given the binomial distribution, the number of days of opioid use we expect to observe for patient $$i$$ is $$180p_{ij}$$. There are at least three ways to think about how to model this probability (though there are certainly more):

• $$p_{ij} = p$$: everyone shares the same probability The collection of all patients will represent a sample from $$Bin(180, p)$$.
• $$p_{ij} = p_j$$: the probability of the outcome is determined by the cluster or group alone. The data within the cluster will have a binomial distribution, but the collective data set will not have a strict binomial distribution and will be over-dispersed.
• $$p_{ij}$$ is unique for each individual. Once again the collective data are over-dispersed, potentially even more so.
Modeling the outcome

The correct model depends, of course, on the situation at hand. What data generation process fits what we expect to be the case? Hopefully, there are existing data to inform the likely model. If not, it may by most prudent to be conservative, which usually means assuming more variation (unique $$p_{ij}$$) rather than less ($$p_{ij} = p$$).

In the first case, the probability (and counts) can be estimated using a generalized linear model (GLM) with a binomial distribution. In the second, one solution (that I will show here) is a generalized linear mixed effects model (GLMM) with a binomial distribution and a group level random effect. In the third case, a GLMM with a negative a negative binomial distribution would be more likely to properly estimate the variation. (I have described other ways to think about these kind of data here and here.)

Case 1: binomial distribution

Even though there is no clustering effect in this first scenario, let’s assume there are clusters. Each individual will have a probability of 0.4 of using opioids on any given day (log odds = -0.405):

def <- defData(varname = "m", formula = 100, dist = "nonrandom", id = "cid") defa <- defDataAdd(varname = "x", formula = -.405, variance = 180, dist = "binomial", link = "logit")

Generate the data:

set.seed(5113373) dc <- genData(200, def) dd <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id") dd <- addColumns(defa, dd)

Here is a plot of 20 of the 100 groups:

dplot <- dd[cid %in% c(1:20)] davg <- dplot[, .(avgx = mean(x)), keyby = cid] ggplot(data=dplot, aes(y = x, x = factor(cid))) + geom_jitter(size = .5, color = "grey50", width = 0.2) + geom_point(data = davg, aes(y = avgx, x = factor(cid)), shape = 21, fill = "firebrick3", size = 2) + theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x = element_blank(), axis.text.x = element_blank() ) + xlab("Group") + scale_y_continuous(limits = c(0, 185), breaks = c(0, 60, 120, 180))

Looking at the plot, we can see that a mixed effects model is probably not relevant.

Case 2: over-dispersion from clustering def <- defData(varname = "ceffect", formula = 0, variance = 0.08, dist = "normal", id = "cid") def <- defData(def, varname = "m", formula = "100", dist = "nonrandom") defa <- defDataAdd(varname = "x", formula = "-0.405 + ceffect", variance = 100, dist = "binomial", link = "logit") dc <- genData(200, def) dd <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id") dd <- addColumns(defa, dd)

This plot suggests that variation within the groups is pretty consistent, though there is variation across the groups. This suggests that a binomial GLMM with a group level random effect would be appropriate.

Case 3: added over-dispersion due to individual differences defa <- defDataAdd(varname = "ieffect", formula = 0, variance = .25, dist = "normal") defa <- defDataAdd(defa, varname = "x", formula = "-0.405 + ceffect + ieffect", variance = 180, dist = "binomial", link = "logit") dd <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id") dd <- addColumns(defa, dd)

In this last case, it is not obvious what model to use. Since there is variability within and between groups, it is probably safe to use a negative binomial model, which is most conservative.

Estimating the parameters under a negative binomial assumption

We can fit the data we just generated (with a 2-level mixed effects model) using a single-level mixed effects model with the assumption of a negative binomial distribution to estimate the parameters we can use for one last simulated data set. Here is the model fit:

nbfit <- glmer.nb(x ~ 1 + (1|cid), data = dd, control = glmerControl(optimizer="bobyqa")) broom::tidy(nbfit) ## # A tibble: 2 x 6 ## term estimate std.error statistic p.value group ## ## 1 (Intercept) 4.29 0.0123 347. 0 fixed ## 2 sd_(Intercept).cid 0.172 NA NA NA cid

And to generate the negative binomial data using simstudy, we need a dispersion parameter, which can be extracted from the estimated model:

(theta <- 1/getME(nbfit, "glmer.nb.theta")) ## [1] 0.079 revar <- lme4::getME(nbfit, name = "theta")^2 revar ## cid.(Intercept) ## 0.03

Generating the data from the estimated model allows us to see how well the negative binomial model fit the dispersed binomial data that we generated. A plot of the two data sets should look pretty similar, at least with respect to the distribution of the cluster means and within-cluster individual counts.

def <- defData(varname = "ceffect", formula = 0, variance = revar, dist = "normal", id = "cid") def <- defData(def, varname = "m", formula = "100", dist = "nonrandom") defa <- defDataAdd(varname = "x", formula = "4.28 + ceffect", variance = theta, dist = "negBinomial", link = "log") dc <- genData(200, def) ddnb <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id") ddnb <- addColumns(defa, ddnb)

The two data sets do look like they came from the same distribution. The one limitation of the negative binomial distribution is that the sample space is not limited to numbers between 0 and 180; in fact, the sample space is all non-negative integers. For at least two clusters shown, there are some individuals with counts that exceed 180 days, which of course is impossible. Because of this, it might be safer to use the over-dispersed binomial data as the generating process for a power calculation, but it would be totally fine to use the negative binomial model as the analysis model (in both the power calculation and the actual data analysis).

Estimating power

One could verify that power is indeed reduced as we move from Case 1 to Case 3. (I’ll leave that as an exercise for you – I think I’ve provided many examples in the past on how one might go about doing this. If, after struggling for a while, you aren’t successful, feel free to get in touch with me.)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Who named their kid Daenerys or Khaleesi?

Tue, 05/14/2019 - 01:25

(This article was first published on You Know, and kindly contributed to R-bloggers)

The Game of Thrones character, Daenerys Targaryen, suffered a hit to her reputation in the final season of the much-watched HBO series. For the past few years, she has been a crowd favorite–combining brave determination, style, and dragon(s). Many people went a step further in their admiration and named their offspring after her. Below are a couple of charts showing just how many, plus a demographic breakdown.

Daenerys
It turns out that one of her titles, Khaleesi, is even more popular.

Khaleesi

These charts are screenshots from this shiny app built using R. The shiny app builds off of Hadley Wickham’s babynames package. Enter a name and the app shows a chart for that name. Enter many names and get a basic sense of the overall mix of those names.

Here is a link to the github repo.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: You Know. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Baby Name Animation

Mon, 05/13/2019 - 21:30

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

I was playing around with the gganimate package this morning and thought I’d make a little animation showing a favorite finding about the distribution of baby names in the United States. This is the fact—I think first noticed by Laura Wattenberg, of the Baby Name Voyager—that there has been a sharp, relatively recent rise in boys’ names ending in the letter ‘n’, at the expense of names with ‘e’, ‘l’, and ‘y’ endings.

Our goal is to animate a bar chart showing this distribution as a bar chart with one bar for each letter, which we’ll draw once for each year from 1880 to 2017 and then smoothly stitch them together with gganimate’s tools.

Here’s the code in full, using the babynames dataset, which can be installed from CRAN.

The babynames data looks like this:

> babynames # A tibble: 1,924,665 x 5 year sex name n prop <dbl> <chr> <chr> <int> <dbl> 1 1880 F Mary 7065 0.0724 2 1880 F Anna 2604 0.0267 3 1880 F Emma 2003 0.0205 4 1880 F Elizabeth 1939 0.0199 5 1880 F Minnie 1746 0.0179 6 1880 F Margaret 1578 0.0162 7 1880 F Ida 1472 0.0151 8 1880 F Alice 1414 0.0145 9 1880 F Bertha 1320 0.0135 10 1880 F Sarah 1288 0.0132 # … with 1,924,655 more rows

We’re going to create a plot object, p. We take the data and subset it to boys’ names, then calculate a table of end-letter frequencies by year. Finally, we add the instructions for the plot.

## Create the plot object p <- babynames %>% filter(sex == "M") %>% mutate(endletter = stringr::str_sub(name, -1)) %>% group_by(year, endletter) %>% summarize(letter_count = n()) %>% mutate(letter_prop = letter_count / sum(letter_count), rank = min_rank(-letter_prop) * 1) %>% ungroup() %>% ggplot(aes(x = factor(endletter, levels = letters, ordered = TRUE), y = letter_prop, group = endletter, fill = factor(endletter), color = factor(endletter))) + geom_col(alpha = 0.8) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + guides(color = FALSE, fill = FALSE) + labs(title = "Distribution of Last Letters of U.S. Girls' Names over Time", subtitle = '{closest_state}', x = "", y = "Names ending in letter", caption = "Data: US Social Security Administration. @kjhealy / socviz.co") + theme(plot.title = element_text(size = rel(2)), plot.subtitle = element_text(size = rel(3)), plot.caption = element_text(size = rel(2)), axis.text.x = element_text(face = "bold", size = rel(3)), axis.text.y = element_text(size = rel(3)), axis.title.y = element_text(size = rel(2))) + transition_states(year, transition_length = 4, state_length = 1) + ease_aes('cubic-in-out')

The first bit of code finds the last letter of every name in babynames using stringr’s str_sub() function. Then we count the number of ending letters and calculate a proportion, which we then rank. From there we hand things over to ggplot to draw a column chart. With gganimate you draw and polish the plot as normal—here just a column chart using geom_col()—and then add the animation instructions using transition_states() and ease_aes(). The only other trick is the use of the placeholder macro '{closest_state}' in the labs() call, where we specify the subtitle. This is what gives us the year counter.

With the plot object ready to go, we call animate() to save it to a file.

animate(p, fps = 25, duration = 20, width = 800, height = 600, renderer = gifski_renderer("figures/name_endings_boys.gif"))

And here’s the result:

Animated distribution of end-letter names for boys.

By switching the filter from "M" to "F" we can do the same for girls’ names:

Animated distribution of end-letter names for girls.

Girls’ names show a very different distribution, with ‘a’ and ‘e’ endings very common (perhaps unsurprisingly given the legacy of Latinate names), but there’s still substantial variation in how popular ‘a’ endings are over time. The dominance of ‘a’ and ‘e’ is interesting, because girl names show more heterogeneity overall, with parents being more willing to experiment with the names of their daughters rather then their sons.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RcppAnnoy 0.0.12

Mon, 05/13/2019 - 03:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of RcppAnnoy is now on CRAN.

RcppAnnoy is the Rcpp-based R integration of the nifty Annoy library by Erik. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours—originally developed to drive the famous Spotify music discovery algorithm.

This release brings several updates: Seed settings follow up on changes in the previous release 0.0.11, this is also documented in the vignette thanks to James Melville; more documentation was added thanks to Adam Spannbauer, unit tests now use the brandnew tinytest package, and vignette building was decoupled from package building. All these changes in this version are summarized with appropriate links below:

Changes in version 0.0.12 (2019-05-12)
• Allow setting of seed (Dirk in #41 fixing #40).

• Document setSeed (James Melville in #42 documenting #41).

• Switched unit testing to the new tinytest package (Dirk in #45).

• The vignette is now pre-made in included as-is in Sweave document reducing the number of suggested packages.

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Faster Way to Slice Dataframe by Row

Mon, 05/13/2019 - 03:11

When we’d like to slice a dataframe by row, we can employ the split() function or the iter() function in the iterators package.

By leveraging the power of parallelism, I wrote an utility function slice() to faster slice the dataframe. In the example shown below, the slice() is 3 times more efficient than the split() or the iter() to select 2 records out of 5,960 rows.

df <- read.csv("hmeq.csv") nrow(df) # [1] 5960 slice <- function(df) { return(parallel::mcMap(function(i) df[i, ], seq(nrow(df)), mc.cores = parallel::detectCores())) } Reduce(rbind, Filter(function(x) x$DEROG == 10, slice(df))) # BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC #3094 1 16800 16204 27781 HomeImp Other 1 10 0 190.57710 0 9 27.14689 #3280 1 17500 76100 98500 DebtCon Other 5 10 1 59.83333 5 16 NA rbenchmark::benchmark(replications = 10, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), "SPLIT" = Reduce(rbind, Filter(Negate(function(x) x$DEROG != 10), split(df, seq(nrow(df))))), "ITER " = Reduce(rbind, Filter(Negate(function(x) x$DEROG != 10), as.list(iterators::iter(df, by = "row")))), "SLICE" = Reduce(rbind, Filter(Negate(function(x) x$DEROG != 10), slice(df))) ) # test replications elapsed relative # SLICE 10 2.224 1.000 # SPLIT 10 7.185 3.231 # ITER 10 7.375 3.316 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Can The New York Times Improve Their Recipes?

Mon, 05/13/2019 - 02:20

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction

The New York Times Cooking section is a favorite among many who read the online publication. Many great authors and cooks contribute to the postings although some recipes fare much better than others. This is exhibited both by the overall rating of the recipe and the number of people that they advertise have viewed the recipe. In this project, I set out to explore trends in recipe popularity and quantify those differences that I was observing. The goal was to produce recommendations for future recipe postings to boost customer interaction with the website.

My project can be viewed on a R Shiny app here and the code used to scrape and analyse the data can be found on my github here

I chose 5 variables to analyse within each recipe page:

• Rating/Number of ratings
• Cooking Category
• Length of recipe
• Recipe Description
• Tag Words

Many recipes included all, some or few of these variables. I chose these five categories because they are important visual aspects of the recipes and have the possibility to significantly increase consumer traffic.

To scrape the data from the Times website, I used Scrapy and built a data collection spider to acquire info from around 1,800 recipes. The scrape required the spider to travel down 3 levels of hyperlinks to arrive at each individual recipe page. There were a total of 7 pages with each collection of ‘Editors’ Collections’ which the spider navigated to collect each recipe.

Data Analysis

I transformed the data from the html format into a workable dataframe using both R and Python. I was able to parse the timestamps that were provided by the NYT and use them for time analysis after converting them to a common measurement.

My analysis of the data consisted of 5 main questions

• Is the overall number of ratings correlated with the star rating?
• Does the inclusion of a picture affect popularity?
• Does the inclusion of a lengthy description affect number of reviews?
• Does the inclusion of a “NYT cooking category” affect popularity?
• What are the most used and most popular recipe tags?
Results: Reviews

The review (0 – 5 stars)  and the number of reviews were positively correlated. Although not directly correlated, you could infer that as a recipes number of reviews increased, so did their star rating. I used number of ratings as a variable to assess popularity rather than review because it was a much more granular statistic and the review was more like a categorical variable, having only 3 tiers.

Picture or No Picture

I assumed the inclusion of a picture on a recipe would have a large impact on the popularity i.e. having a picture would result in a more popular post. This did not turn out to be the case. Although the spread of data in the category “Picture” was centered around a higher average number of reviews, this was not a significant difference. Overall, having a picture may indicate a small number of change in reviews although I could not conclude empirically that this was true.

Recipe Category

There were 5 different ‘recipe categories’ that a recipe could have apart from not being given a category:

• Easy
• Editor’s Pick
• Cookbook Exclusive
• Times Classic
• Healthy

These five categories are correlated with very different popularity results. The majority of “Healthy” recipes seem to be much more unpopular than than even the recipes with no category at all. Although the only category that has a significantly different average number of reviews than the ‘no category” category, is “Times Classic”. This category performs much better than if you are to write a similar recipe without the category tag.

Recipe Length

After converting the recipe times to a standard measurement in minutes, you can deduce from the graphs that there is not a real difference in the popularity of recipes depending on the time it takes to cook them. Longer recipes have a slightly lower median number of reviews, though this did not prove to be a significant difference.

Description

The inclusion of a description blurb at the beginning of a recipe proved to make the most significant difference in overall popularity out of all of the variables. This makes sense as the description is a chance for the author to ‘sell’ the recipe to the reader before they take the plunge to cook it.

Tag Words

The two charts shown here are the most popular tags sorted by average number of ratings of the recipes they are used with, and total number of recipes they are used with, respectively. The two charts provide very different tags and give insight into what categories of recipes people are gravitated towards. Because the charts show different tag words, this proves very useful for designing future recipes. Because a tag word is used in many recipes does not necessarily mean that it is providing value.

Conclusion

Based on the data collected and analysis, I can conclude the following from the findings:

• Popularity of a recipe, measured by the total number of ratings is positively correlated with the 5 star review of the recipe.
• Having a picture does not necessarily negatively impact the recipe as a whole. This is important when developing recipes when thinking whether to pay for a professional food photographer etc. It could be overall beneficial to save money on the back end because it does not significantly negatively affect the front end.
• Having a Recipe Category can both positively and negatively affect the popularity of the recipe. The “Times Classic” Recipe category proves to be correlated with a significantly higher popularity. Although this might be a reason to add the category to more highly rated recipes, this would have to be analysed further. The inclusion of the tag in more recipes may dilute the affect.
• Including a recipe description significantly affects the popularity of the recipe. This is an easy change to make. Add a description and generally you will see your recipe become more popular.
• The inclusion of specific tag words may significantly increase the popularity of the recipe. Alternatively, looking at the most tag words with the highest average number of ratings may be a good “map” to steer the direction of new recipes.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Virtual Morel Foraging with R

Mon, 05/13/2019 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Bryan Lewis is a mathematician, R developer and mushroom forager.

Morchella Americana by Bryan W. Lewis, see https://ohiomushroomsociety.wordpress.com/

It’s that time of year again, when people in the Midwestern US go nuts for morel
mushrooms. Although fairly common in Western Pennsylvania, Ohio, Indiana,
Illinois, Wisconsin, and, especially, Michigan1, they can still be
tricky to find due to the vagaries of weather and mysteries of morel
reproduction.

Morels are indeed delicious mushrooms, but I really think a big part of their
appeal is their elusive nature. It’s so exciting when you finally find some–or
even one!–after hours and hours of hiking in the woods.

For all of you not fortunate to be in the Midwest in the spring, here is a
not-so-serious note on virtual morel foraging. But really, this note explores
ways you can mine image data from the internet using a cornucopia of data
science software tools orchestrated by R.

Typical forays begin with a slow, deliberate hunt for mushrooms in the forest.
Morels, like many mushrooms, may form complex symbiotic relationships with
plants and trees, so seek out tree species that they like (elms, tulip tress,
apple trees, and some others). Upon finding a mushroom, look around and closely
observe its habitat, maybe photograph it, and perhaps remove the fruiting body
for closer inspection and analysis, and maybe for eating. The mushroom you
pick is kind of like a fruit – it is the spore-distributing body of a much
larger organism that lives below the ground. When picking, get in the habit of
carefully examining a portion of the mushroom below the ground because
sometimes that includes important identifying characteristics. Later, use
field guide keys and expert advice to identify the mushroom, maybe even
examining spores under a microscope. Sometimes we might even send a portion of
clean tissue in for DNA analysis (see, for instance,
https://mycomap.com/projects). Then, finally, for choice edible mushrooms like
morels, once you are sure of your bounty, cook and eat them!

Edible morels are actually pretty easy to identify in the field. There are a
few poisinous mushrooms that kind of look like morels, but on closer inspection
not really. Chief among them are the false morels or Gyromitra, some of which we will find below in
our virtual foray!

Our virtual foray proceeds similarly as follows:

1. Virtually hunt for images of morel mushrooms on the internet.
2. Inspect each image for GPS location data.
3. Map the results!

Now, I know what you’re saying: most mushroom
hunters – especially morel hunters – are secretive
about their locations, and will strip GPS information from their pictures.
And we will see that is exactly the case: only about 1% of the pictures
we find include GPS data.
But there are lots of pictures on the internet,
so eventually even that 1% can be interesting to look at…

The Hunt

Our virtual mushroom foray begins as any real-world foray does, looking around
for mushrooms! But instead of a forest, we’ll use the internet. In particular,
let’s ask popular search engines to search for images of morels, and then
inspect those images for GPS coordinates.

But how can we ask internet search engines to return image information directly
to R? Unfortunately, the main image search engines like Google and Bing today
rely on interactive JavaScript operation, precluding simple use of, say, R’s
excellent curl package. Fortunately, there exists a tool for
web browser automation called Selenium and, of course, a corresponding R interface package called RSelenium.
RSelenium essentially allows R to use a web browser like a human, including
clicking on buttons, etc. Using web browser automation is not ideal because
we rely on fragile front-end web page/JavaScript interfaces that can change
at any time instead of something well-organized like HTML, but we
seem to be forced into this approach by the modern internet.

Our hunt requires that the Google Chrome browser is installed on your
system2, and of course you’ll need R! You’ll need at least the
following R packages installed. If you don’t have them, install them from CRAN:

library(wdman) library(RSelenium) library(jsonlite) library(leaflet) library(parallel) library(htmltools)

Let’s define two functions, one to search Microsoft Bing images, and another
to search Google images. Each function takes an RSelenium browser and
a search term as input, and returns a list of search result URLs.

bing = function(wb, search_term) { url = sprintf("https://www.bing.com/images/search?q=%s&FORM=HDRSC2", search_term) wb$navigate(url) invisible(replicate(200, wb$executeScript("window.scrollBy(0, 10000)"))) # infinite scroll down to load more results... x = wb$findElements(using="class name", value="btn_seemore") # more results... if(length(x) > 0) x[[1]]$click() invisible(replicate(200, wb$executeScript("window.scrollBy(0, 10000)"))) Map(function(x) { y = x$getElementAttribute("innerHTML") y = gsub(".* m=\\\"", "", y) y = gsub("\\\".*", "", y) y = gsub(""", "\\\"", y) y = gsub("&", "&", y) fromJSON(y)[c("purl", "murl")] }, wb$findElements(using = "class name", value = "imgpt")) } google = function(wb, search_term) { url = sprintf("https://www.google.com/search?q=%s&source=lnms&tbm=isch", search_term) wb$navigate(url) invisible(replicate(400, wb$executeScript("window.scrollBy(0, 10000)"))) Map(function(x) { ans = fromJSON(x$getElementAttribute("innerHTML")[[1]])[c("isu", "ou")] names(ans) = c("purl", "murl") # comply with Bing (cf.) ans }, wb$findElements(using = "xpath", value = '//div[contains(@class,"rg_meta")]')) } These functions emulate what a human would do by scrolling down to get more image results (both web sites us an ‘infinite scroll’ paradigm), and, in the Bing case, clicking a button. This is what I meant above when I said that this approach is fragile and not optimal – it’s quite possible that some small change in either search engine in the future will cause the above functions to not work. Let’s finally run our virtual mushroom hunt! We set up a Google Chrome-based RSelenium web browser interface, and run some searches: eCaps = list(chromeOptions = list( args = c('--headless', '--disable-gpu', '--window-size=1280,800'))) cr = chrome(port = 4444L) wb = remoteDriver(browserName = "chrome", port = 4444L, extraCapabilities = eCaps) wb$open() foray = c(google(wb, "morels"), google(wb, "indiana morel"), google(wb, "michigan morel"), google(wb, "oregon morel"), bing(wb, "morels"), bing(wb, "morel mushrooms"), bing(wb, "michigan morels")) wb$close() Feel free to try out different search terms. The result is a big list of possible image URLs that just might contain pictures of morels with their coordinates. This particular foray result above, run in late April, 2019, returned about 2000 results. Identification Next, we scan every result for GPS coordinates using the nifty external command-line tool called exiftool and the venerable curl program. If you don’t have those tools, you’ll need to install them on your computer. They are available for most major operating systems. On Debian flavors of GNU/Linux like Ubuntu it’s really easy, just run: sudo apt-get install exiftool curl Once the curl and exiftool programs are installed, we can invoke them for each image URL result from R to efficiently scan through part of the image for GPS coordinates using these functions: #' Extract exif image data #' @param url HTTP image URL #' @return vector of exif character data or NA exif = function(url) { tryCatch({ cmd = sprintf("curl --max-time 5 --connect-timeout 2 -s \"%s\" | exiftool -fast2 -", url) system(cmd, intern=TRUE) }, error = function(e) NA) } #' Convert an exif GPS character string into decimal latitude and longitude coordinates #' @param x an exif GPS string #' @return a named numeric vector of lat/lon coordinates or NA decimal_degrees = function(x) { s = strsplit(strsplit(x, ":")[[1]][2], ",")[[1]] ans = Map(function(y) ifelse(y[4] == "S" || y[4] == "W", -1, 1) * (as.integer(y[1]) + as.numeric(y[2])/60 + as.numeric(y[3])/3600), strsplit(gsub(" +", " ", gsub("^ +", "", gsub("deg|'|\"", " ", s))), " ")) names(ans) = c("lat", "lon") ans } #' Evaluate a picture and return GPS info if available #' @param url image URL #' @return a list with pic, date, month, label, lat, lon entries or NULL forage = function(url) { ex = exif(url) i = grep("GPS Position", ex) if(length(i) == 0) return(NULL) pos = decimal_degrees(ex[i]) date = tryCatch(strsplit(ex[grep("Create Date", ex)], ": ")[[1]][2], error=function(e) NA) month = ifelse(is.na(date), NA, as.numeric(strftime(strptime(date, "%Y:%m:%d %H:%M:%S"), format="%m"))) label = paste(date, " source: ", url) list(pic=paste0(""), date=date, month=month, label=label, lat=pos$lat, lon=pos$lon) } Now, there might be many search results to evaluate. Each evaluation is not very compute intensive. And the results are independent of each other. So why not run this evaluation step in parallel? R makes this easy to do, although with some differences between operating systems. The following works well on Linux or Mac systems. It will also run fine on Windows systems, but sequentially. options(mc.cores = detectCores() + 2) # overload cpu a bit print(system.time({ bounty = do.call(function(...) rbind.data.frame(..., stringsAsFactors=FALSE), mcMap(function(x) { forage(x$murl) }, foray) ) })) # Omit zero-ed out lat/lon coordinates bounty = bounty[round(bounty$lat) != 0 & round(bounty$lon != 0), ]

The above R code runs through every image result, returning those containing
GPS coordinates as observations in a data frame with image URL, date, month,
label, and decimal latitude and longitude variables.

Starting with over 2,000 image results, I ended up with about 20 pictures
with GPS coordinates. Morels are as elusive in the virtual world as the
real one!

Finally, let’s plot each result colored by the month of the image on
a map using the superb R leaflet package. You can click on each point
to see its picture.

colors = c(January="#555555", February="#ffff00", March="#000000", April="#0000ff", May="#00aa00", June="#ff9900", July="#00ffff", August="#ff00ff", September="#55aa11", October="#aa9944", November="#77ffaa", December="#ccaa99") clr = as.vector(colors[bounty$month]) map = addTiles(leaflet(width="100%")) map = addCircleMarkers(map, data=bounty, lng=~lon, lat=~lat, fillOpacity=0.6, stroke=FALSE, fillColor=clr, label=~label, popup=~pic) i = sort(unique(bounty$month)) map = addLegend(map, position="bottomright", colors=colors[i], labels=names(colors)[i], title="Month", opacity=1) map

Closing Notes

You may have noticed that not all of the pictures are of morels. Indeed,
there are several foray group photos, a picture of a deer, and even a few
pictures of (poisonous) false morel mushrooms.

What could be done about that? Well, if you are truly geeky and somewhat
bored – OK very bored – you could train a deep neural network to identify morels,
and then feed the above image results into that. Me, I prefer wasting my time
wandering actual woods looking for interesting mushrooms… Even if there are
no morels to find, wandering in the woods is almost always fun. It’s also worth
pointing out that the false morel and morel habitats are often quite similar, so
those false morel sightings spotted in the map above might actually be
interesting places to forage.

1. I tried first to use the phantomjs driver from R’s wdman package that doesn’t require
an external web browser. But I could only get that to work for searching
Microsoft Bing image results, not Google image search. Help or
advice on getting phantomjs to work would be appreciated!

_____='https://rviews.rstudio.com/2019/05/13/virtual-morel-foraging-with-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Deep (learning) like Jacques Cousteau – Part 4 – Scalar multiplication

Sun, 05/12/2019 - 23:30

(This article was first published on Embracing the Random | R, and kindly contributed to R-bloggers)

(TL;DR: Multiply a vector by a scalar one element at a time.)

LaTeX and MathJax warning for those viewing my feed: please view
directly on website!

We build, we stack, we multiply

Nate Dogg from ‘Multiply’ by Xzibit

Last
time
,
that
,
we learnt about scalars. What happens when we multiply a vector
by a scalar
?

(I don’t know where I’m going with this diagram…but bear with me!)

Today’s topic: Multiplying vectors by scalars

Let’s use our vector \boldsymbol{x} from last time.

\boldsymbol{x} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}

Let’s pick a scalar to multiply it by. I like the number two, so
let’s multiply it by two!

2\boldsymbol{x} = 2 \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}

To evaluate this, we perform scalar multiplication. That is, we
multiply each element of our vector by our scalar. Easy!

2\boldsymbol{x} = 2 \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}

= \begin{bmatrix} 2 \times 1 \\ 2 \times 2 \\ 2 \times 3 \\ \end{bmatrix}

= \begin{bmatrix} 2 \\ 4 \\ 6 \\ \end{bmatrix}

More generally, if our vector \boldsymbol{x} contains n elements
and we multiply it by some scalar c \in \mathbb{R}, we get:

c \boldsymbol{x} = c \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} =

\begin{bmatrix} c x_1 \\ c x_2 \\ \vdots \\ c x_n \end{bmatrix}

How can we perform scalar multiplication in R?

This is easy. It’s what R does by default.

Let’s define our vector, x.

x <- c(1, 2, 3) print(x) ## [1] 1 2 3

Let’s define our scalar, c.

c <- 2 print(c) ## [1] 2

Now, let’s multiply our vector by our scalar.

c * x ## [1] 2 4 6

Boom! The power of vectorisation!

How does type coercion affect scalar multiplication?

here. Let’s define x as an integer vector.

x <- c(1L, 2L, 3L) class(x) ## [1] "integer"

Our scalar c may also look like an integer, but it has been stored
as a numeric type, which is our proxy for real numbers.

print(c) ## [1] 2 class(c) ## [1] "numeric"

So when we multiply a numeric type by our integer vector, we
get a result in the more general numeric type!

class(c * x) ## [1] "numeric"

Conclusion

To multiply a vector by a scalar, simply multiply each element of the vector by the scalar. This is pretty easy, isn’t it?

Let’s learn how to add two vectors before we cover dot products.
Only then can we enter the matrix!

Justin

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random | R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Quick Hit: Updates to QuickLookR and {rdatainfo}

Sun, 05/12/2019 - 16:06

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m using GitUgh links here b/c the issue was submitted there. Those not wishing to be surveilled by Microsoft can find the macOS QuickLook plugin project and {rdatainfo} project in SourceHut and GitLab (~hrbrmstr and hrbrmstr accounts respectively).

I hadn’t touched QuickLookR or {rdatainfo} at all since 2016 since it was really just an example proof of concept. Yet the suggestion to have it handle R markdown (Rmd) files felt useful so I updated {rdatainfo} to better handle data loading for rds, rdata, and rda file extensions and made a small update to the macOS {QuickLookR} QuickLook extension project to treat Rmd files as text files which can be previewed then edited with the default Finder extension exitor you’ve (or your apps) have set for Rmd files.

The {rdatainfo} package is only needed if you need/want R data file preview support (i.e. it’s not necessary for R markdown files). Just unzip the plugin release and put it into ~/Library/QuickLook. Here are examples for the four file types (the example code under saveRDS() and save() was used to generate those data files and the R markdown file is the default one):

file icons

Rmd Preview

rds preview

rdata preview

FIN

This is my first Xcode app build under macOS 10.14 so definitely file issues if you’re having trouble installing or compiling (there are some new shared library “gotchas” that I don’t think apply to this Xcode project but may).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Earthquake Analysis (3/4): Visualizing Data on Maps

Sun, 05/12/2019 - 14:42

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Interested in guest posting? We would love to share your codes and ideas with our community. Categories Tags

This is the third part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show static, interactive and animated earthquakes maps of different flavors by using the functionalities provided by a pool of R packages as specifically explained herein below.

For static maps:

• ggplot2 package
• tmap package
• ggmap package

For interactive maps:

• leaflet package
• tmap package
• mapview package

For animated maps:

• animation package
• gganimate package
• tmap package
Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(ggmap)) suppressPackageStartupMessages(library(ggsn)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(sf)) suppressPackageStartupMessages(library(spData)) suppressPackageStartupMessages(library(tmap)) suppressPackageStartupMessages(library(leaflet)) suppressPackageStartupMessages(library(mapview)) suppressPackageStartupMessages(library(animation)) suppressPackageStartupMessages(library(gganimate)) suppressPackageStartupMessages(library(ggthemes)) suppressPackageStartupMessages(library(gifski)) suppressPackageStartupMessages(library(av))

Packages versions are herein listed.

packages <- c("ggplot2", "ggmap", "ggsn", "dplyr", "lubridate", "sf", "spData", "tmap", "leaflet", "mapview", "animation", "gganimate", "ggthemes", "gifski", "av") version <- lapply(packages, packageVersion) version_c <- do.call(c, version) data.frame(packages=packages, version = as.character(version_c)) ## packages version ## 1 ggplot2 3.1.0 ## 2 ggmap 3.0.0 ## 3 ggsn 0.5.0 ## 4 dplyr 0.8.0.1 ## 5 lubridate 1.7.4 ## 6 sf 0.7.3 ## 7 spData 0.3.0 ## 8 tmap 2.2 ## 9 leaflet 2.0.2 ## 10 mapview 2.6.3 ## 11 animation 2.6 ## 12 gganimate 1.0.2 ## 13 ggthemes 4.1.0 ## 14 gifski 0.8.6 ## 15 av 0.2

Running on Windows-10 the following R language version.

R.version ## _ ## platform x86_64-w64-mingw32 ## arch x86_64 ## os mingw32 ## system x86_64, mingw32 ## status ## major 3 ## minor 5.2 ## year 2018 ## month 12 ## day 20 ## svn rev 75870 ## language R ## version.string R version 3.5.2 (2018-12-20) ## nickname Eggshell Igloo Getting Data

As shown in the previous posts, we download the earthquake dataset from earthquake.usgs.gov, specifically the last 30 days dataset. Please note that such earthquake dataset is day by day updated to cover the last 30 days of data collection. Moreover, it is not the most recent dataset available as I collected it some weeks ago. The earthquakes dataset is in CSV format. If not yet present into our workspace, we download and save it. Then we load it into quakes local variable. .

if ("all_week.csv" %in% dir(".") == FALSE) { url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv" download.file(url = url, destfile = "all_week.csv") } quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE) quakes$time <- ymd_hms(quakes$time) quakes$updated <- ymd_hms(quakes$updated) quakes$magType <- as.factor(quakes$magType) quakes$net <- as.factor(quakes$net) quakes$type <- as.factor(quakes$type) quakes$status <- as.factor(quakes$status) quakes$locationSource <- as.factor(quakes$locationSource) quakes$magSource <- as.factor(quakes$magSource) quakes <- arrange(quakes, -row_number()) # earthquakes dataset earthquakes <- quakes %>% filter(type == "earthquake") Static Maps

We herein show three flavors of maps by taking advantage of the functionalities provided within the packages ggplot2, tmap and ggmap.

gplot2 package

Taking advantage of the ggplot2 package, we create a static map of the earthquake events. Here is the description of the steps to create such a map, numbering corresponds to the comments within the source code.

1. we get the “world” data frame providing with data of world regions in a suitable way for plotting
2. we set the title string of our map as based on the timeline start and end of our earthquakes dataset
3. we create a ggplot object based on the geom_map() geometry in order to obtain polygons from a reference map
4. we add to our ggplot object the points graphic objects (by geom_point()) as located where the earthquakes happened
6. we add the compass as located to the bottom right of the map, adjusting further its location by the anchor parameter
7. we add the scale bar as located to the bottom left, in km units and 2500 km minimum distance represented with WGS84 ellipsoid model
#1 world <- map_data('world') #2 title <- paste("Earthquakes map from ", paste(as.Date(earthquakes$time[1]), as.Date(earthquakes$time[nrow(earthquakes)]), sep = " to ")) #3 p <- ggplot() + geom_map(data = world, map = world, aes(x = long, y=lat, group=group, map_id=region), fill="white", colour="#7f7f7f", size=0.5) #4 p <- p + geom_point(data = earthquakes, aes(x=longitude, y = latitude, colour = mag)) + scale_colour_gradient(low = "#00AA00",high = "#FF00AA") #5 p <- p + ggtitle(title) #6 p <- p + ggsn::north(data = earthquakes, location = "bottomright", anchor = c("x"=200, "y"=-80), symbol = 15) #7 p <- p + ggsn::scalebar(data=earthquakes, location = "bottomleft", dist = 2500, dist_unit = "km", transform = TRUE, model = "WGS84") p

tmap package

In this example, we take advantage of the tmap package. For the purpose, we instantiate a Simple Features object by taking advantage of the sf package. The Simple Features is an open standard developed and endorsed by the Open Geospatial Consortium (OGC), a not-for-profit organization. Simple Features is a hierarchical data model that represents a wide range of geometry types.

Here is the description of the steps to create such a map, numbering corresponds to the comments within the source code.

1. we set the WGS84 as a string projection that will be passed as input paramter to the function which will build our spatial object
2. we set the title string
3. we convert our earthquake dataset to a sf (simple features) object by st_as_sf() function within the sf package
4. we create a tmap-element as based on the world Simple Features object as available within the spData package.
5. we choose the classic style for out map and white colors with borders for regions
7. we add the compass chossing the 8star type in the right+bottom position
8. we add a scale bar 0-2500-5000 km in the left+bottom position
9. we add the previously build Simple Features object at step #3
10. we use the dot symbol to indicate earthquake events on the map with a color scale associated to the magnitude of the event
#1 projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0" #2 title <- paste("Earthquakes map from ", paste(as.Date(earthquakes$time[1]), as.Date(earthquakes$time[nrow(earthquakes)]), sep = " to ")) #3 df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs = projcrs) #4 p <- tm_shape(spData::world) #5 p <- p + tm_style("classic") + tm_fill(col = "white") + tm_borders() #6 p <- p + tm_layout(main.title = title) #7 p <- p + tm_compass(type = "8star", position = c("right", "bottom")) #8 p <- p + tm_scale_bar(breaks = c(0, 2500, 5000), size = 1, position = c("left", "bottom")) #9 p <- p + tm_shape(df) #10 p <- p + tm_dots(size = 0.1, col = "mag", palette = "YlOrRd") p ## Scale bar set for latitude km and will be different at the top and bottom of the map.

Before introducing further map flavors, we take some time to show an overview of how the typical problem of creating a map for a specific region and related data can be fulfilled. Specifically, let us suppose we would like to draw the California map showing earthquakes occurred on such a geographical region. The quick&dirty approach could be to find out interval ranges for longitude and latitude in order to determine a rectangular geographical region which contains the California boundaries. Something depicted in the source code herein below whose steps are so described.

1. we obtain a new dataset starting from the earthquakes one by filtering on specific longitude and latitude intervals
2. we get the California map by filtering the California state map from the United States map made available by the spData package
3. we convert our california dataset to a sf (simple features) object by st_as_sf() function within the sf package
4. we create a tmap-element as based on California map; such tmap-element instance specifies a spatial data object using the world Simple Features object as available within the spData package.
5. we choose the classic style for out map and pale green color fill color with borders for regions
6. we set the title onto the map
7. we add the compass chossing the 8star type in the right+top position
8. we add a scale bar 0-100-200 km in the left+bottom position
9. we add the previously build Simple Features object as based on the earthquake dataset
10. we use the dot symbol to indicate earthquake events on the map with a color scale associated with the magnitude of the event
#1 california_data <- earthquakes %>% filter(longitude >= -125 & longitude <= -114 & latitude <= 42.5 & latitude >= 32.5) #2 map_california <- us_states %>% filter(NAME == "California") #3 df <- st_as_sf(x = california_data, coords = c("longitude", "latitude"), crs = st_crs(map_california)) #4 p <- tm_shape(map_california) #5 p <- p + tm_style("classic") + tm_fill(col = "palegreen4") + tm_borders() #6 p <- p + tm_layout(main.title = paste("California earthquakes map from ", paste(as.Date(california_data$time[1]), as.Date(california_data$time[nrow(california_data)]), sep = " to "))) #7 p <- p + tm_compass(type = "8star", position = c("right", "top")) #8 p <- p + tm_scale_bar(breaks = c(0, 100, 200), size = 1, position = c("left", "bottom")) #9 p <- p + tm_shape(df) #10 p <- p + tm_dots(size = 0.1, col = "mag", palette = "YlOrRd") p

A better result can be achieved by first doing an inner join between our earthquake dataset and the map of California so to determine exactly what are the earthquakes within the California boundaries. Here are the steps to do it.

1. we convert our earthquakes dataset to a sf (simple features) object by st_as_sf() function within the sf package
2. we inner join (left = FALSE) our simple features object with the California map; that gives a new simple features object providing with earthquakes occurred exactly within California geographical boundaries
3. we create a tmap-element as based on California map; such tmap-element instance specifies a spatial data object using the world Simple Features object as available within the spData package.
4. we choose the classic style for out map and pale green color fill color with borders for regions
5. we set the title onto the map
6. we add the compass chossing the 8star type in the right+top position
7. we add a scale bar 0-100-200 km in the left+bottom position
8. we add the previously build Simple Features object resulting from the inner join at step #2
9. we use the dot symbol to indicate earthquake events on the map with a color scale associated with the magnitude of the event
#1 df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs = st_crs(map_california)) #2 df_map_inner_join <- st_join(df, map_california, left=FALSE) ## although coordinates are longitude/latitude, st_intersects assumes that they are planar #3 p <- tm_shape(map_california) #4 p <- p + tm_style("classic") + tm_fill(col = "palegreen4") + tm_borders() #5 p <- p + tm_layout(main.title = paste("California earthquakes map from ", paste(as.Date(california_data$time[1]), as.Date(california_data$time[nrow(california_data)]), sep = " to "))) #6 p <- p + tm_compass(type = "8star", position = c("right", "top")) #7 p <- p + tm_scale_bar(breaks = c(0, 100, 200), size = 1, position = c("left", "bottom")) #8 p <- p + tm_shape(df_map_inner_join) #9 p <- p + tm_dots(size = 0.1, col = "mag", palette = "YlOrRd") p

ggmap package

We show a static map as obtained by taking advantage of the qmplot() function within the ggmap package.

So, we draw a map where symbols highlighting magnitude (color) and different point symbols associated to the event type. The qmplot() function is the “ggmap” equivalent to the ggplot2 package function qplot() and allows for the quick plotting of maps with data. The stamen map flavor is used.

title <- paste("Earthquakes map from ", paste(as.Date(earthquakes$time[1]), as.Date(earthquakes$time[nrow(earthquakes)]), sep = " to ")) magnitude <- factor(round(earthquakes$mag)) suppressMessages(qmplot(x = longitude, y = latitude, data = earthquakes, geom = "point", colour = magnitude, source = "stamen", zoom = 3) + scale_color_brewer(palette = 8) + ggtitle(title)) Interactive Maps An interactive map offers result which depends upon the mouse click actions on the map itself. For example, showing a pop-up with some data when the user clicks on the specific symbol to indicate the location of the earthquake event for our scenario. Unfortunately, the interaction with the map cannot result when embedding it into DS+ site posts, you have to try it out by yourself taking advantage of the following source code. This is true for all the examples we are going to show. leaflet package The leaflet package provides with functionalities to create and customize interactive maps using the Leaflet JavaScript library and the htmlwidgets package. These maps can be used directly from the R console, from 'RStudio', in Shiny applications and R Markdown documents. Here is an example where a pop-up is defined to provide with the place, identifier, time, magnitude and depth data. Further, the cluster options eases the user experience by means of a hierarchical representation in terms of clusters that incrementally show up. Here is what we can get. earthquakes %>% leaflet() %>% addTiles() %>% addMarkers(~longitude, ~latitude, popup = (paste("Place: ", earthquakes$place, "
", "Id: ", earthquakes$id, " ", "Time: ", earthquakes$time, "
", "Magnitude: ", earthquakes$mag, " m ", "Depth: ", earthquakes$depth)), clusterOptions = markerClusterOptions())

tmap package

We take advantage of the tmap package again, however this time for an interactive map which can be easily created based previously shown source code for the same package when the tmap mode view is set. Here are the steps to generate such map.

1. we set the WGS84 as a string projection that will be passed as input paramter to the function which will build our spatial object
2. we set the title string
3. we convert our starting earthquake dataset to a sf (simple features) object by st_as_sf() function within the sf package
4. we set the tmap_mode equal to “view” to allow for animation (the default is “plot”)
5. we create a tmap-element as based on the world Simple Features object as available within the spData package.
6. we choose the classic style for out map and white colors with borders for regions
8. since the compass is not supported in view mode, we comment such line of code previously used for static maps
9. we add a scale bar 0-2500-5000 km in the left+bottom position
10. we add the previously build Simple Features object at step #3
11. we use the dot symbol to indicate earthquake events on the map with a color scale associated with the magnitude of the event
#1 projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0" #2 title <- paste("Earthquakes map from ", paste(as.Date(earthquakes$time[1]), as.Date(earthquakes$time[nrow(earthquakes)]), sep = " to ")) #3 df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs = projcrs) #4 tmap_mode("view") ## tmap mode set to interactive viewing #5 p <- tm_shape(spData::world) #6 p <- p + tm_style("classic") + tm_fill(col = "white") + tm_borders() #7 p <- p + tm_layout(main.title = title) #8 compass is not supported in view mode #p <- p + tm_compass(type = "8star", position = c("right", "bottom")) #9 p <- p + tm_scale_bar(breaks = c(0, 2500, 5000), size = 1, position = c("left", "bottom")) #10 p <- p + tm_shape(df) #11 p <- p + tm_dots(size = 0.01, col = "mag", palette = "YlOrRd") p

We set the mode to “plot” so to revert back to the static map flavor.

tmap_mode("plot") ## tmap mode set to plotting mapview package

We show an interactive map as obtained by taking advantage of the mapview package. Here are the steps.

1. we set the WGS84 as a string projection that will be passed as input parameter to the function which will build our spatial object
2. we set the title string
3. we convert our starting earthquake dataset to a sf (simple features) object by st_as_sf() function within the sf package
4. we create a palette by using the colorRampPalette, a function that gives a fixed number of colors to interpolate the resulting palette with
5. we create the interactive map of CartoDB.Positron flavor with popups showing place and magnitude; we control the size of the dot by the cex parameter; we choose to interpolate on 12 colors scale starting from the palette created at step #4
#1 projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0" #2 title <- paste("Earthquakes map from ", paste(earthquakes$time[1], earthquakes$time[nrow(earthquakes)], sep = " to ")) #3 df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs = projcrs) #4 pal <- colorRampPalette(c("green", "yellow", "red", "violet")) #5 mapview(df, popup = popupTable(df, zcol = c("place", "mag")), zcol = "mag", legend = TRUE, map.types = c("CartoDB.Positron"), cex= 4, alpha = 0.3, col.regions = pal(12))

As map flavor, you can choose among CartoDB.Positron, CartoDB.DarkMatter, OpenStreetMap, Esri.WorldImagery, OpenTopoMap.

Animated Maps

We would like to create an animated map showing day by day the location of the events. We then show different implementations for achieving such a goal.

animation package

We herein show how to create an animated GIF by means of the animation package. Also the option to generate a short video in “avi” the format is outlined. Here are the steps to do it.

1. we determine the days’ array to be displayed on the map
2. we determine how many picture generate, which is equal to the days timespan
3. we define a custom color vector of colors name strings
4. we determine the box area to be used for map display; that is useful for avoiding an annoying map resizing effect from frame to frame during the resulting animation
5. we get the map of stamen flavor and terrain type
6. we save the GIF resulting from a loop of maps plot where:
• we get the earthquake day
• we filter the earthquake dataset to find the events associated to the day of step 6.1

• we translate the earthquake magnitude to a new factor variable

• we create the map by ggmap()

• we use point geometries to highlight where the earthquake happened

• we use a color scale custom defined as based on the colors vector defined at step #3

• we add the map title

• we plot the resulting map

• we set as animation options:
– time interval of the animation to 1
– maximum number of steps for each loop equal to 1
– map width and heigth both equal to 1000 pixels
– resulting animated GIF file name
#1 days <- unique(as.Date(earthquakes$time)) #2 l <- length(days) #3 cols <- c( "-1" = "grey", "0" = "darkgrey", "1" = "green", "2" = "blue", "3" = "yellow", "4" = "pink", "5" = "orange", "6" = "red", "7" = "violet", "8" = "black", "NA" = "white") #4 bbox <- make_bbox(lon = c(-180, 180), lat = c(-70,70), f = 0) #5 map <- get_map(location = bbox, zoom = 3, source = "stamen", maptype = "terrain", force = FALSE) #6 saveGIF( { for (i in 1:l) { #6.1 the_day <- days[i] #6.2 earthquakes_of_day <- earthquakes %>% filter(as.Date(time) == the_day) #6.3 magnitude <- factor(round(earthquakes_of_day$mag)) #6.4 p <- ggmap(map) #6.5 p <- p + geom_point(data = earthquakes_of_day, aes(x=longitude, y=latitude, color = magnitude)) #6.6 p <- p + scale_colour_manual(values = cols) #6.7 p <- p + ggtitle(the_day) #6.8 plot(p) } #6.9 }, interval = 1, nmax = l, ani.width = 1000, ani.height = 1000, movie.name = "earthquakes1.gif")

Please click on the picture below to see its animation

gganimate package

Another way to create an animated GIF, is to leverage on the animate() function of the gganimate package. Here are the steps to do it.

1. we determine the days’ array to be displayed on the map
2. we determine the number of frames to be generated for our video, as equal to the number of days our dataset reports
3. we determine the days array to be displayed on the map
4. we translate the earthquake magnitude to a new factor variable
5. we define a custom color vector of colors name strings
6. we first create the world map showing regions with borders and fill gray colors
7. we highlight the earthquake events on the map by means of point geometry with point colors based on the magnitude; the cumulative flag allows for building up an object or path over time
8. we set our custom color scale as defined at step #5
9. we add a label showing the day associated to the currently displayed frame within the resulting animation
10. we define the frame-to-frame transition variable as the earthquake event day
11. we set the fade effect to be applied to the frame visualization sequence
12. we set the animation options specifying the image size in pixel units
13. we generate the animation specifying its frames length equal to the number of days and the resulting animated GIF file name
#1 days <- unique(as.Date(earthquakes$time)) #2 l <- length(days) #3 earthquakes$date <- as.Date(earthquakes$time) #4 magnitude <- factor(round(earthquakes$mag)) #5 cols <- c( "-1" = "grey", "0" = "darkgrey", "1" = "green", "2" = "blue", "3" = "yellow", "4" = "pink", "5" = "orange", "6" = "red", "7" = "violet", "8" = "black", "NA" = "white") #6 map <- ggplot() + borders("world", colour = "gray65", fill = "gray60") + theme_map() #7 map <- map + geom_point(aes(x = longitude, y = latitude, colour = magnitude, frame = date, cumulative = TRUE), data = earthquakes) #8 map <- map + scale_colour_manual(values = cols) #9 map <- map + geom_text(data = earthquakes, aes(-100, 100, label=date)) #10 map <- map + transition_time(date) #11 map <- map + enter_fade() + exit_fade() #12 options(gganimate.dev_args = list(width = 1000, height = 600)) #13 suppressMessages(animate(map, nframes = l, duration = l, renderer = gifski_renderer("earthquakes2.gif"))) ## Frame 1 (3%) Frame 2 (6%) Frame 3 (9%) Frame 4 (12%) Frame 5 (16%) Frame 6 (19%) Frame 7 (22%) Frame 8 (25%) Frame 9 (29%) Frame 10 (32%) Frame 11 (35%) Frame 12 (38%) Frame 13 (41%) Frame 14 (45%) Frame 15 (48%) Frame 16 (51%) Frame 17 (54%) Frame 18 (58%) Frame 19 (61%) Frame 20 (64%) Frame 21 (67%) Frame 22 (70%) Frame 23 (74%) Frame 24 (77%) Frame 25 (80%) Frame 26 (83%) Frame 27 (87%) Frame 28 (90%) Frame 29 (93%) Frame 30 (96%) Frame 31 (100%) ## Finalizing encoding... done!

Please click on the picture below to see its animation

If you like to produce your animation in avi format, change the renderer as herein shown.

animate(map, nframes= 31, duration = 31, renderer = av_renderer("earthquakes2.avi")) tmap package

The tmap package provides with the function tmap_animation() to create animated maps. Here are the steps to do it.

1. we create a new column data named as date inside our earthquake dataset
2. we convert our starting earthquake dataset to a sf (simple features) object by st_as_sf() function within the sf package
3. we create a tmap-element as based on the world Simple Features object as available within the spData package.
4. we choose the classic style for out map and the gray fill color with borders for regions
5. we add the compass chossing the 8star type in the right+bottom position
6. we add a scale bar 0-2500-5000 km in the left+bottom position
7. we add the previously build Simple Features object at step #3
8. we use the dot symbol to indicate earthquake events on the map with a color scale associated with the magnitude of the event
9. we define facets as based on the earthquake event date
10. we save the result in a variable to be used later
#1 earthquakes$date <- as.Date(earthquakes$time) #2 df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs=st_crs(spData::world)) #3 p <- tm_shape(spData::world) #4 p <- p + tm_style("classic") + tm_fill(col = "gray") + tm_borders() #5 p <- p + tm_compass(type = "8star", position = c("right", "top")) #6 p <- p + tm_scale_bar(breaks = c(0, 2500, 5000), size = 1, position = c("left", "bottom")) #7 p <- p + tm_shape(df) #8 p <- p + tm_dots(size=0.2, col="mag", palette = "YlOrRd") #9 p <- p + tm_facets(along = "date", free.coords = TRUE) #10 maps_result_for_anim <- p

Finally, the animated map is produced by means of the tmap_animation() function.

tmap_animation(maps_result_for_anim, filename = "earthquakes3.gif", delay = 100)

Please click on the picture below to see its animation

If you like to read more about maps visualization with R, take a look at the references list below.

If you have any questions, please feel free to comment below.

References

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Make it explainable!

Sun, 05/12/2019 - 11:03

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

Most people make the mistake of thinking design is what it looks like… People think it’s this veneer — that the designers are handed this box and told, ‚Make it look good!’ That’s not what we think design is. It’s not just what it looks like and feels like. Design is how it works.

Steve Jobs, The New York Times, 2003.

Same goes with interpretable machine learning.
Recently, I am talking a lot about interpretations and explainability. And sometimes I got impression that techniques like SHAP, Break Down, LIME, SAFE are treated like magical incantations that converts complex predictive models into ,,something interpretable’’.

But interpretability/explainability is not a binary feature that you have it or not. It’s a process. The goal is to increase our understanding of the model behavior. Try different techniques to broaden the knowledge about the model or about model predictions.
Maybe you will never explain 100%, but you will understand more.

XAI/IML (eXplainable Artificial Intelligence/Interpretable Machine Learning) techniques can be used not only for post-hoc explainability, but also for model maintenance, debugging or in early phases of crisp modeling. Visual tools like PDP/ALE/CeterisParibus will change the way how we approach modeling and how we interact with models. We as model developers, model auditors or users.

Together with Tomasz Burzykowski from UHasselt we work on a book about the methodology for visual exploration, explanation and debugging predictive models.

Find the early version here https://pbiecek.github.io/PM_VEE/.

There is a lot of R snippets that shows how to use DALEX (and sometimes other packages like shapper, ingredients, iml, iBreakDown, condvis, localModel, pdp) to better understand some aspects of your predictive model.

It’s a work in process and even in an early dirty phase (despite the fact that we have started a year ago).
Feel free to comment it, or suggest improvements. Easiest way to do this is to add a new issue.

Code snippets are fully thanks to archivist hooks. I think that it’s a first book that uses archivist hooks for blended experience. You can read about a model online and in just one line of code you can download an object to your R console.

First chapters show how to use Ceteris Paribus Profiles / Individual Conditional Expectations to perform what-if/sensitivity analysis of a model.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Secret of Landusia: A Text Adventure in the R Language

Sun, 05/12/2019 - 10:38

(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers)

In the 1980s and early 1990s, I spent a lot of time playing and writing text adventure games on my Atari 8-bit computer. Text adventures, or interactive fiction, are games where the player reads a story on the screen. The player can change the outcomes of the story by issuing text instructions to the computer. These lines are an extract from Colossal Cave, by Will Crowther and Don Woods, the very first text adventure written in the mid-1970s:

You are standing at the end of a road before a small brick building. Around you is a forest. A small stream flows out of the building and down a gully.

> go south

You are in a valley in the forest beside a stream tumbling along a rocky bed.

> enter stream

The basic premise of a text adventure is that the player finds herself in a simulated world. Unlike contemporary simulations, this virtual world is created only with text. This might seem a strange idea to modern games, but simulated worlds through text have existed for thousands of years. A text adventure enhances the traditional book by giving the player the freedom to act within the narrative. The objective of these games is to solve problems to reach a specified goal.

This article describes how to write a text adventure in the R language. This article replicates a game and the techniques described by Hal Renko and Sam Edwards in the Dutch book Adventures! (1986).

The sections below describe the detailed workings and the narrative of the game. If you don’t like spoilers, then perhaps you should download the code and play The Secret of Landusia before reading further.

The Secret of Landusia Text Adventure

This section explains the detailed workings of the Landusia game. To view the code, please go to the GitHub repository. The text adventure consists of four files:

• actions.md: Markdown file with descriptions of rooms and actions.
• map.csv: Matrix of the relationship between rooms.
• objects.csv: Matrix of the status of objects and actors.
The Game Loop

The computer displays the current state of the game with a short piece of prose. The player then enters a command in natural language, which changes the state of the game. When the state of the player or the game reaches a certain point, the player has either won or lost the game. The game loop processes the commands of the player and determines the new state of the game. The state of the game consists of:

• Player: The player can have many attributes, such as health, objects they are carrying and their location.
• Map: Logic of the game localities and a description of each area.
• Objects: Are either placed in a room or are with the player.
• Actors: Can act independently.

The initialisation defines the vocabulary through a vector of verbs and a vector of directions. The code reads the map, objects and prose into memory and sets the starting variables for the player. The initialisation includes two auxiliary functions that the player cannot call directly.

The prose function takes a key phrase and finds the location of the heading for that key in the actions variable. The following lines of code print the text up to the next heading indicator. The prose function describes rooms, actions, actors and so on. Using a markdown file simplifies writing the text so it is not hidden amongst code. The prose function reads keywords after double hashtags. Other headings are used for convenience.

The inventory function lists all objects that the player carries. This function is a separate verb in most adventure games, but in this game, it is integrated with other verbs.

The Player

The program simulates the player with a set of parameters, such as their location, level of health and whatever other aspect is relevant. A player can, for example, only carry a certain number of objects, can only endure two blows by a sword, and so on.

The Landusia game, the player has a level of health and maximum carrying capacity. The room variable keeps track of where the player is located.

The Map

The game consists of 28 ‘rooms’. The map is stored in a matrix which indicates the relationships between the rooms. Each direction can have one or no exit. The map can also dynamically change during the game. Specific directions can, for example, only become available after the player opens a door. A rockfall could close an entrance, and so on. The player can move around the map by issuing directional commands. These commands are the four cardinal compass directions, and up and down.

In Landusia, the first nine rooms are a forest. The blacksmith lives south of the forest. Clovar’s castle lies to the south-east and the temple north-east of the forest. The connection between room 14 and 15 opens after the player uses the rope in room 14.

Secret of Landusia map (click to enlarge). Objects

The descriptions of the different rooms of the game can contain objects. These objects are fixed in that their description is hard-coded into the game. A text adventure can also have movable objects. Each object can have its own properties, such as weight, and location. The location of the object can either be a room or the player (room zero). These properties are stored in another matrix.

The Landusia game has four objects: bandage, sword, flute and rope. These objects are scattered around the map and have a certain weight. The capacity variable limits what the layer can carry simultaneously.

Actors

Actors are like objects, with the difference that they are active in the game. As the game progresses, actors take independent actions. Actors can move around the map and respond to the actions of the player. They can have properties, such as health and strength. Actors can be either friends or enemies that create a dynamic narrative.

Landusia has four actors which are stored in the same matrix a the objects. Each actor has a name, location, health and status. The actors function controls their actions.

Wizard

The wizard appears in the living quarters. His only task is to give the player the flute, once they manage to reach him. After this one action, the wizard vanishes by changing his location to room 100.

Crow

When the player is in the forest, a random variable determine whether they are swooped by a crow. Another random variable determines whether the player was hurt. If the crow manages to hit the player, the code reduces the health variable.

Dragon

The dragon sequence depends on the room the player is in. In room 17, the dragon has a 50% chance of hitting, while in room 18, the dragon is always sucessful. The player needs to kill the dragon with the sword (see below). The success of these fights depends on a random variable.

Blacksmith

The blacksmith has the most complex handler of them all. He starts lying wounded on the floor. If the player does not help the blacksmith by using the bandage, he will die. If the player uses the bandage, the blacksmith is healed and move to the workshop to forge a sword for the player. The status variable controls the evolution of the blacksmith in the game.

Command interpreter

The command interpreter asks the player for instructions, which the program then interprets. The state of the game (player, map, objects and actors) is changed, depending on the input. If the state of the game reaches a particular defined state, the game will end either successfully or miserably.

The basic principle of all command translators in interactive fiction is that they analyse the input using regex-type approaches. The interpreter compares the player’s instructions with a vocabulary and strips verbs and nouns.

For example, “take lamp” calls the take function and passes it the parameter lamp. The object lamp will then be moved from its current location to the player. The game could also check if the player still has enough carrying capacity and so on.

If the player enters a command that the interpreter cannot parse, then a negative response follows. The better the text adventure, the higher the likelihood that the game understands the command. Traditional text adventures use elaborate command interpreters to maximise the level of realism of the simulated world. This text adventure uses a minimised vocabulary and each of the verbs has its own function.

Look

The look function takes a room number as input and displays the prose related to the room. The next section lists the available passages and objects in the room. The final part of the look function displays any objects that the player is carrying.

Take and Put

These two functions allow a player to take an object or place it on the ground. The weight of the object is added or subtracted to the carrying capacity. When the object is not present in the room or not carried by the player, the code responds appropriately.

These functions cannot respond to objects mentioned in the room descriptions or actors, such as trees or the dragon. This can be easily remedied by creating two types of objects, moveable and immovable.

Wait

The wait function does nothing. Waiting might be useful to let other actors do their thing. The wait function has a dummy parameter because there is none, but the main loop assigns one to each verb. This verb is needed to give the blacksmith enough time to forge the sword.

Use

The use function starts by checking whether the player actually has the object in their possession. The remainder of the function contains instructions for each of the four objects. The bandage will either heal yourself or the blacksmith. The sword is not very useful in this context. The flute activates the magical teleportation by changing the room variable. Finally, using the rope changes the map by opening the connection between rooms 14 and 15. This action also changes the prose so that the room descriptions are enhanced.

Kill

This last function is conditional upon having the sword. If the player has no sword, then their fists are the weapon of choice. If an opponent is present in the same room as the player, they have a 60% chance of hitting the opponent. The sword has a higher strength than a fist fight. The actor function controls the return hits.

Expanding the Game

This game is modular in structure so that it should be easy to enhance this game or to write a completely different game. They key to writing a good adventure is simplicity. It is easy to get carried away with writing complex command structure that the player has to guess are available.

Feel free to leave a message below if you find bugs or like to enhance the prose or code.

The next wave of interactive fiction?

A lot has changed in computing since the heyday of text adventures. Although these games provided an illusion of freedom to act in a simulated world, the player is always limited by what the writer intended them to do.

Perhaps natural language processing, complex Markov chains and text-generating deep learning methods can generate the ultimate text adventure. Imagine a text adventure where the computer understands the basic structure but freely develops the prose and interactions.

Other Games in the R Langauge The Secret of Landusia: A Text Adventure in the R Language

Celebrate Halloween with Creepy Computer Games in R

Tic Tac Toe Simulation: The Intelligent Minimax Algorithm

R-Cade Games: Simulating the Legendary Pong Game

The post The Secret of Landusia: A Text Adventure in the R Language appeared first on The Lucid Manager.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...