Boston EARL Keynote speaker announcement: Tareef Kawaf
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
Mango Solutions are thrilled to announce that Tareef Kawaf, President of RStudio, will be joining us at EARL Boston as our third Keynote Speaker.
Tareef is an experienced software startup executive and a member of teams that built up ATG’s eCommerce offering and Brightcove’s Online Video Platform, helping both companies grow from early startups to publicly traded companies. He joined RStudio in early 2013 to help define its commercial product strategy and build the team. He is a software engineer by training, and an aspiring student of advanced analytics and R.
This will be Tareef’s second time speaking at EARL Boston and we’re big supporters of RStudio’s mission to provide the most widely used open source and enterpriseready professional software for the R statistical computing environment, so we’re looking forward to him taking to the podium again this year.
Want to join Tareef at EARL Boston? SpeakAbstract submissions close on 31 August, so time is running out to share your R adventures and innovations with fellow R users.
All accepted speakers receive a 1day Conference pass and a ticket to the evening networking reception.
Buy a ticketEarly bird tickets are now available! Save more than $100 on a Full Conference pass.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Analyzing Google Trends Data in R
Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R.
This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Bluray.
Analyzing Google Trends in RI have never bought a Bluray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blurays peaked.
The following R code retrieves the global search history since 2004 for Bluray.
library(gtrendsR) library(reshape2) google.trends = gtrends(c("bluray"), gprop = "web", time = "all")[[1]] google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits") rownames(google.trends) = google.trends$date google.trends$date = NULLThe first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, news, images or youtube. The third argument time can be any of now 1d, now 7d, today 1m, today 3m, today 12m, today+5y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 20101231 20110630.
Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code:
Sys.setenv(TZ = "UTC")After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code.
Plotting Google Trends data: Identifying seasonality and trendsPlotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year.
Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Bluray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here.
Google Trends by geographic regionNext, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below.
google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "20100630 20170630")[[1]]The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as USCA for California. I find the easiest way to get these codes is to use this Wikipedia page.
An alternative way to find all the regionlevel codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT).
library(gtrendsR) geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code))Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada.
Create your own analysis
In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis (just sign into Displayr first).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Hardnosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies
(This article was first published on Coastal Econometrician Views, and kindly contributed to Rbloggers)
Before recession a commercial tool was popular in the country, hence, uncertainty around tools and technology was not much; however, after recession, incertitude (i.e. uncertainty) around tools and technology have preoccupied and occupying data science learning, delivery and deployment.
When python was continuing as general programming language, Rwas the left out best choice (became more popular with the advent of an IDE i.e. RStudio) and author still see its popularity among nonprogramming background (i.e. other than computer scientists) data scientists. Yet, author notices in local meet ups, panel discussions, webinars, still, a clarity on which is better from aspirants towards the data sicence as a everyday interest as shown in below image.
Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Digit fifth powers: Euler Problem 30
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of .
Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.
Euler Problem 30 DefinitionSurprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:
As is not a sum, it is not included.
The sum of these numbers is . Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.
Proposed SolutionThe problem asks for a bruteforce solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is and , which has five digits. If we then look at , which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.
largest < 6 * 9^5 answer < 0 for (n in 2:largest) { power.sum <0 i < n while (i > 0) { d < i %% 10 i < floor(i / 10) power.sum < power.sum + d^5 } if (power.sum == n) { print(n) answer < answer + n } } print(answer)View the most recent version of this code on GitHub.
The post Digit fifth powers: Euler Problem 30 appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sentiment analysis using tidy data principles at DataCamp
(This article was first published on Rstats on Julia Silge, and kindly contributed to Rbloggers)
I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched!
The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data. The course is organized into four case studies (one per chapter), and I don’t think it’s too much of a spoiler to say that I wear a costume for part of it. I’m just saying you should probably check out the course trailer.
Course descriptionText datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in four case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from realworld data.
Learning objectives Learn the principles of sentiment analysis from a tidy data perspective
 Practice manipulating and visualizing text data using dplyr and ggplot2
 Apply sentiment analysis skills to several realworld text datasets
Check the course out, have fun, and start practicing those text mining skills!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Recreating and updating Minard with ggplot2
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Minard's chart depicting Napoleon's 1812 march on Russia is a classic of data visualization that has inspired many homages using different timeandplace data. If you'd like to recreate the original chart, or create one of your own, Andrew Heiss has created a tutorial on using the ggplot2 package to reenvision the chart in R:
The R script provided in the tutorial is driven by historical data on the location and size of Napoleon's armies during the 1812 campaign, but you could adapt the script to use new data as well. Andrew also shows how to combine the chart with a geographical or satellite map, which is how the cities appear in the version above (unlike in Minard's original).
The data behind the Minard chart is available from Michael Friendly and you can find the R scripts at this Github repository. For the complete tutorial, follow the link below.
Andrew Heiss: Exploring Minard’s 1812 plot with ggplot2 (via Jenny Bryan)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Basics of data.table: Smooth data exploration
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
The data.table package provides perhaps the fastest way for data wrangling in R. The syntax is concise and is made to resemble SQL. After studying the basics of data.table and finishing this exercise set successfully you will be able to start easing into using data.table for all your data manipulation needs.
We will use data drawn from the 1980 US Census on married women aged 21–35 with two or more children. The data includes gender of first and second child, as well as information on whether the woman had more than two children, race, age and number of weeks worked in 1979. For more information please refer to the reference manual for the package AER.
Answers are available here.
Exercise 1
Load the data.table package. Furtermore (install and) load the AER package and run the command data("Fertility") which loads the dataset Fertility to your workspace. Turn it into a data.table object.
Exercise 2
Select rows 35 to 50 and print to console its age and work entry.
Exercise 3
Select the last row in the dataset and print to console.
Exercise 4
Count how many women proceeded to have a third child.
 work with different data manipulation packages,
 know how to import, transform and prepare your dataset for modelling,
 and much more.
Exercise 5
There are four possible gender combinations for the first two children. Which is the most common? Use the by argument.
Exercise 6
By racial composition what is the proportion of woman working four weeks or less in 1979?
Exercise 7
Use %between% to get a subset of woman between 22 and 24 calculate the proportion who had a boy as their firstborn.
Exercise 8
Add a new column, age squared, to the dataset.
Exercise 9
Out of all the racial composition in the dataset which had the lowest proportion of boys for their firstborn. With the same command display the number of observation in each category as well.
Exercise 10
Calculate the proportion of women who have a third child by gender combination of the first two children?
 Vector exercises
 Data frame exercises Vol. 2
 Instrumental Variables in R exercises (Part1)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Going Bayes #rstats
(This article was first published on R – Strenge Jacke!, and kindly contributed to Rbloggers)
Some time ago I started working with Bayesian methods, using the great rstanarmpackage. Beside the fantastic packagevignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my packages, as well.
Due to the latest tidyrupdate, I had to update some of my packages, in order to make them work again, so – beside some other features – some Bayesstuff is now avaible in my packages on CRAN.
Finding shape or location parameters from distributionsThe following functions are included in the sjstatspackage. Given some known quantiles or percentiles, or a certain value or ratio and its standard error, the functions find_beta(), find_normal() or find_cauchy() help finding the parameters for a distribution. Taking the example from here, the plot indicates that the mean value for the normal distribution is somewhat above 50. We can find the exact parameters with find_normal(), using the information given in the text:
library(sjstats) find_normal(x1 = 30, p1 = .1, x2 = 90, p2 = .8) #> $mean #> [1] 53.78387 #> #> $sd #> [1] 30.48026 High Density Intervals for MCMC samplesThe hdi()function computes the high density interval for posterior samples. This is nothing special, since there are other packages with such functions as well – however, you can use this function not only on vectors, but also on stanregobjects (i.e. the results from models fitted with rstanarm). And, if required, you can also transform the HDIvalues, e.g. if you need these intervals on an expontiated scale.
library(rstanarm) fit < stan_glm(mpg ~ wt + am, data = mtcars, chains = 1) hdi(fit) #> term hdi.low hdi.high #> 1 (Intercept) 32.158505 42.341421 #> 2 wt 6.611984 4.022419 #> 3 am 2.567573 2.343818 #> 4 sigma 2.564218 3.903652 # fit logistic regression model fit < stan_glm( vs ~ wt + am, data = mtcars, family = binomial("logit"), chains = 1 ) hdi(fit, prob = .89, trans = exp) #> term hdi.low hdi.high #> 1 (Intercept) 4.464230e+02 3.725603e+07 #> 2 wt 6.667981e03 1.752195e01 #> 3 am 8.923942e03 3.747664e01 Marginal effects for rstanarmmodelsThe ggeffectspackage creates tidy data frames of model predictions, which are ready to use with ggplot (though there’s a plot()method as well). ggeffects supports a wide range of models, and makes it easy to plot marginal effects for specific predictors, includinmg interaction terms. In the past updates, support for more model types was added, for instance polr (pkg MASS), hurdle and zeroinfl (pkg pscl), betareg (pkg betareg), truncreg (pkg truncreg), coxph (pkg survival) and stanreg (pkg rstanarm).
ggpredict() is the main function that computes marginal effects. Predictions for stanregmodels are based on the posterior distribution of the linear predictor (posterior_linpred()), mostly for convenience reasons. It is recommended to use the posterior predictive distribution (posterior_predict()) for inference and model checking, and you can do so using the ppdargument when calling ggpredict(), however, especially for binomial or poisson models, it is harder (and much slower) to compute the „confidence intervals“. That’s why relying on posterior_linpred() is the default for stanregmodels with ggpredict().
Here is an example with two plots, one without raw data and one including data points:
library(sjmisc) library(rstanarm) library(ggeffects) data(efc) # make categorical efc$c161sex < to_label(efc$c161sex) # fit model m < stan_glm(neg_c_7 ~ c160age + c12hour + c161sex, data = efc) dat < ggpredict(m, terms = c("c12hour", "c161sex")) dat #> # A tibble: 128 x 5 #> x predicted conf.low conf.high group #> #> 1 4 10.80864 10.32654 11.35832 Male #> 2 4 11.26104 10.89721 11.59076 Female #> 3 5 10.82645 10.34756 11.37489 Male #> 4 5 11.27963 10.91368 11.59938 Female #> 5 6 10.84480 10.36762 11.39147 Male #> 6 6 11.29786 10.93785 11.61687 Female #> 7 7 10.86374 10.38768 11.40973 Male #> 8 7 11.31656 10.96097 11.63308 Female #> 9 8 10.88204 10.38739 11.40548 Male #> 10 8 11.33522 10.98032 11.64661 Female #> # ... with 118 more rows plot(dat) plot(dat, rawdata = TRUE)As you can see, if you work with labelled data, the modelfitting functions from the rstanarmpackage preserves all value and variable labels, making it easy to create annotated plots. The „confidence bands“ are actually hidh density intervals, computed with the above mentioned hdi()function.
Next…Next I will integrate ggeffects into my sjPlotpackage, making sjPlot more generic and supporting more models types. Furthermore, sjPlot shall get a generic plot_model()function which will replace former single functions like sjp.lm(), sjp.glm(), sjp.lmer() or sjp.glmer(). plot_model() should then produce a plot, either marginal effects, forest plots or interaction terms and so on, and accept (m)any model class. This should help making sjPlot more convenient to work with, more stable and easier to maintain…
Tagged: Bayes, data visualization, ggplot, R, rstanarm, rstats, sjPlot, Stan
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rcpp now used by 10 percent of CRAN packages
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Over the last few days, Rcpp passed another noteworthy hurdle. It is now used by over 10 percent of packages on CRAN (as measured by Depends, Imports and LinkingTo, but excluding Suggests). As of this morning 1130 packages use Rcpp out of a total of 11275 packages. The graph on the left shows the growth of both outright usage numbers (in darker blue, left axis) and relative usage (in lighter blue, right axis).
Older posts on this blog took note when Rcpp passed round hundreds of packages, most recently in April for 1000 packages. The growth rates for both Rcpp, and of course CRAN, are still staggering. A big thank you to everybody who makes this happen, from R Core and CRAN to all package developers, contributors, and of course all users driving this. We have built ourselves a rather impressive ecosystem.
So with that a heartfelt Thank You! to all users and contributors of R, CRAN, and of course Rcpp, for help, suggestions, bug reports, documentation, encouragement, and, of course, code.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gender roles in film direction, analyzed with R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other things.
This is all based on an analysis of almost 2,000 film scripts mostly from 1990 and after. The words come from pairs of words beginning with "he" and "she" in the stage directions (but not the dialogue) in the screenplays — directions like "she snuggles up to him, strokes his back" and "he straps on a holster under his sealskin cloak". The essay also includes an analysis of words by the writer and character's gender, and includes lots of lovely interactive elements (including the ability to see examples of the stage directions).
The analysis, including the chart above, was was created using the R language, and the R code is available on GitHub. The screenplay analysis makes use on the tidytext package, which simplifies the process of handling the textbased data (the screenplays), extracting the stage directions, and tabulating the word pairs.
You can find the complete essay linked below, and it's well worth checking out to experience the interactive elements.
ThePudding: She Giggles, He Gallops
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Caching httr Requests? This means WAR[C]!
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a followup post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++backed implementation).
One of those implementations is JWAT, a library written in Java (as many WARC usecases involve working in what would traditionally be called mapreduce environments). It has a small footprint and is structured wellenough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:
I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.
Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.
The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the antiscience folks in the current administration).
To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.
First, we’ll need some helpers:
library(rJava) library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars") library(jwatr) # devtools::install_github("hrbrmstr/jwatr") library(httr) library(jsonlite) library(tidyverse)Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.
Now, we’ll grab some neighbourhood and crime info:
wf < warc_file("~/Data/wraptest") res < warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods") str(jsonlite::fromJSON(content(res, as="text")), 2) ## 'data.frame': 67 obs. of 2 variables: ## $ id : chr "NC04" "NC66" "NC67" "NC68" ... ## $ name: chr "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ... res < warc_GET(wf, "https://data.police.uk/api/crimesstreet/allcrime", query = list(lat=52.629729, lng=1.131592, date="201701")) res < warc_GET(wf, "https://data.police.uk/api/crimesatlocation", query = list(location_id="884227", date="201702")) close_warc_file(wf)As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.
file.info("~/Data/wraptest.warc.gz")$size ## [1] 76020We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):
xdf < read_warc("~/Data/wraptest.warc.gz", include_payload = TRUE) glimpse(xdf) ## Observations: 3 ## Variables: 14 ## $ target_uri "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimesstreet... ## $ ip_address "54.76.101.128", "54.76.101.128", "54.76.101.128" ## $ warc_content_type "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp... ## $ warc_type "response", "response", "response" ## $ content_length 2984, 511564, 688 ## $ payload_type "application/json", "application/json", "application/json" ## $ profile NA, NA, NA ## $ date 20170822, 20170822, 20170822 ## $ http_status_code 200, 200, 200 ## $ http_protocol_content_type "application/json", "application/json", "application/json" ## $ http_version "HTTP/1.1", "HTTP/1.1", "HTTP/1.1" ## $ http_raw_headers [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63... ## $ warc_record_id "", "",... ## $ payload [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74... xdf$target_uri ## [1] "https://data.police.uk/api/leicestershire/neighbourhoods" ## [2] "https://data.police.uk/api/crimesstreet/allcrime?lat=52.629729&lng=1.131592&date=201701" ## [3] "https://data.police.uk/api/crimesatlocation?location_id=884227&date=201702"The URLs are all there, so it will be easier to map the original calls to them.
Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:
for (i in 1:nrow(xdf)) { res < jsonlite::fromJSON(readBin(xdf$payload[[i]], "character")) print(str(res, 2)) } ## 'data.frame': 67 obs. of 2 variables: ## $ id : chr "NC04" "NC66" "NC67" "NC68" ... ## $ name: chr "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ... ## NULL ## 'data.frame': 1318 obs. of 9 variables: ## $ category : chr "antisocialbehaviour" "antisocialbehaviour" "antisocialbehaviour" "antisocialbehaviour" ... ## $ location_type : chr "Force" "Force" "Force" "Force" ... ## $ location :'data.frame': 1318 obs. of 3 variables: ## ..$ latitude : chr "52.616961" "52.629963" "52.641646" "52.635184" ... ## ..$ street :'data.frame': 1318 obs. of 2 variables: ## ..$ longitude: chr "1.120719" "1.122291" "1.131486" "1.135455" ... ## $ context : chr "" "" "" "" ... ## $ outcome_status :'data.frame': 1318 obs. of 2 variables: ## ..$ category: chr NA NA NA NA ... ## ..$ date : chr NA NA NA NA ... ## $ persistent_id : chr "" "" "" "" ... ## $ id : int 54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ... ## $ location_subtype: chr "" "" "" "" ... ## $ month : chr "201701" "201701" "201701" "201701" ... ## NULL ## 'data.frame': 1 obs. of 9 variables: ## $ category : chr "violentcrime" ## $ location_type : chr "Force" ## $ location :'data.frame': 1 obs. of 3 variables: ## ..$ latitude : chr "52.643950" ## ..$ street :'data.frame': 1 obs. of 2 variables: ## ..$ longitude: chr "1.143042" ## $ context : chr "" ## $ outcome_status :'data.frame': 1 obs. of 2 variables: ## ..$ category: chr "Unable to prosecute suspect" ## ..$ date : chr "201702" ## $ persistent_id : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735" ## $ id : int 54726925 ## $ location_subtype: chr "" ## $ month : chr "201702" ## NULLWe can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:
for (i in 1:nrow(xdf)) { payload_content( xdf$target_uri[i], xdf$http_protocol_content_type[i], xdf$http_raw_headers[[i]], xdf$payload[[i]], as = "text" ) %>% jsonlite::fromJSON() > res print(str(res, 2)) }The same output is printed, so I’m saving some blog content space by not including it.
Future WorkI kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.
So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data sciencey and reproducibility use cases R folks have.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Some Neat New R Notations
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
The R package seplyr supplies a few neat new coding notations.
An Abacus, which gives us the term “calculus.”
The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following:
library("seplyr") names < c('a', 'b') names := c('x', 'y') #> a b #> "x" "y"This can be very useful when programming in R, as it allows indirection or abstraction on the lefthand side of inline name assignments (unlike c(a = 'x', b = 'y'), where all lefthandsides are concrete values even if not quoted).
A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R‘s “c()” combine/concatenate function. That is: c('a' := 'x', 'b' := 'y') is the same as c('a', 'b') := c('x', 'y'). Roughly this means the two operations play well with each other.
The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “:=” notation for this (and, as is common in R, pick the correct behavior based on runtime types).
The function construction is written as: “variables := { code }” (the braces are required) and the semantics are roughly the same as “function(variables) { code }“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the seplyr notation is not as concise as is possible).
This notation allows us to write the squares of 1 through 4 as:
sapply(1:4, x:={x^2})instead of writing:
sapply(1:4, function(x) x^2)It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambdacalculus notation such as “(λx.x^2)“. In the development version of seplyr we are experimenting with the following additional notations:
sapply(1:4, lambda(x)(x^2)) sapply(1:4, λ(x, x^2))(Both of these currenlty work in the development version, though we are not sure about submitting source files with nonASCII characters to CRAN.)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Onboarding visdat, a tool for preliminary visualisation of whole dataframes
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
Take a look at the data
This is a phrase that comes up when you first get a dataset.
It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?
Starting down either path, you often encounter the nontrivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.
These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.
The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to "get a look at the data".
Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more.
 I felt like the code was a little sloppy, and that it could be better.
 I wanted to know whether others found it useful.
What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.
Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.
rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.
What's in it for the author?
 Feedback on your package
 Support from rOpenSci members
 Maintain ownership of your package
 Publicity from it being under rOpenSci
 Contribute something to rOpenSci
 Potentially a publication
What can rOpenSci do that CRAN cannot?
The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:
 Assess documentation readability / usability
 Provide a code review to find weak points / points of improvement
 Determine whether a package is overlapping with another.
So I submitted visdat to the onboarding process. For me, I did this for three reasons.
 So visdat could become a better package
 Pending acceptance, I would get a publication in JOSS
 I get to contribute back to rOpenSci
Submitting the package was actually quite easy – you go to submit an issue on the onboarding page on GitHub, and it provides a magical template for you to fill out 2, with no submission gotchas – this could be the future 3. Within 2 days of submitting the issue, I had a response from the editor, Noam Ross, and two reviewers assigned, Mara Averick, and Sean Hughes.
I submitted visdat and waited, somewhat apprehensively. What would the reviewers think?
In fact, Mara Averick wrote a post: "So you (don't) think you can review a package" about her experience evaluating visdat as a firsttime reviewer.
Getting feedback Unexpected extras from the reviewEven before the review started officially, I got some great concrete feedback from Noam Ross, the editor for the visdat submission.
 Noam used the goodpractice package, to identify bad code patterns and other places to immediately improve upon in a concrete way. This resulted in me:
 Fixing error prone code such as using 1:length(...), or 1:nrow(...)
 Improving testing using the visualisation testing software vdiffr)
 Reducing long code lines to improve readability
 Defining global variables to avoid a NOTE ("no visible binding for global variable")
So before the review even started, visdat is in better shape, with 99% test coverage, and clearance from goodpractice.
The feedback from reviewersI received prompt replies from the reviewers, and I got to hear really nice things like "I think visdat is a very worthwhile project and have already started using it in my own work.", and "Having now put it to use in a few of my own projects, I can confidently say that it is an incredibly useful early step in the data analysis workflow. vis_miss(), in particular, is helpful for scoping the task at hand …". In addition to these nice things, there was also great critical feedback from Sean and Mara.
A common thread in both reviews was that the way I initially had visdat set up was to have the first row of the dataset at the bottom left, and the variable names at the bottom. However, this doesn't reflect what a dataframe typically looks like – with the names of the variables at the top, and the first row also at the top. There was also suggestions to add the percentage of missing data in each column.
On the left are the old visdat and vismiss plots, and on the right are the new visdat and vismiss plots.
Changing this makes the plots make a lot more sense, and read better.
Mara made me aware of the warning and error messages that I had let crop up in the package. This was something I had grown to accept – the plot worked, right? But Mara pointed out that from a user perspective, seeing these warnings and messages can be a negative experience for the user, and something that might stop them from using it – how do they know if their plot is accurate with all these warnings? Are they using it wrong?
Sean gave practical advice on reducing code duplication, explaining how to write general construction method to prepare the data for the plots. Sean also explained how to write C++ code to improve the speed of vis_guess().
From both reviewers I got nitty gritty feedback about my writing – places where documentation that was just a bunch of notes I made, or where I had reversed the order of a statement.
What did I think?I think that getting feedback in general on your own work can be a bit hard to take sometimes. We get attached to our ideas, we've seen them grow from little thought bubbles all the way to "all growed up" R packages. I was apprehensive about getting feedback on visdat. But the feedback process from rOpenSci was, as Tina Turner put it, "simply the best".
Boiling down the onboarding review process down to a few key points, I would say it is transparent, friendly, and thorough.
Having the entire review process on GitHub means that everyone is accountable for what they say, and means that you can track exactly what everyone said about it in one place. No email chain hell with (mis)attached documents, accidental replyalls or single replies. The whole internet is cc'd in on this discussion.
Being an rOpenSci initiative, the process is incredibly friendly and respectful of everyone involved. Comments are upbeat, but are also, importantly thorough, providing constructive feedback.
So what does visdat look like? library(visdat) vis_dat(airquality)This shows us a visual analogue of our data, the variable names are shown on the top, and the class of each variable is shown, along with where missing data.
You can focus in on missing data with vis_miss()
vis_miss(airquality)This shows only missing and present information in the data. In addition to vis_dat() it shows the percentage of missing data for each variable and also the overall amount of missing data. vis_miss() will also indicate when a dataset has no missing data at all, or a very small percentage.
The future of visdatThere are some really exciting changes coming up for visdat. The first is making a plotly version of all of the figures that provides useful tooltips and interactivity. The second and third changes to bring in later down the track are to include the idea of visualising expectations, where the user can search their data for particular things, such as particular characters like "~" or values like 99, or 0, or conditions "x > 101", and visualise them. Another final idea is to make it easy to visually compare two dataframes of differing size. We also want to work on providing consistent palettes for particular datatypes. For example, character, numerics, integers, and datetime would all have different (and consistently different) colours.
I am very interested to hear how people use visdat in their work, so if you have suggestions or feedback I would love to hear from you! The best way to leave feedback is by filing an issue, or perhaps sending me an email at nicholas [dot] tierney [at] gmail [dot] com.
The future of your R package?If you have an R package you should give some serious thought about submitting it to the rOpenSci through their onboarding process. There are very clear guidelines on their onboarding GitHub page. If you aren't sure about package fit, you can submit a presubmission enquiry – the editors are nice and friendly, and a positive experience awaits you!

CRAN is an essential part of what makes the rproject successful and certainly without CRAN R simply would not be the language that it is today. The tasks provided by the rOpenSci onboarding require human hours, and there just isn't enough spare time and energy amongst CRAN managers. ↩

Never used GitHub? Don't worry, creating an account is easy, and the template is all there for you. You provide very straightforward information, and it's all there at once. ↩

With some journals, the submission process means you aren't always clear what information you need ahead of time. Gotchas include things like "what is the residential address of every coauthor", or getting everyone to sign a copyright notice. ↩
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to Create an Online Choice Simulator
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
What is a choice simulator?
A choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator.
Choice simulators have many names: decision support systems, market simulators, preference simulators, desktop simulators, conjoint simulators, and choice model simulators.
How to create a choice simulatorIn this post, I show how to create an online choice simulator, with the calculations done using R, and the simulator is hosted in Displayr.
Step 1: Import the model(s) resultsFirst of all, choice simulators are based on models. So, the first step in building a choice simulator is to obtain the model results that are to be used in the simulator. For example, here I use respondentlevel parameters from a latent class model, but there are many other types of data that could have been used (e.g., parameters from a GLM, draws from the posterior distribution, beta draws from a maximum simulated likelihood model).
If practical, it is usually a good idea to have model results at the case level (e.g., respondent level), as the resulting simulator can then be easily automatically weighted and/or filtered. If you have case level data, the model results should be imported into Displayr as a Data Set. See Introduction to Displayr 2: Getting your data into Displayr for an overview of ways of getting data into Displayr.
The table below shows estimated parameters of respondents from a discrete choice experiment of the market for eggs. You can work your way through the choice simulator example used in this post here (the link will first take you to a login page in Displayr and then to a document that contains the data in the variable set called IndividualLevel Parameter Means for Segments 26Jun17 9:01:57 AM).
Step 2: Simplify calculations using variable sets
Variable sets are a novel and very useful aspect of Displayr. Variable sets are related variables that are grouped. We can simplify the calculations of a choice simulator by using the variable sets, with one variable set for each attribute.
In this step, we group the variables for each attribute into separate variable sets, so that they appear as shown on the right. This is done as follows:
 If the variables are already grouped into a variable set, select the variable set, and select Data Manipulation > Split (Variables). In the dataset that I am using, all the variables I need for my calculation are already grouped into a single variable set called IndividualLevel Parameter Means for Segments 26Jun17 9:01:57 AM, so I click on this and split it.
 Next, select the first attribute’s variables. In my example, this is the four variables that start with Weight:, each of which represents the respondentlevel parameters for different egg weights. (The first of these contains only 0s, as dummy coding was used.)
 Then, go to Data Manipulation > Combine (Variables).
 Next set the Label for the new variable set to something appropriate. For reasons that will become clearer below, it is preferable to set it to a single, short word. For example, Weight.
 Set the Label field for each of the variables to whatever label you plan to show in the choice simulator. For example, if you want people to be able to choose an egg weight of 55g (about 2 ounces), set the Label to 55g.
 Finally, repeat this process for all the attributes. If you have any numeric attributes, then leave these as a single variable, like Price in the example here.
In my choice simulator, I have separate columns of controls (i.e., combo boxes) for each of the brands. The fast way to do this is to first create them for the first alternative (column), and then copy and paste them:
 Insert > Control (More).
 Type the levels, separated by semicolons, into the Item list. These must match, exactly, to the labels that you have entered for the Labels for the first attribute in point 5 in the previous step. For example: 55g; 60g; 65g; 70g. I recommend using copy and paste because if you make some typos they will be difficult to track down. Where you have a numeric attribute, such as Price in the example, you enter the range of values that you wish the user to be able to choose from (e.g., 1.50; 2.00; 2.50; 3.00; 3.50; 4.00; 4.50; 5.00).
 Select the Properties tab in the Object Inspector and set the Name of the control to whatever you set as the Label for the corresponding variable set with the number 1 affixed at the end. For example, Weight.1 (You can use any label, but following this convention will save you time later on.)
 Click on the control and select the first level. For example, 55g.
 Repeat these steps until you have created controls for each of the attributes, each under each other, as shown above.
 Select all the controls that you have created, and then select Home > Copy and Home > Paste, and move the new set of labels to the right of the previous labels. Repeat this for as many sets of alternatives as you wish to include. In my example, there are four alternatives.
 Finally, add labels for the brands and attributes: Insert > TextBox (Text and Images).
See also Adding a Combo Box to a Displayr Dashboard for an intro to creating combo boxes.
Step 4: Calculate preference shares Insert an R Output (Insert > R Output (Analysis)), setting it to Automatic with the appropriate code, and positioning it underneath the first column of combo boxes. Press the Calculate button, and it should calculate the share for the first alternative. If you paste the code below, and everything is setup properly, you will get a value of 25%.
 Now, click on the R Output you just created, and copyandpaste it. Position the new version immediately below the second column of combo boxes.
 Modify the very last line of code, replacing [1] with [2], which tells it to show the results of the second alternative.
 Repeat steps 2 and 3 for alternatives 3 and 4.
The code below can easily be modified for other models. A few key aspects of the code:
 It works with four alternatives and is readily modified to deal with different numbers of alternatives.
 The formulas for the utility of each alternative are expressed as simple mathematical expressions. Because I was careful with the naming of the variable sets and the controls, they are easy to read. If you are using Displayr, you can hover over the various elements of the formula and you will get a preview of their data.
 The code is already setup to deal with weights. Just click on the R Output that contains the formula and apply a weight (Home > Weight).
 It is set up to automatically deal with any filters. More about this below.
If you wish, you can make your choice simulator prettier. The R Outputs and the controls all have formatting options. In my example, I got our designer, Nat, to create the pretty background screen, which she did in Photoshop, and then added using Insert > Image.
Step 6: Add filtersIf you have stored the data as variable sets, you can quickly create filters. Note that the calculations will automatically update when the viewer selects the filters.
Step 7: ShareTo share the dashboard, go to the Export tab in the ribbon (at the top of the screen), and click on the black triangle under the Web Page button. Next, check the option for Hide page navigation on exported page and then click Export… and follow the prompts.
Note, the URL for the choice simulator I am using in this example is https://app.displayr.com/Dashboard?id=21043f6445d047af9797cd4180805849. This URL is public. You cannot guess or find this link by websearching for security reasons. If, however, you give the URL to someone, then they can access the document. Alternatively, if you have an annual Displayr account, you can instead go into Settings for the document (the cog at the topright of the screen) and press Disable Public URL. This will limit access to only people who are set up as users for your organization. You can set up people as users in the company’s Settings, accessible by clicking on the cog at the topright of the screen. If you don’t see these settings, contact support@displayr.com to buy a license.
Worked example of a choice simulatorYou can see the choice simulator in View Mode here (as an enduser will see it), or you can create your own choice simulator here (first log into Displayr and then edit or modify a copy of the document used to create this post).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing routr – Routing of HTTP and WebSocket in R
(This article was first published on Data Imaginist  R posts, and kindly contributed to Rbloggers)
routr is now available on CRAN, and I couldn’t be happier. It’s release marks
the completion of an idea that stretches back longer than my attempts to bring
network visualization and ggplot2 together (see this post for ref).
While my PhD was still concerned with proteomics a began developing GUI’s based
on shiny for managing different parts of the proteomics workflow. I soon came
to realize that I was spending an inordinate amount of time battling shiny
itself because I wanted more than it was meant for. Thus began my idea of
creating an expressive and powerful web server framework for R in the veins of
express.js and the likes that could be made to do anything. The idea lingered in
my head for a long time and went through several iterations until I finally
released fiery in the late summer of 2016. fiery was never meant to stand
alone though and I boldly proclaimed that routr would come next. That didn’t
seem to happen. I spend most of the following year developing tools for
visualization and network analysis while having guilty consciousness about the
project I’d put on hold. Fortunately I’ve been able to put in some time for
taking up development for the fiery ecosystem once again, so without further
ado…
While I spend some time in the introduction to talk about the whole development
path of fiery, I would like to start here with saying that routr is a server
agnostic tool. Sure, I’ve build it for use with fiery but I’ve been very
deliberate in making it completely independent of it, except for the code that
is involved in the fiery plugin functionality. So, you’re completely free to
use routr with whatever server framework you wish (e.g. hook it directly to
an httpuv instance). But how does it work? read on…
routr is basically build up of two different concepts: routes and
route stacks. Routes are a collection of handlers attached to specific HTTP
request methods (e.g. GET, POST, PUT) and paths. When a request lands at a route
one of the handlers is chosen and called, based on the nature of the request. A
route stack is a collection of routes. When a request lands at a route stack it
will pass it through all the routes it contains sequentially, potentially
stopping if one of the handlers signals it. In the following these two concepts
will be discussed in detail.
In its essence a router is a decision mechanism for redirection HTTP requests
into the correct handler function based on the request URL. It makes sure that
e.g. requests for http://example.com/info ends up in a different handler than
http://example.com/users/thomasp85. This functionality is encapsulated in the
Route class. The basic use is illustrated below:
Let’s walk through what happened here. First we created a new Route object and
then we added two handlers to it, using the eponymous add_handler() method.
Both of the handlers responds to the GET method, but differs in the path they
are listening for. routr uses reqres under the hood so each handler method
is passed a Request and Response pair (we’ll get back to the keys
argument). Lastly, each handler must return either TRUE indicating that the
next route should be called, or FALSE indicating no further routes should be
called. As the request and response objects are R6 objects any changes to them
will be kept outside of the handler and there is thus no need to return them.
Now, consider the situation where I have build my super fancy web service into a
thriving business with millions of users – would I need to add a handler for
every user? No. This would be a case for a parameterized path.
As can be seen, prefixing a path element with : will make it into a variable,
matching anything that is put in there and adds it as an element to the keys
argument. Paths can contain as many variable elements as wanted in order to
reuse handlers as efficiently as possible.
There’s a last piece of path functionality left to discuss: The wildcard. While
parameterized path elements only matches as single element (e.g.
/users/:user_id will match /users/johndoe but not /users/johndoe/settings)
the wildcard matches anything. Let’s try one of these:
Here we add two new handlers, one preventing access to anything under the
/settings location, and one implementing a custom 404  Not found page. Both
returns FALSE as they are meant to prevent any further processing.
Now there’s a slight pickle with the current situation. If I ask for
/users/thomasp85 it can match three different handlers: /users/thomasp85,
/users/:user_id, and /*. Which to chose? routr decides on the handler
based on path specificity, where handlers are prioritized based on number of
elements in the path (the more the better), number of parameterized elements
(the less the better), and existence of wildcards (better with none). In the
above case it means that the /users/thomasp85 will be chosen. The handler
priority can always be seen when printing the Route object.
The request method is less complicated than the path. It simply matches the
method used in the request, ignoring the case. There’s one special method:
all. This one will match any method, but only if a handler does not exist for
that specific method.
Conceptually, route stacks are much simpler than routes, in that they are just
a sequential collection of routes, with the means to pass requests through them.
Let’s create some additional routes and collect them in a RouteStack:
Now, when our router receives a request it will first pass it to the parser
route and attempt to parse the body. If it is unsuccessful it will abort (the
parse() method returns FALSE if it fails), if not it will pass the request
on to the route we build up in the prior section. If the chosen handler returns
TRUE the request will then end up in the formatter route and the response body
will be formatted based on content negotiation with the request. As can be seen
route stacks are an effective way to extract common functionality into well
defined handlers.
If you’re using fiery. RouteStack objects are also what will be used as
plugins. Whether to use the router for request, header, or message
(WebSocket) events is decided by the attach_to field.
Lastly, routr comes with a few predefined routes, which I will briefly
mention: The ressource_route maps files on the server to handlers. If you wish
to serve static content in some way, this facilitates it, and takes care of a
lot of HTTP header logic such as caching. It will also automatically serve
compressed files if they exist and the client accepts them:
Now, you can get the package description file by visiting /DESCRIPTION. If a
file is found it will return FALSE in order to simply return the file. If
nothing is found it will return TRUE so that other routes can decide what to
do.
If you wish to limit the size of requests, you can use the sizelimit_route and
e.g. attach it to the header event in a fiery app, so that requests that are
too big will get rejected before the body is fetched.
As I started by saying, the release of routr marks a point of maturity for my
fiery ecosystem. I’m extremely happy with this, but it is in no way the end of
development. I will pivot to working on more specialized plugins now concerned
with areas such as security and scalability, but the main approach to building
fiery server side logic is now up and running – I hope you’ll take it for a
spin.
To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist  R posts. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Understanding gender roles in movies with text mining
(This article was first published on Rstats on Julia Silge, and kindly contributed to Rbloggers)
I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film.
The R code behind this analysis in publicly available on GitHub.
I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tidyer BLS data with the blscarpeR package
(This article was first published on Data Science Riot!, and kindly contributed to Rbloggers)
The recent release of the blscrapeR package brings the “tidyverse” into the fold. Inspired by my recent collaboration with Kyle Walker on his excellent tidycensus package, blscrapeR has been optimized for use within the tidyverse as of the current version 3.0.0.
New things you’ll notice right away include:
All data now returned as tibbles.

dplyr and purrr are now imported packages, along with magrittr and ggplot, which were imported from the start.

No need to call any packages other than tidyverse and blscrapeR.

Switched from base R to dplyr in instances where performance could be increased.

Standard apply functions replaced with purrr map() functions where performance could be increased.
The American Time Use Survey is one of the BLS’ more interesting data sets. Below is an API query that compares the time Americans spend watching TV on a daily basis compared to the time spent socializing and communicating.
It should be noted, some familiarity with BLS series id numbers is required here. The BLS Data Finder is a nice tool to find series id numbers.
library(blscrapeR) library(tidyverse) tbl < bls_api(c("TUU10101AA01014236", "TUU10101AA01013951")) %>% spread(seriesID, value) %>% dateCast() %>% rename(watching_tv = TUU10101AA01014236, socializing_communicating = TUU10101AA01013951) tbl ## # A tibble: 3 x 7 ## year period periodName footnotes socializing_communicating watching_tv date ## * ## 1 2014 0.71 2.82 20140101 ## 2 2015 0.68 2.78 20150101 ## 3 2016 0.65 2.73 20160101 Unemployment RatesThe main attraction of the BLS are the monthly employment and unemployment data. Below is an API query and plot of three of the major BLS unemployment rates.

U3: The “official unemployment rate.” Total unemployed, as a percent of the civilian labor force.

U5: Total unemployed, plus discouraged workers, plus all other marginally attached workers, as a percent of the civilian labor force plus all marginally attached workers.

U6: Total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers.
For more information and examples, please see the package vignettes.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Data Science Riot!. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Free simmer hexagon stickers!
(This article was first published on FishyOperations, and kindly contributed to Rbloggers)
Do you want to get your own simmer hexagon sticker? Just fill in this form and get one send to you for free.
Check out rsimmer.org or CRAN for more information on simmer, a discreteevent simulation package for R.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Highlights of the Data Science Track at Microsoft Ignite
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
I will be at the AI Summit in San Francisco next month, which means I can't make it to Ignite in Orlando this year. Which is a bit of a shame, because there's a fantastic Data Science track at Ignite. There are 25 sessions on offer, with presentations from my Microsoft colleagues on Microsoft R, Cognitive Toolkit, Bot Framework, the Team Data Science Process, and much more. There will also be applications of these technologies and systems presented by the Portland Trail Blazers, Enbridge, and Jack Henry and Associates.
 Big data machine learning with Microsoft R Server
 Patterns, Architecture, & Best Practices: Scaling Machine Learning Algorithms with Azure HDInsight
 Cert Exam Prep: Exam 70773: Analyzing Big Data with Microsoft R
 How to build machine learning applications using R and Python in SQL Server 2017
 How to modernize analytics by migrating from SAS to the Microsoft platform
You can access these talks plus keynote presentations by Satya Nadella and Joseph Sirosh with the Data Science, Machine Learning and AI Pass. For the complete list of talks included, follow the link below.
Microsoft Ignite: Data Science, Machine Learning and AI Pass Session Catalog
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Bayesian A/B Testing Made Easy
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
A/B Testing is a familiar task for many working in business analytics. Essentially, A/B Testing is a simple form of hypothesis testing with one control group and one treatment group. Classical frequentist methodology instructs the analyst to estimate the expected effect of the treatment, calculate the required sample size, and perform a test to determine if a large enough effect is observed. This method is somewhat lacking: it only leaves one with point estimates for the control and the treatment groups, and a verdict to reject (effect is observed) or to fail to reject (effect is not observed).
Let’s consider an alternative approach following Bayesian methods with the bayesAB package. Suppose that we have the current version and a proposed version of a web page, each containing a button of interest, and we wish to determine whether the proposed version leads to more clicks on the button of interest. Currently, approximately half of all visitors click the button of interest. Suppose the proposed version of the web page is actually much worse and only 30 percent will click it.
To test this, we randomly assign some visitors to the current and other visitors to the proposed version. Since a visitor either clicks the button of interest or not, we can treat this as a Bernoulli random variable with parameter theta. For the control and the treatment groups, we will assign the same prior distribution on theta, e.g., a beta distribution with mean 0.5. You can think of this as the analog of the null hypothesis.
Now, let’s simulate 20 observations for each group and compare the posterior probabilities for the control and treatment groups. The package automatically computes the probability that the mean of the treatment is greater than the mean of the control.
# First collection control_1 < rbinom(20, 1, 0.5) treatment_1 < rbinom(20, 1, 0.3) Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to work with different binomial and logistic regression techniques,
 know how to compare regression models and choose the right fit,
 and much more.
The treatment posterior distribution is in red and the control posterior is in green. After 40 observations in total, the posteriors have started to separate, and the probability that the treatment is less than the control is approaching 95 percent.
Let’s simulate 20 more observations for each group and compare.
# Second Collection control_2 < rbind(control_1, rbinom(20, 1, 0.5)) treatment_2 < rbind(treatment_1, rbinom(20, 1, 0.3)) # Second Analysis test2 < bayesTest(treatment_2, control_2, distribution = "bernoulli", priors = c("alpha" = 10, "beta" = 10)) print(test2) summary(test2) plot(test2)We can see that with the additional 40 observations, the distributions have separated more, and the probability that the treatment is less than the control is 98 percent.
In addition to point estimates and a verdict, we have full distributions for each of the parameters, which makes computing prediction intervals, for instance, easy. There’s much more to explore in this package than can be covered in this tutorial, so try getting your hands dirty with a few examples contained in the documentation.
Related exercise sets: Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part4)
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part2)
 Cross Tabulation with Xtabs exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...