Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 8 hours 56 min ago

Practical Guide to Principal Component Methods in R

Thu, 08/24/2017 - 14:04

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

Introduction

Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.

This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.

Where to find the book:

The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.

There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.

However, the result is presented differently depending on the used package.

To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and principal component methods – we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).

No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.

Methods, which outputs can be visualized using the factoextra package are shown in the figure below:

In this book, we’ll use mainly:

  • the FactoMineR package to compute principal component methods;
  • and the factoextra package for extracting, visualizing and interpreting the results.

The other packages – ade4, ExPosition, etc – will be also presented briefly.

How this book is organized

This book contains 4 parts.

Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.

Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:

  • Principal Component Analysis (PCA, for continuous variables),
  • Simple correspondence analysis (CA, for large contingency tables formed by two categorical variables)
  • Multiple correspondence analysis (MCA, for a data set with more than 2 categorical variables).

In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:

  • Factor Analysis of Mixed Data (FAMD) and,
  • Multiple Factor Analysis (MFA).

Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables

Key features of this book

This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.

Key features:

  • Covers principal component methods and implementation in R
  • Highlights the most important information in your data set using ggplot2-based elegant visualization
  • Short, self-contained chapters with tested examples that allow for flexibility in designing a course and for easy reference

At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.

Examples of plots

Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.

  1. Eigenvalues/variances of principal components. Proportion of information retained by each principal component.

  1. PCA – Graph of variables:
  • Control variable colors using their contributions to the principal components.

  • Highlight the most contributing variables to each principal dimension:

  1. PCA – Graph of individuals:
  • Control automatically the color of individuals using the cos2 (the quality of the individuals on the factor map)

  • Change the point size according to the cos2 of the corresponding individuals:

  1. PCA – Biplot of individuals and variables

  1. Correspondence analysis. Association between categorical variables.

  1. FAMD/MFA – Analyzing mixed and structured data

  1. Clustering on principal components

Book preview

Download the preview of the book at: Principal Component Methods in R (Book preview)

Order now

About the author

Alboukadel Kassambara is a PhD in Bioinformatics and Cancer Biology. He works since many years on genomic data analysis and visualization (read more: http://www.alboukadel.com/).

He has work experiences in statistical and computational methods to identify prognostic and predictive biomarker signatures through integrative analysis of large-scale genomic and clinical data sets.

He created a bioinformatics web-tool named GenomicScape (www.genomicscape.com) which is an easy-to-use web tool for gene expression data analysis and visualization.

He developed also a training website on data science, named STHDA (Statistical Tools for High-throughput Data Analysis, www.sthda.com/english), which contains many tutorials on data analysis and visualization using R software and packages.

He is the author of many popular R packages for:

Recently, he published three books on data analysis and visualization:

  1. Practical Guide to Cluster Analysis in R (https://goo.gl/DmJ5y5)
  2. Guide to Create Beautiful Graphics in R (https://goo.gl/vJ0OYb).
  3. Complete Guide to 3D Plots in R (https://goo.gl/v5gwl0).


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Notice: Changes to the site

Thu, 08/24/2017 - 13:51

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

I wanted to give everyone a heads up about a major rebrand and some probable downtime happening over the weekend.

I’m going to be consolidating my Locke Data consulting company materials, the blog, the talks, and my package documentation into a single site. The central URL itsalocke.com won’t be changing but there will be a ton of changes happening.

I think I’ve got the redirects and the RSS feeds pretty much sorted, but I’ve converted more than 300 pages of content to new systems – I’ve likely gotten things wrong. You might notice some issues when clicking through from R-Bloggers, on twitter, or from other people’s sites.

I hope you’ll like the changes I’ve made but I’m going to have a “bug bounty” in place. If you find a broken link, a blog post that isn’t rendered correctly, or some other bug, then report it. Filling in the form is easy and if you provide your name and address and I’ll send you a sticker as thanks!

If you want to get a preview of the site, check it out on its temporary home lockelife.com.

The post Notice: Changes to the site appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Boston EARL Keynote speaker announcement: Tareef Kawaf

Thu, 08/24/2017 - 13:25

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Mango Solutions are thrilled to announce that Tareef Kawaf, President of RStudio, will be joining us at EARL Boston as our third Keynote Speaker.

Tareef is an experienced software startup executive and a member of teams that built up ATG’s eCommerce offering and Brightcove’s Online Video Platform, helping both companies grow from early startups to publicly traded companies. He joined RStudio in early 2013 to help define its commercial product strategy and build the team. He is a software engineer by training, and an aspiring student of advanced analytics and R.

This will be Tareef’s second time speaking at EARL Boston and we’re big supporters of RStudio’s mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment, so we’re looking forward to him taking to the podium again this year.

Want to join Tareef at EARL Boston? Speak

Abstract submissions close on 31 August, so time is running out to share your R adventures and innovations with fellow R users.

All accepted speakers receive a 1-day Conference pass and a ticket to the evening networking reception.

Submit your abstract here.

Buy a ticket

Early bird tickets are now available! Save more than $100 on a Full Conference pass.

Buy tickets here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing Google Trends Data in R

Thu, 08/24/2017 - 05:48

Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R.

This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Blu-ray.

Analyzing Google Trends in R

I have never bought a Blu-ray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blu-rays peaked.

The following R code retrieves the global search history since 2004 for Blu-ray.

library(gtrendsR) library(reshape2) google.trends = gtrends(c("blu-ray"), gprop = "web", time = "all")[[1]] google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits") rownames(google.trends) = google.trends$date google.trends$date = NULL

The first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, newsimages or youtube. The third argument time can be any of now 1-d, now 7-d, today 1-m, today 3-m, today 12-m, today+5-y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 2010-12-31 2011-06-30.

Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code:

Sys.setenv(TZ = "UTC")

After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code.

Plotting Google Trends data: Identifying seasonality and trends

Plotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year.

Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Blu-ray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here.

Google Trends by geographic region

Next, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below.

google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "2010-06-30 2017-06-30")[[1]]

The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as US-CA for California. I find the easiest way to get these codes is to use this Wikipedia page.

An alternative way to find all the region-level codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT).

library(gtrendsR) geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code))

Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada.

Create your own analysis

In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis  (just sign into Displayr first).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Hard-nosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies

Thu, 08/24/2017 - 05:34

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

Before recession a commercial tool was popular in the country, hence, uncertainty around tools and technology was not much; however, after recession, incertitude (i.e. uncertainty) around tools and technology have pre-occupied and occupying data science learning, delivery and deployment.

When python was continuing as general programming language, Rwas the left out best choice (became more popular with the advent of an IDE i.e. RStudio) and author still see its popularity among non-programming background (i.e. other than computer scientists) data scientists. Yet, author notices in local meet ups, panel discussions, webinars, still, a clarity on which is better from aspirants towards the data sicence as a everyday interest as shown in below image.

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Digit fifth powers: Euler Problem 30

Thu, 08/24/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of .

Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.

Euler Problem 30 Definition

Surprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:

As is not a sum, it is not included.

The sum of these numbers is . Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.

Proposed Solution

The problem asks for a brute-force solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is and , which has five digits. If we then look at , which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.

largest <- 6 * 9^5 answer <- 0 for (n in 2:largest) { power.sum <-0 i <- n while (i > 0) { d <- i %% 10 i <- floor(i / 10) power.sum <- power.sum + d^5 } if (power.sum == n) { print(n) answer <- answer + n } } print(answer)

View the most recent version of this code on GitHub.

The post Digit fifth powers: Euler Problem 30 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sentiment analysis using tidy data principles at DataCamp

Thu, 08/24/2017 - 02:00

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched!

The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data. The course is organized into four case studies (one per chapter), and I don’t think it’s too much of a spoiler to say that I wear a costume for part of it. I’m just saying you should probably check out the course trailer.

Course description

Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in four case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

Learning objectives
  • Learn the principles of sentiment analysis from a tidy data perspective
  • Practice manipulating and visualizing text data using dplyr and ggplot2
  • Apply sentiment analysis skills to several real-world text datasets

Check the course out, have fun, and start practicing those text mining skills!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recreating and updating Minard with ggplot2

Wed, 08/23/2017 - 23:59

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Minard's chart depicting Napoleon's 1812 march on Russia is a classic of data visualization that has inspired many homages using different time-and-place data. If you'd like to recreate the original chart, or create one of your own, Andrew Heiss has created a tutorial on using the ggplot2 package to re-envision the chart in R:

The R script provided in the tutorial is driven by historical data on the location and size of Napoleon's armies during the 1812 campaign, but you could adapt the script to use new data as well. Andrew also shows how to combine the chart with a geographical or satellite map, which is how the cities appear in the version above (unlike in Minard's original). 

The data behind the Minard chart is available from Michael Friendly and you can find the R scripts at this Github repository. For the complete tutorial, follow the link below.

Andrew Heiss: Exploring Minard’s 1812 plot with ggplot2 (via Jenny Bryan)

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Basics of data.table: Smooth data exploration

Wed, 08/23/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The data.table package provides perhaps the fastest way for data wrangling in R. The syntax is concise and is made to resemble SQL. After studying the basics of data.table and finishing this exercise set successfully you will be able to start easing into using data.table for all your data manipulation needs.

We will use data drawn from the 1980 US Census on married women aged 21–35 with two or more children. The data includes gender of first and second child, as well as information on whether the woman had more than two children, race, age and number of weeks worked in 1979. For more information please refer to the reference manual for the package AER.

Answers are available here.

Exercise 1
Load the data.table package. Furtermore (install and) load the AER package and run the command data("Fertility") which loads the dataset Fertility to your workspace. Turn it into a data.table object.

Exercise 2
Select rows 35 to 50 and print to console its age and work entry.

Exercise 3
Select the last row in the dataset and print to console.

Exercise 4
Count how many women proceeded to have a third child.

Learn more about the data.table package in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to

  • work with different data manipulation packages,
  • know how to import, transform and prepare your dataset for modelling,
  • and much more.

Exercise 5
There are four possible gender combinations for the first two children. Which is the most common? Use the by argument.

Exercise 6
By racial composition what is the proportion of woman working four weeks or less in 1979?

Exercise 7
Use %between% to get a subset of woman between 22 and 24 calculate the proportion who had a boy as their firstborn.

Exercise 8
Add a new column, age squared, to the dataset.

Exercise 9
Out of all the racial composition in the dataset which had the lowest proportion of boys for their firstborn. With the same command display the number of observation in each category as well.

Exercise 10
Calculate the proportion of women who have a third child by gender combination of the first two children?

Related exercise sets:
  1. Vector exercises
  2. Data frame exercises Vol. 2
  3. Instrumental Variables in R exercises (Part-1)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Going Bayes #rstats

Wed, 08/23/2017 - 15:07

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Some time ago I started working with Bayesian methods, using the great rstanarm-package. Beside the fantastic package-vignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my packages, as well.

Due to the latest tidyr-update, I had to update some of my packages, in order to make them work again, so – beside some other features – some Bayes-stuff is now avaible in my packages on CRAN.

Finding shape or location parameters from distributions

The following functions are included in the sjstats-package. Given some known quantiles or percentiles, or a certain value or ratio and its standard error, the functions find_beta(), find_normal() or find_cauchy() help finding the parameters for a distribution. Taking the example from here, the plot indicates that the mean value for the normal distribution is somewhat above 50. We can find the exact parameters with find_normal(), using the information given in the text:

library(sjstats) find_normal(x1 = 30, p1 = .1, x2 = 90, p2 = .8) #> $mean #> [1] 53.78387 #> #> $sd #> [1] 30.48026 High Density Intervals for MCMC samples

The hdi()-function computes the high density interval for posterior samples. This is nothing special, since there are other packages with such functions as well – however, you can use this function not only on vectors, but also on stanreg-objects (i.e. the results from models fitted with rstanarm). And, if required, you can also transform the HDI-values, e.g. if you need these intervals on an expontiated scale.

library(rstanarm) fit <- stan_glm(mpg ~ wt + am, data = mtcars, chains = 1) hdi(fit) #> term hdi.low hdi.high #> 1 (Intercept) 32.158505 42.341421 #> 2 wt -6.611984 -4.022419 #> 3 am -2.567573 2.343818 #> 4 sigma 2.564218 3.903652 # fit logistic regression model fit <- stan_glm( vs ~ wt + am, data = mtcars, family = binomial("logit"), chains = 1 ) hdi(fit, prob = .89, trans = exp) #> term hdi.low hdi.high #> 1 (Intercept) 4.464230e+02 3.725603e+07 #> 2 wt 6.667981e-03 1.752195e-01 #> 3 am 8.923942e-03 3.747664e-01 Marginal effects for rstanarm-models

The ggeffects-package creates tidy data frames of model predictions, which are ready to use with ggplot (though there’s a plot()-method as well). ggeffects supports a wide range of models, and makes it easy to plot marginal effects for specific predictors, includinmg interaction terms. In the past updates, support for more model types was added, for instance polr (pkg MASS), hurdle and zeroinfl (pkg pscl), betareg (pkg betareg), truncreg (pkg truncreg), coxph (pkg survival) and stanreg (pkg rstanarm).

ggpredict() is the main function that computes marginal effects. Predictions for stanreg-models are based on the posterior distribution of the linear predictor (posterior_linpred()), mostly for convenience reasons. It is recommended to use the posterior predictive distribution (posterior_predict()) for inference and model checking, and you can do so using the ppd-argument when calling ggpredict(), however, especially for binomial or poisson models, it is harder (and much slower) to compute the „confidence intervals“. That’s why relying on posterior_linpred() is the default for stanreg-models with ggpredict().

Here is an example with two plots, one without raw data and one including data points:

library(sjmisc) library(rstanarm) library(ggeffects) data(efc) # make categorical efc$c161sex <- to_label(efc$c161sex) # fit model m <- stan_glm(neg_c_7 ~ c160age + c12hour + c161sex, data = efc) dat <- ggpredict(m, terms = c("c12hour", "c161sex")) dat #> # A tibble: 128 x 5 #> x predicted conf.low conf.high group #> #> 1 4 10.80864 10.32654 11.35832 Male #> 2 4 11.26104 10.89721 11.59076 Female #> 3 5 10.82645 10.34756 11.37489 Male #> 4 5 11.27963 10.91368 11.59938 Female #> 5 6 10.84480 10.36762 11.39147 Male #> 6 6 11.29786 10.93785 11.61687 Female #> 7 7 10.86374 10.38768 11.40973 Male #> 8 7 11.31656 10.96097 11.63308 Female #> 9 8 10.88204 10.38739 11.40548 Male #> 10 8 11.33522 10.98032 11.64661 Female #> # ... with 118 more rows plot(dat) plot(dat, rawdata = TRUE)

As you can see, if you work with labelled data, the model-fitting functions from the rstanarm-package preserves all value and variable labels, making it easy to create annotated plots. The „confidence bands“ are actually hidh density intervals, computed with the above mentioned hdi()-function.

Next…

Next I will integrate ggeffects into my sjPlot-package, making sjPlot more generic and supporting more models types. Furthermore, sjPlot shall get a generic plot_model()-function which will replace former single functions like sjp.lm(), sjp.glm(), sjp.lmer() or sjp.glmer(). plot_model() should then produce a plot, either marginal effects, forest plots or interaction terms and so on, and accept (m)any model class. This should help making sjPlot more convenient to work with, more stable and easier to maintain…

Tagged: Bayes, data visualization, ggplot, R, rstanarm, rstats, sjPlot, Stan

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rcpp now used by 10 percent of CRAN packages

Wed, 08/23/2017 - 13:18

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Over the last few days, Rcpp passed another noteworthy hurdle. It is now used by over 10 percent of packages on CRAN (as measured by Depends, Imports and LinkingTo, but excluding Suggests). As of this morning 1130 packages use Rcpp out of a total of 11275 packages. The graph on the left shows the growth of both outright usage numbers (in darker blue, left axis) and relative usage (in lighter blue, right axis).

Older posts on this blog took note when Rcpp passed round hundreds of packages, most recently in April for 1000 packages. The growth rates for both Rcpp, and of course CRAN, are still staggering. A big thank you to everybody who makes this happen, from R Core and CRAN to all package developers, contributors, and of course all users driving this. We have built ourselves a rather impressive ecosystem.

So with that a heartfelt Thank You! to all users and contributors of R, CRAN, and of course Rcpp, for help, suggestions, bug reports, documentation, encouragement, and, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gender roles in film direction, analyzed with R

Tue, 08/22/2017 - 23:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other things.  

This is all based on an analysis of almost 2,000 film scripts mostly from 1990 and after. The words come from pairs of words beginning with "he" and "she" in the stage directions (but not the dialogue) in the screenplays — directions like "she snuggles up to him, strokes his back" and "he straps on a holster under his sealskin cloak". The essay also includes an analysis of words by the writer and character's gender, and includes lots of lovely interactive elements (including the ability to see examples of the stage directions).

The analysis, including the chart above, was was created using the R language, and the R code is available on GitHub. The screenplay analysis makes use on the tidytext package, which simplifies the process of handling the text-based data (the screenplays), extracting the stage directions, and tabulating the word pairs.

You can find the complete essay linked below, and it's well worth checking out to experience the interactive elements.

ThePudding: She Giggles, He Gallops

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Caching httr Requests? This means WAR[C]!

Tue, 08/22/2017 - 19:53

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava) library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars") library(jwatr) # devtools::install_github("hrbrmstr/jwatr") library(httr) library(jsonlite) library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf <- warc_file("~/Data/wrap-test") res <- warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods") str(jsonlite::fromJSON(content(res, as="text")), 2) ## 'data.frame': 67 obs. of 2 variables: ## $ id : chr "NC04" "NC66" "NC67" "NC68" ... ## $ name: chr "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ... res <- warc_GET(wf, "https://data.police.uk/api/crimes-street/all-crime", query = list(lat=52.629729, lng=-1.131592, date="2017-01")) res <- warc_GET(wf, "https://data.police.uk/api/crimes-at-location", query = list(location_id="884227", date="2017-02")) close_warc_file(wf)

As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size ## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf <- read_warc("~/Data/wrap-test.warc.gz", include_payload = TRUE) glimpse(xdf) ## Observations: 3 ## Variables: 14 ## $ target_uri "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street... ## $ ip_address "54.76.101.128", "54.76.101.128", "54.76.101.128" ## $ warc_content_type "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp... ## $ warc_type "response", "response", "response" ## $ content_length 2984, 511564, 688 ## $ payload_type "application/json", "application/json", "application/json" ## $ profile NA, NA, NA ## $ date 2017-08-22, 2017-08-22, 2017-08-22 ## $ http_status_code 200, 200, 200 ## $ http_protocol_content_type "application/json", "application/json", "application/json" ## $ http_version "HTTP/1.1", "HTTP/1.1", "HTTP/1.1" ## $ http_raw_headers [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63... ## $ warc_record_id "", "",... ## $ payload [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74... xdf$target_uri ## [1] "https://data.police.uk/api/leicestershire/neighbourhoods" ## [2] "https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01" ## [3] "https://data.police.uk/api/crimes-at-location?location_id=884227&date=2017-02"

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) { res <- jsonlite::fromJSON(readBin(xdf$payload[[i]], "character")) print(str(res, 2)) } ## 'data.frame': 67 obs. of 2 variables: ## $ id : chr "NC04" "NC66" "NC67" "NC68" ... ## $ name: chr "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ... ## NULL ## 'data.frame': 1318 obs. of 9 variables: ## $ category : chr "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ... ## $ location_type : chr "Force" "Force" "Force" "Force" ... ## $ location :'data.frame': 1318 obs. of 3 variables: ## ..$ latitude : chr "52.616961" "52.629963" "52.641646" "52.635184" ... ## ..$ street :'data.frame': 1318 obs. of 2 variables: ## ..$ longitude: chr "-1.120719" "-1.122291" "-1.131486" "-1.135455" ... ## $ context : chr "" "" "" "" ... ## $ outcome_status :'data.frame': 1318 obs. of 2 variables: ## ..$ category: chr NA NA NA NA ... ## ..$ date : chr NA NA NA NA ... ## $ persistent_id : chr "" "" "" "" ... ## $ id : int 54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ... ## $ location_subtype: chr "" "" "" "" ... ## $ month : chr "2017-01" "2017-01" "2017-01" "2017-01" ... ## NULL ## 'data.frame': 1 obs. of 9 variables: ## $ category : chr "violent-crime" ## $ location_type : chr "Force" ## $ location :'data.frame': 1 obs. of 3 variables: ## ..$ latitude : chr "52.643950" ## ..$ street :'data.frame': 1 obs. of 2 variables: ## ..$ longitude: chr "-1.143042" ## $ context : chr "" ## $ outcome_status :'data.frame': 1 obs. of 2 variables: ## ..$ category: chr "Unable to prosecute suspect" ## ..$ date : chr "2017-02" ## $ persistent_id : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735" ## $ id : int 54726925 ## $ location_subtype: chr "" ## $ month : chr "2017-02" ## NULL

We can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:

for (i in 1:nrow(xdf)) { payload_content( xdf$target_uri[i], xdf$http_protocol_content_type[i], xdf$http_raw_headers[[i]], xdf$payload[[i]], as = "text" ) %>% jsonlite::fromJSON() -> res print(str(res, 2)) }

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Some Neat New R Notations

Tue, 08/22/2017 - 15:39

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package seplyr supplies a few neat new coding notations.


An Abacus, which gives us the term “calculus.”

The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following:

library("seplyr") names <- c('a', 'b') names := c('x', 'y') #> a b #> "x" "y"

This can be very useful when programming in R, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike c(a = 'x', b = 'y'), where all left-hand-sides are concrete values even if not quoted).

A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R‘s “c()” combine/concatenate function. That is: c('a' := 'x', 'b' := 'y') is the same as c('a', 'b') := c('x', 'y'). Roughly this means the two operations play well with each other.

The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “:=” notation for this (and, as is common in R, pick the correct behavior based on runtime types).

The function construction is written as: “variables := { code }” (the braces are required) and the semantics are roughly the same as “function(variables) { code }“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the seplyr notation is not as concise as is possible).

This notation allows us to write the squares of 1 through 4 as:

sapply(1:4, x:={x^2})

instead of writing:

sapply(1:4, function(x) x^2)

It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambda-calculus notation such as “(λx.x^2)“. In the development version of seplyr we are experimenting with the following additional notations:

sapply(1:4, lambda(x)(x^2)) sapply(1:4, λ(x, x^2))

(Both of these currenlty work in the development version, though we are not sure about submitting source files with non-ASCII characters to CRAN.)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Onboarding visdat, a tool for preliminary visualisation of whole dataframes

Tue, 08/22/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Take a look at the data

This is a phrase that comes up when you first get a dataset.

It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?

Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.

These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.

The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to "get a look at the data".

Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more.

  • I felt like the code was a little sloppy, and that it could be better.
  • I wanted to know whether others found it useful.

What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.

Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.

rOpenSci onboarding basics

Onboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.

What's in it for the author?

  • Feedback on your package
  • Support from rOpenSci members
  • Maintain ownership of your package
  • Publicity from it being under rOpenSci
  • Contribute something to rOpenSci
  • Potentially a publication

What can rOpenSci do that CRAN cannot?

The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:

  • Assess documentation readability / usability
  • Provide a code review to find weak points / points of improvement
  • Determine whether a package is overlapping with another.

So I submitted visdat to the onboarding process. For me, I did this for three reasons.

  1. So visdat could become a better package
  2. Pending acceptance, I would get a publication in JOSS
  3. I get to contribute back to rOpenSci

Submitting the package was actually quite easy – you go to submit an issue on the onboarding page on GitHub, and it provides a magical template for you to fill out 2, with no submission gotchas – this could be the future 3. Within 2 days of submitting the issue, I had a response from the editor, Noam Ross, and two reviewers assigned, Mara Averick, and Sean Hughes.

I submitted visdat and waited, somewhat apprehensively. What would the reviewers think?

In fact, Mara Averick wrote a post: "So you (don't) think you can review a package" about her experience evaluating visdat as a first-time reviewer.

Getting feedback Unexpected extras from the review

Even before the review started officially, I got some great concrete feedback from Noam Ross, the editor for the visdat submission.

  • Noam used the goodpractice package, to identify bad code patterns and other places to immediately improve upon in a concrete way. This resulted in me:
    • Fixing error prone code such as using 1:length(...), or 1:nrow(...)
    • Improving testing using the visualisation testing software vdiffr)
    • Reducing long code lines to improve readability
    • Defining global variables to avoid a NOTE ("no visible binding for global variable")

So before the review even started, visdat is in better shape, with 99% test coverage, and clearance from goodpractice.

The feedback from reviewers

I received prompt replies from the reviewers, and I got to hear really nice things like "I think visdat is a very worthwhile project and have already started using it in my own work.", and "Having now put it to use in a few of my own projects, I can confidently say that it is an incredibly useful early step in the data analysis workflow. vis_miss(), in particular, is helpful for scoping the task at hand …". In addition to these nice things, there was also great critical feedback from Sean and Mara.

A common thread in both reviews was that the way I initially had visdat set up was to have the first row of the dataset at the bottom left, and the variable names at the bottom. However, this doesn't reflect what a dataframe typically looks like – with the names of the variables at the top, and the first row also at the top. There was also suggestions to add the percentage of missing data in each column.

On the left are the old visdat and vismiss plots, and on the right are the new visdat and vismiss plots.

Changing this makes the plots make a lot more sense, and read better.

Mara made me aware of the warning and error messages that I had let crop up in the package. This was something I had grown to accept – the plot worked, right? But Mara pointed out that from a user perspective, seeing these warnings and messages can be a negative experience for the user, and something that might stop them from using it – how do they know if their plot is accurate with all these warnings? Are they using it wrong?

Sean gave practical advice on reducing code duplication, explaining how to write general construction method to prepare the data for the plots. Sean also explained how to write C++ code to improve the speed of vis_guess().

From both reviewers I got nitty gritty feedback about my writing – places where documentation that was just a bunch of notes I made, or where I had reversed the order of a statement.

What did I think?

I think that getting feedback in general on your own work can be a bit hard to take sometimes. We get attached to our ideas, we've seen them grow from little thought bubbles all the way to "all growed up" R packages. I was apprehensive about getting feedback on visdat. But the feedback process from rOpenSci was, as Tina Turner put it, "simply the best".

Boiling down the onboarding review process down to a few key points, I would say it is transparent, friendly, and thorough.

Having the entire review process on GitHub means that everyone is accountable for what they say, and means that you can track exactly what everyone said about it in one place. No email chain hell with (mis)attached documents, accidental reply-alls or single replies. The whole internet is cc'd in on this discussion.

Being an rOpenSci initiative, the process is incredibly friendly and respectful of everyone involved. Comments are upbeat, but are also, importantly thorough, providing constructive feedback.

So what does visdat look like? library(visdat) vis_dat(airquality)

This shows us a visual analogue of our data, the variable names are shown on the top, and the class of each variable is shown, along with where missing data.

You can focus in on missing data with vis_miss()

vis_miss(airquality)

This shows only missing and present information in the data. In addition to vis_dat() it shows the percentage of missing data for each variable and also the overall amount of missing data. vis_miss() will also indicate when a dataset has no missing data at all, or a very small percentage.

The future of visdat

There are some really exciting changes coming up for visdat. The first is making a plotly version of all of the figures that provides useful tooltips and interactivity. The second and third changes to bring in later down the track are to include the idea of visualising expectations, where the user can search their data for particular things, such as particular characters like "~" or values like -99, or -0, or conditions "x > 101", and visualise them. Another final idea is to make it easy to visually compare two dataframes of differing size. We also want to work on providing consistent palettes for particular datatypes. For example, character, numerics, integers, and datetime would all have different (and consistently different) colours.

I am very interested to hear how people use visdat in their work, so if you have suggestions or feedback I would love to hear from you! The best way to leave feedback is by filing an issue, or perhaps sending me an email at nicholas [dot] tierney [at] gmail [dot] com.

The future of your R package?

If you have an R package you should give some serious thought about submitting it to the rOpenSci through their onboarding process. There are very clear guidelines on their onboarding GitHub page. If you aren't sure about package fit, you can submit a pre-submission enquiry – the editors are nice and friendly, and a positive experience awaits you!

  1. CRAN is an essential part of what makes the r-project successful and certainly without CRAN R simply would not be the language that it is today. The tasks provided by the rOpenSci onboarding require human hours, and there just isn't enough spare time and energy amongst CRAN managers. 

  2. Never used GitHub? Don't worry, creating an account is easy, and the template is all there for you. You provide very straightforward information, and it's all there at once. 

  3. With some journals, the submission process means you aren't always clear what information you need ahead of time. Gotchas include things like "what is the residential address of every co-author", or getting everyone to sign a copyright notice. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Create an Online Choice Simulator

Tue, 08/22/2017 - 07:32

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

What is a choice simulator?

choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator.

Choice simulators have many names: decision support systems, market simulators, preference simulators, desktop simulators, conjoint simulators, and choice model simulators.

How to create a choice simulator

In this post, I show how to create an online choice simulator, with the calculations done using R, and the simulator is hosted in Displayr.

Step 1: Import the model(s) results

First of all, choice simulators are based on models. So, the first step in building a choice simulator is to obtain the model results that are to be used in the simulator. For example, here I use respondent-level parameters from a latent class model, but there are many other types of data that could have been used (e.g., parameters from a GLM, draws from the posterior distribution, beta draws from a maximum simulated likelihood model).

If practical, it is usually a good idea to have model results at the case level (e.g., respondent level), as the resulting simulator can then be easily automatically weighted and/or filtered. If you have case level data, the model results should be imported into Displayr as a Data Set. See Introduction to Displayr 2: Getting your data into Displayr for an overview of ways of getting data into Displayr.

The table below shows estimated parameters of respondents from a discrete choice experiment of the market for eggs. You can work your way through the choice simulator example used in this post here (the link will first take you to a login page in Displayr and then to a document that contains the data in the variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM).

Step 2: Simplify calculations using variable sets

Variable sets are a novel and very useful aspect of Displayr. Variable sets are related variables that are grouped. We can simplify the calculations of a choice simulator by using the variable sets, with one variable set for each attribute.

In this step, we group the variables for each attribute into separate variable sets, so that they appear as shown on the right. This is done as follows:

  1. If the variables are already grouped into a variable set, select the variable set, and select Data Manipulation > Split (Variables). In the dataset that I am using, all the variables I need for my calculation are already grouped into a single variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM, so I click on this and split it.
  2. Next, select the first attribute’s variables. In my example, this is the four variables that start with Weight:, each of which represents the respondent-level parameters for different egg weights. (The first of these contains only 0s, as dummy coding was used.)
  3. Then, go to Data Manipulation > Combine (Variables).
  4. Next set the Label for the new variable set to something appropriate. For reasons that will become clearer below, it is preferable to set it to a single, short word. For example, Weight.
  5. Set the Label field for each of the variables to whatever label you plan to show in the choice simulator. For example, if you want people to be able to choose an egg weight of 55g (about 2 ounces), set the Label to 55g.
  6. Finally, repeat this process for all the attributes. If you have any numeric attributes, then leave these as a single variable, like Price in the example here.
Step 3: Create the controls

In my choice simulator, I have separate columns of controls (i.e., combo boxes) for each of the brands. The fast way to do this is to first create them for the first alternative (column), and then copy and paste them:

  1. Insert > Control (More).
  2. Type the levels, separated by semi-colons, into the Item list. These must match, exactly, to the labels that you have entered for the Labels for the first attribute in point 5 in the previous step. For example: 55g; 60g; 65g; 70g. I recommend using copy and paste because if you make some typos they will be difficult to track down. Where you have a numeric attribute, such as Price in the example, you enter the range of values that you wish the user to be able to choose from (e.g., 1.50; 2.00; 2.50; 3.00; 3.50; 4.00; 4.50; 5.00).
  3. Select the Properties tab in the Object Inspector and set the Name of the control to whatever you set as the Label for the corresponding variable set with the number 1 affixed at the end. For example, Weight.1 (You can use any label, but following this convention will save you time later on.)
  4. Click on the control and select the first level. For example, 55g.
  5. Repeat these steps until you have created controls for each of the attributes, each under each other, as shown above.
  6. Select all the controls that you have created, and then select Home > Copy and Home > Paste, and move the new set of labels to the right of the previous labels. Repeat this for as many sets of alternatives as you wish to include. In my example, there are four alternatives.
  7. Finally, add labels for the brands and attributes: Insert > TextBox (Text and Images).

See also Adding a Combo Box to a Displayr Dashboard for an intro to creating combo boxes.

Step 4: Calculate preference shares
  1. Insert an R Output (Insert > R Output (Analysis)), setting it to Automatic with the appropriate code, and positioning it underneath the first column of combo boxes. Press the Calculate button, and it should calculate the share for the first alternative. If you paste the code below, and everything is setup properly, you will get a value of 25%.
  2. Now, click on the R Output you just created, and copy-and-paste it. Position the new version immediately below the second column of combo boxes.
  3. Modify the very last line of code, replacing [1] with [2], which tells it to show the results of the second alternative.
  4. Repeat steps 2 and 3 for alternatives 3 and 4.

The code below can easily be modified for other models. A few key aspects of the code:

  • It works with four alternatives and is readily modified to deal with different numbers of alternatives.
  • The formulas for the utility of each alternative are expressed as simple mathematical expressions. Because I was careful with the naming of the variable sets and the controls, they are easy to read. If you are using Displayr, you can hover over the various elements of the formula and you will get a preview of their data.
  • The code is already setup to deal with weights. Just click on the R Output that contains the formula and apply a weight (Home > Weight).
  • It is set up to automatically deal with any filters. More about this below.
R Code to paste: # Computing the utility for each alternative u1 = Weight[, Weight.1] + Organic[, Organic.1] + Charity[, Charity.1] + Quality[, Quality.1] + Uniformity[, Uniformity.1] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.1)) u2 = Weight[, Weight.2] + Organic[, Organic.2] + Charity[, Charity.2] + Quality[, Quality.2] + Uniformity[, Uniformity.2] + Feed[, Feed.2] + Price*as.numeric(gsub("\\$", "", Price.2)) u3 = Weight[, Weight.3] + Organic[, Organic.3] + Charity[, Charity.3] + Quality[, Quality.3] + Uniformity[, Uniformity.3] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.3)) u4 = Weight[, Weight.4] + Organic[, Organic.4] + Charity[, Charity.4] + Quality[, Quality.4] + Uniformity[, Uniformity.4] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.4)) # Computing preference shares utilities = as.matrix(cbind(u1, u2, u3, u4)) eutilities = exp(utilities) shares = prop.table(eutilities, 1) # Filtering the shares, if a filter is applied. shares = shares[QFilter, ] # Filtering the weight variable, if required. weight = if (is.null(QPopulationWeight)) rep(1, length(u1)) else QPopulationWeight weight = weight[QFilter] # Computing shares for the total sample shares = sweep(shares, 1, weight, "*") shares = as.matrix(apply(shares, 2, sum)) shares = 100 * prop.table(shares, 2)[1] Step 5: Make it pretty

If you wish, you can make your choice simulator prettier. The R Outputs and the controls all have formatting options. In my example, I got our designer, Nat, to create the pretty background screen, which she did in Photoshop, and then added using Insert > Image.

Step 6: Add filters

If you have stored the data as variable sets, you can quickly create filters. Note that the calculations will automatically update when the viewer selects the filters.

Step 7: Share

To share the dashboard, go to the Export tab in the ribbon (at the top of the screen), and click on the black triangle under the Web Page button. Next, check the option for Hide page navigation on exported page and then click Export… and follow the prompts.

Note, the URL for the choice simulator I am using in this example is https://app.displayr.com/Dashboard?id=21043f64-45d0-47af-9797-cd4180805849. This URL is public. You cannot guess or find this link by web-searching for security reasons. If, however, you give the URL to someone, then they can access the document. Alternatively, if you have an annual Displayr account, you can instead go into Settings for the document (the cog at the top-right of the screen) and press Disable Public URL. This will limit access to only people who are set up as users for your organization. You can set up people as users in the company’s Settings, accessible by clicking on the cog at the top-right of the screen. If you don’t see these settings, contact support@displayr.com to buy a license.

Worked example of a choice simulator

You can see the choice simulator in View Mode here (as an end-user will see it), or you can create your own choice simulator here (first log into Displayr and then edit or modify a copy of the document used to create this post).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing routr – Routing of HTTP and WebSocket in R

Tue, 08/22/2017 - 02:00

(This article was first published on Data Imaginist - R posts, and kindly contributed to R-bloggers)

routr is now available on CRAN, and I couldn’t be happier. It’s release marks
the completion of an idea that stretches back longer than my attempts to bring
network visualization and ggplot2 together (see this post for ref).
While my PhD was still concerned with proteomics a began developing GUI’s based
on shiny for managing different parts of the proteomics workflow. I soon came
to realize that I was spending an inordinate amount of time battling shiny
itself because I wanted more than it was meant for. Thus began my idea of
creating an expressive and powerful web server framework for R in the veins of
express.js and the likes that could be made to do anything. The idea lingered in
my head for a long time and went through several iterations until I finally
released fiery in the late summer of 2016. fiery was never meant to stand
alone though and I boldly proclaimed that routr would come next. That didn’t
seem to happen. I spend most of the following year developing tools for
visualization and network analysis while having guilty consciousness about the
project I’d put on hold. Fortunately I’ve been able to put in some time for
taking up development for the fiery ecosystem once again, so without further
ado…

routr

While I spend some time in the introduction to talk about the whole development
path of fiery, I would like to start here with saying that routr is a server
agnostic tool. Sure, I’ve build it for use with fiery but I’ve been very
deliberate in making it completely independent of it, except for the code that
is involved in the fiery plugin functionality. So, you’re completely free to
use routr with whatever server framework you wish (e.g. hook it directly to
an httpuv instance). But how does it work? read on…

The design

routr is basically build up of two different concepts: routes and
route stacks. Routes are a collection of handlers attached to specific HTTP
request methods (e.g. GET, POST, PUT) and paths. When a request lands at a route
one of the handlers is chosen and called, based on the nature of the request. A
route stack is a collection of routes. When a request lands at a route stack it
will pass it through all the routes it contains sequentially, potentially
stopping if one of the handlers signals it. In the following these two concepts
will be discussed in detail.

Routes

In its essence a router is a decision mechanism for redirection HTTP requests
into the correct handler function based on the request URL. It makes sure that
e.g. requests for http://example.com/info ends up in a different handler than
http://example.com/users/thomasp85. This functionality is encapsulated in the
Route class. The basic use is illustrated below:

library(routr) route <- Route$new() route$add_handler('get', '/info', function(request, response, keys, ...) { response$status <- 200L response$body <- list(h1 = 'This is a test server') TRUE }) route$add_handler('get', '/users/thomasp85', function(request, response, keys, ...) { response$status <- 200L response$body <- list(h1 = 'This is the user information for thomasp85') TRUE }) route ## A route with 2 handlers ## get: /users/thomasp85 ## : /info

Let’s walk through what happened here. First we created a new Route object and
then we added two handlers to it, using the eponymous add_handler() method.
Both of the handlers responds to the GET method, but differs in the path they
are listening for. routr uses reqres under the hood so each handler method
is passed a Request and Response pair (we’ll get back to the keys
argument). Lastly, each handler must return either TRUE indicating that the
next route should be called, or FALSE indicating no further routes should be
called. As the request and response objects are R6 objects any changes to them
will be kept outside of the handler and there is thus no need to return them.

Now, consider the situation where I have build my super fancy web service into a
thriving business with millions of users – would I need to add a handler for
every user? No. This would be a case for a parameterized path.

route$add_handler('get', '/users/:user_id', function(request, response, keys, ...) { response$status <- 200L response$body <- list(h1 = paste0('This is the user information for ', keys$user_id)) TRUE }) route ## A route with 3 handlers ## get: /users/thomasp85 ## : /users/:user_id ## : /info

As can be seen, prefixing a path element with : will make it into a variable,
matching anything that is put in there and adds it as an element to the keys
argument. Paths can contain as many variable elements as wanted in order to
reuse handlers as efficiently as possible.

There’s a last piece of path functionality left to discuss: The wildcard. While
parameterized path elements only matches as single element (e.g.
/users/:user_id will match /users/johndoe but not /users/johndoe/settings)
the wildcard matches anything. Let’s try one of these:

route$add_handler('get', '/setting/*', function(request, response, keys, ...) { response$status_with_text(403L) # Forbidden FALSE }) route$add_handler('get', '/*', function(request, response, keys, ...) { response$status <- 404L response$body <- list(h1 = 'We really couldn\'t find your page') FALSE }) route ## A route with 5 handlers ## get: /users/thomasp85 ## : /users/:user_id ## : /setting/* ## : /info ## : /*

Here we add two new handlers, one preventing access to anything under the
/settings location, and one implementing a custom 404 - Not found page. Both
returns FALSE as they are meant to prevent any further processing.

Now there’s a slight pickle with the current situation. If I ask for
/users/thomasp85 it can match three different handlers: /users/thomasp85,
/users/:user_id, and /*. Which to chose? routr decides on the handler
based on path specificity, where handlers are prioritized based on number of
elements in the path (the more the better), number of parameterized elements
(the less the better), and existence of wildcards (better with none). In the
above case it means that the /users/thomasp85 will be chosen. The handler
priority can always be seen when printing the Route object.

The request method is less complicated than the path. It simply matches the
method used in the request, ignoring the case. There’s one special method:
all. This one will match any method, but only if a handler does not exist for
that specific method.

Route Stacks

Conceptually, route stacks are much simpler than routes, in that they are just
a sequential collection of routes, with the means to pass requests through them.
Let’s create some additional routes and collect them in a RouteStack:

parser <- Route$new() parser$add_handler('all', '/*', function(request, response, keys, ...) { request$parse(reqres::default_parsers) }) formatter <- Route$new() formatter$add_handler('all', '/*', function(request, response, keys, ...) { response$format(reqres::default_formatters) }) router <- RouteStack$new() router$add_route(parser, 'request_prep') router$add_route(route, 'app_logic') router$add_route(formatter, 'response_finish') router ## A RouteStack containing 3 routes ## 1: request_prep ## 2: app_logic ## 3: response_finish

Now, when our router receives a request it will first pass it to the parser
route and attempt to parse the body. If it is unsuccessful it will abort (the
parse() method returns FALSE if it fails), if not it will pass the request
on to the route we build up in the prior section. If the chosen handler returns
TRUE the request will then end up in the formatter route and the response body
will be formatted based on content negotiation with the request. As can be seen
route stacks are an effective way to extract common functionality into well
defined handlers.

If you’re using fiery. RouteStack objects are also what will be used as
plugins. Whether to use the router for request, header, or message
(WebSocket) events is decided by the attach_to field.

app <- fiery::Fire$new() app$attach(router) app ## &#x1f525; A fiery webserver ## &#x1f525; &#x1f4a5; &#x1f4a5; &#x1f4a5; ## &#x1f525; Running on: 127.0.0.1:8080 ## &#x1f525; Plugins attached: request_routr ## &#x1f525; Event handlers added ## &#x1f525; request: 1 Predefined routes

Lastly, routr comes with a few predefined routes, which I will briefly
mention: The ressource_route maps files on the server to handlers. If you wish
to serve static content in some way, this facilitates it, and takes care of a
lot of HTTP header logic such as caching. It will also automatically serve
compressed files if they exist and the client accepts them:

static_route <- ressource_route('/' = system.file(package = 'routr')) router$add_route(static_route, 'static', after = 1) router ## A RouteStack containing 4 routes ## 1: request_prep ## 2: static ## 3: app_logic ## 4: response_finish

Now, you can get the package description file by visiting /DESCRIPTION. If a
file is found it will return FALSE in order to simply return the file. If
nothing is found it will return TRUE so that other routes can decide what to
do.

If you wish to limit the size of requests, you can use the sizelimit_route and
e.g. attach it to the header event in a fiery app, so that requests that are
too big will get rejected before the body is fetched.

sizelimit <- sizelimit_route(10 * 1024^2) # 10 mb reject_router <- RouteStack$new(size = sizelimit) reject_router$attach_to <- 'header' app$attach(reject_router) app ## &#x1f525; A fiery webserver ## &#x1f525; &#x1f4a5; &#x1f4a5; &#x1f4a5; ## &#x1f525; Running on: 127.0.0.1:8080 ## &#x1f525; Plugins attached: request_routr ## &#x1f525; header_routr ## &#x1f525; Event handlers added ## &#x1f525; header: 1 ## &#x1f525; request: 1 Wrapping up

As I started by saying, the release of routr marks a point of maturity for my
fiery ecosystem. I’m extremely happy with this, but it is in no way the end of
development. I will pivot to working on more specialized plugins now concerned
with areas such as security and scalability, but the main approach to building
fiery server side logic is now up and running – I hope you’ll take it for a
spin.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding gender roles in movies with text mining

Tue, 08/22/2017 - 02:00

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film.

The R code behind this analysis in publicly available on GitHub.

I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tidyer BLS data with the blscarpeR package

Tue, 08/22/2017 - 02:00

(This article was first published on Data Science Riot!, and kindly contributed to R-bloggers)

The recent release of the blscrapeR package brings the “tidyverse” into the fold. Inspired by my recent collaboration with Kyle Walker on his excellent tidycensus package, blscrapeR has been optimized for use within the tidyverse as of the current version 3.0.0.

New things you’ll notice right away include:
  • All data now returned as tibbles.

  • dplyr and purrr are now imported packages, along with magrittr and ggplot, which were imported from the start.

  • No need to call any packages other than tidyverse and blscrapeR.

Major internal changes
  • Switched from base R to dplyr in instances where performance could be increased.

  • Standard apply functions replaced with purrr map() functions where performance could be increased.

install.packages("blscrapeR") The BLS: More than Unemployment

The American Time Use Survey is one of the BLS’ more interesting data sets. Below is an API query that compares the time Americans spend watching TV on a daily basis compared to the time spent socializing and communicating.

It should be noted, some familiarity with BLS series id numbers is required here. The BLS Data Finder is a nice tool to find series id numbers.

library(blscrapeR) library(tidyverse) tbl <- bls_api(c("TUU10101AA01014236", "TUU10101AA01013951")) %>% spread(seriesID, value) %>% dateCast() %>% rename(watching_tv = TUU10101AA01014236, socializing_communicating = TUU10101AA01013951) tbl ## # A tibble: 3 x 7 ## year period periodName footnotes socializing_communicating watching_tv date ## * ## 1 2014 0.71 2.82 2014-01-01 ## 2 2015 0.68 2.78 2015-01-01 ## 3 2016 0.65 2.73 2016-01-01 Unemployment Rates

The main attraction of the BLS are the monthly employment and unemployment data. Below is an API query and plot of three of the major BLS unemployment rates.

  • U-3: The “official unemployment rate.” Total unemployed, as a percent of the civilian labor force.

  • U-5: Total unemployed, plus discouraged workers, plus all other marginally attached workers, as a percent of the civilian labor force plus all marginally attached workers.

  • U-6: Total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers.

library(blscrapeR) library(tidyverse) tbl <- bls_api(c("LNS14000000", "LNS13327708", "LNS13327709")) %>% spread(seriesID, value) %>% dateCast() %>% rename(u3_unemployment = LNS14000000, u5_unemployment = LNS13327708, u6_unemployment = LNS13327709) ggplot(data = tbl, aes(x = date)) + geom_line(aes(y = u3_unemployment, color = "U-3 Unemployment")) + geom_line(aes(y = u5_unemployment, color = "U-5 Unemployment")) + geom_line(aes(y = u6_unemployment, color = "U-6 Unemployment")) + labs(title = "Monthly Unemployment Rates") + ylab("value") + theme(legend.position="top", plot.title = element_text(hjust = 0.5))

For more information and examples, please see the package vignettes.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Science Riot!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Free simmer hexagon stickers!

Mon, 08/21/2017 - 20:00

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)


Do you want to get your own simmer hexagon sticker? Just fill in this form and get one send to you for free.

Check out r-simmer.org or CRAN for more information on simmer, a discrete-event simulation package for R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages