Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 28 min ago

No worries! Afterthoughts from UseR 2018

Fri, 07/27/2018 - 16:26

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)


This year the UseR conference took place in Brisbane, Australia. UseR is my favorite conference and this one was mine 11th (counting from Dortmund 2008). 
Every UseR is unique. Every UseR is great. But my feelings are that European UseRs are (on average) more about math, statistics and methodology while US UseRs are more about big data, data science, technology and tools. 

So, how was the one in Australia? Was it more similar to Europe or US?

IMHO – neither of them. 
This one was (for me) about welcoming of new users, being open for diversified community, being open for changes, caring about R culture. Some footmarks of these values were present in most keynotes.

Talking about keynotes. All of them were great, but the ,,Teaching R to New Users” given by Roger Peng was outstanding. I will use the video or the essay as the MUST READ material for students in my R programming classes.

Venue, talks, atmosphere were great as well (thanks to the organizing crew led by Di Cook). Lots of people (including myself) spend time around the hex wall looking for their favorite packages (here you will read more about it ). There was an engaging team exercise during the conference diner (how much your table knows about R). The poster sessions was being handled on TV screens, therefore some posters were interactive (Miles McBain had poster related to R and Virtual Reality, cool). 

Last but not least, there was a great mixture of contributed talks and workshops. Everyone could find something for himself. And even too often it was hard to choose between few tempting options (fortunately, talks are recorded). 
Here I would like to mention three talks I found inspiring.

,,The Minard Paradox” given by Paul Murrel was refreshing. 
One may think nowadays we are so good in data vis, with all these shiny tools and interactive widgets. Yet Paul showed how hard it is to reproduce great works like Minard’s Map even in the cutting edge software (i.e. R). God is in the detail. Watch Paul’s talk here.

,,Data Preprocessing using Recipes” given by Max Kuhn touched an important, jet often neglected truth: Columns in the source data are unnecessary final features. Between ‘read the data’ and ‘fit the model’ there is an important process of feature engineering. This process needs to be reproducible, needs to be based on some well planned grammar. The recipes package helps here. Find the recipes talk here (tutorial is also recorded)

,,Glue strings to data in R” given by James Hester shows a package that is doing only one thing (glue strings) but is doing it extremely well. I have not expected 20 minutes of absorbing talk focused only on gluing strings. Yet, this is my third favourite. Watch it here.

David Smith shared his highlights here. You will find there quite a collection of links.

Videos for recorded talks, keynotes and tutorials are on R consortium youtube.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Weight loss in the U.S. – An analysis of NHANES data with tidyverse

Fri, 07/27/2018 - 16:06

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Based on a paper published in JAMA last year, the weight gain is increasing among US adults while there is no difference in the percentage of people that were trying to lose weight. The authors used the data from the National Health and Nutrition Examination Survey NHANES from 1988 to 2014 and calculated the proportion of people tried to lose weight during the past 12 months.

Logistic regression analysis was used to estimate the odds of trying to lose weight for different periods from 1988 to 2014. I would like to reproduce these findings with R, given that NHANES data is publically available.

Libraries and Datasets

Load libraries

library(tidyverse) library(RNHANES) library(weights) library(ggsci) library(ggthemes)

The authors utilized the data from 1988 to 2014, but since the RNHANES package enables the data starting from 1999, I will include data from 1999 to 2014. Therefore, I cannot fully reproduce the results of the paper publish at JAMA.

Load the data using nhanes_load_data function from RNHANES package.

d99 = nhanes_load_data("DEMO", "1999-2000") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>% left_join(nhanes_load_data("BMX", "1999-2000"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>% left_join(nhanes_load_data("WHQ", "1999-2000"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070) d01 = nhanes_load_data("DEMO_B", "2001-2002") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>% left_join(nhanes_load_data("BMX_B", "2001-2002"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>% left_join(nhanes_load_data("WHQ_B", "2001-2002"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070) d03 = nhanes_load_data("DEMO_C", "2003-2004") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>% left_join(nhanes_load_data("BMX_C", "2003-2004"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>% left_join(nhanes_load_data("WHQ_C", "2003-2004"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070) d09 = nhanes_load_data("DEMO_F", "2009-2010") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMIN2, WTINT2YR, WTMEC2YR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC=INDFMIN2, WTINT2YR, WTMEC2YR) %>% left_join(nhanes_load_data("BMX_F", "2009-2010"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>% left_join(nhanes_load_data("WHQ_F", "2009-2010"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070) d11 = nhanes_load_data("DEMO_G", "2011-2012") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMIN2, WTINT2YR, WTMEC2YR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC=INDFMIN2, WTINT2YR, WTMEC2YR) %>% left_join(nhanes_load_data("BMX_G", "2011-2012"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>% left_join(nhanes_load_data("WHQ_G", "2011-2012"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070) d13 = nhanes_load_data("DEMO_H", "2013-2014") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMIN2, WTINT2YR, WTMEC2YR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC=INDFMIN2, WTINT2YR, WTMEC2YR) %>% left_join(nhanes_load_data("BMX_H", "2013-2014"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>% left_join(nhanes_load_data("WHQ_H", "2013-2014"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

The list of variables I will use in analysis:

  • SEQN: id
  • RIAGENDR: gender
  • RIDAGEYR: age in years
  • BMXBMI: body mass index
  • INDFMINC: family income
  • RIDRETH1: race/ethnicity
  • RIDEXPRG: pregnancy
  • WHQ070: intention to lose weight
  • WTINT2YR: weighted variable for questionnaires
  • WTMEC2YR: weighted variable for laboratory measurements

Merging the datasets, creating new variables, and excluding pregnant women.

dat = rbind(d99, d01, d03, d09, d11, d13) %>% mutate( race = recode_factor(RIDRETH1, `1` = "Mexian American", `2` = "Hispanic", `3` = "Non-Hispanic, White", `4` = "Non-Hispanic, Black", `5` = "Others"), bmi = if_else(BMXBMI >= 30, "Obese", "Overweight"), pregnancy = if_else(RIDEXPRG == 1, "Yes", "No", "No"), tryweloss = if_else(WHQ070 == 1, "Yes", "No") ) %>% filter(BMXBMI >= 25, RIDAGEYR >=20, RIDAGEYR < 60, race != "Others", pregnancy == "No", WHQ070 %in% c(1,2))

I excluded pregnant women from the analysis as they are overweight or obese due to pregnancy and might have no intention to lose weight during pregnancy. The R code below calculates the weighted proportion of overweight and obesity as well as intention to lose weight for the total population.

with(dat, wpct(bmi, weight = WTMEC2YR)) ## Obese Overweight ## 0.4997831 0.5002169 with(dat, wpct(tryweloss, weight = WTINT2YR)) ## No Yes ## 0.5235062 0.4764938

The proportion of obese and overweight is 50%, as I excluded the normal weight individuals before. The intention to lose weight in the total population is 48%. Now, I will evaluate the “intention to lose weight” among overweight and obese people in different periods. (I will repeat the same code for each period studied. If you know how to short the code below please let me know. I don't know how to weight the proportion with the tidyverse package)

with(dat %>% filter(wave == "1999-2000", bmi == "Obese"), wpct(tryweloss, weight = WTINT2YR)) ## No Yes ## 0.4555818 0.5444182 with(dat %>% filter(wave == "1999-2000", bmi == "Overweight"), wpct(tryweloss, weight = WTINT2YR)) ## No Yes ## 0.6000436 0.3999564

There are 54% obese and 40% overweight people who are trying to lose weight in 1999-2000. Using the above, I calculated the proportions for other periods and made a table. I will focus only on those who lost weight to see how the trend goes over years in overweight and obese.

trend <- read.table(header=TRUE, text=' year bmi weightloss 1999-2000 Obese 54.44182 1999-2000 Overweight 39.99564 2001-2002 Obese 54.14932 2001-2002 Overweight 38.60543 2003-2004 Obese 53.6699 2003-2004 Overweight 42.97784 2009-2010 Obese 53.62087 2009-2010 Overweight 35.6233 2011-2012 Obese 57.5303 2011-2012 Overweight 41.99084 2013-2014 Obese 58.32048 2013-2014 Overweight 40.19653 ') trend %>% ggplot(aes(year, weightloss, fill = bmi, group = bmi)) + geom_line(color = "black", size = 0.3) + geom_point(colour="black", pch=21, size = 3) + theme(text = element_text(family = "serif", size = 11), legend.position="bottom") + scale_fill_jama() + theme_hc() + labs( title = "Trends of 'trying to lose weight' in the United States", caption = "NHANES 1999-2014 survey\ndatascienceplus.com", x = "Survey Cycle Years", y = "Percentage, %", fill = "" ) + scale_y_continuous(limits=c(30,80), breaks=seq(30,80, by = 10))

Gives this plot:

There is a small increasing trend to lose weight in people with obesity in recent years.

Logistic Regression

A logistic regression model with Poisson distribution is used to evaluate the odds of intention to lose weight tryweloss in different time periods. In this analysis, several confounders were considered such as BMI status, gender, age, family income, and race.

mylogit <- glm(tryweloss2 ~ wave + bmi + RIAGENDR + RIDAGEYR + INDFMINC + race, data = dat, family = "poisson") summary(mylogit) ## ## Call: ## glm(formula = tryweloss2 ~ wave + bmi + RIAGENDR + RIDAGEYR + ## INDFMINC + race, family = "poisson", data = dat) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2590 -0.9171 -0.7429 0.6490 1.1270 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.2545203 0.0886399 -14.153 < 2e-16 *** ## wave2001-2002 0.0028610 0.0553019 0.052 0.95874 ## wave2003-2004 0.0277646 0.0562594 0.494 0.62165 ## wave2009-2010 -0.0434281 0.0519876 -0.835 0.40352 ## wave2011-2012 0.0377786 0.0539726 0.700 0.48395 ## wave2013-2014 0.0929779 0.0523326 1.777 0.07562 . ## bmiOverweight -0.3519162 0.0306745 -11.473 < 2e-16 *** ## RIAGENDR 0.4184619 0.0303948 13.768 < 2e-16 *** ## RIDAGEYR -0.0028538 0.0013459 -2.120 0.03398 * ## INDFMINC 0.0008816 0.0010161 0.868 0.38559 ## raceHispanic 0.0847514 0.0584763 1.449 0.14725 ## raceNon-Hispanic, White 0.1199298 0.0387668 3.094 0.00198 ** ## raceNon-Hispanic, Black -0.0016056 0.0436180 -0.037 0.97064 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 7330.3 on 10232 degrees of freedom ## Residual deviance: 6923.2 on 10220 degrees of freedom ## (88 observations deleted due to missingness) ## AIC: 16273 ## ## Number of Fisher Scoring iterations: 5

The results show that there are no significant odds (p>0.05) on the intention to lose weight when I compare different periods to 1999-2000. The small sample size could explain these findings for each survey.

However, let me get the exponential of these coefficients and confidence intervals.

exp(coef(mylogit)) ## (Intercept) wave2001-2002 wave2003-2004 ## 0.2852126 1.0028651 1.0281537 ## wave2009-2010 wave2011-2012 wave2013-2014 ## 0.9575014 1.0385013 1.0974375 ## bmiOverweight RIAGENDR RIDAGEYR ## 0.7033391 1.5196224 0.9971503 ## INDFMINC raceHispanic raceNon-Hispanic, White ## 1.0008820 1.0884464 1.1274177 ## raceNon-Hispanic, Black ## 0.9983957 exp(confint(mylogit)) ## 2.5 % 97.5 % ## (Intercept) 0.2395745 0.3391149 ## wave2001-2002 0.8999514 1.1178640 ## wave2003-2004 0.9208583 1.1481368 ## wave2009-2010 0.8650406 1.0606215 ## wave2011-2012 0.9344651 1.1546960 ## wave2013-2014 0.9907765 1.2164297 ## bmiOverweight 0.6622089 0.7468264 ## RIAGENDR 1.4318982 1.6130974 ## RIDAGEYR 0.9945248 0.9997859 ## INDFMINC 0.9988443 1.0028327 ## raceHispanic 0.9697129 1.2196128 ## raceNon-Hispanic, White 1.0452581 1.2168270 ## raceNon-Hispanic, Black 0.9166538 1.0876073

The odds ratio and 95% confidence intervals (95%CI) of losing weight in the period 2013-2014 is 1.10 (95% CI 0.99-1.22) compare to 1999-2000. The odds ratio (95%CI) of losing weight in overweight compared to obese individuals is significantly lower 0.70 (95% CI 0.66-0.75) indicating that obese individuals are trying more to lose weight than overweight (as we can expect).

My results were different from the paper published in JAMA. There are some differences from this analysis compared to the article. My reference is the period 1999-2000, whereas in the paper is 1988-1994. Moreover, I excluded pregnant women, when the paper keeps those in the analysis.

    Related Post

    1. Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
    2. Story of pairs, ggpairs, and the linear regression
    3. Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
    4. MNIST For Machine Learning Beginners With Softmax Regression
    5. Time-dependent ROC for Survival Prediction Models in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to use Covariates to Improve your MaxDiff Model

    Fri, 07/27/2018 - 15:00

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    MaxDiff is a type of best-worst scaling. Respondents are asked to compare all choices in a given set and pick their best and worse (or most and least favorite). For an introduction, check out this great webinar by Tim Bock. In our post, we’ll discuss why you may want to include covariates in the first place and how they can be included in Hierarchical Bayes (HB) MaxDiff. Then we’ll use the approach to examine the qualities voters look for in a U.S. president.

    Why include respondent-specific covariates?

    Advances in computing means it is easy to include complex respondent-specific covariates in HB MaxDiff models. There are several reasons why we may want to do this in practice.

    1. A standard model which assumes each respondent’s part-worth is drawn from the same normal distribution may be too simplistic. Information drawn from additional covariates may improve the estimates of the part-worths. This is likely to be the case for surveys in which there were fewer questions and therefore less information.
    2. Additionally, when respondents are segmented, we may be worried that the estimates for one segment are biased. Another concern is that HB may shrink the segment means overly close to each other. This is especially problematic if sample sizes vary greatly between segments.
    How to include covariates in the model

    In the usual HB model, we model the part-worths for the ith respondent as βi ~ N(μ, ∑). Note that the mean and covariance parameters μ and ∑ do not depend on i and are the same for each respondent in the population. The simplest way to include respondent-specific covariates in the model is to modify μ to be dependent on the respondent’s covariates.

    We do this by modifying the model for the part-worths to βi ~N(Θxi, ∑) where xi is a vector of known covariate values for the ith respondent and Θ is a matrix of unknown regression coefficients.  Each row of Θ is given a multivariate normal prior. The covariance matrix, ∑, is re-expressed into two parts: a correlation matrix and a vector of scales, and each part receives its own prior distribution.

    Fitting covariates in R and Displayr

    This model can be fit in R, Q and Displayr using the function FitMaxDiff in the R package flipMaxDiff. Download this package from GitHub. The function fits the model using the No-U-Turn sampler implemented in stan – the state-of-the-art software for fitting Bayesian models. The package allows us to quickly and efficiently estimate our model without having to worry about selecting the tuning parameters that are frequently a major hassle in Bayesian computation and machine learning. The package also provides a number of features for visualizing the results and diagnosing any issues with the model fit.

    Example in Displayr The dataset

    Our data set asked 315 Americans ten questions about the attributes they look for in a U.S. president. Each question asked the respondents to pick their most and least important attributes from a set of five. The attributes were:

    • Decent/ethical
    • Plain-speaking
    • Healthy
    • Successful in business
    • Good in a crisis
    • Experienced in government
    • Concerned for the welfare of minorities
    • Understands economics
    • Concerned about global warming
    • Concerned about poverty
    • Has served in the military
    • Multilingual
    • Entertaining
    • Male
    • From a traditional American background
    • Christian

    For more information, please see this earlier blog post, which analyzes the same data using HB, but does not consider covariates.

    Fitting your MaxDiff Model

    In Displayr and Q, we can fit a MaxDiff model by selecting Marketing > MaxDiff > Hierarchical Bayes from the menu.  See this earlier blog post for a description of the HB controls/inputs and a demo using a different data set. Documentation specific to the Displayr GUI is on the Q wiki. To use the function in R, install the devtools package and then download flipMaxDiff using devtools::install_github(“Displayr/flipMaxDiff”). Covariates can be included in the function FitMaxDiff inside flipMaxDiff via the parameters cov.formula and cov.data, which work just like the formula and data parameters in the R functions lm and glm.

    We then included a single categorical predictor in the model – responses to the question of who they voted for in the 2016 election. The predictor had the following levels; voted for Clinton, voted for Trump, voted for another candidate, didn’t vote and don’t know or refused to answer.

    We would expect this predictor to have a very strong correlation with the best and worse choices for each respondent. To compare the models with and without covariates in Displayr, first fit the model without covariates and then copy/paste the created R item.

    To add the covariates, simply click Properties > R CODE in the Object Inspector on the right of the page and add a few lines of R code for the cov.formula and cov.data parameters. See the image carousel below.



    • Respondent parameter estimates and summary statistics for fitting
      a choice

      model with a covariate to the presidents data set.


    • Respondent parameter estimates and summary statistics for fitting a choice

      model with a categorical covariate to the presidents data set.



    • GUI inputs for fitting a choice model with the presidents data set in Displayr.



    • R code for fitting the choice model with fixed covariates in Displayr.

    var htmlDiv = document.getElementById("rs-plugin-settings-inline-css"); var htmlDivCss=""; if(htmlDiv) { htmlDiv.innerHTML = htmlDiv.innerHTML + htmlDivCss; }else{ var htmlDiv = document.createElement("div"); htmlDiv.innerHTML = "

    " + htmlDivCss + "

    "; document.getElementsByTagName("head")[0].appendChild(htmlDiv.childNodes[0]); }

    var htmlDiv = document.getElementById("rs-plugin-settings-inline-css"); var htmlDivCss=""; if(htmlDiv) { htmlDiv.innerHTML = htmlDiv.innerHTML + htmlDivCss; }else{ var htmlDiv = document.createElement("div"); htmlDiv.innerHTML = "

    " + htmlDivCss + "

    "; document.getElementsByTagName("head")[0].appendChild(htmlDiv.childNodes[0]); }
    /****************************************** - PREPARE PLACEHOLDER FOR SLIDER - ******************************************/

    var setREVStartSize=function(){ try{var e=new Object,i=jQuery(window).width(),t=9999,r=0,n=0,l=0,f=0,s=0,h=0; e.c = jQuery('#rev_slider_5_1'); e.gridwidth = [720]; e.gridheight = [500];

    e.sliderLayout = "auto"; if(e.responsiveLevels&&(jQuery.each(e.responsiveLevels,function(e,f){f>i&&(t=r=f,l=e),i>f&&f>r&&(r=f,n=e)}),t>r&&(l=n)),f=e.gridheight[l]||e.gridheight[0]||e.gridheight,s=e.gridwidth[l]||e.gridwidth[0]||e.gridwidth,h=i/s,h=h>1?1:h,f=Math.round(h*f),"fullscreen"==e.sliderLayout){var u=(e.c.width(),jQuery(window).height());if(void 0!=e.fullScreenOffsetContainer){var c=e.fullScreenOffsetContainer.split(",");if (c) jQuery.each(c,function(e,i){u=jQuery(i).length>0?u-jQuery(i).outerHeight(!0):u}),e.fullScreenOffset.split("%").length>1&&void 0!=e.fullScreenOffset&&e.fullScreenOffset.length>0?u-=jQuery(window).height()*parseInt(e.fullScreenOffset,0)/100:void 0!=e.fullScreenOffset&&e.fullScreenOffset.length>0&&(u-=parseInt(e.fullScreenOffset,0))}f=u}else void 0!=e.minHeight&&f ', visibleAmount:9, hide_onmobile:false, hide_onleave:false, direction:"horizontal", span:true, position:"outer-bottom", space:10, h_align:"center", v_align:"bottom", h_offset:0, v_offset:0 } }, carousel: { horizontal_align: "center", vertical_align: "center", fadeout: "on", vary_fade: "on", maxVisibleItems: 3, infinity: "on", space: 0, stretch: "off" }, visibilityLevels:[1240,1024,778,480], gridwidth:720, gridheight:500, lazyType:"smart", parallax: { type:"mouse", origo:"slidercenter", speed:2000, levels:[2,3,4,5,6,7,12,16,10,50,47,48,49,50,51,55], type:"mouse", }, shadow:0, spinner:"off", stopLoop:"on", stopAfterLoops:0, stopAtSlide:1, shuffle:"off", autoHeight:"off", hideThumbsOnMobile:"off", hideSliderAtLimit:0, hideCaptionAtLimit:0, hideAllCaptionAtLilmit:0, debugMode:false, fallbacks: { simplifyAll:"off", nextSlideOnWindowFocus:"off", disableFocusListener:false, } }); } }); /*ready*/
    var htmlDivCss = unescape("%0A.gyges%20.tp-thumb%20%7B%20%0A%20%20%20%20%20%20opacity%3A1%0A%20%20%7D%0A.gyges%20.tp-thumb-img-wrap%20%7B%0A%20%20padding%3A3px%3B%0A%20%20background-color%3Argba%280%2C0%2C0%2C0.25%29%3B%0A%20%20display%3Ainline-block%3B%0A%0A%20%20width%3A100%25%3B%0A%20%20height%3A100%25%3B%0A%20%20position%3Arelative%3B%0A%20%20margin%3A0px%3B%0A%20%20box-sizing%3Aborder-box%3B%0A%20%20%20%20transition%3Aall%200.3s%3B%0A%20%20%20%20-webkit-transition%3Aall%200.3s%3B%0A%7D%0A.gyges%20.tp-thumb-image%20%7B%0A%20%20%20padding%3A3px%3B%20%0A%20%20%20display%3Ablock%3B%0A%20%20%20box-sizing%3Aborder-box%3B%0A%20%20%20position%3Arelative%3B%0A%20%20%20%20-webkit-box-shadow%3A%20inset%205px%205px%2010px%200px%20rgba%280%2C0%2C0%2C0.25%29%3B%0A%20%20-moz-box-shadow%3A%20inset%205px%205px%2010px%200px%20rgba%280%2C0%2C0%2C0.25%29%3B%0A%20%20box-shadow%3A%20inset%205px%205px%2010px%200px%20rgba%280%2C0%2C0%2C0.25%29%3B%0A%20%7D%20%20%0A%0A.gyges%20.tp-thumb%3Ahover%20.tp-thumb-img-wrap%2C%0A%20.gyges%20.tp-thumb.selected%20.tp-thumb-img-wrap%20%7B%0A%20%20%20%20background%3A%20-moz-linear-gradient%28top%2C%20%20rgba%28255%2C%20255%2C%20255%2C%201%29%200%25%2C%20rgba%28119%2C%20119%2C%20119%2C%201%29%20100%25%29%3B%0Abackground%3A%20-webkit-gradient%28left%20top%2C%20left%20bottom%2C%20color-stop%280%25%2C%20rgba%28255%2C%20255%2C%20255%2C%201%29%29%2C%20color-stop%28100%25%2C%20rgba%28119%2C%20119%2C%20119%2C%201%29%29%29%3B%0Abackground%3A%20-webkit-linear-gradient%28top%2C%20rgba%28255%2C%20255%2C%20255%2C%201%29%200%25%2C%20rgba%28119%2C%20119%2C%20119%2C%201%29%20100%25%29%3B%0Abackground%3A%20-o-linear-gradient%28top%2C%20rgba%28255%2C%20255%2C%20255%2C%201%29%200%25%2C%20rgba%28119%2C%20119%2C%20119%2C%201%29%20100%25%29%3B%0Abackground%3A%20-ms-linear-gradient%28top%2C%20rgba%28255%2C%20255%2C%20255%2C%201%29%200%25%2C%20rgba%28119%2C%20119%2C%20119%2C%201%29%20100%25%29%3B%0Abackground%3A%20linear-gradient%28to%20bottom%2C%20rgba%28255%2C%20255%2C%20255%2C%201%29%200%25%2C%20rgba%28119%2C%20119%2C%20119%2C%201%29%20100%25%29%3B%0A%0A%7D%0A%0A%0A"); var htmlDiv = document.getElementById('rs-plugin-settings-inline-css'); if(htmlDiv) { htmlDiv.innerHTML = htmlDiv.innerHTML + htmlDivCss; } else{ var htmlDiv = document.createElement('div'); htmlDiv.innerHTML = '

    ' + htmlDivCss + '

    '; document.getElementsByTagName('head')[0].appendChild(htmlDiv.childNodes[0]); }

    Checking convergence

    We fit the models using 1000 iterations and eight Markov chains. When conducting a HB analysis, it is vital to check that the algorithm used has both converged to and adequately sampled from the posterior distribution. Using the HB diagnostics available in Displayr (see this post for a detailed overview), there appeared to be no issues with convergence for this data. We then assessed the performance of our models by leaving out one or more respondent questions and seeing how well we could predict their choice using the estimated model.

    Results

    If we only hold out one question for prediction and use the other nine questions to fit the models, the effect of the categorical predictor is small. The model with the categorical predictor takes longer to run for the same number of iterations due to the increased number of parameters. Both models have only a modest improvement in out-of-sample prediction accuracy (from 67.0% to 67.3%). We did not gain much from running the predictor because we could already draw substantial information from the nine MaxDiff questions.

    Including fixed covariates becomes much more advantageous when you have less MaxDiff questions – like in the extreme example of only having two questions to fit the models. We see a larger improvement in out-of-sample prediction accuracy (from 54.5% to 55.0%). We also see a much higher effective sample size per second (from 5.04 to 13.37 for the mean parameters). This means that the algorithm is able to sample much more efficiently with the covariate included. Even more importantly, this saves us time as we don’t need to use as many iterations to obtain our desired number of effective samples.

    To see how this was done in Displayr, head to this dashboard “HB Choice Model with Fixed Covariates” to view the complete results! Ready to include your own covariates for analysis? Start a free Displayr trial

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Using themes in ggplot2

    Fri, 07/27/2018 - 13:36

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    As noted elsewhere, sometimes beauty matters. A plot that’s pleasing to the eye will be considered more
    gladly, and thus might be understood more thoroughly. Also, since we at STATWORX oftentimes need to
    subsume and communicate our results, we have come to appreciate how a nice plot can upgrade any presentation.

    So how make a plot look good? How make it accord with given style guidelines? In ggplot2 the display of all non-data components is controlled by the theme system. Other than in some other packages, the appearance of plots is edited after all the data-related elements of the plot have been determined. The theme system of ggplot2 allows the manipulation of titles, labels, legends, grid lines and backgrounds. There are various build-in themes available that already have an all-around consistent style, pertaining to any detail of a plot.

    Pre-defined themes

    There are two ways to apply bulid-in (or otherwise predefined) themes (e.g. theme_grey, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal or theme_classic).
    For one, they can be added as an additional layer to individual plots:

    rm(list = ls()) library(gridExtra) library(ggplot2) # generating a fictional data set containing hours of sunshine and temperature sun_hours <- sample(seq(from = 1, to = 8, by = 0.1), size = 40, replace = TRUE) noise <- sample(seq(from = 17, to = 24, by = 0.1), size = 40, replace = TRUE) temperature <- sun_hours + noise df_sun <- data.frame(sun_hours, temperature) # generate the plot base base_plot <- ggplot(df_sun) + geom_point(aes(x = sun_hours, y = temperature, color = temperature), shape = 6, size = 5, stroke = 2) + geom_point(aes(x = sun_hours, y = temperature, color = temperature), shape = 21, size = 3.3, fill = "white", stroke = 2) + labs(x = "Hours of Sun", y = "Temperature") + scale_color_gradient(high = "firebrick", low = "#ffce00", name = " ") + ggtitle("Base Plot")

    # adding predefined themes p1 <- base_plot + theme_classic() + ggtitle("Plot with theme_classic()") p2 <- base_plot + theme_bw() + ggtitle("Plot with theme_bw()") p3 <- base_plot + theme_dark() + ggtitle("Plot with theme_dark()") p4 <- base_plot + theme_light() + ggtitle("Plot with theme_light()") gridExtra::grid.arrange(p1, p2, p3, p4)

    Alternatively, the default theme that’s automatically added to any plot, can be set or get with the functions theme_set() or theme_get().

    # making the classic theme the default theme_set(theme_classic()) base_plot + ggtitle("Plot with theme_set(theme_classic())")

    While predefined themes are very convenient, there’s always the option to (additionally) tweak the appearance of any non-data detail of a plot via the various arguments of theme(). This can be done for a specific plot, or the currently active default theme. The default theme can be updated or partly replaced via theme_update and theme_replace, respectively.

    # changing the default theme theme_update(legend.position = "none") base_plot + ggtitle("Plot with theme_set(theme_classic()) \n& theme_update(legend.position = \"none\")")

    # changing the theme directly applied to the plot base_plot + theme(legend.position = "bottom") + ggtitle("Plot with theme(legend.position = \"bottom\")")

    Element functions

    There’s a wide range of arguments for theme(), in fact such a wide range, that not all arguments can be discussed here. Therefore, this blog post is far from exhaustive and only deals with the general principles of the theme system and only provides some illustrative examples for a few of all the available arguments. The appearance of many elements needs to be specified via one of the four element functions: element_blank, element_text, element_line or element_rect.

    • How labels and titles are displayed, is controlled by the element_text function. For example, we can make the title of the y axis bold and increase its size.
    • Borders and backgrounds can be manipulated using element_rect. For example, we can choose the color of the plot’s background.
    • Lines can be defined via the element_line function. For example, we can change the line types of the mayor and minor grid.
    • Further, with element_blank() it is possible to remove an object completely, without having any space dedicated to the plot element.
    # using element_text, element_rect, element_line, element_blank base_plot + theme(axis.title.y = element_text(face = "bold", size = 16), plot.background = element_rect(fill = "#FED633"), panel.grid.major = element_line(linetype = "dashed"), panel.grid.minor = element_line(linetype = "dotted"), axis.text.y = element_blank(), axis.text.x = element_blank()) + ggtitle("Plot altered using element functions")

    If we don’t want to change the display of some specific plot elements, but of all text, lines, titles or rectangular elements we can do so by specifying the arguments text, line, rect and title. Specifications passed to these arguments are inherited by all elements of the respective type. This inheritance principle also holds true for other 'parent' arguments. 'Parent' arguments oftentimes are easily identifiable, as their names are used as prefixes for all subordinate arguments.

    # using overreaching arguments #1 base_plot + theme(line = element_line(linetype = "dashed")) + ggtitle("Plot with all lines altered by using line")

    # using overreaching arguments #2 base_plot + theme(axis.title = element_text(size = 6)) + # here axis.title is the parent ggtitle("Plot with both axis titles altered by using axis.title")

    Outlook

    Margins, spaces, sizes and orientations of elements are not specified with element functions but have their own sets of possible parameters. For example, the display of legends is controlled by such arguments and specific parameters.

    # using parameters instead of element functions base_plot + theme(legend.position = "top")

    Since ggplot2 enables to manipulate the appearance of non-data elements of plots in great detail, there is a multitude of arguments. This blog post only tries to give a first impression of the many, many possibilities to design a plot. Some additional occupation with the topic, might be advisable, but any time invested in understanding how to style plots, surely is well spent. If you want read more on making pretty plots in ggplot2, check out my other posts on coordinate stystems or customizing date and time scales.

     

    Referenzen
    • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.

     

    Über den Autor

    Lea Waniek

    Lea ist Mitglied im Data Science Team und unterstützt ebenfalls im Bereich Statistik.

    Der Beitrag Using themes in ggplot2 erschien zuerst auf STATWORX.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Cucumber time, food on a 2D plate / plane

    Fri, 07/27/2018 - 10:40

    (This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

    Introduction

    It is 35 degree Celsius out side, we are in the middle of the ‘slow news season’, in many countries also called cucumber time.  A period typified by the appearance of less informative and frivolous news in the media.

    Did you know that 100 g of cucumber contain 0.28 mg of iron and 1.67 g of sugar? You can find all the nutrient values of a cucumber on the USDA food databases.

    Food Data

    There is more data, for many thousands of products you can retrieve nutrient values through an API (need to register for a free KEY). So besides the cucumber I extracted data for different type of food for example

    • Beef products
    • Dairy & Egg products
    • Vegetables
    • Fruits
    • etc.

    And as a comparison, I retrieved the nutrient values for some fast food products from McDonald’s and Pizza Hut. Just to see if pizza can be classified  as vegetable from a data point of view So the data looks like:

    I have sampled 1500 products and per product we have 34 nutrient values.

    Results

    The 34 dimensional data is now compressed / projected onto a two dimensional plane using UMAP (Uniform Manifold Approximation and Projection). There is a Python and R package to this.

    An interactive map can be found here, and the R code to retrieve and plot the data here. Cheers, Longhow.

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names

    Fri, 07/27/2018 - 03:28

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    Continuing the blog’s UDF theme of late, there are two new UDF kids in town:

    • drill-url-tools for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post)
    • drill-domain-tools for slicing & dicing internet domain names (IDNs).

    Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url() function already?” My answer is “Sure, but it’s based on java.net.URL which is fundamentally broken.”

    Slicing & dicing URLs and IDNs is a large part of the $DAYJOB and they go together pretty well, hence the joint UDF release.

    Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.

    Counting Lying Lock Icons

    SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).

    If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL

    There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.

    Let’s get the data:

    library(sergeant) library(tidyverse) read_delim( file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", delim = " ", col_names = FALSE, col_types = "ccc" ) -> gfg_update dl_path <- file.path("~/Data/gfg_links.tsv.gz") if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)

    Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.

    Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:

    db <- src_drill() tbl(db, "( SELECT b.host, port, b.rec.hostname AS hostname, b.rec.assigned AS assigned, b.rec.tld AS tld, b.rec.subdomain AS subdomain FROM (SELECT host, port, suffix_extract(host) AS rec -- break the hostname into components FROM (SELECT a.rec.host AS host, a.rec.port AS port FROM (SELECT columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components FROM dfs.d.`/gfg_links.tsv.gz`) a WHERE a.rec.port IS NOT NULL -- filter out URL parsing failures ) ) b WHERE b.rec.tld IS NOT NULL -- filter out domain parsing failures )") -> gfg_df gfg_df ## # Database: DrillConnection ## hostname port host subdomain assigned tld ## ## 1 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 2 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 3 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 4 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 5 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 6 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 7 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 8 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 9 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 10 www 80 www.eestikirik.ee NA eestikirik.ee ee ## # ... with more rows

    While we could have done it all in SQL, we saved some bits for R:

    distinct(gfg_df, assigned, port) %>% count(port) %>% collect() -> port_counts port_counts # A tibble: 2 x 2 port n * 1 80 20648 2 443 22178

    You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).

    FIN

    Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.

    I'm not 100% settled on the galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.

    Kick the tyres & file issues & PRs as necessary.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Announcing the 1st Bookdown Contest

    Fri, 07/27/2018 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    Since the release of the bookdown package in 2016, there have been a large number of books written and published with bookdown. Currently there are about 200 books (including tutorials and notes) listed on bookdown.org alone! We have also heard about other applications of bookdown based on custom templates (e.g., dissertations).

    As popular as bookdown is becoming, especially with teachers, researchers, and students, we know it can take a lot of time to tailor bookdown to meet the special typesetting requirements of your institution or publisher. As it is today, future graduate students will have to spend many hours reinventing a thesis template, instead of focusing on writing content in R Markdown! Fortunately, we are sure that there are already elegant and reusable bookdown applications, which would greatly benefit future users.

    With that in mind, we are happy to announce the first contest to recognize outstanding bookdown applications!

    Criteria

    There are no hard judging criteria for this contest, but in general, we’d prefer these types of applications:

    • Publicly and freely accessible (both source documents and the output). If the full source and output cannot be shared publicly, we expect at least a full demo that can be shared (the demo could contain only placeholder content).
    • Not tightly tied to a particular output format, which means you should use as fewer raw LaTeX commands or HTML tags as possible in the body of the book (using the includes options is totally fine, e.g., including custom LaTeX content in the preamble). An exception can be made for dissertations, since they are typically in the PDF format.
    • Has some minimal examples or clear instructions for other users to easily create similar applications.
    • Uses new output formats based on bookdown’s built-in output formats (such as bookdown::html_book or bookdown::pdf_document2).
    • Has creative and elegant styling for HTML and/or PDF output based on either the default templates in bookdown or completely new custom templates.

    We’d also like to see non-English applications, such as books written in CJK (Chinese, Japanese, Korean), right-to-left, or other languages, since there are additional challenges in typesetting with these languages.

    Note that the applications do not have to be technical books or even books at all. They could be novels, diaries, collections of poems/essays, course notes, or data analysis reports.

    Awards

    Honorable Mention Prizes (ten):

    Runner Up Prizes (two): All awards above, plus

    Grand Prize (one): All awards above, with three more signed books related to R Markdown

    The names and work of all winners will be highlighted in a gallery on the bookdown.org website and we will announce them on RStudio’s social platforms, including community.rstudio.com (unless the winner prefers not to be mentioned).

    Of course, the main reward is knowing that you’ve helped future writers!

    Submission

    To participate this contest, please follow the link http://rstd.io/bookdown-contest to create a new post in RStudio Community (you will be asked to sign up if you don’t have an account). The post title should start with “Bookdown contest submission:“, followed by a short title to describe your application (e.g., “a PhD thesis template for Iowa State”). The post may describe features and highlights of the application, include screenshots and links to live examples and source repositories, and briefly explain key technical details (how the customization or extension was achieved).

    There is no limit on the number of entries one participant can submit. Please submit as many as you wish!

    The deadline for the submission is October 1st, 2018. You are welcome to either submit your existing bookdown applications (even like a PhD thesis you wrote two years ago), or create one in two months! We will announce winners and their submissions in this blog, RStudio Community, and also on Twitter before Oct 15th, 2018.

    I (Yihui) will be the main judge this year. Winners of this year will be invited to serve as judges next year. I’ll consider both the above criteria and the feedback/reaction of other users in the submission posts in RStudio Community (such as the number of likes that a post receives).

    Looking forward to your submissions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to use rquery with Apache Spark on Databricks

    Thu, 07/26/2018 - 19:13

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    A big thank you to Databricks for working with us and sharing:

    rquery: Practical Big Data Transforms for R-Spark Users
    How to use rquery with Apache Spark on Databricks

    rquery on Databricks is a great data science tool.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Stan Pharmacometrics conference in Paris July 24 2018

    Thu, 07/26/2018 - 07:38

    (This article was first published on Shravan Vasishth's Slog (Statistics blog), and kindly contributed to R-bloggers)

    I just got back from attending this amazing conference in Paris:

    http://www.go-isop.org/stan-for-pharmacometrics—paris-france

    A few people were disturbed/surprised by the fact that I am linguist (“what are you doing at an pharmacometrics conference?”). I hasten to point out that two of the core developers of Stan are linguists too (Bob Carpenter and Mitzi Morris). People seem to think that all linguists do is correct other people’s comma placements. However, despite my being a total outsider to the conference, the organizers were amazingly welcoming, and even allowed me to join in the speaker’s dinner, and treated me like a regular guest.

    Here is a quick summary of what I learnt:

    1. Gelman’s talk: The only thing I remember from his talk was the statement that when economists fit multiple regression models and find that one predictor’s formerly significant effect was wiped out by adding another predictor, they think that the new predictor explains the old predictor. Which is pretty funny. Another funny thing was that he had absolutely no slides, and was drawing figures in the air, and apologizing for the low resolution of the figures.

     2. Bob Carpenter gave an inspiring talk on the exciting stuff that’s coming in Stan:

    – Higher Speeds (Stan 2.10 will be 80 times faster with a 100 cores)

    – Stan book

    – New functionality (e.g., tuples, multivariate normal RNG)

    – Gaussian process models will soon become tractable

    – Blockless Stan is coming! This will make Stan code look more like JAGS (which is great). Stan will forever remain backward compatible so old code will not break.

    My conclusion was that in the next few years, things will improve a lot in terms of speed and in terms of what one can do.

    3. Torsten and Stan:

    – Torsten seems to be a bunch of functions to do PK/PD modeling with Stan.

    – Bill Gillespie on Torsten and Stan: https://www.metrumrg.com/wp-content/uploads/2018/05/BayesianPmetricsMBSW2018.pdf

    – Free courses on Stan and PK/PK modeling: https://www.metrumrg.com/courses/

    4. Mitzi Morris gave a great talk on disease mapping (accident mapping in NYC) using conditional autoregressive models (CAR). The idea is simple but great: one can model the correlations between neighboring boroughs. A straightforward application is in EEG, modeling data from all electrodes simultaneously, and modeling the decreasing correlation between neighbors. This is low-hanging fruit, esp. with Stan 2.18 coming.

    5. From Bob I learnt that one should never provide free consultation (I am doing that these days), because people don’t value your time then. If you charge them by the hour, this sharpens their focus. But I feel guilty charging people for my time, especially in medicine, where I provide free consulting: I’m a civil servant and already get paid by the state, and I get total freedom to do whatever I like. So it seems only fair that I serve the state in some useful way (other than studying processing differences in subject vs object relative clauses, that is).

    For psycholinguists, there is a lot of stuff in pharmacometrics that will be important for EEG and visual world data: Gaussian process models, PK/PD modeling, spatial+temporal modeling of a signal like EEG. These tools exist today but we are not using them. And Stan makes a lot of this possible now or very soon now.

    Summary: I’m impressed.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shravan Vasishth's Slog (Statistics blog). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RStudio Connect 1.6.6 – Custom Emails

    Thu, 07/26/2018 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    We are excited to announce RStudio Connect 1.6.6! This release caps a series of improvements to RStudio Connect’s ability to deliver your work to others.

    Custom Email

    The most significant change in RStudio Connect 1.6.6 is the new ability for publishers to customize the emails sent to others when they update their data products. In RStudio Connect, it is already possible to schedule the execution of R Markdown documents and send emails to subscribers notifying them of new versions of content. In this release, publishers can customize whether or not an email is sent, add email attachments, specify the email subject line, and dynamically build beautiful email messages with plots and tables produced by your analysis.

    ````{r} email <- blastula::compose_email( body = " Hello Team, Great job! We closed {today()} at {final_sales}. {add_ggplot(p, width = 6, height = 6)} - Jim " ) if (sales > 10000) { rmarkdown::output_metadata$set( rsc_email_subject = glue('Sales at {final_sales} for {today()}'), rsc_email_body_html = email$html_str, rsc_email_images = email$images, rsc_email_attachments = c('sales_summary.pptx', 'sales_data.xlsx') ) } else { rmarkdown::output_metadata$set(rsc_email_suppress_scheduled = TRUE) } ````

    All customizations are done using code in the underlying R Markdown document. The embedded code provides complete control over the email, but does not impact the result of the rendered report. For example, a report about sales numbers could be set up to only email end users if a critical threshold is reached.

    Full examples are available in the RStudio Connect user guide.

    Other Updates
    • Historical Reports RStudio Connect currently allows users to view previously rendered reports. In RStudio Connect 1.6.6, when users are viewing a report with a history, they can open and share a link directly to the historical versions, or send an email including the historic content.

    • Instrumentation RStudio Connect 1.6.6 will track usage events and record information such as who uses the content, what content was used, and when content was viewed. We don’t provide access to this data yet, but in future releases, this information will be accessible to publishers to help answer questions like, “How many users viewed my application this month?”.

    • The usermanager alter command can now be used to manage whether a user is locked or unlocked. See the admin guide for details and other updates to the usermanager command.

    • User Listing in the Connect Server API The public Connect Server API now includes an endpoint to list user information. See the user guide for details.

    Security & Authentication Changes
    • Removing the “Anyone” Option New configuration options can be used to limit how widely publishers are allowed to distribute their content.

    • The People Tab In certain scenarios, it is undesirable for RStudio Connect viewers to be able to see the profiles of other RStudio Connect users. The Applications.UsersListingMinRole setting can now be used to prevent certain roles from seeing other profiles on the People tab. Users limited in this way will still see other user profiles in the content settings panel, but only for content they can access.

    • LDAP / Active Directory Changes RStudio Connect no longer relies on the distinguished name (DN) of a user. Existing installations will continue working, but administrators should use the new LDAP.UniqueIdAttribute to tell RStudio Connect which LDAP attribute identifies users.

    • A new HTTP.ForceSecure option is available, which sets the Secure flag on RStudio Connect browser cookies. This setting adds support for the Secure flag when RStudio Connect is used behind an HTTPS-terminating proxy. See the existing HTTPS.Permanent setting if you plan to use RStudio Connect to terminate HTTPS.

    Deprecations & Breaking Changes
    • Breaking Change In RStudio Connect 1.6.6, the --force flag in the usermanager alter command has been changed to --force-demoting.

    • Breaking Change All URLs referring to users and groups now use generated IDs in place of IDs that may have contained identifying information. Existing bookmarks to specific user or group pages may need to be updated, and pending account confirmation emails will need to be resent.

    • Applications.EnvironmentBlacklist is deprecated in favor of Applications.ProhibitedEnvironment, and LDAP.WhitelistedLoginGroups is deprecated in favor of LDAP.PermittedLoginGroups. Both settings will be removed in the next release.

    Please review the full release notes.

    Upgrade Planning

    If you use LDAP or Active Directory, please take note
    of the LDAP changes described above and in the release notes. Aside from the
    deprecations above, there are no other special considerations, and upgrading
    should take less than 5 minutes. If you’re upgrading from a release older
    than v1.6.4, be sure to consider the “Upgrade Planning” notes from the
    intervening releases, as well.

    If you haven’t yet had a chance to download and try RStudio Connect, we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with collaborators, colleagues, or customers.

    You can find more details or download a 45-day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME

    Thu, 07/26/2018 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    This is code that will encompany an article that will appear in a special edition of a German IT magazine. The article is about explaining black-box machine learning models. In that article I’m showcasing three practical examples:

    1. Explaining supervised classification models built on tabular data using caret and the iml package
    2. Explaining image classification models with keras and lime
    3. Explaining text classification models with xgboost and lime

    • The first part has been published here.
    • The second part has been published here.

    Below, you will find the code for the third part: Text classification with lime.

    # data wrangling library(tidyverse) library(readr) # plotting library(ggthemes) theme_set(theme_minimal()) # text prep library(text2vec) # ml library(caret) library(xgboost) # explanation library(lime) Text classification models

    Here I am using another Kaggle dataset: Women’s e-commerce cloting reviews. The data contains a text review of different items of clothing, as well as some additional information, like rating, division, etc.

    In this example, I will use the review title and text in order to classify whether or not the item was liked. I am creating the response variable from the rating: every item rates with 5 stars is considered “liked” (1), the rest as “not liked” (0). I am also combining review title and text.

    clothing_reviews <- read_csv("/Users/shiringlander/Documents/Github/ix_lime_etc/Womens Clothing E-Commerce Reviews.csv") %>% mutate(Liked = as.factor(ifelse(Rating == 5, 1, 0)), text = paste(Title, `Review Text`), text = gsub("NA", "", text)) ## Parsed with column specification: ## cols( ## X1 = col_integer(), ## `Clothing ID` = col_integer(), ## Age = col_integer(), ## Title = col_character(), ## `Review Text` = col_character(), ## Rating = col_integer(), ## `Recommended IND` = col_integer(), ## `Positive Feedback Count` = col_integer(), ## `Division Name` = col_character(), ## `Department Name` = col_character(), ## `Class Name` = col_character() ## ) glimpse(clothing_reviews) ## Observations: 23,486 ## Variables: 13 ## $ X1 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11... ## $ `Clothing ID` 767, 1080, 1077, 1049, 847, 1080, 85... ## $ Age 33, 34, 60, 50, 47, 49, 39, 39, 24, ... ## $ Title NA, NA, "Some major design flaws", "... ## $ `Review Text` "Absolutely wonderful - silky and se... ## $ Rating 4, 5, 3, 5, 5, 2, 5, 4, 5, 5, 3, 5, ... ## $ `Recommended IND` 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, ... ## $ `Positive Feedback Count` 0, 4, 0, 0, 6, 4, 1, 4, 0, 0, 14, 2,... ## $ `Division Name` "Initmates", "General", "General", "... ## $ `Department Name` "Intimate", "Dresses", "Dresses", "B... ## $ `Class Name` "Intimates", "Dresses", "Dresses", "... ## $ Liked 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, ... ## $ text " Absolutely wonderful - silky and s...

    Whether an item was liked or not will thus be my response variable or label for classification.

    clothing_reviews %>% ggplot(aes(x = Liked, fill = Liked)) + geom_bar(alpha = 0.8) + scale_fill_tableau(palette = "tableau20") + guides(fill = FALSE)

    Let’s split the data into train and test sets:

    set.seed(42) idx <- createDataPartition(clothing_reviews$Liked, p = 0.8, list = FALSE, times = 1) clothing_reviews_train <- clothing_reviews[ idx,] clothing_reviews_test <- clothing_reviews[-idx,] Let’s start simple

    The first text model I’m looking at has been built similarly to the example model in the help for lime::interactive_text_explanations().

    First, we need to prepare the data for modeling: we will need to convert the text to a document term matrix (dtm). There are different ways to do this. One is be with the text2vec package.

    “Because of R’s copy-on-modify semantics, it is not easy to iteratively grow a DTM. Thus constructing a DTM, even for a small collections of documents, can be a serious bottleneck for analysts and researchers. It involves reading the whole collection of text documents into RAM and processing it as single vector, which can easily increase memory use by a factor of 2 to 4. The text2vec package solves this problem by providing a better way of constructing a document-term matrix.” https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html

    Alternatives to text2vec would be tm + SnowballC or you could work with the tidytext package.

    The itoken() function creates vocabularies (here stemmed words), from which we can create the dtm with the create_dtm() function.

    All preprocessing steps, starting from the raw text, need to be wrapped in a function that can then be pasted into the lime::lime() function; this is only necessary if you want to use your model with lime.

    get_matrix <- function(text) { it <- itoken(text, progressbar = FALSE) create_dtm(it, vectorizer = hash_vectorizer()) }

    Now, this preprocessing function can be applied to both training and test data.

    dtm_train <- get_matrix(clothing_reviews_train$text) str(dtm_train) ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:889012] 304 764 786 788 793 794 1228 2799 2819 3041 ... ## ..@ p : int [1:262145] 0 0 0 0 0 0 0 0 0 0 ... ## ..@ Dim : int [1:2] 18789 262144 ## ..@ Dimnames:List of 2 ## .. ..$ : chr [1:18789] "1" "2" "3" "4" ... ## .. ..$ : NULL ## ..@ x : num [1:889012] 1 1 2 1 2 1 1 1 1 1 ... ## ..@ factors : list() dtm_test <- get_matrix(clothing_reviews_test$text) str(dtm_test) ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:222314] 2793 400 477 622 2818 2997 3000 4500 3524 2496 ... ## ..@ p : int [1:262145] 0 0 0 0 0 0 0 0 0 0 ... ## ..@ Dim : int [1:2] 4697 262144 ## ..@ Dimnames:List of 2 ## .. ..$ : chr [1:4697] "1" "2" "3" "4" ... ## .. ..$ : NULL ## ..@ x : num [1:222314] 1 1 1 1 1 1 1 1 1 1 ... ## ..@ factors : list()

    And we use it to train a model with the xgboost package (just as in the example of the lime package).

    xgb_model <- xgb.train(list(max_depth = 7, eta = 0.1, objective = "binary:logistic", eval_metric = "error", nthread = 1), xgb.DMatrix(dtm_train, label = clothing_reviews_train$Liked == "1"), nrounds = 50)

    Let’s try it on the test data and see how it performs:

    pred <- predict(xgb_model, dtm_test) confusionMatrix(clothing_reviews_test$Liked, as.factor(round(pred, digits = 0))) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 1370 701 ## 1 421 2205 ## ## Accuracy : 0.7611 ## 95% CI : (0.7487, 0.7733) ## No Information Rate : 0.6187 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.5085 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.7649 ## Specificity : 0.7588 ## Pos Pred Value : 0.6615 ## Neg Pred Value : 0.8397 ## Prevalence : 0.3813 ## Detection Rate : 0.2917 ## Detection Prevalence : 0.4409 ## Balanced Accuracy : 0.7619 ## ## 'Positive' Class : 0 ##

    Okay, not a perfect score but good enough for me – right now, I’m more interested in the explanations of the model’s predictions. For this, we need to run the lime() function and give it

    • the text input that was used to construct the model
    • the trained model
    • the preprocessing function
    explainer <- lime(clothing_reviews_train$text, xgb_model, preprocess = get_matrix)

    With this, we could right away call the interactive explainer Shiny app, where we can type any text we want into the field on the left and see the explanation on the right: words that are underlined green support the classification, red words contradict them.

    interactive_text_explanations(explainer)

    What happens in the background in the app, we can do explicitly by calling the explain() function and give it

    • the test data (here the first four reviews of the test set)
    • the explainer defined with the lime() function
    • the number of labels we want to have explanations for (alternatively, you set the label by name)
    • and the number of features (in this case words) that should be included in the explanations

    We can plot them either with the plot_text_explanations() function, which gives an output like in the Shiny app or we use the regular plot_features() function.

    explanations <- lime::explain(clothing_reviews_test$text[1:4], explainer, n_labels = 1, n_features = 5) plot_text_explanations(explanations)

    {"x":{"html":"

    Absolutely<\/span> wonderful - silky<\/span> and<\/span> sexy<\/span> and<\/span> comfortable<\/span> <\/br> Label predicted: 1 (73.17%)
    Explainer fit: 1<\/sub> <\/p>

    Some major design flaws I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but<\/span> i found this to be outrageously small. so small in fact that i could not<\/span> zip it up! i reordered it in petite medium, which was<\/span> just ok. overall, the top half was<\/span> comfortable<\/span> and fit nicely, but<\/span> the bottom half had a very tight under layer and several somewhat cheap<\/span> (net) over layers. imo, a major design flaw was<\/span> the net over layer sewn directly into the zipper - it c <\/br> Label predicted: 0 (71.45%)
    Explainer fit: 0.84<\/sub> <\/p>

    Flattering shirt This shirt is very flattering<\/span> to all due to the<\/span> adjustable front tie. it is the<\/span> perfect<\/span> length to wear with leggings and it is sleeveless so<\/span> it pairs well with any cardigan. love<\/span> this shirt!!! <\/br> Label predicted: 1 (84.2%)
    Explainer fit: 0.98<\/sub> <\/p>

    Pretty<\/span> party dress with<\/span> some issues This is a nice choice for holiday gatherings. i like that the length grazes the knee so it is conservative enough for office related gatherings. the size small fit me well - i am usually a size 2/4 with<\/span> a small bust. in my opinion it runs<\/span> small and those with<\/span> larger busts will definitely have to size up (but<\/span> then perhaps the waist will be too<\/span> big). the problem with<\/span> this dress is the quality. the fabrics are terrible. the delicate netting type fabric on the top layer of skirt got stuck in the zip <\/br> Label predicted: 0 (56.99%)
    Explainer fit: 0.91<\/sub> <\/p> <\/div>"},"evals":[],"jsHooks":[]}

    plot_features(explanations)

    As we can see, our explanations contain a lot of stop-words that don’t really make much sense as features in our model. So…

    … let’s try a more complex example

    Okay, our model above works but there are still common words and stop words in our model that LIME picks up on. Ideally, we would want to remove them before modeling and keep only relevant words. This we can accomplish by using additional steps and options in our preprocessing function.

    Important to know is that whatever preprocessing we do with our text corpus, train and test data has to have the same features (i.e. words)! If we were to incorporate all the steps shown below into one function and call it separately on train and test data, we would end up with different words in our dtm and the predict() function won’t work any more. In the simple example above, it works because we have been using the hash_vectorizer().

    Nevertheless, the lime::explain() function expects a preprocessing function that takes a character vector as input.

    How do we go about this? First, we will need to create the vocabulary just from the training data. To reduce the number of words to only the most relevant I am performing the following steps:

    • stem all words
    • remove step-words
    • prune vocabulary
    • transform into vector space
    stem_tokenizer <- function(x) { lapply(word_tokenizer(x), SnowballC::wordStem, language = "en") } stop_words = tm::stopwords(kind = "en") # create prunded vocabulary vocab_train <- itoken(clothing_reviews_train$text, preprocess_function = tolower, tokenizer = stem_tokenizer, progressbar = FALSE) v <- create_vocabulary(vocab_train, stopwords = stop_words) pruned_vocab <- prune_vocabulary(v, doc_proportion_max = 0.99, doc_proportion_min = 0.01) vectorizer_train <- vocab_vectorizer(pruned_vocab)

    This vector space can now be added to the preprocessing function, which we can then apply to both train and test data. Here, I am also transforming the word counts to tfidf values.

    # preprocessing function create_dtm_mat <- function(text, vectorizer = vectorizer_train) { vocab <- itoken(text, preprocess_function = tolower, tokenizer = stem_tokenizer, progressbar = FALSE) dtm <- create_dtm(vocab, vectorizer = vectorizer) tfidf = TfIdf$new() fit_transform(dtm, tfidf) } dtm_train2 <- create_dtm_mat(clothing_reviews_train$text) str(dtm_train2) ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:415770] 26 74 169 294 588 693 703 708 727 759 ... ## ..@ p : int [1:506] 0 189 380 574 765 955 1151 1348 1547 1740 ... ## ..@ Dim : int [1:2] 18789 505 ## ..@ Dimnames:List of 2 ## .. ..$ : chr [1:18789] "1" "2" "3" "4" ... ## .. ..$ : chr [1:505] "ad" "sandal" "depend" "often" ... ## ..@ x : num [1:415770] 0.177 0.135 0.121 0.17 0.131 ... ## ..@ factors : list() dtm_test2 <- create_dtm_mat(clothing_reviews_test$text) str(dtm_test2) ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:103487] 228 304 360 406 472 518 522 624 732 784 ... ## ..@ p : int [1:506] 0 53 113 151 186 216 252 290 323 360 ... ## ..@ Dim : int [1:2] 4697 505 ## ..@ Dimnames:List of 2 ## .. ..$ : chr [1:4697] "1" "2" "3" "4" ... ## .. ..$ : chr [1:505] "ad" "sandal" "depend" "often" ... ## ..@ x : num [1:103487] 0.263 0.131 0.135 0.109 0.179 ... ## ..@ factors : list()

    And we will train another gradient boosting model:

    xgb_model2 <- xgb.train(params = list(max_depth = 10, eta = 0.2, objective = "binary:logistic", eval_metric = "error", nthread = 1), data = xgb.DMatrix(dtm_train2, label = clothing_reviews_train$Liked == "1"), nrounds = 500) pred2 <- predict(xgb_model2, dtm_test2) confusionMatrix(clothing_reviews_test$Liked, as.factor(round(pred2, digits = 0))) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 1441 630 ## 1 426 2200 ## ## Accuracy : 0.7752 ## 95% CI : (0.763, 0.787) ## No Information Rate : 0.6025 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.5392 ## Mcnemar's Test P-Value : 4.187e-10 ## ## Sensitivity : 0.7718 ## Specificity : 0.7774 ## Pos Pred Value : 0.6958 ## Neg Pred Value : 0.8378 ## Prevalence : 0.3975 ## Detection Rate : 0.3068 ## Detection Prevalence : 0.4409 ## Balanced Accuracy : 0.7746 ## ## 'Positive' Class : 0 ##

    Unfortunately, this didn’t really improve the classification accuracy but let’s look at the explanations again:

    explainer2 <- lime(clothing_reviews_train$text, xgb_model2, preprocess = create_dtm_mat) explanations2 <- lime::explain(clothing_reviews_test$text[1:4], explainer2, n_labels = 1, n_features = 4) plot_text_explanations(explanations2)

    {"x":{"html":"

    Absolutely<\/span> wonderful<\/span> - silky<\/span> and sexy and comfortable<\/span> <\/br> Label predicted: 1 (97.25%)
    Explainer fit: 0.94<\/sub> <\/p>

    Some major design<\/span> flaws I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall<\/span>, the top half was comfortable<\/span> and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap<\/span> (net) over layers. imo, a major design<\/span> flaw was the net over layer sewn directly into the zipper - it c <\/br> Label predicted: 0 (94.56%)
    Explainer fit: 0.34<\/sub> <\/p>

    Flattering shirt This shirt is very flattering<\/span> to all due to the adjustable front tie. it is the perfect<\/span> length to wear<\/span> with leggings and it is sleeveless so it pairs well with any cardigan. love<\/span> this shirt!!! <\/br> Label predicted: 1 (98.61%)
    Explainer fit: 0.62<\/sub> <\/p>

    Pretty<\/span> party dress with some issues<\/span> This is a nice choice for holiday gatherings. i like that the length grazes the knee so it is conservative enough for office related gatherings. the size small fit me well - i am usually a size 2/4 with a small bust. in my opinion it runs small and those with larger busts will definitely have to size up (but then perhaps the waist will be too big). the problem<\/span> with this dress is the quality. the fabrics are terrible. the delicate<\/span> netting type fabric on the top layer of skirt got stuck in the zip <\/br> Label predicted: 0 (98.61%)
    Explainer fit: 0.44<\/sub> <\/p> <\/div>"},"evals":[],"jsHooks":[]}

    The words that get picked up now make much more sense! So, even though making my model more complex didn’t improve “the numbers”, this second model is likely to be much better able to generalize to new reviews because it seems to pick up on words that make intuitive sense.

    That’s why I’m sold on the benefits of adding explainer functions to most machine learning workflows – and why I love the lime package in R!

    sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.6 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 lime_0.4.0 xgboost_0.71.2 caret_6.0-80 ## [5] lattice_0.20-35 text2vec_0.5.1 ggthemes_3.5.0 forcats_0.3.0 ## [9] stringr_1.3.1 dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 ## [13] tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1 ## ## loaded via a namespace (and not attached): ## [1] colorspace_1.3-2 class_7.3-14 rprojroot_1.3-2 ## [4] futile.logger_1.4.3 pls_2.6-0 rstudioapi_0.7 ## [7] DRR_0.0.3 SnowballC_0.5.1 prodlim_2018.04.18 ## [10] lubridate_1.7.4 xml2_1.2.0 codetools_0.2-15 ## [13] splines_3.5.1 mnormt_1.5-5 robustbase_0.93-1 ## [16] knitr_1.20 shinythemes_1.1.1 RcppRoll_0.3.0 ## [19] mlapi_0.1.0 jsonlite_1.5 broom_0.4.5 ## [22] ddalpha_1.3.4 kernlab_0.9-26 sfsmisc_1.1-2 ## [25] shiny_1.1.0 compiler_3.5.1 httr_1.3.1 ## [28] backports_1.1.2 assertthat_0.2.0 Matrix_1.2-14 ## [31] lazyeval_0.2.1 cli_1.0.0 later_0.7.3 ## [34] formatR_1.5 htmltools_0.3.6 tools_3.5.1 ## [37] NLP_0.1-11 gtable_0.2.0 glue_1.2.0 ## [40] reshape2_1.4.3 Rcpp_0.12.17 slam_0.1-43 ## [43] cellranger_1.1.0 nlme_3.1-137 blogdown_0.6 ## [46] iterators_1.0.9 psych_1.8.4 timeDate_3043.102 ## [49] gower_0.1.2 xfun_0.3 rvest_0.3.2 ## [52] mime_0.5 stringdist_0.9.5.1 DEoptimR_1.0-8 ## [55] MASS_7.3-50 scales_0.5.0 ipred_0.9-6 ## [58] hms_0.4.2 promises_1.0.1 parallel_3.5.1 ## [61] lambda.r_1.2.3 yaml_2.1.19 rpart_4.1-13 ## [64] stringi_1.2.3 foreach_1.4.4 e1071_1.6-8 ## [67] lava_1.6.2 geometry_0.3-6 rlang_0.2.1 ## [70] pkgconfig_2.0.1 evaluate_0.10.1 bindr_0.1.1 ## [73] labeling_0.3 recipes_0.1.3 htmlwidgets_1.2 ## [76] CVST_0.2-2 tidyselect_0.2.4 plyr_1.8.4 ## [79] magrittr_1.5 bookdown_0.7 R6_2.2.2 ## [82] magick_1.9 dimRed_0.1.0 pillar_1.2.3 ## [85] haven_1.1.2 foreign_0.8-70 withr_2.1.2 ## [88] survival_2.42-3 abind_1.4-5 nnet_7.3-12 ## [91] modelr_0.1.2 crayon_1.3.4 futile.options_1.0.1 ## [94] rmarkdown_1.10 grid_3.5.1 readxl_1.1.0 ## [97] data.table_1.11.4 ModelMetrics_1.1.0 digest_0.6.15 ## [100] tm_0.7-4 xtable_1.8-2 httpuv_1.4.4.2 ## [103] RcppParallel_4.4.0 stats4_3.5.1 munsell_0.5.0 ## [106] glmnet_2.0-16 magic_1.5-8 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Singularity as a software distribution / deployment tool

    Thu, 07/26/2018 - 02:00

    (This article was first published on mlampros, and kindly contributed to R-bloggers)

    In this blog post, I’ll explain how someone can take advantage of Singularity to make R or Python packages available as an image file to users. This is a necessity if the specific R or Python package is difficult to install across different operating systems making that way the installation process cumbersome. Lately, I’ve utilized the reticulate package in R (it provides an interface between R and Python) and I realized from first hand how difficult it is, in some cases, to install R and Python packages and make them work nicely together in the same operating system. This blog post by no means presents the potential of Singularity or containerization tools, such as docker, but it’s mainly restricted to package distribution / deployment.

    Singularity can be installed on all 3 operating systems (Linux, Macintosh, Windows), however the current status (as of July 2018) is that on Macintosh and Windows the user has to setup Vagrant, and run Singularity from there (this might change in the near future).

    Singularity on Linux

    In the following lines I’ll make use of an Ubuntu cloud instance (the same steps can be accomplished on an Ubuntu Desktop with some exceptions) to explain how someone can download Singularity image files and run those images on Rstudio server (in case of R) or a Jupyter Notebook (in case of Python). I utilize Amazon Web Services (AWS) and especially an Ubuntu server 16.04 using a t2.micro instance (1GB memory, 1 core), however, someone can follow the same procedure on Azure or Google Cloud (at least of those two alternative cloud services I’m aware) as well. I’ll skip the steps on how someone can set-up an Ubuntu cloud instance, as it’s beyond the scope of this blog post (there are certainly many tutorials on the web for this purpose).

    Assuming someone uses the command line console, the first thing to do is to install the system requirements (in case of an Ubuntu Desktop upgrading the system should be skipped). Once the installation of the system requirements is finished the following folder should appear in the home directory,

    singularity R language Singularity image files

    My singularity_containers Github repository contains R and Python Singularity Recipes, which are used to build the corresponding containers. My Github repository is connected to my singularity-hub account and once a change is triggered (for instance, a push to my repository) a new / updated container build will be created. An updated build – for instance for the RGF package – can be pulled from singularity-hub in the following way,

    singularity pull --name RGF_r.simg shub://mlampros/singularity_containers:rgf_r

    This code line will create the RGF_r.simg image file in the home directory. One should now make sure that port 8787 is not used by another service / application by using,

    sudo netstat -plnt | fgrep 8787

    If this does not return something then one can proceed with,

    singularity run RGF_r.simg

    to run the image. If everything went ok and no errors occurred then by opening a second command line console and typing,

    sudo netstat -plnt | fgrep 8787

    one should observe that port 8787 is opened,

    tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 23062/rserver

    The final step is to open a web-browser (chrome, firefox etc.) and give,

    • http://Public DNS (IPv4):8787            ( where “Public DNS (IPv4)” is specific to the Cloud instance you launched )

    or

    • http://0.0.0.0:8787            ( in case that someone uses Singularity locally )

    to launch the Rstudio-server and use the RGF package pre-installed with all requirements included (to stop the service use CTRL + C from the command line). I used RGF as an example here because for me personally, it was somehow cumbersome to install on my windows machine.

    The same applies to the other two R singularity recipe files included in my singularity-hub account, i.e. mlampros/singularity_containers:nmslib_r and mlampros/singularity_containers:fuzzywuzzy_r.

    Python language Singularity image files

    The Python Singularity Recipe files which are also included in the same Github repository utilize port 8888 and follow a similar logic with the R files. The only difference is that when a user runs the image the sudo command is required (otherwise it will raise a permission error),

    singularity pull --name RGF_py.simg shub://mlampros/singularity_containers:rgf_python sudo singularity run RGF_py.simg

    The latter command will produce the following (example) output,

    The web-browser runs on localhost:8888 [I 09:56:03.427 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret [W 09:56:03.779 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. [I 09:56:03.789 NotebookApp] Serving notebooks from local directory: /root [I 09:56:03.790 NotebookApp] The Jupyter Notebook is running at: [I 09:56:03.790 NotebookApp] http://(ip-172-31-21-76 or 127.0.0.1):8888/?token=1fc90f01247498dac8d24ac918fe8da57fa46ee9e98eea4f [I 09:56:03.790 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 09:56:03.790 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://(ip-172-31-21-76 or 127.0.0.1):8888/?token=1fc90f01247498dac8d24ac918fe8da57fa46ee9e98eea4f .......

    In the same way as before the user should open a web-browser and give either,

    • http://Public DNS (IPv4):8888            ( where “Public DNS (IPv4)” is specific to the Cloud instance you launched )

    or

    • http://127.0.0.1:8888            ( in case that someone uses Singularity locally )

    When someone connects for the first time to the Jupyter notebook then he / she has to give the output token as the password. For instance, based on the previous example output the token password would be 1fc90f01247498dac8d24ac918fe8da57fa46ee9e98eea4f.

    I also included an .ipynb file which can be loaded to the Jupyter notebook to test the rgf_python package.

    The same applies to the other two Python singularity recipe files included in my singularity-hub account, i.e. mlampros/singularity_containers:nmslib_python and mlampros/singularity_containers:fuzzywuzzy_python.

    Final words

    If someone intends to add authentication to the Singularity recipe files then valuable resources can be found in the https://github.com/nickjer/singularity-rstudio Github repository, on which my Rstudio-server recipes heavily are based.

    An updated version of singularity_containers can be found in my Github repository and to report bugs   /   issues please use the following link, https://github.com/mlampros/singularity_containers/issues.

    References :

    • https://github.com/singularityhub
    • https://vsoch.github.io/
    • https://github.com/nickjer/singularity-rstudio
    • https://github.com/nickjer/singularity-r
    • https://bwlewis.github.io/r-and-singularity/
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: mlampros. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    rOpenSci Educators Collaborative: How Can We Develop a Community of Innovative R Educators?

    Thu, 07/26/2018 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    tl;dr: we propose three calls to action:

    1. Share your curricular materials in the open.
    2. Participate in the rOpenSci Education profile series.
    3. Discuss with us how you want to be involved in rOpenSci Educators’ Collaborative.

    In previous posts in this series, we identified challenges that individual instructors typically face when teaching science with R, and shared characteristics of effective educational resources to help address these challenges. However, the toughest challenges that educators in this area face are human, rather than technological. Our shared experiences highlight the need for a strong community of innovative R educators. However, this community is currently not well-connected or easily discoverable.



    In this final post, we propose a framework for developing an rOpenSci Educators’ Collaborative: a community of practice for people interested and engaged in science education using R. A community of practice is defined as a group of skilled practitioners who share a common passion, and who learn how to do it better by interacting with each other regularly. To develop this kind of community, we need to go beyond shared domains of interest in R and education. Here, we’d like to invite educators to join us in discussing how to put the other ingredients in place.

    First, we need a community, a place to share information and build relationships. It is important to note that while we recognize (and envy!) The Carpentries and the strong education community of practice that has been developed around their workshops for teaching foundational coding and data science skills, we believe there is a place for a community of science educators who teach with R, but who may not identify as someone who teaches coding or programming skills. Currently, one of the main challenges for these educators is the availability and discoverability of course materials. There is an abundance of tutorials and a lack of curricular resources. For example, there are many individual tutorials, focused workshops or bootcamp materials, online courses, and lesson plan modules. All of these are great resources, and can be integrated within a long-form course curriculum. However, while it is helpful to leverage existing materials, adapting these materials in the classroom still requires preparation time. To use them successfully, educators usually need to think carefully about the dependencies of each individual resource (i.e., packages used, assumed prior R knowledge, assumed content domain knowledge), and how it fits into the larger temporal sequence of course materials and learning objectives. It also can be difficult to teach a class or lead a lab-based on others’ materials, written in someone else’s voice, and educators typically need to adapt the data or code to better meet their class’ needs or skill level.

    On the other hand, there is a lack of integrated long-form resources, where educators share full-length syllabi along with R-based learning materials. There are many reasons why these materials may not be open: lack of instructor confidence in sharing their teaching materials, lack of knowledge about tools and platforms for public sharing, and lack of support from institutions and departments. We would like to invite educators to begin to share their curricular materials, and offer a few recommended resources:

    Even if they are made open, the problem remains of how to discover their availability; current methods tend to be time-dependent like sharing links in tweets or posts on community forums. If stored in a GitHub repository with the teaching and rstats topics added, one can search GitHub for these topics: currently there are 27 of them!

    To address this larger challenge, we propose a new rOpenSci Education profile series, which will include interviews with practicing educators to share their own resources, tools, strategies, and experiences in the classroom. We hope that beginning this dialogue about education with educators will be the beginning of an rOpenSci Educators’ Collaborative. We hope that this collaborative will provide a space for new and experienced educators to share what works and what doesn’t. Ideally, in the words of one expert, this community of practice would work to “develop a shared repertoire of resources: experiences, stories, tools, ways of addressing recurring problems.”

    rOpenSciEd – A Community for those who Educate with R https://t.co/PE26T38lTS#runconf18 #rstats

    — rOpenSci (@rOpenSci) May 22, 2018

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New Course: Structural Equation Modeling with lavaan in R

    Wed, 07/25/2018 - 19:47

    (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

    Here is the course link.

    Course Description

    When working with data, we often want to create models to predict future events, but we also want an even deeper understanding of how our data is connected or structured. In this course, you will explore the connectedness of data using using structural equation modeling (SEM) with the R programming language using the lavaan package. SEM will introduce you to latent and manifest variables and how to create measurement models, assess measurement model accuracy, and fix poor fitting models. During the course, you will explore classic SEM datasets, such as the Holzinger and Swineford (1939) and Bollen (1989) datasets. You will also work through a multi-factor model case study using the Wechsler Adult Intelligence Scale. Following this course, you will be able to dive into your data and gain a much deeper understanding of how it all fits together.

    Chapter 1: One-Factor Models (Free)

    In this chapter, you will dive into creating your first structural equation model with lavaan. You will learn important terminology, how to build, and run models. You will create a one-factor model of mental test abilities using the classic Holzinger and Swineford (1939) dataset.

    Chapter 2: Multi-Factor Models

    In this chapter, you will expand your skills in lavaan to creating multi-factor models. We will improve the one-factor models from the last chapter by creating multiple latent variables in the classic Holzinger and Swineford (1939) dataset.

    Chapter 3: Troubleshooting Model Errors and Diagrams

    Structural equation models do not always run smoothly, and in this chapter, you will learn how to troubleshoot Heywood cases which are common errors. You will also learn how to diagram your model in R using the semPlot library.

    Chapter 4: Full Example and an Extension

    This chapter examines the WAIS-III IQ Scale and its structural properties. You will use your skills from the first three chapters to create various models of the WAIS-III, troubleshoot errors in those models, and create diagrams of the final model.

    Prerequisites var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New Course: Experimental Design in R

    Wed, 07/25/2018 - 19:40

    (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

    Here is the course link.

    Course Description

    Experimental design is a crucial part of data analysis in any field, whether you work in business, health or tech. If you want to use data to answer a question, you need to design an experiment! In this course you will learn about basic experimental design, including block and factorial designs, and commonly used statistical tests, such as the t-tests and ANOVAs. You will use built-in R data and real world datasets including the CDC NHANES survey, SAT Scores from NY Public Schools, and Lending Club Loan Data. Following the course, you will be able to design and analyze your own experiments!

    Chapter 1: Introduction to Experimental Design (FREE)

    An introduction to key parts of experimental design plus some power and sample size calculations.

    Chapter 2: Basic Experiments

    Explore the Lending Club dataset plus build and validate basic experiments, including an A/B test.

    Chapter 3: Randomized Complete (& Balanced Incomplete) Block Designs

    Use the NHANES data to build a RCBD and BIBD experiment, including model validation and design tips to make sure the BIBD is valid.

    Chapter 4: Latin Squares, Graeco-Latin Squares, & Factorial experiments

    Evaluate the NYC SAT scores data and deal with its missing values, then evaluate Latin Square, Graeco-Latin Square, and Factorial experiments.

    Prerequisites var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New Project: Visualizing Inequalities in Life Expectancy

    Wed, 07/25/2018 - 19:36

    (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

    Here is the project link.

    Project Description

    Do women live longer than men? How long? Does it happen everywhere? Is life expectancy increasing? Everywhere? Which is the country with the lowest life expectancy? Which is the one with the highest? In this Project, you will answer all these questions by manipulating visualizing United Nations life expectancy data using ggplot2. We recommend that you have completed Introduction to the Tidyverse and Chapter 2 of Cleaning Data in R prior to starting this Project.

    The dataset can be found here and contains the average life expectancies of men and women by country (in years). It covers four periods: 1985-1990, 1990-1995, 1995-2000, and 2000-2005.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Hi Pawel, I’m glad you enjoyed it.

    Wed, 07/25/2018 - 17:03

    (This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers)

    Hi Pawel, I’m glad you enjoyed it. I was trying to play around with facet_grid() earlier but I guess I didn’t stumble upon the proper parameters. Your suggestion works perfectly; not only does it keep each grid x-axis width proportional to its length, but it also keeps appropriate space-between-variables. Thank you for sharing that!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Gender diversity in the film industry

    Wed, 07/25/2018 - 11:09

    (This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

    The year 2017 has completely turned the film industry upside down. The allegations of harassment and sexual assault against Harvey Weinstein have raised the issue of sexism and misogyny in this industry to the eyes of the general public. In addition, it has helped raise awareness of the poor gender diversity and under-representation of women in Hollywood. One of the main problems posed by the low presence of women behind the camera is that this is then reflected in the fictional characters on screen: lots of movies portray women in an incomplete, stereotyped and biased way.

    This post focuses on some key behind-the-camera roles to measure the evolution of gender diversity in the last decade – from 2007 until 2017. The roles I studied were: directorswritersproducerssound teamsmusic teamsart teamsmakeup teams and costume teams.

    The whole code to reproduce the following results is available on GitHub.

    Data frame creation – Web scraping

    What I needed first was a list which gathered the names connected to film job roles for 50 movies. For each year between 2007 and 2017, I gathered the information about the 50 most profitable movies of the year from the IMDb website.

    As a first step, I built data frames which contained the titles of these movies, their gross profit and their IMDb crew links – which shows the names and roles of the whole movie crew. The following code is aimed at building the corresponding data frame for the 50 most profitable movies of 2017.

    # IMDB TOP US GROSSING 2017: 50 MORE PROFITABLE MOVIES OF 2017 ------------- url <- "https://www.imdb.com/search/title?release_date=2017-01-01,2017-12-31&sort=boxoffice_gross_us,desc" page <- read_html(url) # Movies details movie_nodes <- html_nodes(page, '.lister-item-header a') movie_link <- sapply(html_attrs(movie_nodes),`[[`,'href') movie_link <- paste0("http://www.imdb.com", movie_link) movie_crewlink <- gsub("[?]", "fullcredits?", movie_link) #Full crew links movie_name <- html_text(movie_nodes) movie_year <- rep(2017, 50) movie_gross <- html_nodes(page, '.sort-num_votes-visible span:nth-child(5)') %>% html_text() # CREATE DATAFRAME: TOP 2017 ---------------------------------------------- top_2017 <- data.frame(movie_name, movie_year, movie_gross, movie_crewlink, stringsAsFactors = FALSE)

    Let’s have a look at the top_2017 data frame:

    ## movie_name movie_year movie_gross ## 1 Star Wars: Episode VIII - The Last Jedi 2017 $620.18M ## 2 Beauty and the Beast 2017 $504.01M ## 3 Wonder Woman 2017 $412.56M ## 4 Jumanji: Welcome to the Jungle 2017 $404.26M ## 5 Guardians of the Galaxy: Vol. 2 2017 $389.81M ## 6 Spider-Man Homecoming 2017 $334.20M ## movie_crewlink ## 1 http://www.imdb.com/title/tt2527336/fullcredits?ref_=adv_li_tt ## 2 http://www.imdb.com/title/tt2771200/fullcredits?ref_=adv_li_tt ## 3 http://www.imdb.com/title/tt0451279/fullcredits?ref_=adv_li_tt ## 4 http://www.imdb.com/title/tt2283362/fullcredits?ref_=adv_li_tt ## 5 http://www.imdb.com/title/tt3896198/fullcredits?ref_=adv_li_tt ## 6 http://www.imdb.com/title/tt2250912/fullcredits?ref_=adv_li_tt

    I adapted the previous code in order to build equivalent data frames for the past 10 years. I then had 11 data frames: top2017, top2016, …, top2007, which gathered the names, years, gross profit and crew links of the 50 most profitable movies of each year.

    I combined these 11 data frames into one data frame called top_movies.

    List creation – Web scraping

    After that, I had a data frame with 550 rows, and I next needed to build a list which gathered:

    • the years from 2007 to 2017
    • for each year, the names of the top 50 grossing movies corresponding
    • for each movie, the names of the people whose job was included in one of the categories I listed above (director, writer, costume teams)

    In order to build this list, I navigated through all the IMDb full crew web pages stored in our top_movies data frame, and did some web scraping again to gather the information listed above.

    movies_list <- list() for (r in seq_len(nrow(top_movies))) { # FOCUS ON EACH MOVIE ----------------------------------------------------------------- movie_name <- top_movies[r, "movie_name"] movie_year <- as.character(top_movies[r, "movie_year"]) page <- read_html(as.character(top_movies[r, "movie_crewlink"])) # GATHER THE CREW NAMES FOR THIS MOVIE ------------------------------------------------ movie_allcrew <- html_nodes(page, '.name , .dataHeaderWithBorder') %>% html_text() movie_allcrew <- gsub("[\n]", "", movie_allcrew) %>% trimws() #Remove white spaces # SPLIT THE CREW NAMES BY CATEGORY ---------------------------------------------------- movie_categories <- html_nodes(page, '.dataHeaderWithBorder') %>% html_text() movie_categories <- gsub("[\n]", "", movie_categories) %>% trimws() #Remove white spaces ## MUSIC DEPARTMENT ------------------------------------------------------------------- movie_music <- c() for (i in 1:(length(movie_allcrew)-1)){ if (grepl("Music by", movie_allcrew[i])){ j <- 1 while (! grepl(movie_allcrew[i], movie_categories[j])){ j <- j+1 } k <- i+1 while (! grepl(movie_categories[j+1], movie_allcrew[k])){ movie_music <- c(movie_music, movie_allcrew[k]) k <- k+1 } } } for (i in 1:(length(movie_allcrew)-1)){ if (grepl("Music Department", movie_allcrew[i])){ j <- 1 while (! grepl(movie_allcrew[i], movie_categories[j])){ j <- j+1 } k <- i+1 while (! grepl(movie_categories[j+1], movie_allcrew[k])){ movie_music <- c(movie_music, movie_allcrew[k]) k <- k+1 } } } if (length(movie_music) == 0){ movie_music <- c("") } ## IDEM FOR OTHER CATEGORIES --------------------------------------------------------- ## MOVIE_INFO CONTAINS THE MOVIE CREW NAMES ORDERED BY CATEGORY ---------------------- movie_info <- list() movie_info$directors <- movie_directors movie_info$writers <- movie_writers movie_info$producers <- movie_producers movie_info$sound <- movie_sound movie_info$music <- movie_music movie_info$art <- movie_art movie_info$makeup <- movie_makeup movie_info$costume <- movie_costume ## MOVIES_LIST GATHERS THE INFORMATION FOR EVERY YEAR AND EVERY MOVIE ---------------- movies_list[[movie_year]][[movie_name]] <- movie_info }

    Here are some of the names I collected:

    ## - Star Wars VIII 2017, Director: ## Rian Johnson ## - Sweeney Todd 2007, Costume team: ## Colleen Atwood, Natasha Bailey, Sean Barrett, Emma Brown, Charlotte Child, Charlie Copson, Steve Gell, Liberty Kelly, Colleen Kelsall, Linda Lashley, Rachel Lilley, Cavita Luchmun, Ann Maskrey, Ciara McArdle, Sarah Moore, Jacqueline Mulligan, Adam Roach, Sunny Rowley, Jessica Scott-Reed, Marcia Smith, Sophia Spink, Nancy Thompson, Suzi Turnbull, Dominic Young, Deborah Ambrosino, David Bethell, Mariana Bujoi, Mauricio Carneiro, Sacha Chandisingh, Lisa Robinson Gender determination

    All of the names I needed to measure the gender diversity of were now gathered in the list movies_list. Then, I had to determine the gender of almost 275,000 names. This is what the R package GenderizeR does: “The genderizeR package uses genderize.io API to predict gender from first names”. At the moment, the genderize.io database contains 216286 distinct names across 79 countries and 89 languages. The data is collected from social networks from all over the world, which ensure the diversity of origins.

    However, I am aware that determining genders based on names is not an ideal solution: some names are unisex, some people do not recognise themselves as male or female, and some transitioning transgender people still have their former name. But this solution was the only option I had, and as I worked on about 275,000 names, I assumed that the error induced by the cases listed above was not going to have a big impact on my results.

    With this in mind, I used the GenderizeR package and applied its main function on the lists of names I gathered earlier in movies_list. The function genderizeAPI checks if the names tested are included in the genderize.io database and returns:

    • the gender associated with the first name tested
    • the counts of this first name in database
    • the probability of gender given the first name tested.

    The attribute I was interested in was obviously the first one, the gender associated with the first name tested.

    The aim was to focus on every category of jobs, and to count the number of males and females by category, film and year. With the script below, here is the information I added to each object movies_list$year$film:

    • the number of male directors
    • the number of female directors
    • the number of male producers
    • the number of female producers
    • the number of males in costume team
    • the number of females in costume team

    The following code shows how I determined the gender of the directors’ names for every film in the movie_list. The code is similar for all the other categories.

    # for each year for (y in seq_along(movies_list)){ # for each movie for (i in seq_along(movies_list[[y]])){ # Genderize directors ----------------------------------------------------- directors <- movies_list[[y]][[i]]$directors if (directors == ""){ directors_gender <- list() directors_gender$male <- 0 directors_gender$female <- 0 movies_list[[y]][[i]]$directors_gender <- directors_gender } else{ # Split the firstnames and the lastnames # Keep the firstnames directors <- strsplit(directors, " ") l <- c() for (j in seq_along(directors)){ l <- c(l, directors[[j]][1]) } directors <- l movie_directors_male <- 0 movie_directors_female <- 0 # Genderize every firstname and count the number of males and females for (p in seq_along(directors)){ directors_gender <- genderizeAPI(x = directors[p], apikey = "233b284134ae754d9fc56717fec4164e") gender <- directors_gender$response$gender if (length(gender)>0 && gender == "male"){ movie_directors_male <- movie_directors_male + 1 } if (length(gender)>0 && gender == "female"){ movie_directors_female <- movie_directors_female + 1 } } # Put the number of males and females in movies_list directors_gender <- list() directors_gender$male <- movie_directors_male directors_gender$female <- movie_directors_female movies_list[[y]][[i]]$directors_gender <- directors_gender } # Idem for the 7 other categories ----------------------------------------------------- } }

    Here are some examples of the number of male and female names I collected:

    ## - Star Wars VIII 2017 ## Number of male directors: 1 ## Number of female directors: 0 ## - Sweeney Todd 2007 ## Number of male in costume team: 9 ## Number of female in costume team: 20 Percentages calculation

    Once I had all the gender information listed above, the next step was to calculate percentages by year. I then went through the whole list movies_list and created a data frame called percentages which gathered the percentages of women in each job category for each year.

    Let’s have a look at the percentages data frame:

    ## year women_directors women_writers women_producers women_sound ## 1 2017 3.571429 9.386282 23.03030 14.17497 ## 2 2016 3.174603 9.174312 19.04762 14.02918 ## 3 2015 6.000000 12.432432 21.19914 15.69061 ## 4 2014 1.785714 8.041958 23.12634 14.89028 ## 5 2013 1.886792 10.769231 22.86282 13.54005 ## 6 2012 5.357143 10.227273 24.06542 12.33696 ## 7 2011 3.846154 9.523810 19.73392 15.08410 ## 8 2010 0.000000 10.526316 17.40088 16.06700 ## 9 2009 7.407407 13.157895 21.24711 15.30185 ## 10 2008 7.547170 9.756098 18.67612 14.70588 ## 11 2007 3.333333 9.047619 17.42243 16.13904 ## year women_music women_art women_makeup women_costume ## 1 2017 22.46998 26.87484 68.22204 69.89796 ## 2 2016 25.84896 25.04481 67.54386 69.44655 ## 3 2015 20.46163 24.90697 68.83117 70.83333 ## 4 2014 22.86967 22.31998 67.29508 67.47430 ## 5 2013 20.46482 22.45546 63.88697 69.79495 ## 6 2012 21.62819 20.90395 66.95402 68.83539 ## 7 2011 18.09816 20.22792 70.09482 67.44548 ## 8 2010 20.90137 22.38199 65.81118 68.72082 ## 9 2009 19.15734 22.14386 61.15619 70.25948 ## 10 2008 19.82984 21.80974 60.87768 71.20253 ## 11 2007 19.64385 20.21891 59.23310 67.36035 Visualisation – gender diversity in 2017

    I was then able to visualise these percentages. For example, here is the code I used to visualise the gender diversity in 2017.

    # Formating our dataframe percentages_t <- data.frame(t(percentages), stringsAsFactors = FALSE) colnames(percentages_t) <- percentages_t[1, ] percentages_t <- percentages_t[-1, ] rownames(percentages_t) <- c("directors", "writers", "producers", "sound", "music", "art", "makeup", "costume") # Ploting our barplot percentages_2017 <- percentages_t$`2017` y <- as.matrix(percentages_2017) p <- ggplot(percentages_t, aes(x = rownames(percentages_t), y = percentages_2017, fill = rownames(percentages_t))) + geom_bar(stat = "identity") + coord_flip() + # Horizontal bar plot geom_text(aes(label=format(y, digits = 2)), hjust=-0.1, size=3.5) + # pecentages next to bars theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(), legend.title=element_blank(), plot.title = element_text(hjust = 0.5)) + # center the title labs(title = "Percentages of women in the film industry in 2017") + guides(fill = guide_legend(reverse=TRUE)) + # reverse the order of the legend scale_fill_manual(values = brewer.pal(8, "Spectral")) # palette used to fill the bars and legend boxs

    As we can see, in 2017, the behind-the-camera roles of both directors and writers show the most limited women occupation: less than 10% for writers and less than 4% for directors. This is really worrying considering that these are key roles which determine the way women are portrayed in front of the camera. Some studies have already shown that the more these roles are diversified in terms of gender, the more gender diversity is shown on screen.

    Let’s go back to our barplot. Women are also under-represented in sound teams (14%), music teams (22.5%), producer roles (23%) and art teams (27%). The only jobs which seem open to women are the stereotyped female jobs of make-up artists and costume designers, among which almost 70% of the roles are taken by women.

    Visualisation – gender diversity evolution through the last decade

    Even if the 2017 results are not exciting, I wanted to know whether there had been an improvement through the last decade. The evolution I managed to visualise is as follows.

    # From wide to long dataframe colnames(percentages) <- c("year", "directors", "writers","producers", "sound", "music", "art", "makeup", "costume") percentages_long <- percentages %>% gather(key = category, value = percentage, -year) percentages_long$year <- ymd(percentages_long$year, truncated = 2L) # year as date # line plot evolution_10 <- ggplot(percentages_long, aes(x = year, y = percentage, group = category, colour = category)) + geom_line(size = 2) + theme(panel.grid.minor.x = element_blank(), plot.title = element_text(hjust = 0.5)) + # center the title scale_x_date(date_breaks = "1 year", date_labels = "%Y") + scale_color_manual(values = brewer.pal(8, "Set1")) + labs(title = "Percentages of women in the film industry from 2007 to 2017", x = "", y = "Percentages")

    The first thing I noticed is that the representativeness gap between the roles of make-up artists and costume designers and the other ones has not decreased in a flagrant way since 2007.

    In addition, the roles that women are really under-represented – directors, writers and jobs related to sound, no improvement has been achieved.

    If we focus on directors, we do not see any trend. Figures vary depending on the year we consider. For example in 2010, we notice that there are not any female directors among the 50 most profitable movies, and for other years it never goes beyond 7.5%. What is interesting for the role of director, the best levels of female representation were reached in 2008 and 2009. After these years the number of female directors has declined and never reached more than 6%. The percentage of women directors reached in 2017 is practically the same as the percentage reached in 2007.

    We then notice an evenness in the number of female sound teams and writers: women consistently represent around 10% of writers and 15% of sound teams in the last decade. But there is no sign of improvement.

    Only a slight improvement of 3-5% is notable among producers, music and art teams. But nothing astonishing.

    Visualisation – gender diversity forecasting in 2018

    The last step of our study was to forecast, at a basic level, these percentages for 2018. I used the forecast package and its function forecast, and then applied it to the data I collected between 2007 and 2017, in order to get this prediction:

    # Time series ts <- ts(percentages, start = 2007, end = 2017, frequency = 1) # Auto forecast directors 2018 arma_fit_director <- auto.arima(ts[ ,2]) arma_forecast_director <- forecast(arma_fit_director, h = 1) dir_2018 <- arma_forecast_director$fitted[1] # value predicted # Idem for writers, producers, sound, music, art, makeup and costume # Create a data frame for 2018 fitted values percentages_2018 <- data.frame(year = ymd(2018, truncated = 2L), women_directors = dir_2018, women_writers = writ_2018, women_producers = prod_2018, women_sound = sound_2018, women_music = music_2018, women_art = art_2018, women_makeup = makeup_2018, women_costume = costu_2018, stringsAsFactors = FALSE) # Values from 2007 to 2017 + 2018 fitted values percentages_fitted_2018 <- bind_rows(percentages, percentages_2018) # From wide to long dataframe colnames(percentages_fitted_2018) <- c("year", "directors", "writers","producers", "sound", "music", "art", "makeup", "costume") percentages_long_f2018 <- percentages_fitted_2018 %>% gather(key = category, value = percentage, -year) percentages_long_f2018$year <- ymd(percentages_long_f2018$year, truncated = 2L) # year as date # Forecast plot for 2018 forecast_2018 <- ggplot(percentages_long_f2018, aes(x = year, y = percentage, group = category, colour = category)) + geom_line(size = 2)+ theme(panel.grid.minor.x = element_blank(), plot.title = element_text(hjust = 0.5)) + # center the title scale_x_date(date_breaks = "1 year", date_labels = "%Y") + scale_color_manual(values = brewer.pal(8, "Set1")) + labs(title = "Percentages of women in the film industry from 2007 to 2017\n Fitted values for 2018", x = "", y = "Percentages")

    The predicted values I got for 2018 are approximately the same as the ones I calculated for 2017. However, it is a basic forecast, and it does not take into consideration the upheaval which happened in the film industry in 2017. This will surely have an impact on the gender diversity in the film industry. But to what extent? Has general awareness been sufficient to truly achieve change?

    In any case, I sincerely hope that our forecasting is wrong and that a constant improvement will be seen in the next couple of years, so that female characters on cinema screens will become be more interesting and complex.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Revamped bookdown.org Website

    Wed, 07/25/2018 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    Since we announced the bookdown package in 2016, there have been a large number of books, reports, notes, and tutorials written with this package and published to https://bookdown.org. We were excited to see that! At the same time, however, maintaining the list of books on bookdown.org has become more and more difficult because I must update the list manually to filter out books that are only skeletons or not built with bookdown (such as slides). It was not only time-consuming for me, but also delayed the exhibition of many awesome books.

    Today I’m happy to introduce the revamped bookdown.org website and to let you know how you may contribute your books there or help us improve the website. The full source of the website is hosted in the rstudio/bookdown.org repository on Github (special thanks to Christophe Dervieux and TC Zhang for the great help).

    The archive page

    We list all books published to bookdown.org, that have substantial content, on the Archive page. This page also contains a few books published elsewhere (e.g., Fundamentals of Data Visualization by Claus O. Wilke). The list is automatically generated by scraping the homepages of books. If you see any information that is not accurate about your own book on this page, you may need to correct the information in your book source documents (e.g., index.Rmd) and re-publish the book. Then we can scrape your book again to reflect the correct information. You can also contribute links to your books published elsewhere by submitting pull requests on Github. Please read the About page for detailed instructions.

    The homepage

    On the homepage, we feature a small subset of books written in bookdown. These books are typically either published or nearly complete. If you see an interesting/useful book written in bookdown, you may suggest that we add it to the homepage, no matter if you are its author or not. Again, please see the About page for instructions.

    The tags page

    To make it a little easier for you to find the books that you are interested in, we created a list of tags to classify books on the Tags page. The current classification method is quite rudimentary, however. We only match the tags against the descriptions of books. In the future, we may support custom keywords or tags in the bookdown package, so authors can provide their own tags. You are welcome to submit pull requests to improve the existing tags.

    The authors page

    We also list all books by authors on the Authors page. Note that if a book has multiple authors, they are listed together, and the book is not displayed on the individual author’s cards.

    So they have authored 195 books. Where is yours?

    We are happy (as happy as Colin Fay) to see that it is totally practical to publish books with bookdown and enjoy the simplicity of R Markdown at the same time. For authors, if we missed your excellent book on bookdown.org, please do not hesitate to add it yourself. The best time to write a book was 20 years ago. The second best time is now. We are looking forward to your own book on bookdown.org, and we hope readers will enjoy all these free and open-source books on bookdown.org.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    rOpenSci Educators Collaborative: What Educational Resources Work Well and Why?

    Wed, 07/25/2018 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    In the first post of this series, we sketched out some of the common challenges faced by educators who teach with R across scientific domains. In this post, we delve into what makes a “good” educational resource for teaching science with R.



    For instructors teaching sciences with R, there are a number of open educational resources that they can reuse, tailor to their own teaching style, or use to inspire them in creating their own materials. Some examples:

    What makes these teaching materials appealing for educators?



    These curricular materials are open and discoverable by other educators. A common belief is that educational resources that are closed are better than any open material can ever be. Otherwise, why would someone pay for educational materials? While there are surely cases of closed educational resources that are excellent, openness is one of the most appealing characteristics of high quality teaching resources for teaching sciences with R. Open materials have no cost for students, allow a community of educators and students to improve and update the materials, and make it possible to reuse and adapt resources. All this makes the whole teaching experience a lot more efficient for the education community at large. In the closed model, where educators do not share their materials openly, they are often working alone (or in small groups) without any feedback or contributions from the teaching community.

    One of the features all of the resources listed above share is that all of them have the source materials hosted in public GitHub repositories. The source materials are typically written in a shareable, editable format such as R Markdown using tools like R Markdown websites, Blogdown, or Bookdown. Keeping educational materials up-to-date is no small task! Having access to these kinds of open-source materials helps alleviate one of the biggest challenges for educators who use R in the classroom: the rapid pace at which packages are introduced and improved. Being able to access cutting-edge teaching materials that are kept current by R community members is far easier than relying on printed textbooks that include stale code and outdated package recommendations.

    These materials are designed for putting into practice principles of how we learn. Scholars have rigorous formal training in research, which is also peer reviewed. When it comes to teaching at a university, scholars have little to no formal training in educational practices and research. Also, formal mechanisms for peer review are almost nonexistant. In contrast, there have been real innovations in how to teach programming skills like R (e.g., Greg Wilson’s “Teaching Tech Together”), and how to train instructors to teach these skills (e.g., see The Carpentries Instructor Training). Many of our examples above have been designed with modern pedagogical approaches in mind. For example, ModernDive was reviewed by a cognitive psychologist who specializes in the science of learning. All have been “battle-tested” in real classrooms first, and have been iterated on by the authors based on real student feedback.

    Courses that are developed using examples based on real data from the domain of interest and that are meaningful to the learner tend to be more successful. They are more succesful when focusing on the actual practice of the concepts to be learned rather on focusing purely on the theory. One useful technique to transmit contents that are initially challenging is for instructors to assign students to practice in groups of three students, then in groups of two students, then finally to work on the materials individually. A parallel scheme is to introduce the new concepts using a spiral technique that repeats the new concepts over the course using an increased level of difficulty.

    Good course materials teach people to be independent. Courses that focus in step-by-step instructions are usually less helpful in real life than courses that teach concepts in a way generalizes to uses outside the classroom. In real life it is more important to know what questions to ask, how to find good answers to those questions, and how to keep the knowledge acquired up to date than to follow step-by-step tutorials for a procedure that may age.

    How these learning aims are met, either as a part of a formal term-long course, or in a short burst, like a two day workshop, can impact what makes good course materials. For a term-long course, lessons that can be woven into existing courses may be more useful, whereas teaching a certain set of R skills in a short burst may benefit from a more formal set of lessons that have been tested in an intensive workshop experience.

    In the following and last post for this series, we will summarize some priority needs and call for action to advance these priorities to further resources teaching sciences with R.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages