### Navigating the R Package Universe

(This article was first published on ** Rstats on Julia Silge**, and kindly contributed to R-bloggers)

Earlier this month, I, along with John Nash, Spencer Graves, and Ludovic Vannoorenberghe, organized a session at useR!2017 focused on discovering, learning about, and evaluating R packages. You can check out the recording of the session.

There are more than 11,000 packages on CRAN, and R users must approach this abundance of packages with effective strategies to find what they need and choose which packages to invest time in learning how to use. Our session centered on this issue, with three themes in our discussion.

UnificationJohn has been interested in working on wrappers, packages that call other, related packages for a common set of tasks. With a unified wrapper package, a user only has to learn one API but then can use many different implementations for a certain task. John has been particularly involved in numerical optimization techniques and presented possibilities there and beyond.

More generally, and as the session revealed in the breakout discussion, there are opportunities to merge either packages or their functionality. The key issues require, however, human cooperation and some give and take in a realm where egos can take precedence over the efficiency of the R ecosystem.

There were also suggestions that can be interpreted as the unification of the presentation of packages. Overlapping with the “guidance” and “search” themes, these ideas seek to provide selective presentations of packages.

GuidanceJulia explored resources that exist to guide users to packages for certains tasks. R users can turn to long-established resources like CRAN Task Views, or newer options under current development such as the packagemetrics package or the CRANsearcher RStudio add-in. Julia organized a survey before useR about how R users learn about R packages that informed our discussion.

SearchSpencer has worked on the issue of searching for R functions in his sos package, and led the last section focused specifically on how users can find what they are looking for. Ludovic spoke during this part of our talk about RDocumentation, how it works, how it was built, and how we as a community can use it and contribute to making it more useful.

Moving forwardAfter the main presentation, we broke out into three smaller sessions focused on these topics for discussion and brainstorming. Both the main session and then our three following breakout sessions were well-attended. We are so happy about the participation from the community we saw, and hope to use people’s enthusiasm and ideas to move forward with some steps that will improve parts of the R ecosystem. In the coming weeks, look for three more blog posts (from me and the other contributors) on these three topics with more details and ideas on projects. Perhaps something will resonate with *you* and you can get involved!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Rstats on Julia Silge**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### June 2017 New Package Picks

(This article was first published on ** R Views**, and kindly contributed to R-bloggers)

Two hundred and thirty-eight new packages were added to CRAN in June. Below are my picks for the “Top 40”, organized into six categories: Biostatistics, Data, Machine Learning, Miscellaneous, Statistics and Utilities. Some packages, including geofacet and secret, already seem to be gaining traction.

BiostatisticsBIGL v1.0.1: Implements response surface methods for drug synergy analysis, including generalized and classical Loewe formulations and the Highest Single Agent methodology. There are vignettes on Methodology and Synergy Analysis.

colorpatch v0.1.2: Provides functions to show color patches for encoding fold changes (e.g., log ratios) and confidence values within a diagram; especially useful for rendering gene expression data and other types of differential experiments. See the vignette.

eesim v0.1.0: Provides functions to create simulated time series of environmental exposures (e.g., temperature, air pollution) and health outcomes for use in power analysis and simulation studies in environmental epidemiology. The vignette gives an overview of the package.

personalized v0.0.2: Provides functions for fitting and validating subgroup identification and personalized medicine models under the general subgroup identification framework of Chen et al. The vignette provides a brief tutorial.

tidygenomics v0.1.0: Provides method to deal with genomic intervals the “tidy way”. The vignette explains how they work.

Dataalfred v0.1.1: Provides direct access to the ALFRED and FRED databases. The vignette gives a brief example.

CityWaterBalance v0.1.0: Provides functions to retrieve data and estimate unmeasured flows of water through an urban network. Data for US cities can be gathered via web services using this package and dependencies. See the vignette for an introduction to the package.

censusapi v0.2.0: Provides a wrapper for the U.S. Census Bureau APIs that returns data frames of census data and metadata. Available data sets include the Decennial Census, American Community Survey, Small Area Health Insurance Estimates, Small Area Income and Poverty Estimates, and Population Estimates and Projections. There is a brief vignette.

dataverse v0.2.0: Provides access to Dataverse version 4 APIs, enabling data search, retrieval, and deposit. There are four vignettes: Introduction, Search and Discovery, Retrieval and Data Archiving.

data.world v1.1.1: Provides high-level tools for working with data.world data sets. There is a Quickstart Guide and a vignette for writing Queries.

SimMultiCorrData v0.1.0: Provides functions to generate continuous, binary, ordinal, and count variables with a specified correlation matrix that can be used to simulate data sets that mimic real-world situations (e.g., clinical data sets, plasmodes). There are several vignettes including an Overall Workflow for Data Simulation and a Comparison to Other Packages.

tidycensus v0.1.2: Provides an integrated R interface to the decennial US Census and American Community Survey APIs, and the US Census Bureau’s geographic boundary files.

ukbtools v0.9.0: Provides tools to work with UK Biobank datasets. The vignette shows how to get started.

wpp2017 v1.0-1: Provides and interface to data sets from the United Nation’s World Population Prospects 2017.

Machine Learningcld3 v1.0: Provides an interface to Google’s experimental Compact Language Detector 3 algorithm, a neural network model for language identification that is the successor of cld2.

datafsm v0.2.0: Implements a method that automatically generates models of dynamic decision-making that both have strong predictive power and are interpretable in human terms. The vignette provides an example.

diceR v0.1.0: Provides functions for cluster analysis using an ensemble clustering framework. The vignette shows some examples.

glmertree v0.1-1: Implements recursive partitioning based on (generalized) linear mixed models (GLMMs) combining lmer() and glmer() from lme4 and lmtree() and glmtree() from partykit. The vignette shows an example.

greta v0.2.0: Lets users write statistical models in R and fit them by MCMC on CPUs and GPUs, using Google TensorFlow. There is a website, a Getting Started Guide, and vignettes providing Examples andTechnical Details.

penaltyLearning v2017.07.11: Implements algorithms from Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression. There is a vignette.

SentimentAnalysis v1.2-0: Implements functions to perform sentiment analysis of textual data using various existing dictionaries, such as Harvard IV, or finance-specific dictionaries, and create customized dictionaries. The vignette provides an introduction.

Miscellaneousconvexjlr v0.5.1: Provides a high-level wrapper for Julia package Convex.jl, which makes it easy to describe and solve convex optimization problems. There is a very nice vignette that shows how to optimize the parameters for several machine learning models.

interp v1.0-29: Implements bivariate data interpolation on both regular and irregular grids using either linear methods or splines.

pkggraph v0.2.0: Allows users to interactively explore and plot package dependencies for CRAN.

parallelDist v0.1.1: Provides a parallelized alternative to R’s native dist function to calculate distance matrices for continuous, binary, and multi-dimensional input matrices with support for a broad variety of distance functions from the stats, prox and dtw R packages. The vignette offers some results on performance.

StatsanchoredDistR v1.0.3: Supplements the MAD# software that implements the Method of Anchored Distributions for inferring geostatistical parameters. There is a vignette.

bssm v01.1-1: Efficient methods for Bayesian inference of state space models via particle Markov chain Monte Carlo and importance sampling type corrected Markov chain Monte Carlo. There is a vignette on Bayesian Inference of State Space Models and an example of a Logistic Growth Model.

factorMerger v0.3.1 Provides a set of tools to support results of post-hoc testing and enable to extract hierarchical structure of factors. There is an Introduction and vignettes on Cox Regression Factor Merging and Multidimensional Gaussian Merging.

MittagLeffleR v0.1.0: Provides density, distribution, and quantile functions as well as random variate generation for the Mittag-Leffler distribution based on the algorithm by Garrappa. There are short vignettes for the math, distribution functions and random variate generation.

walker v0.2.0: Provides functions for building dynamic Bayesian regression models where the regression coefficients can vary over time as random walks. The vignette shows some examples.

Utilitiescharlatan v0.1.0: Provides functions to make fake data, including addresses, person names, dates, times, colors, coordinates, currencies, DOIs, jobs, phone numbers, ‘DNA’ sequences, doubles and integers from distributions and within a range. The Introduction will get you started.

colordistances v0.8.0: Provides functions to load and display images, selectively mask specified background colors, bin pixels by color, quantitatively measure color similarity among images,and cluster images by object color similarity. There is an Introduction and vignettes on Pixel Binning Methods and Color Distance Metrics.

dbplyr v1.1.0: Implements a dplyr back end for databases that allows working with remote database tables as if they are in-memory data frames. There is an Introduction, a vignette for Adding a new DBI backend and one for SQL translation.

geofacet v0.1.5: Provides geofaciting functionality (the ability to arrange a sequence of plots for different geographical entities into a grid that preserves some geographical orientation) for ggplot2. There is a Package Reference vignette and an Introduction. The package is already getting some traction. This is a user submission:

ggformula v0.4.0: Provides a formula interface to ggplot2. There is a vignette explaining how it works.

gqlr v0.0.1: Provides an implementation of the GraphQL query language created by Facebook for describing data requirements on complex application data models. gqlr should be useful for integrating R computations into production applications that use GraphQL.

later v0.3: Allows users to execute arbitrary R or C functions some time after the current time, after the R execution stack has emptied. The vignette shows how to use later from C++.

secret v1.0.0: Allows sharing sensitive information like passwords, API keys, etc., in R packages, using public key cryptography. There is a vignette.

sessioninfo v1.0.0: Provides functions to query and print information about the current R session. It is similar to utils::sessionInfo(), but includes more information.

webglobe v1.0.2: Provides functions to display geospatial data on an interactive 3D globe. There is a vignette

_____='https://rviews.rstudio.com/2017/07/26/june-2017-new-package-picks/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R Views**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Passing two SQL queries to sp_execute_external_script

(This article was first published on ** R – TomazTsql**, and kindly contributed to R-bloggers)

Recently, I got a question on one of my previous blog posts, if there is possibility to pass two queries in same run-time as an argument to external procedure sp_execute_external_script.

Some of the arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument *@input_data_1* suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true. External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

First I will create two small datasets, using T-SQL

USE SQLR; GO DROP TABLE IF EXISTS dataset; GO CREATE TABLE dataset (ID INT IDENTITY(1,1) NOT NULL ,v1 INT ,v2 INT CONSTRAINT pk_dataset PRIMARY KEY (id) ) SET NOCOUNT ON; GO INSERT INTO dataset(v1,v2) SELECT TOP 1 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1 ,(SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOD') ORDER BY NEWID()) AS v2 FROM master..spt_values GO 50This dataset will be used directly into @input_data_1 argument. The next one will be used through R code:

CREATE TABLE external_dataset (ID INT IDENTITY(1,1) NOT NULL ,v1 INT CONSTRAINT pk_external_dataset PRIMARY KEY (id) ) SET NOCOUNT ON; GO INSERT INTO external_dataset(v1) SELECT TOP 1 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1 FROM master..spt_values GO 50Normally, one would use a single dataset like:

EXEC sp_execute_external_script @language = N'R' ,@script = N'OutputDataSet <- data.frame(MySet);' ,@input_data_1 = N'SELECT TOP 5 v1, v2 FROM dataset;' ,@input_data_1_name = N'MySet' WITH RESULT SETS (( Val1 INT ,Val2 INT ))But by “injecting” the ODBC into R code, we can allow external procedure, to get back to your SQL Server and get additional dataset.

This can be done by following:

EXECUTE AS USER = 'RR'; GO DECLARE @Rscript NVARCHAR(MAX) SET @Rscript = ' library(RODBC) myconn <-odbcDriverConnect("driver={SQL Server}; Server=SICN-KASTRUN;database=SQLR;uid=RR;pwd=Read!2$16") External_source <- sqlQuery(myconn, "SELECT v1 AS v3 FROM external_dataset") close(myconn) Myset <- data.frame(MySet) #Merge both datasets mergeDataSet <- data.frame(cbind(Myset, External_source));' EXEC sp_execute_external_script @language = N'R' ,@script = @Rscript ,@input_data_1 = N'SELECT v1, v2 FROM dataset;' ,@input_data_1_name = N'MySet' ,@output_data_1_name = N'mergeDataSet' WITH RESULT SETS (( Val1 INT ,Val2 INT ,Val3 INT )) REVERT; GOAnd the result will be merged two datasets, in total three columns:

which correspond to two datasets:

-- Check the results! SELECT * FROM dataset SELECT * FROM external_datasetThere are, as already mentioned, several opposing factors to this approach, and I would not recommend this. Some are:

- validating and keeping R code in one place
- performance issues
- additional costs of data transferring
- using ODBC connectors
- installing additional R packages (in my case RODBC package)
- keeping different datasets in one place
- security issues
- additional login/user settings
- firewall inbound/outbound rules setting

This, of course, can also be achieved with *.XDF file formats, if they are stored locally or on server as a files.

As always, code is available at Github.

Happy R-SQLing!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – TomazTsql**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Effect Size Statistics for Anova Tables #rstats

(This article was first published on ** R – Strenge Jacke!**, and kindly contributed to R-bloggers)

My sjstats-package has been updated on CRAN. The past updates introduced new functions for various purposes, e.g. predictive accuracy of regression models or improved support for the marvelous glmmTMB-package. The current update, however, added some ANOVA tools to the package.

In this post, I want to give a short overview of these new functions, which report different effect size measures. These are useful beyond significance tests (p-values), because they estimate the magnitude of effects, independent from sample size. *sjstats* provides following functions:

- eta_sq()
- omega_sq()
- cohens_f()
- anova_stats()

First, we need a sample model:

library(sjstats) # load sample data data(efc) # fit linear model fit <- aov( c12hour ~ as.factor(e42dep) + as.factor(c172code) + c160age, data = efc )All functions accept objects of class aov or anova, so you can also use model fits from the *car-package*, which allows fitting Anova’s with different types of sum of squares. Other objects, like lm, will be coerced to anova internally.

The following functions return the effect size statistic as named numeric vector, using the model’s term names.

Eta SquaredThe eta squared is the proportion of the total variability in the dependent variable that is accounted for by the variation in the independent variable. It is the ratio of the sum of squares for each group level to the total sum of squares. It can be interpreted as percentage of variance accounted for by a variable.

For variables with 1 degree of freedeom (in the numerator), the square root of eta squared is equal to the correlation coefficient r. For variables with more than 1 degree of freedom, eta squared equals R2. This makes eta squared easily interpretable. Furthermore, these effect sizes can easily be converted into effect size measures that can be, for instance, further processed in meta-analyses.

Eta squared can be computed simply with:

eta_sq(fit) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.266114185 0.005399167 0.048441046 Partial Eta SquaredThe partial eta squared value is the ratio of the sum of squares for each group level to the sum of squares for each group level plus the residual sum of squares. It is more difficult to interpret, because its value strongly depends on the variability of the residuals. Partial eta squared values should be reported with caution, and Levine and Hullett (2002) recommend reporting eta or omega squared rather than partial eta squared.

Use the partial-argument to compute partial eta squared values:

eta_sq(fit, partial = TRUE) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.281257128 0.007876882 0.066495448 Omega SquaredWhile eta squared estimates tend to be biased in certain situations, e.g. when the sample size is small or the independent variables have many group levels, omega squared estimates are corrected for this bias.

Omega squared can be simply computed with:

omega_sq(fit) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.263453157 0.003765292 0.047586841 Cohen’s FFinally, cohens_f() computes Cohen’s F effect size for all independent variables in the model:

cohens_f(fit) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.62555427 0.08910342 0.26689334 Complete Statistical Table OutputThe anova_stats() function takes a model input and computes a comprehensive summary, including the above effect size measures, returned as tidy data frame (as tibble, to be exact):

anova_stats(fit) #> # A tibble: 4 x 11 #> term df sumsq meansq statistic p.value etasq partial.etasq omegasq cohens.f power #> #> 1 as.factor(e42dep) 3 577756.33 192585.444 108.786 0.000 0.266 0.281 0.263 0.626 1.00 #> 2 as.factor(c172code) 2 11722.05 5861.024 3.311 0.037 0.005 0.008 0.004 0.089 0.63 #> 3 c160age 1 105169.60 105169.595 59.408 0.000 0.048 0.066 0.048 0.267 1.00 #> 4 Residuals 834 1476436.34 1770.307 NA NA NA NA NA NA NALike the other functions, the input may also be an object of class anova, so you can also use model fits from the car package, which allows fitting Anova’s with different types of sum of squares:

anova_stats(car::Anova(fit, type = 3)) #> # A tibble: 5 x 11 #> term sumsq meansq df statistic p.value etasq partial.etasq omegasq cohens.f power #> #> 1 (Intercept) 26851.070 26851.070 1 15.167 0.000 0.013 0.018 0.012 0.135 0.973 #> 2 as.factor(e42dep) 426461.571 142153.857 3 80.299 0.000 0.209 0.224 0.206 0.537 1.000 #> 3 as.factor(c172code) 7352.049 3676.025 2 2.076 0.126 0.004 0.005 0.002 0.071 0.429 #> 4 c160age 105169.595 105169.595 1 59.408 0.000 0.051 0.066 0.051 0.267 1.000 #> 5 Residuals 1476436.343 1770.307 834 NA NA NA NA NA NA NA ReferencesLevine TR, Hullet CR. Eta Squared, Partial Eta Squared, and Misreporting of Effect Size in Communication Research. Human Communication Research 28(4); 2002: 612-625

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Strenge Jacke!**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R⁶ — General (Attys) Distributions

(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp

— Matt Stiles (@stiles) July 25, 2017

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

- grabs the vis iframe
- extracts and evaluates the target javascript to get a nice data frame
- performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
- plots the distributions using the beeswarm quasirandom alogrithm

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### SQL Server 2017 release candidate now available

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for public preview. For those looking to do data science with data in SQL Server, there are a number of new features compared to SQL Server 2017 which might be of interest:

- SQL Server 2017 will be supported on Linux, specifically on Red Hat, SUSE, and Ubuntu (as well as within a Docker container). You can read some of the backstory on how SQL Server came to Linux here.
- The new SQL Server Machine Learning Services will provide in-database analytics in both the R and — new to SQL Server 2017 — Python languages.
- There are several improvements to the in-database R integration (previously known as "R Services" in SQL Server 2016). These includes a new package management system, real-time in-database scoring of predictive models, and an updated R engine.

SQL Server 2017 Release Candidate 1 is available for download now. For more details on these and other new features in this release, check out the link below.

SQL Server Blog: SQL Server 2017 CTP 2.1 now available

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

Until now, in this series of exercise sets, we have used only continuous probability distributions, which are functions defined on all the real numbers on a certain interval. As a consequence, random variable who have those distributions can assume an infinity of values. However, a lot of random situations only have a finite amount of possible outcome and using a continuous probability distributions to analyze them is not really useful. In today set, we’ll introduce the concept of discrete probability functions, which can be used in those situations and some examples of problems in which they can be used.

Answers to the exercises are available here.

**Exercise 1**

Just as continuous probability distributions are characterized by a probability density function discrete probability functions are characterized by a probability mass function which gives the probability that a random variable is equal to one value.

The first probability mass function we will use today is the binomial distribution, which is used to simulate n iterations of a random process who can either result in a success, with a probability of p, or a failure, with a probability of (1-p). Basically, if you want to simulate something like a coins flip, the binomial distribution is the tool you need.

Suppose you roll a 20 sided dice 200 times and you want to know the probability to get a 20 exactly five times on your rolls. Use the dbinom(n, size, prob) function to compute this probability.

**Exercise 2**

For the binomial distribution, the individual events are independents, meaning that the probability of realization of two events can be calculated by adding the probability of realization of both event. This principle can be generalize to any number of events. For example, the probability of getting three tails or less when you flip a coins 10 time is equal to the probability of getting 1 tails plus the probability of getting 2 tails plus the probability of getting 3 tails.

Knowing this, use the dbinom() function to compute the probability of getting six correct responses at a test made of 10 questions which have true or false for answer if you answer randomly. Then, use the pbinom() function to compute the cumulative probability function of the binomial distribution in that situation.

**Exercise 3**

Another consequence of the independence of events is that if we know the probability of realization of a set of events we can compute the probability of realization of one of his subset by subtracting the probability of the unwanted event. For example, the probability of getting two or three tails when you flip a coins 10 time is equal to the probability of getting at least 3 tails minus the probability of getting 1 tails.

Knowing this, compute the probability of getting 6 or more correct answer on the test described in the previous exercise.

**Learn more**about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

- Work with about different binomial and logistic regression techniques
- Know how to compare regression models and choose the right fit
- And much more

**Exercise 4**

Let’s say that in an experiment a success is defined as getting a 1 if you roll a 20 sided die. Use the barplot() function to represent the probability of getting from 0 to 10 success if you roll the die 10 times. What happened to the barplot if you roll a 10 sided die instead? If you roll a 3 sided die?

**Exercise 5**

Another discrete probability distribution close to the binomial distribution is the Poisson distribution, which give the probability of a number of events to occur during a fixed amount of time if we know the average rate of his occurrence. For example, we could use this distribution to estimate the amount of visitor who goes on a website if we know the average number of visitor per second. In this case, we must assume two things: first that the website has visitor from around the world since the rate of visitor must be constant around the day and two that when a visitor is coming on the site he is not influenced by the last visitor since a process can be expressed by the Poisson distribution if the events are independent from each other.

Use the dpois() function to estimate the probability of having 85 visitors on a website in the next hour if in average 80 individual connect on the site per hour. What is the probability of getting 2000 unique visitors on the website in a day?

**Exercise 6**

Poisson distribution can be also used to compute the probability of an event occurring in an amount of space, as long as the unit of the average rate is compatible with the unit of measure of the space you use. Suppose that a fishing boat catch 1/2 ton of fish when his net goes through 5 squares kilometers of sea. If the boat combed 20 square kilometer, what is the probability that they catch 5 tons of fish?

**Exercise 7**

Until now, we used the Poisson distribution to compute the probability of observing precisely n occurrences of an event. In practice, we are often interested in knowing the probability that an event occur n times or less. To do so we can use the ppois() function to compute the cumulative Poisson distribution. If we are interested in knowing what is the probability of observing strictly more than n occurrences, we can use this function and set the parameter lower to FALSE.

In the situation of exercise 5, what is the probability that the boat caught 5 tons of fish or less? What is the probability that the caught more than 5 tons of fish?

Note that, just as in a binomial experiment, the events in a Poisson process are independant, so you can add or subtract probability of event to compute the probability of a particular set of events.

**Exercise 8**

Draw the Poisson distribution for average rate of 1,3,5 and 10.

**Exercise 9**

The last discrete probability distribution we will use today is the negative binomial distribution which give the probability of observing a certain number of success before observing a fixed number of failures. For example, imagine that a professional football player will retire at the end of the season. This player has scored 495 goals in his career and would really want to meet the 500 goal mark before retiring. If he is set to play 8 games until the end of the season and score one goal every three games in average, we can use the negative binomial distribution to compute the probability that he will meet his goal on his last game, supposing that he won’t score more than one goal per game.

Use the dnbinom() function to compute this probability. In this case, the number of success is 5, the probability of success is 1/3 and the number of failures is 3.

**Exercise 10**

Like for the Poisson distribution, R give us the option to compute the cumulative negative binomial distribution with the function pnbinom(). Again, the lower.tail parameter than give you the option to compute the probability of realizing more than n success if he is set to TRUE.

In the situation of the last exercise, what is the probability that the football player will score at most 5 goals in before the end of his career.

Related exercise sets:- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-1)
- Lets Begin with something sample
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Experiment designs for Agriculture

(This article was first published on ** R tutorial for Spatial Statistics**, and kindly contributed to R-bloggers)

Statistical Methods for Bioscience (Wisconsin-Madison): http://www.stat.wisc.edu/courses/st572-larget/Spring2007/
R Packages to create several designs are presented here: https://cran.r-project.org/web/views/ExperimentalDesign.html
A very good tutorial about the use of the package Agricolae can be found here:

https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf
Complete Randomized Design
This is probably the most common design, and it is generally used when conditions are uniform, so we do not need to account for variations due for example to soil conditions.
In R we can create a simple CRD with the function expand.grid and then with some randomization:
TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))

Data.CRD = TR.Structure[sample(1:nrow(TR.Structure),nrow(TR.Structure)),]

Data.CRD = cbind(PlotN=1:nrow(Data.CRD), Data.CRD[,-1])

write.csv(Data.CRD, "CompleteRandomDesign.csv", row.names=F)

The first line create a basic treatment structure, with rep that identifies the number of replicate, that looks like this:

rep Treatment1 Treatment2

1 1 A A

2 2 A A

3 3 A A

4 1 B A

5 2 B A

6 3 B A

7 1 A B

8 2 A B

9 3 A B

10 1 B B

11 2 B B

12 3 B B

13 1 A C

14 2 A C

15 3 A C

16 1 B C

17 2 B C

18 3 B C

The second line randomizes the whole data.frame to obtain a CRD, then we add with cbind a column at the beginning with an ID for the plot, while also eliminating the columns with rep.

Add Control
To add a Control we need to write two separate lines, one for the treatment structure and the other for the control:
TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))

CR.Structure = expand.grid(rep=1:3, Treatment1=c("Control"), Treatment2=c("Control"))

Data.CCRD = rbind(TR.Structure, CR.Structure)

This will generate the following table:

rep Treatment1 Treatment2

1 1 A A

2 2 A A

3 3 A A

4 1 B A

5 2 B A

6 3 B A

7 1 A B

8 2 A B

9 3 A B

10 1 B B

11 2 B B

12 3 B B

13 1 A C

14 2 A C

15 3 A C

16 1 B C

17 2 B C

18 3 B C

19 1 Control Control

20 2 Control Control

21 3 Control Control

As you can see the control is totally excluded from the rest. Now we just need to randomize, again using the function sample:

Data.CCRD = Data.CCRD[sample(1:nrow(Data.CCRD),nrow(Data.CCRD)),]Data.CCRD = cbind(PlotN=1:nrow(Data.CCRD), Data.CCRD[,-1])

write.csv(Data.CCRD, "CompleteRandomDesign_Control.csv", row.names=F)

Block Design with Control The starting is the same as before. The difference starts when we need to randomize, because in CRD we randomize over the entire table, but with blocks, we need to do it by block. TR.Structure = expand.grid(Treatment1=c("A","B"), Treatment2=c("A","B","C"))

CR.Structure = expand.grid(Treatment1=c("Control"), Treatment2=c("Control"))

Data.CBD = rbind(TR.Structure, CR.Structure)

Block1 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Block2 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Block3 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Data.CBD = rbind(Block1, Block2, Block3)

BlockID = rep(1:nrow(Block1),3)

Data.CBD = cbind(Block = BlockID, Data.CBD)

write.csv(Data.CBD, "BlockDesign_Control.csv", row.names=F)

As you can see from the code above, we’ve created three objects, one for each block, where we used the function sample to randomize.

Other Designs with Agricolae
The package agricolae includes many designs, which I am sure will cover all your needs in terms of setting up field and lab experiments.
We will look at some of them, so first let’s install the package:
install.packages("agricolae")

library(agricolae)

The main syntax for design in agricolae is the following:

design.crd(trt=Trt1, r=3)

The result is the output below:

> design.crd(trt=Trt1, r=3)$parameters

$parameters$design

[1] "crd"

$parameters$trt

[1] "A" "B" "C"

$parameters$r

[1] 3 3 3

$parameters$serie

[1] 2

$parameters$seed

[1] 1572684797

$parameters$kinds

[1] "Super-Duper"

$parameters[[7]]

[1] TRUE

$book

plots r Trt1

1 101 1 A

2 102 1 B

3 103 2 B

4 104 2 A

5 105 1 C

6 106 3 A

7 107 2 C

8 108 3 C

9 109 3 B

As you can see the function takes only one argument for treatments and another for replicates. Therefore, if we need to include a more complex treatment structure we first need to work on them:

Trt1 = c("A","B","C")Trt2 = c("1","2")

Trt3 = c("+","-")

TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))

TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))

TRT.Control = c(TRT, rep("Control", 3))

As you can see we have now three treatments, which are merged into unique strings within the function sapply:

> TRT[1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"

Then we need to include the control, and then we can use the object TRT.Control with the function design.crd, from which we can directly obtain the data.frame with $book:

> design.crd(trt=TRT.Control, r=3)$bookplots r TRT.Control

1 101 1 A2+

2 102 1 B1+

3 103 1 Control

4 104 1 B2+

5 105 1 A1+

6 106 1 C2+

7 107 2 A2+

8 108 1 C2-

9 109 2 Control

10 110 1 B2-

11 111 3 Control

12 112 1 Control

13 113 2 C2-

14 114 2 Control

15 115 1 C1+

16 116 2 C1+

17 117 2 B2-

18 118 1 C1-

19 119 2 C2+

20 120 3 C2-

21 121 1 A2-

22 122 2 C1-

23 123 2 A1+

24 124 3 C1+

25 125 1 B1-

26 126 3 Control

27 127 3 A1+

28 128 2 B1+

29 129 2 B2+

30 130 3 B2+

31 131 1 A1-

32 132 2 B1-

33 133 2 A2-

34 134 1 Control

35 135 3 C2+

36 136 2 Control

37 137 2 A1-

38 138 3 B1+

39 139 3 Control

40 140 3 A2-

41 141 3 A1-

42 142 3 A2+

43 143 3 B2-

44 144 3 C1-

45 145 3 B1-

Other possible designs are:

#Random Block Designdesign.rcbd(trt=TRT.Control, r=3)$book

#Incomplete Block Design

design.bib(trt=TRT.Control, r=7, k=3)

#Split-Plot Design

design.split(Trt1, Trt2, r=3, design=c("crd"))

#Latin Square

design.lsd(trt=TRT.tmp)$sketch

Others not included above are: Alpha designs, Cyclic designs, Augmented block designs, Graeco – latin square designs, Lattice designs, Strip Plot Designs, Incomplete Latin Square Design

Final Note For repeated measures and crossover designs I think we can create designs simply using again the function expand.grid and including time and subjects, as I did in my previous post about Power Analysis. However, there is also the package Crossover that deals specifically with crossover design and on this page you can find more packages that deal with clinical designs: https://cran.r-project.org/web/views/ClinicalTrials.html var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R tutorial for Spatial Statistics**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Twitter Coverage of the Bioinformatics Open Source Conference 2017

July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.

My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is:

- the Github repository, where you can view the markdown report in the code/R directory
- the published report at RPubs

The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.

Filed under: bioinformatics, meetings, R, statistics Tagged: bosc

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));### The new MODIStsp website (based on pkgdown) is online !

(This article was first published on ** Spatial Processing in R**, and kindly contributed to R-bloggers)

The **MODIStsp** website, which lay abandoned since several months on github pages, recently underwent a major overhaul thanks to **pkgdown**. The new site is now available at http://lbusett.github.io/MODIStsp/

We hope that the revised website will allow to navigate MODIStsp-related material much more easily than either github or the standard CRAN documentation, and will therefore help users in better and more easily exploiting MODIStsp functionality.

The restyling was possible thanks to the very nice “**pkgdown**” R package (http://hadley.github.io/pkgdown), which allows to easily create a static documentation website.

Though pkgdown does already quite a good job in creating a bare-bones website exploiting **just the material available in a standard devtools-based R package file structure**, I was surprised at **how simply the standard website could be tweaked **to obtain a very nice (IMO) final result without needing any particular background on html, css, etc. !

(On that, I plan to soon post a short guide containing a few **tips and tricks** I learnt in the last days for setting up and configuring a pkgdown website, so stay tuned !)

To **leave a comment** for the author, please follow the link and comment on their blog: ** Spatial Processing in R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### What are people saying about ICCB2017 on Twitter?

(This article was first published on ** Bluecology blog**, and kindly contributed to R-bloggers)

Here’s a brief analysis of tweets from the International Congress for Conservation Biology to date. The conference started on the starting 23rd July I was curious to see what people are talking about there and who is doing the talking on twitter.

As of 25th July I could access 6978 tweets and retweets from the conference, starting on 2017-07-15.

If you are wondering how these stats compare to other conferences, check out my post from ICRS last year.

Who is talkingThere have been about 1500 users on #ICCB2017 so far.

Clearly, the people talking on twitter are a biased selection of people at ICCB2017 and may also include people not there (like me). As usual, a lot people only tweet once or twice.

The users doing the most tweeting are (in order of most to least with number of tweets):

@IbuAnggrek 528, @ICCB2017 310, @AzurigenMCS 199, @WhySharksMatter 179, @rebecca_jarvis 172,

@CarlaWildlife 146, @WILDLABSNET 93, @Society4ConBio 84, @FancyScientist 80,

@davila_federico 79

Here is a table of the most popular tweets, by their RT count:

Rank Name text Retweet count 1 WhySharksMatter Angler gives up chance at world record, releases huge blacktip shark alive. #ICCB2017 #SharkWeek (from 2014) https://t.co/dwmiAeSXQW https://t.co/74SyQ6Uhfk 96 2 eguerreroforero @Brigittelgb en #ICCB2017: A los países megadiversos nos llaman ‘megadispersos’ porq no hemos logrado posicionar políticamente biodiversidad https://t.co/u43dVvcjHO 65 3 HugePossum Australia’s proposed marine parks have never been science based – bad planning for nature and people #ICCB2017 https://t.co/rZXAueuod6 64 4 ICCB2017 #ICCB2017 attendees – there are sloths in El Parque Centenario near the convention center! #UrbanWildlife https://t.co/81YktjU4Mu 59 5 Seasaver UN Patron of Oceans calls for ban on tournaments which kill threatened sharks https://t.co/P5KmUxBte3 #ICCB2017 @TheIGFA #sportfishing 40I like that sloths are right up there at number 4 at the moment. Marine science topics are also pretty popular.

What people are sayingTextual analysis is a bit of a dark art, but here are some of the most popular words appearing in tweets (with the number of tweets they appear in ):

Term Number of tweets iccb 6965 conserv 1064 amp 691 colombia 477 talk 467 biodivers 411 cartagena 380 work 362 will 323 whysharksmatter 315 societyconbio 304 open 303 great 298 can 293 come 292 use 289 conservation 287 join 285 need 248 world 244At this stage the location is an imporant talking point. Also predictably ‘conservation’ and ‘biodiversity’ are also up there. It is interesting that ‘need’ features in a lot of tweets. Perhaps a lot of people saying we ‘need’ to do this or that?

Time of tweetsFinally, it can be illustrative to look at the timing of tweets. This graph is made with times for the time zone local to Cartagena.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Bluecology blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Analyzing Github pull requests with Neural Embeddings, in R

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text framework). The data analysis was done using R and distributed on Spark, and the resulting neural network trained using the Microsoft Cognitive Toolkit. You can see the slides here, and you can watch the presentation below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Are computers needed to teach Data Science?

(This article was first published on ** R Project – Citizen-Statistician**, and kindly contributed to R-bloggers)

One of the many nice things about summer is the time and space it allows for blogging. And, after a very stimulating SRTL conference (Statistics Reasoning, Teaching and Learning) in Rotorua, New Zealand, there’s lots to blog about.

Let’s begin with a provocative posting by fellow SRTL-er Tim Erickson at his excellent blog A Best Case Scenario. I’ve known Tim for quite awhile, and have enjoyed many interesting and challenging discussions. Tim is a creator of curricula *par excellence*, and has first-hand experience in what inspires and motivates students to think deeply about statistics.

The central question here is: Is computation (on a computer) necessary for learning data science? The learners here are beginners in K-12. Tim answers no, and I answer, tentatively, yes. Tim portrays me in his blog as being a bit more steadfast on this position than I really am. In truth the answer is, some; maybe; a little; I don’t know.

My own experience in the topic comes from the Mobilize project , in which we developed the course Introduction to Data Science for students in the Los Angeles Unified School District. (I’m pleased to say that the course is expanding. This summer, five new L.A.-area school districts will begin training teachers to teach this course. )

The course relies heavily on R via Rstudio. Students begin by studying the structure of data, learning to identify cases and variables and to organize unstructured data into a “tidy” format. Next, they learn to “read” tidy datafiles into Rstudio. The course ends with students learning some predictive modeling using Classification and Regression Trees. In between, they study some inference using randomization-based methods.

To be precise, the students don’t learn straight-up R. They work within a package developed by the Mobilize team (primarily James Molyneux, Amelia McNamara, Steve Nolen, Jeroen Ooms, and Hongsuda Tangmunarunkit) called mobilizR, which is based pretty heavily on the mosaic package developed by Randall Pruim, Danny Kaplan and Nick Horton. The idea with these packages is to provide beginners to R with a unified syntax and a set of verbs that relate more directly to the analysts’ goals. The basic structure for (almost) all commands is

*WhatIWantToDo(yvariable~xvariables, dataset)*

For example, to see the average walking distance recorded by a fitbit by day of the week:

> mean(Distance~DOW,data=fitbitdec) Friday Monday Saturday Sunday Thursday Tuesday Wednesday 1.900000 3.690000 2.020909 2.419091 1.432727 3.378182 3.644545The idea is to provide students with a simplified syntax that “bridges the gap” between beginners of R and more advanced users. Hopefully, this frees up some of the cognitive load required to remember and employ R commands so that students can think strategically and statistically about problems they are trying to solve.

The “bridge the gap” terminology comes from Amelia McNamara, who used the term in her PhD dissertation. One of the many really useful ideas Amelia has given us is the notion that the gap needs to be bridged. Much of “traditional” statistics education holds to the idea that statistical concepts are primarily mathematical, and, for most people, it is sufficient to learn enough of the mathematical concepts so that they can react skeptically and critically to others’ analyses. What is exciting about data science in education is that students can do their own analyses. And if students are analyzing data and discovering on their own (instead of just trying to understand others’ findings), then we need to teach them to use software in such a way that they can transition to more professional practices.

And now, dear readers, we get to the heart of the matter. That gap is really hard to bridge. One reason is that we know little to nothing about the terrain. How do students learn coding when applied to data analysis? How does the technology they use mediate that experience? How can it enhance, rather than inhibit, understanding of statistical concepts and the ability to do data analysis intelligently?

In other words, what’s the learning trajectory?

Tim rightly points to CODAP, the Common Online Data Analysis Platform, as one tool that might help bridge the gap by providing students with some powerful data manipulation techniques. And I recently learned about data.world, which seems another attempt to help bridge the gap. But Amelia’s point is that it is not enough to give students the ability to do something; you have to give it to them so that they are prepared to learn the next step. And if the end-point of a statistics education involves coding, then those intermediate steps need to be developing students’ coding skills, as well as their statistical thinking. It’s not sufficient to help studemts learn statistics. They must simultaneously learn computation.

So how do we get there? One important initial step, I believe, is to really examine what the term “computational thinking” means when we apply it to data analysis. And that will be the subject of an upcoming summer blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R Project – Citizen-Statistician**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking Highcharter: observations per group in boxplots

(This article was first published on ** R – What You're Doing Is Rather Desperate**, and kindly contributed to R-bloggers)

Highcharts has long been a favourite visualisation library of mine, and I’ve written before about Highcharter, my preferred way to use Highcharts in R.

Highcharter has a nice simple function, hcboxplot(), to generate boxplots. I recently generated some for a project at work and was asked: can we see how many observations make up the distribution for each category? This is a common issue with boxplots and there are a few solutions such as: overlay the box on a jitter plot to get some idea of the number of points, or try a violin plot, or a so-called bee-swarm plot. In Highcharts, I figured there should be a method to get the number of observations, which could then be displayed in a tool-tip on mouse-over.

There wasn’t, so I wrote one like this.

First, you’ll need to install highcharter from Github to make it work with the latest dplyr.

Next, we generate a reproducible dataset using the wakefield package. For some reason, we want to look at age by gender, but only for redheads:

library(dplyr) library(tidyr) library(highcharter) library(wakefield) library(tibble) set.seed(1001) sample_data <- r_data_frame( n = 1000, age(x = 10:90), gender, hair ) %>% filter(hair == "Red") sample_data %>% count(Gender) ## # A tibble: 2 x 2 ## Gender n ## ## 1 Male 62 ## 2 Female 48Giving us 62 male and 48 female redheads. The tibble package is required because later on, our boxplot function calls the function has_name from that package.

The standard hcboxplot function shows us, on mouse-over, the summary data used in the boxplot, as in the image below.

hcboxplot(x = sample_data$Age, var = sample_data$Gender) %>% hc_chart(type = "column")To replace that with number of observations per group, we need to edit the function. In RStudio, View(hcboxplot) will open a tab with the (read-only) code, which can be copy/pasted and edited. Look for the function named get_box_values, which uses the R boxplot.stats function to generated a data frame:

get_box_values <- function(x) { boxplot.stats(x)$stats %>% t() %>% as.data.frame() %>% setNames(c("low", "q1", "median", "q3", "high")) }Edit it to look like this – the new function just adds a column obs with number of observations:

get_box_values <- function(x) { boxplot.stats(x)$stats %>% t() %>% cbind(boxplot.stats(x)$n) %>% as.data.frame() %>% setNames(c("low", "q1", "median", "q3", "high", "obs")) }Save the new function as, for example, my_hcboxplot. Now we can customise the tooltip to use the obs property of the point object:

my_hcboxplot(x = sample_data$Age, var = sample_data$Gender) %>% hc_chart(type = "column") %>% hc_tooltip(pointFormat = 'n = {point.obs}')Voilà.

Filed under: R, statistics

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – What You're Doing Is Rather Desperate**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Random Forests in R

(This article was first published on ** R Programming – DataScience+**, and kindly contributed to R-bloggers)

Ensemble Learning is a type of Supervised Learning Technique in which the basic idea is to generate multiple Models on a training dataset and then simply combining(average) their Output Rules or their Hypothesis \( H_x \) to generate a Strong Model which performs very well and does not overfits and which balances the Bisa-Variance Tradeoff too.

The idea is that instead of producing a single complicated and complex Model which might have a high variance which will lead to Overfitting or might be too simple and have a high bias which leads to Underfitting, we will generate lots of Models by training on Training Set and at the end combine them. Such a technique is Random Forest which is a popular Ensembling technique is used to improve the predictive performance of Decision Trees by reducing the variance in the Trees by averaging them. Decision Trees are considered very simple and easily interpretable as well as understandable Modelling techniques, but a major drawback in them is that they have a poor predictive performance and poor Generalization on Test Set. They are also referred to as Weak Learners which are Learners which always perform better than chance and have an error less than 50 %.

Random ForestsRandom Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.

Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Another major difference is that we only consider a Random subset of predictors \( m \) each time we do a split on training examples.Whereas usually in Trees we find all the predictors while doing a split and choose best amongst them. Typically \( m=\sqrt{p} \) where \(p\) are the number of predictors.

Now it seems crazy to throw away lots of predictors, but it makes sense because the effect of doing so is that each tree uses different predictors to split data at various times.

*So by doing this trick of throwing away Predictors, we have de-correlated the Trees and the resulting average seems a little better.*

loading the required packages

require(randomForest) require(MASS)#Package which contains the Boston housing dataset attach(Boston) set.seed(101) dim(Boston)*## [1] 506 14*Saperating Training and Test Sets #training Sample with 300 observations train=sample(1:nrow(Boston),300) ?Boston #to search on the dataset

We are going to use variable ′medv′ as the Response variable, which is the Median Housing Value. We will fit 500 Trees.

Fitting the Random ForestWe will use all the Predictors in the dataset.

Boston.rf=randomForest(medv ~ . , data = Boston , subset = train) Boston.rf*## ## Call: ## randomForest(formula = medv ~ ., data = Boston, subset = train) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## Mean of squared residuals: 12.62686 ## % Var explained: 84.74*

The above Mean Squared Error and Variance explained are calculated using *Out of Bag Error* Estimation.In this \(\frac23\) of Training data is used for training and the remaining \( \frac13 \) is used to Validate the Trees. Also, the number of variables randomly selected at each split is 4.

Plotting the Error vs Number of Trees Graph.

plot(Boston.rf)This plot shows the Error and the Number of Trees.We can easily notice that how the Error is dropping as we keep on adding more and more trees and average them.

Now we can compare the Out of Bag Sample Errors and Error on Test setThe above Random Forest model chose Randomly 4 variables to be considered at each split. We could now try all possible 13 predictors which can be found at each split.

oob.err=double(13) test.err=double(13) #mtry is no of Variables randomly chosen at each split for(mtry in 1:13) { rf=randomForest(medv ~ . , data = Boston , subset = train,mtry=mtry,ntree=400) oob.err[mtry] = rf$mse[400] #Error of all Trees fitted pred<-predict(rf,Boston[-train,]) #Predictions on Test Set for each Tree test.err[mtry]= with(Boston[-train,], mean( (medv - pred)^2)) #Mean Squared Test Error cat(mtry," ") #printing the output to the console }*## 1 2 3 4 5 6 7 8 9 10 11 12 13*

Test Error

test.err*## [1] 26.06433 17.70018 16.51951 14.94621 14.51686 14.64315 14.60834 ## [8] 15.12250 14.42441 14.53687 14.89362 14.86470 15.09553*

Out of Bag Error Estimation

oob.err*## [1] 19.95114 13.34894 13.27162 12.44081 12.75080 12.96327 13.54794 ## [8] 13.68273 14.16359 14.52294 14.41576 14.69038 14.72979*

What happens is that we are growing 400 trees for 13 times i.e for all 13 predictors.

Plotting both Test Error and Out of Bag Error matplot(1:mtry , cbind(oob.err,test.err), pch=19 , col=c("red","blue"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split") legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","blue"))Now what we observe is that the Red line is the Out of Bag Error Estimates and the Blue Line is the Error calculated on Test Set. Both curves are quite smooth and the error estimates are somewhat correlated too. The Error Tends to be minimized at around \( mtry = 4 \) .

On the Extreme Right Hand Side of the above plot, we considered all possible 13 predictors at each Split which is only Bagging.

ConclusionNow in this article, I gave a simple overview of Random Forests and how they differ from other Ensemble Learning Techniques and also learned how to implement such complex and Strong Modelling Technique in R with a simple package randomForest.Random Forests are a very Nice technique to fit a more Accurate Model by averaging Lots of Decision Trees and reducing the Variance and avoiding Overfitting problem in Trees. Decision Trees themselves are poor performance wise, but when used with Ensembling Techniques like Bagging, Random Forests etc, their predictive performance is improved a lot.Now obviously there are various other packages in R which can be used to implement Random Forests in R.

I hope the tutorial is enough to get you started with implementing Random Forests in R or at least understand the basic idea behind how this amazing Technique works.

Thanks for reading the article and make sure to like and share it.

**
**

**Related Post**

- Network analysis of Game of Thrones
- Structural Changes in Global Warming
- Deep Learning with R
- Unsupervised Learning and Text Mining of Emotion Terms Using R
- Using MCA and variable clustering in R for insights in customer attrition

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Programming – DataScience+**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Stippling and TSP art in R: emulating StippleGen

(This article was first published on ** R – dahtah**, and kindly contributed to R-bloggers)

Stippling is the creation of a pattern simulating varying degrees of solidity or shading by using small dots (Wikipedia).StippleGen is a piece of software that renders images using stipple patterns, which I discovered on Xi’an’s blog a couple days ago.

StippleGen uses an algorithm by Adrian Secord (described here) that turns out to be related to a problem in spatial statistics, specifically how to mess with high-order statistics of point processes while controlling density. The algorithm is a variant of k-means and is extremely easy to implement in R.

library(imager) library(dplyr) library(purrr) stipple <- function(im,nPoints=1e3,gamma=2,nSteps=10) { dens <- (1-im)^gamma xy <- sample(nPix(im),nPoints,replace=TRUE,prob=dens) %>% coord.index(im,.) %>% select(x,y) for (ind in 1:nSteps) { xy <- cvt(xy,dens) plot(im); points(xy,col="red") } xy } plot.stipple <- function(im,out,cex=.25) { g <- imgradient(im,"xy") %>% map(~ interp(.,out)) plot(out,ylim=c(height(im),1),cex=cex,pch=19,axes=FALSE,xlab="",ylab="") } ##Compute Voronoi diagram of point set xy, ##and return center of mass of each cell (with density given by image im) cvt <- function(xy,im) { voronoi(xy,width(im),height(im)) %>% as.data.frame %>% mutate(vim=c(im)) %>% group_by(value) %>% dplyr::summarise(x=weighted.mean(x,w=vim),y=weighted.mean(y,w=vim)) %>% select(x,y) %>% filter(x %inr% c(1,width(im)),y %inr% c(1,height(im))) } ##Compute Voronoi diagram for points xy over image of size (w,h) ##Uses a distance transform followed by watershed voronoi <- function(xy,w,h) { v <- imfill(w,h) ind <- round(xy) %>% index.coord(v,.) v[ind] <- seq_along(ind) d <- distance_transform(v>0,1) watershed(v,-d,fill_lines=FALSE) } #image from original paper im <- load.image("http://dahtah.github.io/imager/images/stippling_leaves.png") out <- stipple(im,1e4,nSteps=5,gamma=1.5) plot.stipple(im,out)TSP art is a variant where you solve a TSP problem to connect all the dots.

library(TSP) ##im is the original image (used only for its dimensions) ##out is the output of the stipple function (dot positions) draw.tsp <- function(im,out) { tour <- out %>% ETSP %>% solve_TSP plot(out[tour,],type="l",ylim=c(height(im),1),axes=FALSE,xlab="",ylab="") } ##Be careful, this is memory-heavy (also, slow) out <- stipple(im,4e3,gamma=1.5) draw.tsp(im,out)I’ve written a more detailed explanation on the imager website, with other variants like stippling with line segments, and a mosaic filter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – dahtah**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Beneath the canvas

(This article was first published on ** Data Imaginist - R posts**, and kindly contributed to R-bloggers)

Recently a blog

post made its

rounds on the internet describing how it is possible to speed up plot creation

in ggplot2 by first creating a blank canvas and then later adding the plot

elements on top of it. The main takeaway plot is reproduced below:

The blog post is in generally well reasoned and I love how it moves from a

curious observation to an investigation and ends with a solid recommendation.

Alas, I don’t agree with the recommendation (that you shold create a *canvas*

for subsequent use). Most of the misunderstanding in the blog post comes from

the fact that ggplot2 in many ways seems to be fueled by magic and unicorn

blood — what arises when you write ggplot() and hit enter is far from clear. I

would like to spend most of the time on this point so I’m just going to get a

more general point out of the way first.

When looking for ways to optimise your code, one must always ask whether the

code needs optimisation in the first place, and then whether the changes made

successfully makes a meaningful impact. What the plot above shows is that

caching the ggplot() call leads to a *statistically significant* performance

improvement meassured in <10 ms. This means that in order to get a percievable

runtime difference, it would be necessary to generate hundreds of plots, or

thousands of plots to get a meaningful difference. My own rule of thumb is that

you should not give up coding conventions unless there’s a tangible result, and

in this case I don’t see any. Does this mean you should never strive for

millisecond improvements? No, if you expect your piece of code to be called

thousands of times and compounding the effect this would be worthwhile. This is

why you sometimes see code where the square root of a variable is saved in a new

variable rather than being computed on the fly every time. In this case you

should ask yourself whether you mean to generate a 1000 plots with your code in

one go, and if so, whether an additional second is really worth it.

The notion that ggplot() creates a canvas for subsequent calls to add onto is

a sensible one, supported by the ggplot2 API where layers are added to the

initial plot. Further, if we simply write ggplot() and hits enter we get this:

Which sure looks like a blank canvas. This is all magic and unicorns though –

the call to ggplot() doesn’t actually draw or render anything on the device.

In order to understand what is going on, let’s have a look at the code

underneath it all:

So, ggplot() is an S3 generic. As it is dispatching on the data argument, and

that defaults to NULL I’ll take the wild guess and say we’re using the default

method:

Huh, so even if we’re not passing in a data.frame as data we’re ending up with

a call to the data.frame ggplot method (this is actually the reason why you

can write your own fortify methods for custom objects and let ggplot2 work with

them automatically). Just for completeness let’s have a look at a fortified

NULL value:

We get a waiver object, which is an internal ggplot2 approach to saying: “I’ve

got nothing right now but let’s worry about that later” (grossly simplified).

With that out of the way, let’s dive into ggplot.data.frame():

ggplot2:::ggplot.data.frame #> function (data, mapping = aes(), ..., environment = parent.frame()) #> { #> if (!missing(mapping) && !inherits(mapping, "uneval")) { #> stop("Mapping should be created with `aes() or `aes_()`.", #> call. = FALSE) #> } #> p <- structure(list(data = data, layers = list(), scales = scales_list(), #> mapping = mapping, theme = list(), coordinates = coord_cartesian(), #> facet = facet_null(), plot_env = environment), class = c("gg", #> "ggplot")) #> p$labels <- make_labels(mapping) #> set_last_plot(p) #> p #> } #>This is actually a pretty simple piece of code. There are some argument

checks to make sure the mappings are provided in the correct way, but other than

that it is simply constructing a gg object (a ggplot subclass). The

set_last_plot() call makes sure that this new plot object is now retrievable

with the last_plot() function. In the end it simply returns the new plot

object. We can validate this by looking into the return value of ggplot():

We see our waiver data object in the data element. As expected we don’t have

any layers, but (perhaps surprising) we *do* have a coordinate system and a

facet specification. These are the defaults getting added to every plot and in

effect until overwritten by something else (facet_null() is simply a one-panel

plot, cartesian coordinates are a standard coordinate system, so the defaults

are sensible). While there’s a default theme in ggplot2 it is not part of the

plot object in the same way as the other defaults. The reason for this is that

it needs to be possible to change the theme defaults and have these changes

applied to all plot objects already in existence. So, instead of carrying the

full theme around, a plot object only keeps explicit changes to the theme and

then merges these changes into the current default (available with

theme_get()) during plotting.

All in all ggplot() simply creates an adorned list ready for adding stuff

onto (you might call this a virtual canvas but I think this is stretching

it…).

*So how come something pops up on your plotting device when you hit enter?* (for

a fun effect read this while sounding as Carrie from Sex and the City)

This is due to the same reason you get a model summary when hitting enter on a

lm() call etc.: The print() method. The print() method is called

automatically by R every time a variable is queried and, for a ggplot object,

it draws the content of your object on your device. An interesting side-effect

of this is that ggplots are only rendered when explicetly print()ed/plot()ed

within a loop, as only the last return value in a sequence of calls gets its

print method invoked. This also means that the benchmarks in the original

blogposts were only measuring plot object creation, and not actual plot

rendering, as this is never made explecit in the benchmarked function (A point

later emphasized in the original post as well). For fun, let’s see if

doing that changes anything:

So it appears any time difference is hidden by the actual complexity of

rendering the plot. This is sensible as the time scale has now increased

considerably and a difference in 1 ms will not be visible.

Prior to ggplot2 v2 simply plotting the result of ggplot() would result in an

error as the plot had no layers to draw. ggplot2 did not get the ability to draw

layerless plots in v2, but instead it got an invisible default layer

(geom_blank) that gets dropped once new layers are added. This just goes to show

the amount of logic going into plot generation in ggplot2 and why it sometimes

feels magical…

To **leave a comment** for the author, please follow the link and comment on their blog: ** Data Imaginist - R posts**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Runtime vs. Success (Using IMDB)

(This article was first published on ** R – NYC Data Science Academy Blog**, and kindly contributed to R-bloggers)

The content in this blog comes from a shiny application proof of concept using IMDB movie data. To view the application:

- IMDB Movie Data App on Shiny.io (Created in R / Shiny Dashboard)
- Source Code on Github

Years ago, students of film writing were advised to make their feature length movie scripts between 90 and 129 pages. A page is expected to be one minute of runtime, and today these numbers have been revised to a simple target of 110 minutes. Little fun fact connected with this: my professor tells the story of an insider’s view from development and how the decision makers at one of the giants used to pick up a screenplay, weigh it in his hands, and say “too light” or “too heavy.” Then the script was rejected without ever getting a single page read. Are these page length numbers arbitrary, or is there something to these guidelines?

A Walkthrough of The Shiny App And What It Can Tell Us

With the bin size set very small in the above visualizations, we can see movie releases increasing over time, with the rate accelerating much more starting in the 1970’s. Increasing the bin size from 2 to 28, further illustrates how in recent years, there are so many more movies then in the past.

A quick look at the history may shed a little light on what’s going on here. The first motion pictures were created in the 1890’s and were little more than short, black and white moving images. As the novelty of this waned, next came serial shorts from which the term “cliff hanger” originates. When real full-length stories started to evolve, they were expensive and hard to produce. As a practical matter, only small numbers could be made per year. Today, the bar has been raised but technology keeps lowering the cost of what it takes to meet that bar. Independent and ancillary markets have widened the distribution channels and even YouTube is becoming a part of this expanding network. It’s just a matter of time before low budget movies get made on smart phones if they haven’t been already. Nothing earth shattering here but the visualization does help show that the run-away escalation started during the time when “*Star Wars*”, “*Jaws*”, and “*Close Encounters of The Third Kind*” all made their debut. Many see this time as “the beginning of the blockbuster.”

As shown here, the data used in this application is organized into 5 subsets of data:

There is some overlap in the data sets shown on the application’s “Data” tab:

- The 5270 “Kaggle 5000 Project IDs” data set contains the results of merging 5000 movies whose IDs were obtained from a “Kaggle” project with the unique IDs from all the other data sets described above. The “Kaggle Project IDs” included a sample of releases from 1916 up until 2016.
- The other data sets came from Top and Bottom lists on IMDB. Most of the Top 250 English Movies were also in the “Top 250 Movies”. Some of the “Top 250 Indian Movies” are also in the “Top 250 Movies”. The Bottom 100 should only overlap with the “Kaggle” data set. Movies in these lists include titles from 2017.

Click on the “Visualization” tab of the app to obtain basic stats on each of the aforementioned datasets (Min, Max, Mean, Median and Quartiles) for these 4 numeric variables:

- Year Released
- Runtime (in minutes)
- Number of Reviews
- Movie Rating

This tab also provides a line plot using Loess linear regression which we can analyze for the general trend in movie runtimes that we are looking for.

If we start with the plot for “all the data” in our application, outliers are mostly clustered pretty close to the general confidence interval for the model. No outliers outside this range appear after 1990, and only a small number of points barely outside the confidence interval appear from 1960 to 1990.

Since there is a limited outlier effect, the mean seems like a reasonable metric. It is 109.2 minutes. Of more interest to the original question, this plot essentially reflects a 5000+ record sample of what got made. The hills and valleys of the line seem to range between 105 and 120 minutes up through the 1970’s. Then the line becomes a slightly downward sloping trend up through the present with our data points mostly appearing at around the 110 minute mark. Though anecdotal in nature, the original 110 minute recommendation for movie scripts would seem to be supported by the data. The confidence interval around the line though might suggest a range from about 100 to 120 minutes. If our movie gets made, we may then be concerned with what are its chances of making it into the top or bottom? Starting with the Top 250 Movies:

The mean for the Top 250 movies was 130 minutes. The curve trends upwards over all with a range in recent years (1990 to the present) that fell between: 130 and 140 minutes. There are a larger scattering of outliers on this plot, but based on how the outliers mostly cluster not too far from the confidence interval after 1990, the mean still seems reasonable to use. If we think about why these better received movies are often longer than the general trend for what gets made, I’m sure there are many factors involved. For one thing, big name director and producers have the kind of clout to make us sit through longer movies. Consider “*The Lord of The Rings*” saga, which collectively was over 9 hours long with each of 3 movies lasting in the range of 3 hours a piece.

We don’t want to end up in the bottom, so we’ll take a look at this too:

The mean for the Bottom 100 movies was 92.26 minutes. This curve is also showing a distinct upwards trend over all but with a range in recent years (1990 to the present) that was from about 90 minutes to 115 minutes. There are fewer outliers, but there is also less data in this grouping.

In ConclusionNote that for this application about 6000 IMDB records were obtained. In 2017 alone, IMDB recorded 12,193 new releases by July and the year is still young! Further analysis is indicated and performing the analysis with more data would also be helpful. Given the small size of the data, findings here should not be taken as an absolute. Given that, here is a summary of what the data suggests so far:

- The “Sweet Spot” to target for a feature length screen play to get made is about 110 minutes
- If we want that movie to make it into the Top 250 on IMDB, the final movie produces is expected to range from 130 to 140 minutes
- It is also interesting to note that movies in the bottom 100 tend to range between 90 and 115 minutes

With more time and development, a more thorough analysis could be developed from IMDB data.

The Data Collection ProcessThree R markdown scripts (available on Git) were used to gather source data. A full workflow from initial input files to what got used in the application is provided in the third script. As this process was experimentally developed and modified while creating the source data for this project, R markdown was used to step through the process and keep notes on what was going. High level:

- An XML based R library was used to screen scrape data from “hot lists” (Top 250, Bottom 100, etc.) directly off of the IMDB website in July of 2017. This produced ‘csv’ files used as source for all the “Top / Bottom” lists provided in the final application.
- 5000+ IMDB IDs were extracted using a software process (not documented) from a json file posted with a Kaggle project. That project last gathered data in 2016. IDs from steps 1 and 2 were then used with step 3 to extract current data on these IDs.
- An Open API referred to as “IMDB Open API” was used to extract records from the IMDB website one record at a time, with a 3 second delay between each extraction. This was to ensure no harm to the website. This script was designed to save its work “along the way” and provide helpful output messages so that if an error occurred that halted the program. Data collected so far would be available, and the script could then pick up where it left off to finish the job.
- Close to 6000 observations in total (5270 after filtering out of duplicates) w/ about 27 variables were obtained from IMDB.
- Data cleaning of just the 7 variables brought into the Shiny application ensued with the intent to clean other variables in future development iterations. The vision was to develop subset tables with Movie_IDs on each one as a unique identifier, and then join them to previous data when bringing in each new set to expand the app in support of new analysis and features.

The CEO of a company I used to work for would tell stories of how senior management of various IT areas under him would plan and design solutions on a napkin in a bar which formed the basis of software project requirements. For this shiny app, the “napkin” was actually a piece of notebook paper captured as a PDF for your amusement. Only a small piece of this plan has been realized so far.

The post Runtime vs. Success (Using IMDB) appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – NYC Data Science Academy Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Programming with dplyr by using dplyr

(This article was first published on ** R – Enchufa2**, and kindly contributed to R-bloggers)

The title may seem tautological, but since the arrival of dplyr 0.7.x, there have been some efforts at *using dplyr without actually using it* that I can’t quite understand. The tidyverse has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.

Take the example of seplyr. It stands for *standard evaluation dplyr*, and enables us to program over dplyr without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval”. Let’s consider the following interactive pipeline:

Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr. But is it? Not at all: we just need to add one line and a *bang-bang* (!!):

The enquo() function quotes the name we put in our function (homeworld), and the bang-bang unquotes and uses that name instead of var. That’s it. What about seplyr? With seplyr, we *just* have to (and I quote)

- Change dplyr verbs to their matching seplyr “*_se()” adapters.
- Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
- Replace “=” in expressions with “:=”.

This is the result:

library(seplyr) starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsBasically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:

starwars_mean <- function(var) { var <- as.name(var) starwars %>% group_by(!!var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsBut we can do even better if we remember that dplyr provides scoped variants (see ?dplyr::scoped) for most of the verbs. In this case, group_by_at() comes in handy:

starwars_mean <- function(var) { starwars %>% group_by_at(var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsThat’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr and R’s paste0() function to build up expressions:

grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") data %>% group_by_se(grouping_variables) %>% summarize_se(c(result_names := expressions, "count" := "n()")) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11And the same with dplyr’s scoped verbs (note that I’ve added the last rename_at() on a whim, just to get exactly the same output as before, but it is not really necessary):

grouped_mean <- function(data, grouping_variables, value_variables) { data %>% group_by_at(grouping_variables) %>% mutate(count = n()) %>% summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>% rename_at(value_variables, funs(paste0("mean_", .))) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11Wrapping up, the tidyeval paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr is full of tools that will make your life easier, not harder.

*Article originally published in Enchufa2.es: Programming with dplyr by using dplyr*.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Enchufa2**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data Visualization with googleVis exercises part 8

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

In the eighth part of our series we are going to learn about the features of some interesting types of charts. More specifically we will talk about Annotation and Sankey charts.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

**Package**

As you already know, the first thing you have to do is install and load the googleVis package with:

install.packages("googleVis")

library(googleVis)

**NOTE**: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.

**Annotation chart**

It is quite simple to create an Annotation Chart with googleVis. We will use the “Stocks” dataset. You can see the variables of your dataset with head().

Look at the example below to create a simple Annotation Chart:

AnnoC <- gvisAnnotationChart(Stock)

plot(AnnoC)

**Exercise 1**

Create a list named “AnnoC” and pass to it the “Stock” dataset as an annotation chart. **HINT**: Use gvisAnnotationChart().

**Exercise 2**

Plot the the annotation chart. **HINT**: Use plot().

**Set the variables**

As you can see the annotation chart you built is empty so we have to fill it with some information. We will use the variables of ths “Stock” dataset for this purpose like this:

datevar="Date",

numvar="Value",

idvar="Device",

titlevar="Title",

annotationvar="Annotation"

**Learn more**about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

- Work extensively with the GoogleVis package and its functionality
- Learn what visualizations exist for your specific use case
- And much more

**Exercise 3**

Use the example above to fill your annotation chart with the proper information and plot the chart.

**Dimensions**

You can use height and width to change the dimensions of the annotation chart.

options=list(width=500, height=300)

**Exercise 4**

Set height to 700, width to 500 and plot the chart. **HINT**: Use list().

**Colours**

You can change the colours of the lines with:

options=list(colors="['green','yellow']")

**Exercise 5**

Set the line colours to green and red and plot the chart.

With the fill option you can color the relevant areas and adjust how filled these areas will be. Look at the example:

options=list(colors="['green','yellow']",fill=20)

**Exercise 6**

Set fill to 50 and plot the chart.

**Exercise 7**

Now set fill to 150 to see the difference and plot the chart.

**Sankey Chart**

Before creating a sankey chart we have to create a data frame that will help us as an example, so copy and paste this code to crate the data frame “exp” first:

exp <- data.frame(From=c(rep("A",3), rep("B", 3)),

To=c(rep(c("X", "Y", "Z"),2)),

Weight=c(6,9,7,9,3,1))

As you can see we created a data frame with the variabls “From”, “To” and “Weight”. You can use head() in order to see them.

It is quite simple to create an Sankey Chart with googleVis. We will use the “exp” data frame.

Look at the example below to create a simple Sankey Chart:

SankeyC <- gvisSankey(exp )

plot(SankeyC)

**Exercise 8**

Create a list named “SankeyC” and pass to it the “exp” dataset as a sankey chart. **HINT**: Use gvisSankey().

**Exercise 9**

Plot the the sankey chart. **HINT**: Use plot().

You can change the link colours with:

options=list(sankey="{link: {color: { fill: 'red' } }

**Exercise 10**

Color the links of ths sankey chart green and plot it.

Related exercise sets:- Data Visualization with googleVis exercises part 4
- Data Visualization with googleVis exercises part 2
- Data Visualization with googleVis exercises part 3
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...