### R⁶ — General (Attys) Distributions

(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp

— Matt Stiles (@stiles) July 25, 2017

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

- grabs the vis iframe
- extracts and evaluates the target javascript to get a nice data frame
- performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
- plots the distributions using the beeswarm quasirandom alogrithm

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### SQL Server 2017 release candidate now available

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for public preview. For those looking to do data science with data in SQL Server, there are a number of new features compared to SQL Server 2017 which might be of interest:

- SQL Server 2017 will be supported on Linux, specifically on Red Hat, SUSE, and Ubuntu (as well as within a Docker container). You can read some of the backstory on how SQL Server came to Linux here.
- The new SQL Server Machine Learning Services will provide in-database analytics in both the R and — new to SQL Server 2017 — Python languages.
- There are several improvements to the in-database R integration (previously known as "R Services" in SQL Server 2016). These includes a new package management system, real-time in-database scoring of predictive models, and an updated R engine.

SQL Server 2017 Release Candidate 1 is available for download now. For more details on these and other new features in this release, check out the link below.

SQL Server Blog: SQL Server 2017 CTP 2.1 now available

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

Until now, in this series of exercise sets, we have used only continuous probability distributions, which are functions defined on all the real numbers on a certain interval. As a consequence, random variable who have those distributions can assume an infinity of values. However, a lot of random situations only have a finite amount of possible outcome and using a continuous probability distributions to analyze them is not really useful. In today set, we’ll introduce the concept of discrete probability functions, which can be used in those situations and some examples of problems in which they can be used.

Answers to the exercises are available here.

**Exercise 1**

Just as continuous probability distributions are characterized by a probability density function discrete probability functions are characterized by a probability mass function which gives the probability that a random variable is equal to one value.

The first probability mass function we will use today is the binomial distribution, which is used to simulate n iterations of a random process who can either result in a success, with a probability of p, or a failure, with a probability of (1-p). Basically, if you want to simulate something like a coins flip, the binomial distribution is the tool you need.

Suppose you roll a 20 sided dice 200 times and you want to know the probability to get a 20 exactly five times on your rolls. Use the dbinom(n, size, prob) function to compute this probability.

**Exercise 2**

For the binomial distribution, the individual events are independents, meaning that the probability of realization of two events can be calculated by adding the probability of realization of both event. This principle can be generalize to any number of events. For example, the probability of getting three tails or less when you flip a coins 10 time is equal to the probability of getting 1 tails plus the probability of getting 2 tails plus the probability of getting 3 tails.

Knowing this, use the dbinom() function to compute the probability of getting six correct responses at a test made of 10 questions which have true or false for answer if you answer randomly. Then, use the pbinom() function to compute the cumulative probability function of the binomial distribution in that situation.

**Exercise 3**

Another consequence of the independence of events is that if we know the probability of realization of a set of events we can compute the probability of realization of one of his subset by subtracting the probability of the unwanted event. For example, the probability of getting two or three tails when you flip a coins 10 time is equal to the probability of getting at least 3 tails minus the probability of getting 1 tails.

Knowing this, compute the probability of getting 6 or more correct answer on the test described in the previous exercise.

**Learn more**about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

- Work with about different binomial and logistic regression techniques
- Know how to compare regression models and choose the right fit
- And much more

**Exercise 4**

Let’s say that in an experiment a success is defined as getting a 1 if you roll a 20 sided die. Use the barplot() function to represent the probability of getting from 0 to 10 success if you roll the die 10 times. What happened to the barplot if you roll a 10 sided die instead? If you roll a 3 sided die?

**Exercise 5**

Another discrete probability distribution close to the binomial distribution is the Poisson distribution, which give the probability of a number of events to occur during a fixed amount of time if we know the average rate of his occurrence. For example, we could use this distribution to estimate the amount of visitor who goes on a website if we know the average number of visitor per second. In this case, we must assume two things: first that the website has visitor from around the world since the rate of visitor must be constant around the day and two that when a visitor is coming on the site he is not influenced by the last visitor since a process can be expressed by the Poisson distribution if the events are independent from each other.

Use the dpois() function to estimate the probability of having 85 visitors on a website in the next hour if in average 80 individual connect on the site per hour. What is the probability of getting 2000 unique visitors on the website in a day?

**Exercise 6**

Poisson distribution can be also used to compute the probability of an event occurring in an amount of space, as long as the unit of the average rate is compatible with the unit of measure of the space you use. Suppose that a fishing boat catch 1/2 ton of fish when his net goes through 5 squares kilometers of sea. If the boat combed 20 square kilometer, what is the probability that they catch 5 tons of fish?

**Exercise 7**

Until now, we used the Poisson distribution to compute the probability of observing precisely n occurrences of an event. In practice, we are often interested in knowing the probability that an event occur n times or less. To do so we can use the ppois() function to compute the cumulative Poisson distribution. If we are interested in knowing what is the probability of observing strictly more than n occurrences, we can use this function and set the parameter lower to FALSE.

In the situation of exercise 5, what is the probability that the boat caught 5 tons of fish or less? What is the probability that the caught more than 5 tons of fish?

Note that, just as in a binomial experiment, the events in a Poisson process are independant, so you can add or subtract probability of event to compute the probability of a particular set of events.

**Exercise 8**

Draw the Poisson distribution for average rate of 1,3,5 and 10.

**Exercise 9**

The last discrete probability distribution we will use today is the negative binomial distribution which give the probability of observing a certain number of success before observing a fixed number of failures. For example, imagine that a professional football player will retire at the end of the season. This player has scored 495 goals in his career and would really want to meet the 500 goal mark before retiring. If he is set to play 8 games until the end of the season and score one goal every three games in average, we can use the negative binomial distribution to compute the probability that he will meet his goal on his last game, supposing that he won’t score more than one goal per game.

Use the dnbinom() function to compute this probability. In this case, the number of success is 5, the probability of success is 1/3 and the number of failures is 3.

**Exercise 10**

Like for the Poisson distribution, R give us the option to compute the cumulative negative binomial distribution with the function pnbinom(). Again, the lower.tail parameter than give you the option to compute the probability of realizing more than n success if he is set to TRUE.

In the situation of the last exercise, what is the probability that the football player will score at most 5 goals in before the end of his career.

Related exercise sets:- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-1)
- Lets Begin with something sample
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Experiment designs for Agriculture

(This article was first published on ** R tutorial for Spatial Statistics**, and kindly contributed to R-bloggers)

Statistical Methods for Bioscience (Wisconsin-Madison): http://www.stat.wisc.edu/courses/st572-larget/Spring2007/
R Packages to create several designs are presented here: https://cran.r-project.org/web/views/ExperimentalDesign.html
A very good tutorial about the use of the package Agricolae can be found here:

https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf
Complete Randomized Design
This is probably the most common design, and it is generally used when conditions are uniform, so we do not need to account for variations due for example to soil conditions.
In R we can create a simple CRD with the function expand.grid and then with some randomization:
TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))

Data.CRD = TR.Structure[sample(1:nrow(TR.Structure),nrow(TR.Structure)),]

Data.CRD = cbind(PlotN=1:nrow(Data.CRD), Data.CRD[,-1])

write.csv(Data.CRD, "CompleteRandomDesign.csv", row.names=F)

The first line create a basic treatment structure, with rep that identifies the number of replicate, that looks like this:

rep Treatment1 Treatment2

1 1 A A

2 2 A A

3 3 A A

4 1 B A

5 2 B A

6 3 B A

7 1 A B

8 2 A B

9 3 A B

10 1 B B

11 2 B B

12 3 B B

13 1 A C

14 2 A C

15 3 A C

16 1 B C

17 2 B C

18 3 B C

The second line randomizes the whole data.frame to obtain a CRD, then we add with cbind a column at the beginning with an ID for the plot, while also eliminating the columns with rep.

Add Control
To add a Control we need to write two separate lines, one for the treatment structure and the other for the control:
TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))

CR.Structure = expand.grid(rep=1:3, Treatment1=c("Control"), Treatment2=c("Control"))

Data.CCRD = rbind(TR.Structure, CR.Structure)

This will generate the following table:

rep Treatment1 Treatment2

1 1 A A

2 2 A A

3 3 A A

4 1 B A

5 2 B A

6 3 B A

7 1 A B

8 2 A B

9 3 A B

10 1 B B

11 2 B B

12 3 B B

13 1 A C

14 2 A C

15 3 A C

16 1 B C

17 2 B C

18 3 B C

19 1 Control Control

20 2 Control Control

21 3 Control Control

As you can see the control is totally excluded from the rest. Now we just need to randomize, again using the function sample:

Data.CCRD = Data.CCRD[sample(1:nrow(Data.CCRD),nrow(Data.CCRD)),]Data.CCRD = cbind(PlotN=1:nrow(Data.CCRD), Data.CCRD[,-1])

write.csv(Data.CCRD, "CompleteRandomDesign_Control.csv", row.names=F)

Block Design with Control The starting is the same as before. The difference starts when we need to randomize, because in CRD we randomize over the entire table, but with blocks, we need to do it by block. TR.Structure = expand.grid(Treatment1=c("A","B"), Treatment2=c("A","B","C"))

CR.Structure = expand.grid(Treatment1=c("Control"), Treatment2=c("Control"))

Data.CBD = rbind(TR.Structure, CR.Structure)

Block1 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Block2 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Block3 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Data.CBD = rbind(Block1, Block2, Block3)

BlockID = rep(1:nrow(Block1),3)

Data.CBD = cbind(Block = BlockID, Data.CBD)

write.csv(Data.CBD, "BlockDesign_Control.csv", row.names=F)

As you can see from the code above, we’ve created three objects, one for each block, where we used the function sample to randomize.

Other Designs with Agricolae
The package agricolae includes many designs, which I am sure will cover all your needs in terms of setting up field and lab experiments.
We will look at some of them, so first let’s install the package:
install.packages("agricolae")

library(agricolae)

The main syntax for design in agricolae is the following:

design.crd(trt=Trt1, r=3)

The result is the output below:

> design.crd(trt=Trt1, r=3)$parameters

$parameters$design

[1] "crd"

$parameters$trt

[1] "A" "B" "C"

$parameters$r

[1] 3 3 3

$parameters$serie

[1] 2

$parameters$seed

[1] 1572684797

$parameters$kinds

[1] "Super-Duper"

$parameters[[7]]

[1] TRUE

$book

plots r Trt1

1 101 1 A

2 102 1 B

3 103 2 B

4 104 2 A

5 105 1 C

6 106 3 A

7 107 2 C

8 108 3 C

9 109 3 B

As you can see the function takes only one argument for treatments and another for replicates. Therefore, if we need to include a more complex treatment structure we first need to work on them:

Trt1 = c("A","B","C")Trt2 = c("1","2")

Trt3 = c("+","-")

TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))

TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))

TRT.Control = c(TRT, rep("Control", 3))

As you can see we have now three treatments, which are merged into unique strings within the function sapply:

> TRT[1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"

Then we need to include the control, and then we can use the object TRT.Control with the function design.crd, from which we can directly obtain the data.frame with $book:

> design.crd(trt=TRT.Control, r=3)$bookplots r TRT.Control

1 101 1 A2+

2 102 1 B1+

3 103 1 Control

4 104 1 B2+

5 105 1 A1+

6 106 1 C2+

7 107 2 A2+

8 108 1 C2-

9 109 2 Control

10 110 1 B2-

11 111 3 Control

12 112 1 Control

13 113 2 C2-

14 114 2 Control

15 115 1 C1+

16 116 2 C1+

17 117 2 B2-

18 118 1 C1-

19 119 2 C2+

20 120 3 C2-

21 121 1 A2-

22 122 2 C1-

23 123 2 A1+

24 124 3 C1+

25 125 1 B1-

26 126 3 Control

27 127 3 A1+

28 128 2 B1+

29 129 2 B2+

30 130 3 B2+

31 131 1 A1-

32 132 2 B1-

33 133 2 A2-

34 134 1 Control

35 135 3 C2+

36 136 2 Control

37 137 2 A1-

38 138 3 B1+

39 139 3 Control

40 140 3 A2-

41 141 3 A1-

42 142 3 A2+

43 143 3 B2-

44 144 3 C1-

45 145 3 B1-

Other possible designs are:

#Random Block Designdesign.rcbd(trt=TRT.Control, r=3)$book

#Incomplete Block Design

design.bib(trt=TRT.Control, r=7, k=3)

#Split-Plot Design

design.split(Trt1, Trt2, r=3, design=c("crd"))

#Latin Square

design.lsd(trt=TRT.tmp)$sketch

Others not included above are: Alpha designs, Cyclic designs, Augmented block designs, Graeco – latin square designs, Lattice designs, Strip Plot Designs, Incomplete Latin Square Design

Final Note For repeated measures and crossover designs I think we can create designs simply using again the function expand.grid and including time and subjects, as I did in my previous post about Power Analysis. However, there is also the package Crossover that deals specifically with crossover design and on this page you can find more packages that deal with clinical designs: https://cran.r-project.org/web/views/ClinicalTrials.html var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R tutorial for Spatial Statistics**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Twitter Coverage of the Bioinformatics Open Source Conference 2017

July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.

My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is:

- the Github repository, where you can view the markdown report in the code/R directory
- the published report at RPubs

The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.

Filed under: bioinformatics, meetings, R, statistics Tagged: bosc

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));### The new MODIStsp website (based on pkgdown) is online !

(This article was first published on ** Spatial Processing in R**, and kindly contributed to R-bloggers)

The **MODIStsp** website, which lay abandoned since several months on github pages, recently underwent a major overhaul thanks to **pkgdown**. The new site is now available at http://lbusett.github.io/MODIStsp/

We hope that the revised website will allow to navigate MODIStsp-related material much more easily than either github or the standard CRAN documentation, and will therefore help users in better and more easily exploiting MODIStsp functionality.

The restyling was possible thanks to the very nice “**pkgdown**” R package (http://hadley.github.io/pkgdown), which allows to easily create a static documentation website.

Though pkgdown does already quite a good job in creating a bare-bones website exploiting **just the material available in a standard devtools-based R package file structure**, I was surprised at **how simply the standard website could be tweaked **to obtain a very nice (IMO) final result without needing any particular background on html, css, etc. !

(On that, I plan to soon post a short guide containing a few **tips and tricks** I learnt in the last days for setting up and configuring a pkgdown website, so stay tuned !)

To **leave a comment** for the author, please follow the link and comment on their blog: ** Spatial Processing in R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### What are people saying about ICCB2017 on Twitter?

(This article was first published on ** Bluecology blog**, and kindly contributed to R-bloggers)

Here’s a brief analysis of tweets from the International Congress for Conservation Biology to date. The conference started on the starting 23rd July I was curious to see what people are talking about there and who is doing the talking on twitter.

As of 25th July I could access 6978 tweets and retweets from the conference, starting on 2017-07-15.

If you are wondering how these stats compare to other conferences, check out my post from ICRS last year.

Who is talkingThere have been about 1500 users on #ICCB2017 so far.

Clearly, the people talking on twitter are a biased selection of people at ICCB2017 and may also include people not there (like me). As usual, a lot people only tweet once or twice.

The users doing the most tweeting are (in order of most to least with number of tweets):

@IbuAnggrek 528, @ICCB2017 310, @AzurigenMCS 199, @WhySharksMatter 179, @rebecca_jarvis 172,

@CarlaWildlife 146, @WILDLABSNET 93, @Society4ConBio 84, @FancyScientist 80,

@davila_federico 79

Here is a table of the most popular tweets, by their RT count:

Rank Name text Retweet count 1 WhySharksMatter Angler gives up chance at world record, releases huge blacktip shark alive. #ICCB2017 #SharkWeek (from 2014) https://t.co/dwmiAeSXQW https://t.co/74SyQ6Uhfk 96 2 eguerreroforero @Brigittelgb en #ICCB2017: A los países megadiversos nos llaman ‘megadispersos’ porq no hemos logrado posicionar políticamente biodiversidad https://t.co/u43dVvcjHO 65 3 HugePossum Australia’s proposed marine parks have never been science based – bad planning for nature and people #ICCB2017 https://t.co/rZXAueuod6 64 4 ICCB2017 #ICCB2017 attendees – there are sloths in El Parque Centenario near the convention center! #UrbanWildlife https://t.co/81YktjU4Mu 59 5 Seasaver UN Patron of Oceans calls for ban on tournaments which kill threatened sharks https://t.co/P5KmUxBte3 #ICCB2017 @TheIGFA #sportfishing 40I like that sloths are right up there at number 4 at the moment. Marine science topics are also pretty popular.

What people are sayingTextual analysis is a bit of a dark art, but here are some of the most popular words appearing in tweets (with the number of tweets they appear in ):

Term Number of tweets iccb 6965 conserv 1064 amp 691 colombia 477 talk 467 biodivers 411 cartagena 380 work 362 will 323 whysharksmatter 315 societyconbio 304 open 303 great 298 can 293 come 292 use 289 conservation 287 join 285 need 248 world 244At this stage the location is an imporant talking point. Also predictably ‘conservation’ and ‘biodiversity’ are also up there. It is interesting that ‘need’ features in a lot of tweets. Perhaps a lot of people saying we ‘need’ to do this or that?

Time of tweetsFinally, it can be illustrative to look at the timing of tweets. This graph is made with times for the time zone local to Cartagena.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Bluecology blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Analyzing Github pull requests with Neural Embeddings, in R

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text framework). The data analysis was done using R and distributed on Spark, and the resulting neural network trained using the Microsoft Cognitive Toolkit. You can see the slides here, and you can watch the presentation below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Are computers needed to teach Data Science?

(This article was first published on ** R Project – Citizen-Statistician**, and kindly contributed to R-bloggers)

One of the many nice things about summer is the time and space it allows for blogging. And, after a very stimulating SRTL conference (Statistics Reasoning, Teaching and Learning) in Rotorua, New Zealand, there’s lots to blog about.

Let’s begin with a provocative posting by fellow SRTL-er Tim Erickson at his excellent blog A Best Case Scenario. I’ve known Tim for quite awhile, and have enjoyed many interesting and challenging discussions. Tim is a creator of curricula *par excellence*, and has first-hand experience in what inspires and motivates students to think deeply about statistics.

The central question here is: Is computation (on a computer) necessary for learning data science? The learners here are beginners in K-12. Tim answers no, and I answer, tentatively, yes. Tim portrays me in his blog as being a bit more steadfast on this position than I really am. In truth the answer is, some; maybe; a little; I don’t know.

My own experience in the topic comes from the Mobilize project , in which we developed the course Introduction to Data Science for students in the Los Angeles Unified School District. (I’m pleased to say that the course is expanding. This summer, five new L.A.-area school districts will begin training teachers to teach this course. )

The course relies heavily on R via Rstudio. Students begin by studying the structure of data, learning to identify cases and variables and to organize unstructured data into a “tidy” format. Next, they learn to “read” tidy datafiles into Rstudio. The course ends with students learning some predictive modeling using Classification and Regression Trees. In between, they study some inference using randomization-based methods.

To be precise, the students don’t learn straight-up R. They work within a package developed by the Mobilize team (primarily James Molyneux, Amelia McNamara, Steve Nolen, Jeroen Ooms, and Hongsuda Tangmunarunkit) called mobilizR, which is based pretty heavily on the mosaic package developed by Randall Pruim, Danny Kaplan and Nick Horton. The idea with these packages is to provide beginners to R with a unified syntax and a set of verbs that relate more directly to the analysts’ goals. The basic structure for (almost) all commands is

*WhatIWantToDo(yvariable~xvariables, dataset)*

For example, to see the average walking distance recorded by a fitbit by day of the week:

> mean(Distance~DOW,data=fitbitdec) Friday Monday Saturday Sunday Thursday Tuesday Wednesday 1.900000 3.690000 2.020909 2.419091 1.432727 3.378182 3.644545The idea is to provide students with a simplified syntax that “bridges the gap” between beginners of R and more advanced users. Hopefully, this frees up some of the cognitive load required to remember and employ R commands so that students can think strategically and statistically about problems they are trying to solve.

The “bridge the gap” terminology comes from Amelia McNamara, who used the term in her PhD dissertation. One of the many really useful ideas Amelia has given us is the notion that the gap needs to be bridged. Much of “traditional” statistics education holds to the idea that statistical concepts are primarily mathematical, and, for most people, it is sufficient to learn enough of the mathematical concepts so that they can react skeptically and critically to others’ analyses. What is exciting about data science in education is that students can do their own analyses. And if students are analyzing data and discovering on their own (instead of just trying to understand others’ findings), then we need to teach them to use software in such a way that they can transition to more professional practices.

And now, dear readers, we get to the heart of the matter. That gap is really hard to bridge. One reason is that we know little to nothing about the terrain. How do students learn coding when applied to data analysis? How does the technology they use mediate that experience? How can it enhance, rather than inhibit, understanding of statistical concepts and the ability to do data analysis intelligently?

In other words, what’s the learning trajectory?

Tim rightly points to CODAP, the Common Online Data Analysis Platform, as one tool that might help bridge the gap by providing students with some powerful data manipulation techniques. And I recently learned about data.world, which seems another attempt to help bridge the gap. But Amelia’s point is that it is not enough to give students the ability to do something; you have to give it to them so that they are prepared to learn the next step. And if the end-point of a statistics education involves coding, then those intermediate steps need to be developing students’ coding skills, as well as their statistical thinking. It’s not sufficient to help studemts learn statistics. They must simultaneously learn computation.

So how do we get there? One important initial step, I believe, is to really examine what the term “computational thinking” means when we apply it to data analysis. And that will be the subject of an upcoming summer blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R Project – Citizen-Statistician**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking Highcharter: observations per group in boxplots

(This article was first published on ** R – What You're Doing Is Rather Desperate**, and kindly contributed to R-bloggers)

Highcharts has long been a favourite visualisation library of mine, and I’ve written before about Highcharter, my preferred way to use Highcharts in R.

Highcharter has a nice simple function, hcboxplot(), to generate boxplots. I recently generated some for a project at work and was asked: can we see how many observations make up the distribution for each category? This is a common issue with boxplots and there are a few solutions such as: overlay the box on a jitter plot to get some idea of the number of points, or try a violin plot, or a so-called bee-swarm plot. In Highcharts, I figured there should be a method to get the number of observations, which could then be displayed in a tool-tip on mouse-over.

There wasn’t, so I wrote one like this.

First, you’ll need to install highcharter from Github to make it work with the latest dplyr.

Next, we generate a reproducible dataset using the wakefield package. For some reason, we want to look at age by gender, but only for redheads:

library(dplyr) library(tidyr) library(highcharter) library(wakefield) library(tibble) set.seed(1001) sample_data <- r_data_frame( n = 1000, age(x = 10:90), gender, hair ) %>% filter(hair == "Red") sample_data %>% count(Gender) ## # A tibble: 2 x 2 ## Gender n ## ## 1 Male 62 ## 2 Female 48Giving us 62 male and 48 female redheads. The tibble package is required because later on, our boxplot function calls the function has_name from that package.

The standard hcboxplot function shows us, on mouse-over, the summary data used in the boxplot, as in the image below.

hcboxplot(x = sample_data$Age, var = sample_data$Gender) %>% hc_chart(type = "column")To replace that with number of observations per group, we need to edit the function. In RStudio, View(hcboxplot) will open a tab with the (read-only) code, which can be copy/pasted and edited. Look for the function named get_box_values, which uses the R boxplot.stats function to generated a data frame:

get_box_values <- function(x) { boxplot.stats(x)$stats %>% t() %>% as.data.frame() %>% setNames(c("low", "q1", "median", "q3", "high")) }Edit it to look like this – the new function just adds a column obs with number of observations:

get_box_values <- function(x) { boxplot.stats(x)$stats %>% t() %>% cbind(boxplot.stats(x)$n) %>% as.data.frame() %>% setNames(c("low", "q1", "median", "q3", "high", "obs")) }Save the new function as, for example, my_hcboxplot. Now we can customise the tooltip to use the obs property of the point object:

my_hcboxplot(x = sample_data$Age, var = sample_data$Gender) %>% hc_chart(type = "column") %>% hc_tooltip(pointFormat = 'n = {point.obs}')Voilà.

Filed under: R, statistics

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – What You're Doing Is Rather Desperate**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Random Forests in R

(This article was first published on ** R Programming – DataScience+**, and kindly contributed to R-bloggers)

Ensemble Learning is a type of Supervised Learning Technique in which the basic idea is to generate multiple Models on a training dataset and then simply combining(average) their Output Rules or their Hypothesis \( H_x \) to generate a Strong Model which performs very well and does not overfits and which balances the Bisa-Variance Tradeoff too.

The idea is that instead of producing a single complicated and complex Model which might have a high variance which will lead to Overfitting or might be too simple and have a high bias which leads to Underfitting, we will generate lots of Models by training on Training Set and at the end combine them. Such a technique is Random Forest which is a popular Ensembling technique is used to improve the predictive performance of Decision Trees by reducing the variance in the Trees by averaging them. Decision Trees are considered very simple and easily interpretable as well as understandable Modelling techniques, but a major drawback in them is that they have a poor predictive performance and poor Generalization on Test Set. They are also referred to as Weak Learners which are Learners which always perform better than chance and have an error less than 50 %.

Random ForestsRandom Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.

Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Another major difference is that we only consider a Random subset of predictors \( m \) each time we do a split on training examples.Whereas usually in Trees we find all the predictors while doing a split and choose best amongst them. Typically \( m=\sqrt{p} \) where \(p\) are the number of predictors.

Now it seems crazy to throw away lots of predictors, but it makes sense because the effect of doing so is that each tree uses different predictors to split data at various times.

*So by doing this trick of throwing away Predictors, we have de-correlated the Trees and the resulting average seems a little better.*

loading the required packages

require(randomForest) require(MASS)#Package which contains the Boston housing dataset attach(Boston) set.seed(101) dim(Boston)*## [1] 506 14*Saperating Training and Test Sets #training Sample with 300 observations train=sample(1:nrow(Boston),300) ?Boston #to search on the dataset

We are going to use variable ′medv′ as the Response variable, which is the Median Housing Value. We will fit 500 Trees.

Fitting the Random ForestWe will use all the Predictors in the dataset.

Boston.rf=randomForest(medv ~ . , data = Boston , subset = train) Boston.rf*## ## Call: ## randomForest(formula = medv ~ ., data = Boston, subset = train) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## Mean of squared residuals: 12.62686 ## % Var explained: 84.74*

The above Mean Squared Error and Variance explained are calculated using *Out of Bag Error* Estimation.In this \(\frac23\) of Training data is used for training and the remaining \( \frac13 \) is used to Validate the Trees. Also, the number of variables randomly selected at each split is 4.

Plotting the Error vs Number of Trees Graph.

plot(Boston.rf)This plot shows the Error and the Number of Trees.We can easily notice that how the Error is dropping as we keep on adding more and more trees and average them.

Now we can compare the Out of Bag Sample Errors and Error on Test setThe above Random Forest model chose Randomly 4 variables to be considered at each split. We could now try all possible 13 predictors which can be found at each split.

oob.err=double(13) test.err=double(13) #mtry is no of Variables randomly chosen at each split for(mtry in 1:13) { rf=randomForest(medv ~ . , data = Boston , subset = train,mtry=mtry,ntree=400) oob.err[mtry] = rf$mse[400] #Error of all Trees fitted pred<-predict(rf,Boston[-train,]) #Predictions on Test Set for each Tree test.err[mtry]= with(Boston[-train,], mean( (medv - pred)^2)) #Mean Squared Test Error cat(mtry," ") #printing the output to the console }*## 1 2 3 4 5 6 7 8 9 10 11 12 13*

Test Error

test.err*## [1] 26.06433 17.70018 16.51951 14.94621 14.51686 14.64315 14.60834 ## [8] 15.12250 14.42441 14.53687 14.89362 14.86470 15.09553*

Out of Bag Error Estimation

oob.err*## [1] 19.95114 13.34894 13.27162 12.44081 12.75080 12.96327 13.54794 ## [8] 13.68273 14.16359 14.52294 14.41576 14.69038 14.72979*

What happens is that we are growing 400 trees for 13 times i.e for all 13 predictors.

Plotting both Test Error and Out of Bag Error matplot(1:mtry , cbind(oob.err,test.err), pch=19 , col=c("red","blue"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split") legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","blue"))Now what we observe is that the Red line is the Out of Bag Error Estimates and the Blue Line is the Error calculated on Test Set. Both curves are quite smooth and the error estimates are somewhat correlated too. The Error Tends to be minimized at around \( mtry = 4 \) .

On the Extreme Right Hand Side of the above plot, we considered all possible 13 predictors at each Split which is only Bagging.

ConclusionNow in this article, I gave a simple overview of Random Forests and how they differ from other Ensemble Learning Techniques and also learned how to implement such complex and Strong Modelling Technique in R with a simple package randomForest.Random Forests are a very Nice technique to fit a more Accurate Model by averaging Lots of Decision Trees and reducing the Variance and avoiding Overfitting problem in Trees. Decision Trees themselves are poor performance wise, but when used with Ensembling Techniques like Bagging, Random Forests etc, their predictive performance is improved a lot.Now obviously there are various other packages in R which can be used to implement Random Forests in R.

I hope the tutorial is enough to get you started with implementing Random Forests in R or at least understand the basic idea behind how this amazing Technique works.

Thanks for reading the article and make sure to like and share it.

**
**

**Related Post**

- Network analysis of Game of Thrones
- Structural Changes in Global Warming
- Deep Learning with R
- Unsupervised Learning and Text Mining of Emotion Terms Using R
- Using MCA and variable clustering in R for insights in customer attrition

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Programming – DataScience+**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Stippling and TSP art in R: emulating StippleGen

(This article was first published on ** R – dahtah**, and kindly contributed to R-bloggers)

Stippling is the creation of a pattern simulating varying degrees of solidity or shading by using small dots (Wikipedia).StippleGen is a piece of software that renders images using stipple patterns, which I discovered on Xi’an’s blog a couple days ago.

StippleGen uses an algorithm by Adrian Secord (described here) that turns out to be related to a problem in spatial statistics, specifically how to mess with high-order statistics of point processes while controlling density. The algorithm is a variant of k-means and is extremely easy to implement in R.

library(imager) library(dplyr) library(purrr) stipple <- function(im,nPoints=1e3,gamma=2,nSteps=10) { dens <- (1-im)^gamma xy <- sample(nPix(im),nPoints,replace=TRUE,prob=dens) %>% coord.index(im,.) %>% select(x,y) for (ind in 1:nSteps) { xy <- cvt(xy,dens) plot(im); points(xy,col="red") } xy } plot.stipple <- function(im,out,cex=.25) { g <- imgradient(im,"xy") %>% map(~ interp(.,out)) plot(out,ylim=c(height(im),1),cex=cex,pch=19,axes=FALSE,xlab="",ylab="") } ##Compute Voronoi diagram of point set xy, ##and return center of mass of each cell (with density given by image im) cvt <- function(xy,im) { voronoi(xy,width(im),height(im)) %>% as.data.frame %>% mutate(vim=c(im)) %>% group_by(value) %>% dplyr::summarise(x=weighted.mean(x,w=vim),y=weighted.mean(y,w=vim)) %>% select(x,y) %>% filter(x %inr% c(1,width(im)),y %inr% c(1,height(im))) } ##Compute Voronoi diagram for points xy over image of size (w,h) ##Uses a distance transform followed by watershed voronoi <- function(xy,w,h) { v <- imfill(w,h) ind <- round(xy) %>% index.coord(v,.) v[ind] <- seq_along(ind) d <- distance_transform(v>0,1) watershed(v,-d,fill_lines=FALSE) } #image from original paper im <- load.image("http://dahtah.github.io/imager/images/stippling_leaves.png") out <- stipple(im,1e4,nSteps=5,gamma=1.5) plot.stipple(im,out)TSP art is a variant where you solve a TSP problem to connect all the dots.

library(TSP) ##im is the original image (used only for its dimensions) ##out is the output of the stipple function (dot positions) draw.tsp <- function(im,out) { tour <- out %>% ETSP %>% solve_TSP plot(out[tour,],type="l",ylim=c(height(im),1),axes=FALSE,xlab="",ylab="") } ##Be careful, this is memory-heavy (also, slow) out <- stipple(im,4e3,gamma=1.5) draw.tsp(im,out)I’ve written a more detailed explanation on the imager website, with other variants like stippling with line segments, and a mosaic filter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – dahtah**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Beneath the canvas

(This article was first published on ** Data Imaginist - R posts**, and kindly contributed to R-bloggers)

Recently a blog

post made its

rounds on the internet describing how it is possible to speed up plot creation

in ggplot2 by first creating a blank canvas and then later adding the plot

elements on top of it. The main takeaway plot is reproduced below:

The blog post is in generally well reasoned and I love how it moves from a

curious observation to an investigation and ends with a solid recommendation.

Alas, I don’t agree with the recommendation (that you shold create a *canvas*

for subsequent use). Most of the misunderstanding in the blog post comes from

the fact that ggplot2 in many ways seems to be fueled by magic and unicorn

blood — what arises when you write ggplot() and hit enter is far from clear. I

would like to spend most of the time on this point so I’m just going to get a

more general point out of the way first.

When looking for ways to optimise your code, one must always ask whether the

code needs optimisation in the first place, and then whether the changes made

successfully makes a meaningful impact. What the plot above shows is that

caching the ggplot() call leads to a *statistically significant* performance

improvement meassured in <10 ms. This means that in order to get a percievable

runtime difference, it would be necessary to generate hundreds of plots, or

thousands of plots to get a meaningful difference. My own rule of thumb is that

you should not give up coding conventions unless there’s a tangible result, and

in this case I don’t see any. Does this mean you should never strive for

millisecond improvements? No, if you expect your piece of code to be called

thousands of times and compounding the effect this would be worthwhile. This is

why you sometimes see code where the square root of a variable is saved in a new

variable rather than being computed on the fly every time. In this case you

should ask yourself whether you mean to generate a 1000 plots with your code in

one go, and if so, whether an additional second is really worth it.

The notion that ggplot() creates a canvas for subsequent calls to add onto is

a sensible one, supported by the ggplot2 API where layers are added to the

initial plot. Further, if we simply write ggplot() and hits enter we get this:

Which sure looks like a blank canvas. This is all magic and unicorns though –

the call to ggplot() doesn’t actually draw or render anything on the device.

In order to understand what is going on, let’s have a look at the code

underneath it all:

So, ggplot() is an S3 generic. As it is dispatching on the data argument, and

that defaults to NULL I’ll take the wild guess and say we’re using the default

method:

Huh, so even if we’re not passing in a data.frame as data we’re ending up with

a call to the data.frame ggplot method (this is actually the reason why you

can write your own fortify methods for custom objects and let ggplot2 work with

them automatically). Just for completeness let’s have a look at a fortified

NULL value:

We get a waiver object, which is an internal ggplot2 approach to saying: “I’ve

got nothing right now but let’s worry about that later” (grossly simplified).

With that out of the way, let’s dive into ggplot.data.frame():

ggplot2:::ggplot.data.frame #> function (data, mapping = aes(), ..., environment = parent.frame()) #> { #> if (!missing(mapping) && !inherits(mapping, "uneval")) { #> stop("Mapping should be created with `aes() or `aes_()`.", #> call. = FALSE) #> } #> p <- structure(list(data = data, layers = list(), scales = scales_list(), #> mapping = mapping, theme = list(), coordinates = coord_cartesian(), #> facet = facet_null(), plot_env = environment), class = c("gg", #> "ggplot")) #> p$labels <- make_labels(mapping) #> set_last_plot(p) #> p #> } #>This is actually a pretty simple piece of code. There are some argument

checks to make sure the mappings are provided in the correct way, but other than

that it is simply constructing a gg object (a ggplot subclass). The

set_last_plot() call makes sure that this new plot object is now retrievable

with the last_plot() function. In the end it simply returns the new plot

object. We can validate this by looking into the return value of ggplot():

We see our waiver data object in the data element. As expected we don’t have

any layers, but (perhaps surprising) we *do* have a coordinate system and a

facet specification. These are the defaults getting added to every plot and in

effect until overwritten by something else (facet_null() is simply a one-panel

plot, cartesian coordinates are a standard coordinate system, so the defaults

are sensible). While there’s a default theme in ggplot2 it is not part of the

plot object in the same way as the other defaults. The reason for this is that

it needs to be possible to change the theme defaults and have these changes

applied to all plot objects already in existence. So, instead of carrying the

full theme around, a plot object only keeps explicit changes to the theme and

then merges these changes into the current default (available with

theme_get()) during plotting.

All in all ggplot() simply creates an adorned list ready for adding stuff

onto (you might call this a virtual canvas but I think this is stretching

it…).

*So how come something pops up on your plotting device when you hit enter?* (for

a fun effect read this while sounding as Carrie from Sex and the City)

This is due to the same reason you get a model summary when hitting enter on a

lm() call etc.: The print() method. The print() method is called

automatically by R every time a variable is queried and, for a ggplot object,

it draws the content of your object on your device. An interesting side-effect

of this is that ggplots are only rendered when explicetly print()ed/plot()ed

within a loop, as only the last return value in a sequence of calls gets its

print method invoked. This also means that the benchmarks in the original

blogposts were only measuring plot object creation, and not actual plot

rendering, as this is never made explecit in the benchmarked function (A point

later emphasized in the original post as well). For fun, let’s see if

doing that changes anything:

So it appears any time difference is hidden by the actual complexity of

rendering the plot. This is sensible as the time scale has now increased

considerably and a difference in 1 ms will not be visible.

Prior to ggplot2 v2 simply plotting the result of ggplot() would result in an

error as the plot had no layers to draw. ggplot2 did not get the ability to draw

layerless plots in v2, but instead it got an invisible default layer

(geom_blank) that gets dropped once new layers are added. This just goes to show

the amount of logic going into plot generation in ggplot2 and why it sometimes

feels magical…

To **leave a comment** for the author, please follow the link and comment on their blog: ** Data Imaginist - R posts**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Runtime vs. Success (Using IMDB)

(This article was first published on ** R – NYC Data Science Academy Blog**, and kindly contributed to R-bloggers)

The content in this blog comes from a shiny application proof of concept using IMDB movie data. To view the application:

- IMDB Movie Data App on Shiny.io (Created in R / Shiny Dashboard)
- Source Code on Github

Years ago, students of film writing were advised to make their feature length movie scripts between 90 and 129 pages. A page is expected to be one minute of runtime, and today these numbers have been revised to a simple target of 110 minutes. Little fun fact connected with this: my professor tells the story of an insider’s view from development and how the decision makers at one of the giants used to pick up a screenplay, weigh it in his hands, and say “too light” or “too heavy.” Then the script was rejected without ever getting a single page read. Are these page length numbers arbitrary, or is there something to these guidelines?

A Walkthrough of The Shiny App And What It Can Tell Us

With the bin size set very small in the above visualizations, we can see movie releases increasing over time, with the rate accelerating much more starting in the 1970’s. Increasing the bin size from 2 to 28, further illustrates how in recent years, there are so many more movies then in the past.

A quick look at the history may shed a little light on what’s going on here. The first motion pictures were created in the 1890’s and were little more than short, black and white moving images. As the novelty of this waned, next came serial shorts from which the term “cliff hanger” originates. When real full-length stories started to evolve, they were expensive and hard to produce. As a practical matter, only small numbers could be made per year. Today, the bar has been raised but technology keeps lowering the cost of what it takes to meet that bar. Independent and ancillary markets have widened the distribution channels and even YouTube is becoming a part of this expanding network. It’s just a matter of time before low budget movies get made on smart phones if they haven’t been already. Nothing earth shattering here but the visualization does help show that the run-away escalation started during the time when “*Star Wars*”, “*Jaws*”, and “*Close Encounters of The Third Kind*” all made their debut. Many see this time as “the beginning of the blockbuster.”

As shown here, the data used in this application is organized into 5 subsets of data:

There is some overlap in the data sets shown on the application’s “Data” tab:

- The 5270 “Kaggle 5000 Project IDs” data set contains the results of merging 5000 movies whose IDs were obtained from a “Kaggle” project with the unique IDs from all the other data sets described above. The “Kaggle Project IDs” included a sample of releases from 1916 up until 2016.
- The other data sets came from Top and Bottom lists on IMDB. Most of the Top 250 English Movies were also in the “Top 250 Movies”. Some of the “Top 250 Indian Movies” are also in the “Top 250 Movies”. The Bottom 100 should only overlap with the “Kaggle” data set. Movies in these lists include titles from 2017.

Click on the “Visualization” tab of the app to obtain basic stats on each of the aforementioned datasets (Min, Max, Mean, Median and Quartiles) for these 4 numeric variables:

- Year Released
- Runtime (in minutes)
- Number of Reviews
- Movie Rating

This tab also provides a line plot using Loess linear regression which we can analyze for the general trend in movie runtimes that we are looking for.

If we start with the plot for “all the data” in our application, outliers are mostly clustered pretty close to the general confidence interval for the model. No outliers outside this range appear after 1990, and only a small number of points barely outside the confidence interval appear from 1960 to 1990.

Since there is a limited outlier effect, the mean seems like a reasonable metric. It is 109.2 minutes. Of more interest to the original question, this plot essentially reflects a 5000+ record sample of what got made. The hills and valleys of the line seem to range between 105 and 120 minutes up through the 1970’s. Then the line becomes a slightly downward sloping trend up through the present with our data points mostly appearing at around the 110 minute mark. Though anecdotal in nature, the original 110 minute recommendation for movie scripts would seem to be supported by the data. The confidence interval around the line though might suggest a range from about 100 to 120 minutes. If our movie gets made, we may then be concerned with what are its chances of making it into the top or bottom? Starting with the Top 250 Movies:

The mean for the Top 250 movies was 130 minutes. The curve trends upwards over all with a range in recent years (1990 to the present) that fell between: 130 and 140 minutes. There are a larger scattering of outliers on this plot, but based on how the outliers mostly cluster not too far from the confidence interval after 1990, the mean still seems reasonable to use. If we think about why these better received movies are often longer than the general trend for what gets made, I’m sure there are many factors involved. For one thing, big name director and producers have the kind of clout to make us sit through longer movies. Consider “*The Lord of The Rings*” saga, which collectively was over 9 hours long with each of 3 movies lasting in the range of 3 hours a piece.

We don’t want to end up in the bottom, so we’ll take a look at this too:

The mean for the Bottom 100 movies was 92.26 minutes. This curve is also showing a distinct upwards trend over all but with a range in recent years (1990 to the present) that was from about 90 minutes to 115 minutes. There are fewer outliers, but there is also less data in this grouping.

In ConclusionNote that for this application about 6000 IMDB records were obtained. In 2017 alone, IMDB recorded 12,193 new releases by July and the year is still young! Further analysis is indicated and performing the analysis with more data would also be helpful. Given the small size of the data, findings here should not be taken as an absolute. Given that, here is a summary of what the data suggests so far:

- The “Sweet Spot” to target for a feature length screen play to get made is about 110 minutes
- If we want that movie to make it into the Top 250 on IMDB, the final movie produces is expected to range from 130 to 140 minutes
- It is also interesting to note that movies in the bottom 100 tend to range between 90 and 115 minutes

With more time and development, a more thorough analysis could be developed from IMDB data.

The Data Collection ProcessThree R markdown scripts (available on Git) were used to gather source data. A full workflow from initial input files to what got used in the application is provided in the third script. As this process was experimentally developed and modified while creating the source data for this project, R markdown was used to step through the process and keep notes on what was going. High level:

- An XML based R library was used to screen scrape data from “hot lists” (Top 250, Bottom 100, etc.) directly off of the IMDB website in July of 2017. This produced ‘csv’ files used as source for all the “Top / Bottom” lists provided in the final application.
- 5000+ IMDB IDs were extracted using a software process (not documented) from a json file posted with a Kaggle project. That project last gathered data in 2016. IDs from steps 1 and 2 were then used with step 3 to extract current data on these IDs.
- An Open API referred to as “IMDB Open API” was used to extract records from the IMDB website one record at a time, with a 3 second delay between each extraction. This was to ensure no harm to the website. This script was designed to save its work “along the way” and provide helpful output messages so that if an error occurred that halted the program. Data collected so far would be available, and the script could then pick up where it left off to finish the job.
- Close to 6000 observations in total (5270 after filtering out of duplicates) w/ about 27 variables were obtained from IMDB.
- Data cleaning of just the 7 variables brought into the Shiny application ensued with the intent to clean other variables in future development iterations. The vision was to develop subset tables with Movie_IDs on each one as a unique identifier, and then join them to previous data when bringing in each new set to expand the app in support of new analysis and features.

The CEO of a company I used to work for would tell stories of how senior management of various IT areas under him would plan and design solutions on a napkin in a bar which formed the basis of software project requirements. For this shiny app, the “napkin” was actually a piece of notebook paper captured as a PDF for your amusement. Only a small piece of this plan has been realized so far.

The post Runtime vs. Success (Using IMDB) appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – NYC Data Science Academy Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Programming with dplyr by using dplyr

(This article was first published on ** R – Enchufa2**, and kindly contributed to R-bloggers)

The title may seem tautological, but since the arrival of dplyr 0.7.x, there have been some efforts at *using dplyr without actually using it* that I can’t quite understand. The tidyverse has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.

Take the example of seplyr. It stands for *standard evaluation dplyr*, and enables us to program over dplyr without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval”. Let’s consider the following interactive pipeline:

Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr. But is it? Not at all: we just need to add one line and a *bang-bang* (!!):

The enquo() function quotes the name we put in our function (homeworld), and the bang-bang unquotes and uses that name instead of var. That’s it. What about seplyr? With seplyr, we *just* have to (and I quote)

- Change dplyr verbs to their matching seplyr “*_se()” adapters.
- Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
- Replace “=” in expressions with “:=”.

This is the result:

library(seplyr) starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsBasically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:

starwars_mean <- function(var) { var <- as.name(var) starwars %>% group_by(!!var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsBut we can do even better if we remember that dplyr provides scoped variants (see ?dplyr::scoped) for most of the verbs. In this case, group_by_at() comes in handy:

starwars_mean <- function(var) { starwars %>% group_by_at(var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsThat’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr and R’s paste0() function to build up expressions:

grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") data %>% group_by_se(grouping_variables) %>% summarize_se(c(result_names := expressions, "count" := "n()")) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11And the same with dplyr’s scoped verbs (note that I’ve added the last rename_at() on a whim, just to get exactly the same output as before, but it is not really necessary):

grouped_mean <- function(data, grouping_variables, value_variables) { data %>% group_by_at(grouping_variables) %>% mutate(count = n()) %>% summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>% rename_at(value_variables, funs(paste0("mean_", .))) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11Wrapping up, the tidyeval paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr is full of tools that will make your life easier, not harder.

*Article originally published in Enchufa2.es: Programming with dplyr by using dplyr*.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Enchufa2**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data Visualization with googleVis exercises part 8

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

In the eighth part of our series we are going to learn about the features of some interesting types of charts. More specifically we will talk about Annotation and Sankey charts.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

**Package**

As you already know, the first thing you have to do is install and load the googleVis package with:

install.packages("googleVis")

library(googleVis)

**NOTE**: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.

**Annotation chart**

It is quite simple to create an Annotation Chart with googleVis. We will use the “Stocks” dataset. You can see the variables of your dataset with head().

Look at the example below to create a simple Annotation Chart:

AnnoC <- gvisAnnotationChart(Stock)

plot(AnnoC)

**Exercise 1**

Create a list named “AnnoC” and pass to it the “Stock” dataset as an annotation chart. **HINT**: Use gvisAnnotationChart().

**Exercise 2**

Plot the the annotation chart. **HINT**: Use plot().

**Set the variables**

As you can see the annotation chart you built is empty so we have to fill it with some information. We will use the variables of ths “Stock” dataset for this purpose like this:

datevar="Date",

numvar="Value",

idvar="Device",

titlevar="Title",

annotationvar="Annotation"

**Learn more**about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

- Work extensively with the GoogleVis package and its functionality
- Learn what visualizations exist for your specific use case
- And much more

**Exercise 3**

Use the example above to fill your annotation chart with the proper information and plot the chart.

**Dimensions**

You can use height and width to change the dimensions of the annotation chart.

options=list(width=500, height=300)

**Exercise 4**

Set height to 700, width to 500 and plot the chart. **HINT**: Use list().

**Colours**

You can change the colours of the lines with:

options=list(colors="['green','yellow']")

**Exercise 5**

Set the line colours to green and red and plot the chart.

With the fill option you can color the relevant areas and adjust how filled these areas will be. Look at the example:

options=list(colors="['green','yellow']",fill=20)

**Exercise 6**

Set fill to 50 and plot the chart.

**Exercise 7**

Now set fill to 150 to see the difference and plot the chart.

**Sankey Chart**

Before creating a sankey chart we have to create a data frame that will help us as an example, so copy and paste this code to crate the data frame “exp” first:

exp <- data.frame(From=c(rep("A",3), rep("B", 3)),

To=c(rep(c("X", "Y", "Z"),2)),

Weight=c(6,9,7,9,3,1))

As you can see we created a data frame with the variabls “From”, “To” and “Weight”. You can use head() in order to see them.

It is quite simple to create an Sankey Chart with googleVis. We will use the “exp” data frame.

Look at the example below to create a simple Sankey Chart:

SankeyC <- gvisSankey(exp )

plot(SankeyC)

**Exercise 8**

Create a list named “SankeyC” and pass to it the “exp” dataset as a sankey chart. **HINT**: Use gvisSankey().

**Exercise 9**

Plot the the sankey chart. **HINT**: Use plot().

You can change the link colours with:

options=list(sankey="{link: {color: { fill: 'red' } }

**Exercise 10**

Color the links of ths sankey chart green and plot it.

Related exercise sets:- Data Visualization with googleVis exercises part 4
- Data Visualization with googleVis exercises part 2
- Data Visualization with googleVis exercises part 3
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Version 2.2.1 Released

(This article was first published on ** ggtern: ternary diagrams in R**, and kindly contributed to R-bloggers)

It has been a while since any kind of significant update has been released for the ggetern library, and the other day a minor update was submitted to CRAN, which is now available. There were a few minor bug fixes, and the inclusion of a few new geometries which some of you may find of value.

Newly Approved Geometry: geom_encircleOne of our users commented on the ggtern blog, that the geom_encircle geometry provided by the ggalt package might be handy, and after some investigation, this didn’t need any kind of modification, so it has been added to the list of approved protos, and can be used as follows:

library(ggtern) library(ggalt) #<<< Install this if it isn't installed! data("Fragments") base = ggtern(Fragments,aes(Qm,Qp,Rf+M,fill=GrainSize,shape=GrainSize)) + theme_bw() + theme_legend_position('tr') + geom_encircle(alpha=0.5,size=1) + geom_point() + labs(title = "Example Plot", subtitle = "using geom_encircle") print(base)Pretty simple huh, well this produces the following output:

As you can see (and as expected), points have been circled in an ‘approximate’ manner, sort of like someone grabbing a pen and circling on a piece of paper, perfect for highlighting features without any sort of statistical significance. Of course you will need to install the ggalt package, if it is not already installed.

New Text/Label Geometries:Some other feedback was generously offered, in that it might be convenient if users could apply text or label elements to the plot region, by defining a fractional position in x or y, relative to the viewport, and so, this has been taken this on board and created two new geometries geom_text_viewport and and geom_label_viewport — these are basically the same as the geom_text and geom_label geometries, except, the user is only expected to provide an x and y positional mapping, which are values constrained between the limits [0,1], representing fractional coordinates of the plot viewport, used as follows:

df.vp = data.frame(x = c(.5,1,0,0,1), y = c(.5,1,0,1,0), label = c('Middle','Top Right','Bottom Left','Top Left','Bottom Right')) base2 = base + theme(legend.position = 'right') + geom_mask() + geom_text_viewport(data=df.vp,aes(x=x,y=y,label=label,color=label),inherit.aes = FALSE) + labs(color = "Label", subtitle = "using geom_text_viewport") print(base2) base3 = base + theme(legend.position = 'right') + geom_mask() + geom_label_viewport(data=df.vp,aes(x=x,y=y,label=label,color=label),inherit.aes = FALSE) + labs(color = "Label", subtitle = "using geom_label_viewport") print(base3)The above have the special added benefit with respect to ternary diagrams, insomuch as they reduce the mental overhead of having to think in ternary coordinates, if you want to manually place a label somewhere. Furthermore, both of these new geometries behave similarly on native ggplot2 objects, so, these represent an intuitive way to place labels in a relative manner to the plot region, without having to consider the axis limits.

Finally, many of you have been citing ggtern in the literature, which I am grateful for, please continue to do so, and let me know so that I might add your respective publications to the publications ledger.

I have been thinking alot of the best way to enable piper diagrams, and/or diagrams with cartesian-binary plots perpendicular to the ternary axes, I am hoping to get this functionality happening soon, which will considerably improve the depth of outputs that this package is able to produce.

Some of you have contacted me to produce custom geometries for your respective projects, and as a final plug, this is something I am more than happy to do…

Cheerio,

NH

The post Version 2.2.1 Released appeared first on ggtern: ternary diagrams in R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** ggtern: ternary diagrams in R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tutorial: Using seplyr to Program Over dplyr

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement.

suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.2' cat(colnames(starwars), sep='\n') ## name ## height ## mass ## hair_color ## skin_color ## eye_color ## birth_year ## gender ## homeworld ## species ## films ## vehicles ## starships starwars %>% group_by(homeworld) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsThe above is colloquially called "an interactive script." The name comes from the fact that we use names of variables (such as "homeworld") that would only be known from looking at the data directly in the analysis code. Only somebody interacting with the data could write such a script (hence the name).

It has long been considered a point of discomfort to convert such an interactive dplyr pipeline into a re-usable script or function. That is a script or function that specifies column names in some parametric or re-usable fashion. Roughly it means the names of the data columns are not yet known when we are writing the code (and this is what makes the code re-usable).

This inessential (or conquerable) difficulty is largely a due to the preference for non-standard evaluation interfaces (that is interfaces that capture and inspect un-evaluated expressions from their calling interface) in the design dplyr.

seplyr is a dplyr adapter layer that prefers "slightly clunkier" standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr notation into seplyr notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.

The translation is mechanical in that it is mostly using seplyr documentation as a lookup table. What you have to do is:

- Change dplyr verbs to their matching seplyr "*_se()" adapters.
- Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the "c()" notation.
- Replace "=" in expressions with ":=".

Our converted code looks like the following.

# devtools::install_github("WinVector/seplyr") library("seplyr") starwars %>% group_by_se("homeworld") %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rowsThis code works the same as the original dplyr code. Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr notation is saving us is from having to write a few quote marks).

The benefit is: this new code can *very easily* be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.

For example: suppose (as in the original example) we want to create a function that lets us choose the grouping variable? This is now easy, we copy the code into a function and replace the explicit value "homeworld" with a variable:

starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("hair_color") ## # A tibble: 13 x 4 ## hair_color mean_height mean_mass count ## ## 1 auburn 150.0000 NaN 1 ## 2 auburn, grey 180.0000 NaN 1 ## 3 auburn, white 182.0000 77.00000 1 ## 4 black 174.3333 73.05714 13 ## 5 blond 176.6667 80.50000 3 ## 6 blonde 168.0000 55.00000 1 ## 7 brown 175.2667 79.27273 18 ## 8 brown, grey 178.0000 120.00000 1 ## 9 grey 170.0000 75.00000 1 ## 10 none 180.8889 78.51852 37 ## 11 unknown NaN NaN 1 ## 12 white 156.0000 59.66667 4 ## 13 141.6000 314.20000 5In seplyr programming is easy (just replace values with variables). For example we can make a completely generic re-usable "grouped mean" function using R‘s paste() function to build up expressions.

grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") calculation <- result_names := expressions print(as.list(calculation)) # print for demonstration data %>% group_by_se(grouping_variables) %>% summarize_se(c(calculation, "count" := "n()")) } starwars %>% grouped_mean(grouping_variables = "eye_color", value_variables = c("mass", "birth_year")) ## $mean_mass ## [1] "mean(mass, na.rm = TRUE)" ## ## $mean_birth_year ## [1] "mean(birth_year, na.rm = TRUE)" ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11The only part that requires more study and practice was messing around with the expressions using paste() (for more details on the string manipulation please try "help(paste)"). Notice also we used the ":=" operator to bind the list of desired result names to the matching calculations (please see "help(named_map_builder)" for more details).

The point is: we did not have to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval or lazyeval to complete our programming task. Once you are in seplyr notation, changes are very easy. You can separate translating into seplyr notation from the work of designing your wrapper function (breaking your programming work into smaller easier to understand steps).

The seplyr method is simple, easy to teach, and powerful. The package contains a number of worked examples both in help() and vignette(package='seplyr') documentation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### What’s in our internal chaimagic package at work

(This article was first published on ** Maëlle**, and kindly contributed to R-bloggers)

At my day job I’m a data manager and statistician for an epidemiology project called CHAI lead by Cathryn Tonne. CHAI means “Cardio-vascular health effects of air pollution in Telangana, India” and you can find more about it in our recently published protocol paper . At my institute you could also find the PASTA and TAPAS projects so apparently epidemiologists are good at naming things, or obsessed with food… But back to CHAI! This week Sean Lopp from RStudio wrote an interesting blog post about internal packages. I liked reading it and feeling good because we do have an internal R package for CHAI! In this blog post, I’ll explain what’s in there, in the hope of maybe providing inspiration for your own internal package!

As posted in this tweet, this pic represents the Barcelona contingent of CHAI, a really nice group to work with! We have other colleagues in India obviously, but also in the US.

The birth of chaimagic: help for data wrangling*Note: part of this post was submitted as an abstract for the useR! conference in Brussels, for which I received feedback from my whole team, as well as from François Michonneau, Dirk Schumacher and Jennifer Thompson. Thanks a lot! I got to present a lightning talk about rtimicropem instead.*

Right from the beginning of my time here in October 2015 I had quite a lot of data to clean, which I started doing in R of course. One task in particular was parsing filenames because those were informative in our project, containing for instance a date and a participant ID. I wrote a function for that, batch_parse_filenames, which parses all filenames in a vector and returns a list composed of two data.frames: one with the information contained in the filename (e.g. participant ID, date of sampling) and one with possible parsing mistakes (e.g. a nonexistent ID) and an informative message. It was a very useful function, and I packaged it up together with the data containing the participant IDs for instance. This was the beginning of an internal package!

I decided to call it chaimagic because it sounded cool, and this despite knowing that there’s absolutely no magic in the code I write.

The growth of chaimagic: support for data analysischaimagic evolved with helper functions for analysis, e.g. a function returning the season in India from a date (e.g. monsoon, winter), or functions for common operations on variables from our questionnaire data. chaimagic also got one more contributor.

chaimagic also contains a Shiny dashboard for exploring personal monitoring data that were collected in the project: one selects a participant-day and gets interactive leaflet and rbokeh plots of GPS, air quality, and time-activity data from questionnaires and wearable cameras. I told my boss that the dashboard was brought to us by the biblical Magi but maybe I helped them, who knows.

The maturity of chaimagic: serving scientific communicationNow, our package also supports scientific communication thanks to an RMarkdown template for internal reports which fostered reproducibility of analyses by making the use of RMarkdown even more appealing. Having an RMarkdown template also supports consistent branding, which is discussed in the RStudio blog post.

chaimagic also includes a function returning affiliation information for any possible collaborator we have; and a function returning the agreed-upon acknowledgement sentences to be put in each manuscript. These are pieces of information we can be very happy not to have to go look for in a folder somewhere, we can get them right without leaving R!

Why is chaimagic a really cool package?chaimagic has been a valuable tool for the work of two statisticians/data managers, one GIS technician, one post-doc and one PhD student. Discussing its development and using it was an occasion to share our R joy, thus fostering best practices in scientific computing in our project. Why use Stata or other statistical softwares when you have such a nice internal R package? We found that even if our package served a small team of users, an R package provided a robust infrastructure to ensure that everyone was using the same code and R coding best practices in our team.

So this is very good… but we all know from e.g. Airbnb (see also this post about Airbnb and R) that *stickers* are a very important aspect of an internal R package. I was over the moon when my colleague Carles handed me these fantastic chaimagic stickers! He used the hexSticker package. I’ll leave the team in September after everyone’s vacations, and this piece of art was the goodbye present I received at my goodbye lunch. Doesn’t it make our internal package completely cool now?

To **leave a comment** for the author, please follow the link and comment on their blog: ** Maëlle**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Stan Weekly Roundup, 21 July 2017

(This article was first published on ** R – Statistical Modeling, Causal Inference, and Social Science**, and kindly contributed to R-bloggers)

It was another productive week in Stan land. The big news is that

**Jonathan Auerbach**reports that

A team of Columbia students (mostly Andrew’s, including myself) recently won first place in a competition predicting elementary school enrollment. I heard 192 entered, and there were 5 finalists….Of course, we used Stan (RStan specifically). … Thought it might be Stan news worthy.

I’d say that’s newsworthy. Jon also provided a link to the “challenge” page, a New York City government sponsored “call for innovations”: Enhancing School Zoning Efforts by Predicting Population Change. They took home a US$20K paycheck for their efforts! Stan’s seeing quite a lot of use these days among demographers and others looking to predict forward from time series data. Jonathan’s been very active using government data sets (see his StanCon 2017 presentation with Rob Trangucci, Twelve Cities: Does lowering speed limits save pedestrian lives?). Hopefully they’ll share their code—maybe they already have. I really like to see this combination of government data and careful statistical analysis.

In other big news coming up soon,

**Women in ML**are sponosoring a Stan workshop this coming Saturday to be presented by**Michael Betancourt**and**Mitzi Morris**. A bunch of the Stan gang will be there to help out.

In other news,

**Andrew Gelman**‘s been driving a lot of rethinking of our interfaces because he’s revising his and Jennifer Hill’s regression book (the revision will be two books). Specifically, he’s thinking a lot about workflow and how we can manage model expansion by going from fixed to modeled parameters. Right now, the process is onerous in that you have to move data variables into the parameters block and keep updating their priors. Andrew wants to be able to do this from the outside, but Michael Betancourt and I both expressed a lot of skepticism in terms of it breaking a lot of our fundamental abstractions (like a Stan program defining the density that’s fit!). More to come on this hot topic. Any ideas on how to manage developing a model would be appreciated. This goes back to the very first thing Matt Hoffman, Michael Malecki and I worked on with Andrew when he hired us before we’d conceived Stan. You’d think we’d have better advice on this after all this time. I’ve seen people do everything from use the C++ preprocessor to write elaborate program generation code in R.**Breck Baldwin**has been working on governance and we’re converging on a workable model that we’ll share with everyone soon. The goal’s to make the governance clear and less of a smoke-filled room job by those of us who happen to go to lunch after the weekly meetings.**Jonah Gabry**is taking on the ggplot2-ification of the new regression book and trying to roll everything into a combination of RStanArm and BayesPlot. No word yet if the rest of the tidyverse is to follow. Andrew said, “I’ll see what Jonah comes up with” or something to that effect.**Jonah**has alos been working on priors for multilevel regression and poststratification with**Yajuan Si**(former postdoc of Andrew’s, now at U. Wisconsin); the trick is doing somethign reasonable when you have lots of interactions.**Ben Goodrich**has been working on the next release of RStanArm. It’s beginning to garner a lot of contributions. Remember that the point has been to convert a lot of common point estimation packages to Bayesian versions and supply them with familiar R interfaces. Ben’s particularly been working on survival analysis lately.- Our
**high school student intern**(don’t know if I can mention names online—the rest of our developers are adults!) is working on applying the Cook-Gelman-Rubin metric to evaluating various models. We’re doing much more of this method and it needs a less verbose and more descriptive name! **Mitzi Morris**submitted a pull request for the Stan repo to add line numbers to error messages involving compound declare-and-define statements.- A joint effort by
**Mitzi Morris**,**Dan Simpson**,**Imad Ali**, and**Miguel A Martinez Beneito**has led to convergence of models and fits among BUGS, Stan, and INLA for intrinsic conditional autorgression (ICAR) models. Imad’s building the result in RStanArm and has figured out how to extend the loo (leave one out cross-validation) package to deal with spatial models. Look for it in a Stan package near you soon. Mitzi’s working on the case study, which has been updated in the example-models repo. **Charles Margossian**knocked off a paper on the mixed ODE solver he’s been working on, with a longer paper promised that goes through all the code details. Not sure if that’s on arXiv or not. He’s also been training**Bill Gillespie**to code in C++, which is great news for the project since Charles has to contend with his first year of grad school next year (whereas Bill’s facing a pleasant retirement of tracking down memory leaks). Charles is also working on getting the algebraic and fixed state solvers integrated into Torsten before the fall.**Krzysztof Sakrejda**has a new paper out motivating a custom density he wrote in Stan for tracking dealys in diseseas incidence in a countrywide analysis for Thailand. I’m not sure where he put it.**Yuling Yao**is revising the stacking paper (for a kind of improved model averaging). I believe this is going into the loo package (or is maybe already there). So much going on with Stan I can’t keep up!**Yuling**also revised the simulated tempering paper, which is some custom code on top of Stan to fit models with limited multimodality. There was some discussion about realistic examples with limited multimodality and we hope to have a suite of them to go along with the paper.**Sebastian Weber**,**Bob Carpenter**, and**Micahel Betancourt**, and**Charles Margossian**had a long and productive meeting to design the MPI interface. It’s very tricky trying to find an interface that’ll fit into the Stan language and let you ship off data once and reuse it. I think we’re almost there. The design discussion’s on Discourse.**Rob Trangucci**is finishing the GP paper (after revising the manual chapter) with Dan Simpson, Aki Vehtari, and Michael Betancourt.

I’m sure I’m missing a lot of goings on, especially among people not at our weekly meetings. If you know of things that should be on this list, please let me know.

The post Stan Weekly Roundup, 21 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Statistical Modeling, Causal Inference, and Social Science**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...