Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 45 min ago

Buzzfeed trains an AI to find spy planes

15 hours 1 min ago

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.  

So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier “Spies in the Skies” story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were false positives, but further journalistic inquiry into the provenance of the registrations led to several interesting stories.

Using this model, Buzzfeed news identified several surveillance aircraft in action during a four-month period in late 2015. These included a spy plane operated by US Marshals to hunt drag cartels in Mexico; aircraft covertly registered to US Customer and Border Protection patrolling the US-Mexico border; and a US Navy contractor operating planes circling several points over land in the San Francisco Bay Area — ostensibly for harbor porpoise research.

You can learn more about the stories Buzzfeed News uncovered in the flight data here, and for details on the implementation of the AI model in R, follow the link below.

Github (Buzzfeed): BuzzFeed News Trained A Computer To Search For Hidden Spy Planes. This Is What We Found.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Working with air quality and meteorological data Exercises (Part-1)

Tue, 08/15/2017 - 18:29

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports. In this exercise set we will import and analyze data from MY1 station which is located in Marylebone Road in London, UK.

Answers to the exercises are available here.

Please install and load the package openair before starting the exercises.

Exercise 1
Import the MY1 data for the year 2016 and save it into a dataframe called my1data.

Exercise 2
Get basic statistical summaries of myd1 dataframe.

Exercise 3
Calculate monthly means of:
a. pm10
b. pm2.5
b. nox
c. no
d. o3

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course
[Intermediate] Spatial Data Analysis with R, QGIS & More
. this course you will learn how to:

  • Work with Spatial data and maps
  • Learn about different tools to develop spatial data next to R
  • And much more

Exercise 4
Calculate daily means of:
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 5
calculate daily maximum of:
b. nox
c. no

Related exercise sets:
  1. Data table exercises: keys and subsetting
  2. Density-Based Clustering Exercises
  3. Data wrangling : I/O (Part-1)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Simple practice: basic maps with the Tidyverse

Tue, 08/15/2017 - 11:27

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

To master data science, you need to practice.

This sounds easy enough, but in reality, many people have no idea how to practice.

A full explanation of how to practice is beyond the scope of this blog post, but I can give you a quick tip here:

You need to master the most important techniques, and then practice those techniques in small scripts until you know them backwards and forwards.

Your goal should be to be able to write code “with your eyes closed.”

With that in mind, I want to give you a few small scripts that you can practice. Here are a few small scripts to create a set of maps. (As you examine these, pay close attention to how simple they are. How many functions and commands do you actually need to know?)

Let’s start with a simple one: create a map of the United States.

# INSTALL PACKAGE: tidyverse library(tidyverse) # MAP USA map_data("usa") %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon()



Three lines of code (four, if you count the library() function).

That’s it. Three lines to make a map of the US.

And if you change just 5 characters (change usa to state) you can generate a map of the states of the United States:

# MAP USA (STATES) map_data("state") %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon()



If you add an additional line, you can use filter to subset your data and map only a few states.

# MAP CALIFORNIA, NEVADA, OREGON, WASHINGTON # - to do this, we're using dplyr::filter() # ... otherwise, it's almost exactly the same as the previous code map_data("state") %>% filter(region %in% c("california","nevada","oregon","washington")) %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon()



What I want to emphasize is how easy this is if you just know how filter works and how the pipe operator (%>%) works.

If you make another simple change, you can create a county level map of those states:

# MAP CALIFORNIA, NEVADA, OREGON, WASHINGTON # (Counties) map_data("county") %>% filter(region %in% c("california","nevada","oregon","washington")) %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon()



Finally, a few more changes can get you a map of the world:

# MAP WORLD map_data("world") %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon()



… and then a map of a single country, like Japan:

# MAP JAPAN map_data("world") %>% filter(region == 'Japan') %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon()



I want to point out again that all of these maps were created with very simple variations on our original 3 line script.

It really doesn’t take that much to achieve competence in creating basic maps like this.

Practice small to learn big

You might be asking: why bother practicing these? They’re not terribly useful.

You need to understand that large projects are built from dozens of small snippets of code like these.

Moreover, these little snippets of code are made up of a small set of functions that you can break down and learn one command at a time.

So the path to mastery involves first mastering syntax of small individual commands and functions. After you’ve memorized the syntax of individual functions and commands, little scripts like this give you an opportunity to put those commands together into something a little more complex.

Later, these little scripts can be put together with other code snippets to perform more complicated analyses.

If you can practice (and master) enough small scripts like this, it becomes dramatically easier to execute larger projects quickly and confidently.

For example, by mastering the little scripts above, you put yourself on a path to creating something more like this:



If you want to achieve real fluency, you need to practice small before you practice big. So find little scripts like this, break them down, and drill them until you can type them without hesitation. You’ll thank me later.

(You’re welcome.)

A quick exercise for you

Count up the commands and arguments that we used in the little scripts above.

For the sake of simplicity, don’t count the data features that you might need (like the “region” column, etc), just count the number of functions, commands, and arguments that you need to know.

How many? If you really, really pushed yourself, how long would it take to memorize those commands?

Leave a comment below and tell me.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need a plan.

You need to be highly systematic.

Sign up for our email list right now, and you’ll get our “Data Science Crash Course.”

In it you’ll discover:

  • A step-by-step learning plan
  • How to create the essential data visualizations
  • How to perform the essential data wrangling techniques
  • How to get started with machine learning
  • How much math you need to learn (and when to learn it)
  • And more …

SIGN UP NOW

The post Simple practice: basic maps with the Tidyverse appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

set_na_where(): a nonstandard evaluation use case

Tue, 08/15/2017 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

In this post, I describe a recent case where I used rlang’s
tidy evaluation
system to do some data-cleaning. This example is not particularly involved, but
it demonstrates is a basic but powerful idea: That we can capture the
expressions that a user writes, pass them around as data, and make some
:dizzy: magic :sparkles: happen. This technique in R is called
nonstandard evaluation.

Strange eyetracking data

Last week, I had to deal with a file with some eyetracking data from a
sequence-learning experiment. The eyetracker records the participant’s gaze
location at a rate of 60 frames per second—except for this weird file which
wrote out ~80 frames each second. In this kind of data, we have one row per
eyetracking sample, and each sample records a timestamp and the gaze location
:eyes: on the computer screen at each timestamp. In this particular dataset, we
have x and y gaze coordinates in pixels (both eyes averaged together,
GazeX and GazeY) or in screen proportions (for each eye in the EyeCoord
columns.)

library(dplyr) library(ggplot2) library(rlang) # the data is bundled with an R package I wrote # devtools::install_github("tjmahr/fillgaze") df <- system.file("test-gaze.csv", package = "fillgaze") %>% readr::read_csv() %>% mutate(Time = Time - min(Time)) %>% select(Time:REyeCoordY) %>% round(3) %>% mutate_at(vars(Time), round, 1) %>% mutate_at(vars(GazeX, GazeY), round, 0) df #> # A tibble: 14,823 x 8 #> Time Trial GazeX GazeY LEyeCoordX LEyeCoordY REyeCoordX REyeCoordY #> #> 1 0.0 1 1176 643 0.659 0.589 0.566 0.602 #> 2 3.5 1 -1920 -1080 -1.000 -1.000 -1.000 -1.000 #> 3 20.2 1 -1920 -1080 -1.000 -1.000 -1.000 -1.000 #> 4 36.8 1 1184 648 0.664 0.593 0.570 0.606 #> 5 40.0 1 1225 617 0.685 0.564 0.591 0.579 #> 6 56.7 1 -1920 -1080 -1.000 -1.000 -1.000 -1.000 #> 7 73.4 1 1188 641 0.665 0.587 0.572 0.600 #> 8 76.6 1 1204 621 0.674 0.568 0.580 0.582 #> 9 93.3 1 -1920 -1080 -1.000 -1.000 -1.000 -1.000 #> 10 109.9 1 1189 665 0.666 0.609 0.572 0.622 #> # ... with 14,813 more rows

In this particular eyetracking setup, offscreen looks are coded as negative gaze
coordinates, and what’s extra weird here is that every second or third point is
incorrectly placed offscreen. We see that in the frequent -1920 values in
GazeX. Plotting the first few x and y pixel locations shows the
pattern as well.

p <- ggplot(head(df, 40)) + aes(x = Time) + geom_hline(yintercept = 0, size = 2, color = "white") + geom_point(aes(y = GazeX, color = "GazeX")) + geom_point(aes(y = GazeY, color = "GazeY")) + labs(x = "Time (ms)", y = "Screen location (pixels)", color = "Variable") p + annotate("text", x = 50, y = -200, label = "offscreen", color = "grey20") + annotate("text", x = 50, y = 200, label = "onscreen", color = "grey20")

It is physiologically impossible for a person’s gaze to oscillate so quickly and
with such magnitude (the gaze is tracked on a large screen display), so
obviously something weird was going on with the experiment software.

This file motivated me to develop a general purpose package for interpolating
missing data in eyetracking experiments
.
This package was always something I wanted to do, and this file moved it from
the someday list to the today list.

A function to recode values in many columns as NA

The first step in handling this problematic dataset is to convert the offscreen
values into actual missing (NA) values). Because we have several columns of
data, I wanted a succinct way to recode values in multiple columns into NA
values.

First, we sketch out the code we want to write when we’re done.

set_na_where <- function(data, ...) { # do things } set_na_where( data = df, GazeX = GazeX < -500 | 2200 < GazeX, GazeY = GazeY < -200 | 1200 < GazeY)

That is, after specifying the data, we list off an arbitrary number of column
names, and with each name, we provide a rule to determine whether a value in
that column is offscreen and should be set to NA. For example, we want every
value in GazeX where GazeX < -500 or 2299 < GazeX is TRUE to be replaced
with NA.

Bottling up magic spells

Lines of computer code are magic spells: We say the incantations and things
happen around us. Put more formally, the code contains expressions that are
evaluated in an environment.

hey <- "Hello!" message(hey) #> Hello! exists("x") #> [1] FALSE x <- pi ^ 2 exists("x") #> [1] TRUE print(x) #> [1] 9.869604 stop("what are you doing?") #> Error in eval(expr, envir, enclos): what are you doing?

In our function signature, function(data, ...), the expressions are collected
in the special “dots” argument (...). In normal circumstances, we can view the
contents of the dots by storing them in a list. Consider:

hello_dots <- function(...) { str(list(...)) } hello_dots(x = pi, y = 1:10, z = NA) #> List of 3 #> $ x: num 3.14 #> $ y: int [1:10] 1 2 3 4 5 6 7 8 9 10 #> $ z: logi NA

But we not passing in regular data, but expressions that need to be evaluated in
a particular location. Below the magic words are uttered and we get an error
because they mention things that do not exist in the current environment.

hello_dots(GazeX = GazeX < -500 | 2200 < GazeX) #> Error in str(list(...)): object 'GazeX' not found

What we need to do is prevent these words from being uttered until the time and
place are right. Nonstandard evaluation is a way of bottling up magic spells
and changing how or where they are cast
—sometimes we even change the magic
words themselves. We bottle up or capture the expressions given by the user by
quoting them. quo() quotes a single expression, and quos() (plural) will
quote a list of expressions. Below, we capture the expressions stored in the
dots :speech_balloon: and then make sure that their names match column names in
the dataframe.

set_na_where <- function(data, ...) { dots <- quos(...) stopifnot(names(dots) %in% names(data), !anyDuplicated(names(dots))) dots # more to come } spells <- set_na_where( data = df, GazeX = GazeX < -500 | 2200 < GazeX, GazeY = GazeY < -200 | 1200 < GazeY) spells #> $GazeX #> #> ~GazeX < -500 | 2200 < GazeX #> #> $GazeY #> #> ~GazeY < -200 | 1200 < GazeY #> #> attr(,"class") #> [1] "quosures"

I call these results spells because it just contains the expressions stored as
data. We can interrogate these results like data. We can query the names of the
stored data, and we can extract values (the quoted expressions).

names(spells) #> [1] "GazeX" "GazeY" spells[[1]] #> #> ~GazeX < -500 | 2200 < GazeX Casting spells

We can cast a spell by evaluating an expression. To keep the incantation from
fizzling out, we specify that we want to evaluate the expression inside of the
dataframe. The function eval_tidy(expr, data) lets us do just that: evaluate
an expression expr inside of some data.

# Evaluate the first expression inside of the data xs_to_set_na <- eval_tidy(spells[[1]], data = df) # Just the first few bc there are 10000+ values xs_to_set_na[1:20] #> [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE #> [12] TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE

In fact, we can evaluate them all at once with by applying eval_tidy() on each
listed expression.

to_set_na <- lapply(spells, eval_tidy, data = df) str(to_set_na) #> List of 2 #> $ GazeX: logi [1:14823] FALSE TRUE TRUE FALSE FALSE TRUE ... #> $ GazeY: logi [1:14823] FALSE TRUE TRUE FALSE FALSE TRUE ... Finishing touches

Now, the rest of the function is straightforward. Evaluate each NA-rule on
the named columns, and then set each row where the rule is TRUE to NA.

set_na_where <- function(data, ...) { dots <- quos(...) stopifnot(names(dots) %in% names(data), !anyDuplicated(names(dots))) set_to_na <- lapply(dots, eval_tidy, data = data) for (col in names(set_to_na)) { data[set_to_na[[col]], col] <- NA } data } results <- set_na_where( data = df, GazeX = GazeX < -500 | 2200 < GazeX, GazeY = GazeY < -200 | 1200 < GazeY) results #> # A tibble: 14,823 x 8 #> Time Trial GazeX GazeY LEyeCoordX LEyeCoordY REyeCoordX REyeCoordY #> #> 1 0.0 1 1176 643 0.659 0.589 0.566 0.602 #> 2 3.5 1 NA NA -1.000 -1.000 -1.000 -1.000 #> 3 20.2 1 NA NA -1.000 -1.000 -1.000 -1.000 #> 4 36.8 1 1184 648 0.664 0.593 0.570 0.606 #> 5 40.0 1 1225 617 0.685 0.564 0.591 0.579 #> 6 56.7 1 NA NA -1.000 -1.000 -1.000 -1.000 #> 7 73.4 1 1188 641 0.665 0.587 0.572 0.600 #> 8 76.6 1 1204 621 0.674 0.568 0.580 0.582 #> 9 93.3 1 NA NA -1.000 -1.000 -1.000 -1.000 #> 10 109.9 1 1189 665 0.666 0.609 0.572 0.622 #> # ... with 14,813 more rows

Visually, we can see that the offscreen values are no longer plotted. Plus, we
are told that our data now has missing values.

# `plot %+% data`: replace the data in `plot` with `data` p %+% head(results, 40) #> Warning: Removed 15 rows containing missing values (geom_point). #> Warning: Removed 15 rows containing missing values (geom_point).

One of the quirks about some eyetracking data is that during a blink, sometimes
the device will record the x location but not the y location. (I think this
happens because blinks move vertically so the horizontal detail can still be
inferred in a half-closed eye.) This effect shows up in the data when there are
more NA values for the y values than for the x values:

count_na <- function(data, ...) { subset <- select(data, ...) lapply(subset, function(xs) sum(is.na(xs))) } count_na(results, GazeX, GazeY) #> $GazeX #> [1] 2808 #> #> $GazeY #> [1] 3064

We can equalize these counts by running the function a second time with new rules.

df %>% set_na_where( GazeX = GazeX < -500 | 2200 < GazeX, GazeY = GazeY < -200 | 1200 < GazeY) %>% set_na_where( GazeX = is.na(GazeY), GazeY = is.na(GazeX)) %>% count_na(GazeX, GazeY) #> $GazeX #> [1] 3069 #> #> $GazeY #> [1] 3069

Alternatively, we can do this all at once by using the same NA-filtering rule
on GazeX and GazeY.

df %>% set_na_where( GazeX = GazeX < -500 | 2200 < GazeX | GazeY < -200 | 1200 < GazeY, GazeY = GazeX < -500 | 2200 < GazeX | GazeY < -200 | 1200 < GazeY) %>% count_na(GazeX, GazeY) #> $GazeX #> [1] 3069 #> #> $GazeY #> [1] 3069

These last examples, where we compare different rules, showcases how nonstandard
evaluation lets us write in a very succinct and convenient manner and quickly
iterate over possible rules. Works like magic, indeed.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#9: Compacting your Shared Libraries

Tue, 08/15/2017 - 03:49

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the nineth post in the recognisably rancid R randomness series, or R4 for short. Following on the heels of last week’s post, we aim to look into the shared libraries created by R.

We love the R build process. It is robust, cross-platform, reliable and rather predicatable. It. Just. Works.

One minor issue, though, which has come up once or twice in the past is the (in)ability to fully control all compilation options. R will always recall CFLAGS, CXXFLAGS, … etc as used when it was compiled. Which often entails the -g flag for debugging which can seriously inflate the size of the generated object code. And once stored in ${RHOME}/etc/Makeconf we cannot on the fly override these values.

But there is always a way. Sometimes even two.

The first is local and can be used via the (personal) ~/.R/Makevars file (about which I will have to say more in another post). But something I have been using quite a bite lately uses the flags for the shared library linker. Given that we can have different code flavours and compilation choices—between C, Fortran and the different C++ standards—one can end up with a few lines. I currently use this which uses -Wl, to pass an the -S (or --strip-debug) option to the linker (and also reiterates the desire for a shared library, presumably superfluous):

SHLIB_CXXLDFLAGS = -Wl,-S -shared SHLIB_CXX11LDFLAGS = -Wl,-S -shared SHLIB_CXX14LDFLAGS = -Wl,-S -shared SHLIB_FCLDFLAGS = -Wl,-S -shared SHLIB_LDFLAGS = -Wl,-S -shared

Let’s consider an example: my most recently uploaded package RProtoBuf. Built under a standard 64-bit Linux setup (Ubuntu 17.04, g++ 6.3) and not using the above, we end up with library containing 12 megabytes (!!) of object code:

edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ ls -lh src/RProtoBuf.so -rwxr-xr-x 1 edd edd 12M Aug 14 20:22 src/RProtoBuf.so edd@brad:~/git/rprotobuf(feature/fewer_warnings)$

However, if we use the flags shown above in .R/Makevars, we end up with much less:

edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ ls -lh src/RProtoBuf.so -rwxr-xr-x 1 edd edd 626K Aug 14 20:29 src/RProtoBuf.so edd@brad:~/git/rprotobuf(feature/fewer_warnings)$

So we reduced the size from 12mb to 0.6mb, an 18-fold decrease. And the file tool still shows the file as ‘not stripped’ as it still contains the symbols. Only debugging information was removed.

What reduction in size can one expect, generally speaking? I have seen substantial reductions for C++ code, particularly when using tenmplated code. More old-fashioned C code will be less affected. It seems a little difficult to tell—but this method is my new build default as I continually find rather substantial reductions in size (as I tend to work mostly with C++-based packages).

The second option only occured to me this evening, and complements the first which is after all only applicable locally via the ~/.R/Makevars file. What if we wanted it affect each installation of a package? The following addition to its src/Makevars should do:

strippedLib: $(SHLIB) if test -e "/usr/bin/strip"; then /usr/bin/strip --strip-debug $(SHLIB); fi .phony: strippedLib

We declare a new Makefile target strippedLib. But making it dependent on $(SHLIB), we ensure the standard target of this Makefile is built. And by making the target .phony we ensure it will always be executed. And it simply tests for the strip tool, and invokes it on the library after it has been built. Needless to say we get the same reduction is size. And this scheme may even pass muster with CRAN, but I have not yet tried.

Lastly, and acknowledgement. Everything in this post has benefited from discussion with my former colleague Dan Dillon who went as far as setting up tooling in his r-stripper repository. What we have here may be simpler, but it would not have happened with what Dan had put together earlier.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rstudio::conf(2018): Contributed talks, e-posters, and diversity scholarships

Tue, 08/15/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

rstudio::conf, the conference on all things R and RStudio, will take place February 2 and 3, 2018 in San Diego, California, preceded by Training Days on January 31 and February 1. We are pleased to announce that this year’s conference includes contributed talks and e-posters, and diversity scholarships. More information below!

Contributed talks and e-posters

rstudio::conf() is accepting proposals for contributed talks and e-posters for the first time! Contributed talks are 20 minutes long, and will be scheduled alongside talks by RStudio employees and invited speakers. E-posters will be shown during the opening reception on Thursday evening: we’ll provide a big screen, power, and internet; you’ll provide a laptop with an innovative display or demo.

We are particularly interested in talks and e-posters that:

  • Showcase the use of R and RStudio’s tools to solve real problems.

  • Expand the tidyverse to reach new domains and audiences.

  • Communicate using R, whether it’s building on top of RMarkdown, Shiny, ggplot2, or something else altogether.

  • Discuss how to teach R effectively.

To present you’ll also need to register for the conference.

Apply now!

Applications close Sept 15, and you’ll be notified of our decision by Oct 1.

Diversity scholarships

We’re also continuing our tradition of diversity scholarships, and this year we’re doubling the program to twenty recipients. We will support underrepresented minorities in the R community by covering their registration (including workshops), travel, and accommodation.

Apply now!

Applications close Sept 15, and you’ll be notified of our decision by Oct 1.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Reproducibility: A cautionary tale from data journalism

Tue, 08/15/2017 - 00:04

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?

Image credit: Timo Grossenbacher

The version of R Timo was using had been updated, but that wasn't the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown reason, a change in the dplyr package in the intervening package caused some data rows (shown in red above) to be deleted during the data preparation process, and so the results changed.

Timo was able to recreate the original results by forcing the script to run with package versions as they existed back in August 2015. This is easy to do with the checkpoint package: just add a line like

library(checkpoiint); checkpoint("2015-08-11")

to the top of your R script. We have been taking daily snapshots of every R package on CRAN since September 2014 to address exactly this situation, and the checkpoint package makes it super-easy to find and install all of the packages you need to make your script reproducible, without changing your main R installation or affecting any other projects you may have. (The checkpoint package is available on CRAN, and also included with all editions of Microsoft R.)

I've been including a call to checkpoint on the top of most of my R scripts for several years now, and it's saved me from failing scripts many times. Likewise, Timo has created a structure and process to support truly reproducible data analysis with R, and it advocates using the checkpoint to manage package versions. You can find a description of the process here: A (truly) reproducible R workflow, and find the template in Github

By the way, SRF Data — the data journalism arm of the national broadcaster in Switzerland — has published some outstanding stories over the past few years, and has even been nominated for Data Journalism Website of the Year. At the useR!2017 conference earlier this year, Timo presented several fascinating insights into the data journalism process at SRF Data, which you can see in his slides and talk (embedded below):

Timo Grossenbacher: This is what happens when you use different package versions, Larry! and A (truly) reproducible R workflow

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

acs v2.1.1 is now on CRAN

Mon, 08/14/2017 - 19:44

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

A new version of the acs package is now on CRAN. I recommend that all users of choroplethr update to this version. Here is how to do it:

update.packages() packageVersion('acs') [1] ‘2.1.1’

As a reminder, after updating the acs package you might need to reinstall your Census API key with ?api.key.install.

New Performance Issue

Internally, choroplethr uses the acs package to fetch demographic data from the Census Bureau’s API. Unfortunately, this version of the acs package introduces a performance issue (and solution) when fetching data. Here is an example of the problem:

library(choroplethr) time_demographic_get = function() { start.time = Sys.time() df = get_state_demographics() end.time = Sys.time() end.time - start.time } time_demographic_get() # 1.9 minutes Performance Issue Fix

The fix for this performance issue is simply to call the function ?acs.tables.install. You only need to call this function once. Doing so will dramatically speed up the performance of choroplethr’s various “get_*_demographics” functions:

acs.tables.install() time_demographic_get() # 9.4 seconds

A big thank you to Ezra Haber Glenn, the author of the acs package, for his continued work maintaining the package.

The post acs v2.1.1 is now on CRAN appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Big Data Solutions: A/B t test

Mon, 08/14/2017 - 18:03

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to share my code for using Welch’s t-test to compare group means using summary statistics.

 Motivation

I’ve just started working with A/B tests that use big data. Where once I’d whimsically run t.test(), now my data won’t fit into memory!

I’m sharing my solution here in the hope that it might help others.

 In-memory data

As a baseline, let’s start with an in-memory case by comparing whether automatic and manual cars have different Miles Per Gallon ratings on average (using the mtcars data set).

t.test(mpg ~ am, data = mtcars) #> #> Welch Two Sample t-test #> #> data: mpg by am #> t = -3.7671, df = 18.332, p-value = 0.001374 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> -11.280194 -3.209684 #> sample estimates: #> mean in group 0 mean in group 1 #> 17.14737 24.39231

Well… that was easy!

 Big Data

The problem with big data is that we can’t pull it into memory and work with R.

Fortunately, we don’t need the raw data to run Welch’s t-test. All we need is the mean, variance, and sample size of each group. So our raw data might have billions of rows, but we only need six numbers.

Here are the numbers we need for the previous example:

library(dplyr) grp_summary <- mtcars %>% group_by(am) %>% summarise( mpg_mean = mean(mpg), mpg_var = var(mpg), n = n() ) grp_summary #> # A tibble: 2 x 4 #> am mpg_mean mpg_var n #> #> 1 0 17.14737 14.69930 19 #> 2 1 24.39231 38.02577 13

This is everything we need to obtain a t value, degrees of freedom, and a p value.

 t value

Here we use the means, varianes, and sample sizes to compute Welch’s t:

welch_t <- diff(grp_summary$mpg_mean) / sqrt(sum(grp_summary$mpg_var/grp_summary$n)) cat("Welch's t value of the mean difference is", welch_t) #> Welch's t value of the mean difference is 3.767123

This is the same value returned by t.test(), apart from the sign (which is unimportant).

 Degrees of Freedom

Here, we use the variances and sample sizes to compute the degrees of freedom, which is estimated by the Welch–Satterthwaite equation:

welch_df <- ((sum(grp_summary$mpg_var/grp_summary$n))^2) / sum(grp_summary$mpg_var^2/(grp_summary$n^2 * (grp_summary$n - 1))) cat("Degrees of Freedom for Welch's t is", welch_df) #> Degrees of Freedom for Welch's t is 18.33225

Again, same as t.test().

 p value

We can now calculate the p value thanks to R’s pt(). Assuming we want to conduct a two-tailed test, here’s what we need to do:

welch_p <- 2 * pt(abs(welch_t), welch_df, lower.tail = FALSE) cat("p-value for Welch's t is", welch_p) #> p-value for Welch's t is 0.001373638

Same as t.test() again!

 All-in-one Function

Now we know the math, let’s write a function that takes 2-element vectors of means, variances, and sample sizes, and returns the results in a data frame:

welch_t_test <- function(sample_means, sample_vars, sample_ns) { t_val <- diff(sample_means) / sqrt(sum(sample_vars/sample_ns)) df <- ((sum(sample_vars/sample_ns))^2) / sum(sample_vars^2/(sample_ns^2 * (sample_ns - 1))) p_val <- 2 * pt(abs(t_val), df, lower.tail = FALSE) data.frame(t_val = t_val, df = df, p_val = p_val) } welch_t_test(grp_summary$mpg_mean, grp_summary$mpg_var, grp_summary$n) #> t_val df p_val #> 1 3.767123 18.33225 0.001373638

Excellent!

 Back to Big Data

The point of all this was to help me conduct an A/B test with big data. Has it?

Of course! I don’t pull billions of rows from my data base into memory. Instead, I create a table of the summary statistics within my big data ecosystem. These are easy to pull into memory.

How you create this summary table will vary depending on your setup, but here’s a mock Hive/SQL query to demonstrate the idea:

CREATE TABLE summary_tbl AS SELECT group_var , AVG(outcome) AS outcome_mean , VARIANCE(outcome) AS outcome_variance , COUNT(*) AS n FROM raw_tbl GROUP BY group_var

Happy testing!

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sending Emails from R Exercises

Mon, 08/14/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

When monitoring a data source, model, or other automated process, it’s convienent to have method for easily delivering performance metrics and notifying you whenever something is amiss. One option is to use a dashboard; however, this requires active time and effort to grab numbers and catch errors. An alternative approach is to send an email alert on the performance of the process. In this exercise set, we will explore the email approach using the mailR package.

Exercises in this section will be solved using the mailR package as well as basic HTML and CSS. It is recommended to take a look at the mailR documentation before continuing.

Answers to the exercises are available here.

Exercise 1
Let’s begin by sending “Hello World. This is my email!” as the body parameter from yourself to yourself.

Exercise 2
By passing in a vector for the to parameter, you can send the email to multiple recipients. Send the above email to yourself and a friend.

Exercise 3
So far, your emails have had no subject. Send the email from Exercise 1 to yourself with “Email Testing” for the subject parameter.

Exercise 4
With this package, we can take full advantage of CSS when constructing the body of an email. Send the email from the previous exercise from yourself to yourself where “Hello World.” is now red and “This is my email!” is now blue.

Note: make sure that html = TRUE.

Learn more about html functionality and web connection in the online course A complete journey to web analytics using R tool. In this course you will learn how to:

  • Perform a web based analytic question start to end
  • Learn how to import data from different online platforms such as twitter
  • And much more

Exercise 5
If you write a complex email containing images, dynamic elements, etc. as an HTML file, then you can reference this file with the body parameter. Create an HTML file containing “Hello World. This is my email!” called my_email.html. Send this email to yourself.

Exercise 6
Using knitr, you can compile HTML files. Compile the default knitr document that uses the mtcars dataset to an HTML file and email this to yourself.

Exercise 7
Create a new R script called mailr_six.R containing your code from the above exercises and attach that to your email by referencing the file path to mailr_six.R in the attach.files parameter. Send this email from yourself to yourself.

Exercise 8
The attached R script above does not have a description or a name. Add these in the file.descriptions and file.names parameters, respectively. Send the resulting email to yourself.

Exercise 9
Just as with the recicipents, you can attach multiple files, descriptions, and names by passing in vectors to the respective parameters. Create a new R script called mailr_eight.R containing your code from the above exercises and attach both mailr_six.R and mailr_eight.R to your email. Send the resulting email to yourself.

Exercise 10
Create a new R script where a random integer called important_number is generated. If important_number is even, then send an email to yourself notifying you that important_number is even.

Related exercise sets:
  1. Building Shiny App Exercises (part 5)
  2. Building Shiny App exercises part 1
  3. Parallel Computing Exercises: Foreach and DoParallel (Part-2)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Stan case study, sort of: The probability my son will be stung by a bumblebee

Mon, 08/14/2017 - 17:30

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)


The Stan project for statistical computation has a great collection of curated case studies which anybody can contribute to, maybe even me, I was thinking. But I don’t have time to worry about that right now because I’m on vacation, being on the yearly visit to my old family home in the north of Sweden.

What I do worry about is that my son will be stung by a bumblebee. His name is Torsten, he’s almost two years old, and he loves running around barefoot on the big lawn. Which has its fair share of bumblebees. Maybe I should put shoes on him so he wont step on one, but what are the chances, really.

Well, what are the chances? I guess if I only had

  • Data on the bumblebee density of the lawn.
  • Data on the size of Torsten’s feet and how many steps he takes when running around.
  • A reasonable Bayesian model, maybe implemented in Stan.

I could figure that out. “How hard can it be?”, I thought. And so I made an attempt.

Getting the data

To get some data on bumblebee density I marked out a 1 m² square on a representative part of the lawn. During the course of the day, now and then, I counted up how many bumblebees sat in the square.

Most of the time I saw zero bumblebees, but 1 m² is not that large of an area. Let’s put the data into R:

bumblebees <- c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)

During the same day I kept an eye on Torsten, and sometimes when he was engaged in active play I started a 60 s. timer and counted how many steps he took while running around. Here are the counts:

toddler_steps <- c(26, 16, 37, 101, 12, 122, 90, 55, 56, 39, 55, 15, 45, 8)

Finally I needed to know the area of Torsten’s feet. At a moment when he was not running around I put his foot on a piece of graph paper and traced its edge, which made him giggle. I then calculated how many of the 0.5 cm² squares were fully covered and how many were partially covered.

To estimate of the area of his foot I took the average of the number of full squares and the number partial squares, and then converted to m²:

full_squares <- 174 partial_squares <- 251 squares <- (full_squares + partial_squares) / 2 foot_cm2 <- squares / 4 foot_m2 <- foot_cm2 / 100^2 foot_m2

## [1] 0.0053125

Turns out my son’s feet are about 0.0053 m² each.

Specifying the model

The idea I had here was that if you know the average number of steps Torsten takes per minute while playing, and if you know the average number of bumblebees per m², then you could calculate the average area covered by Torsten’s tiny feet while running around, and then finally you could calculate the probability of a bumblebee being in that area. This would give you a probability for how likely Torsten is to be stung by a Bumblebee. However, while we have data on all of the above, we don’t know any of the above for certain.

So, to capture some of this uncertainty I thought I would whip together a small Bayesian model and, even though it’s a bit overkill in this case, why not fit it using Stan? (If you are unfamiliar with Bayesian data analysis and Stan I’ve made a short video tutorial you can find here.)

Let’s start our Stan program by declaring what data we have:

data { int n_bumblebee_observations; int bumblebees[n_bumblebee_observations]; int n_toddler_steps_observations; int toddler_steps[n_toddler_steps_observations]; real foot_m2; }

Now we need to decide on a model for the number of bumblebees in a m² and a model for the number of toddler_steps in 60 s. For the bees I’m going with a simple Poisson model, that is, the model assumes there is an average number of bees per m² ($\mu_\text{bees}$) and that this is all there is to know about bees.

For the number of steps I’m going to step it up a notch with a negative binomial model. The negative binomial distribution can be viewed in a number of ways, but one particularly useful way is as an extension of the Poisson distribution where the mean number of steps for each outcome is not fixed, but instead is varying, where this is controlled by the precision parameter $\phi_\text{steps}$. The larger $\phi_\text{steps}$ is the less the mean number of steps for each outcome is going to vary around the overall mean $\mu_\text{steps}$. That is, when $\phi_\text{steps} \rightarrow \infty$ then the negative binomial becomes a Poisson distribution. The point with using the negative binomial is to capture that Torsten’s activity level when playing on the lawn is definitely not constant. Sometimes he can run full speed for minutes, but sometimes he spends a long time being fascinated by the same little piece of rubbish and then he’s not taking many steps.

So we almost have a Bayesian model for our data, we just need to put priors on the parameters $\mu_\text{bees}$, $\mu_\text{steps}$, and $\phi_\text{steps}$. All three parameters are positive and a quick n’ dirty prior when you have positive parameters is a half-Cauchy distribution centered at zero: It’s just like half a normal distribution centered at zero, but with a much fatter tail. It has one free parameter, the scale, which also happens to be the median of the half-Cauchy. Set the scale/median to a good guess for the parameter in question and you are ready to go! If your guess is good then this prior will give the parameter the slightest nudge in the right direction, if your guess is bad then, as long as you have enough data, the fat tail of the half-Cauchy distribution will help you save face.

My guess for $\mu_\text{bees}$ is (4.0 + 4.0) / (10 × 10) = 0.08. I base this on that while crossing the lawn I usually scan an area of about 10×10 m², I then usually see around four bees, and I assume I’m missing half of the bees. So eight bees on a 100 m² lawn gives 0.08 bees per m². My guess for the number of steps per minute is going to be 60. For the precision parameter $\phi_\text{steps}$ I have no clue, but it shouldn’t be too large nor to small, so my guess is going to be 1.0. Here are the priors:

With the priors specified we now have all the required parts and here is the resulting Bayesian model:

$$% \begin{align}
& \text{bees} \sim \text{Poisson}(\mu_\text{bees}) \
& \text{steps} \sim \text{Neg-Binomial}(\mu_\text{steps}, \phi_\text{steps}) \
& \mu_\text{bees} \sim \text{Half-Cauchy}((4 + 4) / (10 \times 10)) \
& \mu_\text{steps} \sim \text{Half-Cauchy}(60) \
& \phi_\text{steps} \sim \text{Half-Cauchy}(1)
\end{align} %]]>$$

And here is the Stan code implementing the model above:

parameters { real<lower=0> mu_bees; real<lower=0> mu_steps; real<lower=0> precision_steps; } model { # Since we have contrained the parameters to be positive we get implicit # half-cauchy distributions even if we declare them to be 'full'-cauchy. mu_bees ~ cauchy(0, (4.0 + 4.0) / (10.0 * 10.0) ); mu_steps ~ cauchy(0, 60); precision_steps ~ cauchy(0, 1); bumblebees ~ poisson(mu_bees); toddler_steps ~ neg_binomial_2(mu_steps, precision_steps); }

The final step is to predict how many bumblebees Torsten will step on during, say, one hour of active play. We do this in the generated quantities code block in Stan. The code below will step through 60 minutes (a.k.a. one hour) and for each minute: (1) Sample pred_steps from the negative binomial, (2) calculate the area (m²) covered by these steps, and (3) sample the number of bees in this area from the Poisson giving a predicted stings_by_minute. Finally we sum these 60 minutes worth of strings into stings_by_hour.

generated quantities { int stings_by_minute[60]; int stings_by_hour; int pred_steps; real stepped_area; for(minute in 1:60) { pred_steps = neg_binomial_2_rng(mu_steps, precision_steps); stepped_area = pred_steps * foot_m2; stings_by_minute[minute] = poisson_rng(mu_bees * stepped_area); } stings_by_hour = sum(stings_by_minute); }

Looking at the result

We have data, we have a model, and now all we need to do is fit it. After we’ve put the whole Stan model into bee_model_code as a string we do it like this using the rstan interface:

# First we put all the data into a list data_list <- list( n_bumblebee_observations = length(bumblebees), bumblebees = bumblebees, n_toddler_steps_observations = length(toddler_steps), toddler_steps = toddler_steps, foot_m2 = foot_m2 ) # Then we fit the model which description is in bee_model_code . library(rstan) fit <- stan(model_code = bee_model_code, data = data_list)

With the model fitted, let’s take a look at the posterior probability distribution of our parameters.

stan_hist(fit, c("mu_bees", "mu_steps", "precision_steps")) print(fit, c("mu_bees", "mu_steps", "precision_steps"))

## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat ## mu_bees 0.12 0.00 0.05 0.04 0.08 0.11 0.15 0.23 3551 1 ## mu_steps 50.88 0.22 11.65 32.56 43.09 49.34 57.13 78.00 2801 1 ## precision_steps 1.80 0.01 0.68 0.78 1.32 1.70 2.17 3.41 2706 1

This seems reasonable, but note that the distributions are pretty wide, which means there is a lot of uncertainty! For example, looking at mu_bees it’s credible that there is everything from 0.04 bees/m² to 0.2 bees/m².

Anyway, let’s take a look at what we are really interested in, the predictive distribution over how many stings per hour Torsten will suffer during active play:

# Getting the sample representing the prob. distribution over stings/hour . stings_by_hour <- as.data.frame(fit)$stings_by_hour # Now actually calculating the prob of 0 stings, 1 stings, 2 stiongs, etc. strings_probability <- prop.table( table(stings_by_hour) ) # Plotin' n' printin' barplot(strings_probability, col = "yellow", main = "Posterior predictive stings per hour", xlab = "Number of stings", ylab = "Probability")

round(strings_probability, 2)

## stings_by_hour ## 0 1 2 3 4 5 6 7 8 9 10 ## 0.27 0.30 0.22 0.12 0.05 0.02 0.01 0.00 0.00 0.00 0.00

Ok, it seems like it is most probable that Torsten will receive one sting per hour, but we should not be surprised if it’s two or even three stings. I’d better put some shoes on him! The problem is that after a couple of days full of active barefoot play, my son Torsten’s feet look like this:

As you can see his feet are not swollen from all the bee stings they should receive according to the model. Actually, even after a week, he has not gotten a single bee sting! Which is good, I suppose, in a way, but, well, it means that my model is likely pretty crap. How can this be? Well,

  • I should maybe have gotten more and better data. Perhaps the square I monitored for bumblebees didn’t yield data that was really representative of the bumblebee density in general.
  • The assumption that bumblebees always sting when stepped upon might be wrong. Or maybe Torsten is so smart so that he actively avoids them…
  • Maybe the model was too simplistic. I really should have made a hierarchical model incorporating data from multiple squares and several days of running around. To factor in the effect of the weather, the flower density, and the diet of Torsent also couldn’t hurt.

I guess I could spend some time improving the model. And I guess there is a lesson to be learned here about that it is hard to predict the predictive performance of a model. But all that has to wait. I am on vacation after all.

The full data and code behind this blog post can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Treating your data: The old school vs tidyverse modern tools

Mon, 08/14/2017 - 14:01

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

When I first started using R there was no such thing as the tidyverse. Although some of the tidyverse packages were available independently, I learned to treat my data mostly using brute force combining pieces of information I had from several sources. It is very interesting to compare this old school programming with the tidyverse writing using the magrittr package. Even if you want to stay old school, tidyverse is here to stay and it is the first tool taught in many data science courses based on R.

My objective is to show a very simple example comparing the two ways of writing. There are several ways to do what I am going to propose here, but I think this example is enough to capture the main differences between old school codes and magrittr plus tidyverse. Magrittr is not new, but It seems to me that it is more popular now because of tidyverse.

To the example

I am going to generate a very simple data where we have two variables indexed by letters. My objective is to sum the two variables only in the values corresponding to vowels.

set.seed(123) M = 1000 db1 = data.frame(id = sample(letters, 1000, replace = TRUE), v1 = rnorm(1000), v2 = rnorm(1000)) vowels=c("a", "e", "i", "o", "u") head(db1) ## id v1 v2 ## 1 h -0.60189285 -0.8209867 ## 2 u -0.99369859 -0.3072572 ## 3 k 1.02678506 -0.9020980 ## 4 w 0.75106130 0.6270687 ## 5 y -1.50916654 1.1203550 ## 6 b -0.09514745 2.1272136

The first strategy (old school) to solve this problem is to use aggregate and then some manipulation. First I aggregate the variables to have the sum of each letter, then I select the vowels and use colsums to have the final result.

ag1 = aggregate( . ~ id, data = db1, FUN = sum) ag1 = ag1[ag1$id %in% vowels, ] ag1 = colSums(ag1[, -1]) ag1 ## v1 v2 ## 26.656837 6.644839

The second strategy (tidyverse) uses functions from the dplyr package and the foward-pipe operator (%>%) from the magrittr. The foward-pipe allows us to do many operations in a single shot to get the final result. We do not need to create these auxiliary objects like I did in the previous example. The first two lines do precisely the same as the aggregate. The group_by defines the variable used to create the groups and the summarize tells R the grouping function. In the third line I select only the lines corresponding to vowels and the last summarize sums each variable. As you can see, the results are the same. This approach generated an object type called tibble, which is a special type of data frame from the tidyverse with some different features like not using factors for strings.

library(tidyverse) ag2 = group_by(db1, id) %>% summarise(v1 = sum(v1), v2 = sum(v2)) %>% filter(id %in% vowels) %>% summarize(v1 = sum(v1), v2 = sum(v2)) ag2 ## # A tibble: 1 x 2 ## v1 v2 ## ## 1 26.65684 6.644839 The same thing using merge

Suppose that we want to do the same thing as the previous example but now we are dealing with two data frames: the one from the previous example and a second data frame of characteristics that will tell us which letters are vowels.

aux = rep("consonant",length(letters)) aux[which(letters %in% vowels)] = "vowel" db2 = data.frame(id = letters, type = aux) head(db2) ## id type ## 1 a vowel ## 2 b consonant ## 3 c consonant ## 4 d consonant ## 5 e vowel ## 6 f consonant

The first approach uses merge to combine the two data frames and then sum the observations that have id==vowel.

merge1 = merge(db1, db2, by = "id") head(merge1) ## id v1 v2 type ## 1 a -0.73657823 1.1903106 vowel ## 2 a 0.07987382 -1.1058145 vowel ## 3 a -1.20086933 0.4859824 vowel ## 4 a 0.32040231 -0.6196151 vowel ## 5 a -0.69493683 -1.0387278 vowel ## 6 a 0.15735335 1.6165776 vowel merge1 = colSums(merge1[merge1[,4] == "vowel", 2:3]) merge1 ## v1 v2 ## 26.656837 6.644839

The second approach uses the function inner_join from the dplyr package, then it filters the vowels observations and uses summarize to sum the vowels observations.

merge2 = inner_join(db1, db2, by = "id") %>% filter(type == "vowel") %>% summarise(v1 = sum(v1), v2 = sum(v2)) merge2 ## v1 v2 ## 1 26.65684 6.644839

As you can see, the two ways of writing are very different. Naturally, there is some cost to change from the old school to the tidyverse codes. However, the second makes your code easier to read, it is part of the tidyverse philosophy to write codes that can be read by humans. For example, something like this:

x = 1:10 sum(log(sqrt(x))) ## [1] 7.552206

becomes something like this if you use the foward-pipe:

x %>% sqrt() %>% log() %>% sum() ## [1] 7.552206

For more information check out the tidyverse website and the R For Data Science book, which is available for free on-line here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shinydashboards from right to left (localizing a shinydashboard to Hebrew)

Mon, 08/14/2017 - 12:39

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Post by Adi Sarid (Sarid Institute for Research Services LTD.)

Lately I’ve been working a lot with the shinydashboard library.

Like shiny, it allows any R programmer to harness the power of R and create professional looking interactive apps. The thing about shinydashboards is that it makes wonderfully looking dashboards.

What I’ve been doing with the dashboards is to create dedicated dashboards for my customers. Since most of my customers speak, read, and write in Hebrew I needed to fit that into my shinydashboard apps (i.e., fully localize the app). See an example for such a localized dashboard I made here.

Making a shinydashboard localized turned out to be simpler than I thought. 

Since the average R programmer doesn’t necessarily know and understand CSS, I thought I post my solution. This should fit any Hebrew or Arabic dashboard to work from right to left, including the sidebar and all other objects (though I only tested it in Hebrew).

If you want the short version:

(1) Download the following css file;

(2) Put it in a subfolder “/www” of your shinydashboard app;

(3) In your dashboardBody command (within the ui.R section of your shiny app) add the following code:

dashboardBody(
tags$head(
tags$link(rel = "stylesheet", type = "text/css", href = "bootstrap-rtl.css"
),...

Here are the few insights and steps which lead me to this solution:

Insight #1: any shiny app (dashboard or otherwise) can be customized using CSS. That’s no secret. However, the adaptation to RTL isn’t that simple when you have so many objects, mobile responsiveness to worry about, etc.

Insight #2: Shiny is based on the AdminLTE theme which is based on the bootstrap 3 theme. AdminLTE is a great theme, and even though it doesn’t officially support RTL, mmdsharifi, provided a solution in his github page. The same for bootstrap 3 which has an RTL customization by morteza (also on github).

Insight #3: What I did in order to make this work was to take the bootstrap-rtl.css from morteza, and then concatenate the AdminLTE-rtl.css file by mmdsharifi. Voilà! (simple, isn’t it?)

Here’s the resulting css file.

Thanks to 0xOri for suggesting and testing insight #3.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

Mon, 08/14/2017 - 08:00

(This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers)

Background

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

library(rvest) library(httr) wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

#define empty table to hold all the content biotable = data.frame(matrix(NA,0, 8))

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:

/html/body/div[1]/div[3]/div/main/article/div/table[1]

/html/body/div[1]/div[3]/div/main/article/div/table[2]

/html/body/div[1]/div[3]/div/main/article/div/table[3]

etc.

There are 8 such tables in the whole webpage. We can define a character vector for these as such:

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

for (j in 1:8){ subtable <- wg %>% read_html() %>% html_nodes(xpath = xpaths[j]) %>% html_table(., fill = TRUE) subtable <- subtable[[1]] biotable <- rbind(biotable,subtable) }

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1 X2 X3 X4 X5 X6 X7 X8 Analyte Number of Papers Biological Variation Biological Variation Desirable specification Desirable specification Desirable specification Analyte Number of Papers CVI CVg I(%) B(%) TE(%) S- 11-Desoxycortisol 2 21.3 31.5 10.7 9.5 27.1 S- 17-Hydroxyprogesterone 2 19.6 50.4 9.8 13.5 29.7 U- 4-hydroxy-3-methoximandelate (VMA) 1 22.2 47.0 11.1 13.0 31.3 S- 5' Nucleotidase 2 23.2 19.9 11.6 7.6 26.8 U- 5'-Hydroxyindolacetate, concentration 1 20.3 33.2 10.2 9.7 26.5 S- α1-Acid Glycoprotein 3 11.3 24.9 5.7 6.8 16.2 S- α1-Antichymotrypsin 1 13.5 18.3 6.8 5.7 16.8 S- α1-Antitrypsin 3 5.9 16.3 3.0 4.3 9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE") names(biotable) <- table.header

Let's now find rows we don't want and remove them.

for.removal <- grep("Analyte", biotable$Analyte) biotable <- biotable[-for.removal,]

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

biotable[biotable == "---"] <- NA row.names(biotable) <- 1:nrow(biotable) biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

Check it Out

Just having another look at the first 10 rows:

Sample Analyte NumPapers CVI CVG I B TE S 11-Desoxycortisol 2 21.3 31.5 10.7 9.5 27.1 S 17-Hydroxyprogesterone 2 19.6 50.4 9.8 13.5 29.7 U 4-hydroxy-3-methoximandelate (VMA) 1 22.2 47.0 11.1 13.0 31.3 S 5' Nucleotidase 2 23.2 19.9 11.6 7.6 26.8 U 5'-Hydroxyindolacetate, concentration 1 20.3 33.2 10.2 9.7 26.5 S α1-Acid Glycoprotein 3 11.3 24.9 5.7 6.8 16.2 S α1-Antichymotrypsin 1 13.5 18.3 6.8 5.7 16.8 S α1-Antitrypsin 3 5.9 16.3 3.0 4.3 9.2 S α1-Globulins 2 11.4 22.6 5.7 6.3 15.7 U α1-Microglobulin, concentration, first morning 1 33.0 58.0 16.5 16.7 43.9

Now examining the structure:

str(biotable)

## 'data.frame': 370 obs. of 8 variables: ## $ Sample : chr "S" "S" "U" "S" ... ## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ... ## $ NumPapers: chr "2" "2" "1" "2" ... ## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ... ## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ... ## $ I : chr "10.7" "9.8" "11.1" "11.6" ... ## $ B : chr "9.5" "13.5" "13.0" "7.6" ... ## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...

It's kind-of undesirable to have numbers as characters so…

#convert appropriate columns to numeric biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

library(xlsx) write.xlsx(biotable, file = "Westgard_Biological_Variation.xlsx", row.names = FALSE)

If you are having trouble getting xlsx to install, then just write as csv.

write.csv(biotable, file = "Westgard_Biological_Variation.csv", row.names = FALSE)

Conclusion

You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

library(httr) library(rvest) library(xlsx) wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere")) xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]") #define empty dataframe biotable = data.frame(matrix(NA,0, 8)) #loop over the 8 html tables for (j in 1:8){ subtable <- wg %>% read_html() %>% html_nodes(xpath = xpaths[j] ) %>% html_table(., fill = TRUE) subtable <- subtable[[1]] biotable <- rbind(biotable,subtable) } table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE") names(biotable) <- table.header #remove extraneous rows for.removal <- grep("Analyte", biotable$Analyte) biotable <- biotable[-for.removal,] #make missing data into NA biotable[ biotable == "---" ] <- NA row.names(biotable) <- 1:nrow(biotable) #convert appropriate columns to numeric biotable[,3:8] <- lapply(biotable[3:8], as.numeric) #get rid of minus signs in column 1 biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE) write.xlsx(biotable, file = "Westgard_Biological_Variation.xlsx", row.names = FALSE) write.csv(biotable, file = "Westgard_Biological_Variation.csv", row.names = FALSE)

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Prov 23:5)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Supervised Learning in R: Regression

Mon, 08/14/2017 - 03:55

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We are very excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp

The course is primarily authored by Dr. Nina Zumel (our chief of course design) with contributions from Dr. John Mount. This course will get you quickly up to speed covering:

  • What is regression? (Hint: it is the art of making good numeric predictions, one of the most important tasks in data science, machine learning, or statistics.)
  • When does it work, and when does it not work?
  • How to move fluidly from basic ordinary least squares to Kaggle-winning methods such as gradient boosted trees.

All of this is demonstrated using R, with many worked examples and exercises.

We worked very hard to make this course very much worth your time.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RProtoBuf 0.4.10

Mon, 08/14/2017 - 00:08

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

RProtoBuf provides R bindings for the Google Protocol Buffers ("ProtoBuf") data encoding and serialization library used and released by Google, and deployed fairly widely in numerous projects as a language and operating-system agnostic protocol.

A new releases RProtoBuf 0.4.10 just appeared on CRAN. It is a maintenance releases replacing one leftover errorenous use of package= in .Call with the correct PACKAGE= (as requsted by CRAN). It also integrates a small robustification in the deserializer when encountering invalide objects; this was both reported and fixed by Jeffrey Shen.

Changes in RProtoBuf version 0.4.10 (2017-08-13)
  • More careful operation in deserializer checking for a valid class attribute (Jeffrey Shen in #29 fixing #28)

  • At the request of CRAN, correct one .Call() argument to PACKAGE=; update routine registration accordingly

CRANberries also provides a diff to the previous release. The RProtoBuf page has an older package vignette, a ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)

Sun, 08/13/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In previous set, we’ve seen how to compute probability based on certain density distributions, how to simulate situations to compute their probability and use that knowledge make decisions in obvious situation. But what is a probability? Is there a more scientific way to make those decisions? What is the P-value xkcd keep talking about? In this exercise set, will learn the answer to most of those question and more!

One simple definition of the probability that an event will occur is that it’s the frequency of the observations of this event in a data set divided by the total number of observations in this set. For example, if you have a survey where 2 respondents out of 816 says that they are interested in a potential partner only if they are dressed in an animal costume, you can say that the probability that someone in the population is a furry is about 2/816 or 1/408 or 0.00245… or 0.245%.

Answers to the exercises are available here.

Exercise 1
The average height of males in the USA is about 5 foot 9 inches with a standard deviation of 2.94 inches. If this measure follow a normal distribution, write a function that takes a sample size as input and compute the probability to have a subject taller than 5 foot 8 and smaller than 5 foot 9 on this sample size. Then, set the seed to 42 and compute the probability for a sample size of 200.

Exercise 2
We can deduce a lot from that definition. First, the probability is always a fraction, but since we are usually not used to high number and have a hard time doing division in our head 3968/17849 is not a really useful probability. In consequence, we will usually use a percentage or a real number between o and 1 to represent a probability. Why 0 and one? If an event is not present in the data set, his frequency is 0 so whatever is the total number of observations his probability is 0 and if all the observations are the same, the fraction is going to be equal to 1. Also, if you think about the example of the furries in the survey, maybe you think that there’s a chance that there are only two furries in the entire population and they both take the survey, so the probability that an individual is a furry is in reality a lot lower than 0.0245%. Or maybe there’s a lot more furries in the population and only two where surveyed, which makes the real probability much higher. You are right token reader! In a survey, we estimate the real probability and we can never tell the real probability from a small sample (that’s why if you are against the national survey in your country, all the statisticians hate you in silence). However, the more the sample size of a survey is high the less those rare occurrences happen.

  1. Compute the probability that an American male is taller than 5 foot 8 and smaller than 5 foot 9 with the pnorm function.
  2. Write a function that draws a sample of subject from this distribution, compute the probability of observing a male of this height and compute the percentage of difference between that estimate and the real value. Make sure that you can repeat this process for all sample size between two values.
  3. Use this function to draw sample of size from 1 to 10000 and store the result in a matrix.
  4. Plot the difference between the estimation of the probability and the real value.

This plot show that the more the sample size is big, the less the error of estimation is, but the difference of error between an sample of size 1000 and 10000 is quite small.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 3
We have already seen that density probability can be used to compute probability, but how?

For a standard normal distribution:

  1. Compute the probability that x is smaller or equal to zero, then plot the distribution and draw a vertical line at 0.
  2. Compute the probability that x is greater than zero.
  3. Compute the probability that x is less than -0.25, then plot the distribution and draw a vertical line at -0.25.
  4. Compute the probability that x is smaller than zero and greater than -0.25.

Yeah, the area under the curve of a density function between two points is equal to the probability that an event is equal to a value on this interval. That’s why density are really useful: they help us to easily compute the probability of an event by doing calculus. Often we will use the cumulative distribution function (cdf), which is the antiderivative of the density function, to compute directly the probability of an event on an interval. The function pnorm() for example, compute the value of the cdf between minus infinity and a value x. Note that a cdf return the probability that a random variable take a value smaller.
Exercise 4
For a standard normal distribution, find the values x such as:

  1. 99% of the observation are smaller than x.
  2. 97.5% of the observation are smaller than x.
  3. 95% of the observation are smaller than x.
  4. 99% of the observation are greater than x.
  5. 97.5% of the observation are greater than x.
  6. 95% of the observation are greater than x.

Exercise 5
Since probability are often estimated, it is useful to measure how good is the estimation and report that measure with the estimation. That’s why you often hear survey reported in the form of “x% of the population with a y% margin 19 times out of 20”. In practice, the size of the survey and the variance of the results are the two most important factors that can influence the estimation of a probability. Simulation and bootstrap methods are great way to find the margin of error of an estimation.

Load this dataset and use bootstrapping to compute the interval that has 95% (19/20) chance to contain the real probability of getting a value between 5 and 10. What is the margin of error of this estimation?

This process can be used to any statistics that is estimated, like a mean, a proportion, etc.

When doing estimation, we can use a statistic test to draw conclusion about our estimation and eventually make decisions based on it. For example, if in a survey, we estimate that the average number of miles traveled by car each week by American is 361.47, we could be interested to know if the real average is bigger than 360. To do so, we could start by formulation a null and an alternative hypothesis to test. In our scenario, a null hypothesis would be that the mean is equal or less than 360. We will follow the step of the test and if at the end we cannot support this hypothesis, then we will conclude that the alternative hypothesis is probably true. In our scenario that hypothesis should be that the mean is bigger than 360.

Then we choose a percentage of times we could afford to be wrong. This value will determine the range of possible values for which we will accept the null hypothesis and is called the significance level (α).

Then we can use a math formula or a bootstrap method to estimate the probability that a sample from this population would create an estimate of 361.47. If this probability is less than the significance level, we reject the null hypothesis and go with the alternative hypothesis. If not, we cannot reject the null hypothesis.

So basically, what we do is we look at how often our estimation should happen if the null hypothesis is true and if it’s rare enough at our taste, significance level, we conclude that it’s not a random occurance but a sign that the null hypothesis is false.
Exercise 6
This dataset represents the survey of the situation above.

  1. Estimate of the mean of this dataset.
  2. Use the bootstrap method to find 10000 estimations of the mean from this dataset.
  3. Find the value from this bootstrap sample that is bigger than 5% of all the others values.This value is called the critical value of the test and correspond to α.
  4. From the data we have, should be conclude that the mean of the population is bigger than 360? What is the significance level of this test?

Exercise 7
We can represent the test visually. Since we reject the null hypothesis if the percentage of bootstrapped mean smaller than 360 is bigger than 5%, we can simply look where the fifth percentile lie on the histogram of the bootstrapped mean. If it’s at the left of the 360 value, we know that more than 5% of bootstrapped means are smaller than 360 and we don’t reject the null hypothesis.

Draw the histogram of the bootstrapped mean and draw two vertical lines: one at 360 and one at the fifth percentile.

Exercise 8
There are two ways that a mean can be not equal to a value: when the mean is bigger than the value and when it’s smaller than this value. So if we want to test the equality of the mean to a specific value we must verify if most of our estimations lie around this value or if a lot of them are far from it. To do so, we create an interval who has for endpoints our mean and another point that is at the same distance from this value that the mean. Then we can compute the probability to get an estimation outside this interval. This way, we test if the value is not bigger or smaller than the value 1-α of the time.

Here’s the steps to test the hypothesis that the mean of the dataset of exercise 6 is equal to 363:

  1. To simulate that our distribution has a mean of 363, shift the dataset so that this value become the mean.
  2. Generate 10000 bootstrapped means from this distribution.
  3. Compute the endpoints of the test interval.
  4. Compute the probability that the mean is outside this interval.
  5. What conclusion can we make with a α of 5%?

Exercise 9
Repeat the step of exercise 8, but this time test if the mean is smaller than 363.

This show that a one direction test is more powerful than a two direction test in this situation since there’s less wiggle room between the value of reference and the critical region of the test. So if you have prior knowledge that could make you believe that an estimation is bigger or smaller than a value, testing for than would give you more assurance of the validity of your results.

Exercise 10
The p-value of a test is the probability that we would observe a random estimation as the one we made if the null hypothesis is true. This value is often used in scientific reports since it’s a concise way to express statistics finding. If we know the p-value of a test and the significance level α we can deduce the result of the test since the null hypothesis is rejected when p<α. In another word: you have been using the p-value all this time to make conclusion!

Load the dataset of exercise 5 and compute the p-value associated to the test that the mean is equal to 13 if α is equal to 5%.

Related exercise sets:
  1. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
  2. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ggvis Exercises (Part-1)

Sat, 08/12/2017 - 12:56

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.

This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set. HINT: Use ggvis().

Exercise 2

Use the list you just created to make a scatterplot. HINT: Use layer_points().

Exercise 3

Use %>% to create the scatterplot of Exercise 2.

Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Work extensively with the ggvis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 4

Use the list you created in Exercise 1 to create a scatterplot and use “Cylinders” as stroke.

Exercise 5

Use the list you created in Exercise 1 to create a scatterplot and use “Cylinders” as fill.

Exercise 6

Use the list you created in Exercise 1 to create a scatterplot and use “EngineSize” as size.

Exercise 7

Use the list you created in Exercise 1 to create a scatterplot and use “Cylinders” as shape.

Exercise 8

Use the list you created in Exercise 1 to create a scatterplot with red color and black stroke.

Exercise 9

Use the list you created in Exercise 1 to create a scatterplot with size set to 300 and opacity to 0.5 .

Exercise 10

Use the list you created in Exercise 1 to create a scatterplot with cross as shape.

Related exercise sets:
  1. How to create interactive data visualizations with ggvis
  2. Data science for Doctors: Cluster Analysis Exercises
  3. iPlots exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Is it wine o’clock?

Sat, 08/12/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

Emojis were again quite en vogue this week on Twitter with Romain François doing some awesome stuff for the emo package, in particular this teeny tiny animated clock. It reminded me of my own emoji animated clock that I had done a while ago for representing time-use data. Time for me to present its genesis!

I’m actually not a Quantified Self person, but at my work time-use data was collected for an epidemiology project: information about people activities and locations throughout one day can help unraveling sources of exposure to air pollution. I’ve therefore spent some time thinking about how to represent such data. In particular, my colleague Margaux directed a fantastic video about our project. We introduced some real data from our project in it, including an animated clock that I made with emoji-coding of indoor-outdoor location. I’ll present code and data for producing a similar clock with my own agenda on some Wednesday evenings of last year.

Loading the time-use data

I logged data from my Wednesday in a rather ok format.

library("magrittr") date <- "2016-03-10" activities <- readr::read_csv2("data/2017-08-12-wineoclock-timeuse.csv", col_types = "ccc") %>% dplyr::mutate(start = lubridate::ymd_hms(paste0(date, start, ":00")), end = lubridate::ymd_hms(paste0(date, end, ":00"))) knitr::kable(activities) start end activity 2016-03-10 00:00:00 2016-03-10 06:30:00 sleeping 2016-03-10 06:30:00 2016-03-10 07:00:00 coffee 2016-03-10 07:00:00 2016-03-10 08:00:00 running 2016-03-10 08:00:00 2016-03-10 09:30:00 train 2016-03-10 09:30:00 2016-03-10 17:00:00 computer 2016-03-10 17:00:00 2016-03-10 18:00:00 train 2016-03-10 18:00:00 2016-03-10 20:00:00 school 2016-03-10 20:00:00 2016-03-10 20:30:00 pizza 2016-03-10 20:30:00 2016-03-10 22:00:00 dancer 2016-03-10 22:00:00 2016-03-10 22:30:00 wine_glass 2016-03-10 22:30:00 2016-03-10 23:59:00 sleeping

When I prepared this visualization ggimage didn’t exist so I knew I’d have to use emojis from emojifont to represent my activities, therefore I directly entered the activities as emojis. My Wednesdays at that time were quite varied, I started the day with breakfast (how original), then some time at the gym, followed by a rather short workday due to my taking a Catalan class at the end of the afternoon. The evening was spent enjoying a quick and cheap pizza dinner with my husband before our salsa class and sometimes treating ourselves to a wine glass at the small and rather shabby-looking bar at the end of our street, whose name really was Small bar.

Making the animated clock

I didn’t have to think about how to draw and animated a clock with ggplot2 because somebody already had: I used code from this gist of Drew Conway’s.

# Generate digitial clock face first_nine <- c('00', '01', '02', '03', '04', '05', '06', '07', '08', '09') hours <- c(first_nine, as.character(seq(10,23,1))) mins <- c(first_nine, as.character(seq(10,59,1))) time_chars_l <- lapply(hours, function(h) paste(h, ':', mins, sep='')) time_chars <- do.call(c, time_chars_l) # Generate analog clock face hour_pos <- seq(0, 12, 12/(12*60))[1:720] min_pos <-seq(0,12, 12/60)[1:60] hour_pos <- rep(hour_pos, 2) all_times <- dplyr::tbl_df(cbind(hour_pos, min_pos)) %>% dplyr::mutate(index = time_chars) %>% dplyr::mutate(time = lubridate::ymd_hms(paste0(date, index, ":00")))

I then needed to join the all_times table, containing many snapshots of the time, with my time-use table made of intervals (start and end of each activity). Another package born in the meantime, fuzzyjoin, would have allowed me to write more efficient code, and I was so sad when looking at the old code that I re-wrote it.

all_times <- dplyr::mutate(all_times, start = time, end = time) all_times <- fuzzyjoin::interval_left_join(all_times, activities) knitr::kable(all_times[1:10,]) hour_pos min_pos index time start.x end.x start.y end.y activity 0.0000000 0.0 00:00 2016-03-10 00:00:00 2016-03-10 00:00:00 2016-03-10 00:00:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.0166667 0.2 00:01 2016-03-10 00:01:00 2016-03-10 00:01:00 2016-03-10 00:01:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.0333333 0.4 00:02 2016-03-10 00:02:00 2016-03-10 00:02:00 2016-03-10 00:02:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.0500000 0.6 00:03 2016-03-10 00:03:00 2016-03-10 00:03:00 2016-03-10 00:03:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.0666667 0.8 00:04 2016-03-10 00:04:00 2016-03-10 00:04:00 2016-03-10 00:04:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.0833333 1.0 00:05 2016-03-10 00:05:00 2016-03-10 00:05:00 2016-03-10 00:05:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.1000000 1.2 00:06 2016-03-10 00:06:00 2016-03-10 00:06:00 2016-03-10 00:06:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.1166667 1.4 00:07 2016-03-10 00:07:00 2016-03-10 00:07:00 2016-03-10 00:07:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.1333333 1.6 00:08 2016-03-10 00:08:00 2016-03-10 00:08:00 2016-03-10 00:08:00 2016-03-10 2016-03-10 06:30:00 sleeping 0.1500000 1.8 00:09 2016-03-10 00:09:00 2016-03-10 00:09:00 2016-03-10 00:09:00 2016-03-10 2016-03-10 06:30:00 sleeping

Then I could continue using the code from the gist, this step generating stuff for the two clock hands.

all_times <- all_times %>% dplyr::select(- start.x, - end.x, - start.y, - end.y) %>% tidyr::gather(name, time.info, 1:2) %>% dplyr::arrange(index) %>% dplyr::mutate(hands = ifelse(name == "hour_pos", 0.5, 1))

I decided to only plot a subset of my exciting day.

all_times <- dplyr::filter(all_times, time > lubridate::ymd_hms("2016-03-10 15:00:00"), time < lubridate::ymd_hms("2016-03-10 23:00:00") )

Last but not least, I plotted the clock. Nowadays I’d probably stick to magick for animating the plot but gganimate is awesome too anyway.

library("ggplot2") emojifont::load.emojifont('OpenSansEmoji.ttf') clock <- ggplot(all_times, aes(xmin = time.info, xmax = time.info + 0.03, ymin = 0, ymax = hands, frame = index)) clock <- clock + geom_rect() clock <- clock + theme_bw() clock <- clock + coord_polar() clock <- clock + scale_x_continuous(limits = c(0,12), breaks = 0:11, labels = c(12, 1:11)) clock <- clock + scale_y_continuous(limits=c(0,1.1)) clock <- clock + coord_polar() clock <- clock + geom_text(aes(x = 0, y = 0, label = emojifont::emoji(activity)), family="OpenSansEmoji", size=20) clock <- clock + theme(legend.position="none", axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.title.y = element_blank(), axis.title.x = element_blank(), panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.background = element_blank()) animation::ani.options(interval = 0.025) gganimate::gganimate(clock, "clock.gif")

I ended up with a very useful visualization as you can see… at least, I have no trouble finding out when wine o’clock is!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stan Weekly Roundup, 11 August 2017

Fri, 08/11/2017 - 21:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

This week, more Stan!

  • Charles Margossian is rock star of the week, finishing off the algebraic solver math library fixture and getting all plumbed through Stan and documented. Now you can solve nonlinear sets of equations and get derivatives with the implicit function theory all as part of defining your log density. There is a chapter in the revised manual

  • The entire Stan Development Team, spearheaded by Ben Goodrich needing fixes for RStan, is about to roll out Stan 2.17 along with the interfaces. Look for that to begin trickling out next week. This will fix some further install and error message reporting issues as well as include the algebraic solver. We are also working on moving things toward Stan 3 behind the scenes. We won’t just keep incrementing 2.x forever!

  • Ben Goodrich fixed the inlining declarations in C++ to allow multiple Stan models to be linked or built in a single translation unit. This will be part of the 2.17 release.

  • Sean Talts is still working on build issues after Travis changed some config and compilers changed everywhere disrupting our continuous integration.

  • Sean is working with Michael Betancourt on the Cook-Gelman-Rubin diagnostic and have gotten to the bottom of the quantization errors (usingly on 1000 posterior draws and splitting into too many bins).

  • Imad Ali is looking ahead to spatio-temporal models as he wraps up the spatial models in RStanArm.

  • Yuling Yao and Aki Vehtari finished the stacking paper (like model averaging), adding references, etc.

  • Yuling has also made some updates to the loo package that are coming soon.

  • Andrew Gelman wrote papers (with me and Michael) about R-hat and effective sample size and a paper on how priors can only be understood in the context of the likelihood.
  • Jonah Gabry, Ben and Imad have been working on the paper for priors based on $R^2$

  • Andrew, Breck Baldwin and I are trying to put together some educational proposals for NSF to teach intro stats, and we’re writing more grants now that it’s grant season. All kinds of interesting things at NSF and NIH with spatial modeling.

  • Jonah Gabry continues the ggplot-ification of the new Gelman and Hill book.

  • Ben Goodrich has been working on multivariate effective sample size calculations.

  • Breck Baldwin has been working with Michael on courses before Stan (and elsewhere).

  • Jonah Gabry has been patching all the packagedown references for our doc and generally cleaning things up with cross-references and organization.

  • Mitzi Morris finished adding a data qualifier to function arguments to allow propagation of data-only-ness as required by our ODE functions and now the algebraic solver.

  • Dootika Vats was visiting Michael; she’s been working on multivariate effective sample sizes, which considers how bad things can be with linear combinations of parameters; scaling is quadratic, so there are practical issues. She has previously worked on efficient and robust effective sample size calculations which look promising for inclusion in Stan.

  • Bob Carpenter has been working on exposing properties of variables in a Stan program in the C++ model at runtime and statically.

The post Stan Weekly Roundup, 11 August 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages