Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 40 min 33 sec ago

LondonR – March edition round up

Wed, 03/13/2019 - 11:45

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

My first LondonR took me back to my days at University as UCL hosted us for the evening.

Our first speaker of the night was Mike Smith from Pfizer. Mike had joined us to give a version of his talk that he delivered at this years rstudio::conf – ‘lazy and easily distracted report writing in R’. While there was a strong focus on his (wonderful) tidyverse themed t-shirt and his messy kitchen drawer, Mike had some home truths for us – we all get distracted very easily! This is why it’s so important to produce rmarkdown reports that help you to remember exactly what you were doing, not only for future you, or different people – but for presently distracted you!

He also emphasised how vital knowing your audience is, then showed us how easy it is to adapt an rmarkdown report for various audiences by parametrising your rmarkdown reports. I won’t go in to any detail here but definitely something worth looking in to if your work (or play) involves producing rmarkdown docs for multiple audiences.

After Mike’s talk, Laurens Geffert from Nielsen Marketing Cloud showed us how to build a supercomputer using the cloudyr project and AWS. Laurens definitely got the message across that R can be made into a very powerful tool very easily. Something we can all relate to is how Laurens code has progressed over the years; from base R, to purrr, then on to furrr! Dropping package names like it was going out of fashion Laurens introduced us to a suite of packages for parallel computing on AWS; aws.ec2, future, remoter and the aforementioned furrr the stars of the show. He ended his talk with a call to action, the cloudyr project are looking for people to help with maintenance of their AWS packages (if this sounds like something that interests you then check out github.com/cloudyr ).

Our last speaker was Mango’s very own Hannah Frick – providing the low down on all the news from this year’s rstudio::conf. This year’s conference was held in Austin and featured titans from the R community such as Joe Cheng and Hadley Wickham. Hannah didn’t have time to tell us about all the brilliant talks in a half an hour presentation – so I certainly won’t try and do it here. What I can offer is a link to all of the materials from the conference here, and all of the sessions were recorded and are freely available here for your viewing pleasure.

The night ended with a shameless plug for our annual EARL London conference in September (abstract submissions close on the 31st of March!). If you’re looking for a reason to attend, or more likely convince your boss that you should attend, then look no further than this blog post.

All the information from this event and past LondonR’s can be found at londonr.org. We hope to see you all again at the next one on the 15th of May – again at UCL. We’re always looking for speakers so please get in touch if you’ve got anything to talk about!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An Interesting Subtlety of Statistics: The Hot Hand Fallacy Fallacy

Wed, 03/13/2019 - 10:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

Last week I stumbled across a very interesting recent Econometrica article by Joshua Miller and Adam Sanjuro. I was really surprised by the statistical result they discovered and guess the issue may even have fooled Nobel Prize winning behavioral economists. Before showing the statistical subtlety, let me briefly explain the Hot Hand Fallacy.

Consider a basketball player who makes 30 throws and whose chance to hit is always 50%, independent of previous hits or misses. The following R code simulates a possible sequence of results (I searched a bit for a nice random seed for the purpose of this post. So this outcome may not be “representative”):

set.seed(62) x = sample(0:1,30,replace = TRUE) x # 0=miss, 1=hit table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 1 1 0 1 0 1

The term Hot Hand Fallacy is used by psychologists and behavioral economists for the claim that people tend to systematically underestimate how often streaks of consecutive hits or misses can occur for such samples of i.i.d. sequences.

For example, consider the 5 subsequent hits from throws 9 to 13. In real life such streaks could be due to a hot hand, in the sense that the player had a larger hit probability during these throws than on average. Yet, the streak could also just be a random outcome given a constant hit probability. The Hot Hand Fallacy means that one considers such streaks as stronger statistical evidence against a constant hit probability than is statistically appropriate.

In their classical article from 1985 Gilovich, Vallone, and Tversky use data from real basketball throws. They compare the conditional probability of a hit given that either the previous 3 throws were a hit or the previous 3 throws were a miss.

Let us in several steps compute these probabilities for our vector x:

# Indexes of elements that come directly after # a streak of k=3 subsequent hits inds = find.after.run.inds(x,k=3, value=1) inds ## [1] 12 13 14 19

The function find.after.run.inds is a custom function (see end of this blog for the code) that computes the indeces of the elements of a vector x that come directly after a streak of k=3 consecutive elements with the specified value. Here we have the 12th throw that comes after the 3 hits in 9,10,11, the 13th throw after the 3 hits in 10,11,12, and so on.

x[inds] ## [1] 1 1 0 0

Directly after all streaks of 3 hits, we find exactly 2 hits and 2 misses.

mean(x[inds]) ## [1] 0.5

This means in our sample, we have a hit probability of 50% in throws that are directly preceeded by 3 hits.

We can also compute the a conditional hit probability after a streak of three misses in our sample:

# Look at results after a streak of k=3 subsequent misses inds = find.after.run.inds(x,k=3, value=0) mean(x[inds]) ## [1] 0.5

Again 50%, i.e. there are no differences between the hit probabilities directly after 3 hits or 3 misses.

Looking at several samples of n throws, Gilovich, Vallone, and Tversky find also no large differences in the conditional hit probability after streaks of 3 hits or 3 misses. Neither do they find relevant differences for alternative streak lengths. They thus argue that in their data, there is no evidence for a hot hand. Believing in a hot hand in their data thus seems to be a fallacy. Sounds quite plausible to me.

Let us now slowly move towards the promised statistic subtlety by performing a systematic Monte-Carlo study:

sim.fun = function(n,k,pi=0.5, value=1) { # Simulate n iid bernoulli draws x = sample(0:1,n,replace = TRUE, prob=c(1-pi,pi)) # Find these indeces of x that come directly # after a streak of k elements of specified value inds = find.after.run.inds(x,k,value=value) # If no run of at least k subsequent numbers of value exists # return NULL (we will dismiss this observation) if (length(inds)==0) return(NULL) # Return the share of 1s in x[inds] mean(x[inds]) } # Draw 10000 samples of 30 throws and compute in each sample the # conditional hit probability given 3 earlier hits hitprob_after_3hits = unlist(replicate(10000, sim.fun(n=30,k=3,pi=0.5,value=1), simplify=FALSE)) head(hitprob_after_3hits) ## [1] 0.5000000 0.5000000 0.0000000 0.2500000 0.7142857 0.0000000

We have now simulated 10000 times 30 i.i.d. throws and computed for each of the 10000 samples the average probability of a hit in the throws directly after a streak of 3 hits.

Before showing you mean(hitprob_after_3hits), you can make a guess in the following quiz. Given that I already announced an interesting subtlety of statistics you can of course meta-guess, whether the subtlety already enters here, or whether at this point the obvious answer is still the correct one:

Loading…

OK let’s take a look at the result:

mean(hitprob_after_3hits) ## [1] 0.3822204 # Approximate 95% confidence interval # (see function definition in Appendix) ci(hitprob_after_3hits) ## lower upper ## 0.3792378 0.3852031

Wow! I find that result really, really surprising. I would have been pretty sure that given our constant hit probability of 50%, we also find across samples an average hit probability around 50% after streaks of 3 hits.

Yet, we find in our 10000 samples of 30 throws on average a substantially lower hit probability of 38%, with a very tight confidence interval.

To get an intuition for why we estimate a conditional hit probability after 3 hits below 50%, consider samples of only n=5 throws. The following table shows all 6 possible such samples that have a throw after a streak of 3 hits.

table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } Row Throws Share of hits after streak of 3 hits 1 11100 0% 2 11101 0% 3 11110 50% 4 11111 100% 5 01110 0% 6 01111 100% Mean : 41.7%

Assume we have hits in the first 3 throws (rows 1-4). If then throw 4 is a miss (rows 1-2) then throw 5 is irrelevant because it is not directly preceeded by a streak of 3 hits. So in both rows the share of hits in throws directly after 3 hits is 0%.

If instead throw 4 is a hit (rows 3-4) then also throw 5, which is equally likely a hit or miss, is directly preceeded by 3 hits. This means the average share of hits in throws after 3 hits in rows 3-4 is only 75%, while it was 0% in rows 1-2. In total over all 6 rows this leads to a mean of only 41.7%

Of course, the true probability of the player making a hit in a throw directly after 3 hits is still 50% given our i.i.d. data generating process. Our procedure just systematically underestimates this probability. Miller and Sanjuro call this effect a streak selection bias. It is actually a small sample bias that vanishes as n goes to infinity. Yet the bias can be quite substantial for small n as the simulations show.

We get a mirroring result if we use our procedure to estimate the mean hit probability in throws that come directly after 3 misses.

hitprob_after_3misses = unlist(replicate(10000, sim.fun(n=30,k=3,pi=0.5,value=0), simplify=FALSE)) mean(hitprob_after_3misses) ## [1] 0.6200019 ci(hitprob_after_3misses) ## lower upper ## 0.6170310 0.6229728

We now have an upward bias and estimate that in throws after 3 misses, we find on average a 62% hit probability instead of only 50%.

What if for some real life samples we would estimate with this procedure that the conditional probabilities of a hit after 3 hits and also after 3 misses are both roughly 50%? Our simulation studies have shown that if there was indeed a fixed probability of a hit of 50%, we should rather estimate a conditional hit probability of 38% after 3 hits and of 62% after 3 misses. This means 50% vs 50% instead of 38% vs 62% is rather statistical evidence for a hot hand!

Indeed, Miller and Sanjuro re-estimate the seminal articles on the hot hand effect using an unbiased estimator for the conditional hit probabilities. While the original studies did not find a hot hand effect and thus concluded that there is a Hot Hand Fallacy, Miller and Sanjuro find substantial hot hand effects. This means, at least in those studies, there was a “Hot Hand Fallacy” Fallacy.

Of course, just by showing that in some data sets there is a previously unrecognized hot hand effect, does not mean that people never fall for the Hot Hand Fallacy. Also, for the case of basketball, it has already be shown before with different data sets and more control variables that there is a hot hand effect. Still, it is kind of a cool story: scientists tell statistical layman that they interpret a data set wrongly, and more than 30 years later one finds out that with the correct statistical methods the layman were actually right.

You can replicate the more extensive simulations by Miller and Sanjuro by downloading their supplementary material.

If you want to conveniently search for other interesting economic articles with supplemented code and data for replication, you can also take a look at my Shiny app made for this purpose:

http://econ.mathematik.uni-ulm.de:3200/ejd/

Appendix: Custom R functions used above # Simply function to compute approximate 95% # confidence interval for a sample mean ci = function(x) { n = length(x) m = mean(x) sd = sd(x) c(lower=m-sd/sqrt(n), upper=m + sd/sqrt(n)) } find.after.run.inds = function(x,k,value=1) { runs = find.runs(x) # Keep only runs of specified value # that have at least length k runs = runs[runs$len>=k & runs$val==value,,drop=FALSE] if (NROW(runs)==0) return(NULL) # Index directly after runs of length k inds = runs$start+k # Runs of length m>k contain m-k+1 runs # of length k. Add also all indices of these # subruns # The following code is vectorized over rows # in run max.len = max(runs$len) len = k+1 while (len <= max.len) { runs = runs[runs$len >= len,,drop=FALSE] inds = c(inds,runs$start+len) len = len+1 } # ignore indices above n and sort for convenience inds = sort(inds[inds<=length(x)]) inds } find.runs = function(x) { rle_x = rle(x) # Compute endpoints of run len = rle_x$lengths end = cumsum(len) start = c(1, end[-length(end)]+1) data.frame(val=rle_x$values, len=len,start=start, end=end) }

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Package Manager 1.0.6 – README

Wed, 03/13/2019 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

The 1.0.6 release of RStudio Package Manager helps R users understand packages.
The primary feature in this release is embedded package READMEs, detailed below.
If you’re new to Package Manager, it is an on-premise product built to give teams and organizations reliable and consistent package management. Download an evaluation
today
.

View package READMEs in Package Manager

Package READMEs

Many R packages have rich README files that can include:

  • An introduction to the package
  • Examples for key functions
  • Badges to indicate download counts, build status, code coverage, and other metrics
  • Other helpful information, like the package’s hex sticker!

This information can help a new user when they are first introduced to a package, or help an experienced user or admin gauge package quality. Package READMEs distill and supplement the rich information available in vignettes, Description files, and help files.

Starting in version 1.0.6, READMEs are automatically shown alongside the traditional package metadata. For CRAN packages, Package Manager will automatically show a README for the 12,000 CRAN packages that have them. READMEs are also displayed for internal packages sourced from Git or local files. These READMEs provide an easy way for package authors to document their code for colleagues, publicize new releases and features, and disseminate knowledge to team members.

Deprecations, Breaking Changes, and Security Updates
  • Version 1.0.6 includes a number of updates to Package Manager’s built-in CRAN source. Customers using an internet-connected server do not need to take any action. Updates will be applied during the next CRAN sync. Offline, air-gapped customers should following these instructions to re-fetch the CRAN data immediately after upgrading, and then run the rspm sync command.

Please consult the full release notes.

Upgrade Planning

Please note the breaking changes and deprecations above. Upgrading to 1.0.6 from
1.0.4 will take less than five minutes. There will be a five-to-ten minute delay
in the next CRAN sync following the upgrade. If you are upgrading from an earlier
version, be sure to consult the release notes for the intermediate releases, as well.

Don’t see that perfect feature? Wondering why you should be worried about
package management?
Email us; our product team is happy to help!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

World population growth through time

Wed, 03/13/2019 - 01:00

(This article was first published on Rstats on Jakub Nowosad's website, and kindly contributed to R-bloggers)

A few months ago I have made an attempt to visualize the world population changes from 1800 to 2100:

Inspired by @MaxCRoser and @jkottke, I've tried to visualize the world population changes from 1800 to 2100. My new blog post at https://t.co/XpBpkZLO9s describes how this animation was made using #rstats and #OpenData. pic.twitter.com/WI3gj0xUwU

— Jakub Nowosad (@jakub_nowosad) October 9, 2018

This way of visualization is good to show the ever-changing distribution of the population on a global scale.
It allows seeing that, China and India dominated the world population, but also a large share of the world population had lived in Europe in 1800.
At the same time, Australia, the Americas, or Africa were not densely populated.
By 1900, there is a noticeable growth in relative population in North America (especially the USA).
From 1950 onwards, a decrease in population in Europe and a tremendous increase in population in Africa is visible.
Based on this prediction, it is also possible to see the relative population of China and India will decrease in the last decades of this century.

However, while we can see changes in proportion, we cannot see one important dimension – population growth on a global scale.
Recently, Tomasz Stepinski, Pawel Netzel and I published a paper describing the landscapes change on a global scale between 1992 and 20151 showing how humans impacted our planet just in 24 years time.
It is clear the population and its growth has an enormous impact on our planet as it results in more demand for land and natural resources.
Therefore, I’ve updated my code to add this dimension…

Scaled cartogram of the world population changes between 1800 and 2100.

Now, we can see not only the relative distribution of the population for each year but also population growth on a global scale.

Similarly to the previous post, the rest of this blog post will focus on explaining the steps and code to create the above animation.
It involves downloading the data from two sources, cleaning, merging, and preprocessing it.
Prepared data can be transformed to cartogram to indicate spatial relations between countries at any given year, and scaled to indicate the global world population at any given year compared to the predicted population for 2100.

Starting

Let’s start with the packages.
If you are new to R, you may want to take a read of this first, which points to various resources for setting-up R for geographic data.
When you have a recent R version and the appropriate packages installed (e.g. by executing devtools::install_github("geocompr/geocompkg")) the packages can be attached as follows:

library(sf) # spatial data classes library(rnaturalearth) # world map data library(readxl) # reading excel files library(dplyr) # data manipulation library(stringr) # data manipulation library(tidyr) # data manipulation library(purrr) # data manipulation library(cartogram) # cartograms creation library(gganimate) # animation creation Getting data

To create the maps of the world population for each year we will need two datasets – one containing spatial data of the world’s countries and one non-spatial with information about the annual population in the world’s countries.
The first one can be easily downloaded from the Natural Earth website, for example using the rnaturalearth package:

world_map = ne_countries(returnclass = "sf")

The second one is available from the Gapminder foundation.
Gapminder provides a dataset with population data for all countries and world regions from 1800 to 2100.
We can download and read the dataset using the code below:

if(!dir.exists("data")) dir.create("data") download.file("http://gapm.io/dl_pop", destfile = "data/pop1800_2100.xlsx") world_pop = read_xlsx("data/pop1800_2100.xlsx", sheet = 7) Cleaning

As always when working with multiple datasets – some data cleaning is necessary.
Our world_map dataset has many columns irrelevant for cartograms creation and we do not need spatial data of Antarctica.
We can also transform our data into a more appropriate projection2.

world_map = world_map %>% filter(str_detect(type, "country|Country")) %>% filter(sovereignt != "Antarctica") %>% select(sovereignt) %>% st_transform(world_map, crs = "+proj=moll")

We need to have a common identifier to combine our spatial and non-spatial datasets, for example, names of the countries.
However, there are inconsistencies between some of the names.
We can fix it manually:

world_pop = world_pop %>% mutate(sovereignt = name) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Tanzania", "United Republic of Tanzania")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "United States", "United States of America")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Congo, Dem. Rep.", "Democratic Republic of the Congo")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Bahamas", "The Bahamas")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Serbia", "Republic of Serbia")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Macedonia, FYR", "Macedonia")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Slovak Republic", "Slovakia")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Congo, Rep.", "Republic of Congo")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Kyrgyz Republic", "Kyrgyzstan")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Lao", "Laos")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Cote d'Ivoire", "Ivory Coast")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Timor-Leste", "East Timor")) %>% mutate(sovereignt = replace(sovereignt, sovereignt == "Guinea-Bissau", "Guinea Bissau")) Preparing

Now we can join our two datasets, remove missing values and unimportant columns.
We also should transform the data into a long format3, which is accepted by the plotting packages (such as tmap or ggplot2).

world_data = left_join(world_map, world_pop, by = "sovereignt") %>% na.omit() %>% select(-geo, -name, -indicator) %>% gather(key = "year", value = "population", `1800.0`:`2100.0`, convert = TRUE)

Additionally, we can calculate total global populations in each year:

world_data = world_data %>% group_by(year) %>% mutate(total_pop = sum(as.numeric(population), na.rm = TRUE)) %>% mutate(title = paste0("Year: ", year, "\nTotal population (billions): ", round(total_pop / 1e9, 2))) %>% ungroup() Subsetting

Now, our data contains information about the world population for each year between 1800 and 2100.
It is possible to use it to create cartograms, however, to reduce calculation time and simplification of the results we will only use data for every 25 years:

world_data = world_data %>% filter(year %in% seq(1800, 2100, by = 25)) Transforming

With newly created data, we are able to create our cartograms.
We need to split our data into independent annual datasets, create cartograms based on the population variable, and combine all of the cartograms back into one object.

world_data_carto = world_data %>% split(.$year) %>% map(cartogram_cont, "population", maxSizeError = 2, threshold = 0.1) %>% do.call(rbind, .) Scaling

Predicted global population is the largest for the year 2100 and consists of more than 11 billion human beings.
We can resize a map for each year using this value as our reference point, where the rest of the maps would be proportionally smaller than the one for 2100.
To scale the areas of the other maps we need to divide a total population in each year by the maximum value and compute the square root of the result.
The last step of calculations is to resize the geometries (coordinates) by multiplying it with the scales values4:

world_data_scaled = world_data_carto %>% mutate(scales = sqrt(total_pop / max(total_pop))) %>% group_by(year) %>% mutate(geometry = geometry * scales)

For more on the topic, we recommend checking out the Affine transformations section of Geocomputation with R.

Visualizing

Now when we have all of the pieces needed, the animated map can be created using either tmap5 or gganimate.
It consists of three steps:

  1. Creating maps for each year using a combination of the ggplot(), geom_sf(), and transition_states() functions.
  2. Rendering an animation using animate().
    It allows for changing the size of a resulting image, and the speed of animation – you can control it with any two combinations of nframes, fps, and duration.
  3. Saving an animation with anim_save().
worlds_anim = ggplot() + geom_sf(data = world_data_scaled, aes(fill = population / 1000000), lwd = 0.1) + coord_sf(datum = NA) + scale_fill_viridis_c(name = "Population (mln)") + theme_minimal() + theme( plot.title = element_text(size = 22), plot.background = element_rect(fill = "#f5f5f4", color = NA), panel.background = element_rect(fill = "#f5f5f4", color = NA), legend.background = element_rect(fill = "#f5f5f4", color = NA), legend.position = c(0.1, 0.2), legend.title = element_text(size = 18) ) + labs(title = "{closest_state}") + transition_states(title, transition_length = 2, state_length = 6) + ease_aes("cubic-in-out") worlds_animate = animate(worlds_anim, width = 1200, height = 550, duration = 30, fps = 4) anim_save("worlds_animate.gif", animation = worlds_animate)
  1. You can learn more about it at https://doi.org/10.1016/j.jag.2018.09.013 or by reading its preprint at https://eartharxiv.org/k3rmn/.

  2. You can read more about projections in the Reprojecting geographic data chapter of Geocomputation with R.

  3. This process is called gathering.

  4. Mutating of geometry columns is possible with the recent improvements in the sf package.

  5. Visit the animated maps chapter of Geocomputation with R or my previous blog post to learn how to make animations with tmap.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Jakub Nowosad's website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

“X affects Y”. What does that even mean?

Wed, 03/13/2019 - 01:00

(This article was first published on R on Just be-cause, and kindly contributed to R-bloggers)

On my last post I gave an intuitive demonstration of what’s causal inference and how it’s different than classic ML.
After receiving some feedback I realize that while the post was easy to digest, some confusion remains.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Just be-cause. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R 3.5.3 now available

Wed, 03/13/2019 - 00:45

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The R Core Team announced yesterday the release of R 3.5.3, and updated binaries for Windows and Linux are now available (with Mac sure to follow soon). This update fixes three minor bugs (to the functions writeLines, setClassUnion, and stopifnot), but you might want to upgrade just to avoid the "package built under R 3.5.4" warnings you might get for new CRAN packages in the future.

R releases typically reference the Peanuts cartoon, but the code-name for this release, "Great Truth", is somewhat of a mystery. There may be a clue in the release date, March 11. Anyone got any ideas?

For more details on this latest update to the R language, check out the announcement below. And as always, thanks to the members of the R Core Team for their contributions to all R users with the R project.

R-announce mailing list: R 3.5.3 is released

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Installing Socviz

Tue, 03/12/2019 - 19:14

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

I’ve gotten a couple of reports from people having trouble installing the socviz library that’s meant to be used with Data Visualization: A Practical Introduction. As best as I can tell, the difficulties are being caused by GitHub’s rate limits. The symptom is that, after installing the tidyverse and devtools libraries, you try install_github("kjhealy/socviz") and get an error something like this:

Error in utils::download.file(url, path, method = download_method(), quiet = quiet(): cannot open URL https://api.github.com/repos/kjhealy/socviz/tarball/master

If this is the problem you have, this post explains what’s happening and provides a solution. (In fact, several solutions.)

Explanation

When you download and install a package from GitHub via R or RStudio, you use their “API”, or application programming interface. This is just a term for how a website allows applications to interact with it. APIs standardize various requests that applications can make, for example, or provide a set of services in a form that can be easily integrated into an application’s code. This is needed because apps are not like people clicking buttons on a web page, for example. One of the main things APIs do is impose rules about how much and how often applications can interact with them. Again, this is needed because apps can slurp up information much faster than people clicking links. This is called “rate limiting”. By default, if you do not have a (free) GitHub account, your rate limit will be quite low and you will be temporarily cut off.

What you can do

You have three options.

Option 1 Wait until tomorrow and try again. The rate limit will reset, and it should work again. But you will likely keep having this sort of problem if you use GitHub regularly. You may also not want to wait.

Option 2 Download and install the package manually, from my website rather than GitHub. Click on this link: https://kieranhealy.org/files/misc/socviz_0.8.9000.tar.gz

This will download a .tar.gz file to your computer. Open R Studio and choose Tools > Install Packages … In the dialog box that comes up, select “Package Archive File” like this:

Package selection dialog

Then navigate to the file, choose it, and select “Install”. The socviz package should now be available via library(socviz). The downside to this solution is that you won’t be able to get updates to the package easily later on, and you’ll still run into rate limits with other packages.

Option 3 The third option is a tiny bit more involved, but is the best one. Create a user account on GitHub, and then obtain a “Personal Access Token” (PAT) from them. This is a magic token that substantially boosts your rate limit for transfers to and from GitHub and will make your problem go away for this and any other package installations you have. Once you’ve opened an account, there are detailed instructions here about how to obtain and activate your PAT token in R Studio: https://happygitwithr.com/github-pat.html

You will only have to do this once, and the token will work for any and all package installations you do via GitHub.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

MilanoR Meetup: From Data to Insights

Tue, 03/12/2019 - 17:21

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hi everyone!

Springtime is coming and so is a new R Meetup, do you feel it in the air?

A MilanoR meetup is the ideal occasion to bring together R users in the Milano area to share knowledge and experiences, a free event, open to everybody.

What we’ll talk about

What’s the real value of a data product? Is it meant to just return an output or it can give us insights, suggestions, a vision? How can we make better data products?

On this meeting we will sail away on a journey, the path that leads every data professionals from raw data to meaningful and beautiful data products!

Program

– Welcome presentations

“From numbers to stories” – Pietro Spagnolo

Data can help us write such different kind of stories, deeper stories to discover something new and lighter stories to show us something we’d never thought of. As professionals, we need to guide users, easing any complexity they might encounter.

“Share your R development: the example of SmaRP, Smart Retirement Planning” – Francesca Vitalini

Smart Retirement Planning (SmaRP) is an initiative of Mirai Solutions designed to guide people working in Switzerland towards a strategic decision-making process for their retirement. Using SmaRP as an example, we will highlight how to structure a web app in R, how to extend the R Shiny framework for a nicer graphic interface, and how to generate a report through the app.

Some cool networking time

More about our speakers

Pietro is former Chief Creative Officer at iGenius, the AI Company. Previously he has worked in Turin/Boston as a interaction designer at Carlo Ratti Associati, as graphic designer at Urban Center Bologna and as teacher assistant in interaction design at IUAV in San Marino. As a designer, he has managed many interdisciplinary projects.

Francesca is a Solutions Consultant for Mirai Solutions, a small data science and data analytics consulting company based in Zürich and specialized in the financial sector. In addition to her role as a consultant, Francesca teaches R and is involved in outreach activities. In her free time, Francesca participates and organizes events that support women in data science.

What’s the price?

The event is free and open to everybody. Due to logistic reasons, the meeting is open to max 70 participants, and registration is needed.

Where do i sign up for the event?

You can signup on our meetup event, we can’t wait to meet you!

Where is it?

Mikamai – Via Venini 42, Milano. Mikamai is a coworking very close to the Pasteur metro station.

The doorbell is Mikamai: when you enter, go straight, cross the inner courtyard of the building, you will face a metal door with a sheet with Mikamai written. The office is the last (and only) floor.

Our sponsor

Quantide (http://www.quantide.com/) is a provider of consulting services and training courses about Data Science and Big Data. It’s specialized in R, the open source software for statistical computing. Headquartered in Legnano, near Milan (Italy), Quantide has been supporting for 10 years customers from several industries around the world. Quantide is the founder and the supporter of the MilanoR community, since its first edition in 2012.

 

The post MilanoR Meetup: From Data to Insights appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Uber/Lyft Maximization: More Money for The Time

Tue, 03/12/2019 - 16:26

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Motivation

Uber and Lyft who are the main ridesharing companies can make more money at a faster rate by filling their cars with passengers at a higher peak time when they are on the road. The typical Uber/Lyft driver normally have full-time jobs, full-time students, or in between jobs. Being an Uber/Lyft driver to make some extra cash on the side is ideal, however, you would want to optimize your time. With this being in mind and using the data that’s freely available to the public at sfcta.org, I created a Shiny app for rideshare drivers that suggests where they will most likely find passengers in an area for pick up to drop off quicker. The Shiny app is based on the area zone in the Bay area, time, and day of the week.

The Data

The San Francisco Transportation Authority released some data of Uber and Lyft usage within the city by the Bay area. The two rideshare companies combined logged more than 150,000 daily rides on typical Fridays in the fall of 2016 which only included pick-ups and drop-offs within the city limits. With the “TNCs” official report and data, I used pick-ups and drop-offs from the Fall of 2016 in which the data also provided the GPS coordinates and were also in the form of shape files. I also used data from the Transportation Authority for the block level summaries “taz” zones (traffic analysis zones) in which divided the city up into different area zones.

The App: Visualization and Suggestion

Once I cleaned and merged the data, I transformed the given shape files of the GPS coordinates into a heatmap using Leaflet maps, longitude and latitude points. I added filters so the user can manually select the days of the week and times for pick-ups or drop-offs. To give drivers a better idea for times ahead, I added a histogram on the sidebar to show the number of pick-ups and drop-offs overtime on the users filter criteria.

This Shiny app offers both basic exploratory data analysis and a tool to visualize Uber/Lyft pick-ups and drop-offs. Open the app in the hyperlink above and test out the scenarios below.

R Shiny app for pick-ups and drop-offs according to day of week and direction

Above, you have the options for the day of the week and pick-ups or drop-offs with the leaflet heat map to the left. The “taz” information in the data set are the colored areas that you see in the heatmap. Once your criteria are selected, you will see the different taz areas change color showing the user how many pick-ups or drop-offs are being made in that specific area. This gives drivers a better understanding of what area of the city would be best to be in for picking up passengers on a given day at a given time.

Uber/Lyft pick-ups and drop-offs according to time of day on given day of week

You also have the time of day option to the right with an option play button which shows the transformation of the number of pick-ups and drop-offs throughout the selected day which appears in the leaflet heat map. The bottom right of the heat map has a legend designating the colors specific value. The bottom of the time slider, you may select the play button which shows the driver how the heatmap changes through the selected day. Therefore, the driver knows what area to be in later on in the day if need be. The histogram above the time slider also shows the user the peak times for that day for optimizing use.

Monday evening pick-ups

This is an example for typical Monday evening pick-ups. Not too much activity, however, you have the peak times in the morning commute hours and more so in the evening commute hours.

Saturday night pick-ups

Now, this is a Saturday evening and as you can see there is a lot more activity compared to the earlier example. By looking at the legend, you can see the values changed to a higher value. The histogram also shows the steady increase in pick-ups throughout the day and evening as supposed to Monday.

Suppose a full-time college student is working on the side as a driver for Uber/Lyft for some extra cash. Due to the intensity of his/her studies and the limited free time he/she wants to optimize their time on the road. With limited time or only being able to work a couple of days a week, they would want to be on the road working at the best possible times and day of the week to earn as much as they can. Makes sense right? Using this app that type of user can calculate the best day, time and area to be available to optimize their earnings. As you can see below, the user who decides they want to work on a Saturday night can select that day, choose pick-up, then use the slider bar to watch what areas on the map appear the most red. Therefore, the “college student” can plan accordingly for their work time.

Selecting day of week and pick-up/drop-off Select time of day See areas with the most pick-ups according to the legend Future Work

This investigation didn’t take into account the Uber/Lyft waiting time for particular pick-ups. With drivers being able to identify times and locations for better optimizing their time, this could cause some problems for passengers having to wait an unreasonable amount of time. Therefore, a lot more work to be done to investigate other choices of transportation such as community bike or scooters. San Fransisco only had data available for the fall season where having data year- round would give users a better understanding of how different seasons affect the passengers. Considering events occurring in the city that demand drive rideshare would be ideal. Then, I believe it would be good to implement predictive analytics and suggestions for drivers, such as route and time optimization using machine learning.

The Code

All of the code and data used for this Shiny app can be found on my GitHub page here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Persistence of the Old Regime, Again

Tue, 03/12/2019 - 16:02

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

A few years ago I wrote a post about the stickiness of college and university rankings in the United States. It’s been doing the rounds again, so I thought I’d revisit it and redraw a few of the graphs I made then.

In 1911, Kendric Babcock made an effort to rank US Universities and Colleges. In his report, Babcock divided schools into four Classes, beginning with Class I:

The better sort of school.

And descending all the way to Class IV:

One hardly dares look at the transcripts.

Babcock’s discussion of his methods is admirably brief (the snippet above hints at the one sampling problem that possibly troubled him), so I recommend you read the report yourself.

University reputations are extremely sticky, the conventional wisdom goes. I was interested to see whether Babcock’s report bore that out. I grabbed the US News and World Report National University Rankings for 2014 and made a quick pass through them, coding their 1911 Babcock Class. The question is whether Mr Babcock would be satisfied with how his rankings had held up, were he to return to us from the grave, —more than a century of massive educational expansion and alleged disruption notwithstanding.

It turns out that he would be quite pleased with himself.

Here is a plot of the 2014 USNWR National University Rankings, color-coded by Babcock Class. In 2014, USNWR’s highest-ranked school was Princeton, and so it is at the top left of the dotplot. You read down the ranking from there and across the columns.

University rankings and Babcock classifications.

You can get a larger image or a PDF version of the figure if you want a closer look at it.

As you can see, for private universities, especially, the 1911 Babcock Classification tracks prestige in 2014 very well indeed. The top fifteen or so USNWR Universities that were around in 1911 were regarded as Class 1 by Babcock. Class 2 Privates and a few Class 1 stragglers make up the next chunk of the list. The only serious outliers are the Stevens Institute of Technology and the Catholic University of America.

The situation for public universities is also interesting. The Babcock Class 1 Public Schools have not done as well as their private peers. Berkeley (or “The University of California” as was) is the highest-ranked Class I public in 2014, with UVa and Michigan close behind. Babcock sniffily rated UNC a Class II school. I have no comment about that, other than to say he was obviously right. Other great state flagships like Madison, Urbana, Washington, Ohio State, Austin, Minnesota, Purdue, Indiana, Kansas, and Iowa are much lower-ranked today than their Class I designation by Babcock in 1911 would have led you to believe. Conversely, one or two Class 4 publics—notably Georgia Tech—are much higher ranked today than Babcock would have guessed. So rankings are sticky, but only as long as you’re not public.

There are some caveats. First, because I was more or less coding this stuff while eating my lunch, I did not attempt to connect schools which Babcock did rate with their current institutional descendants. So, for example, some technical, liberal arts, or agricultural schools that he classified grew into or were absorbed by major state universities in the 20th century. These are not on the charts above. We are only looking at schools that existed under their current name (more or less—there are one or two exceptions) in 1911 and now.

Second, higher education in the U.S. really has changed a lot since 1911. In particular the postwar expansion of public education introduced many new and excellent public universities, and over the course of the twentieth century even some decent private ones emerged and came to prominence (such as my own, which competes with a nearby Class II school). This biases things in favor of the seeming stability of the rankings, because in the his own data Babcock had the luxury of not having to classify schools that did not yet exist.

We can add schools founded after 1911 (or not ranked by Babcock) back into the chart. Our expectation would be that most of them would not be highly-ranked at present, especially if they are private. And indeed this is what we find.

Babcock’s 1911 Rankings of Public and Private Universities and US News and World Report Rankings for 2014.

You can get a larger image or a PDF version of the figure if you want a closer look at it.

Now the coding includes a category for universities that appear in the USNWR rankings but which are not in Babcock, either because they did not exist at all in 1911, or had not yet taken their present names. The new additions still leave Babcock’s classification looking pretty good. On the private side, Duke, Caltech, and Rice are added to the upper end of the list, but they are the only new entrants that are highly ranked. A number of new private schools appear further down. Meanwhile, on the public side, you can see the appearance of the 20th century schools, most notably the whole California system. The University of California System is an astonishing achievement, when you look at it. It managed to propel five of its campuses into the upper third of the table, where they joined its flagship, Berkeley. But the status ordering that was—take your pick; these data can’t settle the question—observed, intuited, or invented by Babcock a century ago remains remarkably resilient. The old regime persists.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The teachR’s::cheat sheet

Tue, 03/12/2019 - 13:00

(This article was first published on R on Adi Sarid's personal blog, and kindly contributed to R-bloggers)

A few months ago I attended the 2019 rstudio::conf, including the shiny train-the-trainer workshop. It was a two day workshop and it inspired me in many ways. The first day of the workshop focused on the very basics of teaching (R or anything else), and for me it put the spotlight on things I never considered before.

One of the important takeways from the workshop was how to approach educating others: preparing for a course, things you can do during the lessons, and how to self-learn and improve my own teaching methods afterwards.

This led me to create the teachR’s cheatsheet. It outlines the basics of teaching and I chose to give it the flavour of R (in the examples and illustrations within the cheatsheet).

I have contributed it to RStudio’s cheat sheet repo, so you can download it directly from: https://github.com/rstudio/cheatsheets/raw/master/teachR.pdf.

In the cheat sheet you will find three segments:

  1. Preparing a new course / workshop / lesson.
  2. Things you can do during the lesson itself.
  3. Things you should do when the course is completed in order to improve your own teaching methods.

I previously blogged about some of the things learned at the train-the-trainer, and not everything made it to the cheat sheet, so if you’re interested you can read more here.

Here’s an example for some of the things you can find in the cheat sheet.

Designing a new course

The cheat sheet covers the various steps of designing a course, i.e.:

  1. Persona analysis of your learners.
  2. Defining the course’s goals using Bloom’s taxonomy.
  3. Using conceptual maps to grasp what the the course should look like and what related terms/materials should appear.
  4. Writing the final exam, the slides, check-ins and faded examples.

Here are some examples relating to 1-2:

Persona analysis

Take a while to understand and characterize your learners: are the novice? advanced? false experts?

What are the learner’s goals from the course? what prior knowledge you can assume (and what not), and do they have any special needs.

If end up with too many personas anticipate trouble – it’s hard to accomodate a diverse crowd, what are you going to miss out on?

Define goals using Bloom’s taxonomy

Bloom’s taxonomy illustrates the levels of learning new concepts or topics.

The Vanderbilt University Center for Teaching has a nice illustration for it.

Blooms Taxonomy

You can visit the Vanderbilt website for a more thorough explanation about the taxonomy, but suffice it to say that “remember” is the most basic form of acquired knowledge, and the highest levels (at the top of the pyramid) are evaluate and create (being able to evaluate someone else’s work, or create your own noval work).

If we translate that to R, “remember” might translate to: “learners will be able to state the main packages in tidyverse and their purpose” versus “create” which in that context would translate to: “learners will be able to contribute to a tidyverse packages or create their own tidy package.” You can see that the first is something you can teach an R beginner but the latter is much more complex and can be mastered by an advanced useR.

Working with Bloom’s taxonomy can help you set your goals for the course and also help you set the expectations with the learners of your course.

During the course

Some tips I learned at the train-the-trainer workshop, for when you are during the lesson itself.

Sticky notes

At the start of the lesson, give each learner three sticky notes (green, red, and blue).
The learners put them on their computers according to their progress:

  • Green = I’m doing fine / finished teh exercise.
  • Red = Something is wrong, I need help.
  • Blue = I need a break

If you see a lot of greens – try to up the pace. If you see a lot of reds, maybe take it easier.

Check-ins

Try to set a few check-ins every hour, to evaluate the progress and make sure that the learners are “with you”. You can even use some kind of online surveying tool to turn this into a “game”.

After the course

Make sure you debrief properly, and learn from your experience. Use surveys to collect feedback. Also measure the time each chapter really takes you, so you can better estimate the time for each type of lesson.

Conclusion

Teaching can be challenging, but it is also rewarding and fun.

It is important to come well prepared, and this cheat sheet can help you checklist what you need to do:
https://github.com/rstudio/cheatsheets/raw/master/teachR.pdf

Teaching is an iterative process in which you can keep improving each time, if you measure and learn from your mistakes.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Adi Sarid's personal blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Learning R: The Collatz Conjecture

Tue, 03/12/2019 - 10:10

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)


In this post we will see that a little bit of simple R code can go a very long way! So let’s get started!

One of the fascinating features of number theory (unlike many other branches of mathematics) is that many statements are easy to make but the brightest minds are not able to prove them, the so called Collatz conjecture (named after the German mathematician Lothar Collatz) is an especially fascinating example:

The Collatz conjecture states that when you start with any positive integer and,

  • if it is even, the next number is one half the previous number and,
  • if it is odd, the next number is three times the previous number plus one

the sequence will always reach one.

It doesn’t get any simpler than that but no one has been able to prove this – and not for a lack of trying! The great mathematician Paul Erdős said about it “Mathematics may not be ready for such problems.” You can read more on Wikipedia: Collatz conjecture and an especially nice film that was made by a group of students can be watched here: The Collatz Conjecture.

So let us write a little program and try some numbers!

First we need a simple helper function to determine whether a number is even:

is.even <- function(x) { if (x %% 2 == 0) TRUE else FALSE } is.even(2) ## [1] TRUE is.even(3) ## [1] FALSE

Normally we wouldn’t use a dot within function names but R itself (because of its legacy code) is not totally consistent here and the is-function family (like is.na or is.integer) all use a dot. After that we write a function for the rule itself, making use of the is.even function:

collatz <- function(n) { if (is.even(n)) n/2 else 3 * n + 1 } collatz(6) ## [1] 3 collatz(5) ## [1] 16

To try a number and plot it (like in the Wikipedia article) we could use a while-loop:

n_total <- n <- 27 while (n != 1) { n <- collatz(n) n_total <- c(n_total, n) } n_total ## [1] 27 82 41 124 62 31 94 47 142 71 214 107 322 161 ## [15] 484 242 121 364 182 91 274 137 412 206 103 310 155 466 ## [29] 233 700 350 175 526 263 790 395 1186 593 1780 890 445 1336 ## [43] 668 334 167 502 251 754 377 1132 566 283 850 425 1276 638 ## [57] 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288 3644 1822 ## [71] 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732 866 ## [85] 433 1300 650 325 976 488 244 122 61 184 92 46 23 70 ## [99] 35 106 53 160 80 40 20 10 5 16 8 4 2 1 plot(n_total, type = "l", col = "blue", xlab = "", ylab = "")

As you can see, after a wild ride the sequence finally reaches one as expected. We end with some nerd humour from the cult website xkcd:

Source: xkcd var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DALEX has a new skin! Learn how it was designed at gdansk2019.satRdays

Tue, 03/12/2019 - 09:06

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

DALEX is an R package for visual explanation, exploration, diagnostic and debugging of predictive ML models (aka XAI – eXplainable Artificial Intelligence). It has a bunch of visual explainers for different aspects of predictive models. Some of them are useful during model development some for fine tuning, model diagnostic or model explanations.

Recently Hanna Dyrcz designed a new beautiful theme for these explainers. It’s implemented in the DALEX::theme_drwhy() function.
Find some teaser plots below. A nice Interpretable Machine Learning story for the Titanic data is presented here.

Hanna is a very talented designer. So I’m super happy that at the next satRdays @ gdansk2019 we will have a joint talk ,,Machine Learning meets Design. Design meets Machine Learning”.

New plots are available in the GitHub version of DALEX 0.2.8 (please star if you like it/use it. This helps to attract new developers). Will get to the CRAN soon (I hope).

Instance level explainers, like Break Down or SHAP

Instance level profiles, like Ceteris Paribus or Partial Dependency

Global explainers, like Variable Importance Plots

See you at satRdays!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Binning Data with rbin

Tue, 03/12/2019 - 01:00

(This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers)

We are happy to introduce the rbin package, a set of tools for binning/discretization
of data, designed keeping in mind beginner/intermediate R users. It comes with
two RStudio addins for interactive binning.

Installation # Install release version from CRAN install.packages("rbin") # Install development version from GitHub # install.packages("devtools") devtools::install_github("rsquaredacademy/rbin") RStudio Addins

rbin includes two RStudio addins for manually binning data. Below
is a demo:

Read on to learn more about the features of rbin, or see the
rbin website for
detailed documentation on using the package.

Introduction

Binning is the process of transforming numerical or continuous data into
categorical data. It is a common data pre-processing step of the model building
process. rbin has the following features:

  • manual binning using shiny app
  • equal length binning method
  • winsorized binning method
  • quantile binning method
  • combine levels of categorical data
  • create dummy variables based on binning method
  • calculates weight of evidence (WOE), entropy and information value (IV)
  • provides summary information about binning process
Manual Binning

For manual binning, you need to specify the cut points for the bins. rbin
follows the left closed and right open interval ([0,1) = {x | 0 ≤ x < 1}) for
creating bins. The number of cut points you specify is one less than the number
of bins you want to create i.e. if you want to create 10 bins, you need to
specify only 9 cut points as shown in the below example. The accompanying
RStudio addin, rbinAddin() can be used to iteratively bin the data and to
enforce monotonic increasing/decreasing trend.

After finalizing the bins, you can use rbin_create() to create the dummy
variables.

Bins bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56)) bins ## Binning Summary ## --------------------------- ## Method Manual ## Response y ## Predictor age ## Bins 10 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.5 ## Information Value 0.12 ## ## ## # A tibble: 10 x 7 ## cut_point bin_count good bad woe iv entropy ## ## 1 < 29 410 71 339 -0.484 0.0255 0.665 ## 2 < 31 313 41 272 -0.155 0.00176 0.560 ## 3 < 34 567 55 512 0.184 0.00395 0.459 ## 4 < 36 396 45 351 0.00712 0.00000443 0.511 ## 5 < 39 519 47 472 0.260 0.00701 0.438 ## 6 < 42 431 33 398 0.443 0.0158 0.390 ## 7 < 46 449 47 402 0.0993 0.000942 0.484 ## 8 < 51 521 40 481 0.440 0.0188 0.391 ## 9 < 56 445 49 396 0.0426 0.000176 0.500 ## 10 >= 56 470 89 381 -0.593 0.0456 0.700 Plot plot(bins)

Dummy Variables bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56)) rbin_create(mbank, age, bins) ## # A tibble: 4,521 x 26 ## age job marital education default balance housing loan ## ## 1 34 technician married tertiary no 297 yes no ## 2 49 services married secondary no 180 yes yes ## 3 38 admin. single secondary no 262 no no ## 4 47 services married secondary no 367 yes no ## 5 51 self-employed single secondary no 1640 yes no ## 6 40 unemployed married secondary no 3382 yes no ## 7 58 retired married secondary no 1227 no no ## 8 32 unemployed married primary no 309 yes no ## 9 46 blue-collar married secondary no 922 yes no ## 10 32 services married tertiary no 0 no no ## contact day month duration campaign pdays previous poutcome y ## ## 1 cellular 29 jan 375 2 -1 0 unknown 0 ## 2 unknown 2 jun 392 3 -1 0 unknown 0 ## 3 cellular 3 feb 315 2 180 6 failure 1 ## 4 cellular 12 may 309 1 306 4 success 1 ## 5 unknown 15 may 67 4 -1 0 unknown 0 ## 6 unknown 14 may 125 1 -1 0 unknown 0 ## 7 cellular 14 aug 182 2 37 2 failure 0 ## 8 telephone 13 may 185 1 370 3 failure 0 ## 9 telephone 18 nov 296 2 -1 0 unknown 0 ## 10 cellular 21 nov 80 1 -1 0 unknown 0 ## `age_<_31` `age_<_34` `age_<_36` `age_<_39` `age_<_42` `age_<_46` ## ## 1 0 0 1 0 0 0 ## 2 0 0 0 0 0 0 ## 3 0 0 0 1 0 0 ## 4 0 0 0 0 0 0 ## 5 0 0 0 0 0 0 ## 6 0 0 0 0 1 0 ## 7 0 0 0 0 0 0 ## 8 0 1 0 0 0 0 ## 9 0 0 0 0 0 0 ## 10 0 1 0 0 0 0 ## `age_<_51` `age_<_56` `age_>=_56` ## ## 1 0 0 0 ## 2 1 0 0 ## 3 0 0 0 ## 4 1 0 0 ## 5 0 1 0 ## 6 0 0 0 ## 7 0 0 1 ## 8 0 0 0 ## 9 1 0 0 ## 10 0 0 0 ## # ... with 4,511 more rows Factor Binning

You can collapse or combine levels of a factor/categorical variable using
rbin_factor_combine() and then use rbin_factor() to look at weight of
evidence, entropy and information value. After finalizing the bins, you can
use rbin_factor_create() to create the dummy variables. You can use the
RStudio addin, rbinFactorAddin() to interactively combine the levels and
create dummy variables after finalizing the bins.

Combine Levels upper <- c("secondary", "tertiary") out <- rbin_factor_combine(mbank, education, upper, "upper") table(out$education) ## ## primary unknown upper ## 691 179 3651 out <- rbin_factor_combine(mbank, education, c("secondary", "tertiary"), "upper") table(out$education) ## ## primary unknown upper ## 691 179 3651 Bins bins <- rbin_factor(mbank, y, education) bins ## Binning Summary ## --------------------------- ## Method Custom ## Response y ## Predictor education ## Levels 4 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.51 ## Information Value 0.05 ## ## ## # A tibble: 4 x 7 ## level bin_count good bad woe iv entropy ## ## 1 tertiary 1299 195 1104 -0.313 0.0318 0.610 ## 2 secondary 2352 231 2121 0.170 0.0141 0.463 ## 3 unknown 179 25 154 -0.229 0.00227 0.583 ## 4 primary 691 66 625 0.201 0.00572 0.455 Plot plot(bins)

Create Bins upper <- c("secondary", "tertiary") out <- rbin_factor_combine(mbank, education, upper, "upper") rbin_factor_create(out, education) ## # A tibble: 4,521 x 19 ## age job marital default balance housing loan contact ## ## 1 34 technician married no 297 yes no cellular ## 2 49 services married no 180 yes yes unknown ## 3 38 admin. single no 262 no no cellular ## 4 47 services married no 367 yes no cellular ## 5 51 self-employed single no 1640 yes no unknown ## 6 40 unemployed married no 3382 yes no unknown ## 7 58 retired married no 1227 no no cellular ## 8 32 unemployed married no 309 yes no telephone ## 9 46 blue-collar married no 922 yes no telephone ## 10 32 services married no 0 no no cellular ## day month duration campaign pdays previous poutcome y education ## ## 1 29 jan 375 2 -1 0 unknown 0 upper ## 2 2 jun 392 3 -1 0 unknown 0 upper ## 3 3 feb 315 2 180 6 failure 1 upper ## 4 12 may 309 1 306 4 success 1 upper ## 5 15 may 67 4 -1 0 unknown 0 upper ## 6 14 may 125 1 -1 0 unknown 0 upper ## 7 14 aug 182 2 37 2 failure 0 upper ## 8 13 may 185 1 370 3 failure 0 primary ## 9 18 nov 296 2 -1 0 unknown 0 upper ## 10 21 nov 80 1 -1 0 unknown 0 upper ## education_unknown education_upper ## ## 1 0 1 ## 2 0 1 ## 3 0 1 ## 4 0 1 ## 5 0 1 ## 6 0 1 ## 7 0 1 ## 8 0 0 ## 9 0 1 ## 10 0 1 ## # ... with 4,511 more rows Quantile Binning

Quantile binning aims to bin the data into roughly equal groups using quantiles.

bins <- rbin_quantiles(mbank, y, age, 10) bins ## Binning Summary ## ----------------------------- ## Method Quantile ## Response y ## Predictor age ## Bins 10 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.5 ## Information Value 0.12 ## ## ## # A tibble: 10 x 7 ## cut_point bin_count good bad woe iv entropy ## ## 1 < 29 410 71 339 -0.484 0.0255 0.665 ## 2 < 31 313 41 272 -0.155 0.00176 0.560 ## 3 < 34 567 55 512 0.184 0.00395 0.459 ## 4 < 36 396 45 351 0.00712 0.00000443 0.511 ## 5 < 39 519 47 472 0.260 0.00701 0.438 ## 6 < 42 431 33 398 0.443 0.0158 0.390 ## 7 < 46 449 47 402 0.0993 0.000942 0.484 ## 8 < 51 521 40 481 0.440 0.0188 0.391 ## 9 < 56 445 49 396 0.0426 0.000176 0.500 ## 10 >= 56 470 89 381 -0.593 0.0456 0.700 Plot plot(bins)

Winsorized Binning

Winsorized binning is similar to equal length binning except that both tails
are cut off to obtain a smooth binning result. This technique is often used
to remove outliers during the data pre-processing stage. For Winsorized
binning, the Winsorized statistics are computed first. After the minimum and
maximum have been found, the split points are calculated the same way as in
equal length binning.

bins <- rbin_winsorize(mbank, y, age, 10, winsor_rate = 0.05) bins ## Binning Summary ## ------------------------------ ## Method Winsorize ## Response y ## Predictor age ## Bins 10 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.51 ## Information Value 0.1 ## ## ## # A tibble: 10 x 7 ## cut_point bin_count good bad woe iv entropy ## ## 1 < 30.2 723 112 611 -0.350 0.0224 0.622 ## 2 < 33.4 567 55 512 0.184 0.00395 0.459 ## 3 < 36.6 573 58 515 0.137 0.00225 0.473 ## 4 < 39.8 497 44 453 0.285 0.00798 0.432 ## 5 < 43 396 37 359 0.225 0.00408 0.448 ## 6 < 46.2 461 43 418 0.227 0.00482 0.447 ## 7 < 49.4 281 22 259 0.419 0.00927 0.396 ## 8 < 52.6 309 32 277 0.111 0.000811 0.480 ## 9 < 55.8 244 25 219 0.123 0.000781 0.477 ## 10 >= 55.8 470 89 381 -0.593 0.0456 0.700 Plot plot(bins)

Learning More

The rbin website includes
comprehensive documentation on using the package, including the following
article that gives a brief introduction to rbin:

Feedback

rbin has been on CRAN for a few months now while we were fixing bugs and
making the API stable. All feedback is welcome. Issues (bugs and feature
requests) can be posted to github tracker.
For help with code or other related questions, feel free to reach out to us
at pkgs@rsquaredacademy.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A case where prospective matching may limit bias in a randomized trial

Tue, 03/12/2019 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Analysis is important, but study design is paramount. I am involved with the Diabetes Research, Education, and Action for Minorities (DREAM) Initiative, which is, among other things, estimating the effect of a group-based therapy program on weight loss for patients who have been identified as pre-diabetic (which means they have elevated HbA1c levels). The original plan was to randomize patients at a clinic to treatment or control, and then follow up with those assigned to the treatment group to see if they wanted to participate. The primary outcome is going to be measured using medical records, so those randomized to control (which basically means nothing special happens to them) will not need to interact with the researchers in any way.

The concern with this design is that only those patients randomized to the intervention arm of the study have an opportunity to make a choice about participating. In fact, in a pilot study, it was quite difficult to recruit some patients, because the group therapy sessions were frequently provided during working hours. So, even if the groups are balanced after randomization with respect to important (and unimportant characteristics) like age, gender, weight, baseline A1c levels, etc., the patients who actually receive the group therapy might look quite different from the patients who receive treatment as usual. The decision to actually participate in group therapy is not randomized, so it is possible (maybe even likely) that the group getting the therapy is older and more at risk for diabetes (which might make them more motivated to get involved) than those in the control group.

One solution is to analyze the outcomes for everyone randomized, regardless of whether or not they participate (as an intent-to-treat analysis). This estimate would answer the question about how effective the therapy would be in a setting where the intervention is made available; this intent-to-treat estimate does not say how effective the therapy is for the patients who actually choose to receive it. To answer this second question, some sort of as-treated analysis could be used. One analytic solution would be to use an instrumental variable approach. (I wrote about non-compliance in a series of posts starting here.)

However, we decided to address the issue of differential non-participation in the actual design of the study. In particular, we have modified the randomization process with the aim of eliminating any potential bias. The post-hoc IV analysis is essentially a post-hoc matched analysis (it estimates the treatment effect only for the compliers – those randomized to treatment who actually participate in treatment); we hope to construct the groups prospectively to arrive at the same estimate.

The matching strategy

The idea is quite simple. We will generate a list of patients based on a recent pre-diabetes diagnosis. From that list, we will draw a single individual and then find a match from the remaining individuals. The match will be based on factors that the researchers think might be related to the outcome, such as age, gender, and one or two other relevant baseline measures. (If the number of matching characteristics grows too large, matching may turn out to be difficult.) If no match is found, the first individual is removed from the study. If a match is found, the first individual is assigned to the therapy group, and the second to the control group. Now we repeat the process, drawing another individual from the list (which excludes the first pair and any patients who have been unmatched), and finding a match. The process is repeated until everyone on the list has been matched or placed on the unmatched list.

After the pairs have been created, the research study coordinators reach out to the individuals who have been randomized to the therapy group in an effort to recruit participants. If a patient declines, she and her matched pair are removed from the study (i.e. their outcomes will not be included in the final analysis). The researchers will work their way down the list until enough people have been found to participate.

We try to eliminate the bias due to differential dropout by removing the matched patient every time a patient randomized to therapy declines to participate. We are making a key assumption here: the matched patient of someone who agrees to participate would have also agreed to participate. We are also assuming that the matching criteria are sufficient to predict participation. While we will not completely remove bias, it may be the best we can do given the baseline information we have about the patients. It would be ideal if we could ask both members of the pair if they would be willing to participate, and remove them both if one declines. However, in this particular study, this is not feasible.

The matching algorithm

I implemented this algorithm on a sample data set that includes gender, age, and BMI, the three characteristics we want to match. The data is read directly into an R data.table dsamp. I’ve printed the first six rows:

dsamp <- fread("DataMatchBias/eligList.csv") setkey(dsamp, ID) dsamp[1:6] ## ID female age BMI ## 1: 1 1 24 27.14 ## 2: 2 0 29 31.98 ## 3: 3 0 47 25.28 ## 4: 4 0 40 24.27 ## 5: 5 1 29 30.61 ## 6: 6 1 38 25.69

The loop below selects a single record from dsamp and searches for a match. If a match is found, the selected record is added to drand (randomized to therapy) and the match is added to dcntl. If no match is found, the single record is added to dused, and nothing is added to drand or dcntl. Anytime a record is added to any of the three data tables, it is removed from dsamp. This process continues until dsamp has one or no records remaining.

The actual matching is done by a call to function Match from the Matching package. This function is typically used to match a group of exposed to unexposed (or treated to untreated) individuals, often using a propensity score. In this case, we are matching simultaneously on the three columns in dsamp. Ideally, we would want to have exact matches, but this is unrealistic for continuous measures. So, for age and BMI, we set the matching range to be 0.5 standard deviations. (We do match exactly on gender.)

library(Matching) set.seed(3532) dsamp[, rx := 0] dused <- NULL drand <- NULL dcntl <- NULL while (nrow(dsamp) > 1) { selectRow <- sample(1:nrow(dsamp), 1) dsamp[selectRow, rx := 1] myTr <- dsamp[, rx] myX <- as.matrix(dsamp[, .(female, age, BMI)]) match.dt <- Match(Tr = myTr, X = myX, caliper = c(0, 0.50, .50), ties = FALSE) if (length(match.dt) == 1) { # no match dused <- rbind(dused, dsamp[selectRow]) dsamp <- dsamp[-selectRow, ] } else { # match trt <- match.dt$index.treated ctl <- match.dt$index.control drand <- rbind(drand, dsamp[trt]) dcntl <- rbind(dcntl, dsamp[ctl]) dsamp <- dsamp[-c(trt, ctl)] } } Matching results

Here is a plot of all the pairs that were generated (connected by the blue segment), and includes the individuals without a match (red circles). We could get shorter line segments if we reduced the caliper values, but we would certainly increase the number of unmatched patients.

The distributions of the matching variables (or least the means and standard deviations) appear quite close, as we can see by looking at the males and females separately.

Males ## rx N mu.age sd.age mu.bmi sd.bmi ## 1: 0 77 44.8 12.4 28.6 3.65 ## 2: 1 77 44.6 12.4 28.6 3.71 Females ## rx N mu.age sd.age mu.bmi sd.bmi ## 1: 0 94 47.8 11.1 29.7 4.63 ## 2: 1 94 47.8 11.3 29.7 4.55 Incorporating the design into the analysis plan

The study – which is formally named Integrated Community-Clinical Linkage Model to Promote Weight Loss among South Asians with Pre-Diabetes – is still in its early stages, so no outcomes have been collected. But when it comes time to analyzing the results, the models used to estimate the effect of the intervention will have to take into consideration two important design factors: (1) the fact that the individuals in the treatment and control groups are not independent, because they were assigned to their respective groups in pairs, and (2) the fact that the individuals in the treatment groups will not be independent of each other, since the intervention is group-based, so this a partially cluster randomized trial. In a future post, I will explore this model in a bit more detail.

This study is supported by the National Institutes of Health National Institute of Diabetes and Digestive and Kidney Diseases R01DK11048. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Scatterplots and Correlations with ggpairs

Mon, 03/11/2019 - 21:35

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

As I conduct some analysis for a content validation study, I wanted to quickly blog about a fun plot I discovered today: ggpairs, which displays scatterplots and correlations in a grid for a set of variables.

To demonstrate, I’ll return to my Facebook dataset, which I used for some of last year’s R analysis demonstrations. You can find the dataset, a minicodebook, and code on importing into R here. Then use the code from this post to compute the following variables: RRS, CESD, Extraversion, Agree, Consc, EmoSt, Openness. These correspond to measures of rumination, depression, and the Big Five personality traits. We could easily request correlations for these 7 variables. But if I wanted scatterplots plus correlations for all 7, I can easily request it with ggpairs then listing out the columns from my dataset I want included on the plot:

library(ggplot2)
ggpairs(Facebook[,c(112,116,122:126)]

(Note: I also computed the 3 RRS subscales, which is why the column numbers above skip from 112 (RRS) to 116 (CESD). You might need to adjust the column numbers when you run the analysis yourself.)

The results look like this:

Since the grid is the number of variables squared, I wouldn’t recommend this type of plot for a large number of variables.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science at AT&T Labs Research

Mon, 03/11/2019 - 17:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Noemi Derzsy, a Senior Inventive Scientist at AT&T Labs Research within the Data Science and AI Research organization.

Introduction Noemi Derzsy

Hugo: Hi there, Noemi, and welcome to DataFramed.

Noemi: Hi. Thank you for having me.

Hugo: It’s a real pleasure to have you on the show, and I’m really excited to be talking about your work at AT&T Labs in research at the moment, but before that, I’d like to find out a bit about you. So, on your website, and you sent me a bio, I’m gonna read it out because I love it. You’re a senior inventive scientist at AT&T Labs within the data science and AI research organization, and I love what you say next, that you’re doing lots of science with lots of data.

Noemi: Yes. Well, that is what I do.

What have you been involved in?

Hugo: Exactly. So, you’re working at AT&T now, but you actually have been involved in a lot of other initiatives in the data community. So, I thought maybe you could give me a bit of background, tell me about other things you’ve been involved in.

Noemi: Yeah, for sure. I spent a lot of time before becoming involved in the open source space in academia where I actually didn’t have the opportunity or the bandwidth necessarily to work on open source projects that were actually putting me out there in the open source community, but I started becoming more active in the open source space once I became a NASA Datanaut, and here I started working with NASA’s open source metadata, which is basically information about their over 30,000 datasets that they make publicly available, and that’s how I started getting involved in the data science community and in open source, and recently, I also became the co-organizer of Women In Machine Learning & Data Science meetup in New York, and here we organize meetup events focused on machine learning and data science topics, and our mission is to provide a supportive community that encourages and promotes women and non-binary people in tech.

Hugo: Fantastic, and actually, I’ve recently had Reshama Shaikh on the podcast to talk about a lot of the initiatives at WiMLDS in New York City.

Noemi: Right. She’s my colleague at Women in Machine Learning & Data Science.

Hugo: She’s fantastic.

Noemi: She is.

NASA Datanaut

Hugo: So, I’m also interested in this idea of being a NASA Datanaut. Can you just tell me a bit about this? It sounds really cool to start with. I want to be a Datanaut. So, could you just give me a bit of context around it, on what the program is?

Noemi: Yeah. I think every data scientist whose dream was to become an astronaut but didn’t make it now can become a Datanaut.

Hugo: That’s awesome.

Noemi: Yeah. When the government forced these government agencies to open source some of their datasets, then NASA open sourced over 30,000 datasets, and people don’t know about it, so one way they tried to promote this is with the NASA Datanauts program. So, this is an initiative in which they tried to create this collaborative group every year of individuals who are interested or excited about working with their open source datasets, and we have meetings regularly. We have webinars. People can present what they are working on. They can start collaborations based on NASA’s open source datasets to come up with ideas how they can use them, and then there are data scientists from NASA who are actually presenting what they are doing and telling us how we can get involved in certain projects that they would be interested in seeing results in, but they don’t have the time to work with.

Noemi: Also, the chief knowledge architect from NASA, David Meza, he’s also very involved, and he’s very supportive of this community, so you can always just reach out to him, and he’s going to be very supportive no matter what your question is or what projects you want to work on. So, it’s an amazing community to be a part of. It’s application-based, so every year they launch their application opportunity, and people can apply, and if they get selected, then they can just … Once they become a NASA Datanaut, they will be Datanauts forever.

Hugo: That’s really cool. If our listeners who I’m sure are really excited by the idea of being a NASA Datanaut, at least as much as I am, are interested, we’ll include a link to a few things in the show notes as well.

Noemi: Oh, yeah. Sure. Well, open.nasa.gov is the first place to go to, and then there you can find information.

Hugo: Perfect. So, the other thing that I know that you’re excited about is teaching and pedagogy and data science instruction. Right? And you’ve also run workshops in the wild at conferences and this type of stuff. I think, if I remember correctly, was your interest in network analysis and this type of stuff?

Noemi: Yeah. So, actually, my bachelor’s degree, master’s, and PhD, and then a five-year postdoc all involved network science, so many of my research projects were on understanding complex systems through their network structure. So, I was thinking that this was a good opportunity to have these workshops to show how you can do network analysis, especially because there is this very good NetworkX Python package out there that can enable data scientists to just analyze the data from network point of view very easily.

How did you get into data science?

Hugo: I love NetworkX so much, and we’ve actually got two courses introducing NetworkX on DataCamp taught by Eric Ma, who’s an old friend and collaborator. He’s now a research data scientist at Novartis, but NetworkX is a really great package, and the API is really nice as well, I’ve found. So, we’ve got a few ideas about your background, but I’m just wondering how you got into data science and analytics originally.

Noemi: Yeah. Well, I got into data science long before it started to be called data science, I think this was back in 2006. So, to give a bit of context, I did a bachelor’s degree in physics and computer science, so I had to write a bachelor’s degree thesis on something novel and related to the field, and I didn’t know what that would be. I could either decide to do some physics project or some computer science, but I really wanted to find a method that I can combine these two.

Noemi: So, I actually was very lucky because I had this great quantum physics professor who is the leader in the research space, and he is always interested in physics applications outside the traditional boundaries, and at that point he was working on projects that were focused on understanding complex systems, and even more complex systems with underlying network structure. So, what he was always working on within his projects at the time was to analyze and model these complex systems through some data that he obtained from different sources. So, this was using a lot of computational physics, which I really liked, and also leveraged data analysis. So, I found this topic very exciting.

Hugo: That really explains your interest in networks to this day, why you educate around them, your love of NetworkX, and I supposed also, as we’ll get to, some of your work at AT&T, thinking about networks of individuals in a society and communication between them and that type of stuff.

Noemi: Exactly, and that’s how I actually got to do my first project using social type of data, which was from this Erasmus European scholarship framework, and here I actually built my first network, which I found really exciting. I built a network of European universities where the connections were built by the students who went from one university to the other.

Hugo: Interesting. So, is that a directed graph?

Noemi: Yeah. You can look at it either at the direction, or you can look at it from the undirected way. We actually built both a directed one and the undirected one because if you take into account from which university the student goes to, which university to visit, then it would be a directed one, but if you just want to look at professional connections, for example, I just want to see how universities are interconnected among each other, I don’t really care about the direction. So, then you just look at the undirected version of the network.

Hugo: That’s really cool. I’m wondering about the data collection and data generating process in this case. Did you hand write all these universities down and then figure out using another data set, or did you figure out a way to automate it, or how did that work?

Noemi: So, actually, this was a fairly small dataset. It was only a snapshot from 2003, so they gave us just a small data that we could play with. It contained information… it was basically a matrix version that I received, so each row and column was a university, and then the value was the number of students that went from one to the other. So, that was the data that I got back then.

Hugo: Fantastic. What did you do with it then? What were the takeaways?

Noemi: So, we actually revealed the most interconnected cluster, the subgroup, and this was very interesting because it wasn’t the top universities, but it was actually someone at the conference mentioned when I presented this that it looks like the universities belonging to the cities where the students can have the most best parties. Yeah. What we basically found was that the connections are very much influenced by the professors’ connectivities. So, the professional network of the professors is the one that basically drives these connectivities within the students, despite the fact that they have the opportunity to choose themselves where they go.

Hugo: Right, and actually, that’s … So, I actually did a postdoc, or the first half of my postdoc, in Germany in cell biology, and I do remember a lot of people came through the lab and the institute as a function of professors and researchers and social connections between professors where I worked and professors at other institutes and campuses.

Noemi: Right.

Hugo: So then, if I recall correctly, you worked on something else for your master’s thesis, right?

Noemi: I got so fascinated by this topic that I wanted to continue to do the same thing throughout my master’s degree. So, for that thesis, my advisor, the same advisor who I eventually got the PhD with as well, because I was such a big fan of his type of research and work, so he obtained this Enron open source email communication data. I don’t know if you know about it.

Hugo: I know it well. Yeah.

Noemi: Yeah. Okay. So, I used that email communication data to start analyzing how people communicate with each other, and if we can detect some pattern and build a model, and yeah, the most interesting part was from a physicist point of view, was that it was basically the communication was like an exponential decay, like a particle decay.

Hugo: Oh, interesting.

Noemi: So, yeah. We showed that the later you reply to an email, the less likely it is that you’re going to actually reply to it, and the probability is going to drop exponentially in time.

Hugo: Right. Well, anecdotally, that seems right for me anyway, because there is this … I don’t want to go into this too much, but there is this barrier, right? If I’ve left an email a week to reply, then I’m like, “Oh, no. I’ve gotta actually give a proper reply now,” as opposed to just a few words like, “Hey, received whatever". There is a barrier there. The other thing that I just want to say about the … I’m actually very familiar with the Enron dataset in a slightly strange way. A good friend of mine, and I’ve mentioned this on the podcast once before, a good friend of mine who’s a digital artist and has done a project where you can register to receive the Enron emails daily-

Noemi: Oh, cool.

Hugo: Yeah. I can put a link in the show notes as well. So, I actually, in my inbox every day, I receive one of the Enron mails. I think today’s was Mark from legal thanking someone else for dinner last week or something like that, but it’s actually really odd, and a very intimate dataset as well.

Noemi: Yeah. I actually haven’t followed that to see how the data evolves compared to the fraction of the data that I had back in the day, but actually, that would be a cool thing to followup and see.

Hugo: Absolutely. Let me ask, is the work that you did for your master’s on the Enron dataset, is that on GitHub or out in the public domain at all?

Noemi: No. I didn’t post it back then. Back then, I was doing it in C++, and it was a very long time ago when … In academia, it’s also not very popular-

Hugo: Yeah, okay. No, that makes sense.

Noemi: … to open source things. That’s something that I think academia can work on.

Hugo: It can, and it wasn’t necessarily incentivized back then, and it’s generational in a lot of ways. We’ve discussed that on the show before, but I do think it’s becoming more and more commonplace, particularly, more and more people are learning R and Python as opposed to MATLAB and whatever else there is. Not that MATLAB isn’t great for certain things. I don’t want to say that.

Noemi: Yeah. Well, I always used C++, so from there, for me a natural transition was Python, and now since most of my colleagues here use R, it’s something that I’m dipping my toes into.

Hugo: For sure. When people ask me why I write Python, my first response always is because I love writing Python code. It’s so much fun to write.

Noemi: Yeah.

AT&T Labs Research

Hugo: But today, we’re here to talk about your work at AT&T Labs research, so I thought maybe you could just break down for me what the mission and history of AT&T Labs in general is.

Noemi: I actually represent AT&T Labs Research, so I can talk about AT&T Labs Research mission, because AT&T Labs is a very broad research and development division of AT&T. So, the mission of AT&T Labs Research is to look beyond today’s technology solutions to invent disruptive technologies that meet future needs. This comprises very diverse and fascinating research areas that range from AI, 5G technology to video and media analytics.

Hugo: Great. So, this is stuff that maybe won’t be implemented right now, but thinking very much, as my Belgian colleagues would call, future music.

Noemi: Right. Yeah. So, this is the big view and the future goal that we’re working toward.

Hugo: Yep. Okay, great. So, maybe you can set the scene historically as well for us, briefly.

Noemi: Oh, right. Yeah. So, the history of AT&T Labs, if you think about it, AT&T Labs traces its history from AT&T Bell Labs, which is famous for its very rich history in innovation, and as a physicist, I feel particularly honored to be part of the research lab, especially this research lab where several physicists and scientists from other fields as well have been awarded with a total of nine Nobel Prizes for their work done at Bell Labs, which I’m still very amazed by, and just to name a few, the Bell Labs hosted extraordinary scientists like Walter Shewhart and John Tukey who contributed to the fundamentals of statistics, and Claude Shannon, who is the father of information theory, and many, many others. So, for me, it’s an amazing opportunity to be here, and I’m proud of it every day.

Hugo: Oh, that’s really exciting, and I’m a huge fan of Claude Shannon’s work, of course, and Tukey’s really interesting, particularly in … Something we are only getting back to now in kind of the cultural discourse is really thinking about the importance of exploratory data analysis, and the focus in academia in industry for a long time has been on positive results. His focus on actually getting to know your data and all the techniques he developed to do that are incredibly beautiful.

Noemi: Right. One of the updates that AT&T Lab has is we recently opened an AT&T Science & Technology Innovation Center in Middletown, New Jersey, which is a museum that comprises this 142 years of inventions that AT&T pioneered in. So, you can actually go and check it out.

Hugo: Oh, great. That’s open now, or about to open?

Noemi: So, it opened at the end of last year, so it should be open. Yes.

Hugo: Okay, perfect. So, we’ve discussed briefly how the work at AT&T Labs research, how it thinks about the future of what can happen. I’m just wondering how it relates to the business side of AT&T currently.

Noemi: So, AT&T Labs was founded so it can focus on solving the hardest tech problems that AT&T’s dealing with, and the solutions of these problems translate for AT&T to improvements in customer service or customer care, and many of the projects also result in cost reductions for the company, like network optimizations and improving advertising and so on.

Hugo: Before we get into the work you do, AT&T Labs research, I’d just like a general high-level overview of kind of … Maybe you can tell me a bit about some of the current projects at AT&T Labs in general that you find most interesting.

Noemi: Yeah. Actually, there are so many, and as a new employee, I’m just still observing all the information and all the new projects that I get to learn about from my colleagues-

Hugo: I’m sure.

Noemi: … but to mention a few that I’m actually not involved in, but I find very exciting or important, are … So, one of them is creating new products to make a difference in the media and entertainment space, which also helps us build this partnership with Turner that has recently become a division of AT&T’s WarnerMedia. So, AT&T has a lot of TV data, and most of the time AT&T’s not associated with TV data, but since AT&T owns DirecTV, and now Turner, it’s a lot of TV data that can be used to do critical and fundamental research in the media and entertainment space.

Hugo: Great. That sounds very interesting. Do you know much about what type of tools and techniques are used in doing that, or what type of outcomes they’re trying to achieve?

Noemi: Yeah. So, I think what I can say is that basically, that’s why AT&T launched Zender, which is their new company focused on advertising. I don’t know if you know about that, after the acquisition of AppNexus.

Hugo: Tell me a bit about it.

Noemi: Last year, AT&T acquired AppNexus, and it became now Zender, which is focused on improving advertising in the media and entertainment space.

Hugo: Interesting. I suppose a big part of that now is thinking about targeted advertising, and I think that’s something your colleagues are thinking about as well.

Noemi: Right. So, the goal is … We have a lot of advertising that is just distracting, and the goal is to provide less advertising, but more relevant ones. So, there is a lot of research that is going on around this problem.

Hugo: Great. So, are there any other current projects that you find very exciting?

Noemi: Coming out from this advertising, there are projects. There are also very important project that is going on that many of my colleagues are working currently on, which is focused on how to combat bias and fairness issues in these targeted advertisings. So, this is somehow related to the first one, but it’s also very important. Then I’m going to mention one that has really nothing to do with this. It’s completely different, but when I first got to AT&T and I found out about it, I was like, “Wow. I didn’t even think that that would be a thing in AT&T.”

Noemi: So, one of the projects is working on creating drones for cell tower inspection. So, this research is basically leveraging AI, machine learning, and video analytics, and their goal is to create this deep learning-based algorithm that is just going to, once you send out the drones, to create these video footages. We are going to analyze this footage to detect tower defects or anomalies, and this will enable automating the tower inspections, and it will make it work faster and more efficient. So, this is one of the things that, oh, I didn’t even think about that, and I found it really cool. I just wanted to mentioned it.

Hugo: That’s amazing. The use of drones and the idea of using essentially deep learning and AI technologies and video analytics, as you say, in drones has so many applications. One I’ve been reading quite a bit about recently is in ag-techs, or agricultural technology, and drones analyzing yields of field crops and that type of stuff, and one of the really cool things is … I like this example that if you’ve got a camera on a drone, and you’re trying to build algorithms or get them trained and tested in realtime, it isn’t as though you can throw the deepest neural net at it, for example, if you’re trying to run it onboard the drone, for example. So, you’re pushing up against a lot of technological constraints there as well.

Noemi: Yeah, and this is so fascinating because for me, drones was never … I would have never connected it to such an important work at AT&T, which is you have to make sure that those towers are working properly, and if there’s any malfunctioning, you need to detect it in time, and this would be really helping that cause.

What are you working on?

Hugo: Exactly. So, this is a nice cross-section of different research projects at AT&T Labs that interest you. So, I’d love to jump in and find out just about some of the projects you’re currently working on.

Noemi: So, my projects are also just as diverse as the ones I mentioned above, which I find really exciting because I have the opportunity to study and to work on very different data types and types of data, and to try to answer very different problems and to tackle with very interesting research projects. So, to mention the first one, which is my favorite, is human mobility characterization from mobile network data. So, human mobility patterns revealed from cellular telephone networks can offer a large-scale glimpse of how humans move in space and time, and how they interact, and I find this project very exciting because I can study the human behavior, which is a phenomenon that as a physicist I’ve become very interested in since undergrad. This project also offers the opportunity to study human behavior through large-scale anonymized customer data and leverage the discoveries to also improve our services.

Hugo: That’s really cool, and as we said before, these types of projects really speak to a lot of your interests that you’ve developed over the past couple of decades as well, and what you’ve worked on.

Noemi: Last year, I was interviewing to find a new job, and I was getting after a point very frustrated that there are so many data science positions out there that I would have to literally throw out all the things I’ve learned in the past and all the research that I’ve done because I wouldn’t be able to leverage it, and I am so happy that I found this job because I can just use everything I studied so far, and it doesn’t go to waste.

Hugo: That’s really cool, and thinking about … I mean, the great thing about a project like this from my outside and naïve perspective is you can view it on so many scales. So, as you said, you can view it as a network of individuals, but you can also view it as including a lot of geospatial data, which is incredibly interesting, or you can view on the individual level as well. So, there are kind of a separation of scales there where you can answer different questions at different points.

Noemi: Yes. I also find it exciting because with this, I’m also learning new tools, and for example, I got to learn about Nanocubes, which is this open source visualization tool for large spatiotemporal datasets, and this was actually created at AT&T Labs, and it’s open source, and it’s amazing because you can use billions of data points and visualize them realtime, and you can also query it, slice and dice your data as you please, and then visualize subsets of it. So, it’s a lot of fun.

Hugo: Yeah. I’ve actually seen several demos of Nanocubes. I’ve never used it myself, but I think it was maybe Simon Urbanek and Chris Volinsky who showed it to me originally, but I can’t be sure. And actually, as you know, I had Chris, who for our listeners is … I think he’s now assistant VP of data science and AI research where you are.

Noemi: Yes.

Hugo: So, I had him on the podcast last year, and we didn’t discuss this in detail, but the first time I ever encountered Chris must have been, I don’t know, five or six years ago at a conference, and he actually spoke about characterizing human mobility, which of course, is the project you just spoke to, and he gave this great talk, which involved seeing when text messages stopped in a downtown neighborhood in … I can’t remember which city it was, but text messages stopped at a certain point, and phone calls started being made, and he realized, or his team realized, that at this point this was when all the nightclubs and bars shut, and people were calling for taxis, this was before Uber and that type of stuff, calling for taxis. So, from the data, you can actually see the emergent behavior of populations. Right?

Noemi: Right, and actually, they even published a paper. This is why I love my work so much, because we also have the opportunity to publish, and he published these findings. Yeah. It’s called Human Mobility Characterization from Cellular Network Data, and it’s a publication, I think, from 2013.

Hugo: Great. That was around the time I saw this talk, actually, so that would make perfect sense. So, we’ll definitely link to that in the show notes as well. But it’s really cool to be able to publish this stuff, to make discoveries about human mobility and characterizing that, as you say, but also, as you said, to leverage these discoveries to improve the services that AT&T provides people.

Noemi: Right, and that links to another project of mine, which I’m working on, and I also find it exciting because I can also use my network science background, and that project is about characterizing our mobile network and analyze how its topology compares to other reported real social networks out there. This initially just sounds like fun, but it’s also very crucial for us to know how our network topology looks like because it helps us understand how certain dynamical processes progress throughout the network, and this implicitly also helps us improve our services.

Hugo: Great. So, when we’re thinking of this mobile network, generally, for our listeners who don’t necessarily know a lot about network theory, a network or a graph, you’ll have nodes and edges that connect them. So, you can imagine on Twitter that all the nodes are people with Twitter handles, and connections are formed when people follow each other. Those are the edges. Now, I’m wondering what this mobile network looks like. Is it people with cellphones are the nodes? And there’s an edge between them if they call each other or message each other?

Noemi: Right. Yes. This is all anonymized, so the only thing that we’re doing is using at aggregate level to see connectivities like number of connections and so on, so this helps us understand the topological features. Yeah. So, basically, a network, it has elements, and then the elements that are connected by a certain relationship, you can build this edge among them, and then this is how you construct your network. The reason why I find this so fascinating is because networks are everywhere around us, and network scientists are many scientists from different fields because it’s a very interdisciplinary topic. You have protein interaction networks. You have neural networks. You have social networks. You have street networks, so transportation networks, power grids. So, we are living in a very interconnected world, and everything is networks, so for me, that’s why it’s so fascinating to work in network science.

Hugo: Very much so. So, you said two things I just want to kind of tease apart briefly. You talked about how the topology of a network can be crucial to understanding how certain dynamical processes progress throughout the network. I’m just wondering if you could give us insight into what topology in a network actually means, and what type of dynamical processes, for example, you think about.

Noemi: So, topology-wise, you want to see some basic features of the network. You want to see what is the degree distribution of a network. So, that means that I’m trying to see how many nodes I have with this number of connections, how many nodes with that number, and then I’m just building up the distribution. Recent studies have shown that-

Hugo: That think about how connected it is. Yeah.

Noemi: Right, and studies in network science have shown that real networks mostly follow this scale-free pattern when it comes to their degree distribution, which means that most nodes have very few connections, whereas you have this small number of hubs, which have an extremely large number of connections, and this is something you can see in Twitter, too. Right? So, you have these very popular people who have hundreds of thousands of followers, but most people will have a very low number of connections. Then the other thing that you want to look at when you’re looking at topology is how clustered the nodes are within the network. So, is the network more homogeneous, or do you see some more densely connected subgroups, like for example, in social networks you will see many densely connected subgroups.

Hugo: I was just gonna say, that’s really important because this can give rise to the emergence of filter bubbles and echo chambers. You can imagine politically distinct groups that really communicate within themselves and read particular types of media, but not from the other side, for example.

Noemi: Right, and that’s why it’s very important to understand the structure of your network, because before you start looking at how you can influence, for example, in politics, people, you have to first see what is the type. Is the distribution scale-free? Do you have these hubs where everyone has approximately the same number of connections, and because these dynamical processes are going to evolve in the network in a completely different manner based on what’s the structure of the network. To give an example of these dynamical processes, for example, something that I’ve worked on for a very long time is cascading failures, which you can see in power grids. So, in any type of network where you have information flow, information flow, you can think of anything, like me trying to convince my friend to buy a product, or power grids transmitting from one generator to the other current, you want to see in case one node fails in the system, its failure, how it’s going to get transmitted further throughout the network.

Noemi: So, one thing that we need to take into account when you’re looking at these networks that are transferring information is that nodes have assigned a capacity, like how much information can they handle, and if one of the nodes fails, it’s going reallocate its load, the information that it took over, to its neighboring nodes, and now those, if we’re going to have higher load than their capacity, they’re also going to fail. This is what they call cascading failure, or an avalanche of failures, and this is a big problem because in power grids you have very millimilliseconds to actually try to mitigate that failure. So, what you’re trying to do is to build a more robust system against that, but these failures are also very dependent on the structure of the network. So, at the basis of every network analysis, it comes like what is the structure of that network.

Hugo: As you were talking about dynamic processes propagating through networks, it just sprung to mind, I know that Twitter is used by data scientists so much for thinking about tools and techniques and problem solving and debugging, and I was just wondering, thinking about if we could see how data science tools actually propagated through social networks on the internet, which could be a cool project for a listener to do at some point.

Noemi: There was a research project focused on how tools spread out and how popular they become on GitHub.

Hugo: Yeah, that’s very interesting.

Noemi: Yeah. So, that’s related.

Hugo: You’ve told us about two of the projects that you love that you’re working on, and you told us at the start that you’ve got three main projects. So, what’s the third one?

Noemi: Oh, yeah. The third one I loved working on, too. That’s also something that is brand new and very fascinating topic for me. So, many people don’t know that AT&T owns DirecTV, and now also, with the acquisition of Turner, we have even more TV data that creates for us tremendous research opportunities in the TV advertising space that I’m still learning a lot about, and I find this very exciting because especially when I joined AT&T, even for me, it didn’t occur at the interview process that I might end up working with TV data. So, it’s really awesome.

Hugo: That’s really cool. This is a relatively new position for you, as you said. So, I’m wondering … We’ve got a lot of listeners out there who are aspiring data scientists and working data scientists, and I’m just wondering what advice you’d give someone who’d be interested in this type of job. What types of skills and tools would they need to have?

Noemi: Sure. We are constantly looking to hire, so I’m very happy to share that information for people who are interested. So, since we are a research lab, we are looking for people with PhDs, because we seek candidates who possess the main expertise and have research experience, and of course, we love seeing people with genuine enthusiasm who are excited about new data and know how to get the best out of it, and of course, this requires great technical skills, high integrity using the data, and of course, to be innovative, as implied by our inventive scientist job titles, and last but not least, our research lab is a very collaborative environment.

Noemi: So, you can come up with project ideas or get involved in projects with other team members. So, a critical soft skill that we are looking for is the ability to successfully collaborate with others. AT&T Labs research also promotes academic collaborations. We can publish, as I mentioned, and also, many of my colleagues have ongoing academic collaborations, so being collaborative in our field is a critical skill that we are looking for.

What does the future of data science look like to you?

Hugo: For sure. So, we’ve bounced back and forth between current data science work that you’re involved in, its impact on the future of AT&T. I’m wondering, generally, what the future of data science looks like to you?

Noemi: It’s funny that you ask me because I have a … So, yesterday I read a Forbes article saying that data science won’t be around in 2029-

Hugo: Great.

Noemi: … and it’s very funny because I have the opposite opinion. So, in my opinion, data science has been around for a while, and since even before being called data science, and I think it will be around for even longer.

Hugo: Of course, though, in 2029, it will be around, but it may not be called data science as well. Right?

Noemi: Right. So, even before, it was data analyst, or it had different-

Hugo: Data mining, a lot of-

Noemi: Right, and then there are other people who are working with data that … For example, when I was doing my research as a PhD, what was I doing? I didn’t even know what term to give to it. I was just hearing from traditional physics professors that this is not physics, and I never knew what to call it. So, now we have a term for it, and maybe in the future it’s going to change, but the job itself, the role, I think it’s going to be around for a long time because this field requires a sciencey, innovative mindset, and I think there will be plenty of opportunities in this field in the future. I think the part of data science that changes rapidly is the tools that we make use of, from how to ingest large-scale data to how to evaluate things, interpret predictive models, so this is changing very rapidly, and that’s why data scientists have to constantly keep learning to be able to keep up with the rapid technological advances, but the data scientist role in itself is gonna be around.

Hugo: I think so, and I think we’re also gonna see data skills, data literacy, and data fluency spread across organizations in really interesting ways as well. I mean, something we’re thinking about a lot is what do product managers need to know about data science and statistics? What do VPs of Marketing know? What does C-suite need to know? Do they need to know the basic definitions of metrics for machine learning models, and maybe a bit about class imbalances, for example, right?

Noemi: It’s very cool because the data science fellowship that I participated in last year, Insight Data Science fellowship, they also launched a product management fellowship, which is really awesome.

Hugo: That’s really cool. I’d actually love to know more of the … I’m gonna look into that, because I do actually think that the relationship between product management and data science is not so ill-defined, the way we talk about it is, but it’s becoming more and more important. But that’s for another conversation. I’d like to wrap up with a couple of questions. I’d just love to know what one of your favorite data sciencey things to do is, as a technique or methodology, or anything.

Noemi: For me, the data science process as a whole is my favorite because I liked these crime novels or movies-

Hugo: Awesome.

Noemi: … and I always feel like what I’m doing, I’m getting the data. It’s like throughout the EDA, the exploratory data analysis, I feel like a detective finding these puzzle pieces in the data, and then in the modeling part, I’m just putting the pieces together to reveal the story. So, for me, it’s like, oh, I feel like a detective in a safe space. I don’t deal with criminals. But to answer your question, one methodology that’s one of the several of my favorites is probably text data vectorization, because it’s so simple, yet I find it so fascinating how you can so easily extract features from unstructured text data with this very simple technique, and you can use this for feature extraction, natural language processing, and to build models from it. So, I find it really cool.

Hugo: That’s awesome, and although I agree that when performing data analysis and doing data science you’re not uncovering criminals, a lot of code people write and a lot of process is almost criminal as well. I mean, we’re still establishing best practices also, and also, I find it really interesting that unstructured data and text data natural language processing is part of your answer to this because a lot of the techniques and work and research you’ve done we’ve discussed today isn’t necessarily involved with text data. So, that’s kind of cool, to know that this is another interest of yours.

Noemi: Actually, throughout my open NASA dataset collection analysis, I did natural language processing, and I’m also developing a course for Pearson, which is called Natural Language Processing for Hackers, which is gonna be out, hopefully, soon.

Call to Action

Hugo: Okay, great. So, my final question is, do you have a call to action for our listeners out there?

Noemi: Yeah. Check out our website at AT&T Labs. It’s about.att.com/sites/labs, and there’s cool research to learn more about what we do and how you can get involved, because we’re always looking for young talent to join our growing team, and now I actually shared with you what type of skills we’re looking for, so yeah, we’re very interested in new talent.

Hugo: Fantastic. We’ll put that link the show notes as well, and for all of you who do reach out, mention that you heard our conversation on DataFramed as well. But Noemi, I’d just like to thank you so much for coming on the show. It’s been so much fun.

Noemi: Thank you so much for inviting me. It was great being here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Future is now: Maschinelles Lernen in R für Fortgeschrittene

Mon, 03/11/2019 - 11:00

(This article was first published on R-Programming – Statistik Service, and kindly contributed to R-bloggers)

Maschinelles Lernen oder auch künstliche Intelligenz (aus dem Englischen: Machine Learning / Artificial Intelligence) basiert auf Algorithmen, die aus Beispielen und Erfahrungen zu lernen, ohne explizit programmiert zu werden. Anstatt Code zu schreiben, speisen Sie Daten in den Algorithmus ein, woraufhin dieser Zusammenhänge in den Daten „erlernt“ (daher der Begriff „maschinelles Lernen“). In den letzten Jahren erfreuen sich verschiedenste Algorithmen des maschinellen Lernens sowie künstliche Intelligenz zunehmender Beliebtheit. Es wird deshalb auch in der angewandten Statistik zunehmend wichtiger, Grundlagen von Techniken wie z.B. Support Vector Machine zu beherrschen. In diesem Blogpost werden wir deshalb die allgemeine Grundlagen des Maschinellen Lernens besprechen. Dabei werden wir auf zwei konkrete Beispiele in R näher eingehen. Hierzu wird Grundlagenwissen in R vorausgesetzt.

Für eine konkrete statistische Beratung im Bereich Machine Learning/Künstliche Intelligenz anhand Ihrer eigenen Daten vereinbaren Sie einfach einen Termin zur Statistik-Beratung bei uns!

Überwachtes vs. Unüberwachtes maschinelles Lernen

Maschinelles Lernen/künstliche Intelligenz ist ein Teilbereich des Data Minings. Dieser Bereich lässt sich dabei grob in zwei Teilbereiche einteilen, dem Überwachten sowie Unüberwachtem Lernen.

Überwachtes maschinelles Lernen: Vorhersage der Ausgabe anhand der Eingabedaten

Beim überwachten Lernen verwendet man Eingabevariablen (X) und Ausgabevariable (y) (zum Beispiel könnte X den Lernaufwand in Stunden für eine Klausur darstellen, während y die resultierende Note ist) und man verwendet einen Algorithmus, um eine Funktion f von der Eingabe zur Ausgabe zu „lernen“.

Y = f (X)

Ziel ist es, die Funktion f so gut anzunähern, dass man mit neuen Eingabedaten (X) die Ausgabevariablen (Y) für diese Daten vorhersagen kann. Verschiedene Algorithmen des überwachten Lernens unterscheiden sich vornehmlich dadurch, wie f gewählt wird.

Wir kennen sozusagen also die richtigen „Antworten“ Y auf die „Fragen“ X. Beispielsweise verfügen wir bereits über Daten von verschiedenen Schülern über Ihren Lernaufwand und der erzielten Noten. Diese Daten könnten wir als Trainingsdaten verwenden. Der Algorithmus macht dann iterativ Vorhersagen über die Trainingsdaten und wird vom Lehrer korrigiert. Das Lernen endet, wenn der Algorithmus ein akzeptables Leistungsniveau erreicht. Für unser Beispiel wäre ein akzeptables Leistungsniveau etwa, wenn der Algorithmus die erzielte Note auf Basis des Lernaufwands mit hoher Wahrscheinlichkeit vorhersagen kann.

Überwachte Lernverfahren lassen sich einteilen in Klassifikations- und Regressionsprobleme. Beispiele hierfür sind Support-Vektor-Maschinen, lineare Regression, logistische Regression, naive Bayes, Entscheidungsbäume, k-Nächster-Nachbar-Algorithmus sowie Neuronale Netze. Wenn von „künstliche Intelligenz“ gesprochen wird, ist oft eines dieser Verfahren gemeint.

Unüberwachten maschinelles Lernen: Versteckte Strukturen der Daten entdecken

Beim unüberwachten Lernen hat man dagegen nur Eingabedaten (X) und keine entsprechenden Ausgangsvariablen. Beispielsweise könnten diese Eingabedaten die Wohnorte von Menschen sein – dann wären Städte die zugehörigen Cluster. Es wäre dann etwa denkbar, dass Restaurantketten aufgrund der Ergebnisse einer Clusteranalyse eine Standortplanung durchführen könnten. Weiterhin wäre es auch denkbar, die Clusteranalyse zu nutzen, um Umfragen auszuwerten und die Antworten in entsprechende Cluster einzuteilen.

Ziel des unüberwachten Lernens ist es, die zugrunde liegende Struktur oder Verteilung in den Daten zu modellieren. Somit kann man dann anhand der Struktur mehr über die Daten erfahren.

Dies wird als unüberwachtes Lernen bezeichnet, da es im Gegensatz zum überwachten Lernen keine richtigen Antworten gibt und es keinen Lehrer gibt. Algorithmen sind ihren eigenen Entwürfen überlassen, um die interessante Struktur in den Daten zu entdecken und darzustellen.

Beispiel Überwachtes Lernen: Support Vector Machine

Ein typischer Vertreter überwachter Lernverfahren sind Support Vector Machines. Bei diesem Verfahren werden Daten mit Hilfe von hochdimensionalen Trennflächen separiert. In diesem Abschnitt wird beschrieben, wie man mit Hilfe des kernlab-Pakets Support Vector Maschinen mit R implementieren kann. Wie wir anschliessend sehen werden, können wir mit Hilfe von R diese Trennflächen einfach visualisieren (solange die Daten 2-dimensional sind). Eine einfache theoretische Einführung findet sich beispielsweise hier.

1. Daten generieren und darstellen

Zunächst werden wir Daten generieren, die nicht linear trennbar sind (es gibt also keine Gerade in 2D, welche die Daten eindeutig in zwei Klassen teilt). Wir erstellen dabei einen zweidimensionalen Vektor von 120 Zeilen (x Є R120 × 2). Die Werte in der ersten Spalte (x1) stammen von einer Normalverteilung mit dem Mittelwert 1 und die Werte in der zweiten Spalte (x2) von einer Normalverteilung mit dem Mittelwert 3.

library(kernlab) # importiere kernlab
library(ggplot2) # importiere ggplot2
set.seed(6) # set.seed sorgt für Reproduzierbarkeit
x = rbind(matrix(rnorm(120), , 2), matrix(rnorm(120, mean = 3), , 2)) # x ist ein 120x2 Vektor normalverteilter Werte mit Mittelwert 1 bzw. 3
y

Zur Visualisierung der Daten verwenden wir anschliessend das Paket ggplot2.Dies ist das geläufigste Paket für eine schnelle und flexible Datenvisualisierungen mit R.

d=data.frame(x=x, y=y) # erstelle
names(d)<-c("x1", "x2", "y")
qplot(x1, x2, data = d, color = factor(y)) + geom_point(shape = 1)
+scale_colour_manual(values = c("#0000FF", "#00FF00"), labels = c("1", "-1"))

Der obige Code zeichnet die Daten auf einem 2-D-Gitter und färbt sie entsprechend ihrer Klasse y ein.

Geclusterte Eingabedaten (X), die mittels einer Support Vector Machine getrennt werden sollen 2. Daten Klassifizieren mit linearer Trennfunktion

Wie man leicht erkennen kann, gibt es keine Möglichkeit, die Datenpunkte linear zu trennen, ohne Fehler zu machen. Wenn man dies dennoch tut, führt der Algorithmus eine sogenannte „soft margin“ Klassifizierung durch, wodurch falsch kategorisierte Datenpunkte in Abhängigkeit von ihrer Entfernung von der trennenden Hyperebene bestraft werden. Somit findet der Algorithmus dennoch die beste Trennfläche.

svp = ksvm(y ~ x1 + x2, data = d, type = "C-svc", C = 1, kernel = "vanilladot")

Der folgende Code führt die Support Vector Machine durch.

svp = ksvm(y ~ x1 + x2, data = d, type = "C-svc", C = 1)

Der Parameter C steuert die erwähnte Strafe. Wir werden sehen, dass es für die Generalisierbarkeit der SVM entscheidend ist, einen guten Wert für C zu finden. In diesem Beispielcode ist C auf 1 gesetzt.

Hier können wir nun sehen, wie die Klassifizierungen aussehen:

plot(svp, data = d)

Das Diagramm der resultierenden SVM ergibt einen sogenannten Konturplot, wobei die entsprechenden Unterstützungsvektoren hervorgehoben sind (fett). Es zeigt die gleiche SVM, diesmal jedoch mit C = 100. Offensichtlich versucht die SVM nicht zu missklassifizieren und abzusuchen, da C die Strafe sehr hoch ansetzt:

Support Vector Machine Ergebnis – lineare Trennfunktion C = 1 Support Vector Machine Ergebnis – lineare Trennfunktion C = 100 3. Daten klassifizieren mit RBF Trennfunktion

Ein weiterer wichtiger Parameter ist kernel. Dieser Parameter gibt an, wie die Trennfläche konstruiert wird. Versuchen wir, den sogenannten RBF-Kernel anzuwenden:

svp = ksvm(y ~ x1 + x2, data = d, type = "C-svc", C = 1, kernel = "rbfdot")

Support Vector Machine Ergebnis – RBF Trennfunktion

Wir sehen also, dass sowohl die Wahl von C als auch die Wahl der Trennfläche einen großen Einfluss auf das Ergebnis des Algorithmus haben.

Beispiel Unüberwachtes Lernen: k-means Clustering

K-Means-Clustering ist einer der einfachsten und beliebtesten unüberwachten Algorithmen für maschinelles Lernen (ein einfaches theoretisches Beispiel findet sich z.B. hier). Im Folgenden wird ein einfaches angewandtes Beispiel durchgespielt.

Schritt 1: Generierung von geclusterten Daten

Wir erzeugen wieder 200 zweidimensionale Datenpunkte. Diese verteilen sich dabei um die Werte (15, 5) und (5, 15) herum (mit normalverteilten Abweichungen).

c1 = cbind(rnorm(100, mean=5), rnorm(100, mean=15)) # links oben
c2 = cbind(rnorm(100, mean=15), rnorm(100, mean=5)) # rechts unten
data = rbind(c1, c2) # Binde beide Spalten aneinander (siehe unten)
plot(data)

Geclusterte Daten, welche per künstlicher Intelligenz getrennt werden sollen c1 data[,1] data[,2] 6.370958 16.20097 4.435302 16.04475 … … c2 12.99907 4.995379 15.33378 5.760242 … … Schritt 2: Implementation des k-Means-Algorithmus

Für die Implementierung des k-Means-Clustering-Algorithmus existiert in R eine integrierte Funktion namens kmeans. Man muss dabei nur die zu gruppierenden Daten und die Anzahl der Cluster angeben, die wir auf 4 setzen.

cluster = kmeans(data, 4)
cluster

Das Clustering wird anschliessend von R durchgeführt. Wenn wir cluster aufrufen, erhalten wir den folgenden Output:

K-means clustering with 2 clusters of sizes 100, 100
Cluster means:
5.19192 14.997198
14.70764 4.930212
Clustering vector:
1 1 […] 2 2
Within cluster sum of squares by cluster:
258.1744 160.2776
(between_SS / total_SS = 95.8 %)
Available components:
“cluster” “centers” “totss” “withinss” “tot.withinss” “betweenss” “size” “iter” “ifault”

Der obige Output sagt Folgendes aus:

  • was getan wurde (Clustering mit 2 Clustern der Größen 100, 100)
  • die Zentroide der Cluster, also ihre zentralen Punkte (das ist ziemlich nahe an dem, was wir erwarten würden – die Daten stammen von bivariaten Normalverteilungen mit Mitteln (5, 15) und (15, 5)) (cluster means)
  • ein Vektor mit 200 Einträgen, der die zugewiesenen Kennzeichnungen jedes Datenpunkts beschreibt (clustering vector)
  • und die Summe der Quadrate pro Cluster innerhalb des Clusters – dies ist die Summe der quadratischen Abweichungen für jeden Datenpunkt vom jeweiligen Cluster-Schwerpunkt (Within cluster sum of squares by cluster)
Schritt 3: Visualisierung der Ergebnisse

# Zeichne Daten verschiedener Farbe
plot(data, col = kk$cluster)
# Plotte die Mittelpunkte in Schwarz
points(kk$centers, pch = 16, cex = 2)

Ergebnis der Clusteranalyse

Das war ein sehr einfaches Beispiel mit zwei schön getrennten Punktclustern. In praktischen Anwendungen für künstliche Intelligenz sind die Punkte jedoch oftmals höherdimensional und damit nicht mehr einfach zu visualisieren. Falls Sie Hilfe bei Ihrem statistischen Projekt benötigen, kontaktieren Sie uns gerne!

Zusammenfassung

In diesem Artikel haben wir eine erste Einführung in die Grundlagen des maschinellen Lernens gegeben. Dabei haben wir den Unterschied zwischen überwachten und unüberwachten Lernen dargestellt:

  • Überwachtes Lernen: Alle Daten werden gelabelled und die Algorithmen lernen, die Ausgabe aus den Eingangsdaten vorherzusagen.
  • Unüberwachtes Lernen: Alle Daten sind nicht gelabelled und die Algorithmen lernen aus den Eingabedaten die inhärente Struktur.

In den letzten Jahren haben viele Techniken des Maschinellen Lernens / Künstliche Intelligenz in der Statistik an Bedeutung gewonnen. Beispiele hierfür sind Support-Vektor-Maschinen, lineare Regression, logistische Regression, naive Bayes, Entscheidungsbäume, k-Nächster-Nachbar-Algorithmus, Neuronale Netze sowie k-means Clustering. Falls Sie Hilfe bei der Auswahl und Durchführung ihres Data Mining-Projekts brauchen, helfen unsere Experten gerne mit Ihrem persönlichen Statistik-Projekt!

Der Beitrag The Future is now: Maschinelles Lernen in R für Fortgeschrittene erschien zuerst auf Statistik Service.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ranking places with Google to create maps

Mon, 03/11/2019 - 10:08

(This article was first published on R [english] – NC233, and kindly contributed to R-bloggers)

Today we’re going to use the googleway R package, which allows their user to do requests to the GoogleMaps Places API. The goal is to create maps of specific places (restaurants, museums, etc.) with information from Google Maps rankings (number of stars given by other people). I already discussed this in french here to rank swimming pools in Paris. Let’s start by loading the three libraries I’m going to use : googleway, leaflet to create animated maps, and RColorBrewer for color ranges.

library(googleway) library(leaflet) library(RColorBrewer)

First things first. To do API request to Google, we need an API key ; you can ask for one here. We’ll use this key for the rest of the program, so let’s declare a global variable :

api_key <- "YourKeyHereIWontGiveYouMine" Retrieving Google Data

We’re going to use the google_places function to get a list of places matching a description, called research in my program (for instance : “Restaurant, Paris, France”). The output are multiple, and I’m going to store the place ID and the rating. I’ll also store the research token ; that’ll be explained later.

gmaps_request <- google_places(search_string = research, language = language, key = api_key) gmaps_data <- gmaps_request$results place_id <- gmaps_data$place_id rating <- gmaps_data$rating token <- gmaps_request$next_page_token

This function returns up to 20 places associated to the research by Google. If you want more than 20, you need to use the token previously stored in order to ask the Google Places API to give you the next results, by tweaking the function this way :

gmaps_request <- google_places(search_string = research, language = language, key = api_key, page_token = token)

There are tow caveats to this function. Firstly, the token can be NULL. In this case, there isn’t any further research results you can get. This happens automatically as soon as you reach 60 results. Secondly, the API needs time to refresh the token research (see here) ; that’s why we’re going to make R wait a few seconds, using Sys.sleep(time) between our requests. Our complete function is therefore :

gmaps_request <- google_places(search_string = research, language = language, key = api_key) gmaps_data <- gmaps_request$results place_id <- gmaps_data$place_id rating <- gmaps_data$rating token <- gmaps_request$next_page_token Sys.sleep(5) continue <- TRUE while (continue) { gmaps_request <- google_places(search_string = research, language = language, key = api_key, page_token = token) gmaps_data <- gmaps_request$results if (!is.null(gmaps_request$next_page_token)) { place_id = c(place_id,gmaps_data$place_id) rating = c(rating,gmaps_data$rating) token <- gmaps_request$next_page_token Sys.sleep(5) } else{continue <- FALSE} }

Now we’re going to search for the spatial coordinates of the places we found. To this extent, we’re going to use the google_place_details function of the packages, and retrieve latitude and longitude with these two functions :

get_lat <- function(id, key, language) { id <- as.character(id) details <- google_place_details(id, language = language, key=key) return(details$result$geometry$location$lat) } get_lng <- function(id, key, language) { id <- as.character(id) details <- google_place_details(id, language = language, key=key) return(details$result$geometry$location$lng) }

All these blocks add up to build the complete function :

get_gmaps_data <- function(research, api_key, language) { gmaps_request <- google_places(search_string = research, language = language, key = api_key) gmaps_data <- gmaps_request$results place_id <- gmaps_data$place_id rating <- gmaps_data$rating token <- gmaps_request$next_page_token Sys.sleep(5) continue <- TRUE while (continue) { gmaps_request <- google_places(search_string = research, language = language, key = api_key, page_token = token) gmaps_data <- gmaps_request$results if (!is.null(gmaps_request$next_page_token)) { place_id <- c(place_id, gmaps_data$place_id) rating <- c(rating, gmaps_data$rating) token <- gmaps_request$next_page_token Sys.sleep(5) } else{continue <- FALSE} } lat = sapply(place_id, get_lat, key=api_key, language=language) lng = sapply(place_id, get_lng, key=api_key, language=language) return(data.frame(place_id, rating, lat, lng)) } Map plot

The next part is more classical. We’re going to order the ratings of the data frame built by the previous function in order to arrange the places in differents groups. Each of the groups will be associated to a color on the data plot. If we want to make number_colors groups with the color scheme color (for instance, “Greens”), we are using the following instructions :

color_pal <- brewer.pal(number_colors, color) pal <- colorFactor(color_pal, domain = seq(1,number_colors)) plot_data <- gmaps_data plot_data$ranking <- ceiling(order(gmaps_data$rating)*number_colors/nrow(plot_data))

The definitive function just needs the addition of the leaflet call :

show_map <- function(number_colors, gmaps_data, color="Greens") { color_pal <- brewer.pal(number_colors,color) pal <- colorFactor(color_pal, domain = seq(1,number_colors)) plot_data <- gmaps_data plot_data$ranking <- ceiling(order(gmaps_data$rating)*number_colors/nrow(plot_data)) leaflet(plot_data) %>% addTiles() %>% addCircleMarkers( radius = 6, fillColor = ~pal(ranking), stroke = FALSE, fillOpacity = 1 ) %>% addProviderTiles(providers$CartoDB.Positron) } Examples

I just need to combine these two functions in one, and then to do some food-related examples !

maps_ranking_from_gmaps <- function(research, api_key, language, number_colors=5, color="Greens") { show_map(number_colors, get_gmaps_data(research, api_key, language), color) } maps_ranking_from_gmaps("Macaron, Paris, France", api_key, "fr") maps_ranking_from_gmaps("Macaron, Montreal, Canada", api_key, "fr") maps_ranking_from_gmaps("Poutine, Montreal, Canada", api_key, "fr", 5, "Blues") maps_ranking_from_gmaps("Poutine, Paris, France", api_key, "fr", 5, "Blues")

which returns the following maps :

Macaron in Paris, France

Macaron in Montreal, Canada

Poutine in Montreal, Canada

Poutine in Paris, France (I guess French people are not ready for this)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R [english] – NC233. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This is not normal(ised)

Mon, 03/11/2019 - 07:00

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

“Sydney stations where commuters fall through gaps, get stuck in lifts” blares the headline. The story tells us that:

Central Station, the city’s busiest, topped the list last year with about 54 people falling through gaps

Wow! Wait a minute…

Central Station, the city’s busiest

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

Grabbing the numbers for 2017:

library(tidyverse) tibble(station = c("Central", "Circular Quay", "Redfern"), falls = c(54, 34, 18), entries = c(118960, 27870, 30570)) %>% mutate(falls_per_entry = falls/entries) %>% select(-entries) %>% gather(Variable, Value, -station) %>% ggplot(aes(station, Value)) + geom_col() + facet_wrap(~Variable, scales = "free_y")

Looks like Circular Quay has the bigger problem. Now we have a data story. More tourists? Maybe improve the signage.

Deep in the comment thread, amidst the “only themselves to blame” crowd, one person gets it:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages