Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 12 hours 58 min ago

M4 Forecasting Competition

Sun, 11/19/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

The “M” competitions organized by Spyros Makridakis have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models. For that, Spyros deserves congratulations for changing the landscape of forecasting research through this series of competitions.
Makridakis & Hibon, (JRSSA 1979) was the first serious attempt at a large empirical evaluation of forecast methods.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Keyboard Shortcuts for Pipes

Sat, 11/18/2017 - 19:32

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R.

You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statebins Reimagined

Sat, 11/18/2017 - 16:33

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on statebins…

There were three different functions in the old-style package:

  • one for discrete scales (it automated ‘cuts’)
  • one for continuous scales
  • one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing statebins

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

data(USArrests) USArrests$state <- rownames(USArrests) statebins(USArrests, value_col="Assault", name = "Assault") + theme_statebins(legend_position="right")

Two things should stand out there:

  • you got scale_fill_distiller() for free!
  • labels are dark/light depending on the tile color

Before we go into ^^, it may be helpful to show the new function interface:

statebins(state_data, state_col = "state", value_col = "value", dark_label = "black", light_label = "white", font_size = 3, state_border_col = "white", state_border_size = 2, ggplot2_scale_function = ggplot2::scale_fill_distiller, ...)

You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for).

So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger.

The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output.

But that’s not all!

There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that):

library(statebins) library(tidyverse) library(viridis) data(USArrests) # make up some data for the example rownames_to_column(USArrests, "state") %>% bind_rows( data_frame( state = c("Virgin Islands", "Puerto Rico", "New York City"), Murder = rep(mean(max(USArrests$Murder),3)), Assault = rep(mean(max(USArrests$Assault),3)), Rape = rep(mean(max(USArrests$Rape),3)), UrbanPop = c(93, 95, 100) ) ) -> us_arrests statebins(us_arrests, value_col="Assault", ggplot2_scale_function = viridis::scale_fill_viridis) + labs(title="USArrests + made up data") + theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv("http://www.washingtonpost.com/wp-srv/special/business/states-most-threatened-by-trade/states.csv?cache=1")) mutate( adat, share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4")) ) %>% statebins( value_col = "share", ggplot2_scale_function = scale_fill_brewer, name = "Share of workforce with jobs lost or threatened by trade" ) + labs(title = "1994-2000") + theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

election_2012 <- suppressMessages(read_csv("https://raw.githubusercontent.com/hrbrmstr/statebins/master/tmp/election2012.csv")) mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% statebins( font_size=4, dark_label = "white", light_label = "white", ggplot2_scale_function = scale_fill_manual, name = "Winner", values = c(Romney = "#2166ac", Obama = "#b2182b") ) + theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

data(USArrests) USArrests$state <- rownames(USArrests) statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) + theme_statebins(legend_position="right")

FIN

It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package.

So, kick the tyres and don’t hesitate to shoot over some feedback!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Seasonality of plagues by @ellis2013nz

Sat, 11/18/2017 - 12:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Seasonality of plague deaths

In the course of my ramblings through history, I recently came across Mark R. Welford and Brian H. Bossak Validation of Inverse Seasonal Peak Mortality in Medieval Plagues, Including the Black Death, in Comparison to Modern Yersinia pestis-Variant Diseases. This article is part of an interesting epidemiological literature on the seasonality of medieval and early modern European plagues and the twentieth century outbreaks of plague caused by the Yersinia pestis bacterium. Y. pestis is widely credited (if that’s the word) for the 6th century Plague of Justinian, the mid 14th century Black Death, and the Third Pandemic from the mid nineteenth to mid twentieth centuries. The Black Death killed between 75 and 200 million people in Eurasis in only a few years; the Third Pandemic killed at least 12 million. These are important diseases to understand!

In their 2009 article, Welford and Bossak added additional evidence and rigour to the argument that the difference in seasonality of peak deaths between the first two suspected bubonic plague pandemics and the Third Pandemic indicates a material difference in the diseases, whether it be a variation in Y. pestis itself, its mode and timing of transmission, or even the presence of an unknown human-to-human virus. This latter explanation would account for the extraordinary speed of spread of the Medieval Black Death, which I understand is surprising for a rodent-borne bacterial disease. These are deep waters; as a newcomer to the topic there’s too much risk of me getting it wrong if I try to paraphrase, so I’ll leave it there and encourage you to read specialists in the area.

Plot polishing

One of Welford and Bossak’s graphics caught my eye. Their original is reproduced at the bottom of this post. It makes the point of the differing seasonalities in the different outbreaks of plague but is pretty cluttered. I set myself the job of improving it as a way of familiarising myself with the issues. My first go differed from there’s only in minimal respects:

  • split the graphic into four facets for each plague agent / era to allow easier comparison
  • direct labels of the lines rather than relying on a hard-to-parse legend
  • added a grey ribbon showing a smoothed overall trend, to give a better idea of the “average” seasonality of the particuar era’s outbreak in each facet
  • removed gridlines

All the R code for today’s post, from download to animation, is collected at the end of the post.

I think the grey ribbon of the smoothed average line is the best addition here. It makes it much clearer that there is an average seasonality in the European (medieval and early modern) plagues not apparent in the Indian and Chinese examples.

I wasn’t particularly happy with the logarithm transform of the vertical axis. While this is a great transformation for showing changes in terms of relative difference (eg growth rates), when we’re showing seasonality like this the natural comparison is of the distance of each point from the bottom. It is natural to interpret the space as proportionate to a number, as though there is no transformation. In effect, the transformation is hiding the extent of the seasonality.

I suspect the main reason for the use of the transform at all was to avoid the high numbers for some outbreaks hiding the variation in the smaller ones. An alternative approach is to allow the vertical axes of some of the facets to use varying scales. This is legitimate so long as any comparison of magnitudes is within each facet, and comparisons across facets are of trends and patterns, not magnitude. That is the case here. Here’s how that looks:

I think this is a significant improvement. The concentration of deaths in a few months of each outbreak is now much clearer.

There’s an obvious problem with both the charts so far, which is the clutter of text from the labels. I think they are an improvement on the legend, but they are a bit of mess. This is in fact a classic problem of how to present longitudinal data in what is often termed a “spaghetti plot” for obvious reasons. Generally, such charts only work if we don’t particularly care which line is which.

I attempted to address this problem by aggregating the samples to the country level, while keeping different lines for different combinations of country and year. I made a colour palette matched to countries so they would have the same colour over time. I also changed the scale to being the proportion of deaths in each month from the total year’s outbreak. If what we’re after is in fact the proportion of deaths in each month (ie the seasonality) rather than the magnitudes, then let’s show that on the graphic:

This is starting to look better.

Animating

While I think it was worth while aggregating the deaths in those small medieval towns and villages to get a less cluttered plot, it does lose some lovely granular detail. Did you know for instance that the medieval Welsh town of Abergwillar is completely invisible on the internet other than in discussion of the plague in 1349? One way around this problem of longitudinal data where we have a time series for many cases, themselves of some interest, is to animate through the cases. This also gave me a chance to use Thomas Pedersen’s handy tweenr R package to smooth the transitions between each case study.

Note how I’ve re-used the country colour palette, and used the extra visual space given by spreading facets out over time (in effect) to add some additional contextual information, like the latitude of each location.

Some innocent thoughts on the actual topic

This was a toe in the water for something I’d stumbled across while reading a history book. There’s lots of interesting questions here. Plague is in the news again today. There’s a big outbreak in Madagascar, with 171 deaths between August and mid November 2017 and more than 2,000 infections. And while I have been writing this blog, the Express has reported on North Korea stockpiling plague bacteria for biological warfare.

On the actual topic of whether seasonality of deaths tells us that the medieval Black Death was different from 19th century Chinese and Indian plagues, I’d like to see more data. For example, what were the death rates by month in China in the 14th century, where the outbreak probably began? Presumably not available.

Nineteenth century Bombay was different from a medieval European village in many important respects. Bombay’s winter in fact is comparable in temperature to Europe’s summer, so the coincidence of these as peak times for plague deaths perhaps has something in common. On the other hand, Manchuria is deadly, freezing cold in its winter when pneumonic plague deaths peaked in our data.

Interesting topic.

R code

Here’s the R code that did all that. It’s nothing spectacular. I actually typed by hand the data from the original table, which was only available, as far as I can see, as an image of a table. Any interest is likely to be in the particular niceties of some of the plot polishing eg using a consistent palette for country colour when colour is actually mapped to country-year combination int he data.

Although tweenr is often used in combination with the animate R package, I prefer to do what animate does manually: save the individual frames somewhere I have set myself and call ImageMagick directly via a system() call. I find the animate package adds a layer of problems – particularly on Windows – that sometimes gives me additional headaches. Calling another application via system() works fine.

library(tidyverse) library(scales) library(openxlsx) library(directlabels) library(forcats) library(RColorBrewer) library(stringr) library(tweenr) #========download and prepare data========= download.file("https://github.com/ellisp/ellisp.github.io/raw/master/data/plague-mortality.xlsx", destfile = "plauge-mortality.xlsx", mode = "wb") orig <- read.xlsx("plague-mortality.xlsx", sheet = "data") d1 <- orig %>% rename(place = Place, year = Date, agent = Agent, country = Country, latitude = Latitude) %>% gather(month, death_rate, -place, -year, -agent, -country, -latitude) %>% # put the months' levels in correct order so they work in charts etc mutate(month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), month_number = as.numeric(month)) %>% mutate(place_date = paste(place, year), agent = fct_reorder(agent, year)) %>% arrange(place, year, month_number) # define a caption for multiple use: the_caption <- str_wrap("Graphic by Peter Ellis, reworking data and figures published by Mark R. Welford and Brian H. Bossak 'Validation of Inverse Seasonal Peak Mortality in Medieval Plagues, Including the Black Death, in Comparison to Modern Yersinia pestis-Variant Diseases'", 100) #============faceted version of original graphic===================== p1 <- d1 %>% ggplot(aes(x = month_number, y = death_rate, colour = place_date)) + geom_smooth(aes(group = NULL), se = TRUE, colour = NA, size = 4) + geom_line(aes(x = month_number), size = 0.9) + scale_y_log10("Number of deaths in a month", limits = c(1, 1000000), label = comma) + scale_x_continuous("", breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + facet_wrap(~agent) + theme(panel.grid = element_blank()) + labs(caption = the_caption) + ggtitle("Seasonal deaths by different plague agents", "Medieval and early modern Black Death and plague peaked in summer, whereas early 20th Century outbreaks peaked in late winter") direct.label(p1, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) # good # version with free y axes as an alternative to the log transform p1b <- p1 + facet_wrap(~agent, scales = "free_y") + scale_y_continuous() direct.label(p1b, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) #==============================y axis as proportion=================== d2 <- d1 %>% group_by(country, year, month_number, agent) %>% summarise(deaths = sum(death_rate, na.rm = TRUE)) %>% mutate(country_date = paste(country, year)) %>% group_by(country_date) %>% mutate(prop_deaths = deaths / sum(deaths, na.rm = TRUE)) %>% # convert the zeroes back to NA to be consistent with original presentation: mutate(prop_deaths = ifelse(prop_deaths == 0, NA, prop_deaths)) # Defining a palette. This is a little complex because although colour # is going to be mapped to country_year, I want each country to have its own # colour regardless of the year. I'm going to use Set1 of Cynthia Brewer's colours, # except for the yellow which is invisible against a white background pal1 <- data_frame(country = unique(d2$country)) pal1$colour <- brewer.pal(nrow(pal1) + 1, "Set1")[-6] pal2 <- distinct(d2, country, country_date) %>% left_join(pal1, by = "country") pal3 <- pal2$colour names(pal3) <- pal2$country_date # draw graphic with colour mapped to country-year combination, but colours come out # at country level: p2 <- d2 %>% ggplot(aes(x = month_number, y = prop_deaths, colour = country_date)) + geom_smooth(aes(group = NULL), se = TRUE, colour = NA, size = 4) + # geom_point() + geom_line(aes(x = month_number)) + scale_color_manual(values = pal3) + scale_y_continuous("Proportion of the year's total deaths in each month\n", label = percent, limits = c(0, 0.65)) + scale_x_continuous("", breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + facet_wrap(~agent) + theme(panel.grid = element_blank()) + ggtitle("Seasonal deaths by different plague agents", "Medieval and early modern Black Death and plague peaked in summer, whereas early 20th Century outbreaks peaked in late winter") + labs(caption = the_caption) direct.label(p2, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) #============animation=================== d3 <- d1 %>% group_by(place_date, country, latitude) %>% mutate(prop_deaths = death_rate / sum(death_rate, na.rm = TRUE), prop_deaths = ifelse(is.na(prop_deaths), 0, prop_deaths)) %>% ungroup() %>% mutate(place_date = fct_reorder(place_date, year), place_date_id = as.numeric(place_date)) %>% arrange(place_date_id, month_number) place_dates <- levels(d3$place_date) # change this to a list of data.frames, one for each state d3_l <- lapply(place_dates, function(x){ d3 %>% filter(place_date == x) %>% select(prop_deaths, month, place_date, agent) }) # use tweenr to create a single data frame combining those 20+ frames in the list, with interpolated # values for smooth animation: d3_t <- tween_states(d3_l, tweenlength = 5, statelength = 8, ease = "cubic-in-out", nframes = 600) %>% # caution - tween_states loses the information ont the ordering of factors mutate(month= factor(as.character(month), levels = levels(d1$month)), place_date = factor(as.character(place_date, levels = place_dates))) # make a temporary folder to store some thousands of PNG files as individual frames of the animation unlink("0114-tmp", recursive = TRUE) dir.create("0114-tmp") for(i in 1:max(d3_t$.frame)){ png(paste0("0114-tmp/", i + 10000, ".png"), 2000, 1000, res = 200) the_data <- filter(d3_t, .frame == i) the_title <- paste("Deaths from", the_data[1, "agent"], "in", the_data[1, "place_date"]) tmp <- d3 %>% filter(place_date == the_data[1, "place_date"]) %>% select(place_date, country, latitude) %>% distinct() the_xlab <- with(tmp, paste0(place_date, ", ", country, ", ", latitude, " degrees North")) the_high_point <- the_data %>% filter(prop_deaths == max(prop_deaths) ) %>% slice(1) %>% mutate(prop_deaths = prop_deaths + 0.03) %>% mutate(lab = agent) the_colour <- pal2[pal2$country == tmp$country, ]$colour[1] print( ggplot(the_data, aes(x = as.numeric(month), y = prop_deaths)) + geom_ribbon(aes(ymax = prop_deaths), ymin = 0, fill = the_colour, colour = "grey50", alpha = 0.1) + scale_x_continuous(the_xlab, breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + ggtitle(the_title, "Distribution by month of the total deaths that year") + theme(panel.grid = element_blank()) + geom_text(data = the_high_point, aes(label = lab)) + labs(caption = the_caption) + scale_y_continuous("Proportion of the year's total deaths in each month\n", label = percent, limits = c(0, 0.68)) ) dev.off() # counter: if (i / 20 == round(i / 20)){cat(i)} } # combine all those frames into a single animated gif pd <- setwd("0114-tmp") system('magick -loop 0 -delay 8 *.png "0114-plague-deaths.gif"') setwd(pd) # clean up unlink("plague-mortality.xlsx") The original

As promised, here’s the original Welford and Bossak image. Sadly, they dropped Abergwillar; and they have data for Poona which I couldn’t find in a table in their paper.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Highlights from the Connect(); conference

Fri, 11/17/2017 - 22:44

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Connect();, the annual Microsoft developer conference, is wrapping up now in New York. The conference was the venue for a number of major announcements and talks. Here are some highlights related to data science, machine learning, and artificial intelligence:

Lastly, I wanted to share this video presented at the conference from Stack Overflow. Keep an eye out for R community luminary David Robinson programming in R!

You can find more from the Connect conference, including on-demand replays of the talks and keynotes, at the link below.

Microsoft: Connect(); November 15-17, 2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

padr version 0.4.0 now on CRAN

Fri, 11/17/2017 - 21:30

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

I am happy to share that the latest version of padr just hit CRAN. This new version comprises bug fixes, performance improvements and new functions for formatting datetime variables. But above all, it introduces the custom paradigm that enables you to do asymmetric analysis.

Improvements to existing functions

thicken used to get slowish when the data frame got large. Several adjustments resulted in a considerable performance gain. It is now approximately 6 times faster than in version 0.3.0.

Both pad and thicken used to break with noninformative errors when the datetime variable contains missing values. Both functions allow for missing values now. thicken will leave the record containing the missing datetime value where it is and will enter a missing value in the column added to the data frame. pad will move all the records with missing values in the datetime variable to the end of the data frame and will further ignore them for padding. Both functions throw a warning when the datetime variable has missing values.

New functions

span_date and span_time are wrappers around seq.Date and seq.POSIXt and were previously described in this blog post.
Two new functions are introduced to reformat datetime variable. These can be especially useful for showing the data in a plot or a graph. center_interval shifts the datetime point from the beginning of the interval to the (approximate) center. Especially bar, line and dot plots will reflect the data in a more accurate way after centering.

library(tidyverse) library(padr) jan_first <- emergency %>% filter(as.Date(time_stamp, tz = "EST") == as.Date("2016-01-01")) %>% thicken("3 hour") %>% count(time_stamp_3_hour) ggplot(jan_first, aes(time_stamp_3_hour, n)) + geom_bar(stat = "identity")

jan_first %>% mutate(ts_center = center_interval(time_stamp_3_hour)) %>% ggplot(aes(ts_center, n)) + geom_bar(stat = "identity")

Secondly, there is the format_interval function, that will create a character from the interval using strftime on both the beginning and the end of the datetime points of each interval.

jan_first %>% mutate(hours = format_interval(time_stamp_3_hour, start_format = "%b%d %H", end_format = "%H", sep = "-")) %>% ggplot(aes(hours, n)) + geom_bar(stat = "identity")

When using the interval “week”, one might want to start the weeks on different day than Sunday. In thicken the start_val should than be specified with a day of the desired weekday. Finding this day by hand is a bit tedious, thats why closest_weekday is introduced.

thicken(coffee, "week", start_val = closest_weekday(coffee$time_stamp, 6)) ## time_stamp amount time_stamp_week ## 1 2016-07-07 09:11:21 3.14 2016-07-02 ## 2 2016-07-07 09:46:48 2.98 2016-07-02 ## 3 2016-07-09 13:25:17 4.11 2016-07-09 ## 4 2016-07-10 10:45:11 3.14 2016-07-09 The custom suite

thicken and pad are highly automated, because they assume all datetime points to be equally spaced. Internally they span vectors of the desired interval. However, it might be useful to have asymmetrical periods between datetime points. Especially when there are periods in which the number of observations is consistently lower, such as nightly hours or weekend days. The functions thicken_cust and pad_cust work like the original functions. However, the user has to provide his own spanning vector to whch the observations are mapped.

Lets do an analysis of vehicle accidents in the emergency. We want to distinguish between morning rush hour (7-10), daytime (10-16), evening rush hour (16-19) and nighttime (19-7). This is the full analysis.

accidents_span <- span_around(emergency$time_stamp, "hour", start_shift = "2 hour") %>% subset_span(list(hour = c(7, 10, 16, 19))) emergency %>% filter(title == "Traffic: VEHICLE ACCIDENT -") %>% thicken_cust(accidents_span, "hour") %>% count(hour) %>% pad_cust(accidents_span) %>% fill_by_value() %>% mutate(day = as.Date(hour, tz = "EST"), period = format_interval(hour, start_format = "%H", sep = "-")) %>% ggplot(aes(day, n, col = period)) + geom_line() + facet_wrap(~period)

The helper functions span_around and subset_span are used for building the spanning vector. The first takes the original datetime variable and spans a variable of the requested interval around it. This saves you the manual trouble of finding the min and the max of the variable and determine which points are respectively before and after them to build a variable of the interval. subset_span, subsequently, will only leave the datetime points in the input that meet the criteria given in a list. In the example these are the hours 7, 10, 16, and 19. In total there are eight different datetime parts you can subset on.

The remainder of the analysis is greatly similar to a regular padr analysis. Instead of an interval to which to thicken and pad to, you use the asymmetrically spaced accidents_span variable in both thicken_cust and pad_cust. thicken_cust will then map each observation to a datetime point in the spanned vector. pad_cust will insert rows for each of the observations that are in the spanned vector, but not in the datetime variable.

Thanks

This release completes the initial plans I had for padr. For the next release I plan on refactoring and further increasing performance. Do you find anything still missing from padr? Did you find a bug or an inconsistency? Please notify by sending an email or file an issue on the github page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining – Week 11 (2017)

Fri, 11/17/2017 - 18:39

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 11 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining – Week 11 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Network analysis video – 7th Scottish QGIS user group

Fri, 11/17/2017 - 13:43

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I spoke yesterday at the Scottish QGIS user group. This is a vibrant and welcoming community and has attendees who travel long distances to be there. I think this time Prague was the furthest city anybody came from.

My talk centred on network analysis using GRASS, R and QGIS. You can read how to do some of this work here and here. Watch the presentation (and the rest of the day) on Youtube and view the slide deck at the end of this post. A massive thanks to Ross, Tom and team for organising the event.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Teaching to machines: What is learning in machine learning entails?

Fri, 11/17/2017 - 04:40

(This article was first published on Memo's Island, and kindly contributed to R-bloggers)

Preamble

Figure 1: The oldest learning institution 
 in the world; University of Bologna
(Source: Wikipedia). Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach to a machine? Unfortunately, for even moderate technical people coming from different backgrounds, answer to this question is not apparent in the first instance. This sounds like a conceptual and jargon issue, but it lies in the very success of supervised learning algorithms.

What is a machine in machine learning

First of all here, machine does not mean a machine in conventional sense, but computational modules or set of instructions. It is called machine because of thermodynamics can be applied to analyse this computational modules, their algorithmic complexity can be map to an energy consumption and work production, recall Landauer Principle. Charles Bennett has an article on this principle, here.

Learning curve for machines

What about learning bit? Learning is a natural process for humans, specially for the younger ones. We learn new things naturally without explicitly thinking with instruction and some training, such as practised in University of Bologna a millennia ago, see Figure 1. Usually,  it takes time for us to learn new thing, and the process is called learning curve, due to Hermann Ebinghauss. Essentially, learning in machine learning exactly roots in Ebinghauss approach.  But question remains, how this manifests quantitatively in machine learning. Answer to this question is going to be given here.

The concept of learning curve and  effect of sample size in training machine learning models for both experienced and novice practitioners. It is often this type of analysis is omitted in producing supervised learning solutions. In the advent of deep learning architecture, multi-layered neural networks, this concept becomes more pronounced.

Quantifying learning curve lies in measuring performance over increasing experience. If the performance of the machine, or human, increases over experience we denote that learning is achieved. We distinguish good learner.

On the misconception of unsupervised learning

Figure 2: Donald Webb,
father of unsupervised
learning.
(Source: Wikipedia)

A generic misconception appear on what unsupervised learning means. Clustering, or categorising unlabelled data, is not learning in Ebbinghaus sense. The goodness of fit or cluster validation exercise do not account an increase in experience, at least this is not established in the literature, judging from cluster validation techniques, see, Jain-Dubes’s Algorithms for clustering data. Wikipedia defines unsupervised learning “inferring a function to describe hidden structure from “unlabeled” data”, this is a function inference is not learning,.

Then, what is unsupervised learning? It originates from Hebbian Theory from neuroscience that “Cells that fire together wire together”. This implies, unsupervised learning is about how information is encoded, not how it is labelled. One good practical model that could be classified as unsupervised learning, so called spiking network model.

Outlook

The concept of learning from Machine Learning perspective is summarised in this post. In data science, it is a good practice to differentiate what we call learning and function inference/optimisation. Being attentive to this details would help us.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Memo's Island. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

pool package on CRAN

Fri, 11/17/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

The pool package makes it easier for Shiny developers to connect to databases. Up until now, there wasn’t a clearly good way to do this. As a Shiny app author, if you connect to a database globally (outside of the server function), your connection won’t be robust because all sessions would share that connection (which could leave most users hanging when one of them is using it, or even all of them if the connection breaks). But if you try to connect each time that you need to make a query (e.g. for every reactive you have), your app becomes a lot slower, as it can take in the order of seconds to establish a new connection. The pool package solves this problem by taking care of when to connect and disconnect, allowing you to write performant code that automatically reconnects to the database only when needed.

So, if you are a Shiny app author who needs to connect and interact with databases inside your apps, keep reading because this package was created to make your life easier.

What the pool package does

The pool package adds a new level of abstraction when connecting to a database: instead of directly fetching a connection from the database, you will create an object (called a “pool”) with a reference to that database. The pool holds a number of connections to the database. Some of these may be currently in-use and some of these may be idle, waiting for a new query or statement to request them. Each time you make a query, you are querying the pool, rather than the database. Under the hood, the pool will either give you an idle connection that it previously fetched from the database or, if it has no free connections, fetch one and give it to you. You never have to create or close connections directly: the pool knows when it should grow, shrink or keep steady. You only need to close the pool when you’re done.

Since pool integrates with both DBI and dplyr, there are very few things that will be new to you, if you’re already using either of those packages. Essentially, you shouldn’t feel the difference, with the exception of creating and closing a “Pool” object (as opposed to connecting and disconnecting a “DBIConnection” object). See this copy-pasteable app that uses pool and dplyr to query a MariaDB database (hosted on AWS) inside a Shiny app.

Very briefly, here’s how you’d connect to a database, write a table into it using DBI, query it using dplyr, and finally disconnect (you must have DBI, dplyr and pool installed and loaded in order to be able to run this code):

conn <- dbConnect(RSQLite::SQLite(), dbname = ":memory:") dbWriteTable(conn, "quakes", quakes) tbl(conn, "quakes") %>% select(-stations) %>% filter(mag >= 6) ## # Source: lazy query [?? x 4] ## # Database: sqlite 3.19.3 [:memory:] ## lat long depth mag ## ## 1 -20.70 169.92 139 6.1 ## 2 -13.64 165.96 50 6.0 ## 3 -15.56 167.62 127 6.4 ## 4 -12.23 167.02 242 6.0 ## 5 -21.59 170.56 165 6.0 dbDisconnect(conn)

And here’s how you’d do the same using pool:

pool <- dbPool(RSQLite::SQLite(), dbname = ":memory:") dbWriteTable(pool, "quakes", quakes) tbl(pool, "quakes") %>% select(-stations) %>% filter(mag >= 6) ## # Source: lazy query [?? x 4] ## # Database: sqlite 3.19.3 [:memory:] ## lat long depth mag ## ## 1 -20.70 169.92 139 6.1 ## 2 -13.64 165.96 50 6.0 ## 3 -15.56 167.62 127 6.4 ## 4 -12.23 167.02 242 6.0 ## 5 -21.59 170.56 165 6.0 poolClose(pool) What problem pool was created to solve

As mentioned before, the goal of the pool package is to abstract away the logic of connection management and the performance cost of fetching a new connection from a remote database. These concerns are especially prominent in interactive contexts, like Shiny apps. (So, while this package is of most practical value to Shiny developers, there is no harm if it is used in other contexts.)

The rest of this post elaborates some more on the specific problems of connection management inside of Shiny, and how pool addresses them.

The connection management spectrum

When you’re connecting to a database, it’s important to manage your connections: when to open them (taking into account that this is a potentially long process for remote databases), how to keep track of them, and when to close them. This is always true, but it becomes especially relevant for Shiny apps, where not following best practices can lead to many slowdowns (from inadvertently opening too many connections) and/or many leaked connections (i.e. forgetting to close connections once you no longer need them). Over time, leaked connections could accumulate and substantially slow down your app, as well as overwhelming the database itself.

Oversimplifying a bit, we can think of connection management in Shiny as a spectrum ranging from the extreme of just having one connection per app (potentially serving several sessions of the app) to the extreme of opening (and closing) one connection for each query you make. Neither of these approaches is great: the former is fast, but not robust, and the reverse is true for the latter.

In particular, opening only one connection per app makes it fast (because, in the whole app, you only fetch one connection) and your code is kept as simple as possible. However:

  • it cannot handle simultaneous requests (e.g. two sessions open, both querying the database at the same time);
  • if the connection breaks at some point (maybe the database server crashed), you won’t get a new connection (you have to exit the app and re-run it);
  • finally, if you are not quite at this extreme, and you use more than one connection per app (but fewer than one connection per query), it can be difficult to keep track of all your connections, since you’ll be opening and closing them in potentially very different places.

While the other extreme of opening (and closing) one connection for each query you make resolves all of these points, it is terribly slow (each time we need to access the database, we first have to fetch a connection), and you need a lot more (boilerplate) code to connect and disconnect the connection within each reactive/function.

If you’d like to see actual code that illustrates these two approaches, check this section of the pool README.

The best of both worlds

The pool package was created so that you don’t have to worry about this at all. Since pool abstracts away the logic of connection management, for the vast majority of cases, you never have to deal with connections directly. Since the pool “knows” when it should have more connections and how to manage them, you have all the advantages of the second approach (one connection per query), without the disadvantages. You are still using one connection per query, but that connection is always fetched and returned to the pool, rather than getting it from the database directly. This is a whole lot faster and more efficient. Finally, the code is kept just as simple as the code in the first approach (only one connection for the entire app), since you don’t have to continuously call dbConnect and dbDisconnect.

Feedback

This package has quietly been around for a year and it’s now finally on CRAN, following lots of the changes in the database world (both in DBI and dplyr). All pool-related feedback is welcome. Issues (bugs and features requests) can be posted to the github tracker. Requests for help with code or other questions can be posted to community.rstudio.com/c/shiny, which I check regularly (they can, of course, also be posted to Stack Overflow, but I’m extremely likely to miss it).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Six tips for running a successful unconference

Fri, 11/17/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)


Attendees at the May 2017 rOpenSci unconference. Photo credit: Nistara Randhawa

In May 2017, I helped run a wildly successful “unconference” that had a huge positive impact on the community I serve. rOpenSci is a non-profit initiative enabling open and reproducible research by creating technical infrastructure in the form of staff- and community-contributed software tools in the R programming language that lower barriers to working with scientific data sources on the web, and creating social infrastructure through a welcoming and diverse community of software users and developers. Our 4th annual unconference brought together 70 people to hack on projects they dreamed up and to give them opportunities to meet and work together in person. One third of the participants had attended before, and two thirds were first-timers, selected from an open call for applications. We paid all costs up front for anyone who requested this in order to lower barriers to participation.

It’s called an “unconference” because there is no schedule set before the event – participants discuss project ideas online in advance and projects are selected by participant-voting at the start. I’m sharing some tips here on how to do this well for this particular flavour of unconference.

1. Have a code of conduct

Having a code of conduct that the organizers promote in the welcome goes a long way to creating a welcoming and safe environment and preventing violations in the first place.

2. Host online discussion of project ideas before the unconference

Our unconference centered on teams working on programming projects, rather than discussions, so prior to the unconference, we asked all participants to suggest project ideas using an open online system, called GitHub, that allows everyone to see and comment on ideas or just share enthusiastic emoji to show support.

3. Have a pre-unconference video-chat with first-time participants

Our AAAS CEFP training emphasizes the importance of extending a personalized welcome to community members, so I was inspired to make the bold move of talking with more than 40 first-time participants prior to the unconference. I asked each person to complete a short questionnaire to get them to consider the roles they anticipated playing prior to our chat. Frequently, within an hour of a video-chat, without prompting, I would see the person post a new project idea or share their thoughts about someone else’s. Other times, our conversation gave the person an opportunity to say “this idea maybe isn’t relevant but…” and I would help them talk through it, inevitably leading to “oh my gosh this is such a cool idea”. I got a huge return on my investment. People’s questions like “how do you plan to have 20 different projects present their work?” led to better planning on my part. Specific ideas for improvements came from me responding with “well…how would YOU do it?”

Between the emails, slack channel, issues on GitHub, and personal video chats, I felt completely at ease going into the unconf (where I knew next to no one!).

-rOpenSci unconf17 participant

4. Run an effective ice breaker

I adapted the “Human Barometer” ice breaker to enable 70 people to share opinions across all perceived levels of who a participant is and introduce themselves to the entire group within a 1 hour period. Success depended on creating questions that were relevant to the unconference crowd, and on visually keeping track of who had spoken up in order to call on those who had not. Ice breakers and the rOpenSci version of the Human Barometer will be the subject of a future CEFP blog post.

5. Have a plan to capture content

So much work and money go into running a great unconference, you can’t afford to do it without a plan to capture and disseminate stories about the people and the products. Harness the brain work that went into the ideas! I used the concept of content pillars to come up with a plan. Every project group was given a public repository on GitHub to house their code and documentation. In a 2-day unconference with 70 people in 20 projects, how do people present their results?! We told everyone that they had just three minutes to present, and that the only presentation material they could use was their project README (the page of documentation in their code repository). No slides allowed! This kept their focus on great documentation for the project rather than an ephemeral set of pretty slides. Practically speaking, this meant that all presentations were accessible from a single laptop connected to the projector and that to access their presentation, a speaker had only to click on the link to their repo. Where did the essence of this great idea come from? From a pre-unconference chat of course!

In the week following the unconference, we used the projects’ README documentation to create a series of five posts released Monday through Friday, noting every one of the 20 projects with links to their code repositories. To get more in-depth stories about people and projects, I let participants know we were keen to host community-contributed blog posts and that accepted posts would be tweeted to rOpenSci’s >13,000 followers. Immediately after the unconference, we invited selected projects to contribute longer form narrative blog posts and posted these once a week. The series ended with Unconf 2017: The Roads Not Taken, about all the great ideas that were not yet pursued and inviting people to contribute to these.

All of this content is tied together in one blog post to summarize the unconference and link to all staff- and community-contributed posts in the unconference projects series as well as to posts of warm personal and career-focussed reflections by some participants.

6. Care about other people’s success

Community managers do a lot of “emotional work”. In all of this, my #1 key to running a successful unconference is to genuinely and deeply care that participants arrive feeling ready to hit the ground running and leave feeling like they got what they wanted out of the experience. Ultimately, the success of our unconference is more about the human legacy – building people’s capacity as software users and developers, and the in-person relationships they develop within our online community – than it is about the projects.

“I’ve steered clear of ‘hackathons’ because they feel intimidating and the ‘bad’ kind of competitive. But, this….this was something totally different.”

– rOpenSci unconf17 participant

Additional Resources

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The City of Chicago uses R to issue beach safety alerts

Thu, 11/16/2017 - 23:27

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Among the many interesting talks I saw a the Domino Data Science Pop-Up in Chicago earlier this week was the presentation by Gene Lynes and Nick Lucius from the City of Chicago. The City of Chicago Tech Plan encourages smart communities and open government, and as part of that initiative the city has undertaken dozens of open-source, open-data projects in areas such as food safety inspections, preventing the spread of West Nile virus, and keeping sidewalks clear of snow. 

This talk was on the Clear Water initiative, a project to monitor the water quality of Chicago's many public beaches on Lake Michigan, and to issue safety alerts (or in serious cases, beach closures) when E Coli levels in the water get too high. The problem is that E Coli levels can change rapidly: water levels can be normal for weeks, and then spike for a single day. But traditional culture tests take many days to return results, and while rapid DNA-based tests do exist, it's not possible conduct these tests daily at every beach.

The solution is to build a predictive model, which uses meteorological data and rapid DNA tests for some beaches, combined with historical (culture-based) evaluations of water quality, to predict E Coli levels at all beaches every day. The analysis is performed using R (you can find the R code at this Github repository).

The analysis was developed in conjunction with citizen scientists at Chi Hack Night and statisticians from DePaul University. In 2017, the model was piloted in production in Chicago to issue beach safety alerts and to create a live map of beach water quality. This new R-based model predicted 60 additional occurrences of poor water quality, compared with the process used in prior years.

Still, water quality is hard to predict: once you have the slower test data and an actual result to compare with, that's an accuracy rate of 38%, with fewer than 2% false alarms. (The city plans to use clustering techniques to further improve that number.) That model uses rapid DNA testing at five beaches to predict all beaches along Lake Michigan. A Shiny app (linked below) lets you explore the impact of testing at a different set of beaches, and adjusting the false positive rate, on the overall accuracy of the model.

You can find more details about the City of Chicago Clear Water initiative at the link below.

City of Chicago: Clear Water

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data science courses in R (/python/and others) for $10 at Udemy (Black Friday sale)

Thu, 11/16/2017 - 22:05

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses: 

Introductory R courses: 

Top data science courses (not in R): 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

New IBM Plex Sans Support in hrbrthemes + Automating Axis Text Justification

Thu, 11/16/2017 - 15:07

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes) library(ggplot2) ggplot(mpg, aes(displ, hwy)) + geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) + scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) + scale_y_continuous(expand=c(0,0), limits=c(10, 50)) + scale_color_ipsum() + scale_fill_ipsum() + facet_wrap(~class, scales="free") + labs( title="IBM Plex Sans Test", subtitle="This is a subtitle to see the how it looks in IBM Plex Sans", caption="Source: hrbrthemes & IBM" ) + theme_ipsum_ps(grid="XY", axis="xy") + theme(legend.position="none") -> gg flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing how confounding biases estimates of population-wide (or marginal) average causal effects

Thu, 11/16/2017 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

When we are trying to assess the effect of an exposure or intervention on an outcome, confounding is an ever-present threat to our ability to draw the proper conclusions. My goal (starting here and continuing in upcoming posts) is to think a bit about how to characterize confounding in a way that makes it possible to literally see why improperly estimating intervention effects might lead to bias.

Confounding, potential outcomes, and causal effects

Typically, we think of a confounder as a factor that influences both exposure and outcome. If we ignore the confounding factor in estimating the effect of an exposure, we can easily over- or underestimate the size of the effect due to the exposure. If sicker patients are more likely than healthier patients to take a particular drug, the relatively poor outcomes of those who took the drug may be due to the initial health status rather than the drug.

A slightly different view of confounding is tied to the more conceptual framework of potential outcomes, which I wrote a bit about earlier. A potential outcome is the outcome we would observe if an individual were subjected to a particular exposure. We may or may not observe the potential outcome – this depends on the actual exposure. (To simplify things here, I will assume we are interested only in two different exposures.) \(Y_0\) and \(Y_1\) represent the potential outcomes for an individual with and without exposure, respectively. We observe \(Y_0\) if the individual is not exposed, and \(Y_1\) if she is.

The causal effect of the exposure for the individual \(i\) can be defined as \(Y_{1i} – Y_{0i}\). If we can observe each individual in both states (with and without the exposure) long enough to measure the outcome \(Y\), we are observing both potential outcomes and can measure the causal effect for each individual. Averaging across all individuals in the sample provides an estimate the population average causal effect. (Think of a crossover or N-of-1 study.)

Unfortunately, in the real world, it is rarely feasible to expose an individual to multiple conditions. Instead, we use one group as a proxy for the other. For example, the control group represents what would have happened to the exposed group had the exposed group not been exposed. This approach only makes sense if the control group is identical in every way to the exposure group (except for the exposure, of course.)

Our goal is to compare the distribution of outcomes for the control group with the exposed group. We often simplify this comparison by looking at the means of each distribution. The average causal effect (across all individuals) can be written as \(E(Y_1 – Y_0)\), where \(E()\) is the expectation or average. In reality, we cannot directly measure this since only one potential outcome is observed for each individual.

Using the following logic, we might be able to convince ourselves that we can use observed measurements to estimate unobservable average causal effects. First, we can say \(E(Y_1 – Y_0) = E(Y_1) – E(Y_0)\), because expectation is linear. Next, it seems fairly reasonable to say that \(E(Y_1 | A = 1) = E(Y | A = 1)\), where \(A=1\) for exposure, \(A=0\) for control. In words, this states that the average potential outcome of exposure for the exposed group is the same as what we actually observe for the exposed group (this is the consistency assumption in causal inference theory). Along the same lines, \(E(Y_0 | A = 0) = E(Y | A = 0)\). Finally, if we can say that \(E(Y_1) = E(Y_1 | A = 1)\) – the potential outcome of exposure for everyone is equal to the potential outcome of exposure for those exposed – then we can say that \(E(Y_1) = E(Y | A = 1)\) (the potential outcome with exposure for everyone is the same as the observed outcome for the exposed. Similarly, we can make the same argument to conclude that \(E(Y_0) = E(Y | A = 0)\). At the end of this train of logic, we conclude that we can estimate \(E(Y_1 – Y_0)\) using observed data only: \(E(Y | A = 1) – E(Y | A = 0)\).

This nice logic fails if \(E(Y_1) \ne E(Y | A = 1)\) and/or \(E(Y_0) \ne E(Y | A = 0)\). That is, this nice logic fails when there is confounding.

This is all a very long-winded way of saying that confounding arises when the distributions of potential outcomes for the population are different from those distributions for the subgroups we are using for analysis. For example, if the potential outcome under exposure for the population as a whole (\(Y_1\)) differs from the observed outcome for the subgroup that was exposed (\(Y|A=1\)), or the potential outcome without exposure for the entire population (\(Y_0\)) differs from the observed outcome for the subgroup that was not exposed (\(Y|A=0\)), any estimates of population level causal effects using observed data will be biased.

However, if we can find a factor \(L\) (or factors) where

\[ \begin{aligned}
P(Y_1 | L=l) &= P(Y | A = 1 \text{ and } L=l) \\
P(Y_0 | L=l) &= P(Y | A = 0 \text{ and } L=l)
\end{aligned}
\] both hold for all levels or values of \(L\), we can remove confounding (and get unbiased estimates of the causal effect) by “controlling” for \(L\). In some cases, the causal effect we measure will be conditional on \(L\), sometimes it will be a population-wide average (or marginal) causal effect, and sometimes it will be both.

What confounding looks like …

The easiest way to illustrate the population/subgroup contrast is to generate data from a process that includes confounding. In this first example, the outcome is continuous, and is a function of both the exposure (\(A\)) and a covariate (\(L\)). For each individual, we can generate both potential outcomes \(Y_0\) and \(Y_1\). (Note that both potential outcomes share the same individual level noise term \(e\) – this is not a necessary assumption.) This way, we can “know” the true population, or marginal causal effect of exposure. The observed outcome \(Y\) is determined by the exposure status. For the purposes of plotting a smooth density curve, we generate a very large sample – 2 million.

library(simstudy) defC <- defData(varname = "e", formula = 0, variance = 2, dist = "normal") defC <- defData(defC, varname = "L", formula = 0.4, dist = "binary") defC <- defData(defC, varname = "Y0", formula = "1 + 4*L + e", dist = "nonrandom") defC <- defData(defC, varname = "Y1", formula = "5 + 4*L + e", dist = "nonrandom") defC <- defData(defC, varname = "A", formula = "0.3 + 0.3 * L", dist = "binary") defC <- defData(defC, varname = "Y", formula = "1 + 4*A + 4*L + e", dist = "nonrandom") set.seed(2017) dtC <- genData(n = 2000000, defC) dtC[1:5] ## id e L Y0 Y1 A Y ## 1: 1 2.02826718 1 7.0282672 11.028267 1 11.0282672 ## 2: 2 -0.10930734 0 0.8906927 4.890693 0 0.8906927 ## 3: 3 1.04529790 0 2.0452979 6.045298 0 2.0452979 ## 4: 4 -2.48704266 1 2.5129573 6.512957 1 6.5129573 ## 5: 5 -0.09874778 0 0.9012522 4.901252 0 0.9012522

Feel free to skip over this code – I am just including in case anyone finds it useful to see how I generated the following series of annotated density curves:

library(ggplot2) getDensity <- function(vector, weights = NULL) { if (!is.vector(vector)) stop("Not a vector!") if (is.null(weights)) { avg <- mean(vector) } else { avg <- weighted.mean(vector, weights) } close <- min(which(avg < density(vector)$x)) x <- density(vector)$x[close] if (is.null(weights)) { y = density(vector)$y[close] } else { y = density(vector, weights = weights)$y[close] } return(data.table(x = x, y = y)) } plotDens <- function(dtx, var, xPrefix, title, textL = NULL, weighted = FALSE) { dt <- copy(dtx) if (weighted) { dt[, nIPW := IPW/sum(IPW)] dMarginal <- getDensity(dt[, get(var)], weights = dt$nIPW) } else { dMarginal <- getDensity(dt[, get(var)]) } d0 <- getDensity(dt[L==0, get(var)]) d1 <- getDensity(dt[L==1, get(var)]) dline <- rbind(d0, dMarginal, d1) brk <- round(dline$x, 1) p <- ggplot(aes(x=get(var)), data=dt) + geom_density(data=dt[L==0], fill = "#ce682f", alpha = .4) + geom_density(data=dt[L==1], fill = "#96ce2f", alpha = .4) if (weighted) { p <- p + geom_density(aes(weight = nIPW), fill = "#2f46ce", alpha = .8) } else p <- p + geom_density(fill = "#2f46ce", alpha = .8) p <- p + geom_segment(data = dline, aes(x = x, xend = x, y = 0, yend = y), size = .7, color = "white", lty=3) + annotate(geom="text", x = 12.5, y = .24, label = title, size = 5, fontface = 2) + scale_x_continuous(limits = c(-2, 15), breaks = brk, name = paste(xPrefix, var)) + theme(panel.grid = element_blank(), axis.text.x = element_text(size = 12), axis.title.x = element_text(size = 13) ) if (!is.null(textL)) { p <- p + annotate(geom = "text", x = textL[1], y = textL[2], label = "L=0", size = 4, fontface = 2) + annotate(geom = "text", x = textL[3], y = textL[4], label="L=1", size = 4, fontface = 2) + annotate(geom = "text", x = textL[5], y = textL[6], label="Population distribution", size = 4, fontface = 2) } return(p) } library(gridExtra) grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Full\npopulation", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed\nonly"), plotDens(dtC, "Y1", "Potential outcome", "Full\npopulation"), plotDens(dtC[A==1], "Y", "Observed", "Exposed\nonly"), nrow = 2 )

Looking at the various plots, we can see a few interesting things. The density curves on the left represent the entire population. The conditional distributions of the potential outcomes at the population level are all normally distributed, with means that depend on the exposure and covariate \(L\). We can also see that the population-wide distribution of \(Y_0\) and \(Y_1\) (in blue) are non-symmetrically shaped, because they are a mixture of the conditional normal distributions, weighted by the proportion of each level of \(L\). Since the proportions for the top and bottom plots are in fact the population proportion, the population-level density curves for \(Y_0\) and \(Y_1\) are similarly shaped, with less mass on the higher end, because individuals are less likely to have an \(L\) value of 1:

dtC[, .(propLis1 = mean(L))] ## propLis1 ## 1: 0.399822

The shape of the marginal distribution of \(Y_1\) is identical to \(Y_0\) (in this case, because that is the way I generated the data), but shifted to the right by an amount equal to the causal effect. The conditional effect sizes are 4, as is the population or marginal effect size.

The subgroup plots on the right are a different story. In this case, the distributions of \(L\) vary across the exposed and unexposed groups:

dtC[, .(propLis1 = mean(L)), keyby = A] ## A propLis1 ## 1: 0 0.2757937 ## 2: 1 0.5711685

So, even though the distributions of (observed) \(Y\) conditional on \(L\) are identical to their potential outcome counterparts in the whole population – for example, \(P(Y | A=0 \text{ and } L = 1) = P(Y_0 | L = 1)\) – the marginal distributions of \(Y\) are quite different for the exposed and unexposed. For example, \(P(Y | A = 0) \ne P(Y_0)\). This is directly due to the fact that the mixing weights (the proportions of \(L\)) are different for each of the groups. In the unexposed group, about 28% have \(L=1\), but for the exposed group, about 57% do. Using the subgroup data only, the conditional effect sizes are still 4 (comparing mean outcomes \(Y\) within each level of \(L\)). However the difference in means between the marginal distributions of each subgroup is about 5.2 (calculated by 7.3 – 2.1). This is confounding.

No confounding

Just so we can see that when the covariate \(L\) has nothing to do with the probability of exposure, the marginal distributions of the subgroups do in fact look like their population-level potential outcome marginal distributions:

defC <- updateDef(defC, "A", newformula = 0.5) # change data generation dtC <- genData(n = 2000000, defC) dtC[, .(propLis1 = mean(L)), keyby = A] # subgroup proportions ## A propLis1 ## 1: 0 0.4006499 ## 2: 1 0.3987437 dtC[, .(propLis1 = mean(L))] # population/marginal props ## propLis1 ## 1: 0.3996975 grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed"), plotDens(dtC, "Y1", "Potential outcome", "Population"), plotDens(dtC[A==1], "Y", "Observed", "Exposed"), nrow = 2 )

Estimation of causal effects (now with confounding)

Generating a smaller data set, we estimate the causal effects using simple calculations and linear regression:

library(broom) # change back to confounding defC <- updateDef(defC, "A", newformula = ".3 + .3 * L") dtC <- genData(2500, defC)

The true average (marginal) causal effect from the average difference in potential outcomes for the entire population:

dtC[, mean(Y1 - Y0)] ## [1] 4

And the true average causal effects conditional on the covariate \(L\):

dtC[, mean(Y1 - Y0), keyby = L] ## L V1 ## 1: 0 4 ## 2: 1 4

If we try to estimate the marginal causal effect by using a regression model that does not include \(L\), we run into problems. The estimate of 5.2 we see below is the same biased estimate we saw in the plot above. This model is reporting the differences of the means (across both levels of \(L\)) for the two subgroups, which we know (because we saw) are not the same as the potential outcome distributions in the population due to different proportions of \(L\) in each subgroup:

tidy(lm(Y ~ A, data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.027132 0.06012997 33.71251 1.116211e-205 ## 2 A 5.241004 0.09386145 55.83766 0.000000e+00

If we estimate a model that conditions on \(L\), the estimates are on target because in the context of normal linear regression without interaction terms, conditional effects are the same as marginal effects (when confounding has been removed, or think of the comparisons being made within the orange groups and green groups in the fist set of plots above, not within the purple groups):

tidy(lm(Y ~ A + L , data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 0.9178849 0.03936553 23.31697 5.809202e-109 ## 2 A 4.0968358 0.05835709 70.20288 0.000000e+00 ## 3 L 3.9589109 0.05862583 67.52844 0.000000e+00 Inverse probability weighting (IPW)

What follows briefly here is just a sneak preview of IPW (without any real explanation), which is one way to recover the marginal mean using observed data with confounding. For now, I am ignoring the question of why you might be interested in knowing the marginal effect when the conditional effect estimate provides the same information. Suffice it to say that the conditional effect is not always the same as the marginal effect (think of data generating processes that include interactions or non-linear relationships), and sometimes the marginal effect estimate may the best that we can do, or at least that we can do easily.

If we weight each individual observation by the inverse probability of exposure, we can remove confounding and estimate the marginal effect of exposure on the outcome. Here is a quick simulation example.

After generating the dataset (the same large one we started out with so you can compare) we estimate the probability of exposure \(P(A=1 | L)\), assuming that we know the correct exposure model. This is definitely a questionable assumption, but in this case, we actually do. Once the model has been fit, we assign the predicted probability to each individual based on her value of \(L\).

set.seed(2017) dtC <- genData(2000000, defC) exposureModel <- glm(A ~ L, data = dtC, family = "binomial") tidy(exposureModel) ## term estimate std.error statistic p.value ## 1 (Intercept) -0.847190 0.001991708 -425.3584 0 ## 2 L 1.252043 0.003029343 413.3053 0 dtC[, pA := predict(exposureModel, type = "response")]

The IPW is not based exactly on \(P(A=1 | L)\) (which is commonly used in propensity score analysis), but rather, the probability of the actual exposure at each level of \(L\): \(P(A=a | L)\), where \(a\in(0,1)\):

# Define two new columns defC2 <- defDataAdd(varname = "pA_actual", formula = "A * pA + (1-A) * (1-pA)", dist = "nonrandom") defC2 <- defDataAdd(defC2, varname = "IPW", formula = "1/pA_actual", dist = "nonrandom") # Add weights dtC <- addColumns(defC2, dtC) round(dtC[1:5], 2) ## id e L Y0 Y1 A Y pA pA_actual IPW ## 1: 1 2.03 1 7.03 11.03 1 11.03 0.6 0.6 1.67 ## 2: 2 -0.11 0 0.89 4.89 0 0.89 0.3 0.7 1.43 ## 3: 3 1.05 0 2.05 6.05 0 2.05 0.3 0.7 1.43 ## 4: 4 -2.49 1 2.51 6.51 1 6.51 0.6 0.6 1.67 ## 5: 5 -0.10 0 0.90 4.90 0 0.90 0.3 0.7 1.43

To estimate the marginal effect on the log-odds scale, we use function lm again, but with weights specified by IPW. The true value of the marginal effect of exposure (based on the population-wide potential outcomes) was 4.0. I know I am repeating myself here, but first I am providing the biased estimate that we get when we ignore covariate \(L\) to convince you that the relationship between exposure and outcome is indeed confounded:

tidy(lm(Y ~ A , data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.101021 0.002176711 965.2275 0 ## 2 A 5.184133 0.003359132 1543.2956 0

And now, with the simple addition of the weights but still not including \(L\) in the model, our weighted estimate of the marginal effect is spot on (but with such a large sample size, this is not so surprising):

tidy(lm(Y ~ A , data = dtC, weights = IPW)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.596769 0.002416072 1074.789 0 ## 2 A 4.003122 0.003416842 1171.585 0

And finally, here is a plot of the IPW-adjusted density. You might think I am just showing you the plots for the unconfounded data again, but you can see from the code (and I haven’t hidden anything) that I am still using the data set with confounding. In particular, you can see that I am calling the routine plotDens with weights.

grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed", weighted = TRUE), plotDens(dtC, "Y1", "Potential outcome", "Population"), plotDens(dtC[A==1], "Y", "Observed", "Exposed", weighted = TRUE), nrow = 2 )

As I mentioned, I hope to write more on IPW, and marginal structural models, which make good use of this methodology to estimate effects that can be challenging to get a handle on.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Where to live in the US

Thu, 11/16/2017 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I was fascinated by this xkcd comic about where to live based on your temperature preferences. I also thought it’d be fun to try to make a similar one from my R session! Since I’m no meteorologist and was a bit unsure of how to define winter and summer, and of their relevance in countries like, say, India which has monsoon, I decided to focus on a single country located in one hemisphere only and big enough to offer some variety… the USA! So, dear American readers, where should you live based on your temperature preferences?

Defining data sources Weather data

The data for the original xkcd graph comes from weatherbase. I changed sources because 1) I was not patient enough to wait for weatherbase to send me a custom dataset which I imagine is what xkcd author did and 2) I’m the creator of a cool package accessing airport weather data for the whole word including the US! My package is called “riem” like “R Iowa Environmental Mesonet” (the source of the data, a fantastic website) and “we laugh” in Catalan (at the time I wrote the package I was living in Barcelona and taking 4 hours of Catalan classes a week!). It’s a simple but good package which underwent peer-review at rOpenSci onboarding, thank you Brooke!

I based my graph on data from the last winter and the last summer. I reckon that one should average over more years, but nothing important is at stake here, right?

Cities sample

My package has a function for downloading weather data for a given airport based on its ID. For instance Los Angeles airport is LAX. At that point I just needed a list of cities in the US with their airport code. Indeed with my package you can get all airport weather networks, one per US state, and all the stations withing that network… this is a lot of airports with no way to automatically determine how big they are! And to get the city name since it’d be so hard to parse the airport name I’d have resorted to geocoding with e.g. this package. A bit complicated for a simple fun graph!

So I went to the Wikipedia page of the busiest airports in the US and ended up getting this dataset from the US Department of Transportation (such a weird open data format by the way… I basically copy-pasted the first two columns in a spreadsheet!). This is not perfect, getting a list of major cities in every state would be more fair but hey reader I want you to live in a really big city so that I might know where it is.

Ok so let’s get the data us_airports <- readr::read_csv("data/2017-11-16_wheretoliveus_airports.csv") knitr::kable(us_airports[1:10,]) Name Code Atlanta, GA (Hartsfield-Jackson Atlanta International) ATL Los Angeles, CA (Los Angeles International) LAX Chicago, IL (Chicago O’Hare International) ORD Dallas/Fort Worth, TX (Dallas/Fort Worth International) DFW New York, NY (John F. Kennedy International) JFK Denver, CO (Denver International) DEN San Francisco, CA (San Francisco International) SFO Charlotte, NC (Charlotte Douglas International) CLT Las Vegas, NV (McCarran International) LAS Phoenix, AZ (Phoenix Sky Harbor International) PHX

I first opened my table of 50 airports. Then, I made the call to the Iowa Environment Mesonet.

summer_weather <- purrr::map_df(us_airports$Code, riem::riem_measures, date_start = "2017-06-01", date_end = "2017-08-31") winter_weather <- purrr::map_df(us_airports$Code, riem::riem_measures, date_start = "2016-12-01", date_end = "2017-02-28")

We then remove the lines that will be useless for further computations (The xkcd graph uses average temperature in the winter and average Humidex in the summer which uses both temperature and dew point).

summer_weather <- dplyr::filter(summer_weather, !is.na(tmpf), !is.na(dwpf)) winter_weather <- dplyr::filter(winter_weather, !is.na(tmpf))

I quickly checked that there was “nearly no missing data” so I didn’t remove any station nor day but if I were doing this analysis for something serious I’d do more checks including the time difference between measures for instance. Note that I end up with 48 airports only, there was no measure for Honolulu, HI (Honolulu International), San Juan, PR (Luis Munoz Marin International). Too bad!

Calculating the two weather values

I started by converting all temperatures to Celsius degrees because although I like you, American readers, Fahrenheit degrees do not mean anything to me which is problematic for checking results plausibility for instance. I was lazy and just used Brooke Anderson’s (yep the same Brooke who reviewed riem for rOpenSci) weathermetrics package.

summer_weather <- dplyr::mutate(summer_weather, tmpc = weathermetrics::convert_temperature(tmpf, old_metric = "f", new_metric = "c"), dwpc = weathermetrics::convert_temperature(dwpf, old_metric = "f", new_metric = "c")) winter_weather <- dplyr::mutate(winter_weather, tmpc = weathermetrics::convert_temperature(tmpf, old_metric = "f", new_metric = "c")) Summer values

Summer values are Humidex values. The author explained Humidex “combines heat and dew point”. Once again I went to Wikipedia and found the formula. I’m not in a mood to fiddle with formula writing on the blog so please go there if you want to see it. There’s also a package with a function returning the Humidex. I was feeling adventurous and therefore wrote my own function and checked the results with the numbers from Wikipedia. It was first wrong because I wrote a “+” instead of a “-“… checking one’s code is crucial.

calculate_humidex <- function(temp, dewpoint){ temp + 0.5555*(6.11 *exp(5417.7530*(1/273.16 - 1/(273.15 + dewpoint))) - 10) } calculate_humidex(30, 15) ## [1] 33.969 calculate_humidex(30, 25) ## [1] 42.33841

And then calculating the summer Humidex values by station was quite straightforward…

library("magrittr") summer_weather <- dplyr::mutate(summer_weather, humidex = calculate_humidex(tmpc, dwpc)) summer_values <- summer_weather %>% dplyr::group_by(station) %>% dplyr::summarise(summer_humidex = mean(humidex, na.rm = TRUE))

… as it was for winter temperatures.

winter_values <- winter_weather %>% dplyr::group_by(station) %>% dplyr::summarise(winter_tmpc = mean(tmpc, na.rm = TRUE)) Prepare data for plotting

I first joined winter and summer values

climates <- dplyr::left_join(winter_values, summer_values, by = "station")

Then I re-added city airport names.

climates <- dplyr::left_join(climates, us_airports, by = c("station" = "Code"))

I only kept the city name.

head(climates$Name) ## [1] "Atlanta, GA (Hartsfield-Jackson Atlanta International)" ## [2] "Austin, TX (Austin - Bergstrom International)" ## [3] "Nashville, TN (Nashville International)" ## [4] "Boston, MA (Logan International)" ## [5] "Baltimore, MD (Baltimore/Washington International Thurgood Marshall)" ## [6] "Cleveland, OH (Cleveland-Hopkins International)" climates <- dplyr::mutate(climates, city = stringr::str_replace(Name, " \\(.*", "")) head(climates$city) ## [1] "Atlanta, GA" "Austin, TX" "Nashville, TN" "Boston, MA" ## [5] "Baltimore, MD" "Cleveland, OH" Plotting!

When imitating an xkcd graph, one should use the xkcd package! I had already done that in this post.

library("xkcd") library("ggplot2") library("extrafont") library("ggrepel") xrange <- range(climates$summer_humidex) yrange <- range(climates$winter_tmpc) set.seed(42) ggplot(climates, aes(summer_humidex, winter_tmpc)) + geom_point() + geom_text_repel(aes(label = city), family = "xkcd", max.iter = 50000)+ ggtitle("Where to live based on your temperature preferences", subtitle = "Data from airports weather stations, 2016-2017") + xlab("Summer heat and humidity via Humidex")+ ylab("Winter temperature in Celsius degrees") + xkcdaxis(xrange = xrange, yrange = yrange)+ theme_xkcd() + theme(text = element_text(size = 16, family = "xkcd"))

So, which city seems attractive to you based on this plot? Or would you use different weather measures, e.g. do you care about rain?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny App for making Pixel Art Models

Thu, 11/16/2017 - 01:00

(This article was first published on blog, and kindly contributed to R-bloggers)

Last weekend, I discovered the pixel art. The goal is to reproduce a pixelated drawing. Anyone can do this without any drawing skills because you just have to reproduce the pixels one by one (on a squared paper). Kids and big kids can quickly become addicted to this.

Example

For this pixelated ironman, you need only 3 colors (black, yellow and red). At the beginning I thought this would be really easy and quick. It took me approximately 15 minutes to reproduce this. Children could take more than 1 hour to reproduce this, so it’s nice if you want to keep them busy.

Make your own pixel art models

On the internet, there are lots of models. There are also tutos on how to make models with Photoshop. Yet, I wanted to make an R package for making pixel art models, based on any pictures. The pipeline I came up with is the following:

  • read an image with package magick
  • downsize this image for processing
  • use K-means to project colors in a small set of colors
  • downsize the image and project colors
  • plot the pixels and add lines to separate them

I think there may be a lot to improve but from what I currently know about images, it’s the best I could come up with as a first shot.

I made a package called pixelart, with an associated Shiny App.

# Installation devtools::install_github("privefl/pixelart") # Run Shiny App pixelart::run_app()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Wrangling at Scale

Wed, 11/15/2017 - 20:15

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Please check it out.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Programming, meh… Let’s Teach How to Write Computational Essays Instead

Wed, 11/15/2017 - 13:11

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

From Stephen Wolfram, a nice phrase to describe the sorts of thing you can create using tools like Jupyter notebooks, Rmd and Mathematica notebooks: computational essays that complements the “computational narrative” phrase that is also used to describe such documents.

Wolfram’s recent blog post What Is a Computational Essay?, part essay, part computational essay,  is primarily a pitch for using Mathematica notebooks and the Wolfram Language. (The Wolfram Language provides computational support plus access to a “fact engine” database that ca be used to pull factual information into the coding environment.)

But it also describes nicely some of the generic features of other “generative document” media (Jupyter notebooks, Rmd/knitr) and how to start using them.

There are basically three kinds of things [in a computational essay]. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these three kinds of these all work together to express what’s being communicated.

In Mathematica, the view is something like this:


In Jupyter notebooks:

In its raw form, an RStudio Rmd document source looks something like this:

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

When we originally drafted the OU/FutureLearn course Learn to Code for Data Analysis (also available on OpenLearn), we wrote the explanatory text – delivered as HTML but including static code fragments and code outputs – as a notebook, and then ‘ran” the notebook to generate static HTML (or markdown) that provided the static course content. These notebooks were complemented by actual notebooks that students could work with interactively themselves.

(Actually, we prototyped authoring both the static text, and the elements to be used in the student notebooks, in a single document, from which the static HTML and “live” notebook documents could be generated: Authoring Multiple Docs from a Single IPython Notebook. )

Whilst the notion of the computational essay as a form is really powerful, I think the added distinction between between generative and generated documents is also useful. For example, a raw Rmd document of Jupyter notebook is a generative document that can be used to create a document containing text, code, and the output generated from executing the code. A generated document is an HTML, Word, or PDF export from an executed generative document.

Note that the generating code can be omitted from the generated output document, leaving just the text and code generated outputs. Code cells can also be collapsed so the code itself is hidden from view but still available for inspection at any time:

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

Even if notebooks are not used interactively, they can be used to create correct static texts where outputs that are supposed to relate to some fragment of code in the main text actually do so because they are created by the code, rather than being cut and pasted from some other environment.

However, making the generative – as well as generated – documents available means readers can learn by doing, as well as reading:

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

In terms of our own learnings to date about how to use notebooks most effectively as part of a teaching communication (i.e. as learning materials), Wolfram seems to have come to many similar conclusions. For example, try to limit the amount of code in any particular code cell:

In a typical computational essay, each piece of input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

So what can go wrong? Well, like English prose, can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

It can also be useful to “preview” the output of a particular operation that populates a variable for use in the following expression to help the reader understand what sort of thing that expression is evaluating:

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like.

In many respects, the computational narrative format forces you to construct an argument in a particular way: if a piece of code operates on a particular thing, you need to access, or create, the thing before you can operate on it.

[A]nother thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

One of the arguments I’ve been trying to develop in an attempt to persuade some of my colleagues to consider the use of notebooks to support teaching is the notebook nature of them. Several years ago, one of the en vogue ideas being pushed in our learning design discussions was to try to find ways of supporting and encouraging the use of “learning diaries”, where students could reflect on their learning, recording not only things they’d learned but also ways they’d come to learn them. Slightly later, portfolio style assessment became “a thing” to consider.

Wolfram notes something similar from way back when…

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

One of the advantages that notebooks have over some other environments in which students learn to code is that structure of the notebook can encourage you to develop a solution to a problem whilst retaining your earlier working.

The earlier working is where you can engage in the minutiae of trying to figure out how to apply particular programming concepts, creating small, playful, test examples of the sort of the thing you need to use in the task you have actually been set. (I think of this as a “trial driven” software approach rather than a “test driven* one; in a trial,  you play with a bit of code in the margins to check that it does the sort of thing you want, or expect, it to do before using it in the main flow of a coding task.)

One of the advantages for students using notebooks is that they can doodle with code fragments to try things out, and keep a record of the history of their own learning, as well as producing working bits of code that might be used for formative or summative assessment, for example.

Another advantage is that by creating notebooks, which may include recorded fragments of dead ends when trying to solve a particular problem, is that you can refer back to them. And reuse what you learned, or discovered how to do, in them.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

Looking at many of the notebooks students have created from scratch to support assessment activities in TM351, it’s evident that many of them are not using them other than as an interactive code editor with history. The documents contain code cells and outputs, with little if any commentary (what comments there are are often just simple inline code comments in a code cell). They are barely computational narratives, let alone computational essays; they’re more of a computational scratchpad containing small code fragments, without context.

This possibly reflects the prior history in terms of code education that students have received, working “out of context” in an interactive Python command line editor, or a traditional IDE, where the idea is to produce standalone files containing complete programmes or applications. Not pieces of code, written a line at a time, in a narrative form, with example output to show the development of a computational argument.

(One argument I’ve heard made against notebooks is that they aren’t appropriate as an environment for writing “real programmes” or “applications”. But that’s not strictly true: Jupyter notebooks can be used to define and run microservices/APIs as well as GUI driven applications.)

However, if you start to see computational narratives as a form of narrative documentation that can be used to support a form of literate programming, then once again the notebook format can come in to its own, and draw on styling more common in a text document editor than a programming environment.

(By default, Jupyter notebooks expect you to write text content in markdown or markdown+HTML, but WYSIWYG editors can be added as an extension.)

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

As well as allowing you to create documents in which the content is generated interactively – code cells can be changed and re-run, for example – it is also possible to embed interactive components in both generative and generated documents.

On the one hand, it’s quite possible to generate and embed an interactive map or interactive chart that supports popups or zooming in a generated HTML output document.

On the other, Mathematica and Jupyter both support the dynamic creation of interactive widget controls in generative documents that give you control over code elements in the document, such as sliders to change numerical parameters or list boxes to select categorical text items. (In the R world, there is support for embedded shiny apps in Rmd documents.)

These can be useful when creating narratives that encourage exploration (for example, in the sense of  explorable explantations, though I seem to recall Michael Blastland expressing concern several years ago about how ineffective interactives could be in data journalism stories.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like Manipulate, [interact/interactive in Jupyter notebooks] into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay.”

I’ve also thought previously that interactive functions are a useful way of motivating the use of functions in general when teaching introductory programming. For example, An Alternative Way of Motivating the Use of Functions?.

One of the issues in trying to set up student notebooks is how to handle boilerplate code that is required before the student can create, or run, the code you actually want them to explore. In TM351, we preload notebooks with various packages and bits of magic; in my own tinkerings, I’m starting to try to package stuff up so that it can be imported into a notebook in a single line.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

As far as opportunities for making increasing use of notebooks as a kind of technology goes, I came to a similar conclusion some time ago to Stephen Wolfram when he writes:

[I]t’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Regarding his final conclusion, I’m a little bit more circumspect:

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new.

In many respects, HTML+Javascript pages have been capable of delivering, and actually delivering, computationally generated documents for some time. Whether computational notebooks offer some sort of step-change away from that, or actually represent a return to the original read/write imaginings of the web with portable and computed facts accessed using Linked Data?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Speeding up package installation

Wed, 11/15/2017 - 12:17

(This article was first published on R – Why?, and kindly contributed to R-bloggers)

Can’t be bothered reading, tell me now

A simple one-line tweak can significantly speed up package installation and updates.

See my post at the Jumping Rivers blog (no point duplicating content in two places)

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Why?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages