Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 42 min ago

Functional peace of mind

Tue, 11/14/2017 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I think what I enjoy the most about functional programming is the peace of mind that comes with it. With functional programming, there’s a lot of stuff you don’t need to think about. You can write functions that are general enough so that they solve a variety of problems. For example, imagine for a second that R does not have the sum() function anymore. If you want to compute the sum of, say, the first 100 integers, you could write a loop that would do that for you:

numbers = 0 for (i in 1:100){ numbers = numbers + i } print(numbers) ## [1] 5050

The problem with this approach, is that you cannot reuse any of the code there, even if you put it inside a function. For instance, what if you want to merge 4 datasets together? You would need something like this:

library(dplyr) data(mtcars) mtcars1 = mtcars %>% mutate(id = "1") mtcars2 = mtcars %>% mutate(id = "2") mtcars3 = mtcars %>% mutate(id = "3") mtcars4 = mtcars %>% mutate(id = "4") datasets = list(mtcars1, mtcars2, mtcars3, mtcars4) temp = datasets[[1]] for(i in 1:3){ temp = full_join(temp, datasets[[i+1]]) } ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") glimpse(temp) ## Observations: 128 ## Variables: 12 ## $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.... ## $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ... ## $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1... ## $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ... ## $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9... ## $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3... ## $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2... ## $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ... ## $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ... ## $ id "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1...

Of course, the logic is very similar as before, but you need to think carefully about the structure holding your elements (which can be numbers, datasets, characters, etc…) as well as be careful about indexing correctly… and depending on the type of objects you are working on, you might need to tweak the code further.

How would a functional programming approach make this easier? Of course, you could use purrr::reduce() to solve these problems. However, since I assumed that sum() does not exist, I will also assume that purrr::reduce() does not exist either and write my own, clumsy implementation. Here’s the code:

my_reduce = function(a_list, a_func, init = NULL, ...){ if(is.null(init)){ init = `[[`(a_list, 1) a_list = tail(a_list, -1) } car = `[[`(a_list, 1) cdr = tail(a_list, -1) init = a_func(init, car, ...) if(length(cdr) != 0){ my_reduce(cdr, a_func, init, ...) } else { init } }

This can look much more complicated than before, but the idea is quite simple; if you know about recursive functions (recursive functions are functions that call themselves). I won’t explain how the function works, because it is not the main point of the article (but if you’re curious, I encourage you to play around with it). The point is that now, I can do the following:

my_reduce(list(1,2,3,4,5), `+`) ## [1] 15 my_reduce(datasets, full_join) %>% glimpse ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "id") ## Observations: 128 ## Variables: 12 ## $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.... ## $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ... ## $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1... ## $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ... ## $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9... ## $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3... ## $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2... ## $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ... ## $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ... ## $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ... ## $ id "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1...

And if I need to merge another dataset, I don’t need to change anything at all. Plus, because my_reduce() is very general, I can even use it for situation I didn’t write it for in the first place:

my_reduce(list("a", "b", "c", "d", "e"), paste) ## [1] "a b c d e"

Of course, paste() is vectorized, so you could just as well do paste(1, 2, 3, 4, 5), but again, I want to insist on the fact that writing or using such functions allows you to abstract over a lot of thing. There is nothing specific to any type of object in my_reduce(), whereas the loops have to be tailored for the kind of object you’re working with. As long as the a_func argument is a binary operator that combines the elements inside a_list, it’s going to work. And I don’t need to think about indexing, about having temporary variables or thinking about the structure that will hold my results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Come and work with me

Tue, 11/14/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

I have funding for a new post-doctoral research fellow, on a 2-year contract, to work with me and Professor Kate Smith-Miles on analysing large collections of time series data. We are particularly seeking someone with a PhD in computational statistics or statistical machine learning.
Desirable characteristics:
Experience with time series data. Experience with R package development. Familiarity with reproducible research practices (e.g., git, rmarkdown, etc). A background in machine learning or computational statistics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2017 rOpenSci ozunconf :: Reflections and the realtime Package

Tue, 11/14/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

This year’s rOpenSci ozunconf was held in Melbourne, bringing together over 45 R enthusiasts from around the country and beyond. As is customary, ideas for projects were discussed in GitHub Issues (41 of them by the time the unconf rolled around!) and there was no shortage of enthusiasm, interesting concepts, and varied experience.

I’ve been to a few unconfs now and I treasure the time I get to spend with new people, new ideas, new backgrounds, new approaches, and new insights. That’s not to take away from the time I get to spend with people I met at previous unconfs; I’ve gained great friendships and started collaborations on side projects with these wonderful people.

When the call for nominations came around this year it was an easy decision. I don’t have employer support to attend these things so I take time off work and pay my own way. This is my networking time, my development time, and my skill-building time. I wasn’t sure what sort of project I’d be interested in but I had no doubts something would come up that sounded interesting.

As it happened, I had been playing around with a bit of code, purely out of interest and hoping to learn how htmlwidgets work. The idea I had was to make a classic graphic equaliser visualisation like this

using R.

This presents several challenges; how can I get live audio into R, and how fast can I plot the signal? I had doubts about both parts, partly because of the way that R calls tie up the session (for now…) and partly because constructing a ggplot2 object is somewhat slow (in terms of raw audio speeds). I’d heard about htmlwidgets and thought there must be a way to leverage that towards my goal.

I searched for a graphic equaliser javascript library to work with and didn’t find much that aligned with what I had in my head. Eventually I stumbled on p5.js and its examples page which has an audio-input plot with a live demo. It’s a frequency spectrum, but I figured that’s just a bit of binning away from what I need. Running the example there looks like

This seemed to be worth a go. I managed to follow enough of this tutorial to have the library called from R. I modified the javascript canvas code to look a little more familiar, and the first iteration of geom_realtime() was born

This seemed like enough of an idea that I proposed it in the GitHub Issues for the unconf. It got a bit of attention, which was worrying, because I had no idea what to do with this next. Peter Hickey pointed out that Sean Kross had already wrapped some of the p5.js calls into R calls with his p5 package, so this seemed like a great place to start. It’s quite a clever way of doing it too; it involves re-writing the javascript which htmlwidgets calls on each time you want to do something.

Fast forward to the unconf and a decent number of people gathered around a little slip of paper with geom_realtime() written on it. I had to admit to everyone that the ggplot2 aspect of my demo was a sham (it’s surprisingly easy to draw a canvas in just the right shade of grey with white gridlines), but people stayed, and we got to work seeing what else we could do with the idea. We came up with some suggestions for input sources, some different plot types we might like to support, and set about trying to understand what Sean’s package actually did.

As it tends to work out, we had a great mix of people with different experience levels in different aspects of the project; some who knew how to make a package, some who knew how to work with javascript, some who knew how to work with websockets, some who knew about realtime data sources, and some who knew about nearly none of these things (✋ that would be me). If everyone knew every aspect about how to go about an unconf project I suspect the endeavor would be a bit boring. I love these events because I get to learn so much about so many different topics.

I shared my demo script and we deconstructed the pieces. We dug into the inner workings of the p5 package and started determining which parts we could siphon off to meet our own needs. One of the aspects that we wanted to figure out was how to simulate realtime data. This could be useful both for testing, and also in the situation where one might want to ’re-cast’ some time-coded data. We were thankful that Jackson Kwok had gone deep-dive into websockets and pretty soon (surprisingly soon, perhaps; within the first day) we had examples of (albeit, constructed) real-time (every 100ms) data streaming from a server and being plotted at-speed

Best of all, running the plot code didn’t tie up the session; it uses a listener written into the javascript so it just waits for input on a particular port.

With the core goal well underway, people started branching out into aspects they found most interesting. We had some people work on finding and connecting actual data sources, such as the bitcoin exchange rate

and a live-stream of binary-encoded data from the Australian National University (ANU) Quantum Random Numbers Server

Others formalised the code so that it can be piped into different ‘themes’, and retain the p5 structure for adding more components

These were still toy examples of course, but they highlight what’s possible. They were each constructed using an offshoot of the p5 package whereby the javascript is re-written to include various features each time the plot is generated.

Another route we took is to use the direct javascript binding API with factory functions. This had less flexibility in terms of adding modular components, but meant that the javascript could be modified without worrying about how it needed to interact with p5 so much. This resulted in some outstanding features such as side-scrolling and date-time stamps. We also managed to pipe the data off to another thread for additional processing (in R) before being sent to the plot.

The example we ended up with reads the live-feed of Twitter posts under a given hashtag, computes a sentiment analysis on the words with R, and live-plots the result:

Overall I was amazed at the progress we made over just two days. Starting from a silly idea/demo, we built a package which can plot realtime data, and can even serve up some data to be plotted. I have no expectations that this will be the way of the future, but it’s been a fantastic learning experience for me (and hopefully others too). It’s highlighted that there’s ways to achieve realtime plots, even if we’ve used a library built for drawing rather than one built for plotting per se.

It’s even inspired offshoots in the form of some R packages; tRainspotting which shows realtime data on New South Wales public transport using leaflet as the canvas

and jsReact which explores the interaction between R and Javascript

The possibilities are truly astounding. My list of ‘things to learn’ has grown significantly since the unconf, and projects are still starting up/continuing to develop. The ggeasy package isn’t related, but it was spawned from another unconf Github Issue idea. Again; ideas and collaborations starting and developing.

I had a great time at the unconf, and I can’t wait until the next one. My hand will be going up to help out, attend, and help start something new.

My thanks and congratulations go out to each of the realtime developers: Richard Beare, Jonathan Carroll, Kim Fitter, Charles Gray, Jeffrey O Hanson, Yan Holtz, Jackson Kwok, Miles McBain and the entire cohort of 2017 rOpenSci ozunconf attendees. In particular, my thanks go to the organisers of such a wonderful event; Nick Tierney, Rob Hyndman, Di Cook, and Miles McBain.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Updated curl package provides additional security for R on Windows

Tue, 11/14/2017 - 00:27

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

There are many R packages that connect to the internet, whether it's to import data (readr), install packages from Github (devtools), connect with cloud services (AzureML), or many other web-connected tasks. There's one R package in particular that provides the underlying connection between R and the Web: curl, by Jeroen Ooms, who is also the new maintainer for R for Windows. (The name comes from curl, a command-line utility and interface library for connecting to web-based services). The curl package provides replacements for the standard url and download.file functions in R with support for encryption, and the package was recently updated to enhance its security, particularly on Windows.

To implement secure communications, the curl package needs to connect with a library that handles the SSL (secure socket layer) encryption. On Linux and Macs, curl has always used the OpenSSL library, which is included on those systems. Windows doesn't have this library (at least, outside of the Subsystem for Linux), so on Windows the curl package included the OpenSSL library and associated certificate. This raises its own set of issues (see the post linked below for details), so version 3.0 of the package instead uses the built-in winSSL library. This means curl uses the same security architecture as other connected applications on Windows.

This shouldn't have any impact on your web-connectivity from R now or in the future, except the knowledge that the underlying architecture is more secure. Nonetheless, it's possible to switch back to OpenSSL-based encryption (and this remains the default on Windows 7, which does not include the winSSL).

Version 3.0 of the curl package is available now on CRAN (though you'll likely never need to load it explicitly — packages that use it do that for you automatically). You can learn more about the changes at the link below. If you'd like to know more about what the cur packahe can do, this vignette is a great place to start. Many thanks to Jeroen Ooms for this package.

rOpenSci: Changes to Internet Connectivity in R on Windows

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

normal variates in Metropolis step

Tue, 11/14/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A definitely puzzled participant on X validated, confusing the Normal variate or variable used in the random walk Metropolis-Hastings step with its Normal density… It took some cumulated efforts to point out the distinction. Especially as the originator of the question had a rather strong a priori about his or her background:

“I take issue with your assumption that advice on the Metropolis Algorithm is useless to me because of my ignorance of variates. I am currently taking an experimental course on Bayesian data inference and I’m enjoying it very much, i believe i have a relatively good understanding of the algorithm, but i was unclear about this specific.”

despite pondering the meaning of the call to rnorm(1)… I will keep this question in store to use in class when I teach Metropolis-Hastings in a couple of weeks.

Filed under: Books, Kids, R, Statistics, University life Tagged: cross validated, Gaussian random walk, Markov chain Monte Carlo algorithm, MCMC, Metropolis-Hastings algorithm, Monte Carlo Statistical Methods, normal distribution, normal generator, random variates

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spatial networks – case study St James centre, Edinburgh (2/3)

Mon, 11/13/2017 - 23:46

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

This is part two in a series I’m writing on network analysis. The first part is here. In this section I’m going to cover allocating resources, again using the St James’ development in Edinburgh as an example. Most excitingly (for me), the end of this post covers the impact of changes in resource allocation.

Edinburgh (and surrounds) has more than one shopping centre. Many more. I’ve had a stab at narrowing these down to those that are similar to the St James centre, i.e. they’re big, (generally) covered and may have a cinema. You can see a plot of these below. As you can see the majority are concentrated around the population centre of Edinburgh.

Location of big shopping centres in and around Edinburgh.

As with the previous post I’ve used GRASS GIS for the network analysis, QGIS for cartography and R for some subsequent analysis. I’ve used the Ordnance Survey code-point open and openroads datasets for the analysis and various Ordnance Survey maps for the background.

An allocation map shows how you can split your network to be serviced by different resource centres. I like to think of it as deciding which fire station sends an engine to which road. But this can be extended to any resource with multiple locations: bank branches, libraries, schools, swimming pools. In this case we’re using shopping centres. As always the GRASS manual page contains a full walk through of how to run the analysis. I’ll repeat the steps I took below:

# connect points to network v.net roads_EH points=shopping_centres out=centres_net op=connect thresh=200 # allocate, specifying range of center cats (easier to catch all): v.net.alloc centres_net out=centres_alloc center_cats=1-100000 node_layer=2 # Create db table v.db.addtable map=centres_alloc@shopping_centres # Join allocation and centre tables v.db.join map=centres_alloc column=cat other_table=shopping_centres other_column=cat # Write to shp v.out.ogr -s input=centres_alloc output=shopping_alloc format=ESRI_Shapefile output_layer=shopping_alloc

The last step isn’t strictly necessary, as QGIS and R can connect directly to the GRASS database, but old habits die hard! We’ve now got a copy of the road network where all roads are tagged with which shopping centre they’re closest too. We can see this below:

Allocation network of EH shopping centres.

A few things stand out for me:

  • Ocean terminal is a massive centre but is closest to few people.
  • Some of the postcodes closest to St James, as really far away.
  • The split between Fort Kinnaird and St James is really stark just east of the A702.

If I was a councillor and I coordinated shopping centres in a car free world, I now know where I’d be lobbying for better public transport!

We can also do a similar analysis using the shortest path, as in the previous post. Instead of looking for the shortest path to a single point, we can get GRASS to calculate the distance from each postcode to its nearest shopping centre (note this is using the postcodes_EH file from the previous post):

# connect postcodes to streets as layer 2 v.net --overwrite input=roads_EH points=postcodes_EH output=roads_net1 operation=connect thresh=400 arc_layer=1 node_layer=2 # connect shops to streets as layer 3 v.net --overwrite input=roads_net1 points=shopping_centres output=roads_net2 operation=connect thresh=400 arc_layer=1 node_layer=3 # inspect the result v.category in=roads_net2 op=report # shortest paths from postcodes (points in layer 2) to nearest stations (points in layer 3) v.net.distance --overwrite in=roads_net2 out=pc_2_shops flayer=2 to_layer=3 # Join postcode and distance tables v.db.join map=postcodes_EH column=cat other_table=pc_2_shops other_column=cat # Join station and distance tables v.db.join map=postcodes_EH column=tcat other_table=shopping_centres other_column=cat subset_columns=Centre # Make a km column # Really short field name so we can output to shp v.db.addcolumn map=postcodes_EH columns="dist_al_km double precision" v.db.update map=postcodes_EH column=dist_al_km qcol="dist/1000" # Make a st james vs column # Uses results from the previous blog post v.db.addcolumn map=postcodes_EH columns="diff_km double precision" v.db.update map=postcodes_EH column=diff_km qcol="dist_km-dist_al_km" # Write to shp v.out.ogr -s input=postcodes_EH output=pc_2_shops format=ESRI_Shapefile output_layer=pc_2_shops

Again we can plot these up in QGIS (below). These are really similar results to the road allocation previously, but give us a little more detail on where the population are as each postcode is show. However, the eagle eyed of you will have noticed we pulled out the distance for each postcode in the code above and then compared it to the distance to St James alone. We can use this for considering the impact of resource allocation.

Closest shopping centre for each EH postcode.

Switching to R, we can interrogate the postcode data further. Using R’s gdal library we can read in the shp file and generate some summary statistics:

Centre No. of postcodes closest
Almondvale 4361 Fort Kinnaird 7813 Gyle 3437 Ocean terminal 1321 St James 7088 # Package install.packages("rgdal") library(rgdal) # Read file postcodes = readOGR("/home/user/dir/dir/network/data/pc_2_shops.shp") # How many postcodes for each centre? table(postcodes$Centre)

We can also look at the distribution of distances for each shopping centre using a box and whisker plot. As in the map we can see that Fort Kinnaird and St James are closest to the most distant postcodes, and that Ocean terminal has a small geographical catchment. The code for this plot is a the end of this post.

We can also repeat the plot from the previous blog post and look at how many postcodes are within walking and cycling distance of their nearest centre. In the previous post I showed the solid line and circle points for the St James centre. We can now compare those results to the impact of people travelling to their closest centre (below). The number of postcodes within walking distance of their nearest centre is nearly double that of St James alone, and those within cycling distance rises to nearly 50%! Code at the end of the post.

We also now have two curves on the above plot, and the area between them is the distance saved if each postcode travelled to its closest shopping centre instead of the St James.

The total distance is a whopping 123,680 km!

This impact analysis is obviously of real use in these times of reduced public services. My local council, Midlothian, is considering closing all its libraries bar one. What impact would this have on users? How would the road network around the kept library cope? Why have they just been building new libraries? It’s also analysis I really hope the DWP undertook before closing job centres across Glasgow. Hopefully the work of this post helps people investigate these impacts themselves.

# distance saved # NA value is one postcode too far to be joined to road - oops! sum(postcodes$diff_km, na.rm=T) # Boxplot png("~/dir/dir/network/figures/all-shops_distance_boxplot.png", height=600, width=800) par(cex=1.5) boxplot(dist_al_km ~ Centre, postcodes, lwd=2, range=0, main="Box and whiskers of EH postcodes to their nearest shopping centre", ylab="Distance (km)") dev.off() # Line plot # Turn into percentage instead of postcode counts x = sort(postcodes$dist_km) x = quantile(x, seq(0, 1, by=0.01)) y = sort(postcodes$dist_al_km) y = quantile(y, seq(0, 1, by=0.01)) png("~/dir/dir/network/figures/all-shops_postcode-distance.png", height=600, width=800) par(cex=1.5) plot(x, type="l", main="EH postcode: shortest road distances to EH shopping centres", xlab="Percentage of postcodes", ylab="Distance (km)", lwd=3) lines(y, lty=2, lwd=3) points(max(which(x<2)), 2, pch=19, cex=2, col="purple4") points(max(which(x<5)), 5, pch=19, cex=2, col="darkorange") points(max(which(y<2)), 2, pch=18, cex=2.5, col="purple4") points(max(which(y<5)), 5, pch=18, cex=2.5, col="darkorange") legend("topleft", c("St James", "Nearest centre", paste0(max(which(x<2)), "% postcodes within 2 km (walking) of St James"), paste0(max(which(x<5)), "% postcodes within 5 km (cycling) of St James"), paste0(max(which(y<2)), "% postcodes within 2 km (walking) of nearest centre"), paste0(max(which(y<5)), "% postcodes within 5 km (cycling) of nearest centre")), col=c("black", "black", "purple4", "darkorange", "purple4", "darkorange"), pch=c(NA, NA, 19, 19, 18, 18), lwd=c(3), lty=c(1, 2, NA, NA, NA, NA), pt.cex=c(NA, NA, 2, 2, 2.5, 2.5)) dev.off()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

SQL Saturday statistics – Web Scraping with R and SQL Server

Mon, 11/13/2017 - 20:08

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script @language = N'R' ,@script = N' library(rvest) library(XML) library(dplyr) #URL to schedule url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx'' #Read HTML webpage <- read_html(url_schedule) # Event schedule schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK # Extracting HTML content ht <- html_text(schedule_info) df <- data.frame(data=ht) #create empty DF df_res <- data.frame(title=c(), speaker=c()) for (i in 1:nrow(df)){ #print(df[i]) if (i %% 2 != 0) #odd flow print(paste0("title is: ", df$data[i])) if (i %% 2 == 0) #even flow print(paste0("speaker is: ", df$data[i])) df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1])) } df_res_new = df_res[seq(1, nrow(df_res), 2), ] OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday; GO CREATE OR ALTER PROCEDURE GetSessions @eventID SMALLINT AS DECLARE @URL VARCHAR(500) SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx' PRINT @URL DECLARE @TEMP TABLE ( SqlSatTitle NVARCHAR(500) ,SQLSatSpeaker NVARCHAR(200) ) DECLARE @RCODE NVARCHAR(MAX) SET @RCODE = N' library(rvest) library(XML) library(dplyr) library(httr) library(curl) library(selectr) #URL to schedule url_schedule <- "' DECLARE @RCODE2 NVARCHAR(MAX) SET @RCODE2 = N'" #Read HTML webpage <- html_session(url_schedule) %>% read_html() # Event schedule schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK # Extracting HTML content ht <- html_text(schedule_info) df <- data.frame(data=ht) #create empty DF df_res <- data.frame(title=c(), speaker=c()) for (i in 1:nrow(df)){ #print(df[i]) if (i %% 2 != 0) #odd flow print(paste0("title is: ", df$data[i])) if (i %% 2 == 0) #even flow print(paste0("speaker is: ", df$data[i])) df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1])) } df_res_new = df_res[seq(1, nrow(df_res), 2), ] OutputDataSet <- df_res_new '; DECLARE @FINAL_RCODE NVARCHAR(MAX) SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2) INSERT INTO @Temp EXEC sp_execute_external_script @language = N'R' ,@script = @FINAL_RCODE INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker) SELECT @EventID AS sqlsat ,SqlSatTitle ,SqlSatSpeaker FROM @Temp

 

Before you run this, just a little environement setup:

USE [master]; GO CREATE DATABASe SQLSaturday; GO USE SQLSaturday; GO CREATE TABLE SQLSatSessions ( id SMALLINT IDENTITY(1,1) NOT NULL ,SqlSat SMALLINT NOT NULL ,SqlSatTitle NVARCHAR(500) NOT NULL ,SQLSatSpeaker NVARCHAR(200) NOT NULL )

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

Or presented with pandas graph:

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people!

As always, code is available at Github.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing classifier thresholds

Mon, 11/13/2017 - 17:38

(This article was first published on Rstats – bayesianbiologist, and kindly contributed to R-bloggers)

Lately I’ve been thinking a lot about the connection between prediction models and the decisions that they influence. There is a lot of theory around this, but communicating how the various pieces all fit together with the folks who will use and be impacted by these decisions can be challenging.

One of the important conceptual pieces is the link between the decision threshold (how high does the score need to be to predict positive) and the resulting distribution of outcomes (true positives, false positives, true negatives and false negatives). As a starting point, I’ve built this interactive tool for exploring this.

The idea is to take a validation sample of predictions from a model and experiment with the consequences of varying the decision threshold. The hope is that the user will be able to develop an intuition around the tradeoffs involved by seeing the link to the individual data points involved.

Code for this experiment is available here. I hope to continue to build on this with other interactive, visual tools aimed at demystifying the concepts at the interface between predictions and decisions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – bayesianbiologist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Project: Data Science Instruction at the US Census Bureau!

Mon, 11/13/2017 - 17:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Today I am delighted to announce an exciting new collaboration. I will be working with the US Census Bureau as a Data Science Instructor!

Over the next six months I will be helping Census develop courses on using R to work with Census Data. These courses will be free and open to the public. People familiar with my open source work will realize that this project is right up my alley!

As a start to this project I am trying to gather two pieces of information:

  1. Which packages do R programmers typically use when working with Census data?
  2. What types of analyses do R programmers typically do with Census data?

If you use R to work with Census data, please leave an answer below!

The post New Project: Data Science Instruction at the US Census Bureau! appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Make memorable plots with memery. v0.3.0 now on CRAN.

Mon, 11/13/2017 - 15:33

Make memorable plots with memery. memery is an R package that generates internet memes including superimposed inset graphs and other atypical features, combining the visual impact of an attention-grabbing meme with graphic results of data analysis. Version 0.3.0 of memery is now on CRAN. The latest development version and a package vignette are available on GitHub.

[original post]

Below is an example interleaving a semi-transparent ggplot2 graph between a meme image backdrop and overlying meme text labels. The meme function will produce basic memes without needing to specify a number of additional arguments, but this is not the main purpose of the package. Adding a plot is then as simple as passing the plot to inset.

memery offers sensible defaults as well as a variety of basic templates for controlling how the meme and graph are spliced together. The example here shows how additional arguments can be specified to further control the content and layout. See the package vignette for a more complete set of examples and description of available features and graph templates.

Please do share your data analyst meme creations. Enjoy!

library(memery) # Make a graph of some data library(ggplot2) x <- seq(0, 2*pi , length.out = 50) panels <- rep(c("Plot A", "Plot B"), each = 50) d <- data.frame(x = x, y = sin(x), grp = panels) txt <- c("Philosoraptor's plots", "I like to make plots", "Figure 1. (A) shows a plot and (B) shows another plot.") p <- ggplot(d, aes(x, y)) + geom_line(colour = "cornflowerblue", size = 2) + geom_point(colour = "orange", size = 4) + facet_wrap(~grp) + labs(title = txt[1], subtitle = txt[2], caption = txt[3]) # Meme settings img <- system.file("philosoraptor.jpg", package = "memery") # image lab <- c("What to call my R package?", "Hmm... What? raptr is taken!?", "Noooooo!!!!") # labels size <- c(1.8, 1.5, 2.2) # label sizes, positions, font families and colors pos <- list(w = rep(0.9, 3), h = rep(0.3, 3), x = c(0.45, 0.6, 0.5), y = c(0.95, 0.85, 0.3)) fam <- c("Impact", "serif", "Impact") col <- list(c("black", "orange", "white"), c("white", "black", "black")) gbg <- list(fill = "#FF00FF50", col = "#FFFFFF75") # graph background # Save meme meme(img, lab, "meme.jpg", size = size, family = fam, col = col[[1]], shadow = col[[2]], label_pos = pos, inset = p, inset_bg = gbg, mult = 2)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Update on coordinatized or fluid data

Mon, 11/13/2017 - 01:56

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We have just released a major update of the cdata R package to CRAN.

If you work with R and data, now is the time to check out the cdata package.

Among the changes in the 0.5.* version of cdata package:

  • All coordinatized data or fluid data operations are now in the cdata package (no longer split between the cdata and replyr packages).
  • The transforms are now centered on the more general table driven moveValuesToRowsN() and moveValuesToColumnsN() operators (though pivot and un-pivot are now made available as convenient special cases).
  • All the transforms are now implemented in SQL through DBI (no longer using tidyr or dplyr, though we do include examples of using cdata with dplyr).
  • This is (unfortunately) a user visible API change, however adapting to the changed API is deliberately straightforward.

cdata now supplies very general data transforms on both in-memory data.frames and remote or large data systems (PostgreSQL, Spark/Hive, and so on). These transforms include operators such as pivot/un-pivot that were previously not conveniently available for these data sources (for example tidyr does not operate on such data, despite dplyr doing so).

To help transition we have updated the existing documentation:

The fluid data document is a bit long, as it covers a lot of concepts quickly. We hope to develop more targeted training material going forward.

In summary: cdata theory and package now allow very concise and powerful transformations of big data using R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ShinyProxy 1.0.2

Sun, 11/12/2017 - 20:26

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations. Since our last blog post ten new
releases of ShinyProxy have seen the light, but with the 1.0.2 release it is time
to provide an overview of the lines of development and advances made.

Scalability

ShinyProxy now allows to run 1000s of Shiny apps concurrently on a Docker Swarm cluster.
Moreover, ShinyProxy will automatically detect whether the Docker API URL is a
Docker Engine API or a Swarm cluster API. In other words changing the back-end from
a single Docker host to a Docker Swarm is plug and play.

Single-Sign On

Complex deployments asked for advanced functionality for identity and access management (IAM).
To tackle this we introduced a new authentication mechanism authentication: keycloak
which integrates ShinyProxy with Keycloak, RedHat’s open source IAM solution. Features like single-sign on, identity brokering, user federation etc. are now available for ShinyProxy
deployments.

Larger Applications and Networks

Often times Shiny applications will be offered as part of larger applications that are
written in other languages than R. To enable this type of integrations, we have introduced
functionality to entirely hide the ShinyProxy user interface elements for seamless embedding
as views in bigger user interfaces.

Next to integration within other user interfaces, the underlying Shiny code may need to interact
with applications that live in specific networks. To make sure the Shiny app containers
have network interfaces configured for the right networks, a new docker-network configuration
parameter has been added to the app-specific configurations. Together with Docker volume mounting
for persistence, and the possibility to pass environment variables to Docker containers,
this gives Shiny developers lots of freedom to develop serious applications. An example configuration is given below. A Shiny app communicates over a dedicated Docker network db-net with a database back-end and configuration information is made available to the Shiny app via environment variables that are
read from a configuration file db.env:

- name: db-enabled-app display-name: Shiny App with a Database Persistence Layer description: Shiny App connecting with a Database for Persistence docker-image: registry.openanalytics.eu/public/db-enabled-app:latest docker-network-connections: [ "db-net" ] docker-env-file: db.env groups: [db-app-users] Usage Statistics

Gathering usage statistics was already part of ShinyProxy since version 0.6.0, but was limited
to an InfluxDB back-end so far. Customers asked us to integrate Shiny applications
with MonetDB (and did not want a separate database to store usage statistics) so we developed a MonetDB adapter for version 0.8.4. Configuration has been streamlined with a usage-stats-url and support for DB credentials is now offered through a usage-stats-username and usage-stats-password.

Security

Proper security for ShinyProxy setups of all sizes is highly important and a number
of improvements have been implemented. The ShinyProxy security page
has been extended and has extra content has been added on dealing
with sensitive configuration.
On the authentication side LDAPS support has been around for a long time, but since release 1.0.0
we also offer LDAP+StartTLS support out of the box.

Deployment

Following production deployments for customers, we now also offer RPM files for deployment
on CentOS 7 and RHEL 7, besides the .deb packages for Ubuntu and the platform-independent
JAR files.

Further Information

For all these new features, detailed documentation is provided on http://shinyproxy.io and as always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Creating integer64 and nanotime vectors in C++

Sat, 11/11/2017 - 01:00

(This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers)

Motivation: More Precise Timestamps

R has excellent facilities for dealing with both dates and datetime objects.
For datetime objects, the POSIXt time type can be mapped to POSIXct and
its representation of fractional seconds since the January 1, 1970 “epoch” as
well as to the broken-out list representation in POSIXlt. Many add-on
packages use these facilities.

POSIXct uses a double to provide 53 bits of resolution. That is generally
good enough for timestamps down to just above a microsecond, and has served
the R community rather well.

But increasingly, time increments are measure in nanoseconds. Other languages uses a (signed)
64-bit integer to represent (integer) nanoseconds since the epoch. A bit over a year I realized
that we have this in R too—by combining the integer64 type in the
bit64 package by Jens Oehlschlaegel with the
CCTZ-based parser and formatter in my
RcppCCTZ package. And thus the
nanotime package was created.

Leonardo Silvestri then significantly enhanced
nanotime by redoing it as an S4 class.

A simple example:

library(nanotime) n <- nanotime(42) n [1] "1970-01-01T00:00:00.000000042+00:00"

Here we used a single element with value 42, and created a nanotime vector from it—which is
taken to me 42 nanoseconds since the epoch, or basically almost at January 1, 1970.

Step 1: Large Integer Types

So more recently I had a need to efficiently generate such integer vector from int64_t data.
Both Leonardo and Dan helped with
initial discussion and tests. One can either use a reinterpret_cast<> or a straight memcpy as
the key trick in bit64 is to use the underlying 64-bit
double. So we have the space, we just need to ensure we copy the bits rather than their values.
This leads to the following function to create an integer64 vector for use in R at the C++ level:

#include Rcpp::NumericVector makeInt64(std::vector<int64_t> v) { size_t len = v.size(); Rcpp::NumericVector n(len); // storage vehicle we return them in // transfers values 'keeping bits' but changing type // using reinterpret_cast would get us a warning std::memcpy(&(n[0]), &(v[0]), len * sizeof(double)); n.attr("class") = "integer64"; return n; }

This uses the standard trick of setting a class attribute to set an S3 class. Now the values in
v will return to R (exactly how is treated below), and R will treat the vector as integer64
object (provided the bit64 package has been loaded).

Step 2: Nanotime

A nanotime vector is creating using an internal integer64 vector. So the previous functions
almost gets us there. But we need to set the S4 type correctly. So that needed some extra work.
The following function does it:

#include Rcpp::S4 makeNanotime(std::vector<int64_t> v) { size_t len = v.size(); Rcpp::NumericVector n(len); // storage vehicle we return them in // transfers values 'keeping bits' but changing type // using reinterpret_cast would get us a warning std::memcpy(&(n[0]), &(v[0]), len * sizeof(double)); // do what needs to be done for the S4-ness: class, and .S3Class // this was based on careful reading of .Internal(inspect(nanotime(c(0,1)))) Rcpp::CharacterVector cl = Rcpp::CharacterVector::create("nanotime"); cl.attr("package") = "nanotime"; n.attr(".S3Class") = "integer64"; n.attr("class") = cl; SET_S4_OBJECT(n); return Rcpp::S4(n); }

This creates a nanotime vector as a proper S4 object.

Step 3: Returning them R via data.table

The astute reader will have noticed that neither function had an Rcpp::export tag. This is
because of the function argument: int64_t is not representable natively by R, which is why we
need a workaround. Matt Dowle has been very helpful in providing
excellent support for nanotime in data.table
(even after we, ahem, borked it by switching from S3 to S4). This support was of course relatively
straightforward because data.table already had
support for the underlying integer64, and we had the additional formatters etc.

#include // Enable C++11 via this plugin (Rcpp 0.10.3 or later) // [[Rcpp::plugins("cpp11")]] Rcpp::NumericVector makeInt64(std::vector<int64_t> v) { size_t len = v.size(); Rcpp::NumericVector n(len); // storage vehicle we return them in // transfers values 'keeping bits' but changing type // using reinterpret_cast would get us a warning std::memcpy(&(n[0]), &(v[0]), len * sizeof(double)); n.attr("class") = "integer64"; return n; } Rcpp::S4 makeNanotime(std::vector<int64_t> v) { size_t len = v.size(); Rcpp::NumericVector n(len); // storage vehicle we return them in // transfers values 'keeping bits' but changing type // using reinterpret_cast would get us a warning std::memcpy(&(n[0]), &(v[0]), len * sizeof(double)); // do what needs to be done for the S4-ness: class, and .S3Class // this was based on careful reading of .Internal(inspect(nanotime(c(0,1)))) Rcpp::CharacterVector cl = Rcpp::CharacterVector::create("nanotime"); cl.attr("package") = "nanotime"; n.attr(".S3Class") = "integer64"; n.attr("class") = cl; SET_S4_OBJECT(n); return Rcpp::S4(n); } // [[Rcpp::export]] Rcpp::DataFrame getDT() { std::vector<int64_t> d = { 1L, 1000L, 1000000L, 1000000000L }; std::vector<int64_t> ns = { 1510442294123456789L, 1510442295123456789L, 1510442296123456789L, 1510442297123456789L }; Rcpp::DataFrame df = Rcpp::DataFrame::create(Rcpp::Named("int64s") = makeInt64(d), Rcpp::Named("nanos") = makeNanotime(ns)); df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame"); return(df); } Example

The following example shows the output from the preceding function:

suppressMessages(library("data.table")) dt <- getDT() print(dt) int64s nanos 1: 1 2017-11-11T23:18:14.123456789+00:00 2: 1000 2017-11-11T23:18:15.123456789+00:00 3: 1000000 2017-11-11T23:18:16.123456789+00:00 4: 1000000000 2017-11-11T23:18:17.123456789+00:00 dt[[1]] integer64 [1] 1 1000 1000000 1000000000 dt[[2]] [1] "2017-11-11T23:18:14.123456789+00:00" [2] "2017-11-11T23:18:15.123456789+00:00" [3] "2017-11-11T23:18:16.123456789+00:00" [4] "2017-11-11T23:18:17.123456789+00:00" diff(dt[[2]]) # here 1e9 nanoseconds between them integer64 [1] 1000000000 1000000000 1000000000

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stan Roundup, 10 November 2017

Fri, 11/10/2017 - 21:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

We’re in the heart of the academic season and there’s a lot going on.

  • James Ramsey reported a critical performance regression bug in Stan 2.17 (this affects the latest CmdStan and PyStan, not the latest RStan). Sean Talts and Daniel Lee diagnosed the underlying problem as being with the change from char* to std::string arguments—you can’t pass char* and rely on the implicit std::string constructor without the penalty of memory allocation and copying. The reversion goes back to how things were before with const char* arguments. Ben Goodrich is working with Sean Talts to cherry-pick the performance regression fix to Stan that led to a very slow 2.17 release for the other interfaces. RStan 2.17 should be out soon, and it will be the last pre-C++11 release. We’ve already opened the C++11 floodgates on our development branches (yoo-hoo!).

  • Quentin F. Gronau, Henrik Singmann, E. J. Wagenmakers released the bridgesampling package in R. Check out the arXiv paper. It runs with output from Stan and JAGS.

  • Andrew Gelman and Bob Carpenter‘s proposal was approved by Coursera for a four-course introductory concentration on Bayesian statistics with Stan: 1. Bayesian Data Analysis (Andrew), 2. Markov Chain Monte Carlo (Bob), 3. Stan (Bob), 4. Multilevel Regression (Andrew). The plan is to finish the first two by late spring and the second two by the end of the summer in time for Fall 2018. Advait Rajagopal, an economics Ph.D. student at the New School is going to be leading the exercise writing, managing the Coursera platform, and will also TA the first few iterations. We’ve left open the option for us or others to add a prequel and sequel, 0. Probability Theory, and 5. Advanced Modeling in Stan.

  • Dan Simpson is in town and dropped a casual hint that order statistics would clean up the discretization and binning issues that Sean Talts and crew were having with the simulation-based algorithm testing framework (aka the Cook-Gelman-Rubin diagnostics). Lo-and-behold, it works. Michael Betancourt worked through all the math on our (chalk!) board and I think they are now ready to proceed with the paper and recommendations for coding in Stan. As I’ve commented before, one of my favorite parts of working on Stan is watching the progress on this kind of thing from the next desk.

  • Michael Betancourt tweeted about using Andrei Kascha‘s javascript-based vector field visualization tool for visualizing Hamiltonian trajectories and with multiple trajectories, the Hamiltonian flow. Richard McElreath provides a link to visualizations of the fields for light, normal, and heavy-tailed distributions. The Cauchy’s particularly hypnotic, especially with many fewer particles and velocity highlighting.

  • Krzysztof Sakrejda finished the fixes for standalone function generation in C++. This lets you generate a double- and int-only version of a Stan function for inclusion in R (or elsewhere). This will go into RStan 2.18.

  • Sebastian Weber reports that the Annals of Applied Statistics paper, Bayesian aggregation of average data: An application in drug development, was finally formally accepted after two years in process. I think Michael Betancourt, Aki Vehtari, Daniel Lee, and Andrew Gelman are co-authors.

  • Aki Vehtari posted a case study for review on extreme-value analysis and user-defined functions in Stan [forum link — please comment there].

  • Aki Vehtari, Andrew Gelman and Jonah Gabry have made a major revision of Pareto smoothed importance sampling paper with improved algorithm, new Monte Carlo error and convergence rate results, new experiments with varying sample size and different functions. The next loo package release will use the new version.

  • Bob Carpenter (it’s weird writing about myself in the third person) posted a case study for review on Lotka-Volterra predator-prey population dynamics [forum link — please comment there].

  • Sebastian and Sean Talts led us through the MPI design decisions about whether to go with our own MPI map-reduce abstraction or just build the parallel map function we’re going to implement in the Stan language. Pending further review from someone with more MPI experience, the plan’s to implememt the function directly, then worry about generalizing when we have more than one function to implement.

  • Matt Hoffman (inventor of the original NUTS algorithm and co-founder of Stan) dropped in on the Stan meeting this week and let us know he’s got an upcoming paper generalizing Hamiltonian Monte Carlo sampling and that his team at Google’s working on probabilistic modeling for Tensorflow.

  • Mitzi Morris, Ben Goodrich, Sean Talts and I sat down and hammered out the services spec for running the generated quantities block of a Stan program over the draws from a previous sample. This will decouple the model fitting process and the posterior predictive inference process (because the generated quantities block generates a ỹ according to p(ỹ | θ) where ỹ is a vector of predictive quantities and θ is the vector of model parameters. Mitzi then finished the coding and testing and it should be merged soon. She and Ben Bales are working on getting it into CmdStan and Ben Goodrich doesn’t think it’ll be hard to add to RStan.

  • Mitzi Morris extended the spatial case study with leave-one-out cross-validation and WAIC comparisons of the simple Poisson model, a heterogeneous random effects model, a spatial random effects model, and a combined heterogeneous and spatial model with two different prior configurations. I’m not sure if she posted the updated version yet (no, because Aki is also in town and suggested checking Pareto khats, which said no).

  • Sean Talts split out some of the longer tests for less frequent application to get distribution testing time down to 1.5 hours to improve flow of pull requests.

  • Sean Talts is taking another one for the team by leading the charge to auto-format the C++ code base and then proceed with pre-commit autoformat hooks. I think we’re almost there after a spirited discussion of readability and our ability to assess it.

  • Sean Talts also added precompiled headers to our unit and integration tests. This is a worthwhile speedup when running lots of tests and part of the order of magnitude speedup Sean’s eked out.

ps. some edits made by Aki

The post Stan Roundup, 10 November 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

.rprofile: Mara Averick

Fri, 11/10/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)


Mara Averick is a non-profit data nerd, NBA stats junkie, and most recently, tidyverse developer advocate at RStudio. She is the voice behind two very popular Twitter accounts, @dataandme and @batpigandme. Mara and I discussed sports analytics, how attending a cool conference can change the approach to your career, and how she uses Twitter as a mechanism for self-imposed forced learning.

KO: What is your name, job title, and how long have you been using R? [Note: This interview took place in May 2017. Mara joined RStudio as their tidyverse developer advocate in November 2017.]

MA: My name is Mara Averick, I do consulting, data science, I just say “data nerd at large” because I’ve seen those Venn diagrams and I’m definitely not a data scientist. I used R in high school for fantasy basketball. I graduated from high school in 2003, and then in college used SPSS, and I didn’t use R for a long time. And then I was working with a company that does grant proposals for non-profits, doing all of the demand- and outcome-analysis and it all was in Excel and I thought, we could do better – R might also be helpful for this. It turns out there’s a package for American Community Survey data in R (acs), so that was how I got back into R.

KO: How did you find out about R when you first started using it in high school?

MA: I honestly don’t remember. I didn’t even use RStudio until two years ago. I think it was probably from other fantasy nerds?

KO: Is there an underground R fantasy basketball culture?

MA: Well R for fantasy football is legit. Fantasy Football Analytics is all R modeling.

KO: That’s awesome – so now, do you work with sports analytics? Or is that your personal project/passion?

MA: A little bit of both, I worked for this startup called Stattleship (@stattleship). Because I’ll get involved with anything if there’s a good pun involved… and so we were doing sports analytics work that kind of ended up shifting more in a marketing direction. I still do consulting with the head data scientist [Tanya Cashorali] for that [at TCB Analytics]. Some of the analysis/consulting will be with companies who are doing either consumer products for sports or data journalism stuff around sports analytics.

KO: How often do you use R now?

MA: Oh, I use R like every day. I use it… I don’t use Word any more. [Laughter] Yeah so one of the things about basketball is that there are times of the year where there are games every day. So that’s been my morning workflow for a while – scraping basketball data.

KO: So you get up every morning and scrape what’s new in Basketball?

MA: Yeah! So I end up in RStudio bright and early (often late, as well).

KO: So is that literally what the first half hour of your day looks like?

MA: No, so incidentally that’s kind of how this Twitter thing got started. My dog has long preceded me on Twitter and the internet at large, he’s kind of an internet famous dog @batpigandme. There’s an application called Buffer which allows you to schedule tweets and facebook page posts, which was most of Batpig’s traffic – facebook page visits from Japan. And so I had this morning routine (started in the winter when I had one of those light things you sit in front of for a certain number of minutes) where I would wake up and schedule batpig posts while I’m sitting there and read emails. And that ended up being a nice morning workflow thing.

I went to a Do Good Data conference, which is a Data Analysts for Social Good (@DA4SG) event, just over two years ago, and everyone there was giving out their twitter handles, and I was like, oh – maybe people who aren’t dogs also use Twitter? [Laughter] So that was how I ended up creating my own account @dataandme independent from Batpig.

KO: What happened after you went to this conference? Was it awesome, did it inspire you?

MA: Yeah so, I was the stats person at the company I was working at. And I didn’t realize there was all this really awesome work being done with really rigorous evaluation that wasn’t necessarily federal grant proposal stuff. So I was really inspired by that and started learning more about what other people were doing, some of it in R, some of it not. I kept in touch with some of the people from that conference. And then NBA Twitter is also a thing it turns out, and NBA, R/Statistics is also a really big thing so that was kind of what pulled me in. And it was really fun. A lot of interesting projects and people that I work with were all through that [Twitter] which still surprises me – that I can read a book and tell the author something and they care? It’s weird.

I like to make arbitrary rules for myself, one of the things is I don’t tweet stuff that I haven’t read.

KO: Everyone loves your twitter account. How do you find and curate the things you end up posting about?

MA: I like to make arbitrary rules for myself, one of the things is I don’t tweet stuff that I haven’t read. I like to learn new things and/or I have to learn new things every day so I basically started scheduling [tweets] as a way to make myself read the things that I want to read and get back to.

KO: Wait, so you schedule a tweet and then you’re like, okay well this is my deadline to read this thing – or I’ll be a liar.

MA: Totally.

KO: Whoa that’s awesome.

MA: I’ve also never not finished a book in my life. It’s one of my rules, I’m really strict about it.

KO: That’s a lot of pressure!

MA: So that was kind of how it started out – especially because I didn’t even know all the stuff I didn’t know. Then, as I’ve used R more and more, there’s stuff that I’ve just happened to read because I don’t know what I’m doing.

KO: The more you learn the more you can learn.

MA: Yeah so now a lot of the stuff [tweets] is stuff I end up reading over the course of the day and then add it [to the queue]. Or it’s just stuff I’ve already read when I feel like being lazy.

KO: Do you have side projects other than the basketball/sports stuff?

MA: I actually majored in science and technology studies, which means I was randomly trained in ethical/legal/social implications of science. So I’m working on some data ethics projects which unfortunately I can’t talk about. And then my big side project for total amusement was this D3.js in Action analysis of Archer which is a cartoon that I watch. But that’s also how I learned really how to use tidytext. So then I ended up doing a technical review for David [Robinson] and Julia’s [Silge] book Text Mining with R: A Tidy Approach. It was super fun. So yeah, I always have a bunch of random side projects going on.

KO: How is your work-life balance?

MA: It’s funny because I like what I do. So I don’t always know where that starts and ends. And I’m really bad at capitalism. It never occurs to me that I should be paid for doing some things. Especially if it involves open data and open source – surely you can’t charge for that? But I read a lot of stuff that’s not R too. I think I’m getting sort of a balance, but I’m not sure.

KO: Switching back to your job-job now. Are you on a team, are you remote, are you in an office, what are the logistics like?

MA: Kind of all of the above. In my old job I was on a team but I was the only person doing anything data related. And I developed some really lazy habits from that – really ugly code and committing terrible stuff to git. But with this NBA project I end up working with a lot of different people (who are also basketball-stat nerds).

KO: Do you work with people who are employed by the actual NBA teams, or just people who are really interested in the subject?

MA: No, so there is an unfortunate attrition of people whom I work with when they get hired by teams – which is not unfortunate, it’s awesome, but then they can no longer do anything with us. So that’s collaborative work but I don’t work on a team anymore.

KO: So you don’t have daily stand-ups or anything.

MA: No, no. I could probably benefit from that, but my goal is never to be 100% remote. After I went to that first data conference, I felt like being around all these people who are so much smarter than I am, and know so much more than I do is intimidating, but I also learned so much. And I learned so many things I was doing, not wrong, but inefficiently. I still learn about 80 things I’m doing inefficiently every day.

My goal right now – stop holding on to all of my projects that are not as done as I want them to be, and will never be done.

KO: Do you have set beginnings and endings to projects? How many projects are you juggling at a given time?

MA: After doing federal grant proposals, it doesn’t feel like anything is a deadline compared to that. They don’t care if your house burned down if it’s not in at the right time. So nothing feels as hard and fast as that. There are certain things like the NBA that —

KO: There are timely things.

MA: Yeah, and then sometimes we’ll just set arbitrary deadlines, just to kind of get out of a cycle of trying to perfect it, which I fall deeply into. Yeah so that’s kind of a little bit of my goal right now – stop holding on to all of my projects that are not as done as I want them to be, and will never be done. With the first iteration of this Archer thing I literally spent three days trying to get this faceted bar chart thing to sort in multiple ways and was super frustrated and then I tweeted something about it and immediately David Robinson responded with precisely what I needed and would have never figured out. So I’m working on doing that more. And also because it’s so helpful to me when other people do that.

KO: How did you get hooked up with Julia and David, just through Twitter?

MA: Yeah! So Julia I’d met at Open Vis Conf, David I’d read his blog about a million lines of bad code – it was open on my iPad for like years because I loved it so much, and still do. And yeah so again as this super random twitter-human that I feel like I am, I do end up meeting and doing things with cool people who are super smart and do really cool things.

KO: It’s impressive how much you post and not just that, but it’s really evident that you care. People can tell that this isn’t just someone who reposts a million things day.

MA: I mean it’s totally selfish, don’t get me wrong. But I’m super glad that it’s helpful to other people too. It gives me so much anxiety to think that people might think I know how to do all the things that I post, which I don’t, that’s why I had to read them – but even when I read them, sometimes I don’t know. The R community is pretty awesome, at least the parts of it that I know; which is not universally true of any community of any group of scientists. R Twitter is super-super helpful. And that was evident really quickly, at least to me.

My plea to everyone who has a blog is to put their Twitter handle somewhere on it.

KO: What are some of your favorite things on the internet? Blogs, Twitter Accounts, Podcasts…

MA: I have never skipped one of Julia Silge’s blog posts. Her posts are always something that I know I should learn how to do. Both she and D-Rob [David Robinson] know their stuff and they write really well. So those are two blogs and follows that I love. Bob Rudis – almost daily, I can’t believe how quickly he churns stuff out. R-Bloggers is a great way to discover new stuff. Dr. Simon J [Simon Jackson] – I literally think of people by their twitter handles [@drsimonj], and there are so many others.

Every day I’m amazed by all the stuff I didn’t know existed. And also there’s stuff that people wrote three or four years ago. A lot of the data vis stuff I end up finding from weird angles. So those are some of my favorites – I’m sure there are more. Oh! Thomas Lin Pedersen, Data Imaginist is his blog. There are so many good blogs. My plea to everyone who has a blog is to put their twitter handle somewhere on it. I actually try really hard to find attribution stuff. Every now and then I get it really wrong and it’ll be someone who has nothing to do with it but who has the same name. There’s a bikini model who has the same name as someone who I said wrote a thing – which I vetted it too! I was like, well she’s multi-faceted, good for her! And then somebody was like, I don’t think that’s the right one. Oops! I have to say that that’s the one thing that Medium nailed – when you click share it gives you their twitter handle. If you have a blog, put your twitter handle there so I don’t end up attributing it to a bikini model.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining – Week 10 (2017)

Fri, 11/10/2017 - 00:01

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 10 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining – Week 10 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recap: EARL Boston 2017

Thu, 11/09/2017 - 23:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

By Emmanuel Awa, Francesca Lazzeri and Jaya Mathew, data scientists at Microsoft

A few of us got to attend EARL conference in Boston last week which brought together a group of talented users of R from academia and industry. The conference highlighted various Enterprise Applications of R. Despite being a small conference, the quality of the talks were great and showcased various innovative ways in using some of the newer packages available for use in the R language. Some of the attendees were veteran R users while some were new comers to the R language, so there was a mix in the level of proficiency in using the R language.  

R currently has a vibrant community of users and there are over 11,000 open source packages. The conference also encouraged women to join their local chapter for R Ladies with the aim of increasing the participation of women at R conferences and increasing the number of women who contribute R packages to the open source community.

The team from Microsoft got to showcase some of our tools namely the Microsoft ML Server and our commitment to support the open language R. Some of the Microsoft earned sessions were:

  1. Deep Learning with R – Francesca Lazzeri
  2. Enriching your Customer profile at Scale using R Server – Jaya Mathew, Emmanuel Awa & Robert Alexander
  3. Developing Deep Learning Applications with CNTK – Ali Zaidi

Microsoft was a sponsor at the event and had a booth at the conference where there was a live demo using the Cognitive Services APIs — namely the Face API — to detect age, gender, facial expression.

In addition, some of the other interesting talks were:

  1. When and Why to Use Shiny for Commercial Applications – Tanya Cashorali
  2. HR Analytics: Using Machine Learning to Predict Employee Turnover – Matt Dancho
  3. Using R to Automate the Classification of E-commerce Products – Aidan Boland
  4. Leveraging More Data using Data Fusion in R – Michael Conklin

All the slides from the conference will be available at the conference website shortly. For photos from the conference, visit EARL’s twitter page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing “Introduction to the Tidyverse”, my new DataCamp course

Thu, 11/09/2017 - 19:00

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

For the last few years I’ve been encouraging a particular approach to R education, particularly teaching the dplyr and ggplot2 packages first and introducing real datasets early on. This week I’m excited to announce the next step: the release of Introduction to the Tidyverse, my new interactive course on the DataCamp platform.

The course is an introduction to the dplyr and ggplot2 packages through an analysis of the Gapminder dataset, enabling students to explore and visualize country statistics over time. It’s designed so that people can take it even if they have no previous experience in R, or if they’ve learned some (like in DataCamp’s free introduction) but aren’t familiar with dplyr, ggplot2, or how they fit together.

I’ve published two DataCamp courses before, Exploratory Data Analysis: Case Study (which makes a great followup to this new one) and Foundations of Probability. But I’m particularly excited about this one because the topic is so important to me. Here I’ll share a bit of my thinking behind the course and we made the decisions we did.

How “Intro to the Tidyverse” started

In early July I was at the useR 2017 conference in Brussels (where I gave a talk on R’s growth as seen in Stack Overflow data). A lot of the attendees were experienced teachers, and a common theme in my conversations was about whether it made sense to teach tidyverse packages like dplyr and ggplot2 before teaching base R syntax.

.@minebocek agrees: teach tidyverse to beginners first #UseR2017 pic.twitter.com/vxjCjNrDz0

— David Robinson (@drob) July 5, 2017

These conversations encouraged me to publish Teach the tidyverse to beginners that week. But the most notable conversations I had were with Chester Ismay, who had recently joined DataCamp as a Curriculum Lead, and with the rest of their content team (like Nick Carchedi and Richie Cotton). Chester and I have a lot of alignment in our teaching philosophies, and we realized the DataCamp platform offers a great opportunity to try a tidyverse-first course at a large scale.

The months since have been an exciting process of planning, writing, and executing the course. I enjoyed building my first two DataCamp courses, but this was a particularly thrilling experience, because I grew to realize I’d been planning this course for a while, almost subconsciously. In early October I filmed the video in NYC, it was released almost four months to the day after Chester and I first had the idea.

The curriculum

I realized while I was writing the “teach tidyverse first” post that while I had taught R to beginners with dplyr/ggplot2 about a dozen times in my career (a mix of graduate courses, seminars, and workshops), I hadn’t shared my curriculum in any standardized way.1 This means the conversation has always been a bit abstract. What exactly do I mean by teaching dplyr first, and when do other programming concepts get introduced along the way?

We put a lot of thought into the ordering of topics. DataCamp courses are divided into four chapters, each containing several videos and about 10-15 exercises.

  1. Data Wrangling. Learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

  2. Data Visualization. Learn the essential skill of data visualization, using the ggplot2 package. Visualization and maniuplation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

  3. Grouping and summarizing. We may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

  4. Types of visualizations. Learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

This ordering is certainly not the only way to teach R. But I like how it achieves a particular set of goals.

  • It not only introduces dplyr and ggplot2, but show how they work together. This is the reason we alternated chapters in a dplyr-ggplot2-dplyr-ggplot2 order, to appreciate how filtering, grouping, and summarizing data can feed directly into visualizations. This is one distinction between this course and the existing (excellent) dplyr and ggplot2 courses on DataCamp.
  • Get students doing powerful things quickly. This is a major theme of my tidyverse-first post and a sort of obsession of mine. The first exercise in the course introduces the gapminder dataset, discussing the data before writing a single line of code. And the last chapter in particular teaches students to create four different types of graphs, and shows how once you understand the grammar of graphics you can make a variety of visualizations.
  • Teach an approach that scales to real projects. There are hundreds of important topics students don’t learn in the course, ranging from matrices to lists to loops. But the particular skills they do learn aren’t toy examples or bad habits that need to be unlearned. I do use the functions and graphs taught in the course every day, and Julia Silge and I wrote a book using very similar principles.
  • Beginners don’t need any previous experience in R, or even in programming. We don’t assume someone’s familiar even with the basics in advance, even fundamentals such as variable assignment (assignment is introduced at the start of chapter 2; until then exploration is done interactively). It doesn’t hurt to have a course like Introduction to R under one’s belt first, but it’s not mandatory.

Incidentally, the course derives a lot of inspiration from the excellent book R for Data Science (R4DS), by Hadley Wickham and Garrett Grolemund. Most notably R4DS also uses the gapminder dataset to teach dplyr (thanks to Jenny Bryan’s R package it’s a bit of a modern classic).2 I think the two resources complement each other: some people who prefer learning from videos and interactive exercises than from books, and vice versa. Books have an advantage of having space to go deeper (for instance, we don’t teach select, grouped mutates, or statistical transformations), while courses are useful for having a built-in self-evaluation mechanism. Be sure to check out this page for more resources on learning tidyverse tools.

What’s next

I’m excited about developing my fourth DataCamp course with Chester (continuing my probability curriculum). And I’m particularly interested in seeing how the course is received, and whether people who complete this course continue to succeed in their data science journey.

I have a lot of opinions about R education, but not a lot of data about it, and I’m considering this an experiment to see how the tidyverse-first approach works in a large-scale interactive course. I’m looking forward both to the explicit data that DataCamp can collect, and to hear feedback from students and other instructors. So I hope to hear what you think!

  1. The last online course I’ve recorded for beginners, which I recorded in 2014, takes a very different philosophy than I use now, especially in the first chapter. 

  2. One of the differences is that we introduce the first dplyr operations before introducing ggplot2 (because it’s difficult to visualize gapminder data without filtering it first, while R4DS uses a different dataset to teach ggplot2). 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R live class | R with Database and Big Data | Nov 21-22 Milan

Thu, 11/09/2017 - 16:43

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

 

R with Database and Big Data is our fifth course of the autumn term. It takes place in November 21-22 in a location close to Milano Lima.
During this course you will see how to connect databases through R, and how to use dplyr with databases. Then you will become familiar with the basic IT infrastructures behind big data, the R toolbox to access and manipulate big data structures, the sparkML libraries for out of memory data modeling and ad hoc techniques for big data visualization. It presents the latest techniques to work with big data within the R environment, which means manipulating, analyzing, visualizing big data structures that exceed the single computer capacity in a true R style.
No previous knowledge of big data technology is required, while a basic knowledge of R is necessary.

R with Database and Big Data: Outlines

– Introduction to databases
– Connecting databases through R: ODBC and RSQLite
– Data manipulation with dplyr
– Using dplyr with databases
– Introduction to distributed infrastructure
– Spark and Hadoop
– Sparklyr
– Distributed data manipulation with dplyr
– SparkML

R with Database and Big Data is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

  • November 29-30Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | R with Database and Big Data | Nov 21-22 Milan appeared first on Quantide – R training & consulting.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How Happy is Your Country? — Happy Planet Index Visualized

Thu, 11/09/2017 - 15:00

The Happy Planet Index (HPI) is an index of human well-being and environmental impact that was introduced by NEF, a UK-based economic think tank promoting social, economic and environmental justice. It ranks 140 countries according to “what matters most — sustainable wellbeing for all”.

This is how HPI is calculated:

It’s tells us “how well nations are doing at achieving long, happy, sustainable lives”. The index is weighted to give progressively higher scores to nations with lower ecological footprints.

I downloaded the 2016 dataset from the HPI website. Inspired by “Web Scraping and Applied Clustering Global Happiness and Social Progress Index” written by Dr. Mesfin Gebeyaw, I am interested to find correlations among happiness, wealth, life expectancy, footprint and so on, and then put these 140 countries into different clusters, according to the above measures. I wonder whether the findings will surprise me.

Note: for those who want to see the results right now, I have created a Tableau story, that can be accessed from here.

Load the packages

library(dplyr) library(plotly) library(stringr) library(cluster) library(FactoMineR) library(factoextra) library(ggplot2) library(reshape2) library(ggthemes) library(NbClust)

Data Preprocessing

library(xlsx) hpi <- read.xlsx('hpi-data-2016.xlsx',sheetIndex = 5, header = TRUE) # Remove the unnecessary columns hpi <- hpi[c(3:14)] # remove footer hpi <- hpi[-c(141:158), ] # rename columns hpi <- hpi[,c(grep('Country', colnames(hpi)), grep('Region', colnames(hpi)), grep('Happy.Planet.Index', colnames(hpi)), grep('Average.Life..Expectancy', colnames(hpi)), grep('Happy.Life.Years', colnames(hpi)), grep('Footprint..gha.capita.', colnames(hpi)), grep('GDP.capita...PPP.', colnames(hpi)), grep('Inequality.of.Outcomes', colnames(hpi)), grep('Average.Wellbeing..0.10.', colnames(hpi)), grep('Inequality.adjusted.Life.Expectancy', colnames(hpi)), grep('Inequality.adjusted.Wellbeing', colnames(hpi)), grep('Population', colnames(hpi)))] names(hpi) <- c('country', 'region','hpi_index', 'life_expectancy', 'happy_years', 'footprint', 'gdp', 'inequality_outcomes', 'wellbeing', 'adj_life_expectancy', 'adj_wellbeing', 'population') # change data type hpi$country <- as.character(hpi$country) hpi$region <- as.character(hpi$region)

The structure of the data

str(hpi) 'data.frame': 140 obs. of 12 variables: $ country : chr "Afghanistan" "Albania" "Algeria" "Argentina" ##... $ region : chr "Middle East and North Africa" "Post-communist" ##"Middle East and North Africa" "Americas" ... $ hpi_index : num 20.2 36.8 33.3 35.2 25.7 ... $ life_expectancy : num 59.7 77.3 74.3 75.9 74.4 ... $ happy_years : num 12.4 34.4 30.5 40.2 24 ... $ footprint : num 0.79 2.21 2.12 3.14 2.23 9.31 6.06 0.72 5.09 7.44 ... $ gdp : num 691 4247 5584 14357 3566 ... $ inequality_outcomes: num 0.427 0.165 0.245 0.164 0.217 ... $ wellbeing : num 3.8 5.5 5.6 6.5 4.3 7.2 7.4 4.7 5.7 6.9 ... $ adj_life_expectancy: num 38.3 69.7 60.5 68.3 66.9 ... $ adj_wellbeing : num 3.39 5.1 5.2 6.03 3.75 ... $ population : num 29726803 2900489 37439427 42095224 2978339 ...

The summary

summary(hpi[, 3:12]) hpi_index life_expectancy happy_years footprint Min. :12.78 Min. :48.91 Min. : 8.97 Min. : 0.610 1st Qu.:21.21 1st Qu.:65.04 1st Qu.:18.69 1st Qu.: 1.425 Median :26.29 Median :73.50 Median :29.40 Median : 2.680 Mean :26.41 Mean :70.93 Mean :30.25 Mean : 3.258 3rd Qu.:31.54 3rd Qu.:77.02 3rd Qu.:39.71 3rd Qu.: 4.482 Max. :44.71 Max. :83.57 Max. :59.32 Max. :15.820 gdp inequality_outcomes wellbeing adj_life_expectancy Min. : 244.2 Min. :0.04322 Min. :2.867 Min. :27.32 1st Qu.: 1628.1 1st Qu.:0.13353 1st Qu.:4.575 1st Qu.:48.21 Median : 5691.1 Median :0.21174 Median :5.250 Median :63.41 Mean : 13911.1 Mean :0.23291 Mean :5.408 Mean :60.34 3rd Qu.: 15159.1 3rd Qu.:0.32932 3rd Qu.:6.225 3rd Qu.:72.57 Max. :105447.1 Max. :0.50734 Max. :7.800 Max. :81.26 adj_wellbeing population Min. :2.421 Min. :2.475e+05 1st Qu.:4.047 1st Qu.:4.248e+06 Median :4.816 Median :1.065e+07 Mean :4.973 Mean :4.801e+07 3rd Qu.:5.704 3rd Qu.:3.343e+07 Max. :7.625 Max. :1.351e+09 ggplot(hpi, aes(x=gdp, y=life_expectancy)) + geom_point(aes(size=population, color=region)) + coord_trans(x = 'log10') + geom_smooth(method = 'loess') + ggtitle('Life Expectancy and GDP per Capita in USD log10') + theme_classic()

Gives this plot:

After log transformation, the relationship between GDP per capita and life expectancy is more clear and looks relatively strong. These two variables are correlated. The Pearson correlation between this two variable is reasonably high, at approximate 0.62.

cor.test(hpi$gdp, hpi$life_expectancy) Pearson's product-moment correlation data: hpi$gdp and hpi$life_expectancy t = 9.3042, df = 138, p-value = 2.766e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5072215 0.7133067 sample estimates: cor 0.6208781 ggplot(hpi, aes(x=life_expectancy, y=hpi_index)) + geom_point(aes(size=population, color=region)) + geom_smooth(method = 'loess') + ggtitle('Life Expectancy and Happy Planet Index Score') + theme_classic()

Gives this plot:

Many countries in Europe and Americas end up with middle-to-low HPI index probably because of their big carbon footprints, despite the long life expectancy.

ggplot(hpi, aes(x=gdp, y=hpi_index)) + geom_point(aes(size=population, color=region)) + geom_smooth(method = 'loess') + ggtitle('GDP per Capita(log10) and Happy Planet Index Score') + coord_trans(x = 'log10')

Gives this plot:

Money can’t buy happiness. The correlation between GDP and Happy Planet Index score is indeed very low, at about 0.11.

cor.test(hpi$gdp, hpi$hpi_index) Pearson's product-moment correlation data: hpi$gdp and hpi$hpi_index t = 1.3507, df = 138, p-value = 0.179 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.05267424 0.27492060 sample estimates: cor 0.1142272 Scale the data

An important step of meaningful clustering consists of transforming the variables such that they have mean zero and standard deviation one.

hpi[, 3:12] <- scale(hpi[, 3:12]) summary(hpi[, 3:12]) hpi_index life_expectancy happy_years footprint Min. :-1.86308 Min. :-2.5153 Min. :-1.60493 Min. :-1.1493 1st Qu.:-0.71120 1st Qu.:-0.6729 1st Qu.:-0.87191 1st Qu.:-0.7955 Median :-0.01653 Median : 0.2939 Median :-0.06378 Median :-0.2507 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 3rd Qu.: 0.70106 3rd Qu.: 0.6968 3rd Qu.: 0.71388 3rd Qu.: 0.5317 Max. : 2.50110 Max. : 1.4449 Max. : 2.19247 Max. : 5.4532 gdp inequality_outcomes wellbeing adj_life_expectancy Min. :-0.6921 Min. :-1.5692 Min. :-2.2128 Min. :-2.2192 1st Qu.:-0.6220 1st Qu.:-0.8222 1st Qu.:-0.7252 1st Qu.:-0.8152 Median :-0.4163 Median :-0.1751 Median :-0.1374 Median : 0.2060 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.0632 3rd Qu.: 0.7976 3rd Qu.: 0.7116 3rd Qu.: 0.8221 Max. : 4.6356 Max. : 2.2702 Max. : 2.0831 Max. : 1.4059 adj_wellbeing population Min. :-2.1491 Min. :-0.2990 1st Qu.:-0.7795 1st Qu.:-0.2740 Median :-0.1317 Median :-0.2339 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.6162 3rd Qu.:-0.0913 Max. : 2.2339 Max. : 8.1562

A simple correlation heatmap

qplot(x=Var1, y=Var2, data=melt(cor(hpi[, 3:12], use="p")), fill=value, geom="tile") + scale_fill_gradient2(limits=c(-1, 1)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Heatmap of Correlation Matrix", x=NULL, y=NULL)

Gives this plot:

Principal Component Analysis (PCA)

PCA is a procedure for identifying a smaller number of uncorrelated variables, called “principal components”, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the minimum number of principal components.

hpi.pca <- PCA(hpi[, 3:12], graph=FALSE) print(hpi.pca) **Results for the Principal Component Analysis (PCA)** The analysis was performed on 140 individuals, described by 10 variables *The results are available in the following objects: name description 1 "$eig" "eigenvalues" 2 "$var" "results for the variables" 3 "$var$coord" "coord. for the variables" 4 "$var$cor" "correlations variables - dimensions" 5 "$var$cos2" "cos2 for the variables" 6 "$var$contrib" "contributions of the variables" 7 "$ind" "results for the individuals" 8 "$ind$coord" "coord. for the individuals" 9 "$ind$cos2" "cos2 for the individuals" 10 "$ind$contrib" "contributions of the individuals" 11 "$call" "summary statistics" 12 "$call$centre" "mean of the variables" 13 "$call$ecart.type" "standard error of the variables" 14 "$call$row.w" "weights for the individuals" 15 "$call$col.w" "weights for the variables" eigenvalues <- hpi.pca$eig head(eigenvalues) eigenvalue percentage of variance cumulative percentage of variance comp 1 6.66741533 66.6741533 66.67415 comp 2 1.31161290 13.1161290 79.79028 comp 3 0.97036077 9.7036077 89.49389 comp 4 0.70128270 7.0128270 96.50672 comp 5 0.24150648 2.4150648 98.92178 comp 6 0.05229306 0.5229306 99.44471

Interpretation:
* The proportion of variation retained by the principal components was extracted above.
* Eigenvalues is the amount of variation retained by each PC. The first PC corresponds to the maximum amount of variation in the data set. In this case, the first two principal components are worthy of consideration because A commonly used criterion for the number of factors to rotate is the eigenvalues-greater-than-one rule proposed by Kaiser (1960).

fviz_screeplot(hpi.pca, addlabels = TRUE, ylim = c(0, 65))

Gives this plot:

The scree plot shows us which components explain most of the variability in the data. In this case, almost 80% of the variances contained in the data are retained by the first two principal components.

head(hpi.pca$var$contrib) Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 hpi_index 3.571216 50.96354921 5.368971166 2.1864830 5.28431372 life_expectancy 12.275001 2.29815687 0.002516184 18.4965447 0.31797242 happy_years 14.793710 0.01288175 0.027105103 0.7180341 0.03254368 footprint 9.021277 24.71161977 2.982449522 0.4891428 7.62967135 gdp 9.688265 11.57381062 1.003632002 2.3980025 72.49799232 inequality_outcomes 13.363651 0.30494623 0.010038818 9.7957329 2.97699333

* Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set.
* The contribution of variables was extracted above: The larger the value of the contribution, the more the variable contributes to the component.

fviz_pca_var(hpi.pca, col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE )

Gives this plot:

This highlights the most important variables in explaining the variations retained by the principal components.

Group countries by wealth, development, carbon emissions, and happiness

When using clustering algorithms, k must be specified. I use the following method to help to find the best k.

number <- NbClust(hpi[, 3:12], distance="euclidean", min.nc=2, max.nc=15, method='ward.D', index='all', alphaBeale = 0.1) *** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot. *** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure. ******************************************************************* * Among all indices: * 4 proposed 2 as the best number of clusters * 7 proposed 3 as the best number of clusters * 1 proposed 5 as the best number of clusters * 5 proposed 6 as the best number of clusters * 3 proposed 10 as the best number of clusters * 3 proposed 15 as the best number of clusters ***** Conclusion ***** * According to the majority rule, the best number of clusters is 3

I will apply K=3 in the following steps:

set.seed(2017) pam <- pam(hpi[, 3:12], diss=FALSE, 3, keep.data=TRUE) fviz_silhouette(pam) cluster size ave.sil.width 1 1 43 0.46 2 2 66 0.32 3 3 31 0.37

Number of countries assigned in each cluster

hpi$country[pam$id.med] [1] "Liberia" "Romania" "Ireland"

This prints out one typical country represents each cluster.

fviz_cluster(pam, stand = FALSE, geom = "point", ellipse.type = "norm")

Gives this plot:

It is always a good idea to look at the cluster results, see how these three clusters were assigned.

A World map of three clusters

hpi['cluster'] <- as.factor(pam$clustering) map <- map_data("world") map <- left_join(map, hpi, by = c('region' = 'country')) ggplot() + geom_polygon(data = map, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) + labs(title = "Clustering Happy Planet Index", subtitle = "Based on data from:http://happyplanetindex.org/", x=NULL, y=NULL) + theme_minimal()

Gives this plot:

Summary

The Happy Planet index has been criticized for weighting the ecological footprint too heavily; and the ecological footprint is a controversial concept. In addition, the Happy Planet Index has been misunderstood as a measure of personal “happiness”, when in fact, it is a measure of the “happiness” of the planet.

Nevertheless, the Happy Planet Index has been a consideration in the political area. For us, it is useful because it combines well being and environmental aspects, and it is simple and understandable. Also, it is available online, so we can create a story out of it.

Source code that created this post can be found here. I am happy to hear any feedback or questions.

    Related Post

    1. Exploring, Clustering, and Mapping Toronto’s Crimes
    2. Spring Budget 2017: Circle Visualisation
    3. Qualitative Research in R
    4. Multi-Dimensional Reduction and Visualisation with t-SNE
    5. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages