R news and tutorials contributed by (750) R bloggers
Updated: 5 hours 18 min ago

### Unconf recap 1

Mon, 06/05/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Following up on Stefanie's recap of unconf 17, we are following up this entire week with summaries of projects developed at the event. We plan to highlight 4-5 projects each day, with detailed posts from a handful of teams to follow.

skimr

Summary: skimr, a package inspired by Hadley Wickham's precis package, aims to provide summary statistics iteratively and interactively as part of a pipeline. The package provides easily skimmable summary statistics to help you better understand your data and see what is missing.

Team:
Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, Julia Lowndes, Shannon Ellis, Elin Waring, Michael Quinn, Hope McLeod

emldown

Summary: emldown is a package for creating a helpful website based on EML metadata. Ecological Metadata Language (EML) is a metadata specification developed by the ecology discipline and for the ecology discipline. EML is implemented as a series of XML document types that can by used in a modular and extensible manner to document ecological data.

Team: Maëlle Salmon, Kara Woo, Andrew McDonald, Carl Boettiger

testrmd

Summary: testrmd provides facilities to enable testing of and reporting on tested R Markdown chunks. When running R Markdown documents as part of a workflow where the data are likely to change with each render, testrmd ensures that each chunk will alert the reader to any problems that may arise as a result of such subtle changes in data.

Team: Mara Averick, Christopher Tull, Joe Cheng, Robert Flight

webrockets

Summary: The webrockets package implements a basic websocket listener in R. The implementation draws heavily on @hrbmstr's wrapper of easywsclient in https://github.com/ropenscilabs/decapitated.

Team:

Miles McBain, Alicia Schep, Carl Ganz, Bob Rudis, Henrik Bengtsson, Joe Cheng

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Taking Advanced Analytics to the Cloud – Part I: R on AWS

Mon, 06/05/2017 - 05:59

(This article was first published on R – Luke Stanke, and kindly contributed to R-bloggers)

.main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; height: auto; } .tabbed-pane { padding-top: 12px; } button.code-folding-btn:focus { outline: none; }

$(document).ready(function () { window.buildTabsets("TOC"); }); Running R on the cloud isn’t very difficult. This demo shows how to get Rstudio running on Amazon Web Services. To run R on the cloud we need to initiate a machine/computer and install R – that’s all very simple. Where you might get caught up is in the settings and permissions. Step 1: Create an AWS Account If you don’t have an AWS account you’ll need to sign up. Step 2: Create an Role Once you are signed-up and signed into AWS the first thing you’ll need to do is create a role for your machine. A role provides the necessary permissions for your machine to access various APIs available by AWS. You can do this by searching IAM on the services tab. Once you are on the page click “Roles” on the left menu, then the “Create New Role” button. Then select Amazon EC2 from the AWS Service Roll. There are a number of services that you could run with your instance that you will set up – and there are varying permissions, too. For now, just search and select AmazonS3FullAccess, AmazonEC2FullAccess, RedshiftFullAccess, and AthenaFullAccess. You really won’t need anyt of these right away, but they will be useful when connecting to other services like S3 or another EC2 instance. Note: photo does not include AthenaFullAccess but you should include it! From there you’ll be good to go! Depending on your access needs your policies should be selected accordingly. In fact, if you were doing this correctly, you’d want to create policies that are not over-arching like the full access options we’ve selected for this demo. Step 3: Create a Key Pair Next, you’ll want to create a key pair. This will allow you to securely log into your instance to update your machine and install Rstudio. Go to Services and search EC2 in the search bar. Once the EC2 page loads, click “Key Pairs” under the “Network & Security” section on the left menu bar. From there click “Create Key Pair”. Give your key a name and hit create. The key pair will download as a .pem file. Do not lose this key! Also: do not share this key! Step 4: Create a R/Rstudio/Shiny Security Group As you can see by the steps this far its all about security. From the EC2 menu, under the “Network & Security” section on the left menu select “Security Groups”. Click the “Create Security Group Button” and a pop-up will appear. Create a security group name called “Rstudio/Shiny”. Under description write “Allows port access for R and Shiny”. You can leave the VPC drop down alone. On the inbound tab – the default – add three new rules. Under the first rule select SSH from the dropdown. The source should be auto populated with 0.0.0.0/0. This opens up SSH to anywhere in the world – but you’ll need the .pem key to login. For the second rule leave it as custom TCP and type 8787 as the port range. This is the port that needs to be opened on the server for you to connect to Rstudio. Under source type 0.0.0.0/0. This means you could log into Rstudio from any IP address. You could also just use your own IP address if you know it, too. For the third rule leave it as custom TCP also and type 3838 as the port range. This is the port for the Shiny connection. Were not going to use it for the immediate demo but it’ll be very useful in the future. Under source type 0.0.0.0/0 as well. Step 5: Launch EC2 Instance Staying in the EC2 Section click “Instance” under the “Instances” Section on the left menu. Click the “Launch Instance” button. This will take you to “Step 1: Choose an Amazon Machine Image”. You’ll have a number of tabs to select AMIs from. Stay on the Quick Start tab and select Amazon Linux AMI – it’s the first option. It has a number of pre-built tools that are useful – but it doesn’t have R installed. You’ll do that in a bit. For “Step 2: Choose an Instance Type” choosing an instance type can be daunting. Here you are essentially specifying the type of computer you want to run. I typically select from the General purpose, Compute optimized, or Memory optimized options depending on the type of models I’m running and the type of data I am working with. For this example select t2.micro because this is just demo the tool. Click “Next: Configure Instance”. Note: you’ll need a more powerful machine to install packages Step 8 and beyond. I’d recommend a c2.large machine just to be safe. For “Step 3: Configure Instance”, under “IAM Role” select the role you created earlier – my role was called EC2- S3-Redshift. Note: Under Advanced Options you could send a bash script to configure your instance, instead we’ll do it with command line tools. Click “Next: Add Storage”. For “Step 4: Add Storage” we can stick with the default settings – for now. Click “Next: Add Tags” For “Step 5: Add Tags”, click “Add Tag”, enter a key of “Name” and a value of “Rstudio Example” – this will give our machine a name of “Rstudio Example”. Having a name is important when you have more than one machine running. Click “Next: Configure Security Group” For “Step 6: Configure Security Group”, under “Assign a Security Group” select “Select an existing security group”. Find and select your “Rstudio/Shiny” security group. If this is your first time doing this, it should be fine. If you have multiple security groups like myself you’ll have to search through the lot. Click “Review and Launch”. Under the review screen, make sure you’ve selected instance types, security groups, and IAM roles as mentioned above. Click “Launch” – you’ll get a pop-up to select an existing key pair or create a new key pair. Choose an existing key pair and select the key pair you just created. My key is called awesome_key. Awknowledge you’ve selected the key pair and you’ll need it to log on. Note: make sure you have the key pair to log into your instance. This should go without saying, but I’m saying it – you need your key pair to log into your machine and set it up! Launch your instance. Step 6: Login to your instance This is where the Windows OS world diverges from the Apple/Unix/Linux/Ubuntu worlds. Windows doesn’t have a built-in terminal like these other machines so you’ll have to download and setup PuTTy if you don’t have it already. Next you’ll use your terminal to SSH into your newly created instance and set it up. The instance will likely need about 60 seconds to setup from when you hit launch. After you launch click on your instance id just to the right of the message saying the instance launches has been initiated. This will take you to your instance on the EC2 dashboard. The dashboard is full of important information – most notably it’ll tell you if your instance is up and running. From the image below you can see my instance state is green running. This tells me it’s ready to be SSH’ed into. You can also see the Public DNS on the right and on the description tab. We’ll need that information to SSH into it. *Note: you can also see my IAM Role and key pair name. Click the “Connect” button just to the right of the “Launch Instance Button”. This provides you with addition directons on how to connect to your instance. First, let’s open our terminal and change directories so that we are in the folder that contains our .pem key. If you haven’t moved it out of your downloads folder, it’s probably there – and you should probably move it. Next change the permissions of your key pair to allow you to SSH onto your AWS instance. chmod 400 awesome_key.pem Then SSH onto your machine using the following format. You’ll have to replace the key pair with the key pair you’ve created. You’ll also have to change the public DNS address to the address of your machine.. ssh -i "awesome_key.pem" ec2-user@ec2-54-145-158-106.compute-1.amazonaws.com Once you are logged in it your terminal should look like this: Step 7: Setup your instance Once we are logged in we need to update our machine, install a few additional programs, install R, and install Rstudio. You can do this by running the following commands line-by-line through your EC2 instance. # Update the machine sudo yum -y update # Install programs that run well with the devtools package sudo yum -y install libcurl-devel openssl-devel # used for devtools # Install programs that assist APIs sudo yum -y install libxml2 libxml2-devel # Install R sudo su yum install -y R After running this code you will have 1) updated your machine; 2) installed tools to allow the devtools package to run; 3) installed tools to allow like httr and aws.s3 to run; and 4) installed base R. Next you’ll want to install the most recent version of Rstudio and Shiny. Check here to find the most recent releases of Rstudio and Shiny. Edit the code so that you install the most recent version. Run the install of Rstudio and Shiny. # Install RStudio Server - change version when installing your Rstudio wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-1.0.143-x86_64.rpm sudo yum install -y --nogpgcheck /tmp/rstudio-server-rhel-1.0.143-x86_64.rpm #install shiny and shiny-server - change version when installing your Rstudio R -e "install.packages('shiny', repos='http://cran.rstudio.com/')" wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.5.3.838-rh5-x86_64.rpm yum install -y --nogpgcheck shiny-server-1.5.3.838-rh5-x86_64.rpm #add user(s) sudo useradd -m stanke sudo passwd stanke Finally add a user — and change the password once this is done you can terminate your SSH tunnel. Step 8: Log into your Rstudio Instance Copy your public DNS that was located on your EC2 page earlier. Paste that public DNS into your browser and add “:8787” after your instance – in my case “ec2-54-145-158-106.compute-1.amazonaws.com:8787”. Hit enter. Your Rstudio login page should appear. Enter your credentials from your new user. Click “Sign In”. Step 9: Setup your Rstudio Defaults. So technically you’ve done it. You’ve logged into Rstudio on AWS. But let’s take this another few steps. Let’s set up some defaults so that any time we want to set up an Rstudio instance on AWS we don’t have to go through the hassle we just did above. Let’s install a bunch of packages that you might regularly use. This install might take a while since you’ll be installing a number of packages onto your machine. ## install packages if not present install.packages( c( "devtools", "sparklyr", "ggplot2", "magrittr", "tidyverse", "Lahman", "DBI", "rsparkling", "h2o", "ghit", "xml2", "stringr", "magrittr", "data.table", "clipr" ) ) Lets also create a new .R file and write a few lines of code. You don’t really need to do this, but it’ll show the power of Step 9 in a few minutes. Here is the R script you can use: ### get data data("iris") ### See data structure head(iris) ### Save iris data locally. write.csv(iris, "iris.csv") After the initial script: A .R file, an iris object, and an iris.csv file. Step 9: Take an image of your current instance Setting up Rstudio is a pain – all that time waiting for packages to install, getting your data just right, only to possibly find out you didn’t size your instance correctly. No one wants to have to deal with that every time they setup R on AWS. This isn’t a problem as you can take a snapshot of your instance as it is and spin off new instances from the point and time of the snapshot. To take a snapshot go back to the webpage with your EC2 details. Click the “Actions” button, then go to “Image” and “Create Image”. From there enter an Image Name of “Rstudio AMI Example” and Image Description of “Rstudio AMI Example”. Click “Create Image” and wait a few minutes. Step 10: Create an instance from the AMI Launch a new instance. On “Step 1: Choose an Amazon Machine Image”, click “My AMIs”. Select “Rstudio AMI Example”. Follow Step 5 above for the rest of the setup. However with this instance tag the Name as “Rstudio AMI”. If you’ve set up everything correctly you shouldn’t need to SSH into your instance to configure. Copy the Public DNS into your browser and add “:8787”. Login with your username and password derived earlier. Once logged in you’ll see that your .R script and .csv scripts are still saved to the instance – allowing you to quickly jump back into your analyses. In fact, creating an image and then creating multiple instances will allow you to quickly fit models and test which instance type will be best for the models/data you are currently working with and eventually minimize costs. If the data become more complicated or larger in size you can create a new instance that can accommodate those changes. However, sometimes the data become so large and we need to use a distributed system – Rstudio can sit on top of distributed systems as well – I’ll talk about that in a different post. This approach will get you a great jump-start though. // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() {$('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); }); (function () { var script = document.createElement("script"); script.type = "text/javascript"; ; document.getElementsByTagName("head")[0].appendChild(script); })(); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Luke Stanke. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### 2017 Fantasy Football Projections Mon, 06/05/2017 - 05:10 (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers) We are releasing our 2017 fantasy football projections in a Shiny webapp using R. The app allows you to calculate custom rankings/projections for your league based on your league settings. The projections incorporate more sources of projections than any other site, and have been the most accurate projections over the last 5 years. New features of the app this season include the ability to view additional variables (including “per game” projections). You can access the Projections tool here: http://apps.fantasyfootballanalytics.net For instructions how to use the app, see here. We also have a Draft Optimizer tool (see here). See our articles on how to win your Snake Draft and Auction Draft. We will be updating the projections as the season approaches with more sources of projections. Feel free to add suggestions in the comments! The post 2017 Fantasy Football Projections appeared first on Fantasy Football Analytics. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### More on R and Reproducible Research Sun, 06/04/2017 - 23:33 (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers) Readers who found my posting last week on our revisit package of interest may wish to check the somewhat related package ropensci/rrrpkg, which Edzer Pemesma tweeted about today. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Packaging the TheyWorkForYou API Sun, 06/04/2017 - 21:19 (This article was first published on R – CONJUGATEPRIOR, and kindly contributed to R-bloggers) TheyWorkForYou is a great website for keeping up with British politics and one of the many fine things mySociety does to make democracy in the UK more transparent. There’s also an API, accessible via http and wrapped up for a few languages. However, R is not amongst them, so I wrote twfy. If you’re interested in using it (and you’ve got devtools installed) you can install it with devtools::install_github("conjugateprior/twfy") It was my first proper API package and a bit of a learning experience. If you want to hear more about that, read on. APIs First some recap, for those just joining us. The TheyWorkForYou API works with parameterized GETs to URLs with a common base: http://theyworkforyou.com/api/ and different endpoints, depending on what you want. First you sign up for an API key and then you make the calls. For example, if you want a list of UK parliamentary constituencies then your endpoint is getConstituency, which takes either a name or a postcode, plus your API key key and an output specification, and returns a structured constituency object. In a browser, the complete call looks like https://www.theyworkforyou.com/api/getConstituency?name=Keighley&output=js&key=adsfiddscsdlsdlk where of course adsfiddscsdlsdlk isn’t really an API key. It just plays one the web. The server returns a JSON object: { "bbc_constituency_id" : "344", "guardian_election_results" : "http://www.guardian.co.uk/politics/constituency/1050/keighley", "guardian_id" : "1050", "guardian_name" : "Keighley", "pa_id" : "338", "name" : "Keighley" } Except that’s only sort of true. The server claims to return a Javascript object, as we can tell from its MIME type. "text/javascript; charset=iso-8859-1". We’ll just treat it like JSON though. Making this call and decoding the result programmatically is straightforward with the right packages library(httr) library(jsonlite) q <- list(output="js", key="adsfiddscsdlsdlk", name="Keighley") url <- "https://www.theyworkforyou.com/api/getConstituency" resp <- GET(url, query=q) # make the call json <- fromJSON(content(resp)) # parse the response where we’ve let the GET function deal with all the URL escaping, and just asked fromJSON to do the right thing. By default, jsonlite generally believes the right thing looks like a data.frame and most of the time that’s fine. So far so straightforward. Creating a general purpose function to call the API It’s pretty easy to generalize this code to create a generic API calling function call_api(endpoint, ...) Inside that function we’ll just grab the arguments as list(...), add our preferred output format and the API key, and hand the whole thing to GET. With call_api in hand it’s possible to make all the API functions look pretty much the same, e.g. getConstituencies <- function(date=NULL, search=NULL){ params <- list(data=date, search=search) call_api("getConstituencies", params) } but you know, why write ‘getConstituencies’ twice? Let’s be a bit more general, and use the fact that R functions know what their names are, and that all function parameters have a name, to make a completely general function body. The actual function in twfy is getConstituencies <- function(date=NULL, search=NULL){ params <- params_from_call(match.call()) do.call("call_api", params) } which has exactly the same body as getMPInfo <- function(id, fields=NULL){ params <- params_from_call(match.call()) do.call("call_api", params) } Cute. If the API adds another endpoint, I’ll create a new function with this body, give it the name of the endpoint, and write the help in roxygen above it. So how does this work? Well, inside (any) function as.list(match.call()) is a list with an unlabeled first element that is the name of the function, and subsequent labeled components that are its arguments. If we call getConstituencies function above with search="Keigh" that means [[1]] getConstituencies$search
[1] "Keigh"

All the package’s params_from_call does is remove the first argument from the list and re-add it (as.character, because it’s actually an R symbol) under the new label endpoint, so that params is

$search [1] "Keigh"$endpoint
[1] "getConstituencies"

I then use do.call to call call_api with these arguments. This works because call_api is looking for an argument called endpoint and zero or more named arguments, and params gives it one of each.

This leads to the question: why even have separate function for each endpoint offered by the API? There are two answers:

First, an important effect of wrapping an API is to have the documentation near to hand. This requires separate R functions to write the roxygen above.

Speaking of documentation, TheyWorkForYou is a little bit vague about what each of its endpoints returns, so if you’re listening, a pointer to some more documentation would be great.

Second, it is sometimes useful to pre- or post-process the arguments to do.call. Here’s an example of how documentation and pre-processing interact:

getDebates <- function(type=c("commons", "westminsterhall", "lords", "scotland", "northernireland"), date=NULL, search=NULL, person=NULL, gid=NULL, order=c("d", "r"), page=NULL, num=NULL){ params <- params_from_call(match.call()) params$type <- match.arg(type) params$order <- match.arg(order) do.call("call_api", params) }

The user must specify a legislative body to search with the type argument, and can specify a results ordering with the order argument. The function definition is a good place to put the small number of argument possibilities, not least because they will get picked up by command completion.

In the code above I process the function’s arguments as usual, but then step in and fix the values of type and order using match.arg in the normal way, before making the call.

Where did I leave my keys?

Like most APIs TheyWorkForYou requires a key to use. Here I follow Hadley Wickham’s very useful guidelines (see the links at the end) and store it as an environment variable.

In twfy there’s an internal function that prompts for a key as necessary

get_api_key <- function(){ key <- Sys.getenv("TWFY_API_KEY") if (key == ""){ key <- ask_for_key() if (key != ""){ Sys.setenv(TWFY_API_KEY=key) add_key_to_renviron(key) # and set up for next time } else stop("Hint: you can request an API key from http://theyworkforyou.com/api/key") } key }

The first time it’s needed, this prompts the user for the key, sets its value in the local environment, and writes a line into the user’s .Renviron file so it’s available in later sessions.

There is a set_api_key, but this is only really needed to reset an existing key.

Testing with keys

If you’re a fan of continuous integration, then the next challenge is to set things up in such a way as not to expose the API key in the server logs or hardcode it into the R source. twfy uses Travis, and for Travis the solution is to set the api key as an environment variable in the repository settings.

By default these variables do not appear in the build logs, and that’s the way we like it.

The current .travis.yml for twfy looks like

language: R sudo: false cache: packages before_install: - echo "TWFY_API_KEY=${TWFY_API_KEY}" > ~/.Renviron Actually I’m not sure whether it’s even necessary to drop Travis’s copy of the API key into the .Renviron to get picked up by the package functions. It’s possible that R picks up local environment variables more reliably on Ubuntu than on my OS X box. Still, this works. The package builds and no third parties (or forks) see the API key. Further reading If you find yourself wrapping an API I’d thoroughly recommend reading Somebody always got there first It turns out that Jack Blumenau saw TheyWorkForYou’s API and thought the same thing. Great minds, and all that. You can see his take on the problem at here. He likes XML quite a bit more than me, apparently. In any case, as Chairman Mao once said: “Let a hundred flowers blossom, let a hundred schools of thought contend. And let the winner drive the losers onto a small volcanic island shaped like a sweet potato”. Or something like that. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – CONJUGATEPRIOR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Data Visualization with googleVis exercises part 1 Sun, 06/04/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Line, Bar and Column Charts Hello everybody! This is the first part of an exciting data visualization series r exercises.com is going to provide you. Data visualization involves the creation and study of the visual representation of data. The increasing amounts of data created by Internet activity make the “data visualization” tool totally necessary for data scientists. One of the best packages that R language provides for this purpose is googleVis and guess what, we are going to see its amazing features together. In this first part we are going to see basic stuff of three useful types of charts (Line Chart, Bar Chart and Column Chart) before “digging deeper” in the next parts. NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin! Answers to the exercises are available here. Install the Package. Exercise 1 Install and load the package googleVis. Create a data frame First of all let’s create an experimental data.frame to use for all our plots. This is an example: dfr=data.frame(name=c("GRE", "ARG", "BRA"), val1=c(20,32,19), val2=c(25,52,12)) Exercise 2 Create a data frame named “df”. Give as variables the “Pts” (Points) and “Rbs” (Rebounds) of three NBA players. Names and values are up to you. Line Chart To produce a Line Chart you can use: LineC <- gvisLineChart(df) plot(LineC) Exercise 3 Create a list named “LineC” and pass to it the “df” data frame you just created as a line chart. HINT: Use gvisLineChart(). Exercise 4 Plot the line chart. HINT: Use plot(). Line chart with two axis Below there is an example of this type of Line Chart: LineC2 <- gvisLineChart(df, "name", c("val1","val2"), options=list( series="[{targetAxisIndex: 0}, {targetAxisIndex:1}]", vAxes="[{title:'val1'}, {title:'val2'}]" )) plot(LineC2) As you can see you create a list (options) that corresponds values 1 & 2 in the two axes respectively. Exercise 5 Create a single axis Line chart that displays only the “Pts” of the “df” data frame. Exercise 6 Now create a two axis line chart that displays both “Pts” and “Rbs” of the “df” data frame. HINT: Use list(). Bar Chart It is quite simple to create a bar chart with googleVis with: BarC <- gvisBarChart(df) plot(BarC) Exercise 7 Create a list named “BarC” and pass to it the “df” data frame you just created as a bar chart. HINT: Use gvisBarChart(). Exercise 8 Plot the bar chart. HINT: Use plot(). Column Chart Column charts could be considered as vertical bar charts and are quite simple to be created too. ColumnC <- gvisColumnChart(df) plot(ColumnC) Exercise 9 Create a list named “ColumnC” and pass to it the “df” data frame you just created as a column chart. HINT: Use gvisColumnChart(). Exercise 10 Plot the column chart. HINT: Use plot(). Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### RcppArmadillo 0.7.900.2.0 Sun, 06/04/2017 - 16:21 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) The new RcppArmadillo release 0.7.900.2.0 is now on CRAN, and the Debian package was just updated as well. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 350 other packages on CRAN—an increase of 32 since the last CRAN release of 0.7.800.2.0 in April! With the 7.900.* series of Armadillo, Conrad has started to more fully utilize OpenMP (also see Wikipedia on OpenMP) for operations that can be parallelized. To use this in your package you need to update its src/Makevars{,.win} file similarly to what the skeleton default now uses PKG_CXXFLAGS =$(SHLIB_OPENMP_CXXFLAGS) PKG_LIBS = $(SHLIB_OPENMP_CFLAGS)$(LAPACK_LIBS) $(BLAS_LIBS)$(FLIBS)

and you may want to enable C++11 while you are at it—though this may pose issues with older-than-ancient RHEL installations which are still (way too) pervasive so we do not do it by default (yet).

Here, we once again rely on the build infrastructure automagically provided by R itself: if and when OpenMP is available, R will use it via $(SHLIB_OPENMP_CXXFLAGS) etc; see the fine WRE manual for details. That said, some operating systems make this harder than other, and macOS usually takes the crown. See for example this blog post by James for surviving in that environment. I am a little short of details because on Linux these things just work, and have for well over a decade. The rcpp-devel mailing list will be the best place for questions. Changes in this release relative to the previous CRAN release are as follows: Changes in RcppArmadillo version 0.7.900.2.0 (2017-06-02) • Upgraded to Armadillo release 7.900.2 (Evil Banana Republic) • Expanded clamp() to handle cubes • Computationally expensive element-wise functions (such as exp(), log(), cos(), etc) can now be automatically sped up via OpenMP; this requires a C++11/C++14 compiler with OpenMP 3.0+ support for GCC and clang compilers • One caveat: when using GCC, use of -march=native in conjunction with -fopenmp may lead to speed regressions on recent processors • Added gcc 7 to support compiler check (James Balamuta in #128 addressing #126). • A unit test helper function for rmultinom was corrected (#133). • OpenMP support was added to the skeleton helper in inline.R Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Linking R to IQFeed with the QuantTools package Sun, 06/04/2017 - 09:18 IQFeed provides streaming data services and trading solutions that cover the Agricultural, Energy and Financial marketplace. It is a well known and recognized data feed provider geared toward retail users and small institutions. The subscription price starts at around$80/month.

Stanislav Kovalevsky has developed a package called QuantTools. It is an all in one package designed to enhance quantitative trading modelling. It allows to download and organize historical market data from multiple sources like Yahoo, Google, Finam, MOEX and IQFeed. The feature that interests me the most is the ability to link IQFeed to R. I’ve been using IQFeed for a few years and I’m happy with it (I’m not affiliated to the company in any way). More information can be found here. I’ve been looking for an integration within R for a while and here it is. As a result, after I ran a few tests, I moved my code that was still in Python into R. Just for completeness, here’s a link that explains how to download historical data from IQFeed using Python.

QuantTools offers four main functionalities: Get market data, Store/Retrieve market data, Plot time series data and  Back testing

• Get Market Data

First make sure that IQfeed is open. You can either download daily or intraday data. The below code downloads daily prices (Open, High, Low, Close) for SPY from 1st Jan 2017 to 1st June 2017.

## Generic parameters library(QuantTools) from = '2016-01-01' to = '2016-06-01' symbol = 'SPY' ## Request data get_iqfeed_data(symbol, from, to)

## Generic parameters library(QuantTools) from = '2016-05-01' to = '2016-05-03' symbol = 'SPY' ## Request data get_iqfeed_data(symbol, from, to, period = 'tick')

Note the period parameter. It can take any of the following values: tick, 1min, 5min, 10min, 15min, 30min, hour, day, week, month, depending on the frequency you need.

• Store/Retrieve Market Data

QuantTools makes the process of managing and storing tick market data easy. You just setup storage parameters and you are ready to go. The parameters are where, since what date and which symbols you would like to be stored. Any time you can add more symbols and if they are not present in a storage, QuantTools tries to get the data from specified start date. The code below will save the data in the following directory: “C:/Users/Arnaud/Documents/Market Data/iqfeed”.  There is one sub folder by instrument and the data is aved in .rds  files.

library(QuantTools) settings = list( iqfeed_storage = paste( path.expand('~') , 'Market Data', 'iqfeed', sep = '/'), iqfeed_symbols = c('SPY', 'QQQ'), iqfeed_storage_from = format(Sys.Date() - 3) ) QuantTools_settings(settings) # Update storage with data from last date available until today store_iqfeed_data()

You can also store data between specific dates. Replace the last line of code above with one of the below

# Update storage with data from last date available until specified date store_iqfeed_data(to = format(Sys.Date())) # Update storage with data between from and to dates, store_iqfeed_data(from = format(Sys.Date() - 3), to = format(Sys.Date()))

Now should you want to get back some of the data you stored, just run something like:

get_iqfeed_data(symbol = 'SPY', from = '2017-06-01', to = '2017-06-02', period = 'tick', local = TRUE)

Note that only ticks are supported in local storage so period must be ‘tick’

• Plot time series data

QuantTools provides plot_ts function to plot time series data without weekend, holidays and overnight gaps. In the example below, I first retrieve the data stored above, then select the first 100 price observations and finally draw the chart

## Retrieve previously stored data spy = get_iqfeed_data(symbol = 'SPY', from = '2017-06-01', to = '2017-06-02', period = 'tick', local = TRUE) ## Select the first 100 rows spy_price = spy[,.(time,price)][1:100] ## Plot plot_ts(spy_price)

Two things to notice: First spy is a data.table object hence the syntax above. To get a  quick overview of data.table capabilities have a look at this excellent cheat sheet from DataCamp. Second the local parameter is TRUE as the data is retrieved from internal storage.

• Back testing

QuantTools allows to write your own trading strategy using its C++ API. I’m not going to elaborate on this as this is basically C++ code. You can refer to the Examples section on QuantTools website.

Overall I find the package extremely useful and well documented. The only missing bit is the live feed between R and IQFeed which will make the package a real end to end solution.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Quick illustration of Metropolis and Metropolis-in-Gibbs Sampling in R

Sun, 06/04/2017 - 02:00

(This article was first published on R – BioStatMatt, and kindly contributed to R-bloggers)

The code below gives a simple implementation of the Metropolis and Metropolis-in-Gibbs sampling algorithms, which are useful for sampling probability densities for which the normalizing constant is difficult to calculate, are irregular, or have high dimension (Metropolis-in-Gibbs).

## Metropolis sampling ## x - current value of Markov chain (numeric vector) ## targ - target log density function ## prop - function with prototype function(x, ...) that generates ## a proposal value from a symmetric proposal distribution library('mvtnorm') prop_mvnorm <- function(x, ...) drop(rmvnorm(1, mean=x, ...)) metropolis <- function(x, targ, prop=prop_mvnorm, ...) { xnew <- prop(x) lrat <- targ(xnew, ...) - targ(x, ...) if(log(runif(1)) < lrat) x <- xnew return(x) } ## Metropolis-in-Gibbs sampling ## x - current value of Markov chain (numeric vector) ## targ - target log density function ## ... - arguments passed to 'targ' gibbs <- function(x, targ, ...) { for(i in 1:length(x)) { ## define full conditional targ1 <- function(x1, ...) { x[i] <- x1 targ(x, ...) } ## sample using Metropolis algorithm x[i] <- metropolis(x[i], targ1, ...) } return(x) }

The following code produces the figure below to illustrate the two methods using a 'dumbell' distribution (cf. R package 'ks' vignette).

### The code below illustrates the use of the functions above ## target 'dumbell' density (c.f., R package 'ks' vignette) library('ks') mus <- rbind(c(-2,2), c(0,0), c(2,-2)) sigmas <- rbind(diag(2), matrix(c(0.8, -0.72, -0.72, 0.8), nrow=2), diag(2)) cwt <- 3/11 props <- c((1-cwt)/2, cwt, (1-cwt)/2) targ_test <- function(x, ...) log(dmvnorm.mixt(x, mus=mus, Sigmas=sigmas, props=props)) ## plot contours of target 'dumbell' density set.seed(42) par(mfrow=c(1,2)) plotmixt(mus=mus, Sigmas=sigmas, props=props, xlim=c(-4,4), ylim=c(-4,4), xlab=expression(x[1]), ylab=expression(x[2]), main="Metropolis-in-Gibbs") ## initialize and sample using Metropolis-in-Gibbs xcur <- gibbs(c(0,0), targ_test, sigma=vcov_test) for(j in 1:2000) { xcur <- gibbs(xcur, targ_test, sigma=vcov_test) points(xcur[1], xcur[2], pch=20, col='#00000055') } ## plot contours of target 'dumbell' density plotmixt(mus=mus, Sigmas=sigmas, props=props, xlim=c(-4,4), ylim=c(-4,4), xlab=expression(x[1]), ylab=expression(x[2]), main="Metropolis") ## initialize and sample using Metropolis xcur <- metropolis(c(0,0), targ_test, sigma=vcov_test) for(j in 1:2000) { xcur <- metropolis(xcur, targ_test, sigma=vcov_test) points(xcur[1], xcur[2], pch=20, col='#00000055') }

The figure illustrates two contrasting properties of the two methods:

1. Metropolis-in-Gibbs samples can get 'stuck' in certain regions of the support, especially when there are multiple modes and/or significant correlation among the random variables. This is not as much a problem for Metropolis sampling.
2. Metropolis sampling can produce fewer unique samples due to the poor approximation of the proposal density to the target density. This occurs more often for high-dimensional target densities.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – BioStatMatt. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### datazar

Sun, 06/04/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A few weeks ago and then some, I [as occasional blogger!] got contacted by datazar.com to write a piece on this data-sharing platform. I then went and checked what this was all about, having the vague impression this was a platform where I could store and tun R codes, besides dropping collective projects, but from what I quickly read, it sounds more like being able to run R scripts from one’s machine using data and code stored on datazar.com. But after reading just one more blog entry I finally understood it is also possible to run R, SQL, NotebookJS (and LaTeX) directly on that platform, without downloading code or data to one’s machine. Which makes it a definitive plus with this site, as users can experiment with no transfer to their computer. Hence on a larger variety of platforms. While personally I do not [yet?] see how to use it for my research or [limited] teaching, it seems like an [yet another] interesting exploration of the positive uses of Internet to collaborate and communicate on scientific issues! With no opinion on privacy and data protection offered by the site, of course.

Filed under: R, Statistics, University life Tagged: blogging, data privacy, data science, data-sharing, datazar.com, LaTeX, R, R-bloggers

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Deep Learning Dude pt 1

Sat, 06/03/2017 - 17:39

(This article was first published on Rblog – Thinkinator, and kindly contributed to R-bloggers)

You’ve probably noticed that Deep Learning is all the rage right now. AlphaGo has beaten the world champion at Go, you can google cat photos and be sure you won’t accidentally get photos of canines, and many other near-miraculous feats: all enabled by Deep Learning with neural nets. (I am thinking of coining the phrase “laminar learning” to add some panache to old-school non-deep learning.)

I do a lot of my work in R, and it turns out that not one but two R packages have recently been released that enable R users to use the famous Python-based deep learning package, Keras: keras and kerasR.

The first thing you’ll need to do is to make sure you have Python 3 on your system. Windows generally doesn’t have Python at all, while Macs and some Linux systems will include Python 2.7. The way I’d recommend to get Python 3 is to download and install Anaconda. (Few people do data science with a base-installed Python: they mostly use Anaconda, which pre-packages a lot of data science tools along with Python. Trying to install all of those tools yourself is … difficult.)

The usual place to install Anaconda is in your home directory. Then, on a Mac or Linux system add ~/anaconda/bin: to the beginning of your PATH environment variable. For example, on my system, I edited the .profile file in my home directory and the beginning of the PATH line looks like export PATH=~/anaconda/bin:/usr/local/bin:, so that when I type python the system will use the one provided by Anaconda.

One of the nice things about Anaconda is that it provides an enhanced package loading system, conda. This is similar to R’s cran in some sense. But to install Keras, we’ll use Python’s pip, which is similar to R’s devtools and is run from a command line (not from within Python):

pip install keras tensorflow pydot graphviz

should do the trick. The last two packages allow you to plot the model, though the plot is fairly boring.

Of the two packages (keras and kerasR), I’ve started using kerasR because it has some nice tutorials and it’s in CRAN, so that’s what I’ll use here. (keras is installed via devtools.)

In R, install packages kerasR and reticulate from CRAN. To load kerasR requires an extra step beyond the usual library:

reticulate::use_python ("~/anaconda/bin/python")
library (kerasR)

The use_python tells R where to find the Python 3 that you’re using, the one where you also pip-loaded keras. If the library doesn’t think for a moment and then return “successfully loaded keras”, something is wrong, and you’ll get a bunch of Python error messages which will give you a clue, if you examine them carefully. If you forget the use_python, R will be looking at the base-installed Python and won’t be able to import keras in Python. The things that can go wrong are myriad, and I can’t really be more specific, unfortunately.

If you got this far, you can now follow the kerasR tutorial. At the moment (03 June), there is an error in the first model of the tutorial (Boston Housing) that causes it to perform poorly. I’ve submitted a trouble ticket, and the code I’d recommend is:

mod <- Sequential ()

mod$add (Dense (units=50, input_shape=13)) mod$add (Activation ("relu"))

mod$add (Dense (units=1)) # mod$add (Activation ("relu"))

keras_compile (mod, loss="mse", optimizer=RMSprop())

X_train <- scale (boston$X_train) Y_train <- boston$Y_train
X_test <- scale (boston$X_test, center=attr (X_train, "scaled:center"), scale=attr (X_train, "scaled:scale")) Y_test <- boston$Y_test

keras_fit (mod, X_train, Y_train, batch_size=32, epochs=200, verbose=1, validation_split=0.1)

pred <- keras_predict (mod, X_test)
sd (as.numeric (pred) - Y_test) / sd (Y_test)

plot (Y_test, pred)
abline (0, 1)

Keras works a bit differently than the way R usually works in that mod$add modifies the model mod directly, in-place. The mod$add‘s first create an empty model (Sequential ()), and then add a layer with 50 hidden units and a “Relu” activation function, and then add the 1-unit output layer.

This is pretty much a simple Hello World model with 13 input variables and one hidden layer with 50 units. You could have made the same model in older R neural net packages, but Keras has many advantages.

In the tutorial (as of 03 June), the R scale‘s of the X training and testing data were independent and not linked. In this case, I scale the training data and then use the same center and scale for the testing data, just as you would when you deploy a model: training represents the data we already have, while testing represents new data arriving in the future. (This is a pet peeve on my part, and not generally important.)

More importantly, the tutorial also accidentally applied a second normalization to the data in the prediction step, which would drive it in a different direction from the training data. The version, above, has results that look pretty reasonable:

This example isn’t Deep Learning(tm) and you could’ve done this with older R neural net packages, but it’s just the start of Keras exploration. Follow the kerasR tutorials and the links they recommend. For more details on what the various kerasR functions do, check out the Keras pages. (Remembering that kerasR doesn’t implement everything in Keras itself.)

For a very readable explanation of Deep Learning architectures, first read Neural Network Zoo Prequel, and then Neural Network Zoo, by the Asimov Institute.

One of the advantages of Keras is that it’s built on Tensor Flow, which takes full advantage of clusters of machines, GPUs, and all those other things that makes Google Google. Unfortunately, by “GPU” we mean Nvidia GPUs (i.e. GPUs that support CUDA). My Mac laptop has an AMD graphics chip, so I can’t use GPUs, though I can still develop things on my laptop and then someday spin up things on Amazon Web Services and use GPU-based instances.

Filed under: Data Science, R, Rblog, Uncategorized

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rblog – Thinkinator. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Global choropleth maps of military expenditure

Sat, 06/03/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Global choropleth maps

Today I’m writing about using country-level choropleth maps at a global level, using example data showing nations’ military spending as a percentage of GDP. There are a few specific issues to work through as a result of the global coverage. I’m going to start with my two preferred end results.

First, here’s a Twitter-friendly rotating globe:

There are a few problems with it I didn’t quite finish. If you watch it long enough you’ll see a few times Canada, the USA or Russia flicker out of the frame. More on that below.

Second, here’s an interactive version, suited for a fully featured web browser:

I find this the most useful version, even though I prefer to look at the rotating globe. The ability to hover over a country to get its name and value, and to zoom into a particular region, adds real value to the visualization. This is really the minimum people expect out of what I call micro-interactivity (I’m sure there’s a proper, technical name) – ability to zoom and get useful information when hovering.

Unless you spend a lot of time looking at maps of the Middle East, you probably wouldn’t have identified those two dark patches of high military spending as Oman and Eritrea, without the help of the hovering tooltips.

Caveat for today’s post – I’m not a GIS expert at all, and in fact am barely into the “conscious incompetence” stage. So there may be mistakes in what follows, and there are almost certainly better ways to do some of what I’m doing.

Data on military spend

First I needed some example data. Choropleth maps are appropriate for variables like proportions and growth rates rather than absolute numbers. Military spending as a percentage of GDP was in the news when I started this, because of USA President Trump drawing attention (in somewhat misleading terms) to the pledge by NATO leaders in 2014 to spend 2% of their GDP on the military by 2024 (the 2% was described as an unofficial floor as early as 2006, but only committed to at the leadership level in 2014 in the atmosphere of nervousness after Russia’s actions in the Ukraine). With seven years to the commitment deadline, it is widely reported that only five countries have yet passed the 2% figure, usually listed as the United States, Greece, United Kingdom, Estonia and Poland. Interestingly, the data I’m using suggest the UK is currently just under 2% and France just over, slightly different to some of the media reporting.

Often my first port of call looking for internationally comparable data is the World Development Indicators of the World Bank. While they are the definitive origin of very little data, the World Bank does a great job in collating data from other sources and making it available in their visualisation tool. There’s also Vincent Arel-Bundock’s handy WDI R package which makes it easy to get the data from the Bank into your own analytical tool.

When I started this exercise a couple of weeks ago, I was able to create this choropleth map from the very friendly WDI visualisation tool:

(but interestingly, returning to the site today, I can’t reproduce it). Anyway, there’s a few things that leave me not happy with it:

• The data only go to 2015, whereas the data from the source goes up to 2016 in many cases.
• The Mercator projection of the world map – according to xkcd’s brilliant summary of ‘what your favorite map projection says about you’, Mercator says “you’re not really into maps”. There are several problems, but for me it’s all summed up by Greenland looking way too large.
• There’s a great ability to use the slider at the bottom of the map (not in the above screenshot though) to cycle through the years, but the scale of the colour fill recalibrates itself each year so this doesn’t allow a coherent view of change over time.
• the colour scheme isn’t particularly good at drawing out differences between countries.

So I went to the source, the Stockholm International Peace Research Institute (SIPRI), who collate data from governments around the world into a Yearbook and publish the data separately – at the time of writing, with data from 1949 to 2016, using the countries that exist in 2016 (ie no East Germany, Soviet Union, etc; data for the Russian Federation, Ukraine, etc begins in 1992 or soon after).

Data exploration

If we plot all the data for all the countries and years at once, it looks like this:

First thing you might notice is the massive spike. That turns out to be Kuwait in 1991, when it spent 17 percent more than its GDP on military – the kind of thing that happens when you’re invaded and become the battleground for what in the USA is referred to as the First Gulf War.

In case you’re wondering, there’s no reason why a country can’t spend more than its GDP on the military (or anything else). In fact, while they are both measured in dollars, GDP is not really the same sort of variable as spend. GDP is the sum of value add of economic activity in the country. If spending is on imported goods and services it can exceed GDP. Similarly, while we see economic indicators like “exports as a percentage of GDP” in use (and it is a good indicator of the importance of trade), there’s no ceiling of that indicator at 100%, and countries like Luxembourg, Singapore, Malta and Ireland do in fact exceed 100%.

Here’s the code to download and import the SIPRI “military as a percentage of GDP” data into R and draw that first graph:

# Load in all the packages we need for the full post library(cshapes) library(tidyverse) library(forcats) library(scales) library(openxlsx) library(directlabels) library(ggthemes) library(RColorBrewer) library(countrycode) library(maps) library(viridis) library(leaflet) library(rworldmap) library(sf) library(maptools) #------------download------------ tf <- tempfile() download.file("https://www.sipri.org/sites/default/files/SIPRI-Milex-data-1949-2016.xlsx", destfile = tf, mode = "wb") #---------------import and tidy---------- sipri <- read.xlsx(tf, sheet = "Share of GDP", startRow = 6) %>% gather(Year, Value, -Country, -Notes) %>% mutate(Value = as.numeric(Value), Year = as.numeric(Year)) %>% filter(!is.na(Value)) #----------------exploratory chart----------- sipri %>% ggplot(aes(x = Year, y = Value, colour = Country)) + geom_line() + theme(legend.position = "none") + scale_y_continuous("Military expenditure as a percentage of GDP", label = percent) + ggtitle("Countries' and regions' expenditure on military as a percentage of GDP") + labs(caption = "Source: Stockholm International Peace Research Institute", x = "")

As further exploration, I picked the five so called “Five Eyes” countries. I have a little familiarity with their military history so thought this could be a useful way of checking I understood the data. Here is military spend as a percentage of GDP for those five countries, with major conflicts they were involved in indicated in a timeline.

The dates of the conflicts are just taken from Wikipedia.

The spikes in spending clearly and unsurprisingly relate to conflicts or other known political events, most obviously:

• The Korean war led to a big spike for all four countries that had data available.
• The surge in intensity (of their involvement) in the Vietnam war led to spikes for the USA and Australia.
• There was a blip in UK military spending at the time of the Falklands war, but it did not arrest the overall downwards trend, even more marked since significant military conflict in Northern Ireland finished.
• The early 1980s military boom for the USA was the Reagan administration ratcheting up the Cold War as well as militarily engaging in other countries in ways too numerous to show.
• The wars of the George W. Bush era in Afghanistan and Iraq, attempts to create a formal “War on Terror”, temporarily reversed the secular decline in the economic importance of US military spending.

Code for drawing this chart (I’m not particularly happy with the manual data entry for the conflicts and their dates, but it will do for a one-off thing like this):

wid <- 0.01 # width of annotation bars wars <- data_frame( name = c("Malaysia", "Korea", "Aden", "Vietnam", "Northern Ireland", "Falklands", "Gulf", "Afghanistan", "Iraq"), start = as.Date(c("15/6/1948", "25/6/1950", "10/12/1963", "1/11/1955", "1/1/1968", "2/4/1982" , "2/8/1990", "7/10/2001", "20/3/2003"), format = "%d/%m/%Y"), end = as.Date(c("12/7/1960", "27/7/1953", "30/11/1967", "30/4/1975", "30/6/1998", "14/6/1982", "28/2/1991","28/12/2014", "18/12/2011"), format = "%d/%m/%Y"), lead = c("UK", "USA", "UK", "USA", "UK", "UK", "USA", "USA", "USA") ) %>% mutate(name_seq = n():1, ystart = wid * name_seq + 0.12, yend = ystart +wid ) countries <- c("Australia", "New Zealand", "USA", "UK", "Canada") palette <- brewer.pal(5, "Set1") names(palette) <- countries p <- sipri %>% filter(Country %in% countries) %>% mutate(YearDate = as.Date(paste0("30/6/", Year), format = "%d/%m/%Y")) %>% ggplot() + geom_rect(data = wars, aes(xmin = start, xmax = end, ymin = ystart, ymax = yend, fill = lead), alpha = 0.2) + geom_text(data = wars, aes(x = end, y = (yend + ystart) / 2, label = name), colour = "grey50", hjust = 1, nudge_x = -200, size = 3) + geom_line(aes(x = YearDate, y = Value, colour = Country)) + scale_y_continuous("Military expenditure as a percentage of GDP", label = percent, breaks = c(0, 5, 10, 15) / 100) + ggtitle("Selected countries' expenditure on military as a percentage of GDP", "'Five eyes' countries only; periods also shown for conflicts in which they had material deployments") + labs(x = "", caption = "Source: Stockholm International Peace Research Institute") + scale_colour_manual(values = palette) + scale_fill_manual(values = palette, guide = "none") + theme_tufte(base_family = "myfont") + theme(plot.caption = element_text(colour = "grey50")) + annotate("text", x = as.Date("1970/6/30"), y =0.145, hjust = 0.5, colour = "grey50", size = 3, label = "Not shown - Grenada, Panama, Balkans,\nLibya, Lebanon, Haiti and numerous smaller...") direct.label(p) Different world projections

OK, time for maps. First I wanted to sort out some projections for flat versions of the world. ggplot2 makes this convenient. You can define a single map object, and just add a different version of + coord_map() to it to get a different projection. Here are two which worked quite nicely:

Globular projection

Projection with bilateral symmetry about the Prime Meridian and the equator. Hemisphere is circle, circular arc meridians equally spaced on equator, circular arc parallels equally spaced on 0- and 90-degree meridians:

Orthographic projection

Viewed from infinity, centering on 20 degrees latitude, 20 degrees longitude:

Code for the basic maps

There’s a bit of work to be done to join the military spend data to a map data. Most importantly I need a country code. The countrycode package is a great example of a specialist package that does one thing well; what it does is convert country names or codes between eachother. For my use case, I want everything to be the ISO three character code (eg NZL for New Zealand). Once this is done, all that is required is to create a flat ggplot-ready version of a map shape file and join the two data frames together.

In the code below I create a single ggplot2 object worldmap which is an un-projected world choropleth map complete with the information on colour scale (I use viridis inferno), legend, captions, etc. Then I can add whatever projection I want (final two lines of code in the chunk below.

#===============map prep======================== sipri <- sipri %>% mutate(iso3c = countrycode(Country, "country.name", destination = "iso3c")) %>% # two manual concordances of country code: mutate(iso3c = ifelse(Country == "Kosovo", "KOS", iso3c)) # for some reason the tidyverse doesn't work for Central African Republic! # so we fix it old school: sipri[sipri$Country == "Central African Rep.", "iso3c"] <- "CAF" world <- map_data("world") %>% mutate(iso3c = countrycode(region, "country.name", destination = "iso3c")) # data on military for just the latest year for each country the_data <- sipri %>% group_by(Country) %>% filter(Year == max(Year)) # using the help at http://ggplot2.tidyverse.org/reference/coord_map.html world2 <- world %>% left_join(the_data, by = "iso3c") # define a ggplot mapping object worldmap <- ggplot(world2, aes(x = long, y = lat, group = group, fill = Value)) + geom_polygon(colour = "grey75") + scale_y_continuous("", breaks = (-2:2) * 30) + scale_x_continuous("", breaks = (-4:4) * 45) + scale_fill_viridis("", label = percent, option = "inferno", direction = -1) + theme_minimal(base_family = "myfont") + theme(legend.position = "right", axis.text = element_blank()) + ggtitle("Military expenditure as a percentage of GDP", "Most recent data shown for each country; mostly this is 2016") + labs(caption = "Source: Stockholm International Peace Research Institute") #-------------single map, nice projection------------------- # Lots of the projections in coord_map have problems with drawing # polygons. ok are: globular, harrison worldmap + coord_map("globular") worldmap + coord_map("orthographic", orientation = c(20, 20, 0)) Animated choropleth over time Remebering my reservations with the World Development Indicators site, one of my aims was to have a choropleth map that showed military expenditure as a proportion of GDP for different years, with colour on the same scale. An animated graphic is the obvious way to do this: … but it has imperfections: • Because of the few year-country combinations with massive spikes (eg Kuwait in 1991), I needed to use a logarithm transform on the scale or everything looked the same colour. • The slowly changing dimension of country existence proves a real problem, with data before the 1990s sadly lacking. • Data existence in general tends to dominate the visual; so rather than a feel for how military expenditure changes over time, the impression one gets is of more data gradually becoming available over time. Not really a success. Nonetheless, a nice idea (I think) and here’s the code that does it. It’s basically the same idea as the previous chunk of code, but in a loop that repeats the data filtering and joining for each value of year, and saves a single frame for each printed annual map. #================animated map over time=============== setwd("C:/temp1") # or whatever folder you want to hold the frames in years <- unique(sipri$Year) for(i in years){ this_year_data <- sipri %>% group_by(Country) %>% filter(Year == i) world2 <- world %>% left_join(this_year_data, by = "iso3c") worldmap <- ggplot(world2, aes(x = long, y = lat, group = group, fill = Value)) + geom_polygon(colour = "grey75") + scale_y_continuous("", breaks = (-2:2) * 30) + scale_x_continuous("", breaks = (-4:4) * 45) + scale_fill_viridis("", label = percent, option = "inferno", direction = -1, limits = range(sipri$Value), trans = "sqrt", breaks = c(1, 10, 40, 100) / 100) + theme_minimal(base_family = "myfont") + theme(legend.position = "right", axis.text = element_blank()) + ggtitle(paste("Military expenditure as a percentage of GDP -", i)) + labs(caption = "Source: Stockholm International Peace Research Institute") png(paste0(i, ".png"), 12 * 72, 6 * 72) print(worldmap + coord_map("harrison", dist = 1, angle = 30, ylim = c(-75, 85))) dev.off() } # Use ImageMagick to convert all those PNG frames into a single animated GIF # (requires ImageMagick to be separately installed) system('magick -loop 0 -delay 40 *.png "0099-military-gdp-time.gif"') Animated globe Finally, on to the two “good” versions of the data. Here’s a reminder of the animated globe: The strategy is to create 360 different snapshots of the globe, each one 1 degree changed in longitude and latitude from the previous. Longitude moves 1 degree each frame in the same direction, latitude meanders from 30 degrees north to 30 degrees south and back again. I had to leave the ggplot2 universe to do this (but I’m not sure I had to). I couldn’t get my projections of the ggplot2 version of the globe to work with enough values of latitude and longitude as the centre point. Whenever one of the larger countries like Russia, China or Canada was split in two ggplot2 would draw the polygons in ugly ways. My best effort can be seen in this tweet – which also has some very useful comments and suggestions in response. In the end I adapted the suggestion of Edzer Pebesma, which is included in a demo in his amazing sf package. I don’t really understand exactly how it successfully clips the polygons, but it works nearly entirely well. There were still quite a few frames where countries went missing altogether for the wrong combination of latitude and longitude. Experimenting with different versions of world maps, I found that some were vulnerable to different combinations from others. To reduce the problem, I made each frame of the end animation actually two globes of countries drawn on top of eachother – to increase the chance that at least one of the maps draws each country. This reduced the missing country problem to only three or four frames. Here’s the code that does that: #------------sf approach------------------ # this solution came from the demo in the sf package, tweeted about at # https://twitter.com/edzerpebesma/status/835249468874321920 circ <- function(l = c(-180:180), lon0 = 0, lat0 = 30) { deg2rad = pi / 180 lat = atan(-cos((l - lon0) * deg2rad)/tan(lat0 * deg2rad)) / deg2rad xy = if (lat0 == 0) { l1 = lon0 - 90 l2 = lon0 + 90 rbind(c(l1,-90), c(l2,-90), c(l2,0), c(l2,90), c(l1,90), c(l1,0), c(l1,-90)) } else if (lat0 > 0) { xy = cbind(lon = l, lat = lat) rbind(c(-180,90),xy,c(180,90),c(-180,90)) } else { xy = cbind(lon = l, lat = lat)[length(l):1,] rbind(c(180,-90), xy, c(-180,-90),c(180,-90)) } st_sfc(st_polygon(list(xy)), crs = st_crs(4326)) } m <- st_make_grid() m <- st_segmentize(m, 4e5) data(wrld_simpl) # Two versions of the world map, joined to the military/GDP data: w1 <- st_as_sf(countriesLow) %>% left_join(the_data, by = c("ISO3" = "iso3c")) %>% mutate(fill = pal(Value)) w2 <- st_as_sf(wrld_simpl) %>% left_join(the_data, by = c("ISO3" = "iso3c")) %>% mutate(fill = pal(Value)) # latitudes will go from 30 north to 30 south and back again: lats <- rep(c(30:(-30), (-29):29), 3) # longitudes will start at 50 and go around backwards: lons <- c(180:(-179)) dir.create("C:/temp2") setwd("C:/temp2") # Frame 234 goes missing if using wrld_simpl # Frame 269 goes missing if using countriesLow or wrld_simpl # etc for(i in 1:length(lons)){ png(paste0(1000 + i, ".png"), 600, 550, res = 100) par(mar <- rep(0, 4), bg = "black") lat <- lats[i] lon <- lons[i] # define the proj4 string: p4s <- paste0("+proj=ortho +lat_0=", lat, " +lon_0=", lon) # draw the pale blue globe in space blank_globe <- st_transform(m, st_crs(p4s), check = TRUE) plot(blank_globe, col = '#e6f2ff', border = 'grey99') # create a clipped version of the great circle?? # I don't really understand what is happening, but it seems to work. crc <- circ(lat0 = lat, lon0 = lon) w10 <- suppressWarnings(st_intersection(w1, crc)) w20 <- suppressWarnings(st_intersection(w2, crc)) # cast and re-project the map itself w10 <- st_cast(w10, "MULTIPOLYGON") w10 <- st_transform(w10["fill"], st_crs(p4s), check = TRUE) w20 <- st_cast(w20, "MULTIPOLYGON") w20 <- st_transform(w20["fill"], st_crs(p4s), check = TRUE) # draw the map plot(w10, col = w10$fill, add = TRUE, border = NA) plot(w20, col = w20$fill, add = TRUE, border = "grey75") # title and legend title("Military expenditure as a percentage of GDP\nNearest year to 2016", col.main = "white", font.main = 1, adj = 0) title(sub = "Source: Stockholm International Peace Research Institute", col.sub = "grey50", adj = 1) leg_nums <- seq(from = 4, to = 20, length.out = 6) / 100 legend("bottomright", legend = paste0(round(leg_nums * 100), "%"), pch = 15, adj = 0.1, col = pal(leg_nums), text.col = pal(leg_nums), bg = "grey80") dev.off() } system('magick -loop 0 -delay 7 *.png "0099-military-gdp.gif"') Interactive leaflet choropleth Finally, there’s the interactive version (my favourite): To draw this, I used leaflet and went back to sp spatial polygons data frames. The code for this is actually some of the simplest in this post. While most of the examples I’ve seen of leaflet use Mercator projections, it’s possible to reproject your map (so long as you don’t need Google or other map tiles) with a couple of lines of code: #-------------------------leaflet------------------ shape <- countriesCoarse # define colour palette pal <- colorNumeric( palette = inferno(10, direction = -1), domain = the_data$Value) # uses the_data defined earlier in the ggplot2 demos data2 <- shape@data %>% left_join(the_data, by = c("ISO_A3" = "iso3c")) %>% mutate(tooltip = paste0(ADMIN, " ", Year, ", ", round(Value * 100, 1), "%")) # EPSG4326 means latitude and longitude coarse_crs <- leafletCRS(crsClass = "L.CRS.EPSG4326", proj4def = proj4string(countriesCoarse)) shape %>% leaflet(options = leafletOptions(crs = coarse_crs)) %>% addPolygons(stroke = FALSE, smoothFactor = 0.2, fillOpacity = 1, color = ~pal(data2$Value), label = data2$tooltip)

Happy mapping!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### RTutor: The effect of the TseTse fly on African Development

Sat, 06/03/2017 - 10:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

What is the causal impact of the TseTse fly on historical and current development in Africa? Marcella Alsan studies this question in her fascinating article ‘The effect of the TseTse fly on African development’ in the American Economic Review, 2015.

As part of her Bachelor Thesis at Ulm University, Vanessa Schöller has generated a very nice RTutor problem set that allows you to replicate the main insights of the paper in an interactive fashion.

Here is screenshoot:

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/vanessaschoeller/RTutorTseTse

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)

https://vanessaschoeller.shinyapps.io/RTutorTseTse/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Deferred & Remote Function Execution in R

Sat, 06/03/2017 - 08:31

(This article was first published on R – Random Remarks, and kindly contributed to R-bloggers)

What would you say if you could automatically wrap your R pipeline – which consists of numerous functions and variables – into a single function? What would you say if you could do it repeatedly, with no extra effort regardless of its complexity? Would you store it to document your progress at that particular moment in time? Would you serialize and send it to a remote R session to offload computations? Well, it’s kind-of possible.

defer brings a new capability to R – packaging, as a single R function, a number of inter-connected functions and variables which exist in the current R session. It is something I found useful during fast-prototyping phases of my own work and wanted to share for quite some time now – and at this point I’m interested in seeing if it turns out useful for someone else besides me.

Here’s a simplistic example where defer is given a top-level function h() that requires two other functions (f() and g()), a variable (C) and a library call (base::mean).

C <- 10 f <- function(x) x*x + C g <- function(y) (f(y) + 10) * f(y+3) h <- function(z) mean(c(g(z), g(z+1), g(z+2))) wrapper <- defer(h) #> Found functions: #> g, f #> variables: #> C #> library calls: #> base::mean rm(C, f, g, h) wrapper(10) #> [1] 29688.67

Here is another example where that same wrapper() can be run in a remote R session (via opencpu) without installing or deploying any extra code on the remote server. First, we serialize wrapper as a base64-encoded string.

library(base64enc) buffer <- rawConnection(raw(0), 'w') saveRDS(wrapper, buffer) serialized <- base64enc::base64encode(rawConnectionValue(buffer))

Then, we write that string into a temporary file together with a few lines of code to deserialize and run it in a remote R session.

local_script_path <- tempfile(fileext = ".R") cat(paste0("base64 <- '", serialized, "'\n", "library(base64enc)\n", "bytes <- rawConnection(base64enc::base64decode(base64), 'r')\n", "deserialized <- readRDS(bytes)\n", "deserialized(10)\n"), file = local_script_path)

Finally, we upload the file on the opencpu server and run it via base::source.

library(httr) result <- httr::POST(public_opencpu_url, body = list(file = upload_file(local_script_path))) content(result, 'text') #> [1] "$value\n[1] 29688.67\n\n$visible\n[1] TRUE\n\n"

And that seems like a number we’ve seen already, right?

The package is currently available only from GitHub as it’s way not ready for CRAN.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Random Remarks. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### R Weekly Bulletin Vol – X

Sat, 06/03/2017 - 03:49

This week’s R bulletin will cover topics on grouping data using ntile function, how to open files automatically, and formatting an Excel sheet using R.

We will also cover functions like the choose function, sample function, runif and rnorm function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Fold selected chunk – Alt+L
2. Unfold selected chunk – Shift+Alt+L
3. Fold all – Alt+0

Problem Solving Ideas Grouping data using ntile function

The ntile function is part of the dplyr package, and is used for grouping data. The syntax for the function is given by:

ntile(x, n)

Where,
“x” is the vector of values and
“n” is the number of buckets/groups to divide the data into.

Example:

In this example, we first create a data frame from two vectors, one comprising of Stock symbols, and the other comprising of their respective prices. We then group the values in Price column in 2 groups, and the ranks are populated in a new column called “Ntile”. In the last line we are selecting only those values which fall in the 2nd bucket using the subset function.

library(dplyr) Ticker = c("PAGEIND", "MRF", "BOSCHLTD", "EICHERMOT", "TIDEWATER") Price = c(14742, 33922, 24450, 21800, 5519) data = data.frame(Ticker, Price) data$Ntile = ntile(data$Price, 2) print(data)

ranked_data = subset(data, subset = (Ntile == 2)) print(ranked_data)

Automatically open the saved files

If you are saving the output returned upon executing an R script, and also want to open the file post running the code, one can you use the shell.exec function. This function opens the specified file using the application specified in the Windows file associations.

A file association associates a file with an application capable of opening that file. More commonly, a file association associates a class of files (usually determined by their filename extension, such as .txt) with a corresponding application (such as a text editor).

The example below illustrates the usage of the function.
shell.exec(filename)

Example:

df = data.frame(Symbols=c("ABAN","BPCL","IOC"),Price=c(212,579,538)) write.csv(df,"Stocks List.csv") shell.exec("Stocks List.csv") Quick format of the excel sheet for column width

We can format the excel sheets for column width using the command lines given below. In the example, the first line will load the excel workbook specified by the file name. In the third & the fourth line, the autoSizeColumn function adjusts the width of the columns, which are specified in the “colIndex”, for each of the worksheets. The last line will save the workbook again after making the necessary formatting changes.

Example:

wb = loadWorkbook(file_name) sheets = getSheets(wb) autoSizeColumn(sheets[[1]], colIndex=1:7) autoSizeColumn(sheets[[2]], colIndex=1:5) saveWorkbook(wb,file_name) Functions Demystified choose function

The choose function computes the combination nCr. The syntax for the function is given as:

choose(n,r)

where,
n is the number of elements
r is the number of subset elements

nCr = n!/(r! * (n-r)!)

Examples:

choose(5, 2)

[1] 10

choose(2, 1)

[1] 2

sample function

The sample function randomly selects n items from a given vector. The samples are selected without replacement, which means that the function will not select the same item twice. The syntax for the function is given as:

sample(vector, n)

Example: Consider a vector consisting of yearly revenue growth data for a stock. We select 5 years revenue growth at random using the sample function.

Revenue = c(12, 10.5, 11, 9, 10.75, 11.25, 12.1, 10.5, 9.5, 11.45) sample(Revenue, 5)

[1] 11.45 12.00 9.50 12.10 10.50

Some statistical processes require sampling with replacement, in such cases you can specify replace= TRUE to the sample function.

Example:

x = c(1, 3, 5, 7) sample(x, 7, replace = TRUE)

[1] 7 1 5 3 7 3 5

runif and rnorm functions

The runif function generates a uniform random number between 0 and 1. The argument of runif function is the number of random values to be generated.

Example:

# This will generate 7 uniform random number between 0 and 1. runif(7)

[1] 0.6989614 0.5750565 0.6918520 0.3442109 0.5469400 0.7955652 0.5258890

# This will generate 5 uniform random number between 2 and 4. runif(5, min = 2, max = 4)

[1] 2.899836 2.418774 2.906082 3.728974 2.720633

The rnorm function generates random numbers from normal distribution. The function rnorm stands for the Normal distribution’s random number generator. The syntax for the function is given as:

rnorm(n, mean, sd)

Example:

# generates 6 numbers from a normal distribution with a mean of 3 and standard deviation of 0.25 rnorm(6, 3, 0.25)

[1] 3.588193 3.095924 3.240684 3.061176 2.905392 2.891183

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – X appeared first on .

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### (Linear Algebra) Do not scale your matrix

Sat, 06/03/2017 - 02:00

(This article was first published on Florian Privé, and kindly contributed to R-bloggers)

In this post, I will show you that you generally don’t need to explicitly scale a matrix. Maybe you wanted to know more about WHY matrices should be scaled when doing linear algebra. I will remind about that in the beginning but the rest will focus on HOW to not explicitly scale matrices. We will apply our findings to the computation of Principal Component Analysis (PCA) and then Pearson correlation at the end.

WHY scaling matrices?

Generally, if you don’t center columns of a matrix before PCA, you end up with the loadings of PC1 being the column means, which is not of must interest.

n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) a <- sweep(a, 2, 1:m, '+') colMeans(a) ## [1] 0.9969826 1.9922219 3.0894905 3.9680835 4.9433477 6.0007352 ## [7] 6.9524794 8.0898546 8.9506657 10.0354630 pca <- prcomp(a, center = FALSE) cor(pca$rotation[, 1], colMeans(a)) ## [1] 0.9999991 Now, say you have centered column or your matrix, do you also need to scale them? That is to say, do they need to have the same norm or standard deviation? PCA consists in finding an orthogonal basis that maximizes the variation in the data you analyze. So, if there is a column with much more variation than the others, it will probably end up being PC1, which is not of must interest. n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) a[, 1] <- 100 * a[, 1] apply(a, 2, sd) ## [1] 90.8562836 0.8949268 0.9514208 1.0437561 0.9995103 1.0475651 ## [7] 0.9531466 0.9651383 0.9804789 0.9605121 pca <- prcomp(a, center = TRUE) pca$rotation[, 1] ## [1] 9.999939e-01 6.982439e-04 -5.037773e-04 -9.553513e-05 8.937869e-04 ## [6] -1.893732e-03 -3.053232e-04 1.811950e-03 -1.658014e-03 9.937442e-04

Hope I convinced you on WHY it is important to scale matrix columns before doing PCA. I will now show you HOW not to do it explictly.

Reimplementation of PCA

In this part, I will show you the basic code if you want to reimplement PCA yourself. It will serve as a basis to show you how to do linear algebra on a scaled matrix, without explicitly scaling the matrix.

# True one n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) pca <- prcomp(a, center = TRUE, scale. = TRUE) # DIY a.scaled <- scale(a, center = TRUE, scale = TRUE) K <- crossprod(a.scaled) K.eigs <- eigen(K, symmetric = TRUE) v <- K.eigs$vectors PCs <- a.scaled %*% v # Verif, recall that PCs can be opposites between runs plot(v, pca$rotation)

all.equal(sqrt(K.eigs$values), sqrt(n - 1) * pca$sdev) ## [1] TRUE plot(PCs, pca$x) Linear algebra behind the previous implementation Suppose $$m < n$$ ($$m$$ is the number of columns and $$n$$ is the number of rows). Let us denote $$\tilde{X}$$ the scaled matrix. A partial singular value decomposition of $$\tilde{X}$$ is $$\tilde{X} \approx U \Delta V^T$$ where $$U$$ is an $$n \times K$$ matrix such that $$U^T U = I_K$$, $$\Delta$$ is a $$K \times K$$ diagonal matrix and $$V$$ is an $$m \times K$$ matrix such that $$V^T V = I_K$$. Taking $$K = m$$, you end up with $$\tilde{X} = U \Delta V^T$$. $$U \Delta$$ are the scores (PCs) of the PCA and $$V$$ are the loadings (rotation coefficients). $$K = \tilde{X}^T \tilde{X} = (U \Delta V^T)^T \cdot U \Delta V^T = V \Delta U^T U \Delta V^T = V \Delta^2 V^T$$. So, when doing the eigen decomposition of K, you get $$V$$ and $$\Delta^2$$ because $$K V = V \Delta^2$$. For getting the scores, you then compute $$\tilde{X} V = U \Delta$$. These are exactly the steps implemented above. Implicit scaling of the matrix Do you know the matrix formulation of column scaling? $$\tilde{X} = C_n X S$$ where $$C_n = I_n – \frac{1}{n} 1_n 1_n^T$$ is the centering matrix and $$S$$ is an $$m \times m$$ diagonal matrix with the scaling coefficients (typically, $$S_{j,j} = 1 / \text{sd}_j$$). # Let's verify sds <- apply(a, 2, sd) a.scaled2 <- (diag(n) - tcrossprod(rep(1, n)) / n) %*% a %*% diag(1 / sds) all.equal(a.scaled2, a.scaled, check.attributes = FALSE) ## [1] TRUE In our previous implementation, we computed $$\tilde{X}^T \tilde{X}$$ and $$\tilde{X} V$$. We are going to compute them again, without explicitly scaling the matrix. Product Let us begin by something easy: $$\tilde{X} V = C_n X S V = C_n (X (S V))$$. So, you can compute $$\tilde{X} V$$ without explicitly scaling $$X$$. Let us verify: SV <- v / sds XSV <- a %*% SV CXSV <- sweep(XSV, 2, colMeans(XSV), '-') all.equal(CXSV, PCs) ## [1] TRUE Self cross-product A little more tricky: $$\tilde{X}^T \tilde{X} = (C_n X S)^T \cdot C_n X S = S^T X^T C_n X S$$ ($$C_n^2 = C_n$$ is intuitive because centering an already centered matrix doesn’t change it). $$\tilde{X}^T \tilde{X} = S^T X^T (I_n – \frac{1}{n} 1_n 1_n^T) X S = S^T (X^T X – X^T (\frac{1}{n} 1_n 1_n^T) X) S = S^T (X^T X – \frac{1}{n} s_X * s_X^T) S$$ where $$s_X$$ is the vector of column sums of X. Let us verify with a rough implementation: sx <- colSums(a) K2 <- (crossprod(a) - tcrossprod(sx) / n) / tcrossprod(sds) all.equal(K, K2) ## [1] TRUE Conclusion We have recalled some steps about the computation of Principal Component Analysis on a scaled matrix. We have seen how to compute the different steps of the implementation without having to explicitly scale the matrix. This “implicit” scaling can be quite useful if you manipulate very large matrices because you are not copying the matrix nor making useless computation. In my next post, I will present to you my new package that uses this trick to make a lightning partial SVD on very large matrices. Appendix: application to Pearson correlation Pearson correlation is merely a self cross-product on a centered and normalized (columns with unit norm) matrix. Let us just implement that with our new trick. #include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix& correlize(NumericMatrix& mat, const NumericVector& shift, const NumericVector& scale) { int n = mat.nrow(); int i, j; for (j = 0; j < n; j++) { for (i = 0; i < n; i++) { // corresponds to "- \frac{1}{n} s_X * s_X^T" mat(i, j) -= shift(i) * shift(j); // corresponds to "S^T (...) S" mat(i, j) /= scale(i) * scale(j); } } return mat; } cor3 <- function(mat) { sums <- colSums(mat) / sqrt(nrow(mat)) corr <- crossprod(mat) diags <- sqrt(diag(corr) - sums^2) correlize(corr, shift = sums, scale = diags) } a <- matrix(0, 1000, 1000); a[] <- rnorm(length(a)) all.equal(cor3(a), cor(a)) ## [1] TRUE library(microbenchmark) microbenchmark( cor3(a), cor(a), times = 20 ) ## Unit: milliseconds ## expr min lq mean median uq max neval ## cor3(a) 39.38138 39.67276 40.68596 40.09893 40.65623 46.46785 20 ## cor(a) 635.74350 637.33605 639.34810 638.09980 639.36110 651.61876 20 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Hacking the principles of #openscience #workshops Fri, 06/02/2017 - 20:47 (This article was first published on R – christopher lortie, and kindly contributed to R-bloggers) In a previous post, I discussed the key elements that really stood out for me in recent workshops associated with open science, data science, and ecology. Summer workshop season is upon us, and here are some principles to consider that can be used to hack a workshop. These hacks can be applied a priori as an instructor or in situ as a participant or instructor by engaging with the context from a pragmatic, problem-solving perspective. Principles 1. Embrace open pedagogy. 2. Use and current best practices from traditional teaching contexts. 3. Be learner centered. 4. Speak less, do more. 5. Solve authentic challenges. Hacks (for each principle) 1. Prepare learning outcomes for every lesson. 2. Identify solve-a-problem opportunities in advance and be open to ones that emerge organically during the workshop. 3. Use no slide decks. This challenges the instructor to more directly engage with the students and participants in the workshop and leaves space for students to shape content and narrative to some extent. Decks lock all of us in. This is appropriate for some contexts such as conference presentations, but workshops can be more fluid and open. 4. Plan pauses. Prepare your lessons with gaps for contributions. Prepare a list of questions to offer up for every lesson and provide time for discussion of solutions. 5. Use real evidence/data to answer a compelling question (scale can be limited, approach beta as long as an answer is provided, and the challenge can emerge if teaching is open and space provided for the workshop participants to ideate). Final hack that is a more general teaching principle, consider keeping all teaching materials within a single ecosystem that then references outwards only as needed. For me, this has become all content prepared in RStudio, knitted to html, then pushed to GitHub gh-pages for sharing as a webpage (or site). Then participants can engage in all ideas and content including code, data, ideas in one place. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – christopher lortie. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Teach kids about R with Minecraft Fri, 06/02/2017 - 17:19 (This article was first published on Revolutions, and kindly contributed to R-bloggers) As I mentioned earlier this week, I was on a team at the ROpenSci Unconference (with Brooke AndersonKarl BromanGergely Daróczi, and my Microsoft colleagues Mario Inchiosa and Ali Zaidi) to work on a project to interface the R language with Minecraft. The resulting R package, miner, is now available to install from Github. The goal of the package is to introduce budding programmers to the R language via their interest in Minecraft, and to that end there's also a book (R Programming with Minecraft) and associated R package (craft) under development to provide lots of fun examples of manipulating the Minecraft world with R. Create objects in Minecraft with R functions If you're a parent you're probably already aware of the Minecraft phenomenon, but if not: it's kinda like the Lego of the digital generation. Kids (and kids-and-heart) enter a virtual 3-D world composed of cubes representing ground, water, trees and other materials, and use those cubes to build their own structures, which they can then explore with their friends using their in-game avatars. Inspired by the Python-focused book "Learn to program with Minecraft", Karl had the brilliant idea of building a similar interface around R. The miner package provides just a few simple functions to manipulate the game world: find or move a player's position; add or remove blocks in the world; send a message to all players in the world. The functions are deliberately simple, designed to be used to build more complex tasks. For example, you could write a function to detect when a player is standing in a hole. It uses the getPlayerIds and getPlayerPos functions to find the ID and locations of all players in the world, and the getBlocks function to check if the player is surrounded by blocks that are not code 0 (air). The package also provides are also a couple of "listener" functions: you can detect player chat messages, or when players strike a block with their sword. You can then use to write functions to react to player actions. For example, you can write a chat-bot to play a game with the player, create an AI to solve an in-game maze, or give the player the power's of Elsa from Frozen to freeze water by walking on it To get started with the miner package, you'll need to purchase a copy of Minecraft for Windows, Mac or Linux if you don't have one already. (Note: the Windows 10 version from the Microsoft Store isn't compatible with the miner package.) This is what you'll use to explore the world managed by the Minecraft server, which you'll also need to install along with RaspberryJuice plugin. You can find setup details in the miner package vignette. We installed the open-source Spigot server on a Ubuntu VM running in Azure; you might find this Dockerfile helpful if you're trying something similar. The miner package and craft package available to install from Github at the link below. The packages are a work-in-progress: comments, suggestions, pull-requests and additional examples for the R Programming with Minecraft book are most welcome! Github (ROpenSci labs): miner and craft var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Weather forecast with regression models – part 1 Fri, 06/02/2017 - 14:21 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) In this tutorial we are going to analyse a weather dataset to produce exploratory analysis and forecast reports based on regression models. We are going to take advantage of a public dataset which is part of the exercise datasets of the “Data Mining and Business Analytics with R” book (Wiley) written by Johannes Ledolter. In doing that, we are taking advantage of several R packages, caret package included. The tutorial is going to be split in the following four parts: 1. Introducing the weather dataset and outlining its exploratory analysis 2. Building logistic regression models for 9am, 3pm and late evening weather forecasts 3. Tuning to improve accuracy of previously build models and show ROC plots 4. Making considerations on “at-least” moderate rainfall scenarios and building additional models to predict further weather variables R Packages Overall, we are going to take advantage of the following packages: suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(gmodels)) suppressPackageStartupMessages(library(lattice)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(Kmisc)) suppressPackageStartupMessages(library(ROCR)) suppressPackageStartupMessages(library(corrplot)) Exploratory Analysis The dataset under analysis is publicly available at: jledolter – datamining – data exercises It contains daily observations from a single weather station. Let us start by loading the data and taking a look at. set.seed(1023) weather_data <- read.csv(url("https://www.biz.uiowa.edu/faculty/jledolter/datamining/weather.csv"), header = TRUE, sep = ",", stringsAsFactors = TRUE) kable(head(weather_data)) |Date |Location | MinTemp| MaxTemp| Rainfall| Evaporation| Sunshine|WindGustDir | WindGustSpeed|WindDir9am |WindDir3pm | WindSpeed9am| WindSpeed3pm| Humidity9am| Humidity3pm| Pressure9am| Pressure3pm| Cloud9am| Cloud3pm| Temp9am| Temp3pm|RainToday | RISK_MM|RainTomorrow | |:---------|:--------|-------:|-------:|--------:|-----------:|--------:|:-----------|-------------:|:----------|:----------|------------:|------------:|-----------:|-----------:|-----------:|-----------:|--------:|--------:|-------:|-------:|:---------|-------:|:------------| |11/1/2007 |Canberra | 8.0| 24.3| 0.0| 3.4| 6.3|NW | 30|SW |NW | 6| 20| 68| 29| 1019.7| 1015.0| 7| 7| 14.4| 23.6|No | 3.6|Yes | |11/2/2007 |Canberra | 14.0| 26.9| 3.6| 4.4| 9.7|ENE | 39|E |W | 4| 17| 80| 36| 1012.4| 1008.4| 5| 3| 17.5| 25.7|Yes | 3.6|Yes | |11/3/2007 |Canberra | 13.7| 23.4| 3.6| 5.8| 3.3|NW | 85|N |NNE | 6| 6| 82| 69| 1009.5| 1007.2| 8| 7| 15.4| 20.2|Yes | 39.8|Yes | |11/4/2007 |Canberra | 13.3| 15.5| 39.8| 7.2| 9.1|NW | 54|WNW |W | 30| 24| 62| 56| 1005.5| 1007.0| 2| 7| 13.5| 14.1|Yes | 2.8|Yes | |11/5/2007 |Canberra | 7.6| 16.1| 2.8| 5.6| 10.6|SSE | 50|SSE |ESE | 20| 28| 68| 49| 1018.3| 1018.5| 7| 7| 11.1| 15.4|Yes | 0.0|No | |11/6/2007 |Canberra | 6.2| 16.9| 0.0| 5.8| 8.2|SE | 44|SE |E | 20| 24| 70| 57| 1023.8| 1021.7| 7| 5| 10.9| 14.8|No | 0.2|No | Metrics at specific Date and Location are given together with the RainTomorrow variable indicating if rain occurred on next day. colnames(weather_data) [1] "Date" "Location" "MinTemp" "MaxTemp" "Rainfall" [6] "Evaporation" "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am" [11] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [16] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow" The description of the variables is the following. Date: The date of observation (a date object). Location: The common name of the location of the weather station MinTemp: The minimum temperature in degrees centigrade MaxTemp: The maximum temperature in degrees centigrade Rainfall: The amount of rainfall recorded for the day in millimeters. Evaporation: Class A pan evaporation (in millimeters) during 24 h Sunshine: The number of hours of bright sunshine in the day WindGustDir: The direction of the strongest wind gust in the 24 h to midnight WindGustSpeed: The speed (in kilometers per hour) of the strongest wind gust in the 24 h to midnight WindDir9am: The direction of the wind gust at 9 a.m. WindDir3pm: The direction of the wind gust at 3 p.m. WindSpeed9am: Wind speed (in kilometers per hour) averaged over 10 min before 9 a.m. WindSpeed3pm: Wind speed (in kilometers per hour) averaged over 10 min before 3 p.m. Humidity9am: Relative humidity (in percent) at 9 am Humidity3pm: Relative humidity (in percent) at 3 pm Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9 a.m. Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3 p.m. Cloud9am: Fraction of sky obscured by cloud at 9 a.m. This is measured in ”oktas,” which are a unit of eighths. It records how many eighths of the sky are obscured by cloud. A 0 measure indicates completely clear sky, while an 8 indicates that it is completely overcast Cloud3pm: Fraction of sky obscured by cloud at 3 p.m; see Cloud9am for a description of the values Temp9am: Temperature (degrees C) at 9 a.m. Temp3pm: Temperature (degrees C) at 3 p.m. RainToday: Integer 1 if precipitation (in millimeters) in the 24 h to 9 a.m. exceeds 1 mm, otherwise 0 RISK_MM: The continuous target variable; the amount of rain recorded during the next day RainTomorrow: The binary target variable whether it rains or not during the next day Looking then at the data structure, we discover the dataset includes a mix of numerical and categorical variables. str(weather_data) > str(weather_data) 'data.frame': 366 obs. of 24 variables:$ Date : Factor w/ 366 levels "1/1/2008","1/10/2008",..: 63 74 85 87 88 89 90 91 92 64 ... $Location : Factor w/ 1 level "Canberra": 1 1 1 1 1 1 1 1 1 1 ...$ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ... $MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...$ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ... $Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...$ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ... $WindGustDir : Factor w/ 16 levels "E","ENE","ESE",..: 8 2 8 8 11 10 10 1 9 1 ...$ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ... $WindDir9am : Factor w/ 16 levels "E","ENE","ESE",..: 13 1 4 15 11 10 10 10 1 9 ...$ WindDir3pm : Factor w/ 16 levels "E","ENE","ESE",..: 8 14 6 14 3 1 3 1 2 3 ... $WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ...$ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ... $Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ...$ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ... $Pressure9am : num 1020 1012 1010 1006 1018 ...$ Pressure3pm : num 1015 1008 1007 1007 1018 ... $Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ...$ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ... $Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...$ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ... $RainToday : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 2 ...$ RISK_MM : num 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 0 ... $RainTomorrow : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 1 1 2 1 ... We have available 366 records: (n <- nrow(weather_data)) [1] 366 which spans the following timeline: c(as.character(weather_data$Date[1]), as.character(weather_data$Date[n])) [1] "11/1/2007" "10/31/2008" We further notice that RISK_MM relation with the RainTomorrow variable is the following. all.equal(weather_data$RISK_MM > 1, weather_data$RainTomorrow == "Yes") [1] TRUE The Rainfall variable and the RainToday are equivalent according to the following relationship (as anticipated by the Rainfall description). all.equal(weather_data$Rainfall > 1, weather_data$RainToday == "Yes") [1] TRUE To make it more challenging, we decide to take RISK_MM, RainFall and RainToday out, and determine a new dataset as herein depicted. weather_data2 <- subset(weather_data, select = -c(Date, Location, RISK_MM, Rainfall, RainToday)) colnames(weather_data2) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" "WindGustSpeed" [7] "WindDir9am" "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [13] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" "Temp3pm" [19] "RainTomorrow" (cols_withNa <- apply(weather_data2, 2, function(x) sum(is.na(x)))) MinTemp MaxTemp Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am 0 0 0 3 3 2 31 WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm 1 7 0 0 0 0 0 Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 0 0 0 0 0 Look at the NA’s counts associated to WindDir9am. If WindDir9am were a not significative predictor for RainTomorrow, we could take such data column out and increased the complete cases count. When dealing with the categorical variable analysis we determine if that is possible. For now, we consider records without NA’s values. weather_data3 <- weather_data2[complete.cases(weather_data2),] Categorical Variable Analysis factor_vars <- names(which(sapply(weather_data3, class) == "factor")) factor_vars <- setdiff(factor_vars, "RainTomorrow") chisq_test_res <- lapply(factor_vars, function(x) { chisq.test(weather_data3[,x], weather_data3[, "RainTomorrow"], simulate.p.value = TRUE) }) names(chisq_test_res) <- factor_vars chisq_test_res$WindGustDir Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] X-squared = 35.833, df = NA, p-value = 0.001999 $WindDir9am Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] X-squared = 23.813, df = NA, p-value = 0.06547$WindDir3pm Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] X-squared = 9.403, df = NA, p-value = 0.8636

Above shown Chi-squared p-value results tell us that RainTomorrow values depend upon WindGustDir (we reject the null hypothesis that RainTomorrow does not depend upon the WindGustDir). We reject the null-hypothesis for WindDir9am and WindDir3pm as well. We therefore expect to take advantage of WindGustDir as predictor and not to consider WindDir9am and WindDir3pm for such purpose.

It is also possible to obtain a visual understanding of the significativeness of such categorical variables by taking advantage of barcharts with the cross table row proportions as input.

barchart_res <- lapply(factor_vars, function(x) { title <- colnames(weather_data3[,x, drop=FALSE]) wgd <- CrossTable(weather_data3[,x], weather_data3$RainTomorrow, prop.chisq=F) barchart(wgd$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top", title = title)) }) names(barchart_res) <- factor_vars barchart_res$WindGustDir Barchart of WindGustDir barchart_res$WindDir9am

Barchart of WindDir9am

barchart_res$WindDir3pm Barchart of WindDir3pm Being WindDir9am not a candidate predictor and having got more than 30 NA’s values, we decide to take it out. As a consequence, we have increased the number of NA-free records from 328 to 352. Same for WindDir3pm. weather_data4 <- subset(weather_data2, select = -c(WindDir9am, WindDir3pm)) weather_data5 <- weather_data4[complete.cases(weather_data4),] colnames(weather_data5) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" [6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [16] "Temp3pm" "RainTomorrow" So, we end up with a dataset made up of 16 potential predictors, one of those is a categorical variable (WindGustDir) and 15 are quantitative. Quantitative Variable Analysis In this section, we carry on the exploratory analysis of quantitative variables. We start first by a visualization of the correlation relationship among variables. factor_vars <- names(which(sapply(weather_data5, class) == "factor")) numeric_vars <- setdiff(colnames(weather_data5), factor_vars) numeric_vars <- setdiff(numeric_vars, "RainTomorrow") numeric_vars numeric_vars_mat <- as.matrix(weather_data5[, numeric_vars, drop=FALSE]) numeric_vars_cor <- cor(numeric_vars_mat) corrplot(numeric_vars_cor) By taking a look at above shown correlation plot, we can state that: – Temp9am strongly positive correlated with MinTemp – Temp9am strongly positive correlated with MaxTemp – Temp9am strongly positive correlated with Temp3pm – Temp3pm strongly positive correlated with MaxTemp – Pressure9am strongly positive correlated with Pressure3pm – Humidity3pm strongly negative correlated with Sunshine – Cloud9am strongly negative correlated with Sunshine – Cloud3pm strongly negative correlated with Sunshine The pairs plot put in evidence if any relationship among variables is in place, such as a linear relationship. pairs(weather_data5[,numeric_vars], col=weather_data5$RainTomorrow)

Visual analysis, put in evidence a linear relationship among the following variable pairs:

MinTemp and MaxTemp
MinTemp and Evaporation
MaxTemp and Evaporation
Temp9am and MinTemp
Temp3pm and MaxTemp
Pressure9am and Pressure3pm
Humidity9am and Humidity3pm
WindSpeed9am and WindGustSpeed
WindSpeed3pm and WindGustSpeed

Boxplots and density plots give a visual understanding of the significativeness of a predictor by looking how much are overlapping the predictor values set depending on the variable to be predicted (RainTomorrow).

gp <- invisible(lapply(numeric_vars, function(x) { ggplot(data=weather_data5, aes(x= RainTomorrow, y=eval(parse(text=x)), col = RainTomorrow)) + geom_boxplot() + xlab("RainTomorrow") + ylab(x) + ggtitle("") + theme(legend.position="none")})) grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) { marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)})) grob_plots

gp <- invisible(lapply(numeric_vars, function(x) { ggplot(data=weather_data5, aes(x=eval(parse(text=x)), col = RainTomorrow)) + geom_density() + xlab(x) + ggtitle(paste(x, "density", sep= " "))})) grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) { marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)})) grob_plots

From all those plots, we can state that Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm and Sunshine are predictors to be considered.

Ultimately, we save the dataset currently under analysis for the prosecution of this tutorial.

write.csv(weather_data5, file="weather_data5.csv", sep=",", row.names=FALSE) Conclusions

Exploratory analysis has highlighted interesting relationship across variables. Such preliminary results give us hints for predictors selection and will be considered in the prosecution of this analysis.

If you have any questions, please feel free to comment below.

References

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The code (and other stuff…)

Fri, 06/02/2017 - 12:56

I’ve received a couple of emails or comments on one of the General Election posts to ask me to share the code I’ve used.

In general, I think this is a bit dirty and lots could be done in a more efficient way $-$ effectively, I’m doing this out of my own curiosity and while I think the model is sensible, it’s probably not “publication-standard” (in terms of annotation etc).

Anyway, I’ve created a (rather plain) GitHub repository, which contains the basic files (including R script, R functions, basic data and JAGS model). Given time (which I’m not given…), I’d like to put a lot more description and perhaps also write a Stan version of the model code. I could also write a more precise model description $-$ I’ll try to update the material on the GitHub.

On another note, the previous posts have been syndicated in a couple of places (here and here), which was nice. And finally, here’s a little update with the latest data. As of today, the model predicts the following seats distribution.

mean        sd 2.5% median 97.5%
Conservative 352.124 3.8760350  345    352   359
Labour       216.615 3.8041091  211    217   224
UKIP           0.000 0.0000000    0      0     0
Lib Dem       12.084 1.8752228    8     12    16
SNP           49.844 1.8240041   45     51    52
Green          0.000 0.0000000    0      0     0
PCY            1.333 0.9513233    0      2     3
Other          0.000 0.0000000    0      0     0

Labour are still slowly but surely gaining some ground $-$ I’m not sure the effect of the debate earlier this week (which was deserted by the PM) are visible yet as only a couple of the polls included were conducted after that.

Another interesting thing (following up on this post) is the analysis of the marginal seats that the model predicts to swing from the 2015 Winners. I’ve updated the plot, which now looks as below.

Now there are 30 constituencies that are predicted to change hand, many still towards the Tories. I am not a political scientists, so I don’t really know all the ins and outs of these, but I think a couple of examples are quite interesting and I would venture some comment…

So, the model doesn’t know about the recent by-elections of Copeland and Stoke-on-Trent South and so still label these seats as “Labour” (as they were in 2015), although the Tories have actually now control of Copeland.

In the prediction given the polls and the impact of the EU referendum (both were strong Leave areas with with 60% and 70% of the preference, respectively) and the Tories did well in 2015 (36% vs Labour’s 42% in Copeland and 33% to Labour’s 39% in 2015). So, the model is suggesting that both are likely to switch to the Tories this time around.

In fact, we know that at the time of the by-election, while Copeland (where the contest was mostly Labour v Tories) did go blue, Stoke didn’t. But there, the main battle was between the Labour’s and the UKIP’s candidate (UKIP had got 21% in 2015). And the by-election was fought last February, when the Tories lead was much more robust that it probably is now.

Another interesting area is Twickenham $-$ historically a constituency leaning to the Lib Dems, which was captured by the Conservatives in 2015. But since then, in another by-election the Tories have lost another similar area (Richmond Park,with a massive swing) and the model is suggesting that Twickenham could follow suit, come next Thursday.

Finally, Clapton was the only seat won by UKIP in 2015, but since then, the elected MP (a former Tory-turned-UKIP) has defected the party and is not contesting the seat. This, combined with the poor standing of UKIP in the polls produces the not surprisingly outcome that Clapton is predicted to go blue with basically no uncertainty…

These results look reasonable to me $-$ not sure how life will turn out of course. As many commentators have noted much may depend on the turn out among the younger. Or other factors. And probably there’ll be another instance of the “Shy-Tory effect” (I’ll think about this if I get some time before the final prediction). But the model does seem to make some sense…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));