Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 57 min 36 sec ago

solrium 1.0: Working with Solr from R

Wed, 11/08/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Nearly 4 years ago I wrote on this blog about an R package solr for working with the database Solr. Since then we’ve created a refresh of that package in the solrium package. Since solrium first hit CRAN about two years ago, users have raised a number of issues that required breaking changes. Thus, this blog post is about a major version bump in solrium.

What is Solr?

Solr is a “search platform” – a NoSQL database – data is organized by so called documents that are xml/json/etc blobs of text. Documents are nested within either collections or cores (depending on the mode you start Solr in). Solr makes it easy to search for documents, with a huge variety of parameters, and a number of different data formats (json/xml/csv). Solr is similar to Elasticsearch (see our Elasticsearch client elastic) – and was around before it. Solr in my opinion is harder to setup than Elasticsearch, but I don’t claim to be an expert on either.

Vignettes Noteable features
  • Added in v1, you can now work with many connection objects to different Solr instances.
  • Methods for the major search functionalities: search, highlight, stats, mlt, group, and facet. In addition, a catch all function all to combine all of those.
  • Comprehensive coverage of the Solr HTTP API
  • Can coerce data from Solr API into data.frame’s when possible
Setup

Install solrium

install.packages("solrium")

Or get the development version:

devtools::install_github("ropensci/solrium") library(solrium)

Initialize a client

A big change in v1 of solrium is solr_connect has been replaced by SolrClient. Now you create an R6 connection object with SolrClient, then you can call methods on that R6 object, OR you can pass the connection object to functions.

By default, SolrClient$new() sets connections details for a Solr instance that’s running on localhost, and on port 8983.

(conn <- SolrClient$new()) #> #> host: 127.0.0.1 #> path: #> port: 8983 #> scheme: http #> errors: simple #> proxy:

On instantiation, it does not check that the Solr instance is up, but merely sets connection details. You can check if the instance is up by doing for example (assuming you have a collection named gettingstarted):

conn$ping("gettingstarted") #> $responseHeader #> $responseHeader$zkConnected #> [1] TRUE #> #> $responseHeader$status #> [1] 0 #> #> $responseHeader$QTime #> [1] 163 #> #> $responseHeader$params #> $responseHeader$params$q #> [1] "{!lucene}*:*" #> #> $responseHeader$params$distrib #> [1] "false" #> #> $responseHeader$params$df #> [1] "_text_" #> #> $responseHeader$params$rows #> [1] "10" #> #> $responseHeader$params$wt #> [1] "json" #> #> $responseHeader$params$echoParams #> [1] "all" #> #> #> #> $status #> [1] "OK"

A good hint when connecting to a publicly exposed Solr instance is that you likely don’t need to specify a port, so a pattern like this should work to connect to a URL like http://foobar.com/search:

SolrClient$new(host = "foobar.com", path = "search", port = NULL)

If the instance uses SSL, simply specify that like:

SolrClient$new(host = "foobar.com", path = "search", port = NULL, scheme = "https")

Query and body parameters

Another big change in the package is that we wanted to make it easy to determine whether your Solr query gets passed as query parameters in a GET request or as body in a POST request. Solr clients in some other languages do this, and it made sense to port over that idea here. Now you pass your key-value pairs to either params or body. If nothing is passed to body, we do a GET request. If something is passed to body we do a POST request, even if there’s also key-value pairs passed to params.

This change does break the interface we had in the old version, but we think it’s worth it.

For example, to do a search you have to pass the collection name and a list of named parameters:

conn$search(name = "gettingstarted", params = list(q = "*:*")) #> # A tibble: 5 x 5 #> id title title_str `_version_` price #> #> 1 10 adfadsf adfadsf 1.582913e+18 NA #> 2 12 though though 1.582913e+18 NA #> 3 14 animals animals 1.582913e+18 NA #> 4 1 1.582913e+18 100 #> 5 2 1.582913e+18 500

You can instead pass the connection object to solr_search:

solr_search(conn, name = "gettingstarted", params = list(q = "*:*")) #> # A tibble: 5 x 5 #> id title title_str `_version_` price #> #> 1 10 adfadsf adfadsf 1.582913e+18 NA #> 2 12 though though 1.582913e+18 NA #> 3 14 animals animals 1.582913e+18 NA #> 4 1 1.582913e+18 100 #> 5 2 1.582913e+18 500

And the same pattern applies for the other functions:

  • solr_facet
  • solr_group
  • solr_mlt
  • solr_highlight
  • solr_stats
  • solr_all

New functions for atomic updates

A user requested the ability to do atomic updates – partial updates to documents without having to re-index the entire document.

Two functions were added: update_atomic_json and update_atomic_xml for JSON and XML based updates. Check out their help pages for usage.

Search results as attributes

solr_search and solr_all in v1 gain attributes that include numFound, start, and maxScore. That is, you can get to these three values after data is returned. Note that some Solr instances may not return all three values.

For example, let’s use the Public Library of Science Solr search instance at http://api.plos.org/search:

plos <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL)

Search

res <- plos$search(params = list(q = "*:*"))

Get attributes

attr(res, "numFound") #> [1] 1902279 attr(res, "start") #> [1] 0 attr(res, "maxScore") #> [1] 1

Automatically adjust rows parameter

A user higlighted that there’s a performance penalty when asking for too many rows. The resulting change in solrium is that in some search functions we automatically adjust the rows parameter to avoid the performance penalty.

Usage in other packages

I maintain 4 other packages that use solrium: rplos, ritis, rdatacite, and rdryad. If you are interested in using solrium in your package, looking at any of those four packages will give a good sense of how to do it.

Notes solr pkg

The solr package will soon be archived on CRAN. We’ve moved all packages depending on it to solrium. Let me know ASAP if you have any complaints about archiving it on CRAN.

Feedback!

Please do upgrade/install solrium v1 and let us know what you think.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: October 2017 roundup

Wed, 11/08/2017 - 00:04

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from October of particular interest to R users.

A recent survey of competitors on the Kaggle platform reveals that Python (76%) and R (59%) are the preferred tools for building predictive models.

Microsoft's "Team Data Science Process" has been updated with new guidelines on use of the IDEAR framework for R and Python.

Microsoft R Open 3.4.2 is now available for Windows, Mac and Linux.

Using the foreach package to estimate bias of rpart trees via bootstrapping.

Replays of webinars on the Azure Data Science VM, and on document collection analysis with Azure ML Workbench, are now available.

The "officer" package makes it possible to create PowerPoint and Word documents from R, and even include editable R charts.

An online book on statistical machine learning with the MicrosoftML package.

An updated list of major events in the history of the R project, 1992-2016.

An overview of the R manuals, now also available in Bookdown format.

An R-based analysis comparing the speeds of bikes and taxis for trips across New York City.

Vision-based AI techniques used to estimate the population of endangered snow leopards.

ROpenSci interviews David Smith about working in the R community.

A generational neural network, implemented in R, synthesizes startup names and business plans.

Two R-themed crosswords: a cryptic one by Barry Rowlingson, and a standard one from R-Ladies DC.

A tutorial on using Azure Data Lake Analytics with R.

The remarkable growth of R, as seen in StackOverflow traffic data.

Version 1.0.0 of the dplyrXdf package, providing dplyr operations for Microsoft R out-of-memory data files, is now available.

The GPU-enabled Deep Learning Virtual Machine on Azure includes R, Spark, Tensorflow and more.

A comparison of assault death rates in the US and other advanced democracies, generated in R by Kieran Healy.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

9th MilanoR Meeting: November 20th

Tue, 11/07/2017 - 17:29

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

MilanoR Staff is happy to announce the 9th MilanoR Meeting!

A MilanoR meeting is an occasion to bring together R users in the Milano area to share knowledge and experiences. The meeting is a free event, open to everybody. The meeting consists of some R talks and a free buffet + networking time.

This time we want to focus on a specific topic: data visualization with R. We’ll try to explore the R dataviz world from any corner, checking tools, ideas and applied case studies.

When

Monday, November 20, 2017

from 7,00 to 9,30 pm

Agenda

Intro: Dataviz with R: past, present and future 

by MilanoR

Talk: Automagically build the perfect palette for your plot with paletteR 

by  Andrea Cirillo

Lightning Talks:

From modelling to visualization by Stefano Biffani;  Mapping data with R by Stefano Barberis; Interactive plots with Highchart by Nicola Pesenti; Shiny in a Data Science app lifecycle by Moreno Riccardi

 

Where

Mikamai
Via Venini 42, Milano

 

Buffet

Our sponsors will provide a free buffet!

 

 

Registration

MilanoR Meeting is a free event, open to all R users and enthusiasts or those who wish to learn more about R.

Places are limited so if you would like to attend the meeting, you must register here!

—— FAQ Uhm, this MilanoR meeting looks different from the others.. How does it work?

You’re right, this time we decided to focus on a specific topic, and just explore it from several corners. So instead of having two long talks, we moved to many, but shorter.

Ideally, the meeting may be divided in three half an hour blocks: from about 19:00 to 19:30 we will have welcome presentations and a first introduction to the R dataviz world.

From about 19:30 to 20:00 Andrea Cirillo will introduce the package paletteR, going deeper into a new tool for dataviz (and showing how everyone can extend the R capabilities in every field).

From 20:00 to 20:30 we’ll see a roundup of examples that complete our exploration of the R data visualization world: from modeling to interactive dashboard, we’ll see how R is used in the “real” world.

Then, as always, free food and chat <3

  So, who are the speakers?

Mariachiara Fortuna is a freelance statistician that works with Quantide and takes care of MilanoR

Andrea Cirillo is a Quant Analyst at Intesa Sanpaolo currently employing R to explore and assess advanced credit risk models.

Stefano Biffani is a biostatistician/geneticist using R in animal science, working with AIA, Associazione Italiana Allevatori

Stefano Barberis is a PhD in Statistics at the University of Milano-Bicocca. His research is focused on representation and mapping of geospatial data   

Nicola Pesenti works as data analyst at Istituto Mangiagalli, previously at SSE Airtricity, Dublin

Moreno Riccardi is Deliverability Specialist & Data Analyst at MailUp

  And the sponsors?

Quantide is a provider of consulting services and training courses about Data Science and Big Data. It’s specialized in R, the open source software for statistical computing. Headquartered in Legnano, near Milan (Italy), Quantide has been supporting for 10 years customers from several industries around the world. Quantide is the founder and the supporter of the Milano R meeting, since its first edition in 2012.

Open Search Network is an innovative headhunting company, based in London and focused on Italian Data, Digital Experts (for opportunities in Italy and abroad). OSN is specialised in opportunities for Data Science, Data Engineering, Data Strategy professional, from middle management to executive. OSN is the first headhunting company to be specialised on the Italian Data Sector and to organise online competition for recruitment purpose.

 

I have another question

Please contact us at admin[at]milanor[dot]net

 

The post 9th MilanoR Meeting: November 20th appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Setting up twitter streamR Service on an Ubuntu server

Tue, 11/07/2017 - 16:56

(This article was first published on R – AmitKohli.com, and kindly contributed to R-bloggers)

I am working on a super-secret project for which I am harvesting a highly confidential source of data: twitter The idea is to gather a small amount of twitter data, but for a long time… maybe a year. I tried to use the package TwitteR, but it can only  grab up to a week of tweets… it’s not really good for a set-it-and-forget-it ongoing capture since it requires user-based authentication, which means (I guess) that a machine can’t authenticate for it. Tangibly this means a human needs to start the process every time. So I could run the script weekly, but of course there’s days you miss, or run at different times… plus it’s just plain annoying…

does no #rstats package useTwitter oauth 2.0? I can’t believe it… this means that automated tweet-scrapers in R aren’t possible? pic.twitter.com/eq9dnrbB6L

— amit (@VizMonkey) August 9, 2017

And then I remembered about streamR, which allows exactly for this ongoing scraping. This blog documents my experience this up on my server, using a linux service.

Let’s Go!

(Small meta-note: I’m experimenting with a new blogging style: showing more of my errors and my iterative approach to solving problems in order to counter the perception of the perfect analyst… something a bunch of people have been talking about recently. I was exposed to it by   and  during EARL London, and it’s really got me thinking. Anyway, I do realize that it makes for a messier read full of tangents and dead ends. Love it? Hate it? Please let me know what you think in the comments!)

(metanote 2: The linux bash scripts are available in their own github repo)

So if you don’t have a linux server of your own, follow Dean Atalli’s excellent guide to set one up on Digital Ocean… it’s cheap and totally worth it. Obviously, you’ll need to install streamR, also ROauth. I use other packages in the scripts here, up to you to do it exactly how I do it or not. Also… remember when you install R-Packages on Ubuntu, you have to do it as the superuser in linux, not from R (otherwise that package won’t be available for any other user (like shiny). If you don’t know what I’m talking about then you didn’t read Dean Atalli’s guide like I said above… why are you still here?). Actually, it’s so annoying to have to remember how to correctly install R packages on linux, that I created a little utility for it. save the following into a file called “Rinstaller.sh”:

 

!/bin/bash # Ask the user what package to install echo what package should I grab? read varname echo I assume you mean CRAN, but to use github type "g" read source if [ "$source" = "g" ]; then echo -------------------------------- echo Installing $varname from GitHub sudo su - -c \\"R -e \"devtools::install_github('$varname')\"\\" else echo -------------------------------- echo Grabbin $varname from CRAN sudo su - -c \\"R -e \"install.packages('$varname', repos='http://cran.$ fi

this function will accept an input (the package name) and then will ask if to install from CRON or from github. From github, obviously you need to supply the user account and package name. There! Now we don’t need to remember anything anymore! Oh, make sure you chmod 777 Rinstaller.sh (which lets anyone execute the file) and then run it as ./Rinstaller.sh

 

Anyway, I messed around with streamR for a while and figured out how I wanted to structure the files. I think I want 3 files… one to authenticate, one to capture tweets, and the third to do the supersecret analysis. Here they are:

 

Authenticator ## Auth library(ROAuth) requestURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" consumerKey <- "myKey" consumerSecret <- "mySecret" my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = requestURL, accessURL = accessURL, authURL = authURL) my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) save(my_oauth, file = "my_oauth.Rdata")

So we use this file to connect to the Twitter service. You will need to set yourself up with an API… it’s fairly painless. Go do that here and select “Create new app”. It’s fairly painless.

(small caveat: make sure the my_oauth file is saved in the working directory. You can make sure of it by creating a Project for these three files… tho this is a pain… more on this later).

Tweet-Getter library(streamR) library(here) ## Get load("/srv/shiny-server/SecretFolder/my_oauth.Rdata") filterStream("/srv/shiny-server/SecretFolder/tweets.json", track = "SecretTopic", oauth = my_oauth)

OK, so we run the authenticator once, then we run can run this file, which just goes out and gathers all tweets related to SecretTopic, and saves them to tweets.json. This works with my stream because it’s relatively small number of tweets… but be careful, if your topic gets tons of hits, the file can grow VERY quickly. You might be interested in splitting up the output into multiple files. Check this to see how.

 

On working directories, it’s super annoying, it’s typically bad practice to specify a direct path to files in your script, instead, it’s encouraged that you use tech to “know your path”… for example, we can use the here package, or use the Project folder. The problem that arises when running files from some kind of automated cron or scheduler is that it doesn’t know how to read .Rproj files, and therefore doesn’t know what folder to use. I asked this question in the RStudio Community site, which have sparked a large discussion… check it out! Anyway, the last script:

 

Tweet-analyzer ## read library(streamR) tweets.df <- parseTweets("tweets.json", verbose = FALSE) ## Do the Secret Stuff

 

Ok, so now we can authenticate, gather tweets, and anaylze the resulting file!

 

OK cool! So let’s get the TweetGetter running! As long as it’s running, it will be appending tweets to the json file. We could run it on our laptop, but it’ll stop running when we close our laptop, so that’s a perfect candidate to run on a server. If you don’t know how to get your stuff up into a linux server, I recommend saving your work locally, setting up git, git pushing it up to a private github remote (CAREFUL! This will have your Private Keys so make sure you don’t use a public repo) and then git pulling it into your server.

 

OK, set it up to run on the server

(CAVEAT!! I am not a linux expert… far from it! If anyone sees me doing something boneheaded (like the chmod 777 above, please leave a comment).

The first time you run the script, it will ask you to authenticate… So I recommend running the Authenticator file from RStudio on the server, which will allow you to grab the auth code and paste it into the Rstudio session. Once you’re done, you should be good to capture tweets on that server. The problem is, if you run the TweetGetter in RStudio… when you close that session, it stops the script.

 

Idea 2: Hrm… let’s try in the shell. So SSH into the server (on windows use Putty), go to the Project folder and type:

Rscript TweetGetter.R

 

It runs, but when I close the SSH session it also stops the script :-\ . I guess that instance is tied to the SSH session? I don’t get it… but whatever, fine.

Idea 3: set a cronjob to run it! In case you don’t know, cron jobs are the schedulers on linux. Run crontab -e to edit the jobs, and crontab -l to view what jobs you have scheduled. To understand the syntax of the crontabs, see this.

 

So the idea would be to start the task on a schedule… that way it’s not my session that started it… although of course, if it’s set on a schedule and the schedule dictates it’s time to start up again but the file is already running, I don’t want it to run twice… hrm…

 

Oh I know! I’ll create a small bash file (like a small executable) that CHECKS if the thingie is running, and if it isn’t then run it, if it is, then don’t do anything! This is what I came up with:

f pgrep -x "Rscript" > /dev/null then echo "Running" else echo "Stopped... restarting" Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R" fi

WARNING! THIS IS WRONG.

What this is saying is “check if ‘Rscript’ is running on the server (I assumed I didn’t have  any OTHER running R process at the time, a valid assumption in this case). If it is, then just say ‘Running’, if it’s not, then say ‘Stopped… restarting’ and re-run the file, using Rscript. Then, we can put file on the cron job to run hourly… so hourly I will check if the job is running or not, and if it’s stopped, restart. This is what the cron job looks like:

1 * * * * "/srv/shiny-server/SecretFolder/chek.sh"

In other words, run the file chek.sh during minute 1 of every hour, every day of money, every month of the year, and every day of the week (ie, every hour :))

OK…. Cool! So I’m good right? Let me check if the json is getting tweets… hrm… no data in the past 10 minutes or so… has nobody tweeted or is it broken? Hrm2… how does one check the cronjob log file? Oh, there is none… but shouldn’t there be? ::google:: I guess there is supposed to be one… ::think:: Oh, it’s because I’m logged in with a user that doesn’t have admin rights, so when it tries to create a log file in a protected folder, it gets rejected… Well Fine! I’ll pipe the output of the run to a file in a folder I know I can write to. (Another option is to set up the cron job as the root admin…. ie instead of crontab -e you would say sudo crontab -e… but if there’s one thing I know about linux is that I don’t know linux and therefore I use admin commands as infrequently as I can get away with). So how do I pipe run contents to a location I can see? Well… google says this is one way:

 

40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1

So what this is doing is running the file just as before, but the >> pushes the results to a log file on my home directory. Just a bit of Linux for you… > recreates the piped output everytime (ie overwrites), whereas >> appends to what was already there. The 2>&1 part means ‘grab standard output and errors’… if you wanna read more about why, geek out, but I think you’re basically saying “grab any errors and pipe them to standard output and then grab all standard output”.

OK, so after looking at the output, I saw what was happening… during every crontab run, the chek.sh file made it seem like the newTweetGetter.R wasn’t running… so it would restart it, gather 1 tweet and then time out. What strange behaviour! Am I over some Twitter quota? No, it can’t be, it’s a streaming service, twitter will feed me whatever it wants, I’m not requesting any amount… so it can’t be that.

 

here is where I threw my hands up and asked Richard, my local linux expert for help

Enter a very useful command: top. This command, and it’s slightly cooler version htop (which doesn’t come in Ubuntu by default but is easy to install… sudo apt install htop) quickly showed me that when you call an R file via Rscript, it doesn’t launch a service called Rscript, it launches a service called /usr/lib/R/bin/exec/R --slave --no-restore --file=/srv/shiny-server/SecretFolder/newTweetGetter.R.  Which explains why chek.sh didn’t think it was running (when it was)… and when the second run would try to connect to the twitter stream, it got rejected (because the first script was already connected). So this is where Richard said “BTW, you should probably set this up as a service…”. And being who I am and Richard who he is, I said: “ok”. (Although I didn’t give up on the cron… see ENDNOTE1).

After a bit of playing around, we found that the shiny-server linux service was probably easy enough to manipulate and get functional (guidance here and here), so let’s do it!

Setting up the service:
  1. First of all, go to the folder where all the services live.  cd /etc/systemd/system/
  2. Next, copy the shiny service into your new one, called SECRETtweets.service: sudo cp shiny-server.service SECRETtweets.service
  3. Now edit the contents! sudo nano SECRETtweets.service and copy paste the following code:
[Unit] Description=SECRETTweets [Service] Type=simple User=amit ExecStart=/usr/bin/Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R" Restart=always WorkingDirectory= /srv/shiny-server/SecretFolder/ Environment="LANG=en_US.UTF-8" [Install] WantedBy=multi-user.target

4. restart the daemon that picks up services? Don’t know why… just do it: sudo systemctl daemon-reload

5. Now start the service!! sudo systemctl start SECRETtweets

 

Now your service is running! You can check the status of it using: systemctl status SECRETtweets.service

Where each part does this:

  • Description is what the thingie does
  • Type says how to run it, and “simple” is the default… but check the documentation if u wanna do something more fancy
  • User this defines what user is running the service. This is a bit of extra insurance, in case you installed a package as a yourself and not as a superuser (which is the correct way)
  • ExecStart is the command to run
  • Restart by specifying this to “always”, if the script ever goes down, it’ll automatically restart and start scraping again! Super cool, no? WARNING: Not sure about whether this can cause trouble… if twitter is for some reason pissed off and doesn’t want to serve tweets to you anymore, not sure if CONSTANTLY restarting this could get you in trouble. If I get banned, I’ll letchu know… stay tuned)
  • WorkingDirectory This part is where the magic happens. Remember earlier on we were worried and worried about HOW to pass the working directory to the R script? This is how!! Now we don’t have to worry about paths on the server anymore!
  • Environment is the language
  • WantedBy I have no idea what this does and don’t care because it works!

So there you go! This is the way to set up a proper service that you can monitor, and treat properly like any formal linux process! Enjoy!

 

ENDNOTE 1

Ok, it’s true… sometimes a Service is the right thing to do, if you have a job that runs for a certain amount of time, finishes, and then you want to run it again discretely later, you should set it up as a cron job. So for those cases, here’s the correct script to check the script is running, even assigning a working directory.

if ps aux | grep "R_file_you_want_to_check.R" | grep -v grep > /dev/null then echo "Running, all good!" else echo "Not running... will restart:" cd /path_to_your_working_directory Rscript "R_file_you_want_to_check.R" fi

 

save that as chek.sh and assign it to the cron with the output to your home path, like:

40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AmitKohli.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

Tue, 11/07/2017 - 15:12

(This article was first published on R – Stable Markets, and kindly contributed to R-bloggers)

First off, here are the previous posts in my Bayesian sampling series: Bayesian Simple Linear Regression with Gibbs Sampling in R Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression In the first post, I illustrated Gibbs Sampling – an algorithm for getting draws from a posterior when conditional posteriors are known. In the … Continue reading Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz →

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stable Markets. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

simmer 3.6.4

Tue, 11/07/2017 - 14:08

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The fourth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release patches several bugs regarding resource priority and preemption management when seized amounts greater than one were involved. Check the examples available in the corresponding issues on GitHub (#114#115#116) to know if you are affected.

It can be noted that we already broke the intended bi-monthly release cycle, but it is for a good reason, since we are preparing a journal publication. Further details to come.

Minor changes and fixes:
  • Fix preemption in non-saturated multi-server resources when seizing amounts > 1 (#114).
  • Fix queue priority in non-saturated finite-queue resources when seizing amounts > 1 (#115).
  • Fix resource seizing: avoid jumping the queue when there is room in the server but other arrivals are waiting (#116).

Article originally published in Enchufa2.es: simmer 3.6.4.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Drawing 10 Million Points With ggplot: Clifford Attractors

Tue, 11/07/2017 - 08:15

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

For me, mathematics cultivates a perpetual state of wonder about the nature of mind, the limits of thoughts, and our place in this vast cosmos (Clifford A. Pickover – The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics)

I am a big fan of Clifford Pickover and I find inspiration in his books very often. Thanks to him, I discovered the harmonograph and the Parrondo’s paradox, among many other mathematical treasures. Apart of being a great teacher, he also invented a family of strange attractors wearing his name. Clifford attractors are defined by these equations:

There are infinite attractors, since a, b, c and d are parameters. Given four values (one for each parameter) and a starting point (x0, y0), the previous equation defines the exact location of the point at step n, which is defined just by its location at n-1; an attractor can be thought as the trajectory described by a particle. This plot shows the evolution of a particle starting at (x0, y0)=(0, 0) with parameters a=-1.24458046630025, b=-1.25191834103316, c=-1.81590817030519 and d=-1.90866735205054 along 10 million of steps:

Changing parameters is really entertaining. Drawings have a sandy appearance:

From a technical point of view, the challenge is creating a data frame with all locations, since it must have 10 milion rows and must be populated sequentially. A very fast way to do it is using Rcpp package. To render the plot I use ggplot, which works quite well. Here you have the code to play with Clifford Attractors if you want:

library(Rcpp) library(ggplot2) library(dplyr) opt = theme(legend.position = "none", panel.background = element_rect(fill="white"), axis.ticks = element_blank(), panel.grid = element_blank(), axis.title = element_blank(), axis.text = element_blank()) cppFunction('DataFrame createTrajectory(int n, double x0, double y0, double a, double b, double c, double d) { // create the columns NumericVector x(n); NumericVector y(n); x[0]=x0; y[0]=y0; for(int i = 1; i < n; ++i) { x[i] = sin(a*y[i-1])+c*cos(a*x[i-1]); y[i] = sin(b*x[i-1])+d*cos(b*y[i-1]); } // return a new data frame return DataFrame::create(_["x"]= x, _["y"]= y); } ') a=-1.24458046630025 b=-1.25191834103316 c=-1.81590817030519 d=-1.90866735205054 df=createTrajectory(10000000, 0, 0, a, b, c, d) png("Clifford.png", units="px", width=1600, height=1600, res=300) ggplot(df, aes(x, y)) + geom_point(color="black", shape=46, alpha=.01) + opt dev.off() var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Plots of Back-Calculated Lengths-At-Age I

Tue, 11/07/2017 - 07:00

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

Last spring, I posted about my version of a modified age-bias plot. One reader commented on that post via Twitter – “Now that you solved the age-bias plot, how about the ‘best’ display of back-calculated length-at-age data, with VonB growth curve overlay?”. In addition, I recently received a question related to the non-convergence of a hierarchical (mixed) model applied to fitting a von Bertalanffy growth function (VBGF) to back-calculated lengths at age. In exploring that question, I realized that a “good” plot of back-calculated lengths at age was needed to understand why the VBGF may (or may not) fit.

Here I write about my initial attempts to visualize back-calculated lengths at age with what are basically spaghetti plots. Spaghetti plots show individual longitudinal traces for each subject (e.g., one example). Recently “spaghetti plots” were in the news to show modeled paths of hurricanes (e.g., I particularly enjoyed this critique).

This post are my initial thoughts. Please feel free to send me comments about your ideas or suggestions for improvements. Data Explanation

In this post, I examine back-calculated lengths (mm) at age for Walleye (Sander vitreus) captured from Lake Mille Lacs, Minnesota in late fall (September-October). [More details are here.] These data were kindly provided by the Minnesota Department of Natural Resources, are available in the FSAData package, and were used extensively in the “Growth Estimation: Growth Models and Statistical Inference” chapter of the forthcoming “Age and Growth of Fishes: Principles and Techniques” book to be published by the American Fisheries Society. For simplicity of presentation here, these data were reduced to a single year and sex and several superfluous variables were removed. A “snapshot” of the data file is below.

ID Est.Age TL BC.Age BC.Len 2002.10416.F 1 232 1 81.21 2002.10493.F 1 248 1 114.60 2002.10606.F 1 287 1 112.21 2002.1153.F 1 273 1 116.29 2002.1154.F 1 244 1 157.55 2002.20742.F 10 592 6 523.52 2002.20742.F 10 592 7 539.77 2002.20742.F 10 592 8 554.83 2002.20742.F 10 592 9 567.56 2002.20742.F 10 592 10 576.20

These fish were captured in late fall such that the observed length includes current year’s growth. However, the observed age does not account for time since the fish’s “birthday.” In other words, the observed age at capture should be a “fractional age” such that it represents completed years of growth plus the fraction of the current year’s growth season completed (i.e., the “current age” should be something like 10.9 rather than 10). An example of this is seen by comparing the observed length at capture (in TL) and the back-calculated length (in BC.Len) to age-1 for the first fish in the data.frame (first line in data shown above).

Some of the plots below require a data.frame where the length and age for the oldest age match in time. In other words, this data.frame should contain the length of the fish on the fish’s last “birthday.” With these data, that length is the back-calculated length at the age (in BC.Age) that matches the age of the fish at the time of capture (in Est.Age). With other data, that length may simply be the length of the fish at the time of capture. An example of this data.frame is below (especially compare the last five lines below to the last five lines in the previous data.frame snippet above).

ID Est.Age TL BC.Age BC.Len 2002.10416.F 1 232 1 81.21 2002.10493.F 1 248 1 114.60 2002.10606.F 1 287 1 112.21 2002.1153.F 1 273 1 116.29 2002.1154.F 1 244 1 157.55 2002.20381.F 10 568 10 563.73 2002.20483.F 10 594 10 580.81 2002.20511.F 10 628 10 618.89 2002.20688.F 10 620 10 611.08 2002.20742.F 10 592 10 576.20

Finally, in some of the plots below I include the mean back-calculated length at age. An example of this data.frame is below.

fEst.Age BC.Age mnBC.Len 1 1 111.3464 2 1 113.4633 2 2 225.3060 3 1 98.5338 3 2 211.4096 10 6 505.5437 10 7 538.0147 10 8 558.6757 10 9 571.4900 10 10 579.8577 Plots for Exploratory Data Analysis

When modeling fish growth, I explore the data to make observations about (i) variability in length at each age and (ii) “shape” of growth (i.e., whether or not there is evidence for an horizontal asympote or inflection point). When using repeated-measures data, for example from back-calculated lengths at age, I observe the “shape” of growth for each individual and (iii) identify how the back-calculated lengths at age from older fish compare to the back-calculated lengths at age from younger fish (as major differences could suggest “Lee’s Phenomenon”, substantial changes in growth between year-class or over time, or problems with the back-calculation model). In this section, I describe two plots (with some augmentations to the first type) that could be useful during this exploratory stage. In the last section, I describe a plot that could be used for publication.

Figure 1 shows longitudinal traces of back-calculated lengths at age for each fish, with separate colors for fish with different observed ages at capture. From this I see variability of approximately 100 mm at each age, individual fish that generally follow the typical shape of a VBGF, and some evidence that back-calculated lengths at earlier ages from “older” fish at capture are somewhat lower than the back-calculated lengths at earlier ages for “younger” fish at capture (this is most evident with the pinkish lines).

Figure 1: Traces of back-calculated lengths at age for each fish. Traces with the same color are fish with the same observed age at capture.

Figure 2 is the same as Figure 1 except that heavy lines have been added for the mean back-calculated lengths at age for fish from each age at capture (Figure 2). Here the evidence that back-calculated lengths at earlier ages from “older” fish at capture are somewhat lower than the back-calculated lengths at earlier ages for “younger” fish at capture is a little more obvious.

Figure 2: Traces of back-calculated lengths at age for each fish with mean back-calculated lengths at age shown by the heavier lines. Traces with the same color are fish with the same observed age at capture.

Figure 3 is the same as Figure 1 but also has points for the length and age of each fish at the last completed year of growth. These points are most near to the observed lengths and ages at capture (and will be the observed lengths and ages at capture for datasets where the fish were captured prior to when the current season’s growth had commenced) and, thus, most nearly represent the data that would be used fit a growth model if back-calculations had not been made. With this I observe that most traces of back-calculated lengths at age pass near these points, which suggests that “growth” has not changed dramatically over the time represented in these data and that the model used to back-calculate lengths and ages is not dramatically incorrect.

Figure 3: Traces of back-calculated lengths at age for each fish. Traces with the same color are fish with the same observed age at capture.

The previous spaghetti plots are cluttered because of the number of individual fish. This clutter can be somewhat reduced by creating separate spaghetti plots for each observed age at capture (Figure 4). From this, I observe the clear start of an asymptote at about age 5, an indication of a slight inflection around age 2(most evident for fish that were older at capture), and that a good portion of the variability in length at early ages may be attributable to fish from different year-classes (i.e., of different observed ages at capture). It is, however, more difficult to see that back-calculated lengths at earlier ages from “older” fish at capture are somewhat lower than the back-calculated lengths at earlier ages for “younger” fish at capture. [Note that I left the facet for age-1 fish in this plot to remind me that there were age-1 fish in these data, even though they do not show a trace. Also, the color here is superfluous and could be removed. I left the color here for comparison with previous figures.]

Figure 4: Traces of back-calculated lengths at age for each fish separated by observed age at capture. Black lines in each facet are the mean back-calculated lengths at age for fish shown in that facet.

Publication Graphic with Model Overlaid

For publication I would include traces for individual fish, but without color-coding by estimated age at capture, and overlay the population-average growth model (i.e., the growth model expressed from using the “fixed effects” for each model parameter; Figure 5).

Figure 5: Traces of back-calculated lengths at age for each fish (lighter black lines) with the population-averaged von Bertalanffy growth function (dark black line) overlaid. The equation for the best-fit von Bertalanffy growth function is shown.

R Code ## Required packages library(FSA) # for filterD(), headtail() library(FSAdata) # for WalleyeML library(dplyr) # for %>%, select(), arrange(), unique(), group_by, et al. library(magrittr) # for %<>% library(ggplot2) # for all of the plotting functions library(nlme) # for nlme() ## Custom ggplot2 theme ... the black-and-white theme with slightly ## lighter and dashed grid lines and slightly lighter facet labels. theme_BC <- function() { theme_bw() %+replace% theme( panel.grid.major=element_line(color="gray95",linetype="dashed"), panel.grid.minor=element_line(color="gray95",linetype="dashed"), strip.background=element_rect(fill="gray95") ) } ## Base ggplot ## Uses main data.frame, BC.Age on x- and BC.Len on y-axis ## group=ID to connect lengths at age for each fish ## Applied custom them, labelled axis, controlled ticks on x-axis ## Removed legends ("guides") related to colors and sizes p <- ggplot(data=df,aes(x=BC.Age,y=BC.Len,group=ID)) + theme_BC() + scale_x_continuous("Back-Calculated Age",breaks=seq(0,10,2)) + scale_y_continuous("Back-Calculated Length (mm)") + guides(color=FALSE,size=FALSE) ## Main spaghetti plot p + geom_line(aes(color=fEst.Age),alpha=1/8) ## Spaghetti plot augmented with mean back-calcd lengths at age p + geom_line(aes(color=fEst.Age),alpha=1/8) + geom_line(data=df3,aes(x=BC.Age,y=mnBC.Len,group=fEst.Age,color=fEst.Age),size=1) ## Spaghetti plot augmented with last full length at age points p + geom_line(aes(color=fEst.Age),alpha=1/8) + geom_point(data=df2,aes(color=fEst.Age),alpha=1/5,size=0.9) ## Make facet labels for the plot below lbls <- paste("Age =",levels(df$fEst.Age)) names(lbls) <- levels(df$fEst.Age) ## Spaghetti plot separated by age at capture (with means) p + geom_line(aes(color=fEst.Age),alpha=1/5) + facet_wrap(~fEst.Age,labeller=labeller(fEst.Age=lbls)) + geom_line(data=df3,aes(x=BC.Age,y=mnBC.Len,group=NULL),color="black") ## von B equation for the plot ( tmp <- fixef(fitVB) ) lbl <- paste("L==",round(exp(tmp[1]),0), "*bgroup('(',1-e^-",round(tmp[2],3), "(Age-",round(tmp[3],2),"),')')") ## Base ggplot ## Uses main data.frame, BC.Age on x- and BC.Len on y-axis ## group=ID to connect lengths at age for each fish ## Applied custom them, labelled axis, controlled ticks on x-axis ## Removed legends ("guides") related to colors and sizes p <- ggplot(data=df,aes(x=BC.Age,y=BC.Len,group=ID)) + theme_BC() + scale_x_continuous("Back-Calculated Age",breaks=seq(0,10,2)) + scale_y_continuous("Back-Calculated Length (mm)") + guides(color=FALSE,size=FALSE) ## Add transparent traces for each individual fish ## Add a trace for the overall von B model (in fixed effects) ## Add a label for the overall von B model p + geom_line(aes(),alpha=1/15) + stat_function(data=data.frame(T=seq(1,10,0.1)),aes(x=T,y=NULL,group=NULL), fun=vbT,args=list(logLinf=fixef(fitVB)),size=1.1) + geom_text(data=data.frame(x=7,y=120),aes(x=x,y=y,group=NULL, label=lbl),parse=TRUE,size=4) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Machine Learning with R and Python – Part 5

Tue, 11/07/2017 - 02:11

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

This is the 5th and probably penultimate part of my series on ‘Practical Machine Learning with R and Python’. The earlier parts of this series included

1. Practical Machine Learning with R and Python – Part 1 In this initial post, I touch upon univariate, multivariate, polynomial regression and KNN regression in R and Python
2.Practical Machine Learning with R and Python – Part 2 In this post, I discuss Logistic Regression, KNN classification and cross validation error for both LOOCV and K-Fold in both R and Python
3.Practical Machine Learning with R and Python – Part 3 This post covered ‘feature selection’ in Machine Learning. Specifically I touch best fit, forward fit, backward fit, ridge(L2 regularization) & lasso (L1 regularization). The post includes equivalent code in R and Python.
4.Practical Machine Learning with R and Python – Part 4 In this part I discussed SVMs, Decision Trees, validation, precision recall, and roc curves

This post ‘Practical Machine Learning with R and Python – Part 5’ discusses regression with B-splines, natural splines, smoothing splines, generalized additive models (GAMS), bagging, random forest and boosting

As with my previous posts in this series, this post is largely based on the following 2 MOOC courses

1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

You can download this R Markdown file and associated data files from Github at MachineLearning-RandPython-Part5

For this part I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG)

1. Splines

When performing regression (continuous or logistic) between a target variable and a feature (or a set of features), a single polynomial for the entire range of the data set usually does not perform a good fit.Rather we would need to provide we could fit
regression curves for different section of the data set.

There are several techniques which do this for e.g. piecewise-constant functions, piecewise-linear functions, piecewise-quadratic/cubic/4th order polynomial functions etc. One such set of functions are the cubic splines which fit cubic polynomials to successive sections of the dataset. The points where the cubic splines join, are called ‘knots’.

Since each section has a different cubic spline, there could be discontinuities (or breaks) at these knots. To prevent these discontinuities ‘natural splines’ and ‘smoothing splines’ ensure that the seperate cubic functions have 2nd order continuity at these knots with the adjacent splines. 2nd order continuity implies that the value, 1st order derivative and 2nd order derivative at these knots are equal.

A cubic spline with knots , k=1,2,3,..K is a piece-wise cubic polynomial with continuous derivative up to order 2 at each knot. We can write .
For each (), are called ‘basis’ functions, where  ,  , , where k=1,2,3… K The 1st and 2nd derivatives of cubic splines are continuous at the knots. Hence splines provide a smooth continuous fit to the data by fitting different splines to different sections of the data

1.1a Fit a 4th degree polynomial – R code

In the code below a non-linear function (a 4th order polynomial) is used to fit the data. Usually when we fit a single polynomial to the entire data set the tails of the fit tend to vary a lot particularly if there are fewer points at the ends. Splines help in reducing this variation at the extremities

library(dplyr) library(ggplot2) source('RFunctions-1.R') # Read the data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) #Select specific columns df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) auto <- df2[complete.cases(df2),] # Fit a 4th degree polynomial fit=lm(mpg~poly(horsepower,4),data=auto) #Display a summary of fit summary(fit) ## ## Call: ## lm(formula = mpg ~ poly(horsepower, 4), data = auto) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.8820 -2.5802 -0.1682 2.2100 16.1434 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 23.4459 0.2209 106.161 <2e-16 *** ## poly(horsepower, 4)1 -120.1377 4.3727 -27.475 <2e-16 *** ## poly(horsepower, 4)2 44.0895 4.3727 10.083 <2e-16 *** ## poly(horsepower, 4)3 -3.9488 4.3727 -0.903 0.367 ## poly(horsepower, 4)4 -5.1878 4.3727 -1.186 0.236 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.373 on 387 degrees of freedom ## Multiple R-squared: 0.6893, Adjusted R-squared: 0.6861 ## F-statistic: 214.7 on 4 and 387 DF, p-value: < 2.2e-16 #Get the range of horsepower hp <- range(auto$horsepower) #Create a sequence to be used for plotting hpGrid <- seq(hp[1],hp[2],by=10) #Predict for these values of horsepower. Set Standard error as TRUE pred=predict(fit,newdata=list(horsepower=hpGrid),se=TRUE) #Compute bands on either side that is 2xSE seBands=cbind(pred$fit+2*pred$se.fit,pred$fit-2*pred$se.fit) #Plot the fit with Standard Error bands plot(auto$horsepower,auto$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="Polynomial of degree 4") lines(hpGrid,pred$fit,lwd=2,col="blue") matlines(hpGrid,seBands,lwd=2,col="blue",lty=3)

1.1b Fit a 4th degree polynomial – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression #Read the auto data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") # Select columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert all columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') #Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['horsepower']] y=autoDF3['mpg'] #Create a polynomial of degree 4 poly = PolynomialFeatures(degree=4) X_poly = poly.fit_transform(X) # Fit a polynomial regression line linreg = LinearRegression().fit(X_poly, y) # Create a range of values hpGrid = np.arange(np.min(X),np.max(X),10) hp=hpGrid.reshape(-1,1) # Transform to 4th degree poly = PolynomialFeatures(degree=4) hp_poly = poly.fit_transform(hp) #Create a scatter plot plt.scatter(X,y) # Fit the prediction ypred=linreg.predict(hp_poly) plt.title("Poylnomial of degree 4") fig2=plt.xlabel("Horsepower") fig2=plt.ylabel("MPG") # Draw the regression curve plt.plot(hp,ypred,c="red") plt.savefig('fig1.png', bbox_inches='tight') 1.1c Fit a B-Spline – R Code

In the code below a B- Spline is fit to data. The B-spline requires the manual selection of knots

#Splines library(splines) # Fit a B-spline to the data. Select knots at 60,75,100,150 fit=lm(mpg~bs(horsepower,df=6,knots=c(60,75,100,150)),data=auto) # Use the fitted regresion to predict pred=predict(fit,newdata=list(horsepower=hpGrid),se=T) # Create a scatter plot plot(auto$horsepower,auto$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="B-Spline with 4 knots") #Draw lines with 2 Standard Errors on either side lines(hpGrid,pred$fit,lwd=2) lines(hpGrid,pred$fit+2*pred$se,lty="dashed") lines(hpGrid,pred$fit-2*pred$se,lty="dashed") abline(v=c(60,75,100,150),lty=2,col="darkgreen")

1.1d Fit a Natural Spline – R Code

Here a ‘Natural Spline’ is used to fit .The Natural Spline extrapolates beyond the boundary knots and the ends of the function are much more constrained than a regular spline or a global polynomoial where the ends can wag a lot more. Natural splines do not require the explicit selection of knots

# There is no need to select the knots here. There is a smoothing parameter which # can be specified by the degrees of freedom 'df' parameter. The natural spline fit2=lm(mpg~ns(horsepower,df=4),data=auto) pred=predict(fit2,newdata=list(horsepower=hpGrid),se=T) plot(auto$horsepower,auto$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="Natural Splines") lines(hpGrid,pred$fit,lwd=2) lines(hpGrid,pred$fit+2*pred$se,lty="dashed") lines(hpGrid,pred$fit-2*pred$se,lty="dashed")

1.1.e Fit a Smoothing Spline – R code

Here a smoothing spline is used. Smoothing splines also do not require the explicit setting of knots. We can change the ‘degrees of freedom(df)’ paramater to get the best fit

# Smoothing spline has a smoothing parameter, the degrees of freedom # This is too wiggly plot(auto$horsepower,auto$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="Smoothing Splines") # Here df is set to 16. This has a lot of variance fit=smooth.spline(auto$horsepower,auto$mpg,df=16) lines(fit,col="red",lwd=2) # We can use Cross Validation to allow the spline to pick the value of this smpopothing paramter. We do not need to set the degrees of freedom 'df' fit=smooth.spline(auto$horsepower,auto$mpg,cv=TRUE) lines(fit,col="blue",lwd=2)

1.1e Splines – Python

There isn’t as much treatment of splines in Python and SKLearn. I did find the LSQUnivariate, UnivariateSpline spline. The LSQUnivariate spline requires the explcit setting of knots

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from scipy.interpolate import LSQUnivariateSpline autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') auto=autoDF2.dropna() auto=auto[['horsepower','mpg']].sort_values('horsepower') # Set the knots manually knots=[65,75,100,150] # Create an array for X & y X=np.array(auto['horsepower']) y=np.array(auto['mpg']) # Fit a LSQunivariate spline s = LSQUnivariateSpline(X,y,knots) #Plot the spline xs = np.linspace(40,230,1000) ys = s(xs) plt.scatter(X, y) plt.plot(xs, ys) plt.savefig('fig2.png', bbox_inches='tight') 1.2 Generalized Additiive models (GAMs)

Generalized Additive Models (GAMs) is a really powerful ML tool.

In GAMs we use a different functions for each of the variables. GAMs give a much better fit since we can choose any function for the different sections

1.2a Generalized Additive Models (GAMs) – R Code

The plot below show the smooth spline that is fit for each of the features horsepower, cylinder, displacement, year and acceleration. We can use any function for example loess, 4rd order polynomial etc.

library(gam) # Fit a smoothing spline for horsepower, cyliner, displacement and acceleration gam=gam(mpg~s(horsepower,4)+s(cylinder,5)+s(displacement,4)+s(year,4)+s(acceleration,5),data=auto) # Display the summary of the fit. This give the significance of each of the paramwetr # Also an ANOVA is given for each combination of the features summary(gam) ## ## Call: gam(formula = mpg ~ s(horsepower, 4) + s(cylinder, 5) + s(displacement, ## 4) + s(year, 4) + s(acceleration, 5), data = auto) ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -8.3190 -1.4436 -0.0261 1.2279 12.0873 ## ## (Dispersion Parameter for gaussian family taken to be 6.9943) ## ## Null Deviance: 23818.99 on 391 degrees of freedom ## Residual Deviance: 2587.881 on 370 degrees of freedom ## AIC: 1898.282 ## ## Number of Local Scoring Iterations: 3 ## ## Anova for Parametric Effects ## Df Sum Sq Mean Sq F value Pr(>F) ## s(horsepower, 4) 1 15632.8 15632.8 2235.085 < 2.2e-16 *** ## s(cylinder, 5) 1 508.2 508.2 72.666 3.958e-16 *** ## s(displacement, 4) 1 374.3 374.3 53.514 1.606e-12 *** ## s(year, 4) 1 2263.2 2263.2 323.583 < 2.2e-16 *** ## s(acceleration, 5) 1 372.4 372.4 53.246 1.809e-12 *** ## Residuals 370 2587.9 7.0 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Anova for Nonparametric Effects ## Npar Df Npar F Pr(F) ## (Intercept) ## s(horsepower, 4) 3 13.825 1.453e-08 *** ## s(cylinder, 5) 3 17.668 9.712e-11 *** ## s(displacement, 4) 3 44.573 < 2.2e-16 *** ## s(year, 4) 3 23.364 7.183e-14 *** ## s(acceleration, 5) 4 3.848 0.004453 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 par(mfrow=c(2,3)) plot(gam,se=TRUE)

1.2b Generalized Additive Models (GAMs) – Python Code

I did not find the equivalent of GAMs in SKlearn in Python. There was an early prototype (2012) in Github. Looks like it is still work in progress or has probably been abandoned.

1.3 Tree based Machine Learning Models

Tree based Machine Learning are all based on the ‘bootstrapping’ technique. In bootstrapping given a sample of size N, we create datasets of size N by sampling this original dataset with replacement. Machine Learning models are built on the different bootstrapped samples and then averaged.

Decision Trees as seen above have the tendency to overfit. There are several techniques that help to avoid this namely a) Bagging b) Random Forests c) Boosting

Bagging, Random Forest and Gradient Boosting

Bagging: Bagging, or Bootstrap Aggregation decreases the variance of predictions, by creating separate Decisiion Tree based ML models on the different samples and then averaging these ML models

Random Forests: Bagging is a greedy algorithm and tries to produce splits based on all variables which try to minimize the error. However the different ML models have a high correlation. Random Forests remove this shortcoming, by using a variable and random set of features to split on. Hence the features chosen and the resulting trees are uncorrelated. When these ML models are averaged the performance is much better.

Boosting: Gradient Boosted Decision Trees also use an ensemble of trees but they don’t build Machine Learning models with random set of features at each step. Rather small and simple trees are built. Successive trees try to minimize the error from the earlier trees.

Out of Bag (OOB) Error: In Random Forest and Gradient Boosting for each bootstrap sample taken from the dataset, there will be samples left out. These are known as Out of Bag samples.Classification accuracy carried out on these OOB samples is known as OOB error

1.31a Decision Trees – R Code

The code below creates a Decision tree with the cancer training data. The summary of the fit is output. Based on the ML model, the predict function is used on test data and a confusion matrix is output.

# Read the cancer data library(tree) library(caret) library(e1071) cancer <- read.csv("cancer.csv",stringsAsFactors = FALSE) cancer <- cancer[,2:32] cancer$target <- as.factor(cancer$target) train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Create Decision Tree cancerStatus=tree(target~.,train) summary(cancerStatus) ## ## Classification tree: ## tree(formula = target ~ ., data = train) ## Variables actually used in tree construction: ## [1] "worst.perimeter" "worst.concave.points" "area.error" ## [4] "worst.texture" "mean.texture" "mean.concave.points" ## Number of terminal nodes: 9 ## Residual mean deviance: 0.1218 = 50.8 / 417 ## Misclassification error rate: 0.02347 = 10 / 426 pred <- predict(cancerStatus,newdata=test,type="class") confusionMatrix(pred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 49 7 ## 1 8 78 ## ## Accuracy : 0.8944 ## 95% CI : (0.8318, 0.9397) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 4.641e-15 ## ## Kappa : 0.7795 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.8596 ## Specificity : 0.9176 ## Pos Pred Value : 0.8750 ## Neg Pred Value : 0.9070 ## Prevalence : 0.4014 ## Detection Rate : 0.3451 ## Detection Prevalence : 0.3944 ## Balanced Accuracy : 0.8886 ## ## 'Positive' Class : 0 ## # Plot decision tree with labels plot(cancerStatus) text(cancerStatus,pretty=0)

 

 

1.31b Decision Trees – Cross Validation – R Code

We can also perform a Cross Validation on the data to identify the Decision Tree which will give the minimum deviance.

library(tree) cancer <- read.csv("cancer.csv",stringsAsFactors = FALSE) cancer <- cancer[,2:32] cancer$target <- as.factor(cancer$target) train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Create Decision Tree cancerStatus=tree(target~.,train) # Execute 10 fold cross validation cvCancer=cv.tree(cancerStatus) plot(cvCancer)

# Plot the plot(cvCancer$size,cvCancer$dev,type='b')

prunedCancer=prune.tree(cancerStatus,best=4) plot(prunedCancer) text(prunedCancer,pretty=0)

pred <- predict(prunedCancer,newdata=test,type="class") confusionMatrix(pred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 50 7 ## 1 7 78 ## ## Accuracy : 0.9014 ## 95% CI : (0.8401, 0.945) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 7.988e-16 ## ## Kappa : 0.7948 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.8772 ## Specificity : 0.9176 ## Pos Pred Value : 0.8772 ## Neg Pred Value : 0.9176 ## Prevalence : 0.4014 ## Detection Rate : 0.3521 ## Detection Prevalence : 0.4014 ## Balanced Accuracy : 0.8974 ## ## 'Positive' Class : 0 ## 1.31c Decision Trees – Python Code

Below is the Python code for creating Decision Trees. The accuracy, precision, recall and F1 score is computed on the test data set.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix from sklearn import tree from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score import graphviz cancer = load_breast_cancer() (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = DecisionTreeClassifier().fit(X_train, y_train) print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) y_predicted=clf.predict(X_test) confusion = confusion_matrix(y_test, y_predicted) print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted))) print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted))) print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted))) print('F1: {:.2f}'.format(f1_score(y_test, y_predicted))) # Plot the Decision Tree clf = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=cancer.feature_names, class_names=cancer.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph ## Accuracy of Decision Tree classifier on training set: 1.00 ## Accuracy of Decision Tree classifier on test set: 0.87 ## Accuracy: 0.87 ## Precision: 0.97 ## Recall: 0.82 ## F1: 0.89 1.31d Decision Trees – Cross Validation – Python Code

In the code below 5-fold cross validation is performed for different depths of the tree and the accuracy is computed. The accuracy on the test set seems to plateau when the depth is 8. But it is seen to increase again from 10 to 12. More analysis needs to be done here

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.tree import DecisionTreeClassifier (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) from sklearn.cross_validation import train_test_split, KFold def computeCVAccuracy(X,y,folds): accuracy=[] foldAcc=[] depth=[1,2,3,4,5,6,7,8,9,10,11,12] nK=len(X)/float(folds) xval_err=0 for i in depth: kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf = DecisionTreeClassifier(max_depth = i).fit(X_train, y_train) score=clf.score(X_test, y_test) accuracy.append(score) foldAcc.append(np.mean(accuracy)) return(foldAcc) cvAccuracy=computeCVAccuracy(pd.DataFrame(X_cancer),pd.DataFrame(y_cancer),folds=10) df1=pd.DataFrame(cvAccuracy) df1.columns=['cvAccuracy'] df=df1.reindex([1,2,3,4,5,6,7,8,9,10,11,12]) df.plot() plt.title("Decision Tree - 10-fold Cross Validation Accuracy vs Depth of tree") plt.xlabel("Depth of tree") plt.ylabel("Accuracy") plt.savefig('fig3.png', bbox_inches='tight')

 

 

 

1.4a Random Forest – R code

A Random Forest is fit using the Boston data. The summary shows that 4 variables were randomly chosen at each split and the resulting ML model explains 88.72% of the test data. Also the variable importance is plotted. It can be seen that ‘rooms’ and ‘status’ are the most influential features in the model

library(randomForest) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Select specific columns Boston <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color", "status","medianValue") # Fit a Random Forest on the Boston training data rfBoston=randomForest(medianValue~.,data=Boston) # Display the summatu of the fit. It can be seen that the MSE is 10.88 # and the percentage variance explained is 86.14%. About 4 variables were tried at each # #split for a maximum tree of 500. # The MSE and percent variance is on Out of Bag trees rfBoston ## ## Call: ## randomForest(formula = medianValue ~ ., data = Boston) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## Mean of squared residuals: 9.521672 ## % Var explained: 88.72 #List and plot the variable importances importance(rfBoston) ## IncNodePurity ## crimeRate 2602.1550 ## zone 258.8057 ## indus 2599.6635 ## charles 240.2879 ## nox 2748.8485 ## rooms 12011.6178 ## age 1083.3242 ## distances 2432.8962 ## highways 393.5599 ## tax 1348.6987 ## teacherRatio 2841.5151 ## color 731.4387 ## status 12735.4046 varImpPlot(rfBoston)

1.4b Random Forest-OOB and Cross Validation Error – R code

The figure below shows the OOB error and the Cross Validation error vs the ‘mtry’. Here mtry indicates the number of random features that are chosen at each split. The lowest test error occurs when mtry = 8

library(randomForest) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Select specific columns Boston <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color", "status","medianValue") # Split as training and tst sets train_idx <- trainTestSplit(Boston,trainPercent=75,seed=5) train <- Boston[train_idx, ] test <- Boston[-train_idx, ] #Initialize OOD and testError oobError <- NULL testError <- NULL # In the code below the number of variables to consider at each split is increased # from 1 - 13(max features) and the OOB error and the MSE is computed for(i in 1:13){ fitRF=randomForest(medianValue~.,data=train,mtry=i,ntree=400) oobError[i] <-fitRF$mse[400] pred <- predict(fitRF,newdata=test) testError[i] <- mean((pred-test$medianValue)^2) } # We can see the OOB and Test Error. It can be seen that the Random Forest performs # best with the lowers MSE at mtry=6 matplot(1:13,cbind(testError,oobError),pch=19,col=c("red","blue"), type="b",xlab="mtry(no of varaibles at each split)", ylab="Mean Squared Error", main="Random Forest - OOB and Test Error") legend("topright",legend=c("OOB","Test"),pch=19,col=c("red","blue"))

1.4c Random Forest – Python code

The python code for Random Forest Regression is shown below. The training and test score is computed. The variable importance shows that ‘rooms’ and ‘status’ are the most influential of the variables

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) regr = RandomForestRegressor(max_depth=4, random_state=0) regr.fit(X_train, y_train) print('R-squared score (training): {:.3f}' .format(regr.score(X_train, y_train))) print('R-squared score (test): {:.3f}' .format(regr.score(X_test, y_test))) feature_names=['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status'] print(regr.feature_importances_) plt.figure(figsize=(10,6),dpi=80) c_features=X_train.shape[1] plt.barh(np.arange(c_features),regr.feature_importances_) plt.xlabel("Feature importance") plt.ylabel("Feature name") plt.yticks(np.arange(c_features), feature_names) plt.tight_layout() plt.savefig('fig4.png', bbox_inches='tight') ## R-squared score (training): 0.917 ## R-squared score (test): 0.734 ## [ 0.03437382 0. 0.00580335 0. 0.00731004 0.36461548 ## 0.00638577 0.03432173 0.0041244 0.01732328 0.01074148 0.0012638 ## 0.51373683] 1.4d Random Forest – Cross Validation and OOB Error – Python code

As with R the ‘max_features’ determines the random number of features the random forest will use at each split. The plot shows that when max_features=8 the MSE is lowest

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] cvError=[] oobError=[] oobMSE=[] for i in range(1,13): regr = RandomForestRegressor(max_depth=4, n_estimators=400,max_features=i,oob_score=True,random_state=0) mse= np.mean(cross_val_score(regr, X, y, cv=5,scoring = 'neg_mean_squared_error')) # Since this is neg_mean_squared_error I have inverted the sign to get MSE cvError.append(-mse) # Fit on all data to compute OOB error regr.fit(X, y) # Record the OOB error for each `max_features=i` setting oob = 1 - regr.oob_score_ oobError.append(oob) # Get the Out of Bag prediction oobPred=regr.oob_prediction_ # Compute the Mean Squared Error between OOB Prediction and target mseOOB=np.mean(np.square(oobPred-y)) oobMSE.append(mseOOB) # Plot the CV Error and OOB Error # Set max_features maxFeatures=np.arange(1,13) cvError=pd.DataFrame(cvError,index=maxFeatures) oobMSE=pd.DataFrame(oobMSE,index=maxFeatures) #Plot fig8=df.plot() fig8=plt.title('Random forest - CV Error and OOB Error vs max_features') fig8.figure.savefig('fig8.png', bbox_inches='tight') #Plot the OOB Error vs max_features plt.plot(range(1,13),oobError) fig2=plt.title("Random Forest - OOB Error vs max_features (variable no of features)") fig2=plt.xlabel("max_features (variable no of features)") fig2=plt.ylabel("OOB Error") fig2.figure.savefig('fig7.png', bbox_inches='tight')   1.5a Boosting – R code

Here a Gradient Boosted ML Model is built with a n.trees=5000, with a learning rate of 0.01 and depth of 4. The feature importance plot also shows that rooms and status are the 2 most important features. The MSE vs the number of trees plateaus around 2000 trees

library(gbm) # Perform gradient boosting on the Boston data set. The distribution is gaussian since we # doing MSE. The interaction depth specifies the number of splits boostBoston=gbm(medianValue~.,data=train,distribution="gaussian",n.trees=5000, shrinkage=0.01,interaction.depth=4) #The summary gives the variable importance. The 2 most significant variables are # number of rooms and lower status summary(boostBoston)

## var rel.inf ## rooms rooms 42.2267200 ## status status 27.3024671 ## distances distances 7.9447972 ## crimeRate crimeRate 5.0238827 ## nox nox 4.0616548 ## teacherRatio teacherRatio 3.1991999 ## age age 2.7909772 ## color color 2.3436295 ## tax tax 2.1386213 ## charles charles 1.3799109 ## highways highways 0.7644026 ## indus indus 0.7236082 ## zone zone 0.1001287 # The plots below show how each variable relates to the median value of the home. As # the number of roomd increase the median value increases and with increase in lower status # the median value decreases par(mfrow=c(1,2)) #Plot the relation between the top 2 features and the target plot(boostBoston,i="rooms") plot(boostBoston,i="status")

# Create a sequence of trees between 100-5000 incremented by 50 nTrees=seq(100,5000,by=50) # Predict the values for the test data pred <- predict(boostBoston,newdata=test,n.trees=nTrees) # Compute the mean for each of the MSE for each of the number of trees boostError <- apply((pred-test$medianValue)^2,2,mean) #Plot the MSE vs the number of trees plot(nTrees,boostError,pch=19,col="blue",ylab="Mean Squared Error", main="Boosting Test Error")

1.5b Cross Validation Boosting – R code

Included below is a cross validation error vs the learning rate. The lowest error is when learning rate = 0.09

cvError <- NULL s <- c(.001,0.01,0.03,0.05,0.07,0.09,0.1) for(i in seq_along(s)){ cvBoost=gbm(medianValue~.,data=train,distribution="gaussian",n.trees=5000, shrinkage=s[i],interaction.depth=4,cv.folds=5) cvError[i] <- mean(cvBoost$cv.error) } # Create a data frame for plotting a <- rbind(s,cvError) b <- as.data.frame(t(a)) # It can be seen that a shrinkage parameter of 0,05 gives the lowes CV Error ggplot(b,aes(s,cvError)) + geom_point() + geom_line(color="blue") + xlab("Shrinkage") + ylab("Cross Validation Error") + ggtitle("Gradient boosted trees - Cross Validation error vs Shrinkage")

1.5c Boosting – Python code

A gradient boost ML model in Python is created below. The Rsquared score is computed on the training and test data.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingRegressor df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) regr = GradientBoostingRegressor() regr.fit(X_train, y_train) print('R-squared score (training): {:.3f}' .format(regr.score(X_train, y_train))) print('R-squared score (test): {:.3f}' .format(regr.score(X_test, y_test))) ## R-squared score (training): 0.983 ## R-squared score (test): 0.821 1.5c Cross Validation Boosting – Python code

the cross validation error is computed as the learning rate is varied. The minimum CV eror occurs when lr = 0.04

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import cross_val_score df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] cvError=[] learning_rate =[.001,0.01,0.03,0.05,0.07,0.09,0.1] for lr in learning_rate: regr = GradientBoostingRegressor(max_depth=4, n_estimators=400,learning_rate =lr,random_state=0) mse= np.mean(cross_val_score(regr, X, y, cv=10,scoring = 'neg_mean_squared_error')) # Since this is neg_mean_squared_error I have inverted the sign to get MSE cvError.append(-mse) learning_rate =[.001,0.01,0.03,0.05,0.07,0.09,0.1] plt.plot(learning_rate,cvError) plt.title("Gradient Boosting - 5-fold CV- Mean Squared Error vs max_features (variable no of features)") plt.xlabel("max_features (variable no of features)") plt.ylabel("Mean Squared Error") plt.savefig('fig6.png', bbox_inches='tight')

Conclusion This post covered Splines and Tree based ML models like Bagging, Random Forest and Boosting. Stay tuned for further updates.

You may also like

  1. Re-introducing cricketr! : An R package to analyze performances of cricketer
  2. Designing a Social Web Portal
  3. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
  4. My TEDx talk on the “Internet of Things”
  5. Singularity
  6. A closer look at “Robot Horse on a Trot” in Android

To see all posts see Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppQuantuccia 0.0.2

Tue, 11/07/2017 - 01:33

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A first maintenance release of RcppQuantuccia got to CRAN earlier today.

RcppQuantuccia brings the Quantuccia header-only subset / variant of QuantLib to R. At present it mostly offers calendaring, but Quantuccia just got a decent amount of new functions so hopefully we can offer more here too.

This release was motivated by the upcoming Rcpp release which will deprecate the okd Date and Datetime vectors in favours of newer ones. So this release of RcppQuantuccia switches to the newer ones.

Other changes are below:

Changes in version 0.0.2 (2017-11-06)
  • Added calendars for Canada, China, Germany, Japan and United Kingdom.

  • Added bespoke and joint calendars.

  • Using new date(time) vectors (#6).

Courtesy of CRANberries, there is also a diffstat report relative to the previous release. More information is on the RcppQuantuccia page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data acquisition in R (2/4)

Tue, 11/07/2017 - 01:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

R is an incredible tool for reproducible research. In the present series of blog posts I want to show how one can easily acquire data within an R session, documenting every step in a fully reproducible way. There are numerous data acquisition options for R users. Of course, I do not attempt to show all the data possibilities and tend to focus mostly on demographic data. If your prime interest lies outside human population statistics, it’s worth checking the amazing Open Data Task View.

The series consists of four posts:

For each of the data acquisition options I provide a small visualization use case.

Eurostat

The package eurostat has a function search_eurostat to search for the relevant datasets. Though, sadly enough, this function does not provide the codes of all the datasets that has the expression of interest in the title. For example, the search on the expression life expectancy produces an output with just 2 results, which does not make any sense. Thus, the best strategy is to go to Eurostat website, find the needed dataset code, and fetch the desired dataset by its code. Note that there is a separate database for subregional level indicators.

I am going to download life expectancy estimates for European countries; the dataset code is demo_mlexpec.

library(tidyverse) library(lubridate) library(eurostat) # download the selected dataset e0 <- get_eurostat("demo_mlexpec")

It can take a while because the dataset is quite big (0.4m obs). If the automated procedure does not work, one can download the data manually via the Bulk Download Service of Eurostat.

Let’s have a look at the remaining life expectancy at age 65, the most common conventional age at retirement, in some European countries, separately for males, females, and total population. Some data preparation steps are needed. First, we only need the life expectancy estimates for those aged 65. Next, we don’t need total population, only males and females separately. Finally, let’s select just a bunch of countries: Germany, France, Italy, Russia, Spain, the UK.

e0 %>% filter(! sex == "T", age == "Y65", geo %in% c("DE", "FR", "IT", "RU", "ES", "UK")) %>% ggplot(aes(x = time %>% year(), y = values, color = sex))+ geom_path()+ facet_wrap(~ geo, ncol = 3)+ labs(y = "Life expectancy at age 65", x = NULL)+ theme_minimal()

World Bank

There are several packages that provide an API to World Bank data. Probably, the most elaborated one is a fairly recent wbstats. Its wbsearch function really does great job searching through the database for the relevant datasets. For example, wbsearch("fertility") produces a dataframe of 339 entries with the codes and names of the relevant indicators.

library(tidyverse) library(wbstats) # search for a dataset of interest wbsearch("fertility") %>% head   indicatorID indicator 2479 SP.DYN.WFRT.Q5 Total wanted fertility rate (births per woman): Q5 (highest) 2480 SP.DYN.WFRT.Q4 Total wanted fertility rate (births per woman): Q4 2481 SP.DYN.WFRT.Q3 Total wanted fertility rate (births per woman): Q3 2482 SP.DYN.WFRT.Q2 Total wanted fertility rate (births per woman): Q2 2483 SP.DYN.WFRT.Q1 Total wanted fertility rate (births per woman): Q1 (lowest) 2484 SP.DYN.WFRT Wanted fertility rate (births per woman)

Let’s have a look at the indicator Lifetime risk of maternal death (%) (code SH.MMR.RISK.ZS). World Bank provides a variety of country groupings. One of the curious groupings divides countries based on the advance over the Demographic Transition path. Below I plot our selected indicator for (1) the countries that have passed the Demographic Transition, (2) the countries that haven’t yet experienced demographic dividend, and (3) the whole World.

# fetch the selected dataset df_wb <- wb(indicator = "SH.MMR.RISK.ZS", startdate = 2000, enddate = 2015) # have look at the data for one year df_wb %>% filter(date == 2015) %>% View df_wb %>% filter(iso2c %in% c("V4", "V1", "1W")) %>% ggplot(aes(x = date %>% as.numeric(), y = value, color = country))+ geom_path(size = 1)+ scale_color_brewer(NULL, palette = "Dark2")+ labs(x = NULL, y = NULL, title = "Lifetime risk of maternal death (%)")+ theme_minimal()+ theme(panel.grid.minor = element_blank(), legend.position = c(.8, .9))

OECD

Organization for Economic Cooperation and Development provides detailed economic and demographic data on the member countries. There is an R package OECD that streamlines the use of their data in R. The function search_dataset works nicely to browse the available datasets by keywords. Then get_dataset would fetch the chosen dataset. In the example below I grab the data on the duration of unemployment and then plot the data for the male population of EU16, EU28 and the US as heatmaps.

library(tidyverse) library(viridis) library(OECD) # search by keyword search_dataset("unemployment") %>% View # download the selected dataset df_oecd <- get_dataset("AVD_DUR") # turn variable names to lowercase names(df_oecd) <- names(df_oecd) %>% tolower() df_oecd %>% filter(country %in% c("EU16", "EU28", "USA"), sex == "MEN", ! age == "1524") %>% ggplot(aes(obstime, age, fill = obsvalue))+ geom_tile()+ scale_fill_viridis("Months", option = "B")+ scale_x_discrete(breaks = seq(1970, 2015, 5) %>% paste)+ facet_wrap(~ country, ncol = 1)+ labs(x = NULL, y = "Age groups", title = "Average duration of unemployment in months, males")+ theme_minimal()

WID

[World Wealth and Income Database][wdi] is a harmonized dataset on income and wealth inequality. The developers of the database provide an R package to get their data, which is only available from github so far.

library(tidyverse) #install.packages("devtools") devtools::install_github("WIDworld/wid-r-tool") library(wid)

The function to acquire data is download_wid(). To specify the arguments, one would have to consult help pages of the package and select desired datasets.

?wid_series_type ?wid_concepts

The following nice example is adapted from the package vignette. It shows the share of wealth that was owned by the richest 1% and 10% of population in France and Great Britain.

df_wid <- download_wid( indicators = "shweal", # Shares of personal wealth areas = c("FR", "GB"), # In France an Italy perc = c("p90p100", "p99p100") # Top 1% and top 10% ) df_wid %>% ggplot(aes(x = year, y = value, color = country)) + geom_path()+ labs(title = "Top 1% and top 10% personal wealth shares in France and Great Britain", y = "top share")+ facet_wrap(~ percentile)+ theme_minimal()

Full reproducibility

All the code chunks together can be found in this gist.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Automating Summary of Surveys with RMarkdown

Tue, 11/07/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

This guide shows how to automate the summary of surveys with R and R Markdown using RStudio. This is great for portions of the document that don’t change (e.g., “the survey shows substantial partisan polarization”). The motivation is really twofold: efficiency (maximize the reusabililty of code, minimize copying and pasting errors) and reproducibility (maximize the number of people and computers that can reproduce findings).

The basic setup is to write an Rmd file that will serve as a template, and then a short R script that loops over each data file (using library(knitr)). The render function then turns the Rmd into documents or slides (typically in PDF, HTML, or docx) by taking file metadata as a parameter.

There are countless ways to summarize a survey in R. This guide shows a few basics with ggplot and questionr, but focuses on the overall workflow (file management, etc.). Following the instructions here, you should be able to reproduce all four reports (and in principle, many more) despite only writing code to clean one survey. Most of the code is displayed in this document, but all code is found in either pewpoliticaltemplate.Rmd or pew_report_generator.R. All code, as well as the outputted documents, can be found here, and details on obtaining the data are found below.

Software

RStudio’s interface with library(rmarkdown) is evolving rapidly. Installing the current version of RStudio is highly recommended, particularly for the previews of the R Markdown code (this doc was created with RStudio 1.1.83). (Here is my install guide, which includes links to tutorials and cheat sheets. For somewhat more advanced survey data cleaning, click here.)

Even if you’ve knit Rmd before, your libraries may not be new enough to create parameterized reports. I recommend installing pacman, which has a convenience function p_load that smooths package installation, loading, and maintenance. I’d recommend p_load particularly if you are collaborating, say, on Dropbox.

install.packages("pacman") p_load(rmarkdown, knitr, foreign, scales, questionr, tidyverse, update = TRUE)

Remember PDF requires LaTeX (install links). By contrast, knitting to docx or HTML does not require LaTeX. Creating pptx is possible in R with library(ReporteRs), but is not discussed here.

The Data

Download the four “political surveys” from Pew Research available here (i.e., January, March, August, and October 2016). You may recall, some politics happened in 2016. (The data is free, provided you take a moment to make an account.)

  • If need be, decompress each zip folder.

Three of my folders have intuitive names (Jan16, Mar16, and Oct16), but one of my folders picked up a lengthy name, http___www.people-press.org_files_datasets_Aug16. Don’t worry about that.

  • Create a new folder, call it, say, automating.

  • Move all four data folders into automating.

Please note that I have no affiliation (past or present) with Pew Research. I simply think that they do great work and they make it relatively hassle-free to get started with meaningful data sets.

The R Notebook (R Markdown) Template

(R Markdown ninjas can skip this section.)

In RStudio, create a new R Notebook and save it as pewpoliticaltemplate.Rmd in the automating folder you just created. This document will likely knit to HTML by default; hold down the Knit button to change it to PDF. Add fields to the header as desired. The sample header below automatically puts today’s date on the document by parsing the expression next to Date: as R code. classoption: landscape may help with wide tables. You can also specify the file that contains your bibliography in several formats, such as BibTex and EndNote (citation details).

Next add an R code chunk to pewpoliticaltemplate.Rmd to take care of background stuff like formatting. Though setting a working directory would not be needed just to knit the Rmd, the directory must be set by knitr::opts_knit$set(root.dir = '...') to automate document prep. (setwd isn’t needed in the Rmd, but setting the working directory separately in Console is recommended if you’re still editing.)

The Play button at the top right gives a preview of the code’s output, which is handy. If some part of the analysis is very lengthy, it only needs to be run once, freeing you to tinker with graphics and the like.

– Now the default settings have been set and you don’t need to worry about suppressing warnings and so on with each code chunk. You can, of course, change them case-by-case as you like.

– Unlike in R, when setting the format options for individual code chunks (as shown above to suppress warnings before the defaults kick in), you do need to type out the words TRUE and FALSE in full.

– Unlike the template, in this doc, I’ve set the defaults to echo = TRUE and tidy = TRUE to display the R code more pleasingly.

– The setting asis = TRUE is very useful for professionally formatted tables (shown below), but is not recommended for raw R output of matrix and tables. To make raw data frames display with kable by default, see here.

The Template

I find it easiest to write a fully working example and then make little changes as needed so that knitr::render() can loop over the data sets. First things first.

survey <- read.spss("Jan16/Jan16 public.sav", to.data.frame = TRUE)

Summary stats can easily be inserted into the text like so:

The template contains additional examples with survey weights (lengthier calculations should be done in blocks of code and then their result referenced with that inline style).

Here is a basic plot we might want, which reflects the survey weights. facet_grid() is used to create analogous plots for each party identification. The plot uses the slightly wonky syntax y = (..count..)/sum(..count..) to display the results as percentages rather than counts. Note that some code that cleans the data (mostly shortening labels) is omitted for brevity, but can be found here.

PA <- ggplot(survey) + theme_minimal() PA <- PA + geom_bar(aes(q1, y = (..count..)/sum(..count..), weight = weight, fill = q1)) PA <- PA + facet_grid(party.clean ~ .) + theme(strip.text.y = element_text(angle = 45)) PA <- PA + xlab("") + ylab("Percent of Country") PA <- PA + ggtitle("Presidential Approval: January 2016") PA <- PA + scale_y_continuous(labels = scales::percent) PA

Here is an example of a weighted crosstab. knitr::kable will create a table that’s professional in appearance (when knit as PDF; kable takes the style of an academic journal).

kable(wtd.table(survey$ideo, survey$sex, survey$weight)/nrow(survey), digits = 2) Male Female Very conservative 0.04 0.03 Conservative 0.14 0.13 Moderate 0.20 0.20 Liberal 0.08 0.09 Very liberal 0.03 0.03 DK* 0.02 0.03

Suppose we want to display Presidential Approval, where the first column provides overall approval and subsequent columns are crosstabs for various factors of interest (using the cell weights). I’ve written a convenience function called Xtabs that creates this format, which is common in the survey world.

source("https://raw.githubusercontent.com/rdrr1990/datascience101/master/automating/Xtabs.R") kable(Xtabs(survey, "q1", c("sex", "race"), weight = "cellweight")) Overall Male Female White (nH) Black (nH) Hispanic Other DK* Approve 45.6% 21.3% 24.2% 20.4% 9.48% 10.2% 4.97% 0.443% Disapprove 48.5% 25.5% 23% 39.6% 1.33% 3.95% 2.53% 1.12% Don’t Know (VOL) 5.94% 2.67% 3.27% 3.39% 0.646% 1.14% 0.489% 0.269%

Suppose we want to do many crosstabs. The syntax survey$ideo is widely used for convenience, but survey[["ideo"]] will serve us better since it allows us to work with vectors of variable names ( details from win-vector ). Below, the first two calls to comparisons are identical, but the final one is not because there is no variable “x” in the data frame survey.

identical(survey$ideo, survey[["ideo"]]) [1] TRUE x <- "ideo" identical(survey[[x]], survey[["ideo"]]) [1] TRUE identical(survey[[x]], survey$x) [1] FALSE

So say we want weighted crosstabs for ideology and party ID crossed by all question 20, 21, 22.. 29. Here is some code that will do that.

x <- names(survey)[grep("q2[[:digit:]]", names(survey))] x [1] "q20" "q21" "q22a" "q22b" "q22c" "q22d" "q22e" "q22f" "q22g" "q22h" [11] "q22i" "q25" "q26" "q27" "q28" y <- c("ideo", "party") for (i in x) { for (j in y) { cat("\nWeighted proportions for", i, "broken down by", j, "\n") print(kable(wtd.table(survey[[i]], survey[[j]], survey$weight)/nrow(survey), digits = 2)) cat("\n") # break out of table formatting } cat("\\newpage") }

A few notes:

– This code will only work with the asis setting (shown above) that lets knitr interpret the output of print(kable()) as something to render (rather than just Markdown code to display for use elsewhere).

– Ideally one would have a csv or data.frame of the questions, and display them as loop-switched questions. In this case, the questionnaire is in a docx and so library(docxtrackr) may help.

– Rather than a nested loop, one would likely prefer to pick a question, loop over the demographic and ideological categories for the crosstabs, and then insert commentary and overview.

– The outer loops makes a new page each time it is run, with the inner loop with cat("\\newpage")), which is specific to rendering as PDF. Extra line breaks \n are needed to break out of the table formatting and keep code and text separate. A different approach to page breaks is needed for docx.

Adapting the Template with Parameters

The next step is to add a parameter with any variables you need. The parameters will be controlled by the R script discussed below. There is, of course, quite a bit of choice as to what is controlled by which file, but often only a handful of parameters are necessary. Add the following to the end of the header of pewpoliticaltemplate.Rmd:

params: spssfile: !r 1 surveywave: !r 2016

That creates variables params$spssfile and params$surveywave that can be controlled externally from other R sessions, and gives them default values of 1 and 2016 respectively. Setting default values smooths debugging by allowing you to continue knitting the Rmd on its own (rather than from the R script we will create in a moment… You can also click on Knit and choose Knit with Parameters to specify particular values).

Now make any changes to Rmd template. For example, in the ggplot code…

PA <- PA + ggtitle(paste("Presidential Approval:", params$surveywave))

Notice we can get a list of all the spss files like so:

dir(pattern = "sav", recursive = TRUE) [1] "http___www.people-press.org_files_datasets_Aug16/Aug16 public.sav" [2] "Jan16/Jan16 public.sav" [3] "March16/March16 public.sav" [4] "Oct16/Oct16 public.sav"

or in this case

dir(pattern = "16 public.sav", recursive = TRUE) [1] "http___www.people-press.org_files_datasets_Aug16/Aug16 public.sav" [2] "Jan16/Jan16 public.sav" [3] "March16/March16 public.sav" [4] "Oct16/Oct16 public.sav"

I recommend making the pattern as specific as possible in case you or your collaborators add other spss files with similar names. To use regular expressions to specify more complicated patterns, see here.

Now back to editing pewpoliticaltemplate.Rmd…

Knit the file to see how it looks with these default settings; that’s it for this portion.

Automating with knitr

Now create a new R script; mine’s called pew_report_generator.R. It’s just a simple loop that tells which data set to grab, as well as the label to pass to the Rmd. Note that the labels appear in alphabetical rather than chronological order as a function of the way that the Rmd happens to find the files.

library(pacman) p_load(knitr, rmarkdown, sessioninfo) setwd("/users/mohanty/Desktop/pewpolitical/") waves <- c("August 2016", "January 2016", "March 2016", "October 2016") for (i in 1:length(waves)) { render("pewpoliticaltemplate.Rmd", params = list(spssfile = i, surveywave = waves[i]), output_file = paste0("Survey Analysis ", waves[i], ".pdf")) } session <- session_info() save(session, file = paste0("session", format(Sys.time(), "%m%d%Y"), ".Rdata"))

That’s it. Of course, in practice you might write some code on the first survey that doesn’t work for all of them. Pew, for example, seems to have formatted the survey date differently in the last two surveys, which led me to make a few changes. But if the data are formatted fairly consistently, a one-time investment can save massive amounts of time lost to error-prone copying and pasting.

A Little Version Control

The last bit of code is not necessary, but is a convenient way to store which versions of which libraries were actually used on which version of R. If something works now but not in the future, install_version (found in library(devtools)) can be used to install the older version of particular packages.

s <- session_info() s$platform setting value version R version 3.4.2 (2017-09-28) os macOS Sierra 10.12.6 system x86_64, darwin15.6.0 ui X11 language (EN) collate en_US.UTF-8 tz America/Los_Angeles date 2017-11-06 s$packages package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0) backports 1.1.1 2017-09-25 CRAN (R 3.4.2) bindr 0.1 2016-11-13 CRAN (R 3.4.0) bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0) broom 0.4.2 2017-02-13 CRAN (R 3.4.0) cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0) clisymbols 1.2.0 2017-05-21 cran (@1.2.0) colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0) digest 0.6.12 2017-01-27 CRAN (R 3.4.0) dplyr * 0.7.4 2017-09-28 cran (@0.7.4) evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1) forcats 0.2.0 2017-01-23 CRAN (R 3.4.0) foreign * 0.8-69 2017-06-22 CRAN (R 3.4.2) formatR 1.5 2017-04-25 CRAN (R 3.4.0) ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0) glue 1.2.0 2017-10-29 CRAN (R 3.4.2) gtable 0.2.0 2016-02-26 CRAN (R 3.4.0) haven 1.1.0 2017-07-09 CRAN (R 3.4.1) highr 0.6 2016-05-09 CRAN (R 3.4.0) hms 0.3 2016-11-22 CRAN (R 3.4.0) htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) httpuv 1.3.5 2017-07-04 CRAN (R 3.4.1) httr 1.3.1 2017-08-20 cran (@1.3.1) jsonlite 1.5 2017-06-01 CRAN (R 3.4.0) knitr * 1.17 2017-08-10 CRAN (R 3.4.1) labeling 0.3 2014-08-23 CRAN (R 3.4.0) lattice 0.20-35 2017-03-25 CRAN (R 3.4.2) lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2) lubridate 1.7.0 2017-10-29 CRAN (R 3.4.2) magrittr 1.5 2014-11-22 CRAN (R 3.4.0) mime 0.5 2016-07-07 CRAN (R 3.4.0) miniUI 0.1.1 2016-01-15 CRAN (R 3.4.0) mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0) modelr 0.1.1 2017-07-24 CRAN (R 3.4.1) munsell 0.4.3 2016-02-13 CRAN (R 3.4.0) nlme 3.1-131 2017-02-06 CRAN (R 3.4.2) pacman * 0.4.6 2017-05-14 CRAN (R 3.4.0) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0) plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) psych 1.7.8 2017-09-09 CRAN (R 3.4.1) purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2) questionr * 0.6.3 2017-11-06 local R6 2.2.2 2017-06-17 CRAN (R 3.4.0) Rcpp 0.12.13 2017-09-28 cran (@0.12.13) readr * 1.1.1 2017-05-16 CRAN (R 3.4.0) readxl 1.0.0 2017-04-18 CRAN (R 3.4.0) reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0) rlang 0.1.2 2017-08-09 CRAN (R 3.4.1) rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0) rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) rstudioapi 0.7 2017-09-07 cran (@0.7) rvest 0.3.2 2016-06-17 CRAN (R 3.4.0) scales 0.5.0 2017-08-24 cran (@0.5.0) sessioninfo * 1.0.0 2017-06-21 CRAN (R 3.4.1) shiny 1.0.5 2017-08-23 cran (@1.0.5) stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) tibble * 1.3.4 2017-08-22 cran (@1.3.4) tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.2) tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0) withr 2.0.0 2017-10-25 Github (jimhester/withr@a43df66) xml2 1.1.1 2017-01-24 CRAN (R 3.4.0) xtable 1.8-2 2016-02-05 CRAN (R 3.4.0) yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)

_____='https://rviews.rstudio.com/2017/11/07/automating-summary-of-surveys-with-rmarkdown/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A history-oriented introduction to R for Excel users

Mon, 11/06/2017 - 17:49

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While spreadsheets are fine tools for collecting and sharing data, the temptation is often there to also use them for in-depth analysis better suited to reproducible systems like R. Historian Jesse Sadler recently published the useful guide Excel vs R: A Brief Introduction to R, which provides useful advice to data analysts currently using spreadsheets on how to transition to R:

Quantitative research often begins with the humble process of counting. Historical documents are never as plentiful as a historian would wish, but counting words, material objects, court cases, etc. can lead to a better understanding of the sources and the subject under study. When beginning the process of counting, the first instinct is to open a spreadsheet. The end result might be the production of tables and charts created in the very same spreadsheet document. In this post, I want to show why this spreadsheet-centric workflow is problematic and recommend the use of a programming language such as R as an alternative for both analyzing and visualizing data.

The post provides a good overview of the pros and cons of using spreadsheets for data analysis, and then provides a useful — aimed at spreadsheet users — to using R for the problematic parts. It includes:

  • Basics of the R command line
  • An overview of the Tidyverse, a suite of R packages for data manipulation
  • Working with data in R: numbers, strings and dates
  • Manipulating data frames by linking operations together with the pipe operator
  • Visualizing data with the ggplot2 package

The guide is built around a worked analysis of an interesting historical data set: the correspondence network from 6,600 letters written to the 16th-century Dutch diplomat Daniel van der Meulen. You can find the complete guide, including a link to download the data for the examples, at the link below.

Jesse Sadler: Excel vs R: A Brief Introduction to R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

13 Jobs for R users from around the world (2017-11-06)

Mon, 11/06/2017 - 13:06
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New R Jobs
  1. Full-Time
    Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
    Chicago Illinois, United States
    2 Nov 2017
  2. Full-Time
    Business Data Analytics Faculty Maryville University – Posted by tlorden
    St. Louis Missouri, United States
    2 Nov 2017
  3. Full-Time
    R programmer Model Answers – Posted by MichaelCheng
    Brisbane City Queensland, Australia
    30 Oct 2017
  4. Freelance
    Data Visualization Expert Nicole Van Herwaarden/IdeaConnection – Posted by NKVanHerwaarden
    Anywhere
    23 Oct 2017
  5. Freelance
    Statistician / Econometrician / R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
    Anywhere
    21 Oct 2017
  6. Full-Time
    Postdoc – Norwegian institute of public health – Dept. Genetics and bioinformatics Norwegian institute of public health – Posted by universitypositions
    Oslo Oslo, Norway
    20 Oct 2017
  7. Full-Time
    Evaluation Data Analyst @ Hadley, Massachusetts, U.S. VentureWell – Posted by djscottnc
    Hadley Massachusetts, United States
    18 Oct 2017
  8. Full-Time
    Portfolio Analyst @ Dallas, Texas, United States HD Vest – Posted by jhickey94
    Dallas Texas, United States
    13 Oct 2017
  9. Freelance
    The R Foundation is looking for a volunteer to organize R’s contributed documentation R: The R Foundation – Posted by Tal Galili
    Anywhere
    13 Oct 2017
  10. Full-Time
    Lead Statistician Genomics and Biomarkers @ New Jersey, U.S. Bayer – Posted by KarenJ
    Hanover New Jersey, United States
    12 Oct 2017
  11. Full-Time
    Data Scientist Edelweiss Business Services – Posted by Aakash Gupta
    Mumbai Maharashtra, India
    20 Sep 2017
  12. Full-Time
    PROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALIST fchiava
    Cambridge Massachusetts, United States
    10 Oct 2017
  13. Full-Time
    Data Scientist @ London Hiscox – Posted by alan.south
    London England, United Kingdom
    13 Sep 2017

 

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

rstudio::conf(2018) program now available!

Mon, 11/06/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

rstudio::conf 2018, the conference on all things R and RStudio, is only a few months away. Now is the time to claim your spot or grab one of the few remaining seats at Training Days!

REGISTER NOW

Whether you’re already registered or still working on it, we’re delighted today to announce the full conference schedule, so that you can plan your days in San Diego. rstudio::conf 2017 takes place January 31-Feb 3 at the Manchester Grand Hyatt, California.

This year we have over 60 talks:

  • Keynotes by Dianne Cook “To the Tidyverse and Beyond: Challenges for the Future in Data Science” and JJ Allaire “Machine Learning with R and TensorFlow”

  • 14 invited talks from outstanding speakers, innovators, and data scientists.

  • 18 contributed talks from the R community on topics like “Branding and automating your work with R Markdown“, “Reinforcement learning in Minecraft with CNTK-R”, and “Training an army of new data scientists”.

  • And 28 talks by RStudio employees on the latest developments in the
    tidyverse,
    spark
    profiling,
    Shiny,
    R Markdown,
    databases,
    RStudio Connect,
    and more!

We also have 11 two day workshops (for both beginners and experts!) if you want to go deep into a topic. We look forward to seeing you there!

REGISTER NOW

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Setting fire to deployment: Heroku

Mon, 11/06/2017 - 01:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

There are many ways to utilize fiery: It can be the engine behind a local running web-app, it can serve a dynamic site running on your own server, or it can serve a more demanding service by running in the cloud. In this post I’m going to focus on one way of achieving the latter, namely by deploying a fiery app to Heroku.

What’s Heroku anyways?

Heroku is a Platform-as-a-Service (PaaS), meaning it is a way of putting your code in production without having to worry too much about managing servers, scaling your platform, and keeping everything running. Heroku builds on the premise of containers, a topic which will be familiar to anyone who has looked at Docker—arguably the most well known container solution at the moment. If the concept of containers are foreign to you I’ll give a very short introduction below:

A container is conceptually a minimal virtual machine that can be run anywhere the correct runtime has been installed. Compared to a regular virtual machine a container is merely a thin shell around your environment and most processes are mapped directly to processes on the host system. This means that a container requires way less ressources to get running and is much quicker to get up and running.

Containers are perfect for services written with R code because it makes scaling easy. R is famously single threaded, and achieving concurrency (the ability to handle multiple requests at the same time) is difficult inside a single running R process. By packaging the R code inside a container it is a simple matter of starting multiple containers if there is need for a higher throughput.

Heroku abstracts most of the container part away so you can focus on the app logic instead, but has not really been build for R (few services are…). Out of the box Heroku should work with Node, Ruby, Java, PHP, Python, Go, Scala, and Clojure, but it is possible to make anything work by providing your own build instructions…

Before we begin

In order to play along with this tutorial you’ll need a Heroku acount. An account is free but as with all free beer on the internet it comes with restrictions: There are limited resources available to you, and you can map your app to your own domain. Still, for our little tutorial there should be no problems. If you do not wish to register at Heroku I think you can get the gist of the tutorial by just reading along, but then again – the tutorial is pretty worthless to you if you have no intention of using Heroku.

Once registered there are two different ways to get going: One is installing the command-line tools which we will cover first, while the other is doing it from within the web interface, which we will do afterwards.

The setup

I have prepared a GitHub repository that you are free to fork or download. The repository contains 3 files of note:

  • run.R The source file defining and starting the fiery app. Is the app already defined within a package its a simple matter of loading the package and starting the app, but it can also—as with this example—be a complete specification of the app itself. In our example we are reusing the simple prediction service we created in the fiery v1 announcement with a few small adjustments. The most important change is that the app is now set to run on the 0.0.0.0 host and listen on a port defined by Heroku using the PORT environment variable.
  • init.R The file defining the R dependencies. This file is making sure that your R environment is ready to run your app. It will usually contain install.packages calls, and/or devtools::install_github to grab development versions of packages. For our example we are just grabbing the latest versions of fiery and routr from GitHub
  • Aptfile The file defining system level dependencies. This will generally be filled with libraries needed for your R packages. In our example, since fiery requires reqres and reqres requires xml2, we need libxml2-dev available. In general it can be hard to predict which libraries are needed so it is often a matter of trying things out and adding the requested libraries as they pop up. As the name of the file suggests, the libraries will be installed with apt-get so any library available there is good to go.

Apart from our files we need a way to tell Heroku how to deal with them. This is the potentially hardest part, but thankfully it has already been solved. Chris Stefano has been iterating over a range of approaches to get R to run on Heroku and provides buildpacks to get us going.

Now, with everything we need ready, we can finally try to get this thing up and running

Deploying from the command-line

Once you have installed the command line tools according to the instructions for your system (I used Homebrew on my Mac) and forked my repository we are ready to go. From here on there are surprisingly few steps before we get our app up and running.

The first step is to create the app on Heroku. This is done from the command line using the heroku create command. The command takes the name of the app (I’ve chosen fiery-demo so you’ll have to chose something different). In our case, because we don’t use a language that Heroku supports out of the box, we also adds the buildpack discussed above using -b.

heroku create fiery-demo -b http://github.com/virtualstaticvoid/heroku-buildpack-r.git#heroku-16 # Creating ⬢ fiery-demo... done # Setting buildpack to http://github.com/virtualstaticvoid/heroku-buildpack-r.git#heroku-16... done # https://fiery-demo.herokuapp.com/ | https://git.heroku.com/fiery-demo.git

Apart from setting up the app on Heroku the command will also link the current directory with the Heroku git server. This can be verified by listing the remotes.

git remote # heroku # origin

From now on it is a simple matter of pushing to the heroku remote to trigger a redeployment. We can simulate this by creating and committing an empty file.

touch triggerfile git add . git commit -am "trigger a deploy" git push heroku master

With the Git integration and new Terminal pane in RStudio, managing and deploying fiery apps from within a native R development environment is pretty easy. The only small annoyance is that the git integration does not support pushing to multiple different remotes so you’ll have to do the push to the heroku remote manually (this could change in the future of course).

Using the web interface

While using the command line is in no way difficult, Heroku also provides a web interface for setting up your app. In the following I’ll show what it takes to replicate what we did in the command line (more or less):

When you log in to Heroku for the first time you’ll be greeted with the following screen:

(if you already have something up and running you’ll see a list of your apps).

As R is not part of the exclusive club of languages that Heroku understands we must start from scratch by clicking Create New App. We will now be asked for a name for the app as well as the location of the server it should run on (US or Europe). You generally want your server to be as close to your users, but for our little demo it really doesn’t matter.

As you can see it also lets you add the app to a pipeline. Pipelines are used for orchestrating multiple different apps that should work together and is outside the scope of this tutorial (there is nothing R-specific about their use). Once you hit Create App you are taken to the Deploy settings for the app. Scrolling a bit down you’ll find a section where you can specify how to get your code into the app.

I’ll be showing how you can link it up to a GitHub repository, but as you can see it is also possible to link it to e.g. a Dropbox folder (in that case you’ll have to manually redeploy every time you change code in your app).

Once you’ve found the repository containing the app and linked to it you’ll be able to set additional settings, such as automatically redeploying when you push to a branch etc.

If we click on Deploy Branch now we’ll get an error, as Heroku has yet to be told how to deal with R code.

To fix this we’ll need to point Heroku to our buildpack. This is done under the Settings pane, where we will add the same buildpack as we used with the console.

After this we can switch back to the Deploy pane and hit Deploy Branch again and see the deployment start and, hopefully, succeed.

If you have activated Automatic Deploys you’ll just have to push to your Github branch to update your app and you can thus, as with the command-line setup, manage your app completely from within your R development environment.

Wrapping up

While the demo app we have used is extremely minimal, the approach we have seen scales to any type of app, whether it is a minimal service API or a full webpage. As your app grows I would encourage you to move as much code into a package so you can document and unit test it, and so the run.R does not grow unwieldy large.

Another consideration when growing your application is the limited resources available in the free tier. If you begin to build a more complex app you might run into the limitations and need to shell out some money to get decent performance. As with everything on the internet there is no free beer…

While it may be possible that I will add a clear specification for defining fiery apps in the future and provide buildpacks for this specifically, the approach shown above is general and continue to work irrelevant of how fiery continues to evolve.

In a later post I will show how to use Docker in much the same way as we have used Heroku in this post – stay tuned…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL Presentation on HR Analytics: Using ML to Predict Employee Turnover

Mon, 11/06/2017 - 01:00

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

The EARL Boston 2017 conference was held November 1 – 3 in Boston, Mass. There were some excellent presentations illustrating how R is being embraced in enterprises, especially in the financial and pharmaceutical industries. Matt Dancho, founder of Business Science, presented on using machine learning to predict and explain employee turnover, a hot topic in HR! We’ve uploaded the HR Analytics presentation to YouTube. Check out the presentation, and don’t forget to follow us on social media to stay up on the latest Business Science news, events and information!

EARL Boston 2017 Presentation

If you’re interested in HR Analytics and R applications in business, check out our 30 minute presentation from EARL Boston 2017! We talk about:

  • HR Analytics: Using Machine Learning for Employee Turnover Prediction and Explanation
  • Using H2O for automated machine learning
  • Using LIME for feature importance of black-box (non-linear) models such as neural networks, ensembles, and random forests

The code for the tutorial can be found in our HR Analytics article.

Download Presentation and Code on GitHub

The slide deck and code from the EARL Boston 2017 presentation can be downloaded from the Business Science GitHub site.

Download the EARL Boston 2017 Presentation Slides!

About Business Science

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

Follow Business Science on Social Media var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

pinp 0.0.4: Small tweak

Sun, 11/05/2017 - 17:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of our pinp package for snazzier one or two column vignettes is now on CRAN as of yesterday.

In version 0.0.3, we disabled the default \pnasbreak command we inherit from the PNAS LaTeX style. That change turns out to have been too drastic. So we reverted yet added a new YAML front-matter option skip_final_break which, if set to TRUE, will skip this break. With a default value of FALSE we maintain prior behaviour.

A screenshot of the package vignette can be seen below. Additional screenshots of are at the pinp page.

The NEWS entry for this release follows.

Changes in pinp version 0.0.4 (2017-11-04)
  • Correct NEWS headers from ‘tint’ to ‘pinp’ (#45).

  • New front-matter variables ‘skip_final_break’ skips the \pnasbreak on final page which back as default (#47).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rick and Morty and Tidy Data Principles (Part 3)

Sun, 11/05/2017 - 17:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Motivation

The first and second part of this analysis gave the idea that I did too much scrapping and processing and that deserves more analysis to use that information well. In this third and final part I’m also taking a lot of ideas from Julia Silge‘s blog.

In the GitHub repo of this project you shall find not just Rick and Morty processed subs, but also for Archer, Bojack Horseman, Gravity Falls and Stranger Things. Why? In post post I’m gonna compare the different shows.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

Word Frequencies

Comparing frequencies across different shows can tell us how similar Gravity Falls, for example, is similar to Rick and Morty. I’ll use the subtitles from different shows that I scraped using the same procedure I did with Rick and Morty.

if (!require("pacman")) install.packages("pacman") p_load(data.table,tidyr,tidytext,dplyr,ggplot2,viridis,ggstance,stringr,scales) p_load_gh("dgrtwo/widyr") rick_and_morty_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/rick_and_morty_subs.csv")) archer_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/archer_subs.csv")) bojack_horseman_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/bojack_horseman_subs.csv")) gravity_falls_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/gravity_falls_subs.csv")) stranger_things_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/stranger_things_subs.csv")) rick_and_morty_subs_tidy = rick_and_morty_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) archer_subs_tidy = archer_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) bojack_horseman_subs_tidy = bojack_horseman_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) gravity_falls_subs_tidy = gravity_falls_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) stranger_things_subs_tidy = stranger_things_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words)

With this processing we can compare frequencies across different shows. Here’s an example of the top ten words for each show:

bind_cols(rick_and_morty_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), archer_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), bojack_horseman_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), gravity_falls_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), stranger_things_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10)) %>% setNames(., c("rm_word","rm_n","a_word","a_n","bh_word","bh_n","gf_word","gf_n","st_word","st_n")) # A tibble: 10 x 10 rm_word rm_n a_word a_n bh_word bh_n gf_word gf_n st_word st_n 1 morty 1898 archer 4548 bojack 956 mabel 457 yeah 485 2 rick 1691 lana 2800 yeah 704 hey 453 hey 318 3 jerry 646 yeah 1478 hey 575 ha 416 mike 271 4 yeah 484 cyril 1473 gonna 522 stan 369 sighs 262 5 gonna 421 malory 1462 time 451 dipper 347 uh 189 6 summer 409 pam 1300 uh 382 gonna 345 dustin 179 7 hey 391 god 878 na 373 time 314 lucas 173 8 uh 331 wait 846 diane 345 yeah 293 gonna 172 9 time 319 uh 835 todd 339 uh 265 joyce 161 10 beth 301 gonna 748 love 309 guys 244 mom 157

There are common words such as “yeah” for example.

Now I’ll combine the frequencies of all the shows and I’ll plot the top 50 frequencies to see similitudes with Rick and Morty:

tidy_others = bind_rows(mutate(archer_subs_tidy, show = "Archer"), mutate(bojack_horseman_subs_tidy, show = "Bojack Horseman"), mutate(gravity_falls_subs_tidy, show = "Gravity Falls"), mutate(stranger_things_subs_tidy, show = "Stranger Things")) frequency = tidy_others %>% mutate(word = str_extract(word, "[a-z]+")) %>% count(show, word) %>% rename(other = n) %>% inner_join(count(rick_and_morty_subs_tidy, word)) %>% rename(rick_and_morty = n) %>% mutate(other = other / sum(other), rick_and_morty = rick_and_morty / sum(rick_and_morty)) %>% ungroup() frequency_top_50 = frequency %>% group_by(show) %>% arrange(-other,-rick_and_morty) %>% filter(row_number() <= 50) ggplot(frequency_top_50, aes(x = other, y = rick_and_morty, color = abs(rick_and_morty - other))) + geom_abline(color = "gray40") + geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) + geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) + scale_color_gradient(limits = c(0, 0.5), low = "darkslategray4", high = "gray75") + facet_wrap(~show, ncol = 4) + theme_minimal(base_size = 14) + theme(legend.position="none") + labs(title = "Comparing Word Frequencies", subtitle = "Word frequencies in Rick and Morty episodes versus other shows'", y = "Rick and Morty", x = NULL)

Now the analysis becomes interesting. Archer is a show that is basically about annoy or seduce presented in a way that good writers can and Gravity Falls is about two kids who spend summer with their granpa. Archer doesn’t have as many shared words as Gravity Falls and Rick and Morty do, while Gravity Falls has as many “yeah” as Rick and Morty the summer they talk about is the season and not Rick’s sister from Rick and Morty.

What is only noticeable if you have seen the analysed shows suggests that we should explore global measures of lexical variety such as mean word frequency and type-token ratios.

Before going ahead let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Rick and Morty and the other shows?

cor.test(data = filter(frequency, show == "Archer"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 63.351, df = 4651, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6648556 0.6957166 sample estimates: cor 0.6805879 cor.test(data = filter(frequency, show == "Bojack Horseman"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 34.09, df = 4053, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4477803 0.4956335 sample estimates: cor 0.4720545 cor.test(data = filter(frequency, show == "Gravity Falls"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 61.296, df = 3396, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7083772 0.7403234 sample estimates: cor 0.7247395 cor.test(data = filter(frequency, show == "Stranger Things"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 22.169, df = 2278, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3868980 0.4544503 sample estimates: cor 0.4212582

The correlation test suggests that Rick and Morty and Gravity Falls are the most similar from the considered sample.

The end

My analysis is now complete but the GitHub repo is open to anyone interested in using it for his/her own analysis. I covered mostly microanalysis, or words analysis as isolated units, while providing rusty bits of analysis beyond words as units that would deserve more and longer posts.

Those who find in this a useful material may explore global measures. One option is to read Text Analysis with R for Students of Literature that I’ve reviewed some time ago.

Interesting topics to explore are Hapax richness and keywords in context that correspond to mesoanalysis or even going for macroanalysis to do clustering, classification and topic modelling.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

You beautiful, naïve, sophisticated newborn series

Sun, 11/05/2017 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

My husband and I recently started watching the wonderful series “Parks and recreation” which was recommended to me by my fellow R-Lady Jennifer Thompson in this very convincing thread. The serie was even endorsed by other R-Ladies. Jennifer told me the first two seasons are not as good as the following ones, but that it was worth it to make it through them. We actually started enjoying the humor and characters right away!

Then, this week while watching the show, one of the characters did a very basic text analysis that made me feel like imitating him for a blog post – my husband told me it was very Leslie of me to plan something while doing something else which made me very proud. I tested my idea on other Leslie fans, and they seemed to think it was a great idea… and that this post should be the beginning of a series of R-Ladies blog posts about Parks and recreation!

In this two-short-part blog post, I’ll therefore inaugurate this series, what an honor!

Less than stellar first seasons?

Jennifer told me the first two seasons were not the best ones. My brother who is a cruel man (just kidding) told me this was a very good joke idea: you tell someone a very bad TV series starts getting awesome after a few seasons that you need to watch in order to know the characters… therefore forcing them to loose their time and sanity in front of the screen, ah! Luckily Jennifer wasn’t doing that.

I said my husband and I were hooked on the series right from the first episode… what about other people? To answer this question, I downloaded IMDB ratings for the show! I first tried using R packages accessing the OMDB API, but this one was outdated and this up-to-date one was not but taught me the API no longer returns ratings by season… So I scraped the IMDB website using the rvest package, after checking I was allowed to do so thanks to the robotstxt package.

robotstxt::paths_allowed("http://www.imdb.com/title/tt1266020/eprate") ## [1] TRUE library("rvest") imdb_page <- html("http://www.imdb.com/title/tt1266020/eprate") ratings <- imdb_page %>% html_nodes("table") %>% .[[1]] %>% html_table() %>% .[,1:4] knitr::kable(ratings[1:5,]) # Episode UserRating UserVotes 7.13 One Last Ride: Part 2 9.7 2,407 7.12 One Last Ride: Part 1 9.6 1,934 7.40 Leslie and Ron 9.6 1,842 6.22 Moving Up: Part 2 9.4 1,273 5.14 Leslie and Ben 9.3 1,205

I just had to wrangle this table a little bit.

names(ratings)[1] <- "number" ratings <- tidyr::separate(ratings, col = number, into = c("season", "episode"), sep = "\\.") ratings <- dplyr::mutate(ratings, UserVotes = stringr::str_replace(UserVotes, ",", "")) ratings <- dplyr::mutate(ratings, UserVotes = as.numeric(UserVotes)) ratings <- dplyr::mutate(ratings, episode = as.numeric(episode))

And then I could at last plot the ratings and see whether the two first seasons were not that loved! Before that I quickly assessed whether there was an ok number of votes for all episodes, and that this did not vary widely over seasons, which was the case.

library("ggplot2") library("hrbrthemes") ggplot(ratings, aes(episode, UserRating, color = season)) + geom_smooth()+ geom_point() + xlab("IMDB rating") + viridis::scale_color_viridis(discrete = TRUE) + theme_ipsum(base_size = 20, axis_title_size = 20)

Note that for the first season, there are too few points for actually doing a smoothing. Looking at the plot I guess we can say the first season with its only 6 episodes was slightly less enjoyed! I am very thankful that the first season was deemed good enough to continue the series though! Besides I am thankful that it was so simple to answer this crucial question.

Leslie’s wordcloud

Warning: the wordcloud produced here might be a spoiler if you have not watched the 4 first seasons!

In the 10th episode of season 4, Tom produces a wordcloud of Leslie’s emails and memos. I cannot really do the same, but was inspired to make a worldcloud of things said in the episodes by downloading all subtitles from the 4 first seasons and using the fantastic subtools package. Note that a worldcloud is probably not the coolest thing one can do with all these text data… but it is a start and I truly hope it inspires more people to use François Keck’s package which is really user-friendly since it can read all the subtitles of a serie at once from a folder where each sub-folder is a season. It also supports conversion to formats suitable for text analysis!

The single problem I experienced when doing this work was that I had to download subtitles from the internet and the websites where you find those are not cute places, they’re full of “women to date”…

To make the wordcloud I very simply followed this blog post of François Keck’s. It doesn’t get easier than that… which is great given I have very little time at the moment.

library("subtools") library("tm") library("SnowballC") library("wordcloud") parks_and_rec <- read.subtitles.serie(dir = paste0(getwd(), "/data/pr/")) ## Read: 4 seasons, 68 episodes corpus <- tmCorpus(parks_and_rec) corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) TDM <- TermDocumentMatrix(corpus) TDM <- as.matrix(TDM) vec.season <- c(rep(1, 6), rep(2, 24), rep(3, 16), rep(4, 22)) TDM.season <- t(apply(TDM, 1, function(x) tapply(x, vec.season, sum))) colnames(TDM.season) <- paste("S", 1:4) set.seed(1) comparison.cloud(TDM.season, title.size = 1, max.words = 100, random.order = T)

It doesn’t look anything like Leslie’s wordcloud but it made me happy because I had already forgotten about some words or names from the previous seasons!

Another thing I did with the subtitles was filtering those starting with “Oh Ann” in order to have a look at some of the terrific compliments Leslie made to Ann… And one of them inspired the title of this post, replacing “baby” by “series”!

The end of this post, the beginning a series!

Now that I’ve done my homework I’m truly looking forward to reading the other posts. I can’t wait! Stay tuned by following my fellow bloggers/Leslie fans who came out in this Twitter thread.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages