R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 3 min ago

### 13 Jobs for R users from around the world (2017-11-06)

Mon, 11/06/2017 - 13:06
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Featured Jobs
More New R Jobs
1. Full-Time
Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
Chicago Illinois, United States
2 Nov 2017
2. Full-Time
Business Data Analytics Faculty Maryville University – Posted by tlorden
St. Louis Missouri, United States
2 Nov 2017
3. Full-Time
R programmer Model Answers – Posted by MichaelCheng
Brisbane City Queensland, Australia
30 Oct 2017
4. Freelance
Data Visualization Expert Nicole Van Herwaarden/IdeaConnection – Posted by NKVanHerwaarden
Anywhere
23 Oct 2017
5. Freelance
Statistician / Econometrician / R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
Anywhere
21 Oct 2017
6. Full-Time
Postdoc – Norwegian institute of public health – Dept. Genetics and bioinformatics Norwegian institute of public health – Posted by universitypositions
Oslo Oslo, Norway
20 Oct 2017
7. Full-Time
Evaluation Data Analyst @ Hadley, Massachusetts, U.S. VentureWell – Posted by djscottnc
18 Oct 2017
8. Full-Time
Portfolio Analyst @ Dallas, Texas, United States HD Vest – Posted by jhickey94
Dallas Texas, United States
13 Oct 2017
9. Freelance
The R Foundation is looking for a volunteer to organize R’s contributed documentation R: The R Foundation – Posted by Tal Galili
Anywhere
13 Oct 2017
10. Full-Time
Lead Statistician Genomics and Biomarkers @ New Jersey, U.S. Bayer – Posted by KarenJ
Hanover New Jersey, United States
12 Oct 2017
11. Full-Time
Data Scientist Edelweiss Business Services – Posted by Aakash Gupta
Mumbai Maharashtra, India
20 Sep 2017
12. Full-Time
PROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALIST fchiava
Cambridge Massachusetts, United States
10 Oct 2017
13. Full-Time
Data Scientist @ London Hiscox – Posted by alan.south
London England, United Kingdom
13 Sep 2017

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### rstudio::conf(2018) program now available!

Mon, 11/06/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

rstudio::conf 2018, the conference on all things R and RStudio, is only a few months away. Now is the time to claim your spot or grab one of the few remaining seats at Training Days!

REGISTER NOW

Whether you’re already registered or still working on it, we’re delighted today to announce the full conference schedule, so that you can plan your days in San Diego. rstudio::conf 2017 takes place January 31-Feb 3 at the Manchester Grand Hyatt, California.

This year we have over 60 talks:

• Keynotes by Dianne Cook “To the Tidyverse and Beyond: Challenges for the Future in Data Science” and JJ Allaire “Machine Learning with R and TensorFlow”

• 14 invited talks from outstanding speakers, innovators, and data scientists.

• 18 contributed talks from the R community on topics like “Branding and automating your work with R Markdown“, “Reinforcement learning in Minecraft with CNTK-R”, and “Training an army of new data scientists”.

• And 28 talks by RStudio employees on the latest developments in the
tidyverse,
spark
profiling,
Shiny,
R Markdown,
databases,
RStudio Connect,
and more!

We also have 11 two day workshops (for both beginners and experts!) if you want to go deep into a topic. We look forward to seeing you there!

REGISTER NOW

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Setting fire to deployment: Heroku

Mon, 11/06/2017 - 01:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

There are many ways to utilize fiery: It can be the engine behind a local running web-app, it can serve a dynamic site running on your own server, or it can serve a more demanding service by running in the cloud. In this post I’m going to focus on one way of achieving the latter, namely by deploying a fiery app to Heroku.

What’s Heroku anyways?

Heroku is a Platform-as-a-Service (PaaS), meaning it is a way of putting your code in production without having to worry too much about managing servers, scaling your platform, and keeping everything running. Heroku builds on the premise of containers, a topic which will be familiar to anyone who has looked at Docker—arguably the most well known container solution at the moment. If the concept of containers are foreign to you I’ll give a very short introduction below:

A container is conceptually a minimal virtual machine that can be run anywhere the correct runtime has been installed. Compared to a regular virtual machine a container is merely a thin shell around your environment and most processes are mapped directly to processes on the host system. This means that a container requires way less ressources to get running and is much quicker to get up and running.

Containers are perfect for services written with R code because it makes scaling easy. R is famously single threaded, and achieving concurrency (the ability to handle multiple requests at the same time) is difficult inside a single running R process. By packaging the R code inside a container it is a simple matter of starting multiple containers if there is need for a higher throughput.

Heroku abstracts most of the container part away so you can focus on the app logic instead, but has not really been build for R (few services are…). Out of the box Heroku should work with Node, Ruby, Java, PHP, Python, Go, Scala, and Clojure, but it is possible to make anything work by providing your own build instructions…

Before we begin

In order to play along with this tutorial you’ll need a Heroku acount. An account is free but as with all free beer on the internet it comes with restrictions: There are limited resources available to you, and you can map your app to your own domain. Still, for our little tutorial there should be no problems. If you do not wish to register at Heroku I think you can get the gist of the tutorial by just reading along, but then again – the tutorial is pretty worthless to you if you have no intention of using Heroku.

Once registered there are two different ways to get going: One is installing the command-line tools which we will cover first, while the other is doing it from within the web interface, which we will do afterwards.

The setup

I have prepared a GitHub repository that you are free to fork or download. The repository contains 3 files of note:

• run.R The source file defining and starting the fiery app. Is the app already defined within a package its a simple matter of loading the package and starting the app, but it can also—as with this example—be a complete specification of the app itself. In our example we are reusing the simple prediction service we created in the fiery v1 announcement with a few small adjustments. The most important change is that the app is now set to run on the 0.0.0.0 host and listen on a port defined by Heroku using the PORT environment variable.
• init.R The file defining the R dependencies. This file is making sure that your R environment is ready to run your app. It will usually contain install.packages calls, and/or devtools::install_github to grab development versions of packages. For our example we are just grabbing the latest versions of fiery and routr from GitHub
• Aptfile The file defining system level dependencies. This will generally be filled with libraries needed for your R packages. In our example, since fiery requires reqres and reqres requires xml2, we need libxml2-dev available. In general it can be hard to predict which libraries are needed so it is often a matter of trying things out and adding the requested libraries as they pop up. As the name of the file suggests, the libraries will be installed with apt-get so any library available there is good to go.

Apart from our files we need a way to tell Heroku how to deal with them. This is the potentially hardest part, but thankfully it has already been solved. Chris Stefano has been iterating over a range of approaches to get R to run on Heroku and provides buildpacks to get us going.

Now, with everything we need ready, we can finally try to get this thing up and running

Deploying from the command-line

Once you have installed the command line tools according to the instructions for your system (I used Homebrew on my Mac) and forked my repository we are ready to go. From here on there are surprisingly few steps before we get our app up and running.

The first step is to create the app on Heroku. This is done from the command line using the heroku create command. The command takes the name of the app (I’ve chosen fiery-demo so you’ll have to chose something different). In our case, because we don’t use a language that Heroku supports out of the box, we also adds the buildpack discussed above using -b.

heroku create fiery-demo -b http://github.com/virtualstaticvoid/heroku-buildpack-r.git#heroku-16 # Creating ⬢ fiery-demo... done # Setting buildpack to http://github.com/virtualstaticvoid/heroku-buildpack-r.git#heroku-16... done # https://fiery-demo.herokuapp.com/ | https://git.heroku.com/fiery-demo.git

Apart from setting up the app on Heroku the command will also link the current directory with the Heroku git server. This can be verified by listing the remotes.

git remote # heroku # origin

From now on it is a simple matter of pushing to the heroku remote to trigger a redeployment. We can simulate this by creating and committing an empty file.

touch triggerfile git add . git commit -am "trigger a deploy" git push heroku master

With the Git integration and new Terminal pane in RStudio, managing and deploying fiery apps from within a native R development environment is pretty easy. The only small annoyance is that the git integration does not support pushing to multiple different remotes so you’ll have to do the push to the heroku remote manually (this could change in the future of course).

Using the web interface

While using the command line is in no way difficult, Heroku also provides a web interface for setting up your app. In the following I’ll show what it takes to replicate what we did in the command line (more or less):

When you log in to Heroku for the first time you’ll be greeted with the following screen:

(if you already have something up and running you’ll see a list of your apps).

As R is not part of the exclusive club of languages that Heroku understands we must start from scratch by clicking Create New App. We will now be asked for a name for the app as well as the location of the server it should run on (US or Europe). You generally want your server to be as close to your users, but for our little demo it really doesn’t matter.

As you can see it also lets you add the app to a pipeline. Pipelines are used for orchestrating multiple different apps that should work together and is outside the scope of this tutorial (there is nothing R-specific about their use). Once you hit Create App you are taken to the Deploy settings for the app. Scrolling a bit down you’ll find a section where you can specify how to get your code into the app.

I’ll be showing how you can link it up to a GitHub repository, but as you can see it is also possible to link it to e.g. a Dropbox folder (in that case you’ll have to manually redeploy every time you change code in your app).

Once you’ve found the repository containing the app and linked to it you’ll be able to set additional settings, such as automatically redeploying when you push to a branch etc.

If we click on Deploy Branch now we’ll get an error, as Heroku has yet to be told how to deal with R code.

To fix this we’ll need to point Heroku to our buildpack. This is done under the Settings pane, where we will add the same buildpack as we used with the console.

After this we can switch back to the Deploy pane and hit Deploy Branch again and see the deployment start and, hopefully, succeed.

If you have activated Automatic Deploys you’ll just have to push to your Github branch to update your app and you can thus, as with the command-line setup, manage your app completely from within your R development environment.

Wrapping up

While the demo app we have used is extremely minimal, the approach we have seen scales to any type of app, whether it is a minimal service API or a full webpage. As your app grows I would encourage you to move as much code into a package so you can document and unit test it, and so the run.R does not grow unwieldy large.

Another consideration when growing your application is the limited resources available in the free tier. If you begin to build a more complex app you might run into the limitations and need to shell out some money to get decent performance. As with everything on the internet there is no free beer…

While it may be possible that I will add a clear specification for defining fiery apps in the future and provide buildpacks for this specifically, the approach shown above is general and continue to work irrelevant of how fiery continues to evolve.

In a later post I will show how to use Docker in much the same way as we have used Heroku in this post – stay tuned…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### EARL Presentation on HR Analytics: Using ML to Predict Employee Turnover

Mon, 11/06/2017 - 01:00

The EARL Boston 2017 conference was held November 1 – 3 in Boston, Mass. There were some excellent presentations illustrating how R is being embraced in enterprises, especially in the financial and pharmaceutical industries. Matt Dancho, founder of Business Science, presented on using machine learning to predict and explain employee turnover, a hot topic in HR! We’ve uploaded the HR Analytics presentation to YouTube. Check out the presentation, and don’t forget to follow us on social media to stay up on the latest Business Science news, events and information!

EARL Boston 2017 Presentation

If you’re interested in HR Analytics and R applications in business, check out our 30 minute presentation from EARL Boston 2017! We talk about:

• HR Analytics: Using Machine Learning for Employee Turnover Prediction and Explanation
• Using H2O for automated machine learning
• Using LIME for feature importance of black-box (non-linear) models such as neural networks, ensembles, and random forests

The code for the tutorial can be found in our HR Analytics article.

The slide deck and code from the EARL Boston 2017 presentation can be downloaded from the Business Science GitHub site.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### pinp 0.0.4: Small tweak

Sun, 11/05/2017 - 17:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of our pinp package for snazzier one or two column vignettes is now on CRAN as of yesterday.

In version 0.0.3, we disabled the default \pnasbreak command we inherit from the PNAS LaTeX style. That change turns out to have been too drastic. So we reverted yet added a new YAML front-matter option skip_final_break which, if set to TRUE, will skip this break. With a default value of FALSE we maintain prior behaviour.

A screenshot of the package vignette can be seen below. Additional screenshots of are at the pinp page.

The NEWS entry for this release follows.

Changes in pinp version 0.0.4 (2017-11-04)
• Correct NEWS headers from ‘tint’ to ‘pinp’ (#45).

• New front-matter variables ‘skip_final_break’ skips the \pnasbreak on final page which back as default (#47).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Rick and Morty and Tidy Data Principles (Part 3)

Sun, 11/05/2017 - 17:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Motivation

The first and second part of this analysis gave the idea that I did too much scrapping and processing and that deserves more analysis to use that information well. In this third and final part I’m also taking a lot of ideas from Julia Silge‘s blog.

In the GitHub repo of this project you shall find not just Rick and Morty processed subs, but also for Archer, Bojack Horseman, Gravity Falls and Stranger Things. Why? In post post I’m gonna compare the different shows.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

Word Frequencies

Comparing frequencies across different shows can tell us how similar Gravity Falls, for example, is similar to Rick and Morty. I’ll use the subtitles from different shows that I scraped using the same procedure I did with Rick and Morty.

if (!require("pacman")) install.packages("pacman") p_load(data.table,tidyr,tidytext,dplyr,ggplot2,viridis,ggstance,stringr,scales) p_load_gh("dgrtwo/widyr") rick_and_morty_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/rick_and_morty_subs.csv")) archer_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/archer_subs.csv")) bojack_horseman_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/bojack_horseman_subs.csv")) gravity_falls_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/gravity_falls_subs.csv")) stranger_things_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/stranger_things_subs.csv")) rick_and_morty_subs_tidy = rick_and_morty_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) archer_subs_tidy = archer_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) bojack_horseman_subs_tidy = bojack_horseman_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) gravity_falls_subs_tidy = gravity_falls_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) stranger_things_subs_tidy = stranger_things_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words)

With this processing we can compare frequencies across different shows. Here’s an example of the top ten words for each show:

bind_cols(rick_and_morty_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), archer_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), bojack_horseman_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), gravity_falls_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), stranger_things_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10)) %>% setNames(., c("rm_word","rm_n","a_word","a_n","bh_word","bh_n","gf_word","gf_n","st_word","st_n")) # A tibble: 10 x 10 rm_word rm_n a_word a_n bh_word bh_n gf_word gf_n st_word st_n 1 morty 1898 archer 4548 bojack 956 mabel 457 yeah 485 2 rick 1691 lana 2800 yeah 704 hey 453 hey 318 3 jerry 646 yeah 1478 hey 575 ha 416 mike 271 4 yeah 484 cyril 1473 gonna 522 stan 369 sighs 262 5 gonna 421 malory 1462 time 451 dipper 347 uh 189 6 summer 409 pam 1300 uh 382 gonna 345 dustin 179 7 hey 391 god 878 na 373 time 314 lucas 173 8 uh 331 wait 846 diane 345 yeah 293 gonna 172 9 time 319 uh 835 todd 339 uh 265 joyce 161 10 beth 301 gonna 748 love 309 guys 244 mom 157

There are common words such as “yeah” for example.

Now I’ll combine the frequencies of all the shows and I’ll plot the top 50 frequencies to see similitudes with Rick and Morty:

tidy_others = bind_rows(mutate(archer_subs_tidy, show = "Archer"), mutate(bojack_horseman_subs_tidy, show = "Bojack Horseman"), mutate(gravity_falls_subs_tidy, show = "Gravity Falls"), mutate(stranger_things_subs_tidy, show = "Stranger Things")) frequency = tidy_others %>% mutate(word = str_extract(word, "[a-z]+")) %>% count(show, word) %>% rename(other = n) %>% inner_join(count(rick_and_morty_subs_tidy, word)) %>% rename(rick_and_morty = n) %>% mutate(other = other / sum(other), rick_and_morty = rick_and_morty / sum(rick_and_morty)) %>% ungroup() frequency_top_50 = frequency %>% group_by(show) %>% arrange(-other,-rick_and_morty) %>% filter(row_number() <= 50) ggplot(frequency_top_50, aes(x = other, y = rick_and_morty, color = abs(rick_and_morty - other))) + geom_abline(color = "gray40") + geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) + geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) + scale_color_gradient(limits = c(0, 0.5), low = "darkslategray4", high = "gray75") + facet_wrap(~show, ncol = 4) + theme_minimal(base_size = 14) + theme(legend.position="none") + labs(title = "Comparing Word Frequencies", subtitle = "Word frequencies in Rick and Morty episodes versus other shows'", y = "Rick and Morty", x = NULL)

Now the analysis becomes interesting. Archer is a show that is basically about annoy or seduce presented in a way that good writers can and Gravity Falls is about two kids who spend summer with their granpa. Archer doesn’t have as many shared words as Gravity Falls and Rick and Morty do, while Gravity Falls has as many “yeah” as Rick and Morty the summer they talk about is the season and not Rick’s sister from Rick and Morty.

What is only noticeable if you have seen the analysed shows suggests that we should explore global measures of lexical variety such as mean word frequency and type-token ratios.

Before going ahead let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Rick and Morty and the other shows?

cor.test(data = filter(frequency, show == "Archer"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 63.351, df = 4651, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6648556 0.6957166 sample estimates: cor 0.6805879 cor.test(data = filter(frequency, show == "Bojack Horseman"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 34.09, df = 4053, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4477803 0.4956335 sample estimates: cor 0.4720545 cor.test(data = filter(frequency, show == "Gravity Falls"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 61.296, df = 3396, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7083772 0.7403234 sample estimates: cor 0.7247395 cor.test(data = filter(frequency, show == "Stranger Things"), ~ other + rick_and_morty) Pearson's product-moment correlation data: other and rick_and_morty t = 22.169, df = 2278, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3868980 0.4544503 sample estimates: cor 0.4212582

The correlation test suggests that Rick and Morty and Gravity Falls are the most similar from the considered sample.

The end

My analysis is now complete but the GitHub repo is open to anyone interested in using it for his/her own analysis. I covered mostly microanalysis, or words analysis as isolated units, while providing rusty bits of analysis beyond words as units that would deserve more and longer posts.

Those who find in this a useful material may explore global measures. One option is to read Text Analysis with R for Students of Literature that I’ve reviewed some time ago.

Interesting topics to explore are Hapax richness and keywords in context that correspond to mesoanalysis or even going for macroanalysis to do clustering, classification and topic modelling.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### You beautiful, naïve, sophisticated newborn series

Sun, 11/05/2017 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

My husband and I recently started watching the wonderful series “Parks and recreation” which was recommended to me by my fellow R-Lady Jennifer Thompson in this very convincing thread. The serie was even endorsed by other R-Ladies. Jennifer told me the first two seasons are not as good as the following ones, but that it was worth it to make it through them. We actually started enjoying the humor and characters right away!

Then, this week while watching the show, one of the characters did a very basic text analysis that made me feel like imitating him for a blog post – my husband told me it was very Leslie of me to plan something while doing something else which made me very proud. I tested my idea on other Leslie fans, and they seemed to think it was a great idea… and that this post should be the beginning of a series of R-Ladies blog posts about Parks and recreation!

In this two-short-part blog post, I’ll therefore inaugurate this series, what an honor!

Less than stellar first seasons?

Jennifer told me the first two seasons were not the best ones. My brother who is a cruel man (just kidding) told me this was a very good joke idea: you tell someone a very bad TV series starts getting awesome after a few seasons that you need to watch in order to know the characters… therefore forcing them to loose their time and sanity in front of the screen, ah! Luckily Jennifer wasn’t doing that.

I said my husband and I were hooked on the series right from the first episode… what about other people? To answer this question, I downloaded IMDB ratings for the show! I first tried using R packages accessing the OMDB API, but this one was outdated and this up-to-date one was not but taught me the API no longer returns ratings by season… So I scraped the IMDB website using the rvest package, after checking I was allowed to do so thanks to the robotstxt package.

robotstxt::paths_allowed("http://www.imdb.com/title/tt1266020/eprate") ## [1] TRUE library("rvest") imdb_page <- html("http://www.imdb.com/title/tt1266020/eprate") ratings <- imdb_page %>% html_nodes("table") %>% .[[1]] %>% html_table() %>% .[,1:4] knitr::kable(ratings[1:5,]) # Episode UserRating UserVotes 7.13 One Last Ride: Part 2 9.7 2,407 7.12 One Last Ride: Part 1 9.6 1,934 7.40 Leslie and Ron 9.6 1,842 6.22 Moving Up: Part 2 9.4 1,273 5.14 Leslie and Ben 9.3 1,205

I just had to wrangle this table a little bit.

names(ratings)[1] <- "number" ratings <- tidyr::separate(ratings, col = number, into = c("season", "episode"), sep = "\\.") ratings <- dplyr::mutate(ratings, UserVotes = stringr::str_replace(UserVotes, ",", "")) ratings <- dplyr::mutate(ratings, UserVotes = as.numeric(UserVotes)) ratings <- dplyr::mutate(ratings, episode = as.numeric(episode))

And then I could at last plot the ratings and see whether the two first seasons were not that loved! Before that I quickly assessed whether there was an ok number of votes for all episodes, and that this did not vary widely over seasons, which was the case.

library("ggplot2") library("hrbrthemes") ggplot(ratings, aes(episode, UserRating, color = season)) + geom_smooth()+ geom_point() + xlab("IMDB rating") + viridis::scale_color_viridis(discrete = TRUE) + theme_ipsum(base_size = 20, axis_title_size = 20)

Note that for the first season, there are too few points for actually doing a smoothing. Looking at the plot I guess we can say the first season with its only 6 episodes was slightly less enjoyed! I am very thankful that the first season was deemed good enough to continue the series though! Besides I am thankful that it was so simple to answer this crucial question.

Leslie’s wordcloud

Warning: the wordcloud produced here might be a spoiler if you have not watched the 4 first seasons!

In the 10th episode of season 4, Tom produces a wordcloud of Leslie’s emails and memos. I cannot really do the same, but was inspired to make a worldcloud of things said in the episodes by downloading all subtitles from the 4 first seasons and using the fantastic subtools package. Note that a worldcloud is probably not the coolest thing one can do with all these text data… but it is a start and I truly hope it inspires more people to use François Keck’s package which is really user-friendly since it can read all the subtitles of a serie at once from a folder where each sub-folder is a season. It also supports conversion to formats suitable for text analysis!

The single problem I experienced when doing this work was that I had to download subtitles from the internet and the websites where you find those are not cute places, they’re full of “women to date”…

To make the wordcloud I very simply followed this blog post of François Keck’s. It doesn’t get easier than that… which is great given I have very little time at the moment.

library("subtools") library("tm") library("SnowballC") library("wordcloud") parks_and_rec <- read.subtitles.serie(dir = paste0(getwd(), "/data/pr/")) ## Read: 4 seasons, 68 episodes corpus <- tmCorpus(parks_and_rec) corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) TDM <- TermDocumentMatrix(corpus) TDM <- as.matrix(TDM) vec.season <- c(rep(1, 6), rep(2, 24), rep(3, 16), rep(4, 22)) TDM.season <- t(apply(TDM, 1, function(x) tapply(x, vec.season, sum))) colnames(TDM.season) <- paste("S", 1:4) set.seed(1) comparison.cloud(TDM.season, title.size = 1, max.words = 100, random.order = T)

It doesn’t look anything like Leslie’s wordcloud but it made me happy because I had already forgotten about some words or names from the previous seasons!

Another thing I did with the subtitles was filtering those starting with “Oh Ann” in order to have a look at some of the terrific compliments Leslie made to Ann… And one of them inspired the title of this post, replacing “baby” by “series”!

The end of this post, the beginning a series!

Now that I’ve done my homework I’m truly looking forward to reading the other posts. I can’t wait! Stay tuned by following my fellow bloggers/Leslie fans who came out in this Twitter thread.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### tint 0.0.4: Small enhancements

Sat, 11/04/2017 - 02:02

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of the tint package arrived on CRAN earlier today. Its name expands from tint is not tufte as the package offers a fresher take on the Tufte-style for html and pdf presentations.

A screenshot of the pdf variant is below.

This release brings some minor enhancements and polish, mostly learned from having done the related pinp (two-column vignette in the PNAS style) and linl (LaTeX letter) RMarkdown-wrapper packages; see below for details from the NEWS.Rd file.

Changes in tint version 0.0.4 (2017-11-02)
• Skeleton files are also installed as vignettes (#20).

• A reference to the Tufte source file now points to tint (Ben Marwick in #19, later extended to other Rmd files).

• Several spelling and grammar errors were corrected too (#13 and #16 by R. Mark Sharp and Matthew Henderson)

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Taming exam results in pdf with pdftools

Fri, 11/03/2017 - 23:06

(This article was first published on R – Tales of R, and kindly contributed to R-bloggers)

You can also check this post, written in #blogdown, here: taming-exam-results-with-pdf.

Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools") library(pdftools) txt <- pdf_text("EDAIC.pdf") txt[1] class(txt[1]) [1] "EDAIC Part I 2017 Overall Results\n Candidate N° Result\n 107131 FAIL\n 119233 PASS\n 123744 FAIL\n 127988 FAIL\n 133842 PASS\n 135692 PASS\n 140341 FAIL\n 142595 FAIL\n 151479 PASS\n 151632 PASS\n 152787 PASS\n 157691 PASS\n 158867 PASS\n 160211 PASS\n 161970 FAIL\n 162536 PASS\n 163331 PASS\n 164442 FAIL\n 164835 PASS\n 165734 PASS\n 165900 PASS\n 166469 PASS\n 167241 FAIL\n 167740 PASS\n 168151 FAIL\n 168331 PASS\n 168371 FAIL\n 168711 FAIL\n 169786 PASS\n 170721 FAIL\n 170734 FAIL\n 170754 PASS\n 170980 PASS\n 171894 PASS\n 171911 PASS\n 172047 FAIL\n 172128 PASS\n 172255 FAIL\n 172310 PASS\n 172706 PASS\n 173136 FAIL\n 173229 FAIL\n 174336 PASS\n 174360 PASS\n 175177 FAIL\n 175180 FAIL\n 175184 FAIL\nYour candidate number is indicated on your admission document Page 1 of 52\n" [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

• Split the blob.
• Transform the resulting list into a character vector with unlist.
• Trim leading white spaces with stringr::str_trim.
library(tidyverse) library(stringr) tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns unlist() %>% str_trim(side = "both") # trim white spaces tx2[1:10] [1] "EDAIC Part I 2017 Overall Results" [2] "Candidate N° Result" [3] "107131 FAIL" [4] "119233 PASS" [5] "123744 FAIL" [6] "127988 FAIL" [7] "133842 PASS" [8] "135692 PASS" [9] "140341 FAIL" [10] "142595 FAIL"
• Remove the very first row.
• Transform into a tibble.
tx3 <- tx2[-1] %>% data_frame() tx3 # A tibble: 2,579 x 1 . <chr> 1 Candidate N° Result 2 107131 FAIL 3 119233 PASS 4 123744 FAIL 5 127988 FAIL 6 133842 PASS 7 135692 PASS 8 140341 FAIL 9 142595 FAIL 10 151479 PASS # ... with 2,569 more rows
• Use tidyr::separate to split each row into two columns.
• Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>% mutate(key = gsub('\\s+', '', key)) %>% mutate(value = gsub('\\s+', '', value)) tx4 # A tibble: 2,579 x 2 key value 1 Candidate N°Result 2 107131 FAIL 3 119233 PASS 4 123744 FAIL 5 127988 FAIL 6 133842 PASS 7 135692 PASS 8 140341 FAIL 9 142595 FAIL 10 151479 PASS # ... with 2,569 more rows
• Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] tx5 # A tibble: 2,424 x 2 key value 1 107131 FAIL 2 119233 PASS 3 123744 FAIL 4 127988 FAIL 5 133842 PASS 6 135692 PASS 7 140341 FAIL 8 142595 FAIL 9 151479 PASS 10 151632 PASS # ... with 2,414 more rows Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr) tx5 %>% group_by(value) %>% summarise (count = n()) %>% mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% kable() value count percent FAIL 1017 42 % PASS 1407 58 %

From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New RStudio cheat sheet: Strings in R

Fri, 11/03/2017 - 19:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The RStudio team has created another very useful cheat sheet for R: Working with Strings. This cheat sheet provides an example-laden menu of operations you can perform on strings (character verctors) in R using the stringr package. While base R provides a solid set of string manipulation functions, the stringr package functions are simpler, more consistent (making them easy to use with the pipe operator), and more like the Ruby or Python way of handling string operations.

The back page of the cheat sheet also provides a handy guide to regular expressions a useful (if often frustrating) tool for extracting substrings according to a pattern you define with codes like these:

You can pick up the Working with Strings cheat sheet, and others from the series, at the link below.

RStudio: Cheat Sheets

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to identify risky bank loans using C.50 decision trees

Fri, 11/03/2017 - 17:31

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

This tutorial has been taken from Machine Learning with R Second Edition   by Brett Lantz. Use the code  MLR250RB at the checkout to save 50% on the RRP.

Or pick up this title with 4 others for just $50 – get the R-Bloggers bundle. The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans. Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory. It is likely that automated credit scoring models are employed to instantly approve credit applications on the telephone and web. In this tutorial, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution. Step 1 – collecting data The idea behind our credit model is to identify factors that are predictive of higher risk of default. Therefore, we need to obtain data on a large number of past bank loans and whether the loan went into default, as well as information on the applicant. Data with these characteristics is available in a dataset donated to the UCI Machine Learning Data Repository by Hans Hofmann of the University of Hamburg. The data set contains information on loans obtained from a credit agency in Germany. The data set presented in this tutorial has been modified slightly from the original in order to eliminate some preprocessing steps. To follow along with the examples, download the credit.csv file from Packt’s website and save it to your R working directory. Simply click here and then click ‘code files’ beneath the cover image. The credit data set includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Let’s see whether we can determine any patterns that predict this outcome. Step 2 – exploring and preparing the data As we did previously, we will import data using the read.csv() function. We will ignore the stringsAsFactors option and, therefore, use the default value of TRUE, as the majority of the features in the data are nominal: > credit <- read.csv("credit.csv") The first several lines of output from the str() function are as follows: > str(credit) 'data.frame':1000 obs. of 17 variables:$ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",.. $months_loan_duration: int 6 48 12 ...$ credit_history : Factor w/ 5 levels "critical","good",.. $purpose : Factor w/ 6 levels "business","car",..$ amount : int 1169 5951 2096 ...

We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types.

Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default. The applicant’s checking and savings account balance are recorded as categorical variables:

> table(credit$checking_balance) < 0 DM > 200 DM 1 - 200 DM unknown 274 63 269 394 > table(credit$savings_balance) < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown 603 48 103 63 183

The checking and savings account balance may prove to be important predictors of loan default status. Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM).

Some of the loan’s features are numeric, such as its duration and the amount of credit requested:

> summary(credit$months_loan_duration) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.0 12.0 18.0 20.9 24.0 72.0 > summary(credit$amount) Min. 1st Qu. Median Mean 3rd Qu. Max. 250 1366 2320 3271 3972 18420

The loan amounts ranged from 250 DM to 18,420 DM across terms of 4 to 72 months with a median duration of 18 months and an amount of 2,320 DM.

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default:

> table(credit$default) no yes 700 300 A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants that are at high risk to default, allowing the bank to refuse credit requests. Data preparation: Creating random training and test data sets In this example, we’lll split our data into two portions: a training dataset to build the decision tree and a test dataset to evaluate the performance of the model on new data. We will use 90 percent of the data for training and 10 percent for testing, which will provide us with 100 records to simulate new applicants. We’ll use a random sample of the credit data for training. A random sample is simply a process that selects a subset of records at random. In R, the sample() function is used to perform random sampling. However, before putting it in action, a common practice is to set a seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired. It may seem that this defeats the purpose of generating random numbers, but there is a good reason for doing it this way. Providing a seed value via the set.seed() function ensures that if the analysis is repeated in the future, an identical result is obtained. The following commands use the sample() function to select 900 values at random out of the sequence of integers from 1 to 1000. Note that the set.seed() function uses the arbitrary value 123. Omitting this seed will cause your training and testing split to differ from those shown in the remainder of this tutorial: > set.seed(123) > train_sample <- sample(1000, 900) As expected, the resulting train_sample object is a vector of 900 random integers: > str(train_sample) int [1:900] 288 788 409 881 937 46 525 887 548 453 ... By using this vector to select rows from the credit data, we can split it into the 90 percent training and 10 percent test datasets we desired. Recall that the dash operator used in the selection of the test records tells R to select records that are not in the specified rows; in other words, the test data includes only the rows that are not in the training sample. > credit_train <- credit[train_sample, ] > credit_test <- credit[-train_sample, ] If all went well, we should have about 30 percent of defaulted loans in each of the datasets: > prop.table(table(credit_train$default)) no yes 0.7033333 0.2966667 > prop.table(table(credit_test$default)) no yes 0.67 0.33 This appears to be a fairly even split, so we can now build our decision tree. Tip: If your results do not match exactly, ensure that you ran the command set.seed(123) immediately prior to creating the train_sample vector. Step 3: Training a model on the data We will use the C5.0 algorithm in the C50 package to train our decision tree model. If you have not done so already, install the package with install.packages(“C50”) and load it to your R session, using library(C50). The following syntax box lists some of the most commonly used commands to build decision trees. Compared to the machine learning approaches we used previously, the C5.0 algorithm offers many more ways to tailor the model to a particular learning problem, but more options are available. Once the C50package has been loaded, the ?C5.0Control command displays the help page for more details on how to finely-tune the algorithm. For the first iteration of our credit approval model, we’ll use the default C5.0 configuration, as shown in the following code. The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame, but supply it as the target factor vector for classification: > credit_model <- C5.0(credit_train[-17], credit_train$default)

The credit_model object now contains a C5.0 decision tree. We can see some basic data about the tree by typing its name:

> credit_model Call: C5.0.default(x = credit_train[-17], y = credit_train$default) Classification Tree Number of samples: 900 Number of predictors: 16 Tree size: 57 Non-standard options: attempt to group attributes The preceding text shows some simple facts about the tree, including the function call that generated it, the number of features (labeled predictors), and examples (labeled samples) used to grow the tree. Also listed is the tree size of 57, which indicates that the tree is 57 decisions deep—quite a bit larger than the example trees we’ve considered so far! To see the tree’s decisions, we can call the summary() function on the model: > summary(credit_model) This results in the following output: The preceding output shows some of the first branches in the decision tree. The first three lines could be represented in plain language as: If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.” Otherwise, if the checking account balance is less than zero DM or between one and 200 DM. And the credit history is perfect or very good, then classify as “likely to default.” The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision. For instance, on the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default. In other words, 50 applicants actually defaulted, in spite of the model’s prediction to the contrary. Tip: Sometimes a tree results in decisions that make little logical sense. For example, why would an applicant whose credit history is very good be likely to default, while those whose checking balance is unknown are not likely to default? Contradictory rules like this occur sometimes. They might reflect a real pattern in the data, or they may be a statistical anomaly. In either case, it is important to investigate such strange decisions to see whether the tree’s logic makes sense for business use. After the tree, the summary(credit_model) output displays a confusion matrix, which is a cross-tabulation that indicates the model’s incorrectly classified records in the training data: Evaluation on training data (900 cases): Decision Tree ---------------- Size Errors 56 133(14.8%) << (a) (b) <-classified as ---- ---- 598 35 (a): class no 98 169 (b): class yes The Errors output notes that the model correctly classified all but 133 of the 900 training instances for an error rate of 14.8 percent. A total of 35 actual no values were incorrectly classified as yes (false positives), while 98 yes values were misclassified as no (false negatives). Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test data set. Step 4: Evaluating model performance To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code: > credit_pred <- predict(credit_model, credit_test) This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Setting the prop.c and prop.r parameters to FALSE removes the column and row percentages from the table. The remaining percentage (prop.t) indicates the proportion of records in the cell out of the total number of records: > library(gmodels) > CrossTable(credit_test$default, credit_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

This results in the following table:

Out of the 100 test loan application records, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent. This is somewhat worse than its performance on the training data, but not unexpected, given that a model’s performance is often worse on unseen data. Also note that the model only correctly predicted 14 of the 33 actual loan defaults in the test data, or 42 percent. Unfortunately, this type of error is a potentially very costly mistake, as the bank loses money on each default. Let’s see if we can improve the result with a bit more effort.

Step 5: Improving model performance

Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application. In fact, if the model had predicted “no default” for every test case, it would have been correct 67 percent of the time—a result not much worse than our model’s, but requiring much less effort! Predicting loan defaults from 900 examples seems to be a challenging problem.

Making matters even worse, our model performed especially poorly at identifying applicants who do default on their loans. Luckily, there are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.

Boosting the accuracy of decision trees

One way the C5.0 algorithm improved upon the C4.5 algorithm was through the addition of adaptive boosting. This is a process in which many decision trees are built and the trees vote on the best class for each example.

Boosting is essentially rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone. Each of the models has a unique set of strengths and weaknesses and they may be better or worse in solving certain problems. Using a combination of several learners with complementary strengths and weaknesses can therefore dramatically improve the accuracy of a classifier.

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team. The trials parameter sets an upper limit; the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. We’ll start with 10 trials, a number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25%:

> credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10) While examining the resulting model, we can see that some additional lines have been added, indicating the changes: > credit_boost10 Number of boosting iterations: 10 Average tree size: 47.5 Across the 10 iterations, our tree size shrunk. If you would like, you can see all 10 trees by typing summary(credit_boost10) at the command prompt. It also lists the model’s performance on the training data: > summary(credit_boost10) (a) (b) <-classified as ---- ---- 629 4 (a): class no 30 237 (b): class yes The classifier made 34 mistakes on 900 training examples for an error rate of 3.8 percent. This is quite an improvement over the 13.9 percent training error rate we noted before adding boosting! However, it remains to be seen whether we see a similar improvement on the test data. Let’s take a look: > credit_boost_pred10 <- predict(credit_boost10, credit_test) > CrossTable(credit_test$default, credit_boost_pred10, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

The resulting table is as follows:

Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model. It does not seem like a large gain, but it is in fact larger than the 25 percent reduction we expected. On the other hand, the model is still not doing well at predicting defaults, predicting only 20/33 = 61%correctly. The lack of an even greater improvement may be a function of our relatively small training data set, or it may just be a very difficult problem to solve.

This said, if boosting can be added this easily, why not apply it by default to every decision tree? The reason is twofold. First, if building a decision tree once takes a great deal of computation time, building many trees may be computationally impractical. Secondly, if the training data is very noisy, then boosting might not result in an improvement at all. Still, if greater accuracy is needed, it’s worth giving it a try.

Making mistakes more costlier than others

Giving a loan out to an applicant who is likely to default can be an expensive mistake. One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants, under the assumption that the interest the bank would earn from a risky loan is far outweighed by the massive loss it would incur if the money is not paid back at all.

The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction.

To begin constructing the cost matrix, we need to start by specifying the dimensions. Since the predicted and actual values can both take two values, yes or no, we need to describe a 2 x 2 matrix, using a list of two vectors, each with two values. At the same time, we’ll also name the matrix dimensions to avoid confusion later on:

> matrix_dimensions <- list(c("no", "yes"), c("no", "yes")) > names(matrix_dimensions) <- c("predicted", "actual")

Examining the new object shows that our dimensions have been set up correctly:

> matrix_dimensions $predicted [1] "no" "yes"$actual [1] "no" "yes"

Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. Since R fills a matrix by filling columns one by one from top to bottom, we need to supply the values in a specific order:

• Predicted no, actual no
• Predicted yes, actual no
• Predicted no, actual yes
• Predicted yes, actual yes

Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:

> error_cost <- matrix(c(0, 1, 4, 0), nrow = 2, dimnames = matrix_dimensions)

This creates the following matrix:

> error_cost actual predicted no yes no 0 4 yes 1 0

As defined by this matrix, there is no cost assigned when the algorithm classifies a no or yes correctly, but a false negative has a cost of 4 versus a false positive’s cost of 1. To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function. We’ll otherwise use the same steps as we did earlier:

> credit_cost <- C5.0(credit_train[-17], credit_train$default, costs = error_cost) > credit_cost_pred <- predict(credit_cost, credit_test) > CrossTable(credit_test$default, credit_cost_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

This produces the following confusion matrix:

Compared to our boosted model, this version makes more mistakes overall: 37 percent error here versus 18 percent in the boosted case. However, the types of mistakes are very different. Where the previous models incorrectly classified only 42 and 61 percent of defaults correctly, in this model, 79 percent of the actual defaults were predicted to be non-defaults. This trade resulting in a reduction of false negatives at the expense of increasing false positives may be acceptable if our cost estimates were accurate.

This tutorial has been taken from Machine Learning with R Second Edition   by Brett Lantz. Use the code  MLR250RB at the checkout to save 50% on the RRP.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Let X=X in R

Fri, 11/03/2017 - 14:37

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points:

• Sometimes you have to write parameterized or re-usable code.
• The methods for doing this should be easy and legible.

The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the wrapr package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in R.

In this article we will show how wrapr makes code-rewriting even easier with its new let x=x automation.

Let X=X

There are very important reasons to choose a package that makes things easier. One is debugging:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

Brian Kernighan, The Elements of Programming Style, 2nd edition, chapter 2

Let’s take the monster example from "Let’s Have Some Sympathy For The Part-time R User".

The idea was that perhaps one had worked out a complicated (but useful and important) by-hand survey scoring method:

suppressPackageStartupMessages(library("dplyr")) library("wrapr") d <- data.frame( subjectID = c(1, 1, 2, 2), surveyCategory = c( 'withdrawal behavior', 'positive re-framing', 'withdrawal behavior', 'positive re-framing' ), assessmentTotal = c(5, 2, 3, 4), stringsAsFactors = FALSE ) scale <- 0.237 d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

The presumption is that the above pipeline is considered reasonable (but long, complicated, and valuable) dplyr, and our goal is to re-use it on new data that may not have the same column names as our original data.

We are making the huge simplifying assumption that you have studied the article and the above example is now familiar.

The question is: what to do when one wants to process the same type of data with different column names? For example:

d <- data.frame( PID = c(1, 1, 2, 2), DIAG = c( 'withdrawal behavior', 'positive re-framing', 'withdrawal behavior', 'positive re-framing' ), AT = c(5, 2, 3, 4), stringsAsFactors = FALSE ) print(d) ## PID DIAG AT ## 1 1 withdrawal behavior 5 ## 2 1 positive re-framing 2 ## 3 2 withdrawal behavior 3 ## 4 2 positive re-framing 4

The new table has the following new column definitions:

subjectID <- "PID" surveyCategory <- "DIAG" assessmentTotal <- "AT" isDiagnosis <- "isD" probability <- "prob" diagnosis <- "label"

We could "reduce to a previously solved problem" by renaming the columns to names we know, doing the work, and then renaming back (which is actually a service that replyr::replyr_apply_f_mapped() supplies).

In "Let’s Have Some Sympathy For The Part-time R User" I advised editing the pipeline to have obvious stand-in names (perhaps in all-capitals) and then using wrapr::let() to perform symbol substitution on the pipeline.

Dr. Nina Zumel has since pointed out to me: if you truly trust the substitution method you can use the original column names and adapt the original calculation pipeline as is (without alteration). Let’s try that:

let( c(subjectID = subjectID, surveyCategory = surveyCategory, assessmentTotal = assessmentTotal, isDiagnosis = isDiagnosis, probability = probability, diagnosis = diagnosis), d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID)) ## # A tibble: 2 x 3 ## PID label prob ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

That works! All we did was: paste the original code into the block and the adapter did all of the work, with no user edits of the code.

It is a bit harder for the user to find which symbols are being replaced, but in some sense they don’t really need to know (it is R‘s job to perform the replacements).

wrapr has a new helper function mapsyms() that automates all of the "let x = x" steps from the above example.

mapsyms() is a simple function that captures variable names and builds a mapping from them to the names they refer to in the current environment. For example we can use it to quickly build the assignment map for the let block, because the earlier assignments such as "subjectID <- "PID"" allow mapsyms() to find the intended re-mappings. This would also be true for other cases, such as re-mapping function arguments to values. Our example becomes:

print(mapsyms(subjectID, surveyCategory, assessmentTotal, isDiagnosis, probability, diagnosis)) ## $subjectID ## [1] "PID" ## ##$surveyCategory ## [1] "DIAG" ## ## $assessmentTotal ## [1] "AT" ## ##$isDiagnosis ## [1] "isD" ## ## $probability ## [1] "prob" ## ##$diagnosis ## [1] "label"

This allows the solution to be re-written and even wrapped into a function in a very legible form with very little effort:

computeRes <- function(d, subjectID, surveyCategory, assessmentTotal, isDiagnosis, probability, diagnosis) { let( mapsyms(subjectID, surveyCategory, assessmentTotal, isDiagnosis, probability, diagnosis), d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID) ) } computeRes(d, subjectID = "PID", surveyCategory = "DIAG", assessmentTotal = "AT", isDiagnosis = "isD", probability = "prob", diagnosis = "label") ## # A tibble: 2 x 3 ## PID label prob ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

The idea is: instead of having to mark what instances of symbols are to be replaced (by quoting or de-quoting indicators), we instead declare what symbols are to be replaced using the mapsyms() helper.

mapsyms() is a stand-alone helper function (just as ":=" is). It works not because it is some exceptional corner-case hard-wired into other functions, but because mapsyms()‘s reasonable semantics happen to synergize with let()‘s reasonable semantics. mapsyms() behaves as a replacement target controller (without needing any cumbersome direct quoting or un-quoting notation!).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### I, For One, Welcome Our Forthcoming New robots.txt Overlords

Fri, 11/03/2017 - 14:18

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Despite my week-long Twitter consumption sabbatical (helped — in part — by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem is the text of a very recent ruling from a District Court in Texas and deals with a favourite subject of mine: robots.txt.

The background of the case is that there were two parties who both ran websites for oil and gas professionals that include job postings. One party filed a lawsuit against the other asserting that the they hacked into their system and accessed and used various information in violation of the Computer Fraud and Abuse Act (CFAA), the Stored Wire and Electronic Communications and Transactional Records Access Act (SWECTRA), the Racketeer Influenced and Corrupt Organizations Act (RICO), the Texas Harmful Access by Computer Act (THACA), the Texas Theft Liability Act (TTLA), and the Texas Uniform Trade Secrets Act (TUTS). They also asserted common law claims of misappropriation of confidential information, conversion, trespass to chattels, fraud, breach of fiduciary duty, unfair competition, tortious interference with present and prospective business relationships, civil conspiracy, and aiding and abetting.

The other party filed a motion for dismissal on a number of grounds involving legalese on Terms & Conditions, a rebuttal of CFAA claims and really gnarly legalese around copyrights. There are more than a few paragraphs that make me glad none of my offspring have gone into or have a desire to go into the legal profession. One TLDR here is that T&Cs do, in fact, matter (though that is definitely dependent upon the legal climate where you live or have a case filed against you). We’re going to focus on the DMCA claim which leads us to the robots.txt part.

I shall also preface the rest with “IANAL”, but I don’t think a review of this case requires a law degree.

Command-Shift-R

To refresh memories (or create lasting new ones), robots.txt is a file that is placed at the top of a web site domain (i.e. https://rud.is/robots.txt) that contains robots exclusion standard rules. These rules tell bots (NOTE: if you write a scraper, you’ve written a scraping bot) what they can or cannot scrape and what — if any — delay should be placed between scraping efforts by said bot.

R has two CRAN packages for dealing with these files/rules: robotstxt by Peter Meissner and spiderbar by me. They are not competitors, but are designed much like Reese’s Peanut Butter cups — to go together (though Peter did some wicked good testing and noted a possible error in the underlying C++ library I use that could generate Type I or Type II in certain circumstances) and each has some specialization. I note them now because you don’t have an excuse not to check robots.txt given two CRAN packages being available. Python folks have (at a minimum) robotparser and reppy. Node, Go and other, modern languages all have at least one module/library/package available as well. No. Excuses.

(Y’all are always in a rush, eh?)

This October, 2017 Texas ruling references a 2007 ruling by a District Court in Pennsylvania. I dug in a bit through searchable Federal case law for mentions of robots.txt and there aren’t many U.S. cases that mention this control, though I am amused a small cadre of paralegals had to type robots.txt over-and-over again.

The dismissal request on the grounds that the CFAA did not apply was summarily rejected. Why? The defendant provided proof that they monitor for scraping activity that violates the robots.txt rules and that they use the Windows Firewall (ugh, they use Windows for web serving) to block offending IP addresses when they discover them.

Nuances came out further along in the dismissal text noting that user-interactive viewing of the member profiles on the site was well-within the T&Cs but that the defendant “never authorized [the use of] automated bots to download over 500,000 profiles” nor to have that data used for commercial purposes.

The kicker (for me, anyway) is the last paragraph of the document in the Conclusion where the defendant asserts that:

• the “Internet industry” (I seriously dislike lawyers for wording just like that) has recognized robots.txt as a standard for controlling automated access to resources
• robots.txt has been a valid enforcement mechanism since 1994

The good bit is: -“Whether it actually qualifies in this case will be determined definitively at summary judgment or by a jury.”_ To me, this sounds like a ruling by a jury/judge in favor of robots.txt could mean that it becomes much stronger case law for future misuse claims.

With that in mind:

• Site owners: USE robots.txt, if — for no other reason — to aid legitimate researchers who want to make use of your data for valid scientific purposes, education or to create non-infringing content or analyses that will be a benefit to the public good. You can also use it to legally protect your content (but there are definitely nuances around how you do that).
• Scrapers: Check and obey robots.txt rules. You have no technological excuse not to and not doing so really appears that it could come back to haunt you in the very near future.
FIN

I’ve setup an alert for when future rulings come out for this case and will toss up another post here or on the work-blog (so I can run it by our very-non-skeezy legal team) when it pops up again.

“Best Friends” image by Andy Kelly. Used with permission. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Gold-Mining – Week 9 (2017)

Fri, 11/03/2017 - 04:57

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 9 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining – Week 9 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fri, 11/03/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

There are already multiple places where you can get help with R, Shiny, the RStudio IDE, and the tidyverse. There are, however, far fewer resources for R admins: people who work with R in production, in large organizations, and in complex environments. We hope this new category will serve as a useful and friendly place to connect with fellow R admins to discuss the issues they deal with. We expect this category to include:

• Discussions about best practices and ideas

• General questions to fellow admins about RStudio Pro products, designed to ease friction in R administrator workflows

• An exchange of ideas on domain-specific use cases and configurations

If you’re an existing RStudio customer, this forum is a complement to RStudio’s direct support:

• Folks from RStudio will participate, but only lightly moderate topics and discussions.

• RStudio commercial license holders should still feel free to report Pro
product problems to support@rstudio.com.

• If you think a topic needs RStudio support’s attention, please suggest that
the poster contact RStudio support directly. You can also tag @support in a reply.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Problems In Estimating GARCH Parameters in R

Thu, 11/02/2017 - 20:35

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

These days my research focuses on change point detection methods. These are statistical tests and procedures to detect a structural change in a sequence of data. An early example, from quality control, is detecting whether a machine became uncalibrated when producing a widget. There may be some measurement of interest, such as the diameter of a ball bearing, that we observe. The machine produces these widgets in sequence. Under the null hypothesis, the ball bearing’s mean diameter does not change, while under the alternative, at some unkown point in the manufacturing process the machine became uncalibrated and the mean diameter of the ball bearings changed. The test then decides between these two hypotheses.

These types of test matter to economists and financial sector workers as well, particularly for forecasting. Once again we have a sequence of data indexed by time; my preferred example is price of a stock, which people can instantly recognize as a time series given how common time series graphs for stocks are, but there are many more datasets such as a state’s GDP or the unemployment rate. Economists want to forecast these quantities using past data and statistics. One of the assumptions the statistical methods makes is that the series being forecasted is stationary: the data was generated by one process with a single mean, autocorrelation, distribution, etc. This assumption isn’t always tested yet it is critical to successful forecasting. Tests for structural change check this assumption, and if it turns out to be false, the forecaster may need to divide up their dataset when training their models.

I have written about these tests before, introducing the CUSUM statistic, one of the most popular statistics for detecting structural change. My advisor and a former Ph.D. student of his (currently a professor at the University of Waterloo, Greg Rice) developed a new test statistic that better detects structural changes that occur early or late in the dataset (imagine the machine producing widgets became uncalibrated just barely, and only the last dozen of the hundred widgets in the sample were affected). We’re in the process of making revisions requested by a journal to whom we submitted our paper, one of the revisions being a better example application (we initially worked with the wage/productivity data I discussed in the aforementioned blog post; the reviewers complained that these variables are codetermined so its nonsense to regress one on the other, a complaint I disagree with but I won’t plant my flag on to defend).

We were hoping to apply a version of our test to detecting structural change in GARCH models, a common model in financial time series. To my knowledge the “state of the art” R package for GARCH model estimation and inference (along with other work) is fGarch; in particular, the function garchFit() is used for estimating GARCH models from data. When we tried to use this function in our test, though, we were given obviously bad numbers (we had already done simulation studies to know what behavior to expect). The null hypothesis of no change was soundly rejected on simulated sequences where it was true. I never saw the test fail to reject the null hypothesis, even though the null hypothesis was always true. This was the case even for sample sizes of 10,000, hardly a small sample.

We thought the problem might lie with the estimation of the covariance matrix of the parameter estimates, and I painstakingly derived and programmed functions to get this matrix not using numerical differentiation procedures, yet this did not stop the bad behavior. Eventually my advisor and I last Wednesday played with garchFit() and decided that the function is to blame. The behavior of the function on simulated data is so erratic when estimating parameters (not necessarily the covariance matrix as we initially thought, though it’s likely polluted as well) the function is basically useless, to my knowledge.

This function should be well-known and it’s certainly possible that the problem lies with me, not fGarch (or perhaps there’s better packages out there). This strikes me as a function of such importance I should share my findings. In this article I show a series of numerical experiments demonstrating garchFit()‘s pathological behavior.

Basics on GARCH Models

The model is a time series model often used to model the volatility of financial instrument returns, such as the returns from stocks. Let represent the process. This could represent the deviations in the returns of, say, a stock. The model (without a mean parameter) is defined recursively as:

is the conditional standard deviation of the process, also known as the conditional volatility, and is a random process.

People who follow finance1 noticed that returns to financial instruments (such as stocks or mutual funds) exhibit behavior known as volatility clustering. Some periods a financial instrument is relatively docile; there are not dramatic market movements. In others an instrument’s price can fluctuate greatly, and these periods are not one-off single-day movements but can last for a period of time. GARCH models were developed to model volatility clustering.

It is believed by some that even if a stock’s daily movement is essentially unforecastable (a stock is equally likely to over- or under-perform on any given day), the volatility is forecastable. Even for those who don’t have the hubris to believe anything about future returns can be forecasted these models are important. For example if one uses the model to estimate the beta statistic for a stock (where is the stock’s return at time $latex$ is the market return, and is “random noise”), there is a good chance that is not an i.i.d sequence of random numbers (as is commonly assumed in other statistical contexts) but actually a GARCH sequence. The modeller would then want to know the behavior of her estimates in such a situation. Thus GARCH models are considered important. In fact, the volatility clustering behavior I just described is sometimes described as “GARCH behavior”, since it appears frequently and GARCH models are a frequent tool of choice to address them. (The acronym GARCH stands for generalized autoregressive conditional heteroskedasticity, which is statistics-speak for changing, time-dependent volatility.)

can be any random process but a frequent choice is to use a sequence of i.i.d standard Normal random variables. Here is the only source of randomness in the model. In order for a process to have a stationary solution, we must require that ). In this case the process has a long-run variance of .

Estimating GARCH Parameters

The process I wrote down above is an infinite process; the index $latex$ can extend to negative numbers and beyond. Obviously in practice we don’t observe infinite sequences so if we want to work with models in practice we need to consider a similar sequence:

Below is the new sequence’s secret sauce:

We choose an initial value for this sequence (the theoretical sequence described earlier does not have an initial value)! This sequence strongly resembles the theoretical sequence but it is observable in its entirity, and it can be shown that parameters estimated using this sequence closely approximate those of the theoretical, infinite process.

Naturally one of the most important tasks for these processes is estimating their parameters; for the process, these are , , and . A basic approach is to find the quasi-maximum likelihood estimation (QMLE) estimates. Let’s assume that we have $latex$ observations from our process. In QMLE, we work with the condisional distribution of when assuming follows a standard normal distribution (that is, ). We assume that the entire history of the process up to time $latex$ is known; this implies that is known as well (in fact all we needed to know was the values of the process at time , but I digress). In that case we have . Let be the conditional distribution of (so ). The quasi-likelihood equation is then

Like most likelihood methods, rather than optimize the quasi-likelihood function directly, statisticians try to optimize the log-likelihood, , and after some work it’s not hard to see this is equivalent to minimizing

Note that , , and are involved in this quantity through . There is no closed form solution for the parameters that minimize this quantity. This means that numerical optimization techniques must be applied to find the parameters.

It can be shown that the estimators for the parameters , , and , when computed this way, are consistent (meaning that asymptotically they approach their true values, in the sense that they converge in probability) and follow a Gaussian distribution asymptotically.2 These are properties that we associate with the sample mean, and while we might be optimistic that the rate of convergence of these estimators is as good as the rate of convergence of the sample mean, we may expect comparable asymptotic behavior.

Ideally, the parameters should behave like the process illustrated below.

library(ggplot2) x <- rnorm(1000, sd = 1/3) df <- t(sapply(50:1000, function(t) { return(c("mean" = mean(x[1:t]), "mean.se" = sd(x[1:t])/sqrt(t))) })) df <- as.data.frame(df) df$t <- 50:1000 ggplot(df, aes(x = t, y = mean)) + geom_line() + geom_ribbon(aes(x = t, ymin = mean - 2 * mean.se, ymax = mean + 2 * mean.se), color = "grey", alpha = 0.5) + geom_hline(color = "blue", yintercept = 0) + coord_cartesian(ylim = c(-0.5, 0.5)) Behavior of Estimates by fGarch Before continuing let’s generate a sequence. Throughout this article I work with processes where all parameters are equal to 0.2. Notice that for a process the long-run variance will be with this choice. set.seed(110117) library(fGarch) x <- garchSim(garchSpec(model = list("alpha" = 0.2, "beta" = 0.2, "omega" = 0.2)), n.start = 1000, n = 1000) plot(x) Let’s see the parameters that the fGarch function garchFit() uses. args(garchFit) ## function (formula = ~garch(1, 1), data = dem2gbp, init.rec = c("mci", ## "uev"), delta = 2, skew = 1, shape = 4, cond.dist = c("norm", ## "snorm", "ged", "sged", "std", "sstd", "snig", "QMLE"), include.mean = TRUE, ## include.delta = NULL, include.skew = NULL, include.shape = NULL, ## leverage = NULL, trace = TRUE, algorithm = c("nlminb", "lbfgsb", ## "nlminb+nm", "lbfgsb+nm"), hessian = c("ropt", "rcd"), ## control = list(), title = NULL, description = NULL, ...) ## NULL The function provides a few options for distribution to maximize (cond.dist) and algorithm to use for optimization (algorithm). Here I will always choose cond.dist = QMLE, unless otherwise stated, to instruct the function to use QMLE estimators. Here’s a single pass. garchFit(data = x, cond.dist = "QMLE", include.mean = FALSE) ## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 1000 ## Recursion Init: mci ## Series Scale: 0.5320977 ## ## Parameter Initialization: ## Initial Parameters:$params ## Limits of Transformations: $U,$V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.15640604 0.156406 0.0 FALSE ## omega 0.00000100 100.000000 0.1 TRUE ## alpha1 0.00000001 1.000000 0.1 TRUE ## gamma1 -0.99999999 1.000000 0.1 FALSE ## beta1 0.00000001 1.000000 0.8 TRUE ## delta 0.00000000 2.000000 2.0 FALSE ## skew 0.10000000 10.000000 1.0 FALSE ## shape 1.00000000 10.000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 1419.0152: 0.100000 0.100000 0.800000 ## 1: 1418.6616: 0.108486 0.0998447 0.804683 ## 2: 1417.7139: 0.109746 0.0909961 0.800931 ## 3: 1416.7807: 0.124977 0.0795152 0.804400 ## 4: 1416.7215: 0.141355 0.0446605 0.799891 ## 5: 1415.5139: 0.158059 0.0527601 0.794304 ## 6: 1415.2330: 0.166344 0.0561552 0.777108 ## 7: 1415.0415: 0.195230 0.0637737 0.743465 ## 8: 1415.0031: 0.200862 0.0576220 0.740088 ## 9: 1414.9585: 0.205990 0.0671331 0.724721 ## 10: 1414.9298: 0.219985 0.0713468 0.712919 ## 11: 1414.8226: 0.230628 0.0728325 0.697511 ## 12: 1414.4689: 0.325750 0.0940514 0.583114 ## 13: 1413.4560: 0.581449 0.143094 0.281070 ## 14: 1413.2804: 0.659173 0.157127 0.189282 ## 15: 1413.2136: 0.697840 0.155964 0.150319 ## 16: 1413.1467: 0.720870 0.142550 0.137645 ## 17: 1413.1416: 0.726527 0.138146 0.135966 ## 18: 1413.1407: 0.728384 0.137960 0.134768 ## 19: 1413.1392: 0.731725 0.138321 0.132991 ## 20: 1413.1392: 0.731146 0.138558 0.133590 ## 21: 1413.1392: 0.730849 0.138621 0.133850 ## 22: 1413.1392: 0.730826 0.138622 0.133869 ## ## Final Estimate of the Negative LLH: ## LLH: 782.211 norm LLH: 0.782211 ## omega alpha1 beta1 ## 0.2069173 0.1386221 0.1338686 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -8858.897 -1839.6144 -2491.9827 ## alpha1 -1839.614 -782.8005 -531.7393 ## beta1 -2491.983 -531.7393 -729.7246 ## attr(,"time") ## Time difference of 0.04132652 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.3866439 secs ## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x, cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## ## [data = x] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.20692 0.13862 0.13387 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.20692 0.05102 4.056 5e-05 *** ## alpha1 0.13862 0.04928 2.813 0.00491 ** ## beta1 0.13387 0.18170 0.737 0.46128 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -782.211 normalized: -0.782211 ## ## Description: ## Thu Nov 2 13:01:14 2017 by user: The parameters are not necessarily near the true parameters. One might initially attribute this to just randomness, but that doesn’t seem to be the case. For example, what fit do I get when I fit the model on the first 500 data points? garchFit(data = x[1:500], cond.dist = "QMLE", include.mean = FALSE) ## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 500 ## Recursion Init: mci ## Series Scale: 0.5498649 ## ## Parameter Initialization: ## Initial Parameters:$params ## Limits of Transformations: $U,$V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.33278068 0.3327807 0.0 FALSE ## omega 0.00000100 100.0000000 0.1 TRUE ## alpha1 0.00000001 1.0000000 0.1 TRUE ## gamma1 -0.99999999 1.0000000 0.1 FALSE ## beta1 0.00000001 1.0000000 0.8 TRUE ## delta 0.00000000 2.0000000 2.0 FALSE ## skew 0.10000000 10.0000000 1.0 FALSE ## shape 1.00000000 10.0000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 706.37230: 0.100000 0.100000 0.800000 ## 1: 706.27437: 0.103977 0.100309 0.801115 ## 2: 706.19091: 0.104824 0.0972295 0.798477 ## 3: 706.03116: 0.112782 0.0950253 0.797812 ## 4: 705.77389: 0.122615 0.0858136 0.788169 ## 5: 705.57316: 0.134608 0.0913105 0.778144 ## 6: 705.43424: 0.140011 0.0967118 0.763442 ## 7: 705.19541: 0.162471 0.102711 0.739827 ## 8: 705.16325: 0.166236 0.0931680 0.737563 ## 9: 705.09943: 0.168962 0.100977 0.731085 ## 10: 704.94924: 0.203874 0.0958205 0.702986 ## 11: 704.78210: 0.223975 0.108606 0.664678 ## 12: 704.67414: 0.250189 0.122959 0.630886 ## 13: 704.60673: 0.276532 0.131788 0.595346 ## 14: 704.52185: 0.335952 0.146435 0.520961 ## 15: 704.47725: 0.396737 0.157920 0.448557 ## 16: 704.46540: 0.442499 0.164111 0.396543 ## 17: 704.46319: 0.440935 0.161566 0.400606 ## 18: 704.46231: 0.442951 0.159225 0.400940 ## 19: 704.46231: 0.443022 0.159284 0.400863 ## 20: 704.46230: 0.443072 0.159363 0.400851 ## 21: 704.46230: 0.443112 0.159367 0.400807 ## ## Final Estimate of the Negative LLH: ## LLH: 405.421 norm LLH: 0.810842 ## omega alpha1 beta1 ## 0.1339755 0.1593669 0.4008074 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -8491.005 -1863.4127 -2488.5700 ## alpha1 -1863.413 -685.6071 -585.4327 ## beta1 -2488.570 -585.4327 -744.1593 ## attr(,"time") ## Time difference of 0.02322888 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.1387401 secs ## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x[1:500], cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## ## [data = x[1:500]] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.13398 0.15937 0.40081 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.13398 0.11795 1.136 0.2560 ## alpha1 0.15937 0.07849 2.030 0.0423 * ## beta1 0.40081 0.44228 0.906 0.3648 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -405.421 normalized: -0.810842 ## ## Description: ## Thu Nov 2 13:01:15 2017 by user: Notice that the parameter (listed as beta1) changed dramatically. How about a different cutoff? garchFit(data = x[1:200], cond.dist = "QMLE", include.mean = FALSE) ## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 200 ## Recursion Init: mci ## Series Scale: 0.5746839 ## ## Parameter Initialization: ## Initial Parameters:$params ## Limits of Transformations: $U,$V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.61993813 0.6199381 0.0 FALSE ## omega 0.00000100 100.0000000 0.1 TRUE ## alpha1 0.00000001 1.0000000 0.1 TRUE ## gamma1 -0.99999999 1.0000000 0.1 FALSE ## beta1 0.00000001 1.0000000 0.8 TRUE ## delta 0.00000000 2.0000000 2.0 FALSE ## skew 0.10000000 10.0000000 1.0 FALSE ## shape 1.00000000 10.0000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 280.63354: 0.100000 0.100000 0.800000 ## 1: 280.63302: 0.100315 0.100088 0.800223 ## 2: 280.63262: 0.100695 0.0992822 0.800059 ## 3: 280.63258: 0.102205 0.0983397 0.800404 ## 4: 280.63213: 0.102411 0.0978709 0.799656 ## 5: 280.63200: 0.102368 0.0986702 0.799230 ## 6: 280.63200: 0.101930 0.0984977 0.800005 ## 7: 280.63200: 0.101795 0.0983937 0.799987 ## 8: 280.63197: 0.101876 0.0984197 0.799999 ## 9: 280.63197: 0.102003 0.0983101 0.799965 ## 10: 280.63197: 0.102069 0.0983780 0.799823 ## 11: 280.63197: 0.102097 0.0983703 0.799827 ## 12: 280.63197: 0.102073 0.0983592 0.799850 ## 13: 280.63197: 0.102075 0.0983616 0.799846 ## ## Final Estimate of the Negative LLH: ## LLH: 169.8449 norm LLH: 0.8492246 ## omega alpha1 beta1 ## 0.03371154 0.09836156 0.79984610 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -26914.901 -6696.498 -8183.925 ## alpha1 -6696.498 -2239.695 -2271.547 ## beta1 -8183.925 -2271.547 -2733.098 ## attr(,"time") ## Time difference of 0.02161336 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.09229803 secs ## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x[1:200], cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## ## [data = x[1:200]] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.033712 0.098362 0.799846 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.03371 0.01470 2.293 0.0218 * ## alpha1 0.09836 0.04560 2.157 0.0310 * ## beta1 0.79985 0.03470 23.052 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -169.8449 normalized: -0.8492246 ## ## Description: ## Thu Nov 2 13:01:15 2017 by user: For 200 observations is estimated to be enormous with a relatively tiny standard error! Let’s dive deeper into this. I’ve conducted a number of numerical experiments on the University of Utah Mathematics department’s supercomputer. Below is a helper function to extract the coefficients and standard errors for a particular fit by garchFit() (suppressing all of garchFit()‘s output in the process). getFitData <- function(x, cond.dist = "QMLE", include.mean = FALSE, ...) { args <- list(...) args$data = x args$cond.dist = cond.dist args$include.mean = include.mean log <- capture.output({ fit <- do.call(garchFit, args = args) }) res <- coef(fit) res[paste0(names(fit@fit$se.coef), ".se")] <- fit@fit$se.coef return(res) }

The first experiment is to compute the coefficients of this particular series at each possible end point.

(The following code block is not evaluated when this document is knitted; I have saved the results in a Rda file. This will be the case for every code block that involves parallel computation. I performed these computations on the University of Utah mathematics department’s supercomputer, saving the results for here.)

library(doParallel) set.seed(110117) cl <- makeCluster(detectCores() - 1) registerDoParallel(cl) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) params <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t]) } rownames(params) <- 50:1000

Below I plot these coefficients, along with a region corresponding to double the standard error. This region should roughly correspond to 95% confidence intervals.

params_df <- as.data.frame(params) params_df$t <- as.numeric(rownames(params)) ggplot(params_df) + geom_line(aes(x = t, y = beta1)) + geom_hline(yintercept = 0.2, color = "blue") + geom_ribbon(aes(x = t, ymin = beta1 - 2 * beta1.se, ymax = beta1 + 2 * beta1.se), color = "grey", alpha = 0.5) + ylab(expression(hat(beta))) + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 1)) + coord_cartesian(ylim = c(0, 1)) This is an alarming picture (but not the most alarming I’ve seen; this is one of the better cases). Notice that the confidence interval fails to capture the true value of up until about 375 data points; these intervals should contain the true value about 95% of the time! This is in addition to the confidence interval being fairly large. Let’s see how the other parameters behave. library(reshape2) library(plyr) library(dplyr) param_reshape <- function(p) { p <- as.data.frame(p) p$t <- as.integer(rownames(p)) pnew <- melt(p, id.vars = "t", variable.name = "parameter") pnew$parameter <- as.character(pnew$parameter) pnew.se <- pnew[grepl("*.se", pnew$parameter), ] pnew.se$parameter <- sub(".se", "", pnew.se$parameter) names(pnew.se)[3] <- "se" pnew <- pnew[!grepl("*.se", pnew$parameter), ] return(join(pnew, pnew.se, by = c("t", "parameter"), type = "inner")) } ggp <- ggplot(param_reshape(params), aes(x = t, y = value)) + geom_line() + geom_ribbon(aes(ymin = value - 2 * se, ymax = value + 2 * se), color = "grey", alpha = 0.5) + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(. ~ parameter) print(ggp + ggtitle("NLMINB Optimization"))

The phenomenon is not limited to . also exhibits undesirable behavior. ( isn’t great either, but much better.)

This behavior isn’t unusual; it’s typical. Below are plots for similar series generated with different seeds.

seeds <- c(103117, 123456, 987654, 101010, 8675309, 81891, 222222, 999999, 110011) experiments1 <- foreach(s = seeds) %do% { set.seed(s) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) params <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t]) } rownames(params) <- 50:1000 params } names(experiments1) <- seeds experiments1 <- lapply(experiments1, param_reshape) names(experiments1) <- c(103117, 123456, 987654, 101010, 8675309, 81891, 222222, 999999, 110011) experiments1_df <- ldply(experiments1, .id = "seed") head(experiments1_df) ## seed t parameter value se ## 1 103117 50 omega 0.1043139 0.9830089 ## 2 103117 51 omega 0.1037479 4.8441246 ## 3 103117 52 omega 0.1032197 4.6421147 ## 4 103117 53 omega 0.1026722 1.3041128 ## 5 103117 54 omega 0.1020266 0.5334988 ## 6 103117 55 omega 0.2725939 0.6089607 ggplot(experiments1_df, aes(x = t, y = value)) + geom_line() + geom_ribbon(aes(ymin = value - 2 * se, ymax = value + 2 * se), color = "grey", alpha = 0.5) + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(seed ~ parameter) + ggtitle("Successive parameter estimates using NLMINB optimization")

In this plot we see pathologies of other kinds for , especially for seeds 222222 and 999999, where is chronically far below the correct value. For all of these simulations starts much larger than the correct value, near 1, and for the two seeds mentioned earlier jumps from being very high to suddenly very low. (Not shown here are results for seeds 110131 and 110137; they’re even worse!)

The other parameters are not without their own pathologies but the situation does not seem quite so grim. It’s possible the pathologies we do see are tied to estimation of . In fact, if we look at the analagous experiment for the ARCH(1) process (which is a GARCH(1,0) process, equivalent to setting ) we see better behavior.

set.seed(110117) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) xarch <- garchSim(garchSpec(model = list(omega = 0.2, alpha = 0.2, beta = 0)), n.start = 1000, n = 1000) params_arch <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(xarch[1:t], formula = ~ garch(1, 0)) } rownames(params_arch) <- 50:1000 print(ggp %+% param_reshape(params_arch) + ggtitle("ARCH(1) Model"))

The pathology appears to be numerical in nature and closely tied to . garchFit(), by default, uses nlminb() (a quasi-Newton method with constraints) for solving the optimization problem, using a numerically-computed gradient. We can choose alternative methods, though; we can use the L-BFGS-B method, and we can spice both with the Nelder-Mead method.

Unfortunately these alternative optimization algorithms don’t do better; they may even do worse.

# lbfgsb algorithm params_lbfgsb <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "lbfgsb") } rownames(params_lbfgsb) <- 50:1000 # nlminb+nm algorithm params_nlminbnm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "nlminb+nm") } rownames(params_nlminbnm) <- 50:1000 # lbfgsb+nm algorithm params_lbfgsbnm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "lbfgsb+nm") } rownames(params_lbfgsbnm) <- 50:1000 # cond.dist is norm (default) params_norm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], cond.dist = "norm") } rownames(params_norm) <- 50:1000 print(ggp %+% param_reshape(params_lbfgsb) + ggtitle("L-BFGS-B Optimization"))

print(ggp %+% param_reshape(params_nlminbnm) + ggtitle("nlminb Optimization with Nelder-Mead"))

print(ggp %+% param_reshape(params_lbfgsbnm) + ggtitle("L-BFGS-B Optimization with Nelder-Mead"))

Admittedly, though, QMLE is not the default estimation method garchFit() uses. The default is the Normal distribution. Unfortunately this is no better.

print(ggp %+% param_reshape(params_norm) + ggtitle("cond.dist = 'norm'"))

On CRAN, fGarch has not seen an update since 2013! It’s possible that fGarch is starting to show its age and newer packages have addressed some of the problems I’ve highlighted here. The package tseries provides a function garch() that also fits models via QMLE, and is much newer than fGarch. It is the only other package I am aware of that fits models.

Unfortunately, garch() doesn’t do much better; in fact, it appears to be much worse. Once again, the problem lies with .

library(tseries) getFitDatagarch <- function(x) { garch(x)$coef } params_tseries <- foreach(t = 50:1000, .combine = rbind, .packages = c("tseries")) %dopar% { getFitDatagarch(x[1:t]) } rownames(params_tseries) <- 50:1000 param_reshape_tseries <- function(p) { p <- as.data.frame(p) p$t <- as.integer(rownames(p)) pnew <- melt(p, id.vars = "t", variable.name = "parameter") pnew$parameter <- as.character(pnew$parameter) return(pnew) } ggplot(param_reshape_tseries(params_tseries), aes(x = t, y = value)) + geom_line() + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(. ~ parameter)

All of these experiments were performed on fixed (yet randomly chosen) sequences. They suggest that especially for sample sizes of less than, say, 300 (possibly larger) distributional guarantees for the estimates of parameters are suspect. What happens when we simulate many processes and look at the distribution of the parameters?

I simulated 10,000 processes of sample sizes 100, 500, and 1000 (using the same parameters as before). Below are the empirical distributions of the parameter estimates.

experiments2 <- foreach(n = c(100, 500, 1000)) %do% { mat <- foreach(i = 1:10000, .combine = rbind, .packages = c("fGarch")) %dopar% { x <- garchSim(garchSpec(model = list(omega = 0.2, alpha = 0.2, beta = 0.2)), n.start = 1000, n = n) getFitData(x) } rownames(mat) <- NULL mat } names(experiments2) <- c(100, 500, 1000) save(params, x, experiments1, xarch, params_arch, params_lbfgsb, params_nlminbnm, params_lbfgsbnm, params_norm, params_tseries, experiments2, file="garchfitexperiments.Rda") param_sim <- lapply(experiments2, function(mat) { df <- as.data.frame(mat) df <- df[c("omega", "alpha1", "beta1")] return(df) }) %>% ldply(.id = "n") param_sim <- param_sim %>% melt(id.vars = "n", variable.name = "parameter") head(param_sim) ## n parameter value ## 1 100 omega 8.015968e-02 ## 2 100 omega 2.493595e-01 ## 3 100 omega 2.300699e-01 ## 4 100 omega 3.674244e-07 ## 5 100 omega 2.697577e-03 ## 6 100 omega 2.071737e-01 ggplot(param_sim, aes(x = value)) + geom_density(fill = "grey", alpha = 0.7) + geom_vline(xintercept = 0.2, color = "blue") + facet_grid(n ~ parameter)

When the sample size is 100, these estimators are far from reliable. and have an unnerving tendency to be almost 0, and can be just about anything. As we saw above, the standard errors reported by garchFit() do not capture this behavior. For larger sample sizes and behave better, but still displays unnerving behavior. Its spread barely changes and it still has a propensity to be far too small.

What bothers me most is that a sample size of 1,000 strikes me as being a large sample size. If one were looking at daily data for, say, stock prices, this sample size roughly corresponds to maybe 4 years of data. This suggests to me that this pathological behavior is affecting GARCH models people are trying to estimate now and use in models.

Conclusion

An article by John C. Nash entitled “On best practice optimization methods in R”, published in the Journal of Statistical Software in September 2014, discussed the need for better optimization practices in R. In particular, he highlighted, among others, the methods garchFit() uses (or at least their R implementation) as outdated. He argues for greater awareness in the community for optimization issues and for greater flexibility in packages, going beyond merely using different algorithms provided by optim().

The issues I highlighted in this article made me more aware of the importance of choice in optimization methods. My initial objective was to write a function for performing statistical tests depending structural change in GARCH models. These tests rely heavily on successive estimation of the parameters of models as I demonstrated here. At minimum my experiments show that the variation in the parameters isn’t being captured adequately by standard errors, but also there’s a potential for unacceptably high instability in parameter estimates. They’re so unstable it would take a miracle for the test to not reject the null hypothesis of no change. After all, just looking at the pictures of simulated data and one might conclude that the alternative hypothesis of structural change is true. Thus every time I have tried to implement our test on data where the null hypothesis was supposedly true, the test unequivocally rejected it with $p$-values of essentially 0.

I have heard people conducting hypothesis testing for detecting structural change in GARCH models so I would not be surprised if the numerical instability I have written about here can be avoided. This is a subject I admittedly know little about and I hope that if someone in the R community has already observed this behavior and knows how to resolve it they let me know in the comments or via e-mail. I may write a retraction and show how to produce stable estimates of the parameters with garchFit(). Perhaps the key lies in the function garchFitControl().

I’ve also thought about writing my own optimization routine tailored to my test. Prof. Nash emphasized in his paper the importance of tailoring optimization routines to the needs of particular problems. I’ve written down the quantity to optimize and I have a formula for the gradient and Hessian matrix of the . Perhaps successive optimizations as required by our test could use the parameters from previous iterations as initial values, helping prevent optimizers from finding distant, locally optimal yet globally suboptimal solutions.

Already though this makes the problem more difficult than I initially thought finding an example for our test would be. I’m planning on tabling detecting structural change in GARCH models for now and using instead an example involving merely linear regression (a much more tractable problem). But I hope to hear others’ input on what I’ve written here.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] dplyr_0.7.2 plyr_1.8.4 reshape2_1.4.2 ## [4] fGarch_3010.82.1 fBasics_3011.87 timeSeries_3022.101.2 ## [7] timeDate_3012.100 ggplot2_2.2.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.11 bindr_0.1 knitr_1.17 magrittr_1.5 ## [5] munsell_0.4.3 colorspace_1.3-2 R6_2.2.0 rlang_0.1.2 ## [9] stringr_1.2.0 tools_3.3.3 grid_3.3.3 gtable_0.2.0 ## [13] htmltools_0.3.6 assertthat_0.1 yaml_2.1.14 lazyeval_0.2.0 ## [17] rprojroot_1.2 digest_0.6.12 tibble_1.3.4 bindrcpp_0.2 ## [21] glue_1.1.1 evaluate_0.10 rmarkdown_1.6 labeling_0.3 ## [25] stringi_1.1.5 scales_0.4.1 backports_1.0.5 pkgconfig_2.0.1
1. Bollerslev introduced GARCH models in his 1986 paper entitled “General autoregressive conditional heteroscedasticity”.
2. See the book GARCH Models: Structure, Statistical Inference and Financial Applications, by Christian Francq and Jean-Michel Zakoian

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Role Playing with Probabilities: The Importance of Distributions

Thu, 11/02/2017 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Jocelyn Barker, Data Scientist at Microsoft

I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous.

Why Use Role-Playing Games as an Example?

Partially because it means I get to think about one of my hobbies at work. More practically, because consequences of probability distributions can be hard to examine in the real world. How do you quantify the impact of having your driverless car misclassify objects on the road? Games like DnD on the other hand were built around quantifying the impact of decisions. You decide to do something, add up some numbers that represent the difficulty of what you want to do, and then roll dice to add in some randomness. It also means it is a great environment to study how the distribution of the randomness impacts the outcomes.

A Little Background on DnD

One of the core mechanics of playing DnD and related role-playing games involve rolling a 20 sided die (often referred to as a d20). If you want your character to do something like climb a tree, there is some assigned difficulty for it (eg. 10) and if you roll higher than that number, you achieve your goal. If your character is good at that thing, they get to add a skill modifier (eg. 5) to the number they roll making it more likely that they can do what they wanted to do. If the thing you want to do involves another character, things change a little. Instead of having a set difficulty like for climbing a tree, the difficulty is an opposed roll from the other player. So if Character A wants to sneak past Character B, both players roll d20s and Character A adds their “stealth” modifier against Character B’s “perception” modifier. Whoever between them gets a higher number wins with a tie going to the “perceiver”. Ok, I promise, that is all the DnD rules you need to know for this blog post.

Alternative Rolling Mechanics: What’s in a Distribution?

So here is where the stats nerd in me got excited. Some people change the rules of rolling to make different distributions. The default distribution is pretty boring, 20 numbers with equal probability:

One common way people modify this is with the idea of “critical”. The idea is that sometimes people do way better or worse than average. To reflect this, if you roll a 20, instead of adding 20 to your modifier, you add 30. If you roll a 1, you subtract 10 from your modifier.

Another stats nerd must have made up the last distribution. The idea for constructing it is weird, but the behavior is much more Gaussian. It is called 3z8 because you roll 3 eight-sided dice that are numbered 0-7 and sum them up giving a value between 0 and 21. 1-20 act as in the standard rules, but 0 and 21 are now treated like criticals (but at a much lower frequency than before).

The cool thing is these distributions have almost identical expected values (10.5 for d20, 10.45 with criticals, and 10.498 for 3z8), but very different distributions. How do these distributions affect the game? What can we learn from this as statisticians?

Our Case Study: Sneaking Past the Guards

To examine how our distributions affects outcome, we will look at a scenario where a character, who we will call the rogue, wants to sneak past three guards. If any of the guard’s perception is greater than or equal to the rogue’s stealth, we will say the rogue loses the encounter, if they are all lower, the rogue is successful. We can already see the rogue is at a disadvantage; any one of the guards succeeding is a failure for her. We note that assuming all the guards have the same perception modifier, the actual value of the modifier for the guards doesn’t matter, just the difference between their modifier and the modifier of the rogue because the two modifiers are just a scalar adjustment of the value rolled. In other words, it doesn’t matter if the guards are average Joes with a 0 modifier and the rogue is reasonably sneaky with a +5 or if the guards are hyper alert with a +15 and the rogue is a ninja with a +20; the probability of success is the same in the two scenarios.

Computing the Max Roll for the Guards

Lets start off getting a feeling for what the dice rolls will look like. Since the rogue is only rolling one die, her probability distribution looks the same as the distribution of the dice from the previous section. Now, lets consider the guards. In order for the rogue to fail to sneak by, she only needs to be seen by one of the guards. That means we just need to look at the probability that the maximum roll for one of the guards is $$n$$ for $$n \in 1,..,20$$. We will start with our default distribution. The number of ways you can have 3 dice roll a value of $$n$$ or less is $$n^3$$. Therefore the number of ways you can have the max value of the dice be exactly $$n$$ is the number of ways you can roll $$n$$ or less minus the number of ways where all the dice are $$n – 1$$ or less giving us $$n^3 – (n – 1)^3$$ ways to roll a max value of $$n$$. Finally, we can divide by the total number of roll combinations for an 20-sided dice, $$20^3$$, giving us our final probabilities of:

$\frac{n^3 – (n-1)^3}{20^3} \textrm{for} \{n \in 1, …, 20\}$

The only thing that changes when we add criticals to the mix is that now the probabilities previously assigned to 1 get re-assigned to -10 and those assigned to 20 get reassigned to 30 giving us the following distribution.

This means our guards get a critical success ~14% of the time! This will have a big impact on our final distributions.

Finally, lets look at the distribution for the guards using the 3z8 system.

In the previous distributions, the maximum value became the single most likely roll. Because of the the low probability of rolling a 21 in the 3z8 distribution, this distribution still skews right, but peaks at 14. In this distribution, criticals only occur ~0.6% of the time; much less than the previous distribution.

Impact on Outcome

Now that we have looked at the distributions of the rolls for the rogue and the guards, lets see what our final outcomes look like. As previously mentioned, we don’t need to worry about the specific modifiers for the two groups, just the difference between them. Below is a plot showing the relative modifier for the rogue on the x-axis and the probability of success on the y-axis for our three different probability distributions.

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.
To illustrate how much of a difference there is between the two, lets consider what would be the minimum modifier needed to have a certain probability of success.

Probability Roll 1d20 With Criticals Roll 3z8 50% 6 7 4 75% 11 13 8 90% 15 22 11 95% 17 27 13

We see from the table that reasonably small modifiers make a big difference in the 3z8 system, where as very large modifiers are needed to have a reasonable chance of success when criticals are added. To give context on just how large this is, when a someone is invisible, this only adds +20 to their stealth checks when they are moving. In other words, in the system with criticals, our rogue could literally be invisible sneaking past a group of not particularly observant gaurds and have a reasonable chance of failing.

The next thing to consider is our rogue may have to make multiple checks to sneak into a place (eg. one to sneak into the courtyard, one to sneak from bush to bush, and then a final one to sneak over the wall). If we look at the results of our rogue making three successful checks in a row, our probabilities change even more.

Probability Roll 1d20 With Criticals Roll 3z8 50% 12 15 9 75% 16 23 11 90% 18 28 14 95% 19 29 15

Making multiple checks exaggerates the differences we saw previously. Part of the reason for the poor performance with the addition of criticals (and for the funny shape of the critical curve) is there is a different cost associated with criticals for the rogue compared to the guards. If the guards roll a 20 or the rogue rolls a 1, when criticals are in play, the guards will almost certainly win, even if the rogue has a much higher modifier. On the other hand, if the guard rolls a 1 or the rogue rolls a 20, there isn’t much difference in outcome between getting that critical and any other low/high roll; play continues to the next round.

How Does This Apply to Data Science?

Many times as data scientists, we think of the predictions we make as discrete data points and when we evaluate our models we use aggregate metrics. It is easy to lose sight that our predictions are samples from a probability distribution, and that aggregate measures can obscure how well our model is really performing. We saw in the example with criticals where big hits and misses can make a huge impact on outcomes, even if the average performance is largely the same. We also saw with the 3z8 system where decreasing the expected value of the roll can actually increase performance by making the “average” outcome more likely.

Does all of this sound contrived to you, like I am trying to force an analogy? Let me make a concrete example from my real life data science job. I am responsible for making the machine learning revenue forecasts for Microsoft. Twice a quarter, I forecast the revenue for all of the products at Microsoft world wide. While these product forecasts do need to be accurate for internal use, the forecasts are also summed up to create segment level forecasts. Microsoft’s segment level forecasts go to Wall Street and having our forecasts fail to meet actuals can be a big problem for the company. We can think about our rogue sneaking past our guards as being an analogy for nailing the segment level forecast. If I succeed for most of the products (our individual guards) but have a critical miss of $1 billion error on one of them, then I have a$1 billion error for the segment and I failed. Also like our rogue, one success doesn’t mean we have won. There is always another quarter and doing well one quarter doesn’t mean Wall Street will cut you some slack the next. Finally, a critical success is less valuable than a critical failure is problematic. Getting the forecasts perfect one quarter will just get your a “good job” and a pat on the back, but a big miss costs the company. In this context, it is easy to see why the finance team doesn’t take the machine learning forecasts as gospel, even with our track record of high accuracy.

So as you evaluate your models, keep our sneaky friend in mind. Rather than just thinking about your average metrics, think about your distribution of errors. Are your errors clustered nicely around the mean or are they scattershot of low and high? What does that mean for your application? Are those really low errors valuable enough to be worth getting the really high ones from time to time? Many times having a reliable model may be more valuable than a less reliable one with higher average performance, so when you evaluate, think distributions, not means.

The charts in this post were all produced using the R language. To see the code behind the charts, take a look at this R Markdown file.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Exploring, Clustering, and Mapping Toronto’s Crimes

Thu, 11/02/2017 - 14:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Motivation

I have had a lot of fun exploring The US cities’ Crime data via their Open Data portals. Because Toronto’s crime data was simply not available.

Not until the summer of this year, Toronto police launch a public safety data portal to increase transparency between the public and officers. I recently have had the chance to explore Toronto’s crimes via Toronto Police Service Public Safety Data Portal. I am particularly interested in Major Crime Indicators (MCI) 2016 which contains a tabular list of 32, 612 reports in 2016 (The only year that the data were made available).

Let’s take a look at the data using R and see if there’s anything interesting.

library(ggplot2) library(ggthemes) library(dplyr) library(viridis) library(tidyr) library(cluster) library(ggmap) library(maps)

After a little bit of exploration, I found that there were many duplicated event_unique_id. Let’s drop them.

toronto <- read.csv('toronto_crime.csv') toronto <- subset(toronto, !duplicated(toronto$event_unique_id)) unique(toronto$occurrenceyear) unique(toronto$reportedyear) > unique(toronto$occurrenceyear) [1] 2016 2015 2014 2012 2010 2013 2011 2003 2007 2008 2006 2002 2001 NA 2005 [16] 2009 2000 > unique(toronto$reportedyear) [1] 2016 Find anything interesting? The occurrence year ranged from 2000 to 2016, but report year is only 2016. This means people came to police to report incidents happened 16 years ago. Let’s have a look how many late reported incidents in our data. year_group <- group_by(toronto, occurrenceyear) crime_by_year <- summarise(year_group, n = n()) crime_by_year # A tibble: 17 x 2 occurrenceyear n 1 2000 2 2 2001 2 3 2002 3 4 2003 2 5 2005 1 6 2006 1 7 2007 5 8 2008 1 9 2009 1 10 2010 10 11 2011 3 12 2012 8 13 2013 21 14 2014 37 15 2015 341 16 2016 27705 17 NA 4 2 incidents occurred in 2000, 2 occurred in 2001 and so on. The vast majority of occurrences happened in 2016. So we are going to keep 2016 only. And I am also removing all the columns we do not need and four rows with missing values. drops <- c("X", "Y", "Index_", "ucr_code", "ucr_ext", "reporteddate", "reportedmonth", "reportedday", "reporteddayofyear", "reporteddayofweek", "reportedhour", "occurrencedayofyear", "reportedyear", "Division", "Hood_ID", "FID") toronto <- toronto[, !(names(toronto) %in% drops)] toronto <- toronto[toronto$occurrenceyear == 2016, ] toronto <- toronto[complete.cases(toronto), ] Explore What are the major crimes in 2016? indicator_group <- group_by(toronto, MCI) crime_by_indicator <- summarise(indicator_group, n=n()) crime_by_indicator <- crime_by_indicator[order(crime_by_indicator$n, decreasing = TRUE),] ggplot(aes(x = reorder(MCI, n), y = n), data = crime_by_indicator) + geom_bar(stat = 'identity', width = 0.5) + geom_text(aes(label = n), stat = 'identity', data = crime_by_indicator, hjust = -0.1, size = 3.5) + coord_flip() + xlab('Major Crime Indicators') + ylab('Number of Occurrences') + ggtitle('Major Crime Indicators Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold")) Gives this plot: Assault is the most prevalent form of violent crime in Toronto. What is assault? In criminal and civil law, assault is an attempt to initiate harmful or offensive contact with a person, or a threat to do so. What are the different types of assault? Which type is the worst? assault <- toronto[toronto$MCI == 'Assault', ] assault_group <- group_by(assault, offence) assault_by_offence <- summarise(assault_group, n=n()) assault_by_offence <- assault_by_offence[order(assault_by_offence$n, decreasing = TRUE), ] ggplot(aes(x = reorder(offence, n), y = n), data = assault_by_offence) + geom_bar(stat = 'identity', width = 0.6) + geom_text(aes(label = n), stat = 'identity', data = assault_by_offence, hjust = -0.1, size = 3) + coord_flip() + xlab('Types of Assault') + ylab('Number of Occurrences') + ggtitle('Assault Crime Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold")) Gives this plot: Not much information here, the top assault category is … assault. I eventually learned different types of assault through Attorneys.com. Let’s look at the top offences then offence_group <- group_by(toronto, offence) crime_by_offence <- summarise(offence_group, n=n()) crime_by_offence <- crime_by_offence[order(crime_by_offence$n, decreasing = TRUE), ] ggplot(aes(x = reorder(offence, n), y = n), data = crime_by_offence) + geom_bar(stat = 'identity', width = 0.7) + geom_text(aes(label = n), stat = 'identity', data = crime_by_offence, hjust = -0.1, size = 2) + coord_flip() + xlab('Types of Offence') + ylab('Number of Occurrences') + ggtitle('Offence Types Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Assault being the most common offence followed by Break and Enter. According to Wikibooks, the most typical form of break and enter is a break into a commercial or private residence in order to steal property. This indicates that break and enter more likely to occur when there is no one at home or office.

How about crime by time of the day? hour_group <- group_by(toronto, occurrencehour) crime_hour <- summarise(hour_group, n=n()) ggplot(aes(x=occurrencehour, y=n), data = crime_hour) + geom_line(size = 2.5, alpha = 0.7, color = "mediumseagreen", group=1) + geom_point(size = 0.5) + ggtitle('Total Crimes by Hour of Day in Toronto 2016') + ylab('Number of Occurrences') + xlab('Hour(24-hour clock)') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

The worst hour is around the midnight, another peak time is around the noon, then again at around 8pm.

Okay, but what types of crime are most frequent at each hour? hour_crime_group <- group_by(toronto, occurrencehour, MCI) hour_crime <- summarise(hour_crime_group, n=n()) ggplot(aes(x=occurrencehour, y=n, color=MCI), data =hour_crime) + geom_line(size=1.5) + ggtitle('Crime Types by Hour of Day in Toronto 2016') + ylab('Number of Occurrences') + xlab('Hour(24-hour clock)') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Assaults are the top crimes almost all the time, they happened more often from the late mornings till nights than the early mornings. On the other hand, break and enter happened more often in the mornings and at around the midnight (when no one at home or office) . Robberies and auto thefts are more likely to happen in the late evenings till the nights. They all make sense.

Where are those crimes most likely to occur in Toronto? location_group <- group_by(toronto, Neighbourhood) crime_by_location <- summarise(location_group, n=n()) crime_by_location <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ] crime_by_location_top20 <- head(crime_by_location, 20) ggplot(aes(x = reorder(Neighbourhood, n), y = n), data = crime_by_location_top20) + geom_bar(stat = 'identity', width = 0.6) + geom_text(aes(label = n), stat = 'identity', data = crime_by_location_top20, hjust = -0.1, size = 3) + coord_flip() + xlab('Neighbourhoods') + ylab('Number of Occurrences') + ggtitle('Neighbourhoods with Most Crimes - Top 20') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold")) Gives this plot: The most dangerous neighbourhood is … Waterfront. The sprawling downtown catch-all includes not only the densely packed condo but also the boozy circus that is the entertainment district. The result: a staggering number of violent crimes and arsons. Let’s find out neighbourhoods vs. top offence types offence_location_group <- group_by(toronto, Neighbourhood, offence) offence_type_by_location <- summarise(offence_location_group, n=n()) offence_type_by_location <- offence_type_by_location[order(offence_type_by_location$n, decreasing = TRUE), ] offence_type_by_location_top20 <- head(offence_type_by_location, 20) ggplot(aes(x = Neighbourhood, y=n, fill = offence), data=offence_type_by_location_top20) + geom_bar(stat = 'identity', position = position_dodge(), width = 0.8) + xlab('Neighbourhood') + ylab('Number of Occurrence') + ggtitle('Offence Type vs. Neighbourhood Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"), axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4))

Gives this plot:

I did not expect something like this. It is not pretty. However, it did tell us that besides assaults, Church-Yonge Corridor and Waterfront had the most break and enter(Don’t move there!). West Humber-Clairville had the most vehicle stolen crimes(Don’t park your car there!).

Let’s try something different crime_count % group_by(occurrencemonth, MCI) %>% summarise(Total = n()) crime_count$occurrencemonth <- ordered(crime_count$occurrencemonth, levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) ggplot(crime_count, aes(occurrencemonth, MCI, fill = Total)) + geom_tile(size = 1, color = "white") + scale_fill_viridis() + geom_text(aes(label=Total), color='white') + ggtitle("Major Crime Indicators by Month 2016") + xlab('Month') + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Much better!

Assault is the most common crime every month of the year with no exception. It appears that there were a little bit more assault incidents in May than the other months last year.

day_count % group_by(occurrencedayofweek, MCI) %>% summarise(Total = n()) ggplot(day_count, aes(occurrencedayofweek, MCI, fill = Total)) + geom_tile(size = 1, color = "white") + scale_fill_viridis() + geom_text(aes(label=Total), color='white') + ggtitle("Major Crime Indicators by Day of Week 2016") + xlab('Day of Week') + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Saturdays and Sundays had more assaults than any other days, and had less theft over than any other days. Auto thieves are busy almost every day of the week equally.
I was expecting to find seasonal crime patterns such as temperature changes and daylight hours might be associated with crime throughout the year, or the beginning and end of the school year, are associated with variations in crime throughout the year. But this one-year’s worth of data is not enough to address my above concerns. I hope Toronto Police service will release more data via its open data portal in the soonest future.

Homicide homicide <- read.csv('homicide.csv', stringsAsFactors = F) homicide$Occurrence_Date <- as.Date(homicide$Occurrence_Date) year_group <- group_by(homicide, Occurrence_year, Homicide_Type) homicide_by_year <- summarise(year_group, n=n()) ggplot(aes(x = Occurrence_year, y=n, fill = Homicide_Type), data=homicide_by_year) + geom_bar(stat = 'identity', position = position_dodge(), width = 0.8) + xlab('Year') + ylab('Number of Homicides') + ggtitle('Homicide 2004-2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

2005 is Toronto’s “Year of Gun”. Eleven years later in 2016, Toronto was experiencing another spike in gun-related homicide.

homicide$month <- format(as.Date(homicide$Occurrence_Date) , "%B") homicide_count % group_by(Occurrence_year, month) %>% summarise(Total = n()) homicide_count$month <- ordered(homicide_count$month, levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) ggplot(homicide_count, aes(Occurrence_year, month, fill = Total)) + geom_tile(size = 1, color = "white") + scale_fill_viridis() + geom_text(aes(label=Total), color='white') + ggtitle("Homicides in Toronto (2004-2016)") + xlab('Year') + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

It is worrisome to see that there is a significant increase in the total number of homicides in Toronto in 2016 compared to 2015. I hope we will have a better 2017. However, when I read that Toronto ranked safest city in North America by the Economist, I felt much safer.

K-Means Clustering

K-Means is one of the most popular “clustering” algorithms. It is the process of partitioning a group of data points into a small number of clusters. As in our crime data, we measure the number of assaults and other indicators, and neighbourhoods with high number of assaults will be grouped together. The goal of K-Means clustering is to assign a cluster to each data point (neighbourhood). Thus we first partition datapoints (neighbourhoods) into k clusters in which each neighbourhood belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
As one of the unsupervised learning algorithms,we use K-Mean to build models that help us understand our data better. It enables us to learn groupings of unlabeled data points.

To do clustering analysis, our data has to look like this: by_groups <- group_by(toronto, MCI, Neighbourhood) groups <- summarise(by_groups, n=n()) groups <- groups[c("Neighbourhood", "MCI", "n")] groups_wide <- spread(groups, key = MCI, value = n)

Gives this table:

The fist column — qualitative data should be removed from the analysis z <- groups_wide[, -c(1,1)] The data cannot have any missing values z <- z[complete.cases(z), ] The data must be scaled for comparison m <- apply(z, 2, mean) s <- apply(z, 2, sd) z <- scale(z, m, s) Determine the number of clusters wss <- (nrow(z)-1) * sum(apply(z, 2, var)) for (i in 2:20) wss[i] <- sum(kmeans(z, centers=i)$withiness) plot(1:20, wss, type='b', xlab='Number of Clusters', ylab='Within groups sum of squares') Gives this plot: This plot shows a very strong elbow, based on the plot, we can say with confidence that we do not need more than two clusters (centroids). Fitting a model kc <- kmeans(z, 2) kc K-means clustering with 2 clusters of sizes 121, 10 Cluster means: Assault Auto Theft Break and Enter Robbery Theft Over 1 -0.193042 -0.135490 -0.2176646 -0.1778607 -0.2288382 2 2.335808 1.639429 2.6337422 2.1521148 2.7689425 Clustering vector: [1] 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [40] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 [79] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 [118] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 Within cluster sum of squares by cluster: [1] 183.3436 170.2395 (between_SS / total_SS = 45.6 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault" Interpretation: * First cluster has 121 neighbourhoods, and second cluster has 10 neighbourhoods. * Cluster means: If the ranges of these numbers seem strange, it’s because we standardized the data before performing the cluster analysis. The negative values mean “lower than most” and positive values mean “higher than most”. Thus, cluster 1 are neighbourhoods with low assault, low auto theft, low break and enter, low robbery and low theft over. Cluster 2 are neighbourhoods with high assault, high auto theft, high break and enter, high robbery and high theft over. It is good that these two groups have a significant variance in every variable. It indicates that each variable plays a significant role in categorizing clusters. * Clustering vector: The first, second and third neighbourhoods should all belong to cluster 1, the fourth neighbourhood should belong to cluster 2, and so on. * A measurement that is more relative would be the withinss and betweenss. withinss tells us the sum of the square of the distance from each data point to the cluster center. Lower is better. Betweenss tells us the sum of the squared distance between cluster centers. Ideally we want cluster centers far apart from each other. * Available components is self-explanatory. Plotting the k-Means results z1 <- data.frame(z, kc$cluster) clusplot(z1, kc$cluster, color=TRUE, shade=F, labels=0, lines=0, main='k-Means Cluster Analysis') Gives this plot: It appears that our choice of number of clusters is good, and we have little noise. Hierarchical Clustering For the hierarchical clustering methods, the dendrogram is the main graphical tool for getting insight into a cluster solution. z2 <- data.frame(z) distance <- dist(z2) hc <- hclust(distance) Now that we’ve got a cluster solution. Let’s examine the results. plot(hc, labels = groups_wide$Neighbourhood, main='Cluster Dendrogram', cex=0.65)

Gives this plot:

If we choose any height along the y-axis of the dendrogram, and move across the dendrogram counting the number of lines that we cross, each line represents a cluster. For example, if we look at a height of 10, and move across the x-axis at that height, we’ll cross two lines. That defines a two-cluster solution; by following the line down through all its branches, we can see the names of the neighbourhoods that are included in these two clusters. Looking at the dendrogram for the Toronto’s crimes data, we can see our datapoins are very imbalanced. From the top of the tree, there are two distinct groups; one group consists of brunches with brunches and more brunches, while another group only consists few neighbourhoods, and we can see these neighbourhoods are Toronto’s most dangerous neighbourhoods. However, I want to try many different groupings at once to start investigating.

counts = sapply(2:6,function(ncl)table(cutree(hc,ncl))) names(counts) = 2:6 counts > counts $2 1 2 128 3$3 1 2 3 128 2 1 $4 1 2 3 4 121 7 2 1$5 1 2 3 4 5 121 5 2 2 1 $6 1 2 3 4 5 6 112 5 2 9 2 1 Interpretation: * With 2-cluster solution, we have 128 neighbourhoods in cluster 1, 3 neighbourhoods in cluster 2. * With 3-cluster solution, we have 128 neighbourhoods in cluster 1, 2 neighbourhoods in cluster 2, and 1 neighbourhood in cluster 3. And so on until 6-cluster solution. In practice, we’d like a solution where there aren’t too many clusters with just few observations, because it may make it difficult to interpret our results. For this exercise, I want to stick with 3-cluster solution, see what results I will obtain. member <- cutree(hc, 3) aggregate(z, list(member), mean) aggregate(z, list(member), mean) Group.1 Assault Auto Theft Break and Enter Robbery Theft Over 1 1 -0.09969023 -0.08067526 -0.09425688 -0.09349632 -0.1042508 2 2 5.57040267 0.57693723 4.67333848 4.14398508 5.0024522 3 3 1.61954364 9.17255898 2.71820403 3.67955938 3.3392003 In cluster 1, all the crime indicators are on the negative side, cluster 1 has a significant distinction on each variable compare with cluster 2 and cluster 3. Cluster 2 is higher in most of the crime indicators than cluster 3 except Auto Theft. plot(silhouette(cutree(hc, 3), distance)) Gives this plot: The silhouette width value for cluster 3 is zero. The silhouette plot indicates that we really do not need the third cluster, The vast majority of neighbourhoods belong to the first cluster, and 2-cluster will be our solution. Making a Map of Toronto’s Crimes There are many packages for plotting and manipulating spatial data in R. I am going to use ggmap to produce a simple and easy map of Toronto’ crimes. lat <- toronto$Lat lon <- toronto$Long crimes <- toronto$MCI to_map <- data.frame(crimes, lat, lon) colnames(to_map) <- c('crimes', 'lat', 'lon') sbbox <- make_bbox(lon = toronto$Long, lat = toronto$Lat, f = 0.01) my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="bw", zoom = 10) ggmap(my_map) + geom_point(data=to_map, aes(x = lon, y = lat, color = "#27AE60"), size = 0.5, alpha = 0.03) + xlab('Longitude') + ylab('Latitude') + ggtitle('Location of Major Crime Indicators Toronto 2016') + guides(color=FALSE)

Gives this map:

It’s clear to see where the major crimes in the city occur. A large concentration in the Waterfront area, South of North York is more peaceful than any other areas. However, point-stacking is not helpful when comparing high-density areas, so let’s optimize this visualization.

ggmap(my_map) + geom_point(data=to_map, aes(x = lon, y = lat, color = "#27AE60"), size = 0.5, alpha = 0.05) + xlab('Longitude') + ylab('Latitude') + ggtitle('Location of Major Crime Indicators Toronto 2016') + guides(color=FALSE) + facet_wrap(~ crimes, nrow = 2)

Gives this map:

This is certainly more interesting and prettier. Some crimes, such as Assaults, and Break and Enter occur all over the city, with concentration in the Waterfront areas. Other crimes, such as Auto Theft has a little more points in the west side than the east side. Robbery and Theft Over primarily have clusters in the Waterfront areas.

Summary

Not many more questions can be answered by looking at the data of Toronto Major crimes Indicators. But that’s okay. There’s certainly other interesting things to do with this data, such as creating a dashboard at MicroStrategy.

As always, all the code can be found on Github. I would be pleased to receive feedback or questions on any of the above.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### shadow text effect in grid and ggplot2 graphics

Thu, 11/02/2017 - 08:30

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

After the release of meme package, I received several feedbacks from users.

The most usefule one is the comment on my blog post:

Sercan Kahveci

Greetings Mr. Yu,

I am very happy that this package exists. Thank you for making it! I would like to request a feature, to ensure the package is able to compete with professional meme-creation tools like memegenerator and paint.net. Since memes often use the font Impact, in white and with black outline, I believe the package would be more powerful if it also did that automatically.

Regards,

Sercan Kahveci, MSc

Content creator at Questionable Research Memes on Facebook

The words, ‘compete with professional meme-creation tools’, stimulated me to develop text plotting with background outline effect.

Now this feature is available in meme v>=0.0.7, which can be downloaded from CRAN.

Here is an example:

library(meme) u <- system.file("angry8.jpg", package="meme") meme(u, "code", "all the things!")

To make this shadow text feature available to the R community, I created another package, shadowtext, that creates/draws text grob with background shadow for grid and ggplot2. If you are interesting, please refer to the online vignette.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Explore Predictive Maintenance with flexdashboard

Thu, 11/02/2017 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

I have written the following post about Predictive Maintenance and flexdashboard at my company codecentric’s blog:

Predictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).
A common use-case for Predictive Maintenance is to proactively monitor machines, so as to predict when a check-up is needed to reduce failure and maximize performance. In contrast to traditional maintenance, where each machine has to undergo regular routine check-ups, Predictive Maintenance can save costs and reduce downtime. A machine learning approach to such a problem would be to analyze machine failure over time to train a supervised classification model that predicts failure. Data from sensors and weather information is often used as features in modeling.

With flexdashboard RStudio provides a great way to create interactive dashboards with R. It is an easy and very fast way to present analyses or create story maps. Here, I have used it to demonstrate different analysis techniques for Predictive Maintenance. It uses Shiny run-time to create interactive content.