Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 3 hours ago

tint 0.0.4: Small enhancements

Sat, 11/04/2017 - 02:02

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of the tint package arrived on CRAN earlier today. Its name expands from tint is not tufte as the package offers a fresher take on the Tufte-style for html and pdf presentations.

A screenshot of the pdf variant is below.

This release brings some minor enhancements and polish, mostly learned from having done the related pinp (two-column vignette in the PNAS style) and linl (LaTeX letter) RMarkdown-wrapper packages; see below for details from the NEWS.Rd file.

Changes in tint version 0.0.4 (2017-11-02)
  • Skeleton files are also installed as vignettes (#20).

  • A reference to the Tufte source file now points to tint (Ben Marwick in #19, later extended to other Rmd files).

  • Several spelling and grammar errors were corrected too (#13 and #16 by R. Mark Sharp and Matthew Henderson)

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Taming exam results in pdf with pdftools

Fri, 11/03/2017 - 23:06

(This article was first published on R – Tales of R, and kindly contributed to R-bloggers)

You can also check this post, written in #blogdown, here: taming-exam-results-with-pdf.

Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools") library(pdftools) txt <- pdf_text("EDAIC.pdf") txt[1] class(txt[1]) [1] "EDAIC Part I 2017 Overall Results\n Candidate N° Result\n 107131 FAIL\n 119233 PASS\n 123744 FAIL\n 127988 FAIL\n 133842 PASS\n 135692 PASS\n 140341 FAIL\n 142595 FAIL\n 151479 PASS\n 151632 PASS\n 152787 PASS\n 157691 PASS\n 158867 PASS\n 160211 PASS\n 161970 FAIL\n 162536 PASS\n 163331 PASS\n 164442 FAIL\n 164835 PASS\n 165734 PASS\n 165900 PASS\n 166469 PASS\n 167241 FAIL\n 167740 PASS\n 168151 FAIL\n 168331 PASS\n 168371 FAIL\n 168711 FAIL\n 169786 PASS\n 170721 FAIL\n 170734 FAIL\n 170754 PASS\n 170980 PASS\n 171894 PASS\n 171911 PASS\n 172047 FAIL\n 172128 PASS\n 172255 FAIL\n 172310 PASS\n 172706 PASS\n 173136 FAIL\n 173229 FAIL\n 174336 PASS\n 174360 PASS\n 175177 FAIL\n 175180 FAIL\n 175184 FAIL\nYour candidate number is indicated on your admission document Page 1 of 52\n" [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

  • Split the blob.
  • Transform the resulting list into a character vector with unlist.
  • Trim leading white spaces with stringr::str_trim.
library(tidyverse) library(stringr) tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns unlist() %>% str_trim(side = "both") # trim white spaces tx2[1:10] [1] "EDAIC Part I 2017 Overall Results" [2] "Candidate N° Result" [3] "107131 FAIL" [4] "119233 PASS" [5] "123744 FAIL" [6] "127988 FAIL" [7] "133842 PASS" [8] "135692 PASS" [9] "140341 FAIL" [10] "142595 FAIL"
  • Remove the very first row.
  • Transform into a tibble.
tx3 <- tx2[-1] %>% data_frame() tx3 # A tibble: 2,579 x 1 . <chr> 1 Candidate N° Result 2 107131 FAIL 3 119233 PASS 4 123744 FAIL 5 127988 FAIL 6 133842 PASS 7 135692 PASS 8 140341 FAIL 9 142595 FAIL 10 151479 PASS # ... with 2,569 more rows
  • Use tidyr::separate to split each row into two columns.
  • Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>% mutate(key = gsub('\\s+', '', key)) %>% mutate(value = gsub('\\s+', '', value)) tx4 # A tibble: 2,579 x 2 key value 1 Candidate N°Result 2 107131 FAIL 3 119233 PASS 4 123744 FAIL 5 127988 FAIL 6 133842 PASS 7 135692 PASS 8 140341 FAIL 9 142595 FAIL 10 151479 PASS # ... with 2,569 more rows
  • Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] tx5 # A tibble: 2,424 x 2 key value 1 107131 FAIL 2 119233 PASS 3 123744 FAIL 4 127988 FAIL 5 133842 PASS 6 135692 PASS 7 140341 FAIL 8 142595 FAIL 9 151479 PASS 10 151632 PASS # ... with 2,414 more rows Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr) tx5 %>% group_by(value) %>% summarise (count = n()) %>% mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% kable() value count percent FAIL 1017 42 % PASS 1407 58 %

From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New RStudio cheat sheet: Strings in R

Fri, 11/03/2017 - 19:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The RStudio team has created another very useful cheat sheet for R: Working with Strings. This cheat sheet provides an example-laden menu of operations you can perform on strings (character verctors) in R using the stringr package. While base R provides a solid set of string manipulation functions, the stringr package functions are simpler, more consistent (making them easy to use with the pipe operator), and more like the Ruby or Python way of handling string operations.

The back page of the cheat sheet also provides a handy guide to regular expressions a useful (if often frustrating) tool for extracting substrings according to a pattern you define with codes like these:

You can pick up the Working with Strings cheat sheet, and others from the series, at the link below.

RStudio: Cheat Sheets

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to identify risky bank loans using C.50 decision trees

Fri, 11/03/2017 - 17:31

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

This tutorial has been taken from Machine Learning with R Second Edition   by Brett Lantz. Use the code  MLR250RB at the checkout to save 50% on the RRP.

Or pick up this title with 4 others for just $50 – get the R-Bloggers bundle.

The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans.

Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory.

It is likely that automated credit scoring models are employed to instantly approve credit applications on the telephone and web. In this tutorial, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Step 1 – collecting data

The idea behind our credit model is to identify factors that are predictive of higher risk of default. Therefore, we need to obtain data on a large number of past bank loans and whether the loan went into default, as well as information on the applicant.

Data with these characteristics is available in a dataset donated to the UCI Machine Learning Data Repository by Hans Hofmann of the University of Hamburg. The data set contains information on loans obtained from a credit agency in Germany.

The data set presented in this tutorial has been modified slightly from the original in order to eliminate some preprocessing steps. To follow along with the examples, download the credit.csv file from Packt’s website and save it to your R working directory. Simply click here and then click ‘code files’ beneath the cover image.

The credit data set includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Let’s see whether we can determine any patterns that predict this outcome.

Step 2 – exploring and preparing the data

As we did previously, we will import data using the read.csv() function. We will ignore the stringsAsFactors option and, therefore, use the default value of TRUE, as the majority of the features in the data are nominal:

> credit <- read.csv("credit.csv")

The first several lines of output from the str() function are as follows:

> str(credit) 'data.frame':1000 obs. of 17 variables: $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",.. $ months_loan_duration: int 6 48 12 ... $ credit_history : Factor w/ 5 levels "critical","good",.. $ purpose : Factor w/ 6 levels "business","car",.. $ amount : int 1169 5951 2096 ...

We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types.

Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default. The applicant’s checking and savings account balance are recorded as categorical variables:

> table(credit$checking_balance) < 0 DM > 200 DM 1 - 200 DM unknown 274 63 269 394 > table(credit$savings_balance) < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown 603 48 103 63 183

The checking and savings account balance may prove to be important predictors of loan default status. Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM).

Some of the loan’s features are numeric, such as its duration and the amount of credit requested:

> summary(credit$months_loan_duration) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.0 12.0 18.0 20.9 24.0 72.0 > summary(credit$amount) Min. 1st Qu. Median Mean 3rd Qu. Max. 250 1366 2320 3271 3972 18420

The loan amounts ranged from 250 DM to 18,420 DM across terms of 4 to 72 months with a median duration of 18 months and an amount of 2,320 DM.

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default:

> table(credit$default) no yes 700 300

A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants that are at high risk to default, allowing the bank to refuse credit requests.

Data preparation: Creating random training and test data sets

In this example, we’lll split our data into two portions: a training dataset to build the decision tree and a test dataset to evaluate the performance of the model on new data. We will use 90 percent of the data for training and 10 percent for testing, which will provide us with 100 records to simulate new applicants.

We’ll use a random sample of the credit data for training. A random sample is simply a process that selects a subset of records at random. In R, the sample() function is used to perform random sampling. However, before putting it in action, a common practice is to set a seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired. It may seem that this defeats the purpose of generating random numbers, but there is a good reason for doing it this way. Providing a seed value via the set.seed() function ensures that if the analysis is repeated in the future, an identical result is obtained.

The following commands use the sample() function to select 900 values at random out of the sequence of integers from 1 to 1000. Note that the set.seed() function uses the arbitrary value 123. Omitting this seed will cause your training and testing split to differ from those shown in the remainder of this tutorial:

> set.seed(123) > train_sample <- sample(1000, 900)

As expected, the resulting train_sample object is a vector of 900 random integers:

> str(train_sample) int [1:900] 288 788 409 881 937 46 525 887 548 453 ...

By using this vector to select rows from the credit data, we can split it into the 90 percent training and 10 percent test datasets we desired. Recall that the dash operator used in the selection of the test records tells R to select records that are not in the specified rows; in other words, the test data includes only the rows that are not in the training sample.

> credit_train <- credit[train_sample, ] > credit_test <- credit[-train_sample, ]

If all went well, we should have about 30 percent of defaulted loans in each of the datasets:

> prop.table(table(credit_train$default)) no yes 0.7033333 0.2966667 > prop.table(table(credit_test$default)) no yes 0.67 0.33

This appears to be a fairly even split, so we can now build our decision tree.

Tip: If your results do not match exactly, ensure that you ran the command

set.seed(123)

immediately prior to creating the

train_sample

vector.

Step 3: Training a model on the data

We will use the C5.0 algorithm in the C50 package to train our decision tree model. If you have not done so already, install the package with install.packages(“C50”) and load it to your R session, using library(C50).

The following syntax box lists some of the most commonly used commands to build decision trees. Compared to the machine learning approaches we used previously, the C5.0 algorithm offers many more ways to tailor the model to a particular learning problem, but more options are available. Once the C50package has been loaded, the ?C5.0Control command displays the help page for more details on how to finely-tune the algorithm.

For the first iteration of our credit approval model, we’ll use the default C5.0 configuration, as shown in the following code. The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame, but supply it as the target factor vector for classification:

> credit_model <- C5.0(credit_train[-17], credit_train$default)

The credit_model object now contains a C5.0 decision tree. We can see some basic data about the tree by typing its name:

> credit_model Call: C5.0.default(x = credit_train[-17], y = credit_train$default) Classification Tree Number of samples: 900 Number of predictors: 16 Tree size: 57 Non-standard options: attempt to group attributes

The preceding text shows some simple facts about the tree, including the function call that generated it, the number of features (labeled predictors), and examples (labeled samples) used to grow the tree. Also listed is the tree size of 57, which indicates that the tree is 57 decisions deep—quite a bit larger than the example trees we’ve considered so far!

To see the tree’s decisions, we can call the summary() function on the model:

> summary(credit_model)

This results in the following output:

The preceding output shows some of the first branches in the decision tree. The first three lines could be represented in plain language as:

If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.” Otherwise, if the checking account balance is less than zero DM or between one and 200 DM. And the credit history is perfect or very good, then classify as “likely to default.”

The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision. For instance, on the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default. In other words, 50 applicants actually defaulted, in spite of the model’s prediction to the contrary.

Tip: Sometimes a tree results in decisions that make little logical sense. For example, why would an applicant whose credit history is very good be likely to default, while those whose checking balance is unknown are not likely to default? Contradictory rules like this occur sometimes. They might reflect a real pattern in the data, or they may be a statistical anomaly. In either case, it is important to investigate such strange decisions to see whether the tree’s logic makes sense for business use.

After the tree, the summary(credit_model) output displays a confusion matrix, which is a cross-tabulation that indicates the model’s incorrectly classified records in the training data:

Evaluation on training data (900 cases): Decision Tree ---------------- Size Errors 56 133(14.8%) << (a) (b) <-classified as ---- ---- 598 35 (a): class no 98 169 (b): class yes

The Errors output notes that the model correctly classified all but 133 of the 900 training instances for an error rate of 14.8 percent. A total of 35 actual no values were incorrectly classified as yes (false positives), while 98 yes values were misclassified as no (false negatives).

Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test data set.

Step 4: Evaluating model performance

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

> credit_pred <- predict(credit_model, credit_test)

This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Setting the prop.c and prop.r parameters to FALSE removes the column and row percentages from the table. The remaining percentage (prop.t) indicates the proportion of records in the cell out of the total number of records:

> library(gmodels) > CrossTable(credit_test$default, credit_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

This results in the following table:

Out of the 100 test loan application records, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent. This is somewhat worse than its performance on the training data, but not unexpected, given that a model’s performance is often worse on unseen data. Also note that the model only correctly predicted 14 of the 33 actual loan defaults in the test data, or 42 percent. Unfortunately, this type of error is a potentially very costly mistake, as the bank loses money on each default. Let’s see if we can improve the result with a bit more effort.

Step 5: Improving model performance

Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application. In fact, if the model had predicted “no default” for every test case, it would have been correct 67 percent of the time—a result not much worse than our model’s, but requiring much less effort! Predicting loan defaults from 900 examples seems to be a challenging problem.

Making matters even worse, our model performed especially poorly at identifying applicants who do default on their loans. Luckily, there are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.

Boosting the accuracy of decision trees

One way the C5.0 algorithm improved upon the C4.5 algorithm was through the addition of adaptive boosting. This is a process in which many decision trees are built and the trees vote on the best class for each example.

Boosting is essentially rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone. Each of the models has a unique set of strengths and weaknesses and they may be better or worse in solving certain problems. Using a combination of several learners with complementary strengths and weaknesses can therefore dramatically improve the accuracy of a classifier.

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team. The trials parameter sets an upper limit; the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. We’ll start with 10 trials, a number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25%:

> credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10)

While examining the resulting model, we can see that some additional lines have been added, indicating the changes:

> credit_boost10 Number of boosting iterations: 10 Average tree size: 47.5

Across the 10 iterations, our tree size shrunk. If you would like, you can see all 10 trees by typing summary(credit_boost10) at the command prompt. It also lists the model’s performance on the training data:

> summary(credit_boost10) (a) (b) <-classified as ---- ---- 629 4 (a): class no 30 237 (b): class yes

The classifier made 34 mistakes on 900 training examples for an error rate of 3.8 percent. This is quite an improvement over the 13.9 percent training error rate we noted before adding boosting! However, it remains to be seen whether we see a similar improvement on the test data. Let’s take a look:

> credit_boost_pred10 <- predict(credit_boost10, credit_test) > CrossTable(credit_test$default, credit_boost_pred10, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

The resulting table is as follows:

Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model. It does not seem like a large gain, but it is in fact larger than the 25 percent reduction we expected. On the other hand, the model is still not doing well at predicting defaults, predicting only 20/33 = 61%correctly. The lack of an even greater improvement may be a function of our relatively small training data set, or it may just be a very difficult problem to solve.

This said, if boosting can be added this easily, why not apply it by default to every decision tree? The reason is twofold. First, if building a decision tree once takes a great deal of computation time, building many trees may be computationally impractical. Secondly, if the training data is very noisy, then boosting might not result in an improvement at all. Still, if greater accuracy is needed, it’s worth giving it a try.

Making mistakes more costlier than others

Giving a loan out to an applicant who is likely to default can be an expensive mistake. One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants, under the assumption that the interest the bank would earn from a risky loan is far outweighed by the massive loss it would incur if the money is not paid back at all.

The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction.

To begin constructing the cost matrix, we need to start by specifying the dimensions. Since the predicted and actual values can both take two values, yes or no, we need to describe a 2 x 2 matrix, using a list of two vectors, each with two values. At the same time, we’ll also name the matrix dimensions to avoid confusion later on:

> matrix_dimensions <- list(c("no", "yes"), c("no", "yes")) > names(matrix_dimensions) <- c("predicted", "actual")

Examining the new object shows that our dimensions have been set up correctly:

> matrix_dimensions $predicted [1] "no" "yes" $actual [1] "no" "yes"

Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. Since R fills a matrix by filling columns one by one from top to bottom, we need to supply the values in a specific order:

  • Predicted no, actual no
  • Predicted yes, actual no
  • Predicted no, actual yes
  • Predicted yes, actual yes

Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:

> error_cost <- matrix(c(0, 1, 4, 0), nrow = 2, dimnames = matrix_dimensions)

This creates the following matrix:

> error_cost actual predicted no yes no 0 4 yes 1 0

As defined by this matrix, there is no cost assigned when the algorithm classifies a no or yes correctly, but a false negative has a cost of 4 versus a false positive’s cost of 1. To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function. We’ll otherwise use the same steps as we did earlier:

> credit_cost <- C5.0(credit_train[-17], credit_train$default, costs = error_cost) > credit_cost_pred <- predict(credit_cost, credit_test) > CrossTable(credit_test$default, credit_cost_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

This produces the following confusion matrix:

Compared to our boosted model, this version makes more mistakes overall: 37 percent error here versus 18 percent in the boosted case. However, the types of mistakes are very different. Where the previous models incorrectly classified only 42 and 61 percent of defaults correctly, in this model, 79 percent of the actual defaults were predicted to be non-defaults. This trade resulting in a reduction of false negatives at the expense of increasing false positives may be acceptable if our cost estimates were accurate.

This tutorial has been taken from Machine Learning with R Second Edition   by Brett Lantz. Use the code  MLR250RB at the checkout to save 50% on the RRP.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Let X=X in R

Fri, 11/03/2017 - 14:37

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points:

  • Sometimes you have to write parameterized or re-usable code.
  • The methods for doing this should be easy and legible.

The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the wrapr package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in R.

In this article we will show how wrapr makes code-rewriting even easier with its new let x=x automation.


Let X=X

There are very important reasons to choose a package that makes things easier. One is debugging:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

Brian Kernighan, The Elements of Programming Style, 2nd edition, chapter 2

Let’s take the monster example from "Let’s Have Some Sympathy For The Part-time R User".

The idea was that perhaps one had worked out a complicated (but useful and important) by-hand survey scoring method:

suppressPackageStartupMessages(library("dplyr")) library("wrapr") d <- data.frame( subjectID = c(1, 1, 2, 2), surveyCategory = c( 'withdrawal behavior', 'positive re-framing', 'withdrawal behavior', 'positive re-framing' ), assessmentTotal = c(5, 2, 3, 4), stringsAsFactors = FALSE ) scale <- 0.237 d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

The presumption is that the above pipeline is considered reasonable (but long, complicated, and valuable) dplyr, and our goal is to re-use it on new data that may not have the same column names as our original data.

We are making the huge simplifying assumption that you have studied the article and the above example is now familiar.

The question is: what to do when one wants to process the same type of data with different column names? For example:

d <- data.frame( PID = c(1, 1, 2, 2), DIAG = c( 'withdrawal behavior', 'positive re-framing', 'withdrawal behavior', 'positive re-framing' ), AT = c(5, 2, 3, 4), stringsAsFactors = FALSE ) print(d) ## PID DIAG AT ## 1 1 withdrawal behavior 5 ## 2 1 positive re-framing 2 ## 3 2 withdrawal behavior 3 ## 4 2 positive re-framing 4

The new table has the following new column definitions:

subjectID <- "PID" surveyCategory <- "DIAG" assessmentTotal <- "AT" isDiagnosis <- "isD" probability <- "prob" diagnosis <- "label"

We could "reduce to a previously solved problem" by renaming the columns to names we know, doing the work, and then renaming back (which is actually a service that replyr::replyr_apply_f_mapped() supplies).

In "Let’s Have Some Sympathy For The Part-time R User" I advised editing the pipeline to have obvious stand-in names (perhaps in all-capitals) and then using wrapr::let() to perform symbol substitution on the pipeline.

Dr. Nina Zumel has since pointed out to me: if you truly trust the substitution method you can use the original column names and adapt the original calculation pipeline as is (without alteration). Let’s try that:

let( c(subjectID = subjectID, surveyCategory = surveyCategory, assessmentTotal = assessmentTotal, isDiagnosis = isDiagnosis, probability = probability, diagnosis = diagnosis), d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID)) ## # A tibble: 2 x 3 ## PID label prob ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

That works! All we did was: paste the original code into the block and the adapter did all of the work, with no user edits of the code.

It is a bit harder for the user to find which symbols are being replaced, but in some sense they don’t really need to know (it is R‘s job to perform the replacements).

wrapr has a new helper function mapsyms() that automates all of the "let x = x" steps from the above example.




mapsyms() is a simple function that captures variable names and builds a mapping from them to the names they refer to in the current environment. For example we can use it to quickly build the assignment map for the let block, because the earlier assignments such as "subjectID <- "PID"" allow mapsyms() to find the intended re-mappings. This would also be true for other cases, such as re-mapping function arguments to values. Our example becomes:

print(mapsyms(subjectID, surveyCategory, assessmentTotal, isDiagnosis, probability, diagnosis)) ## $subjectID ## [1] "PID" ## ## $surveyCategory ## [1] "DIAG" ## ## $assessmentTotal ## [1] "AT" ## ## $isDiagnosis ## [1] "isD" ## ## $probability ## [1] "prob" ## ## $diagnosis ## [1] "label"

This allows the solution to be re-written and even wrapped into a function in a very legible form with very little effort:

computeRes <- function(d, subjectID, surveyCategory, assessmentTotal, isDiagnosis, probability, diagnosis) { let( mapsyms(subjectID, surveyCategory, assessmentTotal, isDiagnosis, probability, diagnosis), d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID) ) } computeRes(d, subjectID = "PID", surveyCategory = "DIAG", assessmentTotal = "AT", isDiagnosis = "isD", probability = "prob", diagnosis = "label") ## # A tibble: 2 x 3 ## PID label prob ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

The idea is: instead of having to mark what instances of symbols are to be replaced (by quoting or de-quoting indicators), we instead declare what symbols are to be replaced using the mapsyms() helper.

mapsyms() is a stand-alone helper function (just as ":=" is). It works not because it is some exceptional corner-case hard-wired into other functions, but because mapsyms()‘s reasonable semantics happen to synergize with let()‘s reasonable semantics. mapsyms() behaves as a replacement target controller (without needing any cumbersome direct quoting or un-quoting notation!).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

I, For One, Welcome Our Forthcoming New robots.txt Overlords

Fri, 11/03/2017 - 14:18

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Despite my week-long Twitter consumption sabbatical (helped — in part — by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem is the text of a very recent ruling from a District Court in Texas and deals with a favourite subject of mine: robots.txt.

The background of the case is that there were two parties who both ran websites for oil and gas professionals that include job postings. One party filed a lawsuit against the other asserting that the they hacked into their system and accessed and used various information in violation of the Computer Fraud and Abuse Act (CFAA), the Stored Wire and Electronic Communications and Transactional Records Access Act (SWECTRA), the Racketeer Influenced and Corrupt Organizations Act (RICO), the Texas Harmful Access by Computer Act (THACA), the Texas Theft Liability Act (TTLA), and the Texas Uniform Trade Secrets Act (TUTS). They also asserted common law claims of misappropriation of confidential information, conversion, trespass to chattels, fraud, breach of fiduciary duty, unfair competition, tortious interference with present and prospective business relationships, civil conspiracy, and aiding and abetting.

The other party filed a motion for dismissal on a number of grounds involving legalese on Terms & Conditions, a rebuttal of CFAA claims and really gnarly legalese around copyrights. There are more than a few paragraphs that make me glad none of my offspring have gone into or have a desire to go into the legal profession. One TLDR here is that T&Cs do, in fact, matter (though that is definitely dependent upon the legal climate where you live or have a case filed against you). We’re going to focus on the DMCA claim which leads us to the robots.txt part.

I shall also preface the rest with “IANAL”, but I don’t think a review of this case requires a law degree.

Command-Shift-R

To refresh memories (or create lasting new ones), robots.txt is a file that is placed at the top of a web site domain (i.e. https://rud.is/robots.txt) that contains robots exclusion standard rules. These rules tell bots (NOTE: if you write a scraper, you’ve written a scraping bot) what they can or cannot scrape and what — if any — delay should be placed between scraping efforts by said bot.

R has two CRAN packages for dealing with these files/rules: robotstxt by Peter Meissner and spiderbar by me. They are not competitors, but are designed much like Reese’s Peanut Butter cups — to go together (though Peter did some wicked good testing and noted a possible error in the underlying C++ library I use that could generate Type I or Type II in certain circumstances) and each has some specialization. I note them now because you don’t have an excuse not to check robots.txt given two CRAN packages being available. Python folks have (at a minimum) robotparser and reppy. Node, Go and other, modern languages all have at least one module/library/package available as well. No. Excuses.

Your Point?

(Y’all are always in a rush, eh?)

This October, 2017 Texas ruling references a 2007 ruling by a District Court in Pennsylvania. I dug in a bit through searchable Federal case law for mentions of robots.txt and there aren’t many U.S. cases that mention this control, though I am amused a small cadre of paralegals had to type robots.txt over-and-over again.

The dismissal request on the grounds that the CFAA did not apply was summarily rejected. Why? The defendant provided proof that they monitor for scraping activity that violates the robots.txt rules and that they use the Windows Firewall (ugh, they use Windows for web serving) to block offending IP addresses when they discover them.

Nuances came out further along in the dismissal text noting that user-interactive viewing of the member profiles on the site was well-within the T&Cs but that the defendant “never authorized [the use of] automated bots to download over 500,000 profiles” nor to have that data used for commercial purposes.

The kicker (for me, anyway) is the last paragraph of the document in the Conclusion where the defendant asserts that:

  • robots.txt is in fact a bona-fide technological measure to effectively control access to copyright materials
  • the “Internet industry” (I seriously dislike lawyers for wording just like that) has recognized robots.txt as a standard for controlling automated access to resources
  • robots.txt has been a valid enforcement mechanism since 1994

The good bit is: -“Whether it actually qualifies in this case will be determined definitively at summary judgment or by a jury.”_ To me, this sounds like a ruling by a jury/judge in favor of robots.txt could mean that it becomes much stronger case law for future misuse claims.

With that in mind:

  • Site owners: USE robots.txt, if — for no other reason — to aid legitimate researchers who want to make use of your data for valid scientific purposes, education or to create non-infringing content or analyses that will be a benefit to the public good. You can also use it to legally protect your content (but there are definitely nuances around how you do that).
  • Scrapers: Check and obey robots.txt rules. You have no technological excuse not to and not doing so really appears that it could come back to haunt you in the very near future.
FIN

I’ve setup an alert for when future rulings come out for this case and will toss up another post here or on the work-blog (so I can run it by our very-non-skeezy legal team) when it pops up again.

“Best Friends” image by Andy Kelly. Used with permission. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining – Week 9 (2017)

Fri, 11/03/2017 - 04:57

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 9 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining – Week 9 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R-Admins Community

Fri, 11/03/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today we’re pleased to announce a new category of community.rstudio.com dedicated to R administrators: https://community.rstudio.com/c/r-admin.

There are already multiple places where you can get help with R, Shiny, the RStudio IDE, and the tidyverse. There are, however, far fewer resources for R admins: people who work with R in production, in large organizations, and in complex environments. We hope this new category will serve as a useful and friendly place to connect with fellow R admins to discuss the issues they deal with. We expect this category to include:

  • Discussions about best practices and ideas

  • General questions to fellow admins about RStudio Pro products, designed to ease friction in R administrator workflows

  • An exchange of ideas on domain-specific use cases and configurations

If you’re an existing RStudio customer, this forum is a complement to RStudio’s direct support:

  • Folks from RStudio will participate, but only lightly moderate topics and discussions.

  • RStudio commercial license holders should still feel free to report Pro
    product problems to support@rstudio.com.

  • If you think a topic needs RStudio support’s attention, please suggest that
    the poster contact RStudio support directly. You can also tag @support in a reply.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Problems In Estimating GARCH Parameters in R

Thu, 11/02/2017 - 20:35

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

These days my research focuses on change point detection methods. These are statistical tests and procedures to detect a structural change in a sequence of data. An early example, from quality control, is detecting whether a machine became uncalibrated when producing a widget. There may be some measurement of interest, such as the diameter of a ball bearing, that we observe. The machine produces these widgets in sequence. Under the null hypothesis, the ball bearing’s mean diameter does not change, while under the alternative, at some unkown point in the manufacturing process the machine became uncalibrated and the mean diameter of the ball bearings changed. The test then decides between these two hypotheses.

These types of test matter to economists and financial sector workers as well, particularly for forecasting. Once again we have a sequence of data indexed by time; my preferred example is price of a stock, which people can instantly recognize as a time series given how common time series graphs for stocks are, but there are many more datasets such as a state’s GDP or the unemployment rate. Economists want to forecast these quantities using past data and statistics. One of the assumptions the statistical methods makes is that the series being forecasted is stationary: the data was generated by one process with a single mean, autocorrelation, distribution, etc. This assumption isn’t always tested yet it is critical to successful forecasting. Tests for structural change check this assumption, and if it turns out to be false, the forecaster may need to divide up their dataset when training their models.

I have written about these tests before, introducing the CUSUM statistic, one of the most popular statistics for detecting structural change. My advisor and a former Ph.D. student of his (currently a professor at the University of Waterloo, Greg Rice) developed a new test statistic that better detects structural changes that occur early or late in the dataset (imagine the machine producing widgets became uncalibrated just barely, and only the last dozen of the hundred widgets in the sample were affected). We’re in the process of making revisions requested by a journal to whom we submitted our paper, one of the revisions being a better example application (we initially worked with the wage/productivity data I discussed in the aforementioned blog post; the reviewers complained that these variables are codetermined so its nonsense to regress one on the other, a complaint I disagree with but I won’t plant my flag on to defend).

We were hoping to apply a version of our test to detecting structural change in GARCH models, a common model in financial time series. To my knowledge the “state of the art” R package for GARCH model estimation and inference (along with other work) is fGarch; in particular, the function garchFit() is used for estimating GARCH models from data. When we tried to use this function in our test, though, we were given obviously bad numbers (we had already done simulation studies to know what behavior to expect). The null hypothesis of no change was soundly rejected on simulated sequences where it was true. I never saw the test fail to reject the null hypothesis, even though the null hypothesis was always true. This was the case even for sample sizes of 10,000, hardly a small sample.

We thought the problem might lie with the estimation of the covariance matrix of the parameter estimates, and I painstakingly derived and programmed functions to get this matrix not using numerical differentiation procedures, yet this did not stop the bad behavior. Eventually my advisor and I last Wednesday played with garchFit() and decided that the function is to blame. The behavior of the function on simulated data is so erratic when estimating parameters (not necessarily the covariance matrix as we initially thought, though it’s likely polluted as well) the function is basically useless, to my knowledge.

This function should be well-known and it’s certainly possible that the problem lies with me, not fGarch (or perhaps there’s better packages out there). This strikes me as a function of such importance I should share my findings. In this article I show a series of numerical experiments demonstrating garchFit()‘s pathological behavior.

Basics on GARCH Models

The model is a time series model often used to model the volatility of financial instrument returns, such as the returns from stocks. Let represent the process. This could represent the deviations in the returns of, say, a stock. The model (without a mean parameter) is defined recursively as:

is the conditional standard deviation of the process, also known as the conditional volatility, and is a random process.

People who follow finance1 noticed that returns to financial instruments (such as stocks or mutual funds) exhibit behavior known as volatility clustering. Some periods a financial instrument is relatively docile; there are not dramatic market movements. In others an instrument’s price can fluctuate greatly, and these periods are not one-off single-day movements but can last for a period of time. GARCH models were developed to model volatility clustering.

It is believed by some that even if a stock’s daily movement is essentially unforecastable (a stock is equally likely to over- or under-perform on any given day), the volatility is forecastable. Even for those who don’t have the hubris to believe anything about future returns can be forecasted these models are important. For example if one uses the model to estimate the beta statistic for a stock (where is the stock’s return at time $latex $ is the market return, and is “random noise”), there is a good chance that is not an i.i.d sequence of random numbers (as is commonly assumed in other statistical contexts) but actually a GARCH sequence. The modeller would then want to know the behavior of her estimates in such a situation. Thus GARCH models are considered important. In fact, the volatility clustering behavior I just described is sometimes described as “GARCH behavior”, since it appears frequently and GARCH models are a frequent tool of choice to address them. (The acronym GARCH stands for generalized autoregressive conditional heteroskedasticity, which is statistics-speak for changing, time-dependent volatility.)

can be any random process but a frequent choice is to use a sequence of i.i.d standard Normal random variables. Here is the only source of randomness in the model. In order for a process to have a stationary solution, we must require that ). In this case the process has a long-run variance of .

Estimating GARCH Parameters

The process I wrote down above is an infinite process; the index $latex $ can extend to negative numbers and beyond. Obviously in practice we don’t observe infinite sequences so if we want to work with models in practice we need to consider a similar sequence:

Below is the new sequence’s secret sauce:

We choose an initial value for this sequence (the theoretical sequence described earlier does not have an initial value)! This sequence strongly resembles the theoretical sequence but it is observable in its entirity, and it can be shown that parameters estimated using this sequence closely approximate those of the theoretical, infinite process.

Naturally one of the most important tasks for these processes is estimating their parameters; for the process, these are , , and . A basic approach is to find the quasi-maximum likelihood estimation (QMLE) estimates. Let’s assume that we have $latex $ observations from our process. In QMLE, we work with the condisional distribution of when assuming follows a standard normal distribution (that is, ). We assume that the entire history of the process up to time $latex $ is known; this implies that is known as well (in fact all we needed to know was the values of the process at time , but I digress). In that case we have . Let be the conditional distribution of (so ). The quasi-likelihood equation is then

Like most likelihood methods, rather than optimize the quasi-likelihood function directly, statisticians try to optimize the log-likelihood, , and after some work it’s not hard to see this is equivalent to minimizing

Note that , , and are involved in this quantity through . There is no closed form solution for the parameters that minimize this quantity. This means that numerical optimization techniques must be applied to find the parameters.

It can be shown that the estimators for the parameters , , and , when computed this way, are consistent (meaning that asymptotically they approach their true values, in the sense that they converge in probability) and follow a Gaussian distribution asymptotically.2 These are properties that we associate with the sample mean, and while we might be optimistic that the rate of convergence of these estimators is as good as the rate of convergence of the sample mean, we may expect comparable asymptotic behavior.

Ideally, the parameters should behave like the process illustrated below.

library(ggplot2) x <- rnorm(1000, sd = 1/3) df <- t(sapply(50:1000, function(t) { return(c("mean" = mean(x[1:t]), "mean.se" = sd(x[1:t])/sqrt(t))) })) df <- as.data.frame(df) df$t <- 50:1000 ggplot(df, aes(x = t, y = mean)) + geom_line() + geom_ribbon(aes(x = t, ymin = mean - 2 * mean.se, ymax = mean + 2 * mean.se), color = "grey", alpha = 0.5) + geom_hline(color = "blue", yintercept = 0) + coord_cartesian(ylim = c(-0.5, 0.5))

Behavior of Estimates by fGarch

Before continuing let’s generate a sequence. Throughout this article I work with processes where all parameters are equal to 0.2. Notice that for a process the long-run variance will be with this choice.

set.seed(110117) library(fGarch) x <- garchSim(garchSpec(model = list("alpha" = 0.2, "beta" = 0.2, "omega" = 0.2)), n.start = 1000, n = 1000) plot(x)

Let’s see the parameters that the fGarch function garchFit() uses.

args(garchFit) ## function (formula = ~garch(1, 1), data = dem2gbp, init.rec = c("mci", ## "uev"), delta = 2, skew = 1, shape = 4, cond.dist = c("norm", ## "snorm", "ged", "sged", "std", "sstd", "snig", "QMLE"), include.mean = TRUE, ## include.delta = NULL, include.skew = NULL, include.shape = NULL, ## leverage = NULL, trace = TRUE, algorithm = c("nlminb", "lbfgsb", ## "nlminb+nm", "lbfgsb+nm"), hessian = c("ropt", "rcd"), ## control = list(), title = NULL, description = NULL, ...) ## NULL

The function provides a few options for distribution to maximize (cond.dist) and algorithm to use for optimization (algorithm). Here I will always choose cond.dist = QMLE, unless otherwise stated, to instruct the function to use QMLE estimators.

Here’s a single pass.

garchFit(data = x, cond.dist = "QMLE", include.mean = FALSE) ## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 1000 ## Recursion Init: mci ## Series Scale: 0.5320977 ## ## Parameter Initialization: ## Initial Parameters: $params ## Limits of Transformations: $U, $V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.15640604 0.156406 0.0 FALSE ## omega 0.00000100 100.000000 0.1 TRUE ## alpha1 0.00000001 1.000000 0.1 TRUE ## gamma1 -0.99999999 1.000000 0.1 FALSE ## beta1 0.00000001 1.000000 0.8 TRUE ## delta 0.00000000 2.000000 2.0 FALSE ## skew 0.10000000 10.000000 1.0 FALSE ## shape 1.00000000 10.000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 1419.0152: 0.100000 0.100000 0.800000 ## 1: 1418.6616: 0.108486 0.0998447 0.804683 ## 2: 1417.7139: 0.109746 0.0909961 0.800931 ## 3: 1416.7807: 0.124977 0.0795152 0.804400 ## 4: 1416.7215: 0.141355 0.0446605 0.799891 ## 5: 1415.5139: 0.158059 0.0527601 0.794304 ## 6: 1415.2330: 0.166344 0.0561552 0.777108 ## 7: 1415.0415: 0.195230 0.0637737 0.743465 ## 8: 1415.0031: 0.200862 0.0576220 0.740088 ## 9: 1414.9585: 0.205990 0.0671331 0.724721 ## 10: 1414.9298: 0.219985 0.0713468 0.712919 ## 11: 1414.8226: 0.230628 0.0728325 0.697511 ## 12: 1414.4689: 0.325750 0.0940514 0.583114 ## 13: 1413.4560: 0.581449 0.143094 0.281070 ## 14: 1413.2804: 0.659173 0.157127 0.189282 ## 15: 1413.2136: 0.697840 0.155964 0.150319 ## 16: 1413.1467: 0.720870 0.142550 0.137645 ## 17: 1413.1416: 0.726527 0.138146 0.135966 ## 18: 1413.1407: 0.728384 0.137960 0.134768 ## 19: 1413.1392: 0.731725 0.138321 0.132991 ## 20: 1413.1392: 0.731146 0.138558 0.133590 ## 21: 1413.1392: 0.730849 0.138621 0.133850 ## 22: 1413.1392: 0.730826 0.138622 0.133869 ## ## Final Estimate of the Negative LLH: ## LLH: 782.211 norm LLH: 0.782211 ## omega alpha1 beta1 ## 0.2069173 0.1386221 0.1338686 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -8858.897 -1839.6144 -2491.9827 ## alpha1 -1839.614 -782.8005 -531.7393 ## beta1 -2491.983 -531.7393 -729.7246 ## attr(,"time") ## Time difference of 0.04132652 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.3866439 secs ## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x, cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## ## [data = x] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.20692 0.13862 0.13387 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.20692 0.05102 4.056 5e-05 *** ## alpha1 0.13862 0.04928 2.813 0.00491 ** ## beta1 0.13387 0.18170 0.737 0.46128 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -782.211 normalized: -0.782211 ## ## Description: ## Thu Nov 2 13:01:14 2017 by user:

The parameters are not necessarily near the true parameters. One might initially attribute this to just randomness, but that doesn’t seem to be the case.

For example, what fit do I get when I fit the model on the first 500 data points?

garchFit(data = x[1:500], cond.dist = "QMLE", include.mean = FALSE) ## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 500 ## Recursion Init: mci ## Series Scale: 0.5498649 ## ## Parameter Initialization: ## Initial Parameters: $params ## Limits of Transformations: $U, $V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.33278068 0.3327807 0.0 FALSE ## omega 0.00000100 100.0000000 0.1 TRUE ## alpha1 0.00000001 1.0000000 0.1 TRUE ## gamma1 -0.99999999 1.0000000 0.1 FALSE ## beta1 0.00000001 1.0000000 0.8 TRUE ## delta 0.00000000 2.0000000 2.0 FALSE ## skew 0.10000000 10.0000000 1.0 FALSE ## shape 1.00000000 10.0000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 706.37230: 0.100000 0.100000 0.800000 ## 1: 706.27437: 0.103977 0.100309 0.801115 ## 2: 706.19091: 0.104824 0.0972295 0.798477 ## 3: 706.03116: 0.112782 0.0950253 0.797812 ## 4: 705.77389: 0.122615 0.0858136 0.788169 ## 5: 705.57316: 0.134608 0.0913105 0.778144 ## 6: 705.43424: 0.140011 0.0967118 0.763442 ## 7: 705.19541: 0.162471 0.102711 0.739827 ## 8: 705.16325: 0.166236 0.0931680 0.737563 ## 9: 705.09943: 0.168962 0.100977 0.731085 ## 10: 704.94924: 0.203874 0.0958205 0.702986 ## 11: 704.78210: 0.223975 0.108606 0.664678 ## 12: 704.67414: 0.250189 0.122959 0.630886 ## 13: 704.60673: 0.276532 0.131788 0.595346 ## 14: 704.52185: 0.335952 0.146435 0.520961 ## 15: 704.47725: 0.396737 0.157920 0.448557 ## 16: 704.46540: 0.442499 0.164111 0.396543 ## 17: 704.46319: 0.440935 0.161566 0.400606 ## 18: 704.46231: 0.442951 0.159225 0.400940 ## 19: 704.46231: 0.443022 0.159284 0.400863 ## 20: 704.46230: 0.443072 0.159363 0.400851 ## 21: 704.46230: 0.443112 0.159367 0.400807 ## ## Final Estimate of the Negative LLH: ## LLH: 405.421 norm LLH: 0.810842 ## omega alpha1 beta1 ## 0.1339755 0.1593669 0.4008074 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -8491.005 -1863.4127 -2488.5700 ## alpha1 -1863.413 -685.6071 -585.4327 ## beta1 -2488.570 -585.4327 -744.1593 ## attr(,"time") ## Time difference of 0.02322888 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.1387401 secs ## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x[1:500], cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## ## [data = x[1:500]] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.13398 0.15937 0.40081 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.13398 0.11795 1.136 0.2560 ## alpha1 0.15937 0.07849 2.030 0.0423 * ## beta1 0.40081 0.44228 0.906 0.3648 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -405.421 normalized: -0.810842 ## ## Description: ## Thu Nov 2 13:01:15 2017 by user:

Notice that the parameter (listed as beta1) changed dramatically. How about a different cutoff?

garchFit(data = x[1:200], cond.dist = "QMLE", include.mean = FALSE) ## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 200 ## Recursion Init: mci ## Series Scale: 0.5746839 ## ## Parameter Initialization: ## Initial Parameters: $params ## Limits of Transformations: $U, $V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.61993813 0.6199381 0.0 FALSE ## omega 0.00000100 100.0000000 0.1 TRUE ## alpha1 0.00000001 1.0000000 0.1 TRUE ## gamma1 -0.99999999 1.0000000 0.1 FALSE ## beta1 0.00000001 1.0000000 0.8 TRUE ## delta 0.00000000 2.0000000 2.0 FALSE ## skew 0.10000000 10.0000000 1.0 FALSE ## shape 1.00000000 10.0000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 280.63354: 0.100000 0.100000 0.800000 ## 1: 280.63302: 0.100315 0.100088 0.800223 ## 2: 280.63262: 0.100695 0.0992822 0.800059 ## 3: 280.63258: 0.102205 0.0983397 0.800404 ## 4: 280.63213: 0.102411 0.0978709 0.799656 ## 5: 280.63200: 0.102368 0.0986702 0.799230 ## 6: 280.63200: 0.101930 0.0984977 0.800005 ## 7: 280.63200: 0.101795 0.0983937 0.799987 ## 8: 280.63197: 0.101876 0.0984197 0.799999 ## 9: 280.63197: 0.102003 0.0983101 0.799965 ## 10: 280.63197: 0.102069 0.0983780 0.799823 ## 11: 280.63197: 0.102097 0.0983703 0.799827 ## 12: 280.63197: 0.102073 0.0983592 0.799850 ## 13: 280.63197: 0.102075 0.0983616 0.799846 ## ## Final Estimate of the Negative LLH: ## LLH: 169.8449 norm LLH: 0.8492246 ## omega alpha1 beta1 ## 0.03371154 0.09836156 0.79984610 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -26914.901 -6696.498 -8183.925 ## alpha1 -6696.498 -2239.695 -2271.547 ## beta1 -8183.925 -2271.547 -2733.098 ## attr(,"time") ## Time difference of 0.02161336 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.09229803 secs ## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x[1:200], cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## ## [data = x[1:200]] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.033712 0.098362 0.799846 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.03371 0.01470 2.293 0.0218 * ## alpha1 0.09836 0.04560 2.157 0.0310 * ## beta1 0.79985 0.03470 23.052 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -169.8449 normalized: -0.8492246 ## ## Description: ## Thu Nov 2 13:01:15 2017 by user:

For 200 observations is estimated to be enormous with a relatively tiny standard error!

Let’s dive deeper into this. I’ve conducted a number of numerical experiments on the University of Utah Mathematics department’s supercomputer. Below is a helper function to extract the coefficients and standard errors for a particular fit by garchFit() (suppressing all of garchFit()‘s output in the process).

getFitData <- function(x, cond.dist = "QMLE", include.mean = FALSE, ...) { args <- list(...) args$data = x args$cond.dist = cond.dist args$include.mean = include.mean log <- capture.output({ fit <- do.call(garchFit, args = args) }) res <- coef(fit) res[paste0(names(fit@fit$se.coef), ".se")] <- fit@fit$se.coef return(res) }

The first experiment is to compute the coefficients of this particular series at each possible end point.

(The following code block is not evaluated when this document is knitted; I have saved the results in a Rda file. This will be the case for every code block that involves parallel computation. I performed these computations on the University of Utah mathematics department’s supercomputer, saving the results for here.)

library(doParallel) set.seed(110117) cl <- makeCluster(detectCores() - 1) registerDoParallel(cl) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) params <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t]) } rownames(params) <- 50:1000

Below I plot these coefficients, along with a region corresponding to double the standard error. This region should roughly correspond to 95% confidence intervals.

params_df <- as.data.frame(params) params_df$t <- as.numeric(rownames(params)) ggplot(params_df) + geom_line(aes(x = t, y = beta1)) + geom_hline(yintercept = 0.2, color = "blue") + geom_ribbon(aes(x = t, ymin = beta1 - 2 * beta1.se, ymax = beta1 + 2 * beta1.se), color = "grey", alpha = 0.5) + ylab(expression(hat(beta))) + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 1)) + coord_cartesian(ylim = c(0, 1))

This is an alarming picture (but not the most alarming I’ve seen; this is one of the better cases). Notice that the confidence interval fails to capture the true value of up until about 375 data points; these intervals should contain the true value about 95% of the time! This is in addition to the confidence interval being fairly large.

Let’s see how the other parameters behave.

library(reshape2) library(plyr) library(dplyr) param_reshape <- function(p) { p <- as.data.frame(p) p$t <- as.integer(rownames(p)) pnew <- melt(p, id.vars = "t", variable.name = "parameter") pnew$parameter <- as.character(pnew$parameter) pnew.se <- pnew[grepl("*.se", pnew$parameter), ] pnew.se$parameter <- sub(".se", "", pnew.se$parameter) names(pnew.se)[3] <- "se" pnew <- pnew[!grepl("*.se", pnew$parameter), ] return(join(pnew, pnew.se, by = c("t", "parameter"), type = "inner")) } ggp <- ggplot(param_reshape(params), aes(x = t, y = value)) + geom_line() + geom_ribbon(aes(ymin = value - 2 * se, ymax = value + 2 * se), color = "grey", alpha = 0.5) + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(. ~ parameter) print(ggp + ggtitle("NLMINB Optimization"))

The phenomenon is not limited to . also exhibits undesirable behavior. ( isn’t great either, but much better.)

This behavior isn’t unusual; it’s typical. Below are plots for similar series generated with different seeds.

seeds <- c(103117, 123456, 987654, 101010, 8675309, 81891, 222222, 999999, 110011) experiments1 <- foreach(s = seeds) %do% { set.seed(s) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) params <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t]) } rownames(params) <- 50:1000 params } names(experiments1) <- seeds experiments1 <- lapply(experiments1, param_reshape) names(experiments1) <- c(103117, 123456, 987654, 101010, 8675309, 81891, 222222, 999999, 110011) experiments1_df <- ldply(experiments1, .id = "seed") head(experiments1_df) ## seed t parameter value se ## 1 103117 50 omega 0.1043139 0.9830089 ## 2 103117 51 omega 0.1037479 4.8441246 ## 3 103117 52 omega 0.1032197 4.6421147 ## 4 103117 53 omega 0.1026722 1.3041128 ## 5 103117 54 omega 0.1020266 0.5334988 ## 6 103117 55 omega 0.2725939 0.6089607 ggplot(experiments1_df, aes(x = t, y = value)) + geom_line() + geom_ribbon(aes(ymin = value - 2 * se, ymax = value + 2 * se), color = "grey", alpha = 0.5) + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(seed ~ parameter) + ggtitle("Successive parameter estimates using NLMINB optimization")

In this plot we see pathologies of other kinds for , especially for seeds 222222 and 999999, where is chronically far below the correct value. For all of these simulations starts much larger than the correct value, near 1, and for the two seeds mentioned earlier jumps from being very high to suddenly very low. (Not shown here are results for seeds 110131 and 110137; they’re even worse!)

The other parameters are not without their own pathologies but the situation does not seem quite so grim. It’s possible the pathologies we do see are tied to estimation of . In fact, if we look at the analagous experiment for the ARCH(1) process (which is a GARCH(1,0) process, equivalent to setting ) we see better behavior.

set.seed(110117) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) xarch <- garchSim(garchSpec(model = list(omega = 0.2, alpha = 0.2, beta = 0)), n.start = 1000, n = 1000) params_arch <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(xarch[1:t], formula = ~ garch(1, 0)) } rownames(params_arch) <- 50:1000 print(ggp %+% param_reshape(params_arch) + ggtitle("ARCH(1) Model"))

The pathology appears to be numerical in nature and closely tied to . garchFit(), by default, uses nlminb() (a quasi-Newton method with constraints) for solving the optimization problem, using a numerically-computed gradient. We can choose alternative methods, though; we can use the L-BFGS-B method, and we can spice both with the Nelder-Mead method.

Unfortunately these alternative optimization algorithms don’t do better; they may even do worse.

# lbfgsb algorithm params_lbfgsb <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "lbfgsb") } rownames(params_lbfgsb) <- 50:1000 # nlminb+nm algorithm params_nlminbnm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "nlminb+nm") } rownames(params_nlminbnm) <- 50:1000 # lbfgsb+nm algorithm params_lbfgsbnm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "lbfgsb+nm") } rownames(params_lbfgsbnm) <- 50:1000 # cond.dist is norm (default) params_norm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], cond.dist = "norm") } rownames(params_norm) <- 50:1000 print(ggp %+% param_reshape(params_lbfgsb) + ggtitle("L-BFGS-B Optimization"))

print(ggp %+% param_reshape(params_nlminbnm) + ggtitle("nlminb Optimization with Nelder-Mead"))

print(ggp %+% param_reshape(params_lbfgsbnm) + ggtitle("L-BFGS-B Optimization with Nelder-Mead"))

Admittedly, though, QMLE is not the default estimation method garchFit() uses. The default is the Normal distribution. Unfortunately this is no better.

print(ggp %+% param_reshape(params_norm) + ggtitle("cond.dist = 'norm'"))

On CRAN, fGarch has not seen an update since 2013! It’s possible that fGarch is starting to show its age and newer packages have addressed some of the problems I’ve highlighted here. The package tseries provides a function garch() that also fits models via QMLE, and is much newer than fGarch. It is the only other package I am aware of that fits models.

Unfortunately, garch() doesn’t do much better; in fact, it appears to be much worse. Once again, the problem lies with .

library(tseries) getFitDatagarch <- function(x) { garch(x)$coef } params_tseries <- foreach(t = 50:1000, .combine = rbind, .packages = c("tseries")) %dopar% { getFitDatagarch(x[1:t]) } rownames(params_tseries) <- 50:1000 param_reshape_tseries <- function(p) { p <- as.data.frame(p) p$t <- as.integer(rownames(p)) pnew <- melt(p, id.vars = "t", variable.name = "parameter") pnew$parameter <- as.character(pnew$parameter) return(pnew) } ggplot(param_reshape_tseries(params_tseries), aes(x = t, y = value)) + geom_line() + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(. ~ parameter)

All of these experiments were performed on fixed (yet randomly chosen) sequences. They suggest that especially for sample sizes of less than, say, 300 (possibly larger) distributional guarantees for the estimates of parameters are suspect. What happens when we simulate many processes and look at the distribution of the parameters?

I simulated 10,000 processes of sample sizes 100, 500, and 1000 (using the same parameters as before). Below are the empirical distributions of the parameter estimates.

experiments2 <- foreach(n = c(100, 500, 1000)) %do% { mat <- foreach(i = 1:10000, .combine = rbind, .packages = c("fGarch")) %dopar% { x <- garchSim(garchSpec(model = list(omega = 0.2, alpha = 0.2, beta = 0.2)), n.start = 1000, n = n) getFitData(x) } rownames(mat) <- NULL mat } names(experiments2) <- c(100, 500, 1000) save(params, x, experiments1, xarch, params_arch, params_lbfgsb, params_nlminbnm, params_lbfgsbnm, params_norm, params_tseries, experiments2, file="garchfitexperiments.Rda") param_sim <- lapply(experiments2, function(mat) { df <- as.data.frame(mat) df <- df[c("omega", "alpha1", "beta1")] return(df) }) %>% ldply(.id = "n") param_sim <- param_sim %>% melt(id.vars = "n", variable.name = "parameter") head(param_sim) ## n parameter value ## 1 100 omega 8.015968e-02 ## 2 100 omega 2.493595e-01 ## 3 100 omega 2.300699e-01 ## 4 100 omega 3.674244e-07 ## 5 100 omega 2.697577e-03 ## 6 100 omega 2.071737e-01 ggplot(param_sim, aes(x = value)) + geom_density(fill = "grey", alpha = 0.7) + geom_vline(xintercept = 0.2, color = "blue") + facet_grid(n ~ parameter)

When the sample size is 100, these estimators are far from reliable. and have an unnerving tendency to be almost 0, and can be just about anything. As we saw above, the standard errors reported by garchFit() do not capture this behavior. For larger sample sizes and behave better, but still displays unnerving behavior. Its spread barely changes and it still has a propensity to be far too small.

What bothers me most is that a sample size of 1,000 strikes me as being a large sample size. If one were looking at daily data for, say, stock prices, this sample size roughly corresponds to maybe 4 years of data. This suggests to me that this pathological behavior is affecting GARCH models people are trying to estimate now and use in models.

Conclusion

An article by John C. Nash entitled “On best practice optimization methods in R”, published in the Journal of Statistical Software in September 2014, discussed the need for better optimization practices in R. In particular, he highlighted, among others, the methods garchFit() uses (or at least their R implementation) as outdated. He argues for greater awareness in the community for optimization issues and for greater flexibility in packages, going beyond merely using different algorithms provided by optim().

The issues I highlighted in this article made me more aware of the importance of choice in optimization methods. My initial objective was to write a function for performing statistical tests depending structural change in GARCH models. These tests rely heavily on successive estimation of the parameters of models as I demonstrated here. At minimum my experiments show that the variation in the parameters isn’t being captured adequately by standard errors, but also there’s a potential for unacceptably high instability in parameter estimates. They’re so unstable it would take a miracle for the test to not reject the null hypothesis of no change. After all, just looking at the pictures of simulated data and one might conclude that the alternative hypothesis of structural change is true. Thus every time I have tried to implement our test on data where the null hypothesis was supposedly true, the test unequivocally rejected it with $p$-values of essentially 0.

I have heard people conducting hypothesis testing for detecting structural change in GARCH models so I would not be surprised if the numerical instability I have written about here can be avoided. This is a subject I admittedly know little about and I hope that if someone in the R community has already observed this behavior and knows how to resolve it they let me know in the comments or via e-mail. I may write a retraction and show how to produce stable estimates of the parameters with garchFit(). Perhaps the key lies in the function garchFitControl().

I’ve also thought about writing my own optimization routine tailored to my test. Prof. Nash emphasized in his paper the importance of tailoring optimization routines to the needs of particular problems. I’ve written down the quantity to optimize and I have a formula for the gradient and Hessian matrix of the . Perhaps successive optimizations as required by our test could use the parameters from previous iterations as initial values, helping prevent optimizers from finding distant, locally optimal yet globally suboptimal solutions.

Already though this makes the problem more difficult than I initially thought finding an example for our test would be. I’m planning on tabling detecting structural change in GARCH models for now and using instead an example involving merely linear regression (a much more tractable problem). But I hope to hear others’ input on what I’ve written here.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] dplyr_0.7.2 plyr_1.8.4 reshape2_1.4.2 ## [4] fGarch_3010.82.1 fBasics_3011.87 timeSeries_3022.101.2 ## [7] timeDate_3012.100 ggplot2_2.2.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.11 bindr_0.1 knitr_1.17 magrittr_1.5 ## [5] munsell_0.4.3 colorspace_1.3-2 R6_2.2.0 rlang_0.1.2 ## [9] stringr_1.2.0 tools_3.3.3 grid_3.3.3 gtable_0.2.0 ## [13] htmltools_0.3.6 assertthat_0.1 yaml_2.1.14 lazyeval_0.2.0 ## [17] rprojroot_1.2 digest_0.6.12 tibble_1.3.4 bindrcpp_0.2 ## [21] glue_1.1.1 evaluate_0.10 rmarkdown_1.6 labeling_0.3 ## [25] stringi_1.1.5 scales_0.4.1 backports_1.0.5 pkgconfig_2.0.1
  1. Bollerslev introduced GARCH models in his 1986 paper entitled “General autoregressive conditional heteroscedasticity”. 
  2. See the book GARCH Models: Structure, Statistical Inference and Financial Applications, by Christian Francq and Jean-Michel Zakoian

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Role Playing with Probabilities: The Importance of Distributions

Thu, 11/02/2017 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Jocelyn Barker, Data Scientist at Microsoft

I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous.

Why Use Role-Playing Games as an Example?

Partially because it means I get to think about one of my hobbies at work. More practically, because consequences of probability distributions can be hard to examine in the real world. How do you quantify the impact of having your driverless car misclassify objects on the road? Games like DnD on the other hand were built around quantifying the impact of decisions. You decide to do something, add up some numbers that represent the difficulty of what you want to do, and then roll dice to add in some randomness. It also means it is a great environment to study how the distribution of the randomness impacts the outcomes.

A Little Background on DnD

One of the core mechanics of playing DnD and related role-playing games involve rolling a 20 sided die (often referred to as a d20). If you want your character to do something like climb a tree, there is some assigned difficulty for it (eg. 10) and if you roll higher than that number, you achieve your goal. If your character is good at that thing, they get to add a skill modifier (eg. 5) to the number they roll making it more likely that they can do what they wanted to do. If the thing you want to do involves another character, things change a little. Instead of having a set difficulty like for climbing a tree, the difficulty is an opposed roll from the other player. So if Character A wants to sneak past Character B, both players roll d20s and Character A adds their “stealth” modifier against Character B’s “perception” modifier. Whoever between them gets a higher number wins with a tie going to the “perceiver”. Ok, I promise, that is all the DnD rules you need to know for this blog post.

Alternative Rolling Mechanics: What’s in a Distribution?

So here is where the stats nerd in me got excited. Some people change the rules of rolling to make different distributions. The default distribution is pretty boring, 20 numbers with equal probability:

One common way people modify this is with the idea of “critical”. The idea is that sometimes people do way better or worse than average. To reflect this, if you roll a 20, instead of adding 20 to your modifier, you add 30. If you roll a 1, you subtract 10 from your modifier.


Another stats nerd must have made up the last distribution. The idea for constructing it is weird, but the behavior is much more Gaussian. It is called 3z8 because you roll 3 eight-sided dice that are numbered 0-7 and sum them up giving a value between 0 and 21. 1-20 act as in the standard rules, but 0 and 21 are now treated like criticals (but at a much lower frequency than before).

The cool thing is these distributions have almost identical expected values (10.5 for d20, 10.45 with criticals, and 10.498 for 3z8), but very different distributions. How do these distributions affect the game? What can we learn from this as statisticians?

Our Case Study: Sneaking Past the Guards

To examine how our distributions affects outcome, we will look at a scenario where a character, who we will call the rogue, wants to sneak past three guards. If any of the guard’s perception is greater than or equal to the rogue’s stealth, we will say the rogue loses the encounter, if they are all lower, the rogue is successful. We can already see the rogue is at a disadvantage; any one of the guards succeeding is a failure for her. We note that assuming all the guards have the same perception modifier, the actual value of the modifier for the guards doesn’t matter, just the difference between their modifier and the modifier of the rogue because the two modifiers are just a scalar adjustment of the value rolled. In other words, it doesn’t matter if the guards are average Joes with a 0 modifier and the rogue is reasonably sneaky with a +5 or if the guards are hyper alert with a +15 and the rogue is a ninja with a +20; the probability of success is the same in the two scenarios.

Computing the Max Roll for the Guards

Lets start off getting a feeling for what the dice rolls will look like. Since the rogue is only rolling one die, her probability distribution looks the same as the distribution of the dice from the previous section. Now, lets consider the guards. In order for the rogue to fail to sneak by, she only needs to be seen by one of the guards. That means we just need to look at the probability that the maximum roll for one of the guards is \(n\) for \(n \in 1,..,20\). We will start with our default distribution. The number of ways you can have 3 dice roll a value of \(n\) or less is \(n^3\). Therefore the number of ways you can have the max value of the dice be exactly \(n\) is the number of ways you can roll \(n\) or less minus the number of ways where all the dice are \(n – 1\) or less giving us \(n^3 – (n – 1)^3\) ways to roll a max value of \(n\). Finally, we can divide by the total number of roll combinations for an 20-sided dice, \(20^3\), giving us our final probabilities of:

\[\frac{n^3 – (n-1)^3}{20^3} \textrm{for} \{n \in 1, …, 20\}\]

The only thing that changes when we add criticals to the mix is that now the probabilities previously assigned to 1 get re-assigned to -10 and those assigned to 20 get reassigned to 30 giving us the following distribution.

This means our guards get a critical success ~14% of the time! This will have a big impact on our final distributions.

Finally, lets look at the distribution for the guards using the 3z8 system.

In the previous distributions, the maximum value became the single most likely roll. Because of the the low probability of rolling a 21 in the 3z8 distribution, this distribution still skews right, but peaks at 14. In this distribution, criticals only occur ~0.6% of the time; much less than the previous distribution.

Impact on Outcome

Now that we have looked at the distributions of the rolls for the rogue and the guards, lets see what our final outcomes look like. As previously mentioned, we don’t need to worry about the specific modifiers for the two groups, just the difference between them. Below is a plot showing the relative modifier for the rogue on the x-axis and the probability of success on the y-axis for our three different probability distributions.

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.
To illustrate how much of a difference there is between the two, lets consider what would be the minimum modifier needed to have a certain probability of success.

Probability Roll 1d20 With Criticals Roll 3z8 50% 6 7 4 75% 11 13 8 90% 15 22 11 95% 17 27 13

We see from the table that reasonably small modifiers make a big difference in the 3z8 system, where as very large modifiers are needed to have a reasonable chance of success when criticals are added. To give context on just how large this is, when a someone is invisible, this only adds +20 to their stealth checks when they are moving. In other words, in the system with criticals, our rogue could literally be invisible sneaking past a group of not particularly observant gaurds and have a reasonable chance of failing.

The next thing to consider is our rogue may have to make multiple checks to sneak into a place (eg. one to sneak into the courtyard, one to sneak from bush to bush, and then a final one to sneak over the wall). If we look at the results of our rogue making three successful checks in a row, our probabilities change even more.

Probability Roll 1d20 With Criticals Roll 3z8 50% 12 15 9 75% 16 23 11 90% 18 28 14 95% 19 29 15

Making multiple checks exaggerates the differences we saw previously. Part of the reason for the poor performance with the addition of criticals (and for the funny shape of the critical curve) is there is a different cost associated with criticals for the rogue compared to the guards. If the guards roll a 20 or the rogue rolls a 1, when criticals are in play, the guards will almost certainly win, even if the rogue has a much higher modifier. On the other hand, if the guard rolls a 1 or the rogue rolls a 20, there isn’t much difference in outcome between getting that critical and any other low/high roll; play continues to the next round.

How Does This Apply to Data Science?

Many times as data scientists, we think of the predictions we make as discrete data points and when we evaluate our models we use aggregate metrics. It is easy to lose sight that our predictions are samples from a probability distribution, and that aggregate measures can obscure how well our model is really performing. We saw in the example with criticals where big hits and misses can make a huge impact on outcomes, even if the average performance is largely the same. We also saw with the 3z8 system where decreasing the expected value of the roll can actually increase performance by making the “average” outcome more likely.

Does all of this sound contrived to you, like I am trying to force an analogy? Let me make a concrete example from my real life data science job. I am responsible for making the machine learning revenue forecasts for Microsoft. Twice a quarter, I forecast the revenue for all of the products at Microsoft world wide. While these product forecasts do need to be accurate for internal use, the forecasts are also summed up to create segment level forecasts. Microsoft’s segment level forecasts go to Wall Street and having our forecasts fail to meet actuals can be a big problem for the company. We can think about our rogue sneaking past our guards as being an analogy for nailing the segment level forecast. If I succeed for most of the products (our individual guards) but have a critical miss of $1 billion error on one of them, then I have a $1 billion error for the segment and I failed. Also like our rogue, one success doesn’t mean we have won. There is always another quarter and doing well one quarter doesn’t mean Wall Street will cut you some slack the next. Finally, a critical success is less valuable than a critical failure is problematic. Getting the forecasts perfect one quarter will just get your a “good job” and a pat on the back, but a big miss costs the company. In this context, it is easy to see why the finance team doesn’t take the machine learning forecasts as gospel, even with our track record of high accuracy.

So as you evaluate your models, keep our sneaky friend in mind. Rather than just thinking about your average metrics, think about your distribution of errors. Are your errors clustered nicely around the mean or are they scattershot of low and high? What does that mean for your application? Are those really low errors valuable enough to be worth getting the really high ones from time to time? Many times having a reliable model may be more valuable than a less reliable one with higher average performance, so when you evaluate, think distributions, not means.

The charts in this post were all produced using the R language. To see the code behind the charts, take a look at this R Markdown file.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Exploring, Clustering, and Mapping Toronto’s Crimes

Thu, 11/02/2017 - 14:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Motivation

I have had a lot of fun exploring The US cities’ Crime data via their Open Data portals. Because Toronto’s crime data was simply not available.

Not until the summer of this year, Toronto police launch a public safety data portal to increase transparency between the public and officers. I recently have had the chance to explore Toronto’s crimes via Toronto Police Service Public Safety Data Portal. I am particularly interested in Major Crime Indicators (MCI) 2016 which contains a tabular list of 32, 612 reports in 2016 (The only year that the data were made available).

Let’s take a look at the data using R and see if there’s anything interesting.

library(ggplot2) library(ggthemes) library(dplyr) library(viridis) library(tidyr) library(cluster) library(ggmap) library(maps)

After a little bit of exploration, I found that there were many duplicated event_unique_id. Let’s drop them.

toronto <- read.csv('toronto_crime.csv') toronto <- subset(toronto, !duplicated(toronto$event_unique_id)) unique(toronto$occurrenceyear) unique(toronto$reportedyear) > unique(toronto$occurrenceyear) [1] 2016 2015 2014 2012 2010 2013 2011 2003 2007 2008 2006 2002 2001 NA 2005 [16] 2009 2000 > unique(toronto$reportedyear) [1] 2016

Find anything interesting? The occurrence year ranged from 2000 to 2016, but report year is only 2016. This means people came to police to report incidents happened 16 years ago. Let’s have a look how many late reported incidents in our data.

year_group <- group_by(toronto, occurrenceyear) crime_by_year <- summarise(year_group, n = n()) crime_by_year # A tibble: 17 x 2 occurrenceyear n 1 2000 2 2 2001 2 3 2002 3 4 2003 2 5 2005 1 6 2006 1 7 2007 5 8 2008 1 9 2009 1 10 2010 10 11 2011 3 12 2012 8 13 2013 21 14 2014 37 15 2015 341 16 2016 27705 17 NA 4

2 incidents occurred in 2000, 2 occurred in 2001 and so on. The vast majority of occurrences happened in 2016. So we are going to keep 2016 only. And I am also removing all the columns we do not need and four rows with missing values.

drops <- c("X", "Y", "Index_", "ucr_code", "ucr_ext", "reporteddate", "reportedmonth", "reportedday", "reporteddayofyear", "reporteddayofweek", "reportedhour", "occurrencedayofyear", "reportedyear", "Division", "Hood_ID", "FID") toronto <- toronto[, !(names(toronto) %in% drops)] toronto <- toronto[toronto$occurrenceyear == 2016, ] toronto <- toronto[complete.cases(toronto), ] Explore What are the major crimes in 2016? indicator_group <- group_by(toronto, MCI) crime_by_indicator <- summarise(indicator_group, n=n()) crime_by_indicator <- crime_by_indicator[order(crime_by_indicator$n, decreasing = TRUE),] ggplot(aes(x = reorder(MCI, n), y = n), data = crime_by_indicator) + geom_bar(stat = 'identity', width = 0.5) + geom_text(aes(label = n), stat = 'identity', data = crime_by_indicator, hjust = -0.1, size = 3.5) + coord_flip() + xlab('Major Crime Indicators') + ylab('Number of Occurrences') + ggtitle('Major Crime Indicators Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Assault is the most prevalent form of violent crime in Toronto. What is assault? In criminal and civil law, assault is an attempt to initiate harmful or offensive contact with a person, or a threat to do so.

What are the different types of assault? Which type is the worst? assault <- toronto[toronto$MCI == 'Assault', ] assault_group <- group_by(assault, offence) assault_by_offence <- summarise(assault_group, n=n()) assault_by_offence <- assault_by_offence[order(assault_by_offence$n, decreasing = TRUE), ] ggplot(aes(x = reorder(offence, n), y = n), data = assault_by_offence) + geom_bar(stat = 'identity', width = 0.6) + geom_text(aes(label = n), stat = 'identity', data = assault_by_offence, hjust = -0.1, size = 3) + coord_flip() + xlab('Types of Assault') + ylab('Number of Occurrences') + ggtitle('Assault Crime Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Not much information here, the top assault category is … assault. I eventually learned different types of assault through Attorneys.com.

Let’s look at the top offences then offence_group <- group_by(toronto, offence) crime_by_offence <- summarise(offence_group, n=n()) crime_by_offence <- crime_by_offence[order(crime_by_offence$n, decreasing = TRUE), ] ggplot(aes(x = reorder(offence, n), y = n), data = crime_by_offence) + geom_bar(stat = 'identity', width = 0.7) + geom_text(aes(label = n), stat = 'identity', data = crime_by_offence, hjust = -0.1, size = 2) + coord_flip() + xlab('Types of Offence') + ylab('Number of Occurrences') + ggtitle('Offence Types Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Assault being the most common offence followed by Break and Enter. According to Wikibooks, the most typical form of break and enter is a break into a commercial or private residence in order to steal property. This indicates that break and enter more likely to occur when there is no one at home or office.

How about crime by time of the day? hour_group <- group_by(toronto, occurrencehour) crime_hour <- summarise(hour_group, n=n()) ggplot(aes(x=occurrencehour, y=n), data = crime_hour) + geom_line(size = 2.5, alpha = 0.7, color = "mediumseagreen", group=1) + geom_point(size = 0.5) + ggtitle('Total Crimes by Hour of Day in Toronto 2016') + ylab('Number of Occurrences') + xlab('Hour(24-hour clock)') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

The worst hour is around the midnight, another peak time is around the noon, then again at around 8pm.

Okay, but what types of crime are most frequent at each hour? hour_crime_group <- group_by(toronto, occurrencehour, MCI) hour_crime <- summarise(hour_crime_group, n=n()) ggplot(aes(x=occurrencehour, y=n, color=MCI), data =hour_crime) + geom_line(size=1.5) + ggtitle('Crime Types by Hour of Day in Toronto 2016') + ylab('Number of Occurrences') + xlab('Hour(24-hour clock)') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Assaults are the top crimes almost all the time, they happened more often from the late mornings till nights than the early mornings. On the other hand, break and enter happened more often in the mornings and at around the midnight (when no one at home or office) . Robberies and auto thefts are more likely to happen in the late evenings till the nights. They all make sense.

Where are those crimes most likely to occur in Toronto? location_group <- group_by(toronto, Neighbourhood) crime_by_location <- summarise(location_group, n=n()) crime_by_location <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ] crime_by_location_top20 <- head(crime_by_location, 20) ggplot(aes(x = reorder(Neighbourhood, n), y = n), data = crime_by_location_top20) + geom_bar(stat = 'identity', width = 0.6) + geom_text(aes(label = n), stat = 'identity', data = crime_by_location_top20, hjust = -0.1, size = 3) + coord_flip() + xlab('Neighbourhoods') + ylab('Number of Occurrences') + ggtitle('Neighbourhoods with Most Crimes - Top 20') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

The most dangerous neighbourhood is … Waterfront. The sprawling downtown catch-all includes not only the densely packed condo but also the boozy circus that is the entertainment district. The result: a staggering number of violent crimes and arsons.

Let’s find out neighbourhoods vs. top offence types offence_location_group <- group_by(toronto, Neighbourhood, offence) offence_type_by_location <- summarise(offence_location_group, n=n()) offence_type_by_location <- offence_type_by_location[order(offence_type_by_location$n, decreasing = TRUE), ] offence_type_by_location_top20 <- head(offence_type_by_location, 20) ggplot(aes(x = Neighbourhood, y=n, fill = offence), data=offence_type_by_location_top20) + geom_bar(stat = 'identity', position = position_dodge(), width = 0.8) + xlab('Neighbourhood') + ylab('Number of Occurrence') + ggtitle('Offence Type vs. Neighbourhood Toronto 2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"), axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4))

Gives this plot:

I did not expect something like this. It is not pretty. However, it did tell us that besides assaults, Church-Yonge Corridor and Waterfront had the most break and enter(Don’t move there!). West Humber-Clairville had the most vehicle stolen crimes(Don’t park your car there!).

Let’s try something different crime_count % group_by(occurrencemonth, MCI) %>% summarise(Total = n()) crime_count$occurrencemonth <- ordered(crime_count$occurrencemonth, levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) ggplot(crime_count, aes(occurrencemonth, MCI, fill = Total)) + geom_tile(size = 1, color = "white") + scale_fill_viridis() + geom_text(aes(label=Total), color='white') + ggtitle("Major Crime Indicators by Month 2016") + xlab('Month') + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Much better!

Assault is the most common crime every month of the year with no exception. It appears that there were a little bit more assault incidents in May than the other months last year.

day_count % group_by(occurrencedayofweek, MCI) %>% summarise(Total = n()) ggplot(day_count, aes(occurrencedayofweek, MCI, fill = Total)) + geom_tile(size = 1, color = "white") + scale_fill_viridis() + geom_text(aes(label=Total), color='white') + ggtitle("Major Crime Indicators by Day of Week 2016") + xlab('Day of Week') + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

Saturdays and Sundays had more assaults than any other days, and had less theft over than any other days. Auto thieves are busy almost every day of the week equally.
I was expecting to find seasonal crime patterns such as temperature changes and daylight hours might be associated with crime throughout the year, or the beginning and end of the school year, are associated with variations in crime throughout the year. But this one-year’s worth of data is not enough to address my above concerns. I hope Toronto Police service will release more data via its open data portal in the soonest future.

Homicide homicide <- read.csv('homicide.csv', stringsAsFactors = F) homicide$Occurrence_Date <- as.Date(homicide$Occurrence_Date) year_group <- group_by(homicide, Occurrence_year, Homicide_Type) homicide_by_year <- summarise(year_group, n=n()) ggplot(aes(x = Occurrence_year, y=n, fill = Homicide_Type), data=homicide_by_year) + geom_bar(stat = 'identity', position = position_dodge(), width = 0.8) + xlab('Year') + ylab('Number of Homicides') + ggtitle('Homicide 2004-2016') + theme_bw() + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

2005 is Toronto’s “Year of Gun”. Eleven years later in 2016, Toronto was experiencing another spike in gun-related homicide.

homicide$month <- format(as.Date(homicide$Occurrence_Date) , "%B") homicide_count % group_by(Occurrence_year, month) %>% summarise(Total = n()) homicide_count$month <- ordered(homicide_count$month, levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) ggplot(homicide_count, aes(Occurrence_year, month, fill = Total)) + geom_tile(size = 1, color = "white") + scale_fill_viridis() + geom_text(aes(label=Total), color='white') + ggtitle("Homicides in Toronto (2004-2016)") + xlab('Year') + theme(plot.title = element_text(size = 16), axis.title = element_text(size = 12, face = "bold"))

Gives this plot:

It is worrisome to see that there is a significant increase in the total number of homicides in Toronto in 2016 compared to 2015. I hope we will have a better 2017. However, when I read that Toronto ranked safest city in North America by the Economist, I felt much safer.

K-Means Clustering

K-Means is one of the most popular “clustering” algorithms. It is the process of partitioning a group of data points into a small number of clusters. As in our crime data, we measure the number of assaults and other indicators, and neighbourhoods with high number of assaults will be grouped together. The goal of K-Means clustering is to assign a cluster to each data point (neighbourhood). Thus we first partition datapoints (neighbourhoods) into k clusters in which each neighbourhood belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
As one of the unsupervised learning algorithms,we use K-Mean to build models that help us understand our data better. It enables us to learn groupings of unlabeled data points.

To do clustering analysis, our data has to look like this: by_groups <- group_by(toronto, MCI, Neighbourhood) groups <- summarise(by_groups, n=n()) groups <- groups[c("Neighbourhood", "MCI", "n")] groups_wide <- spread(groups, key = MCI, value = n)

Gives this table:

The fist column — qualitative data should be removed from the analysis z <- groups_wide[, -c(1,1)] The data cannot have any missing values z <- z[complete.cases(z), ] The data must be scaled for comparison m <- apply(z, 2, mean) s <- apply(z, 2, sd) z <- scale(z, m, s) Determine the number of clusters wss <- (nrow(z)-1) * sum(apply(z, 2, var)) for (i in 2:20) wss[i] <- sum(kmeans(z, centers=i)$withiness) plot(1:20, wss, type='b', xlab='Number of Clusters', ylab='Within groups sum of squares')

Gives this plot:

This plot shows a very strong elbow, based on the plot, we can say with confidence that we do not need more than two clusters (centroids).

Fitting a model kc <- kmeans(z, 2) kc K-means clustering with 2 clusters of sizes 121, 10 Cluster means: Assault Auto Theft Break and Enter Robbery Theft Over 1 -0.193042 -0.135490 -0.2176646 -0.1778607 -0.2288382 2 2.335808 1.639429 2.6337422 2.1521148 2.7689425 Clustering vector: [1] 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [40] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 [79] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 [118] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 Within cluster sum of squares by cluster: [1] 183.3436 170.2395 (between_SS / total_SS = 45.6 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"

Interpretation:

* First cluster has 121 neighbourhoods, and second cluster has 10 neighbourhoods.
* Cluster means: If the ranges of these numbers seem strange, it’s because we standardized the data before performing the cluster analysis. The negative values mean “lower than most” and positive values mean “higher than most”. Thus, cluster 1 are neighbourhoods with low assault, low auto theft, low break and enter, low robbery and low theft over. Cluster 2 are neighbourhoods with high assault, high auto theft, high break and enter, high robbery and high theft over. It is good that these two groups have a significant variance in every variable. It indicates that each variable plays a significant role in categorizing clusters.
* Clustering vector: The first, second and third neighbourhoods should all belong to cluster 1, the fourth neighbourhood should belong to cluster 2, and so on.
* A measurement that is more relative would be the withinss and betweenss.
withinss tells us the sum of the square of the distance from each data point to the cluster center. Lower is better. Betweenss tells us the sum of the squared distance between cluster centers. Ideally we want cluster centers far apart from each other.
* Available components is self-explanatory.

Plotting the k-Means results z1 <- data.frame(z, kc$cluster) clusplot(z1, kc$cluster, color=TRUE, shade=F, labels=0, lines=0, main='k-Means Cluster Analysis')

Gives this plot:

It appears that our choice of number of clusters is good, and we have little noise.

Hierarchical Clustering

For the hierarchical clustering methods, the dendrogram is the main graphical tool for getting insight into a cluster solution.

z2 <- data.frame(z) distance <- dist(z2) hc <- hclust(distance)

Now that we’ve got a cluster solution. Let’s examine the results.

plot(hc, labels = groups_wide$Neighbourhood, main='Cluster Dendrogram', cex=0.65)

Gives this plot:

If we choose any height along the y-axis of the dendrogram, and move across the dendrogram counting the number of lines that we cross, each line represents a cluster. For example, if we look at a height of 10, and move across the x-axis at that height, we’ll cross two lines. That defines a two-cluster solution; by following the line down through all its branches, we can see the names of the neighbourhoods that are included in these two clusters. Looking at the dendrogram for the Toronto’s crimes data, we can see our datapoins are very imbalanced. From the top of the tree, there are two distinct groups; one group consists of brunches with brunches and more brunches, while another group only consists few neighbourhoods, and we can see these neighbourhoods are Toronto’s most dangerous neighbourhoods. However, I want to try many different groupings at once to start investigating.

counts = sapply(2:6,function(ncl)table(cutree(hc,ncl))) names(counts) = 2:6 counts > counts $`2` 1 2 128 3 $`3` 1 2 3 128 2 1 $`4` 1 2 3 4 121 7 2 1 $`5` 1 2 3 4 5 121 5 2 2 1 $`6` 1 2 3 4 5 6 112 5 2 9 2 1

Interpretation:
* With 2-cluster solution, we have 128 neighbourhoods in cluster 1, 3 neighbourhoods in cluster 2.
* With 3-cluster solution, we have 128 neighbourhoods in cluster 1, 2 neighbourhoods in cluster 2, and 1 neighbourhood in cluster 3. And so on until 6-cluster solution.

In practice, we’d like a solution where there aren’t too many clusters with just few observations, because it may make it difficult to interpret our results. For this exercise, I want to stick with 3-cluster solution, see what results I will obtain.

member <- cutree(hc, 3) aggregate(z, list(member), mean) aggregate(z, list(member), mean) Group.1 Assault Auto Theft Break and Enter Robbery Theft Over 1 1 -0.09969023 -0.08067526 -0.09425688 -0.09349632 -0.1042508 2 2 5.57040267 0.57693723 4.67333848 4.14398508 5.0024522 3 3 1.61954364 9.17255898 2.71820403 3.67955938 3.3392003

In cluster 1, all the crime indicators are on the negative side, cluster 1 has a significant distinction on each variable compare with cluster 2 and cluster 3. Cluster 2 is higher in most of the crime indicators than cluster 3 except Auto Theft.

plot(silhouette(cutree(hc, 3), distance))

Gives this plot:

The silhouette width value for cluster 3 is zero. The silhouette plot indicates that we really do not need the third cluster, The vast majority of neighbourhoods belong to the first cluster, and 2-cluster will be our solution.

Making a Map of Toronto’s Crimes

There are many packages for plotting and manipulating spatial data in R. I am going to use ggmap to produce a simple and easy map of Toronto’ crimes.

lat <- toronto$Lat lon <- toronto$Long crimes <- toronto$MCI to_map <- data.frame(crimes, lat, lon) colnames(to_map) <- c('crimes', 'lat', 'lon') sbbox <- make_bbox(lon = toronto$Long, lat = toronto$Lat, f = 0.01) my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="bw", zoom = 10) ggmap(my_map) + geom_point(data=to_map, aes(x = lon, y = lat, color = "#27AE60"), size = 0.5, alpha = 0.03) + xlab('Longitude') + ylab('Latitude') + ggtitle('Location of Major Crime Indicators Toronto 2016') + guides(color=FALSE)

Gives this map:

It’s clear to see where the major crimes in the city occur. A large concentration in the Waterfront area, South of North York is more peaceful than any other areas. However, point-stacking is not helpful when comparing high-density areas, so let’s optimize this visualization.

ggmap(my_map) + geom_point(data=to_map, aes(x = lon, y = lat, color = "#27AE60"), size = 0.5, alpha = 0.05) + xlab('Longitude') + ylab('Latitude') + ggtitle('Location of Major Crime Indicators Toronto 2016') + guides(color=FALSE) + facet_wrap(~ crimes, nrow = 2)

Gives this map:

This is certainly more interesting and prettier. Some crimes, such as Assaults, and Break and Enter occur all over the city, with concentration in the Waterfront areas. Other crimes, such as Auto Theft has a little more points in the west side than the east side. Robbery and Theft Over primarily have clusters in the Waterfront areas.

Summary

Not many more questions can be answered by looking at the data of Toronto Major crimes Indicators. But that’s okay. There’s certainly other interesting things to do with this data, such as creating a dashboard at MicroStrategy.

As always, all the code can be found on Github. I would be pleased to receive feedback or questions on any of the above.

    Related Post

    1. Spring Budget 2017: Circle Visualisation
    2. Qualitative Research in R
    3. Multi-Dimensional Reduction and Visualisation with t-SNE
    4. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
    5. Analyzing Obesity across USA
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    shadow text effect in grid and ggplot2 graphics

    Thu, 11/02/2017 - 08:30

    (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

    After the release of meme package, I received several feedbacks from users.

    The most usefule one is the comment on my blog post:

    Sercan Kahveci

    Greetings Mr. Yu,

    I am very happy that this package exists. Thank you for making it! I would like to request a feature, to ensure the package is able to compete with professional meme-creation tools like memegenerator and paint.net. Since memes often use the font Impact, in white and with black outline, I believe the package would be more powerful if it also did that automatically.

    Regards,

    Sercan Kahveci, MSc

    Content creator at Questionable Research Memes on Facebook
    https://www.facebook.com/QResearchMemes/

    The words, ‘compete with professional meme-creation tools’, stimulated me to develop text plotting with background outline effect.

    Now this feature is available in meme v>=0.0.7, which can be downloaded from CRAN.

    Here is an example:

    library(meme) u <- system.file("angry8.jpg", package="meme") meme(u, "code", "all the things!")

    To make this shadow text feature available to the R community, I created another package, shadowtext, that creates/draws text grob with background shadow for grid and ggplot2. If you are interesting, please refer to the online vignette.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Explore Predictive Maintenance with flexdashboard

    Thu, 11/02/2017 - 01:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    I have written the following post about Predictive Maintenance and flexdashboard at my company codecentric’s blog:

    Predictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).
    A common use-case for Predictive Maintenance is to proactively monitor machines, so as to predict when a check-up is needed to reduce failure and maximize performance. In contrast to traditional maintenance, where each machine has to undergo regular routine check-ups, Predictive Maintenance can save costs and reduce downtime. A machine learning approach to such a problem would be to analyze machine failure over time to train a supervised classification model that predicts failure. Data from sensors and weather information is often used as features in modeling.

    With flexdashboard RStudio provides a great way to create interactive dashboards with R. It is an easy and very fast way to present analyses or create story maps. Here, I have used it to demonstrate different analysis techniques for Predictive Maintenance. It uses Shiny run-time to create interactive content.

    Continue reading at https://blog.codecentric.de/en/2017/11/explore-predictive-maintenance-flexdashboard/

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Image Convolution in R using Magick

    Thu, 11/02/2017 - 01:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    Release 1.4 of the magick package introduces
    a new feature called image convolution that
    was requested by Thomas L. Pedersen. In this post we explain what this is all about.

    Kernel Matrix

    The new image_convolve() function applies a kernel over the image. Kernel convolution means that each pixel value is recalculated using the weighted neighborhood sum defined in the kernel matrix. For example lets look at this simple kernel:

    library(magick) kern <- matrix(0, ncol = 3, nrow = 3) kern[1, 2] <- 0.25 kern[2, c(1, 3)] <- 0.25 kern[3, 2] <- 0.25 kern ## [,1] [,2] [,3] ## [1,] 0.00 0.25 0.00 ## [2,] 0.25 0.00 0.25 ## [3,] 0.00 0.25 0.00

    This kernel changes each pixel to the mean of its horizontal and vertical neighboring pixels, which results in a slight blurring effect in the right-hand image below:

    img <- image_read('logo:') img_blurred <- image_convolve(img, kern) image_append(c(img, img_blurred))

    Standard Kernels

    Many operations in magick such as blurring, sharpening, and edge detection are
    actually special cases of image convolution. The benefit of explicitly using
    image_convolve() is more control. For example, we can blur an image and then blend
    it together with the original image in one step by mixing a blurring kernel with the
    unit kernel:

    img %>% image_convolve('Gaussian:0x5', scaling = '60,40%')

    The above requires a bit of explanation. ImageMagick defines several common
    standard kernels such as the
    gaussian kernel. Most of the standard kernels take one or more parameters,
    e.g. the example above used a gaussian kernel with 0 radius and 5 sigma.

    In addition, scaling argument defines the magnitude of the kernel, and possibly
    how much of the original picture should be mixed in. Here we mix 60% of the
    blurring with 40% of the original picture in order to get a diffused lightning effect.

    Edge Detection

    Another area where kernels are of use is in edge detection. A simple example of
    a direction-aware edge detection kernel is the Sobel kernel.
    As can be seen below, vertical edges are detected while horizontals are not.

    img %>% image_convolve('Sobel') %>% image_negate()

    Something less apparent is that the result of the edge detection is truncated.
    Edge detection kernels can result in negative color values which get truncated to zero.
    To combat this it is possible to add a bias to the result. Often you’ll end up with
    scaling the kernel to 50% and adding 50% bias to move the midpoint of the result to 50%
    grey:

    img %>% image_convolve('Sobel', scaling = '50%', bias = '50%')

    Sharpening

    ImageMagick has many more edge detection kernels, some of which are insensitive to
    the direction of the edge. To emulate a classic high-pass filter from photoshop use
    difference of gaussians kernel:

    img %>% image_convolve('DoG:0,0,2') %>% image_negate()

    As with the blurring, the original image can be blended in with the transformed one, effectively sharpening the image along edges.

    img %>% image_convolve('DoG:0,0,2', scaling = '100, 100%')

    The ImageMagick documentation has more examples of convolve with various avaiable kernels.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Le Monde [last] puzzle [#1026]

    Thu, 11/02/2017 - 00:17

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    The last and final Le Monde puzzle is a bit of a disappointment, to wit:

    A 4×4 table is filled with positive and different integers. A 3×3 table is then deduced by adding four adjacent [i.e. sharing a common corner] entries of the original table. Similarly with a 2×2 table, summing up to a unique integer. What is the minimal value of this integer? And by how much does it increase if all 29 integers in the tables are different?

    For the first question, the resulting integer writes down as the sum of the corner values, plus 3 times the sum of the side values, plus 9 times the sum of the 4 inner values [of the 4×4 table]. Hence, minimising the overall sum means taking the inner values as 1,2,3,4, the side values as 5,…,12, and the corner values as 13,…,16. Resulting in a total sum of 352. As checked in this computer code in APL by Jean-Louis:

    This configuration does not produce 29 distinct values, but moving one value higher in one corner does: I experimented with different upper bounds on the numbers and 17 always provided with the smallest overall sum, 365.

    firz=matrix(0,3,3)#second level thirz=matrix(0,2,2)#third level for (t in 1:1e8){ flor=matrix(sample(1:17,16),4,4) for (i in 1:3) for (j in 1:3) firz[i,j]=sum(flor[i:(i+1),j:(j+1)]) for (i in 1:2) for (j in 1:2) thirz[i,j]=sum(firz[i:(i+1),j:(j+1)]) #last if (length(unique(c(flor,firz,thirz)))==29) solz=min(solz,sum(thirz))}

    and a further simulated annealing attempt did not get me anywhere close to this solution.

    Filed under: Books, Kids, R Tagged: competition, Kapla, Le Monde, mathematical puzzle, R, simulated annealing

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R: the least disliked programming language

    Wed, 11/01/2017 - 23:59

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    According to a recent analysis of Stack Overflow "Developer Stories", where programmer candidates list the technologies the would and would not like to work with, R is the least disliked programming language:

    This is probably related to the fact that there's high demand in the job market for fast-growing technologies, which is a disincentive for listing them on the "would not work with" section of an online resume. As author David Robinson notes:

    If you’ve read some of our other posts about the growing and shrinking programming languages, you might notice that the least disliked tags tend to be fast-growing ones. R, Python, Typescript, Go, and Rust are all fast-growing in terms of Stack Overflow activity (we’ve specifically explored Python and R before) and all are among the least polarizing languages.

    Read the complete analysis linked below for more insights, including the "most liked" languages and rivalries between languages (where liking one and disliking the other go hand-in-hand).

    Stack Overflow blog: What are the Most Disliked Programming Languages?

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A ggplot-based Marimekko/Mosaic plot

    Wed, 11/01/2017 - 18:00

    (This article was first published on That’s so Random, and kindly contributed to R-bloggers)

    One of my first baby steps into the open source world, was when I answered this SO question over four years ago. Recently I revisited the post and saw that Z.Lin did a very nice and more modern implementation, using dplyr and facetting in ggplot2. I decided to merge here ideas with mine to create a general function that makes MM plots. I also added two features: counts, proportions, or percentages to the cells as text and highlighting cells by a condition.

    For those of you unfamiliar with this type of plot, it graphs the joint distribution of two categorical variables. x is plotted in bins, with the bin widths reflecting its marginal distribution. The fill of the bins is based on y. Each bin is filled by the co-occurence of its x and y values. When x and y are independent, all the bins are filled (approximately) in the same way. The nice feature of the MM plot, is that is shows both the joint distribution and the marginal distributions of x and y.

    ggmm

    To demonstrate the function, I’ll take a selection of the emergency data set from the padr package. Such that it has three types of incidents in four parts of town. We also do some relabelling for prettier plot labels.

    em_sel <- padr::emergency %>% dplyr::filter( title %in% c("Traffic: VEHICLE ACCIDENT -", "Traffic: DISABLED VEHICLE -", "Fire: FIRE ALARM"), twp %in% c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION")) %>% mutate(twp = factor(twp, levels = c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION"), labels = c("Low Mer.", "Abing.", "Norris.", "Upper Mer.")))

    The function takes a data frame and the bare (unquoted) column names of the x and y variables. It will then create a ggplot object. The variables don’t have to be factors or characters, the function coerces them to character.

    ggmm(em_sel, twp, title)

    Now I promised you two additional features. First, adding text to the cells. The add_text argument takes either “n”, to show the absolute counts

    ggmm(em_sel, twp, title, add_text = "n")

    “prop” to show the proportions of each cell with respect to the joint distribution

    ggmm(em_sel, twp, title, add_text = "prop")

    or “perc”, which reflects the percentages of the joint.

    ggmm(em_sel, twp, title, add_text = "perc")

    An argument is provided to control the rounding of the text.

    Secondly, the alpha_condition argument takes an unevaluated expression in terms of the column names of x and y. The cells for which the expression yields TRUE will be highlighted (or rather the others will be downlighted). This is useful when you want to stress an aspect of the distribution, like a value of y that varies greatly over x.

    ggmm(em_sel, twp, title, alpha_condition = title == "Traffic: DISABLED VEHICLE -")

    I hope you find this function useful. The source code is shared below. Also it is in the package accompanying this blog. Which you can install by running devtools::install_github(EdwinTh/thatssorandom).

    library(tidyverse) ggmm <- function(df, x, y, alpha_condition = 1 == 1, add_text = c(NA, "n", "prop", "perc"), round_text = 2) { stopifnot(is.data.frame(df)) add_text <- match.arg(add_text) x_q <- enquo(x) y_q <- enquo(y) a_q <- enquo(alpha_condition) plot_set <- df %>% add_alpha_ind(a_q) %>% x_cat_y_cat(x_q, y_q) %>% add_freqs_col() plot_return <- mm_plot(plot_set, x_q, y_q) plot_return <- set_alpha(df, plot_return, a_q) if (!is.na(add_text)) { plot_set$text <- make_text_vec(plot_set, add_text, round_text) plot_set$freq <- calculate_coordinates(plot_return) text_part <- geom_text(data = plot_set, aes(label = text)) } else { text_part <- NULL } plot_return + text_part } add_alpha_ind <- function(df, a_q) { df %>% mutate(alpha_ind = !!a_q) } x_cat_y_cat <- function(df, x_q, y_q) { df %>% mutate(x_cat = as.character(!!x_q), y_cat = as.character(!!y_q)) } add_freqs_col <- function(df) { stopifnot(all(c('x_cat', 'y_cat', 'alpha_ind') %in% colnames(df))) df %>% group_by(x_cat, y_cat) %>% summarise(comb_cnt = n(), alpha_ind = as.numeric(sum(alpha_ind) > 0)) %>% mutate(freq = comb_cnt /sum(comb_cnt), y_cnt = sum(comb_cnt)) %>% ungroup() } mm_plot <- function(plot_set, x_q, y_q) { plot_set %>% ggplot(aes(x_cat, freq, width = y_cnt, fill = y_cat, alpha = alpha_ind)) + geom_bar(stat = "identity", position = "fill", color = "black") + facet_grid(~x_cat, scales = "free_x", space = "free_x", switch = "x") + theme( axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.spacing = unit(0.1, "lines"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), strip.background = element_blank() ) + guides(alpha = FALSE) + labs(fill = quo_name(y_q)) + xlab(quo_name(x_q)) } set_alpha <- function(df, plot_return, a_q) { if (mutate(df, !!a_q) %>% pull() %>% unique() %>% length() %>% `==`(1)) { plot_return + scale_alpha_continuous(range = c(1)) } else { plot_return + scale_alpha_continuous(range = c(.4, 1)) } } make_text_vec <- function(plot_set, add_text, round_text) { if (add_text == "n") return(get_counts(plot_set)) text_col <- get_props(plot_set) if (add_text == "perc") { text_col <- round(text_col * 100, round_text) return(paste0(text_col, "%")) } round(text_col, round_text) } get_counts <- function(plot_set) { plot_set %>% pull(comb_cnt) } get_props <- function(plot_set){ plot_set %>% mutate(text_col = comb_cnt / sum(plot_set$comb_cnt)) %>% pull() } calculate_coordinates <- function(plot_return) { ggplot_build(plot_return)$data[[1]] %>% split(.$PANEL) %>% map(y_in_the_middle) %>% unlist() } y_in_the_middle <- function(x) { y_pos <- c(0, x$y) rev(y_pos[-length(y_pos)] + (y_pos %>% diff()) / 2) } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RTutor: Wall Street and the Housing Bubble

    Wed, 11/01/2017 - 10:00

    (This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

    It is widely recognized that aggressive securitization of housing loans played a decisive role in the creating of a house price bubble in the US and the subsequent financial crisis. An open question is in how far financial experts on securitization were aware of that fragile house price bubble they helped creating.

    Ing-Haw Cheng, Sahil Raina, and Wei Xiong shed some light on this question in their very interesting article Wall Street and the Housing Bubble (American Economic Review, 2014). They have collected a unique data set on private real estate transactions for a sample of 400 securitization agents, and two control groups consisting of laywers and financial analysts not working in the real estate sector. They use this data set to study whether or not in their private investments the securitization agents seem to have been aware of the bubble and where more successful in the timing of their real estate investments.

    As part of his Bachelor Thesis at Ulm University, Marius Wentz has generated a RTutor problem set that allows you to replicate the main insights of the article in an interactive fashion. You learn about R, econometrics and the housing bubble at the same time.

    Here is a screenshoot:

    Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by many multiple choice quizzes.

    To install the problem set the problem set locally, first install RTutor as explained here:

    https://github.com/skranz/RTutor

    and then install the problem set package:

    https://github.com/mwentz93/RTutorWallStreet

    There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 25 hours total usage time per month. So it may be greyed out when you click at it.)

    https://mwentz93.shinyapps.io/RTutorWallStreet/

    If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

    https://github.com/skranz/RTutor

    You can also install RTutor as a docker container:
    https://hub.docker.com/r/skranz/rtutor/

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Promises and Closures in R

    Wed, 11/01/2017 - 09:50

    (This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

    At the moment I try to improve my knowledge about functional programming in R. Luckily there are some explanations on the topic in the web (adv-r and Cartesian Faith). Beginning to (re)discover the usefulness of closures, I remember some (at first sight) very strange behaviour. Actually it is consistent within the scoping rules of R, but until I felt to be on the same level of consistency it took a while.

    What is a Promise?

    Every argument you pass to a function is a promise until the moment R evaluates it. Consider a function g with arguments x and y. Let’s leave out one argument in the function call:

    g <- function(x, y) x g(1) ## [1] 1

    R will be forgiving (lazy) until the argument y is actually needed. Until then y exists in the environment of the function call as a ‘name without a value’. Only when R needs to evaluate y a value is searched for. This means that we can pass some non-existent objects as arguments to the function g and R won’t care until the argument is needed in the functions body.

    g(1, nonExistentObject) ## [1] 1

    Have a look at the figure ‘Environment Path 1’. Your workspace is also called the Global Environment and you can access it explicitly using the internal variable .GlobalEnv. There is one variable in my workspace, the function g(x, y). When g is called a new environment is created in which it’s body will be evaluated. This is denoted by the solid line. In this new environment of g there exist two variables, x and y. As long as those variables are not needed, no values are bound to those names only a promise that a value can be found at the time of evaluation. Since x is evaluated, the value 1 is bound to x in the environment of the function g. y, however, is not evaluated, so the promised value of y is never searched for and we can promise anything.

    Environment Path 1.

    The dashed line indicates the direction in which R will try to find objects. Meaning that if the function g does not find a variable in its own ‘evaluation environment’, it will continue its search in the global environment. The question where this dashed line is pointing to is really important if you try to understand closures. Just to give you a heads up: The parent environment (environment where the dashed line is pointing to) of a ‘functions evaluation environment’ is always the environment in which the function was created – and not the environment from which the function is called. In the case of g that is the global environment. In the case of a function living in a package it is the packages namespace.

    What is a closure?

    A closure is a function which has an enclosing environment. As far as my understanding of these things goes, by that definition every function can be considered a closure. This suspicion is supported by R’s constant complaint, that I try to subset closures. Anyway, typically the term closure is used for functions which will have a function as return value:

    fClosure <- function(p) function(x) x^p f1 <- fClosure(1) f2 <- fClosure(2) cbind(f1(1:10), f2(1:10)) ## [,1] [,2] ## [1,] 1 1 ## [2,] 2 4 ## [3,] 3 9 ## [4,] 4 16 ## [5,] 5 25 ## [6,] 6 36 ## [7,] 7 49 ## [8,] 8 64 ## [9,] 9 81 ## [10,] 10 100

    Here I created fClosure as a function of p which will return a function of x. Then I assign values to f1 and f2 which are the functions f(x)=x1 and f(x)=x2. The reason this works can be answered by looking at the figure ‘Environment Path 2’ with all environments and connections between them.

    Environment Path 2.

    The solid line indicates that f1 is called from the .GlobalEnv; the dashed line the direction in which R will search for values (an exception is the promise, x, which will reference to the .GlobalEnv). The enclosing environment of f1 is the environment in which it was created, which was the environment of the call to fClosure. So f1 has an own environment which can be seen when you print the function to the console.

    f1 ## function(x) x^p ## <environment: 0x24488e8>

    This environment can even be accessed, to check what is going on inside.

    ls(environment(f1)) ## [1] "p" get("p", envir = environment(f1)) ## [1] 1

    So in the enclosing environment of f1 lives a variable p with value 1. Whenever R searches for a variable which is not part of the argument list, it will first check the environment created when called, then the enclosing environment and then the .GlobalEnv followed by the search path.

    Why are those two related?

    When I read about the scoping rules in R I never really understood the implications of the word lazy. It needed a couple of hours of utter confusion and experiments with closures that I got it. Consider the case where I want to construct an arbitrary number of functions like in the above example. Copy-pasting fClosure will quickly reach limits and is more frustrating than coding.

    # Creating f1-f5 and store them in a list # This will actually work using lapply in the most recent R version (3.4) # I enforce it by using a for-loop instead of lapply... # funList <- lapply(1:5, fClosure) funList <- list() for (i in 1:5) funList[[i]] <- fClosure(i) # Call f1-f5 with the argument x = 1:10 resultList <- lapply(funList, do.call, args = list(x = 1:10)) # Cbind the results do.call(cbind, resultList) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 1 1 ## [2,] 32 32 32 32 32 ## [3,] 243 243 243 243 243 ## [4,] 1024 1024 1024 1024 1024 ## [5,] 3125 3125 3125 3125 3125 ## [6,] 7776 7776 7776 7776 7776 ## [7,] 16807 16807 16807 16807 16807 ## [8,] 32768 32768 32768 32768 32768 ## [9,] 59049 59049 59049 59049 59049 ## [10,] 100000 100000 100000 100000 100000

    Ups, what happened? The resulting matrix looks like every column was created using the same function! Just to be clear, the above code works just fine. It does exactly as intended. In this case I was tricked by the promises in the enclosing environments, and that in those enclosing environments there live variables p with values 1 to 5. This is not so. Remember, the arguments of a function are evaluated when they are first needed. Until then they are promises. The concept of a promise was surprising because it’s one of the very few objects which have reference semantics in base-R. So a promise is just a pointer to a variable name in an environment (the environment from which the function is called) – they are not pointing to values! If the value of the variable pointed to changes before the promise is evaluated inside the function, the behaviour of the function will change too. This leads to the question: what is the value of p inside this list of functions?

    sapply(funList, function(fun) get("p", envir = environment(fun))) ## [1] 5 5 5 5 5

    Okay, fine, so in the loop where I created the functions f1 to f5, I did pass the numbers 1 to 5 to the closure, however, they do not get evaluated but point to the iterator which is 5 at the moment the promises are evaluated. How do we fix this? Evaluate p in the enclosing environment at the moment of assignment. Actually we could just write p in the functions body (not the function which is returned, it needs to be evaluated in the enclosing environment), but that may be considered bad style because in two weeks time you will see it as a redundant and useless line of code. Actually there is a function for this. force forces the evaluation of arguments in the enclosing environment. This means that the variable p will be bound to a value at the moment the closure is called.

    # Fix fClosure <- function(p) { force(p) function(x) x^p } # And again, with a new definition of fClosure: for(i in 1:5) funList[[i]] <- fClosure(i) resultList <- lapply(funList, do.call, args = list(x = 1:10)) do.call(cbind, resultList) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 1 1 ## [2,] 2 4 8 16 32 ## [3,] 3 9 27 81 243 ## [4,] 4 16 64 256 1024 ## [5,] 5 25 125 625 3125 ## [6,] 6 36 216 1296 7776 ## [7,] 7 49 343 2401 16807 ## [8,] 8 64 512 4096 32768 ## [9,] 9 81 729 6561 59049 ## [10,] 10 100 1000 10000 100000

    And that made all the difference.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Thinking about different ways to analyze sub-groups in an RCT

    Wed, 11/01/2017 - 01:00

    (This article was first published on ouR data generation, and kindly contributed to R-bloggers)

    Here’s the scenario: we have an intervention that we think will improve outcomes for a particular population. Furthermore, there are two sub-groups (let’s say defined by which of two medical conditions each person in the population has) and we are interested in knowing if the intervention effect is different for each sub-group.

    And here’s the question: what is the ideal way to set up a study so that we can assess (1) the intervention effects on the group as a whole, but also (2) the sub-group specific intervention effects?

    This is a pretty straightforward, text-book scenario. Sub-group analysis is common in many areas of research, including health services research where I do most of my work. It is definitely an advantage to know ahead of time if you want to do a sub-group analysis, as you would in designing a stratified randomized controlled trial. Much of the criticism of these sub-group analyses arises when they are not pre-specified and conducted in an ad hoc manner after the study data have been collected. The danger there, of course, is that the assumptions underlying the validity of a hypothesis test are violated. (It may not be easy to convince folks to avoid hypothesis testing.) In planning ahead for these analyses, researchers are less likely to be accused of snooping through data in search of findings.

    So, given that you know you want to do these analyses, the primary issue is how they should be structured. In particular, how should the statistical tests be set up so that we can draw reasonable conclusions? In my mind there are a few ways to answer the question.

    Three possible models

    Here are three models that can help us assess the effect of an intervention on an outcome in a population with at least two sub-groups:

    \[ \text{Model 1: } Y_i = \beta_0 + \beta_1 D_i \]

    \[ \text{Model 2: } Y_i = \beta_0^{\prime} + \beta_1^{\prime} D_i + \beta^{\prime}_2 T_i \]

    \[ \text{Model 3: } Y_i = \beta_0^{\prime\prime} + \beta_1^{\prime\prime} D_i +\beta^{\prime\prime}_2 T_i +\beta^{\prime\prime}_3 T_i D_i\]

    where \(Y_i\) is the outcome for subject \(i\), \(T_i\) is an indicator of treatment and equals 1 if the subject received the treatment, and \(D_i\) is an indicator of having the condition that defines the second sub-group. Model 1 assumes the medical condition can only affect the outcome. Model 2 assumes that if the intervention does have an effect, it is the same regardless of sub-group. And Model 3 allows for the possibility that intervention effects might vary between sub-groups.

    1. Main effects

    In the first approach, we would estimate both Model 2 and Model 3, and conduct a hypothesis test using the null hypothesis \(\text{H}_{01}\): \(\beta_2^{\prime} = 0\). In this case we would reject \(\text{H}_{01}\) if the p-value for the estimated value of \(\beta_2^{\prime}\) was less than 0.05. If in fact we do reject \(\text{H}_{01}\) (and conclude that there is an overall main effect), we could then (and only then) proceed to a second hypothesis test of the interaction term in Model 3, testing \(\text{H}_{02}\): \(\beta_3^{\prime\prime} = 0\). In this second test we can also evaluate the test using a cutoff of 0.05, because we only do this test if we reject the first one.

    This is not a path typically taken, for reasons we will see at the end when we explore the relative power of each test under different effect size scenarios.

    2. Interaction effects

    In the second approach, we would also estimate just Models 2 and 3, but would reverse the order of the tests. We would first test for interaction in Model 3: \(\text{H}_{01}\): \(\beta_3^{\prime\prime} = 0\). If we reject \(\text{H}_{01}\) (and conclude that the intervention effects are different across the two sub-groups), we stop there, because we have evidence that the intervention has some sort of effect, and that it is different across the sub-groups. (Of course, we can report the point estimates.) However, if we fail to reject \(\text{H}_{01}\), we would proceed to test the main effect from Model 2. In this case we would test \(\text{H}_{02}\): \(\beta_2^{\prime} = 0\).

    In this approach, we are forced to adjust the size of our tests (and use, for example, 0.025 as a cutoff for both). Here is a little intuition for why. If we use a cutoff of 0.05 for the first test and in fact there is no effect, 5% of the time we will draw the wrong conclusion (by wrongly rejecting \(\text{H}_{01}\)). However, 95% of the time we will correctly fail to reject the (true) null hypothesis in step one, and thus proceed to step two. Of all the times we proceed to the second step (which will be 95% of the time), we will err 5% of the time (again assuming the null is true). So, 95% of the time we will have an additional 5% error due to the second step, for an error rate of 4.75% due to the second test (95% \(\times\) 5%). In total – adding up the errors from steps 1 and 2 – we will draw the wrong conclusion almost 10% of the time. However, if we use a cutoff of 0.025, then we will be wrong 2.5% of the time in step 1, and about 2.4% (97.5% \(\times\) 2.5%) of the time in the second step, for a total error rate of just under 5%.

    In the first approach (looking at the main effect first), we need to make no adjustment, because we only do the second test when we’ve rejected (incorrectly) the null hypothesis. By definition, errors we make in the second step will only occur in cases where we have made an error in the first step. In the first approach where we evaluate main effects first, the errors are nested. In the second, they are not nested but additive.

    3. Global test

    In the third and last approach, we start by comparing Model 3 with Model 1 using a global F-test. In this case, we are asking the question of whether or not a model that includes treatment as a predictor does “better” than a model that only adjust for sub-group membership. The null hypothesis can crudely be stated as \(\text{H}_{01}\): \(\text{Model }3 = \text{Model }1\). If we reject this hypothesis (and conclude that the intervention does have some sort of effect, either generally or differentially for each sub-group), then we are free to evaluate Models 2 and 3 to see if the there is a varying affect or not.

    Here we can use cutoffs of 0.05 in our hypothesis tests. Again, we only make errors in the second step if we’ve made a mistake in the first step. The errors are nested and not additive.

    Simulating error rates

    This first simulation shows that the error rates of the three approaches are all approximately 5% under the assumption of no intervention effect. That is, given that there is no effect of the intervention on either sub-group (on average), we will draw the wrong conclusion about 5% of the time. In these simulations, the outcome depends only on disease status and not the treatment. Or, in other words, the null hypothesis is in fact true:

    library(simstudy) # define the data def <- defData(varname = "disease", formula = .5, dist = "binary") # outcome depends only on sub-group status, not intervention def2 <- defCondition(condition = "disease == 0", formula = 0.0, variance = 1, dist = "normal") def2 <- defCondition(def2, condition = "disease == 1", formula = 0.5, variance = 1, dist = "normal") set.seed(1987) # the year I graduated from college, in case # you are wondering ... pvals <- data.table() # store simulation results # run 2500 simulations for (i in 1: 2500) { # generate data set dx <- genData(400, def) dx <- trtAssign(dx, nTrt = 2, balanced = TRUE, strata = "disease", grpName = "trt") dx <- addCondition(def2, dx, "y") # fit 3 models lm1 <- lm(y ~ disease, data = dx) lm2 <- lm(y ~ disease + trt, data = dx) lm3 <- lm(y ~ disease + trt + trt*disease, data = dx) # extract relevant p-values cM <- coef(summary(lm2))["trt", 4] cI <- coef(summary(lm3))["disease:trt", 4] fI <- anova(lm1, lm3)$`Pr(>F)`[2] # store the p-values from each iteration pvals <- rbind(pvals, data.table(cM, cI, fI)) } pvals ## cM cI fI ## 1: 0.72272413 0.727465073 0.883669625 ## 2: 0.20230262 0.243850267 0.224974909 ## 3: 0.83602639 0.897635326 0.970757254 ## 4: 0.70949192 0.150259496 0.331072131 ## 5: 0.85990787 0.449130976 0.739087609 ## --- ## 2496: 0.76142389 0.000834619 0.003572901 ## 2497: 0.03942419 0.590363493 0.103971344 ## 2498: 0.16305568 0.757882365 0.360893205 ## 2499: 0.81873930 0.004805028 0.018188997 ## 2500: 0.69122281 0.644801480 0.830958227 # Approach 1 pvals[, mEffect := (cM <= 0.05)] # cases where we would reject null pvals[, iEffect := (cI <= 0.05)] # total error rate pvals[, mean(mEffect & iEffect)] + pvals[, mean(mEffect & !iEffect)] ## [1] 0.0496 # Approach 2 pvals[, iEffect := (cI <= 0.025)] pvals[, mEffect := (cM <= 0.025)] # total error rate pvals[, mean(iEffect)] + pvals[, mean((!iEffect) & mEffect)] ## [1] 0.054 # Approach 3 pvals[, fEffect := (fI <= 0.05)] pvals[, iEffect := (cI <= 0.05)] pvals[, mEffect := (cM <= 0.05)] # total error rate pvals[, mean(fEffect & iEffect)] + pvals[, mean(fEffect & !(iEffect) & mEffect)] ## [1] 0.05

    If we use a cutoff of 0.05 for the second approach, we can see that the overall error rate is indeed inflated to close to 10%:

    # Approach 2 - with invalid cutoff pvals[, iEffect := (cI <= 0.05)] pvals[, mEffect := (cM <= 0.05)] # total error rate pvals[, mean(iEffect)] + pvals[, mean((!iEffect) & mEffect)] ## [1] 0.1028 Exploring power

    Now that we have established at least three valid testing schemes, we can compare them by assessing the power of the tests. For the uninitiated, power is simply the probability of concluding that there is an effect when in fact there truly is an effect. Power depends on a number of factors, such as sample size, effect size, variation, and importantly for this post, the testing scheme.

    The plot below shows the results of estimating power using a range of assumptions about an intervention’s effect in the two subgroups and the different approaches to testing. (The sample size and variation were fixed across all simulations.) The effect sizes ranged from -0.5 to +0.5. (I have not included the code here, because it is quite similar to what I did to assess the error rates. If anyone wants it, please let me know, and I can post it on github or send it to you.)

    The estimated power reflects the probability that the tests correctly rejected at least one null hypothesis. So, if there was no interaction (say both group effects were +0.5) but there was a main effect, we would be correct if we rejected the hypothesis associated with the main effect. Take a look a the plot:

    What can we glean from this power simulation? Well, it looks like the global test that compares the interaction model with the null model (Approach 3) is the way to go, but just barely when compared to the approach that focuses solely on the interaction model first.

    And, we see clearly that the first approach suffers from a fatal flaw. When the sub-group effects are offsetting, as they are when the effect is -0.5 in subgroup 1 and +0.5 in subgroup 2, we will fail to reject the null that says there is no main effect. As a result, we will never test for interaction and see that in fact the intervention does have an effect on both sub-groups (one positive and one negative). We don’t get to test for interaction, because the rule was designed to keep the error rate at 5% when in fact there is no effect, main or otherwise.

    Of course, things are not totally clear cut. If we are quite certain that the effects are going to be positive for both groups, the second approach is not such a disaster. In fact, if we suspect that one of the sub-group effects will be large, it may be preferable to go with this approach. (Look at the right-hand side of the bottom plot to see this.) But, it is still hard to argue (though please do if you feel so inclined), at least based on the assumptions I used in the simulation, that we should take any approach other than the global test.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages