### The hidden diagnostic plots for the lm object

Thu, 11/14/2019 - 20:07

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When plotting an lm object in R, one typically sees a 2 by 2 panel of diagnostic plots, much like the one below:

set.seed(1) x <- matrix(rnorm(200), nrow = 20) y <- rowSums(x[,1:3]) + rnorm(20) lmfit <- lm(y ~ x) summary(lmfit) par(mfrow = c(2, 2)) plot(lmfit)

This link has an excellent explanation of each of these 4 plots, and I highly recommend giving it a read.

Most R users are familiar with these 4 plots. But did you know that the plot() function for lm objects can actually give you 6 plots? It says so right in the documentation:

We can specify which of the 6 plots we want when calling this function using the which option. By default, we are given plots 1, 2, 3 and 5. Let’s have a look at what plots 4 and 6 are.

Plot 4 is of Cook’s distance vs. observation number (i.e. row number). Cook’s distance is a measure of how influential a given observation is on the linear regression fit, with a value > 1 typically indicating a highly influential point. By plotting this value against row number, we can see if highly influential points exhibit any relationship to their position in the dataset. This is useful for time series data as it can indicate if our fit is disproportionately influenced by data from a particular time period.

Here is what plot 4 might look like:

plot(lmfit, which = 4)

Plot 6 is of Cook’s distance against (leverage)/(1 – leverage). An observation’s leverage must fall in the interval , so plotting against (leverage)/(1 – leverage) allows the x-axis to span the whole positive real line. The contours on the plot represent points where the absolute value of the standardized residual is the same. On this plot they happen to be straight lines; the documentation says so as well but I haven’t had time to check it mathematically.

Here is what plot 6 might look like:

plot(lmfit, which = 6)

I’m not too sure how one should interpret this plot. As far as I know, one should take extra notice of points with high leverage and/or high Cook’s distance. So any observation in the top-left, top-right or bottom-right corner should be taken note of. If anyone knows of a better way to interpret this plot, let me know!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

Wed, 11/13/2019 - 10:45

[This article was first published on R – David's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a data scientist, you will likely be asked one day to automate your analysis and port your models to production environments. When that happens you cross the blurry line between data science and software engineering, and become a machine learning engineer. I’d like to share a few tips on how to make that transition as successful as possible.

Let’s first discuss testing, and let’s assume without loss of generality that you develop your machine learning application in R. Just like any other software system, your application needs to be thoroughly tested before being deployed. But how do you ensure that your application will perform as expected when dealing with real data? If you’ve structured your application as an R package you already know that you can write unit tests that will run automatically when you check your package; but how do you test the application as a whole?

Testing an application as whole locally, known as integration testing, should normally not be part of your suite of unit tests: integration tests are frequently much slower than unit tests, and would break your regular build-test workflow. (Besides, if you forgive some pedantry, an integration test is not a unit test by definition since it tests the whole system.)

Nevertheless, integration tests serve a useful purpose in your workflow. They’re often the last test you’ll run locally before releasing your application to more ‘serious’ testing. But that doesn’t mean you should write them last; in this article I’m going to argue that writing them first, before even writing a line of code, tends to produce a better, clearer, and more testable design.

Let’s illustrate with an example. You work for a travel agency whose CEO would like a system that predicts the survival probability of its customers, should their cruise ship hit an iceberg. You are told that we have extensive data on shipwreck survivors since 1912, when the Titanic sank. You are told that your application should apply machine learning on a dataset that has the same structure as the famous Titanic survivor data set, but you cannot access the entire set directly.

You are relatively free to choose the output format of your application, because it will be picked up downstream by other teams. You decide to start with a CSV file for now.

Well, where to begin? From an earlier analysis (ideally documented in a vignette, a notebook, or another report) you know that a random forest is the best model for this problem. So it would make sense to verify whether indeed your application trains a random forest from the training data. How would we know that? The simplest way to save the trained model along with the other output, and check that it is indeed a random forest, trained with the right number of training samples. And this is definitely something which we know how to test.

I’m going to assume you have the basic structure of an R package in place, such as the one created by default by RStudio when you start a new project. We’ll call it app:

app ├── DESCRIPTION ├── NAMESPACE ├── R │ └── hello.R ├── app.Rproj └── man └── hello.Rd

Besides the testthat package, which has become the standard library for writing unit tests in R, I recommend using testhis, which facilitates creating and running integration tests:

usethis::use_testthat() testthis::use_integration_tests()

This sets up the basic structure for running unit tests and other integration tests:

app ├── DESCRIPTION ├── NAMESPACE ├── R │ └── hello.R ├── app.Rproj ├── man │ └── hello.Rd └── tests ├── testthat │ └── integration_tests └── testthat.R

Any test you create under tests/testhat/integration_tests will be run when you call testthis::test_integration(), but not when running your regular unit tests.

I like to begin with a failing test to make sure that everything is setup correctly. Let’s go ahead and write our end-to-end test, calling it tests/testthat/integrations_test-end_to_end.R:

# integrations_test-end_to_end.R test_that("this test fails", { expect_true(FALSE) })

As expected, we can now run the “integration” test and see it fail:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 | test_end-to-end ─────────────────────────────────────────────────── test-test_end-to-end.R:2: failure: this test fails FALSE isn't true. ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ OK: 0 Failed: 1 Warnings: 0 Skipped: 0

Good. Let’s now write test code as if a model had been serialized, and check its number of training samples.

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { app::main() })

Hold it right there. This test will, of course, not even run because the main() function doesn’t exist. Let’s create R/main.R with the following contents before we continue:

# main.R main <- function() {}

The integration test now runs, so we can complete it. But before we can continue, we will need some training data, which will be the Titanic survival dataset. We don’t want main() to read the data from file, but from an interface that is as similar as possible to the actual environment.

We’ve been told that the application will read from a PostgreSQL relational database. We could certainly have our test code set up a local PostgreSQL instance, and load it with the training dataset; this will sometimes be the right strategy, but in this case I think it is overkill. Instead, we will exploit the fact that most database systems can be abstracted away through ODBC drivers. R code that talks to PostgreSQL through its ODBC driver should work just as well against any other database system. In particular, we’ll use the in-memory SQLite database for integration testing.

So our integration test code will look like this, in pseudo-code:

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { app::main() })

Well, let’s do it. Create the tests/testthat/testdata-raw folder (or just call testthis::use_testdata_raw()), and copy there the train.csv dataset downloaded from https://www.kaggle.com/c/titanic/data, renamed to titanic.csv.

We’ll proceed with turning our pseudo-code into real code, refactoring as we go along:

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { # Load Titanic training set titanic <- readr::read_csv("../testdata-raw/titanic.csv") # Copy it to SQLite con <- odbc::dbConnect( odbc::odbc(), driver = "SQLite", database = "file::memory:?cache=shared" ) dplyr::copy_to(con, titanic, temporary = FALSE) # Call main() app::main() # Read serialized model model <- readRDS("output/model.rds") })

Notice the file::memory:?cache=shared URI used to instantiate the in-memory SQLite database. If you simply pass the usual :memory: filename you won’t be able to connect to it from more than one place in your program; but we want the test code to populate the same database that will be accessed by the main application, so we must pass this special file::memory:?chache=shared URI.

Notice also the temporary = FALSE argument to dplyr::copy_to(). This is required for the table to be accessible from another connection than the current one.

Calling testthis::test_integration() should now fail:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 1 | test_end-to-end ─────────────────────────────────────────────────── test-test_end-to-end.R:14: warning: a model is output with the right number of training samples cannot open compressed file 'output/model.rds', probable reason 'No such file or directory' test-test_end-to-end.R:14: error: a model is output with the right number of training samples cannot open the connection 1: readRDS("output/model.rds") at /Users/dlindelof/Work/app/tests/testthat/integration_tests/test-test_end-to-end.R:14 2: gzfile(file, "rb") ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ OK: 0 Failed: 1 Warnings: 1 Skipped: 0

It fails indeed, so we stop writing test code at this point and fix it. It fails because no model is serialized yet; we don’t want to overengineer at this point and fix that by instantiating a fake model and serializing it:

# main.R main <- function() { model <- lm(~ 0) if (!dir.exists("output")) dir.create("output") saveRDS(model, "output/model.rds") }

The tests pass now, so we keep on going:

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { # Load Titanic training set titanic <- readr::read_csv("../testdata-raw/titanic.csv") # Copy it to SQLite con <- odbc::dbConnect( odbc::odbc(), driver = "SQLite", database = "file::memory:?cache=shared" ) dplyr::copy_to(con, titanic, temporary = FALSE) # Call main() main() # Read serialized model model <- readRDS("output/model.rds") # Verify number of training samples expect_equal(length(model$fitted.values), nrow(titanic)) # (1) }) # (1) check if the model has been trained with the right number of samples There are several issues with that test, but it is complete (for now) and, as expected, fails: > testthis::test_integration() Loading app | OK F W S | Context | 0 1 | test_end-to-end ─────────────────────────────────────────────────── test-test_end-to-end.R:16: failure: a model is output with the right number of training samples length(model$fitted.values) not equal to nrow(titanic). 1/1 mismatches  0 - 891 == -891 ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ OK: 0 Failed: 1 Warnings: 0 Skipped: 0

To pass the test, the main() function needs to open an ODBC connection to SQLite, read the training dataset, train (some) model with it, and serialize the model. But we run into a problem: how does our application know where to find the database it should connect to? And how does it know which driver to load? A reasonable solution is to use the config package with a configuration file. When we run integration tests we’ll use the test settings, and when we run in production we’ll use the default settings.

So here’s a first attempt at config.yml:

# config.yml test: driver: "SQLite" database: "file::memory:?cache=shared" default: driver: "PostgreSQL" database: "TBD"

We should now be able to read the titanic dataset from main():

# main.R main <- function(env = "default") { connection_params <- config::get(config = env) con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params)) train <- dplyr::tbl(con, "titanic") %>% dplyr::collect() model <- lm(~ 0, data = train) if (!dir.exists("output")) dir.create("output") saveRDS(model, "output/model.rds") }

Running the tests, we see that the dataset gets read without errors by main(), but the number of training samples in the model remains 0:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 | test_end-to-end [0.1 s] ─────────────────────────────────────────────────── test-test_end-to-end.R:16: failure: a model is output with the right number of training samples length(model$fitted.values) not equal to nrow(titanic). 1/1 mismatches  0 - 891 == -891 ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ Duration: 0.1 s OK: 0 Failed: 1 Warnings: 0 Skipped: 0 That is simply because we have not specified any target variable in our call to lm(). Let’s fix that: # main.R main <- function(env = "default") { connection_params <- config::get(config = env) con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params)) train <- dplyr::tbl(con, "titanic") %>% dplyr::collect() model <- lm(Survived ~ 0, data = train) if (!dir.exists("output")) dir.create("output") saveRDS(model, "output/model.rds") } And the tests now pass: > testthis::test_integration() Loading app | OK F W S | Context | 1 | test_end-to-end ══ Results ════════════════════════════════════════ OK: 1 Failed: 0 Warnings: 0 Skipped: 0 Let’s now recap what we have achieved. We have an integration test case that verifies that a linear model (or any model, really, provided it provides a fitted.values attribute) gets trained by our application, using a dataset read using an ODBC connection we provide. But more importantly, we have written an automated test that exercises the whole application, and will keep on doing so as we implement the missing pieces. It doesn’t feel like we have accomplished much; after all, our model is certainly not the best one for the classification task we have at hand, and we don’t even train it in a sensible manner. But consider this: before you began the project, chances are the customer (or the boss) didn’t really know what they wanted. Now you are in a position to hold the following conversation: you: We have decided that the first user story should prove that the application, when run, trains a model on the Titanic dataset. This is what this first integration test demonstrates. boss: How good is that model? you: Its accuracy is 62%, if that’s what you mean. boss: Really? That’s pretty impressive for a first shot. How did you achieve that? you: It predicts that everybody dies. And 62% of passengers in the training dataset did indeed die. boss: (shifts uncomfortably) hm I see well I guess we should think of alternative metrics. We’d rather avoid false positives, even at the cost of some accuracy. Can you do something about it? So by getting a woefully incomplete, but working, application out so soon you’ve been able to help your stakeholders understand what really matters to them. Better, yet, you are even in a position to write the next item on your backlog as a precise user story: As a customer, I want to know if I'm going to die on this cruise but a death prediction had better be right 90% of the time Time to write a new integration test and our first unit tests. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – David's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Durban EDGE DataQuest Wed, 11/13/2019 - 03:00 [This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The Durban EDGE (Economic Development and Growth in eThekwini) DataQuest was held at UKZN (Westville Campus) on 13 November 2019. Participants were tasked with creating something interesting and useful with the civic data on the new Durban EDGE Open Data Portal developed by Open Data Durban. These datasets were available: • EThekwini Water and Sanitation • Durban Skills Audit 2016 • EThekwini Financial Statistics Survey • EThekwini Rate Collection and Valuation Roll • EThekwini Business Licensing • EThekwini DMOSS -DURBAN Metropolitan Open Space System • Rentable Office Data • EThekwini Labour Force • EThekwini Building Plans • Durban Film Sector Data • KZN Formal Education – Current • EThekwini Electricity Usage and Access and • EThekwini Ward Maps. None of the participants had prior experience with R, but most had used Excel. I’h hoped to get at least a few of them to try using R. To make this more accessible I introduced them to RStudio Cloud, which is such a phenomenal tool for this sort of gig since it requires zero setup on the participants’ machines. I also put together a couple of starter scripts: Let’s take a quick look at them. Electricity Usage This script loads the electricity consumption data, does some simple wrangling (mostly just fixing the year column) and then creates a few plots. The first plot shows how the number of (formal) electricity consumers has increased over time. We see that there is a systematic increase in the number of consumers, which makes sense in terms of population growth and urbanisation. How much energy is being consumed? Again there is a systematic growth in energy consumption. But something clearly happens in 2007: the introduction of load shedding. With these two pieces of information we can also assess the average power consumed per customer. Distribution of Drivers’ Licenses This script merges data from two sources: • a KML file giving ward boundaries and • a skills survey. Although there’s a wealth of informative data in the survey, to keep things simple I used a simple Boolean column: whether or not the respondent had a drivers’ license. Mashing these two datasets together created the map below: the proportion of people with drivers’ licenses broken down by ward. Both of these scripts provide potentially interesting starting points for a deeper analysis. The main motivation for them though was simply to show how such an analysis can be done in R. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R | datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### The Colour of Everything Wed, 11/13/2019 - 01:00 [This article was first published on Data Imaginist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. I’m happy to announce that farver 2.0 has landed on CRAN. This is a big release comprising of a rewrite of much of the internals along with a range of new functions and improvements. Read on to find out what this is all about. The case for farver The first version of farver really came out of necessity as I identified a major performance bottleneck in gganimate related to converting colours into Lab colour space and back when tweening them. This was a result of grDevices::convertColor() not being meant for use with millions of colour values. I build farver in order to address this very specific need, which in turn made Brodie Gaslam look into speeding up the grDevices function. The bottom line is that, while farver is still the fastest at converting between colour spaces, grDevices is now so fast that I probably wouldn’t have bothered to build farver in the first place had it been like this all along. I find this a prime example of fruitful open source competition and couldn’t be happier that Brodie took it upon him. So why a new shiny version? As part of removing compiled code from scales, we decided to adopt farver for colour interpolation, and the code could use a brush-up. I’ve become much more trained in writing compiled code, and further there were some shortcomings in the original implementation that needed to be addressed if scales (and thus ggplot2) should depend on it. Further, I usually write on larger frameworks and there is a certain joy in picking a niche area that you care about and go ridiculously overboard in tooling without worrying about if it benefits any other than yourself (ambient is another example of such indulgence). The new old The former version of farver was quite limited in functionality. It had two functions: convert_colour() and compare_colour() that did colour space conversion and colour distance calculations respectively. No outward changes has been made to these functions, but internally a lot has happened. The old versions had no input validation, so passing in colours with NA, NaN, Inf, and -Inf would give you some truly weird results back. Further, the input and output was not capped to the range of the given colour space, so you could in theory end up with negative RGB values if you converted from a colour space with a larger gamut than sRGB. Both of these issues has been rectified in the new version. Any non-finite value in any channel will result in NA in all channels in the output (for conversion) or an NA distance (for comparison). library(farver) colours <- cbind(r = c(0, NA, 255), g = c(55, 165, 20), b = c(-Inf, 120, 200)) colours ## r g b ## [1,] 0 55 -Inf ## [2,] NA 165 120 ## [3,] 255 20 200 convert_colour(colours, from = 'rgb', to = 'yxy') ## y1 x y2 ## [1,] NA NA NA ## [2,] NA NA NA ## [3,] 25.93626 0.385264 0.1924651 Further, input is now capped to the channel range (if any) before conversion, and output is capped again before returning the result. The later means that convert_colour() is only symmetric (ignoring rounding errors) if the colours are within gamut in both colour spaces. # Equal output because values are capped between 0 and 255 colours <- cbind(r = c(1000, 255), g = 55, b = 120) convert_colour(colours, 'rgb', 'lab') ## l a b ## [1,] 57.41976 76.10097 12.44826 ## [2,] 57.41976 76.10097 12.44826 Lastly, a new colour space has been added: CIELch(uv) (in farver hcl) has been added as a cousin of CIELch(ab) (lch). Both are polar transformations, but the former is based on luv values and the latter on lab. Both colour spaces are used interchangeably (though not equivalent), and as the grDevices::hcl() function is based on the luv space it made sense to provide an equivalent in farver. The new new The new functionality mainly revolves around the encoding of colour in text strings. In many programming languages colour can be encoded into strings as #RRGGBB where each channel is given in hexadecimal digits. This is also how colours are passed around in R mostly (R also has a list of recognized colour names that can be given as aliases instead of the hex string – see grDevices::colour() for a list). The encoding is convenient as it allows colours to be encoded into vectors, and thus into data frame columns or arrays, but means that if you need to perform operations on it you’d have to first decode the string into channels, potentially convert it into the required colour space, do the manipulation, convert back to sRGB, and encode it into strings. Encoding and decoding has been supported in grDevices with rgb() and col2rgb() respectively, both of which are pretty fast. col2rgb() has a quirk in that the output has the channels in the rows instead of the columns, contrary to how decoded colours are presented everywhere else: grDevices::col2rgb(c('#56fec2', 'red')) ## [,1] [,2] ## red 86 255 ## green 254 0 ## blue 194 0 farver sports two new functions that, besides providing consistency in the output format also eliminates some steps in the workflow described above: # Decode strings with decode_colour colours <- decode_colour(c('#56fec2', 'red')) colours ## r g b ## [1,] 86 254 194 ## [2,] 255 0 0 # Encode with encode_colour encode_colour(colours) ##  "#56FEC2" "#FF0000" Besides the basic use shown above, both function allows input/output from other colour spaces than sRGB. That means that if you need to manipulate some colour in Lab space, you can simply decode directly into that, do the manipulation and encode directly back. The functionality is baked into the compiled code, meaning that a lot of memory allocation is spared, making this substantially faster than a grDevices-based workflow: library(ggplot2) # Create some random colour strings colour_strings <- sample(grDevices::colours(), 5000, replace = TRUE) # Get Lab values from a string timing <- bench::mark( farver = decode_colour(colour_strings, to = 'lab'), grDevices = convertColor(t(col2rgb(colour_strings)), 'sRGB', 'Lab', scale.in = 255), check = FALSE, min_iterations = 100 ) plot(timing, type = 'ridge') + theme_minimal() + labs(x = NULL, y = NULL) Can we do better than this? If the purpose is simply to manipulate a single channel in a colour encoded as a string, we may forego the encoding and decoding completely and do it all in compiled code. farver provides a family of functions for doing channel manipulation in string encoded colours. The channels can be any channel in any colour space supported by farver, and the decoding, manipulation and encoding is done in one pass. If you have a lot of colours and need to increase e.g. darkness, this can save a lot of memory allocation: # a lot of colours colour_strings <- sample(grDevices::colours(), 500000, replace = TRUE) darken <- function(colour, by) { colour <- t(col2rgb(colour)) colour <- convertColor(colour, from = 'sRGB', 'Lab', scale.in = 255) colour[, 'L'] <- colour[, 'L'] * by colour <- convertColor(colour, from = 'Lab', to = 'sRGB') rgb(colour) } timing <- bench::mark( farver = multiply_channel(colour_strings, channel = 'l', value = 1.2, space = 'lab'), grDevices = darken(colour_strings, 1.2), check = FALSE, min_iterations = 100 ) plot(timing, type = 'ridge') + theme_minimal() + labs(x = NULL, y = NULL) The bottom line The new release of farver provides invisible improvements to the existing functions and a range of new functionality for working efficiently with string encoded colours. You will be using it indirectly following the next release of scales if you are plotting with ggplot2, but you shouldn’t be able to tell. If you somehow ends up having to manipulate millions of colours, then farver is still the king of the hill by a large margin when it comes to performance, but I personally believe that it also provides a much cleaner API than any of the alternatives. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Automating update of a fiscal database for the Euro Area Wed, 11/13/2019 - 00:00 [This article was first published on Macroeconomic Observatory - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Our purpose is to write a program to automatically update a quarterly fiscal database for the Euro Area. The main difficulty of this exercise is to build long series that go as far as the 1980’s. We use two sources to build the database: the historical database developed in Paredes et al. (2014), which stops in 2013, and the latest Eurostat data. Throughout this article, we explain how we chained each series of PPP with the Eurostat data. Both databases are taken without any seasonal adjustment. At the end of the post, chained data series are seasonally adjusted using the seasonal package developed by Sax (2016) using the X13 methodology. To be automated, the recent points of the database are taken from DBnomics using the rdbnomics package. All the code is written in R, thanks to the RCoreTeam (2016) and RStudioTeam (2016). The database will contain the following series: • Direct taxes • Indirect taxes • Social security contributions by employees • Social security contributions by employers • Government consumption • Government investment • Government transfers • Government subsidies • Government compensation of employees • Unemployment benefits • Government debt • Interest payments • Total revenues • Total expenditures Historical data First we get the historical series built by Paredes et al. (2014). Problem is, the series are not all directly usable: the series of social contribution by contributors do not exist before 1991. url <- "PPP_raw.xls" ppp <- read_excel(url, skip = 1) ppp %<>% transmute(period = as.Date(as.yearqtr(MILL. EURO, RAW DATA, NON-SEAS. ADJUSTED, SMOOTHED ESTIMATES, format="%YQ%q")), totexp = TOE, # Total expenditures pubcons = GCN, # General government consumption expenditures pubinves = GIN, # General government investment tfs = THN, # Social payments unemp = of which UNB, # Unemployment benefits (among social payments) salaries = COE, # Compensation of employees subs = SIN, # Subsidies intpay = INP, # General government interest payments totrev = TOR, # Total revenue indirtax = TIN, # Total indirect taxes dirtax = DTX, # Total direct taxes scr = as.numeric(SCR), # Social contribution by employers sce = as.numeric(SCE), # Social contribution by employees and self-employed sct = SCT, # Total social security contributions debt = MAL) %>% # Euro area general government debt filter(!is.na(period)) Assuming that the ratio of social contributions remains stable between employers and households before 1991, we can reconstruct the contribution of employees and employers using the series of total contribution. Using this technique we infer the series of social contribution by contributors before 1991. # We calculate the ratio of social contribution by employers for the first point in our series prcent <- transmute(ppp, scr_sct=scr/sct) %>% na.omit() %>% first() %>% as.numeric() # Using the ratio, we reconstruct earlier social contribution by contributor scr_sce_before91 <- filter(ppp, is.na(scr)) %>% select(period, sct, scr, sce) %>% transmute(period, scr=prcent*sct, sce=sct-scr) %>% gather(var, value, -period) # We reinject the constructed series in the ppp database ppp %<>% select(-sct) %>% gather(var, value, -period, na.rm = TRUE) %>% bind_rows(scr_sce_before91) %>% arrange(var, period) maxDate <- ppp %>% group_by(var) %>% summarize(maxdate=max(period)) %>% arrange(maxdate) kable(maxDate) var maxdate debt 2013-10-01 dirtax 2013-10-01 indirtax 2013-10-01 intpay 2013-10-01 pubcons 2013-10-01 pubinves 2013-10-01 salaries 2013-10-01 sce 2013-10-01 scr 2013-10-01 subs 2013-10-01 tfs 2013-10-01 totexp 2013-10-01 totrev 2013-10-01 unemp 2013-10-01 Recent data Historical data series stop in 2013. For latest points, we use DBnomics to get Eurostat’s data. Eurostat’s series on social contributions and on unemployment benefits present difficulties as well. We thus download the series from DBnomics in three steps: 1. We take the series on social contributions and we treat them in order to build quarterly series by contributors. 2. We take the series on unemployment benefits and we treat them in order to build quarterly series. 3. We take the series that do not present any problems Special case: social contributions Download annual data var_taken <- c('D613', # Annual Households' actual social contributions (D613) for general govt only (S13) 'D612', # Annual Employers' imputed social contributions 'D611') # Annual Employers' actual social contributions (D611) for general govt only (S13) url_variables <- paste0(var_taken, collapse="+") filter <- paste0('A.MIO_EUR.S13.',url_variables,'.EA19') df <- rdb("Eurostat","gov_10a_taxag",mask = filter) data_1 <- df %>% select(period,var=na_item,value) %>% spread(var,value) %>% mutate(sce=D613+D612, scr=D611) %>% select(-D611,-D612,-D613) %>% gather(var,value,-period) %>% mutate(year=year(period)) The series of actual social contributions present 2 problems: (i) they are not quarterly; (ii) they are available only up to 2018. We fix this problem by using the two series of quarterly net total social contributions and quarterly employers contribution for the total economy. From annual to quarterly data # Quarterly Net social contributions, receivable (D61REC) for general govt only (S13) df <- rdb("Eurostat","gov_10q_ggnfa",mask = "Q.MIO_EUR.NSA.S13.D61REC.EA19") qsct <- df %>% transmute(period, var = 'sct', value) %>% mutate(year=year(period)) maxtime <- summarize(qsct,maxdate=max(period)) To turn the two annual series of social contributions into quarterly series, we use the series of quarterly total net social contributions to calculate the share of each contributor for each year. Using this share and the quarterly value of the total net social contributions, we can deduce the quarterly value of the net social contributions of each contributor. # Calculate total amount of sct by year qsct_a <- qsct %>% group_by(year) %>% summarise(value_a=sum(value)) qsct %<>% left_join(qsct_a, by="year") # Convert data from annual to quarterly qsce_uncomplete <- filter(data_1, var=="sce") %>% full_join(qsct, by="year") %>% transmute(period=period.y, var=var.x, value=value.y*value.x/value_a) %>% filter(!is.na(value)) # Convert data from annual to quarterly qscr_uncomplete <- filter(data_1, var=="scr") %>% full_join(qsct, by="year") %>% transmute(period=period.y, var=var.x, value=value.y*value.x/value_a) %>% filter(!is.na(value)) We plot series to compare built quarterly series with annual series. plot_treatment <- bind_rows(qscr_uncomplete, qsce_uncomplete) %>% mutate(Origin="Deduced quarterly series", value=4*value) %>% # We multiply by 4 because on the plot we compare quarterly level with annual levels bind_rows(mutate(data_1,Origin="Original annual series")) %>% mutate(var=ifelse(var=="sce","Social contribution of households","Social contribution of employers")) %>% select(-year) ggplot(plot_treatment,aes(period,value,colour=Origin)) + geom_line() + facet_wrap(~var,ncol=2,scales = "free_y") + scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(strip.text=element_text(size=12), axis.text=element_text(size=8)) + theme(legend.title=element_blank()) Most recent data Now that we have the quarterly values, we use the series of total employers contribution for total economy along with the share of each contributors in total contributions to deduce latest points of contributions by households and employers. # Quarterly Employers SSC for total economy df <- rdb("Eurostat","namq_10_gdp",mask="Q.CP_MEUR.NSA.D12.EA19") qscr_toteco <- df %>% transmute(period,var = 'scr',value) %>% mutate(year=year(period)) # Using recent data on employers total contribution we chain forward the social contribution of employers qscr <- chain(to_rebase = qscr_toteco, basis = qscr_uncomplete, date_chain = max(qscr_uncomplete$period), is_basis_the_recent_data=FALSE) %>% arrange(period) # Assuming the ratio of social contribution by contributors remains constant over time, we deduce social contribution of households qsce <- bind_rows(qsce_uncomplete, select(qsct, period, value, var), qscr) %>% filter(period<=maxtime$maxdate) %>% spread(var,value, fill = 0) %>% transmute(period, sce=ifelse(period<=max(qsce_uncomplete$period),sce,sct-scr)) %>% gather(var, value, -period) %>% arrange(period)

Series of employers contribution are different in levels. Indeed, we are interested in social contributions of employers for general government only, and not for total economy. But the pattern of both series are very similar. So, by chaining them we take the variations from social contributions of employers for total economy and we apply them to the level of actual social contributions for general government only.

plot_treatment <- bind_rows(qscr_uncomplete, qsce_uncomplete) %>% mutate(Origin="Quarterly series") %>% bind_rows(mutate(qscr_toteco ,Origin="Original quarterly series (for total economy)"), mutate(bind_rows(qsce,qscr), Origin="Chained series")) %>% mutate(var=ifelse(var=="sce","Contribution of households","Contribution of employers")) %>% select(-year) ggplot(plot_treatment,aes(period,value,colour=Origin)) + geom_line() + facet_wrap(~var,ncol=2,scales = "free_y") + scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(strip.text=element_text(size=12), axis.text=element_text(size=8)) + theme(legend.title=element_blank()) + ggtitle("Social contribution forward chaining")

Special case: unemployment benefits

We retrieve government social expenditures and compute their quaterly share for each year.

socialexp <- rdb("Eurostat","gov_10q_ggnfa",mask = "Q.MIO_EUR.NSA.S13.D62PAY.EA19") %>% mutate(year=year(period)) %>% select(period,value,year) %>% group_by(year) %>% mutate(sum=sum(value), ratio=value/sum) %>% ungroup() %>% select(-value,-year,-sum)

Then we retrieve the latest annual data on unemployment benefits, put them in a quarterly table and use the previous ratio of quarterly social expenditures to compute quarterly unemployment benefits.

df <- rdb("Eurostat","gov_10a_exp",mask = "A.MIO_EUR.S13.GF1005.TE.EA19") recent_unemp <- df %>% mutate(year=year(period)) %>% select(period,value,year) recent_unemp_q <- tibble(period=seq(min(recent_unemp$period), length.out=nrow(recent_unemp)*4, by = "quarter"), year=year(period)) %>% left_join(recent_unemp,by="year") %>% select(-period.y,-year) %>% rename(period=period.x) unemp_q <- recent_unemp_q %>% inner_join(socialexp,by="period") %>% mutate(value=value*ratio, var="unemp") %>% select(-ratio) We compare historical data with new quarterly series. data_plot <- ppp %>% filter(var=="unemp") %>% mutate(var="From PPP") %>% bind_rows(mutate(unemp_q,var="New measure")) %>% filter(year(period)>=2007) ggplot(data_plot,aes(period,value,colour=var))+ geom_line()+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(strip.text=element_text(size=12), axis.text=element_text(size=8)) + theme(legend.title=element_blank()) + ggtitle("Unemployment benefits series comparison") Chaining recent data with historical We now fetch the remaining series from DBnomics, none of the remaining series has to be treated before it can be used. We then include in the resulting dataframe the series of social contributions by contributors. # List of var that can be taken on the first dataset var_taken <- c('P3', # Public consumption 'P51G', # Public gross fixed capital formation 'D62PAY', # Social transfers 'D1PAY', # Compensation of employees 'D3PAY', # Subsidies 'D2REC', # Indirect taxes on production 'D5REC', # Direct taxation on income and wealth 'D41PAY', # Interest payments 'TE', # Total expenditures 'TR') # Total revenue # We build the URL, fetch the data and convert it to a data frame url_variables <- paste0(var_taken, collapse="+") filter <- paste0('Q.MIO_EUR.NSA.S13.', url_variables,'.EA19') data_1 <- rdb("Eurostat","gov_10q_ggnfa",mask=filter) # Government consolidated gross debt is in a different dataset so we make a second call to DBnomics data_2 <- rdb("Eurostat","gov_10q_ggdebt",mask="Q.GD.S13.MIO_EUR.EA19") # We then bind the two data frame fetched on DBnomics recent_data <- bind_rows(data_1, data_2) %>% transmute(value, period, var= as.factor(na_item)) # Used to harmonize DBnomics series var names with PPP var_names <- c('pubcons', 'pubinves', 'tfs', 'salaries', 'subs', 'indirtax', 'dirtax','intpay','totexp','totrev','debt') recent_data$var <- plyr::mapvalues(recent_data$var,c(var_taken,'GD'),var_names) # We include the series of social contributions var_names <- c(var_names, "sce", "scr","unemp") recent_data %<>% bind_rows(qsce,qscr,unemp_q) We can check the last available date for each series. All that is left to do is to chain the dataframe of recent series with the historical database of Paredes et al. (2014). Once the data is chained we use the seasonal package to remove the seasonal component of each series. Hereafter, we will present the treatment on each variable to check graphically that what we obtain is consistent. chained_data <- recent_data %>% chain(basis=., to_rebase = filter(ppp, var %in% var_names), date_chain = "2007-01-01") %>% arrange(var, period) %>% select(period, var, value) %>% mutate(Origin = "Chained") to_deseason <- chained_data %>% select(-Origin) %>% spread(var, value) deseasoned <- bind_rows(lapply(unique(chained_data$var), function(var) deseason(var_arrange = var, source_df = to_deseason))) %>% mutate(Origin = "Final series") ppp %<>% mutate(Origin = "Historical data") to_plot <- bind_rows(chained_data, deseasoned, ppp) Total revenue and expenditures plot_totrev <- to_plot %>% filter(var == "totrev", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="totrev",Origin !="Final series"), ind2="1 - Chain series")) plot_totexp <- to_plot %>% filter(var == "totexp", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="totexp",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_totrev,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Total revenue") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) p2 <- ggplot(plot_totexp,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Total expenditures") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) grid.arrange(arrangeGrob(p1,p2,ncol=2))

Public direct spending

The chained series of public consumption resembles strongly the historical series. Here our manipulation of the series allows us to create a long series without much loss.

There is on the chained series of investment a (visually) significant difference in level with the historical one. The method of chaining allows us to build a reasonable proxy for the series of public investment but at a certain loss.

plot_cons <- to_plot %>% filter(var == "pubcons", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="pubcons",Origin !="Final series"), ind2="1 - Chain series")) plot_inves <- to_plot %>% filter(var == "pubinves", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="pubinves",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_cons,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Public consumption") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) p2 <- ggplot(plot_inves,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Public Investment") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) grid.arrange(arrangeGrob(p1,p2,ncol=2))

Specific spending

Both chaining seem to be consistent with the historical series. Here our manipulation does not entail much loss.

plot_salaries <- to_plot %>% filter(var == "salaries", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="salaries",Origin !="Final series"), ind2="1 - Chain series")) plot_subs <- to_plot %>% filter(var == "subs", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="subs",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_salaries,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Compensation of employees") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) p2 <- ggplot(plot_subs,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Subsidies") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) grid.arrange(arrangeGrob(p1,p2,ncol=2))

Taxes

Both chaining seem to be consistent with the historical series. Here our manipulation does not entail much loss.

plot_indir <- to_plot %>% filter(var == "indirtax", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="indirtax",Origin !="Final series"), ind2="1 - Chain series")) plot_dir <- to_plot %>% filter(var == "dirtax", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="dirtax",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_indir,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Indirect taxation") p2 <- ggplot(plot_dir,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Direct taxation") grid.arrange(arrangeGrob(p1,p2,ncol=2))

Debt and interest payments

The chained series of general government debt deviates slightly from the historical one, but the deviation is very thin and both chaining seem consistent. Here the seasonality of both series is weaker and the final series resemble strongly the chained ones.

plot_debt <- to_plot %>% filter(var == "debt", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="debt",Origin !="Final series"), ind2="1 - Chain series")) plot_intpay <- to_plot %>% filter(var == "intpay", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="intpay",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_intpay,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Interest payments") p2 <- ggplot(plot_debt,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("General government debt") grid.arrange(arrangeGrob(p1,p2,ncol=2))

Total social transfers and unemployment benefits plot_unemp <- to_plot %>% filter(var == "unemp", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="unemp",Origin !="Final series"), ind2="1 - Chain series")) plot_transf <- to_plot %>% filter(var == "tfs", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="tfs",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_transf,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Social transfers") p2 <- ggplot(plot_unemp,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Unemployment benefits") grid.arrange(arrangeGrob(p1,p2,ncol=2))

Building the final database Comparing the obtained series with PPP

We want to check that the final database we created resembles the seasonally adjusted one of Paredes et al. (2014).

url <- "PPP_seasonal.xls" pppSA <- read_excel(url, skip = 1) pppSA %<>% transmute(period =as.Date(as.yearqtr(MILL. EURO, RAW DATA, SEASONALLY ADJUSTED, SMOOTHED ESTIMATES, format="%YQ%q")), totexp =TOE, # Total expenditures pubcons =GCN, # General government consumption expenditure pubinves =GIN, # General government investment tfs =THN, # Social payments salaries =COE, # Compensation of employees subs =SIN, # Subsidies unemp =of whichUNB, # Unemployment benefits (among social payments) intpay =INP, # General government interest payments totrev =TOR, # Total revenue indirtax =TIN, # Total indirect taxes dirtax =DTX, # Total direct taxes scr =as.numeric(SCR), # Social contribution by employers sce =as.numeric(SCE), # Social contribution by employees and self-employed debt =MAL) %>% # Euro area general government debt filter(!is.na(period)) plot_compare <- gather(pppSA, var, value, -period, convert= TRUE) %>% na.omit() %>% mutate(Origin="ppp") %>% bind_rows(deseasoned) %>% mutate(var=as.factor(var)) xlab_plot <- c('Euro Area government debt', 'Direct taxes', 'Indirect taxes', 'Interest payments', 'Public consumption', 'Public investment', 'Compensation of employees', "Households' social contribution", "Employers' social contribution", 'Subsidies', 'Social transfers', 'Total expenditures', 'Total revenues', 'Unemployment benefits') plot_compare$var <- plyr::mapvalues(plot_compare$var,levels(plot_compare$var),xlab_plot) ggplot(plot_compare,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~var,ncol=3,scales = "free_y")+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL)+ theme(legend.title=element_blank()) + theme(strip.text=element_text(size=10), axis.text=element_text(size=9))+ ggtitle("Fiscal data for the Euro Area") Final fiscal database for the Euro area We eventually want to build a database close to Paredes et al. (2014). You can download all the raw series here. EA_Fipu_rawdata <- deseasoned %>% select(-Origin) %>% spread(var,value) EA_Fipu_rawdata %>% write.csv(file = "EA_Fipu_rawdata.csv",row.names = FALSE) Then data are normalized by capita and price if needed, using data built to reproduce the Smets and Wouters (2003). sw03 <- read.csv("http://shiny.nomics.world/data/EA_SW_rawdata.csv") %>% mutate(period=ymd(period)) %>% filter(period >="1980-01-01") EA_Fipu_data <- EA_Fipu_rawdata %>% inner_join(sw03,by="period") %>% transmute(period=period, pubcons_rpc = 100*1e+6*pubcons/(defgdp*pop*1000), pubinves_rpc = 100*1e+6*pubinves/(defgdp*pop*1000), salaries_rpc = 100*1e+6*salaries/(defgdp*pop*1000), subs_rpc = 100*1e+6*subs/(defgdp*pop*1000), dirtax_rpc = 100*1e+6*dirtax/(defgdp*pop*1000), indirtax_rpc = 100*1e+6*indirtax/(defgdp*pop*1000), tfs_rpc = 100*1e+6*tfs/(defgdp*pop*1000), sce_rpc = 100*1e+6*sce/(defgdp*pop*1000), scr_rpc = 100*1e+6*scr/(defgdp*pop*1000), debt_rpc = 100*1e+6*debt/(defgdp*pop*1000)) EA_Fipu_data %>% write.csv(file = "EA_Fipu_data.csv",row.names = FALSE) You can download ready-to-use (normalized) data for the estimation here. Appendix Chaining function To chain two datasets, we build a chain function whose input must be two dataframes with three standard columns (period, var, value). It returns a dataframe composed of chained values, ie the dataframe “to rebase” will be chained on the “basis” dataframe. More specifically, the function : • computes the growth rates from value in the dataframe of the 1st argument • multiplies it with the value of reference chosen in value in the dataframe of the 2nd argument • at the date specified in the 3rd argument. chain <- function(to_rebase, basis, date_chain="2000-01-01", is_basis_the_recent_data=TRUE) { date_chain <- as.Date(date_chain, "%Y-%m-%d") valref <- basis %>% filter(period == date_chain) %>% transmute(var, value_ref = value) # If chain is to update old values to match recent values if (is_basis_the_recent_data) { res <- to_rebase %>% filter(period <= date_chain) %>% arrange(desc(period)) %>% group_by(var) %>% mutate(growth_rate = c(1, value[-1]/lag(x = value)[-1])) %>% full_join(valref, by=c("var")) %>% group_by(var) %>% transmute(period, value=cumprod(growth_rate)*value_ref) %>% ungroup() %>% bind_rows(filter(basis, period>date_chain)) } else { # If chain is to update recent values to match old values res <- to_rebase %>% filter(period >= date_chain) %>% arrange(period) %>% group_by(var) %>% mutate(growth_rate = c(1, value[-1]/lag(x = value, n = 1)[-1])) %>% full_join(valref, by=c("var")) %>% group_by(var) %>% transmute(period, value=cumprod(growth_rate)*value_ref) %>% ungroup() %>% bind_rows(filter(basis, period<date_chain)) } return(res) } Seasonal adjustment For the seasonal adjustment we just used a function to mutate the series into a time series, apply the function from the Sax (2016) package, mutate back into a dataframe. deseason <- function(source_df=data_large, var_arrange="pubcons", ...){ local_data <- source_df detrend <- local_data[var_arrange] %>% ts(start=c(1980,1), frequency = 4) %>% seas(na.action=na.exclude) res <- as.numeric(final(detrend)) res <- data_frame(value=res, period=local_data$period, var=var_arrange) %>% arrange(period) %>% transmute(period, var, value) } Bibliography

Joan Paredes, Diego J. Pedregal, and Javier J. Pérez.
Fiscal policy analysis in the euro area: expanding the toolkit.
Journal of Policy Modeling, 36(5):800 – 823, 2014.
URL: http://www.sciencedirect.com/science/article/pii/S0161893814000702, doi:https://doi.org/10.1016/j.jpolmod.2014.07.003. 1 2 3 4 5

Christoph Sax.
seasonal: R Interface to X-13-ARIMA-SEATS.
2016.
R package version 1.2.1.
URL: https://CRAN.R-project.org/package=seasonal. 1 2

F. Smets and R. Wouters.
An estimated dynamic stochastic general equilibrium model of the euro area.
Journal of the European Economic Association, 2003.

R Core Team.
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, 2016.
URL: https://www.R-project.org.

RStudio Team.
RStudio: Integrated Development Environment for R.
RStudio, Inc., Boston, MA, 2016.
URL: http://www.rstudio.com/.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### When Cross-Validation is More Powerful than Regularization

Tue, 11/12/2019 - 20:45

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit.

Cross-validation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let’s call them base learners) that are then used as variables in downstream models, as shown in Figure 1. In either situation, using the same data twice can lead to models that are overtuned to idiosyncracies in the training data, and more likely to overfit.

Figure 1 Properly nesting models with cross-validation

In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias.

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation?

The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”

A simple example

Suppose you have a system with two categorical variables. The variable x_s has 10 levels, and the variable x_n has 100 levels. The outcome y is a function of x_s, but not of x_n (but you, the analyst building the model, don’t know this). Here’s the head of the data.

## x_s x_n y ## 2 s_10 n_72 0.34228110 ## 3 s_01 n_09 -0.03805102 ## 4 s_03 n_18 -0.92145960 ## 9 s_08 n_43 1.77069352 ## 10 s_08 n_17 0.51992928 ## 11 s_01 n_78 1.04714355

With most modeling techniques, a categorical variable with K levels is equivalent to K or K-1 numerical (indicator or dummy) variables, so this system actually has around 110 variables. In real life situations where a data scientist is working with high-cardinality categorical variables, or with a lot of categorical variables, the number of actual variables can begin to swamp the size of training data, and/or bog down the machine learning algorithm.

One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models.

Figure 2 Impact coding as an example of nested model

We refer to this procedure as “impact coding,” and it is one of the data treatments available in the vtreat package, specifically for dealing with high-cardinality categorical variables. But for now, let’s go back to the original problem.

The naive way

For this simple example, you might try representing each variable as the expected value of y - mean(y) in the training data, conditioned on the variable’s level. So the ith “coefficient” of the one-variable model would be given by:

vi = E[y|x = si] − E[y]

## x_s meany coeff ## 1 s_01 0.7998263 0.8503282 ## 2 s_02 -1.3815640 -1.3310621 ## 3 s_03 -0.7928449 -0.7423430 ## 4 s_04 -0.8245088 -0.7740069 ## 5 s_05 0.7547054 0.8052073 ## 6 s_06 0.1564710 0.2069728 ## 7 s_07 -1.1747557 -1.1242539 ## 8 s_08 1.3520153 1.4025171 ## 9 s_09 1.5789785 1.6294804 ## 10 s_10 -0.7313895 -0.6808876

In other words, whenever the value of x_s is s_01, the one variable model vs returns the value 0.8503282, and so on. If you do this for both variables, you get a training set that looks like this:

## x_s x_n y vs vn ## 2 s_10 n_72 0.34228110 -0.6808876 0.64754957 ## 3 s_01 n_09 -0.03805102 0.8503282 0.54991135 ## 4 s_03 n_18 -0.92145960 -0.7423430 0.01923877 ## 9 s_08 n_43 1.77069352 1.4025171 1.90394159 ## 10 s_08 n_17 0.51992928 1.4025171 0.26448341 ## 11 s_01 n_78 1.04714355 0.8503282 0.70342961

Now fit a linear model for y as a function of vs and vn.

model_raw = lm(y ~ vs + vn, data=dtrain_treated) summary(model_raw) ## ## Call: ## lm(formula = y ~ vs + vn, data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.33068 -0.57106 0.00342 0.52488 2.25472 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.05050 0.05597 -0.902 0.368 ## vs 0.77259 0.05940 13.006 <2e-16 *** ## vn 0.61201 0.06906 8.862 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8761 on 242 degrees of freedom ## Multiple R-squared: 0.6382, Adjusted R-squared: 0.6352 ## F-statistic: 213.5 on 2 and 242 DF, p-value: < 2.2e-16

Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit.

The right way: cross-validation

The correct way to impact code (or to nest models in general) is to use cross-validation techniques. Impact coding with cross-validation is already implemented in vtreat; note the similarity between this diagram and Figure 1 above.

Figure 3 Cross-validated data preparation with vtreat

The training data is used both to fit the base learners (as we did above) and to also to create a data frame of cross-validated base learner predictions (called a cross-frame in vtreat). This cross-frame is used to train the overall model. Let’s fit the correct nested model, using vtreat.

library(vtreat) library(wrapr) xframeResults = mkCrossFrameNExperiment(dtrain, qc(x_s, x_n), "y", codeRestriction = qc(catN), verbose = FALSE) # the plan uses the one-variable models to treat data treatmentPlan = xframeResults$treatments # the cross-frame dtrain_treated = xframeResults$crossFrame head(dtrain_treated) ## x_s_catN x_n_catN y ## 1 -0.6337889 0.91241547 0.34228110 ## 2 0.8342227 0.82874089 -0.03805102 ## 3 -0.7020597 0.18198634 -0.92145960 ## 4 1.3983175 1.99197404 1.77069352 ## 5 1.3983175 0.11679580 0.51992928 ## 6 0.8342227 0.06421659 1.04714355 variables = setdiff(colnames(dtrain_treated), "y") model_X = lm(mk_formula("y", variables), data=dtrain_treated) summary(model_X) ## ## Call: ## lm(formula = mk_formula("y", variables), data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2157 -0.7343 0.0225 0.7483 2.9639 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.04169 0.06745 -0.618 0.537 ## x_s_catN 0.92968 0.06344 14.656 <2e-16 *** ## x_n_catN 0.10204 0.06654 1.533 0.126 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.055 on 242 degrees of freedom ## Multiple R-squared: 0.4753, Adjusted R-squared: 0.471 ## F-statistic: 109.6 on 2 and 242 DF, p-value: < 2.2e-16

This model correctly determines that x_n (and its one-variable model x_n_catN) do not affect the outcome. We can compare the performance of this model to the naive model on holdout data.

rmse rsquared ypred_naive 1.303778 0.2311538 ypred_crossval 1.093955 0.4587089

The correct model has a much smaller root-mean-squared error and a much larger R-squared than the naive model when applied to new data.

An attempted alternative: regularized models.

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean.

vi = ∑xj = si yi/(counti + λ) − E[yi]

Where counti is the frequency of si in the training data, and λ is the smoothing parameter (usually 1). If λ = 1 then the first term on the right is just adding one to the frequency of the level and then taking the “adjusted conditional mean” of y.

Again, let’s show this for the variable x_s.

## x_s sum_y count_y grandmean vs ## 1 s_01 20.795484 26 -0.05050187 0.8207050 ## 2 s_02 -37.302227 27 -0.05050187 -1.2817205 ## 3 s_03 -22.199656 28 -0.05050187 -0.7150035 ## 4 s_04 -14.016649 17 -0.05050187 -0.7282009 ## 5 s_05 19.622340 26 -0.05050187 0.7772552 ## 6 s_06 3.129419 20 -0.05050187 0.1995218 ## 7 s_07 -35.242672 30 -0.05050187 -1.0863585 ## 8 s_08 36.504412 27 -0.05050187 1.3542309 ## 9 s_09 33.158549 21 -0.05050187 1.5577086 ## 10 s_10 -16.821957 23 -0.05050187 -0.6504130

After applying the one variable models for x_s and x_n to the data, the head of the resulting treated data looks like this:

## x_s x_n y vs vn ## 2 s_10 n_72 0.34228110 -0.6504130 0.44853367 ## 3 s_01 n_09 -0.03805102 0.8207050 0.42505898 ## 4 s_03 n_18 -0.92145960 -0.7150035 0.02370493 ## 9 s_08 n_43 1.77069352 1.3542309 1.28612835 ## 10 s_08 n_17 0.51992928 1.3542309 0.21098803 ## 11 s_01 n_78 1.04714355 0.8207050 0.61015422

Now fit the overall model:

## ## Call: ## lm(formula = y ~ vs + vn, data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.30354 -0.57688 -0.02224 0.56799 2.25723 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.06665 0.05637 -1.182 0.238 ## vs 0.81142 0.06203 13.082 < 2e-16 *** ## vn 0.85393 0.09905 8.621 8.8e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8819 on 242 degrees of freedom ## Multiple R-squared: 0.6334, Adjusted R-squared: 0.6304 ## F-statistic: 209.1 on 2 and 242 DF, p-value: < 2.2e-16

Again, both variables look significant. Even with regularization, the model is still overfit. Comparing the performance of the models on holdout data, you see that the regularized model does a little better than the naive model, but not as well as the correctly cross-validated model.

rmse rsquared ypred_naive 1.303778 0.2311538 ypred_crossval 1.093955 0.4587089 ypred_reg 1.267648 0.2731756 The Moral of the Story

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.

Appendix: Derivation of Laplace Smoothing as L2-Regularization

Without regularization, the optimal one-variable model for y in terms of a categorical variable with K levels {sj} is a set of K coefficients v such that

is minimized (N is the number of data points). L2-regularization adds a penalty to the magnitude of v, so that the goal is to minimize

where λ is a known smoothing hyperparameter, usually set (in this case) to 1.

To minimize the above expression for a single coefficient vj, take the deriviative with respect to vj and set it to zero:

Where countj is the number of times the level sj appears in the training data. Now solve for vj:

This is Laplace smoothing. Note that it is also the one-variable equivalent of ridge regression.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### Logistic Regression in R: A Classification Technique to Predict Credit Card Default

Tue, 11/12/2019 - 20:00

[This article was first published on R Programming - Data Science Blog | AI, ML, big data analytics , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Logistic regression is one of the statistical techniques in machine learning used to form prediction models. It is one of the most popular classification algorithms mostly used for binary classification problems (problems with two class values, however, some variants may deal with multiple classes as well). It’s used for various research and industrial problems. Therefore, it is essential to have a good grasp on logistic regression algorithm. This tutorial is a sneak peek from many of Data Science Dojo’s hands-on exercises from their 5-day data science bootcamp, you will learn how logistic regression fits a dataset to make predictions, as well as when and why to use it.

In short, Logistic Regression is used when the dependent variable(target) is categorical. For example:

• To predict whether an email is spam (1) or not spam (0)
• Whether the tumor is malignant (1) or not (0)

It is named as ‘Logistic Regression’, because it’s underlying technique is quite the same as Linear Regression. There are structural differences in how linear and logistic regression operate. Therefore, linear regression isn’t suitable to be used for classification problems. This link answers in details that why linear regression isn’t the right approach for classification.

Its name is derived from one of the core function behind its implementation called the logistic function or the sigmoid function. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

The hypothesis function of logistic regression can be seen below where the function g(z) is also shown.

The hypothesis for logistic regression now becomes:

Here θ (theta) is a vector of parameters that our model will calculate to fit our classifier.

After calculations from the above equations, the cost function is now as follows:

Here m is the number of training examples. Like Linear Regression, we will use gradient descent to minimize our cost function and calculate the vector θ (theta).

This tutorial will follow the format below to provide you hands-on practice with Logistic Regression:

1. Importing Libraries
2. Importing Datasets
3. Exploratory Data Analysis
4. Feature Engineering
5. Pre-processing
6. Model Development
7. Prediction
8. Evaluation
THE SCENARIO

In this tutorial, we will be working with Default of Credit Card Clients Data Set. This data set has 30000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card client using the data provided. These attributes are related to various details about a customer, his past payment information and bill statements. It is hosted in Data Science Dojo’s repository.

Think of yourself as a lead data scientist employed at a large bank. You have been assigned to predict whether a particular customer will default payment next month or not. The result is a an extremely valuable piece of information for the bank to take decisions regarding offering credit to its customer and could massively affect the bank’s revenue. Therefore, your task is very critical. You will learn to use logistic regression to solve this problem.

The dataset is a tricky one as it has a mix of categorical and continuous variables. Moreover, You will also get a chance to practice these concepts through short assignments given at the end of a few sub-module. Feel free to change the parameters in the given methods once you have been through the entire notebook.

1) Importing Libraries

We’ll begin by importing our dependencies that we require. The following dependencies are popularly used for data wrangling operations and visualizations. We would encourage you to have a look at their documentations.

library(knitr) library(tidyverse) library(ggplot2) library(mice) library(lattice) library(reshape2) #install.packages("DataExplorer") if the following package is not available library(DataExplorer) 2) Importing Datasets

The dataset is available at Data Science Dojo’s repository in the following link. We’ll use head method to view the first few rows.

## Need to fetch the excel file path <- "https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Default%20of%20Credit%20Card%20Clients/default%20of%20credit%20card%20clients.csv" data <- read.csv(file = path, header = TRUE) head(data)

Since the header names are in the first row of the dataset, we’ll use the code below to first assign the headers to be the one from the first row and then delete the first row from the dataset. This way we will get our desired form.

colnames(data) <- as.character(unlist(data[1,])) data = data[-1, ] head(data)

To avoid any complications ahead, we’ll rename our target variable “default payment next month” to a name without spaces using the code below.

colnames(data)[colnames(data)=="default payment next month"] <- "default_payment" head(data) 3) Exploratory Data Analysis

Data Exploration is one of the most significant portions of the machine learning process. Clean data can ensures a notable increase in accuracy of our model. No matter how powerful our model is, it cannot function well unless the data we provide it has been thoroughly processed. This step will briefly take you through this step and assist you to visualize your data, find relation between variables, deal with missing values and outliers and assist in getting some fundamental understanding of each variable we’ll use. Moreover, this step will also enable us to figure out the most important attibutes to feed our model and discard those that have no relevance.

We will start with using the dim function to print out the dimensionality of our dataframe.

dim(data)

30000 25

The str method will allows us to know the data type of each variable. We’ll transform it to numeric data type since it’ll be more handy to use for our functions ahead.

str(data) 'data.frame': 30000 obs. of 25 variables: $ID : Factor w/ 30001 levels "1","10","100",..: 1 11112 22223 23335 24446 25557 26668 27779 28890 2 ...$ LIMIT_BAL : Factor w/ 82 levels "10000","100000",..: 14 5 81 48 48 48 49 2 7 14 ... $SEX : Factor w/ 3 levels "1","2","SEX": 2 2 2 2 1 1 1 2 2 1 ...$ EDUCATION : Factor w/ 8 levels "0","1","2","3",..: 3 3 3 3 3 2 2 3 4 4 ... $MARRIAGE : Factor w/ 5 levels "0","1","2","3",..: 2 3 3 2 2 3 3 3 2 3 ...$ AGE : Factor w/ 57 levels "21","22","23",..: 4 6 14 17 37 17 9 3 8 15 ... $PAY_0 : Factor w/ 12 levels "-1","-2","0",..: 5 1 3 3 1 3 3 3 3 2 ...$ PAY_2 : Factor w/ 12 levels "-1","-2","0",..: 5 5 3 3 3 3 3 1 3 2 ... $PAY_3 : Factor w/ 12 levels "-1","-2","0",..: 1 3 3 3 1 3 3 1 5 2 ...$ PAY_4 : Factor w/ 12 levels "-1","-2","0",..: 1 3 3 3 3 3 3 3 3 2 ... $PAY_5 : Factor w/ 11 levels "-1","-2","0",..: 2 3 3 3 3 3 3 3 3 1 ...$ PAY_6 : Factor w/ 11 levels "-1","-2","0",..: 2 4 3 3 3 3 3 1 3 1 ... $BILL_AMT1 : Factor w/ 22724 levels "-1","-10","-100",..: 13345 10030 10924 15026 21268 18423 12835 1993 1518 307 ...$ BILL_AMT2 : Factor w/ 22347 levels "-1","-10","-100",..: 11404 5552 3482 15171 16961 17010 13627 12949 3530 348 ... $BILL_AMT3 : Factor w/ 22027 levels "-1","-10","-100",..: 18440 9759 3105 15397 12421 16866 14184 17258 2072 365 ...$ BILL_AMT4 : Factor w/ 21549 levels "-1","-10","-100",..: 378 11833 3620 10318 7717 6809 16081 8147 2129 378 ... $BILL_AMT5 : Factor w/ 21011 levels "-1","-10","-100",..: 385 11971 3950 10407 6477 6841 14580 76 1796 2638 ...$ BILL_AMT6 : Factor w/ 20605 levels "-1","-10","-100",..: 415 11339 4234 10458 6345 7002 14057 15748 12215 3230 ... $PAY_AMT1 : Factor w/ 7944 levels "0","1","10","100",..: 1 1 1495 2416 2416 3160 5871 4578 4128 1 ...$ PAY_AMT2 : Factor w/ 7900 levels "0","1","10","100",..: 6671 5 1477 2536 4508 2142 4778 6189 1 1 ... $PAY_AMT3 : Factor w/ 7519 levels "0","1","10","100",..: 1 5 5 646 6 6163 4292 1 4731 1 ...$ PAY_AMT4 : Factor w/ 6938 levels "0","1","10","100",..: 1 5 5 337 6620 5 2077 5286 5 813 ... $PAY_AMT5 : Factor w/ 6898 levels "0","1","10","100",..: 1 1 5 263 5777 5 950 1502 5 408 ...$ PAY_AMT6 : Factor w/ 6940 levels "0","1","10","100",..: 1 2003 4751 5 5796 6293 963 1267 5 1 ... $default_payment: Factor w/ 3 levels "0","1","default payment next month": 2 2 1 1 1 1 1 1 1 1 ... data[, 1:25] <- sapply(data[, 1:25], as.character) We have involved an intermediate step by converting our data to character first. We need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes. data[, 1:25] <- sapply(data[, 1:25], as.numeric) str(data) 'data.frame': 30000 obs. of 25 variables:$ ID : num 1 2 3 4 5 6 7 8 9 10 ... $LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...$ SEX : num 2 2 2 2 1 1 1 2 2 1 ... $EDUCATION : num 2 2 2 2 2 1 1 2 3 3 ...$ MARRIAGE : num 1 2 2 1 1 2 2 2 1 2 ... $AGE : num 24 26 34 37 57 37 29 23 28 35 ...$ PAY_0 : num 2 -1 0 0 -1 0 0 0 0 -2 ... $PAY_2 : num 2 2 0 0 0 0 0 -1 0 -2 ...$ PAY_3 : num -1 0 0 0 -1 0 0 -1 2 -2 ... $PAY_4 : num -1 0 0 0 0 0 0 0 0 -2 ...$ PAY_5 : num -2 0 0 0 0 0 0 0 0 -1 ... $PAY_6 : num -2 2 0 0 0 0 0 -1 0 -1 ...$ BILL_AMT1 : num 3913 2682 29239 46990 8617 ... $BILL_AMT2 : num 3102 1725 14027 48233 5670 ...$ BILL_AMT3 : num 689 2682 13559 49291 35835 ... $BILL_AMT4 : num 0 3272 14331 28314 20940 ...$ BILL_AMT5 : num 0 3455 14948 28959 19146 ... $BILL_AMT6 : num 0 3261 15549 29547 19131 ...$ PAY_AMT1 : num 0 0 1518 2000 2000 ... $PAY_AMT2 : num 689 1000 1500 2019 36681 ...$ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ... $PAY_AMT4 : num 0 1000 1000 1100 9000 ...$ PAY_AMT5 : num 0 0 1000 1069 689 ... $PAY_AMT6 : num 0 2000 5000 1000 679 ...$ default_payment: num 1 1 0 0 0 0 0 0 0 0 ...

When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together. For a continuous (numeric) variable like “age”, it returns the 5-number summary showing 5 descriptive statistic as these are numeric values.

summary(data) ID LIMIT_BAL SEX EDUCATION Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000 1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000 Median :15000 Median : 140000 Median :2.000 Median :2.000 Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853 3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000 Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000 MARRIAGE AGE PAY_0 PAY_2 Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000 1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000 Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000 Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338 3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000 PAY_3 PAY_4 PAY_5 PAY_6 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000 1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327 Median : 22382 Median : 21200 Median : 20089 Median : 19052 Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263 3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506 Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 Min. :-81334 Min. :-339603 Min. : 0 Min. : 0 1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833 Median : 18105 Median : 17071 Median : 2100 Median : 2009 Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921 3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000 Max. :927171 Max. : 961664 Max. :873552 Max. :1684259 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8 Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0 Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5 3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0 Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0 default_payment Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.2212 3rd Qu.:0.0000 Max. :1.0000

Using the introduce method, we can get to know the basc information about the dataframe, including the number of missing values in each variable.

introduce(data)

As we can observe, there are no missing values in the dataframe.

The information in summary above gives a sense of the continuous and categorical features in our dataset. However, evaluating these details against the data description shows that categorical values such as EDUCATION and MARRIAGE have categories beyond those given in the data dictionary. We’ll find out these extra categories using the value_counts method.

count(data, vars = EDUCATION) vars n 0 14 1 10585 2 14030 3 4917 4 123 5 280 6 51

The data dictionary defines the following categories for EDUCATION: “Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)”. However, we can also observe 0 along with numbers greater than 4, i.e. 5 and 6. Since we don’t have any further details about it, we can assume 0 to be someone with no education experience and 0 along with 5 & 6 can be placed in others along with 4.

count(data, vars = MARRIAGE) vars n 0 54 1 13659 2 15964 3 323

The data dictionary defines the following categories for MARRIAGE: “Marital status (1 = married; 2 = single; 3 = others)”. Since the category 0 hasn’t been defined anywhere in the data dictionary, we can incude it in the ‘others’ category marked as 3.

#replace 0's with NAN, replace others too data$EDUCATION[data$EDUCATION == 0] <- 4 data$EDUCATION[data$EDUCATION == 5] <- 4 data$EDUCATION[data$EDUCATION == 6] <- 4 data$MARRIAGE[data$MARRIAGE == 0] <- 3 count(data, vars = MARRIAGE) count(data, vars = EDUCATION) vars n 1 13659 2 15964 3 377 vars n 1 10585 2 14030 3 4917 4 468

We’ll now move on to multi-variate analysis of our variables and draw a correlation heat map from DataExplorer library. The heatmap will enable us to find out the correlation between each variable. We are more interested in to find out the correlation between our predictor attributes with the target attribute default payment next month. The color scheme depicts the strength of correlation between 2 variables.

This will be a simple way to quickly find out how much an impact a variable has on our final outcome. There are other ways as well to figure this out.

plot_correlation(na.omit(data), maxcat = 5L)

We can observe the week correlation of AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6 with our target variable.

Now let’s have a univariate analysis of our variables. We’ll start with the categorical variables and have a quick check on the frequency of distribution of categories. The code below will allow us to observe the required graphs. We’ll first draw distribution for all PAY variables.

plot_histogram(data)

We can make a few observations from the above histogram. The distribution above shows that all nearly all PAY attributes are rightly skewed.

4) Feature Engineering

This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial.

Analyzing our data above, we’ve been able to note the extremely week correlation of some variables with the final target variable. The following are the ones which have significantly low correlation values: AGE, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6.

#deleting columns data_new <- select(data, -one_of('ID','AGE', 'BILL_AMT2', 'BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6')) head(data_new) 5) Pre-processing

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In the code below, we’ll use the scale method transform our dataset using it.

data_new[, 1:17] <- scale(data_new[, 1:17]) head(data_new)

The next task we’ll do is to split the data for training and testing as we’ll use our test data to evaluate our model. We will now split our dataset into train and test. We’ll change it to 0.3. Therefore, 30% of the dataset is reserved for testing while the remaining for training. By default, the dataset will also be shuffled before splitting.

#create a list of random number ranging from 1 to number of rows from actual data #and 70% of the data into training data data2 = sort(sample(nrow(data_new), nrow(data_new)*.7)) #creating training data set by selecting the output row values train <- data_new[data2,] #creating test data set by not selecting the output row values test <- data_new[-data2,]

Let us print the dimensions of all these variables using the dim method. You can notice the 70-30% split.

dim(train) dim(test)

21000 18

9000 18

6) Model Development

We will now move on to our most important step of developing our logistic regression model. We have already fetched our machine learning model in the beginning. Now with a few lines of code we’ll first create a logistic regression model which as has been imported from scikit learn’s linear model package to our variable named model.

Followed by this, we’ll train our model using the fit method with X_train and y_train that contain 70% of our dataset. This will be a binary classification model.

## fit a logistic regression model with the training dataset log.model <- glm(default_payment ~., data = train, family = binomial(link = "logit")) summary(log.model) Call: glm(formula = default_payment ~ ., family = binomial(link = "logit"), data = train) Deviance Residuals: Min 1Q Median 3Q Max -3.1171 -0.6998 -0.5473 -0.2946 3.4915 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.465097 0.019825 -73.900 < 2e-16 *** LIMIT_BAL -0.083475 0.023905 -3.492 0.000480 *** SEX -0.082986 0.017717 -4.684 2.81e-06 *** EDUCATION -0.059851 0.019178 -3.121 0.001803 ** MARRIAGE -0.107322 0.018350 -5.849 4.95e-09 *** PAY_0 0.661918 0.023605 28.041 < 2e-16 *** PAY_2 0.069704 0.028842 2.417 0.015660 * PAY_3 0.090691 0.031982 2.836 0.004573 ** PAY_4 0.074336 0.034612 2.148 0.031738 * PAY_5 0.018469 0.036430 0.507 0.612178 PAY_6 0.006314 0.030235 0.209 0.834584 BILL_AMT1 -0.123582 0.023558 -5.246 1.56e-07 *** PAY_AMT1 -0.136745 0.037549 -3.642 0.000271 *** PAY_AMT2 -0.246634 0.056432 -4.370 1.24e-05 *** PAY_AMT3 -0.014662 0.028012 -0.523 0.600677 PAY_AMT4 -0.087782 0.031484 -2.788 0.005300 ** PAY_AMT5 -0.084533 0.030917 -2.734 0.006254 ** PAY_AMT6 -0.027355 0.025707 -1.064 0.287277 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 22176 on 20999 degrees of freedom Residual deviance: 19535 on 20982 degrees of freedom AIC: 19571 Number of Fisher Scoring iterations: 6 7) Prediction

Below we’ll use the predict method to find out the predictions made by our Logistic Regression method. We will first store the predicted results in our y_pred variable and print our the first 10 rows of our test data set. Following this we will print the predicted values of the corresponding rows and the original labels that were stored in y_test for comparison.

test[1:10,] ## to predict using logistic regression model, probablilities obtained log.predictions <- predict(log.model, test, type="response") ## Look at probability output head(log.predictions, 10)
2
0.539623162720197
7
0.232835137994762
10
0.25988780274953
11
0.0556716133560243
15
0.422481223473459
22
0.165384552048511
25
0.0494775267027534
26
0.238225423596718
31
0.248366972046479
37
0.111907725985513

Below we are going to assign our labels with decision rule that if the prediction is greater than 0.5, assign it 1 else 0.

log.prediction.rd <- ifelse(log.predictions > 0.5, 1, 0) head(log.prediction.rd, 10)
2
1
7
0
10
0
11
0
15
0
22
0
25
0
26
0
31
0
37
0
Evaluation

We’ll now discuss a few evaluation metrics to measure the performance of our machine learning model here. This part has significant relevance since it will allow us to understand the most important characteristics that led to our model development.

We will output the confusion matrix. It is a handy presentation of the accuracy of a model with two or more classes.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

According to an article the entries in the confusion matrix have the following meaning in the context of our study:

[[a b]
]

• a is the number of correct predictions that an instance is negative,
• b is the number of incorrect predictions that an instance is positive,
• c is the number of incorrect of predictions that an instance is negative, and
• d is the number of correct predictions that an instance is positive.
table(log.prediction.rd, test[,18]) log.prediction.rd 0 1 0 6832 1517 1 170 481

We’ll write a simple function to print the accuracy below

accuracy <- table(log.prediction.rd, test[,18]) sum(diag(accuracy))/sum(accuracy)

0.812555555555556

Conclusion

This tutorial has given you a brief and concise overview of Logistic Regression algorithm and all the steps involved in acheiving better results from our model. This notebook has also highlighted a few methods related to Exploratory Data Analysis, Pre-processing and Evaluation, however, there are several other methods that we would encourage to explore on our blog or video tutorials.

If you want to take a deeper dive into the several data science techniques. Join our 5-day hands-on data science bootcamp preferred by working professionals, we cover the following topics:

• Fundamentals of Data Mining
• Machine Learning Fundamentals
• Introduction to R
• Introduction to Azure Machine Learning Studio
• Data Exploration, Visualization, and Feature Engineering
• Decision Tree Learning
• Ensemble Methods: Bagging, Boosting, and Random Forest
• Regression: Cost Functions, Gradient Descent, Regularization
• Unsupervised Learning
• Recommendation Systems
• Metrics and Methods for Evaluating Predictive Models
• Introduction to Online Experimentation and A/B Testing
• Fundamentals of Big Data Engineering
• Message Queues and Real-time Analytics
• NoSQL Databases and HBase
• Hack Project: Creating a Real-time IoT Pipeline
• Naive Bayes
• Logistic Regression
• Times Series Forecasting

This post was originally sponsored on What’s The Big Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### AzureR updates: AzureStor, AzureVM, AzureGraph, AzureContainers

Tue, 11/12/2019 - 19:30

Some major updates to AzureR packages this week! As well as last week's AzureRMR update, there are changes to AzureStor, AzureVM, AzureGraph and AzureContainers. All of these are live on CRAN.

AzureStor 3.0.0

There are substantial enhancements to multiple-file transfers (up and down). You can supply a vector of pathnames to storage_upload/download as the source and destination arguments. Alternatively if you specify a wildcard source, there is now the ability to recurse through subdirectories; the directory structure in the source will be reproduced at the destination.

There are revamped methods for getting storage properties and (user-defined) metadata for storage objects.

You can now use Azurite with AzureStor, by creating a service_specific endpoint (file_endpoint, blob_endpoint, adls_endpoint) with the Azurite URL. AzureStor will print a warning, but create the endpoint anyway.

For other changes, see the NEWS.md file.

AzureVM 2.1.0

You can now create VM scalesets with attached data disks. In addition, you can specify the disk type (Standard_LRS, StandardSSD_LRS, or Premium_LRS) for the OS disk and, for a Linux Data Science Virtual Machine, the supplied data disk. This enables using VM sizes that don't support Premium storage.

AzureGraph 1.1.0 and AzureContainers 1.1.2

These packages have been updated to use the new Microsoft Graph operations, introduced last week, for managing app passwords. As a security measure, app passwords can no longer be manually specified; instead they are auto-generated on the server using a cryptographically secure PRNG.

In AzureGraph, the az_app$update_password() method is defunct; in its place are add_password() and remove_password(). Similarly, in AzureContainers the aks$update_service_password() and aks$update_aad_password() methods no longer accept a manual password as an input. If you use Graph, and in particular if you use AzureContainers to deploy Azure Kubernetes Service clusters, you should update as soon as possible, as the old versions are incompatible with the new Graph API. If you run into any problems, please open an issue at the GitHub repo in question. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Azure AI and Machine Learning talk series Tue, 11/12/2019 - 18:04 [This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. At last week's Microsoft Ignite conference in Orlando, our team delivered a series of 6 talks about AI and machine learning applications with Azure. The videos from each talk are linked below, and you can watch every talk from the conference online (no registration necessary). Each of our talks also comes with a companion Github repository, where you can find all of the code and scripts behind the demonstrations, so you can deploy and run them yourself. If you'd like to see these talks live, they will also be presented in 31 cities around the world over the next six months, starting with Paris this week. Check the website for Microsoft Ignite the Tour for event dates and further information. AIML10 Making sense of your unstructured data with AI Tailwind Traders has a lot of legacy data that they’d like their developers to leverage in their apps – from various sources, both structured and unstructured, and including images, forms, and several others. In this session, learn how the team used Azure Cognitive Search to make sense of this data in a short amount of time and with amazing success. We discuss tons of AI concepts, like the ingest-enrich-explore pattern, search skillsets, cognitive skills, natural language processing, computer vision, and beyond. AIML20 Using pre-built AI to solve business challenges As a data-driven company, Tailwind Traders understands the importance of using artificial intelligence to improve business processes and delight customers. Before investing in an AI team, their existing developers were able to demonstrate some quick wins using pre-built AI technologies. In this session, we show how you can use Azure Cognitive Services to extract insights from retail data and go into the neural networks behind computer vision. Learn how it works and how to augment the pre-built AI with your own images for custom image recognition applications. AIML21 Developers guide to AI: A data story In this theater session, we show the data science process and how to apply it. From exploration of datasets to deployment of services – all applied to an interesting data story. We also take you on a very brief tour of the Azure AI platform. AIML30: Start building machine learning models faster than you think Tailwind Traders uses custom machine learning models to fix their inventory issues – without changing their software development life cycle! How? Azure Machine Learning Visual Interface. In this session, learn the data science process that Tailwind Traders’ uses and get an introduction to Azure Machine Learning Visual Interface. See how to find, import, and prepare data, select a machine learning algorithm, train and test the model, and deploy a complete model to an API. Get the tips, best practices, and resources you and your development team need to continue your machine learning journey, build your first model, and more. AIML40 Taking models to the next level with Azure Machine Learning best practices Tailwind Traders’ data science team uses natural language processing (NLP), and recently discovered how to fine tune and build a baseline models with Automated ML. In this session, learn what Automated ML is and why it’s so powerful, then dive into how to improve upon baseline models using examples from the NLP best practices repository. We highlight Azure Machine Learning key features and how you can apply them to your organization, including: low priority compute instances, distributed training with auto scale, hyperparameter optimization, collaboration, logging, and deployment. AIML50 Machine learning operations: Applying DevOps to data science Many companies have adopted DevOps practices to improve their software delivery, but these same techniques are rarely applied to machine learning projects. Collaboration between developers and data scientists can be limited and deploying models to production in a consistent, trustworthy way is often a pipe dream. In this session, learn how Tailwind Traders applied DevOps practices to their machine learning projects using Azure DevOps and Azure Machine Learning Service. We show automated training, scoring, and storage of versioned models, wrap the models in Docker containers, and deploy them to Azure Container Instances or Azure Kubernetes Service. We even collect continuous feedback on model behavior so we know when to retrain. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### My AP Statistics Class First R Programming Assignment Using RStudio Tue, 11/12/2019 - 15:49 [This article was first published on R – Saturn Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. My AP Stats class has started their first R programming assignment this week. I gave them the code for them to type in and play with. This will give them some experience with RStudio and basic function commands. I have a total of six assignments for them to complete over the next few months. All my students have a laptop to use so there should be fewer issues getting up and running. The first thing we did was to download R and RStudio. Everyone was able to get RStudio up and running. I then went over some rules for coding and how the assignments will be submitted. Some of the rules I went over were: • Comment out your code with the # sign. Every function should have a comment. • Create white space. It’s easier for me to read and debug. • If you are stuck: a) ask a classmate b) copy/paste error code into Google, c) ask Mr. Smith • Make sure your name and all relevant information are at the top of your code. • Expect to get stuck sometimes. Google and Stackoverflow are your friends. You will learn a lot. After we all had RStudio working, I showed them how to install the tideverse package. This is the greatest package ever and allows me to teach the students on all kinds of larger data. I’ll go into more detail the next lesson on using dplyr to filter and select from a data frame. For this first assignment, I’m using the data from our book on page 35. Here is the code for the first assignment and the output. # My name is _________________________________ # This is my first programming assignment for AP Stats and I will copy and see if everything runs properly # November 11, 2019 # I need to comment out everything using the "#" # This lesson is from my website at saturnscience.com # web link here to see details # http://www.saturnscience.com/category/r/ #################################### Assignment 1---students type in the data ############ # Everything here works for the latest version of R and RStudio ## The general form of a command will look like this: ## note to myself ## myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis)) + geom() ## You can also use =, its the same as -< ## NOTE: DO NOT make variables names with a space, use one word or two connected with a period "." ## Here I enter the data from page 35 ## The "c" function combines the data into a vector ##### Please load dplyr and ggplot2 now. #### foreigh.born=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8, 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1, 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3, 3.9,10.1,12.4,1.2,4.4,2.7) summary(foreigh.born) # Gives the five number summary. str(foreigh.born) # the str function shows me the type of structure of the data. fivenum(foreigh.born) # gives the five number summary mean(foreigh.born) # just shows the mean head(foreigh.born, n=12) # shows the first 12, pick n. Used with large data files. tail(foreigh.born) # shows the end of the data. You can pick n or leave it alone. plot(foreigh.born) # this is R's generic scatter plot function and only shows basic information. # we will use this more later. hist(foreigh.born) # This is base R basic histogram function. # Below is ggplot's better graphing abilities ggplot() + aes(foreigh.born)+ geom_histogram(binwidth = 2.5) # I change the variable name so I don't confuse with the prior graphs foreign.born3=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8, 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1, 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3, 3.9,10.1,12.4,1.2,4.4,2.7) # This is a histogram with base R hist(foreign.born3, breaks = 10, main = "Histogram with Base Graphics", ylim = c(0,15)) # check the structure str(foreign.born3) # make sure it's a data frame by changing to a data.frame. fb3=as.data.frame(foreign.born3) # I check to see the structur of fb3 str(fb3) # I use ggplot to make a histogram similar to the book's histogram ggplot(fb3,aes(x=foreign.born3))+ geom_histogram(color="black",fill="orange",binwidth = 3)+ labs(x="Percent of foreign born residents",y="Number of States")+ geom_density() # I can add a density curve to the histogtam ggplot(fb3,aes(x=foreign.born3))+ geom_histogram(aes(y=..density..),color="black",fill="orange",binwidth = 3)+ labs(x="Percent of foreign born residents",y="Density of States")+ geom_density(alpha=0.2,fill="#FF6666") # Same histogram but I just change the colors a bit. ggplot(fb3, aes(x=foreign.born3)) + geom_histogram(aes(y=..density..), binwidth=3, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") # use control-l to clear the console Some of the output: > ##### Please load dplyr and ggplot2 now. #### > > foreigh.born=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8, + 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1, + 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3, + 3.9,10.1,12.4,1.2,4.4,2.7) > > summary(foreigh.born) # Gives the five number summary. Min. 1st Qu. Median Mean 3rd Qu. Max. 1.200 3.800 6.100 8.316 12.350 27.200 > > str(foreigh.born) # the str function shows me the type of structure of the data. num [1:50] 2.8 7 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 ... > > fivenum(foreigh.born) # gives the five number summary  1.2 3.8 6.1 12.4 27.2 > > mean(foreigh.born) # just shows the mean  8.316 > > head(foreigh.born, n=12) # shows the first 12, pick n. Used with large data files.  2.8 7.0 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 16.3 5.6 > > tail(foreigh.born) # shows the end of the data. You can pick n or leave it alone.  3.9 10.1 12.4 1.2 4.4 2.7 Here are some of the plots using ggplot2 We have completed Unit 4 and will start Unit 5 next week. We are where we need to be at this time of the year. At this rate, we’ll finish the class on time and have a few weeks to review for the exam in May 2020. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Saturn Science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### RcppAnnoy 0.0.14 Tue, 11/12/2019 - 13:13 [This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. A new minor release of RcppAnnoy is now on CRAN, following the previous 0.0.13 release in September. RcppAnnoy is the Rcpp-based R integration of the nifty Annoy library by Erik Bernhardsson. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours—originally developed to drive the famous Spotify music discovery algorithm. This release once again allows compilation on older compilers. The 0.0.13 release in September brought very efficient 512-bit AVX instruction to accelerate computations. However, this could not be compiled on older machines so we caught up once more with upstream to update to conditional code which will fall back to either 128-bit AVX or no AVX, ensuring buildability “everywhere”. Detailed changes follow below. Changes in version 0.0.14 (2019-11-11) • RcppAnnoy again synchronized with upstream to ensure builds with older compilers without AVX512 instructions (Dirk #53). • The cleanup script only uses /bin/sh. Courtesy of CRANberries, there is also a diffstat report for this release. If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### dplyr and Oracle database with odbc on windows Tue, 11/12/2019 - 13:05 [This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. RStudio makes Oracle accessibility from R easier via odbc and connections Pane1. Personally, I find it’s not so easy. As it finally works for me, I will detail some snippets here. After tens of try it seems good to share some tricks2. This blog post is also a notepad for me. Oracle and R configuration is a step where we potentially waste a lot of time. Many things can cause oracle and R not to work at all: • it depends on which client is installed (32b, 64b ?) • wether odbc driver is correctly installed or not • you have to dissect tnsnames.ora • investigate on many ORA error’s • maybe try to clean install Oracle client Often ROracle is used and it works well, sometimes it doesn’t (some oci.dll not found3, etc.). But it doesn’t work with dplyr/dbplyr at the moment. After several years with ROracle, I’m happy to have both possibilities for query writing and collecting (SQL, and now dplyr) Here we are: RStudio connection Pane From connection Pane we take Oracle odbc driver name, we have two here for two Oracle client versions: And then: We now have a big component of the connection string. 32b or 64b If your Oracle client is 32bit, you have to switch to R 32bits, otherwhise it doesn’t work (at least for me). Connection string Then stackoverflow history helped me4 to structure the entire string: library(odbc) library(dplyr) library(dbplyr) library(lubridate) my_oracle <- dbConnect(odbc::odbc(), .connection_string = "Driver={Oracle dans OraClient10g_home1};DBQ=host:port/db_name;UID=woo;PWD=hoo", timeout = 10) You will find all these informations in tnsnames.ora. Port is probably 1521. Some dplyr/dbplyr statements Simple one dplyr::tbl(my_oracle, dbplyr::in_schema('SCHEMA_ONE', 'TB_ONE')) <SQL> SELECT * FROM SCHEMA_ONE.TB_ONE dplyr and dblink If you have another oracle database with dblinks it may also works like this: dplyr::tbl(my_oracle, dbplyr::in_schema('SCHEMA_B', 'TC_TWO@MYDBTWOLINK')) <SQL> SELECT * FROM SCHEMA_B.TC_TWO@MYDBTWOLINK List dblinks DBI::dbGetQuery(my_oracle, "SELECT * FROM ALL_DB_LINKS") <SQL> SELECT * FROM ALL_DB_LINKS Catalog of all columns5 <SQL> SELECT * FROM ALL_TAB_COLUMNS Decomposing the connection string In order to ask for password, we split the connection parts: library(odbc) library(dplyr) library(dbplyr) library(lubridate) my_oracle <- dbConnect(odbc::odbc(), Driver = "Oracle dans OraClient10g_home1", DBQ = "host:port/db_name", SVC = "DB_SCHEMA", # schema when connection opens UID = "woo", PWD = "hoo") And then: library(odbc) library(dplyr) library(dbplyr) library(lubridate) my_oracle <- dbConnect(odbc::odbc(), Driver = "Oracle dans OraClient10g_home1", DBQ = "host:port/db_name", SVC = "DB_SCHEMA", UID = rstudioapi::askForPassword('woo (username)'), PWD = rstudioapi::askForPassword('Open, Sesame (password)')) 1. RStudio documentation for Oracle connections: https://db.rstudio.com/databases/oracle/ 2. see here for a readme in a repo on github: https://github.com/GuillaumePressiat/oracle_odbc_connection_template_for_R 3. see here for ROracle difficulties: https://technology.amis.nl/2017/08/23/r-and-the-oracle-database-using-dplyr-dbplyr-with-roracle-on-windows-10/ 4. how to make a connection string for oracle that includes hostname, instance name, user id, password using system.data.oracleclient? stackoverflow 5. for Oracle catalogs, see here: https://docs.oracle.com/pls/db92/db92.catalog_views?remark=homepage var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Teach R to see by Borrowing a Brain Tue, 11/12/2019 - 09:00 [This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. It has been an old dream to teach a computer to see, i.e. to hold something in front of a camera and let the computer tell you what it sees. For decades it has been exactly that: a dream – because we as human beings are able to see, we just don’t know how we do it, let alone be precise enough to put it into algorithmic form. Enter machine learning! As we have seen in Understanding the Magic of Neural Networks we can use neural networks for that. We have to show the network thousands of readily tagged pics (= supervised learning) and after many cycles, the network will have internalized all the important features of all the pictures shown to it. The problem is that it often takes a lot a computing power and time to train a neural network from scratch. The solution: a pre-trained neural network which you can just use out of the box! In the following we will build a system where you can point your webcam in any direction or hold items in front of it and R will tell you what it sees: a banana, some toilet paper, a sliding door, a bottle of water and so on. Sounds impressive, right! For the following code to work you first have to go through the following steps: 1. Install Python through the Anaconda distribution: https://www.anaconda.com 2. Install the R interface to Keras (a high-level neural networks API): https://keras.rstudio.com 3. Load the keras package and the pre-trained ResNet-50 neural network (based on https://keras.rstudio.com/reference/application_resnet50.html): 4. library(keras) # instantiate the model resnet50 <- application_resnet50(weights = 'imagenet') 5. Build a function which takes a picture as input and makes a prediction on what can be seen in it: 6. predict_resnet50 <- function(img_path) { # load the image img <- image_load(img_path, target_size = c(224, 224)) x <- image_to_array(img) # ensure we have a 4d tensor with single element in the batch dimension, # the preprocess the input for prediction using resnet50 x <- array_reshape(x, c(1, dim(x))) x <- imagenet_preprocess_input(x) # make predictions then decode and print them preds <- predict(resnet50, x) imagenet_decode_predictions(preds, top = 3)[] } 7. Start the webcam and set the timer to 2 seconds (depends on the technical specs on how to do that!), start taking pics. 8. Let the following code run and put different items in front of the camera… Have fun! 9. img_path <- "C:/Users/.../Pictures/Camera Roll" # change path appropriately while (TRUE) { files <- list.files(path = img_path, full.names = TRUE) img <- files[which.max(file.mtime(files))] # grab latest pic cat("\014") # clear console print(predict_resnet50(img)) Sys.sleep(1) } 10. When done click the Stop button in RStudio and stop taking pics. 11. Optional: delete saved pics – you can also do this with the following command: 12. unlink(paste0(img_path, "/*")) # delete all pics in folder Here are a few examples of my experiments with my own crappy webcam: class_name class_description score 1 n07753592 banana 9.999869e-01 2 n01945685 slug 5.599981e-06 3 n01924916 flatworm 3.798145e-06 class_name class_description score 1 n07749582 lemon 0.9924537539 2 n07860988 dough 0.0062746629 3 n07747607 orange 0.0003545524 class_name class_description score 1 n07753275 pineapple 0.9992571473 2 n07760859 custard_apple 0.0002387811 3 n04423845 thimble 0.0001032234 class_name class_description score 1 n04548362 wallet 0.51329690 2 n04026417 purse 0.33063501 3 n02840245 binder 0.02906101 class_name class_description score 1 n04355933 sunglass 5.837566e-01 2 n04356056 sunglasses 4.157162e-01 3 n02883205 bow_tie 9.142305e-05 So far, all of the pics were on a white background, what happens in a more chaotic setting? class_name class_description score 1 n03691459 loudspeaker 0.62559783 2 n03180011 desktop_computer 0.17671309 3 n03782006 monitor 0.04467739 class_name class_description score 1 n03899768 patio 0.65015656 2 n03930313 picket_fence 0.04702349 3 n03495258 harp 0.04476695 class_name class_description score 1 n02870880 bookcase 0.5205195 2 n03661043 library 0.3582534 3 n02871525 bookshop 0.1167464 Quite impressive for such a small amount of work, isn’t it! Another way to make use of pre-trained models is to take them as a basis for building new nets that can e.g. recognize things the original net was not able to. You don’t have to start from scratch but use e.g. only the lower layers which hold the needed building block while retraining the higher layers (another possibility would be to add additional layers on top of the pre-trained model). This method is called Transfer Learning and an example would be to reuse a net that is able to differentiate between male and female persons for recognizing their age or their mood. The main advantage obviously is that you get results much faster this way, one disadvantage may be that a net that is trained from scratch might yield better results. As so often in the area of machine learning there is always a trade-off… Hope this post gave you an even deeper insight into the fascinating area of neural networks which is still one of the hottest areas of machine learning research. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### An API for @racently Tue, 11/12/2019 - 03:00 [This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. @racently is a side project that I have been nursing along for a couple of years. It addresses a problem that I have as a runner: my race results are distributed across a variety of web sites. This makes it difficult to create a single view on my running performance (or lack thereof) over time. I suspect that I am not alone in this. Anyway, @racently was built to scratch my personal itch: my running results are now all aggregated in one place. A few months ago @DanielCunnama suggested that I add the ability to creating running groups in @racently. This sounded like a good idea. It also sounded like a bit of work and TBH I just did not have the time. So I made a counter-suggestion: how about an API so that he could effectively aggregate the data in any way he wanted? He seemed happy with the idea, so it immediately went onto my backlog. And there it stayed. But @DanielCunnama is a persistent guy (perhaps this is why he’s a class runner!) and he pinged me relentlessly about this… until Sunday when I relented and created the API. And now I’m happy that I did, because it gives me an opportunity to write up a quick post about how these data can be accessed from R. Profiles on @racently I’m going to use Gerda Steyn as an example. I hope she doesn’t mind. This is what Gerda’s profile looks like on @racently. Now there are a couple of things I should point out: 1. This profile is far from complete. Gerda has run a lot more races than that. These are just the ones that we currently have in our database. We’re adding more races all the time, but it’s a long and arduous process. 2. The result for the 2019 Comrades Marathon was when she won the race! A view like this can be created for any runner on the system. Most runners in South Africa should have a profile (unless they have explicitly requested that we remove it!). Pulling Data with the API Supposing that you wanted to do some analytics on the data. You’d want to pull the data into R or Python. You could scrape the site, but the API makes it a lot easier to access the data. Load up some helpful packages. library(glue) library(dplyr) library(purrr) library(httr) Set up the URL for the API endpoint and the key for Gerda’s profile. URL = "https://www.racently.com/api/athlete/{key}/" key = "7ef6fbc8-4169-4a98-934e-ff5fa79ba103" Send a GET request and extract the results from the response object, parsing the JSON into an R list. response <- glue(URL) %>% GET() %>% content() Extract some basic information from the response. response$url ##  "http://www.racently.com/api/athlete/7ef6fbc8-4169-4a98-934e-ff5fa79ba103/" response$name ##  "Gerda Steyn" response$gender ##  "F"

Now get the race results. This requires a little more work because of the way that the JSON is structured: an array of licenses, each of which has a nested array of race result objects.

response$license %>% map_dfr(function(license) { license$result %>% map_dfr(as_tibble)} %>% mutate( club = license$club, number = license$number, date = as.Date(date) ) ) %>% arrange(desc(date)) ## date race distance time club number ## 1 2019-06-09 Comrades 86.8 km 05:58:53 Nedbank NA ## 2 2018-06-10 Comrades 90.2 km 06:15:34 Nedbank 8300 ## 3 2018-05-20 RAC 10.0 km 00:35:38 Nedbank 8300 ## 4 2018-05-01 Wally Hayward 10.0 km 00:35:35 Nedbank 8300 ## 5 2017-06-04 Comrades 86.7 km 06:45:45 Nedbank NA ## 6 2016-05-29 Comrades 89.2 km 07:08:23 Nedbank NA

For good measure, let’s throw in the results for @DanielCunnama.

## date race distance time club number ## 1 2019-09-29 Grape Run 21.1 km 01:27:49 Harfield Harriers 4900 ## 2 2019-06-09 Comrades 86.8 km 07:16:21 Harfield Harriers 4900 ## 3 2019-02-17 Cape Peninsula 42.2 km 03:08:47 Harfield Harriers 4900 ## 4 2019-01-26 Red Hill Marathon 36.0 km 02:52:55 Harfield Harriers 4900 ## 5 2019-01-13 Bay to Bay 30.0 km 02:15:55 Harfield Harriers 7935 ## 6 2018-11-10 Winelands 42.2 km 02:58:56 Harfield Harriers 7935 ## 7 2018-10-14 The Gun Run 21.1 km 01:22:30 Harfield Harriers 7935 ## 8 2018-10-07 Grape Run 21.1 km 01:36:46 Harfield Harriers 8358 ## 9 2018-09-23 Cape Town Marathon 42.2 km 03:11:52 Harfield Harriers 7935 ## 10 2018-09-09 Ommiedraai 10.0 km 00:37:46 Harfield Harriers 11167 ## 11 2018-06-10 Comrades 90.2 km 07:19:25 Harfield Harriers 7935 ## 12 2018-02-18 Cape Peninsula 42.2 km 03:08:27 Harfield Harriers 7935 ## 13 2018-01-14 Bay to Bay 30.0 km 02:11:50 Harfield Harriers 7935 ## 14 2017-10-01 Grape Run 21.1 km 01:27:18 Harfield Harriers 7088 ## 15 2017-09-17 Cape Town Marathon 42.2 km 02:57:55 Harfield Harriers 7088 ## 16 2017-06-04 Comrades 86.7 km 07:46:18 Harfield Harriers 7088 ## 17 2016-10-16 The Gun Run 21.1 km 01:19:09 Harfield Harriers NA ## 18 2016-09-10 Mont-Aux-Sources 50.0 km 05:42:23 Harfield Harriers NA ## 19 2016-05-29 Comrades 89.2 km 07:22:53 Harfield Harriers NA ## 20 2016-02-21 Cape Peninsula 42.2 km 03:17:12 Harfield Harriers NA Wrapping Up

Let’s digress for a moment to look at a bubble plot showing the number of races on @racently broken down by runner. There are some really prolific runners.

We’ve currently got just under one million individual race results across over a thousand races. If you have the time and inclination then there’s definitely some interesting science to be done using these results. I’d be very interested in collaborating, so just shout if you are interested.

Feel free to grab some data via the API. At the moment you’ll need to search for an athlete on the main website in order to find their API key. I’ll implement some search functionality in the API when I get a chance.

Finally, here’s a talk I gave about @racently at the Bulgaria Web Summit (2017) in Sofia, Bulgaria. A great conference, incidentally. Well worth making the trip to Bulgaria.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### Trying the ckanr Package

Tue, 11/12/2019 - 01:00

[This article was first published on R on Alan Yeung, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In previous blog posts (Hacking dbplyr for CKAN, Getting Open Data into R from CKAN) I have been exploring how to download data from the NHS Scotland open data platform into R. I’ve recently discovered that ROpenSci has a package to help with just this called ckanr and I wish I’d known about it earlier as it is really pretty handy! It certainly would’ve saved me some time if I’d know about it earlier but I suppose the positive I can take from it is that some of the functions in ckanr perform similar functions to the ideas I had so I guess that shows that my ideas are not completely wacky!

How resources are grouped in CKAN

Before we start testing out some code from ckanr, it is important to consider how resources (I am going to call the individual data items such as specific csv files hosted on CKAN as ‘resources’ but I am not sure if this is necessarily correct) on CKAN are grouped up as this helps to understand the design of some functions within ckanr. Resources can be grouped within ‘Datasets’, ‘Groups’, ‘Tags’ and ‘Themes’ (and possibly more that I don’t yet know about). Out of these, it is clear to me that ckanr offers functions for exploring resources by all of these groupings except themes (although I could also be mistaken about this). With this out of the way, let’s delve into some code.

Figure: How resources are grouped in CKAN.

Initialising ckanr and exploring groups of resources

There are, of course, many different data portals that are powered by CKAN so the first thing we need to do with the ckanr package is to tell it which URL to use by default with ckanr_setup(). Note that if you are working in a place where you need to use a proxy to connect R to the internet, this can also be set within ckanr_setup() using the proxy argument.

library(tidyverse) library(ckanr) ckanr_setup(url = "https://www.opendata.nhs.scot")

Now we can explore the groupings available in the NHS Scotland CKAN website with group_list(), package_list() and tag_list(); from the Figure above, these correspond to ‘Groups’, ‘datasets’ and ‘Tags’ respectively. Note that I only show 10 records in each case to keep things concise.

# View available groups and packages/datasets group_list(as = "table")[1:10] #  "acute-hospital-activity" "cancer" #  "child-health" "dental-care" #  "deprivation" "emergency-care" #  "general-practice" "geography" #  "hospital-activity" "infection-disease-and-virus-surveillance" package_list(as = "table")[1:10] #  "18-weeks-referral-to-treatment" #  "27-30-month-review-statistics" #  "alcohol-related-hospital-statistics-scotland" #  "allied-health-professionals-musculoskeletal-waiting-times" #  "allied-health-professional-vacancies" #  "annual-cancer-incidence" #  "births-in-scottish-hospitals" #  "cancelled-planned-operations" #  "cancer-mortality" #  "cancer-waiting-times" tag_list(as = "table")$name[1:10] #  "31 day" "62 day" #  "address" "adolescent" #  "adult" "age" #  "agenda for change" "agenda for change band" #  "ahp" "ailment" Being able to see the names related to different groupings is useful if you want to download data from all resources in a particular group. I’ll give an example of doing this later but first I want to mimic some of the things I done in previous blog posts but using ckanr. Connect to CKAN with dplyr and download from one resource Let’s demonstrate downloading from one resource using the fairly small Patients Referred dataset within Allied Health Professionals – Musculoskeletal Waiting Times which has resource ID 3988df43-3516-4190-93da-16189db7329a. We start by using src_ckan() to create a connection to CKAN (similar to how you would do so for other non-CKAN databases). From there, you can download data in a similar way to when using dbplyr but using a CKAN resource ID instead of a database table name. ckan <- src_ckan("https://www.opendata.nhs.scot") res_id <- "3988df43-3516-4190-93da-16189db7329a" dplyr::tbl(src = ckan$con, from = res_id) %>% as_tibble() # A tibble: 1,115 x 9 # _id HBT2014 ReferralsPerOne~ _full_text Specialty NumberOfReferra~ NumberOfReferra~ # # 1 1 S08000~ 2890. '2015q3':9 ~ All AHP ~ 8904 d # 2 2 S08000~ 411. '1267':5 '2~ Chiropod~ 1267 "" # 3 3 S08000~ 94.4 '2015q3':3 ~ Occupati~ 291 "" # 4 4 S08000~ 178. '178.17':5 ~ Orthotics 549 "" # 5 5 S08000~ 2206. '2015q3':2 ~ Physioth~ 6797 "" # 6 6 S08000~ 1530. '1472':7 '1~ All AHP ~ 1472 d # 7 7 S08000~ 165. '159':1 '16~ Orthotics 159 "" # 8 8 S08000~ 1365. '1313':2 '1~ Physioth~ 1313 "" # 9 9 S08000~ 2562. '2015q3':7 ~ All AHP ~ 3212 d # 10 10 S08000~ 197. '197.02':1 ~ Chiropod~ 247 "" # # ... with 1,105 more rows, and 2 more variables: Quarter , # # ReferralsPerOneHundredThousandPopulationQF

The variables look like they’ve been downloaded in a bit of a random order and that _full_text variable seems to have appeared so this makes me think that ckanr is using SQL to download the data. This is easy enough to confirm.

getAnywhere(tbl.src_ckan) function (src, from, ..., name = NULL) { if (is.null(name)) { tbl_sql("ckan", src = src, from = sql(from), ...) } else { tbl_sql(subclass = "ckan", src = src, from = sql(sprintf("SELECT * FROM \"%s\"", name))) } }

Now let’s try combining this with some basic dplyr functions like select() and filter().

dplyr::tbl(src = ckan$con, from = res_id) %>% select(Quarter, HBT2014) %>% filter(HBT2014 == "S08000015") %>% as_tibble() # A tibble: 89 x 2 # Quarter HBT2014 # # 1 2015Q3 S08000015 # 2 2015Q3 S08000015 # 3 2015Q3 S08000015 # 4 2015Q3 S08000015 # 5 2015Q3 S08000015 # 6 2015Q4 S08000015 # 7 2015Q4 S08000015 # 8 2015Q4 S08000015 # 9 2015Q4 S08000015 # 10 2015Q4 S08000015 # # ... with 79 more rows # Warning messages: # 1: Translator is missing window variants of the following aggregate functions: # * all # * any # * cor # * cov # * paste # * sd # # 2: Translator is missing window variants of the following aggregate functions: # * all # * any # * cor # * cov # * paste # * sd # # 3: Translator is missing window variants of the following aggregate functions: # * all # * any # * cor # * cov # * paste # * sd We get a long list of warnings explaining what you cannot do with the SQL translation available in ckanr but otherwise, works great! Downloading all resources from a dataset Often, a dataset on CKAN contains many resources related to the same thing. For example, the Consultant Vacancies dataset (remember you can see all available ‘Datasets’ using package_list()) contains different csv files for vacancies at different time points. cons_vac <- package_show("consultant-vacancies", as = "table")$resources cons_vac %>% select(name, id) # name id # 1 Vacancies June 2019 16e27935-325c-471b-89dc-d41c84b3a744 # 2 Vacancies March 2019 ca67b2a4-b2f3-4420-8b77-3771c53b01f4 # 3 Vacancies December 2018 5da80103-4da8-4694-a8b5-2332dfc43e25 # 4 Vacancies September 2018 91d7b780-f2cb-47fb-919f-1c165ed7d301 # 5 Vacancies June 2018 e874f6f4-6cf5-402c-af1d-2d4f26cc669f # 6 Vacancies March 2018 415c2f86-db7c-4c12-9a64-0cd9cf0d9118

Now if you extract the required resource IDs, you can download all of the datasets with some help from the fantastic purrr package.

id_list <- cons_vac$id # Download each resource into a list item cons_vac_list <- map(id_list, ~as_tibble(dplyr::tbl(src = ckan$con, from = .x))) # Check how many variables in each resource map_dbl(cons_vac_list, length) #  12 12 12 14 15 12 # Not all resources have the same structure # Check variable names for a couple that differ map(cons_vac_list[3:4], names) # [] #  "WorkforceRegionGrouping" "HB2014" #  "HB2014QF" "TotalVacancies" #  "_full_text" "Specialty" #  "VacanciesGreater6Months" "Date" #  "SpecialtyQF" "_id" #  "Establishment" "StaffInPost" # # [] #  "WorkforceRegionGrouping" "HB2014" #  "HB2014QF" "TotalVacancies" #  "TotalVacanciesQF" "_full_text" #  "Specialty" "EstablishmentQF" #  "VacanciesGreater6Months" "Date" #  "SpecialtyQF" "_id" #  "Establishment" "StaffInPost" # TotalVacanciesQF and EstablishmentQF not in resource 3 but are in resource 4 # Combine just the first 3 resources which look like they all have the same structure bind_rows(cons_vac_list[1:3]) # A tibble: 1,822 x 12 # WorkforceRegion~ HB2014 HB2014QF TotalVacancies _full_text # # 1 National Bodies~ SB0806 "" 0 '0':9 '1.4'~ # 2 Scotland S9200~ d 0 '0':5 '2019~ # 3 Scotland S9200~ d 0 '0':5 '2.1'~ # 4 National Bodies~ SB0807 "" 0 '0':9 '0.3'~ # 5 North S0800~ "" 0 '0':5 '0.1'~ # 6 National Bodies~ SB0808 "" 0 '0':9 '1.2'~ # 7 East S0800~ "" 0 '0':3 '12.1~ # 8 East S0800~ "" 0 '0':3 '1.7'~ # 9 National Bodies~ SB0804 "" 0 '0':1 '1':9~ # 10 National Bodies~ SB0807 "" 0 '0':9 '0.4'~ # ... with 1,812 more rows, and 7 more variables: Specialty , # VacanciesGreater6Months , Date , SpecialtyQF , # _id , Establishment , StaffInPost

So that is a basic workflow using ckanr alongside functions from purrr for combining related resources into one dataset. I’ve also presented some ways of checking consistency in the structure of those datasets (an essential step when trying to do something like this) and in this case, not all of the datasets were the same so I just combined the most recent 3 datasets for consultant vacancies at the end for simplicity here; in reality you might want to look at ways to make all of the data consistent first and then combine them up but I’ll leave that data wrangling exercise up to the interested reader.

My final verdict: ckanr is definitely recommended for working with data from CKAN!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### What can we really expect to learn from a pilot study?

Tue, 11/12/2019 - 01:00

I am involved with a very interesting project – the NIA IMPACT Collaboratory – where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.

There is a rich literature on this topic. For example, these papers by Lancaster et al and Leon et al provide nice discussions about how pilot studies should fit into the context of larger randomized trials. The key point made by both groups of authors is that pilot studies are important sources of information about the feasibility of conducting a larger, more informative study: Can the intervention actually be implemented well enough to study it? Will it be possible to recruit and retain patients? How difficult will it be to measure the primary outcome? Indeed, what is the most appropriate outcome to be measuring?

Another thing the authors agree on is that the pilot study is not generally well-equipped to provide an estimate of the treatment effect. Because pilot studies are limited in resources (both time and money), sample sizes tend to be quite small. As a result, any estimate of the treatment effect is going to be quite noisy. If we accept the notion that there is some true underlying treatment effect for a particular intervention and population of interest, the pilot study estimate may very well fall relatively far from that true value. As a result, if we use that effect size estimate (rather than the true value) to estimate sample size requirements for the larger randomized trial, we run a substantial risk of designing an RCT that is too small, which may lead us to miss identifying a true effect. (Likewise, we may end up with a study that is too large, using up precious resources.)

My goal here is to use simulations to see how a small pilot study could potentially lead to poor design decisions with respect to sample size.

A small, two-arm pilot study

In these simulations, I will assume a two-arm study (intervention and control) with a true intervention effect $$\Delta = 50$$. The outcome is a continuous measure with a within-arm standard deviation $$\sigma = 100$$. In some fields of research, the effect size would be standardized as $$d = \Delta / \sigma$$. (This is also known as Cohen’s $$d$$.) So, in this case the true standardized effect size $$d=0.5$$.

If we knew the true effect size and variance, we could skip the pilot study and proceed directly to estimate the sample size required for 80% power and Type I error rate $$\alpha = 0.05$$. Using the pwr.t.test function in the pwr library, we specify the treatment effect (as $$d$$), significance level $$\alpha$$, and power to get the number of subjects needed for each study arm. In this case, it would be 64 (for a total of 128):

library(pwr) pwr.t.test(n = NULL, d = 50/100, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 64 ## d = 0.5 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

If we do not have an estimate of $$d$$ or even of the individual components $$\Delta$$ and $$\sigma$$, we may decide to do a small pilot study. I simulate a single study with 30 subjects in each arm (for a total study sample size of 60). First, I generate the data set (representing this one version of the hypothetical study) with a treatment indicator $$rx$$ and an outcome $$y$$:

library(simstudy) defd <- defDataAdd(varname = "y", formula = "rx * 50", variance = 100^2) ss <- 30 set.seed(22821) dd <- genData(n = ss*2) dd <- trtAssign(dd, grpName = "rx") dd <- addColumns(defd, dd) head(dd) ## id rx y ## 1: 1 0 -150 ## 2: 2 1 48 ## 3: 3 0 -230 ## 4: 4 1 116 ## 5: 5 1 91 ## 6: 6 1 105

Once we have collected the data from the pilot study, we probably would try to get sample size requirements for the larger RCT. The question is, what information can we use to inform $$d$$? We have a couple of options. In the first case, we can estimate both $$\Delta$$ and $$\sigma$$ from the data and use those results directly in power calculations:

lmfit <- lm(y ~ rx, data = dd) Delta <- coef(lmfit)["rx"] Delta ## rx ## 78 sd.rx <- dd[rx==1, sd(y)] sd.ctl <- dd[rx==0, sd(y)] pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 ) pool.sd ##  94

The estimated standard deviation (94) is less than the true value, and the effect size is inflated (78), so that the estimated $$\hat{d}$$ is also too large, close to 0.83. This is going to lead us to recruit fewer participants (24 in each group) than the number we actually require (64 in each group):

pwr.t.test(n = NULL, d = Delta/pool.sd, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 24 ## d = 0.83 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

Alternatively, if we had external information that provided some insight into the true effect size, or, absent that, we use a minimally clinically significant effect size, we might get a better result. In this case, we are quite fortunate to use an effect size of 50. However, we will continue to use the variance estimate from the pilot study. Using this approach, the resulting sample size (56) happens to be much closer to the required value (64):

pwr.t.test(n = NULL, d = 50/pool.sd, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 56 ## d = 0.53 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group Speak truth to power

Now the question becomes, what is the true expected power of the RCT based on the sample size estimated in the pilot study. To estimate this true power, we use the true effect size and the true variance (i.e. the true $$d$$)?

In the first case, where we actually used the true $$d$$ to get the sample size estimate, we just recover the 80% power estimate. No surprise there:

pwr.t.test(n = 64, d = 0.50, sig.level = 0.05, type = "two.sample")$power ##  0.8 In the second case, where we used $$\hat{d} = \hat{\Delta} / \hat{\sigma}$$ to get the sample size $$n=24$$, the true power of the larger RCT would be 40%: pwr.t.test(n = 24, d = 0.50, sig.level = 0.05, type = "two.sample")$power ##  0.4

And if we had used $$\hat{d} = 50 / \hat{\sigma}$$ to get the sample size estimate $$n=56$$, the true power would have been 75%:

pwr.t.test(n = 56, d = 0.50, sig.level = 0.05, type = "two.sample")$power ##  0.75 Conservative estimate of standard deviation While the two papers I cited earlier suggest that it is not appropriate to use effect sizes estimated from a pilot study (and more on that in the next and last section), this 1995 paper by R.H. Browne presents the idea that we can use the estimated standard deviation from the pilot study. Or rather, to be conservative, we can use the upper limit of a one-sided confidence interval for the standard deviation estimated from the pilot study. The confidence interval for the standard deviation is not routinely provided in R. Another paper analyzes one-sided confidence intervals quite generally under different conditions, and provides a formula in the most straightforward case under assumptions of normality to estimate the $$\gamma*100\%$$ one-sided confidence interval for $$\sigma^2$$: $\left( 0,\frac{(N-2)s_{pooled}^2}{\chi^2_{N-2;\gamma}} \right)$ where $$\chi^2_{N-2;\gamma}$$ is determined by $$P(\chi^2_{N-2} > \chi^2_{N-2;\gamma}) = \gamma$$. So, if $$\gamma = 0.95$$ then we can get a one-sided 95% confidence interval for the standard deviation using that formulation: gamma <- 0.95 qchi <- qchisq(gamma, df = 2*ss - 2, lower.tail = FALSE) ucl <- sqrt( ( (2*ss - 2) * pool.sd^2 ) / qchi ) ucl ##  111 The point estimate $$\hat{\sigma}$$ is 94, and the one-sided 95% confidence interval is $$(0, 111)$$. (I’m happy to provide a simulation to demonstrate that this is in fact the case, but won’t do it here in the interest of space.) If we use $$\hat{\sigma}_{ucl} = 111$$ to estimate the sample size, we get a more conservative sample size requirement (78) than if we used the point estimate $$\hat{\sigma} = 94$$ (where the sample size requirement was 56): pwr.t.test(n = NULL, d = 50/ucl, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 78 ## d = 0.45 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group Ultimately, using $$\gamma = 0.95$$ might be too conservative in that it might lead to an excessively large sample size requirement. Browne’s paper uses simulation to to evaluate a range of $$\gamma$$’s, from 0.5 to 0.9, which I also do in the next section. Simulation of different approaches At this point, we need to generate multiple iterations to see how the various approaches perform over repeated pilot studies based on the same data generating process, rather than looking at a single instance as I did in the simulations above. As Browne does in his paper, I would like to evaluate the distribution of power estimates that arise from the various approaches. I compare using an external source or minimally clinically meaningful effect size to estimate $$\Delta$$ (in the figures below, this would be the columns labeled ‘truth’) with using the effect size point estimate from the pilot (labeled pilot). I also compare using a point estimate of $$\sigma$$ from the pilot (where $$\gamma=0$$), with using the upper limit of a one-sided confidence interval defined by $$\gamma$$. In these simulations I compare three levels of $$\gamma$$: $$\gamma \in (0.5, 0.7, 0.9)$$. In each of the simulations, I assume 30 subjects per arm, and evaluate true effect sizes of 30 and 75. In all cases, the true standard error $$\sigma = 100$$ so that true $$d$$ is 0.30 or 0.75. The box plots in the figure represent the distribution of power estimates for the larger RCT under different scenarios. Each scenario was simulated 5000 times each. Ideally, the power estimates should cluster close to 80%, the targeted level of power. In the figure, the percentage next to each box plot reports the percent of simulations with power estimates at or above the target of 80%. Two things jump out at me. First, using the true effect size in the power calculation gives us a much better chance of designing an RCT with close to 80% power, even when a point estimate is used for $$\hat{\sigma}$$. In Browne’s paper, the focus is on the fact that even when using the true effect size, there is a high probability of power falling below 80%. This may be the case, but it may be more important to note that when power is lower than the target, it is actually likely to fall relatively close to the 80% target. If the researcher is very concerned about falling below that threshold, perhaps using $$\gamma$$ higher than 0.6 or 0.7 might provide an adequate cushion. Second, it appears using the effect size estimate from the pilot as the basis for an RCT power analysis is risky. The box plots labeled as pilot exhibit much more variation than the ‘true’ box plots. As a result, there is a high probability that the true power will fall considerably below 80%. And in many other cases, the true power will be unnecessarily large, due to the fact that they have been designed to be larger than they need to be. The situation improves somewhat with larger pilot studies, as shown below with 60 patients per arm, where variation seems to be reduced. Still, an argument can be made that using effect sizes from pilot studies is too risky, leading to an under-powered or overpowered study, neither of which is ideal. A question remains about how best to determine what effect size to use for the power calculation if using the estimate from the pilot is risky. I think a principled approach, such as drawing effect size estimates from the existing literature or using clinically meaningful effect sizes, is a much better way to go. And the pilot study should focus on other important feasibility issues that can help improve the design of the RCT. References: Lancaster, G.A., Dodd, S. and Williamson, P.R., 2004. Design and analysis of pilot studies: recommendations for good practice. Journal of evaluation in clinical practice, 10(2), pp.307-312. Leon, A.C., Davis, L.L. and Kraemer, H.C., 2011. The role and interpretation of pilot studies in clinical research. Journal of psychiatric research, 45(5), pp.626-629. Browne, R.H., 1995. On the use of a pilot sample for sample size determination. Statistics in medicine, 14(17), pp.1933-1940. Cojbasic, V. and Loncar, D., 2011. One-sided confidence intervals for population variances of skewed distributions. Journal of Statistical Planning and Inference, 141(5), pp.1667-1672. Support: This research is supported by the National Institutes of Health National Institute on Aging U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations. Addendum Below is the code I used to run the simulations and generate the plots getPower <- function(ssize, esize, gamma = 0, use.est = FALSE) { estring <- paste0("rx * ", esize) defd <- defDataAdd(varname = "y", formula = estring, variance = 100^2) N <- ssize * 2 dd <- genData(n = N) dd <- trtAssign(dd, grpName = "rx") dd <- addColumns(defd, dd) lmfit <- lm(y~rx, data = dd) sd.rx <- dd[rx==1, sd(y)] sd.ctl <- dd[rx==0, sd(y)] pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 ) qchi <- qchisq(gamma, df = N - 2, lower.tail = FALSE) ucl <- sqrt( ( (N-2) * pool.sd^2 ) / qchi ) p.sd <- estsd * (gamma == 0) + ucl * (gamma > 0) p.eff <- esize * (use.est == FALSE) + coef(lmfit)["rx"] * (use.est == TRUE) if (abs(p.eff/p.sd) < 0.0002) p.eff <- sign(p.eff) * .0002 * p.sd nstar <- round(pwr.t.test(n = NULL, d = p.eff/p.sd, sig.level = 0.05, power = 0.80, type = "two.sample")$n,0) power <- pwr.t.test(n=nstar, d = esize/100, sig.level = 0.05, type = "two.sample") return(data.table(ssize, esize, gamma, use.est, estsd = estsd, ucl = ucl, nstar, power = power$power, est = coef(lmfit)["rx"], lcl.est = confint(lmfit)["rx",1] , ucl.est = confint(lmfit)["rx",2]) ) } dres <- data.table() for (i in c(30, 60)) { for (j in c(30, 75)) { for (k in c(0, .5, .7)) { for (l in c(FALSE, TRUE)) { dd <- rbindlist(lapply(1:5000, function(x) getPower(ssize = i, esize = j, gamma = k, use.est = l)) ) dres <- rbind(dres, dd) }}}} above80 <- dres[, .(x80 = mean(power >= 0.80)), keyby = .(ssize, esize, gamma, use.est)] above80[, l80 := scales::percent(x80, accuracy = 1)] g_labeller <- function(value) { paste("\U03B3", "=", value) # unicode for gamma } e_labeller <- function(value) { paste("\U0394", "=", value) # unicdoe for Delta } ggplot(data = dres[ssize == 30], aes(x=factor(use.est, labels=c("'truth'", "pilot")), y=power)) + geom_hline(yintercept = 0.8, color = "white") + geom_boxplot(outlier.shape = NA, fill = "#9ba1cf", width = .4) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey92"), axis.ticks = element_blank(), plot.title = element_text(size = 9, face = "bold")) + facet_grid(esize ~ gamma, labeller = labeller(gamma = g_labeller, esize = e_labeller)) + scale_x_discrete( name = "\n source of effect size used for power calculation") + scale_y_continuous(limits = c(0,1), breaks = c(0, .8), name = "distribution of power estimates \n") + ggtitle("Distribution of power estimates (n = 30 per treatment arm)") + geom_text(data = above80[ssize == 30], aes(label = l80), x=rep(c(0.63, 1.59), 6), y = 0.95, size = 2.5) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Using R and H2O Isolation Forest For Data Quality Mon, 11/11/2019 - 19:56 [This article was first published on R-Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Introduction: We will identify anomalous patterns in data, this process is useful, not only to find inconsistencies and errors but also to find abnormal data behavior, being useful even to find cyber attacks on organizations. On this article there is more information as reference: Data Quality and Anomaly Detection Thoughts For Web Analytics Before starting we need the next software installed and working: – R language installed. – H2O open source framework. – Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases – R studio. About the data used in this article. # I am using https://www.kaggle.com/bradklassen/pga-tour-20102018-data # The version I have is not the most updated version but anyway, a new version # may be used. # The file I am using is a csv 950 mb file with 9,720,530 records, including header. # # One very important thing is that we are going to see that instead to be lost in more than 9 million records, we will just be looking at 158 records with anomalies for the analysed variable, so, it is easier to inspect data in this way. Let’s start coding: # Loading libraries suppressWarnings( suppressMessages( library( h2o ) ) ) # For interactive plotting suppressWarnings( suppressMessages( library( dygraphs ) ) ) suppressWarnings( suppressMessages( library( dplyr ) ) ) suppressWarnings( suppressMessages( library( DT ) ) ) # Start a single-node instance of H2O using all available processor cores and reserve 5GB of memory h2oServer = h2o.init( ip = "localhost", port = 54321, max_mem_size = "5g", nthreads = -1 ) ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.out ## /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.err ## ## ## Starting H2O JVM and connecting: . Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 2 seconds 395 milliseconds ## H2O cluster timezone: America/Mexico_City ## H2O data parsing timezone: UTC ## H2O cluster version: 3.26.0.6 ## H2O cluster version age: 1 month and 8 days ## H2O cluster name: H2O_started_from_R_ckassab_aat507 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 4.44 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 ## R Version: R version 3.6.1 (2019-07-05) h2o.removeAll() # Removes all data from h2o cluster, ensuring it is clean. h2o.no_progress() # Turn off progress bars for notebook readability # Setting H2O timezone for proper date data type handling #h2o.getTimezone() ===>>> UTC #h2o.listTimezones() # We can see all H2O timezones h2o.setTimezone("US/Central") ##  "US/Central" # Note. I am using Ubuntu 19.10, using /tmp directory # Every time I boot my computer, I need to copy the data file again to /tmp # directory. # Importing data file and setting data types accordingly. allData = read.csv( "/tmp/PGA_Tour_Golf_Data_2019_Kaggle.csv", sep = ",", header = T ) # When using as.Posixct H2O is not importing data, so we are using as.Date. allData$Date = as.Date( allData$Date ) allData$Value = as.numeric(allData$Value) # Convert dataset to H2O format. allData_hex = as.h2o( allData ) # Build an Isolation forest model startTime <- Sys.time() startTime ##  "2019-11-10 20:10:30 CST" trainingModel = h2o.isolationForest( training_frame = allData_hex , sample_rate = 0.1 , max_depth = 32 , ntrees = 100 ) ## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Stopping tolerance is ignored for _stopping_rounds=0.. Sys.time() ##  "2019-11-10 20:20:15 CST" Sys.time() - startTime ## Time difference of 9.756691 mins # According to H2O doc: # http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html # # Isolation Forest is similar in principle to Random Forest and is built on # the basis of decision trees. # Isolation Forest creates multiple decision trees to isolate observations. # # Trees are split randomly, The assumption is that: # # IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS, # IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT. # # The less splits needed, the unit is more likely to be anomalous. # # The average number of splits is then used as a score. # Calculate score for all data. startTime <- Sys.time() startTime ##  "2019-11-10 20:20:15 CST" score = h2o.predict( trainingModel, allData_hex ) result_pred = as.vector( score$predict )
Sys.time() ##  "2019-11-10 20:23:18 CST" Sys.time() - startTime ## Time difference of 3.056829 mins ################################################################################
# Setting threshold value for anomaly detection.
################################################################################

# Setting desired threshold percentage.
threshold = .999 # Let's say we want the .001% data different than the rest.

# Using this threshold to get score limit to filter data anomalies.
scoreLimit = round( quantile( result_pred, threshold ), 4 )

# Add row score at the beginning of dataset
allData = cbind( RowScore = round( result_pred, 4 ), allData )

# Get data anomalies by filtering all data.
anomalies = allData[ allData$RowScore > scoreLimit, ] # As we can see in the summary: summary(anomalies) ## RowScore Player.Name Date ## Min. :0.9540 Jonas Blixt : 231 Min. :2019-07-07 ## 1st Qu.:0.9565 Jordan Spieth : 231 1st Qu.:2019-08-25 ## Median :0.9614 Julian Etulain: 221 Median :2019-08-25 ## Mean :0.9640 Johnson Wagner: 213 Mean :2019-08-24 ## 3rd Qu.:0.9701 John Chin : 209 3rd Qu.:2019-08-25 ## Max. :1.0000 Keegan Bradley: 209 Max. :2019-08-25 ## (Other) :8325 ## Statistic ## Club Head Speed : 234 ## Driving Pct. 300-320 (Measured): 193 ## Carry Efficiency : 163 ## First Tee Early Lowest Round : 161 ## First Tee Late Lowest Round : 160 ## GIR Percentage - 100+ yards : 158 ## (Other) :8570 ## Variable ## First Tee Early Lowest Round - (LOW RND) : 103 ## First Tee Late Lowest Round - (LOW RND) : 96 ## First Tee Late Lowest Round - (ROUNDS) : 64 ## Driving Pct. 300-320 (Measured) - (TOTAL DRVS - OVERALL): 61 ## GIR Percentage - 175-200 yards - (%) : 61 ## First Tee Early Lowest Round - (ROUNDS) : 58 ## (Other) :9196 ## Value ## Min. : 1268 ## 1st Qu.: 53058 ## Median : 87088 ## Mean :111716 ## 3rd Qu.:184278 ## Max. :220583 ## # The Statistic: GIR Percentage - 100+ yards is one of the most important values # Filtering all anomalies within this Statistic value statisticFilter = "GIR Percentage - 100+ yards" specificVar = anomalies %>% filter(Statistic==statisticFilter) cat( statisticFilter,": ", dim(specificVar) ) ## GIR Percentage - 100+ yards : 158 if( dim(specificVar) > 0 ) { # We want to know the relation between Players and "Approaches from 200-225 yards" # So, in order to get a chart, we assign a code to each player # Since factors in R are really integer values, we do this to get the codes: specificVar$PlayerCode = as.integer(specificVar$Player.Name) # To sort our dataset we convert the date to numeric specificVar$DateAsNum = as.numeric( paste0( substr(specificVar$Date,1,4) , substr(specificVar$Date,6,7)
, substr(specificVar$Date,9,10) ) ) # And sort the data frame. specificVar = specificVar[order(specificVar$DateAsNum),]
# Set records num using a sequence.
rownames(specificVar) = seq(1:dim(specificVar))

colNamesFinalTable = c( "PlayerCode", "Player.Name", "Date", "Variable", "Value" )
specificVar = specificVar[, colNamesFinalTable]
specificVar$PlayerCode = as.factor(specificVar$PlayerCode)

# Creating our final dataframe for our chart.
specificVarChartData = data.frame( SeqNum = as.integer( rownames(specificVar) )
, PlayerCode = specificVar$PlayerCode , Value = specificVar$Value
)

AnomaliesGraph = dygraph( specificVarChartData, main = ''
, xlab = paste(statisticFilter,"Anomaly Number."), ylab = "Player Code." ) %>%
dyAxis("y", label = "Player Code.") %>%
dyAxis("y2", label = "Value.", independentTicks = TRUE) %>%
dySeries( name = "PlayerCode", label = "Player Code.", drawPoints = TRUE, pointShape = "dot"
, color = "blue", pointSize = 2 ) %>%
dySeries( name = "Value", label = "Value.", drawPoints = TRUE, pointShape = "dot"
, color = "green", pointSize = 2, axis = 'y2' ) %>%
dyRangeSelector()
dyOptions( AnomaliesGraph, digitsAfterDecimal = 0 )
} ## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo

Sample chart with the anomalies found:

Sample data table with the anomalies found:

Show 102550100 entries Search: Player Code Player Name Date Variable Value 1 686 Josh Teater 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 198471 2 655 Johnson Wagner 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 186658 3 618 Jim Furyk 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 198471 4 723 Keegan Bradley 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 211362 5 213 Cameron Tringale 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 198471 6 712 Justin Thomas 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 199671 7 520 Hunter Mahan 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 178096 8 587 Jason Day 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 189755 9 539 J.J. Henry 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 177431 10 657 Jon Rahm 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 199671 Showing 1 to 10 of 158 entries Here is the code on github, including the total html functional demo:
https://github.com/LaranIkal/DataQuality Enjoy it!!!…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### Free Training: Mastering Data Structures in R

Mon, 11/11/2019 - 18:00

Next week I will be delivering a free online R training. This is a new course I’ve created called Mastering Data Structures in R. This course is for you if:

• You are new to R, and want a rigorous introduction to R as a programming language
• You know how to analyze data in R, but want to take the next step and learn to program in it too
• You already know R, but find yourself frequently confused by its “idiosyncracies”
R as a Tool for Data Analysis

When clients reach out to me for training, they normally want help learning R as a tool for Data Analysis. Most of my students are already experts at Excel. They want to learn R because they have heard that it is more powerful than Excel. For these clients I normally propose a two-day course that focuses on the Tidyverse and R Markdown:

• Day 1: learn to use ggplot2 and dplyr on datasets that have already been cleaned.
• Day 2: in the morning, learn to import and clean data using the Tidyverse. In the afternoon, learn to use R Markdown to do reproducible research.

In general, this sort of course helps people learn to use R to do much of what they were already doing in Excel.

R as a Programming Language

The biggest problem with my 2-day workshop is this: while I intend it to be a starting point for learning R, many of my students think that it’s all that they need to know! In fact, though, I only consider it to be a starting point.

For example, when I was working at an online Real Estate company, I needed to analyze our website’s sales lead data. I started by using R as a data analysis tool. I used packages like ggplot2 to explore the ebb and flow of our sales leads over time for each metropolitan area. But I eventually hit a wall. What I really wanted to do was map our data at the ZIP code level, and mash it up with data from the Census Bureau. Of course, no package existed to do this: it’s too specific. But since I knew R as a programming language, I was able to create functions (and eventually a package) to answer the exact question I had.

And this is the level that I want all my students to get to. Every analyst has a unique problem. Learning to use R as a programming language allows you to answer the exact question you have.

Why Data Structures?

Most of my students struggle to learn the “programming language” aspect of R  because they never formally studied Computer Science. I decided to address this by creating a course that resembles an introductory Computer Science course but uses R as the language of instruction.

My undergraduate Computer Science curriculum focused on Data Structures and Algorithms. This is why Mastering Data Structures in R provides a rigorous introduction to the basic data structures in R. The course contains dozens of exercises, and will increase your fluency at the console.

I recently gave a pilot version of this course that was well received. To celebrate, I will be running the course online, for free, next week. Here is the syllabus:

• Monday 11/18: Data Types
• Tuesday 11/19: Vectors
• Wednesday 11/20: Factors and Lists
• Friday 11/21: Data Frames

All sessions will start at 10am PT and last approximately 90 minutes.

If you are interested in the course but cannot attend live, then you should still register: I will be sending recordings to everyone who registers.

The post Free Training: Mastering Data Structures in R appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### Scraping Machinery Parts

Mon, 11/11/2019 - 03:00

I’ve been exploring the feasibility of aggregating data on prices of replacement parts for heavy machinery. There are a number of websites which list this sort of data. I’m focusing on the static sites for the moment.

I’m using are R with {rvest} (and a few other Tidyverse packages thrown in for good measure).

library(glue) library(dplyr) library(purrr) library(stringr) library(rvest)

The data are paginated. Fortunately the URL includes the page number as a GET parameter, so stepping through the pages is simple. I defined a global variable, URL, with a {glue} placeholder for the page number.

This is how I’m looping over the pages. The page number, page, is set to one initially and incremented after each page of results is scraped. When it gets to a page without at results, the loop is stopped.

page = 1 while (TRUE) { items <- read_html(glue(URL)) %>% html_nodes(SELECTOR) # if (length(items) == 0) break # Check if gone past last page. # Extract data for each item here... page <- page + 1 # Advance to next page. }

That’s the mechanics. Within the loop I then used map_dfr() from {purrr} to iterate over items, delving into each item to extract its name and price.

map_dfr(items, function(item) { tibble( part = item %>% html_node("p") %>% html_text(), price = item %>% html_node(".product-item--price small") %>% html_text() ) })

The results from each page are appended to a list and finally concatenated using bind_rows().

> dim(parts)  987 2

Scraping a single category yields 987 parts. Let’s take a look at the first few.

> head(parts) part price <chr> <chr> 1 R986110000 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 240-7732 $1,534.00 2 R986110001 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 213-5268$1,854.00 3 R986110002 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 266-8034 $1,374.00 4 R986110003 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 296-6728$1,754.00 5 R986110004 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 136-8869 $1,494.00 6 R986110005 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 255-6805$1,534.00

That’s looking pretty good already. There’s one final niggle: the data in the price column are strings. Ideally we’d want those to be numeric. But to do that we have to strip off some punctuation. Not a problem thanks to functions from {stringr}.

parts <- parts %>% mutate( price = str_replace(price, "^\\$", ""), # Strip off leading "$" price = str_replace_all(price, ",", ""), # Strip out comma separators price = as.numeric(price) )

Success!

> head(parts) part price <chr> <dbl> 1 R986110000 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 240-7732 1534 2 R986110001 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 213-5268 1854 3 R986110002 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 266-8034 1374 4 R986110003 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 296-6728 1754 5 R986110004 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 136-8869 1494 6 R986110005 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 255-6805 1534

The great thing about scripting a scraper is that, provided that the website has not been dramatically restructured, the scraper can be run at any time to gather an updated set of results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));

### Statistical uncertainty with R and pdqr

Mon, 11/11/2019 - 01:00

CRAN has accepted my ‘pdqr’ package. Here are important examples of how it can be used to describe and evaluate statistical uncertainty.

Prologue

I am glad to announce that my latest, long written R package ‘pdqr’ is accepted to CRAN. It provides tools for creating, transforming and summarizing custom random variables with distribution functions (as base R ‘p*()’, ‘d*()’, ‘q*()’, and ‘r*()’ functions). You can read a brief overview in one of my previous posts.

We will need the following setup:

library(pdqr) library(magrittr) # For the sake of reproducibility set.seed(20191111) Statistical uncertainty General description

Statistical estimation usually has the following setup. There is a sample (observed, usually randomly chosen, set of values of measurable quantities) from some general population (whole set of values of the same measurable quantities). We need to make conclusions about the general population based on a sample. This is done by computing summary values (called statistics) of a sample, and making reasonable assumptions (with process usually called inference) about how these values are close to values that potentially can be computed based on whole general population. Thus, summary value based on a sample (sample statistic) is an estimation of potential summary value based on a general population (true value).

How can we make inference about quality of this estimation? This question itself describes statistical uncertainty and can be unfolded into a deep philosophical question about probability, nature, and life in general. Basically, the answer depends on assumptions about the relation between sample, general population, and statistic.

For me, the most beautiful inferential approach is bootstrap. It has the following key assumption: process of producing samples from general population can be simulated by doing random sampling with replacement from present sample. In other words, we agree (and practice often agrees with us) that random sampling with replacement from current sample (sometimes called bootstrap sampling) has a “close enough” behavior to the “true nature” of how initial sample was created. Numerical estimation of “how close” is also an interesting problem, but it is a more complicated topic.

Computation with pdqr

Natural way of computing bootstrap quantities is straightforward: produce $$B$$ random bootstrap samples, for each one compute value of statistic under question, and summarize sample of statistic values with numerical quantity (usually with some center and spread values).

There are many ways of performing bootstrap in R, like boot::boot(), rsample::bootstraps(), and others. In turn, ‘pdqr’ offers its own way of describing and doing bootstrap inference for one-dimensional numeric sample(s):

• Create a random variable (in the form of pdqr-function with new_*() family) based on initial sample. This random variable already describes a general population with “bootstrap assumption”: it will produce values based on initial sample. Type of this variable determines the type of bootstrap:
• Type "discrete" describes ordinary bootstrap. Only values from initial sample can be produced.
• Type "continuous" describes smooth bootstrap. Initial sample is smoothed by doing kernel density estimation with density() function and random variable produces values from distribution with that density.
• Transform created random variable into one that produces statistic values obtained with bootstrap. Sometimes this can be done with basic mathematical operations like +, min, etc. But usually this is done with form_estimate() function: it creates many (10000 by default) bootstrap samples, for each computes statistic value, and creates its own random variable in the form of pdqr-function (class and type are preserved from supplied random variable, but this can be adjusted). It needs at least three arguments:
• f: pdqr-function representing random variable. In described setup it is created as a result of “Create” step.
• stat: statistic function that accepts numeric vector of size sample_size and returns single numeric or logical output.
• sample_size: Size of a sample that each bootstrap draw should produce. In described setup it should be equal to number of elements in initial sample.
• Summarize distribution of statistic. Usually this is point measure of center or spread, or interval.
Example 1: single numerical estimate

Mean value of ‘mpg’ variable in mtcars dataset is 20.090625. However, having in mind statistical uncertainty, we can ask how precise is this estimation? This can, and should, be reformulated in the following question: if we repeat sampling sets of 32 cars from general population of all cars, how close their ‘mpg’ sample means will be to each other? This can be answered by computing bootstrap distribution of sample means (pipe %>% function from ‘magrittr’ package is used to simplify notation):

# Using ordinary bootstrap d_mpg_dis_mean <- mtcars$mpg %>% new_d(type = "discrete") %>% form_estimate(stat = mean, sample_size = nrow(mtcars)) # Spread of this bootstrap distribution describes the precision of estimation: # bigger values indicate lower precision summ_sd(d_mpg_dis_mean) ##  1.04067 # This discrete distribution has the following d-function plot( d_mpg_dis_mean, main = "Ordinary bootstrap distribution of 'mpg' sample mean" ) If modeling assumption about continuous nature of ‘mpg’ variable is reasonable (which it seems so), you can use “smooth bootstrap” by changing type of initial pdqr-function: # Using smooth bootstrap with type = "continuous" d_mpg_con_mean <- mtcars$mpg %>% new_d(type = "continuous") %>% form_estimate(stat = mean, sample_size = nrow(mtcars)) # Spread is higher in this case because kernel density estimation with # density() function extends support during creation of pdqr-function on the # bootstrap step summ_sd(d_mpg_con_mean) ##  1.153957 plot( d_mpg_con_mean, main = "Smooth bootstrap distribution of 'mpg' sample mean" )

One can also do ordinary bootstrap but represent bootstrap distribution of sample mean with continuous random variable:

# Using ordinary bootstrap, but treating sample mean as continuous d_mpg_con_mean_2 <- mtcars$mpg %>% new_d(type = "discrete") %>% form_estimate( stat = mean, sample_size = nrow(mtcars), # Create continuous pdqr-function from bootstrap sample means args_new = list(type = "continuous") ) summ_sd(d_mpg_con_mean_2) ##  1.063524 plot( d_mpg_con_mean_2, main = "Ordinary bootstrap distribution of 'mpg' continuous sample mean" ) In this case, sample mean has standard deviation from 1.04067 to 1.1539572 (depends on assumptions about data generating process). Example 2: single logical estimate Share of 4-cylinder cars in mtcars is equal to 0.34375. However, it might happen that we don’t care about actual value, but only if it is bigger 0.3 or not. In present data it is bigger, but how sure we can be about that? In other words: if we repeat sampling sets of 32 cars from general population of all cars, which part of it will have share of 4-cylinder cars bigger than 0.3?. Here is the way of computing that with ‘pdqr’: # If statistic returns logical value (indicating presence of some feature in # sample), output estimate pdqr-function is "boolean": "discrete" type function # with elements being exactly 0 (indicating FALSE) and 1 (indicating TRUE). d_cyl_lgl <- mtcars$cyl %>% new_d(type = "discrete") %>% form_estimate( stat = function(x) {mean(x == 4) > 0.3}, sample_size = nrow(mtcars) ) d_cyl_lgl ## Probability mass function of discrete type ## Support: [0, 1] (2 elements, probability of 1: 0.7113) # To extract certain probability from boolean pdqr-function, use # summ_prob_*() functions summ_prob_true(d_cyl_lgl) ##  0.7113 summ_prob_false(d_cyl_lgl) ##  0.2887

In this case, estimated probability that share of 4-cylinder cars in general population is more than 0.3 is 0.7113.

Example 3: comparison of estimates

In mtcars there are 19 cars with automatic transmission (‘am’ variable is 0) and 13 with manual (‘am’ variable is 1). We might be concerned with the following question: are cars with automatic transmission heavier than cars with manual transmission? This is an example of question where reformulating is very crucial, because it leads to completely different methodologies. Basically, it is all about dealing with statistical uncertainty and how to measure that one numerical set is bigger than the other.

First, rather verbose, way of expanding this question is this one: if we randomly choose a car with automatic transmission (uniformly on set of all cars with automatic transmission) and a car with manual (uniformly on set of all cars with manual transmission), what is the probability that weight of the first one is bigger than the second one?. With ‘pdqr’ this can be computed straightforwardly by comparing two random variables (which is implemented exactly like the question above; read more here):

# Seems reasonable to treat weight as continuous random variable. Note that this # means use of kernel density estimation, which can lead to random variable that # returns negative values. As weight can be only positive, it is a good idea to # ensure that. Package 'pdqr' has form_resupport() function for that. d_wt_am0 <- mtcars$wt[mtcars$am == 0] %>% new_d(type = "continuous") %>% # Ensure that returned values are only positive form_resupport(c(0, NA)) d_wt_am1 <- mtcars$wt[mtcars$am == 1] %>% new_d(type = "continuous") %>% form_resupport(c(0, NA)) # Comparing two pdqr-functions with >= results into boolean pdqr-function summ_prob_true(d_wt_am0 >= d_wt_am1) ##  0.9209063

So in this case the answer is that probability of “automatic” cars being heavier than “manual” ones is around 0.921.

Second way of understanding question about comparing is the following: is average weight of “automatic” cars bigger than of “manual”?. This type of questions are more widespread in statistical practice. Having to deal with statistical uncertainty, this should be reformulated: if we repeat sampling (in parallel pairs) sets of 19 “automatic” cars and of 13 “manual” cars, which part of the set pairs will have mean weight of “automatic” cars bigger? This question implies creating bootstrap distribution of sample means for “automatic” and “manual” cars with the following comparing:

d_wt_am0_mean <- d_wt_am0 %>% form_estimate(stat = mean, sample_size = sum(mtcars$am == 0)) %>% # Ensure "positiveness" of random variable form_resupport(c(0, NA)) d_wt_am1_mean <- d_wt_am1 %>% form_estimate(stat = mean, sample_size = sum(mtcars$am == 1)) %>% form_resupport(c(0, NA)) # Comparing two random variables representing sample means summ_prob_true(d_wt_am0_mean >= d_wt_am1_mean) ##  1

So in this case the answer is that probability of “automatic” cars being heavier than “manual” ones is 1.

Computed results can have decisively different outcomes. If researcher sets a standard 0.95 rule, first variant would imply that conclusion ‘“automatic” cars are heavier than “manual”’ isn’t significant, while the second would imply otherwise.

Epilogue
• Basic knowledge about statistical uncertainty is crucial to understand the process of statistical inference.
• One of the most popular methodologies for doing statistical inference is bootstrap. There are at least two kinds of it: ordinary and smooth.
• Package ‘pdqr’ offers extensive functionality for describing and estimating statistical uncertainty. Core functions here are new_*() family, form_estimate(), and comparison operators.
sessionInfo() sessionInfo() ## R version 3.6.1 (2019-07-05) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 18.04.3 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so ## ## locale: ##  LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ##  LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ##  LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ##  LC_PAPER=ru_UA.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  magrittr_1.5 pdqr_0.2.0 ## ## loaded via a namespace (and not attached): ##  Rcpp_1.0.2 bookdown_0.13 crayon_1.3.4 digest_0.6.21 ##  evaluate_0.14 blogdown_0.15 pillar_1.4.2 rlang_0.4.0 ##  stringi_1.4.3 rmarkdown_1.15 tools_3.6.1 stringr_1.4.0 ##  xfun_0.9 yaml_2.2.0 compiler_3.6.1 htmltools_0.3.6 ##  knitr_1.25

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script'));