Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 40 sec ago

On the Road to 0.8.0 — Some Additional New Features Coming in the sergeant Package

Wed, 01/09/2019 - 20:32

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

It was probably not difficult to discern from my previous Drill-themed post that I’m fairly excited about the Apache Drill 1.15.0 release. I’ve rounded out most of the existing corners for it in preparation for a long-overdue CRAN update and have been concentrating on two helper features: configuring & launching Drill embedded Docker containers and auto-generation of Drill CTAS queries.

Drill Docker Goodness

Starting with version 1.14.0, Apache provides Drill Docker images for use in experimenting/testing/building-off-of. They run Drill in single node standalone mode so you’re not going to be running this in “production” (unless you have light or just personal workloads). Docker is a great way to get to know Drill if you haven’t already played with it since you don’t have do do much except run the Docker image.

I’ve simplified this even more thanks to @rgfitzjohn’s most excellent stevedore package which adds a robust R wrapper to the Docker client without relying on any heavy external dependencies such as reticulate. The new drill_up() function will auto-fetch the latest Drill image and launch a container so you can have a running Drill instance with virtually no effort on your part.

Just running the vanilla image isn’t enough since your goal is likely to do more than work with the built-in cp data source. The default container launch scenario also doesn’t hook up any local filesystem paths to the container so you really can’t do much other than cp-oriented queries. Rather than make you do all the work of figuring out how to machinate Docker command line arguments and manually configure a workspace that points to a local filesystem area in the Drill web admin GUI the drill_up() function provides a data_dir argument (that defaults to the getwd() where you are in your R session) which will then auto-wire up that path into the container and create a dfs.d workspace which auto-points to it for you. Here’s a sample execution:

library(sergeant) library(tidyverse) dr <- drill_up(data_dir = "~/Data") ## Drill container started. Waiting for the service to become active (this may take up to 30s). ## Drill container ID: f02a11b50e1647e44c4e233799180da3e907c8aa27900f192b5fd72acfa67ec0

You can use dc$stop() to stop the container or use the printed container id to do it from the command line.

We’ll use this containerized Drill instance with the next feature but I need to thank @cboettig for the suggestion to make an auto-downloader-runner-thingy before doing that. (Thank you @cboettig!)

Taking the Tedium out of CTAS

@dseverski, an intrepid R, Drill & sergeant user noticed some new package behavior with Drill 1.15.0 that ended up spawning a new feature: automatic generation of Drill CTAS statements.

Prior to 1.14.0 sergeant had no way to accurately, precisely tell data types of the columns coming back since the REST API didn’t provide them (as noted in the previous Drill post). Now, it did rely on the JSON types to create the initial data frames but id also did something **kinda horribad*: it ran readr::type_convert() on the result sets . Said operation had the singular benefit of auto-converting CSV/CSVH/TSV/PSV/etc data to something sane without having to worry about writing lengthy CTAS queries (at the expense of potentially confusing everyone, though that didn’t seem to happen).

With 1.15.0, the readr::type_convert() crutch is gone, which results in less-than-helpful things like this when you have delimiter-separated values data:

# using the Drill container we just started above write_csv(nycflights13::flights, "~/Data/flights.csvh") con <- src_drill("localhost") tbl(con, "dfs.d.`flights.csvh`") %>% glimpse() ## Observations: ?? ## Variables: 19 ## Database: DrillConnection ## $ year "2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013… ## $ month "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "… ## $ day "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "… ## $ dep_time "517", "533", "542", "544", "554", "554", "555", "557", "557"… ## $ sched_dep_time "515", "529", "540", "545", "600", "558", "600", "600", "600"… ## $ dep_delay "2", "4", "2", "-1", "-6", "-4", "-5", "-3", "-3", "-2", "-2"… ## $ arr_time "830", "850", "923", "1004", "812", "740", "913", "709", "838… ## $ sched_arr_time "819", "830", "850", "1022", "837", "728", "854", "723", "846… ## $ arr_delay "11", "20", "33", "-18", "-25", "12", "19", "-14", "-8", "8",… ## $ carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "… ## $ flight "1545", "1714", "1141", "725", "461", "1696", "507", "5708", … ## $ tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "… ## $ origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"… ## $ dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"… ## $ air_time "227", "227", "160", "183", "116", "150", "158", "53", "140",… ## $ distance "1400", "1416", "1089", "1576", "762", "719", "1065", "229", … ## $ hour "5", "5", "5", "5", "6", "5", "6", "6", "6", "6", "6", "6", "… ## $ minute "15", "29", "40", "45", "0", "58", "0", "0", "0", "0", "0", "… ## $ time_hour "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-01-01T1…

So the package does what it finally should have been doing all along. But, as noted, that’s not great if you just wanted to quickly work with a directory of CSV files. In theory, you’re supposed to use Drill’s CREATE TABLE AS then do a bunch of CASTS and TO_s to get proper data types. But who has time for that?

David had a stellar idea, might sergeant be able to automagically create CTAS statements from a query?. Yes. Yes it just might be able to do that with the new ctas_profile() function.

Let’s pipe the previous tbl() into ctas_profile() and see what we get:

tbl(con, "dfs.d.`flights.csvh`") %>% ctas_profile() %>% cat() -- ** Created by ctas_profile() in the R sergeant package, version 0.8.0 ** CREATE TABLE CHANGE____ME AS SELECT CAST(`year` AS DOUBLE) AS `year`, CAST(`month` AS DOUBLE) AS `month`, CAST(`day` AS DOUBLE) AS `day`, CAST(`dep_time` AS DOUBLE) AS `dep_time`, CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`, CAST(`dep_delay` AS DOUBLE) AS `dep_delay`, CAST(`arr_time` AS DOUBLE) AS `arr_time`, CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`, CAST(`arr_delay` AS DOUBLE) AS `arr_delay`, CAST(`carrier` AS VARCHAR) AS `carrier`, CAST(`flight` AS DOUBLE) AS `flight`, CAST(`tailnum` AS VARCHAR) AS `tailnum`, CAST(`origin` AS VARCHAR) AS `origin`, CAST(`dest` AS VARCHAR) AS `dest`, CAST(`air_time` AS DOUBLE) AS `air_time`, CAST(`distance` AS DOUBLE) AS `distance`, CAST(`hour` AS DOUBLE) AS `hour`, CAST(`minute` AS DOUBLE) AS `minute`, TO_TIMESTAMP(`time_hour`, 'FORMATSTRING') AS `time_hour` -- *NOTE* You need to specify the format string. Sample character data is: [2013-01-01T10:00:00Z]. FROM (SELECT * FROM dfs.d.`flights.csvh`) -- TIMESTAMP and/or DATE columns were detected. Drill's date/time format string reference can be found at: -- --

There’s a parameter for the new table name which will cause the CHANGE____ME to go away and when the function finds TIMESTAMP or DATE fields it knows to switch to their TO_ cousins and gives sample data with a reminder that you need to make a format string (I’ll eventually auto-generate them unless someone PRs it first). And, since nodoby but Java programmers remember Joda format strings (they’re different than what you’re used to) it provides a handy link to them if it detects the presence of those column types.

Now, we don’t need to actually create a new table (though converting a bunch of CSVs to Parquet is likely a good idea for performance reasons) to use that output. We can pass most of that new query right to tbl():

tbl(con, sql(" SELECT CAST(`year` AS DOUBLE) AS `year`, CAST(`month` AS DOUBLE) AS `month`, CAST(`day` AS DOUBLE) AS `day`, CAST(`dep_time` AS DOUBLE) AS `dep_time`, CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`, CAST(`dep_delay` AS DOUBLE) AS `dep_delay`, CAST(`arr_time` AS DOUBLE) AS `arr_time`, CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`, CAST(`arr_delay` AS DOUBLE) AS `arr_delay`, CAST(`carrier` AS VARCHAR) AS `carrier`, CAST(`flight` AS DOUBLE) AS `flight`, CAST(`tailnum` AS VARCHAR) AS `tailnum`, CAST(`origin` AS VARCHAR) AS `origin`, CAST(`dest` AS VARCHAR) AS `dest`, CAST(`air_time` AS DOUBLE) AS `air_time`, CAST(`distance` AS DOUBLE) AS `distance`, CAST(`hour` AS DOUBLE) AS `hour`, CAST(`minute` AS DOUBLE) AS `minute`, TO_TIMESTAMP(`time_hour`, 'yyyy-MM-dd''T''HH:mm:ssZ') AS `time_hour` -- [2013-01-01T10:00:00Z]. FROM (SELECT * FROM dfs.d.`flights.csvh`) ")) %>% glimpse() ## Observations: ?? ## Variables: 19 ## Database: DrillConnection ## $ year 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ dep_time 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 5… ## $ sched_dep_time 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 6… ## $ dep_delay 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -… ## $ arr_time 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, … ## $ sched_arr_time 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, … ## $ arr_delay 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31,… ## $ carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "… ## $ flight 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71,… ## $ tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "… ## $ origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"… ## $ dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"… ## $ air_time 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34… ## $ distance 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, … ## $ hour 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6… ## $ minute 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0… ## $ time_hour 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 10:00:0…

Ahhhh… Useful data types. (And, see what I mean about that daft format string? Also, WP is mangling the format string so add a comment if you need the actual string.)

FIN

As you can see questions, suggestions (and PRs!) are welcome and heeded on your social-coding platform of choice (though y’all still seem to be stuck on GH ).

NOTE: I’ll be subbing out most install_github() links in READMEs and future blog posts for install_git() counterparts pointing to my sr.ht repos (as I co-locate/migrate them there).

You can play with the new 0.8.0 features via devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0").

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R NewYorkers Feeling the Holiday Spirit? Here’s Your Tip

Wed, 01/09/2019 - 17:53

(This article was first published on R – Pivot Billions, and kindly contributed to R-bloggers)

Combining Pivot Billions with R to dive into whether the holiday spirit inspires bigger tips and which parts of New York experience this effect the most.

The holiday season brings with it a degree of cheer and joy that many claim makes people act friendlier towards each other. I wanted to see how this effect translates to action so I decided to look into tips for New York green taxis both during the holiday season and the rest of the year. To start, I streamed all of the green taxi data files from the public NYC Taxi and Limousine Commission Trip Record Data for 2017-07-01 to 2018-06-31 (the most recent year of green taxi data) into Pivot Billions and enhanced the data with two new columns: holidayseason and tip_percent.

There were many rows that weren’t relevant to this analysis since cash payments did not have records of tips, so I filtered out cash payments from the payment_type column in Pivot Billions bringing the total rows to ~5 Million.

To dive into the data I made use of Pivot Billions’ pivot feature to quickly reorganize all of this filtered data by where the passenger(s) were dropped off (DOLocationID) and whether the trip occurred during the holiday season. My over 9 Million original rows of data were now shrunk down to a much more manageable 513 row detailed summary. Downloading this new view of the data from Pivot Billions I switched my focus to visualizing and analyzing the data in R.

Now that the data was shrunk down to a size R can easily handle, I loaded the Taxi Zone Shapefile and my newly downloaded DoLocationID_holiday_tips.csv file into R. This was a simple process of uploading the shapefile from our datasource as well as our Pivot Billions – processed file onto my machine running R and then joining them by setting Location ID equal to DOLocationID.

After quickly defining a new metric from our data called “Holiday Effect” that tracks the percentage difference in average tips between the holiday season and the rest of the year and adding additional information to the data to make it informative and explorable, I was left with a very clear and powerful visualization of the green taxi data.

It is immediately clear that there are regions with a much greater occurrence of positive holiday effects (green areas) than negative effects (orange areas) as well as the reverse. Utilizing R’s powerful indexing abilities it’s easy to narrow down the data by location and explore which areas of New York experience the effect the most. It appears that Bronx and Brooklyn experience more negative effects whereas Queens is evenly spread between positive and negative. However, Manhattan and Newark Airport have a much higher proportion of positive effects due to the holiday season.

Though most of New York is being affected by the holidays for better or worse, people going to Manhattan and Newark Airport seem to be feeling the holiday spirit the most.

To create this visualization yourself you can download my R code, DOLocationID_holiday_tips.csv, and the Public Data’s Shapefile. You can also run this code replacing “DOLocationID_holiday_tips.csv” with “PULocationID_holiday_tips.csv” and DOLocationID with PULocationID to view the holiday effect on tips by Pick-Up Location.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Pivot Billions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Animating Data Transformations: Part II

Wed, 01/09/2019 - 17:40

(This article was first published on R Tutorials – Omni Analytics Group, and kindly contributed to R-bloggers)

In our previous series on Animating Data Transformations, we showed you how to use gganimate to construct an animation which illustrates the process of going between tall and wide representations of data. Today, we will show the same procedure for constructing an animation of the unnest() function.

The unnest() function takes a tibble containing a list column and converts it to a tibble such that each element in the list comprises a single row. Think of it as “unpacking” the list column into a more standard vector column in R.

We will create a toy dataset using the sleep data, and ensure that it contains a list column:

library(tidyverse) library(gganimate) sleep_data <- sleep %>% mutate(group = rep(1:4, each = 5), ID = rep(1:4, each = 5)) %>% select(id = ID, group, data = extra) sleep_data # A tibble: 20 x 3 id group data 1 1 1 0.7 2 1 1 -1.6 3 1 1 -0.2 4 1 1 -1.2 5 1 1 -0.1 ...

You can see that we have a 20 row by 3 column tibble in R. Next, we perform the following routine in order to nest the data column into a new nesteddata column:

sleep_nested <- sleep_data %>% group_by(group) %>% summarise(id = id[1], nesteddata = list(data))

sleep_nested # A tibble: 4 x 3 group id nesteddata 1 1 1 2 2 2 3 3 3 4 4 4

Next, we perform a similar routine to the previous blog and combine the two datasets into one dataset which will be used to build the animation:

longDat <- function(x) { names(x) %>% rbind(x) %>% setNames(seq_len(ncol(x))) %>% mutate(row = row_number()) %>% tidyr::gather(column, value, -row) %>% mutate(column = as.integer(column)) %>% ungroup() %>% arrange(column, row) } long_tables <- map(list(sleep_nested, sleep_data), longDat) nested_table <- long_tables[[1]] %>% mutate(tstep = "a", value = sapply(value, paste, collapse = ", ")) unnested_table <- long_tables[[2]] %>% mutate(tstep = "b") both_tables <- bind_rows(nested_table, unnested_table) both_tables$celltype[both_tables$column == 1] <- c("header", rep("id", 4), "header", rep("id", 20)) both_tables$celltype[both_tables$column == 2] <- c("header", rep(1:4, each = 1), "header", rep(1:4, each = 5)) both_tables$celltype[both_tables$column == 3] <- c("header", rep("nesteddata", 4), "header", rep("data", 20)) both_tables # A tibble: 78 x 5 row column value tstep celltype 1 1 1 group a header 2 2 1 1 a id 3 3 1 2 a id 4 4 1 3 a id 5 5 1 4 a id 6 1 2 id a header 7 2 2 1 a 1 8 3 2 2 a 2 9 4 2 3 a 3 10 5 2 4 a 4 # … with 68 more rows

From this, we can produce static versions of the two images which will form the basis for the animation:

base_plot <- ggplot(both_tables, aes(column, -row, fill = celltype)) + geom_tile(color = "black") + theme_void() + scale_fill_manual(values = c("#247ba0","#70c1b3","#b2dbbf","turquoise2", "#ead2ac", "grey60", "mistyrose3", "#ffae63"), name = "", labels = c("Group 1", "Group 2", "Group 3", "Group 4", "Data", "Header", "ID", "Nested Data")) base_plot + facet_wrap(~tstep) The static plot that will be used to generate the animation

Finally, we use gganimate to build the final animation!

p1 <- base_plot + transition_states( states = tstep, transition_length = 1, state_length = 1 ) + enter_fade() + exit_fade() + ease_aes('sine-in-out') p1_animate <- animate(p1, height = 800, width = 600, fps = 20, duration = 10) anim_save("unnest_animate.gif")

And there you have it! We hope this was helpful both in learning how to produce data transformation animations, and in terms of learning the unnest() operation itself. If you have any requests for more data transformation animations, please let us know, and be on the look out for future posts in this series!

The post Animating Data Transformations: Part II appeared first on Omni Analytics Group.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Tutorials – Omni Analytics Group. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding the maths of Computed Tomography (CT) scans

Wed, 01/09/2019 - 09:34

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)


Noseman is having a headache and as an old-school hypochondriac he goes to see his doctor. His doctor is quite worried and makes an appointment with a radiologist for Noseman to get a CT scan.

Modern CT scanner from Siemens

Because Noseman always wants to know how things work he asks the radiologist about the inner workings of a CT scanner.

The basic idea is that X-rays are fired from one side of the scanner to the other. Because different sorts of tissue (like bones, brain cells, cartilage etc.) block different amounts of the X-rays the intensity measured on the other side varies accordingly.

The problem is of course that a single picture cannot give the full details of what is inside the body because it is a combination of different sorts of tissue in the way of the respective X-rays. The solution is to rotate the scanner and combine the different slices.

How, you ask? Good old linear algebra to the rescue!

We start with the initial position and fire X-rays with an intensity of 30 (just a number for illustrative purposes) through the body:

Initial position

As can be seen in the picture the upper ray goes through areas 1, 2 and 3 and let’s say that the intensity value of 12 is measured on the other side of the scanner:

   

or

   

The rest of the formula is found accordingly:

   

We then rotate the scanner for the first time…

Position after first rotation

…which gives the following formula:

   

And a second rotation…

Position after second rotation

…yields the following formula:

   

Now we are combining all three systems of equations:

   

or written out in full:

   

Here is the data of the matrix for you to download: ct-scan.txt).

We now have 9 equations with 9 unknown variables… which should easily be solvable by R, which can also depict the solution as a gray-scaled image… the actual CT-scan!

A <- read.csv("data/ct-scan.txt") b <- c(18, 21, 18, 18, 21, 9, 18, 14, 16) v <- solve(A, b) matrix(v, ncol = 3, byrow = TRUE) ## [,1] [,2] [,3] ## [1,] 9 9 0 ## [2,] 9 5 7 ## [3,] 9 9 0 image(matrix(v, ncol = 3), col = gray(4:0 / 4)) CT of Noseman

The radiologist looks at the picture… and has good news for Noseman: everything is how it should be! Noseman is relieved and his headache is much better now…

Real CT scans make use of the same basic principles (of course with a lot of additional engineering and maths magic )

Here are real images of CT scans of a human brain…

Source: Wikimedia

… which can be combined into a 3D-animation:

Source: Wikimedia

Isn’t it fascinating how a little bit of maths can save lives!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A deep dive into glmnet: offset

Wed, 01/09/2019 - 08:21

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will look at the offset option.

For reference, here is the full signature of the glmnet function:

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"), weights, offset=NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs

offset

According to the official R documentation, offset should be

A vector of length nobs that is included in the linear predictor (a nobs x nc matrix for the “multinomial” family).

Its default value is NULL: in that case, glmnet internally sets the offset to be a vector of zeros having the same length as the response y.

Here is some example code for using the offset option:

set.seed(1) n <- 50; p <- 10 x <- matrix(rnorm(n * p), nrow = n) y <- rnorm(n) offset <- rnorm(n) # fit model fit1 <- glmnet(x, y, offset = offset)

If we specify offset in the glmnet call, then when making predictions with the model, we must specify the newoffset option. For example, if we want the predictions fit1 gives us at for the training data, not specifying newoffset will give us an error:

This is the correct code:

predict(fit1, x, s = 0.1, newoffset = offset) # 1 # [1,] 0.44691399 # [2,] 0.30013292 # [3,] -1.68825225 # [4,] -0.49655504 # [5,] 1.20180199 # ...

So, what does offset actually do (or mean)? Recall that glmnet is fitting a linear model. More concretely, our data is , where the are our features for observation and is the response for observation . For each observation, we are trying to model some variable as a linear combination of the features, i.e. . is a function of ; the function depends on the context. For example,

  • For ordinary regression, , i.e. the response itself.
  • For logistic regression, .
  • For Poisson regression, .

So, we are trying to find and so that is a good estimate for . If we have an offset , then we are trying to find and so that is a good estimate for .

Why might we want to use offsets? There are two primary reasons for them stated in the documentation:

Useful for the “poisson” family (e.g. log of exposure time), or for refining a model by starting at a current fit.

Let me elaborate. First, offsets are useful for Poisson regression. The official vignette has a little section explaining this; let me explain it through an example.

Imagine that we are trying to predict how many points an NBA basketball player will score per minute based on his physical attributes. If the player’s physical attributes (i.e. the covariates of our model) are denoted by and then the number of points he scores in a minute is denoted by , then Poisson regression assumes that

and are parameters of the model to be determined.

Having described the model, let’s turn to our data. For each player , we have physical covariates . However, instead of having each player’s points per minute, we have number of points scored over a certain time period. For example, we might have “player 1 scored 12 points over 30 minutes” instead of “player 1 scored 0.4 points per minute”.

Offsets allow us to use our data as is. In our example above, loosely speaking 12/30 (points per minute) is our estimate for . Hence, 12 (points in 30 minutes) is our estimate for . In our model, is our estimate for , and so our estimate for would be . The term is the “offset” to get the model prediction for our data as is.

Taking this to the full dataset: if player scores points in minutes, then our offset would be the vector , and the response we would feed glmnet is .

The second reason one might want to use offsets is to improve on an existing model. Continuing the example above: say we have a friend who has trained a model (not necessarily a linear model) to predict , but he did not use the player’s physical attributes. We think that we can improve on his predictions by adding physical attributes to the model. One refinement to our friend’s model could be

where is the prediction of from our friend’s model. In this setting, the offsets are simply our friend’s predictions. For model training, we would provide the first model’s predictions on the training observations as the offset. To get predictions from the refinement on new observations, we would first compute the predictions from the first model, then use them as the newoffset option in the predict call.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dow Jones Stock Market Index (4/4): Trade Volume GARCH Model

Wed, 01/09/2019 - 04:23

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Advanced Modeling

    Tags

    1. Data Visualisation
    2. Linear Regression
    3. R Programming

    This is the final part of the 4-series posts. In this fourth post, I am going to build an ARMA-GARCH model for Dow Jones Industrial Average (DJIA) daily trade volume log ratio. You can read the other three parts in the following links: part 1, part2, and part 3.

    Packages

    The packages being used in this post series are herein listed.

    suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(fBasics)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(urca)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(quantmod)) suppressPackageStartupMessages(library(PerformanceAnalytics)) suppressPackageStartupMessages(library(rugarch)) suppressPackageStartupMessages(library(FinTS)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(TSA)) Getting Data

    We upload the environment status as saved at the end of part 3.

    load(file='DowEnvironment.RData') Structural Breaks in Daily Trade Volume

    In part 2, we already highlighted some level shifts occurring within the daily trade volume. Here is the plot we already commented in part 2.

    plot(dj_vol)

    Now we want to identify and date such level shifts. First, we verify that a linear regression with constant mean is statistically significative.

    lm_dj_vol <- lm(dj_vol ~ 1, data = dj_vol) summary(lm_dj_vol) ## ## Call: ## lm(formula = dj_vol ~ 1, data = dj_vol) ## ## Residuals: ## Min 1Q Median 3Q Max ## -193537268 -91369768 -25727268 65320232 698562732 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 201947268 2060772 98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 113200000 on 3019 degrees of freedom

    Such linear regression model is statistically significative, hence it makes sense to proceed on with the breakpoints structural identification with respect a constant value.

    bp_dj_vol <- breakpoints(dj_vol ~ 1, h = 0.1) summary(bp_dj_vol) ## ## Optimal (m+1)-segment partition: ## ## Call: ## breakpoints.formula(formula = dj_vol ~ 1, h = 0.1) ## ## Breakpoints at observation number: ## ## m = 1 2499 ## m = 2 896 2499 ## m = 3 626 1254 2499 ## m = 4 342 644 1254 2499 ## m = 5 342 644 1219 1649 2499 ## m = 6 320 622 924 1251 1649 2499 ## m = 7 320 622 924 1251 1692 2172 2499 ## m = 8 320 622 924 1251 1561 1863 2172 2499 ## ## Corresponding to breakdates: ## ## m = 1 ## m = 2 0.296688741721854 ## m = 3 0.207284768211921 ## m = 4 0.113245033112583 0.213245033112583 ## m = 5 0.113245033112583 0.213245033112583 ## m = 6 0.105960264900662 0.205960264900662 0.305960264900662 ## m = 7 0.105960264900662 0.205960264900662 0.305960264900662 ## m = 8 0.105960264900662 0.205960264900662 0.305960264900662 ## ## m = 1 ## m = 2 ## m = 3 0.41523178807947 ## m = 4 0.41523178807947 ## m = 5 0.40364238410596 0.546026490066225 ## m = 6 0.414238410596027 0.546026490066225 ## m = 7 0.414238410596027 0.560264900662252 ## m = 8 0.414238410596027 0.516887417218543 0.616887417218543 ## ## m = 1 0.827483443708609 ## m = 2 0.827483443708609 ## m = 3 0.827483443708609 ## m = 4 0.827483443708609 ## m = 5 0.827483443708609 ## m = 6 0.827483443708609 ## m = 7 0.719205298013245 0.827483443708609 ## m = 8 0.719205298013245 0.827483443708609 ## ## Fit: ## ## m 0 1 2 3 4 5 6 ## RSS 3.872e+19 2.772e+19 1.740e+19 1.547e+19 1.515e+19 1.490e+19 1.475e+19 ## BIC 1.206e+05 1.196e+05 1.182e+05 1.179e+05 1.178e+05 1.178e+05 1.178e+05 ## ## m 7 8 ## RSS 1.472e+19 1.478e+19 ## BIC 1.178e+05 1.178e+05

    It gives this plot.

    plot(bp_dj_vol)

    The minimum BIC is reached at breaks = 6.

    Here is the plot of the DJIA daily trade volume with level shifts (red line).

    bp_fit <- fitted(bp_dj_vol, breaks=6) bp_fit_xts <- xts(bp_fit, order.by = index(dj_vol)) bb <- merge(bp_fit_xts, dj_vol) colnames(bb) <- c("Level", "DJI.Volume") plot(bb, lwd = c(3,1), col = c("red", "black"))

    Here are the calendar dates when such level shifts occur.

    bd <- breakdates(bp_dj_vol) bd <- as.integer(nrow(dj_vol)*bd) index(dj_vol)[bd] ## [1] "2008-04-10" "2009-06-22" "2010-09-01" "2011-12-16" "2013-07-22" ## [6] "2016-12-02"

    See ref. [10] for further details about structural breaks.

    Daily Trade Volume Log Ratio Model

    As shown in part 2, the daily trade volume log ratio:

    \[
    v_{t}\ := ln \frac{V_{t}}{V_{t-1}}
    \]

    is already available within our data environment. We plot it.

    plot(dj_vol_log_ratio)

    Outliers Detection

    The Return.clean function within Performance Analytics package is able to clean return time series from outliers. Here below we compare the original time series with the outliers adjusted one.

    dj_vol_log_ratio_outliersadj <- Return.clean(dj_vol_log_ratio, "boudt") p <- plot(dj_vol_log_ratio) p <- addSeries(dj_vol_log_ratio_outliersadj, col = 'blue', on = 1) p

    The prosecution of the analysis will be carried on with the original time series as a more conservative approach to volatility evaluation.

    Correlation plots

    Here below are the total and partial correlation plots.

    acf(dj_vol_log_ratio)

    pacf(dj_vol_log_ratio)

    Above plots may suggest some ARMA(p,q) model with p and q > 0. That will be verified within the prosecution of the present analysis.

    Unit root tests

    We run the Augmented Dickey-Fuller test as available within the urca package. The no trend and no drift test flavor is run.

    (urdftest_lag = floor(12* (nrow(dj_vol_log_ratio)/100)^0.25)) ## [1] 28 summary(ur.df(dj_vol_log_ratio, type = "none", lags = urdftest_lag, selectlags="BIC")) ## ## ############################################### ## # Augmented Dickey-Fuller Test Unit Root Test # ## ############################################### ## ## Test regression none ## ## ## Call: ## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.51742 -0.13544 -0.02485 0.12006 1.98641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## z.lag.1 -5.21571 0.25218 -20.682 < 2e-16 *** ## z.diff.lag1 3.59929 0.24637 14.609 < 2e-16 *** ## z.diff.lag2 3.14516 0.23796 13.217 < 2e-16 *** ## z.diff.lag3 2.78234 0.22809 12.198 < 2e-16 *** ## z.diff.lag4 2.41745 0.21700 11.140 < 2e-16 *** ## z.diff.lag5 2.14179 0.20472 10.462 < 2e-16 *** ## z.diff.lag6 1.88354 0.19168 9.826 < 2e-16 *** ## z.diff.lag7 1.66394 0.17781 9.358 < 2e-16 *** ## z.diff.lag8 1.41657 0.16304 8.688 < 2e-16 *** ## z.diff.lag9 1.19354 0.14771 8.080 9.31e-16 *** ## z.diff.lag10 1.00374 0.13207 7.600 3.95e-14 *** ## z.diff.lag11 0.88134 0.11634 7.575 4.75e-14 *** ## z.diff.lag12 0.74946 0.10027 7.474 1.02e-13 *** ## z.diff.lag13 0.60798 0.08399 7.239 5.73e-13 *** ## z.diff.lag14 0.48367 0.06732 7.185 8.46e-13 *** ## z.diff.lag15 0.36278 0.05108 7.102 1.53e-12 *** ## z.diff.lag16 0.21128 0.03470 6.088 1.29e-09 *** ## z.diff.lag17 0.07911 0.01833 4.315 1.65e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2576 on 2972 degrees of freedom ## Multiple R-squared: 0.7476, Adjusted R-squared: 0.746 ## F-statistic: 488.9 on 18 and 2972 DF, p-value: < 2.2e-16 ## ## ## Value of test-statistic is: -20.6822 ## ## Critical values for test statistics: ## 1pct 5pct 10pct ## tau1 -2.58 -1.95 -1.62

    Based on reported test statistics compared with critical values, we reject the null hypothesis of unit root presence. See ref. [6] for further details about the Augmented Dickey-Fuller test.

    ARMA model

    We now determine the ARMA structure of our time series in order to run the ARCH effects test on the resulting residuals. That is in agreement with what outlined in ref. [4] $4.3.

    We take advantage of auto.arima() function within forecast package (ref. [7]) to have an idea of what ARMA model to start with.

    (auto_model_1 <- auto.arima(dj_vol_log_ratio, stepwise=FALSE)) ## Series: dj_vol_log_ratio ## ARIMA(3,0,2) with zero mean ## ## Coefficients: ## ar1 ar2 ar3 ma1 ma2 ## -0.5266 0.3769 0.1459 -0.0896 -0.7812 ## s.e. 0.0970 0.0401 0.0203 0.0968 0.0872 ## ## sigma^2 estimated as 0.0659: log likelihood=-176.48 ## AIC=364.96 AICc=364.99 BIC=401.03 coeftest(auto_model_1) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.526617 0.096988 -5.4297 5.644e-08 *** ## ar2 0.376906 0.040090 9.4014 < 2.2e-16 *** ## ar3 0.145902 0.020324 7.1787 7.039e-13 *** ## ma1 -0.089611 0.096788 -0.9258 0.3545 ## ma2 -0.781163 0.087159 -8.9626 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    The ma1 coefficient is not significative. Hence we try with the following ARMA(2,3) model.

    (arma_model_2 <- arima(dj_vol_log_ratio, order=c(2,0,3), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(2, 0, 3), include.mean = FALSE) ## ## Coefficients: ## ar1 ar2 ma1 ma2 ma3 ## -0.1802 0.6441 -0.4351 -0.8604 0.3596 ## s.e. 0.0643 0.0454 0.0681 0.0493 0.0423 ## ## sigma^2 estimated as 0.0658: log likelihood = -176.9, aic = 363.79 coeftest(arma_model_2) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.180233 0.064315 -2.8023 0.005073 ** ## ar2 0.644104 0.045449 14.1721 < 2.2e-16 *** ## ma1 -0.435122 0.068126 -6.3870 1.692e-10 *** ## ma2 -0.860443 0.049282 -17.4595 < 2.2e-16 *** ## ma3 0.359564 0.042307 8.4990 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    All coefficients are significative and AIC is lower than the one of the first model. We then try with ARMA(1,2).

    (arma_model_3 <- arima(dj_vol_log_ratio, order=c(1,0,2), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(1, 0, 2), include.mean = FALSE) ## ## Coefficients: ## ar1 ma1 ma2 ## 0.6956 -1.3183 0.3550 ## s.e. 0.0439 0.0518 0.0453 ## ## sigma^2 estimated as 0.06598: log likelihood = -180.92, aic = 367.84 coeftest(arma_model_3) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.695565 0.043874 15.8537 < 2.2e-16 *** ## ma1 -1.318284 0.051787 -25.4557 < 2.2e-16 *** ## ma2 0.355015 0.045277 7.8409 4.474e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    This model has the highest AIC among the set and all coefficients are significative.

    We can also have a try with the eacf() function within the TSA package as further verification.

    eacf(dj_vol_log_ratio) ## AR/MA ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 0 x o o x x o o x o o x o o o ## 1 x x o x o o o x o o x o o o ## 2 x x x x o o o o o o x o o o ## 3 x x x x o o o o o o x o o o ## 4 x x x x x o o o o o x o o o ## 5 x x x x o o o o o o o o o o ## 6 x x x x x o x o o o o o o o ## 7 x x x x x o o o o o o o o o

    The upper left triangle with “O” as a vertex seems to be located somehow within {(1,2),(2,2),(1,3),(2,3)}, which represents the set of potential (p,q) values according to each function output. To remark that we prefer to consider parsimonious models, that is why we do not go too far as AR and MA orders.

    We already verified ARMA models with (p,q) orders within the set {(3,2)(2,3)(1,2)}. Let us try {(2,2)(1,3)}

    (arma_model_4 <- arima(dj_vol_log_ratio, order=c(2,0,2), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(2, 0, 2), include.mean = FALSE) ## ## Coefficients: ## ar1 ar2 ma1 ma2 ## 0.7174 -0.0096 -1.3395 0.3746 ## s.e. 0.1374 0.0560 0.1361 0.1247 ## ## sigma^2 estimated as 0.06598: log likelihood = -180.9, aic = 369.8 coeftest(arma_model_4) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.7173631 0.1374135 5.2205 1.785e-07 *** ## ar2 -0.0096263 0.0560077 -0.1719 0.863536 ## ma1 -1.3394720 0.1361208 -9.8403 < 2.2e-16 *** ## ma2 0.3746317 0.1247117 3.0040 0.002665 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    The ar2 coefficient is not significative.

    (arma_model_5 <- arima(dj_vol_log_ratio, order=c(1,0,3), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(1, 0, 3), include.mean = FALSE) ## ## Coefficients: ## ar1 ma1 ma2 ma3 ## 0.7031 -1.3253 0.3563 0.0047 ## s.e. 0.0657 0.0684 0.0458 0.0281 ## ## sigma^2 estimated as 0.06598: log likelihood = -180.9, aic = 369.8 coeftest(arma_model_5) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.7030934 0.0656902 10.7032 < 2.2e-16 *** ## ma1 -1.3253176 0.0683526 -19.3894 < 2.2e-16 *** ## ma2 0.3563425 0.0458436 7.7730 7.664e-15 *** ## ma3 0.0047019 0.0280798 0.1674 0.867 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    The ma3 coefficient is not significative.

    As a conclusion, we keep both ARMA(1,2) and ARMA(2,3) as tentative mean models. We can now proceed on with ARCH effect tests.

    ARCH effect Test

    If ARCH effects are statistical significative for the residuals of our time series, a GARCH model is required.

    We test the candidate mean model ARMA(2,3).

    resid_dj_vol_log_ratio <- residuals(arma_model_2) ArchTest(resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio)) ## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio) ## Chi-squared = 78.359, df = 12, p-value = 8.476e-12

    Based on reported p-value, we reject the null hypothesis of no ARCH effects.

    Let us have a look at the residual correlation plot too.

    par(mfrow=c(1,2)) acf(resid_dj_vol_log_ratio) pacf(resid_dj_vol_log_ratio)

    We test the second candidate mean model, ARMA(1,2).

    resid_dj_vol_log_ratio <- resid(arma_model_3) ArchTest(resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio)) ## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio) ## Chi-squared = 74.768, df = 12, p-value = 4.065e-11

    Based on reported p-value, we reject the null hypothesis of no ARCH effects.

    Let us have a look at the residual correlation plots.

    par(mfrow=c(1,2)) acf(resid_dj_vol_log_ratio) pacf(resid_dj_vol_log_ratio)

    To inspect asymmetries within the DJIA volume log ratio, summary statistics and density plot are shown.

    basicStats(dj_vol_log_ratio) ## DJI.Volume ## nobs 3019.000000 ## NAs 0.000000 ## Minimum -2.301514 ## Maximum 2.441882 ## 1. Quartile -0.137674 ## 3. Quartile 0.136788 ## Mean -0.000041 ## Median -0.004158 ## Sum -0.124733 ## SE Mean 0.005530 ## LCL Mean -0.010885 ## UCL Mean 0.010802 ## Variance 0.092337 ## Stdev 0.303869 ## Skewness -0.182683 ## Kurtosis 9.463384

    Negative skewness is reported.

    plot(density(dj_vol_log_ratio))

    As a result, also for daily trade volume log ratio the eGARCH model will be proposed.

    We run two fits in order to compare the results with two candidate mean models ARMA(1,2) and ARMA(2,3)

    ARMA-GARCH: ARMA(1,2) + eGARCH(1,1)

    garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,2), include.mean = FALSE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1), variance.targeting = FALSE), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_vol_log_ratio, spec = garchspec, solver='hybrid')) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(1,0,2) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## ar1 0.67731 0.014856 45.5918 0.0e+00 ## ma1 -1.22817 0.000038 -31975.1819 0.0e+00 ## ma2 0.27070 0.000445 608.3525 0.0e+00 ## omega -1.79325 0.207588 -8.6385 0.0e+00 ## alpha1 0.14348 0.032569 4.4053 1.1e-05 ## beta1 0.35819 0.073164 4.8957 1.0e-06 ## gamma1 0.41914 0.042252 9.9199 0.0e+00 ## skew 1.32266 0.031528 41.9518 0.0e+00 ## shape 3.54346 0.221750 15.9795 0.0e+00 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## ar1 0.67731 0.022072 30.6859 0.0e+00 ## ma1 -1.22817 0.000067 -18466.0626 0.0e+00 ## ma2 0.27070 0.000574 471.4391 0.0e+00 ## omega -1.79325 0.233210 -7.6894 0.0e+00 ## alpha1 0.14348 0.030588 4.6906 3.0e-06 ## beta1 0.35819 0.082956 4.3178 1.6e-05 ## gamma1 0.41914 0.046728 8.9698 0.0e+00 ## skew 1.32266 0.037586 35.1902 0.0e+00 ## shape 3.54346 0.238225 14.8744 0.0e+00 ## ## LogLikelihood : 347.9765 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -0.22456 ## Bayes -0.20664 ## Shibata -0.22458 ## Hannan-Quinn -0.21812 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.5812 4.459e-01 ## Lag[2*(p+q)+(p+q)-1][8] 8.5925 3.969e-08 ## Lag[4*(p+q)+(p+q)-1][14] 14.1511 4.171e-03 ## d.o.f=3 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.4995 0.4797 ## Lag[2*(p+q)+(p+q)-1][5] 1.1855 0.8164 ## Lag[4*(p+q)+(p+q)-1][9] 2.4090 0.8510 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 0.4215 0.500 2.000 0.5162 ## ARCH Lag[5] 0.5974 1.440 1.667 0.8545 ## ARCH Lag[7] 1.2835 2.315 1.543 0.8636 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 5.2333 ## Individual Statistics: ## ar1 0.63051 ## ma1 1.18685 ## ma2 1.11562 ## omega 2.10211 ## alpha1 0.08261 ## beta1 2.07607 ## gamma1 0.15883 ## skew 0.33181 ## shape 2.56140 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 2.1 2.32 2.82 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 1.600 0.10965 ## Negative Sign Bias 0.602 0.54725 ## Positive Sign Bias 2.540 0.01115 ** ## Joint Effect 6.815 0.07804 * ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 20.37 0.3726 ## 2 30 36.82 0.1510 ## 3 40 45.07 0.2328 ## 4 50 52.03 0.3567 ## ## ## Elapsed time : 1.364722

    All coefficients are statistically significative. However, based on above reported Weighted Ljung-Box Test on Standardized Residuals p-values, we reject the null hypothesis of no correlation of residuals for the present model. Hence model ARMA(1,2)+eGARCH(1,1) is not able to capture all the structure of our time series.

    ARMA-GARCH: ARMA(2,3) + eGARCH(1,1)

    garchspec <- ugarchspec(mean.model = list(armaOrder = c(2,3), include.mean = FALSE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1), variance.targeting = FALSE), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_vol_log_ratio, spec = garchspec, solver='hybrid')) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(2,0,3) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.18607 0.008580 -21.6873 0.0e+00 ## ar2 0.59559 0.004596 129.5884 0.0e+00 ## ma1 -0.35619 0.013512 -26.3608 0.0e+00 ## ma2 -0.83010 0.004689 -177.0331 0.0e+00 ## ma3 0.26277 0.007285 36.0678 0.0e+00 ## omega -1.92262 0.226738 -8.4795 0.0e+00 ## alpha1 0.14382 0.033920 4.2401 2.2e-05 ## beta1 0.31060 0.079441 3.9098 9.2e-05 ## gamma1 0.43137 0.043016 10.0281 0.0e+00 ## skew 1.32282 0.031382 42.1523 0.0e+00 ## shape 3.48939 0.220787 15.8043 0.0e+00 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.18607 0.023940 -7.7724 0.000000 ## ar2 0.59559 0.022231 26.7906 0.000000 ## ma1 -0.35619 0.024244 -14.6918 0.000000 ## ma2 -0.83010 0.004831 -171.8373 0.000000 ## ma3 0.26277 0.030750 8.5453 0.000000 ## omega -1.92262 0.266462 -7.2154 0.000000 ## alpha1 0.14382 0.032511 4.4239 0.000010 ## beta1 0.31060 0.095329 3.2582 0.001121 ## gamma1 0.43137 0.047092 9.1602 0.000000 ## skew 1.32282 0.037663 35.1225 0.000000 ## shape 3.48939 0.223470 15.6146 0.000000 ## ## LogLikelihood : 356.4994 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -0.22888 ## Bayes -0.20698 ## Shibata -0.22891 ## Hannan-Quinn -0.22101 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.7678 0.38091 ## Lag[2*(p+q)+(p+q)-1][14] 7.7336 0.33963 ## Lag[4*(p+q)+(p+q)-1][24] 17.1601 0.04972 ## d.o.f=5 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.526 0.4683 ## Lag[2*(p+q)+(p+q)-1][5] 1.677 0.6965 ## Lag[4*(p+q)+(p+q)-1][9] 2.954 0.7666 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 1.095 0.500 2.000 0.2955 ## ARCH Lag[5] 1.281 1.440 1.667 0.6519 ## ARCH Lag[7] 1.940 2.315 1.543 0.7301 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 5.3764 ## Individual Statistics: ## ar1 0.12923 ## ar2 0.20878 ## ma1 1.15005 ## ma2 1.15356 ## ma3 0.97487 ## omega 2.04688 ## alpha1 0.09695 ## beta1 2.01026 ## gamma1 0.18039 ## skew 0.38131 ## shape 2.40996 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 2.49 2.75 3.27 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 1.4929 0.13556 ## Negative Sign Bias 0.6317 0.52766 ## Positive Sign Bias 2.4505 0.01432 ** ## Joint Effect 6.4063 0.09343 * ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 17.92 0.5278 ## 2 30 33.99 0.2395 ## 3 40 44.92 0.2378 ## 4 50 50.28 0.4226 ## ## ## Elapsed time : 1.660402

    All coefficients are statistically significative. No correlation within standardized residuals or standardized squared residuals is found. All ARCH effects are properly captured by the model. The Adjusted Pearson Goodness-of-fit test does not reject the null hypothesis that the empirical distribution of the standardized residuals and the selected theoretical distribution are the same. However:

    * the Nyblom stability test null hypothesis that the model parameters are constant over time is rejected for some of them

    * the Positive Sign Bias null hypothesis is rejected at 5% of significance level; this kind of test focuses on the effect of large and small positive shocks

    See ref. [11] for an explanation about GARCH model diagnostic.

    Some diagnostic plots are shown.

    par(mfrow=c(2,2)) plot(garchfit, which=8) plot(garchfit, which=9) plot(garchfit, which=10) plot(garchfit, which=11)

    We show the original DJIA daily trade volume log ratio time series with the mean model fit (red line) and the conditional volatility (blue line).

    par(mfrow=c(1,1)) cond_volatility <- sigma(garchfit) mean_model_fit <- fitted(garchfit) p <- plot(dj_vol_log_ratio, col = "grey") p <- addSeries(mean_model_fit, col = 'red', on = 1) p <- addSeries(cond_volatility, col = 'blue', on = 1) p

    Model Equation

    Combining both ARMA(2,3) and eGARCH(1,1) model equations we have:

    \[
    \begin{equation}
    \begin{cases}
    y_{t}\ -\ \phi_{1}y_{t-1}\ -\ \phi_{2}y_{t-2} =\ \phi_{0}\ +\ u_{t}\ +\ \theta_{1}u_{t-1}\ +\ \theta_{2}u_{t-2}\ +\ \theta_{3}u_{t-3}
    \\
    \\
    u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1)
    \\
    \\
    \ln(\sigma_{t}^2)\ =\ \omega\ + \sum_{j=1}^{q} (\alpha_{j} \epsilon_{t-j}^2\ + \gamma (\epsilon_{t-j} – E|\epsilon_{t-j}|)) +\ \sum_{i=1}^{p} \beta_{i} ln(\sigma_{t-1}^2)
    \end{cases}
    \end{equation}
    \]

    Using the model resulting coefficients we obtain:

    \[
    \begin{equation}
    \begin{cases}
    y_{t}\ +\ 0.186\ y_{t-1} -\ 0.595\ y_{t-2}\ = \ u_{t}\ -\ 0.356u_{t-1}\ – 0.830\ u_{t-2}\ +\ 0.263\ u_{t-3}
    \\
    \\
    u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1)
    \\
    \\
    \ln(\sigma_{t}^2)\ =\ -1.922\ +0.144 \epsilon_{t-1}^2\ + 0.431\ (\epsilon_{t-1} – E|\epsilon_{t-1}|)) +\ 0.310\ ln(\sigma_{t-1}^2)
    \end{cases}
    \end{equation}
    \]

    Daily Volume Log Ratio Volatility Analysis

    Here is the plot of conditional volatility as resulting from our model ARMA(2,2)+eGARCH(1,1).

    plot(cond_volatility)

    Line plots of conditional volatility by year are shown.

    par(mfrow=c(6,2)) pl <- lapply(2007:2018, function(x) { plot(cond_volatility[as.character(x)], main = "DJIA Daily Volume Log-ratio conditional volatility")}) pl

    Conditional volatility box-plots by year are shown.

    par(mfrow=c(1,1)) cond_volatility_df <- xts_to_dataframe(cond_volatility) dataframe_boxplot(cond_volatility_df, "Dow Jones Daily Volume Log-ratio conditional volatility 2007-2018")

    If you have any questions, please feel free to comment below.

    References
    1. Dow Jones Industrial Average
    2. Skewness ]
    3. Kurtosis ]
    4. An introduction to the analysis of financial data with R, Wiley, Ruey S. Tsay
    5. Time series analysis and its applications, Springer ed., R.H. Shumway, D.S. Stoffer
    6. Applied Econometric Time Series, Wiley, W. Enders, 4th ed.
    7. Forecasting – Principle and Practice, Texts, R.J. Hyndman
    8. Options, Futures and other Derivatives, Pearson ed., J.C. Hull
    9. An introduction to rugarch package
    10. Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
    11. GARCH modeling: diagnostic tests
    Disclaimer

    Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.

    Related Post

    1. Dow Jones Stock Market Index (3/4): Log Returns GARCH Model
    2. Leaf Plant Classification: Statistical Learning Model – Part 2
    3. NYC buses: C5.0 classification with R; more than 20 minute delay?
    4. NYC buses: Cubist regression with more predictors
    5. NYC buses: simple Cubist regression
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

    Tue, 01/08/2019 - 23:16

    (This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

    A small note on the growing list of users of the udpipe R package. In the last month of 2018, we’ve updated the package on CRAN with some noticeable changes

    • The default models which are now downloaded with the function udpipe_download_model are now models built on Universal Dependencies 2.3 (released on 2018-11-15)
    • This means udpipe now has models for 60 languages. That’s right! And they provide tokenisation, parts of speech tagging, lemmatisation and dependency parsing built on all of these treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb.
    • Although this was not intended originally we added a sentiment scoring function in the latest release (version 0.8 on CRAN). Combined with the output of the dependency parsing, this allows to answer questions like ‘WHAT IS CAUSING A NEGATIVE SENTIMENT’. Example showing below.
    • If you want to use the udpipe models for commercial purposes, we have some nice extra pretrained models available for you – get in touch if you are looking for this.
    Below we will showcase the new features of the R package by finding out what is causing a negative sentiment. If I see some users of the tidytext sentiment R package I always wondered if they do sentiment scoring for the love of building reports as it looks like the main thing they report is frequency of occurrences of words which are part of a positive or negative dictionary. While probably their manager asked them. “Yeah but why is the sentiment negative or positive”.
    You can answer this managerial question using dependency parsing and that is exactly what udpipe provides (amongst other NLP annotations). Dependency parsing links each word to another word, allowing us the find out which words are linked to negative words giving you the context of why something is negative and what needs to be improved in your business. Let’s show how to get this easily done in R.

    Below we get a sample of 500 AirBnb customer reviews in French, annotate it with udpipe (using a French model built on top of Rhapsodie French treebank), use the new sentiment scoring txt_sentiment which is available in the new udpipe release using an online dictionary of positive / negative terms for French. Next we use the udpipe dependency parsing output by looking to the adjectival modifier ‘amod’ in the dep_rel udpipe output and visualise all words which are linked the the negative terms of the dictionary. The result is this graph showing words of the dictionary in red and words which are linked to that word in another color.

    Full code showing how this is done is shown below.

    library(udpipe)
    library(dplyr)
    library(magrittr)
    data(brussels_reviews, package = "udpipe")
    x <- brussels_reviews %>%
      filter(language == "fr") %>%
      rename(doc_id = id, text = feedback) %>%
      udpipe("french-spoken", trace = 10)
    ##
    ## Get a French sentiment dictionary lexicon with positive/negative terms, negators, amplifiers and deamplifiers
    ##
    load(file("https://github.com/sborms/sentometrics/raw/master/data-raw/FEEL_fr.rda"))
    load(file("https://github.com/sborms/sentometrics/raw/master/data-raw/valence-raw/valShifters.rda"))
    polarity_terms <- rename(FEEL_fr, term = x, polarity = y)
    polarity_negators <- subset(valShifters$valence_fr, t == 1)$x
    polarity_amplifiers <- subset(valShifters$valence_fr, t == 2)$x
    polarity_deamplifiers <- subset(valShifters$valence_fr, t == 3)$x
    ##
    ## Do sentiment analysis based on that open French lexicon
    ##
    sentiments <- txt_sentiment(x, term = "lemma",
                                polarity_terms = polarity_terms,
                                polarity_negators = polarity_negators,
                                polarity_amplifiers = polarity_amplifiers,
                                polarity_deamplifiers = polarity_deamplifiers)
    sentiments <- sentiments$data
    • Nothing fancy happened here above. We use udpipe for NLP annotation (tokenisation, lemmatisation, parts of speech tagging and dependency parsing). The sentiment scoring not only does a join with the sentiment dictionary but also looks for neighbouring words which might change the sentiment.
    • The resulting dataset looks like this

    Now we can answer the question – why is something negative

    This is done by using the dependency relationship output of udpipe to find out which words are linked to negative words from our sentiment dictionary. Users unfamiliar with dependency relationships, have a look at definitions of possible tags for the dep_rel field at dependency parsing output. In this case we only take ‘amod’ meaning we are looking for adjectives modifying a noun.

    ## Use cbind_dependencies to add the parent token to which the keyword is linked
    reasons <- sentiments %>%
      cbind_dependencies() %>%
      select(doc_id, lemma, token, upos, sentiment_polarity, token_parent, lemma_parent, upos_parent, dep_rel) %>%
      filter(sentiment_polarity < 0)
    head(reasons)
    • Now instead of making a plot showing which negative words appear which tidytext users seem to be so keen of, we can make a plot showing the negative words and the words which these negative terms are linked to indicating the context of the negative term.
    • We select the lemma’s of the negative words and the lemma of the parent word and calculate how many times they occur together
    reasons <- filter(reasons, dep_rel %in% "amod")
    word_cooccurences <- reasons %>%
      group_by(lemma, lemma_parent) %>%
      summarise(cooc = n()) %>%
      arrange(-cooc)
    vertices <- bind_rows(
      data_frame(key = unique(reasons$lemma)) %>% mutate(in_dictionary = if_else(key %in% polarity_terms$term, "in_dictionary", "linked-to")),
      data_frame(key = unique(setdiff(reasons$lemma_parent, reasons$lemma))) %>% mutate(in_dictionary = "linked-to"))
    • The following makes the visualisation using ggraph.
    library(magrittr)
    library(ggraph)
    library(igraph)
    cooc <- head(word_cooccurences, 20)
    set.seed(123456789)
    cooc %>%  
      graph_from_data_frame(vertices = filter(vertices, key %in% c(cooc$lemma, cooc$lemma_parent))) %>%
      ggraph(layout = "fr") +
      geom_edge_link0(aes(edge_alpha = cooc, edge_width = cooc)) +
      geom_node_point(aes(colour = in_dictionary), size = 5) +
      geom_node_text(aes(label = name), vjust = 1.8, col = "darkgreen") +
      ggtitle("Which words are linked to the negative terms") +
      theme_void()

    This generated the image shown above, showing context of negative terms. Now go do this on your own data.

    If you are interested in the techniques shown above, you might also be interested in our recent open-sourced NLP developments:
    • textrank: text summarisation
    • crfsuite: entity recognition, chunking and sequence modelling
    • BTM: biterm topic modelling on short texts (e.g. survey answers / twitter data)
    • ruimtehol: neural text models on top of Starspace (neural models for text categorisation, word/sentence/document embeddings, document recommendation, entity link completion and entity embeddings)
    • udpipe: general NLP package for tokenisation, lemmatisation, parts of speech tagging, morphological annotations, dependency parsing, keyword extraction and NLP flows

    Enjoy!

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    French Baccalaureate Results

    Tue, 01/08/2019 - 23:15

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    I. Context

    The French Baccalaureate (BAC) is the final exam all French students must pass to graduate from high school. Not only is it necessary to graduate, but a student’s performance on the BAC is the American equivalent to one’s performance on the ACT/SAT for college applications.

    As I am myself a product of the French Lycee (high school) system and more specifically the Scientific BAC, I was hoping to answer questions I always had regarding the test as well as provide insights to future BAC takers. Before delving deeper, it is important to highlight the fact that a “departement” (correctly spelled with an “e” after the “t”) is a French word denoting a region of France which has its own jurisdiction to a certain extent.

    There are 3 general Baccalaureate sections offered to students, for which I analyzed the passing rates:

    S: Scientific

    ES: Economic & Social

    L: Literary

    I sought to create a general platform that french students could use to compare the performance of their school/departement against that of other students.

    Questions I wanted to answer:

    1) With S (Scientific) being the section geared towards hard sciences, I assumed that on average more students would gravitate towards ES (economic & social) which is less quantitative. Is this valid?

    2) In opposition to 1), there is a perpetuating stereotype that “smart” students choose S regardless of whether or not they are interested in science, which would contradict 1) and push more students towards S. Is this valid?

    3) L is considered to have some of the most talented and lazy students, meaning distributions of their passing rates should be more spread out compared to S and ES. Is this valid?

    4) With the option of “re-dos” for students with grades between [8;10] out of 20 and all 210 students in my graduating class having passed their BAC I expected passing rates to be above 95%. Is this valid?

    5) What are the strongest performing and weakest performing Lycees and departements in France? Are there any patterns between strong and weak Lycees being in certain departements.

    II. The Data Set

    I obtained the Lycee-level data on the French Government’s website. I made several modifications to the data set before building out my shiny app:

    1) Column names that were duplicates were modified for clarity purposes

    2) Created a rank feature for each school

    3) Created a feature to isolate non-technical baccalaureates

    4) Modified the two Corsica departement codes from 2A and 2B to 29 and 30 to allow for mapping departement names over a map

    5) French special characters were substituted for their traditional counterparts

    III. Data Analysis

    I built my shiny app to explore Baccalaureate passing rates from 3 main perspectives so that users can answer any potential question they may have:

    1. National: One can examine the passing rates for all three sections individually as well as combined on an interactive map of France. Hovering your cursor over a departement will show the departement’s name as well as its passing rate for the BAC type selected (its passing rate is the average passing rate of all the schools located in that departement). In addition, 3 widgets under the map will show the user what the country’s maximum, average and minimum passing rates are for a given section.

    2. Departement: One can examine passing rates by selected departement for several different key statistics as shown below. All four charts can be combined to build one’s understanding of the key drivers of a departement’s performance.

    3. Lycee: One can finally study the passing rates and student distribution amongst the sections for individual schools via a table. The table offers multiple ways the user can find the information they are looking for:

    1. there is a search bar to find a specific lycee
    2. the user can click on any of the column headers to sort by ascending or descending
    3. the user can control how many schools are displayed at once

    IV. Feedback

    If you have any feedback or questions, please do not hesitate to contact me via LinkedIn.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Analysis of South African Funds

    Tue, 01/08/2019 - 20:00

    (This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

    Packages used in this post

    Disclaimer: I am no financial advisor, have never been and you should not take any of this analysis as investment advice. These thoughts are my own, please dont mail me about your money strategies/problems. I enjoy numbers, scraping and data analysis and that is wat this post is about. Also, do you really want to trust someone who wrote this post at 2am… no you dont

    library(tidyverse) library(glue) library(rvest) library(gridExtra) library(ggpubr) library(scales) library(ggthemes) theme_set(theme_economist()) Why even do this?

    Well, I recently decided to do a quick overview of some of the funds I am invested in as part of a retirement annuity. Its been almost 4 years since I started the RA and thought I would just view the fund’s performance, its asset composition and the fees that have been charged over the period.

    If you do not know what an RA is and why you would need one, you can think of it as a way to get back at the tax man while investing over a long period. In South Africa certain funds are what are called Regulation 28 compliant, which means, any money you put into such a fund is tax-exempt – but you just have to leave it there till you retire, which isn’t actually a bad thing thanks the miracle of compounded growth. I think having an RA is a cornerstone of any good financial planning. So enough of sounding like your parents.

    OK, so at this point in time, you probably like this:

    Well, actually you are kind of right of thinking this. If you invest in a retirement annuity on a regular basis and over the contract period you should have some money for retirement, in theory. However, the market for funds are huge and which one do you pick? Financial advisers could give you good advice, but its also in their interest to invest you in multiple complex funds, as they get management fees in return1.

    The easiest way to get a grips with the cost involved when investing in a fund is what is called Total Expense Ratios. These are important, because if the fund is costing you 3.5%, annualised growth of fund is 8% and inflation is 6%, then you not gaining of the advantage of compounded growth, but actually losing -1.5% a year in real terms. An example of a TER is as follows:

    Investopedia gives a good explanation of these fees

    The total expense ratio (TER) is a measure of the total costs associated with managing and operating an investment fund, such as a mutual fund. These costs consist primarily of management fees and additional expenses, such as trading fees, legal fees, auditor fees and other operational expenses. The total cost of the fund is divided by the fund’s total assets to arrive at a percentage amount, which represents the TER, most often referred to as simply “expense ratio”

    So what I set out to do is look at the TERs for South African funds and then compare it with their performance of the last 5 years. To collect the data I scraped all the information from Fundsdata.co.za as they have a wide collection of different funds and categories

    I used rvest and combined it with a bit of purrr magic that got me the results I wanted. The data was a bit dirty so I just cleaned one or two things before starting the analysis.

    main_pg <- read_html("http://www.fundsdata.co.za/navs/default.htm") pg_funds <- main_pg %>% html_nodes("tr.pagelink") %>% html_nodes("a") %>% html_text() pg_links <- main_pg %>% html_nodes("tr.pagelink") %>% html_nodes("a") %>% html_attr("href") all_funds <- tibble(funds = pg_funds, link = glue("http://www.fundsdata.co.za/navs/{pg_links}")) link <- "http://www.fundsdata.co.za/navs/ZEGN.htm" collect_ter <- possibly(function(link){ cat(glue("Now collecting {link}"), "\n") pg <- read_html(link) pg_tbl <- html_table(pg, fill = T)[[1]][, -20] x <- pg_tbl %>% slice(-c(1:2)) %>% set_names(make.names(pg_tbl[2,], unique = T)) %>% tbl_df %>% slice(-c(n():(n()-3))) Sys.sleep(10) x }, NULL) all_funds <- all_funds %>% filter(grepl("South Africa", funds)) %>% mutate(info = map(link, collect_ter)) #saveRDS(all_funds, "all_funds.rds")

    After the collection is done, you have a neat tibble that looks like this:

    ## # A tibble: 6 x 3 ## funds link info ## ## 1 South African - Equity - General Funds http://www.fu~

    Next I wanted to calculate the annualised rate of return of the each of the funds just to get an idea of what kind of returns are out there. The data only goes back 5 years, so I am going to work from that.

    The formula for CAGR (compounded annual growth rate) is simply:

    \(CAGR = (\frac{F}{P})^{1/n} – 1\)

    Where:

    • F is future value
    • P is present value
    • n is number of years

    Once I have the annualised return, I add one or two extra columns of information. I calculate a sharpe ratio as a measure of risk adjusted returns (assume 5% risk free rate), I also adjust the cagr for inflation2. The website was also kind enough to add an asterisk(*) to the sheet indicating funds which are eligible to the tax break aka Reg28 compliant.

    all_funds <- readRDS("data/all_funds.rds") annual_ret <- function(info){ info %>% select(Fund.Name, TER, X5.yearsCash.Value, Volatility.Ann...) %>% set_names(c("fund_name", "TER", "cash_value_5yrs", "volatility")) %>% mutate_at(vars(-one_of("fund_name")), as.numeric) %>% mutate( # growth rate annualised_ret = ((cash_value_5yrs/100)^(1/5)-1)* 100, # risk adjusted return risk_adj_cagr = annualised_ret * (1 - volatility), # sharpe ration for risk adjusted return sharpe = (annualised_ret - 5)/volatility, # inflation adjustment inflation_adj_cagr = annualised_ret - 5.5) %>% mutate(reg_28_compliant = grepl("[*]", fund_name)) } all_funds_ret <- all_funds %>% mutate(returns = map(info, annual_ret)) %>% select(funds, returns) %>% unnest %>% arrange(-annualised_ret) %>% filter(!is.na(TER)) all_funds_ret ## # A tibble: 1,182 x 10 ## funds fund_name TER cash_value_5yrs volatility annualised_ret ## ## 1 South Af~ Absa Propert~ 1.72 175. 19.1 11.9 ## 2 South Af~ Long Beach F~ 1.53 167. 14.9 10.9 ## 3 South Af~ Nedgroup Inv~ 1.36 166. 10.7 10.7 ## 4 South Af~ Nedgroup Inv~ 1.93 162. 10.7 10.1 ## 5 South Af~ Momentum Fin~ 1.46 161. 11.1 9.94 ## 6 South Af~ Fairtree Fle~ 0.900 160. 3.08 9.82 ## 7 South Af~ Satrix FINI ~ 0.410 159. 14.1 9.78 ## 8 South Af~ Fairtree Equ~ 1.16 158. 12.7 9.58 ## 9 South Af~ *Pan-African~ 0.770 158. 3.80 9.51 ## 10 South Af~ Sesfikile BC~ 1.28 156. 11.5 9.37 ## # ... with 1,172 more rows, and 4 more variables: risk_adj_cagr , ## # sharpe , inflation_adj_cagr , reg_28_compliant

    Now that we have all the pieces, lets see what the average TER is for funds in South Africa.

    OK doki, so what we see is that on average, the cost of the the non Reg28 funds are cheaper, with the median TER sitting around 1.21%. This means that we will need to choose quite carefully when we decide on which fund to invest in.

    Next I wanted to see what the distribution of returns look like between the Reg28 and non-Reg28 funds3.

    Its nice to see that although the fees are higher for the Reg28 funds, the returns also seem higher with a bi-modal distribution. This could be a something like a sector effect. I also wanted to check whether there was some statistical significance to this. So I conduct a wilcoxon test. A wilcoxon is a non-parametric test to check whether the true location shift is equal to 0. Here we reject the hypothesis and confirm that Reg28 do indeed have higher returns.

    At what cost is the higher returns? Are the Reg28 funds more risky or risk-averse? To answer this question, lets turn to our sharpe ratio. Anything about 1 is considered good and anything about 2 we starting to see some really good returns on low risk investments. There was some outliers, so I removed all more than 3.5 as to focus on the meat of data.

    Its by a small margin, but the Reg28 funds seem to be delivering higher return with a lower risk profile.

    The last bit of analysis I did was to have a look at the inflation adjusted returns over the last 5 years. The market took quite a beating over the last year, mostly due to some of our presidential factors and also because of some “other” presidential factors in the developed world. So lets look at how the different market sectors did over the last 5 years.

    Financial equities did well along with some interest bearing funds. This is not completely out of place. Five years can be considered a short window of investment. Our market equity market is also high concentrated with Naspers driving a lot of the growth in recent years through their investment in a company called Tencent in China.

    Conclusion

    This write up was partly to get a grip on how my own RA is doing amid the sea of other options. I also wanted to get an idea of what a reasonable TER is and what a good return on the higher TERs are – i.e are the complex funds delivering higher returns? It would seem so, but even the best funds haven’t beaten the inflation trackers. This is basically because of the short period I am looking at as well as the fact that 2018 sucked for investments in SA (and globally)

    The other part of why I wrote this blog, was because I wanted to explore the ggpubr package. Really enjoy some of the features that it has when it comes to quick reporting and visualisation.

    1. Again, I am not saying that financial advisers are the devil, I am exaggerating for effect. I have a financial adviser

    2. Doing a quick check, in SA the average inflation over the past 5 years was ~5.5%

    3. From here on in I used the ggpubr package quite a lot. Really enjoyed it. Do wish it had NSE though

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    AzureR packages now on CRAN

    Tue, 01/08/2019 - 18:00

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The suite of AzureR packages for interfacing with Azure services from R is now available on CRAN. If you missed the earlier announcements, this means you can now use the install.packages function in R to install these packages, rather than having to install from the Github repositories. Updated versions of these packages will also be posted to CRAN, so you can get the latest versions simply by running update.packages.

    The packages are:

    • AzureVM (CRAN): a package for working with virtual machines and virtual machine clusters. It allows you to easily deploy, manage and delete a VM from R. A number of templates are provided based on the Data Science Virtual Machine, or you can also supply your own template. Introductory vignette.
    • AzureStor: a package for working with storage accounts. In addition to a Resource Manager interface for creating and deleting storage accounts, it also provides a client interface for working with the storage itself. You can upload and download files and blobs, list file shares and containers, list files within a share, and so on. It supports authenticated access to storage via either a key or a shared access signature (SAS). Introductory vignette.
    • AzureContainers: a package for working with containers in Azure: specifically, it provides an interface to Azure Container InstancesAzure Container Registry and Azure Kubernetes Service. You can easily create an image, push it to an ACR repo, and then deploy it as a service to AKS. It also provides lightweight shells to docker, kubectl and helm (assuming these are installed). Vignettes on deploying with the plumber package and with Microsoft ML Server
    • AzureRMR: the base package of the AzureR family for interfacing with the Azure Resource Manager and which can be extended to interface with other services. There's an introductory vignette, and another on extending AzureRMR.

    Click on the first links in each bullet above for a blog post introducing each package. The introductory vignette listed with each package provides further details, and will be up-to-date for the latest version of the package.

    Feedback is welcome on all of these packages, which can be found as part of the cloudyr project on Github.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Beautiful 2 by 2 Matrix Identity

    Tue, 01/08/2019 - 17:57

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    While working on a variation of the RcppDynProg algorithm we derived the following beautiful identity of 2 by 2 real matrices:

    The superscript “top” denoting the transpose operation, the ||.||^2_2 denoting sum of squares norm, and the single |.| denoting determinant.

    This is derived from one of the check equations for the Moore–Penrose inverse and we have details of the derivation here, and details of the messy algebra here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Where does .Renviron live on Citrix?

    Tue, 01/08/2019 - 05:00

    (This article was first published on R on datawookie, and kindly contributed to R-bloggers)

    At one of my clients I run RStudio under Citrix in order to have access to their data.

    For the most part this works fine. However, every time I visit them I spend the first few minutes of my day installing packages because my environment does not seem to be persisted from one session to the next.

    I finally had a gap and decided to fix the problem.

    Where are the packages being installed?

    Installed packages just spontaneously disappear… That’s weird. Where are they being installed?

    # Install package under Citrix. > install.packages("devtools") Installing package into ‘C:/Users/andrewcol/Documents/R/win-library/3.5’

    It looks like the packages are being installed into a folder on my C: drive. But if I look for that folder it’s not there.

    How does this make sense?

    A bit of investigation revealed that the folder is actually on the C: drive on the Citrix server (rather than my local machine).

    I confirmed this by installing the same package using RStudio running on my local machine.

    # Install package on local. > install.packages("devtools") Installing package into ‘C:/Users/AndrewCol/Documents/R/win-library/3.4’

    Note the difference in the path (andrewcol versus AndrewCol) and R versions.

    Environment

    I know that files on my network H: drive will not disappear between sessions, so this seems like a natural place to install packages. I can make this happen by changing the R_LIBS_USER environment variable.

    # Environment on Citrix. > Sys.getenv("R_LIBS_USER") [1] "C:/Users/andrewcol/Documents/R/win-library/3.5"

    We’ll assign a new value by setting up a .Renviron file. Where should this file live? On my local C: drive, my network H: drive or the Citrix server C: drive?

    # Environment on Citrix. > Sys.getenv("R_USER") [1] "C:/Users/andrewcol/Documents"

    Okay, so it needs to go on the Citrix C: drive. This is really the crux of the entire process, because if you put it on H: or local C: then it will not be picked up by RStudio on Citrix.

    Writing to Citrix C:

    It turns out that it’s remarkably difficult to access C: on the Citrix server. I couldn’t get there using File Explorer. I couldn’t create a new file there using Notepad or RStudio.

    No problem! Write an R script.

    renviron <- file("C:/Users/andrewcol/Documents/.Renviron", "w") cat( "R_LIBS_USER='H:/myCitrixFiles/Documents/R/win-library/3.5'", "R_USER='H:/myCitrixFiles/Documents'", file = renviron, sep = "\n" ) close(renviron)

    That’ll create the .Renviron file in the right place and insert the required content.

    Quick check that it’s been created.

    setwd("C:/Users/andrewcol/Documents") list.files(all.files = TRUE) readLines(".Renviron")

    Restart RStudio under Citrix. Packages will be installed to H: and should not mysteriously disappear between sessions.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dow Jones Stock Market Index (3/4): Log Returns GARCH Model

    Tue, 01/08/2019 - 04:12

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

      Categories

      1. Advanced Modeling

      Tags

      1. Data Visualisation
      2. Linear Regression
      3. R Programming

      In this third post, I am going to build an ARMA-GARCH model for Dow Jones Industrial Average (DJIA) daily log-returns. You can read the first and second part which I published previously.

      Packages

      The packages being used in this post series are herein listed.

      suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(fBasics)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(urca)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(quantmod)) suppressPackageStartupMessages(library(PerformanceAnalytics)) suppressPackageStartupMessages(library(rugarch)) suppressPackageStartupMessages(library(FinTS)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(TSA)) Getting Data

      We upload the environment status as saved at the end of part 2.

      load(file='DowEnvironment.RData')

      This is the plot of DJIA daily log-returns.

      plot(dj_ret)

      Outliers Detection

      The Return.clean function within Performance Analytics package is able to clean return time series from outliers. Here below we compare the original time series with the outliers adjusted one.

      dj_ret_outliersadj <- Return.clean(dj_ret, "boudt") p <- plot(dj_ret) p <- addSeries(dj_ret_outliersadj, col = 'blue', on = 1) p

      The prosecution of the analysis will be carried on with the original time series as a more conservative approach to volatility evaluation.

      Correlation plots

      Here below are the total and partial correlation plots.

      acf(dj_ret)

      pacf(dj_ret)

      Above correlation plots suggest some ARMA(p,q) model with p and q > 0. That will be verified within the prosecution of the present analysis.

      Unit root tests

      We run the Augmented Dickey-Fuller test as available within the urca package. The no trend and no drift test flavor is run.

      (urdftest_lag = floor(12* (nrow(dj_ret)/100)^0.25)) ## [1] 28 summary(ur.df(dj_ret, type = "none", lags = urdftest_lag, selectlags="BIC")) ## ## ############################################### ## # Augmented Dickey-Fuller Test Unit Root Test # ## ############################################### ## ## Test regression none ## ## ## Call: ## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.081477 -0.004141 0.000762 0.005426 0.098777 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## z.lag.1 -1.16233 0.02699 -43.058 < 2e-16 *** ## z.diff.lag 0.06325 0.01826 3.464 0.000539 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.01157 on 2988 degrees of freedom ## Multiple R-squared: 0.5484, Adjusted R-squared: 0.5481 ## F-statistic: 1814 on 2 and 2988 DF, p-value: < 2.2e-16 ## ## ## Value of test-statistic is: -43.0578 ## ## Critical values for test statistics: ## 1pct 5pct 10pct ## tau1 -2.58 -1.95 -1.62

      Based on reported test statistics compared with critical values, we reject the null hypothesis of unit root presence. See ref. [6] for further details about the Augmented Dickey-Fuller test.

      ARMA model

      We now determine the ARMA structure of our time series in order to run the ARCH effects test on the resulting residuals. That is in agreement with what outlined in ref. [4] $4.3.

      ACF and PACF plots tailing off suggests an ARMA(2,2) (ref. [5], table $3.1). We take advantage of auto.arima() function within forecast package (ref. [7]) to have an idea to start with.

      auto_model <- auto.arima(dj_ret) summary(auto_model) ## Series: dj_ret ## ARIMA(2,0,4) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ma2 ma3 ma4 ## 0.4250 -0.8784 -0.5202 0.8705 -0.0335 -0.0769 ## s.e. 0.0376 0.0628 0.0412 0.0672 0.0246 0.0203 ## ## sigma^2 estimated as 0.0001322: log likelihood=9201.19 ## AIC=-18388.38 AICc=-18388.34 BIC=-18346.29 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002416895 0.01148496 0.007505056 NaN Inf 0.6687536 ## ACF1 ## Training set -0.002537238

      ARMA(2,4) model is suggested. However, the ma3 coefficient is not significative, as further verified by:

      coeftest(auto_model) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.425015 0.037610 11.3007 < 2.2e-16 *** ## ar2 -0.878356 0.062839 -13.9779 < 2.2e-16 *** ## ma1 -0.520173 0.041217 -12.6204 < 2.2e-16 *** ## ma2 0.870457 0.067211 12.9511 < 2.2e-16 *** ## ma3 -0.033527 0.024641 -1.3606 0.1736335 ## ma4 -0.076882 0.020273 -3.7923 0.0001492 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      Hence we put as a constraint MA order q <= 2.

      auto_model2 <- auto.arima(dj_ret, max.q=2) summary(auto_model2) ## Series: dj_ret ## ARIMA(2,0,2) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ma2 ## -0.5143 -0.4364 0.4212 0.3441 ## s.e. 0.1461 0.1439 0.1512 0.1532 ## ## sigma^2 estimated as 0.0001325: log likelihood=9196.33 ## AIC=-18382.66 AICc=-18382.64 BIC=-18352.6 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002287171 0.01150361 0.007501925 Inf Inf 0.6684746 ## ACF1 ## Training set -0.002414944

      Now all coefficients are significative.

      coeftest(auto_model2) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.51428 0.14613 -3.5192 0.0004328 *** ## ar2 -0.43640 0.14392 -3.0322 0.0024276 ** ## ma1 0.42116 0.15121 2.7853 0.0053485 ** ## ma2 0.34414 0.15323 2.2458 0.0247139 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      Further verifications with ARMA(2,1) and ARMA(1,2) results with higher AIC values than ARMA(2,2). Hence ARMA(2,2) is preferable. Here are the results.

      auto_model3 <- auto.arima(dj_ret, max.p=2, max.q=1) summary(auto_model3) ## Series: dj_ret ## ARIMA(2,0,1) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ## -0.4619 -0.1020 0.3646 ## s.e. 0.1439 0.0204 0.1438 ## ## sigma^2 estimated as 0.0001327: log likelihood=9194.1 ## AIC=-18380.2 AICc=-18380.19 BIC=-18356.15 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002370597 0.01151213 0.007522059 Inf Inf 0.6702687 ## ACF1 ## Training set 0.0009366271 coeftest(auto_model3) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.461916 0.143880 -3.2104 0.001325 ** ## ar2 -0.102012 0.020377 -5.0062 5.552e-07 *** ## ma1 0.364628 0.143818 2.5353 0.011234 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      All coefficients are statistically significative.

      auto_model4 <- auto.arima(dj_ret, max.p=1, max.q=2) summary(auto_model4) ## Series: dj_ret ## ARIMA(1,0,2) with zero mean ## ## Coefficients: ## ar1 ma1 ma2 ## -0.4207 0.3259 -0.0954 ## s.e. 0.1488 0.1481 0.0198 ## ## sigma^2 estimated as 0.0001328: log likelihood=9193.01 ## AIC=-18378.02 AICc=-18378 BIC=-18353.96 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002387398 0.0115163 0.007522913 Inf Inf 0.6703448 ## ACF1 ## Training set -0.001958194 coeftest(auto_model4) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.420678 0.148818 -2.8268 0.004702 ** ## ma1 0.325918 0.148115 2.2004 0.027776 * ## ma2 -0.095407 0.019848 -4.8070 1.532e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      All coefficients are statistically significative.

      Furthermore, we investigate what eacf() function within the TSA package reports.

      eacf(dj_ret) ## AR/MA ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 0 x x x o x o o o o o o o o x ## 1 x x o o x o o o o o o o o o ## 2 x o o x x o o o o o o o o o ## 3 x o x o x o o o o o o o o o ## 4 x x x x x o o o o o o o o o ## 5 x x x x x o o x o o o o o o ## 6 x x x x x x o o o o o o o o ## 7 x x x x x o o o o o o o o o

      The upper left triangle with “O” as a vertex seems to be located somehow within (p,q) = {(1,2),(2,2),(1,3)}, which represents a set of potential candidate (p,q) values according to eacf function output. To remark that we prefer to consider parsimoniuos models, that is why we do not go too far as AR and MA orders.

      ARMA(1,2) model was already verified. ARMA(2,2) is already a candidate model. Let us verify ARMA(1,3).

      (arima_model5 <- arima(dj_ret, order=c(1,0,3), include.mean = FALSE)) ## ## Call: ## arima(x = dj_ret, order = c(1, 0, 3), include.mean = FALSE) ## ## Coefficients: ## ar1 ma1 ma2 ma3 ## -0.2057 0.1106 -0.0681 0.0338 ## s.e. 0.2012 0.2005 0.0263 0.0215 ## ## sigma^2 estimated as 0.0001325: log likelihood = 9193.97, aic = -18379.94 coeftest(arima_model5) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.205742 0.201180 -1.0227 0.306461 ## ma1 0.110599 0.200475 0.5517 0.581167 ## ma2 -0.068124 0.026321 -2.5882 0.009647 ** ## ma3 0.033832 0.021495 1.5739 0.115501 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      Only one coefficient is statistically significative.

      As a conclusion, we choose ARMA(2,2) as mean model. We can now proceed on with ARCH effect tests.

      ARCH effect test

      Now, we can test if any ARCH effects are present on residuals of our model. If ARCH effects are statistical significative for the residuals of our time series, a GARCH model is required.

      model_residuals <- residuals(auto_model2) ArchTest(model_residuals - mean(model_residuals)) ## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: model_residuals - mean(model_residuals) ## Chi-squared = 986.82, df = 12, p-value < 2.2e-16

      Based on reported p-value, we reject the null hypothesis of no ARCH effects.

      Let us have a look at the residual correlation plots.

      par(mfrow=c(1,2)) acf(model_residuals) pacf(model_residuals)

      Conditional Volatility

      The conditional mean and variance are defined as:

      \[
      \mu_{t}\ :=\ E(r_{t}|\ F_{t-1}) \\
      \sigma^2_{t}\ :=\ Var(r_{t}|\ F_{t-1})\ =\ E[(r_t-\mu_{t})^2| F_{t-1}]
      \]

      The conditional volatility can be computed as square root of the conditional variance. See ref. [4] for further details.

      eGARCH Model

      The attempts with sGARCH as variance model did not bring to result with significative coefficients. On the contrary, the exponential GARCH (eGARCH) variance model is capable to capture asymmetries within the volatility shocks.

      To inspect asymmetries within the DJIA log returns, summary statistics and density plot are shown.

      basicStats(dj_ret) ## DJI.Adjusted ## nobs 3019.000000 ## NAs 0.000000 ## Minimum -0.082005 ## Maximum 0.105083 ## 1. Quartile -0.003991 ## 3. Quartile 0.005232 ## Mean 0.000207 ## Median 0.000551 ## Sum 0.625943 ## SE Mean 0.000211 ## LCL Mean -0.000206 ## UCL Mean 0.000621 ## Variance 0.000134 ## Stdev 0.011593 ## Skewness -0.141370 ## Kurtosis 10.200492

      The negative skewness value confirms the presence of asymmetries within the DJIA distribution.

      This gives the density plot.

      plot(density(dj_ret))

      We go on proposing as variance model (for conditional variance) the eGARCH model. More precisely, we are about to model an ARMA-GARCH, with ARMA(2,2) as a mean model and exponential GARCH(1,1) as the variance model. Before doing that, we further emphasize how ARMA(0,0) is not satisfactory within this context.

      ARMA-GARCH: ARMA(0,0) + eGARCH(1,1)

      garchspec <- ugarchspec(mean.model = list(armaOrder = c(0,0), include.mean = TRUE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1)), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_ret, spec = garchspec)) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(0,0,0) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## mu 0.000303 0.000117 2.5933 0.009506 ## omega -0.291302 0.016580 -17.5699 0.000000 ## alpha1 -0.174456 0.013913 -12.5387 0.000000 ## beta1 0.969255 0.001770 547.6539 0.000000 ## gamma1 0.188918 0.021771 8.6773 0.000000 ## skew 0.870191 0.021763 39.9848 0.000000 ## shape 6.118380 0.750114 8.1566 0.000000 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## mu 0.000303 0.000130 2.3253 0.020055 ## omega -0.291302 0.014819 -19.6569 0.000000 ## alpha1 -0.174456 0.016852 -10.3524 0.000000 ## beta1 0.969255 0.001629 595.0143 0.000000 ## gamma1 0.188918 0.031453 6.0063 0.000000 ## skew 0.870191 0.022733 38.2783 0.000000 ## shape 6.118380 0.834724 7.3298 0.000000 ## ## LogLikelihood : 10138.63 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -6.7119 ## Bayes -6.6980 ## Shibata -6.7119 ## Hannan-Quinn -6.7069 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 5.475 0.01929 ## Lag[2*(p+q)+(p+q)-1][2] 6.011 0.02185 ## Lag[4*(p+q)+(p+q)-1][5] 7.712 0.03472 ## d.o.f=0 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 1.342 0.2467 ## Lag[2*(p+q)+(p+q)-1][5] 2.325 0.5438 ## Lag[4*(p+q)+(p+q)-1][9] 2.971 0.7638 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 0.3229 0.500 2.000 0.5699 ## ARCH Lag[5] 1.4809 1.440 1.667 0.5973 ## ARCH Lag[7] 1.6994 2.315 1.543 0.7806 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 4.0468 ## Individual Statistics: ## mu 0.2156 ## omega 1.0830 ## alpha1 0.5748 ## beta1 0.8663 ## gamma1 0.3994 ## skew 0.1044 ## shape 0.4940 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 1.69 1.9 2.35 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 1.183 0.23680 ## Negative Sign Bias 2.180 0.02932 ** ## Positive Sign Bias 1.554 0.12022 ## Joint Effect 8.498 0.03677 ** ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 37.24 0.00741 ## 2 30 42.92 0.04633 ## 3 40 52.86 0.06831 ## 4 50 65.55 0.05714 ## ## ## Elapsed time : 0.6527421

      All coefficients are statistically significative. However, from Weighted Ljung-Box Test on Standardized Residuals reported p-value above (as part of the overall report), we have the confirmation that such model does not capture all the structure (we reject the null hypothesis of no correlation within the residuals).

      As a conclusion, we proceed on by specifying ARMA(2,2) as a mean model within our GARCH fit as hereinbelow shown.

      ARMA-GARCH: ARMA(2,2) + eGARCH(1,1) garchspec <- ugarchspec(mean.model = list(armaOrder = c(2,2), include.mean = FALSE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1)), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_ret, spec = garchspec)) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(2,0,2) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.47642 0.026115 -18.2433 0 ## ar2 -0.57465 0.052469 -10.9523 0 ## ma1 0.42945 0.025846 16.6157 0 ## ma2 0.56258 0.054060 10.4066 0 ## omega -0.31340 0.003497 -89.6286 0 ## alpha1 -0.17372 0.011642 -14.9222 0 ## beta1 0.96598 0.000027 35240.1590 0 ## gamma1 0.18937 0.011893 15.9222 0 ## skew 0.84959 0.020063 42.3469 0 ## shape 5.99161 0.701313 8.5434 0 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.47642 0.007708 -61.8064 0 ## ar2 -0.57465 0.018561 -30.9608 0 ## ma1 0.42945 0.007927 54.1760 0 ## ma2 0.56258 0.017799 31.6074 0 ## omega -0.31340 0.003263 -96.0543 0 ## alpha1 -0.17372 0.012630 -13.7547 0 ## beta1 0.96598 0.000036 26838.0412 0 ## gamma1 0.18937 0.013003 14.5631 0 ## skew 0.84959 0.020089 42.2911 0 ## shape 5.99161 0.707324 8.4708 0 ## ## LogLikelihood : 10140.27 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -6.7110 ## Bayes -6.6911 ## Shibata -6.7110 ## Hannan-Quinn -6.7039 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.03028 0.8619 ## Lag[2*(p+q)+(p+q)-1][11] 5.69916 0.6822 ## Lag[4*(p+q)+(p+q)-1][19] 12.14955 0.1782 ## d.o.f=4 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 1.666 0.1967 ## Lag[2*(p+q)+(p+q)-1][5] 2.815 0.4418 ## Lag[4*(p+q)+(p+q)-1][9] 3.457 0.6818 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 0.1796 0.500 2.000 0.6717 ## ARCH Lag[5] 1.5392 1.440 1.667 0.5821 ## ARCH Lag[7] 1.6381 2.315 1.543 0.7933 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 4.4743 ## Individual Statistics: ## ar1 0.07045 ## ar2 0.37070 ## ma1 0.07702 ## ma2 0.39283 ## omega 1.00123 ## alpha1 0.49520 ## beta1 0.79702 ## gamma1 0.51601 ## skew 0.07163 ## shape 0.55625 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 2.29 2.54 3.05 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 0.4723 0.63677 ## Negative Sign Bias 1.7969 0.07246 * ## Positive Sign Bias 2.0114 0.04438 ** ## Joint Effect 7.7269 0.05201 * ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 46.18 0.0004673 ## 2 30 47.73 0.0156837 ## 3 40 67.07 0.0034331 ## 4 50 65.51 0.0574582 ## ## ## Elapsed time : 0.93679

      All coefficients are statistically significative. No correlation within standardized residuals or standardized squared residuals is found. All ARCH effects are properly captured by the model. However:

      * the Nyblom stability test null hypothesis that the model parameters are constant over time is rejected for some of them

      * the Positive Sign Bias null hypothesis is rejected at 5% level of significance; this kind of test focuses on the effect of large and small positive shocks

      * the Adjusted Pearson Goodness-of-fit test null hypothesis that the empirical and theoretical distribution of standardized residuals are identical is rejected

      See ref. [11] for an explanation about GARCH model diagnostic.

      Note: The ARMA(1,2) + eGARCH(1,1) fit also provides with significative coefficients, no correlation within standardized residuals, no correlation within standardized squared residuals and all ARCH effects are properly captured. However the bias tests are less satisfactory at 5% than the ARMA(2,2) + eGARCH(1,1) model ones.

      We show some diagnostic plots further.

      par(mfrow=c(2,2)) plot(garchfit, which=8) plot(garchfit, which=9) plot(garchfit, which=10) plot(garchfit, which=11)

      We show the original DJIA log-returns time series with the mean model fit (red line) and the conditional volatility (blue line).

      par(mfrow=c(1,1)) cond_volatility <- sigma(garchfit) mean_model_fit <- fitted(garchfit) p <- plot(dj_ret, col = "grey") p <- addSeries(mean_model_fit, col = 'red', on = 1) p <- addSeries(cond_volatility, col = 'blue', on = 1) p

      Model Equation

      Combining both ARMA(2,2) and eGARCH models we have:

      \[
      \begin{equation}
      \begin{cases}
      y_{t}\ -\ \phi_{1}y_{t-1}\ -\ \phi_{2}y_{t-2} =\ \phi_{0}\ +\ u_{t}\ +\ \theta_{1}u_{t-1}\ +\ \theta_{2}u_{t-2}
      \\
      \\
      u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1)
      \\
      \\
      \ln(\sigma_{t}^2)\ =\ \omega\ + \sum_{j=1}^{q} (\alpha_{j} \epsilon_{t-j}^2\ + \gamma (\epsilon_{t-j} – E|\epsilon_{t-j}|)) +\ \sum_{i=1}^{p} \beta_{i} ln(\sigma_{t-1}^2)
      \end{cases}
      \end{equation}
      \]

      Using the model resulting coefficients, it results as follows.

      \[
      \begin{equation}
      \begin{cases}
      y_{t}\ + 0.476\ y_{t-1}\ + 0.575\ y_{t-2} = \ u_{t}\ + 0.429\ u_{t-1}\ + 0.563\ u_{t-2}
      \\
      \\
      u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1)
      \\
      \\
      \ln(\sigma_{t}^2)\ =\ -0.313\ -0.174 \epsilon_{t-1}^2\ + 0.189\ (\epsilon_{t-1} – E|\epsilon_{t-1}|)) +\ 0.966\ ln(\sigma_{t-1}^2)
      \end{cases}
      \end{equation}
      \]

      Volatility Analysis

      Here is the plot of conditional volatility as resulting from our ARMA(2,2) + eGARCH(1,1) model.

      plot(cond_volatility)

      Line plots of conditional volatility by year are shown.

      par(mfrow=c(6,2)) pl <- lapply(2007:2018, function(x) { plot(cond_volatility[as.character(x)], main = "DJIA Daily Log returns conditional volatility")}) pl

      Conditional volatility box-plots by year are shown.

      par(mfrow=c(1,1)) cond_volatility_df <- xts_to_dataframe(cond_volatility) dataframe_boxplot(cond_volatility_df, "Dow Jones Daily Log-returns conditional Volatility 2007-2018")

      Afterwards 2008, the daily volatility basically tends to decrease. In the year 2017, the volatility was lower with respect any other year under analysis. On the contrary, on the year 2018, we experienced a remarkable increase of volatility with respect year 2017.

      Saving the current enviroment for further analysis.

      save.image(file='DowEnvironment.RData')

      If you have any questions, please feel free to comment below.

      References
      1. Dow Jones Industrial Average
      2. Skewness
      3. Kurtosis
      4. An introduction to analysis of financial data with R, Wiley, Ruey S. Tsay
      5. Time series analysis and its applications, Springer ed., R.H. Shumway, D.S. Stoffer
      6. Applied Econometric Time Series, Wiley, W. Enders, 4th ed.
      7. Forecasting – Principle and Practice, Texts, R.J. Hyndman

      8. Options, Futures and other Derivatives, Pearson ed., J.C. Hull
      9. An introduction to rugarch package
      10. Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
      11. GARCH modeling: diagnostic tests
      Disclaimer

      Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.

      Related Post

      1. Leaf Plant Classification: Statistical Learning Model – Part 2
      2. NYC buses: C5.0 classification with R; more than 20 minute delay?
      3. NYC buses: Cubist regression with more predictors
      4. NYC buses: simple Cubist regression
      5. Visualization of NYC bus delays with R
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Continuing to Grow Community Together at ozunconf, 2018

      Tue, 01/08/2019 - 01:00

      (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

      In late November 2018, we ran the third annual rOpenSci ozunconf. This is the sibling rOpenSci unconference, held in Australia. We ran the first ozunconf in Brisbane in 2016, and the second in Melbourne in 2017.

      Photos taken by Ajay from Fotoholics

      As usual, before the unconf, we started discussion on GitHub issue threads,
      and the excitement was building with the number of issues.

      The day before the unconf we ran “Day 0 training” – an afternoon explaining R packages and GitHub. This aimed to show people how to create an R package, set it up with version control with git, then put it on GitHub, and share it with others. The idea behind delivering this course was not necessarily to have people become experts in R package development and GitHub. Instead, it aimed to gently introduce the ideas and concepts of R packages and GitHub, so that people can hit the ground running over the next two days.

      This year, Saras Windecker stepped up to the plate and did an incredible job teaching the afternoon course that Roger Peng and I developed last year. We also had a great group of helper instructors, which ended up creating the fantastic teaching ratio of two students for every helper. There was a lot of learning from both sides here, and one comment on writing git commit messages from Adam Gruer really stood out to me. He said when writing a commit message, it should complete the sentence: “This commit will …”. Adam found this comment from blog post by Chris Beams. This post by Vicky Lai also provided some good comments on good git practices.

      Day 1 started with a bang, in the beautiful north arts building at the University of Melbourne, and dusted off the tried and tested rOpenSci icebreaker. We then moved on to choose projects. Everyone moved around the room, exploring options, and we settled into groups.


      We worked on a variety of projects, ranging from some that we thought might be knocked over in a few hours, like ozbabynames. (Spoiler alert, it took longer than three hours), to documenting and learning gganimate, creating paper review toolkits, shortening / enriching the documenting/reading the documentation process with tl, automating the creation of vitae’s, simplifying the process of recoding free gender text responses with gendercoder, measuring walkability, making bindertools to recreate and reproduce analysis, and even generating synonyms. You can see the complete list of projects here.

      Throughout the conference, Saskia Freytag recorded conversations to capture the event. These were edited into the podcast “Credibly Curious”, in episode 6, “Ozunconf”, run by Saskia and me.

      On the second day, Adam Gruer led a fireside chat on how we might continue the experience of the rOpenSci ozunconf after we wrap up. Among the suggestions were starting or joining coding clubs at your university or place of work. Adam also made the point that a reason so many people enjoy the unconf is that freeing up your time and expectations has the surprising effect of making you more productive, and to try and recreate that idea. Robbie Bonelli had a good way to summarise this:

      We gain personal knowledge that we would never have gained working on our own, even if we never use or work on these things again. Make it about learning something new, this is how to break down the idea of having to get something out of this for your job.

      We closed up the conference with 3-minute presentations from everyone on what they got up to over the past two days.



      Looking back, it really is amazing how much can get done over a just a few days! We couldn’t have gotten it done without the help of our sponsors, Microsoft Azure, The R Consortium, RStudio, The University of Melbourne, and The Monash Business School.

      It’s been an absolute pleasure to bring people together for another great event. Just three years ago we ran the first ozunconf, and seeing how the community has changed and grown over the past three years has been a real privilege. There is nothing quite like seeing so many people together in the same space, enjoying coding together. Seeing people realise that they can contribute, and become more of what they are is fantastic.

      Next year we are looking into moving the ozunconf to a different city in Australia, stay tuned for updates!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      RStudio Server Pro is now available on Microsoft Azure Marketplace

      Tue, 01/08/2019 - 01:00

      (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

      RStudio is excited to announce the availability of its flagship, enterprise-ready, integrated development environment for R in Azure Marketplace.

      RStudio Server Pro for Azure is an on-demand, commercially-licensed integrated development environment (IDE) for R on the Microsoft Azure Cloud. It offers all of the capabilities found in the popular RStudio open source IDE, plus turnkey convenience, enhanced security, the ability to manage multiple R versions and sessions, and more. It comes pre-configured with multiple versions of R, common systems libraries, and the most popular R packages.

      RStudio Server Pro Azure helps you adapt to your unique circumstances. It allows you to choose different Azure computing instances whenever a project requires it, and helps avoid the sometimes complicated processes for procuring on-premises software.

      If the enhanced security, elegant support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you, consider RStudio Server Pro for Azure!

      Read the FAQ Getting Started with RStudio Server Pro for Azure

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Looking back at 2018 and plans for 2019

      Tue, 01/08/2019 - 01:00

      (This article was first published on R on msperlin, and kindly contributed to R-bloggers)

      At the end of every year I plan to write about the highlight of the current year and set plans for the future. First, let’s talk about my work in 2018.

      Highlights of 2018

      Research wise, my scientometrics paper Is predatory publishing a real threat? Evidence from a large database study was featured in many news outlets. Its altmetric page is doing great, with over 1100 downloads and featured at top 5% of all research output measured by altmetric. This is, by far, the most impactful research piece I ever wrote. Its rewarding to see my work featured in the local and international media.

      This year I also released the first version of GetDFPData, a R package for accessing a large database of financial information from B3, the Brazilian exchange. I’m glad to report that many people are using it for their own research. I can see the number of visits in the web interface and the frequent emails I get about the package. The feedback from other researchers has been great but, off course, there are always ways to improve the code. I’ve been constantly developing it over time.

      The GetDFPData package also had an impact in my own research. I’ve always been biased towards the topic of capital markets and now I’m doing research in corporate finance, mostly due to the new access to a large database of corporate events. Currently, I have three paper initiatives in analyzing the effect of boards formation towards financial performance of Brazilian companies. These will likely probably be published in 2019 or 2020.

      In late 2018 I started my YouTube series padfeR, with video tutorials about using R for Finance and Economics. The idea is to have a greater impact and help those that are starting to use R. So far, all videos are in Portuguese but I do have plans for doing it in english in the future. Hopefully I’ll find some time in 2019 to start it.

      Overall, 2018 was a great year. I’m always thankful for having the opportunity of working in a job that I love and look forward to work (almost) every single day.

      My blog posts in 2018

      In november I changed the technology behind my blog from Jekyll to Hugo. Can’t stress enough how much I’m liking the Academic template built with blogdown and hosted in my own server. It is far easier to write posts and maintain the website.

      First, let’s see how many posts I have so far.

      my.blog.folder <- '~/Dropbox/11-My Website/www.msperlin.com-blog/content/post/' post.files <- list.files(path = my.blog.folder, pattern = '.Rmd') post.files ## [1] "2017-01-01-First-post.Rmd" ## [2] "2017-01-02-GetHFData.Rmd" ## [3] "2017-01-15-CalculatingBetas.Rmd" ## [4] "2017-01-30-Exams-with-dynamic-content.Rmd" ## [5] "2017-02-13-R-and-Tennis-Players.Rmd" ## [6] "2017-02-16-Writing-a-book.Rmd" ## [7] "2017-03-05-Prophet-and_stock-market.Rmd" ## [8] "2017-05-04-pafdR-is-out.Rmd" ## [9] "2017-05-09-Studying-Pkg-Names.Rmd" ## [10] "2017-05-15-R-Finance.Rmd" ## [11] "2017-08-12-Switching-to-Linux.Rmd" ## [12] "2017-09-14-Brazilian-Yield-Curve.Rmd" ## [13] "2017-12-06-Package-GetDFPData.Rmd" ## [14] "2017-12-13-Serving-shiny-apps-internet.Rmd" ## [15] "2017-12-30-Looking-Back-2017.Rmd" ## [16] "2018-01-22-Update-BatchGetSymbols.Rmd" ## [17] "2018-03-16-Writing_Papers_About_Pkgs.Rmd" ## [18] "2018-04-22-predatory-scientometrics.Rmd" ## [19] "2018-05-12-Investing-Long-Run.Rmd" ## [20] "2018-06-12-padfR-ed2.Rmd" ## [21] "2018-06-29-BenchmarkingSSD.Rmd" ## [22] "2018-10-10-BatchGetSymbols-NewVersion.Rmd" ## [23] "2018-10-11-Update-GetLattesData.Rmd" ## [24] "2018-10-13-NewPackage-PkgsFromFiles.Rmd" ## [25] "2018-10-19-R-and-loops.Rmd" ## [26] "2018-10-20-Linux-and-R.Rmd" ## [27] "2018-11-03-NewBlog.Rmd" ## [28] "2018-11-03-RstudioTricks.Rmd" ## [29] "2019-01-08-Looking-Back-2018.Rmd"

      The blog started in january 2017 and, over time, I wrote 29 posts. That feels alright. I’m not felling forced to write and I do it whenever I fell like I have something to share.

      Let’s get more information from the .Rmd files. I’ll write function read_blog_files and use it for all post files.

      read_blog_files <- function(f.in) { require(tidyverse) my.front.matter <- rmarkdown::yaml_front_matter(f.in) df.out <- data_frame(post.title = my.front.matter$title, post.date = lubridate::ymd(my.front.matter$date), post.month = as.Date(format(post.date, '%Y-%m-01')), tags = paste0(my.front.matter$tags, collapse = ';'), categories = paste0(my.front.matter$categories, collapse = ';'), content = paste0(read_lines(f.in), collapse = ' ')) return(df.out) } df.posts <- dplyr::bind_rows(purrr::map(post.files, read_blog_files)) ## Loading required package: tidyverse ## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ## ✔ tibble 2.0.0 ✔ dplyr 0.7.8 ## ✔ tidyr 0.8.2 ✔ stringr 1.3.1 ## ✔ readr 1.3.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## Warning: `data_frame()` is deprecated, use `tibble()`. ## This warning is displayed once per session. glimpse(df.posts) ## Observations: 29 ## Variables: 6 ## $ post.title "My first post!", "Using R to download high frequency… ## $ post.date 2017-01-01, 2017-01-02, 2017-01-15, 2017-01-30, 2017… ## $ post.month 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-01, 2017… ## $ tags "about me", "R;GetHFData;B3;market microstructure;hig… ## $ categories "about me", "R;GetHFData;B3;market microstructure;hig… ## $ content "--- title: \"My first post!\" subtitle: \"A little b…

      First, we’ll look at the frequency of posts over time.

      df.posts.2018 <- df.posts %>% filter(post.date > as.Date('2018-01-01')) print( ggplot(df.posts.2018, aes(x = post.month)) + geom_histogram(stat='count') + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(y = 'Number of posts', x = '')) ## Warning: Ignoring unknown parameters: binwidth, bins, pad

      Seems to average about once a month. The blank spaces show that I did not write for a couple of months.

      Checking 2018’s plans

      In the end of 2017 my plans for 2018 were:

      Work on the second edition of the portuguese book.

      Done! I’m glad to report that the second edition of the book was published in June 2018. It was great to review the book and add several new chapters and sections. As I mentioned in the publication post, this is the largest and longest project I ever worked and it is very satisfying to see it develop over time. Even more satisfying is to receive positive feedback of readers that are reading and using the book to learn to code in R! Many teachers in Economics and Business are also starting to use it in the classroom.

      The book will continue to be update every couple of years. One of the greatest things about R, among many others, is that the language is continually evolving and changing. I have no doubt that there will always be new material to write about.

      Start a portal for financial data in Brazil

      Unfortunately this project did not launch. I wrote a couple of R scripts for fetching and saving data automatically in my server but it never became a webpage. I started to work on other projects and the website was not a priority.

      Plans for 2019

      New edition of “Processing and Analyzing Financial Data with R”

      The international version of my book pafdR was published in january 2017. I fell its time to update it with the new chapters and structure from the second edition in portuguese. There are many improvements to the book, with an emphasis in the tidyverse universe.

      Work on my new book: “Investing For the Long Term” (provisory title)

      There is a huge deficit of financial knowledge in Brazil, specially in saving and investing. I’ve been a long term investor for most of my career as an academic and I fell there is a lot I can contribute to the topic of financial education by bringing data science into the problem of investing.

      The book will be a introduction to investments for the common person in Brazil, with a heavy data-based approach. It will not be about trading strategies or anything related to short term trading. The idea is to bring data analysis for the common long term investor, showing how the financial market works and how one can build passive income by constantly buying good financial contracts.

      I have no clue if it will be published em 2019. Unlike my previous book, I’m taking my time to write this one. No rush and no deadlines :).

      Solidify my research agenda in Board Composition

      As I mentioned before, my research agenda has shifted from capital markets to board compositions. This is a very interesting topic with many implications for listed companies. I’m leaning a lot from researching into these topics.

      Currently, I have four initiatives with different co-authors:

      • Gender and board composition
      • Politics and board composition
      • Professors in the Board of Companies
      • Board description of Brazilian Companies

      Hoepfully, these will be published in 2019 or 2020.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Don’t reinvent the wheel: making use of shiny extension packages. Join MünsteR for our next meetup!

      Tue, 01/08/2019 - 01:00

      (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

      In our next MünsteR R-user group meetup on Tuesday, February 5th, 2019, titled Don’t reinvent the wheel: making use of shiny extension packages., Suthira Owlarn will introduce the shiny package and show how she used it to build an interactive web app for her sequencing datasets.

      You can RSVP here: http://meetu.ps/e/Gg5th/w54bW/f

      Shiny is a popular R package for building interactive web apps and dashboards – no web development knowledge required! It’s ease-of-use and extendability makes it great for quick user-friendly presentations of toy projects as well as production-ready data products in various industries.

      In our first 2019 meetup, Suthira Owlarn will briefly demo a prototype of BioDEV, a shiny app that addresses a common pain-point in dealing with biological (omics) data sets. It allows interactive and intuitive exploration of user-provided data sets, integration of external data and download of parameterized interactive reports. She will then talk about packages that extend shiny by providing a variety of user interfaces, more advanced user interactions or an application-testing framework. Finally, we will cover some useful tools that aren’t specific to shiny, including an easy way to turn shiny apps into self-contained desktop applications.

      About the speaker:

      Suthira Owlarn is a biologist at the Max Planck Institute in Münster. After finishing her PhD, she decided to learn a little R and immediately fell in love! Soon, she waved goodbye to the bench and is now having fun diving into the world of data analysis, exploring R packages and building shiny apps.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      BH 1.69.0-1 on CRAN

      Tue, 01/08/2019 - 00:26

      (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

      The BH package provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R. It is quite popular, and frequently used together with Rcpp. The BH CRAN page shows e.g. that it is used by rstan, dplyr as well as a few other packages. The current count of reverse dependencies is at 164.

      Boost releases every four months. The last release we packaged was 1.66 from February—and this BH release gets us to Boost 1.69 released just three or so weeks ago. And as blogged last month, we made a pre-release (following several reverse-depends checks) to allow three packages affected by changes in Boost to adapt. My RcppStreams package was one, and we made an update release 0.1.2 just yesterday. This BH release was also preceded by another reverse-depends check on the weekend.

      Sine the 1.66 release of BH, CRAN tightened policies some more. Pragmas suppressing compiler warnings are now verboten so I had to disable a few. Expect compilations of packages using Boost, and BH, to be potentially very noisy. Consider adding flags to your local ~/.R/Makeconf and we should add them to the src/Makevars as much as we can (eg my ticket #3961 to dplyr). Collecting a few of these on a BH wiki page may not be a bad idea.

      A list of our local changes follows. The diff to upstream Boost is in the repo as well.

      Changes in version 1.69.0-1 (2019-01-07)
      • Upgraded to Boost 1.69.0 (plus the few local tweaks)

      • Applied the standard minimal patch with required changes, as well as the newer changeset for diagnostics pragma suppression.

      • Following a pre-release in December, maintainers of three packages affected by the 1.66 to 1.69 were contacted, and changes were prepared.

      Via CRANberries, there is a diffstat report relative to the previous release.

      Comments and suggestions about BH are welcome via the mailing list or the issue tracker at the GitHub repo.

      This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Long-awaited updates to htmlTable

      Mon, 01/07/2019 - 21:51

      (This article was first published on R – G-Forge, and kindly contributed to R-bloggers)

      Lets celebrate 2019 with some updates to my most popular package ever, the htmlTable. The image is CC by
      Thomas Hawk

      One of the most pleasant surprises has been the popularity of my htmlTable-package with more than 100k downloads per month. This is all thanks to more popular packages relying on it and the web expanding beyond its original boundaries. As a thank you I have taken the time to update and fix some features (the 1.13 release) – enjoy!

      Grouping column headers

      Grouping columns is a powerful feature that allows you to quickly generate more interesting tables. Unfortunately the cgroup has previously required matrix input for this feature that even I as the author sometimes struggled with. Now you can provide a list in a much more user friendly manner:

      rbind( `Group A` = c(20, 5, 380, 95), `Group B` = c(11, 55, 9, 45) ) %>% htmlTable(header = rep(c("No", "%"), times = 2), n.cgroup=list( c(2), c(2,2)), cgroup=list('Super', c("First", "Second")))

      Super First   Second No %   No % Group A 20 5   380 95 Group B 11 55   9 45

      The n.cgroup can be either a combination of the cgroup below but the exact same table can be created through providing list(c(4),c(2,2)) as the n.cgroup.

      Auto-counting tspanners

      Even more common than grouping columns is probably grouping data by rows. The htmlTable allows you to do this by rgroup and tspanner. The most common approach is by using rgroup as the first row-grouping element but with larger tables you frequently want to separate concepts into separate sections. Here’s a more complex example. This has previously been a little cumbersome to to counting the rows of each tspanner but now you’re able to (1) leave out the last row, (2) specify the number of rgroups instead of the number of rows. The latter is convenient as the n.tspanner must align with the underlying rgroup. Here’s an example:

      rbind( `Group A` = c(20, 380), `Group B` = c(110, 1230), `Group C` = c(2, 56), `Group D` = c(17, 33), `Group A` = c(40, 360), `Group B` = c(230, 1100), `Group C` = c(8, 50), `Group D` = c(10, 40) ) %>% apply(1, function(x) { sapply(x, function(count) c( txtInt(count), sprintf("(%s)", txtRound(count/sum(x) * 100, 1)))) %>% c(txtInt(sum(x)), .) }) %>% t %>% htmlTable(header = c("Total", rep(c("No", "(%)"), times = 2)), n.cgroup=list(c(1,2,2)), cgroup=list(c("", "Cases", "Controls")), rgroup = rep(c("Aspirin", "Intermittent compression"), times = 2), n.rgroup = rep(2, times = 4), tspanner = c("First experiment", "Second experiment"), n.tspanner = c(2), align = "r", caption = "Extremely fake data")

      Extremely fake data   Cases   Controls Total   No (%)   No (%) First experiment Aspirin   Group A 400   20 (5.0)   380 (95.0)   Group B 1,340   110 (8.2)   1,230 (91.8) Intermittent compression   Group C 58   2 (3.4)   56 (96.6)   Group D 50   17 (34.0)   33 (66.0) Second experiment Aspirin   Group A 400   40 (10.0)   360 (90.0)   Group B 1,330   230 (17.3)   1,100 (82.7) Intermittent compression   Group C 58   8 (13.8)   50 (86.2)   Group D 50   10 (20.0)   40 (80.0) The txtRound now has digits.nonzero argument

      The txtRound is similar to R’s round but returns a string that makes sure that all your values are presented with the same number of digits.

      > txtRound(c(1.2, 1.0), digits = 1) [1] "1.2" "1.0"

      Under some circumstances you are close to 0 and you want to retain more digits, hence the new
      digits.nonzero argument:

      > txtRound(c(0.1, 0.01, 0.001), digits = 1, digits.nonzero = 2) [1] "0.1" "0.01" "0.0"

      New function vector2string

      As a package author I think the majority of my code is dedicated towards providing good error message. A common issue is that you have two vectors that don’t match. For this purpose I’ve created the vector2string. This makes it easier to write your stop messages. You simply provide any vector and get a string back:

      > vector2string(c(1:2, "a random ' string")) [1] "'1', '2', 'a random \\' string'"

      Here’s how I use is in one of my functions:

      ... stop("Invalid size of lists: ", vector2string(lengths), " each element should be be able to evenly divide ", ncol(x))

      Summary

      These are tiny additions and everything should work just as it has in previous versions. Hopefully it will save you some time and make your life easier! Enjoy!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Marketing analytics with greybox

      Mon, 01/07/2019 - 17:40

      (This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

      One of the reasons why I have started the

      greybox

      package is to use it for marketing research and marketing analytics. The common problem that I face, when working with these courses is analysing the data measured in different scales. While R handles numeric scales natively, the work with categorical is not satisfactory. Yes, I know that there are packages that implement some of the functions, but I wanted to have them in one place without the need to install a lot of packages and satisfy the dependencies. After all, what’s the point in installing a package for Cramer’s V, when it can be calculated with two lines of code? So, here’s a brief explanation of the functions for marketing analytics in

      greybox

      .

      I will use mtcars dataset for the examples, but we will transform some of the variables into factors:

      mtcarsData <- as.data.frame(mtcars) mtcarsData$vs <- factor(mtcarsData$vs, levels=c(0,1), labels=c("v","s")) mtcarsData$am <- factor(mtcarsData$am, levels=c(0,1), labels=c("a","m"))

      All the functions discussed in this post are available in

      greybox

      starting from v0.4.0. However, I’ve found several bugs since the submission to CRAN, and the most recent version with bugfixes is now available on github.

      Analysing the relation between the two variables in categorical scales Cramer’s V

      Cramer’s V measures the relation between two variables in categorical scale. It is implemented in the

      cramer()

      function. It returns the value in a range of 0 to 1 (1 – when the two categorical variables are linearly associated with each other, 0 – otherwise), Chi-Squared statistics from the

      chisq.test()

      , the respective p-value and the number of degrees of freedom. The tested hypothesis in this case is formulated as:
      \begin{matrix}
      H_0: V = 0 \text{ (the variables don’t have association);} \\
      H_1: V \neq 0 \text{ (there is an association between the variables).}
      \end{matrix}

      Here’s what we get when trying to find the association between the engine and transmission in the mtcars data:

      cramer(mtcarsData$vs, mtcarsData$am)

      Cramer's V: 0.1042 Chi^2 statistics = 0.3475, df: 1, p-value: 0.5555

      Judging by this output, the association between these two variables is very low (close to zero) and is not statistically significant.

      Cramer’s V can also be used for the data in numerical scales. In general, this might be not the most suitable solution, but this might be useful when you have a small number of values in the data. For example, the variable gear in mtcars is numerical, but it has only three options (3, 4 and 5). Here’s what Cramer’s V tells us in the case of gear and am:

      cramer(mtcarsData$am, mtcarsData$gear)

      Cramer's V: 0.809 Chi^2 statistics = 20.9447, df: 2, p-value: 0

      As we see, the value is high in this case (0.809), and the null hypothesis is rejected on 5% level. So we can conclude that there is a relation between the two variables. This does not mean that one variable causes the other one, but they both might be driven by something else (do more expensive cars have less gears but the automatic transmission?).

      Plotting categorical variables

      While R allows plotting two categorical variables against each other, the plot is hard to read and is not very helpful (in my opinion):

      plot(table(mtcarsData$am,mtcarsData$gear))

      Default plot of a table

      So I have created a function that produces a heat map for two categorical variables. It is called

      tableplot()

      :

      tableplot(mtcarsData$am,mtcarsData$gear)

      Tableplot for the two categorical variables

      It is based on

      table()

      function and uses the frequencies inside the table for the colours:

      table(mtcarsData$am,mtcarsData$gear) / length(mtcarsData$am)

      3 4 5 a 0.46875 0.12500 0.00000 m 0.00000 0.25000 0.15625

      The darker sectors mean that there is a higher concentration of values, while the white ones correspond to zeroes. So, in our example, we see that the majority of cars have automatic transmissions with three gears. Furthermore, the plot shows that there is some sort of relation between the two variables: the cars with automatic transmissions have the lower number of gears, while the ones with the manual have the higher number of gears (something we’ve already noticed in the previous subsection).

      Association between the categorical and numerical variables

      While Cramer’s V can also be used for the measurement of association between the variables in different scales, there are better instruments. For example, some analysts recommend using intraclass correlation coefficient when measuring the relation between the numerical and categorical variables. But there is a simpler option, which involves calculating the coefficient of multiple correlation between the variables. This is implemented in

      mcor()

      function of

      greybox

      . The y variable should be numerical, while x can be of any type. What the function then does is expands all the factors and runs a regression via

      .lm.fit()

      function, returning the square root of the coefficient of determination. If the variables are linearly related, then the returned value will be close to one. Otherwise it will be closet to zero. The function also returns the F statistics from the regression, the associated p-value and the number of degrees of freedom (the hypothesis is formulated similarly to

      cramer()

      function).

      Here’s how it works:

      mcor(mtcarsData$am,mtcarsData$mpg)

      Multiple correlations value: 0.5998 F-statistics = 16.8603, df: 1, df resid: 30, p-value: 3e-04

      In this example, the simple linear regression of mpg from the set of dummies is constructed, and we can conclude that there is a linear relation between the variables, and that this relation is statistically significant.

      Association between several variables Measures of association

      When you deal with datasets (i.e. data frames or matrices), then you can use

      cor()

      function in order to calculate the correlation coefficients between the variables in the data. But when you have a mixture of numerical and categorical variables, the situation becomes more difficult, as the correlation does not make sense for the latter. This motivated me to create a function that uses either

      cor()

      , or

      cramer()

      , or

      mcor()

      functions depending on the types of data (see discussions of

      cramer()

      and

      mcor()

      above). The function is called

      association()

      or

      assoc()

      and returns three matrices: the values of the measures of association, their p-values and the types of the functions used between the variables. Here’s an example:

      assocValues <- assoc(mtcarsData) print(assocValues,digits=2) Associations: values: mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 0.86 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.66 0.67 cyl 0.86 1.00 0.92 0.84 0.70 0.78 0.59 0.82 0.52 0.53 0.62 disp -0.85 0.92 1.00 0.79 -0.71 0.89 -0.43 0.71 0.59 0.77 0.56 hp -0.78 0.84 0.79 1.00 -0.45 0.66 -0.71 0.72 0.24 0.66 0.79 drat 0.68 0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.83 0.33 wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 0.55 0.69 0.66 0.61 qsec 0.42 0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 0.23 0.63 0.67 vs 0.66 0.82 0.71 0.72 0.44 0.55 0.74 1.00 0.10 0.62 0.69 am 0.60 0.52 0.59 0.24 0.71 0.69 0.23 0.10 1.00 0.81 0.44 gear 0.66 0.53 0.77 0.66 0.83 0.66 0.63 0.62 0.81 1.00 0.51 carb 0.67 0.62 0.56 0.79 0.33 0.61 0.67 0.69 0.44 0.51 1.00 p-values: mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.01 cyl 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 disp 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.07 hp 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.18 0.00 0.00 drat 0.00 0.00 0.00 0.01 1.00 0.00 0.62 0.01 0.00 0.00 0.66 wt 0.00 0.00 0.00 0.00 0.00 1.00 0.34 0.00 0.00 0.00 0.02 qsec 0.02 0.00 0.01 0.00 0.62 0.34 1.00 0.00 0.21 0.00 0.01 vs 0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.56 0.00 0.01 am 0.00 0.01 0.00 0.18 0.00 0.00 0.21 0.56 1.00 0.00 0.28 gear 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.09 carb 0.01 0.01 0.07 0.00 0.66 0.02 0.01 0.01 0.28 0.09 1.00 types: mpg cyl disp hp drat wt qsec vs am mpg "none" "mcor" "cor" "cor" "cor" "cor" "cor" "mcor" "mcor" cyl "mcor" "none" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer" disp "cor" "mcor" "none" "cor" "cor" "cor" "cor" "mcor" "mcor" hp "cor" "mcor" "cor" "none" "cor" "cor" "cor" "mcor" "mcor" drat "cor" "mcor" "cor" "cor" "none" "cor" "cor" "mcor" "mcor" wt "cor" "mcor" "cor" "cor" "cor" "none" "cor" "mcor" "mcor" qsec "cor" "mcor" "cor" "cor" "cor" "cor" "none" "mcor" "mcor" vs "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "none" "cramer" am "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "none" gear "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer" carb "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer" gear carb mpg "mcor" "mcor" cyl "cramer" "cramer" disp "mcor" "mcor" hp "mcor" "mcor" drat "mcor" "mcor" wt "mcor" "mcor" qsec "mcor" "mcor" vs "cramer" "cramer" am "cramer" "cramer" gear "none" "cramer" carb "cramer" "none"

      One thing to note is that the function considers numerical variables as categorical, when they only have up to 10 unique values. This is useful, for example, in case of number of gears in the dataset.

      Plots of association between several variables

      Similarly to the problem with

      cor()

      , scatterplot matrix (produced using

      plot()

      ) is not meaningful in case of a mixture of variables:

      plot(mtcarsData)

      Default scatter plot matrix

      It makes sense to use scatterplot in case of numeric variables,

      tableplot()

      in case of categorical and

      boxplot()

      in case of a mixture. So, there is the function

      spread()

      in

      greybox

      that creates something more meaningful. It uses the same algorithm as

      assoc()

      function, but produces plots instead of calculating measures of association. So, gear will be considered as categorical and the function will produce either

      boxplot()

      or

      tableplot()

      , when plotting it against other variables.

      Here’s an example:

      spread(mtcarsData)

      Spread matrix

      This plot demonstrates, for example, that the number of carburetors influences fuel consumption (something that we could not have spotted in the case of

      plot()

      ). Notice also, that the number of gears influences the fuel consumption in a non-linear relation as well. So constructing the model with dummy variables for the number of gears might be a reasonable thing to do.

      The function also has the parameter log, which will transform all the numerical variables using logarithms, which is handy, when you suspect the non-linear relation between the variables. Finally, there is a parameter histogram, which will plot either histograms, or barplots on the diagonal.

      spread(mtcarsData, histograms=TRUE, log=TRUE)

      Spread matrix in logs

      The plot demonstrates that the disp has a strong non-linear relation with mpg, and, similarly, drat and hp also influence mpg in a non-linear fashion.

      Regression diagnostics

      One of the problems of linear regression that can be diagnosed prior to the model construction is multicollinearity. The conventional way of doing this diagnostics is via calculating the variance inflation factor (VIF) after constructing the model. However, VIF is not easy to interpret, because it lies in \(1,\infty\). Coefficients of determination from the linear regression models of explanatory variables are easier to interpret and work with. If such a coefficient is equal to one, then there are some perfectly correlated explanatory variables in the dataset. If it is equal to zero, then they are not linearly related.

      There is a function

      determination()

      or

      determ()

      in

      greybox

      that returns the set of coefficients of determination for the explanatory variables. The good thing is that this can be done before constructing any model. In our example, the first column, mpg is the response variable, so we can diagnose the multicollinearity the following way:

      determination(mtcarsData[,-1])

      cyl disp hp drat wt qsec vs 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.8017720 am gear carb 0.7924392 0.8133441 0.8735577

      As we can see from the output above, disp is the most linearly related with the variables, so including it in the model might cause the multicollinearity, which will decrease the efficiency of the estimates of parameters.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages