R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 40 sec ago

### On the Road to 0.8.0 — Some Additional New Features Coming in the sergeant Package

Wed, 01/09/2019 - 20:32

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

It was probably not difficult to discern from my previous Drill-themed post that I’m fairly excited about the Apache Drill 1.15.0 release. I’ve rounded out most of the existing corners for it in preparation for a long-overdue CRAN update and have been concentrating on two helper features: configuring & launching Drill embedded Docker containers and auto-generation of Drill CTAS queries.

Drill Docker Goodness

Starting with version 1.14.0, Apache provides Drill Docker images for use in experimenting/testing/building-off-of. They run Drill in single node standalone mode so you’re not going to be running this in “production” (unless you have light or just personal workloads). Docker is a great way to get to know Drill if you haven’t already played with it since you don’t have do do much except run the Docker image.

I’ve simplified this even more thanks to @rgfitzjohn’s most excellent stevedore package which adds a robust R wrapper to the Docker client without relying on any heavy external dependencies such as reticulate. The new drill_up() function will auto-fetch the latest Drill image and launch a container so you can have a running Drill instance with virtually no effort on your part.

Just running the vanilla image isn’t enough since your goal is likely to do more than work with the built-in cp data source. The default container launch scenario also doesn’t hook up any local filesystem paths to the container so you really can’t do much other than cp-oriented queries. Rather than make you do all the work of figuring out how to machinate Docker command line arguments and manually configure a workspace that points to a local filesystem area in the Drill web admin GUI the drill_up() function provides a data_dir argument (that defaults to the getwd() where you are in your R session) which will then auto-wire up that path into the container and create a dfs.d workspace which auto-points to it for you. Here’s a sample execution:

library(sergeant) library(tidyverse) dr <- drill_up(data_dir = "~/Data") ## Drill container started. Waiting for the service to become active (this may take up to 30s). ## Drill container ID: f02a11b50e1647e44c4e233799180da3e907c8aa27900f192b5fd72acfa67ec0

You can use dc$stop() to stop the container or use the printed container id to do it from the command line. We’ll use this containerized Drill instance with the next feature but I need to thank @cboettig for the suggestion to make an auto-downloader-runner-thingy before doing that. (Thank you @cboettig!) Taking the Tedium out of CTAS @dseverski, an intrepid R, Drill & sergeant user noticed some new package behavior with Drill 1.15.0 that ended up spawning a new feature: automatic generation of Drill CTAS statements. Prior to 1.14.0 sergeant had no way to accurately, precisely tell data types of the columns coming back since the REST API didn’t provide them (as noted in the previous Drill post). Now, it did rely on the JSON types to create the initial data frames but id also did something **kinda horribad*: it ran readr::type_convert() on the result sets . Said operation had the singular benefit of auto-converting CSV/CSVH/TSV/PSV/etc data to something sane without having to worry about writing lengthy CTAS queries (at the expense of potentially confusing everyone, though that didn’t seem to happen). With 1.15.0, the readr::type_convert() crutch is gone, which results in less-than-helpful things like this when you have delimiter-separated values data: # using the Drill container we just started above write_csv(nycflights13::flights, "~/Data/flights.csvh") con <- src_drill("localhost") tbl(con, "dfs.d.flights.csvh") %>% glimpse() ## Observations: ?? ## Variables: 19 ## Database: DrillConnection ##$ year "2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013… ## $month "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "… ##$ day "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "… ## $dep_time "517", "533", "542", "544", "554", "554", "555", "557", "557"… ##$ sched_dep_time "515", "529", "540", "545", "600", "558", "600", "600", "600"… ## $dep_delay "2", "4", "2", "-1", "-6", "-4", "-5", "-3", "-3", "-2", "-2"… ##$ arr_time "830", "850", "923", "1004", "812", "740", "913", "709", "838… ## $sched_arr_time "819", "830", "850", "1022", "837", "728", "854", "723", "846… ##$ arr_delay "11", "20", "33", "-18", "-25", "12", "19", "-14", "-8", "8",… ## $carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "… ##$ flight "1545", "1714", "1141", "725", "461", "1696", "507", "5708", … ## $tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "… ##$ origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"… ## $dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"… ##$ air_time "227", "227", "160", "183", "116", "150", "158", "53", "140",… ## $distance "1400", "1416", "1089", "1576", "762", "719", "1065", "229", … ##$ hour "5", "5", "5", "5", "6", "5", "6", "6", "6", "6", "6", "6", "… ## $minute "15", "29", "40", "45", "0", "58", "0", "0", "0", "0", "0", "… ##$ time_hour "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-01-01T1…

So the package does what it finally should have been doing all along. But, as noted, that’s not great if you just wanted to quickly work with a directory of CSV files. In theory, you’re supposed to use Drill’s CREATE TABLE AS then do a bunch of CASTS and TO_s to get proper data types. But who has time for that?

David had a stellar idea, might sergeant be able to automagically create CTAS statements from a query?. Yes. Yes it just might be able to do that with the new ctas_profile() function.

Let’s pipe the previous tbl() into ctas_profile() and see what we get:

tbl(con, "dfs.d.flights.csvh") %>% ctas_profile() %>% cat() -- ** Created by ctas_profile() in the R sergeant package, version 0.8.0 ** CREATE TABLE CHANGE____ME AS SELECT CAST(year AS DOUBLE) AS year, CAST(month AS DOUBLE) AS month, CAST(day AS DOUBLE) AS day, CAST(dep_time AS DOUBLE) AS dep_time, CAST(sched_dep_time AS DOUBLE) AS sched_dep_time, CAST(dep_delay AS DOUBLE) AS dep_delay, CAST(arr_time AS DOUBLE) AS arr_time, CAST(sched_arr_time AS DOUBLE) AS sched_arr_time, CAST(arr_delay AS DOUBLE) AS arr_delay, CAST(carrier AS VARCHAR) AS carrier, CAST(flight AS DOUBLE) AS flight, CAST(tailnum AS VARCHAR) AS tailnum, CAST(origin AS VARCHAR) AS origin, CAST(dest AS VARCHAR) AS dest, CAST(air_time AS DOUBLE) AS air_time, CAST(distance AS DOUBLE) AS distance, CAST(hour AS DOUBLE) AS hour, CAST(minute AS DOUBLE) AS minute, TO_TIMESTAMP(time_hour, 'FORMATSTRING') AS time_hour -- *NOTE* You need to specify the format string. Sample character data is: [2013-01-01T10:00:00Z]. FROM (SELECT * FROM dfs.d.flights.csvh) -- TIMESTAMP and/or DATE columns were detected. Drill's date/time format string reference can be found at: -- --

There’s a parameter for the new table name which will cause the CHANGE____ME to go away and when the function finds TIMESTAMP or DATE fields it knows to switch to their TO_ cousins and gives sample data with a reminder that you need to make a format string (I’ll eventually auto-generate them unless someone PRs it first). And, since nodoby but Java programmers remember Joda format strings (they’re different than what you’re used to) it provides a handy link to them if it detects the presence of those column types.

Now, we don’t need to actually create a new table (though converting a bunch of CSVs to Parquet is likely a good idea for performance reasons) to use that output. We can pass most of that new query right to tbl():

tbl(con, sql(" SELECT CAST(year AS DOUBLE) AS year, CAST(month AS DOUBLE) AS month, CAST(day AS DOUBLE) AS day, CAST(dep_time AS DOUBLE) AS dep_time, CAST(sched_dep_time AS DOUBLE) AS sched_dep_time, CAST(dep_delay AS DOUBLE) AS dep_delay, CAST(arr_time AS DOUBLE) AS arr_time, CAST(sched_arr_time AS DOUBLE) AS sched_arr_time, CAST(arr_delay AS DOUBLE) AS arr_delay, CAST(carrier AS VARCHAR) AS carrier, CAST(flight AS DOUBLE) AS flight, CAST(tailnum AS VARCHAR) AS tailnum, CAST(origin AS VARCHAR) AS origin, CAST(dest AS VARCHAR) AS dest, CAST(air_time AS DOUBLE) AS air_time, CAST(distance AS DOUBLE) AS distance, CAST(hour AS DOUBLE) AS hour, CAST(minute AS DOUBLE) AS minute, TO_TIMESTAMP(time_hour, 'yyyy-MM-dd''T''HH:mm:ssZ') AS time_hour -- [2013-01-01T10:00:00Z]. FROM (SELECT * FROM dfs.d.flights.csvh) ")) %>% glimpse() ## Observations: ?? ## Variables: 19 ## Database: DrillConnection ## $year 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ##$ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ##$ dep_time 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 5… ## $sched_dep_time 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 6… ##$ dep_delay 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -… ## $arr_time 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, … ##$ sched_arr_time 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, … ## $arr_delay 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31,… ##$ carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "… ## $flight 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71,… ##$ tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "… ## $origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"… ##$ dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"… ## $air_time 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34… ##$ distance 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, … ## $hour 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6… ##$ minute 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0… ## $time_hour 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 10:00:0… Ahhhh… Useful data types. (And, see what I mean about that daft format string? Also, WP is mangling the format string so add a comment if you need the actual string.) FIN As you can see questions, suggestions (and PRs!) are welcome and heeded on your social-coding platform of choice (though y’all still seem to be stuck on GH ). NOTE: I’ll be subbing out most install_github() links in READMEs and future blog posts for install_git() counterparts pointing to my sr.ht repos (as I co-locate/migrate them there). You can play with the new 0.8.0 features via devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0"). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### R NewYorkers Feeling the Holiday Spirit? Here’s Your Tip Wed, 01/09/2019 - 17:53 (This article was first published on R – Pivot Billions, and kindly contributed to R-bloggers) Combining Pivot Billions with R to dive into whether the holiday spirit inspires bigger tips and which parts of New York experience this effect the most. The holiday season brings with it a degree of cheer and joy that many claim makes people act friendlier towards each other. I wanted to see how this effect translates to action so I decided to look into tips for New York green taxis both during the holiday season and the rest of the year. To start, I streamed all of the green taxi data files from the public NYC Taxi and Limousine Commission Trip Record Data for 2017-07-01 to 2018-06-31 (the most recent year of green taxi data) into Pivot Billions and enhanced the data with two new columns: holidayseason and tip_percent. There were many rows that weren’t relevant to this analysis since cash payments did not have records of tips, so I filtered out cash payments from the payment_type column in Pivot Billions bringing the total rows to ~5 Million. To dive into the data I made use of Pivot Billions’ pivot feature to quickly reorganize all of this filtered data by where the passenger(s) were dropped off (DOLocationID) and whether the trip occurred during the holiday season. My over 9 Million original rows of data were now shrunk down to a much more manageable 513 row detailed summary. Downloading this new view of the data from Pivot Billions I switched my focus to visualizing and analyzing the data in R. Now that the data was shrunk down to a size R can easily handle, I loaded the Taxi Zone Shapefile and my newly downloaded DoLocationID_holiday_tips.csv file into R. This was a simple process of uploading the shapefile from our datasource as well as our Pivot Billions – processed file onto my machine running R and then joining them by setting Location ID equal to DOLocationID. After quickly defining a new metric from our data called “Holiday Effect” that tracks the percentage difference in average tips between the holiday season and the rest of the year and adding additional information to the data to make it informative and explorable, I was left with a very clear and powerful visualization of the green taxi data. It is immediately clear that there are regions with a much greater occurrence of positive holiday effects (green areas) than negative effects (orange areas) as well as the reverse. Utilizing R’s powerful indexing abilities it’s easy to narrow down the data by location and explore which areas of New York experience the effect the most. It appears that Bronx and Brooklyn experience more negative effects whereas Queens is evenly spread between positive and negative. However, Manhattan and Newark Airport have a much higher proportion of positive effects due to the holiday season. Though most of New York is being affected by the holidays for better or worse, people going to Manhattan and Newark Airport seem to be feeling the holiday spirit the most. To create this visualization yourself you can download my R code, DOLocationID_holiday_tips.csv, and the Public Data’s Shapefile. You can also run this code replacing “DOLocationID_holiday_tips.csv” with “PULocationID_holiday_tips.csv” and DOLocationID with PULocationID to view the holiday effect on tips by Pick-Up Location. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Pivot Billions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Animating Data Transformations: Part II Wed, 01/09/2019 - 17:40 (This article was first published on R Tutorials – Omni Analytics Group, and kindly contributed to R-bloggers) In our previous series on Animating Data Transformations, we showed you how to use gganimate to construct an animation which illustrates the process of going between tall and wide representations of data. Today, we will show the same procedure for constructing an animation of the unnest() function. The unnest() function takes a tibble containing a list column and converts it to a tibble such that each element in the list comprises a single row. Think of it as “unpacking” the list column into a more standard vector column in R. We will create a toy dataset using the sleep data, and ensure that it contains a list column: library(tidyverse) library(gganimate) sleep_data <- sleep %>% mutate(group = rep(1:4, each = 5), ID = rep(1:4, each = 5)) %>% select(id = ID, group, data = extra) sleep_data # A tibble: 20 x 3 id group data 1 1 1 0.7 2 1 1 -1.6 3 1 1 -0.2 4 1 1 -1.2 5 1 1 -0.1 ... You can see that we have a 20 row by 3 column tibble in R. Next, we perform the following routine in order to nest the data column into a new nesteddata column: sleep_nested <- sleep_data %>% group_by(group) %>% summarise(id = id[1], nesteddata = list(data)) sleep_nested # A tibble: 4 x 3 group id nesteddata 1 1 1 2 2 2 3 3 3 4 4 4 Next, we perform a similar routine to the previous blog and combine the two datasets into one dataset which will be used to build the animation: longDat <- function(x) { names(x) %>% rbind(x) %>% setNames(seq_len(ncol(x))) %>% mutate(row = row_number()) %>% tidyr::gather(column, value, -row) %>% mutate(column = as.integer(column)) %>% ungroup() %>% arrange(column, row) } long_tables <- map(list(sleep_nested, sleep_data), longDat) nested_table <- long_tables[[1]] %>% mutate(tstep = "a", value = sapply(value, paste, collapse = ", ")) unnested_table <- long_tables[[2]] %>% mutate(tstep = "b") both_tables <- bind_rows(nested_table, unnested_table) both_tables$celltype[both_tables$column == 1] <- c("header", rep("id", 4), "header", rep("id", 20)) both_tables$celltype[both_tables$column == 2] <- c("header", rep(1:4, each = 1), "header", rep(1:4, each = 5)) both_tables$celltype[both_tables$column == 3] <- c("header", rep("nesteddata", 4), "header", rep("data", 20)) both_tables # A tibble: 78 x 5 row column value tstep celltype 1 1 1 group a header 2 2 1 1 a id 3 3 1 2 a id 4 4 1 3 a id 5 5 1 4 a id 6 1 2 id a header 7 2 2 1 a 1 8 3 2 2 a 2 9 4 2 3 a 3 10 5 2 4 a 4 # … with 68 more rows From this, we can produce static versions of the two images which will form the basis for the animation: base_plot <- ggplot(both_tables, aes(column, -row, fill = celltype)) + geom_tile(color = "black") + theme_void() + scale_fill_manual(values = c("#247ba0","#70c1b3","#b2dbbf","turquoise2", "#ead2ac", "grey60", "mistyrose3", "#ffae63"), name = "", labels = c("Group 1", "Group 2", "Group 3", "Group 4", "Data", "Header", "ID", "Nested Data")) base_plot + facet_wrap(~tstep) The static plot that will be used to generate the animation Finally, we use gganimate to build the final animation! p1 <- base_plot + transition_states( states = tstep, transition_length = 1, state_length = 1 ) + enter_fade() + exit_fade() + ease_aes('sine-in-out') p1_animate <- animate(p1, height = 800, width = 600, fps = 20, duration = 10) anim_save("unnest_animate.gif") And there you have it! We hope this was helpful both in learning how to produce data transformation animations, and in terms of learning the unnest() operation itself. If you have any requests for more data transformation animations, please let us know, and be on the look out for future posts in this series! The post Animating Data Transformations: Part II appeared first on Omni Analytics Group. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Tutorials – Omni Analytics Group. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Understanding the maths of Computed Tomography (CT) scans Wed, 01/09/2019 - 09:34 (This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers) Noseman is having a headache and as an old-school hypochondriac he goes to see his doctor. His doctor is quite worried and makes an appointment with a radiologist for Noseman to get a CT scan. Modern CT scanner from Siemens Because Noseman always wants to know how things work he asks the radiologist about the inner workings of a CT scanner. The basic idea is that X-rays are fired from one side of the scanner to the other. Because different sorts of tissue (like bones, brain cells, cartilage etc.) block different amounts of the X-rays the intensity measured on the other side varies accordingly. The problem is of course that a single picture cannot give the full details of what is inside the body because it is a combination of different sorts of tissue in the way of the respective X-rays. The solution is to rotate the scanner and combine the different slices. How, you ask? Good old linear algebra to the rescue! We start with the initial position and fire X-rays with an intensity of 30 (just a number for illustrative purposes) through the body: Initial position As can be seen in the picture the upper ray goes through areas 1, 2 and 3 and let’s say that the intensity value of 12 is measured on the other side of the scanner: or The rest of the formula is found accordingly: We then rotate the scanner for the first time… Position after first rotation …which gives the following formula: And a second rotation… Position after second rotation …yields the following formula: Now we are combining all three systems of equations: or written out in full: Here is the data of the matrix for you to download: ct-scan.txt). We now have 9 equations with 9 unknown variables… which should easily be solvable by R, which can also depict the solution as a gray-scaled image… the actual CT-scan! A <- read.csv("data/ct-scan.txt") b <- c(18, 21, 18, 18, 21, 9, 18, 14, 16) v <- solve(A, b) matrix(v, ncol = 3, byrow = TRUE) ## [,1] [,2] [,3] ## [1,] 9 9 0 ## [2,] 9 5 7 ## [3,] 9 9 0 image(matrix(v, ncol = 3), col = gray(4:0 / 4)) CT of Noseman The radiologist looks at the picture… and has good news for Noseman: everything is how it should be! Noseman is relieved and his headache is much better now… Real CT scans make use of the same basic principles (of course with a lot of additional engineering and maths magic ) Here are real images of CT scans of a human brain… Source: Wikimedia … which can be combined into a 3D-animation: Source: Wikimedia Isn’t it fascinating how a little bit of maths can save lives! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### A deep dive into glmnet: offset Wed, 01/09/2019 - 08:21 (This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers) I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation. In this post, we will look at the offset option. For reference, here is the full signature of the glmnet function: glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"), weights, offset=NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs offset According to the official R documentation, offset should be A vector of length nobs that is included in the linear predictor (a nobs x nc matrix for the “multinomial” family). Its default value is NULL: in that case, glmnet internally sets the offset to be a vector of zeros having the same length as the response y. Here is some example code for using the offset option: set.seed(1) n <- 50; p <- 10 x <- matrix(rnorm(n * p), nrow = n) y <- rnorm(n) offset <- rnorm(n) # fit model fit1 <- glmnet(x, y, offset = offset) If we specify offset in the glmnet call, then when making predictions with the model, we must specify the newoffset option. For example, if we want the predictions fit1 gives us at for the training data, not specifying newoffset will give us an error: This is the correct code: predict(fit1, x, s = 0.1, newoffset = offset) # 1 # [1,] 0.44691399 # [2,] 0.30013292 # [3,] -1.68825225 # [4,] -0.49655504 # [5,] 1.20180199 # ... So, what does offset actually do (or mean)? Recall that glmnet is fitting a linear model. More concretely, our data is , where the are our features for observation and is the response for observation . For each observation, we are trying to model some variable as a linear combination of the features, i.e. . is a function of ; the function depends on the context. For example, • For ordinary regression, , i.e. the response itself. • For logistic regression, . • For Poisson regression, . So, we are trying to find and so that is a good estimate for . If we have an offset , then we are trying to find and so that is a good estimate for . Why might we want to use offsets? There are two primary reasons for them stated in the documentation: Useful for the “poisson” family (e.g. log of exposure time), or for refining a model by starting at a current fit. Let me elaborate. First, offsets are useful for Poisson regression. The official vignette has a little section explaining this; let me explain it through an example. Imagine that we are trying to predict how many points an NBA basketball player will score per minute based on his physical attributes. If the player’s physical attributes (i.e. the covariates of our model) are denoted by and then the number of points he scores in a minute is denoted by , then Poisson regression assumes that and are parameters of the model to be determined. Having described the model, let’s turn to our data. For each player , we have physical covariates . However, instead of having each player’s points per minute, we have number of points scored over a certain time period. For example, we might have “player 1 scored 12 points over 30 minutes” instead of “player 1 scored 0.4 points per minute”. Offsets allow us to use our data as is. In our example above, loosely speaking 12/30 (points per minute) is our estimate for . Hence, 12 (points in 30 minutes) is our estimate for . In our model, is our estimate for , and so our estimate for would be . The term is the “offset” to get the model prediction for our data as is. Taking this to the full dataset: if player scores points in minutes, then our offset would be the vector , and the response we would feed glmnet is . The second reason one might want to use offsets is to improve on an existing model. Continuing the example above: say we have a friend who has trained a model (not necessarily a linear model) to predict , but he did not use the player’s physical attributes. We think that we can improve on his predictions by adding physical attributes to the model. One refinement to our friend’s model could be where is the prediction of from our friend’s model. In this setting, the offsets are simply our friend’s predictions. For model training, we would provide the first model’s predictions on the training observations as the offset. To get predictions from the refinement on new observations, we would first compute the predictions from the first model, then use them as the newoffset option in the predict call. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Dow Jones Stock Market Index (4/4): Trade Volume GARCH Model Wed, 01/09/2019 - 04:23 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) Categories Tags This is the final part of the 4-series posts. In this fourth post, I am going to build an ARMA-GARCH model for Dow Jones Industrial Average (DJIA) daily trade volume log ratio. You can read the other three parts in the following links: part 1, part2, and part 3. Packages The packages being used in this post series are herein listed. suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(fBasics)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(urca)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(quantmod)) suppressPackageStartupMessages(library(PerformanceAnalytics)) suppressPackageStartupMessages(library(rugarch)) suppressPackageStartupMessages(library(FinTS)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(TSA)) Getting Data We upload the environment status as saved at the end of part 3. load(file='DowEnvironment.RData') Structural Breaks in Daily Trade Volume In part 2, we already highlighted some level shifts occurring within the daily trade volume. Here is the plot we already commented in part 2. plot(dj_vol) Now we want to identify and date such level shifts. First, we verify that a linear regression with constant mean is statistically significative. lm_dj_vol <- lm(dj_vol ~ 1, data = dj_vol) summary(lm_dj_vol) ## ## Call: ## lm(formula = dj_vol ~ 1, data = dj_vol) ## ## Residuals: ## Min 1Q Median 3Q Max ## -193537268 -91369768 -25727268 65320232 698562732 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 201947268 2060772 98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 113200000 on 3019 degrees of freedom Such linear regression model is statistically significative, hence it makes sense to proceed on with the breakpoints structural identification with respect a constant value. bp_dj_vol <- breakpoints(dj_vol ~ 1, h = 0.1) summary(bp_dj_vol) ## ## Optimal (m+1)-segment partition: ## ## Call: ## breakpoints.formula(formula = dj_vol ~ 1, h = 0.1) ## ## Breakpoints at observation number: ## ## m = 1 2499 ## m = 2 896 2499 ## m = 3 626 1254 2499 ## m = 4 342 644 1254 2499 ## m = 5 342 644 1219 1649 2499 ## m = 6 320 622 924 1251 1649 2499 ## m = 7 320 622 924 1251 1692 2172 2499 ## m = 8 320 622 924 1251 1561 1863 2172 2499 ## ## Corresponding to breakdates: ## ## m = 1 ## m = 2 0.296688741721854 ## m = 3 0.207284768211921 ## m = 4 0.113245033112583 0.213245033112583 ## m = 5 0.113245033112583 0.213245033112583 ## m = 6 0.105960264900662 0.205960264900662 0.305960264900662 ## m = 7 0.105960264900662 0.205960264900662 0.305960264900662 ## m = 8 0.105960264900662 0.205960264900662 0.305960264900662 ## ## m = 1 ## m = 2 ## m = 3 0.41523178807947 ## m = 4 0.41523178807947 ## m = 5 0.40364238410596 0.546026490066225 ## m = 6 0.414238410596027 0.546026490066225 ## m = 7 0.414238410596027 0.560264900662252 ## m = 8 0.414238410596027 0.516887417218543 0.616887417218543 ## ## m = 1 0.827483443708609 ## m = 2 0.827483443708609 ## m = 3 0.827483443708609 ## m = 4 0.827483443708609 ## m = 5 0.827483443708609 ## m = 6 0.827483443708609 ## m = 7 0.719205298013245 0.827483443708609 ## m = 8 0.719205298013245 0.827483443708609 ## ## Fit: ## ## m 0 1 2 3 4 5 6 ## RSS 3.872e+19 2.772e+19 1.740e+19 1.547e+19 1.515e+19 1.490e+19 1.475e+19 ## BIC 1.206e+05 1.196e+05 1.182e+05 1.179e+05 1.178e+05 1.178e+05 1.178e+05 ## ## m 7 8 ## RSS 1.472e+19 1.478e+19 ## BIC 1.178e+05 1.178e+05 It gives this plot. plot(bp_dj_vol) The minimum BIC is reached at breaks = 6. Here is the plot of the DJIA daily trade volume with level shifts (red line). bp_fit <- fitted(bp_dj_vol, breaks=6) bp_fit_xts <- xts(bp_fit, order.by = index(dj_vol)) bb <- merge(bp_fit_xts, dj_vol) colnames(bb) <- c("Level", "DJI.Volume") plot(bb, lwd = c(3,1), col = c("red", "black")) Here are the calendar dates when such level shifts occur. bd <- breakdates(bp_dj_vol) bd <- as.integer(nrow(dj_vol)*bd) index(dj_vol)[bd] ## [1] "2008-04-10" "2009-06-22" "2010-09-01" "2011-12-16" "2013-07-22" ## [6] "2016-12-02" See ref. [10] for further details about structural breaks. Daily Trade Volume Log Ratio Model As shown in part 2, the daily trade volume log ratio: $v_{t}\ := ln \frac{V_{t}}{V_{t-1}}$ is already available within our data environment. We plot it. plot(dj_vol_log_ratio) Outliers Detection The Return.clean function within Performance Analytics package is able to clean return time series from outliers. Here below we compare the original time series with the outliers adjusted one. dj_vol_log_ratio_outliersadj <- Return.clean(dj_vol_log_ratio, "boudt") p <- plot(dj_vol_log_ratio) p <- addSeries(dj_vol_log_ratio_outliersadj, col = 'blue', on = 1) p The prosecution of the analysis will be carried on with the original time series as a more conservative approach to volatility evaluation. Correlation plots Here below are the total and partial correlation plots. acf(dj_vol_log_ratio) pacf(dj_vol_log_ratio) Above plots may suggest some ARMA(p,q) model with p and q > 0. That will be verified within the prosecution of the present analysis. Unit root tests We run the Augmented Dickey-Fuller test as available within the urca package. The no trend and no drift test flavor is run. (urdftest_lag = floor(12* (nrow(dj_vol_log_ratio)/100)^0.25)) ## [1] 28 summary(ur.df(dj_vol_log_ratio, type = "none", lags = urdftest_lag, selectlags="BIC")) ## ## ############################################### ## # Augmented Dickey-Fuller Test Unit Root Test # ## ############################################### ## ## Test regression none ## ## ## Call: ## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.51742 -0.13544 -0.02485 0.12006 1.98641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## z.lag.1 -5.21571 0.25218 -20.682 < 2e-16 *** ## z.diff.lag1 3.59929 0.24637 14.609 < 2e-16 *** ## z.diff.lag2 3.14516 0.23796 13.217 < 2e-16 *** ## z.diff.lag3 2.78234 0.22809 12.198 < 2e-16 *** ## z.diff.lag4 2.41745 0.21700 11.140 < 2e-16 *** ## z.diff.lag5 2.14179 0.20472 10.462 < 2e-16 *** ## z.diff.lag6 1.88354 0.19168 9.826 < 2e-16 *** ## z.diff.lag7 1.66394 0.17781 9.358 < 2e-16 *** ## z.diff.lag8 1.41657 0.16304 8.688 < 2e-16 *** ## z.diff.lag9 1.19354 0.14771 8.080 9.31e-16 *** ## z.diff.lag10 1.00374 0.13207 7.600 3.95e-14 *** ## z.diff.lag11 0.88134 0.11634 7.575 4.75e-14 *** ## z.diff.lag12 0.74946 0.10027 7.474 1.02e-13 *** ## z.diff.lag13 0.60798 0.08399 7.239 5.73e-13 *** ## z.diff.lag14 0.48367 0.06732 7.185 8.46e-13 *** ## z.diff.lag15 0.36278 0.05108 7.102 1.53e-12 *** ## z.diff.lag16 0.21128 0.03470 6.088 1.29e-09 *** ## z.diff.lag17 0.07911 0.01833 4.315 1.65e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2576 on 2972 degrees of freedom ## Multiple R-squared: 0.7476, Adjusted R-squared: 0.746 ## F-statistic: 488.9 on 18 and 2972 DF, p-value: < 2.2e-16 ## ## ## Value of test-statistic is: -20.6822 ## ## Critical values for test statistics: ## 1pct 5pct 10pct ## tau1 -2.58 -1.95 -1.62 Based on reported test statistics compared with critical values, we reject the null hypothesis of unit root presence. See ref. [6] for further details about the Augmented Dickey-Fuller test. ARMA model We now determine the ARMA structure of our time series in order to run the ARCH effects test on the resulting residuals. That is in agreement with what outlined in ref. [4]$4.3.

We take advantage of auto.arima() function within forecast package (ref. [7]) to have an idea of what ARMA model to start with.

(auto_model_1 <- auto.arima(dj_vol_log_ratio, stepwise=FALSE)) ## Series: dj_vol_log_ratio ## ARIMA(3,0,2) with zero mean ## ## Coefficients: ## ar1 ar2 ar3 ma1 ma2 ## -0.5266 0.3769 0.1459 -0.0896 -0.7812 ## s.e. 0.0970 0.0401 0.0203 0.0968 0.0872 ## ## sigma^2 estimated as 0.0659: log likelihood=-176.48 ## AIC=364.96 AICc=364.99 BIC=401.03 coeftest(auto_model_1) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.526617 0.096988 -5.4297 5.644e-08 *** ## ar2 0.376906 0.040090 9.4014 < 2.2e-16 *** ## ar3 0.145902 0.020324 7.1787 7.039e-13 *** ## ma1 -0.089611 0.096788 -0.9258 0.3545 ## ma2 -0.781163 0.087159 -8.9626 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ma1 coefficient is not significative. Hence we try with the following ARMA(2,3) model.

(arma_model_2 <- arima(dj_vol_log_ratio, order=c(2,0,3), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(2, 0, 3), include.mean = FALSE) ## ## Coefficients: ## ar1 ar2 ma1 ma2 ma3 ## -0.1802 0.6441 -0.4351 -0.8604 0.3596 ## s.e. 0.0643 0.0454 0.0681 0.0493 0.0423 ## ## sigma^2 estimated as 0.0658: log likelihood = -176.9, aic = 363.79 coeftest(arma_model_2) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.180233 0.064315 -2.8023 0.005073 ** ## ar2 0.644104 0.045449 14.1721 < 2.2e-16 *** ## ma1 -0.435122 0.068126 -6.3870 1.692e-10 *** ## ma2 -0.860443 0.049282 -17.4595 < 2.2e-16 *** ## ma3 0.359564 0.042307 8.4990 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All coefficients are significative and AIC is lower than the one of the first model. We then try with ARMA(1,2).

(arma_model_3 <- arima(dj_vol_log_ratio, order=c(1,0,2), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(1, 0, 2), include.mean = FALSE) ## ## Coefficients: ## ar1 ma1 ma2 ## 0.6956 -1.3183 0.3550 ## s.e. 0.0439 0.0518 0.0453 ## ## sigma^2 estimated as 0.06598: log likelihood = -180.92, aic = 367.84 coeftest(arma_model_3) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.695565 0.043874 15.8537 < 2.2e-16 *** ## ma1 -1.318284 0.051787 -25.4557 < 2.2e-16 *** ## ma2 0.355015 0.045277 7.8409 4.474e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model has the highest AIC among the set and all coefficients are significative.

We can also have a try with the eacf() function within the TSA package as further verification.

eacf(dj_vol_log_ratio) ## AR/MA ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 0 x o o x x o o x o o x o o o ## 1 x x o x o o o x o o x o o o ## 2 x x x x o o o o o o x o o o ## 3 x x x x o o o o o o x o o o ## 4 x x x x x o o o o o x o o o ## 5 x x x x o o o o o o o o o o ## 6 x x x x x o x o o o o o o o ## 7 x x x x x o o o o o o o o o

The upper left triangle with “O” as a vertex seems to be located somehow within {(1,2),(2,2),(1,3),(2,3)}, which represents the set of potential (p,q) values according to each function output. To remark that we prefer to consider parsimonious models, that is why we do not go too far as AR and MA orders.

We already verified ARMA models with (p,q) orders within the set {(3,2)(2,3)(1,2)}. Let us try {(2,2)(1,3)}

(arma_model_4 <- arima(dj_vol_log_ratio, order=c(2,0,2), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(2, 0, 2), include.mean = FALSE) ## ## Coefficients: ## ar1 ar2 ma1 ma2 ## 0.7174 -0.0096 -1.3395 0.3746 ## s.e. 0.1374 0.0560 0.1361 0.1247 ## ## sigma^2 estimated as 0.06598: log likelihood = -180.9, aic = 369.8 coeftest(arma_model_4) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.7173631 0.1374135 5.2205 1.785e-07 *** ## ar2 -0.0096263 0.0560077 -0.1719 0.863536 ## ma1 -1.3394720 0.1361208 -9.8403 < 2.2e-16 *** ## ma2 0.3746317 0.1247117 3.0040 0.002665 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ar2 coefficient is not significative.

(arma_model_5 <- arima(dj_vol_log_ratio, order=c(1,0,3), include.mean = FALSE)) ## ## Call: ## arima(x = dj_vol_log_ratio, order = c(1, 0, 3), include.mean = FALSE) ## ## Coefficients: ## ar1 ma1 ma2 ma3 ## 0.7031 -1.3253 0.3563 0.0047 ## s.e. 0.0657 0.0684 0.0458 0.0281 ## ## sigma^2 estimated as 0.06598: log likelihood = -180.9, aic = 369.8 coeftest(arma_model_5) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.7030934 0.0656902 10.7032 < 2.2e-16 *** ## ma1 -1.3253176 0.0683526 -19.3894 < 2.2e-16 *** ## ma2 0.3563425 0.0458436 7.7730 7.664e-15 *** ## ma3 0.0047019 0.0280798 0.1674 0.867 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ma3 coefficient is not significative.

As a conclusion, we keep both ARMA(1,2) and ARMA(2,3) as tentative mean models. We can now proceed on with ARCH effect tests.

ARCH effect Test

If ARCH effects are statistical significative for the residuals of our time series, a GARCH model is required.

We test the candidate mean model ARMA(2,3).

resid_dj_vol_log_ratio <- residuals(arma_model_2) ArchTest(resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio)) ## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio) ## Chi-squared = 78.359, df = 12, p-value = 8.476e-12

Based on reported p-value, we reject the null hypothesis of no ARCH effects.

Let us have a look at the residual correlation plot too.

par(mfrow=c(1,2)) acf(resid_dj_vol_log_ratio) pacf(resid_dj_vol_log_ratio)

We test the second candidate mean model, ARMA(1,2).

resid_dj_vol_log_ratio <- resid(arma_model_3) ArchTest(resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio)) ## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: resid_dj_vol_log_ratio - mean(resid_dj_vol_log_ratio) ## Chi-squared = 74.768, df = 12, p-value = 4.065e-11

Based on reported p-value, we reject the null hypothesis of no ARCH effects.

Let us have a look at the residual correlation plots.

par(mfrow=c(1,2)) acf(resid_dj_vol_log_ratio) pacf(resid_dj_vol_log_ratio)

To inspect asymmetries within the DJIA volume log ratio, summary statistics and density plot are shown.

basicStats(dj_vol_log_ratio) ## DJI.Volume ## nobs 3019.000000 ## NAs 0.000000 ## Minimum -2.301514 ## Maximum 2.441882 ## 1. Quartile -0.137674 ## 3. Quartile 0.136788 ## Mean -0.000041 ## Median -0.004158 ## Sum -0.124733 ## SE Mean 0.005530 ## LCL Mean -0.010885 ## UCL Mean 0.010802 ## Variance 0.092337 ## Stdev 0.303869 ## Skewness -0.182683 ## Kurtosis 9.463384

Negative skewness is reported.

plot(density(dj_vol_log_ratio))

As a result, also for daily trade volume log ratio the eGARCH model will be proposed.

We run two fits in order to compare the results with two candidate mean models ARMA(1,2) and ARMA(2,3)

ARMA-GARCH: ARMA(1,2) + eGARCH(1,1)

garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,2), include.mean = FALSE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1), variance.targeting = FALSE), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_vol_log_ratio, spec = garchspec, solver='hybrid')) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(1,0,2) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## ar1 0.67731 0.014856 45.5918 0.0e+00 ## ma1 -1.22817 0.000038 -31975.1819 0.0e+00 ## ma2 0.27070 0.000445 608.3525 0.0e+00 ## omega -1.79325 0.207588 -8.6385 0.0e+00 ## alpha1 0.14348 0.032569 4.4053 1.1e-05 ## beta1 0.35819 0.073164 4.8957 1.0e-06 ## gamma1 0.41914 0.042252 9.9199 0.0e+00 ## skew 1.32266 0.031528 41.9518 0.0e+00 ## shape 3.54346 0.221750 15.9795 0.0e+00 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## ar1 0.67731 0.022072 30.6859 0.0e+00 ## ma1 -1.22817 0.000067 -18466.0626 0.0e+00 ## ma2 0.27070 0.000574 471.4391 0.0e+00 ## omega -1.79325 0.233210 -7.6894 0.0e+00 ## alpha1 0.14348 0.030588 4.6906 3.0e-06 ## beta1 0.35819 0.082956 4.3178 1.6e-05 ## gamma1 0.41914 0.046728 8.9698 0.0e+00 ## skew 1.32266 0.037586 35.1902 0.0e+00 ## shape 3.54346 0.238225 14.8744 0.0e+00 ## ## LogLikelihood : 347.9765 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -0.22456 ## Bayes -0.20664 ## Shibata -0.22458 ## Hannan-Quinn -0.21812 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.5812 4.459e-01 ## Lag[2*(p+q)+(p+q)-1][8] 8.5925 3.969e-08 ## Lag[4*(p+q)+(p+q)-1][14] 14.1511 4.171e-03 ## d.o.f=3 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.4995 0.4797 ## Lag[2*(p+q)+(p+q)-1][5] 1.1855 0.8164 ## Lag[4*(p+q)+(p+q)-1][9] 2.4090 0.8510 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 0.4215 0.500 2.000 0.5162 ## ARCH Lag[5] 0.5974 1.440 1.667 0.8545 ## ARCH Lag[7] 1.2835 2.315 1.543 0.8636 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 5.2333 ## Individual Statistics: ## ar1 0.63051 ## ma1 1.18685 ## ma2 1.11562 ## omega 2.10211 ## alpha1 0.08261 ## beta1 2.07607 ## gamma1 0.15883 ## skew 0.33181 ## shape 2.56140 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 2.1 2.32 2.82 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 1.600 0.10965 ## Negative Sign Bias 0.602 0.54725 ## Positive Sign Bias 2.540 0.01115 ** ## Joint Effect 6.815 0.07804 * ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 20.37 0.3726 ## 2 30 36.82 0.1510 ## 3 40 45.07 0.2328 ## 4 50 52.03 0.3567 ## ## ## Elapsed time : 1.364722

All coefficients are statistically significative. However, based on above reported Weighted Ljung-Box Test on Standardized Residuals p-values, we reject the null hypothesis of no correlation of residuals for the present model. Hence model ARMA(1,2)+eGARCH(1,1) is not able to capture all the structure of our time series.

ARMA-GARCH: ARMA(2,3) + eGARCH(1,1)

garchspec <- ugarchspec(mean.model = list(armaOrder = c(2,3), include.mean = FALSE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1), variance.targeting = FALSE), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_vol_log_ratio, spec = garchspec, solver='hybrid')) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(2,0,3) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.18607 0.008580 -21.6873 0.0e+00 ## ar2 0.59559 0.004596 129.5884 0.0e+00 ## ma1 -0.35619 0.013512 -26.3608 0.0e+00 ## ma2 -0.83010 0.004689 -177.0331 0.0e+00 ## ma3 0.26277 0.007285 36.0678 0.0e+00 ## omega -1.92262 0.226738 -8.4795 0.0e+00 ## alpha1 0.14382 0.033920 4.2401 2.2e-05 ## beta1 0.31060 0.079441 3.9098 9.2e-05 ## gamma1 0.43137 0.043016 10.0281 0.0e+00 ## skew 1.32282 0.031382 42.1523 0.0e+00 ## shape 3.48939 0.220787 15.8043 0.0e+00 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.18607 0.023940 -7.7724 0.000000 ## ar2 0.59559 0.022231 26.7906 0.000000 ## ma1 -0.35619 0.024244 -14.6918 0.000000 ## ma2 -0.83010 0.004831 -171.8373 0.000000 ## ma3 0.26277 0.030750 8.5453 0.000000 ## omega -1.92262 0.266462 -7.2154 0.000000 ## alpha1 0.14382 0.032511 4.4239 0.000010 ## beta1 0.31060 0.095329 3.2582 0.001121 ## gamma1 0.43137 0.047092 9.1602 0.000000 ## skew 1.32282 0.037663 35.1225 0.000000 ## shape 3.48939 0.223470 15.6146 0.000000 ## ## LogLikelihood : 356.4994 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -0.22888 ## Bayes -0.20698 ## Shibata -0.22891 ## Hannan-Quinn -0.22101 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.7678 0.38091 ## Lag[2*(p+q)+(p+q)-1][14] 7.7336 0.33963 ## Lag[4*(p+q)+(p+q)-1][24] 17.1601 0.04972 ## d.o.f=5 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.526 0.4683 ## Lag[2*(p+q)+(p+q)-1][5] 1.677 0.6965 ## Lag[4*(p+q)+(p+q)-1][9] 2.954 0.7666 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 1.095 0.500 2.000 0.2955 ## ARCH Lag[5] 1.281 1.440 1.667 0.6519 ## ARCH Lag[7] 1.940 2.315 1.543 0.7301 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 5.3764 ## Individual Statistics: ## ar1 0.12923 ## ar2 0.20878 ## ma1 1.15005 ## ma2 1.15356 ## ma3 0.97487 ## omega 2.04688 ## alpha1 0.09695 ## beta1 2.01026 ## gamma1 0.18039 ## skew 0.38131 ## shape 2.40996 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 2.49 2.75 3.27 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 1.4929 0.13556 ## Negative Sign Bias 0.6317 0.52766 ## Positive Sign Bias 2.4505 0.01432 ** ## Joint Effect 6.4063 0.09343 * ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 17.92 0.5278 ## 2 30 33.99 0.2395 ## 3 40 44.92 0.2378 ## 4 50 50.28 0.4226 ## ## ## Elapsed time : 1.660402

All coefficients are statistically significative. No correlation within standardized residuals or standardized squared residuals is found. All ARCH effects are properly captured by the model. The Adjusted Pearson Goodness-of-fit test does not reject the null hypothesis that the empirical distribution of the standardized residuals and the selected theoretical distribution are the same. However:

* the Nyblom stability test null hypothesis that the model parameters are constant over time is rejected for some of them

* the Positive Sign Bias null hypothesis is rejected at 5% of significance level; this kind of test focuses on the effect of large and small positive shocks

See ref. [11] for an explanation about GARCH model diagnostic.

Some diagnostic plots are shown.

par(mfrow=c(2,2)) plot(garchfit, which=8) plot(garchfit, which=9) plot(garchfit, which=10) plot(garchfit, which=11)

We show the original DJIA daily trade volume log ratio time series with the mean model fit (red line) and the conditional volatility (blue line).

par(mfrow=c(1,1)) cond_volatility <- sigma(garchfit) mean_model_fit <- fitted(garchfit) p <- plot(dj_vol_log_ratio, col = "grey") p <- addSeries(mean_model_fit, col = 'red', on = 1) p <- addSeries(cond_volatility, col = 'blue', on = 1) p

Model Equation

Combining both ARMA(2,3) and eGARCH(1,1) model equations we have:

$\begin{cases} y_{t}\ -\ \phi_{1}y_{t-1}\ -\ \phi_{2}y_{t-2} =\ \phi_{0}\ +\ u_{t}\ +\ \theta_{1}u_{t-1}\ +\ \theta_{2}u_{t-2}\ +\ \theta_{3}u_{t-3} \\ \\ u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1) \\ \\ \ln(\sigma_{t}^2)\ =\ \omega\ + \sum_{j=1}^{q} (\alpha_{j} \epsilon_{t-j}^2\ + \gamma (\epsilon_{t-j} – E|\epsilon_{t-j}|)) +\ \sum_{i=1}^{p} \beta_{i} ln(\sigma_{t-1}^2) \end{cases}$

Using the model resulting coefficients we obtain:

$\begin{cases} y_{t}\ +\ 0.186\ y_{t-1} -\ 0.595\ y_{t-2}\ = \ u_{t}\ -\ 0.356u_{t-1}\ – 0.830\ u_{t-2}\ +\ 0.263\ u_{t-3} \\ \\ u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1) \\ \\ \ln(\sigma_{t}^2)\ =\ -1.922\ +0.144 \epsilon_{t-1}^2\ + 0.431\ (\epsilon_{t-1} – E|\epsilon_{t-1}|)) +\ 0.310\ ln(\sigma_{t-1}^2) \end{cases}$

Daily Volume Log Ratio Volatility Analysis

Here is the plot of conditional volatility as resulting from our model ARMA(2,2)+eGARCH(1,1).

plot(cond_volatility)

Line plots of conditional volatility by year are shown.

par(mfrow=c(6,2)) pl <- lapply(2007:2018, function(x) { plot(cond_volatility[as.character(x)], main = "DJIA Daily Volume Log-ratio conditional volatility")}) pl

Conditional volatility box-plots by year are shown.

par(mfrow=c(1,1)) cond_volatility_df <- xts_to_dataframe(cond_volatility) dataframe_boxplot(cond_volatility_df, "Dow Jones Daily Volume Log-ratio conditional volatility 2007-2018")

If you have any questions, please feel free to comment below.

References Disclaimer

Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

Tue, 01/08/2019 - 23:16

(This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

A small note on the growing list of users of the udpipe R package. In the last month of 2018, we’ve updated the package on CRAN with some noticeable changes

• The default models which are now downloaded with the function udpipe_download_model are now models built on Universal Dependencies 2.3 (released on 2018-11-15)
• This means udpipe now has models for 60 languages. That’s right! And they provide tokenisation, parts of speech tagging, lemmatisation and dependency parsing built on all of these treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb.
• Although this was not intended originally we added a sentiment scoring function in the latest release (version 0.8 on CRAN). Combined with the output of the dependency parsing, this allows to answer questions like ‘WHAT IS CAUSING A NEGATIVE SENTIMENT’. Example showing below.
• If you want to use the udpipe models for commercial purposes, we have some nice extra pretrained models available for you – get in touch if you are looking for this.
Below we will showcase the new features of the R package by finding out what is causing a negative sentiment. If I see some users of the tidytext sentiment R package I always wondered if they do sentiment scoring for the love of building reports as it looks like the main thing they report is frequency of occurrences of words which are part of a positive or negative dictionary. While probably their manager asked them. “Yeah but why is the sentiment negative or positive”.
You can answer this managerial question using dependency parsing and that is exactly what udpipe provides (amongst other NLP annotations). Dependency parsing links each word to another word, allowing us the find out which words are linked to negative words giving you the context of why something is negative and what needs to be improved in your business. Let’s show how to get this easily done in R.

Below we get a sample of 500 AirBnb customer reviews in French, annotate it with udpipe (using a French model built on top of Rhapsodie French treebank), use the new sentiment scoring txt_sentiment which is available in the new udpipe release using an online dictionary of positive / negative terms for French. Next we use the udpipe dependency parsing output by looking to the adjectival modifier ‘amod’ in the dep_rel udpipe output and visualise all words which are linked the the negative terms of the dictionary. The result is this graph showing words of the dictionary in red and words which are linked to that word in another color.

Full code showing how this is done is shown below.

library(udpipe)
library(dplyr)
library(magrittr)
data(brussels_reviews, package = "udpipe")
x <- brussels_reviews %>%
filter(language == "fr") %>%
rename(doc_id = id, text = feedback) %>%
udpipe("french-spoken", trace = 10)
##
## Get a French sentiment dictionary lexicon with positive/negative terms, negators, amplifiers and deamplifiers
##
polarity_terms <- rename(FEEL_fr, term = x, polarity = y)
polarity_negators <- subset(valShifters$valence_fr, t == 1)$x
polarity_amplifiers <- subset(valShifters$valence_fr, t == 2)$x
polarity_deamplifiers <- subset(valShifters$valence_fr, t == 3)$x
##
## Do sentiment analysis based on that open French lexicon
##
sentiments <- txt_sentiment(x, term = "lemma",
polarity_terms = polarity_terms,
polarity_negators = polarity_negators,
polarity_amplifiers = polarity_amplifiers,
polarity_deamplifiers = polarity_deamplifiers)

ACF and PACF plots tailing off suggests an ARMA(2,2) (ref. [5], table $3.1). We take advantage of auto.arima() function within forecast package (ref. [7]) to have an idea to start with. auto_model <- auto.arima(dj_ret) summary(auto_model) ## Series: dj_ret ## ARIMA(2,0,4) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ma2 ma3 ma4 ## 0.4250 -0.8784 -0.5202 0.8705 -0.0335 -0.0769 ## s.e. 0.0376 0.0628 0.0412 0.0672 0.0246 0.0203 ## ## sigma^2 estimated as 0.0001322: log likelihood=9201.19 ## AIC=-18388.38 AICc=-18388.34 BIC=-18346.29 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002416895 0.01148496 0.007505056 NaN Inf 0.6687536 ## ACF1 ## Training set -0.002537238 ARMA(2,4) model is suggested. However, the ma3 coefficient is not significative, as further verified by: coeftest(auto_model) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 0.425015 0.037610 11.3007 < 2.2e-16 *** ## ar2 -0.878356 0.062839 -13.9779 < 2.2e-16 *** ## ma1 -0.520173 0.041217 -12.6204 < 2.2e-16 *** ## ma2 0.870457 0.067211 12.9511 < 2.2e-16 *** ## ma3 -0.033527 0.024641 -1.3606 0.1736335 ## ma4 -0.076882 0.020273 -3.7923 0.0001492 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Hence we put as a constraint MA order q <= 2. auto_model2 <- auto.arima(dj_ret, max.q=2) summary(auto_model2) ## Series: dj_ret ## ARIMA(2,0,2) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ma2 ## -0.5143 -0.4364 0.4212 0.3441 ## s.e. 0.1461 0.1439 0.1512 0.1532 ## ## sigma^2 estimated as 0.0001325: log likelihood=9196.33 ## AIC=-18382.66 AICc=-18382.64 BIC=-18352.6 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002287171 0.01150361 0.007501925 Inf Inf 0.6684746 ## ACF1 ## Training set -0.002414944 Now all coefficients are significative. coeftest(auto_model2) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.51428 0.14613 -3.5192 0.0004328 *** ## ar2 -0.43640 0.14392 -3.0322 0.0024276 ** ## ma1 0.42116 0.15121 2.7853 0.0053485 ** ## ma2 0.34414 0.15323 2.2458 0.0247139 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Further verifications with ARMA(2,1) and ARMA(1,2) results with higher AIC values than ARMA(2,2). Hence ARMA(2,2) is preferable. Here are the results. auto_model3 <- auto.arima(dj_ret, max.p=2, max.q=1) summary(auto_model3) ## Series: dj_ret ## ARIMA(2,0,1) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ## -0.4619 -0.1020 0.3646 ## s.e. 0.1439 0.0204 0.1438 ## ## sigma^2 estimated as 0.0001327: log likelihood=9194.1 ## AIC=-18380.2 AICc=-18380.19 BIC=-18356.15 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002370597 0.01151213 0.007522059 Inf Inf 0.6702687 ## ACF1 ## Training set 0.0009366271 coeftest(auto_model3) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.461916 0.143880 -3.2104 0.001325 ** ## ar2 -0.102012 0.020377 -5.0062 5.552e-07 *** ## ma1 0.364628 0.143818 2.5353 0.011234 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 All coefficients are statistically significative. auto_model4 <- auto.arima(dj_ret, max.p=1, max.q=2) summary(auto_model4) ## Series: dj_ret ## ARIMA(1,0,2) with zero mean ## ## Coefficients: ## ar1 ma1 ma2 ## -0.4207 0.3259 -0.0954 ## s.e. 0.1488 0.1481 0.0198 ## ## sigma^2 estimated as 0.0001328: log likelihood=9193.01 ## AIC=-18378.02 AICc=-18378 BIC=-18353.96 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.0002387398 0.0115163 0.007522913 Inf Inf 0.6703448 ## ACF1 ## Training set -0.001958194 coeftest(auto_model4) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.420678 0.148818 -2.8268 0.004702 ** ## ma1 0.325918 0.148115 2.2004 0.027776 * ## ma2 -0.095407 0.019848 -4.8070 1.532e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 All coefficients are statistically significative. Furthermore, we investigate what eacf() function within the TSA package reports. eacf(dj_ret) ## AR/MA ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 0 x x x o x o o o o o o o o x ## 1 x x o o x o o o o o o o o o ## 2 x o o x x o o o o o o o o o ## 3 x o x o x o o o o o o o o o ## 4 x x x x x o o o o o o o o o ## 5 x x x x x o o x o o o o o o ## 6 x x x x x x o o o o o o o o ## 7 x x x x x o o o o o o o o o The upper left triangle with “O” as a vertex seems to be located somehow within (p,q) = {(1,2),(2,2),(1,3)}, which represents a set of potential candidate (p,q) values according to eacf function output. To remark that we prefer to consider parsimoniuos models, that is why we do not go too far as AR and MA orders. ARMA(1,2) model was already verified. ARMA(2,2) is already a candidate model. Let us verify ARMA(1,3). (arima_model5 <- arima(dj_ret, order=c(1,0,3), include.mean = FALSE)) ## ## Call: ## arima(x = dj_ret, order = c(1, 0, 3), include.mean = FALSE) ## ## Coefficients: ## ar1 ma1 ma2 ma3 ## -0.2057 0.1106 -0.0681 0.0338 ## s.e. 0.2012 0.2005 0.0263 0.0215 ## ## sigma^2 estimated as 0.0001325: log likelihood = 9193.97, aic = -18379.94 coeftest(arima_model5) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## ar1 -0.205742 0.201180 -1.0227 0.306461 ## ma1 0.110599 0.200475 0.5517 0.581167 ## ma2 -0.068124 0.026321 -2.5882 0.009647 ** ## ma3 0.033832 0.021495 1.5739 0.115501 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Only one coefficient is statistically significative. As a conclusion, we choose ARMA(2,2) as mean model. We can now proceed on with ARCH effect tests. ARCH effect test Now, we can test if any ARCH effects are present on residuals of our model. If ARCH effects are statistical significative for the residuals of our time series, a GARCH model is required. model_residuals <- residuals(auto_model2) ArchTest(model_residuals - mean(model_residuals)) ## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: model_residuals - mean(model_residuals) ## Chi-squared = 986.82, df = 12, p-value < 2.2e-16 Based on reported p-value, we reject the null hypothesis of no ARCH effects. Let us have a look at the residual correlation plots. par(mfrow=c(1,2)) acf(model_residuals) pacf(model_residuals) Conditional Volatility The conditional mean and variance are defined as: $\mu_{t}\ :=\ E(r_{t}|\ F_{t-1}) \\ \sigma^2_{t}\ :=\ Var(r_{t}|\ F_{t-1})\ =\ E[(r_t-\mu_{t})^2| F_{t-1}]$ The conditional volatility can be computed as square root of the conditional variance. See ref. [4] for further details. eGARCH Model The attempts with sGARCH as variance model did not bring to result with significative coefficients. On the contrary, the exponential GARCH (eGARCH) variance model is capable to capture asymmetries within the volatility shocks. To inspect asymmetries within the DJIA log returns, summary statistics and density plot are shown. basicStats(dj_ret) ## DJI.Adjusted ## nobs 3019.000000 ## NAs 0.000000 ## Minimum -0.082005 ## Maximum 0.105083 ## 1. Quartile -0.003991 ## 3. Quartile 0.005232 ## Mean 0.000207 ## Median 0.000551 ## Sum 0.625943 ## SE Mean 0.000211 ## LCL Mean -0.000206 ## UCL Mean 0.000621 ## Variance 0.000134 ## Stdev 0.011593 ## Skewness -0.141370 ## Kurtosis 10.200492 The negative skewness value confirms the presence of asymmetries within the DJIA distribution. This gives the density plot. plot(density(dj_ret)) We go on proposing as variance model (for conditional variance) the eGARCH model. More precisely, we are about to model an ARMA-GARCH, with ARMA(2,2) as a mean model and exponential GARCH(1,1) as the variance model. Before doing that, we further emphasize how ARMA(0,0) is not satisfactory within this context. ARMA-GARCH: ARMA(0,0) + eGARCH(1,1) garchspec <- ugarchspec(mean.model = list(armaOrder = c(0,0), include.mean = TRUE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1)), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_ret, spec = garchspec)) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(0,0,0) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## mu 0.000303 0.000117 2.5933 0.009506 ## omega -0.291302 0.016580 -17.5699 0.000000 ## alpha1 -0.174456 0.013913 -12.5387 0.000000 ## beta1 0.969255 0.001770 547.6539 0.000000 ## gamma1 0.188918 0.021771 8.6773 0.000000 ## skew 0.870191 0.021763 39.9848 0.000000 ## shape 6.118380 0.750114 8.1566 0.000000 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## mu 0.000303 0.000130 2.3253 0.020055 ## omega -0.291302 0.014819 -19.6569 0.000000 ## alpha1 -0.174456 0.016852 -10.3524 0.000000 ## beta1 0.969255 0.001629 595.0143 0.000000 ## gamma1 0.188918 0.031453 6.0063 0.000000 ## skew 0.870191 0.022733 38.2783 0.000000 ## shape 6.118380 0.834724 7.3298 0.000000 ## ## LogLikelihood : 10138.63 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -6.7119 ## Bayes -6.6980 ## Shibata -6.7119 ## Hannan-Quinn -6.7069 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 5.475 0.01929 ## Lag[2*(p+q)+(p+q)-1][2] 6.011 0.02185 ## Lag[4*(p+q)+(p+q)-1][5] 7.712 0.03472 ## d.o.f=0 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 1.342 0.2467 ## Lag[2*(p+q)+(p+q)-1][5] 2.325 0.5438 ## Lag[4*(p+q)+(p+q)-1][9] 2.971 0.7638 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 0.3229 0.500 2.000 0.5699 ## ARCH Lag[5] 1.4809 1.440 1.667 0.5973 ## ARCH Lag[7] 1.6994 2.315 1.543 0.7806 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 4.0468 ## Individual Statistics: ## mu 0.2156 ## omega 1.0830 ## alpha1 0.5748 ## beta1 0.8663 ## gamma1 0.3994 ## skew 0.1044 ## shape 0.4940 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 1.69 1.9 2.35 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 1.183 0.23680 ## Negative Sign Bias 2.180 0.02932 ** ## Positive Sign Bias 1.554 0.12022 ## Joint Effect 8.498 0.03677 ** ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 37.24 0.00741 ## 2 30 42.92 0.04633 ## 3 40 52.86 0.06831 ## 4 50 65.55 0.05714 ## ## ## Elapsed time : 0.6527421 All coefficients are statistically significative. However, from Weighted Ljung-Box Test on Standardized Residuals reported p-value above (as part of the overall report), we have the confirmation that such model does not capture all the structure (we reject the null hypothesis of no correlation within the residuals). As a conclusion, we proceed on by specifying ARMA(2,2) as a mean model within our GARCH fit as hereinbelow shown. ARMA-GARCH: ARMA(2,2) + eGARCH(1,1) garchspec <- ugarchspec(mean.model = list(armaOrder = c(2,2), include.mean = FALSE), variance.model = list(model = "eGARCH", garchOrder = c(1, 1)), distribution.model = "sstd") (garchfit <- ugarchfit(data = dj_ret, spec = garchspec)) ## ## *---------------------------------* ## * GARCH Model Fit * ## *---------------------------------* ## ## Conditional Variance Dynamics ## ----------------------------------- ## GARCH Model : eGARCH(1,1) ## Mean Model : ARFIMA(2,0,2) ## Distribution : sstd ## ## Optimal Parameters ## ------------------------------------ ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.47642 0.026115 -18.2433 0 ## ar2 -0.57465 0.052469 -10.9523 0 ## ma1 0.42945 0.025846 16.6157 0 ## ma2 0.56258 0.054060 10.4066 0 ## omega -0.31340 0.003497 -89.6286 0 ## alpha1 -0.17372 0.011642 -14.9222 0 ## beta1 0.96598 0.000027 35240.1590 0 ## gamma1 0.18937 0.011893 15.9222 0 ## skew 0.84959 0.020063 42.3469 0 ## shape 5.99161 0.701313 8.5434 0 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>|t|) ## ar1 -0.47642 0.007708 -61.8064 0 ## ar2 -0.57465 0.018561 -30.9608 0 ## ma1 0.42945 0.007927 54.1760 0 ## ma2 0.56258 0.017799 31.6074 0 ## omega -0.31340 0.003263 -96.0543 0 ## alpha1 -0.17372 0.012630 -13.7547 0 ## beta1 0.96598 0.000036 26838.0412 0 ## gamma1 0.18937 0.013003 14.5631 0 ## skew 0.84959 0.020089 42.2911 0 ## shape 5.99161 0.707324 8.4708 0 ## ## LogLikelihood : 10140.27 ## ## Information Criteria ## ------------------------------------ ## ## Akaike -6.7110 ## Bayes -6.6911 ## Shibata -6.7110 ## Hannan-Quinn -6.7039 ## ## Weighted Ljung-Box Test on Standardized Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 0.03028 0.8619 ## Lag[2*(p+q)+(p+q)-1][11] 5.69916 0.6822 ## Lag[4*(p+q)+(p+q)-1][19] 12.14955 0.1782 ## d.o.f=4 ## H0 : No serial correlation ## ## Weighted Ljung-Box Test on Standardized Squared Residuals ## ------------------------------------ ## statistic p-value ## Lag[1] 1.666 0.1967 ## Lag[2*(p+q)+(p+q)-1][5] 2.815 0.4418 ## Lag[4*(p+q)+(p+q)-1][9] 3.457 0.6818 ## d.o.f=2 ## ## Weighted ARCH LM Tests ## ------------------------------------ ## Statistic Shape Scale P-Value ## ARCH Lag[3] 0.1796 0.500 2.000 0.6717 ## ARCH Lag[5] 1.5392 1.440 1.667 0.5821 ## ARCH Lag[7] 1.6381 2.315 1.543 0.7933 ## ## Nyblom stability test ## ------------------------------------ ## Joint Statistic: 4.4743 ## Individual Statistics: ## ar1 0.07045 ## ar2 0.37070 ## ma1 0.07702 ## ma2 0.39283 ## omega 1.00123 ## alpha1 0.49520 ## beta1 0.79702 ## gamma1 0.51601 ## skew 0.07163 ## shape 0.55625 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 2.29 2.54 3.05 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ## ------------------------------------ ## t-value prob sig ## Sign Bias 0.4723 0.63677 ## Negative Sign Bias 1.7969 0.07246 * ## Positive Sign Bias 2.0114 0.04438 ** ## Joint Effect 7.7269 0.05201 * ## ## ## Adjusted Pearson Goodness-of-Fit Test: ## ------------------------------------ ## group statistic p-value(g-1) ## 1 20 46.18 0.0004673 ## 2 30 47.73 0.0156837 ## 3 40 67.07 0.0034331 ## 4 50 65.51 0.0574582 ## ## ## Elapsed time : 0.93679 All coefficients are statistically significative. No correlation within standardized residuals or standardized squared residuals is found. All ARCH effects are properly captured by the model. However: * the Nyblom stability test null hypothesis that the model parameters are constant over time is rejected for some of them * the Positive Sign Bias null hypothesis is rejected at 5% level of significance; this kind of test focuses on the effect of large and small positive shocks * the Adjusted Pearson Goodness-of-fit test null hypothesis that the empirical and theoretical distribution of standardized residuals are identical is rejected See ref. [11] for an explanation about GARCH model diagnostic. Note: The ARMA(1,2) + eGARCH(1,1) fit also provides with significative coefficients, no correlation within standardized residuals, no correlation within standardized squared residuals and all ARCH effects are properly captured. However the bias tests are less satisfactory at 5% than the ARMA(2,2) + eGARCH(1,1) model ones. We show some diagnostic plots further. par(mfrow=c(2,2)) plot(garchfit, which=8) plot(garchfit, which=9) plot(garchfit, which=10) plot(garchfit, which=11) We show the original DJIA log-returns time series with the mean model fit (red line) and the conditional volatility (blue line). par(mfrow=c(1,1)) cond_volatility <- sigma(garchfit) mean_model_fit <- fitted(garchfit) p <- plot(dj_ret, col = "grey") p <- addSeries(mean_model_fit, col = 'red', on = 1) p <- addSeries(cond_volatility, col = 'blue', on = 1) p Model Equation Combining both ARMA(2,2) and eGARCH models we have: $\begin{cases} y_{t}\ -\ \phi_{1}y_{t-1}\ -\ \phi_{2}y_{t-2} =\ \phi_{0}\ +\ u_{t}\ +\ \theta_{1}u_{t-1}\ +\ \theta_{2}u_{t-2} \\ \\ u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1) \\ \\ \ln(\sigma_{t}^2)\ =\ \omega\ + \sum_{j=1}^{q} (\alpha_{j} \epsilon_{t-j}^2\ + \gamma (\epsilon_{t-j} – E|\epsilon_{t-j}|)) +\ \sum_{i=1}^{p} \beta_{i} ln(\sigma_{t-1}^2) \end{cases}$ Using the model resulting coefficients, it results as follows. $\begin{cases} y_{t}\ + 0.476\ y_{t-1}\ + 0.575\ y_{t-2} = \ u_{t}\ + 0.429\ u_{t-1}\ + 0.563\ u_{t-2} \\ \\ u_{t}\ =\ \sigma_{t}\epsilon_{t},\ \ \ \ \ \epsilon_{t}=N(0,1) \\ \\ \ln(\sigma_{t}^2)\ =\ -0.313\ -0.174 \epsilon_{t-1}^2\ + 0.189\ (\epsilon_{t-1} – E|\epsilon_{t-1}|)) +\ 0.966\ ln(\sigma_{t-1}^2) \end{cases}$ Volatility Analysis Here is the plot of conditional volatility as resulting from our ARMA(2,2) + eGARCH(1,1) model. plot(cond_volatility) Line plots of conditional volatility by year are shown. par(mfrow=c(6,2)) pl <- lapply(2007:2018, function(x) { plot(cond_volatility[as.character(x)], main = "DJIA Daily Log returns conditional volatility")}) pl Conditional volatility box-plots by year are shown. par(mfrow=c(1,1)) cond_volatility_df <- xts_to_dataframe(cond_volatility) dataframe_boxplot(cond_volatility_df, "Dow Jones Daily Log-returns conditional Volatility 2007-2018") Afterwards 2008, the daily volatility basically tends to decrease. In the year 2017, the volatility was lower with respect any other year under analysis. On the contrary, on the year 2018, we experienced a remarkable increase of volatility with respect year 2017. Saving the current enviroment for further analysis. save.image(file='DowEnvironment.RData') If you have any questions, please feel free to comment below. References Disclaimer Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or promotion of any particular security or source. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Continuing to Grow Community Together at ozunconf, 2018 Tue, 01/08/2019 - 01:00 (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers) In late November 2018, we ran the third annual rOpenSci ozunconf. This is the sibling rOpenSci unconference, held in Australia. We ran the first ozunconf in Brisbane in 2016, and the second in Melbourne in 2017. Photos taken by Ajay from Fotoholics As usual, before the unconf, we started discussion on GitHub issue threads, and the excitement was building with the number of issues. The day before the unconf we ran “Day 0 training” – an afternoon explaining R packages and GitHub. This aimed to show people how to create an R package, set it up with version control with git, then put it on GitHub, and share it with others. The idea behind delivering this course was not necessarily to have people become experts in R package development and GitHub. Instead, it aimed to gently introduce the ideas and concepts of R packages and GitHub, so that people can hit the ground running over the next two days. This year, Saras Windecker stepped up to the plate and did an incredible job teaching the afternoon course that Roger Peng and I developed last year. We also had a great group of helper instructors, which ended up creating the fantastic teaching ratio of two students for every helper. There was a lot of learning from both sides here, and one comment on writing git commit messages from Adam Gruer really stood out to me. He said when writing a commit message, it should complete the sentence: “This commit will …”. Adam found this comment from blog post by Chris Beams. This post by Vicky Lai also provided some good comments on good git practices. Day 1 started with a bang, in the beautiful north arts building at the University of Melbourne, and dusted off the tried and tested rOpenSci icebreaker. We then moved on to choose projects. Everyone moved around the room, exploring options, and we settled into groups. We worked on a variety of projects, ranging from some that we thought might be knocked over in a few hours, like ozbabynames. (Spoiler alert, it took longer than three hours), to documenting and learning gganimate, creating paper review toolkits, shortening / enriching the documenting/reading the documentation process with tl, automating the creation of vitae’s, simplifying the process of recoding free gender text responses with gendercoder, measuring walkability, making bindertools to recreate and reproduce analysis, and even generating synonyms. You can see the complete list of projects here. Throughout the conference, Saskia Freytag recorded conversations to capture the event. These were edited into the podcast “Credibly Curious”, in episode 6, “Ozunconf”, run by Saskia and me. On the second day, Adam Gruer led a fireside chat on how we might continue the experience of the rOpenSci ozunconf after we wrap up. Among the suggestions were starting or joining coding clubs at your university or place of work. Adam also made the point that a reason so many people enjoy the unconf is that freeing up your time and expectations has the surprising effect of making you more productive, and to try and recreate that idea. Robbie Bonelli had a good way to summarise this: We gain personal knowledge that we would never have gained working on our own, even if we never use or work on these things again. Make it about learning something new, this is how to break down the idea of having to get something out of this for your job. We closed up the conference with 3-minute presentations from everyone on what they got up to over the past two days. Looking back, it really is amazing how much can get done over a just a few days! We couldn’t have gotten it done without the help of our sponsors, Microsoft Azure, The R Consortium, RStudio, The University of Melbourne, and The Monash Business School. It’s been an absolute pleasure to bring people together for another great event. Just three years ago we ran the first ozunconf, and seeing how the community has changed and grown over the past three years has been a real privilege. There is nothing quite like seeing so many people together in the same space, enjoying coding together. Seeing people realise that they can contribute, and become more of what they are is fantastic. Next year we are looking into moving the ozunconf to a different city in Australia, stay tuned for updates! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RStudio Server Pro is now available on Microsoft Azure Marketplace Tue, 01/08/2019 - 01:00 (This article was first published on RStudio Blog, and kindly contributed to R-bloggers) RStudio is excited to announce the availability of its flagship, enterprise-ready, integrated development environment for R in Azure Marketplace. RStudio Server Pro for Azure is an on-demand, commercially-licensed integrated development environment (IDE) for R on the Microsoft Azure Cloud. It offers all of the capabilities found in the popular RStudio open source IDE, plus turnkey convenience, enhanced security, the ability to manage multiple R versions and sessions, and more. It comes pre-configured with multiple versions of R, common systems libraries, and the most popular R packages. RStudio Server Pro Azure helps you adapt to your unique circumstances. It allows you to choose different Azure computing instances whenever a project requires it, and helps avoid the sometimes complicated processes for procuring on-premises software. If the enhanced security, elegant support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you, consider RStudio Server Pro for Azure! Read the FAQ Getting Started with RStudio Server Pro for Azure var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Looking back at 2018 and plans for 2019 Tue, 01/08/2019 - 01:00 (This article was first published on R on msperlin, and kindly contributed to R-bloggers) At the end of every year I plan to write about the highlight of the current year and set plans for the future. First, let’s talk about my work in 2018. Highlights of 2018 Research wise, my scientometrics paper Is predatory publishing a real threat? Evidence from a large database study was featured in many news outlets. Its altmetric page is doing great, with over 1100 downloads and featured at top 5% of all research output measured by altmetric. This is, by far, the most impactful research piece I ever wrote. Its rewarding to see my work featured in the local and international media. This year I also released the first version of GetDFPData, a R package for accessing a large database of financial information from B3, the Brazilian exchange. I’m glad to report that many people are using it for their own research. I can see the number of visits in the web interface and the frequent emails I get about the package. The feedback from other researchers has been great but, off course, there are always ways to improve the code. I’ve been constantly developing it over time. The GetDFPData package also had an impact in my own research. I’ve always been biased towards the topic of capital markets and now I’m doing research in corporate finance, mostly due to the new access to a large database of corporate events. Currently, I have three paper initiatives in analyzing the effect of boards formation towards financial performance of Brazilian companies. These will likely probably be published in 2019 or 2020. In late 2018 I started my YouTube series padfeR, with video tutorials about using R for Finance and Economics. The idea is to have a greater impact and help those that are starting to use R. So far, all videos are in Portuguese but I do have plans for doing it in english in the future. Hopefully I’ll find some time in 2019 to start it. Overall, 2018 was a great year. I’m always thankful for having the opportunity of working in a job that I love and look forward to work (almost) every single day. My blog posts in 2018 In november I changed the technology behind my blog from Jekyll to Hugo. Can’t stress enough how much I’m liking the Academic template built with blogdown and hosted in my own server. It is far easier to write posts and maintain the website. First, let’s see how many posts I have so far. my.blog.folder <- '~/Dropbox/11-My Website/www.msperlin.com-blog/content/post/' post.files <- list.files(path = my.blog.folder, pattern = '.Rmd') post.files ## [1] "2017-01-01-First-post.Rmd" ## [2] "2017-01-02-GetHFData.Rmd" ## [3] "2017-01-15-CalculatingBetas.Rmd" ## [4] "2017-01-30-Exams-with-dynamic-content.Rmd" ## [5] "2017-02-13-R-and-Tennis-Players.Rmd" ## [6] "2017-02-16-Writing-a-book.Rmd" ## [7] "2017-03-05-Prophet-and_stock-market.Rmd" ## [8] "2017-05-04-pafdR-is-out.Rmd" ## [9] "2017-05-09-Studying-Pkg-Names.Rmd" ## [10] "2017-05-15-R-Finance.Rmd" ## [11] "2017-08-12-Switching-to-Linux.Rmd" ## [12] "2017-09-14-Brazilian-Yield-Curve.Rmd" ## [13] "2017-12-06-Package-GetDFPData.Rmd" ## [14] "2017-12-13-Serving-shiny-apps-internet.Rmd" ## [15] "2017-12-30-Looking-Back-2017.Rmd" ## [16] "2018-01-22-Update-BatchGetSymbols.Rmd" ## [17] "2018-03-16-Writing_Papers_About_Pkgs.Rmd" ## [18] "2018-04-22-predatory-scientometrics.Rmd" ## [19] "2018-05-12-Investing-Long-Run.Rmd" ## [20] "2018-06-12-padfR-ed2.Rmd" ## [21] "2018-06-29-BenchmarkingSSD.Rmd" ## [22] "2018-10-10-BatchGetSymbols-NewVersion.Rmd" ## [23] "2018-10-11-Update-GetLattesData.Rmd" ## [24] "2018-10-13-NewPackage-PkgsFromFiles.Rmd" ## [25] "2018-10-19-R-and-loops.Rmd" ## [26] "2018-10-20-Linux-and-R.Rmd" ## [27] "2018-11-03-NewBlog.Rmd" ## [28] "2018-11-03-RstudioTricks.Rmd" ## [29] "2019-01-08-Looking-Back-2018.Rmd" The blog started in january 2017 and, over time, I wrote 29 posts. That feels alright. I’m not felling forced to write and I do it whenever I fell like I have something to share. Let’s get more information from the .Rmd files. I’ll write function read_blog_files and use it for all post files. read_blog_files <- function(f.in) { require(tidyverse) my.front.matter <- rmarkdown::yaml_front_matter(f.in) df.out <- data_frame(post.title = my.front.matter$title, post.date = lubridate::ymd(my.front.matter$date), post.month = as.Date(format(post.date, '%Y-%m-01')), tags = paste0(my.front.matter$tags, collapse = ';'), categories = paste0(my.front.matter$categories, collapse = ';'), content = paste0(read_lines(f.in), collapse = ' ')) return(df.out) } df.posts <- dplyr::bind_rows(purrr::map(post.files, read_blog_files)) ## Loading required package: tidyverse ## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ## ✔ tibble 2.0.0 ✔ dplyr 0.7.8 ## ✔ tidyr 0.8.2 ✔ stringr 1.3.1 ## ✔ readr 1.3.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## Warning: data_frame() is deprecated, use tibble(). ## This warning is displayed once per session. glimpse(df.posts) ## Observations: 29 ## Variables: 6 ##$ post.title "My first post!", "Using R to download high frequency… ## $post.date 2017-01-01, 2017-01-02, 2017-01-15, 2017-01-30, 2017… ##$ post.month 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-01, 2017… ## $tags "about me", "R;GetHFData;B3;market microstructure;hig… ##$ categories "about me", "R;GetHFData;B3;market microstructure;hig… ## content "--- title: \"My first post!\" subtitle: \"A little b… First, we’ll look at the frequency of posts over time. df.posts.2018 <- df.posts %>% filter(post.date > as.Date('2018-01-01')) print( ggplot(df.posts.2018, aes(x = post.month)) + geom_histogram(stat='count') + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(y = 'Number of posts', x = '')) ## Warning: Ignoring unknown parameters: binwidth, bins, pad Seems to average about once a month. The blank spaces show that I did not write for a couple of months. Checking 2018’s plans In the end of 2017 my plans for 2018 were: Work on the second edition of the portuguese book. Done! I’m glad to report that the second edition of the book was published in June 2018. It was great to review the book and add several new chapters and sections. As I mentioned in the publication post, this is the largest and longest project I ever worked and it is very satisfying to see it develop over time. Even more satisfying is to receive positive feedback of readers that are reading and using the book to learn to code in R! Many teachers in Economics and Business are also starting to use it in the classroom. The book will continue to be update every couple of years. One of the greatest things about R, among many others, is that the language is continually evolving and changing. I have no doubt that there will always be new material to write about. Start a portal for financial data in Brazil Unfortunately this project did not launch. I wrote a couple of R scripts for fetching and saving data automatically in my server but it never became a webpage. I started to work on other projects and the website was not a priority. Plans for 2019 New edition of “Processing and Analyzing Financial Data with R” The international version of my book pafdR was published in january 2017. I fell its time to update it with the new chapters and structure from the second edition in portuguese. There are many improvements to the book, with an emphasis in the tidyverse universe. Work on my new book: “Investing For the Long Term” (provisory title) There is a huge deficit of financial knowledge in Brazil, specially in saving and investing. I’ve been a long term investor for most of my career as an academic and I fell there is a lot I can contribute to the topic of financial education by bringing data science into the problem of investing. The book will be a introduction to investments for the common person in Brazil, with a heavy data-based approach. It will not be about trading strategies or anything related to short term trading. The idea is to bring data analysis for the common long term investor, showing how the financial market works and how one can build passive income by constantly buying good financial contracts. I have no clue if it will be published em 2019. Unlike my previous book, I’m taking my time to write this one. No rush and no deadlines :). Solidify my research agenda in Board Composition As I mentioned before, my research agenda has shifted from capital markets to board compositions. This is a very interesting topic with many implications for listed companies. I’m leaning a lot from researching into these topics. Currently, I have four initiatives with different co-authors: • Gender and board composition • Politics and board composition • Professors in the Board of Companies • Board description of Brazilian Companies Hoepfully, these will be published in 2019 or 2020. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Don’t reinvent the wheel: making use of shiny extension packages. Join MünsteR for our next meetup! Tue, 01/08/2019 - 01:00 (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers) In our next MünsteR R-user group meetup on Tuesday, February 5th, 2019, titled Don’t reinvent the wheel: making use of shiny extension packages., Suthira Owlarn will introduce the shiny package and show how she used it to build an interactive web app for her sequencing datasets. You can RSVP here: http://meetu.ps/e/Gg5th/w54bW/f Shiny is a popular R package for building interactive web apps and dashboards – no web development knowledge required! It’s ease-of-use and extendability makes it great for quick user-friendly presentations of toy projects as well as production-ready data products in various industries. In our first 2019 meetup, Suthira Owlarn will briefly demo a prototype of BioDEV, a shiny app that addresses a common pain-point in dealing with biological (omics) data sets. It allows interactive and intuitive exploration of user-provided data sets, integration of external data and download of parameterized interactive reports. She will then talk about packages that extend shiny by providing a variety of user interfaces, more advanced user interactions or an application-testing framework. Finally, we will cover some useful tools that aren’t specific to shiny, including an easy way to turn shiny apps into self-contained desktop applications. About the speaker: Suthira Owlarn is a biologist at the Max Planck Institute in Münster. After finishing her PhD, she decided to learn a little R and immediately fell in love! Soon, she waved goodbye to the bench and is now having fun diving into the world of data analysis, exploring R packages and building shiny apps. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### BH 1.69.0-1 on CRAN Tue, 01/08/2019 - 00:26 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) The BH package provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R. It is quite popular, and frequently used together with Rcpp. The BH CRAN page shows e.g. that it is used by rstan, dplyr as well as a few other packages. The current count of reverse dependencies is at 164. Boost releases every four months. The last release we packaged was 1.66 from February—and this BH release gets us to Boost 1.69 released just three or so weeks ago. And as blogged last month, we made a pre-release (following several reverse-depends checks) to allow three packages affected by changes in Boost to adapt. My RcppStreams package was one, and we made an update release 0.1.2 just yesterday. This BH release was also preceded by another reverse-depends check on the weekend. Sine the 1.66 release of BH, CRAN tightened policies some more. Pragmas suppressing compiler warnings are now verboten so I had to disable a few. Expect compilations of packages using Boost, and BH, to be potentially very noisy. Consider adding flags to your local ~/.R/Makeconf and we should add them to the src/Makevars as much as we can (eg my ticket #3961 to dplyr). Collecting a few of these on a BH wiki page may not be a bad idea. A list of our local changes follows. The diff to upstream Boost is in the repo as well. Changes in version 1.69.0-1 (2019-01-07) • Upgraded to Boost 1.69.0 (plus the few local tweaks) • Applied the standard minimal patch with required changes, as well as the newer changeset for diagnostics pragma suppression. • Following a pre-release in December, maintainers of three packages affected by the 1.66 to 1.69 were contacted, and changes were prepared. Via CRANberries, there is a diffstat report relative to the previous release. Comments and suggestions about BH are welcome via the mailing list or the issue tracker at the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Long-awaited updates to htmlTable Mon, 01/07/2019 - 21:51 (This article was first published on R – G-Forge, and kindly contributed to R-bloggers) Lets celebrate 2019 with some updates to my most popular package ever, the htmlTable. The image is CC by Thomas Hawk One of the most pleasant surprises has been the popularity of my htmlTable-package with more than 100k downloads per month. This is all thanks to more popular packages relying on it and the web expanding beyond its original boundaries. As a thank you I have taken the time to update and fix some features (the 1.13 release) – enjoy! Grouping column headers Grouping columns is a powerful feature that allows you to quickly generate more interesting tables. Unfortunately the cgroup has previously required matrix input for this feature that even I as the author sometimes struggled with. Now you can provide a list in a much more user friendly manner: rbind( Group A = c(20, 5, 380, 95), Group B = c(11, 55, 9, 45) ) %>% htmlTable(header = rep(c("No", "%"), times = 2), n.cgroup=list( c(2), c(2,2)), cgroup=list('Super', c("First", "Second"))) Super First Second No % No % Group A 20 5 380 95 Group B 11 55 9 45 The n.cgroup can be either a combination of the cgroup below but the exact same table can be created through providing list(c(4),c(2,2)) as the n.cgroup. Auto-counting tspanners Even more common than grouping columns is probably grouping data by rows. The htmlTable allows you to do this by rgroup and tspanner. The most common approach is by using rgroup as the first row-grouping element but with larger tables you frequently want to separate concepts into separate sections. Here’s a more complex example. This has previously been a little cumbersome to to counting the rows of each tspanner but now you’re able to (1) leave out the last row, (2) specify the number of rgroups instead of the number of rows. The latter is convenient as the n.tspanner must align with the underlying rgroup. Here’s an example: rbind( Group A = c(20, 380), Group B = c(110, 1230), Group C = c(2, 56), Group D = c(17, 33), Group A = c(40, 360), Group B = c(230, 1100), Group C = c(8, 50), Group D = c(10, 40) ) %>% apply(1, function(x) { sapply(x, function(count) c( txtInt(count), sprintf("(%s)", txtRound(count/sum(x) * 100, 1)))) %>% c(txtInt(sum(x)), .) }) %>% t %>% htmlTable(header = c("Total", rep(c("No", "(%)"), times = 2)), n.cgroup=list(c(1,2,2)), cgroup=list(c("", "Cases", "Controls")), rgroup = rep(c("Aspirin", "Intermittent compression"), times = 2), n.rgroup = rep(2, times = 4), tspanner = c("First experiment", "Second experiment"), n.tspanner = c(2), align = "r", caption = "Extremely fake data") Extremely fake data Cases Controls Total No (%) No (%) First experiment Aspirin Group A 400 20 (5.0) 380 (95.0) Group B 1,340 110 (8.2) 1,230 (91.8) Intermittent compression Group C 58 2 (3.4) 56 (96.6) Group D 50 17 (34.0) 33 (66.0) Second experiment Aspirin Group A 400 40 (10.0) 360 (90.0) Group B 1,330 230 (17.3) 1,100 (82.7) Intermittent compression Group C 58 8 (13.8) 50 (86.2) Group D 50 10 (20.0) 40 (80.0) The txtRound now has digits.nonzero argument The txtRound is similar to R’s round but returns a string that makes sure that all your values are presented with the same number of digits. > txtRound(c(1.2, 1.0), digits = 1) [1] "1.2" "1.0" Under some circumstances you are close to 0 and you want to retain more digits, hence the new digits.nonzero argument: > txtRound(c(0.1, 0.01, 0.001), digits = 1, digits.nonzero = 2) [1] "0.1" "0.01" "0.0" New function vector2string As a package author I think the majority of my code is dedicated towards providing good error message. A common issue is that you have two vectors that don’t match. For this purpose I’ve created the vector2string. This makes it easier to write your stop messages. You simply provide any vector and get a string back: > vector2string(c(1:2, "a random ' string")) [1] "'1', '2', 'a random \\' string'" Here’s how I use is in one of my functions: ... stop("Invalid size of lists: ", vector2string(lengths), " each element should be be able to evenly divide ", ncol(x)) Summary These are tiny additions and everything should work just as it has in previous versions. Hopefully it will save you some time and make your life easier! Enjoy! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Marketing analytics with greybox Mon, 01/07/2019 - 17:40 (This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers) One of the reasons why I have started the greybox package is to use it for marketing research and marketing analytics. The common problem that I face, when working with these courses is analysing the data measured in different scales. While R handles numeric scales natively, the work with categorical is not satisfactory. Yes, I know that there are packages that implement some of the functions, but I wanted to have them in one place without the need to install a lot of packages and satisfy the dependencies. After all, what’s the point in installing a package for Cramer’s V, when it can be calculated with two lines of code? So, here’s a brief explanation of the functions for marketing analytics in greybox . I will use mtcars dataset for the examples, but we will transform some of the variables into factors: mtcarsData <- as.data.frame(mtcars) mtcarsDatavs <- factor(mtcarsData$vs, levels=c(0,1), labels=c("v","s")) mtcarsData$am <- factor(mtcarsData$am, levels=c(0,1), labels=c("a","m")) All the functions discussed in this post are available in greybox starting from v0.4.0. However, I’ve found several bugs since the submission to CRAN, and the most recent version with bugfixes is now available on github. Analysing the relation between the two variables in categorical scales Cramer’s V Cramer’s V measures the relation between two variables in categorical scale. It is implemented in the cramer() function. It returns the value in a range of 0 to 1 (1 – when the two categorical variables are linearly associated with each other, 0 – otherwise), Chi-Squared statistics from the chisq.test() , the respective p-value and the number of degrees of freedom. The tested hypothesis in this case is formulated as: \begin{matrix} H_0: V = 0 \text{ (the variables don’t have association);} \\ H_1: V \neq 0 \text{ (there is an association between the variables).} \end{matrix} Here’s what we get when trying to find the association between the engine and transmission in the mtcars data: cramer(mtcarsData$vs, mtcarsData$am) Cramer's V: 0.1042 Chi^2 statistics = 0.3475, df: 1, p-value: 0.5555 Judging by this output, the association between these two variables is very low (close to zero) and is not statistically significant. Cramer’s V can also be used for the data in numerical scales. In general, this might be not the most suitable solution, but this might be useful when you have a small number of values in the data. For example, the variable gear in mtcars is numerical, but it has only three options (3, 4 and 5). Here’s what Cramer’s V tells us in the case of gear and am: cramer(mtcarsData$am, mtcarsData$gear) Cramer's V: 0.809 Chi^2 statistics = 20.9447, df: 2, p-value: 0 As we see, the value is high in this case (0.809), and the null hypothesis is rejected on 5% level. So we can conclude that there is a relation between the two variables. This does not mean that one variable causes the other one, but they both might be driven by something else (do more expensive cars have less gears but the automatic transmission?). Plotting categorical variables While R allows plotting two categorical variables against each other, the plot is hard to read and is not very helpful (in my opinion): plot(table(mtcarsData$am,mtcarsData$gear)) Default plot of a table So I have created a function that produces a heat map for two categorical variables. It is called tableplot() : tableplot(mtcarsData$am,mtcarsData$gear) Tableplot for the two categorical variables It is based on table() function and uses the frequencies inside the table for the colours: table(mtcarsData$am,mtcarsData$gear) / length(mtcarsData$am)

3 4 5 a 0.46875 0.12500 0.00000 m 0.00000 0.25000 0.15625

The darker sectors mean that there is a higher concentration of values, while the white ones correspond to zeroes. So, in our example, we see that the majority of cars have automatic transmissions with three gears. Furthermore, the plot shows that there is some sort of relation between the two variables: the cars with automatic transmissions have the lower number of gears, while the ones with the manual have the higher number of gears (something we’ve already noticed in the previous subsection).

Association between the categorical and numerical variables

While Cramer’s V can also be used for the measurement of association between the variables in different scales, there are better instruments. For example, some analysts recommend using intraclass correlation coefficient when measuring the relation between the numerical and categorical variables. But there is a simpler option, which involves calculating the coefficient of multiple correlation between the variables. This is implemented in

mcor()

function of

greybox

. The y variable should be numerical, while x can be of any type. What the function then does is expands all the factors and runs a regression via

.lm.fit()

function, returning the square root of the coefficient of determination. If the variables are linearly related, then the returned value will be close to one. Otherwise it will be closet to zero. The function also returns the F statistics from the regression, the associated p-value and the number of degrees of freedom (the hypothesis is formulated similarly to

cramer()

function).

Here’s how it works:

mcor(mtcarsData$am,mtcarsData$mpg)

Multiple correlations value: 0.5998 F-statistics = 16.8603, df: 1, df resid: 30, p-value: 3e-04

In this example, the simple linear regression of mpg from the set of dummies is constructed, and we can conclude that there is a linear relation between the variables, and that this relation is statistically significant.

Association between several variables Measures of association

When you deal with datasets (i.e. data frames or matrices), then you can use

cor()

function in order to calculate the correlation coefficients between the variables in the data. But when you have a mixture of numerical and categorical variables, the situation becomes more difficult, as the correlation does not make sense for the latter. This motivated me to create a function that uses either

cor()

, or

cramer()

, or

mcor()

functions depending on the types of data (see discussions of

cramer()

and

mcor()

above). The function is called

association()

or

assoc()

and returns three matrices: the values of the measures of association, their p-values and the types of the functions used between the variables. Here’s an example:

assocValues <- assoc(mtcarsData) print(assocValues,digits=2) Associations: values: mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 0.86 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.66 0.67 cyl 0.86 1.00 0.92 0.84 0.70 0.78 0.59 0.82 0.52 0.53 0.62 disp -0.85 0.92 1.00 0.79 -0.71 0.89 -0.43 0.71 0.59 0.77 0.56 hp -0.78 0.84 0.79 1.00 -0.45 0.66 -0.71 0.72 0.24 0.66 0.79 drat 0.68 0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.83 0.33 wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 0.55 0.69 0.66 0.61 qsec 0.42 0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 0.23 0.63 0.67 vs 0.66 0.82 0.71 0.72 0.44 0.55 0.74 1.00 0.10 0.62 0.69 am 0.60 0.52 0.59 0.24 0.71 0.69 0.23 0.10 1.00 0.81 0.44 gear 0.66 0.53 0.77 0.66 0.83 0.66 0.63 0.62 0.81 1.00 0.51 carb 0.67 0.62 0.56 0.79 0.33 0.61 0.67 0.69 0.44 0.51 1.00 p-values: mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.01 cyl 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 disp 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.07 hp 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.18 0.00 0.00 drat 0.00 0.00 0.00 0.01 1.00 0.00 0.62 0.01 0.00 0.00 0.66 wt 0.00 0.00 0.00 0.00 0.00 1.00 0.34 0.00 0.00 0.00 0.02 qsec 0.02 0.00 0.01 0.00 0.62 0.34 1.00 0.00 0.21 0.00 0.01 vs 0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.56 0.00 0.01 am 0.00 0.01 0.00 0.18 0.00 0.00 0.21 0.56 1.00 0.00 0.28 gear 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.09 carb 0.01 0.01 0.07 0.00 0.66 0.02 0.01 0.01 0.28 0.09 1.00 types: mpg cyl disp hp drat wt qsec vs am mpg "none" "mcor" "cor" "cor" "cor" "cor" "cor" "mcor" "mcor" cyl "mcor" "none" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer" disp "cor" "mcor" "none" "cor" "cor" "cor" "cor" "mcor" "mcor" hp "cor" "mcor" "cor" "none" "cor" "cor" "cor" "mcor" "mcor" drat "cor" "mcor" "cor" "cor" "none" "cor" "cor" "mcor" "mcor" wt "cor" "mcor" "cor" "cor" "cor" "none" "cor" "mcor" "mcor" qsec "cor" "mcor" "cor" "cor" "cor" "cor" "none" "mcor" "mcor" vs "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "none" "cramer" am "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "none" gear "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer" carb "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer" gear carb mpg "mcor" "mcor" cyl "cramer" "cramer" disp "mcor" "mcor" hp "mcor" "mcor" drat "mcor" "mcor" wt "mcor" "mcor" qsec "mcor" "mcor" vs "cramer" "cramer" am "cramer" "cramer" gear "none" "cramer" carb "cramer" "none"

One thing to note is that the function considers numerical variables as categorical, when they only have up to 10 unique values. This is useful, for example, in case of number of gears in the dataset.

Plots of association between several variables

Similarly to the problem with

cor()

, scatterplot matrix (produced using

plot()

) is not meaningful in case of a mixture of variables:

plot(mtcarsData)

Default scatter plot matrix

It makes sense to use scatterplot in case of numeric variables,

tableplot()

in case of categorical and

boxplot()

in case of a mixture. So, there is the function

in

greybox

that creates something more meaningful. It uses the same algorithm as

assoc()

function, but produces plots instead of calculating measures of association. So, gear will be considered as categorical and the function will produce either

boxplot()

or

tableplot()

, when plotting it against other variables.

Here’s an example:

This plot demonstrates, for example, that the number of carburetors influences fuel consumption (something that we could not have spotted in the case of

plot()

). Notice also, that the number of gears influences the fuel consumption in a non-linear relation as well. So constructing the model with dummy variables for the number of gears might be a reasonable thing to do.

The function also has the parameter log, which will transform all the numerical variables using logarithms, which is handy, when you suspect the non-linear relation between the variables. Finally, there is a parameter histogram, which will plot either histograms, or barplots on the diagonal.

The plot demonstrates that the disp has a strong non-linear relation with mpg, and, similarly, drat and hp also influence mpg in a non-linear fashion.

Regression diagnostics

One of the problems of linear regression that can be diagnosed prior to the model construction is multicollinearity. The conventional way of doing this diagnostics is via calculating the variance inflation factor (VIF) after constructing the model. However, VIF is not easy to interpret, because it lies in $$1,\infty$$. Coefficients of determination from the linear regression models of explanatory variables are easier to interpret and work with. If such a coefficient is equal to one, then there are some perfectly correlated explanatory variables in the dataset. If it is equal to zero, then they are not linearly related.

There is a function

determination()

or

determ()

in

greybox

that returns the set of coefficients of determination for the explanatory variables. The good thing is that this can be done before constructing any model. In our example, the first column, mpg is the response variable, so we can diagnose the multicollinearity the following way:

determination(mtcarsData[,-1])

cyl disp hp drat wt qsec vs 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.8017720 am gear carb 0.7924392 0.8133441 0.8735577

As we can see from the output above, disp is the most linearly related with the variables, so including it in the model might cause the multicollinearity, which will decrease the efficiency of the estimates of parameters.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...