Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 24 min ago

Mapping the Prevalence of Alzheimer Disease Mortality in the USA

Sat, 08/18/2018 - 17:26

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In comparison with other statistical software (e.g., SAS, STATA, and SPSS), R is the best for data visualization. Therefore, in all posts I have written for DataScience+ I take advantage of R and make plots using ggplot2 to visualize all the findings. For example, previously I plotted the percentiles of body mass index in the NHANES 2005-2014 and got exactly same results as the paper published in JAMA.

In this post, I will make a map of the prevalence of Alzheimer disease mortality by the state in the USA. The Centers for Disease Control and Prevention is providing the data for download, and they have created a beautiful map. I will try to reproduce the same results using several packages in R.

Libraries and Datasets

Load the library

library(tidyverse) library(scales) library(maps) library(mapproj)

Download the .CSV file from the Centers for Disease Control and Prevention website (link is above)

dt_ad <- read.csv("~/Downloads/ALZHEIMERS2016.csv") head(dt_ad) STATE RATE DEATHS URL 1 AL 45.0 2,507 /nchs/pressroom/states/alabama/alabama.htm 2 AK 25.8 111 /nchs/pressroom/states/alaska/alaska.htm 3 AZ 35.8 3,082 /nchs/pressroom/states/arizona/arizona.htm 4 AR 41.3 1,475 /nchs/pressroom/states/arkansas/arkansas.htm 5 CA 36.1 15,570 /nchs/pressroom/states/california/california.htm 6 CO 34.7 1,835 /nchs/pressroom/states/colorado/colorado.htm

Load the map data of the U.S. states

dt_states = map_data("state") head(dt_states) long lat group order region subregion 1 -87.46201 30.38968 1 1 alabama 2 -87.48493 30.37249 1 2 alabama 3 -87.52503 30.37249 1 3 alabama 4 -87.53076 30.33239 1 4 alabama 5 -87.57087 30.32665 1 5 alabama 6 -87.58806 30.32665 1 6 alabama

Now, I have two datasets, one has the rate of mortality from Alzheimer disease and the other have variables with the information to create maps. I need to merge both datasets together but I dont have a similar variable for merge. Therefore, I will create a new region variable form the URL variable in the first dataset and will use to merge with the second dataset. For this purpose, I will use the function separate and gsub. In the end I will merge with states dataset by region.

#get the state name from URL dt_ad2 = dt_ad %>% separate(URL, c("a","b","c","d", "region"), sep="/") %>% select(RATE, region) # removing white space for mergin purposes dt_states2 = dt_states %>% mutate(region = gsub(" ","", region)) # merge dt_final = left_join(dt_ad2, dt_states2) Visualization

The dt_final dataset have all the variables I need to make the map.

ggplot(dt_final, aes(x = long, y = lat, group = group, fill = RATE)) + geom_polygon(color = "white") + scale_fill_gradient( name = "Death Rate", low = "#fbece3", high = "#6f1873", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) + labs(title="Mortality of Alzheimer Disease in the U.S.", x="", y="") + coord_map()

In this short post I showed how simple is to visualize the data in a map. I hope you like it and feel free to post a comment below or send me a message.

    Related Post

    1. Animating the Goals of the World Cup: Comparing the old vs. new gganimate and tweenr API
    2. Machine Learning Results in R: one plot to rule them all! (Part 1 – Classification Models)
    3. Seaborn Categorical Plots in Python
    4. Matplotlib Library Tutorial with Examples – Python
    5. Visualize the World Cup with R! Part 1: Recreating Goals with ggsoccer and ggplot2
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R:case4base – code profiling with base R

    Sat, 08/18/2018 - 15:00

    (This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

    Introduction

    In this summertime post in the case4base series, we will look at useful tools in base R, which let us profile our code without any extra packages needed to be installed. We will cover simple and easy to use speed profiling, more complex profiling of performance and memory and, as always, look at alternatives to base R as well, with a special shout out to profiling integration in RStudio.

    Contents
    1. Simple time profiling with system.time
    2. Profile R execution with Rprof
    3. Non-sampling memory use profiling with Rprofmem
    4. Profiling integration within RStudio
    5. Background profiling with base R via an RStudio addin
    6. Alternatives to base R
    7. References
    Simple time profiling with system.time

    Base function system.time returns the difference between two proc.time calls within which it evaluates an expression provided as argument. The simplest usage can be seen below:

    system.time(runif(10^8)) ## user system elapsed ## 4.660 0.392 5.186

    For the purpose of processing the results, we can of course store and examine them within a variable where we can see that it is in fact a numeric vector with 5 elements with a proc_time class. It uses summary as its print method via the print.proc_time. For most our purposes, we would be interested in the “elapsed� element of the result, giving us the ‘real†elapsed time since the process was started:

    tm <- system.time(runif(10^8)) str(tm) ## Class 'proc_time' Named num [1:5] 4.49 0.38 4.97 0 0 ## ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ... tm["elapsed"] ## elapsed ## 4.969

    We can also very simply run multiple observations for an expression and investigate the results:

    expr <- rep(expression(runif(10^8)), 10L) tm <- unlist(lapply(expr, function(x) system.time(eval(x))["elapsed"])) summary(tm) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.873 4.946 4.980 5.055 5.059 5.432

    With a little tweaking we can also run it in a separate process to not block our R session:

    script <- paste( 'expr <- rep(expression(runif(10^7)), 10L)', 'tm <- unlist(lapply(expr, function(x) system.time(eval(x))["elapsed"]))', 'print(summary(tm))', sep = ';') system2(command = 'Rscript', args = c('-e', shQuote(script)), wait = FALSE) Profile R execution with Rprof

    The utils package included in the base R releases contains a very useful pair of functions for profiling by sampling every interval of seconds:

    • Use utils::Rprof() to enable the R profiling, run the code to be profiled and use utils::Rprof(NULL) to disable profiling
    • Afterwards, use utils::summaryRprof() to investigate the results

    The most simplistic usage is really this straight-forward:

    # Enable profiling utils::Rprof() # Run the code to be profiled x <- lapply(10^(6:7), runif) y <- lapply(x, summary) z <- sort(x[[2]]) # Disable profiling utils::Rprof(NULL) # Read the profiling results and view res <- utils::summaryRprof() res[["by.self"]]

    The profiling can be customized with arguments such as filename, which specifies to which file will the results be written (and also serves as the off switch if set to NULL or ""), interval, which governs the time between profiling samples. More can be found in the functionâ€s help.

    Perhaps the most interesting argument is memory.profiling which if set to TRUE will add memory information into the results file:

    # Enable profiling with memory profiling utils::Rprof(filename = "ProfwMemory.out", memory.profiling = TRUE) # Run the code to be profiled x <- lapply(10^(6:7), runif) y <- lapply(x, summary) z <- sort(x[[2]]) # Disable profiling utils::Rprof(NULL) # Read the profiling results and view results in different ways utils::summaryRprof(filename = "ProfwMemory.out", memory = c("stats"), lines = "show") utils::summaryRprof(filename = "ProfwMemory.out", memory = c("both"))[["by.self"]] Non-sampling memory use profiling with Rprofmem

    Base R also offers an option to profile memory use (if R is compiled with R_MEMORY_PROFILING defined) using Rprofmem – a pure memory use profiler. Results are written as simple text into a file, from which they can be read, however the processing of the result may use a bit more polishing here:

    # Enable memory profiling profiling utils::Rprofmem("Rprofmem.out", threshold = 10240) # Run the code to be profiled x <- runif(10^5) y <- runif(10^6) z <- runif(10^7) # Disable profiling utils::Rprofmem(NULL) # Read the results readLines("Rprofmem.out")

    If our concern is specifically copying of (large) objects which negatively impact the memory requirements of our work, we can (provided that R is compiled with --enable-memory-profiling). Use tracemem(object) to mark object for tracking and print a stack trace it is duplicated. untracemem(object) untraces the object.

    For more details see the references section.

    Profiling integration within RStudio

    Even though this does not really adhere to the case4base rules, we still mention the RStudio profiling integration, which is done using the profvis package and if successful, works really well and provides informative graphical outputs. All we have to to it either select a chunk of code and click on Profile -> Profile Selected Line(s), or click on Profile -> Start Profiling, run our code and then Profile -> Stop profiling. RStudio should then automatically use profvis to produce an interactive output that allows nice exploration of the results:

    RStudio+profvis

    Background profiling with base R via an RStudio addin

    We have also created and written about an RStudio addin that let users profile R code selected in RStudio, with the advantage that the profiling runs asynchronously in a separate process not blocking the current R session and also not requiring external packages such as profvis. You can read more about it and get it here.

    Alternatives to base R References
    1. Profiling R code for speed at Writing R Extensions
    2. Profiling R code for memory use at Writing R Extensions
    3. system.time help
    4. Memory profiling in R
    Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    approximative Laplace

    Sat, 08/18/2018 - 00:18

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    I came across this question on X validated that wondered about one of our examples in Monte Carlo Statistical Methods. We have included a section on Laplace approximations in the Monte Carlo integration chapter, with a bit of reluctance on my side as this type of integral approximation does not directly connect to Monte Carlo methods. Even less in the case of the example as we aimed at replacing a coverage probability for a Gamma distribution with a formal Laplace approximation. Formal due to the lack of asymptotics, besides the length of the interval (a,b) which probability is approximated. Hence, on top of the typos, the point of the example is not crystal clear, in that it does not show much more than the step-function approximation to the function converges as the interval length gets to zero. For instance, using instead a flat approximation produces an almost as good approximation:

    > xact(5,2,7,9) [1] 0.1933414 > laplace(5,2,7,9) [1] 0.1933507 > flat(5,2,7,9) [1] 0.1953668

    What may be more surprising is the resilience of the approximation as the width of the interval increases:

    > xact(5,2,5,11) [1] 0.53366 > lapl(5,2,5,11) [1] 0.5354954 > plain(5,2,5,11) [1] 0.5861004 > quad(5,2,5,11) [1] 0.434131 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Topics and Categories in the Russian Troll Tweets

    Fri, 08/17/2018 - 23:49

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    .knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

    Topics and Categories in the Russian Troll Tweets I decided to return to the analysis I conducted for the IRA tweets dataset. (You can read up on that analysis and R code here.) Specifically, I returned to the LDA results, which looked like they lined up pretty well with the account categories identified by Darren Linvill and Patrick Warren. But with slightly altered code, we can confirm that or see if there’s more to the topics data than meets the eye. (Spoiler alert: There is more than meets the eye.)

    I reran much of the original code – creating the file, removing non-English tweets and URLs, generating the DTM and conducting the 6-topic LDA. For brevity, I’m not including it in this post, but once again, you can see it here.

    I will note that the topics were numbered a bit differently than they were in my previous analysis. Here’s the new plot. The results look very similar to before. (LDA is a variational Bayesian method and there is an element of randomness to it, so the results aren’t a one-to-one match, but they’re very close.)

    top_terms <- tweet_topics %>%
    group_by(topic) %>%
    top_n(15, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

    top_terms %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~topic, scales = "free") +
    coord_flip()

    Before, when I generated a plot of the LDA results, I asked it to give me the top 15 terms by topic. I’ll use the same code, but instead have it give the top topic for each term.

    word_topic <- tweet_topics %>%
    group_by(term) %>%
    top_n(1, beta) %>%
    ungroup()

    I can then match this dataset up to the original tweetwords dataset, to show which topic each word is most strongly associated with. Because the word variable is known by two different variable names in my datasets, I need to tell R how to match.

    tweetwords <- tweetwords %>%
    left_join(word_topic, by = c("word" = "term"))

    Now we can generate a crosstable, which displays the matchup between LDA topic (1-6) and account category (Commercial, Fearmonger, Hashtag Gamer, Left Troll, News Feed, Right Troll, and Unknown).

    cat_by_topic <- table(tweetwords$account_category, tweetwords$topic)
    cat_by_topic
    ##
    ## 1 2 3 4 5 6
    ## Commercial 38082 34181 49625 952309 57744 19380
    ## Fearmonger 9187 3779 37326 1515 8321 4864
    ## HashtagGamer 117517 103628 183204 31739 669976 81803
    ## LeftTroll 497796 1106698 647045 94485 395972 348725
    ## NewsFeed 2715106 331987 525710 91164 352709 428937
    ## RightTroll 910965 498983 1147854 113829 534146 2420880
    ## Unknown 7622 5198 12808 1497 11282 4605

    This table is a bit hard to read, because it’s frequencies, and the total number of words for each topic and account category differ. But we can solve that problem by asking instead for proportions. I’ll have it generate proportions by column, so we can see the top account category associated with each topic.

    options(scipen = 999)
    prop.table(cat_by_topic, 2) #column percentages - which topic is each category most associated with
    ##
    ## 1 2 3 4 5
    ## Commercial 0.008863958 0.016398059 0.019060352 0.740210550 0.028443218
    ## Fearmonger 0.002138364 0.001812945 0.014336458 0.001177579 0.004098712
    ## HashtagGamer 0.027353230 0.049714697 0.070366404 0.024670084 0.330013053
    ## LeftTroll 0.115866885 0.530929442 0.248522031 0.073441282 0.195045686
    ## NewsFeed 0.631967460 0.159268087 0.201918749 0.070859936 0.173735438
    ## RightTroll 0.212036008 0.239383071 0.440876611 0.088476982 0.263106667
    ## Unknown 0.001774095 0.002493699 0.004919395 0.001163588 0.005557225
    ##
    ## 6
    ## Commercial 0.005856411
    ## Fearmonger 0.001469844
    ## HashtagGamer 0.024719917
    ## LeftTroll 0.105380646
    ## NewsFeed 0.129619781
    ## RightTroll 0.731561824
    ## Unknown 0.001391578

    Category 1 is News Feed, Category 2 Left Troll, Category 4 Commercial, and Category 5 Hashtag Gamer. But look at Categories 3 and 6. For both, the highest percentage is Right Troll. Fearmonger is not most strongly associated with any specific topic. What happens if we instead ask for a proportion table by row, which tells us which category each topic most associated with?

    prop.table(cat_by_topic, 1) #row percentages - which category is each topic most associated with
    ##
    ## 1 2 3 4 5
    ## Commercial 0.03307679 0.02968851 0.04310266 0.82714465 0.05015456
    ## Fearmonger 0.14135586 0.05814562 0.57431684 0.02331056 0.12803114
    ## HashtagGamer 0.09893111 0.08723872 0.15422939 0.02671932 0.56401601
    ## LeftTroll 0.16106145 0.35807114 0.20935083 0.03057054 0.12811638
    ## NewsFeed 0.61073827 0.07467744 0.11825366 0.02050651 0.07933866
    ## RightTroll 0.16190164 0.08868197 0.20400284 0.02023031 0.09493132
    ## Unknown 0.17720636 0.12085000 0.29777736 0.03480424 0.26229889
    ##
    ## 6
    ## Commercial 0.01683284
    ## Fearmonger 0.07483998
    ## HashtagGamer 0.06886545
    ## LeftTroll 0.11282966
    ## NewsFeed 0.09648546
    ## RightTroll 0.43025192
    ## Unknown 0.10706315

    Based on these results, Fearmonger now seems closest to Category 3 and Right Troll with Category 6. But Right Troll also shows up on Categories 3 (20%) and 1 (16%). Left Trolls show up in these categories at almost each proportions. It appears, then, that political trolls show strong similarity in topics with Fearmongers (stirring things up) and News Feed (“informing”) trolls. Unknown isn’t the top contributer to any topic, but it aligns with Category 3 (showing elements of Fearmongering) and 5 (showing elements of Hashtag Gaming). Let’s focus in on 5 categories.

    categories <- c("Fearmonger", "HashtagGamer", "LeftTroll", "NewsFeed", "RightTroll")

    politics_fear_hash <- tweetwords %>%
    filter(account_category %in% categories)

    PFH_counts <- politics_fear_hash %>%
    count(account_category, topic, word, sort = TRUE) %>%
    ungroup()

    For now, let’s define our topics like this: 1 = News Feed, 2 = Left Troll, 3 = Fearmonger, 4 = Commercial, 5 = Hashtag Gamer, and 6 = Right Troll. We’ll ask R to go through our PFH dataset and tell us when account category topic matches and when it mismatches. Then we can look at these terms.

    PFH_counts$match <- ifelse(PFH_counts$account_category == "NewsFeed" & PFH_counts$topic == 1,PFH_counts$match <- "Match",
    ifelse(PFH_counts$account_category == "LeftTroll" & PFH_counts$topic == 2,PFH_counts$match <- "Match",
    ifelse(PFH_counts$account_category == "Fearmonger" & PFH_counts$topic == 3,PFH_counts$match <- "Match",
    ifelse(PFH_counts$account_category == "HashtagGamer" & PFH_counts$topic == 5,PFH_counts$match <- "Match",
    ifelse(PFH_counts$account_category == "RightTroll" & PFH_counts$topic == 6,PFH_counts$match <- "Match",
    PFH_counts$match <- "NonMatch")))))

    top_PFH <- PFH_counts %>%
    group_by(account_category, match) %>%
    top_n(15, n) %>%
    ungroup() %>%
    arrange(account_category, -n)

    top_PFH %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n, fill = factor(match))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~account_category, scales = "free") +
    coord_flip()

    Red indicates a match and blue indicates a mismatch. So when Fearmongers talk about food poisoning or Koch Farms, it’s a match, but when they talk about Hillary Clinton or the police, it’s a mismatch. Terms like “MAGA” and “CNN” are matches for Right Trolls but “news” and “love” are mismatches. Left Trolls show a match when tweeting about “Black Lives Matter” or “police” but a mismatch when tweeting about “Trump” or “America.” An interesting observation is that Trump is a mismatch for every topic it’s displayed under on the plot. (Now, realdonaldtrump, Trump’s Twitter handle, is a match for Right Trolls.) So where does that term, and associated terms like “Donald” belong?

    tweetwords %>%
    filter(word %in% c("donald", "trump"))
    ## # A tibble: 157,844 x 7
    ## author publish_date account_category id word topic beta
    ##
    ## 1 10_GOP 10/1/2017 22:43 RightTroll C:/Users/~ trump 3 0.0183
    ## 2 10_GOP 10/1/2017 23:52 RightTroll C:/Users/~ trump 3 0.0183
    ## 3 10_GOP 10/1/2017 2:47 RightTroll C:/Users/~ dona~ 3 0.00236
    ## 4 10_GOP 10/1/2017 2:47 RightTroll C:/Users/~ trump 3 0.0183
    ## 5 10_GOP 10/1/2017 3:47 RightTroll C:/Users/~ trump 3 0.0183
    ## 6 10_GOP 10/10/2017 20:57 RightTroll C:/Users/~ trump 3 0.0183
    ## 7 10_GOP 10/10/2017 23:42 RightTroll C:/Users/~ trump 3 0.0183
    ## 8 10_GOP 10/11/2017 22:14 RightTroll C:/Users/~ trump 3 0.0183
    ## 9 10_GOP 10/11/2017 22:20 RightTroll C:/Users/~ trump 3 0.0183
    ## 10 10_GOP 10/12/2017 0:38 RightTroll C:/Users/~ trump 3 0.0183
    ## # ... with 157,834 more rows

    These terms apparently were sorted into Category 3, which we’ve called Fearmongers. Once again, this highlights the similarity between political trolls and fearmongering trolls in this dataset.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Review of James Picerno’s Quantitative Investment Portfolio Analytics in R

    Fri, 08/17/2018 - 18:35

    (This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

    This is a review of James Picerno’s Quantitative Investment Portfolio Analytics in R. Overall, it’s about as fantastic a book as you can get on portfolio optimization until you start getting into corner cases stemming from large amounts of assets.

    Here’s a quick summary of what the book covers:

    1) How to install R.

    2) How to create some rudimentary backtests.

    3) Momentum.

    4) Mean-Variance Optimization.

    5) Factor Analysis

    6) Bootstrapping/Monte-Carlo simulations.

    7) Modeling Tail Risk

    8) Risk Parity/Vol Targeting

    9) Index replication

    10) Estimating impacts of shocks

    11) Plotting in ggplot

    12) Downloading/saving data.

    All in all, the book teaches the reader many fantastic techniques to get started doing some basic portfolio management using asset-class ETFs, and under the assumption of ideal data–that is, that there are few assets with concurrent starting times, that the number of assets is much smaller than the number of observations (I.E. 10 asset class ETFs, 90 day lookback windows, for instance), and other attributes taken for granted to illustrate concepts. I myself have used these concepts time and again (and, in fact, covered some of these topics on this blog, such as volatility targeting, momentum, and mean-variance), but in some of the work projects I’ve done, the trouble begins when the number of assets grows larger than the number of observations, or when assets move in or out of the investable universe (EG a new company has an IPO or a company goes bankrupt/merges/etc.). It also does not go into the PortfolioAnalytics package, developed by Ross Bennett and Brian Peterson. Having recently started to use this package for a real-world problem, it produces some very interesting results and its potential is immense, with the large caveat that you need an immense amount of computing power to generate lots of results for large-scale problems, which renders it impractical for many individual users. A quadratic optimization on a backtest with around 2400 periods and around 500 assets per rebalancing period (days) took about eight hours on a cloud server (when done sequentially to preserve full path dependency).

    However, aside from delving into some somewhat-edge-case appears-more-in-the-professional-world topics, this book is extremely comprehensive. Simply, as far as managing a portfolio of asset-class ETFs (essentially, what the inimitable Adam Butler and crew from ReSolve Asset Management talk about, along with Walter’s fantastic site, AllocateSmartly), this book will impart a lot of knowledge that goes into doing those things. While it won’t make you as comfortable as say, an experienced professional like myself is at writing and analyzing portfolio optimization backtests, it will allow you to do a great deal of your own analysis, and certainly a lot more than anyone using Excel.

    While I won’t rehash what the book covers in this post, what I will say is that it does cover some of the material I’ve posted in years past. And furthermore, rather than spending half the book about topics such as motivations, behavioral biases, and so on, this book goes right into the content that readers should know in order to execute the tasks they desire. Furthermore, the content is presented in a very coherent, English-and-code, matter-of-fact way, as opposed to a bunch of abstract mathematical derivations that treats practical implementation as an afterthought. Essentially, when one buys a cookbook, they don’t get it to read half of it for motivations as to why they should bake their own cake, but on how to do it. And as far as density of how-to, this book delivers in a way I think that other authors should strive to emulate.

    Furthermore, I think that this book should be required reading for any analyst wanting to work in the field. It’s a very digestible “here’s how you do X” type of book. I.E. “here’s a data set, write a backtest based on these momentum rules, use an inverse-variance weighting scheme, do a Fama-French factor analysis on it”.

    In any case, in my opinion, for anyone doing any sort of tactical asset allocation analysis in R, get this book now. For anyone doing any sort of tactical asset allocation analysis in spreadsheets, buy this book sooner than now, and then see the previous sentence. In any case, I’ll certainly be keeping this book on my shelf and referencing it if need be.

    Thanks for reading.

    Note: I am currently contracting but am always on the lookout for full-time positions in New York City. If you know of a position which may benefit from my skills, please let me know. My LinkedIn profile can be found here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to Create Sankey Diagrams From Tables (Data Frames) Using R

    Fri, 08/17/2018 - 15:00

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    Step 1: Create a Tidy data frame

    The very first step in creating visualizations is to get the data in a useful format. In the case of Sankey diagrams, the trick is to get the data into the tidy data format. This post uses a simple example to make it clear how everything fits together. Below, you can see the R code to create a small data frame. I’ve shown this as a table, followed by the resulting Sankey diagram.

    my.data = data.frame(Married = c("Yes","Yes", "Yes", "No", "No"), Pet = c("Yes", "Yes", "No", "Yes", "No"), Happy = c("Yes", "Yes", "Yes", "Yes", "No"), freq = 5:1)

    A few things to note:

    • The Sankey diagram is in a different order to the data in the table, with “no” appearing before “yes”. Sankey automatically orders the categories to minimize the amount of overlap.
    • Where two rows in the table contain the same information, Sankey automatically combines them. In our table, we can see that the first two rows are the same. Our Sankey diagram has combined them so the flow from Married: Yes to Pet: Yes to Happy: Yes has a weight (width) of 5 + 4 = 9. You can see this value if you hover your mouse over the Sankey diagram.
    • The Sankey diagram automatically merges together the nodes (blocks) that have the same text. For example, while we have five rows of data in the example above, we only have two unique values of Pet, which is why only two blocks for pet ownership appears.
    •  We can pull apart the blocks by changing the labels, as shown in the data frame and resulting Sankey diagram below. My colleague, Carmen is working on modifying the code to be able to split these apart without changing the labels.

    Step 2: Install the flipPlot package

    The Sankey diagrams I am using in this post, come from our flipPlots package (Displayr/flipPlots). If you don’t know how to install from GitHub, please see how to install packages from GitHub.

    Step 3: Create the Sankey diagram

    We created the first of the Sankey diagrams shown in this post using the code below. Note that the data frame is passed in as the first argument, but the fourth column (the one containing the weight) has been removed. I’ve set link.color to “Source”, which sets the colors that emanate from the same node to be consistent.

    library(flipPlots) SankeyDiagram(my.data[, -4], link.color = "Source", weights = my.data$freq)

    I’ve provided the code for the second sankey diagram shown in the post below. The only difference from the previous code is that I’ve used label.show.varname = FALSE, to prevent the variable names to from being shown in the sankey diagram.

    library(flipPlots) SankeyDiagram(my.data.2[, -4], link.color = "Source", label.show.varname = FALSE, weights = my.data.2$freq)

     

    More complicated sankey diagrams

    If you want to create more complicated Sankey diagrams, which do not easily fit into the structure of a table (data frame), please see Creating Custom Sankey Diagrams Using R.

    Acknowledgements

    The Sankey diagrams are created using a modified version of networkD3, created by Kenton Russell (timelyportfolio/networkD3@feature/responsive). networkD3 is an HTMLwidget version of Mike Bostock’s D3 Sankey diagram code, which is inspired by Tom Counsell’s Sankey library.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppArmadillo 0.9.100.5.0

    Fri, 08/17/2018 - 14:00

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new RcppArmadillo release 0.9.100.5.0, based on the new Armadillo release 9.100.5 from earlier today, is now on CRAN and in Debian.

    It once again follows our (and Conrad’s) bi-monthly release schedule. Conrad started with a new 9.100.* series a few days ago. I ran reverse-depends checks and found an issue which he promptly addressed; CRAN found another which he also very promptly addressed. It remains a true pleasure to work with such experienced professionals as Conrad (with whom I finally had a beer around the recent useR! in his home town) and of course the CRAN team whose superb package repository truly is the bedrock of the R community.

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 479 other packages on CRAN.

    This release once again brings a number of improvements to the sparse matrix functionality. We also fixed one use case of the OpemMP compiler and linker flags which will likely hit a number of the by now 501 (!!) CRAN packages using RcppArmadillo.

    Changes in RcppArmadillo version 0.9.100.5.0 (2018-08-16)
    • Upgraded to Armadillo release 9.100.4 (Armatus Ad Infinitum)

      • faster handling of symmetric/hermitian positive definite matrices by solve()

      • faster handling of inv_sympd() in compound expressions

      • added .is_symmetric()

      • added .is_hermitian()

      • expanded spsolve() to optionally allow keeping solutions of systems singular to working precision

      • new configuration options ARMA_OPTIMISE_SOLVE_BAND and ARMA_OPTIMISE_SOLVE_SYMPD smarter use of the element cache in sparse matrices

      • smarter use of the element cache in sparse matrices

    • Aligned OpenMP flags in the RcppArmadillo.package.skeleton used Makevars,.win to not use one C and C++ flag.

    Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    Edited on 2018-08-17 to correct one sentence (thanks, Barry!) and adjust the RcppArmadillo to 501 (!!) as we crossed the threshold of 500 packages overnight.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    An R package to create NEWS.md files

    Fri, 08/17/2018 - 08:43

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    A while back, I started to create an R package that would help me and my collegues at STATWORX with our daily work. After writing the DESCRIPTION file, I did not want to do this ever again. I found the desc package which let's you parse, manipulate and reformat this file via some functions and ergo a script. If you want more info on that, check out their github. However, I was too lazy still, because I had to manually write the NEWS file.

    I longed for the functionality of editing the NEWS file in the same script I built for the DESCRIPTION file. Since I could not find anything – and already was in a mood for programming – I wrote one my self. In this blog post, I will briefly explain what the package does.

    Usage of newsmd

    The main part of the package is the news object, which is an R6 class object and contains the text for the NEWS.md file. You can add versions, subtitles and bullet points to it via the object's methods.

    Initialise a new object

    To initialise a new object you can use two different ways:

    # install.packages("devtools") devtools::install_github("Dschaykib/newsmd") library(newsmd) my_news <- news$new() my_news <- newsmd()

    The default text contains markdown code and looks like this:

    ## version 0.0.0.9000 --- ### setup - added NEWS.md creation Adding a the next version

    With add_version you can update the version number.

    my_news$add_version("0.0.1") Adding a new subtitle

    With add_subtitle you can add a new subtile, where the follwoing bullet points will be under.

    my_news$add_subtitle("Bugfixes") Adding more bullets

    With add_bullet you can add more bullet points to the latest version and latest subtitle.

    my_news$add_bullet(c("this is point 1", "this is point 2")) Getting the whole text

    After these fews changes, let's see how the file looks. The get_text method will return each single line of the file. Alternativly you can just use print(my_news).

    my_news$get_text() [1] "## version 0.0.1" [2] "" [3] "---" [4] "" [5] "" [6] "### Bugfixes" [7] "" [8] "- this is point 1" [9] "- this is point 2" [10] "" [11] "" [12] "## version 0.0.0.9000" [13] "" [14] "---" [15] "" [16] "### setup" [17] "" [18] "- added NEWS.md creation" [19] "" Writing the NEWS.md file

    At last, with write you can save all your changes into the NEWS.md file.

    my_news$write() Combination with the desc package

    The goal of this package is to update the NEWS.md file in addition to the DESCRIPTION file. To optimize the workflow I suggest the desc package. For example instead of manually defining the version number, you can use desc_bump_version() and get_version() of the desc package:

    my_desc$bump_version("patch") my_news$add_version(my_desc$get_version()) my_news$add_bullet("added automated creation for DESCRIPTION and NEWS.md")

    The full example script I used for my other package, can be found here.

    Conclusion

    I hope this little package will be of help to some of you. If you find some bugs or have ideas that would improve the usage and functionality, I would be pleased if you let me know on my Github.

    Über den Autor

    Jakob Gepp

    Jakob ist im Statistik Team und interessiert sich im Moment stark für Hadoop und Big Data. In seiner Freizeit bastelt er gerne an alten Elektrogeräten und spielt Hockey.

    Der Beitrag An R package to create NEWS.md files erschien zuerst auf STATWORX.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Many reports from 1 RMarkdown file

    Fri, 08/17/2018 - 08:04

    (This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

    I was at the EdinbR talk this week by the RStudio community lead – Curtis Kephart. It was really interesting, but I disagree with his suggestion to point and click different parameters when you want to generate multiple reports from the same RMarkdown file. This might be acceptable if you have one or two, but any more and the chance for error and tedium is greatly increased. This blog post shows you how to loop (yes – an actual for loop!) through a variable to generate different reports for each of its unique values.

    First, we need an RMarkdown file (.Rmd). This is largely the same as your usual .Rmd file, and I strongly encourage you to develop it like one. i.e. write your single .Rmd file and convert it into a special use case to be a template. Working like this makes debugging a whole lot easier. Here’s an example of a “normal” .Rmd:

    --- title: "Demographics exploratory analysis" author: "Mike Spencer" date: "11 September 2017" output: pdf_document: toc: yes --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo=F, message=F, results='hide', warning=F, fig.height=9) source("read.R") library(tidyverse) library(RColorBrewer) library(knitr) ``` ## Introduction This file provides a summary of the digital potential in the rural UK survey. In particular it shows the answers for females. The summary had `r nrow(df)` responses, of these `r sum(is.na(df$Eastings))` either had no postcode or the postcode supplied did not match the current list from Ordnance Survey or the UK data service (Northern Irish postcodes). Those postcodes which were matchable could be related to country and their rural-urban classification.

    You can see it’s not far off what you get when you opt to start a new RMarkdown file in RStudio. I’ve abstracted the data reading to a separate file (it has some lengthy factor cleaning and is used in a few different situations), and I’m loading the knitr library so I can make tables with kable().

    The next code chunk shows how the file is adapted to be used as a template for many outputs:

    --- params: new_title: "My Title!" title: "`r params$new_title`" author: "Mike Spencer" date: "11 September 2017" output: html_document: toc: yes --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo=F, message=F, results='hide', warning=F, fig.height=9) library(tidyverse) library(RColorBrewer) library(knitr) df1 = df %>% filter(gender==v) df1 = droplevels(df1) ``` ## Introduction This file provides a summary of the digital potential in the rural UK survey. In particular it shows the answers of `r v`. The summary had `r nrow(df1)` responses, of these `r sum(is.na(df1$Eastings))` either had no postcode or the postcode supplied did not match the current list from Ordnance Survey or the UK data service (Northern Irish postcodes). Those postcodes which were matchable could be related to country and their rural-urban classification.

    Pretty similar, but there are some subtle differences. We’re now passing a title parameter to our .Rmd, our data are already loaded and we subset them to df1. In reality, this second step could happen in the next file – choose your preference for readability (and whether you want to change all the df variables in your .Rmd to df1).

    Finally, we need a separate script to loop through our variable and make some reports!

    library("rmarkdown") source("~/repo/read.R") slices = unique(df$gender) for(v in slices){ render("~/repo/exploratory_template.Rmd", output_file=paste0("~/results/exploratory_", v, ".html"), params=list(new_title=paste("Exploratory analysis -", v))) }

    Note we’re explicitly loading the rmarkdown library here so we can use the render function. We’re also loading our data before our loop, to speed our code up. The object v is passed to the .Rmd file, which is what we use to subset our data.

    If you’ve made it this far you should now have the tools to make multiple reports with a lot less effort! Beware that it becomes very easy to make more outputs than anyone could possibly read – with great power comes etc, etc..

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppArmadillo 0.9.100.5.0

    Fri, 08/17/2018 - 03:20

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new RcppArmadillo release 0.9.100.5.0, based on the new Armadillo release 9.100.5 from earlier today, is now on CRAN and in Debian.

    It once again follows our (and Conrad’s) bi-monthly release schedule. Conrad started with a new 9.100.* series a few days ago. I ran reverse-depends checks and found an issue which he promptly addressed; CRAN found another which he also very promptly addressed. It remains a true pleasure to work with such experienced professionals as Conrad (with whom I finally had a beer around the recent useR! in his home town) and of course the CRAN team whose superb package repository truly is the bedrock of the R community.

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 479 other packages on CRAN.

    This release once again brings a number of improvements to the sparse matrix functionality. We also also one use case of the OpemMP compiler and linker flags which will likely hit a number of the by now 499 (!!) CRAN packages using RcppArmadillo.

    Changes in RcppArmadillo version 0.9.100.5.0 (2018-08-16)
    • Upgraded to Armadillo release 9.100.4 (Armatus Ad Infinitum)

      • faster handling of symmetric/hermitian positive definite matrices by solve()

      • faster handling of inv_sympd() in compound expressions

      • added .is_symmetric()

      • added .is_hermitian()

      • expanded spsolve() to optionally allow keeping solutions of systems singular to working precision

      • new configuration options ARMA_OPTIMISE_SOLVE_BAND and ARMA_OPTIMISE_SOLVE_SYMPD smarter use of the element cache in sparse matrices

      • smarter use of the element cache in sparse matrices

    • Aligned OpenMP flags in the RcppArmadillo.package.skeleton used Makevars,.win to not use one C and C++ flag.

    Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Make R speak

    Fri, 08/17/2018 - 00:59

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Every wanted to make R talk to you? Now you can, with the mscstts package by John Muschelli. It provides an interface to the Microsoft Cognitive Services Text-to-Speech API (hence the name) in Azure, and you can use it to convert any short piece of text to a playable audio file, rendering it as speech using a number of different voices.

    Before you can generate speech yourself, you'll need a Bing Speech API key. If you don't already have an Azure account, you can generate a free 7-day API key in seconds by visiting the Try Cognitive Services page, selecting "Speech APIs", and clicking "Get API Key" for "Bing Speech":

    No credit card is needed; all you need is a Microsoft, Facebook, LinkedIn or Github account. (If you need permanent keys, you can create an Azure account here which you can use for 5000 free Speech API calls per month, and also get $200 in free credits for use with any Azure service.)

    Once you have your key (you'll actually get two, but you can use either one) you will call the ms_synthesize function to convert text up to 1,000 characters or so into mp3 data, which you can then save to a file. (See lines 8-10 in the speakR.R script, below.) Then, you can play the file with any MP3 player on your system. On Windows, the easiest way to do that is to call start on the MP3 file itself, which will use your system default. (You'll need to modify line 11 of the script to work non-Windows systems.)

    Saving the data to a file and then invoking a player got a bit tedious for me, so I created the say function in the script below to automate the process. Let's see it in action:

    Note that you can choose from a number of accents and spoken languages (including British English, Canadian French, Chinese and Japanese), and the gender of the voice (though both female and male voices aren't available for all languages).  You can even modify volume, pitch, speaking rate, and even the pronunciation of individual words using the SSML standard. (This does mean you can't use characters recognized as SSML in your text, which is why the say function below filters out < > and / first.) 

    The mscstts package is available for download now from your favorite CRAN mirror, and you can find the latest development version on Github. Many thanks to John Muschelli for putting this handy package together!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Relative risk ratios and odds ratios by @ellis2013nz

    Thu, 08/16/2018 - 20:30

    (This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

    This post tries to explain the difference between odds ratios and relative risk ratios; and how just a few letters in the code fitting a generalized linear model mean the difference between extracting one or the other. There are plenty of other explanations available (for example, here and here), but there is also still plenty of confusion about the differences. I wanted to illustrate the issues with a concrete but simulated example and actual code that could be used as a foundation in the wild.

    Concepts

    Odds ratios and relative risk are commonly used to contrast the prevalence of some indicator (eg disease) in different categories of population. They seem to get particular emphasis in medical and epidemiological literature but are used broadly.

    This diagram demonstrates with some simulated data the core concepts:

    • Tigers have a 1/4 (0.25) probability of being diseased, which is “1 to 3” odds of being diseased.
    • Lions have a 1/2 (0.5) probability of being diseased, which is “1 to 1” or “even” odds.
    • Bears have a 1/10 (0.1) probability of being diseased, which is “1 to 9” odds.

    I find the relative probabilities very intuitive – Tigers have 2.5 times the probability of being diseased as Bears, and Lions have 5 times the probability.

    Odds ratios – not so much. Even though I’m a statistician with a particular interest in games of chance (which form the genealogy of odds), I don’t really intuit how to interpret the Tiger to Bears odds ratio of 3 times the odds of being diseased. In fact, there’s evidence that when scientific papers report odds ratios, they are commonly misinterpreted as being risk ratios (“People of type X are twice as likely to have the disease compared to people of type Y” being a common formulation).

    When the probabilities are very close to zero, then odds ratios are similar to risk ratios. The farther they get from zero, the more they differ, and the more dangerous it is if people misinterpret an odds ratio as a risk ratio.

    The best argument for reporting odds ratios is that they are symmetric – whether you are reporting on presence or absence, the odds ratio is unaffected. In our example above, if we looked at the probability of not being diseased, the Tigers:Bears odds ratio becomes 3:1 / 9:1 which is still a 3, whereas the relative probability is (3/4) / (9/10) which is 5/6 or 0.833. So while it’s nice and intuitive to say Tigers are 2.5 times as likely as Bears to be diseased, they are 0.833 times as likely as Bears to be not-diseased. This can lead to confusion. The odds ratio is three both ways (or at least, 3 in one direction and 1/3 in the other).

    Most situations where we want these statistics, I think relative risk ratios are much better, more intutitive, and easier to explain. I don’t think the asymmetry matters much in most communications, whereas the misunderstanding of odds ratios often does. Frequent misunderstanding of the target audience is a high price to pay for symmetry. And I’m not sure why symmetry is necessary anyway.

    A balanced (in my view) and non-paywalled article on the topic (which has generated many) is this one by Cummings in JAMA Pediatrics.

    Before we go onto to getting these estimates from a fitted model, here’s the R code that created that simulated data and the graphic:

    library(tidyverse) library(scales) library(viridis) set.seed(765) n <- 100000 data <- data_frame( animal_ = sample(c("lion", "tiger", "bear"), size = n, replace = TRUE, prob = c(0.2, 0.3, 0.5)) ) %>% mutate(diseased = case_when( animal_ == "lion" ~ sample(0:1, size = n, replace = TRUE, prob = c(0.5, 0.5)), animal_ == "tiger" ~ sample(0:1, size = n, replace = TRUE, prob = c(0.75, 0.25)), animal_ == "bear" ~ sample(0:1, size = n, replace = TRUE, prob = c(0.9, 0.1)) )) true_props <- data %>% group_by(animal_) %>% summarise(diseased = round(mean(diseased), 2)) %>% mutate(odds = c("1:9", "1:1", "1:3"), lab = paste0("Probability: ", diseased, "\nOdds: ", odds)) ggplot(data, aes(fill = as.logical(diseased), x = animal_)) + geom_bar(position = "fill") + geom_text(data = true_props, aes(label = lab, y = diseased), colour = "white") + labs(x = "", y = "Proportion", fill = "Diseased or not:") + guides(fill = guide_legend(reverse = TRUE)) + coord_flip() + ggtitle("Simulated data for demonstrating odds and risk ratios", "Risk ratio: tiger / bear = 2.5, lion / bear = 5.0\nOdds ratio: tiger / bear = 3.0, lion / bear = 9.0") Obtaining odds and risk ratios from a generalized linear model

    Putting that aside, how do we get these estimates from a model?

    Odds ratios

    Frankly, I suspect the more material reason for the prevalence of odds ratios is that they fall easily out of the results of a logistic regression (generalized linear model with the canonical logit link function relating the mean of the response to the linear predictor – where the logit function is the logarithm of the odds). For a categorical explanatory variable that has been represented as dummy indicator variables, e to the power of the coefficient for the dummy variable will give an estimate of the odds ratio.

    Here’s an example of extracting estimates of odds ratios from our data:

    # Odds ratio: model_2 <- glm(diseased ~ animal_, family = quasibinomial, data = data) exp(coef(model_2))[-1] # animal_lion animal_tiger # 9.286664 2.997837 #

    These odds ratios are of the given animal (Lion or Tiger) relative to the disease rate of the reference level, which in this case is Bear. So these are estimates of the ratios depicted in the original diagram.

    You would get the same point estimate if you used family = binomial, or family = quasi(link = 'logit', variance = "mu(1-mu)"); it’s the logit link that’s important here.

    Note that these estimates are biased; despite being based on large samples of 100,000 animals they haven’t converged on the true odds ratios of 9 and 3. The estimates of the original coefficients are unbiased; but the non-linear transformation of exp(coef(model)) inevitably introduces bias. A statistic that is unbiased on one scale will be biased if you transform it to another scale; such is life.

    Using binomial versus quasibinomial does make a difference to confidence intervals, but in our current case it’s not material. The confidence intervals at least contain the correct values; although the point estimate is biased (and also not in the centre of the confidence interval, again due to the non-linear transformation), the confidence interval is still valid for the transformed scale:

    exp(confint(model_2))[-1, ] # 2.5 % 97.5 % # animal_lion 8.920358 9.669207 # animal_tiger 2.881889 3.118685 # Risk ratios

    At a minimum, the only change that needs to be done to get risk ratios is to change the link function that relates the mean value of the response variable to the linear predictor. For estimates of odds ratios, this is logit (ie the logarithm of the odds of the mean); for estimates of relative risk ratios, this becomes logarithm. We can specify this manually, or just use a built-in family for our generalized linear model for which the logarithm is the canonical link fucntion, and hence the default. The quasipoisson is a great choice here:

    # Risk ratio: model_1 <- glm(diseased ~ animal_, family = quasipoisson, data = data) exp(coef(model_1))[-1] # animal_lion animal_tiger # 5.112507 2.504792 #

    Again, these point estimates are biased because of the non-linear exp() transformation of the unbiased original coefficients. But the confidence intervals contain the correct values (which recall are 5.0 and 2.5):

    exp(confint(model_1))[-1, ] # 2.5 % 97.5 % # animal_lion 4.961963 5.268162 # animal_tiger 2.426520 2.585758 #

    If we’d used poisson as our family instead of quasipoisson, the dispersion parameter is forced to be 1.0, instead of 0.7 which it is estimated to be by the quasipoisson model. The net effect of that “underdispersion” in the data in this case is that the confidence intervals with family = poisson are larger than in the quasipoisson instance. This won’t always apply, but it often will when we are modelling data such as this which is, in fact under-dispersed compared to a pure poisson distribution (because none of the counts can exceed 1).

    These examples have avoided complications of other explanatory variables, but the point of using a generalized linear model for this (rather than observing proportions directly) is that we can add in other variables and report on relative risks (or odds, if we must) “while controlling for …” those other variables.

    The approach also nicely extends to complex surveys; we can use svyglm() from Lumley’s survey package, with any of the family specifications above, according to our needs and get good results.

    Summary

    So, key points;

    • carefully choose which (or both) of odds or risk ratios to use
    • explicitly communicate what you are reporting, and try to counter in your communication the likely misunderstandings particularly with odds ratios
    • exp(coef(model)) for a glm fit with family = quasibinomial will give odds ratios
    • exp(coef(model)) for a glm fit with family = quasipoisson will give relative risk ratios
    • remember that point estimates from this approach are biased, and like all point estimates give an undue impression of precision; always include a confidence interval or credibility interval as well.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Great post!

    Thu, 08/16/2018 - 19:49

    (This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers)

    Great post!

    I wanted to mention that although many previous studies have used the area under receiver operating characteristic curve (auROC) statistic to benchmark the precision, it misleads evaluators when the test data is (highly) imbalanced see: PLOS One, 10(3):e0118432, 2015 & bioRxiv, 2017 doi: 10.1101/142760

    In my field (bioinformatics), ENCODE’s DREAM-Challenge recommends reporting both auROC, which assess false negative predictions and the area under precision-recall curve (auPR), which also assesses false positives (see: ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge. https://synapse.org/encode, 2017. Accessed: 2018–01–31)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Updates to the sergeant (Apache Drill connector) Package & a look at Apache Drill 1.14.0 release

    Thu, 08/16/2018 - 18:49

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    Apache Drill 1.14.0 was recently released, bringing with it many new features and a temporary incompatibility with the current rev of the MapR ODBC drivers. The Drill community expects new ODBC drivers to arrive shortly. The sergeant is an alternative to ODBC for R users as it provides a dplyr interface to the REST API along with a JDBC interface and functions to work directly with the REST API in a more programmatic fashion.

    First-class dplyr-citizen support for the sergeant JDBC interface

    I’ve been primarily using the ODBC interface for a while, now, since it’s dead simple to use it with dplyr (as has been noted in my still-unfinished, short cookbook on wiring up Drill and R). The ODBC incompatibility is pretty severe since it’s at the protobuf-level, but sergeant::src_drill() is an easy swap out and does not have any issues since it works against the REST API. Unfortunately, the query endpoint of the REST API mangles the field order when it returns query results. This really isn’t too painful since it’s easy to add in a select() call after gathering query results to reorder things. However, it’s painful enough that it facilitated rounding out some of the corners to the JDBC interface.

    sergeant::drill_jdbc() now returns a object which was necessary to add dplyr classes for just enough bits to enable smooth operation with the tbl() function (without breaking all your other RJDBC usage in the same session). The next blog section will use the new JDBC interface with dplyr as it introduces one of Drill’s new features.

    Query Image Metadata with Apache Drill 1.14.0

    There are quite a few R packages for reading image/media metadata. Since that seems to be en vogue, R folks might be interested in Drill’s new image metadata format plugin. Just point drill to a directory of files and you can use a familiar dplyr interface to get the deets on your pirated torrent archivefamily photo inventory.

    You first need to follow the directions at the aforelinked resource and add the following format to the formats: section.

    formats: { "image": { "type": "image", "extensions": [ "jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif", "ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p", "m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw", "crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f" ], "fileSystemMetadata": true, "descriptive": true, "timeZone": null } }

    Note that the configuration snippet on Drill’s site (as of the date-stamp on this post) did not have a , after the ] for the extensions array, so copy this one instead.

    I created a media workspace and set the defaultInputFormat to image. Here’s a naive first look at what you can get back from a simple query to a jpg directory under it (using the new JDBC interface and dplyr):

    library(sergeant) library(tidyverse) (con <- drill_jdbc("bigd:2181")) ## tbl(con, "dfs.media.`/jpg/*`") %>% glimpse() ## Observations: ?? ## Variables: 28 ## $ FileSize "4412686 bytes", "4737696 bytes", "4253912 byt... ## $ FileDateTime "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16... ## $ Format "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"... ## $ PixelWidth "4032", "4032", "4032", "4032", "4032", "4032"... ## $ PixelHeight "3024", "3024", "3024", "3024", "3024", "3024"... ## $ BitsPerPixel "24", "24", "24", "24", "24", "24", "24", "24"... ## $ DPIWidth "72", "72", "72", "72", "72", "72", "72", "72"... ## $ DPIHeight "72", "72", "72", "72", "72", "72", "72", "72"... ## $ Orientaion "Unknown (0)", "Unknown (0)", "Unknown (0)", "... ## $ ColorMode "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB... ## $ HasAlpha "false", "false", "false", "false", "false", "... ## $ Duration "00:00:00", "00:00:00", "00:00:00", "00:00:00"... ## $ VideoCodec "Unknown", "Unknown", "Unknown", "Unknown", "U... ## $ FrameRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ AudioCodec "Unknown", "Unknown", "Unknown", "Unknown", "U... ## $ AudioSampleSize "0", "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ AudioSampleRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ JPEG "{\"CompressionType\":\"Baseline\",\"DataPreci... ## $ JFIF "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no... ## $ ExifIFD0 "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus... ## $ ExifSubIFD "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":... ## $ AppleMakernote "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x... ## $ GPS "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4... ## $ XMP "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat... ## $ Photoshop "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3... ## $ IPTC "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio... ## $ Huffman "{\"NumberOfTables\":\"4 Huffman tables\"}", "... ## $ FileType "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

    That’s quite a bit of metadata, but the Drill format plugin page kinda fibs a bit about column types since we see many chrs there. You may be quick to question the sergeant package but this isn’t using the REST interface and we can use DBI calls to ask Drill what’s it’s sending us:

    dbSendQuery(con, "SELECT * FROM dfs.media.`/jpg/*`") %>% dbColumnInfo() ## field.name field.type data.type name ## 1 FileSize CHARACTER VARYING character FileSize ## 2 FileDateTime CHARACTER VARYING character FileDateTime ## 3 Format CHARACTER VARYING character Format ## 4 PixelWidth CHARACTER VARYING character PixelWidth ## 5 PixelHeight CHARACTER VARYING character PixelHeight ## 6 BitsPerPixel CHARACTER VARYING character BitsPerPixel ## 7 DPIWidth CHARACTER VARYING character DPIWidth ## 8 DPIHeight CHARACTER VARYING character DPIHeight ## 9 Orientaion CHARACTER VARYING character Orientaion ## 10 ColorMode CHARACTER VARYING character ColorMode ## 11 HasAlpha CHARACTER VARYING character HasAlpha ## 12 Duration CHARACTER VARYING character Duration ## 13 VideoCodec CHARACTER VARYING character VideoCodec ## 14 FrameRate CHARACTER VARYING character FrameRate ## 15 AudioCodec CHARACTER VARYING character AudioCodec ## 16 AudioSampleSize CHARACTER VARYING character AudioSampleSize ## 17 AudioSampleRate CHARACTER VARYING character AudioSampleRate ## 18 JPEG MAP character JPEG ## 19 JFIF MAP character JFIF ## 20 ExifIFD0 MAP character ExifIFD0 ## 21 ExifSubIFD MAP character ExifSubIFD ## 22 AppleMakernote MAP character AppleMakernote ## 23 GPS MAP character GPS ## 24 XMP MAP character XMP ## 25 Photoshop MAP character Photoshop ## 26 IPTC MAP character IPTC ## 27 Huffman MAP character Huffman ## 28 FileType MAP character FileType

    We can still work with the results, but there’s also a pretty key element missing: the media filename. The reason it’s not in the listing is that filename is an implicit column that we have to ask for. So, we need to modify our query to be something like this:

    tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>% glimpse() ## Observations: ?? ## Variables: 29 ## $ fname "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"... ## $ FileSize "4412686 bytes", "4737696 bytes", "4253912 byt... ## $ FileDateTime "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16... ## $ Format "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"... ## $ PixelWidth "4032", "4032", "4032", "4032", "4032", "4032"... ## $ PixelHeight "3024", "3024", "3024", "3024", "3024", "3024"... ## $ BitsPerPixel "24", "24", "24", "24", "24", "24", "24", "24"... ## $ DPIWidth "72", "72", "72", "72", "72", "72", "72", "72"... ## $ DPIHeight "72", "72", "72", "72", "72", "72", "72", "72"... ## $ Orientaion "Unknown (0)", "Unknown (0)", "Unknown (0)", "... ## $ ColorMode "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB... ## $ HasAlpha "false", "false", "false", "false", "false", "... ## $ Duration "00:00:00", "00:00:00", "00:00:00", "00:00:00"... ## $ VideoCodec "Unknown", "Unknown", "Unknown", "Unknown", "U... ## $ FrameRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ AudioCodec "Unknown", "Unknown", "Unknown", "Unknown", "U... ## $ AudioSampleSize "0", "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ AudioSampleRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ JPEG "{\"CompressionType\":\"Baseline\",\"DataPreci... ## $ JFIF "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no... ## $ ExifIFD0 "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus... ## $ ExifSubIFD "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":... ## $ AppleMakernote "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x... ## $ GPS "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4... ## $ XMP "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat... ## $ Photoshop "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3... ## $ IPTC "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio... ## $ Huffman "{\"NumberOfTables\":\"4 Huffman tables\"}", "... ## $ FileType "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

    We could work with the “map” columns with Drill’s SQL, but this is just metadata and even if there are many files, most R folks have sufficient system memory these days to collect it all and work with it locally. There’s nothing stopping you from working on the SQL side, though, and it may be a better choice if you’ll be using this to process huge archives. But, we’ll do this in R and convert a bunch of field types along the way:

    from_map <- function(x) { map(x, jsonlite::fromJSON)} tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>% collect() %>% mutate_at( .vars = vars( JPEG, JFIF, ExifSubIFD, AppleMakernote, GPS, XMP, Photoshop, IPTC, Huffman, FileType ), .funs=funs(from_map) ) %>% mutate_at( .vars = vars( PixelWidth, PixelHeight, DPIWidth, DPIHeight, FrameRate, AudioSampleSize, AudioSampleRate ), .funs=funs(as.numeric) ) %>% glimpse() -> media_df ## Observations: 11 ## Variables: 29 ## $ fname "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"... ## $ FileSize "4412686 bytes", "4737696 bytes", "4253912 byt... ## $ FileDateTime "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16... ## $ Format "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"... ## $ PixelWidth 4032, 4032, 4032, 4032, 4032, 4032, 3024, 4032... ## $ PixelHeight 3024, 3024, 3024, 3024, 3024, 3024, 4032, 3024... ## $ BitsPerPixel "24", "24", "24", "24", "24", "24", "24", "24"... ## $ DPIWidth 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72 ## $ DPIHeight 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72 ## $ Orientaion "Unknown (0)", "Unknown (0)", "Unknown (0)", "... ## $ ColorMode "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB... ## $ HasAlpha "false", "false", "false", "false", "false", "... ## $ Duration "00:00:00", "00:00:00", "00:00:00", "00:00:00"... ## $ VideoCodec "Unknown", "Unknown", "Unknown", "Unknown", "U... ## $ FrameRate 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ## $ AudioCodec "Unknown", "Unknown", "Unknown", "Unknown", "U... ## $ AudioSampleSize 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ## $ AudioSampleRate 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ## $ JPEG [["Baseline", "8 bits", "3024 pixels", "4032 ... ## $ JFIF [["1.1", "none", "72 dots", "72 dots", "0", "... ## $ ExifIFD0 "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus... ## $ ExifSubIFD [["1/2227 sec", "f/1.8", "Program normal", "2... ## $ AppleMakernote [["5", "[558 values]", "[104 values]", "1", "... ## $ GPS [["N", "44° 19' 6.3\"", "W", "-68° 11' 22.39\... ## $ XMP [["4", ["2017-06-22T14:28:04"], ["2017-06-22T... ## $ Photoshop [["48 89 11 77 33 105 192 33 170 252 63 34 43... ## $ IPTC [["UTF-8", "2", "14:28:04", "2017:06:22", "20... ## $ Huffman [["4 Huffman tables"], ["4 Huffman tables"], ... ## $ FileType [["JPEG", "Joint Photographic Experts Group",...

    Now, we can do anything with the data, including getting the average file size:

    mutate(media_df, FileSize = str_replace(FileSize, " bytes", "") %>% as.numeric()) %>% summarise(mean(FileSize)) ## # A tibble: 1 x 1 ## `mean(FileSize)` ## ## 1 3878963. FIN

    The enhancements to the JDBC interface have only been given a light workout but seem to be doing well so-far. Kick the tyres and file an issue if you have any problems. ODBC users should not have to wait long for new drivers and src_drill() aficionados can keep chugging along as before without issue.

    For those new to Apache Drill, there’s now an official Docker image for it so you can get up and running without adding too much cruft to your local systems. I may add support for spinning up and managing a Drill container to the sergeant package, so keep your eyes on pushes to the repo.

    Also keep an eye on the mini-cookbook as I’ll be modifying to to account for the new package changes and introduce additional, new Drill features.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Slides from R/Pharma

    Thu, 08/16/2018 - 16:18

    (This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

    My slides from the R/Pharma conference on “Modeling in the Tidyverse” are in pdf format as well as the HTML version.

    (Joe Cheng just killed it in his shiny presentation – see this repo)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Bio7 2.9 Released

    Thu, 08/16/2018 - 13:58

    (This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

    16.08.2018

    A new release of Bio7 is available. The new Bio7 2.9 release comes with a plethora of new R features and bugfixes.

    Screenshot MacOSX with dark theme and opened ‘R ImageJ Analysis’ perspective

    Screenshot ‘R’ perspective Windows and opened R-Shell code completion

    Release Notes:

    General:

    • Based on Eclipse 4.8
    • Improved the dark theme and the layout of the dark theme in many places
    • All editor font colors are now changed automatically to default optimized colors when using the dark theme
    • New ‘Switch Theme’ action available to easily switch to the dark theme or back to the default theme
    • Java updated to 1.8.181
    • New Fullscreen/Hide Menu action available (Main menu “Window” – Key: Strg+Shift+ 4)
    • New Hide and Show main menu actions available (useful to hide added plugin menus again)
    • Added pdf reader option “Okular” to open a R pdf plot with the okular application (used when the SWT browser + open external preference is selected)
    • Added a new Table API method to transfer a matrix array form Java more efficiently
    • Improved the WorldWind search function (using the photon geocoding API)
    • Improved the default font size for Swing components on Retina displays (e.g., ImageJ components)

    R:

    • Updated R for Windows to version 3.5.1
    • Added a new plot option to plot huge plots in a temporary folder and open them in ImageJ virtually (disk resident). The folder will automatically be opened after plotting (plot images have to be deleted manually!)

    ImageJ

    • ImageJ Updated to 1.52f18
    • ImageJ toolbar improved for the dark theme (see screenshots)
    • Added a new ImageJ detach image menu in the ‘Window’ menu of the ImageJ-Canvas view

    R Markdown:

    • Added syntax coloring of R markdown snippets
    • Added new context menu actions in menu “Text” (Find and replace, Toggle Block Selection, To Upper Case, To Lower Case)
    • Added a new context menu action to toggle word wrap, etc.

    R editor:

    • Added a new warning and quickfix for possibly wrong comparisons of NULL, NA, NaN

    • Added a new option in the preferences for the new warning
    • Added new context menu actions in menu “Text” (Find and replace, Toggle Block Selection, To Upper Case, To Lower Case)
    • Improved the display for quickfix suggestion, warning and errors when the font is resized dynamically (especially for MacOSX)
    • The help browser for code completion popups now display help sites in dark mode if the Bio7 dark theme is enabled.

    ImageJ Macro Editor

    • Added a general rename method to rename selected word occurences
    • Added new context menu actions in menu “Text” (Find and replace, Toggle Block Selection, To Upper Case, To Lower Case)
    • Added latest macro code completion function definitions
    Download and Installation: Windows:

    Just download the *.zip distribution file from https://bio7.org and unzip it in your preferred location. Bio7 comes bundled with a Java Runtime Environment, R and Rserve distribution and works out of the box.

    Linux:

    Download and extract the installation file from https://bio7.org.
    For Linux you have to install R and Rserve (see Rserve installation below!).

    MacOSX:

    Download and extract the installation file from https://bio7.org.

    If you start Bio7 a warning or error can occur because of the changes how Apple treats signatures! To allow Bio7 to start see this instructions for Yosemite and Sierra:

    OS X Yosemite: Open an app from an unidentified developer

    macOS Sierra: Open an app from an unidentified developer

    If you have still problems with Sierra see this solution!

    In addition for MacOSX you have to install R and Rserve (see below!).

    Linux and MacOSX Rserve (compiled for cooperative mode) installation:

    To install Rserve open the R shell and then execute the menu action “Options -> Install Rserve (coop. mode) for R …” for different R versions. This will download an install Rserve in your default R library location, see video below (please make sure that your default Linux R library install location has writing permissions!). In cooperative mode only one connection at a time is allowed (which we want for this Desktop appl.) and all subsequent connections share the same namespace (default on Windows)!

    " data-embed-play="Play VideoThis video will be embedded from Youtube. The apply.">

    Play Video

    This video will be embedded from Youtube. The
    privacy policies of google apply. Installation of Useful R Packages

    R packages which are useful in combination with Bio7 can easily be installed with the R-Shell context menu “Packages” action:

    R-Shell context menu->Packages->Install_Default_R_Packages

    Bio7 Documentation

    For more information about Bio7 please consult the soon updated Bio7 User Guide.

    A plethora of Bio7 videotutorials for an introduction can be found on YouTube.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Linear programming in R

    Thu, 08/16/2018 - 12:22

    (This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers)

    Linear programming is a technique to solve optimization problems whose constraints and outcome are represented by linear relationships.

    Simply put, linear programming allows to solve problems of the following kind:

    • Maximize/minimize $\hat C^T \hat X$
    • Under the constraint $\hat A \hat X \leq \hat B$
    • And the constraint $\hat X \geq 0$

    This doesn’t seem much when you glance at it but in practice it is a powerful tool that can be used to make decisions in practical life scenarios.

    It is often the case that we need to make decisions based on constraints. Often the invisible and most harsh constraint is time, but generally speaking there are a lot of other constraints that we need to take into account. A simple set of examples would be:

    • I want to change my heating system. I want to minimize the cost of the system and the bills, what kind of heating system should I install? A pellet stove? Electric radiators?
    • I want to obtain the maximum profit from the sale of these two products I produce. In what quantities should I produce each product?
    • And so on…

    Linear programming can help you with these kind of decisions where:

    1. The function you are trying to optimize is a linear combination of the decision variables (this might not always be the case).
    2. The constraints you have are a linear combination of the decision variables.

    An example of linear optimization

    I’m going to implement in R an example of linear optimization that I found in the book “Modeling and Solving Linear Programming with R” by Jose M. Sallan, Oriol Lordan and Vincenc Fernandez.  The example is named “Production of two models of chairs” and can be found at page 57, section 3.5. I’m going to solve only the first point.

    The problem text is the following

    A company produces two models of chairs: 4P and 3P. The model 4P needs 4 legs, 1 seat and 1 back. On the other hand, the model 3P needs 3 legs and 1 seat. The company has a initial stock of 200 legs, 500 seats and 100 backs. If the company needs more legs, seats and backs, it can buy standard wood blocks, whose cost is 80 euro per block. The company can produce 10 seats, 20 legs and 2 backs from a standard wood block. The cost of producing the model 4P is 30 euro/chair, meanwhile the cost of the model 3P is 40 euro/chair. Finally, the company informs that the minimum number of chairs to produce is 1000 units per month. Define a linear programming model, which minimizes the total cost (the production costs of the two chairs, plus the buying of new wood blocks).

    Problem definition

    First, we need to translate the problem in a mathematical way. Let’s define the following variables

    • $x_{4p}$ is the number of 4P chairs to be produced.
    • $x_{3p}$ is the number of 3P chairs to be produced.
    • $x_w$ is the number of wood blocks to be bought.

    Now we can define $\hat X = \begin{pmatrix} x_{4p} \\ x_{3p}  \\  x_w \end{pmatrix}$ as the decision variable vector. Note that it must be $\hat X \geq 0$.

    We would like to minimize the total cost so we must set our objective function as follows

    $$cost(x_{4p}, x_{3p}, x_w) = 30 x_{4p} + 40 x_{3p} + 80 x_w = MIN(cost) $$

    which means that $\hat C = \begin{pmatrix} 30 \\ 40  \\  80 \end{pmatrix}$.

    The constraints can be set in the following way

    1. For the seats $$ x_{4p} + x_{3p} \leq 500 + 10 x_w $$
    2. For the legs $$ 4 x_{4p} + 3 x_{3p} \leq 200 + 20 x_w $$
    3. For the backs $$ x_{4p} \leq 100 + 2 x_w $$
    4. Minimum number of chairs produced $$ x_{4p} + x_{3p} \geq 1000 $$

    We can now define the coefficient matrix $A = \begin{pmatrix} 1 & 1 & -10 & \\  4 & 3 & -20 & \\  1 & 0 & -2 & \\  – 1 & – 1 & 0 &  \end{pmatrix} $ and $B = \begin{pmatrix} 500 \\ 200 \\ 100 \\ -1000 \end{pmatrix}$.

    R implementation and solution

    We are now ready to implement this is in R. There are many different packages which can solve this kind of problems but my favourites are lpSolve and lpSolveAPI which is a kind of API built on top of lpSolve. I will use both.

    A nice feature about the lpSolve package is that you can specify the direction of the constraints. Indeed in our case the last constraint of minimum number of chairs produced does not fit in with the mathematical definiton of the problem that we gave in the previous paragraph. Here we can either change the signs (and therefore the inequality direction) or specify the inequality direction in lpSolve. I’ll do this second way.

    We can set that all the variables should be integers by setting the argument “all.int=TRUE” in the lpSolve::lp function. Of course, lpSolve can work with both integers and real numbers.

    It’s also particularly important to check the status at the end of the execution: if it is 0, then a solution has been found but if it is 2 then this means that there is no feasible solution.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Remaking ‘Luminance-gradient-dependent lightness illusion’ with R

    Thu, 08/16/2018 - 11:17

    (This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

    A blogpost inspired by a tweet and a YouTube video.

    ‘Luminance-gradient-dependent lightness illusion’

    In the last days, I’ve stumbled upon this tweet:

    A demo of lightness perception pic.twitter.com/BSVpgcuIw1

    — Akiyoshi Kitaoka (@AkiyoshiKitaoka) August 12, 2018

    Which is a demonstration of how our perception of color is affected by the lightness surrounding this color.

    Also recently, I’ve been watching the useR 2018 YouTube recordings, which contains a video called The Grammar of Animation, presenting {gganimate} a package by Thomas Lin Pedersen extending the {ggplot2} grammar of graphics to include animation.

    If you want to know more about {gganimate}, feel free to browse the GitHub repo, and watch the YouTube video.

     

    Back to our lightness perception: as you can see, the illusion is made by hand, with a piece of paper, moving the square manually. Let’s recreate this with gganimate.

    The R Code Creating the data.frame

    So, we’ll need three things:

    • A variable stating all the transition states
    • A variable for y min and max, which will stay the same
    • A variable for x min and max, which will be incremented by one on each transition states.
    d <- data.frame( # x coordinates x1 = 1:10, x2 = 2:11, # y y1 = 4, y2 = 5, # transition time t = 1:10 ) Getting the background

    The background is basically a gradient from #000000 to #FFFFFF (read this post for more about hex color notation). Let’s create that object:

    library(grid) g <- rasterGrob( # Creating the color gradient t(colorRampPalette(c("#000000", "#FFFFFF"))(1000)), # Scaling it to fit the graph width= unit(1,"npc"), height = unit(1,"npc") ) Creating the ggplot

    I’ll first create the ggplot object, which is composed of 10 squares, filled with the same grey: "#7E7E7E". I use the theme_nothing() from {ggmap} as an empty theme.

    library(ggplot2) gg <- ggplot() + annotation_custom(g , -Inf, Inf, -Inf, Inf) + geom_rect(data=d, mapping= aes(xmin=x1, xmax=x2, ymin=y1, ymax=y2), color="black", fill = "#7E7E7E") + ylim(c(1,8)) + ggmap::theme_nothing() gg Animation is illusion

    Let’s now animate our plot to create the illusion. As I want the move to be linear, I’ll use a combination of transition_time and ease_aes('linear') to make the transition smooth.

    library(gganimate) gg_animated <- gg + transition_time(t) + ease_aes('linear')

    And tadaa !

    gg_animated

    On this animation, the square appears to be changing color. But it’s not: the fill is always "#7E7E7E".

    What’s behind Luminance and perception of color

    “Every light is a shade, compared to the higher lights, till you come to the sun; and every shade is a light, compared to the deeper shades, till you come to the night.” —John Ruskin, 1879.

    OK, let’s forget R to focus on what is happening here, and quickly talk about perception of luminance (the level of light coming to your eye) and color.

    We are here facing a phenomenon known as a “gradient illusion”. The important idea behind this is that every color we perceive is influenced by its surrounding: in other words, we perceive the color lighter on the left of our image, as it is contrasted with black. The more the square move toward the white, the more the grey appears dark.

    How does it work? When a color comes to your eye, you perceive a certain amount of luminance. In other words, you are able to tell if something is dark or light or somewhere in the middle. Our ability to do so is called “lightness consistency”, but these gradient illusions show us one thing: this ability is not perfect, and the “luminance environment” in which the color appears influence how we see it.

    A phenomenon to know

    So, as we’ve just seen, perception of color is influenced by its surrounding. When it comes to creating a dataviz, color scales are crucial – even more now that we know that. Let’s imagine that for some weird reason we have created this plot:

    ggplot() + annotation_custom(g , -Inf, Inf, -Inf, Inf) + geom_col(aes(x = 1, y = 3), fill = "#7E7E7E") + geom_col(aes(x = 8, y = 4), fill = "#7E7E7E")

    Again, the fill is the same, but one bar seems to be darker than the other, which can trick the reader into thinking the value in the two is not the same. Something to consider if there is a chance your plot will be turned to black and white.

    Let’s say we are drawing a map. Maps are composed of regions, and can be colored following a specific scale. But there’s a probability that two regions with the very same result on that scale would be surrounded by two opposite colors. For example, two #7E7E7E could be surrounded one by #515151, the other by #aeaeae.

    ggplot() + geom_rect(aes(xmin = 1, xmax = 4, ymin = 1, ymax = 4), fill = "#515151") + geom_rect(aes(xmin = 2, xmax = 3, ymin = 2, ymax = 3), fill = "#7E7E7E") + geom_rect(aes(xmin = 4, xmax = 7, ymin = 1, ymax = 4), fill = "#aeaeae") + geom_rect(aes(xmin = 5, xmax = 6, ymin = 2, ymax = 3), fill = "#7E7E7E")

    From Lightness Perception and Lightness Illusions

    What to do now?

    • Now that you know this phenomenom, pay attention to it when you create plots
    • Be careful when chosing palettes
    • Try to turn your graph to black and white with colorspace::desaturate
    with_palette <- function(palette) { x <- y <- seq(-8 * pi, 8 * pi, len = 40) r <- sqrt(outer(x^2, y^2, "+")) filled.contour(cos(r^2) * exp(-r / (2 * pi)), axes = FALSE, color.palette = palette, asp = 1 ) } with_palette( purrr:::compose( colorspace::desaturate, viridis::viridis ) )

    From ggplot2 – Welcome viridis !

    Further reading :
    • “Lightness Perception and Lightness Illusions”, Edward H. Adelson, 2000 link
    • “Lightness models, gradient illusions, and curl”, Lawrence E. Arend and Robert Goldstein, 1987 link

    The post Remaking ‘Luminance-gradient-dependent lightness illusion’ with R appeared first on (en) The R Task Force.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to capitalize on a priori contrasts in linear (mixed) models: A tutorial

    Thu, 08/16/2018 - 08:25

    (This article was first published on Shravan Vasishth's Slog (Statistics blog), and kindly contributed to R-bloggers)

    We wrote a short tutorial on contast coding, covering the common contrast coding scenarios, among them: treatment, helmert, anova, sum, and sliding (successive differences) contrasts.  The target audience is psychologists and linguists, but really it is for anyone doing planned experiments.
     
    The paper has not been submitted anywhere yet. We are keen to get user feedback before we do that. Comments and criticism very welcome. Please post comments on this blog, or email me.
     
    Abstract:

    Factorial experiments in research on memory, language, and in other areas are often analyzed using analysis of variance (ANOVA). However, for experimental factors with more than two levels, the ANOVA omnibus F-test is not informative about the source of a main effect or interaction. This is unfortunate as researchers typically have specific hypotheses about which condition means differ from each other. A priori contrasts (i.e., comparisons planned before the sample means are known) between specific conditions or combinations of conditions are the appropriate way to represent such hypotheses in the statistical model. Many researchers have pointed out that contrasts should be “tested instead of, rather than as a supplement to, the ordinary `omnibus’ F test” (Hayes, 1973, p. 601). In this tutorial, we explain the mathematics underlying different kinds of contrasts (i.e., treatment, sum, repeated, Helmert, and polynomial contrasts), discuss their properties, and demonstrate how they are applied in the R System for Statistical Computing (R Core Team, 2018). In this context, we explain the generalized inverse which is needed to compute the weight coefficients for contrasts that test hypotheses that are not covered by the default set of contrasts. A detailed understanding of contrast coding is crucial for successful and correct specification in linear models (including linear mixed models). Contrasts defined a priori yield far more precise confirmatory tests of experimental hypotheses than standard omnibus F-test.

     Full paper: https://arxiv.org/abs/1807.10451

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shravan Vasishth's Slog (Statistics blog). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dealing with The Problem of Multicollinearity in R

    Thu, 08/16/2018 - 07:25

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    Imagine a situation where you are asked to predict the tourism revenue for a country, let’s say India. In this case, your output or dependent or response variable will be total revenue earned (in USD) in a given year. But, what about independent or predictor variables?

    You have been provided with two sets of predictor variables and you have to choose one of the sets to predict your output. The first set consists of three variables:

    • X1 = Total number of tourists visiting the country
    • X2 = Government spending on tourism marketing
    • X3 = a*X1 + b*X2 + c, where a, b and c are some constants

    The second set also consists of three variables:

    • X1 = Total number of tourists visiting the country
    • X2 = Government spending on tourism marketing
    • X3 = Average currency exchange rate

    Which of the two sets do you think provides us more information in predicting our output?

    I am sure, you will agree with me that the second set provides us more information in predicting the output because the second set has three variables which are different from each other and each of the variables provides different information (we can infer this intuitively at this moment). Moreover, none of the three variables is directly derived from the other variables in the system. Alternatively, we can also say that none of the variables is a linear combination of other variables in the system.

    In the first set of variables, only two variables provide us relevant information; while, the third variable is nothing but a linear combination of other two variables. If we were to directly develop a model without including this variable, our model would have considered this combination and estimated coefficients accordingly.

    Now, this effect in the first set of variables is called multicollinearity. Variables in the first set are strongly correlated to each other (if not all, at least some variables are correlated with other variables). Model developed using the first set of variables may not provide as accurate results as the second one because we are missing out on relevant variables/information in the first set. Therefore, it becomes important to study multicollinearity and the techniques to detect and tackle its effect in regression models.

    According to Wikipedia, “Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. For example, X1 and X2 are perfectly collinear if there exist parameters λ0 and λ1 such that, for all observations i, we have

    X2i = λ0 + λ1 * X1i

    Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.”

    We saw an example of exactly what the Wikipedia definition is describing.

    Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

    Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

    Multicollinearity may not possess problem at an overall level, but it strongly impacts the individual variables and their predictive power. You may not be able to identify which are statistically significant variables in your model. Moreover, you will be working with a set of variables which provide you similar output or variables which are redundant with respect to other variables.

    • It becomes difficult to identify statistically significant variables. Since your model will become very sensitive to the sample you choose to run the model, different samples may show different statistically significant variables.
    • Because of multicollinearity, regression coefficients cannot be estimated precisely because the standard errors tend to be very high. Value and even sign of regression coefficients may change when different samples are chosen from the data.
    • Model becomes very sensitive to addition or deletion of any independent variable. If you add a variable which is orthogonal to the existing variable, your variable may throw completely different results. Deletion of a variable may also significantly impact the overall results.
    • Confidence intervals tend to become wider because of which we may not be able to reject the NULL hypothesis. The NULL hypothesis states that the true population coefficient is zero.

    Now, moving on to how to detect the presence of multicollinearity in the system.

    There are multiple ways to detect the presence of multicollinearity among the independent or explanatory variables.

    • The first and most rudimentary way is to create a pair-wise correlation plot among different variables. In most of the cases, variables will have some bit of correlation among each other, but high correlation coefficient may be a point of concern for us. It may indicate the presence of multicollinearity among variables.
    • Large variations in regression coefficients on addition or deletion of new explanatory or independent variables can indicate the presence of multicollinearity. The other thing could be significant change in the regression coefficients from sample to sample. With different samples, different statistically significant variables may come out.
    • The other method can be to use tolerance or variance inflation factor (VIF).

    VIF = 1 / Tolerance

    VIF = 1/ (1 – R square)

    VIF of over 10 indicates that the variables have high correlation among each other. Usually, VIF value of less than 4 is considered good for a model.

    • The model may have very high R-square value but most of the coefficients are not statistically significant. This kind of a scenario may reflect multicollinearity in the system.
    • Farrar-Glauber test is one of the statistical test used to detect multicollinearity. This comprises of three further tests. The first, Chi-square test, examines whether multicollinearity is present in the system. The second test, F-test, determines which regressors or explanatory variables are collinear. The third test, t-test, determines the type or pattern of multicollinearity.

    We will now use some of these techniques and try their implementation in R.

    We will use CPS_85_Wages data which consists of a random sample of 534 persons from the CPS (Current Population Survey). The data provides information on wages and other characteristics of the workers. (Linkhttp://lib.stat.cmu.edu/datasets/CPS_85_Wages). You can go through the data details on the link provided.

    In this data, we will predict wages from other variables in the data.

    > data1 = read.csv(file.choose(), header = T)> head(data1)  Education South Sex Experience Union  Wage Age Race Occupation Sector Marr1         8 0 1         21 0 5.10 35  2 6 1 12         9 0 1         42 0 4.95 57  3 6 1 13        12 0  0 1     0 6.67 19 3         6 1 04        12 0  0 4     0 4.00 22 3         6 0 05        12 0  0 17     0 7.50 35 3         6 0 16        13 0  0 9     1 13.07 28 3         6 0 0> str(data1)’data.frame’: 534 obs. of  11 variables: $ Education : int  8 9 12 12 12 13 10 12 16 12 … $ South     : int 0 0 0 0 0 0 1 0 0 0 … $ Sex       : int 1 1 0 0 0 0 0 0 0 0 … $ Experience: int  21 42 1 4 17 9 27 9 11 9 … $ Union     : int 0 0 0 0 0 1 0 0 0 0 … $ Wage      : num 5.1 4.95 6.67 4 7.5 … $ Age       : int 35 57 19 22 35 28 43 27 33 27 … $ Race      : int 2 3 3 3 3 3 3 3 3 3 … $ Occupation: int  6 6 6 6 6 6 6 6 6 6 … $ Sector    : int 1 1 1 0 0 0 0 0 1 0 … $ Marr      : int 1 1 0 0 1 0 0 0 1 0 …

    The above results show the sample view of data and the variables present in the data. Now, let’s fit the linear regression model and analyze the results.

    > fit_model1 = lm(log(data1$Wage) ~., data = data1)> summary(fit_model1) Call:lm(formula = log(data1$Wage) ~ ., data = data1) Residuals:     Min     1Q Median       3Q Max -2.16246 -0.29163 -0.00469  0.29981 1.98248  Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept)  1.078596 0.687514   1.569 0.117291 Education    0.179366 0.110756   1.619 0.105949 South       -0.102360 0.042823  -2.390 0.017187 * Sex         -0.221997 0.039907  -5.563 4.24e-08 ***Experience   0.095822 0.110799   0.865 0.387531 Union        0.200483 0.052475   3.821 0.000149 ***Age         -0.085444 0.110730  -0.772 0.440671 Race         0.050406 0.028531   1.767 0.077865 . Occupation  -0.007417 0.013109  -0.566 0.571761 Sector       0.091458 0.038736   2.361 0.018589 * Marr         0.076611 0.041931   1.827 0.068259 . —Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4398 on 523 degrees of freedomMultiple R-squared:  0.3185, Adjusted R-squared:  0.3054 F-statistic: 24.44 on 10 and 523 DF,  p-value: < 2.2e-16

    The linear regression results show that the model is statistically significant as the F-statistic has high value and p-value for model is less than 0.05. However, on closer examination we observe that four variables – Education, Experience, Age and Occupation are not statistically significant; while, two variables Race and Marr (martial status) are significant at 10% level. Now, let’s plot the model diagnostics to validate the assumptions of the model.

    > plot(fit_model1)
    Hit to see next plot:

    Hit to see next plot:

    Hit to see next plot:

    Hit to see next plot:

    The diagnostic plots also look fine. Let’s investigate further and look at pair-wise correlation among variables.

    library(corrplot)
    > cor1 = cor(data1)
    > corrplot.mixed(cor1, lower.col = “black”, number.cex = .7)

    The above correlation plot shows that there is high correlation between experience and age variables. This might be resulting in multicollinearity in the model.

    Now, let’s move a step further and try Farrar-Glauber test to further investigate this. The ‘mctest’ package in R provides the Farrar-Glauber test in R.

    install.packages(‘mctest’)library(mctest)

    We will first use omcdiag function in mctest package. According to the package description, omcdiag (Overall Multicollinearity Diagnostics Measures) computes different overall measures of multicollinearity diagnostics for matrix of regressors.

    > omcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:omcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage)  Overall Multicollinearity Diagnostics                        MC Results detectionDeterminant |X’X|:         0.0001 1Farrar Chi-Square:      4833.5751 1Red Indicator:             0.1983 0Sum of Lambda Inverse: 10068.8439         1Theil’s Method:            1.2263 1Condition Number:        739.7337 1 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test

    The above output shows that multicollinearity is present in the model. Now, let’s go a step further and check for F-test in in Farrar-Glauber test.

    > imcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:imcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage)  All Individual Multicollinearity Diagnostics Result                  VIF TOL   Wi Fi Leamer      CVIF KleinEducation   231.1956 0.0043  13402.4982 15106.5849 0.0658  236.4725 1South         1.0468 0.9553     2.7264 3.0731 0.9774    1.0707 0Sex           1.0916 0.9161     5.3351 6.0135 0.9571    1.1165 0Experience 5184.0939 0.0002 301771.2445 340140.5368 0.0139 5302.4188     1Union         1.1209 0.8922     7.0368 7.9315 0.9445    1.1464 0Age        4645.6650 0.0002 270422.7164 304806.1391 0.0147 4751.7005     1Race          1.0371 0.9642     2.1622 2.4372 0.9819    1.0608 0Occupation    1.2982 0.7703    17.3637 19.5715 0.8777    1.3279 0Sector        1.1987 0.8343    11.5670 13.0378 0.9134    1.2260 0Marr          1.0961 0.9123     5.5969 6.3085 0.9551    1.1211 0 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test Education , South , Experience , Age , Race , Occupation , Sector , Marr , coefficient(s) are non-significant may be due to multicollinearity R-square of y on all x: 0.2805  * use method argument to check which regressors may be the reason of collinearity===================================

    The above output shows that Education, Experience and Age have multicollinearity. Also, the VIF value is very high for these variables. Finally, let’s move to examine the pattern of multicollinearity and conduct t-test for correlation coefficients.

    > pcor(data1[,c(1:5,7:11)],method = “pearson”)$estimate              Education   South Sex  Experience Union         Age Race OccupationEducation   1.000000000 -0.031750193  0.051510483 -0.99756187 -0.007479144  0.99726160 0.017230877 0.029436911South      -0.031750193  1.000000000 -0.030152499 -0.02231360 -0.097548621  0.02152507 -0.111197596 0.008430595Sex         0.051510483 -0.030152499  1.000000000 0.05497703 -0.120087577 -0.05369785  0.020017315 -0.142750864Experience -0.997561873 -0.022313605  0.054977034 1.00000000 -0.010244447 0.99987574  0.010888486 0.042058560Union      -0.007479144 -0.097548621 -0.120087577 -0.01024445  1.000000000 0.01223890 -0.107706183 0.212996388Age         0.997261601 0.021525073 -0.053697851  0.99987574 0.012238897 1.00000000 -0.010803310 -0.044140293Race        0.017230877 -0.111197596  0.020017315 0.01088849 -0.107706183 -0.01080331  1.000000000 0.057539374Occupation  0.029436911 0.008430595 -0.142750864  0.04205856 0.212996388 -0.04414029 0.057539374  1.000000000Sector     -0.021253493 -0.021518760 -0.112146760 -0.01326166 -0.013531482  0.01456575 0.006412099 0.314746868Marr       -0.040302967  0.030418218 0.004163264 -0.04097664  0.068918496 0.04509033 0.055645964 -0.018580965                 Sector MarrEducation  -0.021253493 -0.040302967South      -0.021518760  0.030418218Sex        -0.112146760  0.004163264Experience -0.013261665 -0.040976643Union      -0.013531482  0.068918496Age         0.014565751 0.045090327Race        0.006412099 0.055645964Occupation  0.314746868 -0.018580965Sector      1.000000000 0.036495494Marr        0.036495494 1.000000000 $p.value           Education    South Sex Experience        Union Age Race Occupation       SectorEducation  0.0000000 0.46745162 0.238259049  0.0000000 8.641246e-01 0.0000000 0.69337880 5.005235e-01 6.267278e-01South      0.4674516 0.00000000 0.490162786  0.6096300 2.526916e-02 0.6223281 0.01070652 8.470400e-01 6.224302e-01Sex        0.2382590 0.49016279 0.000000000  0.2080904 5.822656e-03 0.2188841 0.64692038 1.027137e-03 1.005138e-02Experience 0.0000000 0.60962999 0.208090393  0.0000000 8.146741e-01 0.0000000 0.80325456 3.356824e-01 7.615531e-01Union      0.8641246 0.02526916 0.005822656  0.8146741 0.000000e+00 0.7794483 0.01345383 8.220095e-07 7.568528e-01Age        0.0000000 0.62232811 0.218884070  0.0000000 7.794483e-01 0.0000000 0.80476248 3.122902e-01 7.389200e-01Race       0.6933788 0.01070652 0.646920379  0.8032546 1.345383e-02 0.8047625 0.00000000 1.876376e-01 8.833600e-01Occupation 0.5005235 0.84704000 0.001027137  0.3356824 8.220095e-07 0.3122902 0.18763758 0.000000e+00 1.467261e-13Sector     0.6267278 0.62243025 0.010051378  0.7615531 7.568528e-01 0.7389200 0.88336002 1.467261e-13 0.000000e+00Marr       0.3562616 0.48634504 0.924111163  0.3482728 1.143954e-01 0.3019796 0.20260170 6.707116e-01 4.035489e-01                MarrEducation  0.3562616South      0.4863450Sex        0.9241112Experience 0.3482728Union      0.1143954Age        0.3019796Race       0.2026017Occupation 0.6707116Sector     0.4035489Marr       0.0000000 $statistic              Education South         Sex Experience Union        Age Race Occupation SectorEducation     0.0000000 -0.7271618  1.18069629 -327.2105031 -0.1712102  308.6803174 0.3944914 0.6741338 -0.4866246South        -0.7271618 0.0000000 -0.69053623   -0.5109090 -2.2436907 0.4928456 -2.5613138  0.1929920 -0.4927010Sex           1.1806963 -0.6905362  0.00000000 1.2603880 -2.7689685   -1.2309760 0.4583091 -3.3015287 -2.5834540Experience -327.2105031 -0.5109090  1.26038801 0.0000000 -0.2345184 1451.9092015  0.2492636 0.9636171 -0.3036001Union        -0.1712102 -2.2436907 -2.76896848   -0.2345184 0.0000000 0.2801822 -2.4799336  4.9902208 -0.3097781Age         308.6803174 0.4928456 -1.23097601 1451.9092015  0.2801822 0.0000000 -0.2473135 -1.0114033 0.3334607Race          0.3944914 -2.5613138  0.45830912 0.2492636 -2.4799336   -0.2473135 0.0000000 1.3193223 0.1467827Occupation    0.6741338 0.1929920 -3.30152873    0.9636171 4.9902208 -1.0114033 1.3193223  0.0000000 7.5906763Sector       -0.4866246 -0.4927010 -2.58345399   -0.3036001 -0.3097781 0.3334607 0.1467827  7.5906763 0.0000000Marr         -0.9233273 0.6966272  0.09530228 -0.9387867  1.5813765 1.0332156 1.2757711 -0.4254112  0.8359769                  MarrEducation  -0.92332727South       0.69662719Sex         0.09530228Experience -0.93878671Union       1.58137652Age         1.03321563Race        1.27577106Occupation -0.42541117Sector      0.83597695Marr        0.00000000 $n[1] 534 $gp[1] 8 $method[1] “pearson”

    As we saw earlier in the correlation plot, partial correlation between age-experience, age-education and education-experience is statistically significant. There are other pairs also which are statistically significant. Thus, Farrar-Glauber test helps us in identifying the variables which are causing multicollinearity in the model.

    There are multiple ways to overcome the problem of multicollinearity. You may use ridge regression or principal component regression or partial least squares regression. The alternate way could be to drop off variables which are resulting in multicollinearity. You may drop of variables which have VIF more than 10. In our case, since age and experience are highly correlated, you may drop one of these variables and build the model again. Try building the model again by removing experience or age and check if you are getting better results. Share your experiences in the comments section below.

    Author Bio:

    This article was contributed by Perceptive Analytics. Jyothirmayee Thondamallu, Chaitanya Sagar and Saneesh Veetil contributed to this article.

    Perceptive Analytics provides Data Analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Its client roster includes Fortune 500 and NYSE listed companies in the USA and India.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages