Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 4 hours 45 min ago

RStudio meets MilanoR! June 29th, Milan

Mon, 06/12/2017 - 16:57

Hello R-Users, we have a great news! We are going to host Nathan Stephens and Joseph Rickert from RStudio and R Consortium: they are coming to Milano just for us (from USA) to meet the MilanoR community and talk about the latest news from RStudio and R Consortium!

Nathan, director of solutions engineering at RStudio, will discuss how the Tidyverse, R Markdown, and Shiny work together with RStudio products to enable better R for Data Science. Joseph, member of the R Consortium board of directors, will tell us all about the R Consortium news and RStudio community support activities.

Given the intersection of topics and interests of R in the Data Science space, the event is organized in collaboration with “Data Science Milan, so the two communities will also have the chance to meet. Therefore, on June 29th, in the evening, we will have two interesting talks by two important guests, plus a free buffet & lots of networking opportunities in the cosy garden of Impact Hub. How can you miss it?

Program
  • Welcome presentations
  • RStudio presents: Publishing and Self-service Analytics with R by Nathan Stephens
  • R Consortium News and RStudio Community Support by Joseph Rickert
  • Free refreshment & networking time
What’s the price?

The event is free and open to everybody. Given the exceptionality of our guests, we did our best to host as many people as possible. So, the meeting is open to max 200 participants, but registration is needed on the Eventbrite event page.

Where is it?

Sala Biliardo, Impact Hub – Via Aosta 4, Milano (M5 Cenisio)

—— FAQ I want to know more about the talks..

Publishing and Self-service Analytics with R: At RStudio we think about a model for data science that begins with importing and tidying data, continues with an iterative cycle of transforming, modelling, and visualizing data, and ends with communicating results. We’ve built tools and packages to help you succeed across this cycle:  RStudio and Notebooks for interactive development, the Tidyverse for making data science easier and more effective, and R Markdown, Shiny, and the new RStudio Connect (Pro) for communicating results. In this talk we’ll share the way we think about data science and our toolchain.

R Consortium News and RStudio Community Support: There is a lot happening at R Consortium! We now have fifteen member organizations, including the Gordon and Betty Moore Foundation which joined the Consortium this year as a Platinum member. The R Consortium’s Infrastructure Steering Committee (ISC) is currently supporting more than twenty projects working on various aspects of improving R Community infrastructure. In this talk, Joseph will provide an overview of R Consortium activities, and make suggestions on how R users can get involved. He will also talk about what RStudio is doing to support the R Consortium, and how RStudio directly supports the R Community.

..and the speakers..

Nathan Stephens: Nathan Stephens is director of solutions engineering at RStudio. His background is in applied analytics and consulting. He has experience building data science teams, creating innovative data products, analyzing big data, and architecting analytic platforms. He is a long time user of R and has helped many organizations adopt R as an analytic standard.

Joseph Rickert: Joseph Rickert is RStudio’s “Ambassador at Large” for all things R. He works with the rest of the RStudio team and the R Consortium to promote open source activities, the R language and the R Community. Joseph represents RStudio on the R Consortium board of directors. Joseph came to RStudio from Microsoft /Revolution Analytics where he was a data scientist, blog editor and special programs manager after having spent most of his career working for startups in various technical, marketing and sales positions. Joseph attended Franklin & Marshall college on a Classics scholarship and later earned an M.A. in Humanities and an M.S. in Statistics from the California State University. You can follow him on Twitter as @RStudioJoe.

…and the sponsors..

RStudioPeople all over the world are turning to R, an open source statistical language, to make sense of data. Inspired by the innovations of R users in science, education, and industry, RStudio develops free and open tools for R and enterprise-ready professional products for teams to scale and share work.

Quantide is a provider of consulting services and training courses about Data Science and Big Data. It’s specialized in R, the open source software for statistical computing. Headquartered near Milan (Italy), Quantide has been supporting for 10 years customers from several industries around the world. Quantide is the founder and a supporter of the Milano R community, since its start in 2012.

…and the organizers..

MilanoR – MilanoR is the R user group of the city of Milano. MilanoR has a blog, where we post R tutorials, articles and events; a Facebook page (that includes our events streaming); a newsletter. MilanoR organizes R talks (the MilanoR meetings) and R-Labs, monthly collaborative R mini-hackathon. All the R-Labs materials are stored on Github, while our R-Labs conversations are on a Slack channel (if you are interested in joining one of these channels, please go here). In a nutshell, we work and talk about R in whatever way has come to our minds.     

Data Science Milano – Data Science Milano is an independent group of 600+ professionals with the goal of promoting and pioneering knowledge and innovation of the data-driven revolution in the Italian peninsula and beyond. The group encourages international collaboration, sharing and open source tools. Everyone who is involved in data science projects or would like  to undertake this career is invited to join.

Ok, I want to join the meeting

Go to the event Eventbrite page, click the green “Register” button at the top, and welcome!

I have another question

Please contact us at admin@milanor.net

The post RStudio meets MilanoR! June 29th, Milan appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Facilitating Exploratory Data Visualization: Application to TCGA Genomic Data

Mon, 06/12/2017 - 16:30

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

In genomic fields, it’s very common to explore the gene expression profile of one or a list of genes involved in a pathway of interest. Here, we present some helper functions in the ggpubr R package to facilitate exploratory data analysis (EDA) for life scientists.

Exploratory Data visualization: Gene Expression Data

Standard graphical techniques used in EDA, include:

  • Box plot
  • Violin plot
  • Stripchart
  • Dot plot
  • Histogram and density plots
  • ECDF plot
  • Q-Q plot

All these plots can be created using the ggplot2 R package, which is highly flexible.

However, to customize a ggplot, the syntax might appear opaque for a beginner and this raises the level of difficulty for researchers with no advanced R programming skills. If you’re not familiar with ggplot2 system, you can start by reading our Guide to Create Beautiful Graphics in R.

Previously, we described how to Add P-values and Significance Levels to ggplots. In this article, we present the ggpubr package, a wrapper around ggplot2, which provides some easy-to-use functions for creating ‘ggplot2’- based publication ready plots. We’ll use the ggpubr functions to visualize gene expression profile from TCGA genomic data sets.

Contents:

Prerequisites ggpubr package

Required R package: ggpubr (version >= 0.1.3).

  • Install from CRAN as follow:
install.packages("ggpubr")
  • Or, install the latest developmental version from GitHub as follow:
if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/ggpubr")
  • Load ggpubr:
library(ggpubr) TCGA data

The Cancer Genome Atlas (TCGA) data is a publicly available data containing clinical and genomic data across 33 cancer types. These data include gene expression, CNV profiling, SNP genotyping, DNA methylation, miRNA profiling, exome sequencing, and other types of data.

The RTCGA R package, by Marcin Marcin Kosinski et al., provides a convenient solution to access to clinical and genomic data available in TCGA. Each of the data packages is a separate package, and must be installed (once) individually.

The following R code installs the core RTCGA package as well as the clinical and mRNA gene expression data packages.

# Load the bioconductor installer. source("https://bioconductor.org/biocLite.R") # Install the main RTCGA package biocLite("RTCGA") # Install the clinical and mRNA gene expression data packages biocLite("RTCGA.clinical") biocLite("RTCGA.mRNA")

To see the type of data available for each cancer type, use this:

library(RTCGA) infoTCGA() # A tibble: 38 x 13 Cohort BCR Clinical CN LowP Methylation mRNA mRNASeq miR miRSeq RPPA MAF rawMAF * 1 ACC 92 92 90 0 80 0 79 0 80 46 90 0 2 BLCA 412 412 410 112 412 0 408 0 409 344 130 395 3 BRCA 1098 1097 1089 19 1097 526 1093 0 1078 887 977 0 4 CESC 307 307 295 50 307 0 304 0 307 173 194 0 5 CHOL 51 45 36 0 36 0 36 0 36 30 35 0 6 COAD 460 458 451 69 457 153 457 0 406 360 154 367 7 COADREAD 631 629 616 104 622 222 623 0 549 491 223 489 8 DLBC 58 48 48 0 48 0 48 0 47 33 48 0 9 ESCA 185 185 184 51 185 0 184 0 184 126 185 0 10 FPPP 38 38 0 0 0 0 0 0 23 0 0 0 # ... with 28 more rows

More information about the disease names can be found at: http://gdac.broadinstitute.org/

Gene expression data

The R function expressionsTCGA() [in RTCGA package] can be used to easily extract the expression values of genes of interest in one or multiple cancer types.

In the following R code, we start by extracting the mRNA expression for five genes of interest – GATA3, PTEN, XBP1, ESR1 and MUC1 – from 3 different data sets:

  • Breast invasive carcinoma (BRCA),
  • Ovarian serous cystadenocarcinoma (OV) and
  • Lung squamous cell carcinoma (LUSC)
library(RTCGA) library(RTCGA.mRNA) expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA, LUSC.mRNA, extract.cols = c("GATA3", "PTEN", "XBP1","ESR1", "MUC1")) expr # A tibble: 1,305 x 7 bcr_patient_barcode dataset GATA3 PTEN XBP1 ESR1 MUC1 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA 2.870500 1.3613571 2.983333 3.0842500 1.652125 2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA 2.166250 0.4283571 2.550833 2.3860000 3.080250 3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA 1.323500 1.3056429 3.020417 0.7912500 2.985250 4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA 1.841625 0.8096429 3.131333 2.4954167 -1.918500 5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA -6.025250 0.2508571 -1.451750 -4.8606667 -1.171500 6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA 1.804500 1.3107857 4.041083 2.7970000 3.529750 7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -4.879250 -0.2369286 -0.724750 -4.4860833 -1.455750 8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -3.143250 -1.2432143 -1.193083 -1.6274167 -0.986750 9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA 2.034000 1.2074286 2.278833 4.1155833 0.668000 10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.293125 0.2883571 -1.605083 0.4731667 0.011500 # ... with 1,295 more rows

To display the number of sample in each data set, type this:

nb_samples <- table(expr$dataset) nb_samples BRCA.mRNA LUSC.mRNA OV.mRNA 590 154 561

We can simplify data set names by removing the “mRNA” tag. This can be done using the R base function gsub().

expr$dataset <- gsub(pattern = ".mRNA", replacement = "", expr$dataset)

Let’s simplify also the patients’ barcode column. The following R code will change the barcodes into BRCA1, BRCA2, …, OV1, OV2, …., etc

expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154)) expr # A tibble: 1,305 x 7 bcr_patient_barcode dataset GATA3 PTEN XBP1 ESR1 MUC1 1 BRCA1 BRCA 2.870500 1.3613571 2.983333 3.0842500 1.652125 2 BRCA2 BRCA 2.166250 0.4283571 2.550833 2.3860000 3.080250 3 BRCA3 BRCA 1.323500 1.3056429 3.020417 0.7912500 2.985250 4 BRCA4 BRCA 1.841625 0.8096429 3.131333 2.4954167 -1.918500 5 BRCA5 BRCA -6.025250 0.2508571 -1.451750 -4.8606667 -1.171500 6 BRCA6 BRCA 1.804500 1.3107857 4.041083 2.7970000 3.529750 7 BRCA7 BRCA -4.879250 -0.2369286 -0.724750 -4.4860833 -1.455750 8 BRCA8 BRCA -3.143250 -1.2432143 -1.193083 -1.6274167 -0.986750 9 BRCA9 BRCA 2.034000 1.2074286 2.278833 4.1155833 0.668000 10 BRCA10 BRCA -0.293125 0.2883571 -1.605083 0.4731667 0.011500 # ... with 1,295 more rows

The above (expr) dataset has been saved at https://raw.githubusercontent.com/kassambara/data/master/expr_tcga.txt. This data is required to practice the R code provided in this tutotial.

If you experience some issues in installing the RTCGA packages, You can simply load the data as follow:

expr <- read.delim("https://raw.githubusercontent.com/kassambara/data/master/expr_tcga.txt", stringsAsFactors = FALSE) Box plots

(ggplot2 way of creating box plot)

Create a box plot of a gene expression profile, colored by groups (here data set/cancer type):

library(ggpubr) # GATA3 ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco") # PTEN ggboxplot(expr, x = "dataset", y = "PTEN", title = "PTEN", ylab = "Expression", color = "dataset", palette = "jco")

Exploratory Data visualization: Gene Expression Data

Note that, the argument palette is used to change color palettes. Allowed values include:

  • “grey” for grey color palettes;
  • brewer palettes e.g. “RdBu”, “Blues”, …;. To view all, type this in R: RColorBrewer::display.brewer.all() or click here to see all brewer palettes;
  • or custom color palettes e.g. c(“blue”, “red”) or c(“#00AFBB”, “#E7B800”);
  • and scientific journal palettes from the ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.

Instead of repeating the same R code for each gene, you can create a list of plots at once, as follow:

# Create a list of plots p <- ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), title = c("GATA3", "PTEN", "XBP1"), ylab = "Expression", color = "dataset", palette = "jco") # View GATA3 p$GATA3 # View PTEN p$PTEN # View XBP1 p$XBP1

Note that, when the argument y contains multiple variables (here multiple gene names), then the arguments title, xlab and ylab can be also a character vector of same length as y.

To add p-values and significance levels to the boxplots, read our previous article: Add P-values and Significance Levels to ggplots. Briefly, you can do this:

my_comparisons <- list(c("BRCA", "OV"), c("OV", "LUSC")) ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco")+ stat_compare_means(comparisons = my_comparisons)

Exploratory Data visualization: Gene Expression Data

For each of the genes, you can compare the different groups as follow:

compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = expr) # A tibble: 9 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 GATA3 BRCA OV 1.111768e-177 3.335304e-177 < 2e-16 **** Wilcoxon 2 GATA3 BRCA LUSC 6.684016e-73 1.336803e-72 < 2e-16 **** Wilcoxon 3 GATA3 OV LUSC 2.965702e-08 2.965702e-08 3.0e-08 **** Wilcoxon 4 PTEN BRCA OV 6.791940e-05 6.791940e-05 6.8e-05 **** Wilcoxon 5 PTEN BRCA LUSC 1.042830e-16 3.128489e-16 < 2e-16 **** Wilcoxon 6 PTEN OV LUSC 1.280576e-07 2.561153e-07 1.3e-07 **** Wilcoxon 7 XBP1 BRCA OV 2.551228e-123 7.653685e-123 < 2e-16 **** Wilcoxon 8 XBP1 BRCA LUSC 1.950162e-42 3.900324e-42 < 2e-16 **** Wilcoxon 9 XBP1 OV LUSC 4.239570e-11 4.239570e-11 4.2e-11 **** Wilcoxon

If you want to select items (here cancer types) to display or to remove a particular item from the plot, use the argument select or remove, as follow:

# Select BRCA and OV cancer types ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", select = c("BRCA", "OV")) # or remove BRCA ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", remove = "BRCA")

Exploratory Data visualization: Gene Expression Data

To change the order of the data sets on x axis, use the argument order. For example order = c(“LUSC”, “OV”, “BRCA”):

# Order data sets ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", order = c("LUSC", "OV", "BRCA"))

Exploratory Data visualization: Gene Expression Data

To create horizontal plots, use the argument rotate = TRUE:

ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", rotate = TRUE)

Exploratory Data visualization: Gene Expression Data

To combine the three gene expression plots into a multi-panel plot, use the argument combine = TRUE:

ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, ylab = "Expression", color = "dataset", palette = "jco")

Exploratory Data visualization: Gene Expression Data

You can also merge the 3 plots using the argument merge = TRUE or merge = “asis”:

ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), merge = TRUE, ylab = "Expression", palette = "jco")

Exploratory Data visualization: Gene Expression Data

In the plot above, It’s easy to visually compare the expression level of the different genes in each cancer type.

But you might want to put genes (y variables) on x axis, in order to compare the expression level in the different cell subpopulations.

In this situation, the y variables (i.e.: genes) become x tick labels and the x variable (i.e.: dataset) becomes the grouping variable. To do this, use the argument merge = “flip”.

ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), merge = "flip", ylab = "Expression", palette = "jco")

Exploratory Data visualization: Gene Expression Data

You might want to add jittered points on the boxplot. Each point correspond to individual observations. To add jittered points, use the argument add = “jitter” as follow. To customize the added elements, specify the argument add.params.

ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "jitter", # Add jittered points add.params = list(size = 0.1, jitter = 0.2) # Point size and the amount of jittering )

Exploratory Data visualization: Gene Expression Data

Note that, when using ggboxplot() sensible values for the argument add are one of c(“jitter”, “dotplot”). If you decide to use add = “dotplot”, you can adjust dotsize and binwidth wen you have a strong dense dotplot. Read more about binwidth.

You can add and adjust a dotplot as follow:

ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "dotplot", # Add dotplot add.params = list(binwidth = 0.1, dotsize = 0.3) )

Exploratory Data visualization: Gene Expression Data

You might want to label the boxplot by showing the names of samples with the top n highest or lowest values. In this case, you can use the following arguments:

  • label: the name of the column containing point labels.
  • label.select: can be of two formats:
    • a character vector specifying some labels to show.
    • a list containing one or the combination of the following components:
      • top.up and top.down: to display the labels of the top up/down points. For example, label.select = list(top.up = 10, top.down = 4).
      • criteria: to filter, for example, by x and y variables values, use this: label.select = list(criteria = “`y` > 3.9 & `y` < 5 & `x` %in% c(‘BRCA’, ‘OV’)”).

For example:

ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "jitter", # Add jittered points add.params = list(size = 0.1, jitter = 0.2), # Point size and the amount of jittering label = "bcr_patient_barcode", # column containing point labels label.select = list(top.up = 2, top.down = 2),# Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting )

Exploratory Data visualization: Gene Expression Data

A complex criteria for labeling can be specified as follow:

label.select.criteria <- list(criteria = "`y` > 3.9 & `x` %in% c('BRCA', 'OV')") ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", label = "bcr_patient_barcode", # column containing point labels label.select = label.select.criteria, # Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting )

Other types of plots, with the same arguments as the function ggboxplot(), are available, such as stripchart and violin plots.

Violin plots

(ggplot2 way of creating violin plot)

The following R code draws violin plots with box plots inside:

ggviolin(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "boxplot")

Exploratory Data visualization: Gene Expression Data

Instead of adding a box plot inside the violin plot, you can add the median + interquantile range as follow:

ggviolin(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "median_iqr")

Exploratory Data visualization: Gene Expression Data

When using the function ggviolin(), sensible values for the argument add include: “mean”, “mean_se”, “mean_sd”, “mean_ci”, “mean_range”, “median”, “median_iqr”, “median_mad”, “median_range”.

You can also add “jitter” points and “dotplot” inside the violin plot as described previously in the box plot section.

Stripcharts and dot plots

To draw a stripchart, type this:

ggstripchart(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", size = 0.1, jitter = 0.2, ylab = "Expression", add = "median_iqr", add.params = list(color = "gray"))

Exploratory Data visualization: Gene Expression Data

(ggplot2 way of creating stripcharts)

For a dot plot, use this:

ggdotplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", fill = "white", binwidth = 0.1, ylab = "Expression", add = "median_iqr", add.params = list(size = 0.9))

Exploratory Data visualization: Gene Expression Data

(ggplot2 way of creating dot plots)

Density plots

(ggplot2 way of creating density plots)

To visualize the distribution as a density plot, use the function ggdensity() as follow:

# Basic density plot ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE # Add marginal rug )

Exploratory Data visualization: Gene Expression Data

# Change color and fill by dataset ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE, # Add marginal rug color = "dataset", fill = "dataset", palette = "jco" )

Exploratory Data visualization: Gene Expression Data

# Merge the 3 plots # and use y = "..count.." instead of "..density.." ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette )

Exploratory Data visualization: Gene Expression Data

# color and fill by x variables ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", # color and fill by x variables merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette )

Exploratory Data visualization: Gene Expression Data

# Facet by "dataset" ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", facet.by = "dataset", # Split by "dataset" into multi-panel merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette )

Exploratory Data visualization: Gene Expression Data

Histogram plots

(ggplot2 way of creating histogram plots)

To visualize the distribution as a histogram plot, use the function gghistogram() as follow:

# Basic histogram plot gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE # Add marginal rug )

Exploratory Data visualization: Gene Expression Data

# Change color and fill by dataset gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE, # Add marginal rug color = "dataset", fill = "dataset", palette = "jco" )

Exploratory Data visualization: Gene Expression Data

# Merge the 3 plots # and use y = "..count.." instead of "..density.." gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette )

Exploratory Data visualization: Gene Expression Data

# color and fill by x variables gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", # color and fill by x variables merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette )

Exploratory Data visualization: Gene Expression Data

# Facet by "dataset" gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", facet.by = "dataset", # Split by "dataset" into multi-panel merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette )

Exploratory Data visualization: Gene Expression Data

Empirical cumulative density function

(ggplot2 way of creating ECDF plots)

# Basic ECDF plot ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, xlab = "Expression", ylab = "F(expression)" )

Exploratory Data visualization: Gene Expression Data

# Change color by dataset ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, xlab = "Expression", ylab = "F(expression)", color = "dataset", palette = "jco" )

Exploratory Data visualization: Gene Expression Data

# Merge the 3 plots and color by x variables ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, xlab = "Expression", ylab = "F(expression)", color = ".x.", palette = "jco" )

Exploratory Data visualization: Gene Expression Data

# Merge the 3 plots and color by x variables # facet by "dataset" into multi-panel ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, xlab = "Expression", ylab = "F(expression)", color = ".x.", palette = "jco", facet.by = "dataset" )

Exploratory Data visualization: Gene Expression Data

Quantile - Quantile plot

(ggplot2 way of creating QQ plots)

# Basic ECDF plot ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, size = 0.5 )

Exploratory Data visualization: Gene Expression Data

# Change color by dataset ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", size = 0.5 )

Exploratory Data visualization: Gene Expression Data

# Merge the 3 plots and color by x variables ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, color = ".x.", palette = "jco" )

Exploratory Data visualization: Gene Expression Data

# Merge the 3 plots and color by x variables # facet by "dataset" into multi-panel ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, size = 0.5, color = ".x.", palette = "jco", facet.by = "dataset" )

Exploratory Data visualization: Gene Expression Data

Infos

This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.3).

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

A Data Scientist’s Guide to Predicting Housing Prices in Russia

Mon, 06/12/2017 - 15:17

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Sberbank Russian Housing Market

A Kaggle Competition on Predicting Realty Price in Russia

Written by Haseeb Durrani, Chen Trilnik, and Jack Yip

 

Introduction

In May 2017, Sberbank, Russia’s oldest and largest bank, challenged data scientists on Kaggle to come up with the best machine learning models to estimate housing prices for its customers, which includes consumers and developers looking to buy, invest in, or rent properties. This blog post outlines the end-to-end process of how we went about tackling this challenge.

 

About the Data

The datasets provided consist of a training set, a test set, and a file containing historical macroeconomic metrics.

  • The training set contains ~21K realty transactions spanning from August 2011 to June 2015 along with information specific to the property. This set also includes the price at which the property was sold for.
  • The test set contains ~7K realty transactions spanning from July 2015 to May 2016 along with information specific to the property. This set does not include the price at which the property was transacted.
  • The macroeconomic data spans from January 2010 to October 2016.
  • There are a combined total of ~400 features or predictors.

The train and test datasets spans from 2011 to 2016.

 

Project Workflow

As with any team-based project, defining the project workflow was vital to our delivering on time as we were given only a two-week timeframe to complete the project. Painting the bigger picture allowed us to have a shared understanding of the moving parts and helped us to budget time appropriately.

Our project workflow can be broken down into three main parts: 1) Data Assessment, 2) Model Preparation, and 3) Model Fitting. This blog post will follow the general flow depicted in the illustration below.

The project workflow was vital to keeping us on track under a two-week constraint.

 

Approach to the Problem

Given a tight deadline, we had to limit the project scope to what we can realistically complete. We agreed that our primary objectives are to learn as much as we can and to apply what we have learned in class. To do so, we decided to do the following:

  • Focus on applying the Multiple Linear Regression and Gradient Boosting model, but to also spend some time with XGBoost for learning purposes (XGBoost is a relatively new machine learning method, and it is oftentimes the model of choice to win Kaggle competitions).
  • Spend more time working and learning together as a team instead of working in silos. This meant that each of us has an opportunity to work in each of the domains in the project pipeline. While this is not reflective of the fast-paced environment of the real world, we believe that it can help us better learn the data science process.

 

Understanding the Russian Economy (2011 to 2016)

In order to better predict Russian housing prices, we first need to understand how the economic forces impact the Russian economy, as it directly affects the supply and demand in the housing market.

To understand how we can best apply macroeconomic features into our models, we researched for Russia’s biggest economic drivers and also any events that may have impacted Russia’s economy during the time period of the dataset (i.e. 2011 to 2016). Our findings show that:

These two factors initiated a “snowball.” negatively impacting the following base economic indicators:  

  • The Russian ruble
  • The Russian Trading System Index (RTS Index)
  • The Russian trade balance

Labels for the chart: Blue – RTS Index, Green – Global Oil Price, Purple – USD/Ruble Exchange Rate,  Red – Russia’s Trade Balance

 

The poorly performing Russian economy manifested a series of changes in Russia’s macroeconomic indicators, including but not limited to:

  • Significant increase in Russia’s Consumer Price Index (CPI)
  • Worsened salary growth and unemployment rate
  • Decline in Gross Domestic Product (GDP)

The change in these indicators, led us to the understanding that Russia is facing a financial crisis.

The first chart measures CPI, the second chart measures GDP growth, and the third chart measures unemployment rate (blue) and salary growth (red)

 

As we continued to explore the macroeconomic dataset, we were able to further confirm Russia’s financial crisis. We observed that the Russian Central Bank had increased the banking deposit rate by over 100% in 2014 in an attempt to bring stability to the Russian economy and the Russian ruble.

In the chart above, the red data points represent banking deposit rates, and the blue data points represent mortgage rates.

 

Next, we looked for the effect of the Russian Central Bank’s move on the Russian housing market. When raising banking deposit rates, the Russian Central Bank encouraged consumers to keep their savings in the banks and minimize the public’s’ exposure to financial risk. That was seen clearly in the Russian housing market following the raise in deposit rates. We noticed that rent prices increased, and the growth of commercial investment in real estate dropped in the beginning of 2015.

The first chart above represents the index of rent prices, and the second chart shows the level of investment in fixed assets.

 

This analysis led us to the understanding that the Russian economy was facing a major financial crisis, which significantly impacted the demand in the Russian housing market. This highlights the importance of the usage of the macroeconomic features in our machine learning model to predict the Russian housing market.

The illustration on the left shows the count of transactions by product type. The illustration on the right provides a more detailed view of the same information.

 

Exploratory Data Analysis

The first step to any data science project is simple exploration and visualization of the data. Since the ultimate purpose of this competition is price prediction, it’s a good idea to visualize price trends over the time span of the training data set. The visualization below shows monthly average realty prices over time. We can see that average prices have seen fluctuations between 2011 and 2015 with an overall increase over time. There is, however, a noticeable drop from June to December of 2012.  

It is important to keep in mind that these averaged prices include apartments of different sizes; therefore, a more “standardized” measure of price would be the price per square meter over the same time span. Below we see that the average price per square meter shows fluctuations as well, though the overall trend is quite different from the previous visualization.

The data set starts off with a decline in average price per square meter, which sees somewhat of an improvement from late 2011 to mid 2012. Again we see a price drop from June to December of 2012.

Some other attributes to take into consideration when evaluating the price of an apartment are the size of the apartment and the size of the building.  

Our intuition tells us that price should be directly related to the size of an apartment, and the the boxplot below shows that the median price does go up relative to apartment size, with the exception of “Large” apartments having a median price slightly lower than apartments of “Medium” size. Factors such as neighborhood might help in explaining this anomaly.

Apartment price as a function of building size shows similar median prices for low-rise, medium, and high-rise buildings. Larger buildings labelled “Sky” with 40 floors or more show a slightly higher median apartment price.

Another interesting variable worth taking a look at is the type of transaction or “product type” mentioned in the data set. This feature of the data categorizes each transaction as being either an Investment or an Owner Occupied purchase. The barplot below gives a breakdown of the transactions for the top sub-areas (by frequency) based on the product type.

This visualization clearly shows that certain areas are heavily owner occupied while other areas are very attractive to investors. If investment leads to increasing prices, this trend will be important for making future predictions.

If we shift our focus to transactions related to the building’s build year based on product type, we see that older buildings generally are involved in investment transactions, possibly due to better deals, while newer constructions are predominantly owner occupied.

Simple exploration and visualization of the data outlines important considerations while training our machine learning model. Our model must consider the factors behind the price dips in 2012 and learn to anticipate similar drops in future predictions. The model must also understand that low prices attract investors, as can be seen with older buildings. However, if investment leads to an increase in price for a particular building, the model must be able to adjust future predictions for that building accordingly. Similarly. the model must be able to predict future prices for a given sub-area that has seen an increase in average price over time as a result of investment.

 

Feature Selection

To simplify the feature selection process, we fitted an XGBoost model containing all of the housing features (~290 features) and called its feature importance function to determine the most important predictors of realty price. XGBoost was very efficient in doing this; it took less than 10 minutes to fit the model.

In the illustration below, XGBoost sorts each of the housing features by its Fscore, which is a measure of variable importance in its predictive power of the transaction price. Among the top 20 features identified with XGBoost, a Random Forest was used to further rank their importance. The %IncMSE metric was preferred over IncNodePurity as our objective was to identify features that can minimize the mean squared errors of our model (i.e. better predictability).

In order to determine which macroeconomic features were important, we joined the transaction prices in the training set with the macroeconomics by the timestamp and then repeated the process above. The top 45 ranking features are outlined below. In this project, we have used experimented with different subsets of these features to fit our models.

 

Feature Engineering

Often times it is crucial to generate new variables from existing ones to improve prediction accuracy. Feature engineering can serve the purpose of extrapolating information by splitting variables for model flexibility or dimension reduction by combining variables for model simplicity.

The graphic below shows that by averaging our chosen distance-related features into an average-distance feature, we yield a similar price trend as the original individual features.

 

For simpler models sub areas were limited to the top 50 most frequent with the rest classified as “other.” The timestamp variable was used to extract date and seasonal information into new variables, such as day, month, year, and season. The apartment size feature mentioned earlier during our exploratory data analysis was actually an engineered featured using the living square meter area of the apartment to determine apartment size. This feature served very valuable during the imputation of missing values for number of rooms in an apartment. Similarly building size was generated using the maximum number of floors in the given building.

 

Data Cleaning

Among 45 features selected, outliers and missing values were corrected as both the multiple linear regression and gradient boosting models do not accept missing values.

Below is a list of the outlier corrections and missing value imputations.

Outlier Correction

  • Sq-related features: Imputed by mean within sub area
  • Distance-related features: Imputed by mean within sub area

Missing Value Imputation

  • Build Year: Imputed by most common build year in sub area
  • Num Room: Imputed by average number of rooms in similarly sized apartments within the sub area
  • State: Imputed using the build year and sub area  

 

Model Selection

Prior to fitting models, it is imperative to understand their strengths and weaknesses.

Outlined below are some of the pros and cons we have identified, as well as the associated libraries used to implement them.

Multiple Linear Regression (R: lm, Python: statsmodel)

  • Pros: High Interpretability, simple to implement
  • Cons: Does not handle multicollinearity well (especially if we would like to include a high number of features from this dataset)

Gradient Boosting Tree (R: gbm & caret)

  • Pros: High predictive accuracy
  • Cons: Slow to train, high number of parameters

XGBoost (Python: xgboost)

  • Pros: Even higher predictive accuracy, accepts missing values
  • Cons: Relatively fast to train compared to GBM, but higher number of hyperparameters

 

Multiple Linear Regression

Despite knowing that a multiple linear regression model will not work well given the dataset’s complexity and the presence of multicollinearity, we were interested to see its predictive strength.

We began by validating the presence of multicollinearity. In the illustration below, green and red boxes indicate positive and negative correlations between two features. One of the assumptions of a multiple linear regression model is that its features should not be correlated.

Violation of Multicollinearity

We continue to examine other underlying assumptions of a multiple linear regression model. When examining scatterplots among predictors and the target value (i.e. price), it is apparent that many predictors do not share a linear relationship with the response variable. The following plots are one of many sets of plots examined during the process.

 

Violation of Linearity

 

The Residuals vs Fitted plot does not align with the assumption that error terms have the same variance no matter where they appear along the regression line (i.e. red line shows an increasing trend).

Violation of Constant Variance

 

The Normal Q-Q plot shows severe violation of normality (i.e. standardized residuals are highly deviant from the normal line).

 

Violation of Normality

The Scale-Location plot shows a violation of independent errors as it is observed that the standardized residuals are increasing along with increases in input variables

Violation of Independent Errors

Despite observing violations of the underlying assumptions of multiple linear regression models, we proceeded with submitting our predictions to see how well we can score.

Our first model A1 was trained using the top 12 property-specific features identified in the feature selection process. Submitting predictions using this model gave us a Kaggle score of 0.39544 on the public leaderboard (this score is calculated based on the Root Mean Squared Logarithmic Error). Essentially, the lower the score, the more accurate the model’s predictions are. Nevertheless, it is possible to “overfit the public leaderboard”, as the score on the public leaderboard is only representative of the predictions of approximately 35% of the test set.

Models A2 and A3 were fitted by removing features from model A1 that showed VIFs (Variance Inflation Factors) greater than 5. Generally, features with VIFs greater than 5 indicate that there is multicollinearity present among the features used in the model. Removing these did not help to improve our Kaggle scores, as it is likely that we have removed features that are also important to the prediction of price.

Models B1, B2, and B3 were trained with the features in A1 with the addition of identified important macro features. These models did not improve our Kaggle scores, likely for reasons mentioned previously.

 

Gradient Boosting Model

The second machine learning method we used was the Gradient Boosting Model (GBM), which is a tree-based machine learning method that is robust in handling multicollinearity. This is important as we believe that many features provided in the dataset are significant predictors of price.

GBMs are trained using the bootstrapping method; each tree is generated using information from previous trees. To avoid overfitting the model to the training dataset, we used cross-validation with Caret package in R. Knowing that this model can handle complexity and multicollinearity well, we added more features for the first model we fitted.

In Model A1, we chose 47 features from the original training dataset, which gave us a Kaggle score of 0.38056. Next, we wanted to check if reducing the complexity of the model gives better results. We decided to run another model (A2) with only 20 features. This model performed better than Model A1.

In Model A3, we added to the previous model, 5 macro features that were chosen using XGBoost feature importance. As expected, the macro features improved our results. The model gave us the best Kaggle score of all the gradient boosting models we trained.

For each of the GBM models, we used cross-validation to tune the parameters (with R’s Caret package). GBM models on average provided us better prediction results compared with those using the multiple linear regression models.

 

XGBoost

An extreme form of gradient boosting, XGBoost, is the preferred tool for Kaggle competitions due to it’s accuracy of predictions and speed. The quick speed in training a model is a result of allowing residuals to “roll-over” into the next tree. Careful selection of hyperparameters such as learning rate and max-depth, gave us our best Kaggle prediction score. Fine tuning of the hyperparameters can be achieved through extensive cross-validation; however, caution is recommended when fitting too many parameters as it can be computationally expensive and time consuming. Below is an example of a simple grid search cross validation.

 

Prediction Scores

The table below outlines the best public scores we have received using each of the different models, along with the features used.

 

Future Directions

This two-week exercise provided us the opportunity to experience first-hand what it is like to be working in a team of data scientists. If given the luxury of additional time, we would dedicate more to engineering new features to improve our model predictions. This would require us to develop deeper industry knowledge of Russia’s housing economy.

 

Acknowledgement

We thank the entire team of instructors at NYCDSA for their guidance during this two-week project and also our fellow Kagglers who published kernels with their useful insights.

The post A Data Scientist’s Guide to Predicting Housing Prices in Russia appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Single Word Analysis of Early 19th Century Poetry Using tidytext

Mon, 06/12/2017 - 13:14

(This article was first published on Pareto's Playground, and kindly contributed to R-bloggers)

When reading poetry, it feels as if you are reading emotion. A good poet can use words to bring to their reader whichever emotions they choose. To do this, these poets often have to look deep within themselves, using their own lives as fuel. For this reason you would expect that poets and as a result, their poetry, would be influenced by the events happening around them. If certain events caused the poet’s view of the world to be an unhappy one, you’d expect their poetry to be unhappy, and of course the same for the opposite.

In this analysis I attempt to show this relation between world events and the poets of those times.

Gathering the required data – this being poems and their dates – proved to be a very difficult task. I had initially hoped to be able to show an analysis of the entire 20th century (this may still happen when I find more time and/or sources), but in the end I settled for the years between 1890 – 1925. I managed to collect an overall number of 97 poems (a few of which are excluded from the analysis due to being written in 1888 or 1889) from 6 of the most well known poets of the time. These poets being:

  • William Yeats
  • T. S. Eliot
  • Robert Frost
  • Wallace Stevens
  • E. E. Cummings
  • D. H. Lawrence

I will only be going through the preparation code for Yeats. The rest is similar but not exactly the same, and can be found in my github repository (https://github.com/Laurence-Sonnenberg/single_word_poetry_analysis) if anyone would like to re-create anything I have done. The actual analysis is done using all poets/poems combined.

Firstly I load the packages I’m going to need. I have included a small function that other people might find useful.

load_pkg <- function(packagelist) { # Check if any package needs installation: PackagesNeedingInstall <- packagelist[!(packagelist %in% installed.packages()[,"Package"])] if(length(PackagesNeedingInstall)) { cat("\nInstalling necesarry packages: ", paste0(PackagesNeedingInstall,collapse = ", "), "\n") install.packages(PackagesNeedingInstall, dependencies = T, repos = "http://cran-mirror.cs.uu.nl/") } # load packages into R: x <- suppressWarnings(lapply(packagelist, require, character.only = TRUE, quietly = T)) } pkg_list <- c("tidyverse", "tidytext", "stringr", "rvest", "lubridate", "purrr") load_pkg(pkg_list) ## ## Installing necesarry packages: tidytext

The website I use for scraping has a page for each poet which lists the poems available – then it has a seperate page for each of the poems in the list, along with their dates. I have chosen this website specifically because having each poem dated is a must for my analysis. I run an initial scrape using rvest to get the names which are then cleaned so they can be used in the urls to get the poems and dates. Some names are not exactly the same as their url versions, so they need to be manually changed.

# page url poemNameUrl <- "http://www.poetry-archive.com/y/yeats_w_b.html" # scrape for poem names poemName <- poemNameUrl %>% read_html() %>% html_nodes("a font") %>% html_text() # clean names poemName <- poemName[1:50] %>% str_replace_all(pattern = "\r", replacement = "") %>% str_replace_all(pattern = "\n", replacement = " ") %>% str_replace_all(pattern = "[ ]{2}", replacement = "") %>% str_replace_all(pattern = "[[:punct:]]", replacement = "") %>% tolower() # hardcode 2 poem names to fit in with used url name poemName[9] <- "he mourns for the change" poemName[24] <- "the old men admiring themselves" # display head(poemName, 10) ## [1] "at galway races" "the blessed" ## [3] "the cap and bells" "the cat and the moon" ## [5] "down by the salley gardens" "dream of a blessed spirit" ## [7] "the falling of the leaves" "he bids his beloved be at peace" ## [9] "he mourns for the change" "he remembers forgotten beauty"

Next I need to scrape the website for the poems and their dates. To do this I use a function that takes a poem name – and again using rvest – scrapes for the given poem and its date, then cleans the text and waits before returning. The wait is so that I don’t send too many requests to the website in quick succession.

# function to scrape for poem and its date using poem names GetPoemAndDate <- function(poemName) { # split string at empty space and unlist the result nameVec <- unlist(str_split(poemName, pattern = " ")) # use result in url url <- str_c("http://www.poetry-archive.com/y/", paste(nameVec, collapse = "_"), ".html") # scrape for poem and clean poem <- url %>% read_html() %>% html_nodes("dl") %>% html_text() %>% str_replace_all(pattern = "\r", replacement = "") %>% str_replace_all(pattern = "\n", replacement = " ") %>% str_replace_all(pattern = "[ ]{2}", replacement = "") # scrape for date date <- url %>% read_html() %>% html_nodes("td font") %>% html_text() # clean dates date <- date[3] %>% str_extract(pattern = "[0-9]+") # pause before function return Sys.sleep(runif(1, 0, 1)) return(list(poem = poem, date = date)) }

I then use the previous function in a for loop which loops through each poems name and scrapes for that particular poem and its date. With each loop I also add the data frame of poem and date to a list. Once the for loop has completed I rbind all the data frames together. A complete data frame of all poems and dates will be needed for the next step.

# get poems and dates poemDataFrame <- list() count <- 1 for (name in poemName) { # get poem and date for given poem name poemNameDate <- GetPoemAndDate(name) # create data frame of name and date and add it to list poemDataFrame[[count]] <- data.frame(poem = poemNameDate$poem, date = poemNameDate$date, stringsAsFactors = FALSE) count <- count + 1 } # rbind all poems and dates poemDataFrame <- do.call(rbind, poemDataFrame)

I then combine the names from the original scrape, the dates, and the poems into a single data frame so they will be ready for me to work with. I also take a look at all the text, checking for and correcting any errors that could negatively affect my analysis. In this case I have found an error in a date.

# create data frame of names, poems and dates poemDataFrame <- cbind(poemName, poemDataFrame) # hardcode single date with error poemDataFrame$date[40] <- "1916"

After this, we have a dataset of dimension 50×3 with column headings poemName, poem and date.

The next function I have written takes a row index, and using the combined data frame of poems, names and dates, along with tidytext, tokenizes the poem found at that index into single words. It then uses anti_join from dplyr, along with the stop_words dataset from tidytext, to create a tibble – this tibble will now be missing a list of stop words which will not be helpful in the analysis. This list of words for each poem is then used to create a sentiment score. This is done using inner_join and the sentiment dataset filtered on the bing lexicon, again from tidytext. The function calculates the sentiment score value for the poem depending on the number of negative and positive words in the list – negative words get a value of -1 and positive get a value of 1, these values are then summed and the result returned as a percentage of the number of words in that particular poem.

# function to analyse poems and return scores using bing lexicon bingAnalysePoems <- function(i) { # get poem a index i poem <- data_frame(poem = poemDataFrame$poem[i]) # tokenize into words textTokenized <- poem %>% unnest_tokens(word, poem) # load stop words dataset data("stop_words") # anti join on stop words tidyPoem <- textTokenized %>% anti_join(stop_words) # filter on bing lexicon bing <- sentiments %>% filter(lexicon == "bing") %>% select(-score) # join on bing to get whether words are positive or negative poemSentiment <- tidyPoem %>% inner_join(bing) # get score for poem poemSentiment <- poemSentiment %>% mutate(score = ifelse(poemSentiment$sentiment == "positive", 1, -1)) # get score as a percentage of total words in poem finalScore <- (sum(poemSentiment$score)/length(textTokenized$word))*10 return(finalScore) }

Now, using the previous function and sapply I get the score for each poem and create a data frame with the poems score and its date. These scores will be used later to show the change in sentiment for each year.

# get scores scores <- sapply(1:length(poemDataFrame$poem), bingAnalysePoems) # create data frame with data and scores dateAndScore <- data.frame(scores) %>% mutate(date = year(ymd(str_c(poemDataFrame$date, "/01/01")))) # display head(dateAndScore, 10) ## scores date ## 1 -0.20618557 1910 ## 2 -0.03649635 1899 ## 3 0.16528926 1899 ## 4 -0.13157895 1919 ## 5 0.32258065 1921 ## 6 0.31250000 1921 ## 7 -0.42253521 1889 ## 8 -0.73394495 1899 ## 9 -0.23255814 1899 ## 10 -0.44025157 1899

The next function I have written takes a particular sentiment, which in this case can also be called an emotion, from a number of sentiments found in the NRC lexicon of the sentiment dataset; it then again creates a tibble of words not including the stop words, and using semi_join and the sentiment dataset filtered on the NRC lexicon and the particular sentiment given, returns the sum of the number of words relating to that sentiment. These sums will be used to show which emotions were high (or low) in a particular year.

# NRC analysis function nrcAnalysePoems <- function(sent) { # nrcPoem used from global environment textTokenized <- nrcPoem %>% unnest_tokens(word, poem) # load stop words dataset data("stop_words") tidyPoem <- textTokenized %>% anti_join(stop_words) # filter on NRC lexicon and sentiment nrcSentiment <- sentiments %>% filter(lexicon == "nrc", sentiment == sent) # join on sentiment and count words sentimentInPoem <- tidyPoem %>% semi_join(nrcSentiment) %>% count(word, sort = TRUE) # return the sum of the counted words return(sum(sentimentInPoem$n)) }

Using the above function and sapply, I get the sums for each different sentiment in my sentiments list; this is done within a for loop, which loops through each poem, creating a data frame with each row consisting of a sentiment, the sum of words for that sentiment, the name of the poem the sentiment relates to, and the poems date. I then add this data frame to a list which will be used in my next step.

# list of used setiments found in NRC lexicon sentimentsVec <- c("anger", "anticipation", "disgust", "fear", "joy","sadness", "surprise", "trust") # create empty list nrcAnalysisDataFrame <- list() # get a frequency percetage for each sentiment found in the NRC lexicon for (i in 1:length(poemDataFrame$poem)) { # poem at index i - to be used in nrcAnalysePoems function environment nrcPoem <- data_frame(poem = poemDataFrame$poem[i]) # create data frame for each poem of all sentiment sums nrcDataFrame <- as.data.frame(sapply(sentimentsVec, nrcAnalysePoems)) %>% rownames_to_column() %>% select(sentiment = rowname, value = 2) %>% mutate(name = poemDataFrame$poemName[i], date = poemDataFrame$date[i]) # add data frame to list nrcAnalysisDataFrame[[i]] <- nrcDataFrame }

Once the for loop has completed I rbind all the data frames in the list so that I can work with all of them together as a whole.

# rbind list of all NRC sum values for all poems nrcAnalysisDataFrame <- do.call(rbind, nrcAnalysisDataFrame) # display head(nrcAnalysisDataFrame, 10) ## sentiment value name date ## 1 anger 1 at galway races 1910 ## 2 anticipation 4 at galway races 1910 ## 3 disgust 2 at galway races 1910 ## 4 fear 4 at galway races 1910 ## 5 joy 2 at galway races 1910 ## 6 sadness 4 at galway races 1910 ## 7 surprise 2 at galway races 1910 ## 8 trust 2 at galway races 1910 ## 9 anger 0 the blessed 1899 ## 10 anticipation 9 the blessed 1899

Now all the preparation code is done and we are ready to begin analyzing. For ease of use I have created a function for each poet to run all the code shown previously; the functions return both the bing and the NRC data frames, ready to be used. These functions are run but will not be shown here.

Firstly I will need to call each poet’s function and assign the returned data frames to variables so that I can work with them as needed.

yeats <- Yeats() eliot <- Eliot() frost <- Frost() wallace <- Wallace() cummings <- Cummings() lawrence <- Lawrence() poets <- list(yeats, eliot, frost, wallace, cummings, lawrence)

Having collected all the poems from our five poets, we plot the count of poems per year for each of our selected sample:

From this we see that Yeats dominates the earlier period in our sample, while Frost plays a large part in the second half of the sample of text.

Next I will join all the bing scores for all poems using rbind, I do this so that I can work with all the data at once. I then group by date and get a mean score for each year in the complete data frame. This returns a tibble of scores and dates which I can now plot.

# rbind data frames and summarize completedataFramebing <- poets %>% map(~.x %>% .[[1]] %>% data.frame) %>% reduce(rbind) %>% filter(date >= 1890) %>% group_by(date) %>% summarize(score = mean(scores)) # display head(completedataFramebing, 10) ## # A tibble: 10 × 2 ## date score ## ## 1 1893 -0.23988213 ## 2 1899 -0.31855413 ## 3 1903 -0.03143456 ## 4 1910 -0.09117054 ## 5 1915 -0.20194924 ## 6 1916 -0.29473096 ## 7 1917 -0.57692308 ## 8 1918 -0.50458716 ## 9 1919 -0.16981618 ## 10 1920 -0.14955234

I use multiplot so that I can show two different plots side by side.

# plot bar plot p1 <- ggplot(data = completedataFramebing) + geom_bar(aes(x = date, y = score, fill = score), stat = "identity", position = "identity") + ylim(-1, 1) + ylab("mean percentage score") + xlab("year") + theme(legend.key.size = unit(0.5, "cm"), legend.text = element_text(size = 7), legend.position = c(0.87, 0.77)) + ggtitle("Bar Plot of Percentage Scores") + theme(plot.title = element_text(size = 12)) # plot dot and line plot p2 <- ggplot(data = completedataFramebing) + geom_line(aes(x = date, y = score), colour = "blue") + geom_point(aes(x = date, y = score), size = 1.3, colour = "blue") + ylim(-1, 1) + ylab("mean percentage score") + xlab("year") + ggtitle("Dot and Line Plot of Percentage Scores") + theme(plot.title = element_text(size = 12)) # use multiplot function # can be found at http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/ multiplot(p1, p2, cols = 2)

The 0 values in the bar plot are not all actual 0 values but rather years where no information is available. When looking at the plots we can see a steady decline between the period 1890 – 1900, then due to a lack of data very little can be seen during the period from 1900 – 1915, after which there is again a massive decline between 1915 and 1920 and again a small drop between 1920 and 1925.

We can also look at the individuals poets, to determine who of them portrayed their sorrow the most through their poems. To do this we plot the average score per poet.

From the plot we see that Lawrence has the most negative mean score and with poems such as giorno dei morti or day of the dead, it comes as no suprise!

Finally, let’s do the NRC analysis. Firstly, I join together all the poet’s NRC data frames using rbind, again so I can work with all the data at once. I then group them by date and sentiment, sum the sentiment values for each poem and use this sum to get percentage values for each sentiment in each year. Once I have done this the data frame is ready to be plotted.

# rbind data frames and summarize yeatsnrc <- poets %>% map(~.x %>% .[[2]]) %>% reduce(rbind) %>% filter(date >= 1890) %>% group_by(date, sentiment) %>% summarise(dateSum = sum(value)) %>% mutate(dateFreqPerc = dateSum/sum(dateSum)) head(yeatsnrc, 10) ## Source: local data frame [10 x 4] ## Groups: date [2] ## ## date sentiment dateSum dateFreqPerc ## ## 1 1893 anger 25 0.10775862 ## 2 1893 anticipation 38 0.16379310 ## 3 1893 disgust 11 0.04741379 ## 4 1893 fear 38 0.16379310 ## 5 1893 joy 42 0.18103448 ## 6 1893 sadness 34 0.14655172 ## 7 1893 surprise 8 0.03448276 ## 8 1893 trust 36 0.15517241 ## 9 1899 anger 17 0.06390977 ## 10 1899 anticipation 46 0.17293233

I use facet_wrap on the date so that I can display all the plots for all the years at once.

# plot multiple NRC barplots using facet wrap ggplot(data = yeatsnrc, aes(x = sentiment, y = dateFreqPerc, fill = sentiment)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = "Spectral") + guides(fill = FALSE) + facet_wrap(~date, ncol = 2) + theme(axis.text.x = element_text( angle = 45, hjust = 1, size = 8)) + theme(axis.title.y = element_text(margin = margin(0, 10, 0, 0))) + ylab("frequency percentage") + ggtitle("Bar Plots for Sentiment Percentages Per Year") + theme(plot.title = element_text(size = 14))

Looking at these plots we can see how sentiment changes through the years of available data. During the early 1890s we see equal percentages of anticipation, fear, joy, sadness and trust, while later we see anticipation, joy and trust. In 1903 joy takes the lead, but in 1910 and 1915 we see anticipation and trust increasing with fear and joy equal. In 1916 fear and sadness are high and 1917 anger, fear, sadness and trust are notibly high. From 1918 – 1919 we see a change from negative to positive with joy taking a big lead in 1919 and again in 1920. 1921 sees sadness again with others coming in with mostly equal percentages, and 1922 anticipation, joy and trust.

I am not a historian but fortunately my sister majored in history, so I asked her to send me a timeline of influencial world events between the years this analysis is based on. I have used this timeline to explain the findings.

From both the bing and NRC analysis it can be seen that there was a very apparent negative trend during the years from 1915 – 1920, this is very possibly due to a number of world events, namely; World War I during the years from 1914 – 1918, the influenza pandemic which lasted from 1918 – 1919, and the Russian Revolution in 1917. Another negative trend seems to appear at the end of the 19th Century, this time it is possibly due to the Boer War, which lasted from 1899 – 1902 and had repurcussions across Europe, as well as the Indian famine which killed an estimated one million people during the years 1899 and 1900.

A few things to be mentioned:

  1. As we all know the pop artists of the time will always move with popular trends, so it could be that the poets chosen, being the most popular during the years this analysis is based on, may have been more likely to be influenced by the events around them
  2. Due to the surprisingly difficult task of finding poetry with enough information, this analysis has been based on only a small number of poems

Taking the above into account, it is clear more work needs to be done; though the findings from this brief analysis do show a possible relation between poetry and the events of the time.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pareto's Playground. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Superstorm Sandy at Barnegat Bay Revisted

Mon, 06/12/2017 - 00:17

(This article was first published on AdventuresInData, and kindly contributed to R-bloggers)

Animations of continuous data in GIF format offer some portability advantages over video files.  A few years ago, shortly after Superstorm Sandy, a colleague and I developed a video of animated water surface elevations from USGS gages in Barnegat Bay, NJ as the eye of the storm approached.  That version used video screen capture and MS Excel VBA – you can see it here.

With the pending 5 year anniversary of Sandy and the R animation package, the time was right to revisit the animation as a GIF.

Sources of data include:
USGS 01409335 Little Egg Inlet near Tuckerton, NJ
USGS 01409146 East Thorofare at Ship Bottom, NJ
USGS 01409110 Barnegat Bay at Waretown, NJ
USGS 01408205 Barnegat Bay at Route 37 bridge near Bay Shore NJ
USGS 01408168 Barnegat Bay at Mantoloking , NJ
USGS 01408043 Point Pleasant Canal at Point Pleasant NJ
USGS 01408050 Manasquan River at Point Pleasant NJ
NOAA National Data Buoy Center Station 44009 (LLNR 168) – DELAWARE BAY 26 NM Southeast of Cape May, NJ See and download the full length GIF at https://media.giphy.com/media/xUA7beSsKPUPI3V41a/giphy.gif var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: AdventuresInData. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Weather forecast with regression models – part 3

Sun, 06/11/2017 - 23:47

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In the previous part of this tutorial, we build several models based on logistic regression. One aspect to be further considered is the decision threshold tuning that may help in reaching a better accuracy. We are going to show a procedure able to determine an optimal value for such purpose. ROC plots will be introduced as well.

Decision Threshold Tuning

As default, caret uses 0.5 as threshold decision value to be used as probability cutoff value. We are going to show how to tune such decision threshold to achieve a better training set accuracy. We then compare how the testing set accuracy changes accordingly.

Tuning Analysis

We load previously saved training, testing datasets together with the models we have already determined.

suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(ROCR)) set.seed(1023) readRDS(file="wf_log_reg_part2.rds")

The following tuning procedure explores a pool of cutoff values giving back a table reporting {cutoff, accuracy, specificity} triplets. Its inspection allows to identify the cutoff value providing with the highest accuracy. In case of a tie in accuracy value, we choose the lowest cutoff value of such.

glm.tune <- function(glm_model, dataset) { results <- data.frame() for (q in seq(0.2, 0.8, by = 0.02)) { fitted_values <- glm_model$finalModel$fitted.values prediction = q, "Yes", "No") cm <- confusionMatrix(prediction, dataset$RainTomorrow) accuracy <- cm$overall["Accuracy"] specificity <- cm$byClass["Specificity"] results <- rbind(results, data.frame(cutoff=q, accuracy=accuracy, specificity = specificity)) } rownames(results) <- NULL results }

In the utility function above, we included also the specificity metrics. Specificity is defined as TN/(TN+FP) and in the next part of this series we will discuss some aspects of. Let us show an example of confusion matrix with some metrics computation to clarify the terminology.

Reference Prediction No Yes No 203 42 Yes 0 3 Accuracy : 0.8306 = (TP+TN)/(TP+TN+FP+FN) = (203+3)/(203+3+42+0) Sensitivity : 1.00000 = TP/(TP+FN) = 203/(203+0) Specificity : 0.06667 = TN/(TN+FP) = 3/(3+42) Pos Pred Value : 0.82857 = TP/(TP+FP) = 203/(203+42) Neg Pred Value : 1.00000 = TN/(TN+FN) = 3/(3+0)

Please note that the “No” column is associated to a positive outcome, whilst “Yes” to a negative one. Specificity measure how good we are in predict that {RainTomorrow = “No”} and actually it is and this is something that we are going to investigate in next post of this series.

The logistic regression models to be tuned are:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

In the following analysis, for all models we determine the cutoff value providing with the highest accuracy.

glm.tune(mod9am_c1_fit, training) cutoff accuracy specificity 1 0.20 0.7822581 0.73333333 2 0.22 0.7862903 0.71111111 3 0.24 0.7943548 0.68888889 4 0.26 0.8064516 0.64444444 5 0.28 0.8104839 0.64444444 6 0.30 0.8145161 0.60000000 7 0.32 0.8185484 0.57777778 8 0.34 0.8185484 0.51111111 9 0.36 0.8104839 0.46666667 10 0.38 0.8266129 0.46666667 11 0.40 0.8266129 0.40000000 12 0.42 0.8306452 0.40000000 13 0.44 0.8266129 0.37777778 14 0.46 0.8346774 0.35555556 15 0.48 0.8467742 0.33333333 16 0.50 0.8508065 0.31111111 17 0.52 0.8427419 0.26666667 18 0.54 0.8467742 0.26666667 19 0.56 0.8346774 0.20000000 20 0.58 0.8387097 0.20000000 21 0.60 0.8387097 0.20000000 22 0.62 0.8346774 0.17777778 23 0.64 0.8387097 0.17777778 24 0.66 0.8346774 0.15555556 25 0.68 0.8387097 0.15555556 26 0.70 0.8387097 0.13333333 27 0.72 0.8387097 0.13333333 28 0.74 0.8346774 0.11111111 29 0.76 0.8266129 0.06666667 30 0.78 0.8266129 0.06666667 31 0.80 0.8306452 0.06666667

The optimal cutoff value is equal to 0.5. In this case, we find the same cutoff value as the default the caret package takes advantage of for logistic regression models.

opt_cutoff &lt 0.5;- pred_test <- predict(mod9am_c1_fit, testing, type = "prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No

Then, tuning the 3PM model.

glm.tune(mod3pm_c1_fit, training) cutoff accuracy specificity 1 0.20 0.8064516 0.7777778 2 0.22 0.8185484 0.7555556 3 0.24 0.8225806 0.7333333 4 0.26 0.8306452 0.6888889 5 0.28 0.8467742 0.6666667 6 0.30 0.8467742 0.6444444 7 0.32 0.8427419 0.6222222 8 0.34 0.8669355 0.6222222 9 0.36 0.8709677 0.6222222 10 0.38 0.8629032 0.5777778 11 0.40 0.8669355 0.5777778 12 0.42 0.8669355 0.5555556 13 0.44 0.8548387 0.4666667 14 0.46 0.8548387 0.4444444 15 0.48 0.8588710 0.4444444 16 0.50 0.8669355 0.4444444 17 0.52 0.8629032 0.4222222 18 0.54 0.8669355 0.4222222 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3777778 21 0.60 0.8588710 0.3333333 22 0.62 0.8548387 0.3111111 23 0.64 0.8508065 0.2888889 24 0.66 0.8467742 0.2666667 25 0.68 0.8387097 0.2222222 26 0.70 0.8387097 0.2222222 27 0.72 0.8346774 0.2000000 28 0.74 0.8387097 0.1777778 29 0.76 0.8346774 0.1555556 30 0.78 0.8346774 0.1555556 31 0.80 0.8306452 0.1111111

The optimal cutoff value is equal to 0.36.

opt_cutoff <- 0.36 pred_test <- predict(mod3pm_c1_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 81 5 Yes 5 14 Accuracy : 0.9048 95% CI : (0.8318, 0.9534) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.0112 Kappa : 0.6787 Mcnemar's Test P-Value : 1.0000 Sensitivity : 0.9419 Specificity : 0.7368 Pos Pred Value : 0.9419 Neg Pred Value : 0.7368 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8190 Balanced Accuracy : 0.8394 'Positive' Class : No

We improved the test accuracy, previously resulting as equal to 0.8857 and now equal to 0.9048 (please take a look at part #2 of this tutorial to know about the test accuracy with 0.5 cutoff value). Then, the first evening hours model.

glm.tune(mod_ev_c2_fit, training) cutoff accuracy specificity 1 0.20 0.8387097 0.8444444 2 0.22 0.8467742 0.8222222 3 0.24 0.8548387 0.8222222 4 0.26 0.8629032 0.8000000 5 0.28 0.8588710 0.7333333 6 0.30 0.8709677 0.7333333 7 0.32 0.8790323 0.7111111 8 0.34 0.8830645 0.6888889 9 0.36 0.8870968 0.6666667 10 0.38 0.8870968 0.6666667 11 0.40 0.8951613 0.6666667 12 0.42 0.8991935 0.6666667 13 0.44 0.8991935 0.6666667 14 0.46 0.8951613 0.6444444 15 0.48 0.8951613 0.6222222 16 0.50 0.8951613 0.6222222 17 0.52 0.8951613 0.6000000 18 0.54 0.8911290 0.5777778 19 0.56 0.8911290 0.5777778 20 0.58 0.8951613 0.5777778 21 0.60 0.8911290 0.5555556 22 0.62 0.8870968 0.5111111 23 0.64 0.8830645 0.4888889 24 0.66 0.8830645 0.4666667 25 0.68 0.8790323 0.4222222 26 0.70 0.8709677 0.3555556 27 0.72 0.8548387 0.2666667 28 0.74 0.8508065 0.2444444 29 0.76 0.8427419 0.1777778 30 0.78 0.8387097 0.1555556 31 0.80 0.8387097 0.1555556

The optimal cutoff value is equal to 0.42.

opt_cutoff <- 0.42 pred_test <- predict(mod_ev_c2_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

We did not improve the accuracy as the default cutoff value provides with the same. Other metrics changed and in particular we notice that the sensitivity decreased a little bit while the positive predicted value improved. Then the second evening hours model.

glm.tune(mod_ev_c3_fit, training) cutoff accuracy specificity 1 0.20 0.8225806 0.8000000 2 0.22 0.8145161 0.7333333 3 0.24 0.8064516 0.6666667 4 0.26 0.8145161 0.6666667 5 0.28 0.8145161 0.6222222 6 0.30 0.8185484 0.6222222 7 0.32 0.8266129 0.6000000 8 0.34 0.8185484 0.5555556 9 0.36 0.8306452 0.5333333 10 0.38 0.8467742 0.5333333 11 0.40 0.8467742 0.5111111 12 0.42 0.8427419 0.4666667 13 0.44 0.8508065 0.4666667 14 0.46 0.8467742 0.4222222 15 0.48 0.8467742 0.4000000 16 0.50 0.8508065 0.4000000 17 0.52 0.8548387 0.4000000 18 0.54 0.8508065 0.3777778 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3555556 21 0.60 0.8629032 0.3333333 22 0.62 0.8588710 0.3111111 23 0.64 0.8548387 0.2888889 24 0.66 0.8508065 0.2666667 25 0.68 0.8467742 0.2444444 26 0.70 0.8387097 0.2000000 27 0.72 0.8387097 0.1777778 28 0.74 0.8346774 0.1555556 29 0.76 0.8306452 0.1333333 30 0.78 0.8306452 0.1333333 31 0.80 0.8306452 0.1333333

The optimal cutoff value is equal to 0.56.

opt_cutoff <- 0.56 pred_test <- predict(mod_ev_c3_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

In this case, we did not improved the accuracy of this model. Other metrics changed and in particular we notice that the sensitivity is slightly higher than the default cutoff driven value (0.9419), and the positive predicted value decreased a little bit (for default cutoff was 0.9205).

ROC analysis

To have a visual understanding of the classification performances and compute further metrics, we are going to take advantage of the ROCr package facilities. We plot ROC curves and determine the corresponding AUC value. That is going to be done for each model. The function used to accomplish with this task is the following.

glm.perf.plot <- function (prediction, cutoff) { perf <- performance(prediction, measure = "tpr", x.measure = "fpr") par(mfrow=(c(1,2))) plot(perf, col="red") grid() perf <- performance(prediction, measure = "acc", x.measure = "cutoff") plot(perf, col="red") abline(v = cutoff, col="green") grid() auc_res <- performance(prediction, "auc") auc_res@y.values[[1]] }

In the following, each model will be considered, AUC value printed out and ROC plots given.

mod9am_pred_prob <- predict(mod9am_c1_fit, testing, type="prob") mod9am_pred_resp <- prediction(mod9am_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod9am_pred_resp, 0.5) [1] 0.8004896

mod3pm_pred_prob <- predict(mod3pm_c1_fit, testing, type="prob") mod3pm_pred_resp <- prediction(mod3pm_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod3pm_pred_resp, 0.36) [1] 0.9155447

The 3PM model shows the highest AUC value among all the models.

mod_ev_c2_prob <- predict(mod_ev_c2_fit, testing, type="prob") mod_ev_c2_pred_resp <- prediction(mod_ev_c2_prob$Yes , testing$RainTomorrow) glm.perf.plot(mod_ev_c2_pred_resp, 0.42) [1] 0.8390453

mod_ev_c3_prob <- predict(mod_ev_c3_fit, testing, type="prob") mod_ev_c3_pred_resp <- prediction(mod_ev_c3_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod_ev_c3_pred_resp, 0.56) [1] 0.8886169

Conclusions

By tuning the decision threshold, we were able to improve training and testing set accuracy. It turned out the 3PM model achieved a satisfactory accuracy. ROC plots gave us an understanding of how true/false positive rates vary and what is the accuracy obtained by a specific decision threshold value (i.e. cutoff value). Ultimately, AUC values were reported.

If you have any questions, please feel free to comment below.

    Related Post

    1. Weather forecast with regression models – part 2
    2. Weather forecast with regression models – part 1
    3. Weighted Linear Support Vector Machine
    4. Logistic Regression Regularized with Optimization
    5. Analytical and Numerical Solutions to Linear Regression Problems
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Data Visualization with googleVis exercises part 2

    Sun, 06/11/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Area, Stepped & Combo charts

    In the second part of our series we are going to meet three more googleVis charts. More specifically these charts are Area Chart, Stepped Area Chart and Combo Chart.

    Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

    Answers to the exercises are available here.

    Package & Data frame

    As you already know, the first thing you have to do is install and load the googleVis package with:
    install.packages("googleVis")
    library(googleVis)

    Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with:
    df=data.frame(name=c("James", "Curry", "Harden"),
    Pts=c(20,23,34),
    Rbs=c(13,7,9))

    NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

    Area chart

    It is quite simple to create an area chart with googleVis with:
    AreaC <- gvisBarChart(df)
    plot(AreaC)

    Exercise 1

    Create a list named “AreaC” and pass to it the “df” data frame you just created as an area chart. HINT: Use gvisAreaChart().

    Exercise 2

    Plot the area chart. HINT: Use plot().

    Stepped Area chart

    Creating a stepped area chart is a little different than the area chart. You have to set the X and Y variables and also make it stacked in order to display the values correctly. Here is an example:
    SteppedAreaC <- gvisSteppedAreaChart(df, xvar="name",
    yvar=c("val1", "val2"),
    options=list(isStacked=TRUE))
    plot(SteppedAreaC)

    Exercise 3

    Create a list named “SteppedAreaC” and pass to it the “df” data frame you just created as a stepped area chart. You should also set the X variable as the players’ names and Y variable as their values. HINT: Use gvisSteppedAreaChart(), xvar and yvar.

    Exercise 4

    Plot the stepped area chart. HINT: Use plot().

    Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

    • Work extensively with the GoogleVis package and its functionality
    • Learn what visualizations exist for your specific use case
    • And much more

    Exercise 5

    Now transform your stepped area chart to stacked to correct it and plot it.

    Combo chart

    The next chart we are going to meet is the combination of lines and bars chart known as combo chart. You can produce it like this:
    ComboC <- gvisComboChart(df, xvar="country",
    yvar=c("val1", "val2"),
    options=list(seriesType="bars",
    series='{1: {type:"line"}}'))
    plot(ComboC)

    Exercise 6

    Create a list named “ComboC” and pass to it the “df” data frame you just created as a combo chart. You should also set the X variable as the players’ names and Y variable as their values. HINT: Use gvisComboChart(), xvar and yvar.

    Exercise 7

    Plot the chart. What kind of chart do you see? HINT: Use plot().

    In order to add the bars we have to set it as the example below.
    options=list(seriesType="bars",
    series='{1: {type:"line"}}')

    Exercise 8

    Transform the chart you just created into a combo chart with bars and lines and plot it. HINT: Use list().

    Exercise 9

    In the previous exercise “Pts” are represented by the bars and “Rbs” by the lines. Try to reverse them.

    You can easily transform your combo chart into a column chart or a line chart just be setting series='{1: {type:"line"}}' to 2

    Exercise 10

    Transform the combo chart into a column chart and then into a line chart.

    Related exercise sets:
    1. Data Visualization with googleVis exercises part 1
    2. Getting started with Plotly: basic Plots
    3. Shiny Application Layouts exercises part 3
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Exit poll for June 2017 election (UK)

    Sun, 06/11/2017 - 16:13

    (This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers)


    It has been a while since I posted anything here, but I can’t resist this one.

    Let me just give three numbers.  The first two are:

    • 314, the number of seats predicted for the largest party (Conservatives) in the UK House of Commons, at 10pm in Thursday (i.e., before even a single vote had been counted) from the exit poll commissioned jointly by broadcasters BBC, ITV and Sky.
    • 318, the actual number of seats that were won by the Conservatives, now that all the votes have been counted.

    That highly accurate prediction changed the whole story on election night: most of the pre-election voting intention polls had predicted a substantial Conservative majority.  (And certainly that’s what Theresa May had expected to achieve when she made the mistake of calling a snap election, 3 years early.)  But the exit poll prediction made it pretty clear that the Conservatives would either not achieve a majority (for which 326 seats would be needed), or at best would be returned with a very small majority such as the one they held before the election.  Media commentary turned quickly to how a government might be formed in the seemingly likely event of a hung Parliament, and what the future might be for Mrs May.  The financial markets moved quite substantially, too, in the moments after 10pm.

    For more details on the exit poll, its history, and the methods used to achieve that kind of predictive accuracy, see Exit Polling Explained.

    The third number I want to mention here is

    • 2.1.0

    That’s the version of R that I had at the time of the 2005 General Election, when I completed the development of a fairly extensive set of R functions to use in connection with the exit poll (which at that time was done for BBC and ITV jointly).  Amazingly (to me!) the code that I wrote back in 2001–2005 still works fine.  My friend and former colleague Jouni Kuha, who stepped in as election-day statistician for the BBC when I gave it up after 2005, told me today that (with some tweaks, I presume!) it all works brilliantly still, as the basis for an extremely high-pressure data analysis on election day/night.  Very pleasing indeed; and strong testimony to the heroic efforts of the R Core Development Team, to keep everything stable with a view to the long term.

    As suggested by that kind tweet reproduced above from the RSS President, David Spiegelhalter: Thursday’s performance was quite a triumph for the practical art and science of Statistics.  [And I think I am allowed to say this, since on this occasion I was not even there!  The credit for Thursday’s work goes to Jouni Kuha, along with John Curtice, Steve Fisher and the rest of the academic team of analysts who worked in the secret exit-poll “bunker” on 8 June.]

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Machine Learning Powered Biological Network Analysis

    Sun, 06/11/2017 - 15:50

    (This article was first published on r-bloggers – Creative Data Solutions, and kindly contributed to R-bloggers)

    Metabolomic network analysis can be used to interpret experimental results within a variety of contexts including: biochemical relationships, structural and spectral similarity and empirical correlation. Machine learning is useful for modeling relationships in the context of pattern recognition, clustering, classification and regression based predictive modeling. The combination of developed metabolomic networks and machine learning based predictive models offer a unique method to visualize empirical relationships while testing key experimental hypotheses. The following presentation focuses on data analysis, visualization, machine learning and network mapping approaches used to create richly mapped metabolomic networks. Learn more at www.createdatasol.com

    The following presentation also shows a sneak peak of a new data analysis visualization software, DAVe: Data Analysis and Visualization engine. Check out some early features. DAVe is built in R and seeks to support a seamless environment for advanced data analysis and machine learning tasks and biological functional and network analysis. As an aside, building the main site (in progress)  was a fun opportunity to experiment with Jekyll, Ruby and embedding slick interactive canvas elements into websites. You can checkout all the code here https://github.com/dgrapov/CDS_jekyll_site.

    slides: https://www.slideshare.net/dgrapov/machine-learning-powered-metabolomic-network-analysis

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Creative Data Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    “smooth” package for R. Common ground. Part I. Prediction intervals

    Sun, 06/11/2017 - 15:23

    (This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

    We have spent previous six posts discussing basics of es()

    function (underlying models and their implementation). Now it is time to move forward. Starting from this post we will discuss common parameters, shared by all the forecasting functions implemented in smooth. This means that the topics that we discuss are not only applicable to es()

    , but also to ssarima()

    , ces()

    , ges()

    and sma()

    . However, taking that we have only discussed ETS so far, we will use es()

    in our examples for now.

    And I would like to start this series of general posts from the topic of prediction intervals.

    Prediction intervals for smooth functions

    One of the features of smooth

    functions is their ability to produce different types of prediction intervals. Parametric prediction intervals (triggered by intervals="p"

    , intervals="parametric"

    or intervals=TRUE

    ) are derived analytically only for pure additive and pure multiplicative models and are based on the state-space model discussed in previous posts. In the current smooth

    version (v2.0.0) only es()

    function has multiplicative components, all the other functions are based on additive models. This makes es()

    “special”. While constructing intervals for pure models (either additive or multiplicative) is relatively easy to do, the mixed models cause pain in the arse (one of the reasons why I don’t like them). So in case of mixed ETS models, we have to use several tricks.

    If the model has multiplicative error, non-multiplicative other components (trend, seasonality) and low variance of the error (smaller than 0.1), then the intervals can be approximated by similar models with additive error term. For example, the intervals for ETS(M,A,N) can be approximated with intervals of ETS(A,A,N), when the variance is low, because the distribution of errors in both models will be similar. In all the other cases we use simulations for prediction intervals construction (via sim.es()

    function). In this case the data is generated with preset parameters (including variance) and contains \(h\) observations. This process is repeated 10,000 times, resulting in 10,000 possible trajectories. After that the necessary quantiles of these trajectories for each step ahead are taken using quantile()

    function from stats

    package and returned as prediction intervals. This cannot be considered as a pure parametric approach, but it is the closest we have.

    smooth

    functions also introduce semiparametric and nonparametric prediction intervals. Both of them are based on multiple steps ahead (also sometimes called as “trace”) forecast errors. These are obtained via producing forecasts for horizon 1 to \(h\) from each observation of time series. As a result a matrix with \(h\) columns and \(T-h\) rows is produced. In case of semi-parametric intervals (called using intervals="sp"

    or intervals="semiparametric"

    ), variances of forecast errors for each horizon are calculated and then used in order to extract quantiles of either normal or log-normal distribution (depending on error type). This way we cover possible violation of assumptions of homoscedasticity and no autocorrelation in residuals, but we still assume that each separate observation has some parametric distribution.

    In case of non-parametric prediction intervals (defined in R via intervals="np"

    or intervals="nonparametric"

    ), we loosen assumptions further, dropping part about distribution of residuals. In this case quantile regressions are used as proposed by Taylor and Bunn, 1999. However we use a different form of regression model than the authors do:
    \begin{equation} \label{eq:ssTaylorPIs}
    \hat{e}_{j} = a_0 j ^ {a_{1}},
    \end{equation}
    where \(j = 1, .., h\) is forecast horizon. This function has an important advantage over the proposed by the authors second order polynomial: it does not have extremum (turning point) for \(j>0\), which means that the intervals won’t behave strangely after several observations ahead. Using polynomials for intervals sometimes leads to weird bounds (for example, expanding and then shrinking). On the other hand, power function allows producing wide variety of forecast trajectories, which correspond to differently increasing or decreasing bounds of prediction intervals (depending on values of \(a_0\) and \(a_1\)), without producing any ridiculous trajectories.

    The main problem with nonparametric intervals produced by smooth

    is caused by quantile regressions, which do not behave well on small samples. In order to produce correct 0.95 quantile, we need to have at least 20 observations, and if we want 0.99 quantile, then the sample must contain at least 100. In the cases, when there is not enough observations, the produced intervals can be inaccurate and may not correspond to the nominal level values.

    As a small note, if a user produces only one-step-ahead forecast, then semiparametric interval will correspond to parametric one (because only the variance of the one-step-ahead error is used), and the nonparametric interval is constructed using quantile()

    function from stats

    package.

    Finally, the width of prediction intervals is regulated by parameter level

    , which can be written either as a fraction number ( level=0.95

    ) or as an integer number, less than 100 ( level=95

    ). I personally prefer former, but the latter is needed for the consistency with forecast

    package functions. By default all the smooth

    functions produce 95% prediction intervals.

    There are some other features of prediction interval construction for specific intermittent models and cumulative forecasts, but they will be covered in upcoming posts.

    Examples in R

    We will use a time series N1241 as an example and we will estimate model ETS(A,Ad,N). Here’s how we do that:

    ourModel1 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="p") ourModel2 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="sp") ourModel3 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="np")

    The resulting graphs demonstrate some differences in prediction intervals widths and shapes:

    Series N1241 from M3, es() forecast, parametric prediction intervals

    Series N1241 from M3, es() forecast, semiparametric prediction intervals

    Series N1241 from M3, es() forecast, nonparametric prediction intervals

    All of them cover actual values in the holdout, because the intervals are very wide. It is not obvious, which of them is the most appropriate for this task. So we can calculate the spread of intervals and see, which of them is on average wider:

    mean(ourModel1$upper-ourModel1$lower) mean(ourModel2$upper-ourModel2$lower) mean(ourModel3$upper-ourModel3$lower)

    Which results in:

    950.4171 955.0831 850.614

    In this specific example, the non-parametric interval appeared to be the narrowest, which is good, taking that it adequately covered values in the holdout sample. However, this doesn’t mean that it is in general superior to the other methods. Selection of the appropriate intervals should be done based on the general understanding of the violated assumptions. If we didn’t know the actual values in the holdout sample, then we could make a decision based on the analysis of the in-sample residuals in order to get a clue about the violation of any assumptions. This can be done, for example, this way:

    forecast::tsdisplay(ourModel1$residuals) hist(ourModel1$residuals) qqnorm(ourModel3$residuals) qqline(ourModel3$residuals)

    Linear plot and correlation functions of the residuals of the ETS(A,Ad,N) model

    Histogram of the residuals of the ETS(A,Ad,N) model

    Q-Q plot of the residuals of the ETS(A,Ad,N) model

    The first plot shows how residuals change over time and how the autocorrelation and partial autocorrelation functions look for this time series. There is no obvious autocorrelation and no obvious heteroscedasticity in the residuals. This means that we can assume that these conditions are not violated in the model, so there is no need to use semiparametric prediction intervals. However, the second and the third graphs demonstrate that the residuals are not normally distributed (as assumed by the model ETS(A,Ad,N)). This means that parametric prediction intervals may be wrong for this time series. All of this motivates the usage of nonparametric prediction intervals for the series N1241.

    That’s it for today.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Likelihood calculation for the g-and-k distribution

    Sun, 06/11/2017 - 00:33

    (This article was first published on R – Statisfaction, and kindly contributed to R-bloggers)

     

    Histogram of 1e5 samples from the g-and-k distribution, and overlaid probability density function

     

    Hello,

    An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2)  numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:

    github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf

    for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance  (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).

    Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].

    These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.

    1. Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474.
    2. Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75.
    3. Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556.
    4. Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    R Weekly Bulletin Vol – XI

    Sat, 06/10/2017 - 20:48

    (This article was first published on R programming, and kindly contributed to R-bloggers)

    This week’s R bulletin will cover topics on how to round to the nearest desired number, converting and comparing dates and how to remove last x characters from an element.

    We will also cover functions like rank, mutate, transmute, and set.seed. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. Comment/uncomment current line/selection – Ctrl+Shift+C
    2. Move Lines Up/Down – Alt+Up/Down
    3. Delete Line – Ctrl+D

    Problem Solving Ideas Rounding to the nearest desired number

    Consider a case where you want to round a given number to the nearest 25. This can be done in the following manner:

    round(145/25) * 25

    [1] 150

    floor(145/25) * 25

    [1] 125

    ceiling(145/25) * 25

    [1] 150

    Usage:
    Assume if you are calculating a stop loss or take profit for an NSE stock in which the minimum tick is 5 paisa. In such case, we will divide and multiply by 0.05 to achieve the desired outcome.

    Example:

    Price = 566 Stop_loss = 1/100 # without rounding SL = Price * Stop_loss print(SL)

    [1] 5.66

    # with rounding to the nearest 0.05 SL1 = floor((Price * Stop_loss)/0.05) * 0.05 print(SL1)

    [1] 5.65

    How to remove last n characters from every element

    To remove the last n characters we will use the substr function along with the nchr function. The example below illustrates the way to do it.

    Example:

    # In this case, we just want to retain the ticker name which is "TECHM" symbol = "TECHM.EQ-NSE" s = substr(symbol,1,nchar(symbol)-7) print(s)

    [1] “TECHM”

    Converting and Comparing dates in different formats

    When we pull stock data from Google finance the date appears as “YYYYMMDD”, which is not recognized as a date-time object. To convert it into a date-time object we can use the “ymd” function from the lubridate package.

    Example:

    library(lubridate) x = ymd(20160724) print(x)

    [1] “2016-07-24”

    Another data provider gives stock data which has the date-time object in the American format (mm/dd/yyyy). When we read the file, the date-time column is read as a character. We need to convert this into a date-time object. We can convert it using the as.Date function and by specifying the format.

    dt = "07/24/2016" y = as.Date(dt, format = "%m/%d/%Y") print(y)

    [1] “2016-07-24”

    # Comparing the two date-time objects (from Google Finance and the data provider) after conversion identical(x, y)

    [1] TRUE

    Functions Demystified

    rank function

    The rank function returns the sample ranks of the values in a vector. Ties (i.e., equal values) and
    missing values can be handled in several ways.

    rank(x, na.last = TRUE, ties.method = c(“average”, “first”, “random”, “max”, “min”))

    where,
    x: numeric, complex, character or logical vector
    na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if “keep” they are kept with rank NA
    ties.method: a character string specifying how ties are treated

    Examples:

    x <- c(3, 5, 1, -4, NA, Inf, 90, 43) rank(x)

    [1] 3 4 2 1 8 7 6 5

    rank(x, na.last = FALSE)

    [1] 4 5 3 2 1 8 7 6

    mutate and transmute functions

    The mutate and transmute functions are part of the dplyr package. The mutate function computes new variables using the existing variables of a given data frame. The new variables are added to the existing data frame. On the other hand, the transmute function creates these new variables as a separate data frame.

    Consider the data frame “df” given in the example below. Suppose we have 5 observations of 1-minute price data for a stock, and we want to create a new variable by subtracting the mean from the 1-minute closing prices. It can be done in the following manner using the mutate function.

    Example:

    library(dplyr) OpenPrice = c(520, 521.35, 521.45, 522.1, 522) ClosePrice = c(521, 521.1, 522, 522.25, 522.4) Volume = c(2000, 3500, 1750, 2050, 1300) df = data.frame(OpenPrice, ClosePrice, Volume) print(df)

    df_new = mutate(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE)) print(df_new)

    # If we want the new variable as a separate data frame, we can use the transmute function instead. df_new = transmute(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE)) print(df_new)

    set.seed function

    The set.seed function helps generate the same sequence of random numbers every time the program runs. It sets the random number generator to a known state. The function takes a single argument which is an integer. One needs to use the same positive integer in order to get the same initial state.

    Example:

    # Initialize the random number generator to a known state and generate five random numbers set.seed(100) runif(5)

    [1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928

    # Reinitialize to the same known state and generate the same five 'random' numbers set.seed(100) runif(5)

    [1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928

    Next Step

    We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

    Download the PDF Now!

    The post R Weekly Bulletin Vol – XI appeared first on .

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Density-Based Clustering Exercises

    Sat, 06/10/2017 - 17:50

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius.
    Advantages of density-based clustering:

    • as mentioned above, it does not require a predefined number of clusters,
    • clusters can be of any shape, including non-spherical ones,
    • the technique is able to identify noise data (outliers).

    Disadvantages:

    • density-based clustering fails if there are no density drops between clusters,
    • it is also sensitive to parameters that define density (radius and the minimum number of points); proper parameter setting may require domain knowledge.

    There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and “mean-shift”.
    This set of exercises covers basic techniques for using the DBSCAN method, and allows to compare its result to the results of the k-means clustering algorithm by means of the silhouette analysis.
    The set requires the packages dbscan, cluster, and factoextra to be installed. The exercises make use of the iris data set, which is supplied with R, and the wholesale customers data set from the University of California, Irvine (UCI) machine learning repository (download here).
    Answers to the exercises are available here.

    Exercise 1
    Create a new data frame using all but the last variable from the iris data set, which is supplied with R.

    Exercise 2
    Use the scale function to normalize values of all variables in the new data set (with default settings). Ensure that the resulting object is of class data.frame.

    Exercise 3
    Plot the distribution of distances between data points and their fifth nearest neighbors using the kNNdistplot function from the dbscan package.
    Examine the plot and find a tentative threshold at which distances start increasing quickly. On the same plot, draw a horizontal line at the level of the threshold.

    Exercise 4
    Use the dbscan function from the package of the same name to find density-based clusters in the data. Set the size of the epsilon neighborhood at the level of the found threshold, and set the number of minimum points in the epsilon region equal to 5.
    Assign the value returned by the function to an object, and print that object.

    Exercise 5
    Plot the clusters with the fviz_cluster function from the factoextra package. Choose the geometry type to draw only points on the graph, and assign the ellipse parameter value such that an outline around points of each cluster is not drawn.
    (Note that the fviz_cluster function produces a 2-dimensional plot. If the data set contains two variables those variables are used for plotting, if the number of variables is bigger the first two principal components are drawn.)

    Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

    • Delve into various algorithms for classification such as KNN and see how they are applied in R
    • Evaluate k-Means, Connectivity, Distribution, and Density based clustering
    • And much more

    Exercise 6
    Examine the structure of the cluster object obtained in Exercise 4, and find the vector with cluster assignments. Make a copy of the data set, add the vector of cluster assignments to the data set, and print its first few lines.

    Exercise 7
    Now look at what happens if you change the epsilon value.

    1. Plot again the distribution of distances between data points and their fifth nearest neighbors (with the kNNdistplot function, as in Exercise 3). On that plot, draw horizontal lines at levels 1.8, 0.5, and 0.4.
    2. Use the dbscan function to find clusters in the data with the epsilon set at these values (as in Exercise 4).
    3. Plot the results (as in the Exercise 5, but now set the ellipse parameter value such that an outline around points is drawn).

    Exercise 8
    This exercise shows how the DBSCAN algorithm can be used as a way to detect outliers:

    1. Load the Wholesale customers data set, and delete all variables with the exception of Fresh and Milk. Assign the data set to the customers variable.
    2. Discover clusters using the steps from Exercises 2-5: scale the data, choose an epsilon value, find clusters, and plot them. Set the number of minimum points to 5. Use the db_clusters_customers variable to store the output of the dbscan function.

    Exercise 9
    Compare the results obtained in the previous exercise with the results of the k-means algorithm. First, find clusters using this algorithm:

    1. Use the same data set, but get rid of outliers for both variables (here the outliers may be defined as values beyond 2.5 standard deviations from the mean; note that the values are already expressed in unit of standard deviation about the mean). Assign the new data set to the customers_core variable.
    2. Use kmeans function to obtain an object with cluster assignments. Set the number of centers equal to 4, and the number of initial random sets (the nstart parameter) equal to 10. Assign the obtained object to the variable km_clusters_customers variable.
    3. Plot clusters using the fviz_cluster function (as in the previous exercise).

    Exercise 10
    Now compare the results of DBSCAN and k-means using silhouette analysis:

    1. Retrieve a vector of cluster assignments from the db_clusters_customers object.
    2. Calculate distances between data points in the customers data set using the dist function (with the default parameters).
    3. Use the vector and the distances object as inputs into the silhouette function from the cluster package to get a silhouette information object.
    4. Plot that object with the fviz_silhouette function from the factoextra package.
    5. Repeat the steps described above for the km_clusters_customers object and the customers_core data sets.
    6. Compare two plots and the average silhouette width values.
    Related exercise sets:
    1. Data science for Doctors: Cluster Analysis Exercises
    2. Hierarchical Clustering exercises (beginner)
    3. Building Shiny App exercises part 7
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    UK 2017 General Election Results Data

    Sat, 06/10/2017 - 12:53

    (This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

    As the reality of a hung parliament starts to sink in, economists, political scientists and commentators will begin their usual routine of “post mortem” analysis of the surprise result of the UK 2017 general election. My co-authors Sascha Becker and Dennis Novy have done a similar exercise studying the EU Referendum last year [see also here]  and have worked on the question whether migration contributed to an erosion of pro EU sentiment [see also here].

    For people wanting to get to work straight away, there are a few things slowing us down.  The last constituency, Kensington, was not called until last night and so I dont expect the UK’s Election Commission to post the final tally of votes across all constituencies anytime before next week. Nevertheless, the crude election results data can be “scraped” from some infographics. This post describes how…

    The Economist’s Infographics

    The Economist, among other newspapers, provides a very nice infographic – behind that info graphic lies a web service that can be queried using JSON formed requests.

    Each Parliamentary constituency has an identifier code that can be used to query the web service and pull the results. The URL for a request is quite simple:

    http://infographics.economist.com/2017/ukelmap-2017/data/live-results2017/r2017E14000937.json

    This provides the results for the constituency Cambridgeshire, South East. The JSON object looks as follows

    resultCB({"swing": -3.84, "mpn": "Lucy Frazer", "electorate": "86121", "lib": 11958, "id": "E14000937", "name": "Cambridgeshire, South East", "lab": 17443, "con": 33601, "status": "hold", "pa_key": "123", "oth": 0, "region": "East Of England", "win": "con", "turnout": "63002"})

    This piece of Javascript calls a function resultCB that updates one of the views of the infographic.

    In order to convert this to an R data frame, we can use the RJSONIO or jsonlite package functions fromJSON, after having removed the part that calls the function, i.e.

    library(jsonlite) as.data.frame(fromJSON(gsub("\\)$","",gsub("resultCB\\(","",readLines(con="http://infographics.economist.com/2017/ukelmap-2017/data/live-results2017/r2017E14000937.json"))))) ## id pa_key oth name win status swing lib ## 1 E14000937 123 0 Cambridgeshire, South East con hold -3.84 11958 ## region mpn electorate turnout lab con ## 1 East Of England Lucy Frazer 86121 63002 17443 33601

    In order to build a data.frame of all election results, all that is necessary is to loop over the set of constituency codes available. I share the results from this step in the following spreadsheet Data for UK 2017 General Election Results (Economist Infographic).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Schedule for useR!2017 now available

    Fri, 06/09/2017 - 21:17

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The full schedule of talks for useR!2017, the global R user conference, has now been posted. The conference will feature 16 tutorials, 6 keynotes, 141 full talks, and 86 lightning talks starting on July 5 in Brussels. That's a lot to fir into 4 days, but I'm especially looking forward to the keynote presentations:

    I'm also pleased to be attending with several of my Microsoft colleagues. You can find our talks below.

    I hope you can attend too! Registration is still open if you'd like to join in. You can find the complete schedule linked below.

    Sched: useR!2017

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Big Data Manipulation in R Exercises

    Fri, 06/09/2017 - 17:50

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    Some times it is necessary to download really big csv files to deliver some analysis. When you hit file sizes in Gigabytes it is useful to use R instead of spreadsheets. This exercise teaches us to manipulate this kind of files.

    Answers to the exercises are available here.

    Exercise 1
    Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
    Download data set from here.

    Exercise 2
    Create a string vector with file names: 00540002-eng, 00540005-eng, 00540007-eng, 00540009-eng, 00540011-eng, 00540013-eng, 00540015-eng, and 00540017-eng.

    Exercise 3
    Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

    Exercise 4
    Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

    Exercise 5
    Clean up all other data sets in the list and exclude registers the same way discribed at exercise 4. Then, pile up all data in a sigle data set.

    Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

    • import data into R in several ways while also beeing able to identify a suitable import tool
    • use SQL code within R
    • And much more

    Exercise 6
    Write a csv file with the recent create data set.

    Exercise 7
    Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
    Download data set from here.
    Create a string vector with file names: 00540018-eng, 00540019-eng, 00540020-eng, 00540021-eng, 00540022-eng, 00540023-eng, 00540024-eng, and 00540025-eng.
    Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

    Exercise 8
    Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

    Exercise 9
    Clean up all other data sets in the list and exclude registers the same way discribed at exercise 8. Then, pile up all data in a sigle data set.

    Exercise 10
    Write a csv file with the recent create data set.

    Related exercise sets:
    1. Forecasting: ARIMAX Model Exercises (Part-5)
    2. Data wrangling : I/O (Part-1)
    3. Bind exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Managing intermediate results when using R/sparklyr

    Fri, 06/09/2017 - 17:37

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.


    Handle management

    Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat up resources.

    To help control handle lifetime the replyr supplies record-retaining temporary name generators (and uses the same internally).

    The actual function is pretty simple:

    print(replyr::makeTempNameGenerator) ## function(prefix, ## suffix= NULL) { ## force(prefix) ## if((length(prefix)!=1)||(!is.character(prefix))) { ## stop("repyr::makeTempNameGenerator prefix must be a string") ## } ## if(is.null(suffix)) { ## alphabet <- c(letters, toupper(letters), as.character(0:9)) ## suffix <- paste(base::sample(alphabet, size=20, replace= TRUE), ## collapse = '') ## } ## count <- 0 ## nameList <- c() ## function(dumpList=FALSE) { ## if(dumpList) { ## v <- nameList ## nameList <<- c() ## return(v) ## } ## nm <- paste(prefix, suffix, sprintf('%010d',count), sep='_') ## nameList <<- c(nameList, nm) ## count <<- count + 1 ## nm ## } ## } ## ##

    For instance to join a few tables it is a can be a good idea to call compute after each join (else the generated SQL can become large and unmanageable). This sort of code looks like the following:

    # create example data names <- paste('table', 1:5, sep='_') tables <- lapply(names, function(ni) { di <- data.frame(key= 1:3) di[[paste('val',ni,sep='_')]] <- runif(nrow(di)) copy_to(sc, di, ni) }) # build our temp name generator tmpNamGen <- replyr::makeTempNameGenerator('JOINTMP') # left join the tables in sequence joined <- tables[[1]] for(i in seq(2,length(tables))) { ti <- tables[[i]] if(i<length(tables)) { joined <- compute(left_join(joined, ti, by='key'), name= tmpNamGen()) } else { # use non-temp name. joined <- compute(left_join(joined, ti, by='key'), name= 'joinres') } } # clean up temps temps <- tmpNamGen(dumpList = TRUE) print(temps) ## [1] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000000" ## [2] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000001" ## [3] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000002" for(ti in temps) { db_drop_table(sc, ti) } # show result print(joined) ## Source: query [3 x 6] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 3 x 6 ## key val_table_1 val_table_2 val_table_3 val_table_4 val_table_5 ## ## 1 1 0.7594355 0.8082776 0.696254059 0.3777300 0.30015615 ## 2 2 0.4082232 0.8101691 0.005687125 0.9382002 0.04502867 ## 3 3 0.5941884 0.7990701 0.874374779 0.7936563 0.19940400

    Careful introduction and management of materialized intermediates can conserve resources (both time and space) and greatly improve outcomes. We feel it is a good practice to set up an explicit temp name manager, pass it through all your Sparklyr transforms, and then clear temps in batches after the results no longer depend no the intermediates.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Unconf projects 5: mwparser, Gargle, arresteddev

    Fri, 06/09/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    And finally, we end our series of unconf project summaries (day 1, day 2, day 3, day 4).

    mwparser

    Summary: Wikimarkup is the language used on Wikipedia and similar projects, and as such contains a lot of valuable data both for scientists studying collaborative systems and people studying things documented on or in Wikipedia. mwparser parses wikimarkup, allowing a user to filter down to specific types of tags such as links or templates, and then extract components of those tags.

    Team: Oliver Keyes

    Github: https://github.com/Ironholds/mwparser

    Gargle

    Summary: Gargle is a library that provides authentication for Google APIs but without all the agonizing pain. The package provides helper functions (for httr) to support automatic retries, paging, and progress bars for API calls.

    Team: Craig Citro

    Github: https://github.com/r-lib/gargle

    arresteddev

    Summary: This package is designed to help troubleshoot errors that come up during package and analysis development. As of now, the package helps track tracebacks and errors but more functionality is planned for the future.

    Team: Lucy D'Agostino McGowan, Karthik Ram, Miles McBain

    Github: https://github.com/ropenscilabs/arresteddev

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    R Interface to Spark

    Fri, 06/09/2017 - 06:30

    (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

    SparkR

    library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685 ### SUMMARY TABLE WITH AGG() grp <- groupBy(filter(df1, "month in (1, 3, 5)"), "month") summ <- agg(grp, avg_dep = avg(df1$dep_time), avg_arr = avg(df1$arr_time)) head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685

    sparklyr

    library(sparklyr) sc <- spark_connect(master = "local") df1 <- spark_read_csv(sc, name = "tbl1", path = "nycflights13.csv", header = TRUE, infer_schema = TRUE) ### SUMMARY TABLE WITH SQL library(DBI) summ <- dbGetQuery(sc, "select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743 ### SUMMARY TABLE WITH DPLYR library(dplyr) summ <- df1 %>% filter(month %in% c(1, 3, 5)) %>% group_by(month) %>% summarize(avg_dep = mean(dep_time), avg_arr = mean(arr_time)) head(summ) # month avg_dep avg_arr # # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Data Science for Business – Time Series Forecasting Part 2: Forecasting with timekit

    Fri, 06/09/2017 - 02:00

    In my last post, I prepared and visually explored time series data.

    Now, I will use this data to test the timekit package for time series forecasting with machine learning.

    Forecasting

    In time series forecasting, we use models to predict future time points based on past observations.

    As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).”

    And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts – we can already guess the future curve just by looking at it. So, if we think of use-cases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience.

    The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.

    library(tidyverse) library(caret) library(tidyquant) library(broom) library(timekit) library(modelr) options(na.action = na.warn)

    Training and test data

    My input data is the tibble retail_p_day, that was created in my last post.

    I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011).

    retail_p_day <- retail_p_day %>% mutate(model = ifelse(day <= "2011-11-01", "train", "test")) colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))] <- paste0("P_", colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))])

    Here, I am testing out timekit’s functions with the net income per day as response variable. Because the time series in our data set is relatively short and doesn’t cover multiple years, this forecast will only be able to capture recurring variation in days and weeks. Variations like increased sales before holidays, etc. would need additional data from several years to be accurately forecast.

    As we can see in the plot below, the net income shows variation between days.

    retail_p_day %>% ggplot(aes(x = day, y = sum_income, color = model)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

    Augmenting the time series signature

    With timekit, we can do forecasting with only a time series signature (a series of dates and times) and a corresponding response variable. If we had additional features that could be forecast independently, we could also introduce these into the model, but here, I will only work with the minimal data set.

    A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature:

    • index: The index value that was decomposed
    • index.num: The numeric value of the index in seconds. The base is “1970-01-01 00:00:00”.
    • diff: The difference in seconds from the previous numeric index value.
    • year: The year component of the index.
    • half: The half component of the index.
    • quarter: The quarter component of the index.
    • month: The month component of the index with base 1.
    • month.xts: The month component of the index with base 0, which is what xts implements.
    • month.lbl: The month label as an ordered factor beginning with January and ending with December.
    • day: The day component of the index.
    • hour: The hour component of the index.
    • minute: The minute component of the index.
    • second: The second component of the index.
    • hour12: The hour component on a 12 hour scale.
    • am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
    • wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.
    • wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6.
    • wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
    • mday: The day of the month.
    • qday: The day of the quarter.
    • yday: The day of the year.
    • mweek: The week of the month.
    • week: The week number of the year (Sunday start).
    • week.iso: The ISO week number of the year (Monday start).
    • week2: The modulus for bi-weekly frequency.
    • week3: The modulus for tri-weekly frequency.
    • week4: The modulus for quad-weekly frequency.
    • mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.

    Because we have missing data for the first column of diff, I am removing this row. We need to keep in mind too, that we have an irregular time series, because we never have data on Saturdays. This will affect the modeling and results and we need to account for this later on! Alternatively, it might make sense to compare the results when setting all NAs/Saturdays to 0, assuming that no information means that there was no income on a given day. Or we could impute missing values. Which strategy most accurately represents your data needs to be decided based on a good understanding of the business and how the data was collected.

    retail_p_day_aug <- retail_p_day %>% rename(date = day) %>% select(model, date, sum_income) %>% tk_augment_timeseries_signature() %>% select(-contains("month")) retail_p_day_aug <- retail_p_day_aug[complete.cases(retail_p_day_aug), ]

    Preprocessing

    Not all of these augmented features will be informative for our model. For example, we don’t have information about time of day, so features like hour, minute, second, etc. will be irrelevant here.

    Let’s look at column variation for all numeric feature and remove those with a variance of 0.

    library(matrixStats) (var <- data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]), colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>% filter(colvars == 0)) ## colnames colvars ## 1 hour 0 ## 2 minute 0 ## 3 second 0 ## 4 hour12 0 ## 5 am.pm 0 retail_p_day_aug <- select(retail_p_day_aug, -one_of(as.character(var$colnames)))

    Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set.

    library(ggcorrplot) cor <- cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) p.cor <- cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) ggcorrplot(cor, type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor, colors = c(palette_light()[1], "white", palette_light()[2]))

    I am going to choose a cutoff of 0.9 for removing features:

    cor_cut <- findCorrelation(cor, cutoff=0.9) retail_p_day_aug <- select(retail_p_day_aug, -one_of(colnames(cor)[cor_cut]))

    Now, I can split the data into training and test sets:

    train <- filter(retail_p_day_aug, model == "train") %>% select(-model) test <- filter(retail_p_day_aug, model == "test")

    Modeling

    To model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple.

    fit_lm <- glm(sum_income ~ ., data = train)

    We can examine our model e.g. by visualizing:

    tidy(fit_lm) %>% gather(x, y, estimate:p.value) %>% ggplot(aes(x = term, y = y, color = x, fill = x)) + facet_wrap(~ x, scales = "free", ncol = 4) + geom_bar(stat = "identity", alpha = 0.8) + scale_color_manual(values = palette_light()) + scale_fill_manual(values = palette_light()) + theme_tq() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

    augment(fit_lm) %>% ggplot(aes(x = date, y = .resid)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()

    With this model, we can now add predictions and residuals for the test data…

    pred_test <- test %>% add_predictions(fit_lm, "pred_lm") %>% add_residuals(fit_lm, "resid_lm")

    … and visualize the residuals.

    pred_test %>% ggplot(aes(x = date, y = resid_lm)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()

    We can also compare the predicted with the actual sum income in the test set.

    pred_test %>% gather(x, y, sum_income, pred_lm) %>% ggplot(aes(x = date, y = y, color = x)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

    Forecasting

    Once we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame).

    # Extract index idx <- retail_p_day %>% tk_index()

    From this index we can generate the future time series.

    Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds).

    retail_p_day_aug %>% ggplot(aes(x = date, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

    What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always off-days).

    retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl != "Sunday" & diff > 86400) %>% mutate(days_missing = diff / 86400 -1) ## # A tibble: 5 x 4 ## date wday.lbl diff days_missing ## ## 1 2011-01-04 Tuesday 1036800 11 ## 2 2011-04-26 Tuesday 432000 4 ## 3 2011-05-03 Tuesday 172800 1 ## 4 2011-05-31 Tuesday 172800 1 ## 5 2011-08-30 Tuesday 172800 1 retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl == "Sunday" & diff > 172800) %>% mutate(days_missing = diff / 86400 -1) ## # A tibble: 1 x 4 ## date wday.lbl diff days_missing ## ## 1 2011-05-01 Sunday 259200 2

    Let’s create a list of all missing days:

    off_days <- c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03", "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30") %>% ymd()

    Official UK holidays during that time were:

    • 2011:
    • Boxing Day December 26
    • Christmas Day Holiday December 27

    • 2012:
    • New Year’s Day Holiday January 2
    • Good Friday April 6
    • Easter Monday April 9
    • Early May Bank Holiday May 7
    • Spring Bank Holiday June 4
    • Diamond Jubilee Holiday June 5
    • Summer Bank Holiday August 27

    We can account for the missing Saturdays with inspect_weekdays = TRUE.

    Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. Generally, it is very difficult to account for holidays, because they don’t occur with an easy to model rule (e.g. Easter is on the first Sunday after the first full moon in Spring). Unfortunately, in our dataset we have seen that holidays and randomly missing days did not have a big overlap in the past.

    Because not all holidays are missing days and we have more missing days than official holidays, I am using the list of missing days for skipping values – even though this is only a best-guess approach and likely not going to match all days that will be missing in reality during the future time series.

    idx_future <- idx %>% tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days) idx_future %>% tk_get_timeseries_signature() %>% ggplot(aes(x = index, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

    Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame.

    data_future <- idx_future %>% tk_get_timeseries_signature() %>% rename(date = index) pred_future <- predict(fit_lm, newdata = data_future) pred_future <- data_future %>% select(date) %>% add_column(sum_income = pred_future) retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% rbind(pred_future) %>% ggplot(aes(x = date, y = sum_income)) + scale_x_date() + geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + theme_tq()

    When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals.

    test_residuals <- pred_test$resid_lm test_resid_sd <- sd(test_residuals, na.rm = TRUE) pred_future <- pred_future %>% mutate( lo.95 = sum_income - 1.96 * test_resid_sd, lo.80 = sum_income - 1.28 * test_resid_sd, hi.80 = sum_income + 1.28 * test_resid_sd, hi.95 = sum_income + 1.96 * test_resid_sd )

    This, we can then plot to show the forecast with confidence intervals:

    retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% ggplot(aes(x = date, y = sum_income)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future, fill = "#D5DBFF", color = NA, size = 0) + geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future, fill = "#596DD5", color = NA, size = 0, alpha = 0.8) + geom_point(aes(x = date, y = sum_income), data = pred_future, alpha = 0.5, color = palette_light()[[2]]) + geom_smooth(aes(x = date, y = sum_income), data = pred_future, method = 'loess', color = "white") + theme_tq()

    Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future.

    Next time, I’ll compare how Facebook’s prophet will predict the future income.

    sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggcorrplot_0.1.1 matrixStats_0.52.2 ## [3] modelr_0.1.0 timekit_0.3.0 ## [5] broom_0.4.2 tidyquant_0.5.1 ## [7] quantmod_0.4-8 TTR_0.23-1 ## [9] PerformanceAnalytics_1.4.3541 xts_0.9-7 ## [11] zoo_1.8-0 lubridate_1.6.0 ## [13] caret_6.0-76 lattice_0.20-35 ## [15] dplyr_0.5.0 purrr_0.2.2.2 ## [17] readr_1.1.1 tidyr_0.6.3 ## [19] tibble_1.3.1 ggplot2_2.2.1 ## [21] tidyverse_1.1.1 ## ## loaded via a namespace (and not attached): ## [1] httr_1.2.1 jsonlite_1.5 splines_3.4.0 ## [4] foreach_1.4.3 assertthat_0.2.0 stats4_3.4.0 ## [7] cellranger_1.1.0 yaml_2.1.14 backports_1.0.5 ## [10] quantreg_5.33 digest_0.6.12 rvest_0.3.2 ## [13] minqa_1.2.4 colorspace_1.3-2 htmltools_0.3.6 ## [16] Matrix_1.2-10 plyr_1.8.4 psych_1.7.5 ## [19] SparseM_1.77 haven_1.0.0 padr_0.3.0 ## [22] scales_0.4.1 lme4_1.1-13 MatrixModels_0.4-1 ## [25] mgcv_1.8-17 car_2.1-4 nnet_7.3-12 ## [28] lazyeval_0.2.0 pbkrtest_0.4-7 mnormt_1.5-5 ## [31] magrittr_1.5 readxl_1.0.0 evaluate_0.10 ## [34] nlme_3.1-131 MASS_7.3-47 forcats_0.2.0 ## [37] xml2_1.1.1 foreign_0.8-68 tools_3.4.0 ## [40] hms_0.3 stringr_1.2.0 munsell_0.4.3 ## [43] compiler_3.4.0 rlang_0.1.1 grid_3.4.0 ## [46] nloptr_1.0.4 iterators_1.0.8 labeling_0.3 ## [49] rmarkdown_1.5 gtable_0.2.0 ModelMetrics_1.1.0 ## [52] codetools_0.2-15 DBI_0.6-1 reshape2_1.4.2 ## [55] R6_2.2.1 knitr_1.16 rprojroot_1.2 ## [58] Quandl_2.8.0 stringi_1.1.5 parallel_3.4.0 ## [61] Rcpp_0.12.11 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages