Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 23 min ago

fastqcr: An R Package Facilitating Quality Controls of Sequencing Data for Large Numbers of Samples

Tue, 04/11/2017 - 20:54

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

Introduction

High throughput sequencing data can contain hundreds of millions of sequences (also known as reads).

The raw sequencing reads may contain PCR primers, adaptors, low quality bases, duplicates and other contaminants coming from the experimental protocols. As these may affect the results of downstream analysis, it’s essential to perform some quality control (QC) checks to ensure that the raw data looks good and there are no problems in your data.

The FastQC tool, written by Simon Andrews at the Babraham Institute, is the most widely used tool to perform quality control for high throughput sequence data. To learn more about the FastQC tool, see this Video Tuorial.

It produces, for each sample, an html report and a ‘zip’ file, which contains a file called fastqc_data.txt and summary.txt.

If you have hundreds of samples, you’re not going to open up each HTML page. You need some way of looking at these data in aggregate.

Therefore, we developed the fastqcr R package, which contains helper functions to easily and automatically parse, aggregate and analyze FastQC reports for large numbers of samples.

Additionally, the fastqcr package provides a convenient solution for building a multi-QC report and a one-sample FastQC report with the result interpretations. The online documentation is available at: http://www.sthda.com/english/rpkgs/fastqcr/.

Examples of QC reports, generated automatically by the fastqcr R package, include:

In this article, we’ll demonstrate how to perform a quality control of sequencing data. We start by describing how to install and use the FastQC tool. Finally, we’ll describe the fastqcr R package to easily aggregate and analyze FastQC reports for large numbers of samples.

Contents:

Installation and loading fastqcr
  • fastqcr can be installed from CRAN as follow:
install.packages("fastqcr")
  • Or, install the latest version from GitHub:
if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/fastqcr")
  • Load fastqcr:
library("fastqcr") Quick Start library(fastqcr) # Aggregating Multiple FastQC Reports into a Data Frame #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% # Demo QC directory containing zipped FASTQC reports qc.dir <- system.file("fastqc_results", package = "fastqcr") qc <- qc_aggregate(qc.dir) qc # Inspecting QC Problems #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% # See which modules failed in the most samples qc_fails(qc, "module") # Or, see which samples failed the most qc_fails(qc, "sample") # Building Multi QC Reports #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% qc_report(qc.dir, result.file = "multi-qc-report" ) # Building One-Sample QC Reports (+ Interpretation) #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc_report(qc.file, result.file = "one-sample-report", interpret = TRUE) Main Functions

1) Installing and Running FastQC

  • fastqc_install(): Install the latest version of FastQC tool on Unix systems (MAC OSX and Linux)

  • fastqc(): Run the FastQC tool from R.

2) Aggregating and Summarizing Multiple FastQC Reports

  • qc <- qc_aggregate(): Aggregate multiple FastQC reports into a data frame.

  • summary(qc): Generates a summary of qc_aggregate.

  • qc_stats(qc): General statistics of FastQC reports.

3) Inspecting Problems

  • qc_fails(qc): Displays samples or modules that failed.

  • qc_warns(qc): Displays samples or modules that warned.

  • qc_problems(qc): Union of qc_fails() and qc_warns(). Display which samples or modules that failed or warned.

4) Importing and Plotting FastQC Reports

  • qc_read(): Read FastQC data into R.

  • qc_plot(qc): Plot FastQC data

5) Building One-Sample and Multi-QC Reports

  • qc_report(): Create an HTML file containing FastQC reports of one or multiple files. Inputs can be either a directory containing multiple FastQC reports or a single sample FastQC report.

6) Others

  • qc_unzip(): Unzip all zipped files in the qc.dir directory.
Installing FastQC from R

You can install automatically the FastQC tool from R as follow:

fastqc_install() Running FastQC from R

The supported file formats by FastQC include:

  • FASTQ
  • gzip compressed FASTQ

Suppose that your working directory is organized as follow:

  • home
    • Documents
      • FASTQ

where, FASTQ is the directory containing your FASTQ files, for which you want to perform the quality control check.

To run FastQC from R, type this:

fastqc(fq.dir = "~/Documents/FASTQ", # FASTQ files directory qc.dir = "~/Documents/FASTQC", # Results direcory threads = 4 # Number of threads ) FastQC Reports

For each sample, FastQC performs a series of tests called analysis modules.

These modules include:

  • Basic Statistics,
  • Per base sequence quality,
  • Per tile sequence quality
  • Per sequence quality scores,
  • Per base sequence content,
  • Per sequence GC content,
  • Per base N content,
  • Sequence Length Distribution,
  • Sequence Duplication Levels,
  • Overrepresented sequences,
  • Adapter Content
  • Kmer content

The interpretation of these modules are provided in the official documentation of the FastQC tool.

Aggregating Reports

Here, we provide an R function qc_aggregate() to walk the FastQC result directory, find all the FASTQC zipped output folders, read the fastqc_data.txt and the summary.txt files, and aggregate the information into a data frame.

The fastqc_data.txt file contains the raw data and statistics while the summary.txt file summarizes which tests have been passed.

In the example below, we’ll use a demo FastQC output directory available in the fastqcr package.

library(fastqcr) # Demo QC dir qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.dir ## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results" # List of files in the directory list.files(qc.dir) ## [1] "S1_fastqc.zip" "S2_fastqc.zip" "S3_fastqc.zip" "S4_fastqc.zip" "S5_fastqc.zip"

The demo QC directory contains five zipped folders corresponding to the FastQC output for 5 samples.

Aggregating FastQC reports:

qc <- qc_aggregate(qc.dir) qc

The aggregated report looks like this:

sample module status tot.seq seq.length pct.gc pct.dup S4 Per tile sequence quality PASS 67255341 35-76 49 19.89 S3 Per base sequence quality PASS 67255341 35-76 49 22.14 S3 Per base N content PASS 67255341 35-76 49 22.14 S5 Per base sequence content FAIL 65011962 35-76 48 18.15 S2 Sequence Duplication Levels PASS 50299587 35-76 48 15.70 S1 Per base sequence content FAIL 50299587 35-76 48 17.24 S1 Overrepresented sequences PASS 50299587 35-76 48 17.24 S3 Basic Statistics PASS 67255341 35-76 49 22.14 S1 Basic Statistics PASS 50299587 35-76 48 17.24 S4 Overrepresented sequences PASS 67255341 35-76 49 19.89

Column names:

  • sample: sample names
  • module: fastqc modules
  • status: fastqc module status for each sample
  • tot.seq: total sequences (i.e.: the number of reads)
  • seq.length: sequence length
  • pct.gc: percentage of GC content
  • pct.dup: percentage of duplicate reads

The table shows, for each sample, the names of tested FastQC modules, the status of the test, as well as, some general statistics including the number of reads, the length of reads, the percentage of GC content and the percentage of duplicate reads.

Once you have the aggregated data you can use the dplyr package to easily inspect modules that failed or warned in samples. For example, the following R code shows samples with warnings and/or failures:

library(dplyr) qc %>% select(sample, module, status) %>% filter(status %in% c("WARN", "FAIL")) %>% arrange(sample) sample module status S1 Per base sequence content FAIL S1 Per sequence GC content WARN S1 Sequence Length Distribution WARN S2 Per base sequence content FAIL S2 Per sequence GC content WARN S2 Sequence Length Distribution WARN S3 Per base sequence content FAIL S3 Per sequence GC content FAIL S3 Sequence Length Distribution WARN S4 Per base sequence content FAIL S4 Per sequence GC content FAIL S4 Sequence Length Distribution WARN S5 Per base sequence content FAIL S5 Per sequence GC content WARN S5 Sequence Length Distribution WARN

In the next section, we’ll describe some easy-to-use functions, available in the fastqcr package, for analyzing the aggregated data.

Summarizing Reports

We start by presenting a summary and general statistics of the aggregated data.

QC Summary
  • R function: summary()
  • Input data: aggregated data from qc_aggregate()
# Summary of qc summary(qc) module nb_samples nb_fail nb_pass nb_warn failed warned Adapter Content 5 0 5 0 NA NA Basic Statistics 5 0 5 0 NA NA Kmer Content 5 0 5 0 NA NA Overrepresented sequences 5 0 5 0 NA NA Per base N content 5 0 5 0 NA NA Per base sequence content 5 5 0 0 S1, S2, S3, S4, S5 NA Per base sequence quality 5 0 5 0 NA NA Per sequence GC content 5 2 0 3 S3, S4 S1, S2, S5 Per sequence quality scores 5 0 5 0 NA NA Per tile sequence quality 5 0 5 0 NA NA Sequence Duplication Levels 5 0 5 0 NA NA Sequence Length Distribution 5 0 0 5 NA S1, S2, S3, S4, S5

Column names:

  • module: fastqc modules
  • nb_samples: the number of samples tested
  • nb_pass, nb_fail, nb_warn: the number of samples that passed, failed and warned, respectively.
  • failed, warned: the name of samples that failed and warned, respectively.

The table shows, for each FastQC module, the number and the name of samples that failed or warned.

General statistics
  • R function: qc_stats()
  • Input data: aggregated data from qc_aggregate()
qc_stats(qc) sample pct.dup pct.gc tot.seq seq.length S1 17.24 48 50299587 35-76 S2 15.70 48 50299587 35-76 S3 22.14 49 67255341 35-76 S4 19.89 49 67255341 35-76 S5 18.15 48 65011962 35-76

Column names:

  • pct.dup: the percentage of duplicate reads,
  • pct.gc: the percentage of GC content,
  • tot.seq: total sequences or the number of reads and
  • seq.length: sequence length or the length of reads.

The table shows, for each sample, some general statistics such as the total number of reads, the length of reads, the percentage of GC content and the percentage of duplicate reads

Inspecting Problems

Once you’ve got this aggregated data, it’s easy to figure out what (if anything) is wrong with your data.

1) R functions. You can inspect problems per either modules or samples using the following R functions:

  • qc_fails(qc): Displays samples or modules that failed.
  • qc_warns(qc): Displays samples or modules that warned.
  • qc_problems(qc): Union of qc_fails() and qc_warns(). Display which samples or modules that failed or warned.

2) Input data: aggregated data from qc_aggregate()

3) Output data: Returns samples or FastQC modules with failures or warnings. By default, these functions return a compact output format. If you want a stretched format, specify the argument compact = FALSE.

The format and the interpretation of the outputs depend on the additional argument element, which value is one of c(“sample”, “module”).

  • If element = “sample” (default), results are samples with failed and/or warned modules. The results contain the following columns:
    • sample (sample names),
    • nb_problems (the number of modules with problems),
    • module (the name of modules with problems).
  • If element = “module”, results are modules that failed and/or warned in the most samples. The results contain the following columns:
    • module (the name of module with problems),
    • nb_problems (the number of samples with problems),
    • sample (the name of samples with problems)
Per Module Problems
  • Modules that failed in the most samples:
# See which module failed in the most samples qc_fails(qc, "module") module nb_problems sample Per base sequence content 5 S1, S2, S3, S4, S5 Per sequence GC content 2 S3, S4

For each module, the number of problems (failures) and the name of samples, that failed, are shown.

  • Modules that warned in the most samples:
# See which module warned in the most samples qc_warns(qc, "module") module nb_problems sample Sequence Length Distribution 5 S1, S2, S3, S4, S5 Per sequence GC content 3 S1, S2, S5
  • Modules that failed or warned: Union of qc_fails() and qc_warns()
# See which modules failed or warned. qc_problems(qc, "module") module nb_problems sample Per base sequence content 5 S1, S2, S3, S4, S5 Per sequence GC content 5 S1, S2, S3, S4, S5 Sequence Length Distribution 5 S1, S2, S3, S4, S5

The output above is in a compact format. For a stretched format, type this:

qc_problems(qc, "module", compact = FALSE) module nb_problems sample status Per base sequence content 5 S1 FAIL Per base sequence content 5 S2 FAIL Per base sequence content 5 S3 FAIL Per base sequence content 5 S4 FAIL Per base sequence content 5 S5 FAIL Per sequence GC content 5 S3 FAIL Per sequence GC content 5 S4 FAIL Per sequence GC content 5 S1 WARN Per sequence GC content 5 S2 WARN Per sequence GC content 5 S5 WARN Sequence Length Distribution 5 S1 WARN Sequence Length Distribution 5 S2 WARN Sequence Length Distribution 5 S3 WARN Sequence Length Distribution 5 S4 WARN Sequence Length Distribution 5 S5 WARN

In the the stretched format each row correspond to a unique sample. Additionally, the status of each module is specified.

It’s also possible to display problems for one or more specified modules. For example,

qc_problems(qc, "module", name = "Per sequence GC content") module nb_problems sample status Per sequence GC content 5 S3 FAIL Per sequence GC content 5 S4 FAIL Per sequence GC content 5 S1 WARN Per sequence GC content 5 S2 WARN Per sequence GC content 5 S5 WARN

Note that, partial matching of name is allowed. For example, name = “Per sequence GC content” equates to name = “GC content”.

qc_problems(qc, "module", name = "GC content") Per Sample Problems
  • Samples with one or more failed modules
# See which samples had one or more failed modules qc_fails(qc, "sample") sample nb_problems module S3 2 Per base sequence content, Per sequence GC content S4 2 Per base sequence content, Per sequence GC content S1 1 Per base sequence content S2 1 Per base sequence content S5 1 Per base sequence content

For each sample, the number of problems (failures) and the name of modules, that failed, are shown.

  • Samples with failed or warned modules:
# See which samples had one or more module with failure or warning qc_problems(qc, "sample", compact = FALSE) sample nb_problems module status S1 3 Per base sequence content FAIL S1 3 Per sequence GC content WARN S1 3 Sequence Length Distribution WARN S2 3 Per base sequence content FAIL S2 3 Per sequence GC content WARN S2 3 Sequence Length Distribution WARN S3 3 Per base sequence content FAIL S3 3 Per sequence GC content FAIL S3 3 Sequence Length Distribution WARN S4 3 Per base sequence content FAIL S4 3 Per sequence GC content FAIL S4 3 Sequence Length Distribution WARN S5 3 Per base sequence content FAIL S5 3 Per sequence GC content WARN S5 3 Sequence Length Distribution WARN

To specify the name of a sample of interest, type this:

qc_problems(qc, "sample", name = "S1") sample nb_problems module status S1 3 Per base sequence content FAIL S1 3 Per sequence GC content WARN S1 3 Sequence Length Distribution WARN Building an HTML Report

The function qc_report() can be used to build a report of FastQC outputs. It creates an HTML file containing FastQC reports of one or multiple samples.

Inputs can be either a directory containing multiple FastQC reports or a single sample FastQC report.

Create a Multi-QC Report

We’ll build a multi-qc report for the following demo QC directory:

# Demo QC Directory qc.dir <- system.file("fastqc_results", package = "fastqcr") qc.dir ## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results" # Build a report qc_report(qc.dir, result.file = "~/Desktop/multi-qc-result", experiment = "Exome sequencing of colon cancer cell lines")

An example of report is available at: fastqcr multi-qc report

Create a One-Sample Report

We’ll build a report for the following demo QC file:

qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc.file ## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results/S1_fastqc.zip"
  • One-Sample QC report with plot interpretations:
qc_report(qc.file, result.file = "one-sample-report-with-interpretation", interpret = TRUE)

An example of report is available at: One sample QC report with interpretation

  • One-Sample QC report without plot interpretations:
qc_report(qc.file, result.file = "one-sample-report", interpret = FALSE)

An example of report is available at: One sample QC report without interpretation

Importing and Plotting a FastQC QC Report

We’ll visualize the output for sample 1:

# Demo file qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr") qc.file ## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results/S1_fastqc.zip"

We start by reading the output using the function qc_read(), which returns a list of tibbles containing the data for specified modules:

# Read all modules qc <- qc_read(qc.file) # Elements contained in the qc object names(qc) ## [1] "summary" "basic_statistics" "per_base_sequence_quality" "per_tile_sequence_quality" ## [5] "per_sequence_quality_scores" "per_base_sequence_content" "per_sequence_gc_content" "per_base_n_content" ## [9] "sequence_length_distribution" "sequence_duplication_levels" "overrepresented_sequences" "adapter_content" ## [13] "kmer_content" "total_deduplicated_percentage"

The function qc_plot() is used to visualized the data of a specified module. Allowed values for the argument modules include one or the combination of:

  • “Summary”,
  • “Basic Statistics”,
  • “Per base sequence quality”,
  • “Per sequence quality scores”,
  • “Per base sequence content”,
  • “Per sequence GC content”,
  • “Per base N content”,
  • “Sequence Length Distribution”,
  • “Sequence Duplication Levels”,
  • “Overrepresented sequences”,
  • “Adapter Content”
qc_plot(qc, "Per sequence GC content") qc_plot(qc, "Per base sequence quality") qc_plot(qc, "Per sequence quality scores") qc_plot(qc, "Per base sequence content") qc_plot(qc, "Sequence duplication levels")

fastqcr

Interpreting FastQC Reports
  • Summary shows a summary of the modules which were tested, and the status of the test results:
    • normal results (PASS),
    • slightly abnormal (WARN: warning)
    • or very unusual (FAIL: failure).

Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look normal.

qc_plot(qc, "summary") status module sample PASS Basic Statistics S1.fastq PASS Per base sequence quality S1.fastq PASS Per tile sequence quality S1.fastq PASS Per sequence quality scores S1.fastq FAIL Per base sequence content S1.fastq WARN Per sequence GC content S1.fastq PASS Per base N content S1.fastq WARN Sequence Length Distribution S1.fastq PASS Sequence Duplication Levels S1.fastq PASS Overrepresented sequences S1.fastq PASS Adapter Content S1.fastq PASS Kmer Content S1.fastq
  • Basic statistics shows basic data metrics such as:
    • Total sequences: the number of reads (total sequences),
    • Sequence length: the length of reads (minimum - maximum)
    • %GC: GC content
qc_plot(qc, "Basic statistics") Measure Value Filename S1.fastq File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 50299587 Sequences flagged as poor quality 0 Sequence length 35-76 %GC 48
  • Per base sequence quality plot depicts the quality scores across all bases at each position in the reads. The background color delimits 3 different zones: very good quality (green), reasonable quality (orange) and poor quality (red). A good sample will have qualities all above 28:
qc_plot(qc, "Per base sequence quality")

fastqcr

Problems:

  • warning if the median for any base is less than 25.
  • failure if the median for any base is less than 20.

Common reasons for problems:

  • Degradation of (sequencing chemistry) quality over the duration of long runs. Remedy: Quality trimming.

  • Short loss of quality earlier in the run, which then recovers to produce later good quality sequence. Can be explained by a transient problem with the run (bubbles in the flowcell for example). In these cases trimming is not advisable as it will remove later good sequence, but you might want to consider masking bases during subsequent mapping or assembly.

  • Library with reads of varying length. Warning or error is generated because of very low coverage for a given base range. Before committing to any action, check how many sequences were responsible for triggering an error by looking at the sequence length distribution module results.

  • Per sequence quality scores plot shows the frequencies of quality scores in a sample. It allows you to see if a subset of your sequences have low quality values. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible (quality > 27).
qc_plot(qc, "Per sequence quality scores")

fastqcr

Problems:

  • warning if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate.
  • failure if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.

Common reasons for problems:

General loss of quality within a run. Remedy: For long runs this may be alleviated through quality trimming.

  • Per base sequence content shows the four nucleotides’ proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other. In a good sequence composition, the difference between A and T, or G and C is < 10% in any position.
qc_plot(qc, "Per base sequence content")

fastqcr

It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. For example, in RNA-Seq data, it is common to have bias at the beginning of the reads. This occurs during RNA-Seq library preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the beginning of the reads. We can remove these primers using a trim adaptors tool.

Problems:

  • warning if the difference between A and T, or G and C is greater than 10% in any position.

  • failure if the difference between A and T, or G and C is greater than 20% in any position.

Common reasons for problems:

  • Overrepresented sequences: adapter dimers or rRNA

  • Biased selection of random primers for RNA-seq. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn’t seem to adversely affect the ability to measure expression.

  • Biased composition libraries: Some libraries are inherently biased in their sequence composition. For example, library treated with sodium bisulphite, which will then converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library.

  • Library which has been aggressively adapter trimmed.

  • Per sequence GC content plot displays GC distribution over all sequences. In a random library you expect a roughly normal GC content distribution. An unusually sharped or shifted distribution could indicate a contamination or some systematic biases:
qc_plot(qc, "Per sequence GC content")

fastqcr

You can generate the theoretical GC content curves files using an R package called fastqcTheoreticalGC written by Mike Love.

  • Per base N content. If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base call. This module plots out the percentage of base calls at each position for which an N was called.
qc_plot(qc, "Per base N content")

fastqcr

Problems:

  • warning if any position shows an N content of >5%.
  • failure if any position shows an N content of >20%.

Common reasons for problems:

  • General loss of quality.
  • Very biased sequence composition in the library.
  • Sequence length distribution module reports if all sequences have the same length or not. For some sequencing platforms it is entirely normal to have different read lengths so warnings here can be ignored. In many cases this will produce a simple graph showing a peak only at one size. This module will raise an error if any of the sequences have zero length.
qc_plot(qc, "Sequence length distribution")

fastqcr

  • Sequence duplication levels. This module counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication. A high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).
qc_plot(qc, "Sequence duplication levels")

fastqcr

Problems:

  • warning if non-unique sequences make up more than 20% of the total.
  • failure if non-unique sequences make up more than 50% of the total.

Common reasons for problems:

  • Technical duplicates arising from PCR artifacts

  • Biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.

In RNA-seq data, duplication levels can reach even 40%. Nevertheless, while analyzing transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent PCR duplicates or high gene expression of our samples.

  • Overrepresented sequences section gives information about primer or adaptor contaminations. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected. This module lists all of the sequence which make up more than 0.1% of the total.
qc_plot(qc, "Overrepresented sequences")

fastqcr

Problems:

  • warning if any sequence is found to represent more than 0.1% of the total.
  • failure if any sequence is found to represent more than 1% of the total.

Common reasons for problems:

small RNA libraries where sequences are not subjected to random fragmentation, and the same sequence may naturally be present in a significant proportion of the library.

  • Adapter content module checks the presence of read-through adapter sequences. It is useful to know if your library contains a significant amount of adapter in order to be able to assess whether you need to adapter trim or not.
qc_plot(qc, "Adapter content")

fastqcr

Problems:

  • warning if any sequence is present in more than 5% of all reads.
  • failure if any sequence is present in more than 10% of all reads.

A warning or failure means that the sequences will need to be adapter trimmed before proceeding with any downstream analysis.

  • K-mer content
qc_plot(qc, "Kmer content")

fastqcr

Useful Links Infos

This analysis has been performed using R software (ver. 3.3.2).

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}


To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Teaching pivot / un-pivot

Tue, 04/11/2017 - 19:48

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Authors: John Mount and Nina Zumel Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data.

Background

In data science data-rows are often considered to be instances. Because of this the data scientist needs explicit control over which facts fall into a single row. If we are trying to compute the relative prevalence of a birth-names by year broken down by sex we probably want both sexes in a single row. If we are trying to graph the same data using the R package ggplot2 we may want each year plus sex to determine a different row. Our thesis is that these differences are inessential for features of data presentation and not to be confused with properties of the underlying data.

Because we need to move from form to form we need both terminology to discuss the transforms and tools the implement the transforms.

For example when we were preparing our recent Strata workshop on Spark/R/Sparklyr we started with materials from our RStudio partners and found ourselves puzzled by one bit of code:

birthsYearly <- applicants_tbl %>% mutate(male = ifelse(sex == "M", n_all, 0), female = ifelse(sex == "F", n_all, 0)) %>% group_by(year) %>% summarize(Male = sum(male) / 1000000, Female = sum(female) / 1000000) %>% arrange(year) %>% collect

One of your authors (Nina Zumel) found this code much easier to understand once she added a comment indicating intent such as:

# by-hand spread on remote data

And the other author (John Mount) noticed that this implementation of “pivot” or “spread” was a better implementation idea than he had previously been toying with to add “pivot” (or “move values to columns”) capabilities to remote data implementations (databases and Spark).

This two stage version of pivot (widening individual rows and then summarizing by groups) is also a great way to teach data shaping techniques, which we will discuss here.

Teaching moving data to rows

Moving data to rows is easy to teach through examples. Suppose we have the following data frame:

d <- data.frame( index = c(1, 2, 3), meas1 = c('m1_1', 'm1_2', 'm1_3'), meas2 = c('m2_1', 'm2_2', 'm2_3'), stringsAsFactors = FALSE) print(d) # index meas1 meas2 # 1 1 m1_1 m2_1 # 2 2 m1_2 m2_2 # 3 3 m1_3 m2_3

We can convert this into a “thin” form with a call such as the following:

library("dplyr") library("cdata") d2 <- moveValuesToRows(d, nameForNewKeyColumn= 'meastype', nameForNewValueColumn= 'meas', columnsToTakeFrom= c('meas1','meas2')) %>% arrange(index) print(d2) # index meastype meas # 1 1 meas1 m1_1 # 2 1 meas2 m2_1 # 3 2 meas1 m1_2 # 4 2 meas2 m2_2 # 5 3 meas1 m1_3 # 6 3 meas2 m2_3

The idea is: intent is documented through the method name and verbose argument bindings. As we mentioned in our earlier article, this transform is easy to teach as you can meaningfully think about it operating on each input row separately:

moveValuesToRows(d[1, , drop=FALSE], nameForNewKeyColumn= 'meastype', nameForNewValueColumn= 'meas', columnsToTakeFrom= c('meas1','meas2')) %>% arrange(index) # index meastype meas # 1 1 meas1 m1_1 # 2 1 meas2 m2_1 Teaching moving data to columns

As we taught earlier, with the proper pre-conditions, we can consider moving data to columns as an inverse operation to moving data to rows. We can undo the last transform with:

d1p <- d2 %>% moveValuesToColumns(columnToTakeKeysFrom = 'meastype', columnToTakeValuesFrom = 'meas', rowKeyColumns = 'index') %>% arrange(index) all.equal(d, d1p) # [1] TRUE

Teaching moving data to columns at first blush seems harder as the operation as normally presented takes sets of rows as inputs. However, this is not an essential feature of moving data to columns. It is just an optimization or convenience that is so deeply ingrained into implementations it becomes part of the explanations.

Consider the following “incomplete” implementation of moving data to columns from the development version of replyr.

devtools::install_github("WinVector/replyr") library("replyr") d1q <- d2 %>% replyr_moveValuesToColumns(columnToTakeKeysFrom = 'meastype', columnToTakeValuesFrom = 'meas', rowKeyColumns = 'index', dosummarize = FALSE, fill = '') %>% arrange(index) print(d1q) # index meas1 meas2 # 1 1 m1_1 # 2 1 m2_1 # 3 2 m1_2 # 4 2 m2_2 # 5 3 m1_3 # 6 3 m2_3

This notation makes the motion of values to columns obvious: each row from the original data frame produces a single new row in the result data frame that:

  • Has a new column for each possible values seen in “columnToTakeKeysFrom”.
  • Populates the column matching the value in “columnToTakeKeysFrom” with the value from “columnToTakeValuesFrom”.
  • Populates other new columns (those taking names from “columnToTakeKeysFrom”) with easy to remove placeholder values.
  • Copies over all other column values.

Once we see this it becomes clear moving values to columns is an operation very much like the expansion of levels in “stats::model.matrix()” or 1-hot encoding (also called “dummy variables” or “indicators”), which place ones in columns instead of arbitrary values.



Dummy or indicator column encoding example from Practical Data Science with R, Zumel, Mount; Manning 2014.

In fact calling model.matrix() gives us a structure very similar to the “d1q” frame:

model.matrix(~ 0 + index + meastype, data = d2) # index meastypemeas1 meastypemeas2 # 1 1 1 0 # 2 1 0 1 # 3 2 1 0 # 4 2 0 1 # 5 3 1 0 # 6 3 0 1

The reason we bring this up is that things are easier to learn when they are in a shared, familiar context, and not treated as unique, “remarkable” occurrences.

To finish the conversion back to the original frame “d” we just have to add back in the neglected aggregation (which was intentionally suppressed by the “dosummarize = FALSE” option):

d1recovered <- d1q %>% group_by(index) %>% summarize_all("max") %>% arrange(index) print(d1recovered) # # A tibble: 3 × 3 # index meas1 meas2 # <dbl> <chr> <chr> # 1 1 m1_1 m2_1 # 2 2 m1_2 m2_2 # 3 3 m1_3 m2_3 all.equal(d, data.frame(d1recovered)) # [1] TRUE

And we have inverted the operation and recovered “d“! Demonstrating sequences of moving values to columns and moving values to rows is key to building familiarity and trust in these operations. This is whey we work such sequences here and in our previous article (yielding the following strongly connected graph converting between four different scientist’s preferred data representations):



Conclusion

The typical explanation of “pivot” for spreadsheet users contains aggregation as an integral part, and the typical explanations and diagrams used by R teachers also include a hidden aggregation (though only in the weaker sense of coalescing rows). Separating row transforms completely from value aggregation/coalescing makes pivoting (or moving values to columns) much more comprehendible and teachable.

We feel showing the notional intermediate form of the “expanded data frame” we introduced here when moving values to columns (the “d1q” frame) greatly improves learnability and comprehension. We also feel one should consistently use the terms “moving values to columns” and “moving values to rows” instead of insisting new students memorize non-informative technical name. Likely the “expanded data frame” is not taught as it is not usually the actual implementation (as it is in fact temporarily wasting space).

Appendix

The development version of replyr now implements a move values to columns operation explicitly in terms of this expansion, and we have demonstrated the method working on top of Spark2.0. This “be temporarily wasteful” strategy is actually compatible with how one designs high-throughput big-data systems leaning hard on the aphorism:

“The biggest difference between time and space is that you can’t reuse time.”

Merrick Furst

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building Shiny App exercises part 10

Tue, 04/11/2017 - 17:25

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

SHINY DASHBOARD STRUCTURE & APPEARANCE

Finally we reached in the final part of our series. At this part we will see how to improve the structure and the appearance of our dashboard even more, according to our preferences and of course make it more attractive to the user. Last but not least we will see the simplest and easiest way to deploy it.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

infoBox

There is a special kind of box that is used for displaying simple numeric or text values, with an icon.The example code below shows how to generate infoBox. The first row of infoBox uses the default setting of fill=FALSE.

Since the content of an infoBox will usually be dynamic, shinydashboard contains the helper functions infoBoxOutput and renderInfoBox for dynamic content.
#ui.R
library(shinydashboard)

dashboardPage(
dashboardHeader(title = "Info boxes"),
dashboardSidebar(),
dashboardBody(
# infoBoxes with fill=FALSE
fluidRow(
# A static infoBox
infoBox("New Orders", 10 * 2, icon = icon("credit-card")),
# Dynamic infoBoxes
infoBoxOutput("progressBox"),
infoBoxOutput("approvalBox")
)))
#server.R
shinyServer(function(input, output) {
output$progressBox <- renderInfoBox({
infoBox(
"Progress", paste0(25 + input$count, "%"), icon = icon("list"),
color = "purple"
)
})
output$approvalBox <- renderInfoBox({
infoBox(
"Approval", "80%", icon = icon("thumbs-up", lib = "glyphicon"),
color = "yellow"
)
})
})

Exercise 1

Create three infoBox with information icons and color of your choice and put them in the tabItem “dt” under the “DATA-TABLE”.

To fill them with a color follow the example below:
#ui.R
library(shinydashboard)

dashboardPage(
dashboardHeader(title = "Info boxes"),
dashboardSidebar(),
dashboardBody(
# infoBoxes with fill=FALSE
fluidRow(
# A static infoBox
infoBox("New Orders", 10 * 2, icon = icon("credit-card"),fill = TRUE),
# Dynamic infoBoxes
infoBoxOutput("progressBox"),
infoBoxOutput("approvalBox")
)))
#server.R
shinyServer(function(input, output) {
output$progressBox <- renderInfoBox({
infoBox(
"Progress", paste0(25 + input$count, "%"), icon = icon("list"),
color = "purple",fill = TRUE
)
})
output$approvalBox <- renderInfoBox({
infoBox(
"Approval", "80%", icon = icon("thumbs-up", lib = "glyphicon"),
color = "yellow",fill = TRUE
)
})
})

Exercise 2

Fill the infoBox with the color you selected in Exercise 1. HINT: Use fill.

Exercise 3

Now enhance the appearance of your tabItem named “km” by setting height = 450 in the four box you have there.

Skins

There are a number of color themes, or skins. The default is blue, but there are also black, purple, green, red, and yellow. You can choose which theme to use with dashboardPage(skin = "blue"), dashboardPage(skin = "black"), and so on.

Exercise 4

Change skin from blue to red.

CSS

You can add custom CSS to your app by adding code in the UI of your app like this:
#ui.R
dashboardPage(
dashboardHeader(title = "Custom font"),
dashboardSidebar(),
dashboardBody(
tags$head(tags$style(HTML('
.main-header .logo {
font-family: "Georgia", Times, "Times New Roman", serif;
font-weight: bold;
font-size: 24px;
}
')))
)
)

Exercise 5

Change the font of your dashboard title by adding CSS code.

Long titles

In some cases, the title that you wish to use won’t fit in the default width in the header bar. You can make the space for the title wider with the titleWidth option. In this example, we’ve increased the width for the title to 450 pixels.
#ui.R
dashboardPage(
dashboardHeader(
title = "Example of a long title that needs more space",
titleWidth = 450
),
dashboardSidebar(),
dashboardBody(

)
)
#server.R
function(input, output) { }

Exercise 6

Set your titlewidth to “400” and then set it to the default value again.

Sidebar width

To change the width of the sidebar, you can use the width option. This example has a wider title and sidebar:

#ui.R
library(shinydashboard)
dashboardPage(
dashboardHeader(
title = "Title and sidebar 350 pixels wide",
titleWidth = 350
),
dashboardSidebar(
width = 350,
sidebarMenu(
menuItem("Menu Item")
)
),
dashboardBody()
)
#server.R
function(input, output) { }

Exercise 7

Set sidebar width to “400” and then return to the default one.

Icons

Icons are used liberally in shinydashboard. The icons used in shiny and shinydashboard are really just characters from special font sets, and they’re created with Shiny’s icon() function.

To create a calendar icon, you’d call:
icon("calendar")

The icons are from Font-Awesome and Glyphicons. You can see lists of all available icons here:

http://fontawesome.io/icons/
http://getbootstrap.com/components/#glyphicons

Exercise 8

Change the icon of the three menuItem of your dashboard. Select whatever you like from the two lists above.

Statuses and colors

Many shinydashboard components have a status or color argument.

The status is a property of some Bootstrap classes. It can have values like status="primary", status="success", and others.

The color argument is more straightforward. It can have values like color="red", color="black", and others.

The valid statuses and colors are also listed in ?validStatuses and ?validColors.

Exercise 9

Change the status of the three widget box in the tabItem named “km” to “info”, “success” and “danger” respectively.

Shinyapps.io

The easiest way to turn your Shiny app into a web page is to use shinyapps.io, RStudio’s hosting service for Shiny apps.

shinyapps.io lets you upload your app straight from your R session to a server hosted by RStudio. You have complete control over your app including server administration tools.

First of all you have to create an account in shinyapps.io.

Exercise 10

Publish your app through Shinyapps.io

Related exercise sets:
  1. Building Shiny App Exercises (part-8)
  2. Building Shiny App exercises part 2
  3. Building Shiny App exercises part 1
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Connect 1.4.6

Tue, 04/11/2017 - 17:20

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce the release of RStudio Connect: version 1.4.6. This is an incremental release which features significantly improved startup time and support for server-side Shiny bookmarks.

Improved Startup & Job Listing Time

We now track R process jobs in the database which allows us to list and query jobs much more quickly. This decreases the startup time of the RStudio Connect service — allowing even the busiest of servers to spin up in a matter of seconds. Additionally, operations that involve listing jobs such as viewing process logs for a particular application should be noticeably faster.

Server-Side Shiny Bookmarks

Shiny v0.14 introduced a feature by which users could bookmark the current state of the application by either encoding the state in the URL or saving the state to the server. As of this release, RStudio Connect now supports server-side bookmarking of Shiny applications.

Other notable changes this release:

  • BREAKING: Changed the default for Authorization.DefaultUserRole from publisher to viewer. New users will now be created with a viewer account until promoted. The user roles documentation explains the differences. To restore the previous behavior, set DefaultUserRole = publisher. Because viewer users cannot be added as collaborators on content, this means that in order to add a remote user as a collaborator on content you must first create their account, then promote them to a publisher account.
  • Fixed a bug in the previous release that had broken Applications.ViewerOnDemandReports and Applications.ViewerCustomizedReports. These settings are again functional and allow you to manage the capabilities of a viewer of a parameterized report on the server.
  • Tune the number of concurrent processes to use when building R packages. This is controlled with the Server.CompilationConcurrency setting and passed as the value to the make flag -jNUM. The default is to permit four concurrent processes. Decrease this setting in low memory environments.
  • The /etc/rstudio-connect/rstudio-connect.gcfg file is installed with more restrictive permissions.
  • Log file downloads include a more descriptive file name by default. Previously, we used the naming convention <jobId>.log, which resulted in file names like GBFCaiPE6tegbrEM.log. Now, we use the naming convention rstudio-connect.<appId>.<reportId>.<bundleId>.<jobType>.<jobId>.log, which results in file names like rstudio-connect.34.259.15.packrat_restore.GBFCaiPE6tegbrEM.log.
  • Bundle the admin guide and user guide in the product. You can access both from the Documentation tab.
  • Implemented improved, pop-out filtering panel when filtering content, which offers a better experience on small/mobile screens.
  • Improvements to the parameterized report pane when the viewer does not have the authority to render custom versions of the document.
  • Database performance improvements which should improve performance in high-traffic environments.

Upgrade Planning: The migration of jobs from disk to the database may take a few minutes. The server will be unavailable during this migration which will be performed the first time RStudio Connect v1.4.6 starts. Even on the busiest of servers we would expect this migration to complete in under 5 minutes.

If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, etc.) with collaborators, colleagues, or customers.

You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bio7 2.5 for MacOSX Released

Tue, 04/11/2017 - 16:38

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

11.04.2017

A new release of Bio7  for MaxOSX (64-bit) is available.
This release comes with a plethora of new functions for MacOSX and R.


(Screenshot with Bio7 Dark theme enabled and custom font colors of the R editor)

Bio7 2.5 Release Notes Installation:

Download and extract the installation file from http://bio7.org.

If you start Bio7 a warning or error can occur because of the changes how Apple treats signatures! Please see this post for a solution (Go to System Preferences>Security & Privacy>General, set ‘Allow Apps etc’ to ‘Anywhere’. Change it back after you download and executed this app.).

In addition For MacOSX you have to install R and Rserve.

To install Rserve open the R shell and then execute the menu action “Options->Install Rserve (coop. mode)”. This will download an install Rserve in your default R library location, see video below (please make sure that your default MacOSX R library install location has writing permissions! – normally you don’t have to worry about this on MacOSX!).

The special version of Rserve can also be downloaded here:
https://bitbucket.org/maustenfeld/bio7-new/downloads

For a manual installation in the R prompt type the following command to install the compiled package (replace with your file path!):

install.packages(“Users/yourName/Downloads/Rserve_1.8-4_Mac_cooperative.tgz”, repos=NULL)

The latest Bio7 examples can be downloaded (download and import as *.zip) from Github here!

For more information about the installation and adding  LaTeX, Knitr and rmarkdown support please consult the Bio7 User Guide.

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fast data loading from files to R

Tue, 04/11/2017 - 16:00

(This article was first published on Appsilon Data Science - R language, and kindly contributed to R-bloggers)

Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.

The goal of my post is to compare:

  1. read.csv from utils, which was the standard way of reading csv files to R in RStudio,
  2. read_csv from readr which replaced the former method as a standard way of doing it in RStudio,
  3. load and readRDS from base, and
  4. read_feather from feather and fread from data.table.
Data

First let’s generate some random data

set.seed(123) df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)), replicate(10, stringi::stri_rand_strings(1000, 5)))

and save the files on a disk to evaluate the loading time. Besides the csv format we will also need feather, RDS and Rdata files.

path_csv <- '../assets/data/fast_load/df.csv' path_feather <- '../assets/data/fast_load/df.feather' path_rdata <- '../assets/data/fast_load/df.RData' path_rds <- '../assets/data/fast_load/df.rds' library(feather) library(data.table) write.csv(df, file = path_csv, row.names = F) write_feather(df, path_feather) save(df, file = path_rdata) saveRDS(df, path_rds)

Next let’s check our files sizes:

files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds') info <- file.info(files) info$size_mb <- info$size/(1024 * 1024) print(subset(info, select=c("size_mb"))) ## size_mb ## ../assets/data/fast_load/df.csv 1780.3005 ## ../assets/data/fast_load/df.feather 1145.2881 ## ../assets/data/fast_load/df.RData 285.4836 ## ../assets/data/fast_load/df.rds 285.4837

As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.

Benchmark

We will use microbenchmark library to compare the reading times of the following methods:

  • utils::read.csv
  • readr::read_csv
  • data.table::fread
  • base::load
  • base::readRDS
  • feather::read_feather

in 10 rounds.

library(microbenchmark) benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv), readrCSV = readr::read_csv(path_csv, progress = F), fread = data.table::fread(path_csv, showProgress = F), loadRdata = base::load(path_rdata), readRds = base::readRDS(path_rds), readFeather = feather::read_feather(path_feather), times = 10) print(benchmark, signif = 2) ##Unit: seconds ## expr min lq mean median uq max neval ## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10 ## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10 ## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10 ## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10 ## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10 ## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10

And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.

When it comes to reading from csv format fread significantly beats read_csv and read.csv, and thus is the best option to read a csv file.

In our case we decided to go with feather file since conversion from csv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RData format.

The final workflow was:

  1. reading a csv file provided by our customer using fread,
  2. writing it to feather using write_feather, and
  3. loading a feather file on app initialization using read_feather.

First two tasks were done once and outside of a Shiny App context.

There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science - R language. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Custom comparison function for sorting data in R

Tue, 04/11/2017 - 09:48

Many languages allow you to use a custom comparison function in sorting. R is not an exception, but it is not entirely straightforward – it requires you to define a new class and overload certain operators. Here is how to do it.

Consider the following example. You have a certain number of paired values, for example

v <- list(a=c(2,1), b=c(1,3), d=c(1,1), e=c(2,3))

The job is to order these pairs in the following way. Given two pairs, p1=(x1, y1) and p2=(x2, y2), p1 < p2 iff one of the following conditions is fulfilled: either x1 < x2 and y1 <= y2, or x1 <= x2 and y1 < y2. The point is that if we draw lines, where one end of the line is at the height x1, and the other end is at the height y1, we want to sort these lines only if they do not cross — at most, only if one of their ends overlaps (but not both, because then the lines would be identical):

On the figure above, left panel, p1 < p2, because one of the ends is below the end of the other line (x1 < x2 and y1=y2). Of course, if y1 < y2 the inequality still holds. On the other hand, the right panel shows a case where we cannot resolve the comparison; the lines cross, so we should treat them as equal.

If now we have a list of such pairs and want to order it, we will have a problem. Here is the thing: the desired order is {d, a, b, e}. The element d=(1,1) is clearly smaller (as defined above) than all the others. However, b=(1,3) is not smaller than a=(2,1), and a is not smaller than b; that means, that a is equal to b, and their order should not be modified.

There is no way to do that with regular tools such as order, especially since x and y may not only be on different scales — they might be even completely different data types! One might be a numeric vector, the other a character string. Or, possibly, a type of requisite from Monty Python (with a defined relation stating that a banana is less than a gun). We must use a custom comparator.

For this, we need to notice that the R functions sort and order rely on the function xtfrm. This in turns relies on the methods ==, &gt; and [, defined for a given class. For numeric vectors, for example, these give what you would expect.

Our v vector is a list with elements which are pairs of numbers. For this type of data, there is no comparison defined; and comparing two pairs of numbers results with a vector of two logical numbers, which is not what we want.

> v[1] < v[2] Error in v[1] < v[2] : comparison of these types is not implemented > v[[1]] < v[[2]] [1] FALSE TRUE

R, however, is an object oriented language (even if it does not always feel like that). Comparisons (“, ==) are generic functions and it is possible to define (or redefine) them for any class of objects. So here is the plan: we invent a new class for the object v, and define custom comparisons for the elements of this class of objects. Remember that if we define a function which name consists of a generic (like "plot" or "["), a dot, and a name of the class, we are defining the method for the given class:

## make v an object of class "foo" class(v) <- "foo" ## to use the "extract" ([) method, ## we need to momentarily change the class of x, because ## otherwise we will end up in an endless loop '[.foo' <- function(x, i) { class(x) <- "list" x <- x[i] class(x) <- "foo" x } ## define ">" as stated above ## the weird syntax results from the fact that a and b ## are lists with one element, this element being a vector ## of a pair of numbers '>.foo' <- function(a,b) { a <- a[[1]] b <- b[[1]] ifelse( (a[1] > b[1] && a[2] >= b[2]) || (a[1] >= b[1] && a[2] > b[2]), TRUE, FALSE) } ## if we can't find a difference, then there is no difference '==.foo' <- function(a, b) ifelse(a > b || b > a, FALSE, TRUE) ## we don't need that, but for the sake of completeness... '<.foo' <- function(a, b) b > a

This will now do exactly what we want:

> v["a"] == v["b"] [1] TRUE > v["a"] > v["d"] [1] TRUE > sort(v) $d [1] 1 1 $a [1] 2 1 $b [1] 1 3 $e [1] 2 3 attr(,"class") [1] "foo"

Data validation with the assertr package

Tue, 04/11/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

This is cross-posted from Tony’s blog onthelambda.com

Version 2.0 of my data set validation package assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr, the text below will point out the improvements.

I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).

assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.

I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr attempts to impose a grammar of error checking for datasets.

In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:

  • For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
  • For every element in a column, you want to make sure certain criteria are met but the criteria can be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
  • For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
  • For every row of a dataset, you want to make sure certain assumptions hold but the criteria can be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
  • You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.

Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…

Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars dataset, we wanted to confirm the following dataset assumptions…

  • that it has the columns mpg, vs, and am
  • that the dataset contains more than 10 observations
  • that the column for 'miles per gallon' (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
  • that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is unique jointly between the mpg, am, and wt columns
  • each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

library(assertr) library(magrittr) library(dplyr) mtcars %>% verify(has_all_names("mpg", "vs", "am", "wt")) %>% verify(nrow(.) > 10) %>% verify(mpg > 0) %>% insist(within_n_sds(4), mpg) %>% assert(in_set(0,1), am, vs) %>% assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>% assert_rows(col_concat, is_uniq, mpg, am, wt) %>% insist_rows(maha_dist, within_n_mads(10), everything()) %>% group_by(cyl) %>% summarise(avg.mpg=mean(mpg))

Before assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start and chain_end functions; all assumptions within a chain (below a call to chain_start and above chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.

Another major improvement since the last version of assertr on CRAN is that assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")

Beyond these examples

Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.

The first is that there is another package, assertive that greatly enhances the assertr experience. The predicates (functions that start with is_) from this (meta)package can be used in assertr pipelines just as easily as assertr’s own predicates. And assertive has an enormous amount of them! Some specialized and exciting examples include is_hex_color, is_ip_address, and is_isbn_code!

The second is if assertive doesn’t have what you’re looking for, with just a little bit of studying the assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.

If this package interests you, I urge you to read the most recent package vignette here. If you're an assertr old-timer, I point you to this NEWS file that list the changes from the previous version.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sakura blossoms in Japan

Tue, 04/11/2017 - 07:00

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

Springtime and cherry blossoms

It is spring, and the flowers bloom. One tree in particular symbolises this: the Japanese cherry (sakura) tree. Every year, millions of Japanese enjoy hanami, jointly watching the cherry blossoms, philosophising about the the transient nature of beauty, and the inevitability of death.

There are regular “sakura forecasts” on TV, and elsewhere. The Japanese National Tourist Office has its own map and details available on their website as well.

The data and graph section of The Economist recently published a beautiful chart showing 1200 years of data on the cherry blossom – as well as one of their customary great puns: “the elephant in the bloom”. I wondered if I could replicate the graph in R.

I remembered an emoji font I had stumbled across, and that there is a sakura (cherry blossom) emoji. I also felt a bit competitive – with a Japanese on this blog’s author roll, and another author being a certified expert on emojis, I felt that I needed show off that I, too, can poach in their areas of expertise!

Data

The Economist’s article mentions a professor Yasuyuki Aono, who wrote several papers using data on Japanese cherry blossom dates. A quick googleing brought up his homepage. He also linked to a page detailing the cherry blossom data. Fantastic! A pretty straightforward data grab, it is then.

if(!file.exists("sakura.xls")){ # data is available on AONO's university homepage: # http://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/ aono.dta <- "http://atmenv.envi.osakafu-u.ac.jp/osakafu-content/uploads/sites/251/2015/10/KyotoFullFlower7.xls" # make sure to read as binary, because windows is stupid download.file(aono.dta, "sakura.xls", mode= "wb") } # read in data dta <- readxl::read_excel("sakura.xls", skip = 25) %>% # make "good" column names setNames(make.names(names(.), unique = TRUE)) %>% mutate( # convert the blossom date to R dateformat blossom.date = as.Date(paste0(sprintf("%04d", AD), "0", Full.flowering.date), format = "%Y%m%d") )

We download the data if not available on disk, do some cleaning of column names, and generate a proper date column of the full flowering date, since this is the data with the most observations.

It’s a fascinating dataset: 1200 years of human records on a yearly event! It’s not often that one can work with these kinds of data. For this, I would like to thank professor Aono. Generating such a historical dataset is a tremendous amount of work (he probably had to find and read through hundreds of historical records), and providing the data on his homepage is not something I see often in academia.

Graphing

The original emoji graphing system I had in mind, emoGG, worked, but I could not change the size of the individual emojis. It turns out, emoGG is deprecated, and I should look at emojifont instead.

emojifont uses geom_text() to draw the emojis onto a ggplot() graph. After a quick install of the font I was ready to go. Two things one needs to keep in mind when working with emojifont: you will need to load the font to start with, and if you’re using RStudio, you will need to initialise a new graphics environment. The standard RStudio environment is not able to preview/show the fonts. Both can be accomplished with these commands:

# need this to load the font load.emojifont() # only needed for RStudio windows()

One other thing that was problematic: emojifont uses a font to populate the graph by ggplot2::aes(label = "labelname") to specify the emoji. It needs the label pulled from the original dataset, and not generated within the aes().

# plot as sakuras dta %>% # include the sakura as emoji character in dataset mutate( sakura.emoji = emoji("cherry_blossom"), # join to a common year for axis label (2000 is a leap year) common.date = as.Date(paste0("2000-", format(blossom.date, "%m-%d")), "%Y-%m-%d") )

I created the common.date variable to scale the y-axis of the graph to a month-day format.

dta %>% # plot in ggplot ggplot(aes(x=AD, y=common.date))+ # alternatively, with geom_emoji: # geom_emoji("cherry_blossom", x=dta$AD, y=dta$Full.flowering.date..DOY., size = 3)+ #geom_point(size = 0.5, colour = "red")+ # include emoji as text, h-/vjust to center; strange it needs vjust 0.25 -- why? 975572 BD77A4 geom_text(aes(label = sakura.emoji, hjust = 0.5, vjust = 0.25), family = "EmojiOne", size = 4, colour = "#FF506E")+ # trend line geom_smooth(method = "loess", span = 0.1, fill = "#D2A5C2", colour = "grey", size = 0.5)

To copy the Economist’s graph as closely as possible, I manually set the y-axis breaks. The x-axis breaks are more interesting, because the original graphs has major ticks every 200 years, labelled, and minor ticks every 100 years in between, which are not labelled. This is unfortunately not possible in ggplot, currently, so I had to resort to a workaround.

I create the sequence of years, with ticks every 100 years (so 800, 900, 1000, ...). I then “overwrite” every even element (900, 1100, 1300, ...) of the vector, resulting in a “blank” label for those years. The axis label would thus look like this: 800, "", 1000, "", 1200, ....

scale_y_date( name = "Date of cherry-blossom peak-bloom", breaks = c("2000-03-27", "2000-04-01", "2000-04-10", "2000-04-20", "2000-05-01", "2000-05-04") %>% as.Date(), # Apr-01 labels = scales::date_format("%b-%d") )+ scale_x_continuous( limits = c(800, 2020), # axis ticks every 200 years breaks = seq(800, 2000, 100), # no minor axis ticks in ggplot2::theme(): http://stackoverflow.com/questions/14490071/adding-minor-tick-marks-to-the-x-axis-in-ggplot2-with-no-labels # length(breaks): 13; replace every even element with empty string, to remove from axis labels labels = replace(seq(800, 2000, 100), seq(2, 12, 2), ""), name = "Year" )

All that remains then are some theme() hacking to get the graph pretty.

Behold, 1200 years of cherry blossom history!

I’m not super happy with the graph colours, it seems too “light”, and pastel. Not really very readable. The Economist’s version is in purple; much more readable and much prettier to look at. I did want to use the sakura pink tones, though, and thus the graph colour scheme seems as fragile and transient as the actual sakura blossoms.

Concluding remarks

The code is available on github, as always.

Aono uses the data to predict March temperatures, and in various papers he manages to do this to an astonishing degree of accuracy: 0.1°C.

It’s very disheartening to see the blossom dates dramatically moving earlier into the year in the last 50 years. Global warming is to blame, and it always saddens me that people chose to ignore these facts… As a cold-loving northern German, I will need to find a retirement location in northern Norway, if things continue the way they do!

Sakura blossoms in Japan was originally published by Kirill Pomogajko at Opiate for the masses on April 11, 2017.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Plot Function — The Options

Tue, 04/11/2017 - 00:02

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

R’s plot function is probably the most used visualization function in R. It’s simple, easy and gets the job done. It’s also highly customizable. Adding unnecessary styling and information on a visualization/plot is not really recommended because it can take away from what’s being portrayed, but there are times when you have just have to. Whether it’s for pure aesthetics, to convey multiple things in one plot, or any other reason, here are the options you can use in R’s base plot() function.

The Data Points

We’re going to be using the cars dataset that is built in R. To follow along with real code, here’s an interactive R Notebook. Feel free to copy it and play around with the code as you read along.

So if we were to simply plot the dataset using just the data as the only parameter, it’d look like this:

plot(dataset)
Plot using no options. Dot Style

The default data points are circles with an empty fill. To change the style of the dots (the data points), we use the option ‘pch’:

plot(dataset,pch=19)
Plot using the ‘pch’ option.

The ‘pch’ has accepts several codes, here is a grid with all the different data point styles (the default is 1):


https://greggilbertlab.sites.ucsc.edu/wp-content/uploads/sites/276/2015/10/symbols.jpg Data Point Size

To change the size of the data point, we use the ‘cex’ option:

plot(dataset,pch=19,cex=2)
Plot with the data point style and size options used. Data Point Color

The default color for the data points is black, to change that we use the ‘col’ option:

plot(dataset,pch=19,cex=2,col="pink")
Plot with the data point, size and color options used.

The ‘col’ option takes in both words and integers to identify the color. So you can use words like ‘green’, ‘wheat’, ‘red’ etc… and color codes. To get all the colors, just run colors() and it will return all the colors available. So if you know the location of your favorite color in the return value of the colors() function, you can just run plot(dataset,col=colors()[45]) and you’ll have a plot with your favorite color (or just save the color in a variable somewhere in the document if you want to keep re-using it). There are 657 colors in the colors() function.

Axis Labels

If you work in a team, or even if you work alone (but especially if you work with a group), always label your axes. If the dataset you’re using has headers or column titles, the plot function will automatically use those as the labels for the plot axes. To add your own, use the xlab() and the ylab() options:

plot(dataset, xlab("Speed (km/h)"), ylab("Distance (km)"))
Plot option with custom axes labels. Plot Legend

Plot legends can be called using the legend() function. The first two parameters are the x-position and y-position of the legend on the plot. You call this function right after you plot:

plot(dataset,col="blue",pch=4)
legend(20,110,c("Cars"),col="blue",pch=4)
Plot legend.

You want the legend symbol to match the symbol used in the plot. The legend takes in the same pch codes used in the plot() function for the symbols. In addition, you should of course have the same color for the symbols in the legend and the symbols in the plot. Here’s some of the options you can play around with in the legend() function:

legend(xPosition: int, yPosition: int, labels: array, col :int|string, cex: int, pch: int)

These are just what I call the essentials, a lot more in the documentation (see below).

And that’s it. Like I said before, there are several other options you can use like regression/trend lines, plot sizing etc… These are just the essentials when you want a little something extra on your visualization. In particular stages of the data analysis process, the less you add to your plots, the better.

Reference and Documentation


R Plot Function — The Options was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

8th MilanoR Meeting – Presentations, photos and video!

Mon, 04/10/2017 - 23:45

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hello R-users <3

The 8th MilanoR Meeting was great! Thank you very much for your interest and participation!

A short recap for those who weren’t there: the meeting was last Wednesday evening at the Microsoft House (a very nice location, thanks @dr_mattia for your support!).

We had two exceptional speakers: Stefano Iacus, member of the R Foundation and funder of Voices from the blogs, and Romain François, historical author of the package Rcpp and data active at Thinkr, who came all the way from Paris for us.

The event started with my quick presentation about our MilanoR community and our upcoming events: if you want to take a look, here it is.


Welcome Presentation 

by Mariachiara Fortuna

 

Then it was the turn of Stefano, that introduced us the yuima package, a powerful S4 framework for the simulation and inference of several classes of time series. On the top of the yuima package, Stefano and his collaborators built a great interactive dashboard, yuimaGUI, that makes accessible to a wider audience the data I/O, the explorative data analysis and model fitting. The youima package and the related app look very powerful: if you want to be introduced to their potentialities, the presentation is wide and deep.

yuimaGUI a Shiny app for the analysis and simulation of financial time series

by Stefano Iacus

 

Then Romain François gave us the chance of watching under the R surface, to the C++ deep layer. Romain told us about the past and the state of the art of R-C++ relationship, in a surprisingly clear and accessible way. Then he shared with us some insights on how he imagines the future of R and C++ integration and the challenge it represents: you can find more or less everything in the presentation here, or you may follow this link.

R and C++: past, present and future

by Romain François

 

The meeting ended with a free refreshment and some networking time – you can see everything from the pictures

This time we were able to have a Facebook streaming during the meeting, so the full recording is available in our Facebook channel. Thanks @Akiro for the perfect quality!

Meeting full video

 

 

We are grateful to our sponsors, Quantide and Microsoft, that made possible all of this, to Stefano and Romain that shared their knowledge with us, and to all of you that were there. Thank you so much, and hope to see you soon at the R-Lab#2 (today)!

The post 8th MilanoR Meeting – Presentations, photos and video! appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Linear splines with convenient parametrisations

Mon, 04/10/2017 - 23:38

(This article was first published on R – Brokering Closure, and kindly contributed to R-bloggers)

I have just published a small package lspline on CRAN that implements linear splines using convenient parametrisations such that

  • coefficients are slopes of consecutive segments
  • coefficients capture slope change at consecutive knots

Knot locations can be specified

  • manually (with lspline())
  • at breaks dividing the range of x into q equal-frequency intervals (with qlspline())
  • at breaks dividing the range of x into n equal-width intervals (with elspline())

The implementation follows Greene (2003, chapter 7.5.2).

The package sources are on GitHub here.

Examples

We will use the following artificial data with knots at x=5 and x=10:

set.seed(666) n <- 200 d <- data.frame( x = scales::rescale(rchisq(n, 6), c(0, 20)) ) d$interval <- findInterval(d$x, c(5, 10), rightmost.closed = TRUE) + 1 d$slope <- c(2, -3, 0)[d$interval] d$intercept <- c(0, 25, -5)[d$interval] d$y <- with(d, intercept + slope * x + rnorm(n, 0, 1))

Plotting y against x:

library(ggplot2) fig <- ggplot(d, aes(x=x, y=y)) + geom_point(aes(shape=as.character(slope))) + scale_shape_discrete(name="Slope") + theme_bw() fig

The slopes of the consecutive segments are 2, -3, and 0.

Setting knot locations manually

We can parametrize the spline with slopes of individual segments (default marginal=FALSE):

library(lspline) m1 <- lm(y ~ lspline(x, c(5, 10)), data=d) knitr::kable(broom::tidy(m1)) term estimate std.error statistic p.value (Intercept) 0.1343204 0.2148116 0.6252941 0.5325054 lspline(x, c(5, 10))1 1.9435458 0.0597698 32.5171747 0.0000000 lspline(x, c(5, 10))2 -2.9666750 0.0503967 -58.8664832 0.0000000 lspline(x, c(5, 10))3 -0.0335289 0.0518601 -0.6465255 0.5186955

Or parametrize with coeficients measuring change in slope (with marginal=TRUE):

m2 <- lm(y ~ lspline(x, c(5,10), marginal=TRUE), data=d) knitr::kable(broom::tidy(m2)) term estimate std.error statistic p.value (Intercept) 0.1343204 0.2148116 0.6252941 0.5325054 lspline(x, c(5, 10), marginal = TRUE)1 1.9435458 0.0597698 32.5171747 0.0000000 lspline(x, c(5, 10), marginal = TRUE)2 -4.9102208 0.0975908 -50.3143597 0.0000000 lspline(x, c(5, 10), marginal = TRUE)3 2.9331462 0.0885445 33.1262479 0.0000000

The coefficients are

  • lspline(x, c(5, 10), marginal = TRUE)1 – the slope of the first segment
  • lspline(x, c(5, 10), marginal = TRUE)2 – the change in slope at knot x = 5; it is changing from 2 to -3, so by -5
  • lspline(x, c(5, 10), marginal = TRUE)3 – tha change in slope at knot x = 10; it is changing from -3 to 0, so by 3

The two parametrisations (obviously) give identical predicted values:

all.equal( fitted(m1), fitted(m2) ) ## [1] TRUE

graphically

fig + geom_smooth(method="lm", formula=formula(m1), se=FALSE) + geom_vline(xintercept = c(5, 10), linetype=2)

Knots at n equal-length intervals

Function elspline() sets the knots at points dividing the range of x into n equal length intervals.

m3 <- lm(y ~ elspline(x, 3), data=d) knitr::kable(broom::tidy(m3)) term estimate std.error statistic p.value (Intercept) 3.5484817 0.4603827 7.707678 0.00e+00 elspline(x, 3)1 0.4652507 0.1010200 4.605529 7.40e-06 elspline(x, 3)2 -2.4908385 0.1167867 -21.328105 0.00e+00 elspline(x, 3)3 0.9475630 0.2328691 4.069080 6.84e-05

Graphically

fig + geom_smooth(aes(group=1), method="lm", formula=formula(m3), se=FALSE, n=200)

Knots at quantiles of x

Function qlspline() sets the knots at points dividing the range of x into q equal-frequency intervals.

m4 <- lm(y ~ qlspline(x, 4), data=d) knitr::kable(broom::tidy(m4)) term estimate std.error statistic p.value (Intercept) 0.0782285 0.3948061 0.198144 0.8431388 qlspline(x, 4)1 2.0398804 0.1802724 11.315548 0.0000000 qlspline(x, 4)2 1.2675186 0.1471270 8.615132 0.0000000 qlspline(x, 4)3 -4.5846478 0.1476810 -31.044273 0.0000000 qlspline(x, 4)4 -0.4965858 0.0572115 -8.679818 0.0000000

Graphically

fig + geom_smooth(method="lm", formula=formula(m4), se=FALSE, n=200)

Installation

Stable version from CRAN or development version from GitHub with

devtools::install_github("mbojan/lspline", build_vignettes=TRUE) Acknowledgements

Inspired by Stata command mkspline and function ares::lspline from Junger & Ponce de Leon (2011). As such, the implementation follows Greene (2003), chapter 7.5.2.

  • Greene, William H. (2003) Econometric analysis. Pearson Education
  • Junger & Ponce de Leon (2011) ares: Environment air pollution epidemiology: a library for timeseries analysis. R package version 0.7.2 retrieved from CRAN archives.

To leave a comment for the author, please follow the link and comment on their blog: R – Brokering Closure. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Saving input and output with sp_execute_external_script

Mon, 04/10/2017 - 22:06

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Again I was at the point, where I needed to store and save to external file all the R code that was executed through sp_execute_external_script.

Soon, you will find out several interesting things. To show the example, I will start with following example:

USE [WideWorldImporters]; GO EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N' d <- InputDataSet     c <- data.frame(Num_V1 = c(1,2,3))      c      OutputDataSet <- c'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS ((Numbers_From_R INT));

The result is a column called “Numbers” with three rows, represented from the data frame. This is very easy and straight-forward.

DMV

By using dynamic management view sys.dm_exec_query_stats as following:

SELECT      QM_ST.[TEXT] AS [Query]     ,DM_QS.last_execution_time     ,DM_QS.query_hash     ,DM_QS.query_plan_hash  FROM     sys.dm_exec_query_stats AS DM_QS     CROSS APPLY sys.dm_exec_sql_text(DM_QS.sql_handle) AS QM_ST ORDER BY     DM_QS.last_execution_time DESC

Surprisingly I get only the following query returned:

sp_execute_external_script: SELECT 1 AS Nmbrs_From_R

which is far what was executed in the first place!

EXECUTION PLANS

When using sys.dm_exec_query_plan dynamic management view to generate executed query plan, I get similar result with no R code and little sign of SQL query that was introduced to sp_execute_external_query procedure.

Relative the same results emerges when showing actual execution plan in SSMS. Only XML-UDX is showed.

So far, very slim possibility to get some extra and additional information from query statistics DMV or execution plan.

SQL SERVER PROFILER

So opening SQL Profiler and running the example sp_execute_external_script code, I was finally able to see the actual R code within profiler:

Upon closer look, we can see that profiler wraps execution of external procedure with following command SET STATISTICS XML ON/OFF. So we can store the results from profiler into a table or trace file and later filter out the R-code!

QUERY STORE

Query store is very very useful and new feature with flagship MSSQL2016. Storing the queries and execution times is therefore needed in order to do later performance analysis. So in this phase, let’s just see, if we can store external procedure code in query store.

With execution of R external procedure, I execute following query to check the Query Store (QS):

SELECT   QSQT.query_text_id  ,QSQT.query_sql_text  ,QSP.plan_id FROM     sys.query_store_plan AS QSP     JOIN sys.query_store_query AS QSQ       ON QSP.query_id = QSQ.query_id       JOIN sys.query_store_query_text AS QSQT       ON QSQ.query_text_id = QSQT.query_text_id

And the results are – in a way – not surprising at all, since many of query store statistics base on DMV. So result for my external procedure is again, very little informative in order to extract R code:

Something, we have seen already couple of times. And no sign of execution of R Script. In fact, looking from this, it is hard even to tell, this was passed to RLaunchpad.exe external program.

SINK

Sink is a R function to store the output of the executed R code into external file. With execution of any of the two T-SQL code, I will never be able to either get the results nor the R code itself.

In case of results:

EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         sink("C:\\DataTK\\logRSQLsession3.txt")         d <- InputDataSet         c <- data.frame(Num_V1 = c(1,2,3))         c         sink()         OutputDataSet <- c'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS ((Numbers_From_R INT)); EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         c <- data.frame(Num_V1 = c(1,2,3))         c         sink("C:\\DataTK\\logRSQLsession3.txt")'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS NONE;

In both cases the file is created, but it is just that. Empty file. No content whatsoever.

LOAD

Load will store intermediate results into file for later analysis or for semi aggreagated data, used for further calculations. So, I have tested it as following:

EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         c <- data.frame(Num_V1 = c(1,2,3))         c         save(c, file="C:\\DataTK\\logRSQLsession3.rda")         #load(file="C:\\DataTK\\logRSQLsession3.rda")'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS NONE; -- LOAD RESULTS EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         load(file="C:\\DataTK\\logRSQLsession3.rda")         OutputDataSet <- c'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS ((Num_V1 INT));

 

EXTENSIBILITY LOG

Extensibility Log will store information about the session but it will not store the R or R environment information or data, just session information and data. Navigate to:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\LOG\ExtensibilityLog

to check the content and to see, if there is anything useful for your needs.

Conclusion

We are very limited in terms of exporting executed R code, results or Logs. Same applies for importing any additional code. We have seen that import, source are not working, whereas Load for loading *.rda files is working. At least something There should be more ways to get into the, especially with Rterm or Vanilla R, but the idea was to have everything run comfortably from the SSMS environment.

As you can see, there is little possibilities to store R code separately or store execution R logs in external files. But I presume, I haven’t exhausted all the possibilities, so there should be still some ways to try and do this.

As always, the code is available at Github.

Happy Rrrrr!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: March 2017 roundup

Mon, 04/10/2017 - 21:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from March of particular interest to R users. 

A tutorial and comparison of the SparkR, sparklyr, rsparkling, and RevoScaleR packages for using R with Spark.

An analysis of Scrabble games between AI players.

The doAzureParallel package, a backend to "foreach" for parallel computations on Azure-based clusters.

The UK government project to automate reporting of official statistics with R.

Data science languages R and Python rank highly in the latest Redmonk popularity rankings.

FiveThirtyEight used R to find clusters of similar subreddits.

RTVS 1.0, which provides R support to Visual Studio, is now available.

The mrsdeploy package (part of Microsoft R Server) facilitates publishing an R function as a web service on Azure.

The Call for Papers for the EARL conferences in London and San Francisco closes on April 14.

Alteryx Designer has been integrated with Microsoft R.

StitchFix, the personal styling service, uses R and Python as part of their data science process.

A review of "Testing R Code", by Richie Cotton.

On the connection between AUC and the Mann-Whitney U-Statistic.

An overview of neural networks and R packages to train them.

Performance benchmarks of rxNeuralNetwork (in the MicrosoftML package).

Updates to the Data Science Virtual Machine for Linux.

A timelapse of the founding of cities since 3500BC.

A set of R scripts and sample data to predict employee attrition.

R 3.3.3 is now available.

The htmlwidgets gallery catalogs interactive web-based charts for R.

R-based analyses to predict the length of a hospital stay.

Scholarships to encourage diversity at useR!2017.

And some general interest stories (not necessarily related to R): simulating the galaxy;  a poor showing at Crufts; why some memes survive better than others;  a digital clock made in Life; and a tongue-in-cheek guide to reading sheet music.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting: Time Series Exploration Exercises (Part-1)

Mon, 04/10/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

R provides powerful tools for forecasting time series data such as sales volumes, population sizes, and earthquake frequencies. A number of those tools are also simple enough to be used without mastering sophisticated underlying theories. This set of exercises is the first in a series offering a possibility to practice in the use of such tools, which include the ARIMA model, exponential smoothing models, and others.
The first set provides a training in exploration of regularly spaced time series data (such as weekly, monthly, and quarterly), which may be useful for selection of a predictive model. The set covers:
– visual inspection of time series,
– estimation of trend and seasonal patterns,
– finding whether a series is stationary (i.e. whether it has a constant mean and variance),
– examination of correlation between lagged values of a time series (autocorrelation).
The exercises make use of functions from the packages forecast, and tseries. Exercises are based on a dataset on retail sales volume by US clothing and clothing accessory stores for 1992-2016 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here). The data represent monthly sales in millions of dollars.
For other parts of the series follow the tag forecasting
Answers to the exercises are available here

Exercise 1
Read the data from the file sales.csv.

Exercise 2
Transform the data into a time series object of the ts type (indicate that the data is monthly, and the starting period is January 1992).
Print the data.

Exercise 3
Plot the time series. Ensure that the y axis starts from zero.

Exercise 4
Use the gghistogram function from the forecast package to visually inspect the distribution of time series values. Add a kernel density estimate and a normal density function to the plot.

Exercise 5
Use the decompose function to break the series into seasonal, trend, and irregular components (apply multiplicative decomposition).
Plot the decomposed series.

Exercise 6
Explore the structure of the decomposed object, and find seasonal coefficients (multiples). Identify the three months with the greatest coefficients, and the three months with the smallest coefficients. (Note that the coefficients are equal in different years).

Exercise 7
Check whether the time series is trend-stationary (i.e. its mean and variance are constant with respect to a trend) using the function kpss.test from the tseries package. (Note that the null hypothesis of the test is that the series is trend-stationary).

Exercise 8
Use the diff function to create a differenced time series (i.e. a series that includes differences between the values of the original series), and test it for trend stationarity.

Exercise 9
Plot the differenced time series.

Exercise 10
Use the Acf and Pacf functions from the forecast package to explore autocorrelation of the differenced series. Find at which lags correlation between lagged values is statistically significant at 5% level.

Related exercise sets:
  1. Stock Prices Analysis part 3 – Exercises
  2. zoo time series exercises
  3. Correlation and Correlogram Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Transaction Costs are Not an Afterthought; Transaction Costs in quantstrat

Mon, 04/10/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

DISCLAIMER: Any losses incurred based on the content of this post are the responsibility of the trader, not the author. The author takes no responsibility for the conduct of others nor offers any guarantees.

Introduction: Efficient Market Hypothesis

Burton Malkiel, in the finance classic A Random Walk Down Wall Street, made the accessible, popular case for the efficient market hypothesis (EMH). One can sum up the EMH as, “the price is always right.” No trader can know more about the market; the market price for an asset, such as a stock, is always correct. This means that trading, which relies on forecasting the future movements of prices, is as profitable as forecasting whether a coin will land heads-up; in short, traders are wasting their time. The best one can do is buy a large portfolio of assets representing the composition of the market and earn the market return rate (about 8.5% a year). Don’t try to pick winners and losers; just pick a low-expense, “dumb” fund, and you’ll do better than any highly-paid mutual fund manager (who isn’t smart enough to be profitable).

The EMH comes in various forms and “stengths”. The strong form of the EMH claims that prices cannot be beaten by anyone. Period. All information is contained in the market price; no one has any information that could “correct” the market price. The semi-strong form walks this back by claiming that prices react to new information instantaneously, so no one can consistently beat the market (this walks back the strong form, where new information is irrelevant). The weak form claims that only those traders with insider knowledge can beat the market; public information is of no help.

If true, the EMH condemns much of the finance sector and traders to practicing an elaborate exercise in futility, so naturally its controversial. Obviously the weak version is most plausible (if not because the other two versions imply the weak version), and my reading of Burton Malkiel’s statements is that he’s a believer in either the weak or semi-strong versions, but any version dooms investors to the “boring” market rate (since insider trading is illegal). Many don’t like that idea.

There are good reasons to believe any form of the EMH is wrong. First, the EMH, if true, implies stock movement behavior exhibits certain characteristics that can be statistically tested. In particular, “extreme” price movements do not happen. Stocks price movements would resemble a Wiener process (good news if your a mathematician or financial engineer, since these process have very nice properties), and price changes would follow a bell-curve shape (the Normal distribution) where 99.7% of most price movements happen nearby the average; “extreme” events are almost never seen. But the famed mathematician Benoit Mandelbrot (the one famous for developing the theory of fractals) showed that extreme price movements, rather than being relatively uncommon, happen very frequently; we see price movements that should not be seen within a million years or longer every couple decades or so. There are other assaults too from behavioral economics and other fields, along with anecdotal evidence such as the wealth of Warren Buffett.

A Random Walk Down Wall Street has been through (at least) eleven editions, so one should not be surprised that Malkiel has heard and addressed some of these issues. For example, instead of arguing against behavioral economics, he co-opts some findings from the theory, such as confirmation bias or the brain’s propensity for pattern-seeking, that supports his thesis that trying to beat the market is a futile effort (ignoring some of the theory’s majori criticisms of the EMH)1..

As for some of the attacks against the markets’ alleged efficiency, Malkiel has a good defense; even if markets are not really, truly efficient (efficiency meaning “the price is always right”), for the rank-and-file investor, this doesn’t matter. They’re efficient enough for an investor to be unable to game them profitably. Fees and small mistakes will eat away any gains to be had; worse, they may reverse them from an opportunity-cost perspective, if not even a purely monetary one.

Malkiel has a point: index funds can win just by being low cost, thanks to being “dumb” by just matching the composition of the market. Every transaction has a cost. Small gains may fail to make up for the costs in research, and the opportunity to make those gains may disappear as soon as they’re spotted. But most damaging of all is the fees.

Fees are Important

Fees and transaction costs are the friction sapping the energy from a trading system until it grinds to a red-hot halt. They’re the house’s cut at a poker table that make the game a losing one for all involved except the house. They’re often mentioned as an afterthought in books; a strategy is presented without accounting for fees, and the author says, “We should account for fees,” and nothing more. This could lead to a neophyte trader believing that fees are more a nuisance than something to be seriously accounted for when backtesting. Nothing could be further from the truth.

Every effort should be taken to minimize fees. Opt for the cheapest brokers (don’t pay for the advice; the premium paid for what passes for expertise is likely not worth it, even if the advice were good, and it often isn’t). Given the choice between a trading system making lots of trades and a trading system making few, opt for few. Remember that a single percentage point difference in returns means a big difference in profitability (thanks to compounding). Heck, people could become rich on Wall Street if only they can dodge the fees!

In my blog post on backtesting with Python, I introduced a problem to readers: adjust my backtesting system to account for transaction costs. Naturally, I never give a problem without solving it myself first (how else do I know it can be done?), and I retried it on the data sets I looked at in the post, adding in a 2% commission.

The result? The strategy that was barely profitable turned into a losing one (let alone losing compared to SPY).

The point was important enough that I felt sorry that I did not post this result before. I still will not post my Python solution; why spoil the problem? But I can still make that point in R (I never made equivalent R problems).

Cleaning Up the Trading System

Readers who read both the Python and R posts on backtesting will notice that they are not exactly equivalent. The Python trading system trades based on signal, while the R system trades based on regime (the R system will reinforce a position so it’s always 10% of the portfolio, even after the initial crossing). They Python system trades in batches of 100, while the R system has no such restriction. Here, I redo the R trading system, when trading from a batch of tech stocks, and bring it back into line with they Python trading system.

I load in quantstrat and repeat many of the steps from earlier articles.

if (!require("quantstrat")) { install.packages("quantstrat", repos="http://R-Forge.R-project.org") library(quantstrat) } if (!require("IKTrading")) { if (!require("devtools")) { install.packages("devtools") } library(devtools) install_github("IKTrading", username = "IlyaKipnis") library(IKTrading) } start <- as.Date("2010-01-01") end <- as.Date("2016-10-01") rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment currency("USD") # Currency being used Sys.setenv(TZ = "MDT") # Allows quantstrat to use timestamps initDate <- "2009-12-31" # A date prior to first close price; needed (why?) # Get new symbols symbols <- c("AAPL", "MSFT", "GOOG", "FB", "TWTR", "NFLX", "AMZN", "YHOO", "SNY", "NTDOY", "IBM", "HPQ") getSymbols(Symbols = symbols, src = "yahoo", from = start, to = end) # Quickly define adjusted versions of each of these `%s%` <- function(x, y) {paste(x, y)} `%s0%` <- function(x, y) {paste0(x, y)} for (s in symbols) { eval(parse(text = s %s0% "_adj <- adjustOHLC(" %s0% s %s0% ")")) } symbols_adj <- paste(symbols, "adj", sep = "_") stock(symbols_adj, currency = "USD", multiplier = 1) strategy_st <- portfolio_st <- account_st <- "SMAC-20-50" rm.strat(portfolio_st) rm.strat(strategy_st) initPortf(portfolio_st, symbols = symbols_adj, initDate = initDate, currency = "USD") initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 1000000) initOrders(portfolio_st, store = TRUE) strategy(strategy_st, store = TRUE)

As mentioned before, the strategy from my earlier article traded based on the current regime. This feature was implemented by using the signal function sigComparrison(), which tracks whether one indicator is greater than another (or vice versa). To trade based on a single crossing, use the sigCrossover() function (which is similar to sigComparison() but only generates a signal at the first point of crossing).

For some reason, the quantstrat developers wrote sigCrossover() so it returns either TRUE or NA rather than TRUE or FALSE. This leads to bad behavior for my system, so I’ve written an alternative, sigCrossover2(), that exhibits the latter behavior.

sigCrossover2 <- function(label, data = mktdata, columns, relationship = c("gt", "lt", "eq", "gte", "lte"), offset1 = 0, offset2 = 0) { # A wrapper for sigCrossover, exhibiting the same behavior except returning # an object containing TRUE/FALSE instead of TRUE/NA res <- sigCrossover(label = label, data = data, columns = columns, relationship = relationship, offset1 = offset1, offset2 = offset2) res[is.na(res)] <- FALSE return(res) }

Use this in my strategy instead of sigCrossover().

add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") ## [1] "SMAC-20-50" add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") ## [1] "SMAC-20-50" # Next comes trading signals add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") ## [1] "SMAC-20-50" add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") ## [1] "SMAC-20-50"

Now the strategy will trade based on a single crossover instead of balancing the portfolio so that 10% of equity is devoted to a signle position. I prefer this; the earlier strategy engaged in trades that would not be as profitable as those made at the point of crossover, driving up the number of transactions and thus the resulting fees.

Now, for trading in batches of 100. The original rule for opening a position used the function osMaxDollar() from the package IKTrading to size a position based on a dollar amount. The line floor(getEndEq(account_st_2, Date = timestamp) * .1)), passed to osMaxDollar()‘s maxSize and tradeSize parameters, instructed the system to place trades so they did not exceed 10% of the equity in the account. I modify the osMaxDollar() function to accept a batchSize parameter so that we can place trades in batches of 100.

# Based on Ilya Kipnis's osMaxDollar(); lots of recycled code osMaxDollarBatch <- function(data, timestamp, orderqty, ordertype, orderside, portfolio, symbol, prefer = "Open", tradeSize, maxSize, batchSize = 100, integerQty = TRUE, ...) { # An order sizing function that limits position size based on dollar value of # the position, optionally controlling for number of batches to purchase # # Args: # data: ??? (held over from Mr. Kipnis's original osMaxDollar function) # timestamp: The current date being evaluated (some object, like a string, # from which time can be inferred) # orderqty: ??? (held over from Mr. Kipnis's original osMaxDollar function) # ordertype: ??? (held over from Mr. Kipnis's original osMaxDollar # function) # orderside: ??? (held over from Mr. Kipnis's original osMaxDollar # function) # portfolio: A string representing the portfolio being treated; will be # passed to getPosQty # symbol: A string representing the symbol being traded # prefer: A string that indicates whether the Open or Closing price is # used for determining the price of the asset in backtesting (set # to "Close" to use the closing price) # tradeSize: Numeric, indicating the dollar value to transact (using # negative numbers for selling short) # maxSize: Numeric, indicating the dollar limit to the position (use # negative numbers for the short side) # batchSize: The number of stocks purchased per batch (only applies if # integerQty is TRUE); default value is 100, but setting to 1 # effectively nullifies the batchSize # integerQty: A boolean indicating whether or not to truncate to the # nearest integer of contracts/shares/etc. # ...: ??? (held over from Mr. Kipnis's original osMaxDollar function) # # Returns: # A numeric quantity representing the number of shares to purchase pos <- getPosQty(portfolio, symbol, timestamp) if (prefer == "Close") { price <- as.numeric(Cl(mktdata[timestamp, ])) } else { price <- as.numeric(Op(mktdata[timestamp, ])) } posVal <- pos * price if (orderside == "short") { dollarsToTransact 0) { dollarsToTransact = 0 } } else { dollarsToTransact <- min(tradeSize, maxSize - posVal) if (dollarsToTransact < 0) { dollarsToTransact = 0 } } qty <- dollarsToTransact/price if (integerQty) { # Controlling for batch size only makes sense if we were working with # integer quantities anyway; if we didn't care about being integers, # why bother? qty <- trunc(qty / batchSize) * batchSize } return(qty) }

Now we add trading rules, replacing osMaxDollar() with our new osMaxDollarBatch():

Now the updated analytics:

# Now for analytics updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50" tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) ## AAPL_adj AMZN_adj FB_adj GOOG_adj HPQ_adj ## Num.Txns 39.00 33.00 23.00 35.00 29.00 ## Num.Trades 20.00 17.00 12.00 18.00 15.00 ## Net.Trading.PL 96871.62 99943.99 116383.00 27476.81 51431.03 ## Avg.Trade.PL 4843.58 5879.06 9698.58 1526.49 3428.74 ## Med.Trade.PL 803.16 2755.00 4681.50 -830.50 -947.54 ## Largest.Winner 40299.33 33740.00 74888.01 27149.18 54379.08 ## Largest.Loser -7543.93 -14152.00 -17185.01 -9588.94 -15344.27 ## Gross.Profits 134300.99 140947.00 141620.01 62944.47 121893.03 ## Gross.Losses -37429.37 -41003.01 -25237.01 -35467.66 -70462.01 ## Std.Dev.Trade.PL 12463.46 13985.62 22585.57 8247.78 20283.11 ## Percent.Positive 55.00 64.71 75.00 38.89 40.00 ## Percent.Negative 45.00 35.29 25.00 61.11 60.00 ## Profit.Factor 3.59 3.44 5.61 1.77 1.73 ## Avg.Win.Trade 12209.18 12813.36 15735.56 8992.07 20315.51 ## Med.Win.Trade 7272.58 7277.00 8810.00 7254.76 10230.34 ## Avg.Losing.Trade -4158.82 -6833.84 -8412.34 -3224.33 -7829.11 ## Med.Losing.Trade -2837.23 -5474.50 -4752.00 -3537.02 -8954.58 ## Avg.Daily.PL 4216.84 4519.19 10124.27 1386.81 2571.42 ## Med.Daily.PL 304.94 2241.50 4346.99 -1121.00 -2841.81 ## Std.Dev.Daily.PL 12476.98 13232.69 23637.40 8479.65 20764.83 ## Ann.Sharpe 5.37 5.42 6.80 2.60 1.97 ## Max.Drawdown -24834.98 -44379.01 -42976.02 -26932.62 -53384.75 ## Profit.To.Max.Draw 3.90 2.25 2.71 1.02 0.96 ## Avg.WinLoss.Ratio 2.94 1.87 1.87 2.79 2.59 ## Med.WinLoss.Ratio 2.56 1.33 1.85 2.05 1.14 ## Max.Equity 102428.61 99943.99 120250.00 36354.82 52580.84 ## Min.Equity -11759.00 -3269.00 -14696.00 -21199.34 -53384.75 ## End.Equity 96871.62 99943.99 116383.00 27476.81 51431.03 ## IBM_adj MSFT_adj NFLX_adj NTDOY_adj SNY_adj ## Num.Txns 40.00 39.00 33.00 37.00 44.00 ## Num.Trades 20.00 20.00 17.00 19.00 22.00 ## Net.Trading.PL 10761.37 28739.24 232193.00 19555.00 -25046.79 ## Avg.Trade.PL 538.07 1436.96 13658.41 1029.21 -1138.49 ## Med.Trade.PL -651.11 -2351.12 2500.00 -1440.00 349.03 ## Largest.Winner 20407.71 18337.28 151840.00 43452.01 14393.86 ## Largest.Loser -7512.68 -11360.92 -29741.57 -8685.00 -16256.19 ## Gross.Profits 45828.56 88916.25 295344.86 79742.00 61804.49 ## Gross.Losses -35067.19 -60177.00 -63151.85 -60186.99 -86851.28 ## Std.Dev.Trade.PL 6103.11 8723.81 39673.15 12193.29 8412.38 ## Percent.Positive 40.00 45.00 64.71 42.11 54.55 ## Percent.Negative 60.00 55.00 35.29 57.89 45.45 ## Profit.Factor 1.31 1.48 4.68 1.32 0.71 ## Avg.Win.Trade 5728.57 9879.58 26849.53 9967.75 5150.37 ## Med.Win.Trade 3157.60 9507.66 8307.00 2560.50 3140.38 ## Avg.Losing.Trade -2922.27 -5470.64 -10525.31 -5471.54 -8685.13 ## Med.Losing.Trade -2153.73 -4955.57 -6835.07 -5832.00 -8354.55 ## Avg.Daily.PL 538.07 1135.14 14355.81 226.94 -1138.49 ## Med.Daily.PL -651.11 -2858.40 2847.00 -1512.00 349.03 ## Std.Dev.Daily.PL 6103.11 8854.93 40866.48 12019.71 8412.38 ## Ann.Sharpe 1.40 2.04 5.58 0.30 -2.15 ## Max.Drawdown -37072.32 -30916.29 -54616.13 -50112.01 -36718.77 ## Profit.To.Max.Draw 0.29 0.93 4.25 0.39 -0.68 ## Avg.WinLoss.Ratio 1.96 1.81 2.55 1.82 0.59 ## Med.WinLoss.Ratio 1.47 1.92 1.22 0.44 0.38 ## Max.Equity 30247.68 42996.90 264941.01 40014.00 11471.08 ## Min.Equity -6824.64 -17726.78 -3255.72 -32476.99 -32673.80 ## End.Equity 10761.37 28739.24 232193.00 19555.00 -25046.79 ## TWTR_adj YHOO_adj ## Num.Txns 9.00 43.00 ## Num.Trades 5.00 22.00 ## Net.Trading.PL 27051.99 106619.99 ## Avg.Trade.PL 5410.40 4846.36 ## Med.Trade.PL -27.01 -1296.00 ## Largest.Winner 1800.00 54746.00 ## Largest.Loser -10362.00 -7700.00 ## Gross.Profits 43340.99 155671.99 ## Gross.Losses -16289.00 -49051.99 ## Std.Dev.Trade.PL 20764.84 15313.45 ## Percent.Positive 40.00 31.82 ## Percent.Negative 60.00 68.18 ## Profit.Factor 2.66 3.17 ## Avg.Win.Trade 21670.50 22238.86 ## Med.Win.Trade 21670.50 14091.99 ## Avg.Losing.Trade -5429.67 -3270.13 ## Med.Losing.Trade -5900.00 -3105.01 ## Avg.Daily.PL -3622.25 4406.10 ## Med.Daily.PL -2963.50 -1404.00 ## Std.Dev.Daily.PL 5565.94 15548.28 ## Ann.Sharpe -10.33 4.50 ## Max.Drawdown -60115.00 -33162.01 ## Profit.To.Max.Draw 0.45 3.22 ## Avg.WinLoss.Ratio 3.99 6.80 ## Med.WinLoss.Ratio 3.67 4.54 ## Max.Equity 42758.99 110806.00 ## Min.Equity -17356.00 -10732.99 ## End.Equity 27051.99 106619.99 final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"], main = "Portfolio Equity")

getSymbols("SPY", from = start, to = end) ## [1] "SPY" # A crude estimate of end portfolio value from buying and holding SPY plot(final_acct$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

Modelling Fees

We’ve now replicated the Python analysis in R (or as close as we will get using quantstrat). Let’s now model our portfolio when a 2% commission is applied to every trade (that is, you pay the broker 2% of the total value of the trade, whether you buy or sell). quantstrat allows you to easily model such a fee. To do so, create a function that computes the fee given the number of shares being sold and the share price. Then pass this to the argument TxnFees when you create the rule (this takes either a function, a function name, or a negative number representing the fee). Be sure to add this to all rules (though it need not be the same fee structure for each rule).

fee <- function(TxnQty, TxnPrice, Symbol) { # A function for computing a transaction fee that is 2% of total value of # transaction # # Args: # TxnQty: Numeric for number of shares being traded # TxnPrice: Numeric for price per share # Symbol: The symbol being traded (not used here, but will be passed) # # Returns: # The fee to be applied return(-0.02 * abs(TxnQty * TxnPrice)) } rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment stock(symbols_adj, currency = "USD", multiplier = 1) rm.strat(portfolio_st) rm.strat(strategy_st) initPortf(portfolio_st, symbols = symbols_adj, initDate = initDate, currency = "USD") initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 1000000) initOrders(portfolio_st, store = TRUE) strategy(strategy_st, store = TRUE) add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bull", sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", # Apply the fee with the trade # If you wanted a flat fee, replace # this with a negative number # representing the fee prefer = "Open", osFUN = osMaxDollarBatch, maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), batchSize = 100), type = "enter", path.dep = TRUE, label = "buy") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bear", sigval = TRUE, orderqty = "all", ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", # Apply the fee with the trade prefer = "Open"), type = "exit", path.dep = TRUE, label = "sell") # Having set up the strategy, we now backtest applyStrategy(strategy_st, portfolios = portfolio_st)

Let’s now look at our portfolio.

# Now for analytics updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50" tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) ## AAPL_adj AMZN_adj FB_adj GOOG_adj HPQ_adj ## Num.Txns 39.00 33.00 23.00 35.00 29.00 ## Num.Trades 20.00 17.00 12.00 18.00 15.00 ## Net.Trading.PL 20624.18 43224.01 68896.98 -27891.65 -6635.25 ## Avg.Trade.PL 4843.58 5879.06 9698.58 1526.49 3428.74 ## Med.Trade.PL 803.16 2755.00 4681.50 -830.50 -947.54 ## Largest.Winner 37593.97 31266.76 71410.75 24869.77 51318.19 ## Largest.Loser -9426.91 -15583.36 -18874.11 -11201.90 -17051.87 ## Gross.Profits 134300.99 140947.00 141620.01 62944.47 121893.03 ## Gross.Losses -37429.37 -41003.01 -25237.01 -35467.66 -70462.01 ## Std.Dev.Trade.PL 12463.46 13985.62 22585.57 8247.78 20283.11 ## Percent.Positive 55.00 64.71 75.00 38.89 40.00 ## Percent.Negative 45.00 35.29 25.00 61.11 60.00 ## Profit.Factor 3.59 3.44 5.61 1.77 1.73 ## Avg.Win.Trade 12209.18 12813.36 15735.56 8992.07 20315.51 ## Med.Win.Trade 7272.58 7277.00 8810.00 7254.76 10230.34 ## Avg.Losing.Trade -4158.82 -6833.84 -8412.34 -3224.33 -7829.11 ## Med.Losing.Trade -2837.23 -5474.50 -4752.00 -3537.02 -8954.58 ## Avg.Daily.PL 2218.84 2736.55 7953.30 -212.11 542.98 ## Med.Daily.PL -1549.09 541.93 2355.65 -2396.58 -4741.84 ## Std.Dev.Daily.PL 12205.15 12945.81 23168.30 8308.18 20348.24 ## Ann.Sharpe 2.89 3.36 5.45 -0.41 0.42 ## Max.Drawdown -56123.63 -60668.36 -51083.34 -46522.72 -81892.46 ## Profit.To.Max.Draw 0.37 0.71 1.35 -0.60 -0.08 ## Avg.WinLoss.Ratio 2.94 1.87 1.87 2.79 2.59 ## Med.WinLoss.Ratio 2.56 1.33 1.85 2.05 1.14 ## Max.Equity 59846.94 43224.01 80524.92 4063.82 8591.39 ## Min.Equity -15308.91 -18513.81 -24333.09 -42458.90 -81892.46 ## End.Equity 20624.18 43224.01 68896.98 -27891.65 -6635.25 ## IBM_adj MSFT_adj NFLX_adj NTDOY_adj SNY_adj ## Num.Txns 40.00 39.00 33.00 37.00 44.00 ## Num.Trades 20.00 20.00 17.00 19.00 22.00 ## Net.Trading.PL -62316.03 -48328.40 163636.67 -53742.98 -111417.94 ## Avg.Trade.PL 538.07 1436.96 13658.41 1029.21 -1138.49 ## Med.Trade.PL -651.11 -2351.12 2500.00 -1440.00 349.03 ## Largest.Winner 18145.45 15954.84 146848.00 40548.11 12141.77 ## Largest.Loser -9256.52 -13175.01 -31173.09 -10517.40 -17910.22 ## Gross.Profits 45828.56 88916.25 295344.86 79742.00 61804.49 ## Gross.Losses -35067.19 -60177.00 -63151.85 -60186.99 -86851.28 ## Std.Dev.Trade.PL 6103.11 8723.81 39673.15 12193.29 8412.38 ## Percent.Positive 40.00 45.00 64.71 42.11 54.55 ## Percent.Negative 60.00 55.00 35.29 57.89 45.45 ## Profit.Factor 1.31 1.48 4.68 1.32 0.71 ## Avg.Win.Trade 5728.57 9879.58 26849.53 9967.75 5150.37 ## Med.Win.Trade 3157.60 9507.66 8307.00 2560.50 3140.38 ## Avg.Losing.Trade -2922.27 -5470.64 -10525.31 -5471.54 -8685.13 ## Med.Losing.Trade -2153.73 -4955.57 -6835.07 -5832.00 -8354.55 ## Avg.Daily.PL -1294.25 -853.51 12129.90 -1757.68 -3090.09 ## Med.Daily.PL -2449.41 -4777.80 783.80 -3466.08 -1631.09 ## Std.Dev.Daily.PL 5974.29 8677.26 40049.20 11764.24 8256.61 ## Ann.Sharpe -3.44 -1.56 4.81 -2.37 -5.94 ## Max.Drawdown -75697.82 -60653.33 -63238.29 -82945.57 -111417.94 ## Profit.To.Max.Draw -0.82 -0.80 2.59 -0.65 -1.00 ## Avg.WinLoss.Ratio 1.96 1.81 2.55 1.82 0.59 ## Med.WinLoss.Ratio 1.47 1.92 1.22 0.44 0.38 ## Max.Equity 8016.15 3615.01 218329.97 1177.74 0.00 ## Min.Equity -67681.68 -57038.32 -5157.36 -81767.83 -111417.94 ## End.Equity -62316.03 -48328.40 163636.67 -53742.98 -111417.94 ## TWTR_adj YHOO_adj ## Num.Txns 9.00 43.00 ## Num.Trades 5.00 22.00 ## Net.Trading.PL 9470.41 19783.95 ## Avg.Trade.PL 5410.40 4846.36 ## Med.Trade.PL -27.01 -1296.00 ## Largest.Winner 0.00 51665.84 ## Largest.Loser -12099.12 -9466.80 ## Gross.Profits 43340.99 155671.99 ## Gross.Losses -16289.00 -49051.99 ## Std.Dev.Trade.PL 20764.84 15313.45 ## Percent.Positive 40.00 31.82 ## Percent.Negative 60.00 68.18 ## Profit.Factor 2.66 3.17 ## Avg.Win.Trade 21670.50 22238.86 ## Med.Win.Trade 21670.50 14091.99 ## Avg.Losing.Trade -5429.67 -3270.13 ## Med.Losing.Trade -5900.00 -3105.01 ## Avg.Daily.PL -5536.07 2341.16 ## Med.Daily.PL -4925.13 -3332.34 ## Std.Dev.Daily.PL 5438.76 15231.86 ## Ann.Sharpe -16.16 2.44 ## Max.Drawdown -71707.02 -68750.59 ## Profit.To.Max.Draw 0.13 0.29 ## Avg.WinLoss.Ratio 3.99 6.80 ## Med.WinLoss.Ratio 3.67 4.54 ## Max.Equity 36769.43 42502.03 ## Min.Equity -34937.58 -55447.71 ## End.Equity 9470.41 19783.95 final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"], main = "Portfolio Equity")

Before the fees were accounted for, we made a modest profit, though one worse than if we bought and held SPY. But with this commission structure, we barely break even!

Not surprisingly, our strategy looks even worse when compared to buy-and-hold-SPY.

plot(final_acct$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

The 2% commission is very punishing. With this commission, we would need to profit by about 4% in order for a trade to really, truly be profitable. This is harder than it sounds (and doesn’t even allow for losses), and the losses add up.

Of course if you’re using an online trading platform you may opt for a broker that charges a flat rate, say, $10 per trade. This means that with each trade your profit would need to be at least $10 in order to beat the fee. Also, the larger the trade, the smaller the fee is relative to the trade, making the trade more likely to be profitable.

Let’s suppose a $1,000,000 portfolio pays a $100 fee per trade. I model this below.

rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment stock(symbols_adj, currency = "USD", multiplier = 1) rm.strat(portfolio_st) rm.strat(strategy_st) initPortf(portfolio_st, symbols = symbols_adj, initDate = initDate, currency = "USD") initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 1000000) initOrders(portfolio_st, store = TRUE) strategy(strategy_st, store = TRUE) add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bull", sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, TxnFees = -100, # Apply the fee with the trade prefer = "Open", osFUN = osMaxDollarBatch, maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), batchSize = 100), type = "enter", path.dep = TRUE, label = "buy") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bear", sigval = TRUE, orderqty = "all", ordertype = "market", orderside = "long", replace = FALSE, TxnFees = -100, prefer = "Open"), type = "exit", path.dep = TRUE, label = "sell") # Having set up the strategy, we now backtest applyStrategy(strategy_st, portfolios = portfolio_st)

Now for comparison with SPY.

updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50" final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

Not quite as punishing as the 2% rate, though admittedly slightly less than when no fee was applied at all. And while you may say, “No one would ever go with a broker charging $100 per trade when they can get $10 or $5 per trade,” remember most people don’t have $1,000,000 accounts (and I’m sure that if you applied a $5 fee to a more modest account, you’d discover that $5 per trade actually makes a big difference: consider that a challenge).

Conclusion

Fees are only the beginning of potential, unmodelled avenues through which a backtest can lead to optimistic results. There’s slippage, where orders are placed at unwanted prices that lead to further losses than initially planned, and becomes a serious problem in periods of high volatility. Another potential problem that cannot be modelled by backtesting at all is affecting the market with an order. Prices in backtesting are treated as givens, but in the real world, an order affects the price of the asset bought and could lead to others reacting in unexpected ways. There’s also the propensity for backtesting to overfit data; the strategy appears to be profitable when it merely managed to learn the ups and downs of the historic period.

In a practice that relies on slim profits, where the market may only be just barely inefficient, there’s little room for error. Do what you can to accurately model what a trade would actually look like (especially when quantstrat makes doing so easy). Little price differences can lead to big differences in the end.

# Session details sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] IKTrading_1.0 roxygen2_6.0.1 ## [3] digest_0.6.12 Rcpp_0.12.10 ## [5] quantstrat_0.9.1739 foreach_1.4.4 ## [7] blotter_0.9.1741 PerformanceAnalytics_1.4.4000 ## [9] FinancialInstrument_1.2.0 quantmod_0.4-7 ## [11] TTR_0.23-1 xts_0.9-7 ## [13] zoo_1.7-14 RWordPress_0.2-3 ## [15] optparse_1.3.2 knitr_1.15.1 ## ## loaded via a namespace (and not attached): ## [1] xml2_1.1.1 magrittr_1.5 getopt_1.20.0 lattice_0.20-35 ## [5] R6_2.2.0 highr_0.6 stringr_1.2.0 tools_3.3.3 ## [9] grid_3.3.3 commonmark_1.2 iterators_1.0.8 bitops_1.0-6 ## [13] codetools_0.2-15 RCurl_1.95-4.8 evaluate_0.10 stringi_1.1.3 ## [17] XMLRPC_0.3-0 XML_3.98-1.5

1 The EMH claims that information about past prices will not help predict future price movements, so technical analysis is pure alchemy. However, we need to remember that price data and future prices are being produced by people who do look at past prices. Suppose a technical indicator indicates a bullish movement in prices; for example, a slow moving average crossed over a fast moving average. Forget whether there’s any inherent reason this should lead to a bullish price movement; there’s a good chance there isn’t any, and all the narratives explaining the rationale of the indicator are rubbish (as I’m inclined to believe). If a trader sees the signal, and believes that enough traders: (a) see the signal; (b) interpret the signal in the same way; and © will act on the signal as it recommends, then that trader should do the same. Enough people believe the asset is undervalued to make it act as if it were undervalued, and the trader should go long. So technical signals could become self-fulfilling prophecies, like J.M. Keynes’ beauty contest.

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How and when: ridge regression with glmnet

Mon, 04/10/2017 - 14:25

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to conduct ridge regression (linear regression with L2 regularization) in R using the glmnet package, and use simulations to demonstrate its relative advantages over ordinary least squares regression.

 Ridge regression

Ridge regression uses L2 regularisation to weight/penalise residuals when the parameters of a regression model are being learned. In the context of linear regression, it can be compared to Ordinary Least Square (OLS). OLS defines the function by which parameter estimates (intercepts and slopes) are calculated. It involves minimising the sum of squared residuals. L2 regularisation is a small addition to the OLS function that weights residuals in a particular way to make the parameters more stable. The outcome is typically a model that fits the training data less well than OLS but generalises better because it is less sensitive to extreme variance in the data such as outliers.

 Packages

We’ll make use of the following packages in this post:

library(tidyverse) library(broom) library(glmnet)  Ridge regression with glmnet

The glmnet package provides the functionality for ridge regression via glmnet(). Important things to know:

  • Rather than accepting a formula and data frame, it requires a vector input and matrix of predictors.
  • You must specify alpha = 0 for ridge regression.
  • Ridge regression involves tuning a hyperparameter, lambda. glmnet() will generate default values for you. Alternatively, it is common practice to define your own with the lambda argument (which we’ll do).

Here’s an example using the mtcars data set:

y <- mtcars$hp x <- mtcars %>% select(mpg, wt, drat) %>% data.matrix() lambdas <- 10^seq(3, -2, by = -.1) fit <- glmnet(x, y, alpha = 0, lambda = lambdas) summary(fit) #> Length Class Mode #> a0 51 -none- numeric #> beta 153 dgCMatrix S4 #> df 51 -none- numeric #> dim 2 -none- numeric #> lambda 51 -none- numeric #> dev.ratio 51 -none- numeric #> nulldev 1 -none- numeric #> npasses 1 -none- numeric #> jerr 1 -none- numeric #> offset 1 -none- logical #> call 5 -none- call #> nobs 1 -none- numeric

Because, unlike OLS regression done with lm(), ridge regression involves tuning a hyperparameter, lambda, glmnet() runs the model many times for different values of lambda. We can automatically find a value for lambda that is optimal by using cv.glmnet() as follows:

cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas)

cv.glmnet() uses cross-validation to work out how well each model generalises, which we can visualise as:

plot(cv_fit)

The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross-validation. We can extract this values as:

opt_lambda <- cv_fit$lambda.min opt_lambda #> [1] 3.162278

And we can extract all of the fitted models (like the object returned by glmnet()) via:

fit <- cv_fit$glmnet.fit summary(fit) #> Length Class Mode #> a0 51 -none- numeric #> beta 153 dgCMatrix S4 #> df 51 -none- numeric #> dim 2 -none- numeric #> lambda 51 -none- numeric #> dev.ratio 51 -none- numeric #> nulldev 1 -none- numeric #> npasses 1 -none- numeric #> jerr 1 -none- numeric #> offset 1 -none- logical #> call 5 -none- call #> nobs 1 -none- numeric

These are the two things we need to predict new data. For example, predicting values and computing an R2 value for the data we trained on:

y_predicted <- predict(fit, s = opt_lambda, newx = x) # Sum of Squares Total and Error sst <- sum(y^2) sse <- sum((y_predicted - y)^2) # R squared rsq <- 1 - sse / sst rsq #> [1] 0.9318896

The optimal model has accounted for 93% of the variance in the training data.

 Ridge v OLS simulations

By producing more stable parameters than OLS, ridge regression should be less prone to overfitting training data. Ridge regression might, therefore, predict training data less well than OLS, but better generalise to new data. This will particularly be the case when extreme variance in the training data is high, which tends to happen when the sample size is low and/or the number of features is high relative to the number of observations.

Below is a simulation experiment I created to compare the prediction accuracy of ridge regression and OLS on training and test data.

I first set up the functions to run the simulation:

# Compute R^2 from true and predicted values rsquare <- function(true, predicted) { sse <- sum((predicted - true)^2) sst <- sum(true^2) rsq <- 1 - sse / sst # For this post, impose floor... if (rsq < 0) rsq <- 0 return (rsq) } # Train ridge and OLS regression models on simulated data set with `n_train` # observations and a number of features as a proportion to `n_train`, # `p_features`. Return R squared for both models on: # - y values of the training set # - y values of a simualted test data set of `n_test` observations # - The beta coefficients used to simulate the data ols_vs_ridge <- function(n_train, p_features, n_test = 200) { ## Simulate datasets n_features <- floor(n_train * p_features) betas <- rnorm(n_features) x <- matrix(rnorm(n_train * n_features), nrow = n_train) y <- x %*% betas + rnorm(n_train) train <- data.frame(y = y, x) x <- matrix(rnorm(n_test * n_features), nrow = n_test) y <- x %*% betas + rnorm(n_test) test <- data.frame(y = y, x) ## OLS lm_fit <- lm(y ~ ., train) # Match to beta coefficients lm_betas <- tidy(lm_fit) %>% filter(term != "(Intercept)") %>% {.$estimate} lm_betas_rsq <- rsquare(betas, lm_betas) # Fit to training data lm_train_rsq <- glance(lm_fit)$r.squared # Fit to test data lm_test_yhat <- predict(lm_fit, newdata = test) lm_test_rsq <- rsquare(test$y, lm_test_yhat) ## Ridge regression lambda_vals <- 10^seq(3, -2, by = -.1) # Lambda values to search cv_glm_fit <- cv.glmnet(as.matrix(train[,-1]), train$y, alpha = 0, lambda = lambda_vals, nfolds = 5) opt_lambda <- cv_glm_fit$lambda.min # Optimal Lambda glm_fit <- cv_glm_fit$glmnet.fit # Match to beta coefficients glm_betas <- tidy(glm_fit) %>% filter(term != "(Intercept)", lambda == opt_lambda) %>% {.$estimate} glm_betas_rsq <- rsquare(betas, glm_betas) # Fit to training data glm_train_yhat <- predict(glm_fit, s = opt_lambda, newx = as.matrix(train[,-1])) glm_train_rsq <- rsquare(train$y, glm_train_yhat) # Fit to test data glm_test_yhat <- predict(glm_fit, s = opt_lambda, newx = as.matrix(test[,-1])) glm_test_rsq <- rsquare(test$y, glm_test_yhat) data.frame( model = c("OLS", "Ridge"), betas_rsq = c(lm_betas_rsq, glm_betas_rsq), train_rsq = c(lm_train_rsq, glm_train_rsq), test_rsq = c(lm_test_rsq, glm_test_rsq) ) } # Function to run `ols_vs_ridge()` `n_replications` times repeated_comparisons <- function(..., n_replications = 5) { map(seq(n_replications), ~ ols_vs_ridge(...)) %>% map2(seq(.), ~ mutate(.x, replicate = .y)) %>% reduce(rbind) }

Now run the simulations for varying numbers of training data and relative proportions of features (takes some time):

d <- purrr::cross_d(list( n_train = seq(20, 200, 20), p_features = seq(.55, .95, .05) )) d <- d %>% mutate(results = map2(n_train, p_features, repeated_comparisons))

Visualise the results…

For varying numbers of training data (averaging over number of features), how well do both models predict the training and test data?

d %>% unnest() %>% group_by(model, n_train) %>% summarise( train_rsq = mean(train_rsq), test_rsq = mean(test_rsq)) %>% gather(data, rsq, contains("rsq")) %>% mutate(data = gsub("_rsq", "", data)) %>% ggplot(aes(n_train, rsq, color = model)) + geom_line() + geom_point(size = 4, alpha = .3) + facet_wrap(~ data) + theme_minimal() + labs(x = "Number of training observations", y = "R squared")

As hypothesised, OLS fits the training data better but Ridge regression better generalises to new test data. Further, these effects are more pronounced when the number of training observations is low.

For varying relative proportions of features (averaging over numbers of training data) how well do both models predict the training and test data?

d %>% unnest() %>% group_by(model, p_features) %>% summarise( train_rsq = mean(train_rsq), test_rsq = mean(test_rsq)) %>% gather(data, rsq, contains("rsq")) %>% mutate(data = gsub("_rsq", "", data)) %>% ggplot(aes(p_features, rsq, color = model)) + geom_line() + geom_point(size = 4, alpha = .3) + facet_wrap(~ data) + theme_minimal() + labs(x = "Number of features as proportion\nof number of observation", y = "R squared")

Again, OLS has performed slightly better on training data, but Ridge better on test data. The effects are more pronounced when the number of features is relatively high compared to the number of training observations.

The following plot helps to visualise the relative advantage (or disadvantage) of Ridge to OLS over the number of observations and features:

d %>% unnest() %>% group_by(model, n_train, p_features) %>% summarise(train_rsq = mean(train_rsq), test_rsq = mean(test_rsq)) %>% group_by(n_train, p_features) %>% summarise(RidgeAdvTrain = train_rsq[model == "Ridge"] - train_rsq[model == "OLS"], RidgeAdvTest = test_rsq[model == "Ridge"] - test_rsq[model == "OLS"]) %>% gather(data, RidgeAdvantage, contains("RidgeAdv")) %>% mutate(data = gsub("RidgeAdv", "", data)) %>% ggplot(aes(n_train, p_features, fill = RidgeAdvantage)) + scale_fill_gradient2(low = "red", high = "green") + geom_tile() + theme_minimal() + facet_wrap(~ data) + labs(x = "Number of training observations", y = "Number of features as proportion\nof number of observation") + ggtitle("Relative R squared advantage of Ridge compared to OLS")

This shows the combined effect: that Ridge regression better transfers to test data when the number of training observations is low and/or the number of features is high relative to the number of training observations. OLS performs slightly better on the training data under similar conditions, indicating that it is more prone to overfitting training data than when ridge regularisation is employed.

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Test driving Python integration in R, using the ‘reticulate’ package

Mon, 04/10/2017 - 08:01
Introduction

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:

The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.

The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:

The pattern.nl module

The pattern.nl module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install pattern.nl.

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.

The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.

Conclusion

As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module pattern.nl for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow

RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes

Mon, 04/10/2017 - 02:00

I released version 1.4.12 of RSiteCatalyst before I wrote the release notes for version 1.4.11, so this blog post will treat both releases as one. Users should upgrade directly to version 1.4.12 as the releases are cumulative in nature.

Get* method additions

Two Get* methods were added in this release: GetClickMapReporting and GetPreviousServerCalls, mostly for completeness. Analytics users will likely not need to use these methods, but they are useful for generating documentation.

Bug fixes
  • Fixed GetLogin function, adding selected_ims_group_list parameter to response (caused test suite failure)
  • Fixed issue with ViewProcessingRules where nested rules threw errors (#214)
  • Fixed issue with GetMarketingChannelRules where nested rules threw errors, matches column duplicated values across rows (#180)
  • Added ability to use a segment in QueueDataWarehouse function, which was previously implemented incorrectly (#216)
  • Fixed issue with QueueDataWarehouse not returning the proper number of results when enqueueOnly = FALSE
  • Fixed encoding for QueueDataWarehouse to correctly use UTF-8-BOM (#198)
  • Fixed parser for GetFeeds, to unnest ‘activity’ data frame into discrete columns
  • Fixed issue where message Error in if (!is.null(elements[i, ]$classification) && nchar(elements[i, : missing value where TRUE/FALSE needed displayed when using multiple elements in a Queue* function (#207)
Community Contributions (An Adobe Summit bounce?!)

In the past month, the number of GitHub issues submitted has increased dramatically, a good problem to have!

I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.

April R Course Finder update: Logistics Regression, New platforms and Complete Machine Learning

Sun, 04/09/2017 - 23:03

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Last year we launched R Course Finder, an online directory that helps you to find the right R course quickly. With so many R courses available online, we thought it was a good idea to offer a tool that helps people to compare these courses, before they decide where to spend their valuable time and (sometimes) money.

If you haven’t looked at it yet, go to the R Course Finder now by clicking here.

Last month we added 17 courses to the Course Finder. Currently we are at 171 courses on 18 different online platforms, and 3 offline Learning Institutes.

Newly added platforms include:

Highlighted Courses Learning Path: Real-World Data Mining With R

When working with data, so often we first have to consider data mining, Learning Path: Real-World Data Mining With R offers a complete learning path for this essential problem.

Logistic Regression Workshop using R – Step by Step Modeling

The course, Logistic Regression Workshop using R – Step by Step Modeling really peaked our attention as it seems to be one of the rare cases that promote content quality over advertisement. This relatively short course description does not speak for the content and indepth discussion of Logistic regressions in R.

R: Complete Machine Learning Solutions

We have many different courses in our directory that cover the current hype in Data Science, Machine learning, this course R: Complete Machine Learning Solutions covers the whole process front to end, and leaves no room for interpretation. It can be very useful as a back reference for anyone who knows some of these concepts, but wants to return and read about them .

Besides these courses, we also added:

Introduction to R: ETH Zurich
Data Science Certification Training – R Programming
Creating Interactive Presentations with Shiny and R
R: Complete Data Visualization Solutions
R Programming from Scratch
Volatility Trading Analysis with R

How you can help to make R Course Finder better
  • If you miss a course that is not included yet, please post a reminder in the comments and we’ll add it.
  • If you miss an important filter or search functionality, please let us know in the comments below.
  • If you already took one of the courses, please let all of us know about your experiences in the review section, an example is available here.

And, last but not least: If you like R Course Finder, please share this announcement with friends and colleagues using the buttons below.

Related exercise sets:
  1. Basic Tree 2 Exercises
  2. Multiple Regression (Part 3) Diagnostics
  3. Intermediate Tree 2
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages