Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 19 min ago

March Monthly Round Up

Tue, 04/04/2017 - 10:05

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

March has been a busy month here at DataCamp! We’ve released 7 courses, 4 DataChats, and 8 articles. Luckily, this round-up post will cover the highlights, in case you missed them.

Data Manipulation: Tidy Your Data!

We were very excited to release the third course in our Pandas course series! Merging DataFrames with pandas follows on the Pandas Foundations and Manipulating DataFrames with Pandas; All three are taught by Dhavide Aruliah from Continuum Analytics, who you might already know if you have downloaded the Anaconda Python distribution!

Tip: if you ever need a one-page reference for this Python package, consider our Pandas basics and  Pandas Data Wrangling cheat sheets. Make sure you also don’t miss our Python Exploratory Data Analysis tutorial, in which you’ll learn more about how you can use Pandas with other packages such as matplotlib to explore your data!

Apart from the release of our Pandas course, we also launched Network Analysis in Python (Part 1), by Eric Ma. In this course, you’ll learn how to analyze, visualize, and understand networks by applying the concepts that you learn real-world network data with the NetworkX library. 

Eric shows the power of network data in his talk titled “Networks, networks everywhere!” in which he goes over a real-life example of how networks solved influenza problems. You can check it out in DataChats episode.

For those of you who are interested in getting into Spark with Python, March also brought a PySpark cheat sheet, which covers the basics of Spark’s main building blocks, Resilient Distributed Datasets (RDD). There’s also a beginner’s guide, which will give you a comprehensive overview of how to install PySpark on your computer, how to work with Spark in Jupyter, how RDDs differ from DataFrames and Datasets, and so much more!  

Machine Learning Is Fun!

We added the Unsupervised Learning in R course, by Hank Roark, Senior Data Scientist at Boeing, to the curriculum: in contrast to supervised machine learning, you’ll see that this course will teach you how to find patterns in data without trying to make predictions.

DataCamp instructor Nick Carchedi asked Hank about his career in data science, job automation and the importance of communicating your results as a data scientist in episode 14 of DataChats.

Next, we launched Supervised Learning with scikit-learn course, taught by Andreas Müller. In this course you’ll learn how to use Python to perform supervised learning: you’ll build predictive models, tune the parameters, tell how well they will perform on unseen data, … All while discovering real world data sets with scikit-learn.

Fun fact: Andreas is one of the core developers of the scikit-learn package!

Do you want to know more about Andreas and his work? In the 15th episode of DataChats, Andreas also gives advice to people starting with data science and answers what the most difficult part of his job.

Lastly, there’s also a new Machine Learning with the Experts: School Budgets course, taught by Peter Bull, co-founder of DrivenData. You might have already heard about this company already: it hosts data science competitions with the goal to save the world. In this course, you’ll use the scikit-learn skills that you learned from Andreas’ course to tackle a problem related to school district budgeting.

Tip: To prepare for DataCamp’s upcoming Deep Learning in Python course, go and check out Deep Learning with Jupyter Notebooks in the Cloud.

R and Finance: Secure Your Quantitative Finance Job

Last but not least, DataCamp’s R for finance curriculum was enriched with an Introduction to R for Finance course, by Lore Dirick. This course is ideal for those who want to get started with finance in R in an applied way: you’ll learn the essential data structures and you’ll have the chance to apply them directly to financial examples! With this course, you’ll be set to start performing financial analyses in R!

One of the things that this course also covers is correlation, or the relationship between variables: consider taking our R correlation tutorial if you want to find out more.

For those who have already finished the Introduction to Portfolio Analysis in R course, we also released Intermediate Portfolio Analysis in R, taught by Ross Bennett, in which you’ll explore advanced concepts in the portfolio optimization process with the PortfolioAnalytics package.

Did you know that this package has 863 monthly (direct) downloads, according to RDocumentation? You can read more about which techniques and technologies are used to provide you with these insights in our RDocumentation: Scoring and Ranking post.

Hope you enjoyed this month’s content. Let us know what you think the comments below! Much more to come for April!

The DataCamp Team

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Everybody talks about the weather

Tue, 04/04/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Everybody talks about the weather, but nobody does anything about it. – Charles Dudley Warner

As a scientist who models plant diseases, I use a lot of weather data. Often this data is not available for areas of interest. Previously, I worked with the International Rice Research Institute (IRRI) and often the countries I was working with did not have weather data available or I was working on a large area covering several countries and needed a single source of data to work from. Other scientists who work with crop biophysical models to model crop yields also have similar weather data needs and may experience similar issues with data availability.

The United States National Oceanic and Atmospheric Administration's (NOAA) National Centers for Environmental Information (NCEI 1) provides several sources of data, one of which is the Global Surface Summary of the Day (GSOD) data. The data are attractive because they are daily time-step and ground or buoy station based, freely available, and the data span several years, 1929 to current, with data from 1973 to current being the most complete. For more information on the GSOD data, please see the description of the data provided by NCEI,

While the GSOD data are a valuable source of weather data with global coverage. The data files can be cumbersome and difficult to work with for research purposes. So instead of just talking about it, I did something about it. The GSODR package aims to make it easy to find, transfer and format the data you need for use in analysis. The package provides four main functions for facilitating this:

  • get_GSOD() – the main function that will query and transfer files from the FTP server, reformat them and return a data.frame in R or save a file to disk
  • reformat_GSOD() – the workhorse, this function takes individual station files on the local disk and reformats them returning a data.frame in R
  • nearest_stations() – this function returns a data frame containing a list of stations and their metadata that fall within the given radius of a point specified by the user
  • get_station_list() – this function retrieves the most up-to-date list of stations and corresponding metadata

When reformatting data either with get_GSOD() or reformat_GSOD(), all units are converted to International System of Units (SI), e.g., inches to millimetres and Fahrenheit to Celsius. File output can be saved as a Comma Separated Value (CSV) file or in a spatial GeoPackage (GPKG) file, implemented by most major GIS software products, summarising each year by station, which also includes vapour pressure (ea and es) and relative humidity variables calculated from existing data in GSOD by the package.

Documentation is provided for GSODR using pkgdown, which you can find on the GSODR website, or of course in the usual R vignettes and help files.

How Can You Use GSODR?

Recently a colleague contacted me asking if I knew of or had weather data for a time period covering 1960 to 2015 for selected provinces in the Philippines where the International Rice Research Institute (IRRI) has conducted surveys. The IRRI survey loop in Central Luzon is a study that aims to monitor the changes in rice farming in the major rice producing area of the Philippines, the Central Luzon region, which is called the "rice bowl of the Philippines". In this survey data have been collected several times since the 1960s, see the Farm Household Survey Database webpage for the data collected data. Using the GSODR package I was able to retrieve weather data from stations within a 100km radius of the centre of the provinces included in the survey and provide my colleague with a CSV file of weather data from ground-based weather stations.

As an example of how we can use GSODR, I will demonstrate the following:

  • retrieving a spatial object of provincial level data;

  • sub-setting this data for the provinces of interest;

  • merging the polygons into one object;

  • finding the centroid of this resulting polygon;

  • using the centroid of the polygon to find stations within 100km of this point;

  • determining which stations provide data for the specified time-period, 1960-2015; and

  • downloading the station files and creating a single CSV file of the data for analysis.

Retrieve PHL Provincial Data and Select Loop Provinces

As a first step, we'll use the raster package to retrieve data from GADM that will provide the provincial spatial data for the survey area. We will then use this to find the centroid of the area of interest, which will be used to find the nearest stations. Using raster::getData() fetch level 0 (national) and 1 (provincial) data for the Philippines.

install.packages("raster") library(raster) RP0 <- raster::getData(country = "Philippines", level = 0) RP1 <- raster::getData(country = "Philippines", level = 1)

Now we will select the provinces involved in the survey and make a new object called Central_Luzon from the provincial level data, RP1.

Central_Luzon <- RP1[RP1@data$NAME_1 == "Pampanga" | RP1@data$NAME_1 == "Tarlac" | RP1@data$NAME_1 == "Pangasinan" | RP1@data$NAME_1 == "La Union" | RP1@data$NAME_1 == "Nueva Ecija" | RP1@data$NAME_1 == "Bulacan", ]

With a little help from an old acquaintance, Arnold Salvacion and Scott Chamberlain we can create a map inset showing where the Central Luzon Loop survey takes place.

First we'll use gSimplify() from rgeos to simplify the map of the Philippines to make the map generation in the next few steps quicker.

RP0 <- rgeos::gSimplify(RP0, tol = 0.05) library(ggplot2) library(grid) library(gridExtra) CL_names <- data.frame(coordinates(Central_Luzon)) # get center coordinates of provinces in Central Luzon CL_names$label <- Central_Luzon@data$NAME_1 # Main Map p1 <- ggplot() + geom_polygon(data = Central_Luzon, aes(x = long, y = lat, group = group), colour = "grey10", fill = "#fff7bc") + geom_text(data = CL_names, aes(x = X1, y = X2, label = label), size = 3, colour = "grey20") + theme(axis.text.y = element_text(angle = 90, hjust = 0.5)) + ggtitle("Central Luzon Provinces Surveyed") + theme_bw() + xlab("Longitude") + ylab("Latitude") + coord_map() ## Regions defined for each Polygons # Inset p2 <- ggplot() + geom_polygon(data = RP0, aes(long, lat, group = group), colour = "grey10", fill = "#fff7bc") + coord_equal() + theme_bw() + labs(x = NULL, y = NULL) + geom_rect(aes(xmin = extent(Central_Luzon)[1], xmax = extent(Central_Luzon)[2], ymin = extent(Central_Luzon)[3], ymax = extent(Central_Luzon)[4]), alpha = 0, colour = "red", size = 1, linetype = 1) + theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), plot.margin = unit(c(0, 0, 0 ,0), "mm")) grid.newpage() v1 <- viewport(width = 1, height = 1, x = 0.5, y = 0.5) # plot area for the main map v2 <- viewport(width = 0.28, height = 0.28, x = 0.67, y = 0.78) # plot area for the inset map print(p1, vp = v1) print(p2, vp = v2)

Dissolve Polygons and Find Centroid of Loop Survey Area

Now that we have the provincial data that we want, we will dissolve the polygons that represent the individual provinces in Central Luzon and find the centroid of all of them, which we will use as the central point for querying stations from the GSOD data set.

Central_Luzon <- rgeos::gUnaryUnion(Central_Luzon) centroid <- rgeos::gCentroid(Central_Luzon) ggplot() + geom_polygon(data = Central_Luzon, aes(x = long, y = lat, group = group), colour = "grey10", fill = "#fff7bc") + geom_point(aes(x = centroid@coords[1], y = centroid@coords[2])) + theme(axis.text.y = element_text(angle = 90, hjust = 0.5)) + ggtitle("Centre of Provinces Surveyed") + theme_bw() + xlab("Longitude") + ylab("Latitude") + coord_map()

Next, make a list of stations that are within this area. First we need to fetch the station medadata, "isd-history.csv" from the FTP server and then check which stations fall within a 100km radius of the centre of the provinces we're interested in.

library(GSODR) library(readr) # Fetch station list from NCEI station_meta <- read_csv( "", col_types = "ccccccddddd", col_names = c("USAF", "WBAN", "STN_NAME", "CTRY", "STATE", "CALL", "LAT", "LON", "ELEV_M", "BEGIN", "END"), skip = 1) station_meta$STNID <- as.character(paste(station_meta$USAF, station_meta$WBAN, sep = "-")) loop_stations <- nearest_stations(LAT = centroid@coords[2], LON = centroid@coords[1], distance = 100) loop_stations <- station_meta[station_meta$STNID %in% loop_stations, ] loop_stations <- loop_stations[loop_stations$BEGIN <= 19591231 & loop_stations$END >= 20151231, ] print(loop_stations[, c(1:2, 3, 7:12)]) ## # A tibble: 9 × 9 ## USAF WBAN STN_NAME LAT LON ELEV_M BEGIN END ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 983240 99999 IBA/LUZON ISLAND 15.333 119.967 5.0 19490104 20170315 ## 2 983250 99999 DAGUPAN 16.083 120.350 2.0 19450119 20170315 ## 3 983270 99999 CLARK INTL 15.186 120.560 147.5 19450214 20170315 ## 4 983280 99999 BAGUIO 16.375 120.620 1295.7 19490101 20170315 ## 5 984260 41231 SUBIC BAY WEATHER 14.800 120.267 19.0 19450209 20170315 ## 6 984290 99999 NINOY AQUINO INTL 14.509 121.020 22.9 19450331 20170315 ## 7 984300 99999 SCIENCE GARDEN 14.650 121.050 46.0 19450228 20170315 ## 8 984320 99999 AMBULONG 14.083 121.050 11.0 19490205 20170315 ## 9 984330 99999 TANAY 14.583 121.367 651.0 19490101 20170315 ## # ... with 1 more variables: STNID <chr>

These are all of the stations that are availble within 100km of the centroid of this area and the years for which data are available.

Using get_GSOD() to Fetch the Requested Station Files

This example shows how you could construct a query using the get_GSOD() function. Be aware that it may result in incomplete data and error from the server if it stops responding. We've done our best to make GSODR handle these errors, but if it does this, see the following option for using the reformat_GSOD() function.

PHL <- get_GSOD(station = eval(parse(text = loop_stations[, 12])), years = 1960:2015) ## ## Checking requested station file for availability on server. ## ## Downloading individual station files. ## Starting data file processing Another Option, Using reformat_GSOD()

GSODR provides a function for dealing with local files that have been transferred from the server already as well, reformat_GSOD(). If the previous example with get_GSOD() does not work, this is a good alternative that takes a bit more intervention but gives the same results.

Using your FTP client (e.g., FileZilla) log into the NCEI FTP server, and navigate to /pub/data/gsod/. Manually downloading the files for each station listed above from 1960 to 2015 is possible, but tedious. An easier solution is to simply download the annual files found in each yearly directory, "gsod-YYYY.tar" and untar them locally and then use R to list the available files and select only the files for the stations of interest. Lastly, write the data to disk as a CSV file for saving and later use.

years <- 1960:2015 loop_stations <- eval(parse(text = loop_stations[, 12])) # create file list loop_stations <- paste0, c(expand.grid(loop_stations, "-", years, ".op.gz")) ) local_files <- list.files(path = "./GSOD", full.names = TRUE, recursive = TRUE) local_files <- local_files[basename(local_files) %in% loop_stations] loop_data <- reformat_GSOD(file_list = local_files) readr::write_csv(loop_data, file = "Loop_Survey_Weather_1960-2015") The Data Content and Format

The final data returned by either of these methods will be data that include the following elements for the years of 1960-2015

  • STNID – Station number (WMO/DATSAV3 number) for the location;

  • WBAN – number where applicable–this is the historical "Weather Bureau Air Force Navy" number – with WBAN being the acronym;

  • STN_NAME – Unique text identifier;

  • CTRY – Country in which the station is located;

  • LAT – Latitude. Station dropped in cases where values are < -90 or > 90 degrees or Lat = 0 and Lon = 0;

  • LON – Longitude. Station dropped in cases where values are < -180 or > 180 degrees or Lat = 0 and Lon = 0;

  • ELEV_M – Elevation in metres;

  • ELEV_M_SRTM_90m – Elevation in metres corrected for possible errors, derived from the CGIAR-CSI SRTM 90m database (Jarvis et al. 2008);

  • YEARMODA – Date in YYYY-mm-dd format;

  • YEAR – The year (YYYY);

  • MONTH – The month (mm);

  • DAY – The day (dd);

  • YDAY – Sequential day of year (not in original GSOD);

  • TEMP – Mean daily temperature converted to degrees C to tenths. Missing = NA;

  • TEMP_CNT – Number of observations used in calculating mean daily temperature;

  • DEWP – Mean daily dew point converted to degrees C to tenths. Missing = NA;

  • DEWP_CNT – Number of observations used in calculating mean daily dew point;

  • SLP – Mean sea level pressure in millibars to tenths. Missing = NA;

  • SLP_CNT – Number of observations used in calculating mean sea level pressure;

  • STP – Mean station pressure for the day in millibars to tenths. Missing = NA;

  • STP_CNT – Number of observations used in calculating mean station pressure;

  • VISIB – Mean visibility for the day converted to kilometres to tenths Missing = NA;

  • VISIB_CNT – Number of observations used in calculating mean daily visibility;

  • WDSP – Mean daily wind speed value converted to metres/second to tenths Missing = NA;

  • WDSP_CNT – Number of observations used in calculating mean daily wind speed;

  • MXSPD – Maximum sustained wind speed reported for the day converted to metres/second to tenths. Missing = NA;

  • GUST – Maximum wind gust reported for the day converted to metres/second to tenths. Missing = NA;

  • MAX – Maximum temperature reported during the day converted to Celsius to tenths–time of max temp report varies by country and region, so this will sometimes not be the max for the calendar day. Missing = NA;

  • MAX_FLAG – Blank indicates max temp was taken from the explicit max temp report and not from the 'hourly' data. An "*" indicates max temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature);

  • MIN – Minimum temperature reported during the day converted to Celsius to tenths–time of min temp report varies by country and region, so this will sometimes not be the max for the calendar day. Missing = NA;

  • MIN_FLAG – Blank indicates max temp was taken from the explicit min temp report and not from the 'hourly' data. An "*" indicates min temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature);

  • PRCP – Total precipitation (rain and/or melted snow) reported during the day converted to millimetres to hundredths; will usually not end with the midnight observation, i.e., may include latter part of previous day. A value of ".00" indicates no measurable precipitation (includes a trace). Missing = NA; Note: Many stations do not report '0' on days with no precipitation– therefore, 'NA' will often appear on these days. For example, a station may only report a 6-hour amount for the period during which rain fell. See FLAGS_PRCP column for source of data;


    • A = 1 report of 6-hour precipitation amount;
    • B = Summation of 2 reports of 6-hour precipitation amount;
    • C = Summation of 3 reports of 6-hour precipitation amount;
    • D = Summation of 4 reports of 6-hour precipitation amount;
    • E = 1 report of 12-hour precipitation amount;
    • F = Summation of 2 reports of 12-hour precipitation amount;
    • G = 1 report of 24-hour precipitation amount;
    • H = Station reported '0' as the amount for the day (e.g., from 6-hour reports), but also reported at least one occurrence of precipitation in hourly observations–this could indicate a rrace occurred, but should be considered as incomplete data for the day;
    • I = Station did not report any precipitation data for the day and did not report any occurrences of precipitation in its hourly observations–it's still possible that precipitation occurred but was not reported;
  • SNDP – Snow depth in millimetres to tenths. Missing = NA;

  • I_FOG – Indicator for fog, (1 = yes, 0 = no/not reported) for the occurrence during the day;

  • I_RAIN_DRIZZLE – Indicator for rain or drizzle, (1 = yes, 0 = no/not reported) for the occurrence during the day;

  • I_SNOW_ICE – Indicator for snow or ice pellets, (1 = yes, 0 = no/not reported) for the occurrence during the day;

  • I_HAIL – Indicator for hail, (1 = yes, 0 = no/not reported) for the occurrence during the day;

  • I_THUNDER – Indicator for thunder, (1 = yes, 0 = no/not reported) for the occurrence during the day;

  • I_TORNADO_FUNNEL – Indicator for tornado or funnel cloud, (1 = yes, 0 = no/not reported) for the occurrence during the day;

  • ea – Mean daily actual vapour pressure;

  • es – Mean daily saturation vapour pressure;

  • RH – Mean daily relative humidity.

Here's what the data look like.

## WBAN STNID STN_NAME CTRY STATE CALL LAT LON ## 1 99999 983240-99999 IBA/LUZON ISLAND RP <NA> RPUI 15.333 119.967 ## 2 99999 983240-99999 IBA/LUZON ISLAND RP <NA> RPUI 15.333 119.967 ## 3 99999 983240-99999 IBA/LUZON ISLAND RP <NA> RPUI 15.333 119.967 ## 4 99999 983240-99999 IBA/LUZON ISLAND RP <NA> RPUI 15.333 119.967 ## 5 99999 983240-99999 IBA/LUZON ISLAND RP <NA> RPUI 15.333 119.967 ## 6 99999 983240-99999 IBA/LUZON ISLAND RP <NA> RPUI 15.333 119.967 ## ELEV_M ELEV_M_SRTM_90m BEGIN END YEARMODA YEAR MONTH DAY YDAY ## 1 5 9 19490104 20170315 19730101 1973 01 01 1 ## 2 5 9 19490104 20170315 19730102 1973 01 02 2 ## 3 5 9 19490104 20170315 19730103 1973 01 03 3 ## 4 5 9 19490104 20170315 19730104 1973 01 04 4 ## 5 5 9 19490104 20170315 19730105 1973 01 05 5 ## 6 5 9 19490104 20170315 19730106 1973 01 06 6 ## TEMP TEMP_CNT DEWP DEWP_CNT SLP SLP_CNT STP STP_CNT VISIB VISIB_CNT ## 1 26.8 7 23.8 7 1014.1 7 NA 0 34.3 7 ## 2 26.6 5 23.2 5 1013.8 5 NA 0 34.0 5 ## 3 25.8 6 23.2 6 1013.5 6 NA 0 34.9 6 ## 4 28.4 5 25.0 5 1014.5 5 NA 0 38.0 5 ## 5 25.8 4 23.5 4 1014.8 4 NA 0 NA 0 ## 6 25.0 8 22.9 8 1014.8 8 NA 0 35.6 8 ## WDSP WDSP_CNT MXSPD GUST MAX MAX_FLAG MIN MIN_FLAG PRCP PRCP_FLAG SNDP ## 1 0.4 7 3.0 NA 31 * 24 * 0 I NA ## 2 0.9 5 4.1 NA 33 * 19 * 0 I NA ## 3 0.8 6 4.1 NA 30 * 21 * 0 I NA ## 4 1.0 5 5.1 NA 32 * 23 * 0 I NA ## 5 0.2 4 1.5 NA 30 * 22 * 0 I NA ## 6 0.4 8 2.5 NA 31 * 19 * 0 I NA ## I_FOG I_RAIN_DRIZZLE I_SNOW_ICE I_HAIL I_THUNDER I_TORNADO_FUNNEL EA ## 1 0 0 0 0 0 0 2.9 ## 2 0 0 0 0 0 0 2.8 ## 3 0 0 0 0 0 0 2.8 ## 4 0 0 0 0 0 0 3.2 ## 5 0 0 0 0 0 0 2.9 ## 6 0 0 0 0 0 0 2.8 ## ES RH ## 1 3.5 82.9 ## 2 3.5 80.0 ## 3 3.3 84.8 ## 4 3.9 82.1 ## 5 3.3 87.9 ## 6 3.2 87.5

Using the GSODR package and R I was able to easily retrieve and provide weather data for the years requested that cover the area of interest for this survey and create a CSV file of the data for use with other software for the analysis.

Using GSOD Data With Climate Data From GSODRdata

If you want to use climate and ecological data with weather data, we've also provided a supplementary data package to go with GSODR, GSODRdata, which provides climate data from five sources in six data frames:

Due to the size of the package, >9Mb, it is only available from GitHub. However, these data frames provide climate and ecological data that corresponds to the GSOD station locations, making it easy to find and work with weather and climate data at the same time. If you're interested you can find some further examples in the GSODR vignettes.


The GSOD data have a wide range of applications and GSODR makes this data more accessible to scientists that need a global weather data set. Using GSODR means that you can efficiently query for a set of years, for specific stations or areas within a given radius. The GSOD data are not perfect, there are many gaps prior to 1973, but in the more recent years the data became more reliable and more stations are being added.

ggplot(station_meta, aes(x = END)) + geom_histogram() + xlab("End Date (YYYYMMDD)") + ylab("Number of Stations") + ggtitle("Count of stations' end date") + theme_bw()

How could you use GSODR?

Let us know how you use GSODR in your work. If you find an issue, please file an issue and we'll work to get it corrected as quickly as possible. Also, if you think that you have an idea that might make GSODR better let us know that too.


We're grateful to Jeff Hanson and Dillon Niederhut who took the time to review GSODR as a part of the rOpenSci onboarding process and the related paper published in the Journal of Open Source Software (Sparks et al., 2017). Their suggestions greatly improved this package. Also, thanks to Scott Chamberlain for his editorial comments on this blog post, including spelling corrections to his name.


Adam H Sparks, Tomislav Hengl and Andrew Nelson (2017). GSODR: Global Summary Daily Weather Data in R. The Journal of Open Source Software, 2(10). DOI: 10.21105/joss.00177. URL:

  1. Formerly the National Climatic Data Center (NCDC) 

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tidyquant 0.5.0: select, rollapply, and Quandl

Tue, 04/04/2017 - 02:00

(This article was first published on - Articles, and kindly contributed to R-bloggers)

We’ve got some good stuff cooking over at Business Science. Yesterday, we had the fifth official release (0.5.0) of tidyquant to CRAN. The release includes some great new features. First, the Quandl integration is complete, which now enables getting Quandl data in “tidy” format. Second, we have a new mechanism to handle selecting which columns get sent to the mutation functions. The new argument name is… select, and it provides increased flexibility which we show off in a rollapply example. Finally, we have added several PerformanceAnalytics functions that deal with modifying returns to the mutation functions. In this post, we’ll go over a few of the new features in version 5.

Quandl integration

We couldn’t wait to talk about the new Quandl integration for tidyquant! Two weeks ago, we went ahead and released a Quandl blog post using the GitHub development version. However, it is worth mentioning again now that it is officially being released. Definitely go check out that post to learn more!

We would like to highlight one feature that that post didn’t cover: data tables. These are data sets from Quandl that aren’t in a time series format, but are more of a treasure trove of other information. Many of them are premium datasets, meaning they cost money, but a few are free or have free samples. One of them is the Fundamentals Condensed datatable from Zacks. According to Quandl,

“Updated daily, this database contains over 70 fundamental indicators, including income statement, balance sheet, cash flow line items and precalculated ratios, for over 17,000 US and Canadian Equities, including 9,000+ delisted stocks.”

Without a premium Quandl account, we won’t be able to access all of it, but Quandl still offers a sample of the data. Let’s see how to grab this. First, we will want to set the Quandl API key linked to our account so that we have more than 50 calls per day.

library(tidyquant) quandl_api_key("api-key-goes-here")

Now, we can use tq_get(get = "quandl.datatable") with the ZACKS/FC Quandl code. There’s a lot of data here, including mailing address, company officers, and numerous fundamentals stats!

FC <- tq_get("ZACKS/FC", get = "quandl.datatable") FC %>% slice(1:3) %>% knitr::kable() m_ticker ticker comp_name comp_name_2 exchange currency_code per_end_date per_type per_code per_fisc_year per_fisc_qtr per_cal_year per_cal_qtr data_type_ind filing_type qtr_nbr zacks_sector_code zacks_x_ind_code zacks_metrics_ind_code fye_month comp_cik per_len sic_code filing_date last_changed_date state_incorp_name bus_address_line_1 bus_city bus_state_name bus_post_code bus_phone_nbr bus_fax_nbr mail_address_line_1 mail_city mail_state_name mail_post_code country_name country_code home_exchange_name emp_cnt emp_pt_cnt emp_ft_cnt emp_other_cnt comm_share_holder auditor auditor_opinion comp_url email_addr nbr_shares_out shares_out_date officer_name_1 officer_title_1 officer_name_2 officer_title_2 officer_name_3 officer_title_3 officer_name_4 officer_title_4 officer_name_5 officer_title_5 rpt_0_date tot_revnu cost_good_sold gross_profit tot_deprec_amort int_exp_oper int_invst_income_oper res_dev_exp in_proc_res_dev_exp_aggr tot_sell_gen_admin_exp rental_exp_ind_broker pension_post_retire_exp other_oper_income_exp tot_oper_exp oper_income non_oper_int_exp int_cap asset_wdown_impair_aggr restruct_charge merger_acq_income_aggr rental_income spcl_unusual_charge impair_goodwill litig_aggr gain_loss_sale_asset_aggr gain_loss_sale_invst_aggr stock_div_subsid income_loss_equity_invst_other pre_tax_minority_int int_invst_income other_non_oper_income_exp tot_non_oper_income_exp pre_tax_income tot_provsn_income_tax income_aft_tax minority_int equity_earn_subsid invst_gain_loss_other other_income income_cont_oper income_discont_oper income_bef_exord_acct_change exord_income_loss cumul_eff_acct_change consol_net_income_loss non_ctl_int net_income_parent_comp pref_stock_div_other_adj net_income_loss_share_holder eps_basic_cont_oper eps_basic_discont_oper eps_basic_acct_change eps_basic_extra eps_basic_consol eps_basic_parent_comp basic_net_eps eps_diluted_cont_oper eps_diluted_discont_oper eps_diluted_acct_change eps_diluted_extra eps_diluted_consol eps_diluted_parent_comp diluted_net_eps dilution_factor avg_d_shares avg_b_shares norm_pre_tax_income norm_aft_tax_income ebitda ebit rpt_1_date cash_sterm_invst note_loan_rcv rcv_est_doubt rcv_tot invty prepaid_expense def_charge_curr def_tax_asset_curr asset_discont_oper_curr other_curr_asset tot_curr_asset gross_prop_plant_equip tot_accum_deprec net_prop_plant_equip net_real_estate_misc_prop cap_software lterm_invst adv_dep lterm_rcv invty_lterm goodwill_intang_asset_tot def_charge_non_curr def_tax_asset_lterm asset_discont_oper_lterm pension_post_retire_asset other_lterm_asset tot_lterm_asset tot_asset note_pay acct_pay div_pay other_pay accrued_exp other_accrued_exp curr_portion_debt curr_portion_cap_lease curr_portion_tax_pay defer_revnu_curr defer_tax_liab_curr liab_discont_oper_curr other_curr_liab tot_curr_liab tot_lterm_debt defer_revnu_non_curr pension_post_retire_liab defer_tax_liab_lterm mand_redeem_pref_sec_subsid pref_stock_liab min_int liab_disc_oper_lterm other_non_curr_liab tot_lterm_liab tot_liab tot_pref_stock comm_stock_net addtl_paid_in_cap retain_earn_accum_deficit equity_equiv treas_stock compr_income def_compsn other_share_holder_equity tot_comm_equity tot_share_holder_equity tot_liab_share_holder_equity comm_shares_out pref_stock_shares_out tang_stock_holder_equity rpt_2_date net_income_loss tot_deprec_amort_cash_flow other_non_cash_item tot_non_cash_item change_acct_rcv change_invty change_acct_pay change_acct_pay_accrued_liab change_income_tax change_asset_liab tot_change_asset_liab oper_activity_other cash_flow_oper_activity net_change_prop_plant_equip net_change_intang_asset net_acq_divst net_change_sterm_invst net_change_lterm_invst net_change_invst_tot invst_activity_other cash_flow_invst_activity net_lterm_debt net_curr_debt debt_issue_retire_net_tot net_comm_equity_issued_repurch net_pref_equity_issued_repurch net_tot_equity_issued_repurch tot_comm_pref_stock_div_paid fin_activity_other cash_flow_fin_activity fgn_exchange_rate_adj disc_oper_misc_cash_flow_adj incr_decr_cash beg_cash end_cash stock_based_compsn comm_stock_div_paid pref_stock_div_paid tot_deprec_amort_qd stock_based_compsn_qd cash_flow_oper_activity_qd net_change_prop_plant_equip_qd comm_stock_div_paid_qd pref_stock_div_paid_qd tot_comm_pref_stock_div_qd wavg_shares_out wavg_shares_out_diluted eps_basic_net eps_diluted_net AAPL AAPL APPLE INC Apple Inc. NSDQ USD 2011-09-30 A NA 2011 4 2011 3 F 10-K 2011 10 199 9 9 0000320193 12 3571 2011-10-26 2011-10-26 CA ONE INFINITE LOOP CUPERTINO CA 95014 408-996-1010 408-996-0275 ONE INFINITE LOOP CUPERTINO CA 95014 UNITED STATES X1 NA 63300 2900 60400 NA 28543 Ernst & Young LLP Fair 6505863000 2011-10-14 Timothy D. Cook Chief Executive Officer and Director Peter Oppenheimer Senior Vice President and Chief Financial Officer Betsy Rafael Vice President and Corporate Controller William V. Campbell Director Millard S. Drexler Director 2013-10-30 108249 64431 43818 NA NA NA 2429 NA 7599 NA NA NA 74459 33790 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 415 415 34205 8283 25922 NA NA NA NA 25922 NA 25922 NA NA 25922 NA 25922 NA 25922 4.0066 NA NA NA 4.0066 4.0066 4.0071 3.9536 NA NA NA 3.9536 3.9536 3.9543 4.3340 6556.515 6469.806 34205 25922 35604 33790 2012-10-31 25952 NA NA 11717 776 NA NA 2014 NA 4529 44988 11768 3991 7777 NA NA 55618 NA NA NA 4432 NA NA NA NA 3556 71383 116371 NA 14632 NA NA 9247 NA NA NA NA 4091 NA NA NA 27970 NA 1686 NA NA NA NA NA NA 10100 11786 39756 NA 13331 NA 62841 NA NA 443 NA NA 76615 76615 116371 6504.939 NA 72183 2013-10-30 25922 1814 4036 5850 143 275 2515 NA NA 2824 5757 NA 37529 -4260 -3192 -244 -32464 NA -32464 -259 -40419 NA NA NA 831 NA 831 NA 613 1444 NA NA -1446 11261 9815 1168 0 NA NA NA NA NA NA NA NA 6469.806 6556.515 4.0071 3.9543 AAPL AAPL APPLE INC Apple Inc. NSDQ USD 2012-09-30 A NA 2012 4 2012 3 F 10-K 2012 10 199 9 9 0000320193 12 3571 2012-10-31 2012-10-31 CA ONE INFINITE LOOP CUPERTINO CA 95014 408-996-1010 408-996-0275 ONE INFINITE LOOP CUPERTINO CA 95014 UNITED STATES X1 NA 76100 3300 72800 NA 27696 Ernst & Young LLP Fair 6584844000 2012-10-19 Timothy D. Cook Chief Executive Officer and Director Peter Oppenheimer Senior Vice President and Chief Financial Officer William V. Campbell Director Millard S. Drexler Director Al Gore Director 2014-10-27 156508 87846 68662 NA NA NA 3381 NA 10040 NA NA NA 101267 55241 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 522 522 55763 14030 41733 NA NA NA NA 41733 NA 41733 NA NA 41733 NA 41733 NA 41733 6.3776 NA NA NA 6.3776 6.3776 6.3771 6.3065 NA NA NA 6.3065 6.3065 6.3071 23.3164 6617.485 6543.726 55763 41733 58518 55241 2013-10-30 29129 NA NA 18692 791 NA NA 2583 NA 6458 57653 21887 6435 15452 NA NA 92122 NA NA NA 5359 NA NA NA NA 5478 118411 176064 NA 21175 NA NA 11414 NA NA NA NA 5953 NA NA NA 38542 NA 2648 NA NA NA NA NA NA 16664 19312 57854 NA 16422 NA 101289 NA NA 499 NA NA 118210 118210 176064 6574.456 NA 112851 2014-10-27 41733 3277 6145 9422 -5551 -15 4467 NA NA 800 -299 NA 50856 -8295 -1107 -350 -38427 NA -38427 -48 -48227 NA NA NA 665 NA 665 -2488 125 -1698 NA NA 931 9815 10746 1740 -2488 NA NA NA NA NA NA NA NA 6543.726 6617.483 6.3800 6.3100 AAPL AAPL APPLE INC Apple Inc. NSDQ USD 2013-09-30 A NA 2013 4 2013 3 F 10-K 2013 10 199 9 9 0000320193 12 3571 2013-10-30 2013-10-29 CA ONE INFINITE LOOP CUPERTINO CA 95014 408-996-1010 408-974-2483 ONE INFINITE LOOP CUPERTINO CA 95014 UNITED STATES X1 NA 84400 4100 80300 NA 24710 Ernst & Young LLP Fair 6298166000 2013-10-18 Timothy D. Cook Chief Executive Officer and Director Peter Oppenheimer Senior Vice President and Chief Financial Officer Luca Maestri Vice President and Corporate Controller William V. Campbell Director Millard S. Drexler Director 2015-10-28 170910 106606 64304 NA NA NA 4475 NA 10830 NA NA NA 121911 48999 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1156 1156 50155 13118 37037 NA NA NA NA 37037 NA 37037 NA NA 37037 NA 37037 NA 37037 5.7180 NA NA NA 5.7180 5.7180 5.7186 5.6791 NA NA NA 5.6791 5.6791 5.6786 5.8789 6521.634 6477.317 50155 37037 55756 48999 2014-10-27 40546 NA NA 20641 1764 NA NA 3453 NA 6882 73286 28519 11922 16597 NA NA 106215 NA NA NA 5756 NA NA NA NA 5146 133714 207000 NA 22367 NA NA 13856 NA NA NA NA 7435 NA NA NA 43658 16960 2625 NA NA NA NA NA NA 20208 39793 83451 NA 19764 NA 104256 NA NA -471 NA NA 123549 123549 207000 6294.494 NA 117793 2015-10-28 37037 6757 3394 10151 -2172 -973 2340 NA NA 7283 6478 NA 53666 -8165 -911 -496 -24042 NA -24042 -160 -33774 16896 NA 16896 -22330 NA -22330 -10564 -381 -16379 NA NA 3513 10746 14259 2253 -10564 NA NA NA NA NA NA NA NA 6477.320 6521.634 5.7200 5.6800

Quandl is a very extensive data source. Finding the right series can sometimes be a challenge, but their documentation and search function has greatly improved from a few years ago. In addition, with quandl_search() users can even find data sets from within the R console. Let’s search for the WTI crude prices within the FRED database.

quandl_search(query = "WTI", database_code = "FRED", silent = TRUE) %>% select(-description) %>% slice(1:3) %>% knitr::kable() id dataset_code database_code name refreshed_at newest_available_date oldest_available_date column_names frequency type premium database_id 34399442 WTISPLC FRED Spot Crude Oil Price: West Texas Intermediate (WTI) 2017-03-18T04:39:31.083Z 2017-02-01 1946-01-01 DATE, VALUE monthly Time Series FALSE 118 120621 ACOILWTICO FRED Crude Oil Prices: West Texas Intermediate (WTI) – Cushing, Oklahoma 2016-12-11T19:35:20.310Z 2015-01-01 1986-01-01 DATE, VALUE annual Time Series FALSE 118 120619 MCOILWTICO FRED Crude Oil Prices: West Texas Intermediate (WTI) – Cushing, Oklahoma 2017-03-17T15:46:22.490Z 2017-02-01 1986-01-01 DATE, VALUE monthly Time Series FALSE 118 selecting columns: A flexible approach

Previously, when selecting columns with tq_mutate() and tq_transmute(), we would use the ohlc_fun argument to select the columns to mutate on. ohlc_fun leveraged the numerous functions from quantmod that extract specific columns from OHLC (Open High Low Close) price data such as Ad() (adjusted) or Cl() (close). While this was great for OHLC data, it was not ideal for datasets not in the OHLC form. It also kept users from performing rolling regressions using custom functions with rollapply(). To fix this, we’ve come up with a new approach that replaces ohlc_fun through a new argument, select. This new approach uses dplyr for selecting data frame columns instead of relying on quantmod, and as a result it has a “tidyverse” feel. While this seems like a small change, in the following examples you will see just how much more flexibility this brings.

select VS ohlc_fun

Here’s a quick example demonstrating the difference in the approaches. Let’s first grab the stock price data for IBM.

ibm <- tq_get("IBM") ibm ## # A tibble: 2,581 × 7 ## date open high low close volume adjusted ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2007-01-03 97.18 98.40 96.26 97.27 9196800 77.73997 ## 2 2007-01-04 97.25 98.79 96.88 98.31 10524500 78.57116 ## 3 2007-01-05 97.60 97.95 96.91 97.42 7221300 77.85985 ## 4 2007-01-08 98.50 99.50 98.35 98.90 10340000 79.04270 ## 5 2007-01-09 99.08 100.33 99.07 100.07 11108200 79.97778 ## 6 2007-01-10 98.50 99.05 97.93 98.89 8744800 79.03470 ## 7 2007-01-11 99.00 99.90 98.50 98.65 8000700 78.84289 ## 8 2007-01-12 98.99 99.69 98.50 99.34 6636500 79.39435 ## 9 2007-01-16 99.40 100.84 99.30 100.82 9602200 80.57719 ## 10 2007-01-17 100.69 100.90 99.90 100.02 8200700 79.93782 ## # ... with 2,571 more rows

What if you wanted to calculate the daily returns of the adjusted stock price? Previously, you might have done something like this…

tq_mutate(ibm, ohlc_fun = Ad, mutate_fun = dailyReturn) %>% slice(1:5) %>% knitr::kable() ## Warning: Argument ohlc_fun is deprecated; please use select instead. date open high low close volume adjusted daily.returns 2007-01-03 97.18 98.40 96.26 97.27 9196800 77.73997 0.0000000 2007-01-04 97.25 98.79 96.88 98.31 10524500 78.57116 0.0106919 2007-01-05 97.60 97.95 96.91 97.42 7221300 77.85985 -0.0090530 2007-01-08 98.50 99.50 98.35 98.90 10340000 79.04270 0.0151920 2007-01-09 99.08 100.33 99.07 100.07 11108200 79.97778 0.0118301

At this point, ohlc_fun will still work, but you’ll notice that a warning appears notifying you that we’ve deprecated ohlc_fun. In version 6, we will remove the argument completely. Let’s see the new approach with select.

tq_mutate(ibm, select = adjusted, mutate_fun = dailyReturn) %>% slice(1:5) %>% knitr::kable() date open high low close volume adjusted daily.returns 2007-01-03 97.18 98.40 96.26 97.27 9196800 77.73997 0.0000000 2007-01-04 97.25 98.79 96.88 98.31 10524500 78.57116 0.0106919 2007-01-05 97.60 97.95 96.91 97.42 7221300 77.85985 -0.0090530 2007-01-08 98.50 99.50 98.35 98.90 10340000 79.04270 0.0151920 2007-01-09 99.08 100.33 99.07 100.07 11108200 79.97778 0.0118301

As you can see, the new approach is similar to dplyr::mutate in that you simply specify the name of the columns that get mutated. You can even select multiple columns with the dplyr shorthand. Here we succinctly select the columns open, high, low, and close and change to a weekly periodicity. Since we change periodicity, we have to use tq_transmute() to drop the other columns.

tq_transmute(ibm, select = open:close, mutate_fun = to.weekly) %>% slice(1:5) %>% knitr::kable() date open high low close 2007-01-05 97.60 97.95 96.91 97.42 2007-01-12 98.99 99.69 98.50 99.34 2007-01-19 95.00 96.85 94.55 96.17 2007-01-26 97.52 97.83 96.84 97.45 2007-02-02 99.10 99.73 98.88 99.17

One final note. If you want to mutate the entire dataset, you can leave off the select argument so that it keeps its default value of NULL which passes the entire data frame to the mutate function.

Working with non-OHLC data

The select argument is more than a surface level change. The change from the OHLC functions allows us to work with non-OHLC data straight from tq_mutate(). This used to be a little awkward. Since the columns of non-OHLC data could not be selected with a quantmod function, we had to use tq_mutate_xy() to find them. The intended use of the *_xy() functions is for working with functions that require mutating both x and y in parallel, so this definitely wasn’t best practice. Check out this old example of working with some crude oil prices from FRED, and applying a 1 period lag.

oil_prices <- tq_get("DCOILWTICO", get = "") oil_prices ## # A tibble: 2,671 × 2 ## date price ## <date> <dbl> ## 1 2007-01-01 NA ## 2 2007-01-02 60.77 ## 3 2007-01-03 58.31 ## 4 2007-01-04 55.65 ## 5 2007-01-05 56.29 ## 6 2007-01-08 56.08 ## 7 2007-01-09 55.65 ## 8 2007-01-10 53.95 ## 9 2007-01-11 51.91 ## 10 2007-01-12 52.96 ## # ... with 2,661 more rows

tq_mutate_xy() was the only thing that worked in the old version, but it just felt wrong!

tq_mutate_xy(oil_prices, x = price, y = NULL, mutate_fun = lag.xts, k = 1) ## # A tibble: 2,671 × 3 ## date price lag.xts ## <date> <dbl> <dbl> ## 1 2007-01-01 NA NA ## 2 2007-01-02 60.77 NA ## 3 2007-01-03 58.31 60.77 ## 4 2007-01-04 55.65 58.31 ## 5 2007-01-05 56.29 55.65 ## 6 2007-01-08 56.08 56.29 ## 7 2007-01-09 55.65 56.08 ## 8 2007-01-10 53.95 55.65 ## 9 2007-01-11 51.91 53.95 ## 10 2007-01-12 52.96 51.91 ## # ... with 2,661 more rows

Enter, select. This works much better!

tq_mutate(oil_prices, select = price, mutate_fun = lag.xts, k = 1) ## # A tibble: 2,671 × 3 ## date price lag.xts ## <date> <dbl> <dbl> ## 1 2007-01-01 NA NA ## 2 2007-01-02 60.77 NA ## 3 2007-01-03 58.31 60.77 ## 4 2007-01-04 55.65 58.31 ## 5 2007-01-05 56.29 55.65 ## 6 2007-01-08 56.08 56.29 ## 7 2007-01-09 55.65 56.08 ## 8 2007-01-10 53.95 55.65 ## 9 2007-01-11 51.91 53.95 ## 10 2007-01-12 52.96 51.91 ## # ... with 2,661 more rows Getting things rolling with rollapply

Something that we are very excited about with this new version is the ability to perform custom rolling calculations with rollapply. This is a byproduct of the new select argument, and has some pretty cool use cases. In the example below, we will walk through a rolling CAPM analysis of Apple stock.

CAPM stands for Capital Asset Pricing Model, and is a way to determine the theoretically appropriate expected stock return, based on exposure to market risk. Essentially, the CAPM model performs a linear regression with Apple’s excess return as the dependent variable, and the “market’s” excess return as the independent variable. We will pull from the Fama French website for the risk free rate and the market return. Here is the regression formula for CAPM:

r_{apple} - r_f = \alpha + \beta (r_{mkt} - r_f) + \epsilon

Where r_f is the risk free return, and epsilon is the normally distributed error term. The key terms of interest are alpha and beta, the coefficients estimated from the regression. Alpha is the excess return of a stock relative to a market index, and beta is a measure of the volatility of a stock in relation to the market. It would be interesting to evaluate the stability of the model, and one way of doing this is by performing a rolling regression to see how alpha and beta change throughout time.
First, we will need two datasets: Fama French factors and Apple stock returns.

Fama French factors

Quandl actually has a cleaned up series for the Fama French factors going back to 1926! As of writing this, the most recent data point in their weekly series is 02/24/2017, so for reproducibility purposes we will stop the series there.

fama_french <- tq_get("KFRENCH/FACTORS_W", get = "quandl", to = "2017-02-24") fama_french ## # A tibble: 5,849 × 5 ## date mkt.rf smb hml rf ## <date> <dbl> <dbl> <dbl> <dbl> ## 1 2017-02-24 0.48 -1.09 -0.46 0.009 ## 2 2017-02-17 1.53 -0.54 -0.39 0.009 ## 3 2017-02-10 0.88 0.14 -1.20 0.009 ## 4 2017-02-03 0.15 0.41 -0.67 0.009 ## 5 2017-01-27 1.15 0.22 0.59 0.009 ## 6 2017-01-20 -0.31 -1.13 -0.15 0.009 ## 7 2017-01-13 0.09 0.47 -1.04 0.009 ## 8 2017-01-06 1.69 -0.58 -1.25 0.009 ## 9 2016-12-30 -1.16 -0.11 0.13 0.006 ## 10 2016-12-23 0.32 0.17 0.71 0.006 ## # ... with 5,839 more rows

We only need two columns for the CAPM analysis, mkt.rf and rf, so let’s select them.

fama_french <- fama_french %>% select(date, mkt.rf, rf) fama_french ## # A tibble: 5,849 × 3 ## date mkt.rf rf ## <date> <dbl> <dbl> ## 1 2017-02-24 0.48 0.009 ## 2 2017-02-17 1.53 0.009 ## 3 2017-02-10 0.88 0.009 ## 4 2017-02-03 0.15 0.009 ## 5 2017-01-27 1.15 0.009 ## 6 2017-01-20 -0.31 0.009 ## 7 2017-01-13 0.09 0.009 ## 8 2017-01-06 1.69 0.009 ## 9 2016-12-30 -1.16 0.006 ## 10 2016-12-23 0.32 0.006 ## # ... with 5,839 more rows Apple weekly stock returns

This one is easy, we will use tq_get() to pull the stock prices, and then use tq_transmute() for the returns. For this analysis, we will pull 15 years of prices.

apple_prices <- tq_get("AAPL", get = "stock.prices", from = "2002-02-24", to = "2017-02-24") apple_prices ## # A tibble: 3,778 × 7 ## date open high low close volume adjusted ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2002-02-25 22.85 24.72 22.36 23.81 106712200 1.542406 ## 2 2002-02-26 23.91 24.37 23.25 23.67 65032800 1.533337 ## 3 2002-02-27 23.94 24.25 20.94 21.96 257539800 1.422564 ## 4 2002-02-28 22.15 22.59 21.35 21.70 114234400 1.405721 ## 5 2002-03-01 21.93 23.50 21.82 23.45 87248000 1.519086 ## 6 2002-03-04 23.26 24.58 22.76 24.29 87064600 1.573501 ## 7 2002-03-05 24.15 24.43 23.40 23.53 68675600 1.524268 ## 8 2002-03-06 23.48 24.34 22.93 24.07 56551600 1.559249 ## 9 2002-03-07 24.06 24.53 23.61 24.38 64562400 1.579331 ## 10 2002-03-08 24.74 25.09 24.30 24.66 67443600 1.597469 ## # ... with 3,768 more rows

Now, let’s get the weekly returns. We have to use tq_transmute() because the periodicity changes, and we will use the periodReturn() function from quantmod as our mutate_fun. Lastly, we change the return to a percentage because that is the form that the Fama French factors are in.

apple_return <- tq_transmute(apple_prices, select = adjusted, col_rename = "weekly_return", mutate_fun = periodReturn, period = "weekly") %>% mutate(weekly_return = weekly_return * 100) apple_return ## # A tibble: 783 × 2 ## date weekly_return ## <date> <dbl> ## 1 2002-03-01 -1.5119236 ## 2 2002-03-08 5.1598790 ## 3 2002-03-15 1.1759853 ## 4 2002-03-22 -3.4468571 ## 5 2002-03-28 -1.7434935 ## 6 2002-04-05 4.5205327 ## 7 2002-04-12 1.2934187 ## 8 2002-04-19 -0.3192103 ## 9 2002-04-26 -7.8862983 ## 10 2002-05-03 2.1729753 ## # ... with 773 more rows Merge the two datasets

Alright, now that we have the two data sets, let’s join them together. We want to keep all of apple_return, and just match the fama_french data to it. A left_join will serve that purpose.

joined_data <- left_join(apple_return, fama_french) joined_data ## # A tibble: 783 × 4 ## date weekly_return mkt.rf rf ## <date> <dbl> <dbl> <dbl> ## 1 2002-03-01 -1.5119236 3.79 0.033 ## 2 2002-03-08 5.1598790 3.15 0.033 ## 3 2002-03-15 1.1759853 0.09 0.033 ## 4 2002-03-22 -3.4468571 -1.19 0.033 ## 5 2002-03-28 -1.7434935 -0.04 0.033 ## 6 2002-04-05 4.5205327 -2.15 0.038 ## 7 2002-04-12 1.2934187 -0.51 0.038 ## 8 2002-04-19 -0.3192103 1.26 0.038 ## 9 2002-04-26 -7.8862983 -4.13 0.038 ## 10 2002-05-03 2.1729753 -0.13 0.036 ## # ... with 773 more rows

Remembering that the left side of the CAPM formula is apple return minus the risk free rate, we calculate that as well.

joined_data <- mutate(joined_data, weekly_ret_rf = weekly_return - rf) joined_data ## # A tibble: 783 × 5 ## date weekly_return mkt.rf rf weekly_ret_rf ## <date> <dbl> <dbl> <dbl> <dbl> ## 1 2002-03-01 -1.5119236 3.79 0.033 -1.5449236 ## 2 2002-03-08 5.1598790 3.15 0.033 5.1268790 ## 3 2002-03-15 1.1759853 0.09 0.033 1.1429853 ## 4 2002-03-22 -3.4468571 -1.19 0.033 -3.4798571 ## 5 2002-03-28 -1.7434935 -0.04 0.033 -1.7764935 ## 6 2002-04-05 4.5205327 -2.15 0.038 4.4825327 ## 7 2002-04-12 1.2934187 -0.51 0.038 1.2554187 ## 8 2002-04-19 -0.3192103 1.26 0.038 -0.3572103 ## 9 2002-04-26 -7.8862983 -4.13 0.038 -7.9242983 ## 10 2002-05-03 2.1729753 -0.13 0.036 2.1369753 ## # ... with 773 more rows Rolling CAPM

We are almost ready for the rolling CAPM. The last thing we need is a function that is applied at each iteration of our roll, which here will be a regression. We are really only interested in the coefficients from the regression, so that is what we will return. Note that the incoming data_xts will be an xts object, so we need to convert to a data.frame in the lm() function. The output should be a numeric vector. tq_mutate will take care of adding the appropriate columns provided the custom function (in our case regr_fun()) is handled in this manner.

regr_fun <- function(data_xts) { lm(weekly_ret_rf ~ mkt.rf, data = as_data_frame(data_xts)) %>% coef() }

Now let’s work some magic. We will leave select out of this call to tq_mutate() since we can just pass the entire dataset into rollapply. To have enough data for a good analysis, we will set width equal to 260 for a 5 year rolling window of data per call. The final argument, by.column, is especially important so that rollapply doesn’t try and apply our regression to each column individually.

joined_data <- joined_data %>% tq_mutate(mutate_fun = rollapply, width = 260, FUN = regr_fun, by.column = FALSE, col_rename = c("alpha", "beta")) joined_data %>% slice(255:265) ## # A tibble: 11 × 7 ## date weekly_return mkt.rf rf weekly_ret_rf alpha ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2007-01-12 11.2522154 1.73 0.111 11.1412154 NA ## 2 2007-01-19 -6.4679733 -0.31 0.111 -6.5789733 NA ## 3 2007-01-26 -3.5254340 -0.57 0.111 -3.6364340 NA ## 4 2007-02-02 -0.7378731 1.94 0.096 -0.8338731 NA ## 5 2007-02-09 -1.7463114 -0.65 0.096 -1.8423114 NA ## 6 2007-02-16 1.8734247 1.22 0.096 1.7774247 0.6905407 ## 7 2007-02-23 4.9982362 -0.05 0.096 4.9022362 0.7304938 ## 8 2007-03-02 -4.1091364 -4.62 0.106 -4.2151364 0.7331868 ## 9 2007-03-09 2.9973031 1.00 0.106 2.8913031 0.7353908 ## 10 2007-03-16 1.8415416 -1.18 0.106 1.7355416 0.7560794 ## 11 2007-03-23 4.3866501 3.48 0.106 4.2806501 0.7626701 ## # ... with 1 more variables: beta <dbl>

We had to scroll down to row 260 before we find any values for alpha and beta, but it’s more useful to create a plot. Looking at the rolling coefficient for alpha, it seems to be steadily trending downward over time.

filter(joined_data, ! %>% ggplot(aes(x = date, y = alpha)) + geom_line(size = 1, color = palette_light()[[1]]) + geom_smooth() + labs(title = "AAPL: 5 year rolling CAPM - alpha", x = "", subtitle = "Rolling alpha is steadily trending downward") + theme_tq(base_size = 18)

The rolling coefficient for beta is just as interesting, with a huge drop around 2008 likely having to deal with the financial crisis, but consistently over or near 1.

filter(joined_data, ! %>% ggplot(aes(x = date, y = beta)) + geom_rect(xmin = as.numeric(ymd("2008-01-01")), xmax = as.numeric(ymd("2009-03-01")), ymin = 0, ymax = 2.0, fill = palette_light()[[4]], alpha = 0.01) + geom_line(size = 1, color = palette_light()[[1]]) + labs(title = "AAPL: 5 year rolling CAPM - beta", x = "", subtitle = "Huge drop in 2008 following financial crisis") + theme_tq(base_size = 18)

Hopefully you can see the power that tidyquant gives to the user by allowing them to perform these rolling calculations with data frames!

As an aside, there are some improvements to this model that could be of interest to some users:

  1. This says nothing to how well the CAPM model fits Apple’s return data. It only monitors the stability of the coefficients.

  2. The mkt.rf data pulled from the Fama-French data set was only one part of a more detailed extension of CAPM, the Fama French 3 Factor Model. It would be interesting the stability of this model over time as well.

PerformanceAnalytics Mutation Functions

We have added a few new functions from the PerformanceAnalytics package to tq_mutate() and tq_transmute(), which can be reviewed using tq_mutate_fun_options()$PerformanceAnalytics. These differ from the PerformanceAnalytics functions available for use with tq_performance() (refer to tq_performance_fun_options()) in that the deal with transforming the returns data rather than calculating performance metrics.

tq_mutate_fun_options()$PerformanceAnalytics ## [1] "Return.annualized" "Return.annualized.excess" ## [3] "Return.clean" "Return.cumulative" ## [5] "Return.excess" "Return.Geltner" ## [7] "zerofill"

We can use the functions to clean and format returns for the FANG data set. The example below uses three progressive applications of tq_transmute() to apply various quant functions to the grouped stock prices from the FANG data set. First, we calculate daily returns using periodReturn from the quantmod package. Next, we use Return.clean to clean outliers from the return data. The alpha parameter is the percentage of outliers to be cleaned. Finally, Return.excess is used to calculate the excess returns using a risk-free rate of 3% (divided by 252 for 252 trade days in one year). Finally, we calculate general statistics on the returns by passing to tq_performance(performance_fun = table.Stats).

FANG %>% group_by(symbol) %>% tq_transmute(adjusted, periodReturn, period = "daily") %>% tq_transmute(daily.returns, Return.clean, alpha = 0.05) %>% tq_transmute(daily.returns, Return.excess, Rf = 0.03 / 252, col_rename = "formatted.returns") %>% tq_performance(Ra = formatted.returns, performance_fun = table.Stats) %>% knitr::kable() symbol ArithmeticMean GeometricMean Kurtosis LCLMean(0.95) Maximum Median Minimum NAs Observations Quartile1 Quartile3 SEMean Skewness Stdev UCLMean(0.95) Variance FB 0.0015 0.0013 35.2725 2e-04 0.2960 1e-03 -0.0695 0 1008 -0.0098 0.0119 7e-04 2.9098 0.0220 0.0029 0.0005 AMZN 0.0011 0.0009 9.6476 -1e-04 0.1412 6e-04 -0.1101 0 1008 -0.0082 0.0111 6e-04 0.4804 0.0194 0.0023 0.0004 NFLX 0.0026 0.0021 35.9090 6e-04 0.4221 -1e-04 -0.1938 0 1008 -0.0119 0.0149 1e-03 3.0745 0.0324 0.0046 0.0011 GOOG 0.0007 0.0006 22.1306 -2e-04 0.1604 0e+00 -0.0533 0 1008 -0.0067 0.0078 5e-04 2.2165 0.0148 0.0017 0.0002

Getting, cleaning and calculating excess returns is that easy with the additional PerformanceAnalytics mutation functions.

Further Reading

That’s all for now. We hope you enjoy the new release, and be on the lookout for more to come!

  1. Tidyquant Vignettes: The Core Functions in tidyquant vignette has a nice section on the features related to the Quandl integration! The R Quantitative Analysis Package Integrations in tidyquant vignette includes another example of working with rollapply.

  2. R for Data Science: A free book that thoroughly covers the “tidyverse”. A prerequisite for maximizing your abilities with tidyquant.

To leave a comment for the author, please follow the link and comment on their blog: - Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1002]

Tue, 04/04/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

For once and only because it is part of this competition, a geometric Le Monde mathematical puzzle:

Given both diagonals of lengths p=105 and q=116, what is the parallelogram with the largest area? and when the perimeter is furthermore constrained to be L=290?

This made me jump right away to the quadrilateral page on Wikipedia, which reminds us that the largest area occurs when the diagonals are orthogonal, in which case it is A=½pq. Only the angle between the diagonals matters. Imposing the perimeter in addition is not solved there, so I wrote an R code looking at all the integer solutions, based on one of the numerous formulae for the area, like

where s is the half-perimeter and a,b,c,d are the lengths of the four sides:

p=105 q=116 s=145 for (alpha in (1:500)/1000){ ap=alpha*p;ap2=ap^2;omap=p-ap;omap2=omap^2 for (beta in (1:999)/1000){ bq=beta*q;bq2=bq^2;ombq=q-bq;ombq2=ombq^2 for (teta in (1:9999)*pi/10000){ d=sqrt(ap2+bq2-2*ap*bq*cos(teta)) a=sqrt(ap2+ombq2+2*ap*ombq*cos(teta)) b=sqrt(omap2+ombq2-2*omap*ombq*cos(teta)) c=sqrt(omap2+bq2+2*omap*bq*cos(teta)) if (abs(a+b+c+d-2*s)2*maxur){ maxur=p*q*sin(teta)/2 sole=c(a,b,c,d,alpha,beta,teta)}}}}

This code returned an area of 4350, to compare with the optimal 6090 (which is recovered by the above R code when the diagonal lengths are identical and the perimeter is the one of the associated square). (As Jean-Louis Foulley pointed out to me, this area can be found directly by assuming the quadrilateral is a parallelogram.)

Filed under: Kids, R Tagged: geometry, Le Monde, mathematical puzzle, optimisation, R, Rmpfr

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Most Popular Languages for Data Scientists/Engineers

Mon, 04/03/2017 - 23:26

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The results of the 2017 StackOverflow Survey of nearly 65,000 developers were published recently, and includes lots of interesting insights about their work, lives and preferences. The results include a cross-tabulation of the most popular languages amongst the "Data Scientist/Engineer" subset, and the results were … well, surprising:

When thinking about data scientists, it certainly makes sense to see SQL, Python and R in this list. (I've only included the top 10 above.) But it's a real surprise to see JavaScript at the top of the list, and the presence of PHP is just unfathomable to me. I think it goes to show that the "Data Engineer" role is a very different type of job than "Data Scientist". Sadly, it's not clear what the relative proportion of Data Scientists to Data Engineers is in the survey, as it's not broken out elsewhere.

Nonetheless, there were several other interesting tidbits in the survey relevant to data scientists:

  • Overall, Python is the fifth most popular language, used by 32% of respondents. R ranks #15, used by 4.4% of respondents.
  • The top three databases are MySQL (44.3%), SQL Server (30.8%) and SQLite (21.2%).
  • The most popular platforms are Windows (41%), Linux (33%) and Android (28%).
  • AWS is used by 28% of respondents; Microsoft Azure by 11.4%.
  • Python is the "Most Wanted" language (indicated by 21% of respondents as a language they'd like to be using).
  • Oracle is the "Most Dreaded" database (63% of users would rather be using a different database).
  • Visual Studio is the most popular development environment within all developer categories except "Sysadmin/DevOps" (in which case it's vim).
  • In the US, R developers earn $100,000pa on average. For Python, it's $99,000. "Machine Learning specialist", "Developer with a statistics or mathematics background" and "data scientist" were the three highest-paying job categories in the US, all topping $100,000.
  • Developers consider Elliot Alderson from Mr Robot to be the most realistic portrayal of a programmer in fiction.

You can find a complete analysis of the survey data, follow the link below.

StackOverflow: Developer Survey Results 2017




To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

simmer 3.6.1

Mon, 04/03/2017 - 17:26

A new patch release of simmer, the Discrete-Event Simulator for R, is on CRAN. Three months have passed since the last release. The last year was a period of intense development (one release per month). Now, the package has reached some level of maturity, so we intend to extend the release cycle.

In this maintenance release, the replacement operators for trajectories ([<-, [[<-) now work as expected. Also, we have removed previously deprecated plotting capabilities, which are covered and extended by the simmer.plot package.

Last but not least, we have extended the from_to() convenience function with a parameter every, which enables the generation of arrivals in cycles. For instance, let us suppose we want to simulate different patient arrival rates in the morning, evening and night:

library(simmer) library(simmer.plot) ## Loading required package: ggplot2 set.seed(1234) # units are hours # visit time between 10 and 20 minutes patient <- trajectory() %>% seize("doctor", 1) %>% timeout(function() runif(1, 10/60, 20/60)) %>% release("doctor", 1) morning <- from_to(start_time = 8, stop_time = 16, dist = function() rexp(1, 60/15), arrive = FALSE, every = 24) evening <- from_to(start_time = 16, stop_time = 24, dist = function() rexp(1, 60/30), arrive = FALSE, every = 24) night <- from_to(start_time = 0, stop_time = 8, dist = function() rexp(1, 60/60), arrive = FALSE, every = 24) env <- simmer() %>% add_resource("doctor", 1) %>% add_generator("morning", patient, morning) %>% add_generator("evening", patient, evening) %>% add_generator("night", patient, night) %>% run(24 * 5) # simulate 5 days breaks <- c(0, cumsum(rep(8, 3 * 5))) env %>% get_mon_arrivals() %>% dplyr::mutate(generator = factor(gsub("\\d", "", name))) %>% ggplot(aes(start_time, fill=generator)) + xlab("time") + stat_bin(breaks = breaks) + scale_x_continuous(breaks = breaks)

plot(env, what="resources", metric="usage", "doctor", steps=TRUE) + scale_x_continuous(breaks = breaks)

Minor changes and fixes:
  • Recycle logical indexes when subsetting (2526e75).
  • Implement replacement operators, [<- and [[<- (#88).
  • Provide rep() S3 method for trajectories (7fa515e).
  • Remove plotting functions (bb9656b), deprecated since v3.6.0. The new simmer.plot package (on CRAN) already covers these features among others.
  • Don’t evaluate vignette chunks if Suggests are not installed (e40e5b6).
  • Rewrite DESCRIPTION (3f26516).
  • Add an every parameter to the from_to() convenience function (9d68887).

Contenido extraído de simmer 3.6.1.

An Introduction to Stock Market Data Analysis with R (Part 2)

Mon, 04/03/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on at a link that now appears to no longer work) and Chinese (here and here). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis.

This post is the second in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. (You can read the first post here.) In these posts, I discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This post discusses moving average crossover strategies,backtesting, and benchmarking.

NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk.

Trading Strategy

Call an open position a trade that will be terminated in the future when a condition is met. A long position is one in which a profit is made if the financial instrument traded increases in value, and a short position is on in which a profit is made if the financial asset being traded decreases in value. When trading stocks directly, all long positions are bullish and all short position are bearish. That said, a bullish attitude need not be accompanied by a long position, and a bearish attitude need not be accompanied by a short position (this is particularly true when trading stock options).

Here is an example. Let’s say you buy a stock with the expectation that the stock will increase in value, with a plan to sell the stock at a higher price. This is a long position: you are holding a financial asset for which you will profit if the asset increases in value. Your potential profit is unlimited, and your potential losses are limited by the price of the stock since stock prices never go below zero. On the other hand, if you expect a stock to decrease in value, you may borrow the stock from a brokerage firm and sell it, with the expectation of buying the stock back later at a lower price, thus earning you a profit. This is called shorting a stock, and is a short position, since you will earn a profit if the stock drops in value. The potential profit from shorting a stock is limited by the price of the stock (the best you can do is have the stock become worth nothing; you buy it back for free), while the losses are unlimited, since you could potentially spend an arbitrarily large amount of money to buy the stock back. Thus, a broker will expect an investor to be in a very good financial position before allowing the investor to short a stock.

Any trader must have a set of rules that determine how much of her money she is willing to bet on any single trade. For example, a trader may decide that under no circumstances will she risk more than 10% of her portfolio on a trade. Additionally, in any trade, a trader must have an exit strategy, a set of conditions determining when she will exit the position, for either profit or loss. A trader may set a target, which is the minimum profit that will induce the trader to leave the position. Likewise, a trader must have a maximum loss she is willing to tolerate; if potential losses go beyond this amount, the trader will exit the position in order to prevent any further loss (this is usually done by setting a stop-loss order, an order that is triggered to prevent further losses).

We will call a plan that includes trading signals for prompting trades, a rule for deciding how much of the portfolio to risk on any particular strategy, and a complete exit strategy for any trade an overall trading strategy. Our concern now is to design and evaluate trading strategies.

We will suppose that the amount of money in the portfolio involved in any particular trade is a fixed proportion; 10% seems like a good number. We will also say that for any trade, if losses exceed 20% of the value of the trade, we will exit the position. Now we need a means for deciding when to enter position and when to exit for a profit.

Here, I will be demonstrating a moving average crossover strategy. We will use two moving averages, one we consider “fast”, and the other “slow”. The strategy is:

  • Trade the asset when the fast moving average crosses over the slow moving average.
  • Exit the trade when the fast moving average crosses over the slow moving average again.

A long trade will be prompted when the fast moving average crosses from below to above the slow moving average, and the trade will be exited when the fast moving average crosses below the slow moving average later. A short trade will be prompted when the fast moving average crosses below the slow moving average, and the trade will be exited when the fast moving average later crosses above the slow moving average.

We now have a complete strategy. But before we decide we want to use it, we should try to evaluate the quality of the strategy first. The usual means for doing so is backtesting, which is looking at how profitable the strategy is on historical data. For example, looking at the above chart’s performance on Apple stock, if the 20-day moving average is the fast moving average and the 50-day moving average the slow, this strategy does not appear to be very profitable, at least not if you are always taking long positions.

(You may not see all of these packages being loaded in the session; they may get loaded when other packages are loaded.)

First, let’s recreate some of our earlier charts. (I’m going to use different methods for recreating these charts, both for diversity and because some functions we’ll be using later will be introduced this way. The method used is actually more general than what was used in part 1.)

if (!require("TTR")) { install.packages("TTR") library(TTR) } if (!require("quantstrat")) { install.packages("quantstrat", repos="") library(quantstrat) } if (!require("IKTrading")) { if (!require("devtools")) { install.packages("devtools") } library(devtools) install_github("IKTrading", username = "IlyaKipnis") library(IKTrading) } library(quantmod) start <- as.Date("2010-01-01") end <- as.Date("2016-10-01") # Let's get Apple stock data; Apple's ticker symbol is AAPL. We use the quantmod function getSymbols, and pass a string as a first argument to identify the desired ticker symbol, pass "yahoo" to src for Yahoo! Finance, and from and to specify date ranges # The default behavior for getSymbols is to load data directly into the global environment, with the object being named after the loaded ticker symbol. This feature may become deprecated in the future, but we exploit it now. getSymbols("AAPL", src="yahoo", from = start, to = end) ## [1] "AAPL" # What is AAPL? class(AAPL) ## [1] "xts" "zoo" # Let's see the first few rows head(AAPL) ## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume ## 2010-01-04 213.43 214.50 212.38 214.01 123432400 ## 2010-01-05 214.60 215.59 213.25 214.38 150476200 ## 2010-01-06 214.38 215.23 210.75 210.97 138040000 ## 2010-01-07 211.75 212.00 209.05 210.58 119282800 ## 2010-01-08 210.30 212.00 209.06 211.98 111902700 ## 2010-01-11 212.80 213.00 208.45 210.11 115557400 ## AAPL.Adjusted ## 2010-01-04 27.72704 ## 2010-01-05 27.77498 ## 2010-01-06 27.33318 ## 2010-01-07 27.28265 ## 2010-01-08 27.46403 ## 2010-01-11 27.22176 candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") AAPL_sma_20 <- SMA( Cl(AAPL), # The closing price of AAPL, obtained by quantmod's Cl() function n = 20 # The number of days in the moving average window ) AAPL_sma_50 <- SMA( Cl(AAPL), n = 50 ) AAPL_sma_200 <- SMA( Cl(AAPL), n = 200 ) zoomChart("2016") # Zoom into the year 2016 in the chart addTA(AAPL_sma_20, on = 1, col = "red") # on = 1 plots the SMA with price addTA(AAPL_sma_50, on = 1, col = "blue") addTA(AAPL_sma_200, on = 1, col = "green")

We will refer to the sign of this difference as the regime; that is, if the fast moving average is above the slow moving average, this is a bullish regime (the bulls rule), and a bearish regime (the bears rule) holds when the fast moving average is below the slow moving average. I identify regimes with the following code.

AAPL_trade <- AAPL AAPL_trade$`20d` <- AAPL_sma_20 AAPL_trade$`50d` <- AAPL_sma_50 regime_val <- sigComparison("", data = AAPL_trade, columns = c("20d", "50d"), relationship = "gt") - sigComparison("", data = AAPL_trade, columns = c("20d", "50d"), relationship = "lt") plot(regime_val["2016"], main = "Regime", ylim = c(-2, 2))

plot(regime_val, main = "Regime", ylim = c(-2, 2))

I could visualize the regime along with the main series with the following code:

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addTA(regime_val, col = "blue", yrange = c(-2, 2)) addLines(h = 0, col = "black", on = 3) addSMA(n = c(20, 50), on = 1, col = c("red", "blue")) zoomChart("2016")

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addTA(regime_val, col = "blue", yrange = c(-2, 2)) addLines(h = 0, col = "black", on = 3) addSMA(n = c(20, 50), on = 1, col = c("red", "blue"))

table(as.vector(regime_val)) ## ## -1 1 ## 663 987

The call above indicates that the market was bullish on Apple stock for 987 days, and bearish for 663 days. (If you were following along on the Python version, you may notice there is no 0. This may be due to different behavior by sigComparison().)

Trading signals appear at regime changes. When a bullish regime begins, a buy signal is triggered, and when it ends, a sell signal is triggered. Likewise, when a bearish regime begins, a sell signal is triggered, and when the regime ends, a buy signal is triggered (this is of interest only if you ever will short the stock, or use some derivative like a stock option to bet against the market).

It’s simple to obtain signals. Let indicate the regime at time , and the signal at time . Then:

, with indicating “sell”, indicating “buy”, and no action. We can obtain signals like so:

sig <- diff(regime_val) / 2 plot(sig, main = "Signal", ylim = c(-2, 2))

table(sig) ## sig ## -1 0 1 ## 19 1611 19

We would buy Apple stock 19 times and sell Apple stock 19 times. If we only go long on Apple stock, only 19 trades will be engaged in over the 6-year period, while if we pivot from a long to a short position every time a long position is terminated, we would engage in 19 trades total. (Bear in mind that trading more frequently isn’t necessarily good; trades are never free.)

You may notice that the system as it currently stands isn’t very robust, since even a fleeting moment when the fast moving average is above the slow moving average triggers a trade, resulting in trades that end immediately (which is bad if not simply because realistically every trade is accompanied by a fee that can quickly erode earnings). Additionally, every bullish regime immediately transitions into a bearish regime, and if you were constructing trading systems that allow both bullish and bearish bets, this would lead to the end of one trade immediately triggering a new trade that bets on the market in the opposite direction, which again seems finnicky. A better system would require more evidence that the market is moving in some particular direction. But we will not concern ourselves with these details for now.

Let’s now try to identify what the prices of the stock is at every buy and every sell.

# The Cl function from quantmod pulls the closing price from the object # holding a stock's data # Buy prices Cl(AAPL)[which(sig == 1)] ## AAPL.Close ## 2010-06-18 274.07 ## 2010-09-20 283.23 ## 2011-05-12 346.57 ## 2011-07-14 357.77 ## 2011-12-28 402.64 ## 2012-06-25 570.77 ## 2013-05-17 433.26 ## 2013-07-31 452.53 ## 2013-10-16 501.11 ## 2014-03-26 539.78 ## 2014-04-25 571.94 ## 2014-08-18 99.16 ## 2014-10-28 106.74 ## 2015-02-05 119.94 ## 2015-04-28 130.56 ## 2015-10-27 114.55 ## 2016-03-11 102.26 ## 2016-07-01 95.89 ## 2016-07-25 97.34 # Sell prices Cl(AAPL)[sig == -1] ## AAPL.Close ## 2010-06-11 253.51 ## 2010-07-22 259.02 ## 2011-03-31 348.51 ## 2011-05-27 337.41 ## 2011-11-17 377.41 ## 2012-05-09 569.18 ## 2012-10-17 644.61 ## 2013-06-26 398.07 ## 2013-10-03 483.41 ## 2014-01-28 506.50 ## 2014-04-22 531.70 ## 2014-06-11 93.86 ## 2014-10-17 97.67 ## 2015-01-05 106.25 ## 2015-04-16 126.17 ## 2015-06-25 127.50 ## 2015-12-18 106.03 ## 2016-05-05 93.24 ## 2016-07-08 96.68 # Since these are of the same dimension, computing profit is easy as.vector(Cl(AAPL)[sig == 1])[-1] - Cl(AAPL)[sig == -1][-table(sig)[["1"]]] ## AAPL.Close ## 2010-06-11 29.720012 ## 2010-07-22 87.549988 ## 2011-03-31 9.259998 ## 2011-05-27 65.230011 ## 2011-11-17 193.360020 ## 2012-05-09 -135.920013 ## 2012-10-17 -192.080017 ## 2013-06-26 103.040009 ## 2013-10-03 56.369995 ## 2014-01-28 65.440003 ## 2014-04-22 -432.540016 ## 2014-06-11 12.879997 ## 2014-10-17 22.270004 ## 2015-01-05 24.309998 ## 2015-04-16 -11.619995 ## 2015-06-25 -25.239998 ## 2015-12-18 -10.140000 ## 2016-05-05 4.099998

Above, we can see that on May 22nd, 2013, there was a massive drop in the price of Apple stock, and it looks like our trading system would do badly. But this price drop is not because of a massive shock to Apple, but simply due to a stock split. And while dividend payments are not as obvious as a stock split, they may be affecting the performance of our system.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white") addTA(regime_val, col = "blue", yrange = c(-2, 2)) addLines(h = 0, col = "black", on = 3) addSMA(n = c(20, 50), on = 1, col = c("red", "blue")) zoomChart("2014-05/2014-07")

We don’t want our trading system to be behaving poorly because of stock splits and dividend payments. How should we handle this? One approach would be to obtain historical stock split and dividend payment data and design a trading system for handling these. This would most realistically represent the behavior of the stock and could be considered the best solution, but it is more complicated. Another solution would be to adjust the prices to account for stock splits and dividend payments.

Yahoo! Finance only provides the adjusted closing price of a stock, but this is all we need to get adjusted opening, high, and low prices. The adjusted close is computed like so:

where is the multiplier used for the adjustment. Solving for requires only division and thus we can use the closing price and the adjusted closing price to adjust all prices in the series.

Let’s go back, adjust the apple data, and reevaluate our trading system using the adjusted data.

candleChart(adjustOHLC(AAPL), up.col = "black", dn.col = "red", theme = "white") addLines(h = 0, col = "black", on = 3) addSMA(n = c(20, 50), on = 1, col = c("red", "blue"))

As you can see, adjusting for dividends and stock splits makes a big difference. We will use this data from now on.

Let’s now create a simulated portfolio of $1,000,000, and see how it would behave, according to the rules we have established. This includes:

  • Investing only 10% of the portfolio in any trade
  • Exiting the position if losses exceed 20% of the value of the trade.

When simulating, bear in mind that:

  • Trades are done in batches of 100 stocks (unfortunately, osMaxDollar() will not round to batches of 100, so for the sake of simplicity, this will be ignored).
  • Our stop-loss rule involves placing an order to sell the stock the moment the price drops below the specified level. Thus we need to check whether the lows during this period ever go low enough to trigger the stop-loss. Realistically, unless we buy a put option, we cannot guarantee that we will sell the stock at the price we set at the stop-loss, but we will use this as the selling price anyway for the sake of simplicity.
  • Every trade is accompanied by a commission to the broker, which should be accounted for. I do not do so here.

quantmod is good for visualizing stock data, but if we want to start developing and testing strategies, we will need to rely more on other packages:

  • TTR contains functions for computing technical indicators (simple moving averages, or SMAs, are included).
  • blotter is intended to manage portfolios and positions created while developing trading strategies. It provides environments intended to help simplify portfolio tracking tasks.
  • FinancialInstruments is a companion package to blotter and stores meta-data about different financial instruments (stocks, options contracts, etc.).
  • quantstrat may be the work horse package here; it is used for defining and testing trading strategies.
  • IKTrading contains some additional useful functions for trading (in particular, osMaxDollar()).

quantstrat and its companion packages provides a flexible framework for backtesting that, in principle, should allow for any trading strategy to be backtested. The following code must be executed for every testing session.

For the sake of simplicity, I’m going to not force orders to be in batches of 100 shares, nor will I enforce the stop-loss rule mentioned above. It is possible to implement these rules, but doing so is not simple. Perhaps in a later article I will revisit these topics. Additionally, I am allowing the strategy to further enter a position in “bullish” markets. I would not recommend this in the real world, since this will drive up transaction costs for little profit. (Rule of thumb; the fewer transactions, the better.)

rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment currency("USD") # Currency being used ## [1] "USD" Sys.setenv(TZ = "MDT") # Allows quantstrat to use timestamps

Next, we declare the stocks we are going to be working with.

AAPL_adj <- adjustOHLC(AAPL) stock("AAPL_adj", currency = "USD", multiplier = 1) ## [1] "AAPL_adj" initDate <- "1990-01-01" # A date prior to first close price; needed (why?)

Now we create a portfolio and account object that will be used in the simulation.

strategy_st <- portfolio_st <- account_st <- "SMAC-20-50" # Names of objects rm.strat(portfolio_st) # Need to remove portfolio from blotter env rm.strat(strategy_st) # Ensure no strategy by this name exists either initPortf(portfolio_st, symbols = "AAPL_adj", # This is a simple portfolio # trading AAPL only initDate = initDate, currency = "USD") ## [1] "SMAC-20-50" initAcct(account_st, portfolios = portfolio_st, # Uses only one portfolio initDate = initDate, currency = "USD", initEq = 1000000) # Start with a million dollars ## [1] "SMAC-20-50" initOrders(portfolio_st, store = TRUE) # Initialize the order container; will # contain all orders made by strategy

Now we define the strategy.

strategy(strategy_st, store = TRUE) # store = TRUE tells function to store in # the .strategy environment # Now define trading rules # Indicators are used to construct signals add.indicator(strategy = strategy_st, name = "SMA", # SMA is a function arguments = list(x = quote(Cl(mktdata)), # args of SMA n = 20), label = "fastMA") ## [1] "SMAC-20-50" add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") ## [1] "SMAC-20-50" # Next comes trading signals add.signal(strategy = strategy_st, name = "sigComparison", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") ## [1] "SMAC-20-50" add.signal(strategy = strategy_st, name = "sigComparison", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") ## [1] "SMAC-20-50" # Finally, rules that generate trades add.rule(strategy = strategy_st, name = "ruleSignal", # Almost always this one arguments = list(sigcol = "bull", # Signal (see above) that triggers sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, prefer = "Open", osFUN = osMaxDollar, # The next parameter, which is a parameter passed to # osMaxDollar, will ensure that trades are about 10% # of portfolio equity maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1))), type = "enter", path.dep = TRUE, label = "buy") ## [1] "SMAC-20-50" add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bear", sigval = TRUE, orderqty = "all", ordertype = "market", orderside = "long", replace = FALSE, prefer = "Open"), type = "exit", path.dep = TRUE, label = "sell") ## [1] "SMAC-20-50"

With the strategy set up, we now execute it. (I’m allowing the function’s intermediate output to be included; I believe, in this case, it’s nice to see what is going on.)

applyStrategy(strategy_st, portfolios = portfolio_st) ## [1] "2010-03-17 00:00:00 AAPL_adj 3410 @ 29.4145214710173" ## [1] "2010-03-19 00:00:00 AAPL_adj 1 @ 29.4001360824518" ## [1] "2010-03-23 00:00:00 AAPL_adj 56 @ 29.5113078050153" ## [1] "2010-06-14 00:00:00 AAPL_adj -3467 @ 33.4768405519892" ## [1] "2010-06-21 00:00:00 AAPL_adj 2808 @ 36.3188897412688" ## [1] "2010-06-23 00:00:00 AAPL_adj 1 @ 35.9121377755245" ## [1] "2010-06-25 00:00:00 AAPL_adj 12 @ 35.3209704881171" ## [1] "2010-06-28 00:00:00 AAPL_adj 10 @ 34.9115992042882" ## [1] "2010-06-29 00:00:00 AAPL_adj 33 @ 34.5440814243432" ## [1] "2010-06-30 00:00:00 AAPL_adj 30 @ 33.5749295477998" ## [1] "2010-07-01 00:00:00 AAPL_adj 84 @ 33.2597286804071" ## [1] "2010-07-02 00:00:00 AAPL_adj 28 @ 32.761422084999" ## [1] "2010-07-06 00:00:00 AAPL_adj 46 @ 32.8281245169061" ## [1] "2010-07-15 00:00:00 AAPL_adj 13 @ 32.4658384412953" ## [1] "2010-07-16 00:00:00 AAPL_adj 15 @ 33.1132447519415" ## [1] "2010-07-21 00:00:00 AAPL_adj 67 @ 34.6709448593864" ## [1] "2010-07-23 00:00:00 AAPL_adj -3147 @ 33.6246305427903" ## [1] "2010-09-21 00:00:00 AAPL_adj 2769 @ 37.1258627071335" ## [1] "2011-04-01 00:00:00 AAPL_adj -2769 @ 45.9214462525203" ## [1] "2011-05-13 00:00:00 AAPL_adj 2209 @ 45.2086437030919" ## [1] "2011-05-16 00:00:00 AAPL_adj 2 @ 44.3637427445526" ## [1] "2011-05-17 00:00:00 AAPL_adj 43 @ 43.4220589833276" ## [1] "2011-05-18 00:00:00 AAPL_adj 48 @ 44.006688373276" ## [1] "2011-05-24 00:00:00 AAPL_adj 15 @ 43.8798216684994" ## [1] "2011-05-31 00:00:00 AAPL_adj -2317 @ 44.6122447113503" ## [1] "2011-07-15 00:00:00 AAPL_adj 2117 @ 47.2371856911493" ## [1] "2011-08-24 00:00:00 AAPL_adj 5 @ 48.8458929867096" ## [1] "2011-11-18 00:00:00 AAPL_adj -2122 @ 49.5586954053487" ## [1] "2011-12-29 00:00:00 AAPL_adj 1879 @ 52.7604184147789" ## [1] "2011-12-30 00:00:00 AAPL_adj 16 @ 52.7748073346566" ## [1] "2012-05-10 00:00:00 AAPL_adj -1895 @ 75.1489364843129" ## [1] "2012-06-26 00:00:00 AAPL_adj 1324 @ 74.7238700874815" ## [1] "2012-06-27 00:00:00 AAPL_adj 14 @ 75.2038727149497" ## [1] "2012-10-18 00:00:00 AAPL_adj -1338 @ 84.0107133628424" ## [1] "2013-05-20 00:00:00 AAPL_adj 1704 @ 57.702437434572" ## [1] "2013-05-21 00:00:00 AAPL_adj 29 @ 58.5360921156059" ## [1] "2013-06-18 00:00:00 AAPL_adj 1 @ 57.6556788337951" ## [1] "2013-06-20 00:00:00 AAPL_adj 1 @ 56.0177655048158" ## [1] "2013-06-21 00:00:00 AAPL_adj 50 @ 55.9095500863203" ## [1] "2013-06-24 00:00:00 AAPL_adj 3 @ 54.4279450227602" ## [1] "2013-06-25 00:00:00 AAPL_adj 49 @ 54.2008263223713" ## [1] "2013-06-26 00:00:00 AAPL_adj 7 @ 53.9603509990938" ## [1] "2013-06-27 00:00:00 AAPL_adj -1844 @ 53.3391172023021" ## [1] "2013-08-01 00:00:00 AAPL_adj 1645 @ 60.8874187232282" ## [1] "2013-09-18 00:00:00 AAPL_adj 14 @ 62.288630508577" ## [1] "2013-10-07 00:00:00 AAPL_adj -1659 @ 65.4327788398408" ## [1] "2013-10-17 00:00:00 AAPL_adj 1484 @ 67.2375075223774" ## [1] "2013-10-18 00:00:00 AAPL_adj 3 @ 68.0457369974481" ## [1] "2014-01-29 00:00:00 AAPL_adj -1487 @ 68.1670731190817" ## [1] "2014-03-12 00:00:00 AAPL_adj 1372 @ 72.733561996219" ## [1] "2014-03-13 00:00:00 AAPL_adj 2 @ 73.1322624931313" ## [1] "2014-03-17 00:00:00 AAPL_adj 15 @ 71.8068850638596" ## [1] "2014-03-18 00:00:00 AAPL_adj -1389 @ 71.5619513215039" ## [1] "2014-03-25 00:00:00 AAPL_adj 1364 @ 73.6847222604636" ## [1] "2014-03-31 00:00:00 AAPL_adj 1 @ 73.37583725238" ## [1] "2014-04-02 00:00:00 AAPL_adj 1 @ 73.8044711654273" ## [1] "2014-04-08 00:00:00 AAPL_adj 25 @ 71.4653411892903" ## [1] "2014-04-09 00:00:00 AAPL_adj 8 @ 71.1183468222455" ## [1] "2014-04-10 00:00:00 AAPL_adj 7 @ 72.2123947642041" ## [1] "2014-04-14 00:00:00 AAPL_adj 9 @ 71.0176525287248" ## [1] "2014-04-17 00:00:00 AAPL_adj 3 @ 70.7591073193403" ## [1] "2014-04-23 00:00:00 AAPL_adj -1418 @ 71.9919515657196" ## [1] "2014-04-28 00:00:00 AAPL_adj 1301 @ 77.943882953473" ## [1] "2014-10-20 00:00:00 AAPL_adj -1301 @ 94.6439193845976" ## [1] "2014-10-29 00:00:00 AAPL_adj 985 @ 102.662471436688" ## [1] "2015-01-06 00:00:00 AAPL_adj -985 @ 103.001288430376" ## [1] "2015-02-06 00:00:00 AAPL_adj 858 @ 116.491485509158" ## [1] "2015-02-10 00:00:00 AAPL_adj 11 @ 116.637076575269" ## [1] "2015-04-17 00:00:00 AAPL_adj -869 @ 121.858912853907" ## [1] "2015-04-29 00:00:00 AAPL_adj 766 @ 126.333382759857" ## [1] "2015-04-30 00:00:00 AAPL_adj 25 @ 124.858064939017" ## [1] "2015-05-01 00:00:00 AAPL_adj 9 @ 122.392738351109" ## [1] "2015-05-04 00:00:00 AAPL_adj 17 @ 125.69278245721" ## [1] "2015-05-08 00:00:00 AAPL_adj 5 @ 123.469279769953" ## [1] "2015-06-29 00:00:00 AAPL_adj -822 @ 122.280199845824" ## [1] "2015-10-28 00:00:00 AAPL_adj 885 @ 114.482259314017" ## [1] "2015-11-17 00:00:00 AAPL_adj 28 @ 112.995955565041" ## [1] "2015-12-17 00:00:00 AAPL_adj 2 @ 110.144507689672" ## [1] "2015-12-21 00:00:00 AAPL_adj -915 @ 105.483868873907" ## [1] "2016-03-11 00:00:00 AAPL_adj 997 @ 101.073743853996" ## [1] "2016-04-28 00:00:00 AAPL_adj 56 @ 96.496561342483" ## [1] "2016-05-02 00:00:00 AAPL_adj 23 @ 92.8980829110911" ## [1] "2016-05-06 00:00:00 AAPL_adj -1076 @ 92.8669223571517" ## [1] "2016-06-24 00:00:00 AAPL_adj 1047 @ 92.4094018468721" ## [1] "2016-06-27 00:00:00 AAPL_adj 35 @ 92.4989129454683" ## [1] "2016-06-28 00:00:00 AAPL_adj -1082 @ 92.3994537379766" ## [1] "2016-07-01 00:00:00 AAPL_adj 1064 @ 94.9754947544617" ## [1] "2016-07-12 00:00:00 AAPL_adj -1064 @ 96.6464428592831" ## [1] "2016-07-26 00:00:00 AAPL_adj 1023 @ 96.2983306600025" ## [1] "2016-07-27 00:00:00 AAPL_adj 15 @ 103.708186831476"

Here we effectvely conclude the study so we can perform some analytics.

updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(portfolio_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50"

Here’s our first set of useful summary statistics.

tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) ## AAPL_adj ## Num.Txns 90.00 ## Num.Trades 21.00 ## Net.Trading.PL 112580.62 ## Avg.Trade.PL 5360.98 ## Med.Trade.PL 1379.84 ## Largest.Winner 42426.01 ## Largest.Loser -8386.18 ## Gross.Profits 152876.04 ## Gross.Losses -40295.42 ## Std.Dev.Trade.PL 12835.99 ## Percent.Positive 57.14 ## Percent.Negative 42.86 ## Profit.Factor 3.79 ## Avg.Win.Trade 12739.67 ## Med.Win.Trade 9970.11 ## Avg.Losing.Trade -4477.27 ## Med.Losing.Trade -3234.16 ## Avg.Daily.PL 4765.18 ## Med.Daily.PL 856.79 ## Std.Dev.Daily.PL 12868.07 ## Ann.Sharpe 5.88 ## Max.Drawdown -27945.73 ## Profit.To.Max.Draw 4.03 ## Avg.WinLoss.Ratio 2.85 ## Med.WinLoss.Ratio 3.08 ## Max.Equity 120545.86 ## Min.Equity -1182.21 ## End.Equity 112580.62 final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"], main = "Portfolio Equity")

Our portfolio’s value grew by 10% in about six years. Considering that only 10% of the portfolio was ever involved in any single trade, this is not bad performance.

A more realistic portfolio would not be betting 10% of its value on only one stock. A more realistic one would consider investing in multiple stocks. Multiple trades may be ongoing at any given time involving multiple companies, and most of the portfolio will be in stocks, not cash.

Fortunately, quantmod allows us to do this.

# Now for analytics updatePortf(portfolio_st_2) ## [1] "SMAC-20-50_v2" dateRange <- time(getPortfolio(portfolio_st_2)$summary)[-1] updateAcct(account_st_2, dateRange) ## [1] "SMAC-20-50_v2" updateEndEq(account_st_2) ## [1] "SMAC-20-50_v2" tStats2 <- tradeStats(Portfolios = portfolio_st_2, use="trades", inclZeroDays = FALSE) tStats2[, 4:ncol(tStats2)] <- round(tStats2[, 4:ncol(tStats2)], 2) print(data.frame(t(tStats2[, -c(1,2)]))) ## AAPL_adj AMZN_adj FB_adj GOOG_adj HPQ_adj ## Num.Txns 90.00 100.00 52.00 82.00 101.00 ## Num.Trades 21.00 18.00 12.00 18.00 16.00 ## Net.Trading.PL 112580.62 131136.44 123027.65 34936.13 47817.45 ## Avg.Trade.PL 5360.98 7285.36 10252.30 1940.90 2988.59 ## Med.Trade.PL 1379.84 2734.07 4946.80 -1110.80 -2937.26 ## Largest.Winner 42426.01 39293.00 78430.09 31629.27 55409.83 ## Largest.Loser -8386.18 -17567.20 -18374.18 -10477.41 -16000.11 ## Gross.Profits 152876.04 190370.65 148742.50 78661.57 129111.69 ## Gross.Losses -40295.42 -59234.21 -25714.85 -43725.44 -81294.24 ## Std.Dev.Trade.PL 12835.99 18910.41 23637.58 9807.28 20596.60 ## Percent.Positive 57.14 61.11 75.00 38.89 37.50 ## Percent.Negative 42.86 38.89 25.00 61.11 62.50 ## Profit.Factor 3.79 3.21 5.78 1.80 1.59 ## Avg.Win.Trade 12739.67 17306.42 16526.94 11237.37 21518.62 ## Med.Win.Trade 9970.11 10904.54 9600.62 10363.86 12287.22 ## Avg.Losing.Trade -4477.27 -8462.03 -8571.62 -3975.04 -8129.42 ## Med.Losing.Trade -3234.16 -7586.30 -3885.64 -4277.02 -8975.12 ## Avg.Daily.PL 4765.18 4733.02 10715.22 1745.28 1887.01 ## Med.Daily.PL 856.79 2358.72 4733.39 -1206.41 -4857.62 ## Std.Dev.Daily.PL 12868.07 15980.17 24734.19 10072.85 20825.92 ## Ann.Sharpe 5.88 4.70 6.88 2.75 1.44 ## Max.Drawdown -27945.73 -50062.96 -44728.34 -31255.17 -67890.79 ## Profit.To.Max.Draw 4.03 2.62 2.75 1.12 0.70 ## Avg.WinLoss.Ratio 2.85 2.05 1.93 2.83 2.65 ## Med.WinLoss.Ratio 3.08 1.44 2.47 2.42 1.37 ## Max.Equity 120545.86 131136.44 127540.04 48651.04 47817.45 ## Min.Equity -1182.21 -11314.07 -15750.32 -23388.32 -63845.70 ## End.Equity 112580.62 131136.44 123027.65 34936.13 47817.45 ## IBM_adj MSFT_adj NFLX_adj NTDOY_adj SNY_adj ## Num.Txns 115.00 112.00 96.00 109.00 121.00 ## Num.Trades 21.00 20.00 18.00 20.00 22.00 ## Net.Trading.PL 9126.76 29513.06 306152.14 17217.18 -26278.53 ## Avg.Trade.PL 434.61 1475.65 17008.45 860.86 -1194.48 ## Med.Trade.PL -722.94 -2408.39 3954.98 -1467.95 474.75 ## Largest.Winner 22193.38 18573.05 177643.74 45165.04 15195.77 ## Largest.Loser -8188.45 -11537.97 -30183.02 -9052.12 -17715.75 ## Gross.Profits 50691.47 91471.86 372517.68 85502.47 64843.56 ## Gross.Losses -41564.71 -61958.80 -66365.54 -68285.29 -91122.09 ## Std.Dev.Trade.PL 6571.91 8946.29 44740.88 12590.03 8855.56 ## Percent.Positive 38.10 45.00 66.67 40.00 54.55 ## Percent.Negative 61.90 55.00 33.33 60.00 45.45 ## Profit.Factor 1.22 1.48 5.61 1.25 0.71 ## Avg.Win.Trade 6336.43 10163.54 31043.14 10687.81 5403.63 ## Med.Win.Trade 3593.06 10044.61 14941.58 2718.22 3385.61 ## Avg.Losing.Trade -3197.29 -5632.62 -11060.92 -5690.44 -9112.21 ## Med.Losing.Trade -2368.90 -5111.92 -7694.86 -6391.76 -8601.77 ## Avg.Daily.PL 434.61 1160.50 17854.27 -77.22 -1194.48 ## Med.Daily.PL -722.94 -2870.77 5280.37 -1587.30 474.75 ## Std.Dev.Daily.PL 6571.91 9076.67 45969.27 12195.79 8855.56 ## Ann.Sharpe 1.05 2.03 6.17 -0.10 -2.14 ## Max.Drawdown -42354.92 -32484.71 -57707.47 -55006.73 -38592.68 ## Profit.To.Max.Draw 0.22 0.91 5.31 0.31 -0.68 ## Avg.WinLoss.Ratio 1.98 1.80 2.81 1.88 0.59 ## Med.WinLoss.Ratio 1.52 1.96 1.94 0.43 0.39 ## Max.Equity 31766.86 44009.02 340579.29 35295.68 12237.77 ## Min.Equity -10588.07 -18649.24 -646.98 -39765.29 -34367.59 ## End.Equity 9126.76 29513.06 306152.14 17217.18 -26278.53 ## TWTR_adj YHOO_adj ## Num.Txns 31.00 108.00 ## Num.Trades 6.00 22.00 ## Net.Trading.PL 17321.55 110711.07 ## Avg.Trade.PL 2886.93 5032.32 ## Med.Trade.PL -3051.02 -1332.22 ## Largest.Winner 3058.35 55850.27 ## Largest.Loser -11305.68 -8359.86 ## Gross.Profits 45892.12 161301.36 ## Gross.Losses -28570.57 -50590.28 ## Std.Dev.Trade.PL 20412.79 15701.97 ## Percent.Positive 33.33 31.82 ## Percent.Negative 66.67 68.18 ## Profit.Factor 1.61 3.19 ## Avg.Win.Trade 22946.06 23043.05 ## Med.Win.Trade 22946.06 14477.80 ## Avg.Losing.Trade -7142.64 -3372.69 ## Med.Losing.Trade -8618.81 -3245.27 ## Avg.Daily.PL -5102.44 4582.54 ## Med.Daily.PL -6074.75 -1445.65 ## Std.Dev.Daily.PL 6490.57 15943.85 ## Ann.Sharpe -12.48 4.56 ## Max.Drawdown -61686.41 -34526.52 ## Profit.To.Max.Draw 0.28 3.21 ## Avg.WinLoss.Ratio 3.21 6.83 ## Med.WinLoss.Ratio 2.66 4.46 ## Max.Equity 33279.24 115011.39 ## Min.Equity -28407.16 -11616.95 ## End.Equity 17321.55 110711.07 final_acct2 <- getAccount(account_st_2) plot(final_acct2$summary$End.Eq["2010/2016"], main = "Portfolio Equity")

A more realistic portfolio that can invest in any in a list of twelve (tech) stocks has a final growth of about 100%. How good is this? While on the surface not bad, we will see we could have done better.


Backtesting is only part of evaluating the efficacy of a trading strategy. We would like to benchmark the strategy, or compare it to other available (usually well-known) strategies in order to determine how well we have done.

Whenever you evaluate a trading system, there is one strategy that you should always check, one that beats all but a handful of managed mutual funds and investment managers: buy and hold SPY. The efficient market hypothesis claims that it is all but impossible for anyone to beat the market. Thus, one should always buy an index fund that merely reflects the composition of the market. SPY is an exchange-traded fund (a mutual fund that is traded on the market like a stock) whose value effectively represents the value of the stocks in the S&P 500 stock index. By buying and holding SPY, we are effectively trying to match our returns with the market rather than beat it.

I obtain data on SPY below, and look at the profits for simply buying and holding SPY.

getSymbols("SPY", from = start, to = end) ## [1] "SPY" # A crude estimate of end portfolio value from buying and holding SPY 1000000 * (SPY$SPY.Adjusted["2016-09-30"][[1]] / SPY$SPY.Adjusted[[1]]) ## [1] 2189421 plot(final_acct2$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

Buying and holding SPY beats our trading system, at least how we currently set it up, and we haven’t even accounted for how expensive our more complex strategy is in terms of fees. Given both the opportunity cost and the expense associated with the active strategy, we should not use it.

What could we do to improve the performance of our system? For starters, we could try diversifying. All the stocks we considered were tech companies, which means that if the tech industry is doing poorly, our portfolio will reflect that. We could try developing a system that can also short stocks or bet bearishly, so we can take advantage of movement in any direction. We could seek means for forecasting how high we expect a stock to move. Whatever we do, though, must beat this benchmark; otherwise there is an opportunity cost associated with our trading system.

Other benchmark strategies exist, and if our trading system beat the “buy and hold SPY” strategy, we may check against them. Some such strategies include:

  • Buy SPY when its closing monthly price is aboves its ten-month moving average.
  • Buy SPY when its ten-month momentum is positive. (Momentum is the first difference of a moving average process, or .)

(I first read of these strategies here.) The general lesson still holds: don’t use a complex trading system with lots of active trading when a simple strategy involving an index fund without frequent trading beats it. This is actually a very difficult requirement to meet.

As a final note, suppose that your trading system did manage to beat any baseline strategy thrown at it in backtesting. Does backtesting predict future performance? Not at all. Backtesting has a propensity for overfitting, so just because backtesting predicts high growth doesn’t mean that growth will hold in the future.


While this lecture ends on a depressing note, keep in mind that the efficient market hypothesis has many critics. My own opinion is that as trading becomes more algorithmic, beating the market will become more difficult. That said, it may be possible to beat the market, even though mutual funds seem incapable of doing so (bear in mind, though, that part of the reason mutual funds perform so poorly is because of fees, which is not a concern for index funds).

This lecture is very brief, covering only one type of strategy: strategies based on moving averages. Many other trading signals exist and employed. Additionally, we never discussed in depth shorting stocks, currency trading, or stock options. Stock options, in particular, are a rich subject that offer many different ways to bet on the direction of a stock.

Some resources I referred to when researching this subject include:

A resource I have not thoroughly checked out but looks promising is the ebook Backtesting Strategies with R, by Tim Trice, also discussing quantstrat.

Remember that it is possible (if not common) to lose money in the stock market. It’s also true, though, that it’s difficult to find returns like those found in stocks, and any investment strategy should take investing in it seriously. This lecture is intended to provide a starting point for evaluating stock trading and investments, and, more generally, analyzing temporal data, and I hope you continue to explore these ideas.

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RApiDatetime 0.0.3

Mon, 04/03/2017 - 02:55

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A brown bag bug fix release 0.0.3 of RApiDatetime is now on CRAN.

RApiDatetime provides six entry points for C-level functions of the R API for Date and Datetime calculations. The functions asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R.

I left two undefined variables in calls in the exported header file; this only become an issue once I actually tried accessing the API from another package as I am now doing in anytime.

Changes in RApiDatetime version 0.0.3 (2017-04-02)
  • Correct two simple copy-and-paste errors in RApiDatetime.h

  • Also enable registration in useDynLib, and explicitly export known and documented R access functions provided for testing

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the rapidatetime page.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Key Phrase Extraction from Tweets

Sun, 04/02/2017 - 23:41

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In a previous post of mine published at DataScience+, I analyzed the text of the first presidential debate using relatively simple string manipulation functions to answer some high level questions from the available text.

In this post, we leverage a few other NLP techniques to analyze another text corpus – A collection of tweets. Given a tweet, we would like to extract the key words or phrases which conveys the gist of the meaning of the tweet. For the purpose of this demo, we will extract President Donald Trump’s tweets (~3000 in total) from twitter using Twitter’s API.

I would not cover the twitter data extraction part in this post and directly jump on to the actual analysis (The data extraction code is in Python). However, I have uploaded a csv file with the extracted tweets here. You can download the file to your local machine & load it in your R console if you wish to follow along with the tutorial.

So let’s jump in now to the actual analysis!

How do we access key phrases from the text

There are various different approaches that one can try for this. We will try out one specific approach in this post – POS (Part of Speech) tagging followed by pattern based chunking and extraction

Rationale behind this approach

Consider the following tweet from our sample:

After being forced to apologize for its bad and inaccurate coverage of me after winning the election, the FAKE NEWS @nytimes is still lost!

A POS tag output for this text would be as follows:

After/IN being/VBG forced/VBN to/TO apologize/VB for/IN its/PRP$ bad/JJ and/CC inaccurate/JJ coverage/NN of/IN me/PRP after/IN winning/VBG the/DT election/NN ,/, the/DT FAKE/NNP NEWS/NNP @nytimes/NNP is/VBZ still/RB lost/VBN !/

A definition of each of these tags can be found here.

We can make couple of simple observations from the POS tagged text:

  • As is obvious, certain tags are more informative than others. For. E.g. Noun tags (starting with NN) would carry more information than prepositions or conjunctions. Similarly, if we would like to know “what” is being spoken about, Noun words may be more relevant than others.
  • Secondly, chunks of words would carry more meaning than looking at individual words in isolation. For e.g. in the text above the word “coverage” alone does not adequately convey the meaning of the text. However, “inaccurate coverage” very nicely captures one of the key themes in this tweet.
How do we implement it

This is where you may face some challenges, if you are trying to implement this in R. Python seems to be more generally preferred for tasks such as this, with many of the POS tagging & chunking examples that you will find online, based on the NLTK library in Python. While NLTK is great to work with, I tried implementing this in R as well using the openNLP package.

This is the flow that we will follow:

Text –> Sentence Annotation –> Word Annotation –> POS Tagging –> Chunking

Given below is an explanation of the approach:

Step 1: POS Tagging

The POS tagging code is pretty much based on the code examples that are given as part of openNLP’s documentation. Here is the code:

#We will try the approach for 1 tweet; Finally we will convert this into a function x <- "the text of the tweet" x <- as.String(x) # Before POS tagging, we need to do Sentence annotation followed by word annotation wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator())) # POS tag the words & extract the "words" from the output POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation) POSwords <- subset(POSAnnotation, type == "word") # Extract the tags from the words tags <- sapply(POSwords$features, '[[', "POS") # Create a data frame with words and tags tokenizedAndTagged <- data.frame(Tokens = x[POSwords], Tags = tags) Step 2: Chunking and Extraction

For us to chunk the POS tagged text, we would have to first define what POS pattern we would consider as a chunk. For e.g. an Adjective-Noun(s) combination (JJ-NN) can be a useful pattern to extract (in the example above this pattern would have given us the “inaccurate coverage” chunk). Similarly, we may wish to chunk and extract proper nouns (so for e.g. in this tweet – “ Hope you like my nomination of Judge Neil Gorsuch for the United States Supreme Court. He is a good and brilliant man, respected by all.“, chunking and extracting proper nouns (NNP, NNPS) would give us “Judge Neil Gorsuch” & “United States Supreme Court”)

So once we define what POS pattern we consider as a chunk, the next step is to extract them.
This is a bit trickier (In python’s nltk, there is a very useful function that helps extract chunks from POS tagged text using RegEx based pattern search. I couldn’t find a similar function in openNLP, so I wrote a simple workaround for this function in R)

Given below is the code with explanatory comments:

# Define a flag(tags_mod) for pos tags - Flag set to 1 if it contains the POS tag we are interested in else 0 # In this case we only want Noun and Adjective tags (NN, JJ) # Note that this will also capture variations such as NNP, NNPS etc tokenizedAndTagged$Tags_mod = grepl("NN|JJ", tokenizedAndTagged$Tags) # Initialize a vector to store chunk indexes chunk = vector() # Iterate thru each word and assign each one to a group # if the word doesn’t belong to NN|JJ tags (i.e. tags_mod flag is 0) assign it to the default group (0) # If the ith tag is in “NN|JJ” (i.e. tags_mod flag is 1) assign it to group i-1 if the (i-1)th tag_mod flag is also 1; else assign it to a new group chunk[1] = as.numeric(tokenizedAndTagged$Tags_mod[1]) for (i in 2:nrow(tokenizedAndTagged)) { if(!tokenizedAndTagged$Tags_mod[i]) { chunk[i] = 0 } else if (tokenizedAndTagged$Tags_mod[i] == tokenizedAndTagged$Tags_mod[i-1]) { chunk[i] = chunk[i-1] } else { chunk[i] = max(chunk) + 1 } }

Finally extract matching pattern

# Split and chunk words text_chunk <- split(as.character(tokenizedAndTagged$Tokens), chunk) tag_pattern <- split(as.character(tokenizedAndTagged$Tags), chunk) names(text_chunk) <- sapply(tag_pattern, function(x) paste(x, collapse = "-")) # Extract chunks matching pattern # We will extract JJ-NN chunks and two or more continuous NN tags # "NN.-NN" -> The "." in this regex will match all variants of NN: NNP, NNS etc res = text_chunk[grepl("JJ-NN|NN.-NN", names(text_chunk))]

The overall function is available here.

Testing the Approach

Testing the approach on a few tweets gave fairly promising results.
I am providing below some sample results.

Sample Tweet 1:

“Jeff Sessions is an honest man. He did not say anything wrong. He could have stated his response more accurately, but it was clearly not….”

Chunking Output:
c(“Jeff Sessions”, “honest man”)

Sample Tweet 2:

Since November 8th, Election Day, the Stock Market has posted $3.2 trillion in GAINS and consumer confidence is at a 15 year high. Jobs!

Chunking Output:
c(“Election Day”, “Stock Market”)

Sample Tweet 3:

Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks

Chunking Output:
c(“Russia talk”, “FAKE NEWS”, “big election defeat”, “illegal leaks”)


As can be seen, this approach seems to be working fairly well.
The extracted chunks does convey some of the key themes present in the text.
We can offcourse try out more complex techniques & alternate approaches and if time permits, I will try exploring & blogging about a few of them in the future.
However, the intent of this post was to provide readers a quick overview of POS tagging and chunking approaches and the usefulness of these techniques for certain NLP Tasks.

Hope you find this useful!

    Related Post

    1. Financial time series forecasting – an easy approach
    2. Outlier detection and treatment with R
    3. Implementing Apriori Algorithm in R
    4. R for Publication: Lesson 6, Part 2 – Linear Mixed Effects Models
    5. R for Publication: Lesson 6, Part 1 – Linear Mixed Effects Models

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Understanding the CANDECOMP/PARAFAC Tensor Decomposition, aka CP; with R code

    Sun, 04/02/2017 - 21:30

    (This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

    A tensor is essentially a multi-dimensional array:

    • a tensor of order one is a vector, which simply is a column of numbers,
    • a tensor of order two is a matrix, which is basically numbers arranged in a rectangle,
    • a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension),
    • an nth order (or n-way) tensor looks like numbers arranged in… an $n$-hyperrectangle or something like that… you get the idea…

    In many applications, data naturally form an n-way tensor with n > 2, rather than a “tidy” table.

    So, what’s a tensor decomposition?

    Well, there are several types of tensor decomposition, but in this blog post I will introduce only the CANDECOMP/PARAFAC decomposition. Following Kolda & Bader (2009) I will refer to it as CP decomposition. But before spelling it out in mathematical terms, let’s start with a simple toy example using the R language.

    Rank-1 approximation to a 3-way tensor (toy example)

    Assume that we observe spatio-temporal1 data that, apart from random noise, correspond to the following three classes:

    As you can see from the above figures, the essential difference between the three cases is that they behave differently with respect to time. The spatial component is just a Gaussian curve, while the temporal component is piecewise constant with a sudden jump at time 50, which differs in magnitude and in direction between the three classes.

    Now assume we have collected samples that correspond to the three classes above (with some added noise). But we don’t know which sample falls in what class, how many classes there are, and how they differ :confused:. In an unsupervised fashion we want to learn the different classes and their differences with respect to time and space.

    We can arrange the samples in a 3-way tensor, sample-by-space-by-time. For simplicity, however, assume that the samples are already grouped according to their class within the tensor (but the algorithm doesn’t know that!), so that the resulting tensor looks like this:

    In R the above tensor (let’s call it X) can be generated with the following lines of code:

    space_index <- seq(-1, 1, l = 100) bell_curve <- dnorm(space_index, mean = 0, sd = 0.5) case1 <- matrix(rep(bell_curve, 10), 100, 100) case2 <- matrix(rep(bell_curve, 10), 100, 100) case3 <- matrix(rep(bell_curve, 10), 100, 100) case2[ , 51:100] <- case2[ , 51:100] + 0.1 case3[ , 51:100] <- case3[ , 51:100] - 0.1 X <- array(NA, dim = c(90, 100, 100)) for(i in 1:30) { X[i, , ] <- case1 + matrix(rnorm(10000, sd = 0.1), 100, 100) X[i+30, , ] <- case2 + matrix(rnorm(10000, sd = 0.1), 100, 100) X[i+60, , ] <- case3 + matrix(rnorm(10000, sd = 0.1), 100, 100) }

    Using the excellent R package rTensor we obtain the CP decomposition with one component per mode of the tensor:

    library(rTensor) cp_decomp <- cp(as.tensor(X), num_components = 1) str(cp_decomp$U) # List of 3 # $ : num [1:90, 1] 0.0111 0.0111 0.0111 0.0111 0.0112 ... # $ : num [1:100, 1] -0.00233 -0.00251 -0.00271 -0.00292 -0.00314 ... # $ : num [1:100, 1] -0.00996 -0.00994 -0.00996 -0.00993 -0.00997 ... # NULL

    Visualizing the three components, we get the following figures:

    We can clearly see that

    1. The sample-specific component has correctly separated the samples into three groups (in fact, 1:30 correspond to case 1, 31:60 correspond to case 2, 61:90 correspond to case 3).
    2. The spatial component is bell-shaped, just as the input data with respect to the spatial dimension.
    3. The temporal component has correctly picked up a change at time 50, which is where the three classes differ.

    So, what did just happen? :astonished:

    CP decomposition (quick summary of the math behind it)

    The CP decomposition factorizes a tensor into a sum of outer products of vectors. For example, for a 3-way tensor $X$, the CP decomposition can be written as

    X \approx \sum\subscript{r = 1}^R u\subscript{r} \circ v\subscript{r} \circ w\subscript{r} =: \widehat{X},

    where $R>0$ and $u\subscript{r}$, $v\subscript{r}$, $w\subscript{r}$ are vectors of appropriate dimensions, and where the notation “$\circ$” denotes the outer product for tensors, i.e.,

    x\subscript{ijk} \approx \hat{x}\subscript{ijk} = \sum\subscript{r = 1}^R u\subscript{ri} v\subscript{rj} w\subscript{rk}.

    In case of the previous toy example we have that $R = 1$. Thus, the corresponding CP decomposition has the following form:

    I hope that explains, why the components u, v, and w in the toy example look the way they do! :bowtie:

    Now, how do you solve for the components $u\subscript{r}$, $v\subscript{r}$, $w\subscript{r}$ ($r = 1, 2, \ldots, R$)? You need to solve the following optimization problem:

    \min\subscript{\widehat{X}} \lVert X - \widehat{X} \rVert \quad\mathrm{with}\, \widehat{X} = \sum\subscript{r = 1}^R \lambda\subscript{r} u\subscript{r} \circ v\subscript{r} \circ w\subscript{r},

    where $\lVert \cdot \rVert$ is the Frobenius norm.

    The simplest way to do it is via an alternating least squares approach, where we would regard certain components as fixed while solving for others, and then iterate while alternating the components regarded as fixed… For much more rigour and detail see Kolda & Bader (2009) Tensor Decompositions and Applications.

    A higher-rank approximation via CP (toy example cont.)

    Since in the previous toy example, there are no differentiating features between the three classes, apart from a jump in the temporal component, it makes perfect sense to set $R = 1$ in CP. In order to try out CP with more than one component per mode, I generated data with a more complex structure, and with further differentiation between the three groups with respect to their temporal and spatial makeup:

    1. The three classes still have a bell-shaped component in the spatial mode, but now each class has a different mean.
    2. In the temporal mode the data is shaped like a sine wave, with different scaling per class.
    3. As before, there is a sudden jump in the temporal mode at time 50.

    The three classes in the resulting dataset have the following means.

    As before, we generate a tensor X of dimensions 90 × 100 × 100, with 30 samples per class obscured with random noise.

    We use a CP decomposition in order to obtain a rank-3 approximation to that tensor:

    cp_decomp <- cp(as.tensor(X), num_components = 3, max_iter = 100)

    Here, we increase max_iter to 100, in order to ensure convergence, as can be checked with the conv attribute:

    cp_decomp$conv # [1] TRUE

    Since we set num_components = 3, the solution now has three components per mode, organized in a three-column matrix for each mode:

    str(cp_decomp$U) # List of 3 # $ : num [1:90, 1:3] 0.00131 0.00137 0.00141 0.0014 0.00135 ... # $ : num [1:100, 1:3] 0.000926 0.001345 0.001799 0.002228 0.002755 ... # $ : num [1:100, 1:3] 0.00969 0.0097 0.00974 0.0097 0.00971 ... # NULL

    And we can even check the percentage of the Frobenius norm of $X$ explained by the rank-3 approximation $\widehat{X}$:

    cp_decomp$norm_percent # [1] 83.13865

    83% isn’t too bad!

    :grinning: Let’s look at a visualization of the obtained components!

    Indeed, we observe that,

    • the sample-specific components $u\subscript{r}$ clearly distinguish between the three groups of samples (1:30, 31:60, and 61:90),
    • the spatial components $v\subscript{r}$ clearly picks up the Gaussian shapes with the three different means (at -0.5, 0, and 0.5),
    • the temporal components $w\subscript{r}$ clearly show the sudden jump at time 50, as well as a sine wave.

    That’s it for now. In the next couple of weeks I am planning to write a couple blog posts on other types of tensor decompositions and tensor regression methods, as I am learning about them.

    1. The two data modes can correspond to many types of measurements, other than space and time. Here, I use space and time for example purposes only because those are very familiar concepts. I am neither suggesting that specifically spatio-temporal data should be analyzed in this way, nor that tensor decomposition is generally a good approach for spatio-temporal data (I actually have no idea). 

    To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data science for Doctors: Inferential Statistics Exercises(Part-4)

    Sun, 04/02/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

    We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

    This is the seventh part of the series and it aims to cover partially the subject of Inferential statistics.
    Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
    In more detail, in this part we will go through the hypothesis testing for F-distribution (F-test), and Chi-squared distribution (Chi-squared test). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background. The assumption of the t-test (we covered it last time here) is that the two population variances are equal. Such an assumption can serve as a null hypothesis for F-test. Moreover sometimes it happens that we want to test a hypothesis with respect to more than one probability, here is where Chi-Squared test comes into play.

    Before proceeding, it might be helpful to look over the help pages for the sd, var, var.test, chisq.test.

    Please run the code below in order to load the data set and transform it into a proper data frame format:

    url <- ""
    data <- read.table(url, fileEncoding="UTF-8", sep=",")
    names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
    colnames(data) <- names
    data <- data[-which(data$mass ==0),]

    Moreover run the chunk below in order to generate the samples that we will test on this set of exercises.
    f_1 <- rnorm(28,29,3)
    f_2 <- rnorm(23,29,6)

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1

    Compute the F-statistic. (test statistic for F-test)

    Exercise 2

    Compute the degrees of freedom for the numerator and denominator.

    Exercise 3

    Apply a two-sided F-test for the two samples

    Exercise 4

    Apply a one-sided F-test for the two samples with the alternative hypothesis to be that the standard deviation of the first sample is smaller than the second.

    Exercise 5

    Retrieve the p-value and the ratio of variances for both tests.

    Exercise 6

    Find the number of patients who show signs of diabetes and those who don’t.

    Exercise 7

    Assume that the hypothesis we made is that 10% of people show signs of diabetes. Is that a valid claim to make? Test it using the chi-squared test.

    Exercise 8

    Suppose that the mass index affects whether the patients show signs of diabetes and we assume that the people who weight more than the average are more likely to have diabetes signs. Make a matrix that contains the true-positives, false-positives, true-negatives, and false-negatives of our hypothesis.
    hint: True-positive: data$class==1 & data$mass >= mean(data$mass)

    Exercise 9

    Test the hypothesis we made at exercise 8 using chi-squared test.

    Exercise 10

    The hypothesis we made at exercise 8 cannot be validated, however we have noticed that the dataset contains outliers which affect the average. Therefore we make another assumption that patients who are heavier than the 25% lightest of the patients are more likely to have signs of diabetes. Test that hypothesis.
    hint: it is similar to the process we did at exercises 8 and 9 but with different criteria.

    Related exercise sets:
    1. Data Science for Doctors – Part 1 : Data Display
    2. Data science for Doctors: Inferential Statistics Exercises (part-2)
    3. Data science for Doctors: Inferential Statistics Exercises (part-3)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Using R: Don’t save your workspace

    Sun, 04/02/2017 - 14:17

    (This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

    To everyone learning R: Don’t save your workspace.

    When you exit an R session, you’re faced with the question of whether or not to save your workspace. You should almost never answer yes. Saving your workspace creates an image of your current variables and functions, and saves them to a file called ”.RData”. When you re-open R from that working directory, the workspace will be loaded, and all these things will be available to you again. But you don’t want that, so don’t save your workspace.

    Loading a saved workspace turns your R script from a program, where everything happens logically according to the plan that is the code, to something akin to a cardboard box taken down from the attic, full of assorted pages and notebooks that may or may not be what they seem to be. You end up having to put an inordinate trust in your old self. I don’t know about your old selves, dear reader, but if they are anything like mine, don’t save your workspace.

    What should one do instead? One should source the script often, ideally from freshly minted R sessions, to make sure to always be working with a script that runs and does what it’s supposed to. Storing a data frame in the workspace can seem comforting, but what happens the day I overwrite it by mistake? Don’t save your workspace.

    Yes, I’m exaggerating. When using any modern computer system, we rely on saved information and saved state all the time. And yes, every time a computation takes too much time to reproduce, one should write it to a file to load every time. But I that should be a deliberate choice, worthy of its own save() and load() calls, and certainly not something one does with simple stuff that can be reproduced a the blink of an eye. Put more trust in your script than in your memory, and don’t save your workspace.

    Postat i:computer stuff, data analysis Tagged: R, RData, workspace

    To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Function Call with Ellipsis Trap/Pitfall

    Sun, 04/02/2017 - 10:00

    (This article was first published on, and kindly contributed to R-bloggers)

    Objective if this post: alerting all users to double check case and spelling of all function parameters

    I am newbie in R and was trying RSNNS mlp function and wasted a lot of time due to some typos.

    RSNNS mlp function silently ignores misspelled keywords


    I intentionally misspelled size as hize and maxit as mazit
    There are no warnings or errors.

    I think that many packages may have same problem as package writers may not always validate ellipsis arguments. I made a small spelling mistake and got puzzling results as there was no parameter validation, but I expected that great eminent packages should be robust and help users recover from typos

    Let us see what happens with no ellipsis

    &gt; square &lt;-function(num ){ + return(num*num) + } &gt; &gt; square(num=4) [1] 16 &gt; square(numm=4) Error in square(numm = 4) : unused argument (numm = 4) # With ellipsis added &gt; square &lt;-function(num, …){ + print(names(list(…))); + return(num*num) + } &gt; &gt; square(num=3,bla=4,kla=9) [1] “bla” “kla” [1] 9

    As you can see names(list(…)) does give access to parameter names

    The problem is that ellipsis function calls are for flexibility but package writers should take extra care to throw exception “unused argument” when parameters of functions are misspelled.

    This to my mind is a major weakness of the R ecosystem. Most parameters have defaults and small case or spelling mistake can lead to really wrong conclusions!

    RSNNS  is simply fantastic but given simply as an example of the ellipsis function call trap. Hope other newbies benefit and learn to avoid the trap of wrong arguments.

    Jayanta Narayan Choudhuri
    Kolkata India

    To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Superclassing to R⁶

    Sun, 04/02/2017 - 03:31

    (This article was first published on R –, and kindly contributed to R-bloggers)

    To avoid “branding” confusion with R⁴ I’m superclassing it to R⁶ and encouraging others in the R community to don the moniker and do their own small, focused posts on topics that would help the R community learn things. Feel free to use R⁶ (I’ll figure out an acronym later). Feel free to tag your posts as R⁶ (or r6) and use the moniker as you see fit.

    I’ll eventually tag the 2 current “r4” posts as “r6”.

    Hopefully we can link together a cadre of R⁶ posts into a semi-organized structure that all can benefit from.

    To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dealing with unbalanced data in machine learning

    Sun, 04/02/2017 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.

    But because I had gotten a few questions regarding this, I thought it would be worthwhile to explain over- and under-sampling techniques in more detail and show how you can very easily implement them with caret.


    Unbalanced data

    In this context, unbalanced data refers to classification problems where we have unequal instances for different classes. Having unbalanced data is actually very common in general, but it is especially prevalent when working with disease data where we usually have more healthy control samples than disease cases. Even more extreme unbalance is seen with fraud detection, where e.g. most credit card uses are okay and only very few will be fraudulent. In the example I used for my webinar, a breast cancer dataset, we had about twice as many benign than malignant samples.

    summary(bc_data$classes) ## benign malignant ## 458 241

    Why is unbalanced data a problem in machine learning?

    Most machine learning classification algorithms are sensitive to unbalance in the predictor classes. Let’s consider an even more extreme example than our breast cancer dataset: assume we had 10 malignant vs 90 benign samples. A machine learning model that has been trained and tested on such a dataset could now predict “benign” for all samples and still gain a very high accuracy. An unbalanced dataset will bias the prediction model towards the more common class!

    How to balance data for modeling

    The basic theoretical concepts behind over- and under-sampling are very simple:

    • With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. In our example, we would randomly pick 241 out of the 458 benign cases. The main disadvantage of under-sampling is that we loose potentially relevant information from the left-out samples.

    • With oversampling, we randomly duplicate samples from the class with fewer instances or we generate additional instances based on the data that we have, so as to match the number of samples in each class. While we avoid loosing information with this approach, we also run the risk of overfitting our model as we are more likely to get the same samples in the training and in the test data, i.e. the test data is no longer independent from training data. This would lead to an overestimation of our model’s performance and generalizability.

    In reality though, we should not simply perform over- or under-sampling on our training data and then run the model. We need to account for cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance!

    Modeling the original unbalanced data

    Here is the same model I used in my webinar example: I randomly divide the data into training and test sets (stratified by class) and perform Random Forest modeling with 10 x 10 repeated cross-validation. Final model performance is then measured on the test set.

    set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) final <- data.frame(actual = test_data$classes, predict(model_rf, newdata = test_data, type = "prob")) final$predict <- ifelse(final$benign > 0.5, "benign", "malignant") cm_original <- confusionMatrix(final$predict, test_data$classes)


    Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.

    ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "down") set.seed(42) model_rf_under <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_under <- data.frame(actual = test_data$classes, predict(model_rf_under, newdata = test_data, type = "prob")) final_under$predict <- ifelse(final_under$benign > 0.5, "benign", "malignant") cm_under <- confusionMatrix(final_under$predict, test_data$classes)


    For over- (also called up-) sampling we simply specify sampling = "up".

    ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "up") set.seed(42) model_rf_over <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_over <- data.frame(actual = test_data$classes, predict(model_rf_over, newdata = test_data, type = "prob")) final_over$predict <- ifelse(final_over$benign > 0.5, "benign", "malignant") cm_over <- confusionMatrix(final_over$predict, test_data$classes)


    Besides over- and under-sampling, there are hybrid methods that combine under-sampling with the generation of additional data. Two of the most popular are ROSE and SMOTE.

    From Nicola Lunardon, Giovanna Menardi and Nicola Torelli’s “ROSE: A Package for Binary Imbalanced Learning” (R Journal, 2014, Vol. 6 Issue 1, p. 79): “The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.”

    You implement them the same way as before, this time choosing sampling = "rose"…

    ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "rose") set.seed(42) model_rf_rose <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_rose <- data.frame(actual = test_data$classes, predict(model_rf_rose, newdata = test_data, type = "prob")) final_rose$predict <- ifelse(final_rose$benign > 0.5, "benign", "malignant") cm_rose <- confusionMatrix(final_rose$predict, test_data$classes)


    … or by choosing sampling = "smote" in the trainControl settings.

    From Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer’s “SMOTE: Synthetic Minority Over-sampling Technique” (Journal of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357): “This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples.”

    ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE, sampling = "smote") set.seed(42) model_rf_smote <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = ctrl) final_smote <- data.frame(actual = test_data$classes, predict(model_rf_smote, newdata = test_data, type = "prob")) final_smote$predict <- ifelse(final_smote$benign > 0.5, "benign", "malignant") cm_smote <- confusionMatrix(final_smote$predict, test_data$classes)


    Now let’s compare the predictions of all these models:

    models <- list(original = model_rf, under = model_rf_under, over = model_rf_over, smote = model_rf_smote, rose = model_rf_rose) resampling <- resamples(models) bwplot(resampling)

    library(dplyr) comparison <- data.frame(model = names(models), Sensitivity = rep(NA, length(models)), Specificity = rep(NA, length(models)), Precision = rep(NA, length(models)), Recall = rep(NA, length(models)), F1 = rep(NA, length(models))) for (name in names(models)) { model <- get(paste0("cm_", name)) comparison[comparison$model == name, ] <- filter(comparison, model == name) %>% mutate(Sensitivity = model$byClass["Sensitivity"], Specificity = model$byClass["Specificity"], Precision = model$byClass["Precision"], Recall = model$byClass["Recall"], F1 = model$byClass["F1"]) } library(tidyr) comparison %>% gather(x, y, Sensitivity:F1) %>% ggplot(aes(x = x, y = y, color = model)) + geom_jitter(width = 0.2, alpha = 0.5, size = 3)

    With this small dataset, we can already see how the different techniques can influence model performance. Sensitivity (or recall) describes the proportion of benign cases that have been predicted correctly, while specificity describes the proportion of malignant cases that have been predicted correctly. Precision describes the true positives, i.e. the proportion of benign predictions that were actual from benign samples. F1 is the weighted average of precision and sensitivity/ recall.

    Here, all four methods improved specificity and precision compared to the original model.
    Under-sampling, over-sampling and ROSE additionally improved precision and the F1 score.

    This post shows a simple example of how to correct for unbalance in datasets for machine learning. For more advanced instructions and potential caveats with these techniques, check out the excellent caret documentation.

    If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

    sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.3 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tidyr_0.6.1 dplyr_0.5.0 randomForest_4.6-12 ## [4] caret_6.0-73 ggplot2_2.2.1 lattice_0.20-34 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 nloptr_1.0.4 plyr_1.8.4 ## [4] class_7.3-14 iterators_1.0.8 tools_3.3.3 ## [7] digest_0.6.12 lme4_1.1-12 evaluate_0.10 ## [10] tibble_1.2 gtable_0.2.0 nlme_3.1-131 ## [13] mgcv_1.8-17 Matrix_1.2-8 foreach_1.4.3 ## [16] DBI_0.5-1 yaml_2.1.14 parallel_3.3.3 ## [19] SparseM_1.74 e1071_1.6-8 stringr_1.2.0 ## [22] knitr_1.15.1 MatrixModels_0.4-1 stats4_3.3.3 ## [25] rprojroot_1.2 grid_3.3.3 nnet_7.3-12 ## [28] R6_2.2.0 rmarkdown_1.3 minqa_1.2.4 ## [31] reshape2_1.4.2 car_2.1-4 magrittr_1.5 ## [34] backports_1.0.5 scales_0.4.1 codetools_0.2-15 ## [37] ModelMetrics_1.1.0 htmltools_0.3.5 MASS_7.3-45 ## [40] splines_3.3.3 assertthat_0.1 pbkrtest_0.4-6 ## [43] colorspace_1.3-2 labeling_0.3 quantreg_5.29 ## [46] stringi_1.1.2 lazyeval_0.2.0 munsell_0.4.3

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Building Shiny App Exercises (part-9)

    Sat, 04/01/2017 - 18:00

    Shiny Dashboard Overview

    In this part we will “dig deeper” to discover the amazing capabilities that a Shiny Dasboard provides.
    Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

    Answers to the exercises are available here.

    The dashboardPage function expects three components: a header, sidebar, and body:

    For more complicated apps, splitting app into pieces can make it more readable:

    header <- dashboardHeader()

    sidebar <- dashboardSidebar()

    body <- dashboardBody()

    dashboardPage(header, sidebar, body)

    Now we’ll look at each of the three main components of a shinydashboard.


    A header can have a title and dropdown menus. The dropdown menus are generated by the dropdownMenu function. There are three types of menus – messages, notifications, and tasks – and each one must be populated with a corresponding type of item.

    Message menus

    A messageItem contained in a message menu needs values for from and message. You can also control the icon and a notification time string. By default, the icon is a silhouette of a person. The time string can be any text. For example, it could be a relative date/time like “5 minutes”, “today”, or “12:30pm yesterday”, or an absolute time, like “2014-12-01 13:45”.
    dropdownMenu(type = "messages",
    from = "Sales Dept",
    message = "Sales are steady this month."
    from = "New User",
    message = "How do I register?",
    icon = icon("question"),
    time = "13:45"
    from = "Support",
    message = "The new server is ready.",
    icon = icon("life-ring"),
    time = "2014-12-01"

    Exercise 1

    Create a dropdownMenu in your dashboardHeader as the example above. Put date, time and generally text of your choice.

    Dynamic content

    In most cases, you’ll want to make the content dynamic. That means that the HTML content is generated on the server side and sent to the client for rendering. In the UI code, you’d use dropdownMenuOutput like this:


    Exercise 2

    Replace dropdownMenu with dropdownMenuOutput and the three messageItem with messageMenu.

    The next step is to create some messages for this example.The code below does this work for us.
    # Example message data in a data frame
    messageData <- data.frame(
    from = c("Admininstrator", "New User", "Support"),
    message = c(
    "Sales are steady this month.",
    "How do I register?",
    "The new server is ready."
    stringsAsFactors = FALSE

    Exercise 3

    Put messageData inside your server.r but outside of the shinyServer function.

    And on the server side, you’d generate the entire menu in a renderMenu, like this:
    output$messageMenu <- renderMenu({
    # Code to generate each of the messageItems here, in a list. messageData
    # is a data frame with two columns, 'from' and 'message'.
    # Also add on slider value to the message content, so that messages update.
    msgs <- apply(messageData, 1, function(row) {
    from = row[["from"]],
    message = paste(row[["message"]], input$slider)

    dropdownMenu(type = "messages", .list = msgs)

    Exercise 4

    Put the code above(output$messageMenu) in the shinyServer of server.R.

    Hopefully you have understood by now the logic behind the dynamic content of your Menu. Now let’s return to the static one in order to describe it a little bit more. So make the proper changes to your code in order to return exactly to the point we were after exercise 1.

    Notification menus

    A notificationItem contained in a notification contains a text notification. You can also control the icon and the status color. The code below gives an example.
    dropdownMenu(type = "notifications",
    text = "20 new users today",
    text = "14 items delivered",
    status = "success"
    text = "Server load at 84%",
    icon = icon("exclamation-triangle"),
    status = "warning"

    Exercise 5

    Create a dropdownMenu for your notifications like the example. Use text of your choice. Be careful of the type and the notificationItem.

    Task menus

    Task items have a progress bar and a text label. You can also specify the color of the bar. Valid colors are listed in ?validColors. Take a look at the example below.
    dropdownMenu(type = "tasks", badgeStatus = "success",
    taskItem(value = 90, color = "green",
    taskItem(value = 17, color = "aqua",
    "Project X"
    taskItem(value = 75, color = "yellow",
    "Server deployment"
    taskItem(value = 80, color = "red",
    "Overall project"

    Exercise 6

    Create a dropdownMenu for your tasks like the example above. Use text of your choice and create as many taskItem as you want. Be carefull of the type and the taskItem.

    Disabling the header

    If you don’t want to show a header bar, you can disable it with:

    dashboardHeader(disable = TRUE)

    Exercise 7

    Disable the header.

    Now enable it again.


    The body of a dashboard page can contain any regular Shiny content. However, if you’re creating a dashboard you’ll likely want to make something that’s more structured. The basic building block of most dashboards is a box. Boxes in turn can contain any content.


    Boxes are the main building blocks of dashboard pages. A basic box can be created with the box function, and the contents of the box can be (most) any Shiny UI content. We have already created some boxes in part 8 so lets enhance theis appearance a little bit.
    Boxes can have titles and header bar colors with the title and status options. Look at the examples below.

    box(title = "Histogram", status = "primary",solidHeader = TRUE, plotOutput("plot2", height = 250)),

    title = "Inputs", status = "warning",
    "Box content here", br(), "More box content",
    sliderInput("slider", "Slider input:", 1, 100, 50),
    textInput("text", "Text input:")

    Exercise 8

    Give a title of your choice to all the box you have created in your dashboard except of the three widgets’ box.

    Exercise 9

    Change the status of the first three box to “primary” and the last three to “warning”.

    Exercise 10

    Transform the headers of your first three box to solid headers.

    Related exercise sets:
    1. Building Shiny App exercises part 1
    2. Building Shiny App exercises part 3
    3. Building Shiny App exercises part 2
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    Tutorial: Using R for Scalable Data Analytics

    Fri, 03/31/2017 - 21:45

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    At the recent Strata conference in San Jose, several members of the Microsoft Data Science team presented the tutorial Using R for Scalable Data Analytics: Single Machines to Spark Clusters. The materials are all available online, including the presentation slides and hands-on R scripts. You can follow along with the materials at home, using the Data Science Virtual Machine for Linux, which provides all the necessary components like Spark and Microsoft R Server. (If you don't already have an Azure account, you can get $200 credit with the Azure free trial.)

    The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you'll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You'll also learn how to create prediction engines from these trained models using the mrsdeploy package.

    The tutorial also includes scripts for comparing the performance of these various techniques, both for training the predictive model:

    and for generating predictions from the trained model:

    (The above tests used 4 worker nodes and 1 edge node, all with with 16 cores and 112Gb of RAM.)

    You can find the tutorial details, including slides and scripts, at the link below.

    Strata + Hadoop World 2017, San Jose: Using R for scalable data analytics: From single machines to Hadoop Spark clusters



    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    ggedit 0.2.0 is now on CRAN

    Fri, 03/31/2017 - 21:17

    (This article was first published on, and kindly contributed to R-bloggers)

    Jonathan Sidi, Metrum Research Group

    We are pleased to announce the release of the ggedit package on CRAN.

    To install the package you can call the standard R command


    The source version is still tracked on github, which has been reorganized to be easier to navigate.

    To install the dev version:

    devtools::install_github('metrumresearchgroup/ggedit') What is ggedit?

    ggedit is an R package that is used to facilitate ggplot formatting. With ggedit, R users of all experience levels can easily move from creating ggplots to refining aesthetic details, all while maintaining portability for further reproducible research and collaboration.
    ggedit is run from an R console or as a reactive object in any Shiny application. The user inputs a ggplot object or a list of objects. The application populates Bootstrap modals with all of the elements found in each layer, scale, and theme of the ggplot objects. The user can then edit these elements and interact with the plot as changes occur. During editing, a comparison of the script is logged, which can be directly copied and shared. The application output is a nested list containing the edited layers, scales, and themes in both object and script form, so you can apply the edited objects independent of the original plot using regular ggplot2 grammar.
    Why does it matter? ggedit promotes efficient collaboration. You can share your plots with team members to make formatting changes, and they can then send any objects they’ve edited back to you for implementation. No more email chains to change a circle to a triangle!

    Updates in ggedit 0.2.0:
    • The layer modal (popups) elements have been reorganized for less clutter and easier navigation.
    • The S3 method written to plot and compare themes has been removed from the package, but can still be found on the repo, see plot.theme.
    • call from the console: ggedit(p)
    • call from the addin toolbar: highlight script of a plot object on the source editor window of RStudio and run from toolbar.
    • call as part of Shiny: use the Shiny module syntax to call the ggEdit UI elements.
      • server: callModule(ggEdit,'pUI',obj=reactive(p))
      • ui: ggEditUI('pUI')
    • if you have installed the package you can see an example of a Shiny app by executing runApp(system.file('examples/shinyModule.R',package = 'ggedit'))

    ggedit returns a list containing 8 elements either to the global enviroment or as a reactive output in Shiny.

    • updatedPlots
      • List containing updated ggplot objects
    • updatedLayers
      • For each plot a list of updated layers (ggproto) objects
      • Portable object
    • updatedLayersElements
      • For each plot a list elements and their values in each layer
      • Can be used to update the new values in the original code
    • updatedLayerCalls
      • For each plot a list of scripts that can be run directly from the console to create a layer
    • updatedThemes
      • For each plot a list of updated theme objects
      • Portable object
      • If the user doesn’t edit the theme updatedThemes will not be returned
    • updatedThemeCalls
      • For each plot a list of scripts that can be run directly from the console to create a theme
    • updatedScales
      • For each plot a list of updated scales (ggproto) objects
      • Portable object
    • updatedScaleCalls
      • For each plot a list of scripts that can be run directly from the console to create a scale
      Short Clip to use ggedit in Shiny

    Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings. To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website. Contact: For questions and comments, feel free to email me at: or open an issue for bug fixes or enhancements at github.

    To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The 5 Most Effective Ways to Learn R

    Fri, 03/31/2017 - 19:53

    (This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

    Whether you’re plotting a simple time series or building a predictive model for the next election, the R programming language’s flexibility will ensure you have all the capabilities you need to get the job done. In this blog we will take a look at five effective tactics for learning this essential data science language, as well as some of the top resources associated with each. These tactics should be used to complement one another on your path to mastering the world’s most powerful statistical language!

    1. Watch Instructive Videos

    We often flock to YouTube when we want to learn how to play a song on the piano, change a tire, or chop an onion, but why should it be any different when learning how to perform calculations using the most popular statistical programming language? LearnR, Google Developers, and MarinStatsLectures are all fantastic YouTube channels with playlists specifically dedicated to the R language.

    2. Read Blogs

    There’s a good chance you came across this article through the R-bloggers website, which curates content from some of the best blogs about R that can be found on the web today. Since there are 750+ blogs that are curated on R-bloggers alone, you shouldn’t have a problem finding an article on the exact topic or use case you’re interested in!

    A few notable R blogs:

    3. Take an Online Course

    As we’ve mentioned in previous blogs, there are a great number of online classes you can take to learn specific technical skills. In many instances, these courses are free, or very affordable, with some offering discounts to college students. Why spend thousands of dollars on a university course, when you can get as good, if not better (IMHO), of an understanding online.

    Some sites that offer great R courses include:

    4. Read Books

    Many times, books are given a bad rap since most programming concepts can be found online, for free. Sure, if you are going to use the book just as a reference, you’d probably be better off saving that money and taking to Google search. However, if you’re a beginner, or someone who wants to learn the fundamentals, working through an entire book at the foundational level will provide a high degree of understanding.

    There is a fantastic list of the best books for R at Data Science Central.

    5. Experiment!

    You can read articles and watch videos all day long, but if you never try it for yourself, you’ll never learn! Datazar is a great place for you to jump right in and experiment with what you’ve learned. You can immediately start by opening the R console or creating a notebook in our cloud-based environment. If you get stuck, you can consult with other users and even work with scripts that have been opened up by others!

    I hope you found this helpful and as always if you would like to share any additional resources, feel free to drop them in the comments below!

    Resources Included in this Article

    The 5 Most Effective Ways to Learn R was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

    To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Easy leave-one-out cross validation with pipelearner

    Fri, 03/31/2017 - 14:10

    (This article was first published on blogR, and kindly contributed to R-bloggers)

    @drsimonj here to show you how to do leave-one-out cross validation using pipelearner.

     Leave-one-out cross validation

    Leave-one-out is a type of cross validation whereby the following is done for each observation in the data:

    • Run model on all other observations
    • Use model to predict value for observation

    This means that a model is fitted, and a predicted is made n times where n is the number of observations in your data.

     Leave-one-out in pipelearner

    pipelearner is a package for streamlining machine learning pipelines, including cross validation. If you’re new to it, check out blogR for other relevant posts.

    To demonstrate, let’s use regression to predict horsepower (hp) with all other variables in the mtcars data set. Set this up in pipelearner as follows:

    library(pipelearner) pl <- pipelearner(mtcars, lm, hp ~ .)

    How cross validation is done is handled by learn_cvpairs(). For leave-one-out, specify k = number of rows:

    pl <- learn_cvpairs(pl, k = nrow(mtcars))

    Finally, learn() the model on all folds:

    pl <- learn(pl)

    This can all be written in a pipeline:

    pl <- pipelearner(mtcars, lm, hp ~ .) %>% learn_cvpairs(k = nrow(mtcars)) %>% learn() pl #> # A tibble: 32 × 9 #> train_p fit target model params #> <chr> <chr> <dbl> <list> <chr> <chr> <list> #> 1 1 01 1 <S3: lm> hp lm <list [1]> #> 2 1 02 1 <S3: lm> hp lm <list [1]> #> 3 1 03 1 <S3: lm> hp lm <list [1]> #> 4 1 04 1 <S3: lm> hp lm <list [1]> #> 5 1 05 1 <S3: lm> hp lm <list [1]> #> 6 1 06 1 <S3: lm> hp lm <list [1]> #> 7 1 07 1 <S3: lm> hp lm <list [1]> #> 8 1 08 1 <S3: lm> hp lm <list [1]> #> 9 1 09 1 <S3: lm> hp lm <list [1]> #> 10 1 10 1 <S3: lm> hp lm <list [1]> #> # ... with 22 more rows, and 2 more variables: train <list>, test <list>  Evaluating performance

    Performance can be evaluated in many ways depending on your model. We will calculate R2:

    library(tidyverse) # Extract true and predicted values of hp for each observation pl <- pl %>% mutate(true = map2_dbl(test, target,[[.y]]), predicted = map2_dbl(fit, test, predict)) # Summarise results results <- pl %>% summarise( sse = sum((predicted - true)^2), sst = sum(true^2) ) %>% mutate(r_squared = 1 - sse / sst) results #> # A tibble: 1 × 3 #> sse sst r_squared #> <dbl> <dbl> <dbl> #> 1 41145.56 834278 0.9506812

    Using leave-one-out cross validation, the regression model obtains an R2 of 0.95 when generalizing to predict horsepower in new data.

    We’ll conclude with a plot of each true data point and it’s predicted value:

    pl %>% ggplot(aes(true, predicted)) + geom_point(size = 2) + geom_abline(intercept = 0, slope = 1, linetype = 2) + theme_minimal() + labs(x = "True value", y = "Predicted value") + ggtitle("True against predicted values based\non leave-one-one cross validation")

     Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me at to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.

    To leave a comment for the author, please follow the link and comment on their blog: blogR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...