Managing intermediate results when using R/sparklyr
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
In our latest “R and big data” article we show how to manage intermediate results in nontrivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.
Handle management
Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat up resources.
To help control handle lifetime the replyr supplies recordretaining temporary name generators (and uses the same internally).
The actual function is pretty simple:
print(replyr::makeTempNameGenerator) ## function(prefix, ## suffix= NULL) { ## force(prefix) ## if((length(prefix)!=1)(!is.character(prefix))) { ## stop("repyr::makeTempNameGenerator prefix must be a string") ## } ## if(is.null(suffix)) { ## alphabet < c(letters, toupper(letters), as.character(0:9)) ## suffix < paste(base::sample(alphabet, size=20, replace= TRUE), ## collapse = '') ## } ## count < 0 ## nameList < c() ## function(dumpList=FALSE) { ## if(dumpList) { ## v < nameList ## nameList << c() ## return(v) ## } ## nm < paste(prefix, suffix, sprintf('%010d',count), sep='_') ## nameList << c(nameList, nm) ## count << count + 1 ## nm ## } ## } ## ##For instance to join a few tables it is a can be a good idea to call compute after each join (else the generated SQL can become large and unmanageable). This sort of code looks like the following:
# create example data names < paste('table', 1:5, sep='_') tables < lapply(names, function(ni) { di < data.frame(key= 1:3) di[[paste('val',ni,sep='_')]] < runif(nrow(di)) copy_to(sc, di, ni) }) # build our temp name generator tmpNamGen < replyr::makeTempNameGenerator('JOINTMP') # left join the tables in sequence joined < tables[[1]] for(i in seq(2,length(tables))) { ti < tables[[i]] if(i<length(tables)) { joined < compute(left_join(joined, ti, by='key'), name= tmpNamGen()) } else { # use nontemp name. joined < compute(left_join(joined, ti, by='key'), name= 'joinres') } } # clean up temps temps < tmpNamGen(dumpList = TRUE) print(temps) ## [1] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000000" ## [2] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000001" ## [3] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000002" for(ti in temps) { db_drop_table(sc, ti) } # show result print(joined) ## Source: query [3 x 6] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 3 x 6 ## key val_table_1 val_table_2 val_table_3 val_table_4 val_table_5 ## ## 1 1 0.7594355 0.8082776 0.696254059 0.3777300 0.30015615 ## 2 2 0.4082232 0.8101691 0.005687125 0.9382002 0.04502867 ## 3 3 0.5941884 0.7990701 0.874374779 0.7936563 0.19940400Careful introduction and management of materialized intermediates can conserve resources (both time and space) and greatly improve outcomes. We feel it is a good practice to set up an explicit temp name manager, pass it through all your Sparklyr transforms, and then clear temps in batches after the results no longer depend no the intermediates.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Unconf projects 5: mwparser, Gargle, arresteddev
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
And finally, we end our series of unconf project summaries (day 1, day 2, day 3, day 4).
mwparserSummary: Wikimarkup is the language used on Wikipedia and similar projects, and as such contains a lot of valuable data both for scientists studying collaborative systems and people studying things documented on or in Wikipedia. mwparser parses wikimarkup, allowing a user to filter down to specific types of tags such as links or templates, and then extract components of those tags.
Team: Oliver Keyes
Github: https://github.com/Ironholds/mwparser
GargleSummary: Gargle is a library that provides authentication for Google APIs but without all the agonizing pain. The package provides helper functions (for httr) to support automatic retries, paging, and progress bars for API calls.
Team: Craig Citro
Github: https://github.com/rlib/gargle
arresteddevSummary: This package is designed to help troubleshoot errors that come up during package and analysis development. As of now, the package helps track tracebacks and errors but more functionality is planned for the future.
Team: Lucy D'Agostino McGowan, Karthik Ram, Miles McBain
Github: https://github.com/ropenscilabs/arresteddev
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));R Interface to Spark
(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to Rbloggers)
SparkR
library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc < sparkR.session(master = "local") df1 < read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ < sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685 ### SUMMARY TABLE WITH AGG() grp < groupBy(filter(df1, "month in (1, 3, 5)"), "month") summ < agg(grp, avg_dep = avg(df1$dep_time), avg_arr = avg(df1$arr_time)) head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685sparklyr
library(sparklyr) sc < spark_connect(master = "local") df1 < spark_read_csv(sc, name = "tbl1", path = "nycflights13.csv", header = TRUE, infer_schema = TRUE) ### SUMMARY TABLE WITH SQL library(DBI) summ < dbGetQuery(sc, "select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743 ### SUMMARY TABLE WITH DPLYR library(dplyr) summ < df1 %>% filter(month %in% c(1, 3, 5)) %>% group_by(month) %>% summarize(avg_dep = mean(dep_time), avg_arr = mean(arr_time)) head(summ) # month avg_dep avg_arr # # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Data Science for Business – Time Series Forecasting Part 2: Forecasting with timekit
In my last post, I prepared and visually explored time series data.
Now, I will use this data to test the timekit package for time series forecasting with machine learning.
ForecastingIn time series forecasting, we use models to predict future time points based on past observations.
As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).”
And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts – we can already guess the future curve just by looking at it. So, if we think of usecases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience.
The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.
library(tidyverse) library(caret) library(tidyquant) library(broom) library(timekit) library(modelr) options(na.action = na.warn) Training and test dataMy input data is the tibble retail_p_day, that was created in my last post.
I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011).
retail_p_day < retail_p_day %>% mutate(model = ifelse(day <= "20111101", "train", "test")) colnames(retail_p_day)[grep("^[09]+", colnames(retail_p_day))] < paste0("P_", colnames(retail_p_day)[grep("^[09]+", colnames(retail_p_day))])Here, I am testing out timekit’s functions with the net income per day as response variable. Because the time series in our data set is relatively short and doesn’t cover multiple years, this forecast will only be able to capture recurring variation in days and weeks. Variations like increased sales before holidays, etc. would need additional data from several years to be accurately forecast.
As we can see in the plot below, the net income shows variation between days.
retail_p_day %>% ggplot(aes(x = day, y = sum_income, color = model)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq() Augmenting the time series signatureWith timekit, we can do forecasting with only a time series signature (a series of dates and times) and a corresponding response variable. If we had additional features that could be forecast independently, we could also introduce these into the model, but here, I will only work with the minimal data set.
A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature:
 index: The index value that was decomposed
 index.num: The numeric value of the index in seconds. The base is “19700101 00:00:00”.
 diff: The difference in seconds from the previous numeric index value.
 year: The year component of the index.
 half: The half component of the index.
 quarter: The quarter component of the index.
 month: The month component of the index with base 1.
 month.xts: The month component of the index with base 0, which is what xts implements.
 month.lbl: The month label as an ordered factor beginning with January and ending with December.
 day: The day component of the index.
 hour: The hour component of the index.
 minute: The minute component of the index.
 second: The second component of the index.
 hour12: The hour component on a 12 hour scale.
 am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
 wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.
 wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6.
 wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
 mday: The day of the month.
 qday: The day of the quarter.
 yday: The day of the year.
 mweek: The week of the month.
 week: The week number of the year (Sunday start).
 week.iso: The ISO week number of the year (Monday start).
 week2: The modulus for biweekly frequency.
 week3: The modulus for triweekly frequency.
 week4: The modulus for quadweekly frequency.
 mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.
Because we have missing data for the first column of diff, I am removing this row. We need to keep in mind too, that we have an irregular time series, because we never have data on Saturdays. This will affect the modeling and results and we need to account for this later on! Alternatively, it might make sense to compare the results when setting all NAs/Saturdays to 0, assuming that no information means that there was no income on a given day. Or we could impute missing values. Which strategy most accurately represents your data needs to be decided based on a good understanding of the business and how the data was collected.
retail_p_day_aug < retail_p_day %>% rename(date = day) %>% select(model, date, sum_income) %>% tk_augment_timeseries_signature() %>% select(contains("month")) retail_p_day_aug < retail_p_day_aug[complete.cases(retail_p_day_aug), ] PreprocessingNot all of these augmented features will be informative for our model. For example, we don’t have information about time of day, so features like hour, minute, second, etc. will be irrelevant here.
Let’s look at column variation for all numeric feature and remove those with a variance of 0.
library(matrixStats) (var < data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]), colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>% filter(colvars == 0)) ## colnames colvars ## 1 hour 0 ## 2 minute 0 ## 3 second 0 ## 4 hour12 0 ## 5 am.pm 0 retail_p_day_aug < select(retail_p_day_aug, one_of(as.character(var$colnames)))Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set.
library(ggcorrplot) cor < cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) p.cor < cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) ggcorrplot(cor, type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor, colors = c(palette_light()[1], "white", palette_light()[2]))I am going to choose a cutoff of 0.9 for removing features:
cor_cut < findCorrelation(cor, cutoff=0.9) retail_p_day_aug < select(retail_p_day_aug, one_of(colnames(cor)[cor_cut]))Now, I can split the data into training and test sets:
train < filter(retail_p_day_aug, model == "train") %>% select(model) test < filter(retail_p_day_aug, model == "test") ModelingTo model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple.
fit_lm < glm(sum_income ~ ., data = train)We can examine our model e.g. by visualizing:
tidy(fit_lm) %>% gather(x, y, estimate:p.value) %>% ggplot(aes(x = term, y = y, color = x, fill = x)) + facet_wrap(~ x, scales = "free", ncol = 4) + geom_bar(stat = "identity", alpha = 0.8) + scale_color_manual(values = palette_light()) + scale_fill_manual(values = palette_light()) + theme_tq() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) augment(fit_lm) %>% ggplot(aes(x = date, y = .resid)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()With this model, we can now add predictions and residuals for the test data…
pred_test < test %>% add_predictions(fit_lm, "pred_lm") %>% add_residuals(fit_lm, "resid_lm")… and visualize the residuals.
pred_test %>% ggplot(aes(x = date, y = resid_lm)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()We can also compare the predicted with the actual sum income in the test set.
pred_test %>% gather(x, y, sum_income, pred_lm) %>% ggplot(aes(x = date, y = y, color = x)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq() ForecastingOnce we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame).
# Extract index idx < retail_p_day %>% tk_index()From this index we can generate the future time series.
Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds).
retail_p_day_aug %>% ggplot(aes(x = date, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always offdays).
retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl != "Sunday" & diff > 86400) %>% mutate(days_missing = diff / 86400 1) ## # A tibble: 5 x 4 ## date wday.lbl diff days_missing ## ## 1 20110104 Tuesday 1036800 11 ## 2 20110426 Tuesday 432000 4 ## 3 20110503 Tuesday 172800 1 ## 4 20110531 Tuesday 172800 1 ## 5 20110830 Tuesday 172800 1 retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl == "Sunday" & diff > 172800) %>% mutate(days_missing = diff / 86400 1) ## # A tibble: 1 x 4 ## date wday.lbl diff days_missing ## ## 1 20110501 Sunday 259200 2Let’s create a list of all missing days:
off_days < c("20101224", "20101225", "20101226", "20101227", "20101228", "20101229", "20101230", "20100101", "20100102", "20100103", "20110422", "20110423", "20110424", "20110425", "20110502", "20110530", "20110829", "20110429", "20110430") %>% ymd()Official UK holidays during that time were:
 2011:
 Boxing Day December 26

Christmas Day Holiday December 27
 2012:
 New Year’s Day Holiday January 2
 Good Friday April 6
 Easter Monday April 9
 Early May Bank Holiday May 7
 Spring Bank Holiday June 4
 Diamond Jubilee Holiday June 5
 Summer Bank Holiday August 27
We can account for the missing Saturdays with inspect_weekdays = TRUE.
Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. Generally, it is very difficult to account for holidays, because they don’t occur with an easy to model rule (e.g. Easter is on the first Sunday after the first full moon in Spring). Unfortunately, in our dataset we have seen that holidays and randomly missing days did not have a big overlap in the past.
Because not all holidays are missing days and we have more missing days than official holidays, I am using the list of missing days for skipping values – even though this is only a bestguess approach and likely not going to match all days that will be missing in reality during the future time series.
idx_future < idx %>% tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days) idx_future %>% tk_get_timeseries_signature() %>% ggplot(aes(x = index, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame.
data_future < idx_future %>% tk_get_timeseries_signature() %>% rename(date = index) pred_future < predict(fit_lm, newdata = data_future) pred_future < data_future %>% select(date) %>% add_column(sum_income = pred_future) retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% rbind(pred_future) %>% ggplot(aes(x = date, y = sum_income)) + scale_x_date() + geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + theme_tq()When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals.
test_residuals < pred_test$resid_lm test_resid_sd < sd(test_residuals, na.rm = TRUE) pred_future < pred_future %>% mutate( lo.95 = sum_income  1.96 * test_resid_sd, lo.80 = sum_income  1.28 * test_resid_sd, hi.80 = sum_income + 1.28 * test_resid_sd, hi.95 = sum_income + 1.96 * test_resid_sd )This, we can then plot to show the forecast with confidence intervals:
retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% ggplot(aes(x = date, y = sum_income)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future, fill = "#D5DBFF", color = NA, size = 0) + geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future, fill = "#596DD5", color = NA, size = 0, alpha = 0.8) + geom_point(aes(x = date, y = sum_income), data = pred_future, alpha = 0.5, color = palette_light()[[2]]) + geom_smooth(aes(x = date, y = sum_income), data = pred_future, method = 'loess', color = "white") + theme_tq()Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future.
Next time, I’ll compare how Facebook’s prophet will predict the future income.
sessionInfo() ## R version 3.4.0 (20170421) ## Platform: x86_64w64mingw32/x64 (64bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggcorrplot_0.1.1 matrixStats_0.52.2 ## [3] modelr_0.1.0 timekit_0.3.0 ## [5] broom_0.4.2 tidyquant_0.5.1 ## [7] quantmod_0.48 TTR_0.231 ## [9] PerformanceAnalytics_1.4.3541 xts_0.97 ## [11] zoo_1.80 lubridate_1.6.0 ## [13] caret_6.076 lattice_0.2035 ## [15] dplyr_0.5.0 purrr_0.2.2.2 ## [17] readr_1.1.1 tidyr_0.6.3 ## [19] tibble_1.3.1 ggplot2_2.2.1 ## [21] tidyverse_1.1.1 ## ## loaded via a namespace (and not attached): ## [1] httr_1.2.1 jsonlite_1.5 splines_3.4.0 ## [4] foreach_1.4.3 assertthat_0.2.0 stats4_3.4.0 ## [7] cellranger_1.1.0 yaml_2.1.14 backports_1.0.5 ## [10] quantreg_5.33 digest_0.6.12 rvest_0.3.2 ## [13] minqa_1.2.4 colorspace_1.32 htmltools_0.3.6 ## [16] Matrix_1.210 plyr_1.8.4 psych_1.7.5 ## [19] SparseM_1.77 haven_1.0.0 padr_0.3.0 ## [22] scales_0.4.1 lme4_1.113 MatrixModels_0.41 ## [25] mgcv_1.817 car_2.14 nnet_7.312 ## [28] lazyeval_0.2.0 pbkrtest_0.47 mnormt_1.55 ## [31] magrittr_1.5 readxl_1.0.0 evaluate_0.10 ## [34] nlme_3.1131 MASS_7.347 forcats_0.2.0 ## [37] xml2_1.1.1 foreign_0.868 tools_3.4.0 ## [40] hms_0.3 stringr_1.2.0 munsell_0.4.3 ## [43] compiler_3.4.0 rlang_0.1.1 grid_3.4.0 ## [46] nloptr_1.0.4 iterators_1.0.8 labeling_0.3 ## [49] rmarkdown_1.5 gtable_0.2.0 ModelMetrics_1.1.0 ## [52] codetools_0.215 DBI_0.61 reshape2_1.4.2 ## [55] R6_2.2.1 knitr_1.16 rprojroot_1.2 ## [58] Quandl_2.8.0 stringi_1.1.5 parallel_3.4.0 ## [61] Rcpp_0.12.11 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Run massive parallel R jobs cheaply with updated doAzureParallel package
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automaticallyscaling Azure Batch clusters with lowpriority nodes, which can be used at a discount of up to 80% compared to the price of regular highavailability VMs.
JS Tan using doAzureParallel #rstats package to run simulation on a cluster of 20 lowpriority Azure VMs. Total cost: $0.02 #EARLConf2017 pic.twitter.com/Mpl3IUa9zY
— David Smith (@revodavid) June 7, 2017
Using the doAzureParallel package is simple. First, you need to define the cluster you're going to use as a JSON file. (You can see an example on the right.) Here, you'll specify your Azure credentials, the size of the cluster, and the type of nodes (CPUs and memory) to use in the cluster. You can also specify here R packages (from CRAN and/or Github) to be preloaded onto each node, and the maximum number of simultaneous tasks to run on each node (for withinnode parallelism).
New to this update, the poolSize option allows you to specify the number of dedicated (standard) VM nodes to use, in addition to a number of lowpriority nodes to use. Lowpriority nodes can be preempted by the Azure system at any time, but are much cheaper to use. (Even if a node is preempted your parallel computation will be continue; it will just take a little longer with the reduced capacity.) You can even specify a minimum and maximum number of nodes of each class to use, in which case the cluster will automatically scale up and down according to either (your choice) the workload or the time of day (e.g. only expand the lowpriority part of the cluster on weekends, when preemption is less likely).
Once you've defined the parameters of your cluster, all you need to do is declare the cluster as a backend for the foreach package. The body of the foreach loop runs just like a for loop, except that multiple iterations run in parallel on the remote cluster. Here are the key parts of the option price simulation example JS presented at the conference.
This same approach can be used for any "embarrassingly parallel" iteration in R, and you can use any R function or package within the body of the loop. For example, you could use a cluster to reduce the time required for parameter tuning and crossvalidation with the caret package, or speed up data preparation tasks when using the dplyr package.
In addition to support for autoscaling clusters, this update to doAzureParallel also includes a few other new features. You'll also find new utility functions for managing multiple longrunning R jobs, functions to read data from and write data to Azure Blob storage, and the ability to preload data into the cluster by specifying resource files.
The doAzureParallel package is available for download now from Github, under the opensource MIT license. For details on how to use the package, check out the README and the doAzureParallel guide.
Github (Azure): doAzureParallel
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Introduction to Set Theory and Sets with R
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Sets define a ‘collection’ of objects, or things typically referred to as ‘elements’ or ‘members.’ The concept of sets arises naturally when dealing with any collection of objects, whether it be a group of numbers or anything else. Conceptually, the following examples can be defined as a ‘set’:
 {1, 2, 3, 4}
 {Red, Green, Blue}
 {Cat, Dog}
The first example is the set of the first four natural numbers. The second defines a set of the primary colors while the third example denotes a set of common household pets.
Since its development beginning in the 1870s with Georg Cantor and Richard Dedekind, set theory has become a foundational system of mathematics, and therefore its concepts constantly arise in the study of mathematics and is also an area of active research today.
This post will introduce some of the basic concepts of set theory, specifically the ZermeloFraenkel axiomatic system (more on that later), with R code to demonstrate these concepts.
Set NotationSets can be defined with lowercase, uppercase, script or Greek letters (in addition to subscripts and the like). Using several types of letters helps when dealing with hierarchies. Before diving into set theory, it’s best to define the common notation employed. One benefit of set theory being ubiquitousness in mathematics is learning its notation also helps in the understanding of other mathematical concepts.
 \forall x – for every set x
 \exists x – there exists such a set x that
 \neg – not
 \wedge – and
 \vee – or (one or the other or both)
 \Rightarrow – implies
 \Leftrightarrow – iff, if and only if.
 A \cup B – union of sets A and B
 A \cap B – intersection of sets A and B
 x \in A – x is an element of a set A
 x \notin A – x is not an element of a set A
The \Rightarrow notation for implies can be thought of like an if statement in that it denotes the relation ‘if a then b.’
Set MembershipSet membership is written similar to:
x \in A
Which can be stated as ‘x is an element of the set A.’ If x is not a member of A, we write:
x \notin A
The symbol \in to denote set membership originated with Giuseppe Peano (Enderton, pp. 26).
Which is read ‘x is not an element of the set A.’ We can write an R function to implement the concept of set membership.
iselement < function(x, A) { if (x %in% A) { return(TRUE) } return(FALSE) }Let’s use our simple function to test if there exists some members in the set, A = \{3, 5, 7, 11\}.
A < c(3, 5, 7, 11) eles < c(3, 5, 9, 10, (5 + 6)) for (i in 1:length(eles)) { print(iselement(i, A)) } ## [1] FALSE ## [1] FALSE ## [1] TRUE ## [1] FALSE ## [1] TRUESet membership leads into one of the first axioms of set theory under the ZermeloFraenkel system, the Principle of Extensionality.
Principle of ExtensionalityThe principle of extensionality states if two sets have the same members, they are equal. The formal definition of the principle of extensionality can be stated more concisely using the notation given above:
\forall A \forall B (\forall x (x \in A \Leftrightarrow x \in B) \Rightarrow A = B)
Stated less concisely but still using set notation:
If two sets A and B are such that for every element (member) x:
x \in A \qquad iff \qquad x \in B Then A = B.
We can express this axiom through an R function to test for set equality.
isequalset < function(a, b) { a < unique(a) b < unique(b) an < length(a) if (an > length(b)) { return(FALSE) } for (i in 1:an) { if (!(a[i]) %in% b) { return(FALSE) } } return(TRUE) }We can now put the principle of extensionality in action with our R function!
# original set A to compare A < c(3, 5, 7, 11) # define some sets to test for equality B < c(5, 7, 11, 3) C < c(3, 4, 6, 5) D < c(3, 5, 7, 11, 13) E < c(11, 7, 5, 3) G < c(3, 5, 5, 7, 7, 11) # collect sets into a list to iterate sets < list(B, C, D, E, G) # using the isequalset() function, print the results of the equality tests. for (i in sets) { print(isequalset(i, A)) } ## [1] TRUE ## [1] FALSE ## [1] FALSE ## [1] TRUE ## [1] TRUE Empty Sets and SingletonsSo far we have only investigated sets with two or more members. The empty set, denoted \varnothing, is defined as a set containing no elements and occurs surprisingly frequently in settheoretic operations despite is seemingly straightforward and simple nature.
The empty set axiom, states the existence of an empty set concisely:
\exists B \forall x \qquad x \notin B Which can also be stated as ‘there is a set having no members.’
A set {\varnothing} can be formed whose only member is \varnothing. It is important to note {\varnothing} \neq \varnothing because \varnothing \in {\varnothing} but \varnothing \notin \varnothing. One can conceptually think of {\varnothing} as a container with nothing in it.
A singleton is a set with exactly one element, denoted typically by {a}. A nonempty set is, therefore, a set with one or more element. Thus a singleton is also nonempty. We can define another quick function to test if a given set is empty, a singleton or a nonempty set.
typeofset < function(a) { if (length(a) == 0) { return('empty set') } else if (length(a) == 1) { return('singleton') } else if (length(a) > 1) { return('nonempty set') } else { stop('could not determine type of set') } } A < c() B < c(0) C < c(1, 2) D < list(c()) set_types < list(A, B, C, D) for (i in set_types) { print(typeofset(i)) } ## [1] "empty set" ## [1] "singleton" ## [1] "nonempty set" ## [1] "singleton"Note D is defined as a singleton because the set contains one element, the empty set \varnothing.
SummaryThis post introduced some of the basic concepts of axiomatic set theory using the ZermeloFraenkel axioms by exploring the idea of set, set membership and some particular cases of sets such as the empty set and singletons. Set notation that will be used throughout not just settheoretic applications but throughout mathematics was also introduced.
ReferencesBarile, Margherita. “Singleton Set.” From MathWorld–A Wolfram Web Resource, created by Eric W. Weisstein. http://mathworld.wolfram.com/SingletonSet.html
Enderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.
Weisstein, Eric W. “Empty Set.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/EmptySet.html
Weisstein, Eric W. “Nonempty Set.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NonemptySet.html
The post Introduction to Set Theory and Sets with R appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Campaign Response Testing no longer published on Udemy
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Our free video course Campaign Response Testing is no longer published on Udemy. It remains available for free on YouTube with all source code available from GitHub. I’ll try to correct bad links as I find them.
Please read on for the reasons.
Udemy recently unilaterally instituted a new policy on free courses: “When a free course has a Recent Review Rating less than 4.1 and is flagged with a ‘high degree of confidence’ the course will be hidden from Udemy’s search.”
Campaign Response Testing is a free course with an alltime average rating of 4.14 and a recent rating of 3.85. We have kept the code up to date and answered student questions (there was no backlog of student questions).
Obviously others should have opinion of our public work (which may or may not be good). And we are keeping the course up for free with nosign up necessary on YouTube (as we in no way want to hurt or inconvenience students).
But I do have an issue with Udemy’s new system: it is completely vulnerable to griefers and spammers. Accounts with “no skin in the game” (having paid nothing, possibly not even have watched the course, and without having to leave any review text in addition to the “high degree of confidence” star rating) can belittle free courses at will and in bulk. Currently there is no effective dispute mechanism (other than superficial palliatives such as: “answer student questions”), and like all lopsided “fights against semianonymous trolls” it would be a pointless effort anyway. I understand the desire to increase the quality of free content (it is Udemy’s “introduction”) but I feel the introduced system would obviously be unworkable even before it was introduced.
So with hopefully no harm to our subscribers I am unpublishing the course. As is the case with Udemy policy anybody already subscribed should continue to have access to the course through Udemy. And as I have said: we have long sense ported the entire free course to YouTube and Github for free and loginfree sharing.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Neural networks Exercises (Part1)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.
In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. This set of exercises is an introduction to neural networks where we’ll use them to create two simple regression and clustering model. Doing so, we’ll use a lot of basic concepts we’ll explore further in future sets. If you want more informations about neural network, your can see this page.
Answers to the exercises are available here.
Exercise 1
We’ll start by creating the data set on which we want to do a simple regression. Set the seed to 42, generate 200 random points between 10 and 10 and store them in a vector named X. Then, create a vector named Y containing the value of sin(x). Neural network are a lot more flexible than most regression algorithms and can fit complex function with ease. The biggest challenge is to find the appropriate network function appropriate to the situation.
Exercise 2
A network function is made of three components: the network of neurons, the weight of each connection between neuron and the activation function of each neuron. For this example, we’ll use a feedforward neural network and the logistic activation which are the defaults for the package nnet. We take one number as input of our neural network and we want one number as the output so the size of the input and output layer are both of one. For the hidden layer, we’ll start with three neurons. It’s good practice to randomize the initial weights, so create a vector of 10 random values, picked in the interval [1,1].
Exercise 3
Neural networks have a strong tendency of overfitting your data, meaning they become really good at describing the relationship between the values in your data set, but are not effective with data that wasn’t used to train your model. As a consequence, we need to crossvalidate our model. Set the seed to 42, then create a training set containing 75% of the values in your initial data set and a test set containing the rest of your data.
Exercise 4
Load the nnet package and use the function of the same name to create your model. Pass your weights via the Wts argument and set the maxit argument to 50. We want to fit a function which can have for output multiple possible values. To do so, set the linout argument to true. Finally, take the time to look at the structure of your model.
 Work with Deep Learning networks and related packagse in R
 Create Natural Language Processing models
 And much more
Exercise 5
Predict the output for the test set and compute the RMSE of your predictions. Plot the function sin(x) and then plot your predictions.
Exercise 6
The number of neurons in the hidden layer, as well as the number of hidden layer used, has a great influence on the effectiveness of your model. Repeat the exercises three to five, but this time use a hidden layer with seven neurons and initiate randomly 22 weights.
Exercise 7
Now let us use neural networks to solve a clustering problems, so let’s load the iris data set! It is good practice to normalize your input data to uniformize the behavior of your model over different range of value and have a faster training. Normalize each factor so that they have a mean of zero and a standard deviation of 1, then create your train and test set.
Exercise 8
Use the nnet() and use a hidden layer of ten neurons to create your model. We want to fit a function which have a finite amount of value as output. To do so, set the code>linout argument to true. Look at the structure of your model. With clustering problem, the output is usually a factor that is coded as multiple dummy variables, instead of a single numeric value. As a consequence, the output layer have as one less neuron than the number of levels of the output factor.
Exercise 9
Make prediction with the values of the test set.
Exercise 10
Create the confusion table of your prediction and compute the accuracy of the model.
 Evaluate your model with R Exercises
 Getting started with Plotly: basic Plots
 Network Analysis Part 3 Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Introducing the MonteCarlo Package
(This article was first published on R – first differences, and kindly contributed to Rbloggers)
My first R package has been released on CRAN recently. It is named MonteCarlo and aims to make simulation studies as easy as possible – including parallelization and the generation of tables.
What Are Simulation Studies Good For?Monte Carlo simulations are an essential tool in statistics and related disciplines. They are routinely used to determine distributional properties, where no analytical results are available. A typical example is to study the finite sample properties of a new statistical procedure. Finite sample properties are often unknown, since theoretical results usually rely on asymptotic theory that only applies to infinite sample sizes. This may be a good approximation in practice if the sample size is large, but it may also be bad if the sample is small or if the rate of convergence is very slow.
Simulation studies are also extremely useful to compare the performance of alternative procedures for the same task. Imagine you want to test whether your sample comes from a Gaussian distribution and you have to decide whether to use the JarqueBera test or the Kolmogorov Smirnov test. Both appear to be legitimate choices. A simulation study that is tailored so that it reflects the situation at hand might uncover that one of the procedures is much more powerful than the other.
Finally, small scale simulation studies are an essential tool for statistical programming. Testing is an essential part of programming and software development. Usually, if one writes a function, it is good practice to let it run on some test cases. This is easy, if the outcome is deterministic. But if your function is a statistical test or an estimator, the output depends on the sample and is stochastic. Therefore, the only way to test whether the implementation is correct, is to generate a large number of samples and to see whether it has the expected statistical properties. For example one might test whether the mean squared error of an estimator converges to zero as the sample size increases, or whether the test has the correct alpha error.
Therefore, writing Monte Carlo simulations is an everyday task in many areas of statistics. This comes with considereable effort. It is not unusual that the required lines of code to produce a simulation study are a multiple of that needed to implement the procedure of interest. As a consequence of that they are also one of the main sources for errors. On top of this, the large computational effort often requires parallelization which brings additional complications and programming efforts. This efforts can often be prohibitive – especially for less advanced users.
The MonteCarlo PackageThe MonteCarlo package streamlines the process described above. It allows to create simulation studies and to summarize their results in LaTeX tables quickly and easily.
There are only two main functions in the package:
 MonteCarlo() runs a simulation study for a user defined parameter grid. It handles the generation of loops over these parameter grid and parallelizes the computation on a user specified number of CPU units.
 MakeTable() creates LaTeX tables from the output of MonteCarlo(). It stacks high dimensional output arrays into tables with a user specified ordering of rows and columns.
To run a simulation study, the user has to nest both – the generation of a sample and the calculation of the desired statistics from this sample – in a single function. This function is passed to MonteCarlo(). No additional programming is required. This approach is not only very versatile – it is also very intuitive. The user formulates his experiment as if he/she was only interested in a single draw.
The aim of this approach is to allow the user full control and flexibility with regard to the design of the Monte Carlo experiment, but to take away all the effort of setting up the technical part of the simulation.
A Simple Example: The ttestSuppose we want to evaluate the performance of a standard ttest for the hypothesis that the mean is equal to zero. We are interested to see how the size and power of the test change with the sample size (n), the distance from the null hypothesis (loc for location) and the standard deviation of the distribution (scale). The sample is generated from a normal distribution.
To conduct this analysis, we proceed as follows. First, we load the MonteCarlo package.
library(MonteCarlo)Then define the following function.
######################################### ## Example: ttest # Define function that generates data and applies the method of interest ttest<function(n,loc,scale){ # generate sample: sample<rnorm(n, loc, scale) # calculate test statistic: stat<sqrt(n)*mean(sample)/sd(sample) # get test decision: decision1.96 # return result: return(list("decision"=decision)) }As discussed above, ttest() is formulated in a way as if we only want to generate a single test decision. The arguments of the function are the parameters we are interested in. ttest() carries out 4 steps:
 Draw a sample of n observations from a normal distribution with mean loc and standard deviation scale.
 Calculate the tstatistic.
 Determine the test decision.
 Return the desired result in form of a list.
We then define the combinations of parameters that we are interested in and collect them in a list. The elements of the lists must have the same names as the parameters for which we want to supply grids.
# define parameter grid: n_grid<c(50,100,250,500) loc_grid<seq(0,1,0.2) scale_grid<c(1,2) # collect parameter grids in list: param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)To run the simulation, the function ttest() and the parameter grid (param_list) are passed to MonteCarlo(), together with the desired number of Monte Carlo repetitions (nrep=1000).
# run simulation: MC_result<MonteCarlo(func=ttest, nrep=1000, param_list=param_list)There is no further coding required. All the mechanics of the Monte Carlo experiment are handled by the MonteCarlo() function.
Calling summary produces a short information on the simulation.
summary(MC_result) ## Simulation of function: ## ## function(n,loc,scale){ ## ## # generate sample: ## sample<rnorm(n, loc, scale) ## ## # calculate test statistic: ## stat<sqrt(n)*mean(sample)/sd(sample) ## ## # get test decision: ## decision1.96 ## ## # return result: ## return(list("decision"=decision)) ## } ## ## Required time: 13.38 secs for nrep = 1000 repetitions on 1 CPUs ## ## Parameter grid: ## ## n : 50 100 250 500 ## loc : 0 0.2 0.4 0.6 0.8 1 ## scale : 1 2 ## ## ## 1 output arrays of dimensions: 4 6 2 1000As one can see from the summary, the simulation results are stored in an array of dimension c(4,6,2,1000), where the Monte Carlo repetitions are collected in the last dimension of the array.
To summarize the results in a reasonable way and to include them as a table in a paper or report, we have to represent them in a matrix. This is handled by the MakeTable() function that stacks the submatrices collected in the array in the rows and columns of a matrix and prints the result in the form of code to generate a LaTeX table.
To determine in which order the results are stacked in rows and columns, we supply the function arguments rows and cols. These are vectors of the names of the parameters in the order in which we want them to appear in the table (sorted from the inside to the outside).
# generate table: MakeTable(output=MC_result, rows="n", cols=c("loc","scale"), digits=2, include_meta=FALSE) ## \begin{table}[h] ## \centering ## \resizebox{ 1 \textwidth}{!}{% ## \begin{tabular}{ rrrrrrrrrrrrrrr } ## \hline\hline\\\\ ## scale && \multicolumn{ 6 }{c}{ 1 } & & \multicolumn{ 6 }{c}{ 2 } \\ ## n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ ## & & & & & & & & & & & & & & \\ ## 50 & & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 & & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ ## 100 & & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 & & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ ## 250 & & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ ## 500 & & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## \\ ## \\ ## \hline\hline ## \end{tabular}% ## } ## \caption{ decision } ## \end{table}To change the ordering, just change the vectors rows and cols.
# generate table: MakeTable(output=MC_result, rows=c("n","scale"), cols="loc", digits=2, include_meta=FALSE) ## \begin{table}[h] ## \centering ## \resizebox{ 1 \textwidth}{!}{% ## \begin{tabular}{ rrrrrrrrr } ## \hline\hline\\\\ ## scale & n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ ## & & & & & & & & \\ ## \multirow{ 4 }{*}{ 1 } & 50 & & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 \\ ## & 100 & & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 \\ ## & 250 & & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## & 500 & & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## & & & & & & & & \\ ## \multirow{ 4 }{*}{ 2 } & 50 & & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ ## & 100 & & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ ## & 250 & & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ ## & 500 & & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## \\ ## \\ ## \hline\hline ## \end{tabular}% ## } ## \caption{ decision } ## \end{table}Now we can simply copy the code and add it to our paper, report or presentation. That is all. Only make sure that the package multirow is included in the header of the .tex file.
Parallelised SimulationIf the procedure you are interested in is not so fast or you need a large number of replications to produce very accurate results, you might want to use parallelized computation on multiple cores of your computer (or cluster). To achive this, simply specify the number of CPUs by supplying a value for the argument ncpus of MonteCarlo as shown below. Of course you should actually have at least the specified number of units.
# run simulation: MC_result<MonteCarlo(func=ttest, nrep=1000, param_list=param_list, ncpus=4)This automatically sets up a snow cluster, including the export of functions and the loading of packages. The user does not have to take care of anything.
Further InformationThis is an introduction to get you up and running with the MonteCarlo package as quickly as possible. Therefore, I only included a short example. However, the functions MonteCarlo() and particularly MakeTable() provide much more functionality. This is described in more detail in the package vignette, that also provides additional examples.
Filed under: R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – first differences. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Words growing or shrinking in Hacker News titles: a tidy analysis
(This article was first published on Variance Explained, and kindly contributed to Rbloggers)
In May, some friends and I built Tagger News, a realtime automatic classifier of Hacker News articles based on their text (see here for more about how we built it). This process started me down some interesting paths, particularly analyzing trends in titles. By finding words that became more or less common in Hacker News titles over time, we can see what topics were gaining or losing interest in the tech world.
I thought I’d share some of these analyses here. As with most of my text analysis posts, I’ll be analyzing it with the tidytext package written by me and Julia Silge. (Check out our soontobereleased book on the subject, Text Mining with R!)
Processing one million Hacker News articlesThe titles (and dates, and links) of Hacker News articles are helpfully stored on Google Bigquery. We start with a dataset of a million Hacker News article titles, which covers about 3 and a half years of posts. I downloaded it on 20170604 with the following query:
SELECT id, score, time, title, url FROM [bigquerypublicdata:hacker_news.full] WHERE deleted IS null AND dead IS null AND type = 'story' ORDER BY time DESC LIMIT 1000000I’m hosting the dataset so that you can read it yourself (note that it’s a moderately large and compressed file so it may take a minute to read) as follows:
library(tidyverse) library(lubridate) stories < read_csv("http://varianceexplained.org/files/stories_1000000.csv.gz") %>% mutate(time = as.POSIXct(time, origin = "19700101"), month = round_date(time, "month"))Whenever we read in a text dataset, the first step is usually to tokenize it (split it up into words) and remove stopwords (like “the” or “and”). As described here, we can do this with tidytext’s unnest_tokens function.
library(tidyverse) library(tidytext) library(stringr) title_words < stories %>% arrange(desc(score)) %>% distinct(title, .keep_all = TRUE) %>% unnest_tokens(word, title, drop = FALSE) %>% distinct(id, word, .keep_all = TRUE) %>% anti_join(stop_words, by = "word") %>% filter(str_detect(word, "[^\\d]")) %>% group_by(word) %>% mutate(word_total = n()) %>% ungroup()This creates a pretty large data frame (4443083 rows) with one row per word per post. For example, we could use this to find and visualize the most common words in Hacker News posts during this 3.5 year period.
word_counts < title_words %>% count(word, sort = TRUE) word_counts %>% head(25) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "lightblue") + scale_y_continuous(labels = comma_format()) + coord_flip() + labs(title = "Most common words in Hacker News titles", subtitle = "Among the last million stories; stop words removed", y = "# of uses")The top result, “hn”, comes from the common constructions of “Show HN” and “Ask HN”, which are often prepended to a title. Among the other words, there’s nothing we wouldn’t have expected from following Hacker News. There some notable technology companies like Google, Apple, and Facebook, along with common topics in tech discussions like “app”, “data” and “startup”.
Change over timeWhat words and topics have become more frequent, or less frequent, over time? These could give us a sense of the changing technology ecosystem, and let us predict what topics will continue to grow in relevance.
To achieve this, we’ll first count the occurrences of words in titles by month.
stories_per_month < stories %>% group_by(month) %>% summarize(month_total = n()) word_month_counts < title_words %>% filter(word_total >= 1000) %>% count(word, month) %>% complete(word, month, fill = list(n = 0)) %>% inner_join(stories_per_month, by = "month") %>% mutate(percent = n / month_total) %>% mutate(year = year(month) + yday(month) / 365) word_month_counts ## # A tibble: 33,480 x 6 ## word month n month_total percent year ## ## 1 2.0 20131001 42 20542 0.002044592 2013.751 ## 2 2.0 20131101 55 23402 0.002350226 2013.836 ## 3 2.0 20131201 26 21945 0.001184780 2013.918 ## 4 2.0 20140101 25 19186 0.001303033 2014.003 ## 5 2.0 20140201 31 22295 0.001390446 2014.088 ## 6 2.0 20140301 37 21118 0.001752060 2014.164 ## 7 2.0 20140401 33 23673 0.001393993 2014.249 ## 8 2.0 20140501 28 21394 0.001308778 2014.332 ## 9 2.0 20140601 47 19265 0.002439657 2014.416 ## 10 2.0 20140701 25 19829 0.001260780 2014.499 ## # ... with 33,470 more rowsWe can then use my broom package to fit a model (logistic regression) to examine whether the frequency of each word is increasing or decreasing over time. Every term will then have a growth rate (as an exponential term) associated with it.
library(broom) mod < ~ glm(cbind(n, month_total  n) ~ year, ., family = "binomial") slopes < word_month_counts %>% nest(word) %>% mutate(model = map(data, mod)) %>% unnest(map(model, tidy)) %>% filter(term == "year") %>% arrange(desc(estimate)) slopes ## # A tibble: 744 x 6 ## word term estimate std.error statistic p.value ## ## 1 trump year 1.7570144 0.04052662 43.35458 0.000000e+00 ## 2 ai year 0.7830541 0.01756776 44.57335 0.000000e+00 ## 3 blockchain year 0.6557110 0.02682345 24.44544 5.626574e132 ## 4 neural year 0.6270933 0.02617919 23.95388 8.418336e127 ## 5 react year 0.6027489 0.01628292 37.01725 6.045233e300 ## 6 vr year 0.5260498 0.02247906 23.40177 4.099556e121 ## 7 bot year 0.5178669 0.02600166 19.91669 2.916460e88 ## 8 iot year 0.5076088 0.02514613 20.18636 1.290270e90 ## 9 microservices year 0.4933223 0.03180060 15.51299 2.833910e54 ## 10 slack year 0.4718030 0.02287605 20.62432 1.660491e94 ## # ... with 734 more rowsWe can now ask: what terms have been increasing in frequency in Hacker News titles?
It makes sense that “Trump” is the word that’s growing most quickly. While it shows a moderate growth after Trump announced his candidacy in mid2015, it shows a sharp increase around the time of the 2016 election (though it hasn’t been quite as dominant in the months since his inauguration). The next fastest growing terms include “AI” and “blockchain”, both topics that have shown a recent surge of interest. (We can also see “machine learning”, “AWS”, and “bot” among the growing terms).
What words have been decreasing in frequency in Hacker News titles?
This shows a few topics in which interest has died out since 2013, including Google Glass and Gmail. Discussion of Edward Snowden and the NSA was fresh news around 20132014, so it makes sense that it’s declined since. There are also some notable technologies that people write about less, such as AngularJS, HTML5, and Ruby on Rails. It’s interesting to compare these to Stack Overflow Trends during that time period, in which AngularJS has been growing in terms of Stack Overflow questions asked (HTML5 and Rails have leveled off). It’s possible that discussion on HN is a “leading indicator”, showing a surge articles when a technology first gets popular.
(I don’t currently have a guess for why “million” and “billion” had sudden dropoffs in 2014. Is it some artifact of the Hacker News policy, with the word becoming edited or deleted in newer posts? Or is it a real change in what the site discusses?)
Interestingly, the conversation around “bitcoin” peaked around 20132014 (with a recent uptick due to a surge in Bitcoin’s price), while discussion of the blockchain has been growing in the years since. Comparing them on the same axes makes in clear that the blockchain has roughly “caught up to” bitcoin in terms of general interest:
Words that passed their peakWe can learn a lot by examining words , but we’re misses an important piece of the puzzle: there are some words that grew in interest for part of this time, then “passed their peak.” This is more complicated to explore quantitatively, but one approach is to find words where the ratio of the peak in the trend to the average is especially high. To reduce the effect of random noise, we use a cubic spline to smooth each trend before determining the peak.
library(splines) mod2 < ~ glm(cbind(n, month_total  n) ~ ns(year, 4), ., family = "binomial") # Fit a cubic spline to each shape spline_predictions < word_month_counts %>% mutate(year = as.integer(as.Date(month)) / 365) %>% nest(word) %>% mutate(model = map(data, mod2)) %>% unnest(map2(model, data, augment, type.predict = "response")) # Find the terms with the highest peak / average ratio peak_per_month < spline_predictions %>% group_by(word) %>% mutate(average = mean(.fitted)) %>% top_n(1, .fitted) %>% ungroup() %>% mutate(ratio = .fitted / average) %>% filter(month != min(month), month != max(month)) %>% top_n(16, ratio)Here are 16 terms that had strong peaks at various point in the last 3.5 years.
peak_per_month %>% select(word, peak = month) %>% inner_join(spline_predictions, by = "word") %>% mutate(word = reorder(word, peak)) %>% ggplot(aes(month, percent)) + geom_line(aes(color = word), show.legend = FALSE) + geom_line(aes(y = .fitted), lty = 2) + facet_wrap(~ word, scales = "free_y") + scale_y_continuous(labels = percent_format()) + expand_limits(y = 0) + labs(x = "Year", y = "Percent of titles containing this term", title = "16 words that peaked then declined in Hacker News titles", subtitle = "Spline fit (df = 4) shown.\nSelected based on the peak of the spline divided by the overall average; ordered by peak month.")We can see a peak of discussion around net neutrality in 2015 (though it’s shown a recent resurgence of interest). You can spot the introduction of Swift and the Apple Watch, then a moderate decline in discussion around them, and there are sudden jolts of discussion around Oculus in 2014 (with Facebook’s purchase of the company) and the FBI in 2016 (news around Clinton’s email server). The graph suggests the discussion around Slack, bots, Tesla, and virtual reality may have leveled off (though in some cases it may be too early to tell).
What’s nextI have some more analyses of Hacker News posts I’ll be sharing in the future, including analyses of word correlations, examination of what words and topics tend to get upvoted, and some improvements on the Tagger News classification algorithm. The series will also give me the chance to highlight more techniques from Text Mining with R, including word networks and topic modeling.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Testdriving Microsoft Cognitive Toolkit in R using reticulate
(This article was first published on R snippets, and kindly contributed to Rbloggers)
Recently new tools for data science pop up constantly, so it is hard to keep up with the changes and choose those that promise to be useful in the long run.
Recently two new solutions were announced that look very promising: reticulate package for R and Microsoft Cognitive Toolkit 2.0 (CNTK). Together with Wit Jakuczun we have decided to test drive them on Azure Data Science Virtual Machine.
CNTK is a very promising deep learning toolkit, backed by Microsoft. Unfortunately it provides bindings to Python and not to R. Therefore it is an ideal situation to test reticulate package, which provides R interface to Python. We wanted to test CNTK on GPUs, so we have decided to use Azure Data Science Virtual Machine using NV6 Instance with Tesla M60 GPU.
For our testdrive we have used classic MNIST dataset. You can find all the source codes and detailed results of the test on GitHub. The short conclusion is that all components worked and played together excellently:
 on CNTK we have build MLP network model, which has reported 2.3% classification error, without any special tuning, which is a pretty decent result;
 calling Python library from R using reticulate was simple and stable;
 Calculations using GPU give a significant speedup and configuration of Azure DSVM was really simple.
To leave a comment for the author, please follow the link and comment on their blog: R snippets. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Deep Learning with R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
For R users, there hasn’t been a production grade solution for deep learning (sorry MXNET). This post introduces the Keras interface for R and how it can be used to perform image classification. The post ends by providing some code snippets that show Keras is intuitive and powerful.
TensorflowLast January, Tensorflow for R was released, which provided access to the Tensorflow API from R. However, for most R users, the interface was not very R like.
Take a look at this code chunk for training a model:
cross_entropy < tf$reduce_mean(tf$reduce_sum(y_ * tf$log(y_conv), reduction_indices=1L)) train_step < tf$train$AdamOptimizer(1e4)$minimize(cross_entropy) correct_prediction < tf$equal(tf$argmax(y_conv, 1L), tf$argmax(y_, 1L)) accuracy < tf$reduce_mean(tf$cast(correct_prediction, tf$float32)) sess$run(tf$global_variables_initializer()) for (i in 1:20000) { batch < mnist$train$next_batch(50L) if (i %% 100 == 0) { train_accuracy < accuracy$eval(feed_dict = dict( x = batch[[1]], y_ = batch[[2]], keep_prob = 1.0)) cat(sprintf("step %d, training accuracy %g\n", i, train_accuracy)) } train_step$run(feed_dict = dict( x = batch[[1]], y_ = batch[[2]], keep_prob = 0.5)) } test_accuracy % fit( x = train_x, y = train_y, epochs=epochs, batch_size=batch_size, validation_data=valid) Image Classification with KerasSo if you are still with me, let me show you how to build deep learning models using R, Keras, and Tensorflow together. You will find a Github repo that contains the code and data you will need. Included is an R notebook that walks through building an image classifier (telling cat from dog), but can easily be generalized to other images. The walk through includes advanced methods that are commonly used for production deep learning work including:
 augmenting data
 using the bottleneck features of a pretrained network
 finetuning the top layers of a pretrained network
 saving weights for your models
The R interface to Keras truly makes it easy to build deep learning models in R. Here are some code snippets to illustrate how intuitive and useful Keras for R is:
To load picture from a folder:
train_generator < flow_images_from_directory(train_directory, generator = image_data_generator(), target_size = c(img_width, img_height), color_mode = "rgb", class_mode = "binary", batch_size = batch_size, shuffle = TRUE, seed = 123)To define a simple convolutional neural network:
model % layer_conv_2d(filter = 32, kernel_size = c(3,3), input_shape = c(img_width, img_height, 3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filter = 64, kernel_size = c(3,3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_flatten() %>% layer_dense(64) %>% layer_activation("relu") %>% layer_dropout(0.5) %>% layer_dense(1) %>% layer_activation("sigmoid")To augment data:
augment < image_data_generator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=TRUE)To load a pretrained network:
model_vgg < application_vgg16(include_top = FALSE, weights = "imagenet")To save model weights:
save_model_weights_hdf5(model_ft, 'finetuning_30epochs_vggR.h5', overwrite = TRUE)I believe the Keras for R interface will make it much easier for R users and the R community to build and refine deep learning models with R. This means you don't have to force everyone to use python to build, refine, and test your models. I really think this will open up deep learning to a wider audience that was a bit apprehensive on using Python. So for now, give it a spin!
Grab my github repo, fire up RStudio (or your IDE of choice), and go build a simple classifier using Keras.
Related Post
 Unsupervised Learning and Text Mining of Emotion Terms Using R
 Using MCA and variable clustering in R for insights in customer attrition
 Web Scraping and Applied Clustering Global Happiness and Social Progress Index
 Key Phrase Extraction from Tweets
 Financial time series forecasting – an easy approach
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Add Pvalues and Significance Levels to ggplots
(This article was first published on Easy Guides, and kindly contributed to Rbloggers)
In this article, we’ll describe how to easily i) compare means of two or multiple groups; ii) and to automatically add pvalues and significance levels to a ggplot (such as box plots, dot plots, bar plots and line plots …).
Contents:
 Prerequisites
 Methods for comparing means
 R functions to add pvalues
 Compare two independent groups
 Compare two paired samples
 Compare more than two groups
 Multiple grouping variables
 Other plot types
 Infos
Required R package: ggpubr (version >= 0.1.3), for ggplot2based publication ready plots.
 Install from CRAN as follow:
 Or, install the latest developmental version from GitHub as follow:
 Load ggpubr:
Official documentation of ggpubr is available at: http://www.sthda.com/english/rpkgs/ggpubr
Demo data setsData: ToothGrowth data sets.
data("ToothGrowth") head(ToothGrowth) len supp dose 1 4.2 VC 0.5 2 11.5 VC 0.5 3 7.3 VC 0.5 4 5.8 VC 0.5 5 6.4 VC 0.5 6 10.0 VC 0.5 Methods for comparing meansThe standard methods to compare the means of two or more groups in R, have been largely described at: comparing means in R.
The most common methods for comparing means include:
Methods R function Description Ttest t.test() Compare two groups (parametric) Wilcoxon test wilcox.test() Compare two groups (nonparametric) ANOVA aov() or anova() Compare multiple groups (parametric) KruskalWallis kruskal.test() Compare multiple groups (nonparametric)A practical guide to compute and interpret the results of each of these methods are provided at the following links:
 Comparing onesample mean to a standard known mean:
 Comparing the means of two independent groups:
 Comparing the means of paired samples:

Comparing the means of more than two groups
 Analysis of variance (ANOVA, parametric):
 KruskalWallis Test in R (non parametric alternative to oneway ANOVA)
Here we present two new R functions in the ggpubr package:
 compare_means(): easy to use solution to performs one and multiple mean comparisons.
 stat_compare_means(): easy to use solution to automatically add pvalues and significance levels to a ggplot.
As we’ll show in the next sections, it has multiple useful options compared to the standard R functions.
The simplified format is as follow:
compare_means(formula, data, method = "wilcox.test", paired = FALSE, group.by = NULL, ref.group = NULL, ...) formula: a formula of the form x ~ group, where x is a numeric variable and group is a factor with one or multiple levels. For example, formula = TP53 ~ cancer_group. It’s also possible to perform the test for multiple response variables at the same time. For example, formula = c(TP53, PTEN) ~ cancer_group.

data: a data.frame containing the variables in the formula.

method: the type of test. Default is “wilcox.test”. Allowed values include:
 “t.test” (parametric) and “wilcox.test”” (nonparametric). Perform comparison between two groups of samples. If the grouping variable contains more than two levels, then a pairwise comparison is performed.
 “anova” (parametric) and “kruskal.test” (nonparametric). Perform oneway ANOVA test comparing multiple groups.

paired: a logical indicating whether you want a paired test. Used only in t.test and in wilcox.test.

group.by: variables used to group the data set before applying the test. When specified the mean comparisons will be performed in each subset of the data formed by the different levels of the group.by variables.

ref.group: a character string specifying the reference group. If specified, for a given grouping variable, each of the group levels will be compared to the reference group (i.e. control group). ref.group can be also “.all.”. In this case, each of the grouping variable levels is compared to all (i.e. basemean).
This function extends ggplot2 for adding mean comparison pvalues to a ggplot, such as box blots, dot plots, bar plots and line plots.
The simplified format is as follow:
stat_compare_means(mapping = NULL, comparisons = NULL hide.ns = FALSE, label = NULL, label.x = NULL, label.y = NULL, ...)
mapping: Set of aesthetic mappings created by aes().

comparisons: A list of length2 vectors. The entries in the vector are either the names of 2 values on the xaxis or the 2 integers that correspond to the index of the groups of interest, to be compared.

hide.ns: logical value. If TRUE, hide ns symbol when displaying significance levels.

label: character string specifying label type. Allowed values include “p.signif” (shows the significance levels), “p.format” (shows the formatted p value).

label.x,label.y: numeric values. coordinates (in data units) to be used for absolute positioning of the label. If too short they will be recycled.

…: other arguments passed to the function compare_means() such as method, paired, ref.group.
Perform the test:
compare_means(len ~ supp, data = ToothGrowth) # A tibble: 1 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len OJ VC 0.06449067 0.06449067 0.064 ns WilcoxonBy default method = “wilcox.test” (nonparametric test). You can also specify method = “t.test” for a parametric ttest.
Returned value is a data frame with the following columns:
 .y.: the y variable used in the test.
 p: the pvalue
 p.adj: the adjusted pvalue. Default value for p.adjust.method = “holm”
 p.format: the formatted pvalue
 p.signif: the significance level.
 method: the statistical test used to compare groups.
Create a box plot with pvalues:
p < ggboxplot(ToothGrowth, x = "supp", y = "len", color = "supp", palette = "jco", add = "jitter") # Add pvalue p + stat_compare_means() # Change method p + stat_compare_means(method = "t.test")Note that, the pvalue label position can be adjusted using the arguments: label.x, label.y, hjust and vjust.
The default pvalue label displayed is obtained by concatenating the method and the p columns of the returned data frame by the function compare_means(). You can specify other combinations using the aes() function.
For example,
 aes(label = ..p.format..) or aes(label = paste0(“p =”, ..p.format..)): display only the formatted pvalue (without the method name)
 aes(label = ..p.signif..): display only the significance level.
 aes(label = paste0(..method.., “\n”, “p =”, ..p.format..)): Use line break (“\n”) between the method name and the pvalue.
As an illustration, type this:
p + stat_compare_means( aes(label = ..p.signif..), label.x = 1.5, label.y = 40)If you prefer, it’s also possible to specify the argument label as a character vector:
p + stat_compare_means( label = "p.signif", label.x = 1.5, label.y = 40) Compare two paired samplesPerform the test:
compare_means(len ~ supp, data = ToothGrowth, paired = TRUE) # A tibble: 1 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 len OJ VC 0.004312554 0.004312554 0.0043 ** WilcoxonVisualize paired data using the ggpaired() function:
ggpaired(ToothGrowth, x = "supp", y = "len", color = "supp", line.color = "gray", line.size = 0.4, palette = "jco")+ stat_compare_means(paired = TRUE) Compare more than two groups Global test:
Plot with global pvalue:
# Default method = "kruskal.test" for multiple groups ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means() # Change method to anova ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(method = "anova") Pairwise comparisons. If the grouping variable contains more than two levels, then pairwise tests will be performed automatically. The default method is “wilcox.test”. You can change this to “t.test”.
If you want to specify the precise y location of bars, use the argument label.y:
ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(comparisons = my_comparisons, label.y = c(29, 35, 40))+ stat_compare_means(label.y = 45)(Adding bars, connecting compared groups, has been facilitated by the ggsignif R package )
 Multiple pairwise tests against a reference group:
 Multiple pairwise tests against all (basemean):
A typical situation, where pairwise comparisons against “all” can be useful, is illustrated here using the myeloma data set from the survminer package.
We’ll plot the expression profile of the DEPDC1 gene according to the patients’ molecular groups. We want to know if there is any difference between groups. If yes, where the difference is?
To answer to this question, you can perform a pairwise comparison between all the 7 groups. This will lead to a lot of comparisons between all possible combinations. If you have many groups, as here, it might be difficult to interpret.
Another easy solution is to compare each of the seven groups against “all” (i.e. basemean). When the test is significant, then you can conclude that DEPDC1 is significantly overexpressed or downexpressed in a group xxx compared to all.
# Load myeloma data from survminer package if(!require(survminer)) install.packages("survminer") data("myeloma", package = "survminer") # Perform the test compare_means(DEPDC1 ~ molecular_group, data = myeloma, ref.group = ".all.", method = "t.test") # A tibble: 7 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 DEPDC1 .all. Cyclin D1 1.496896e01 4.490687e01 0.14969 ns Ttest 2 DEPDC1 .all. Cyclin D2 5.231428e01 1.000000e+00 0.52314 ns Ttest 3 DEPDC1 .all. Hyperdiploid 2.815333e04 1.689200e03 0.00028 *** Ttest 4 DEPDC1 .all. Low bone disease 5.083985e03 2.541992e02 0.00508 ** Ttest 5 DEPDC1 .all. MAF 8.610664e02 3.444265e01 0.08611 ns Ttest 6 DEPDC1 .all. MMSET 5.762908e01 1.000000e+00 0.57629 ns Ttest 7 DEPDC1 .all. Proliferation 1.241416e09 8.689910e09 1.2e09 **** Ttest # Visualize the expression profile ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group", add = "jitter", legend = "none") + rotate_x_text(angle = 45)+ geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean stat_compare_means(method = "anova", label.y = 1600)+ # Add global annova pvalue stat_compare_means(label = "p.signif", method = "t.test", ref.group = ".all.") # Pairwise comparison against allFrom the plot above, we can conclude that DEPDC1 is significantly overexpressed in proliferation group and, it’s significantly downexpressed in Hyperdiploid and Low bone disease compared to all.
Note that, if you want to hide the ns symbol, specify the argument hide.ns = TRUE.
# Visualize the expression profile ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group", add = "jitter", legend = "none") + rotate_x_text(angle = 45)+ geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean stat_compare_means(method = "anova", label.y = 1600)+ # Add global annova pvalue stat_compare_means(label = "p.signif", method = "t.test", ref.group = ".all.", hide.ns = TRUE) # Pairwise comparison against all Multiple grouping variables Two independent sample comparisons after grouping the data by another variable:
Perform the test:
compare_means(len ~ supp, data = ToothGrowth, group.by = "dose") # A tibble: 3 x 9 dose .y. group1 group2 p p.adj p.format p.signif method 1 0.5 len OJ VC 0.023186427 0.04637285 0.023 * Wilcoxon 2 1.0 len OJ VC 0.004030367 0.01209110 0.004 ** Wilcoxon 3 2.0 len OJ VC 1.000000000 1.00000000 1.000 ns WilcoxonIn the example above, for each level of the variable “dose”, we compare the means of the variable “len” in the different groups formed by the grouping variable “supp”.
Visualize (1/2). Create a multipanel box plots facetted by group (here, “dose”):
# Box plot facetted by "dose" p < ggboxplot(ToothGrowth, x = "supp", y = "len", color = "supp", palette = "jco", add = "jitter", facet.by = "dose", short.panel.labs = FALSE) # Use only p.format as label. Remove method name. p + stat_compare_means(label = "p.format") # Or use significance symbol as label p + stat_compare_means(label = "p.signif", label.x = 1.5)To hide the ‘ns’ symbol, use the argument hide.ns = TRUE.
Visualize (2/2). Create one single panel with all box plots. Plot y = “len” by x = “dose” and color by “supp”:
p < ggboxplot(ToothGrowth, x = "dose", y = "len", color = "supp", palette = "jco", add = "jitter") p + stat_compare_means(aes(group = supp)) # Show only pvalue p + stat_compare_means(aes(group = supp), label = "p.format") # Use significance symbol as label p + stat_compare_means(aes(group = supp), label = "p.signif") Paired sample comparisons after grouping the data by another variable:
Perform the test:
compare_means(len ~ supp, data = ToothGrowth, group.by = "dose", paired = TRUE) # A tibble: 3 x 9 dose .y. group1 group2 p p.adj p.format p.signif method 1 0.5 len OJ VC 0.03296938 0.06593876 0.033 * Wilcoxon 2 1.0 len OJ VC 0.01905889 0.05717667 0.019 * Wilcoxon 3 2.0 len OJ VC 1.00000000 1.00000000 1.000 ns WilcoxonVisualize. Create a multipanel box plots facetted by group (here, “dose”):
# Box plot facetted by "dose" p < ggpaired(ToothGrowth, x = "supp", y = "len", color = "supp", palette = "jco", line.color = "gray", line.size = 0.4, facet.by = "dose", short.panel.labs = FALSE) # Use only p.format as label. Remove method name. p + stat_compare_means(label = "p.format", paired = TRUE) Other plot types Bar and line plots (one grouping variable):
 Bar and line plots (two grouping variables):
This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.3).
jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header
.content{padding:0px;}
To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));UK R Courses
(This article was first published on R – Why?, and kindly contributed to Rbloggers)
Over the next few months we’re running a number of R, Stan and Scala courses around the UK.
June (London) Mon Jun 26 – Introduction to R
 Tue Jun 27 – Statistical Modelling with R
 Wed Jun 28 – Programming with R
 Thu Jun 29 (2day course) – Advanced R Programming
 Thu Aug 10 – Programming with R
 Fri Aug 11 – Building an R Package
 Mon Sep 11 – Introduction to R
 Tue Sep 12 Statistical Modelling with R
 Wed Sep 13 Programming with R
 Thu Sep 14 Advanced Graphics with R
 Fri Sep 15 Automated Reporting (first steps towards Shiny)
 Mon Sep 18 (5day course) – Bioconductor
 Wed Oct 25 – Advanced Graphics with R
 Thu Oct 26 (2day course) – Advanced R Programming
 Mon Nov 06 (1day course) – Efficient R Programming
 Tue Nov 07 (1day course) – R for Big Data
 Thu Dec 07 (2day course) – Introduction to Bayesian Inference using RStan (Newcastle)
 Mon Dec 11 (3day course) – Scala for Statistical Computing and Data Science (London)
See the website for course descriptions. Any questions, feel free to contact me: colin@jumpingrivers.com
On site courses available on request.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Why?. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Unconf projects 4: cityquant, notary, packagemetrics, pegax
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
Continuing our series of blog posts (day 1, day 2, day 3) this week about unconf 17.
cityquantSummary: The goal with the cityquant project was to build a digital dashboard for sustainable cities.
They also had a "spinoff" project called selfquant to get data from a quantified self google sheets template to keep track of weekly performance in various categories.
Team: Reka Solymosi, Ben Best, Chelsea Ursaner, Tim Phan, Jasmine Dumas
Github: https://github.com/ropenscilabs/cityquant
notarySummary: notary is actually two things:
notary: An R package for signing and verification of R packages. It has functions for installing and verifying packages, validating GitHub releases, sourcing files with verification, and more.
rsecuritypractices: A bookdown book targeting users, developers, and admins on R security best practices.
Team: Stephanie Locke, Oliver Keyes, Rich FitzJohn, Bob Rudis, Joroen Ooms
Github: https://github.com/ropenscilabs/notary / https://github.com/ropenscilabs/rsecuritypractices
packagemetricsSummary: packagemetrics is a package for helping you choose which package to use. Their tool collects metrics including CRAN downloads, GitHub stars, whether it's tidyverse compatible, whether it has tests and vignettes, number of contributors, and more!
This project combined two ideas from our brainstorming stage: Avoiding redundant / overlapping packages and A framework for reproducible tables.
Team: Erin Grand, Sam Firke, Hannah Frick, Becca Krouse, Lori Shepherd
Github: https://github.com/ropenscilabs/packagemetrics
pegaxSummary: pegax is a very alpha client for parsing taxonomic names. Taxonomic names are things such as Homo sapiens (human beings) wikispecies, or Ursus americanus (american black bear) wikispecies, or Balaenoptera musculus (blue whale) wikispecies. Taxonomic names can be hard to parse – and thus something called Parsing Expression Grammar (PEG) can be employed to help. We were lucky that Oliver Keyes just started an R package for PEGs in R called piton – which is now used in pegax to parse taxonomic names.
Team:
Scott Chamberlain (with help from Oliver Keyes)
Github: https://github.com/ropenscilabs/pegax
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Tic Tac Toe Part 3: The Minimax Algorithm
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
In two previous posts, I presented code to teach R to play the trivial game of Tic Tac Toe. I thought this was an unbeatable algorithm. Alas, a comment from Alberto shattered my pride as he was able to beat my code.
The reason for the demise of my code was that I didn’t implement a full minimax algorithm, but instead looked only two moves ahead. I thought the common strategy rules (start in the centre and occupy the corners) would make the program unbeatable. When I simulated the game by instructing the computer to play against itself, Alberto’s strategy never arose because the code forces the centre field. Alberto’s code shows that you need to look at least three moves ahead for a perfect game. He has been so kind to share his code and gave me permission to publish it.
Alberto recreated two functions, for completeness I have added the complete working code that merges his improvements with my earlier work. The first two functions are identical to the previous post. These functions draw the game board and process the human player’s move by waiting for a mouse click.
# Draw the game board draw.board < function(game) { xo < c("X", " ", "O") # Symbols par(mar = rep(1,4)) plot.new() plot.window(xlim = c(0,30), ylim = c(0,30)) abline(h = c(10, 20), col="darkgrey", lwd = 4) abline(v = c(10, 20), col="darkgrey", lwd = 4) text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 4) # Identify location of any three in a row square < t(matrix(game, nrow = 3)) hor < abs(rowSums(square)) if (any(hor == 3)) hor < (4  which(hor == 3)) * 10  5 else hor < 0 ver < abs(colSums(square)) if (any(ver == 3)) ver < which(ver == 3) * 10  5 else ver < 0 diag1 < sum(diag(square)) diag2 < sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (all(hor > 0)) for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="red") if (all(ver > 0)) for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="red") if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd = 10, col = "red") if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd = 10, col = "red") } # Human player enters a move move.human < function(game) { text(4, 0, "Click on screen to move", col = "grey", cex=.7) empty < which(game == 0) move < 0 while (!move %in% empty) { coords < locator(n = 1) # add lines coords$x < floor(abs(coords$x) / 10) + 1 coords$y < floor(abs(coords$y) / 10) + 1 move < coords$x + 3 * (3  coords$y) } return (move) }Alberto rewrote the functions that analyse the board and determine the move of the computer. The ganador (Spanish for winning) function assesses the board condition by assigning 10 or + 10 for a winning game and 0 for any other situation.
ganador < function(juego, player) { game < matrix(juego, nrow = 3, byrow = T) hor < rowSums(game) ver < colSums(game) diag < c(sum(diag(game)), sum(diag(apply(game, 1, rev)))) if (3 %in% c(hor, ver, diag)) return(10) if (3 %in% c(hor, ver, diag)) return(10) else return(0) }The next function is the actual minimax algorithm. If the computer starts then the first move ( options to assess) takes a little while. The commented lines can be used to force a corner and make the games faster by forcing a random corner.
The minimax function returns a list with the move and its valuation through the ganador function. The function works recursively until it has filled the board and retains the best scoring move using the minimax method. To avoid the computer always playing the same move in the same situation random variables are added.
minimax < function(juego, player) { free < which(juego == 0) if(length(free) == 1) { juego[free] < player return(list(move = free, U = ganador(juego, player))) } poss.results < rep(0, 9) for(i in free) { game < juego game[i] < player poss.results[i] < ganador(game, player) } mm < ifelse(player == 1, "which.min", "which.max") if(any(poss.results == (player * 10))) { move < do.call(mm, list(poss.results)) return(list(move = move, U = poss.results[move])) } for(i in free) { game < juego game[i] < player poss.results[i] < minimax(game, player)$U } random < runif(9, 0, 0.1) poss.results[free] < 100 * player poss.results < poss.results + (player * random) move < do.call(mm, list(poss.results)) return(list(move = move, U = poss.results[move])) }This final function stitches everything together and lets you play the game. Simply paste all functions in your R console and run them to play a game. The tic.tac.toe function can take two parameters, “human” and/or “computer”. The order of the parameters determines who starts the game.
# Main game engine tic.tac.toe < function(player1 = "human", player2 = "computer") { game < rep(0, 9) # Empty board winner < FALSE # Define winner player < 1 # First player players < c(player1, player2) draw.board(game) while (0 %in% game & !winner) { # Keep playing until win or full board if (players[(player + 3) %% 3] == "human") # Human player move < move.human(game) else { # Computer player move < minimax(game, player) move < move$move } game[move] < player # Change board draw.board(game) winner < max(eval.game(game, 1), abs(eval.game(game, 1))) == 6 # Winner, winner, chicken dinner? player < player # Change player } } tic.tac.toe()This is my last word on Tic Tac Toe but now that the minimax conundrum is solved I could start working on other similar games such as Connect Four, Draughts or even the royal game of Chess.
The post Tic Tac Toe Part 3: The Minimax Algorithm appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));The Ghost Printer Behind Toplevel R Expressions
(This article was first published on English Blog on Yihui Xie  谢益辉, and kindly contributed to Rbloggers)
For any developers who have ever written an S3 method for the print() function, they probably know what a toplevel R expression means, but this is a very confusing concept to nondevelopers. I have to explain this every now and then, so I decided to write a short post about it.
Yesterday I saw a Github issue in the rmarkdown repository, and you can see that there are still users confused by the fact that ggplot2 plots are not rendered in certain cases. I have seen similar questions perhaps hundreds of times. Such questions have been answered in the R FAQ 7.22 “Why do lattice/trellis graphics not work?”, but the answer didn’t explain the root reason in detail.
A toplevel R expression is usually implicitly printed. Both words can cause confusion: printing is implicit so that you probably don’t consciously know it, and printing means the print() function is called on the object returned by the expression. For example, when you type 1 + 1 in the R console, and press Enter/Return, what actually happens is print(2), where 2 is the value returned by 1 + 1.
The reason that you create a ggplot() object in your R console and it shows up after you press Enter is that ggplot2 has defined a print.ggplot method on such objects. ggplot() does not actually create the plot – it only creates a ggplot object. It is the print method that actually creates the plot in a graphical device (as a sideeffect).
Expressions nested in other expressions are not toplevel expressions. For example, in a for loop, ggplot objects or HTML widgets or knitr::kable() cannot be displayed because they are not printed, unless you explicitly print() them.
Toplevel R expressions are not always printed. They are not printed if they are marked as invisible. For example, if you run invisible(1 + 1) in the R console, you won’t see any value printed.
Many R functions return invisible values wrapped in the invisible() function. I want to show a few functions that you may have never recognized: if, (, and assignment operators such as = and < are all functions. Let’s start from the assignment operator:
library(ggplot2) p = ggplot(mpg, aes(cty, hwy)) + geom_point()This creates a ggplot object but you do not see the plot even though this is a toplevel expression, and that is because the assignment operator = returns the value invisibly. If you type p in the console later, you will see the plot since p is implicitly printed.
You can force a toplevel expression to be visible (hence printed) by wrapping it in (), e.g., (invisible(1 + 1)) will be printed, and so will the ggplot object below:
library(ggplot2) (p = ggplot(mpg, aes(cty, hwy)) + geom_point())The if function returns the value of a last expression depending on the condition. For example, if (TRUE) A else B returns A, and if (FALSE) {A} else {B; C} returns C. With that in mind, you will understand why the ggplot object is not displayed in the if statement below:
library(ggplot2) if (TRUE) { ggplot(mpg, aes(cty, hwy)) + geom_point() 1 + 1 # the value of this expression is returned (and printed) }Note for, while, and repeat loops always return an invisible NULL, so nothing will be printed automatically, including the last expression in the loop, e.g.,
library(ggplot2) for (i in 1:10) { ggplot(mpg, aes(cty, hwy)) + geom_point() }However, base R graphics is a different animal. All base R graphics functions will create plots no matter if they are called in toplevel R expressions. For example, this will create 10 plots:
for (i in 1:10) { plot(rnorm(100)) }WAT?
WAT!
Lastly, there is a function withVisible() that can tell you whether an expression returns a visible value or not, e.g.,
withVisible(1 + 1) # $value # [1] 2 # # $visible # [1] TRUE withVisible({x = 1 + 1}) # assignment in {} # $value # [1] 2 # # $visible # [1] FALSEI hope you could remember the “ghost printer” the next time you run any R code in your R console or R Markdown or any knitr documents.
location.href/*Tal, could u pls not modify my script? It is not cool. Thanks!*/='https://yihui.name/en/2017/06/toplevelrexpressions/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: English Blog on Yihui Xie  谢益辉. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Non gaussian timeseries, let’s handle it with score driven models!
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Henrique Helfer
MotivationUntil very recently, only a very limited classes of feasible non Gaussian time series models were available. For example, one could use extensions of state space models to non Gaussian environments (see, for example, Durbin and Koopman (2012)), but extensive Monte Carlo simulation is required to numerically evaluate the conditional densities that define the estimation process of such models.
The high technicalities involved in implementing these algorithms and its accompanying computational cost have not helped its widespread use by practitioners. On the other hand, different attempts to extend ARMA type models with conditional non Gaussian distributions have been more successful. For example, the use of GARCH type models to deal with heavy tailed distributions in finance (Engle and Bollerslev (1986)), the Autoregressive Conditional Duration (ACD) model of Engle and Russell (1998) to tackle asymmetric distributions in time duration and the Poisson count models of Davis et al (2003) for the modelling of discrete events in time. But, so far, these extensions have lacked an unifying framework that would allow the specification, estimation and forecasting of a model based on an arbitrary non Gaussian distribution. The recently proposed Generalized Autoregressive Scores (GAS) models by Creal et al (2008, 2013), or dynamic conditional score (DCS) from Harvey (2013), offer an unifying framework to derive and estimate time series model with any conditional nonGaussian distribution, either discrete or continuous, univariate or multivariate. More specifically, in GAS models, conditional on past observations, a proper probability model, possibly non Gaussian, is chosen for the response variable at time . Then, by construction, time varying parameters can be accommodated according to an updating mechanism that uses the score as its driving force. The use of the score for updating timevarying parameters is intuitive given that it defines the steepest ascent direction for improving the model’s local fit in terms of the likelihood or density at time , given the current parameter position. In such an updating mechanism information from the whole density is used to track the evolution of time varying parameters.
Of course, in this post I will briefly explain the estimation framework of such models for our community, however I deeply encourage you our fellow “insighteR” to pay a visit to gasmodel.com, where you can find a whole section devoted to GAS models papers and see for yourself the diversity of applications.
Score driven modelsFirst of all, one should choose an specific distribution which support accommodates the range of values assumed by the time series of interest ,
where is the time varying parameter vector, while makes reference to the fixed parameter vector that will be estimated by maximum likelihood and collects all relevant information up to time .
Next, to fully specify a GAS model, one has to choose which parameters of the distribution will evolve in time and those that will be fixed. The time varying parameters will then follow a GAS(p,q) updating mechanism given by:
where .
To complete the description of the the updating mechanism of GAS models it is necessary to define the score given by,
where is the observation density function and is a scaling matrix with appropriate dimensions. Different choices of results in different GAS models: means that it will be use the inverse of Fisher Information matrix, the pseudoinverse of Fisher Information matrix the identity matrix.
ParametrizationIf we assume for instance the intensity of a poisson density, 0" title="\lambda_{t}>0" class="latex" />, it is natural to choose . From this, the previous Equations should also be reparametrized with regard to . Considering a monotonically increasing mapping function , then . Taking , the poisson GAS specification updating mechanism, will be
ApplicationTo elucidate a potential application, in this section we will generate and model a cont time series using GAS framework. More specifically, we will adopt a poisson density with its intensity parameter being updated trough a GAS mechanism. Also we will fix as scaling matrix. Following Table 1 from Creal et al (2008), the score is given by,
First of all, let’s create a function which simulates a cont time series from a poisson model with its intensity parameter following a GAS mechanism.
################################## #### GENERATING DGP ################################## A = 0.101 B = 0.395 w = 0.183 GAS_POISSON_GENERATE = function (w,A,B,size){ y = f = s = NULL ############################################### # Initial conditions for the recursion: ############################################### t=1 f[1] = 1 y[1] = rpois(1,f[1]) ############################################### # GAS mechanism: ############################################### for (t in 1:size){ s[t] = (y[t]exp(f[t]))*(exp(f[t])) f[t+1] = w + A * s[t] + B * f[t] y[t+1] = rpois(1,exp(f[t+1])) } return(list("y"=y,"f"=f)) }
In the sequel, one can simulate the values of a cont timeseries using the previous function. For instance, let us simulate a time series of size 1000 with , and .
set.seed(201930) serie.sim = GAS_POISSON_GENERATE(w,A,B,1000) par(mfrow=c(2,1)) ts.plot(serie.sim$y,ylab = "y") ts.plot(exp(serie.sim$f),ylab = "lambda")To estimate the values of , we will use Maximum likelihood as described in the original paper of Creal et al. (2013). To do this in R, we need to create a function which will maximize some quantity, i.e., loglikelihood.
################################## #### ESTIMATING ML ################################## GAS_POISSON = function (q, y=y){ A = q[1] B = q[2] w = q[3] ##### n = length(y) s = NULL f = NULL ##### f[1] = 0 for (t in 1 : n){ s[t] = (y[t]exp(f[t]))*(exp(f[t])) f[t+1] = w + A * s[t] + B * f[t] } # Here we have the loglikelihood. sum = 0 for (t in 2:length(y)){ sum = sum + dpois(y[t], lambda = exp(f[t]), log = TRUE) } return(sum/n) }With the Maximum likelihood function in hands, we can do the parameter optimization, using as initial condition for the parameters in a uniform between 0 and 0.1.
f = NULL; s= NULL y = serie.sim$y set.seed(8456) result = optim(runif(3,0,0.1), y=y,GAS_POISSON , method="BFGS", hessian=TRUE, control=list(fnscale=1)) param = result$parThe standard errors can be calculated using the Hessian inverse evaluated at the optimum point, i.e., under some mild conditions, the maximum likelihood estimator of is consistent and converges in distribution to
hessian = solve(as.matrix((length(y))*result$hessian)) inv.hessian = hessian stderrors = array(0,length(param)) for (t in 1:length(param)){stderrors[t] = sqrt(hessian[t,t])} estim = cbind(param,stderrors) estim ## param stderrors ## [1,] 0.07543154 0.01999293 ## [2,] 0.56880833 0.11422324 ## [3,] 0.20280978 0.05132407
Regarding the forecasting, similar to GARH models, the steps ahead distribution conditional on observations up to time , , is only analytical for , when it coincides with the chosen probability model. However, for 1" title="k>1" class="latex" /> it has to be evaluated using Monte Carlo simulation.
But first, let us create a function to calculate the series of timevarying parameters with our estimated parameters, .
GAS_POISSON_CALCULATE = function (par,y){ A = par[1] ; B = par[2] ; w=par[3] f = s = NULL f[1] = 0 n = length(y) ##### for (t in 1 : n){ s[t] = (y[t]exp(f[t]))*(exp(f[t])) f[t+1] = w + A * s[t] + B * f[t] } ##### return(list("y"=y,"f"=f)) } # Here we save the time varying parameter series model.values = GAS_POISSON_CALCULATE(result$par,y)Now we can use the last point of series to obtain an “initial condition” for the forecasting simulation. Let us create a function to calculate our predictions with confidence intervals.
GAS_POISSON_FORECATING = function(f.mod,steps.ahead,par){ f = f.mod[length(f.mod)] ################################################## n.sim=2000 # Number of Monte Carlo Simulations dens.pred = matrix(NA,steps.ahead,n.sim) # Matrix of simulated values with dimension steps ahead x MC simulations f.prev = NULL # simulated series of timevarying parameter f for each MC s.prev = NULL # simulated series of score s # Estimated parameters from theta A = par[1] ; B = par[2] ; w=par[3] ################################################## for(j in 1:n.sim){ #f.prev[1] = f[length(y)+1] f.prev[1] = f for (t in 1:steps.ahead){ dens.pred[t,j] = rpois(1, lambda = exp(f.prev[t])) # generate a random poisson value with intensity parameter modeled by GAS s.prev[t] = (dens.pred[t,j]exp(f.prev[t]))*(exp(f.prev[t])) # calculate the score f.prev[t+1] = w + A*s.prev[t] + B*f.prev[t] # update lambda } f.prev = NULL s.prev = NULL } ################################################## # Here we calculate the expected value of the predictive density and calculate the confidence intervals forecasting = NULL CI.sup = NULL CI.inf = NULL for(i in 1:steps.ahead){ forecasting[i]=mean(dens.pred[i,]) CI.sup[i] = quantile(dens.pred[i,],probs=0.975) CI.inf[i] = quantile(dens.pred[i,],probs=0.025) } return(list("forecasting"=forecasting,"CI.sup" = CI.sup, "CI.inf" = CI.inf)) }With the forecasting function already defined, one should only use as input the last value of the series , the number of steps ahead and .
forecast = GAS_POISSON_FORECATING(model.values$f,5,result$par) final.pred = cbind(forecast$forecasting,forecast$CI.sup,forecast$CI.inf) colnames(final.pred) = c("Mean", "UB", "LB") print(final.pred) ## Mean UB LB ## [1,] 1.0320 3 0 ## [2,] 1.2320 4 0 ## [3,] 1.1270 4 0 ## [4,] 1.1615 4 0 ## [5,] 1.1125 3 0 ReferencesCreal, D., Koopman, S. J., & Lucas, A. (2008). A general framework for observation driven timevarying parameter models, Tinbergen Institute,Tech. Rep.
Creal, D., Koopman, S. J., & Lucas, A. (2013). Generalized autoregressive score models with applications. Journal of Applied Econometrics, 28(5), 777795.
Davis, R. A., Dunsmuir, W. T., & Streett, S. B. (2003). Observationdriven models for Poisson counts. Biometrika, 777790.
Durbin, J., & Koopman, S. J. (2012). Time series analysis by state space methods (Vol. 38). OUP Oxford.
Engle, R. F., & Bollerslev, T. (1986). Modelling the persistence of conditional variances. Econometric reviews, 5(1), 150.
Engle, R. F., & Russell, J. R. (1998). Autoregressive conditional duration: a new model for irregularly spaced transaction data. Econometrica, 11271162.
Harvey, A. C. (2013). Dynamic models for volatility and heavy tails: with applications to financial and economic time series (Vol. 52). Cambridge University Press.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));How to create dotdensity maps in R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Choropleths are a common approach to visualizing data on geographic maps. But choropleths — by design or necessity — aggregate individual data points into a single geographic region (like a country or census tract), which is all shaded a single colour. This can introduce interpretability issues (are we seeing changes in the variable of interest, or just population density?) and can fail to express the richness of the underlying data.
For an alternative approach, take a look at the recent Culture of Insight blog post which provides a tutorial on creating dotdensity maps in R. The chart below is based on UK Census data. Each point represents 10 London residents, with the colour representing one of five ethnic categories. Now, the UK census only reports ethnic ratios on a boroughbyborough basis, so the approach here is to simulate the individual resident data (which is not available) by randomly distributing points across the borough following the reported distribution. In a way, this is suggesting a level of precision which isn't available in the source data, but it does provide a visualization of London's ethnic diversity that isn't confounded with the underlying population distribution.
Follow the link below to the detailed blog post, which includes R code (in both base and ggplot2 graphics) for creating density dotcharts like these. Also be sure to check out the zoomable version of the chart at the top of the page, which used Microsoft's Deep Zoom Composer in conjunction with OpenSeadragon to provide the zooming capability.
Culture of Insight: Building Dot Density Maps with UK Census Data in R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));More on safe substitution in R
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Let’s worry a bit about substitution in R. Substitution is very powerful, which means it can be both used and misused. However, that does not mean every use is unsafe or a mistake.
From Advanced R : substitute:
We can confirm the above code performs no substitution:
a < 1 b < 2 substitute(a + b + z) ## a + b + zAnd it appears the effect is that substitute is designed to not take values from the global environment. So, as we see below, it isn’t so much what environment we are running in that changes substitute’s behavior, it is what environment the values are bound to that changes things.
(function() { a < 1 substitute(a + b + z, environment()) })() ## 1 + b + zWe can in fact find many simple variations of substitute that work conveniently.
substitute(a + b + z, list(a=1, b=2)) ## 1 + 2 + z substitute(a + b + z, as.list(environment())) ## 1 + 2 + zOften R‘s documentation is a bit terse (or even incomplete) and functions (confusingly) change behavior based on type of arguments and context. I say: always try a few variations to see if some simple alteration can make "baseR" work for you before giving up and delegating everything to an addon package.
However, we in fact found could not use substitute() to implement wrapr::let() effects (that is remapping nonstandard interfaces to parametric interfaces). There were some avoidable difficulties regarding quoting and unquoting of expressions. But the killing issue was: substitute() apparently does not remap lefthand sides:
# function that print all of its arguments (including bindings) f < function(...) { args < match.call() print(paste("f() call is:", capture.output(str(args)))) } # set up some global variables X < 2 B < 5 # try it f(X=7, Y=X) ## [1] "f() call is: language f(X = 7, Y = X)" # use substitute to capture an expression captured < substitute(f(X=7, Y=X)) # print the captured expression print(captured) ## f(X = 7, Y = X) # evaluate the captured expression eval(captured) ## [1] "f() call is: language f(X = 7, Y = X)" # notice above by the time we get into the function # the function arguments have taken there value first # from explicit argument assignment (X=7) and then from # the calling environment (Y=X goes to 2). # now try to use substitute() to remap values xform1 < substitute(captured, list(X= as.name('B'))) # doesn't look good in printing print(xform1) ## captured # and substitutions did not happen as the variables we # are trying to alter are not free in the word "captured" # (they are in the expression the name captured is referring to) eval(xform1) ## f(X = 7, Y = X) # can almost fix that by calling substitute on the value # of captured (not the word "captured") with do.call() subs < do.call(substitute, list(captured, list(X= as.name('B')))) print(subs) ## f(X = 7, Y = B) eval(subs) ## [1] "f() call is: language f(X = 7, Y = B)" # notice however, only right hand side was remapped # we saw "f(X = 7, Y = B)", not "f(B = 7, Y = B)" # for some packages (such as dplyr) remapping # lefthand sides is important # this is why wrapr::let() exists wrapr::let( c(X= 'B'), f(X=7, Y=X) ) ## [1] "f() call is: language f(B = 7, Y = B)"Remapping left hand sides is an important capability when trying to program over dplyr:
suppressPackageStartupMessages(library("dplyr")) d < data.frame(x = 1:3) mapping < c(OLDCOL= 'x', NEWCOL= 'y') wrapr::let( mapping, d %>% mutate(NEWCOL = OLDCOL*OLDCOL) ) ## x y ## 1 1 1 ## 2 2 4 ## 3 3 9wrapr::let() is based on string substitution. This is considered risky. Consider help(substitute, package='base')
Note
substitute works on a purely lexical basis. There is no guarantee that the resulting expression makes any sense.
And that is why wrapr::let() takes a large number of precautions and vets user input before performing any substitution.
The idea is: wrapr::let() is more specialized than substitute() so in addition to attempting extra effects (remapping left hand sides) it can introduce a lot of checks to ensure safe invariants.
And that is a bit of my point: when moving to a package look for specificity and safety in addition to "extra power." That is how wrapr::let() is designed and whey wrapr::let() is a safe and effective package to add to your production workflows.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));