Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 53 min ago

Rolling Origins and Fama French

Wed, 12/26/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Today, we continue our work on sampling so that we can run models on subsets of our data and then test the accuracy of the models on data not included in those subsets. In the machine learning prediction world, these two data sets are often called training data and testing data, but we’re not going to do any machine learning prediction today. We’ll stay with our good’ol Fama French regression models for the reasons explained last time: the goal is to explore a new method of sampling our data and I prefer to do that in the context of a familiar model and data set. In 2019, we’ll start expanding our horizons to different models and data, but for today it’s more of a tools exploration.

Loyal readers of this blog (if that’s you, thank you!) will remember that we undertook a similar project in the previous post, when we used k-fold cross-validation. Today, we will use rolling origin sampling of the data, which differs from k-fold cross-validation in the sense that with rolling origin we explicitly sample based on the dates of our observation. With rolling origin, our first sample starts on the first day, our second sample starts on the second day, our third sample starts on third day. Or, we could have the second sample start on the twentieth day, or we could have it start again on the first day and just include an extra day. Either way, we are aware of the order of our data when sampling. With k-fold cross-validation, we were randomly shuffling and then selecting observations. This distinction can be particularly important for economic time series where we believe that the order of our observations is important.

Let’s get to it.

First, we need our data and, as usual, we’ll import data for daily prices of 5 ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the 5 Fama French factor data and join it to our 5 ETF returns data. Here’s the code to make that happen (this code was covered in detail in this post:

library(tidyverse) library(tidyquant) library(rsample) library(broom) library(timetk) symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG") # The prices object will hold our daily price data. prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) asset_returns_long <- prices %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% na.omit() factors_data_address <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip" factors_csv_name <- "Global_5_Factors_Daily.csv" temp <- tempfile() download.file( # location of file to be downloaded factors_data_address, # where we want R to store that file temp, quiet = TRUE) Global_5_Factors <- read_csv(unz(temp, factors_csv_name), skip = 6 ) %>% rename(date = X1, MKT = `Mkt-RF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(-RF) data_joined_tidy <- asset_returns_long %>% left_join(Global_5_Factors, by = "date") %>% na.omit()

After running that code, we have an object called data_joined_tidy. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.

data_joined_tidy %>% slice(1) # A tibble: 5 x 8 # Groups: asset [5] date asset returns MKT SMB HML RMW CMA 1 2013-01-02 AGG -0.00117 0.0199 -0.0043 0.0028 -0.0028 -0.0023 2 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023 3 2013-01-02 EFA 0.0154 0.0199 -0.0043 0.0028 -0.0028 -0.0023 4 2013-01-02 IJS 0.0271 0.0199 -0.0043 0.0028 -0.0028 -0.0023 5 2013-01-02 SPY 0.0253 0.0199 -0.0043 0.0028 -0.0028 -0.0023

For today, let’s work with just the daily returns of EEM.

eem_for_roll <- data_joined_tidy %>% filter(asset == "EEM")

We are going to regress those EEM daily returns on the Fama French factors and need a way to measure the accuracy of our various factor models. Previously, we measured our models by looking at the adjusted R-squared, when the models were run on the entirety of the data. Last time, we use k-fold cross-validation to split the data into a bunch of subsets, then ran our model on the subsets and measured the accuracy against the holdouts from the subsets. Now, let’s use the rolling_origin() function from rsample to split our data into analysis and assessment sets in a time-aware way, then calculate RMSEs.

The rolling_origin() function needs a few arguments set. We first set data to be eem_for_roll Then, we assign initial to be 50 – this tells the function that the size of our first sample is 50 days. Our first chunk of analysis data will be the first 50 days of EEM returns. Next, we assign assess to be 5 – this tells the function that our assessment data is the 5 days of EEM returns following those first 50 days. Finally, we set cumulative to be FALSE – this tells the functions that each of splits is a 50 days. The first split is the first 50 days, starting on day 1 and ending on day 50. The next split starts on day 2 and ends on day 51. If we were to set cumulative = TRUE, the first split would be 50 days. The next split would be 51 days, the next split would be 52 days. And so on. The analysis split days would be accumulating. For now, we will leave it at cumulative = FALSE.

For that reason, we will append _sliding to the name of our object because the start of our window will slide each time.

library(rsample) roll_eem_sliding <- rolling_origin( data = eem_for_roll, initial = 50, assess = 10, cumulative = FALSE )

Look at an individual split.

one_eem_split <- roll_eem_sliding$splits[[1]] one_eem_split <50/10/1259>

That 50 is telling us there are 50 days or rows in the analysis set; that 10 is telling us that there are 10 rows in our assessment data – we’ll see how well our model predicts the return 5 days after the last observation in our data.

Here is the analysis subset of that split.

analysis(one_eem_split) %>% head() # A tibble: 6 x 8 # Groups: asset [1] date asset returns MKT SMB HML RMW CMA 1 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023 2 2013-01-03 EEM -0.00710 -0.0021 0.00120 0.000600 0.0008 0.0013 3 2013-01-04 EEM 0.00200 0.0064 0.0011 0.0056 -0.0043 0.0036 4 2013-01-07 EEM -0.00759 -0.0014 0.00580 0 -0.0015 0.0001 5 2013-01-08 EEM -0.00900 -0.0027 0.0018 -0.00120 -0.0002 0.00120 6 2013-01-09 EEM 0.00428 0.0036 0.000300 0.0025 -0.0028 0.001

And the assessment subset – this is 10 rows.

assessment(one_eem_split) # A tibble: 10 x 8 # Groups: asset [1] date asset returns MKT SMB HML RMW CMA 1 2013-03-15 EEM -0.00908 0.0025 0.006 -0.0001 0.0018 -0.0001 2 2013-03-18 EEM -0.0113 -0.0093 0.0021 -0.0039 0.0039 -0.001 3 2013-03-19 EEM -0.00712 -0.003 0.0015 -0.0023 0.0017 0.0028 4 2013-03-20 EEM 0.00570 0.0052 -0.003 0.0023 -0.002 0.002 5 2013-03-21 EEM -0.0102 -0.0037 0.0085 -0.0007 0.00120 0.0008 6 2013-03-22 EEM 0.00382 0.0033 -0.0042 -0.0023 0.0022 -0.0007 7 2013-03-25 EEM -0.000954 -0.0037 0.0036 -0.0032 0.0027 0.000600 8 2013-03-26 EEM 0.0142 0.0039 -0.0032 -0.0017 0.00120 -0.0017 9 2013-03-27 EEM 0.00376 -0.0016 0.0022 -0.0004 -0.0002 -0.0008 10 2013-03-28 EEM 0.00211 0.0033 -0.0022 -0.0031 0.000600 0.000600

By way of comparison, here’s what the k-fold cross-validated data would look like.

cved_eem <- vfold_cv(eem_for_roll, v = 5) assessment(cved_eem$splits[[1]]) %>% head() # A tibble: 6 x 8 # Groups: asset [1] date asset returns MKT SMB HML RMW CMA 1 2013-01-04 EEM 0.00200 0.0064 0.0011 0.0056 -0.0043 0.0036 2 2013-01-14 EEM 0.00426 -0.000600 0.0002 0.0013 -0.001 0.0011 3 2013-01-31 EEM 0.00204 -0.0021 0.0038 0.000300 -0.0007 -0.0001 4 2013-02-04 EEM -0.0133 -0.0107 0.0059 -0.0027 0.0023 0.0004 5 2013-02-05 EEM 0.00114 0.0034 -0.0054 -0.0002 0.0008 -0.002 6 2013-02-06 EEM -0.00114 0.0019 0.0037 0.0022 -0.0018 0.0028

Notice how the first date is not necessarily the first date in our data. In fact, if you run that code chunk a few times, the first date will be randomly selected. For me, it varied between January 2nd and January 15th.

Back to our rolling_origin data, we know that split 1 begins on day 1 and ends on day 50:

analysis(roll_eem_sliding$splits[[1]]) %>% slice(c(1,50)) # A tibble: 2 x 8 # Groups: asset [1] date asset returns MKT SMB HML RMW CMA 1 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023 2 2013-03-14 EEM 0.00418 0.0078 0.0021 0.0016 -0.002 0.0026

And we know split 2 begins on day 2 and ends on day 51:

analysis(roll_eem_sliding$splits[[2]]) %>% slice(c(1,50)) # A tibble: 2 x 8 # Groups: asset [1] date asset returns MKT SMB HML RMW CMA 1 2013-01-03 EEM -0.00710 -0.0021 0.00120 0.000600 0.0008 0.0013 2 2013-03-15 EEM -0.00908 0.0025 0.006 -0.0001 0.0018 -0.0001

Now, we can start using our data to fit and then assess our model. The code flow is very similar to our previous post, but I’ll go ahead and run through it anyway.

First, let’s create a function that takes a split as an argument, fits a 3-factor Fama French model and calculates the root mean squared error. We will use the rmse() function from yardstick to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared differences between our fitted values and the actual values in the assessment data. A lower RMSE is better!

library(yardstick) ff_three_rmse <- function(split){ analysis_set <- analysis(split) ff_model <- lm(returns ~ MKT + SMB + HML , data = analysis_set) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

Let’s pass our one split object one_eem_split to the function.

ff_three_rmse(one_eem_split) %>% print() [1] 0.005311276

Now, we can use mutate() and map_dbl() to pass all of our splits held in roll_eem_sliding to the function. This will return an rmse for this model when applied to each of our rolling origin splits. This takes a few seconds to run.

rmse_one_model <- roll_eem_sliding %>% mutate(model = map_dbl( .x = splits, .f = ~(ff_three_rmse(.x)))) rmse_one_model %>% head() # A tibble: 6 x 3 splits id model * 1 Slice0001 0.00531 2 Slice0002 0.00621 3 Slice0003 0.00642 4 Slice0004 0.00647 5 Slice0005 0.00654 6 Slice0006 0.00649

This is the same as we did last time. Now, let’s get functional. We will define three models that we wish to test with via our rolling origin splits and RMSE, then pass those models to one function.

First, we use as.formula() to define our three models.

mod_form_1 <- as.formula(returns ~ MKT) mod_form_2 <- as.formula(returns ~ MKT + SMB + HML) mod_form_3 <- as.formula(returns ~ MKT + SMB + HML + RMW + CMA)

We write one function that takes split as an argument, same as above, but also takes formula as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to map so I’ll append _flex to the name of this function.

ff_rolling_flex <- function(split, formula) { split_for_data <- analysis(split) ff_model <- lm(formula, data = split_for_data) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) } rmse_model_1_flex <- roll_eem_sliding %>% mutate(model_1_rmse = map_dbl(roll_eem_sliding$splits, ff_rolling_flex, formula = mod_form_1)) rmse_model_1_flex %>% head() # A tibble: 6 x 3 splits id model_1_rmse * 1 Slice0001 0.00601 2 Slice0002 0.00548 3 Slice0003 0.00547 4 Slice0004 0.00556 5 Slice0005 0.00544 6 Slice0006 0.00539

Again same as last time, we can run all of our models using the mutate() and map_dbl combination. Warning, this one takes a bit of time and compute to run – proceed with caution if you’re on a desktop.

rmse_3_models <- roll_eem_sliding %>% mutate(model_1_rmse = map_dbl(roll_eem_sliding$splits, ff_rolling_flex, formula = mod_form_1), model_2_rmse = map_dbl(roll_eem_sliding$splits, ff_rolling_flex, formula = mod_form_2), model_3_rmse = map_dbl(roll_eem_sliding$splits, ff_rolling_flex, formula = mod_form_3)) rmse_3_models %>% head() # A tibble: 6 x 5 splits id model_1_rmse model_2_rmse model_3_rmse * 1 Slice0001 0.00601 0.00531 0.00496 2 Slice0002 0.00548 0.00621 0.00596 3 Slice0003 0.00547 0.00642 0.00617 4 Slice0004 0.00556 0.00647 0.00615 5 Slice0005 0.00544 0.00654 0.00613 6 Slice0006 0.00539 0.00649 0.00600

Alright, we have our RMSE, from each of our 3 models, as applied to each of our splits.

Thus far, our substantive flow is very similar to our k-fold cross-validation work. But now, we can take advantage of the time-aware nature of our splits.

Let’s visualize how our RMSE has changed over different time-aware splits, for our various models. Remember, we know the exact start and end date for our analysis and assessment sets, so we can extract the date of, say the first observation in the assessment data and assign it to the split. We can consider this the date of that model run.

First, let’s create a function to extract the first date of each of our assessment sets.

get_start_date <- function(x) min(assessment(x)$date)

Here’s how that works on our one split object.

get_start_date(one_eem_split) [1] "2013-03-15"

That’s the first date in the assessment data:

assessment(one_eem_split) %>% select(date) %>% slice(1) # A tibble: 1 x 2 # Groups: asset [1] asset date 1 EEM 2013-03-15

We want to add a column to our results_3 object called start_date. We’ll use our usual mutate() and then map() flow to apply the get_start_date() function to each of our splits, but we’ll need to pipe the result to reduce(c) to coerce the result to a date. map() returns a list by default and we want a vector of dates.

rmse_3_models_with_start_date <- rmse_3_models %>% mutate(start_date = map(roll_eem_sliding$splits, get_start_date) %>% reduce(c)) %>% select(start_date, everything()) rmse_3_models_with_start_date %>% head() # A tibble: 6 x 6 start_date splits id model_1_rmse model_2_rmse model_3_rmse * 1 2013-03-15 Slice… 0.00601 0.00531 0.00496 2 2013-03-18 Slice… 0.00548 0.00621 0.00596 3 2013-03-19 Slice… 0.00547 0.00642 0.00617 4 2013-03-20 Slice… 0.00556 0.00647 0.00615 5 2013-03-21 Slice… 0.00544 0.00654 0.00613 6 2013-03-22 Slice… 0.00539 0.00649 0.00600

We can head to ggplot for some visualizing. I’d like to plot all of my RMSE’s in different colors and the best way to do that is to gather() this data to tidy format, with a column called model and a column called value. It’s necessary to coerce to a data frame first, using as.data.frame().

rmse_3_models_with_start_date %>% as.data.frame() %>% select(-splits, -id) %>% gather(model, value, -start_date) %>% head() start_date model value 1 2013-03-15 model_1_rmse 0.006011868 2 2013-03-18 model_1_rmse 0.005483091 3 2013-03-19 model_1_rmse 0.005470834 4 2013-03-20 model_1_rmse 0.005557170 5 2013-03-21 model_1_rmse 0.005439921 6 2013-03-22 model_1_rmse 0.005391862

Next, we can use some of our familiar ggplot methods to plot our RMSEs over time, and see if we notice this model degrading or improving in different periods.

rmse_3_models_with_start_date %>% as.data.frame() %>% select(-splits, -id) %>% gather(model, value, -start_date) %>% ggplot(aes(x = start_date, y = value, color = model)) + geom_point() + facet_wrap(~model, nrow = 2) + scale_x_date(breaks = scales::pretty_breaks(n = 10)) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

All of the models have a jump in RMSE, meaning they performed worse, around the end of 2017. We aren’t focused on the theory of these models today but if we were, this would be a good place to start investigating. This is just the beginning of our exploration of rolling_origin, but I love how it opens the door for ways to think about visualizing our models.

And finally, those same public service announcements from last time are still live, so I’ll mention them one more time.

Thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).

Applications are open for the Battlefin alternative data contest and RStudio is one of the tools you can use to analyze the data. Check it out here. They’ll announce 25 finalists in January who will get to compete for a cash prize and connect with some quant hedge funds. Go get’em!

A special thanks to Bruce Fox who suggested we might want to expand on the Monte Carlo simulation in the book to take account of different distributions implied by historical returns and different market regimes that might arise. Today’s rolling origin framework will also lay the foundation, I hope, for implementing some of Bruce’s ideas in January.

Thanks for reading and see you next time.

_____='https://rviews.rstudio.com/2018/12/26/rolling-origins-and-fama-french/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Optimism corrected bootstrapping: a problematic method

Tue, 12/25/2018 - 10:09

(This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)

There are lots of ways to assess how predictive a model is while correcting for overfitting. In Caret the main methods I use are leave one out cross validation, for when we have relatively few samples, and k fold cross validation when we have more. There also is another method called ‘optimism corrected bootstrapping’, that attempts to save statistical power, by first getting the overfitted result, then trying to correct this result by bootstrapping the data to estimate the degree of optimism. A few simple tests in Caret can demonstrate this method is bunk.

This is a very straightforward method, just add random variables from a normal distribution to the ground truth iris labels. We should find our AUC (area under ROC curve) is about 0.5. Yet for optimism corrected bootstrap it gives a positive result regardless of whether the predictors are just noise or not. Let’s just run that test:

This is called a sensitivity analysis for the uninitiated, I am just increasing number of random noise features (z) and binding them to the real labels in an iterative manner.

library(caret) allresults <- matrix(ncol=2,nrow=200) i = 0 for (z in seq(10,2000,10)){ i = i + 1 # select only two species iris <- subset(iris, iris$Species == 'versicolor' | iris$Species == 'virginica') iris$Species <- droplevels(iris$Species) # generate random data test <- matrix(rnorm(100*z, mean = 0, sd = 1), nrow = 100, ncol = z, byrow = TRUE) # bind random data iris <- cbind(iris,test) # remove real data iris <- iris[,-1:-4] # cross validation ctrl <- trainControl(method = 'cv', summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, verboseIter = T) fit3 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris, method="glmnet", # preProc=c("center", "scale") trControl=ctrl, metric = "ROC") # allresults[i,1] <- max(fit3$results$ROC) # optimism corrected bootstrapping ctrl <- trainControl(method = 'optimism_boot', summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, verboseIter = T) fit4 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris, method="glmnet", # preProc=c("center", "scale") trControl=ctrl, metric = "ROC") # allresults[i,2] <- max(fit4$results$ROC) rm(iris) } df <- data.frame(allresults) colnames(df) <- c('cross_validation','optimism_corrected_boot') df2 <- reshape2::melt(df) df2$N <- c(seq(10,2000,10),seq(10,2000,10)) # do the plot p1 <- ggplot(df2, aes(x=N, y=value, group=variable)) + geom_line(aes(colour=variable)) png('bias_in_optimism_corrected_bootstrapping.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p1) dev.off()

So there we have it, a method that is being used in publications as we speak, but it is horribly bias. This is science for you, full of misinformation and lies. Be warned, there is a darkness out there R readers, not just in Washington.

N (x axis) refers to number of features here (that are just noise). AUC (y axis) is about 0.5 when we have no predictive power.

We can see cross validation performs as expected, it is a bit unstable though, better to use repeated cross validation I think.

My recommendations for Rbloggers readers who don’t know much about supervised machine learning, but want to use it on their data in R:

  • Don’t start by implementing cross validation loops your self and calculate AUC etc, Caret makes life a lot easier and it is bug tested by thousands of people around the world like myself.
  • Stick with tried and tested mainstream approaches when in doubt. Like LOOCV or repeated k fold CV.
  • If implementing your own method or skeptical of the reliability of the method your using, use rnorm to perform a negative control to test your statistical method for bias. This is a good check to perform, as well as a positive control.
  • Make sure your data is in the correct order.

 

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pivot Billions and Deep Learning enhanced trading models achieve 100% net profit

Mon, 12/24/2018 - 20:01

(This article was first published on R – Pivot Billions, and kindly contributed to R-bloggers)

Deep Learning has revolutionized the fields of image classification, personal assistance, competitive board game play, and many more. However, the financial currency markets have been surprisingly stagnant. In our efforts to create a profitable and accurate trading model, we came upon the question: what if financial currency data could be represented as an image? The answer: it can!

There are many ways to reshape currency data into an image. However, each requires a great deal of processing power and research. We powered our analysis with Pivot Billions, which allowed us to load and analyze our data quickly and to reshape it into an image using their custom module. While we could have reshaped the data to have the last X ticks for each tick data point or last Y minutes for each minute of the data, we already had working models from our initial research without deep learning.

See prior posts:

Therefore we decided to take the signals from one of our models and enhance them with the power of deep learning and AI.

By incorporating Pivot Billions into our Keras workflow, we were immediately able to prepare the data to have the last 100 minutes prior to each of our model’s signals in the row for that respective signal and to setup our training and testing datasets. We then easily loaded this data into keras and worked on developing our deep learning model. Many times we needed to modify which features’ 100 minute histories were being stored and incorporated into our deep learning so we made full use of Pivot Billions’ speed to decrease our iteration turnaround time. After many iterations, we were able to determine a deep learning model that learned our model’s weaknesses and strengths and accurately predicted the profitable signals.

Using this enhanced model we could achieve amazing and much more stable profit throughout our data. We took raw signals that looked like this:

which were profitable but highly volatile due to periods of noisy and underperforming trades, and turned them into this:

The periods of drawdown are greatly reduced and our profitability is much more stable, allowing us to achieve nearly 100% profit in less than 7 months. We’ll continue our research so look forward to another blog post featuring even better AI-enhanced models!

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Pivot Billions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dreaming of a white Christmas – with ggmap in R

Mon, 12/24/2018 - 12:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

With the holidays approaching, one of the most discussed questions at STATWORX was whether we’ll have a white Christmas or not. And what better way to get our hopes up, than by taking a look at the DWD Climate Data Center’s historic data on the snow depth on the past ten Christmas Eves?

But how to best visualize spatial data? Other than most data types, spatial data usually calls for a very particular visualization, namely data points overlaying a map. In this way, areal data is automatically contextualized by the geographic information intuitively conveyed by a map.

The basic functionality of ggplot2 dosen’t offer the possibility to do so, but there is a package akin to ggplot2 that allows to do so: ggmap. ggmap was written by David Kahle and Hadley Wickham and combines the building blocks of ggplot2, the grammar of graphics as well as the static maps of Google Maps, OpenStreetMap, Stamen Maps or CloudMade Maps. And with all that, ggmap allows us to make really fancy visualizations:

Above-average snow depth on Christmas Eve (2008-2017)

The original functionalities of ggmap used to be somewhat more general, broad and “barrier-free”, but since those good old days – aka 2013 – some of the map suppliers changed the terms of use as well as mechanics of their APIs. At the moment, the service of Stamen Maps seems to be the most stable, while also being easily accessible – e.g. without registering for an API that requires one to provide some payment information. Therefore, we’re going to focus on Stamen Maps.

 

First things first: the map

Conveniently, ggmap employs the same theoretical framework and general syntax as ggplot2. However, ggmap requires one additional step: Before we can start plotting, we have to download a map as backdrop for our visualization. This is done with get_stamenmap(), get_cloudmademap(), get_googlemap() or get_openstreetmap() or the more general get_map(). We’re going to use get_stamenmap().

To determine the depicted map cutout, the left, bottom, right and top coordinates of a bounding box, have to be supplied to the argument bbox.
Conveniently, there is no need to know the exact latitudes and longitudes of each and every bounding box of interest. The function geocode_OSM() from the package tmaptools, returns whenever possible the coordinates of a search query consisting of an address, zip code and/or name of a city or country.

library(scales) library(tidyverse) library(tmaptools) library(ggimage) library(ggmap) # get the bounding box geocode_OSM("Germany")$bbox xmin ymin xmax ymax 5.866315 47.270111 15.041932 55.099161

The zoom level can be set via the zoom argument and can range between 0 (least detailed) and 18 (most detailed, quick disclaimer: this can take a very long time). The zoom level determines the resolution of the image as well as the amount of displayed annotations.

Depending on whether we want to highlight roads, political or administrative boundaries or bodies of water and land different styles of maps excel. The maptype argument allows to choose from different ready-made styles: "terrain", "terrain-background", "terrain-labels", "terrain-lines", "toner", "toner-2010", "toner-2011", "toner-background", "toner-hybrid", "toner-labels", "toner-lines", "toner-lite" or "watercolor".

Some further, very handy arguments of get_stamenmap() are crop, force and color:
As implied by the name, color defines whether a map should be in black-and-white ("bw") or when possible in color ("color").

Under the hood get_stamenmap() downloads map tiles, which are joined to the complete map. If the map tiles should be cropped so as to only depict the specified bounding box, the crop argument can be set to TRUE.

Unless the force argument is set to TRUE, even when arguments changing the style of a map have been altered, once a map of a given location has been downloaded it will not be downloaded again.

When we’ve obtained the map of the right location and style, we can store the “map image” in an object or simply pass it along to ggmap() to plot it. The labels, ticks etc. of axes can be controlled as usual.

# getting map plot_map_z7 <- get_stamenmap(as.numeric(geocode_OSM("Germany")$bbox), zoom = 7, force = TRUE, maptype = "terrain") # saving plotted map alone plot1 <- ggmap(plot_map_z7) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank()) # getting map plot_map_z5 <- get_stamenmap(as.numeric(geocode_OSM("Germany")$bbox), zoom = 5, force = TRUE, maptype = "terrain") # saving plotted map alone plot2 <- ggmap(plot_map_z5) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank()) # plotting maps together plot <- gridExtra::grid.arrange(plot1, plot2, nrow = 1)

Example for maptype = „terrain“ with zoom = 7 (left) vs. zoom = 5 (right).

Business as usual: layering geoms on top

We then can layer any ggplot2 geom we’d like on top of that map, with the only requirement being that the variables mapped to the axes are within the same numeric range as the latitudes and longitudes of the depicted map. We also can use many extension packages building on ggplot2. For example, we can use the very handy package ggimage by Guangchuang Yu to make our plots extra festive:

# aggregating data per coordinate df_snow_agg <- df_snow %>% dplyr::mutate(LATITUDE = plyr::round_any(LATITUDE, accuracy = 1), LONGITUDE = plyr::round_any(LONGITUDE, accuracy = 1)) %>% dplyr::group_by(LATITUDE, LONGITUDE) %>% dplyr::summarise(WERT = mean(WERT, na.rm = TRUE)) # cutting into equal intervals df_snow_agg$snow <- as.numeric(cut(df_snow_agg$WERT, 12)) # setting below average snow depths to 0 df_snow_agg <- df_snow_agg %>% mutate(snow = ifelse(WERT <= mean(df_snow_agg$WERT), 0, snow)) # adding directory of image to use df_snow_agg$image <- "Snowflake_white.png" # getting map plot_map <- get_stamenmap(as.numeric(geocode_OSM("Germany")$bbox), zoom = 7, force = TRUE, maptype = "terrain") # plotting map + aggregated snow data with ggimage plot <- ggmap(plot_map) + geom_image(data = df_snow_agg, aes(x = LONGITUDE, y = LATITUDE, image = image, size = I(snow/20))) + # rescaling to get valid size values theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank())

Above-average snow depth on Christmas Eve (2008-2017)

We can get creative, playing with all sorts of geoms and map styles. Especially visualization strategies to prevent overplotting can be quite handy:

For example, a facetted scatter plot with alpha blending:

# getting map plot_map <- get_stamenmap(as.numeric(geocode_OSM("Germany")$bbox), zoom = 7, force = TRUE, maptype = "watercolor") # facets per year, scatter plot with alpha blending plot <- ggmap(plot_map) + geom_point(data = df_snow, aes(x = LONGITUDE, y = LATITUDE, alpha = WERT, size = WERT), color = "white") + facet_wrap(. ~ DATE, nrow = 2) + scale_alpha_continuous(range = c(0.0, 0.9)) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank(), legend.position = "none", strip.background = element_rect(fill = "white"))

Snow depth on Christmas Eve per year (2008-2017)

Or density plots with either geom_line() or geom_polygon():

# getting map plot_map <- get_stamenmap(as.numeric(geocode_OSM("Germany")$bbox), zoom = 7, crop = FALSE, force = TRUE, maptype = "toner-background") # density plot: lines plot1 <- ggmap(plot_map) + geom_density2d(data = df_snow_dens, aes(x = LONGITUDE, y = LATITUDE, color = ..level..), size = 0.7) + scale_color_continuous(low = "#17775A", high = "white", name = "", breaks = c(0.01, 0.06), labels = c("little snow", "much snow")) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank(), legend.position = "left") # density plot: polygon plot2 <- ggmap(plot_map) + stat_density_2d(data = df_snow_dens, aes(x = LONGITUDE, y = LATITUDE, alpha = ..level..), fill = "#CFECFF", geom = "polygon") + scale_alpha(name = "", breaks = c(0.012, 0.024, 0.036, 0.048, 0.06), labels = c("little snow", "","", "", "much snow")) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank()) plot <- gridExtra::grid.arrange(plot1, plot2, nrow = 1) ggsave(plot = plot, filename = "density.png", width = 5.5, height = 3)

Average snow depth on Christmas Eve (2008-2017)

Our plots give a first taste of what can be done with ggmap. However, based on the historic data, there won’t be a white Christmas for the most of us… Against all odds, I still hope that outside there’s a winter wonderland waiting for you just now.

From everybody here at STATWORX happy holidays and a happy new year!

References Über den Autor

Lea Waniek

I am data scientist at STATWORX, apart from machine learning, I love to play around with RMarkdown and ggplot2, making data science beautiful inside and out.

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Dreaming of a white Christmas – with ggmap in R erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Need for Speed Part 2: C++ vs. Fortran vs. C

Mon, 12/24/2018 - 04:22

(This article was first published on R – Strange Attractors, and kindly contributed to R-bloggers)

Searching for Speed

In my previous post, I described the method I use for compiling Fortran (or C) into an R package using the .Call interface. This post will compare the speed of various implementations of the layer loss cost function.

The Function

Often, insurance or reinsurance is bought in stratified horizontal layers. For example, an auto policy with a $300,000 limit may have its middle third insured. This means that the primary insurer would pay up to the first $100,000 of loss, a reinsurer would pay the next $100,000, and the primary insurer would pay the last $100,000. In actuarial terms, we say that the reinsurer is covering $100K excess of $100K. The reinsured losses attach at $100K and are limited to a total of $100K paid loss. If is the random variable representing the loss, is the limit of the policy, and the attachment, then the layer loss cost, LLC is:

   

In R, this is extremely easy to write:

LLC_r <- function(x, l, a) sum(pmax(0, pmin(x - a, l)))

The Question

The two main languages which comprise the compiled code in R are C/C++ and Fortran. I’m not sure why, but I have a soft spot in my heart for Fortran, specifically the free-form, actually readable, versions of Fortran from F90 and on. For the packages I maintain on CRAN, I’ve found that, more often than not, Fortran will outperform C++ by a small margin—although not always. This was opportunity to directly compare the performance of C++ and Fortran.

Parallelization Basics

Speed improvements often come from explicit parallelization of the compiled code. In parallel computing, vectorization means using the multiple SIMD lanes in modern chips (SSE/AVX/AVX2). The loop is only performed in one thread but it can be chunked inside the CPU. Consider a one-lane highway approaching a toll plaza; the speed limit remains the same but just at the tollbooth four cars can be processed at the same time. Parallelization means taking advantage of multiple cores. This is akin to splitting the one lane into several lanes, each one with its own speed limit and approach to its own toll-plaza. Sometimes single-thread vectorization is called “fine-grain parallelism” and multi-thread parallelism is called “coarse-grain parallelism”. Both can be combined on today’s modern SIMD CPUs with multiple cores.

In theory, either of these should be faster than single-thread serial calculation. In practice, however, there is a certain amount of overhead involved in setting up either kind of parallelism. Therefore, for small loads, it is not uncommon to find the serial versions to be the fastest. Even for larger, or huge, loads, it is not always the case that the most parallelism is worth the most overhead. In each language tested, the code will test coarse-grained, fine-grained, and combined parallelism.

R-specific hurdle

The standard way to parallelize C/C++/Fortran is by using OpenMP. However, R’s mechanics officially don’t allow OpenMP to be used by multiple languages in the same package. While it may be forced, the forcing violates R norms and relies on some default linker flags being the same across languages. Even now, December 2018, the Fortran mechanics for OpenMP are changing. So to test the languages, I wrote two small packages—one exclusively C++ using Rcpp and one using the C/Fortran interface. In the end, I did violate those norms to show some pure C calls which were put into the Fortran package. While this will work for now and for this post, it is not good practice and should be avoided.

Approaches R

In base R, the pure translation would be the nested use of pmin and pmax. There is one variant; the documentation for the two functions also describes a .int version for both which are “faster internal versions only used when all arguments are atomic vectors and there are no classes.” So for R, there are two options to be checked:

LLC_r1 <- function(x, l, a) sum(pmax(0, pmin(x - a, l))) LLC_r2 <- function(x, l, a) sum(pmax.int(0, pmin.int(x - a, l)))

C++/Rcpp Serial Loop

Unlike R, the easiest way to calculate this function in C++ is to use a for loop. There may be more elegant ways using STL, but those are more difficult to implement (and the LLC function isn’t unary anyway). In R, loops are deprecated in favor of vectorized functions and the apply family. The reason is that the looping is handled by the underlying C/C++/Fortran and not at the R level. Here we are living in the C++ basement, so well-formed loops are both good and fast. The archetypal loop implementation for the layer loss cost function in Rcpp would be:

// [[Rcpp::export]] double LLC_cpp(NumericVector X, double L, double A) { int n = X.size(); double LLC = 0.0; for (int i = 0; i < n; ++i) { LLC += fmax(0.0, fmin(X[i] - A, L)); } return(LLC); }

Not only is this implementation fast, a sufficiently good and aggressive compiler can also attempt autovectorization for even more speed (-O3 will usually attempt this). However, explicit vectorization and parallelization for C/C++ and Fortran will be applied using OpenMP. There are other ways to impose parallelism, like the Intel TBB framework behind RcppParallel. In that lies a possible addendum or two to this post.

Parallel Loops

Three additional flavors of the C++ code were tested:

  • Only using SIMD vectorization: LLC_cpps
  • Only using explicit parallization: LLC_cppl
  • Using both: LLC_cppls

The entire .cpp file is posted below to show the includes and other Rcpp necessities which weren’t covered in the first post.

#include #include #include using namespace Rcpp; //[[Rcpp::plugins(openmp)]] // [[Rcpp::export]] double LLC_cpp(NumericVector X, double L, double A) { int n = X.size(); double LLC = 0.0; for (int i = 0; i < n; ++i) { LLC += fmax(0.0, fmin(X[i] - A, L)); } return(LLC); } // [[Rcpp::export]] double LLC_cpps(NumericVector X, double L, double A) { int n = X.size(); double LLC = 0.0; #pragma omp simd reduction(+: LLC) for (int i = 0; i < n; ++i) { LLC += fmax(0.0, fmin(X[i] - A, L)); } return(LLC); } // [[Rcpp::export]] double LLC_cppl(NumericVector X, double L, double A) { int n = X.size(); double LLC = 0.0; #pragma omp parallel for schedule(static), reduction(+:LLC) for (int i = 0; i < n; ++i) { LLC += fmax(0.0, fmin(X[i] - A, L)); } return(LLC); } // [[Rcpp::export]] double LLC_cppls(NumericVector X, double L, double A) { int n = X.size(); double LLC = 0.0; #pragma omp parallel for simd schedule(static), reduction(+:LLC) for (int i = 0; i < n; ++i) { LLC += fmax(0.0, fmin(X[i] - A, L)); } return(LLC); }

The reduction code word lets the compiler know which variable is the sum accumulator to which the separate threads or vectors need to return their work. The schedule keyword tells the compiler how to split up the work into separate threads. The static option causes the work to be split into evenly-sized chunks. In OpenMP 4.5, there is a special scheduling option for loops with SIMD, but the current Rtools for Windows is based on GCC 4.9.3, which is limited to OpenMP 4.0.

Using OpenMP requires some special handling in the Makevars[.win] file. The entire package including the C++ code, Makevars, and other goodies can be found at: https://github.com/aadler/CppLangComp

Fortran Direct

Unlike C/C++, arrays are first-class citizens in modern Fortran (F90+). Therefore, the function can be written without explicit looping. Fortran knows x is an array, and will behave accordingly. The function below is named llc_fd for “Fortran – Direct”:

subroutine llc_fd(x, n, l, a, llc) bind(C, name = "llc_fd_") real(kind = c_double), intent(in) :: l, a integer(kind = c_int), intent(in), value :: n real(kind = c_double), intent(in), dimension(n) :: x real(kind = c_double), intent(out) :: llc llc = sum(max(0.0_c_double, min(x - a, l))) end subroutine llc_f1

Another element (pun intended) of Fortran is the concept of elemental functions. These are functions which take scalar inputs, return scalar outputs, and have no side effects. For example, the Fortran intrinsic log function takes a scalar input, returns the log of that value as output, and does nothing else. The benefit of creating an elemental functions is that the Fortran compiler recognizes the potential for automatically vectorizing/parallelizing the function when applied to an array. So a second direct approach would be to make the calculation of the LLC an elemental function and apply that to the vector. The function below is named llc_fe for “Fortran – Elemental”:

elemental function llce(x, l, a) result(y) real(kind = c_double), intent(in) :: x, l, a real(kind = c_double) :: y y = max(0.0_c_double, min(x - a, l)) end function llce subroutine llc_fe(x, n, l, a, llc) bind(C, name = "llc_fe_") real(kind = c_double), intent(in) :: l, a integer(kind = c_int), intent(in), value :: n real(kind = c_double), intent(in), dimension(n) :: x real(kind = c_double), intent(out) :: llc llc = sum(llce(x, l, a)) end subroutine llc_fe

In Fortran, max and min are themselves elemental functions, so the “direct” implementation is actually elemental as well. Thus, it will probably be a tick faster, as it doesn’t need the intermediary llce function.

Serial Loop

Loops in Fortran tend to be very fast, especially when using -O3, as the compilers unroll/autovectorize pretty decently. The function below is named llc_f as it’s the basic Fortran implementation most comparable to its Rcpp/C++ and C counterparts:

subroutine llc_f(x, n, l, a, llc) bind(C, name = "llc_f_") real(kind = c_double), intent(in) :: l, a integer(kind = c_int), intent(in), value :: n real(kind = c_double), intent(in), dimension(n) :: x real(kind = c_double), intent(out) :: llc integer :: i llc = 0.0_c_double do i = 1, n llc = llc + max(0.0_c_double, min(x(i) - a, l)) end do end subroutine llc_f

Parallel Loops

Similar to the Rcpp/C++ versions, there are three parallelized loop versions: SIMD only, parallelization only, and both. The parallel functions follow similar naming to the Rcpp versions above. To see the entire .f95 file, see the source file on github:

subroutine llc_fs(x, n, l, a, llc) bind(C, name = "llc_fs_") real(kind = c_double), intent(in) :: l, a integer(kind = c_int), intent(in), value :: n real(kind = c_double), intent(in), dimension(n) :: x real(kind = c_double), intent(out) :: llc integer :: i llc = 0.0_c_double !$omp do simd reduction(+:llc) do i = 1, n llc = llc + max(0.0_c_double, min(x(i) - a, l)) end do !$omp end do simd end subroutine llc_fs subroutine llc_fl(x, n, l, a, llc) bind(C, name = "llc_fl_") real(kind = c_double), intent(in) :: l, a integer(kind = c_int), intent(in), value :: n real(kind = c_double), intent(in), dimension(n) :: x real(kind = c_double), intent(out) :: llc integer :: i llc = 0.0_c_double !$omp parallel do reduction(+:llc), schedule(static), private(i) do i = 1, n llc = llc + max(0.0_c_double, min(x(i) - a, l)) end do !$omp end parallel do end subroutine llc_fl subroutine llc_fls(x, n, l, a, llc) bind(C, name = "llc_fls_") real(kind = c_double), intent(in) :: l, a integer(kind = c_int), intent(in), value :: n real(kind = c_double), intent(in), dimension(n) :: x real(kind = c_double), intent(out) :: llc integer :: i llc = 0.0_c_double !$omp parallel do simd reduction(+:llc), schedule(static) do i = 1, n llc = llc + max(0.0_c_double, min(x(i) - a, l)) end do !$omp end parallel do simd end subroutine llc_fls

C

Now that we know how to fold in compiled code, we can write pure(ish) C functions and compare them to both the Fortran and Rcpp/C++. The same four versions as the Rcpp functions can be written directly in C and are named c_llc_c, c_llc_cs, c_llc_cl, c_llc_cls describing their parallelization. Per Hadley Wickham’s advice, it is faster to move the singleton SEXPs into local constant variables, and to declare a pointer to the vector SEXP, than it is to call them directly inside the loop. To see the entire .c file with the includes, the Fortran interfaces, and the R registration functions, please see the source file on github:

extern SEXP c_llc_c(SEXP x, SEXP l, SEXP a){ const int n = LENGTH(x); const double ll = asReal(l); const double aa = asReal(a); double *px = REAL(x); double llc = 0.0; for (int i = 0; i < n; ++i) { llc += fmax(0.0, fmin(px[i] - aa, ll)); } SEXP ret = PROTECT(allocVector(REALSXP, 1)); REAL(ret)[0] = llc; UNPROTECT(1); return(ret); } extern SEXP c_llc_cs(SEXP x, SEXP l, SEXP a){ const int n = LENGTH(x); const double ll = asReal(l); const double aa = asReal(a); double *px = REAL(x); double llc = 0.0; #pragma omp simd reduction(+:llc) for (int i = 0; i < n; ++i) { llc += fmax(0.0, fmin(px[i] - aa, ll)); } SEXP ret = PROTECT(allocVector(REALSXP, 1)); REAL(ret)[0] = llc; UNPROTECT(1); return(ret); } extern SEXP c_llc_cl(SEXP x, SEXP l, SEXP a){ const int n = LENGTH(x); const double ll = asReal(l); const double aa = asReal(a); double *px = REAL(x); double llc = 0.0; #pragma omp parallel for schedule(static), reduction(+:llc) for (int i = 0; i < n; ++i) { llc += fmax(0.0, fmin(px[i] - aa, ll)); } SEXP ret = PROTECT(allocVector(REALSXP, 1)); REAL(ret)[0] = llc; UNPROTECT(1); return(ret); } extern SEXP c_llc_cls(SEXP x, SEXP l, SEXP a){ const int n = LENGTH(x); const double ll = asReal(l); const double aa = asReal(a); double *px = REAL(x); double llc = 0.0; #pragma omp parallel for simd schedule(static), reduction(+:llc) for (int i = 0; i < n; ++i) { llc += fmax(0.0, fmin(px[i] - aa, ll)); } SEXP ret = PROTECT(allocVector(REALSXP, 1)); REAL(ret)[0] = llc; UNPROTECT(1); return(ret); }

Necessary Evils

The .R file is straightforward, especially after the previous post. The necessary evil is the Makevars in which I violated my own warning about having OpenMP called by two languages in the same package. The entire package, which is not CRAN-worthy, can be found on github.

Tests

There are sixteen implementations of the LLC function between the two packages. These were tested with 1000 replications of calculating the layer loss cost for a $2M excess of $1M layer for sequences of 1, 10, 100, 1000, 10K, 100K, and 1M generated losses. Timing was done using the microbenchmark package.

The timing code can be found as Timings.R in the inst/Timings directory of the package. In that script, the last commented-out call will create a nice table of all the timings. As that would have overwhelmed this article, it isn’t posted here.

Results Synopsis

For the layer loss cost function, Fortran consistently outperforms its equivalent C and C++ routines. For vectors up to around length 1000, the Fortran serial loop, direct, and elemental versions are approximately the same. Between 1000 and almost 100K, the Fortran serial loop is the fastest. Shortly before 100K, the overhead of the parallelization is overtaken by the speed of using multiple cores. The parallelization overhead is significant enough that even base R outperforms the compiled code up to almost 10K elements. Also, using gcc/gfortran 4.9.3, there is no real benefit using SIMD. Serial SIMD is equivalent to serial loops and parallel SIMD is equivalent to parallel loops. Lastly, the .int version of pmin and pmax is faster up to about vectors of length 10K, after which is about the same as the regular version. Select the graph below for a larger version.

In closing, significant speed can be achieved in R by using compiled code. For this function, Fortran outperformed C and Rcpp, but that is not always the case. Hopefully, after these two posts, it will be easier for the interested reader to experiment, especially when needing to squeeze out every drop of speed.

Equivalently-sized Vector Comparisons

Below are a series of box and ridge plots comparing the distribution of the 1000 results across language and method. There are no ridge plots for lengths less than 1000 as ggridges returned an error for those. All graphs can be selected for a larger view.

Box Plots

Ridge Plots

SessionInfo

These timings were performed on an i7-8700K overclocked to 5.03Ghz, ASUS ROG MAXIMUS X HERO Z370 motherboard, 32Gb RAM, 960 Pro NVME hard drive, NVIDIA GTX 1080Ti.

R version 3.5.0 Patched (2018-05-16 r74732) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.11.8 ggridges_0.5.1 scales_1.0.0 ggplot2_3.1.0 microbenchmark_1.4-6 CppLangComp_0.0.1 RFortLangComp_0.0.1 loaded via a namespace (and not attached): [1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1 magrittr_1.5 tidyselect_0.2.5 munsell_0.5.0 colorspace_1.3-2 R6_2.3.0 rlang_0.3.0.1 [10] plyr_1.8.4 dplyr_0.7.8 tools_3.5.0 grid_3.5.0 gtable_0.2.0 withr_2.1.2 yaml_2.2.0 lazyeval_0.2.1 assertthat_0.2.0 [19] tibble_1.4.2 crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5 glue_1.3.0 compiler_3.5.0 pillar_1.3.0 pkgconfig_2.0.2

The post The Need for Speed Part 2: C++ vs. Fortran vs. C appeared first on Strange Attractors.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Strange Attractors. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Objects types and some useful R functions for beginners

Mon, 12/24/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
free here. This is taken from Chapter 2, which explains
the different R objects you can manipulate as well as some functions to get you started.

Objects, types and useful R functions to get started

All objects in R have a given type. You already know most of them, as these types are also used
in mathematics. Integers, floating point numbers, or floats, matrices, etc, are all objects you
are already familiar with. But R has other, maybe lesser known data types (that you can find in a
lot of other programming languages) that you need to become familiar with. But first, we need to
learn how to assign a value to a variable. This can be done in two ways:

a <- 3

or

a = 3

in very practical terms, there is no difference between the two. I prefer using <- for assigning
values to variables and reserve = for passing arguments to functions, for example:

spam <- mean(x = c(1,2,3))

I think this is less confusing than:

spam = mean(x = c(1,2,3))

but as I explained above you can use whatever you feel most comfortable with.

The numeric class

To define single numbers, you can do the following:

a <- 3

The class() function allows you to check the class of an object:

class(a) ## [1] "numeric"

Decimals are defined with the character .:

a <- 3.14

R also supports integers. If you find yourself in a situation where you explicitly need an integer
and not a floating point number, you can use the following:

a <- as.integer(3) class(a) ## [1] "integer"

The as.integer() function is very useful, because it converts its argument into an integer. There
is a whole family of as.*() functions. To convert a into a floating point number again:

class(as.numeric(a)) ## [1] "numeric"

There is also is.numeric() which tests whether a number is of the numeric class:

is.numeric(a) ## [1] TRUE

These functions are very useful, there is one for any of the supported types in R. Later, we are going
to learn about the {purrr} package, which is a very powerful package for functional programming. This
package includes further such functions.

The character class

Use " " to define characters (called strings in other programming languages):

a <- "this is a string" class(a) ## [1] "character"

To convert something to a character you can use the as.character() function:

a <- 4.392 class(a) ## [1] "numeric" class(as.character(a)) ## [1] "character"

It is also possible to convert a character to a numeric:

a <- "4.392" class(a) ## [1] "character" class(as.numeric(a)) ## [1] "numeric"

But this only works if it makes sense:

a <- "this won't work, chief" class(a) ## [1] "character" as.numeric(a) ## Warning: NAs introduced by coercion ## [1] NA

A very nice package to work with characters is {stringr}, which is also part of the {tidyverse}.

The factor class

Factors look like characters, but are very different. They are the representation of categorical
variables. A {tidyverse} package to work with factors is {forcats}. You would rarely use
factor variables outside of datasets, so for now, it is enough to know that this class exists.
We are going to learn more about factor variables in Chapter 4, by using the {forcats} package.

The Date class

Dates also look like characters, but are very different too:

as.Date("2019/03/19") ## [1] "2019-03-19" class(as.Date("2019/03/19")) ## [1] "Date"

Manipulating dates and time can be tricky, but thankfully there’s a {tidyverse} package for that,
called {lubridate}. We are going to go over this package in Chapter 4.

The logical class

This class is the result of logical comparisons, for example, if you type:

4 > 3 ## [1] TRUE

R returns TRUE, which is an object of class logical:

k <- 4 > 3 class(k) ## [1] "logical"

In other programming languages, logicals are often called bools. A logical variable can only have
two values, either TRUE or FALSE. You can test the truthiness of a variable with isTRUE():

k <- 4 > 3 isTRUE(k) ## [1] TRUE

How can you test if a variable is false? There is not a isFALSE() function (at least not without having
to load a package containing this function), but there is way to do it:

k <- 4 > 3 !isTRUE(k) ## [1] FALSE

The ! operator indicates negation, so the above expression could be translated as is k not TRUE?.
There are other such operators, namely &, &&, |, ||. & means and and | stands for or.
You might be wondering what the difference between & and && is? Or between | and ||? & and
| work on vectors, doing pairwise comparisons:

one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) one & two ## [1] FALSE FALSE TRUE FALSE

Compare this to the && operator:

one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) one && two ## [1] FALSE

The && and || operators only compare the first element of the vectors and stop as soon as a the return
value can be safely determined. This is called short-circuiting. Consider the following:

one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) three <- c(TRUE, TRUE, FALSE, FALSE) one && two && three ## [1] FALSE one || two || three ## [1] TRUE

The || operator stops as soon it evaluates to TRUE whereas the && stops as soon as it evaluates to FALSE.
Personally, I rarely use || or && because I get confused. I find using | or & in combination with the
all() or any() functions much more useful:

one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) any(one & two) ## [1] TRUE all(one & two) ## [1] FALSE

any() checks whether any of the vector’s elements are TRUE and all() checks if all elements of the vector are
TRUE.

As a final note, you should know that is possible to use T for TRUE and F for FALSE but I would advise against
doing this, because it is not very explicit.

Vectors and matrices

You can create a vector in different ways. But first of all, it is important to understand that a
vector in most programming languages is nothing more than a list of things. These things can be
numbers (either integers or floats), strings, or even other vectors. A vector in R can only contain elements of one
single type. This is not the case for a list, which is much more flexible. We will talk about lists shortly, but
let’s first focus on vectors and matrices.

The c() function

A very important function that allows you to build a vector is c():

a <- c(1,2,3,4,5)

This creates a vector with elements 1, 2, 3, 4, 5. If you check its class:

class(a) ## [1] "numeric"

This can be confusing: you where probably expecting a to be of class vector or
something similar. This is not the case if you use c() to create the vector, because c()
doesn’t build a vector in the mathematical sense, but a so-called atomic vector.
Checking its dimension:

dim(a) ## NULL

returns NULL because an atomic vector doesn’t have a dimension.
If you want to create a true vector, you need to use cbind() or rbind().

But before continuing, be aware that atomic vectors can only contain elements of the same type:

c(1, 2, "3") ## [1] "1" "2" "3"

because “3” is a character, all the other values get implicitly converted to characters. You have
to be very careful about this, and if you use atomic vectors in your programming, you have to make
absolutely sure that no characters or logicals or whatever else are going to convert your atomic
vector to something you were not expecting.

cbind() and rbind()

You can create a true vector with cbind():

a <- cbind(1, 2, 3, 4, 5)

Check its class now:

class(a) ## [1] "matrix"

This is exactly what we expected. Let’s check its dimension:

dim(a) ## [1] 1 5

This returns the dimension of a using the LICO notation (number of LInes first, the number of COlumns).

It is also possible to bind vectors together to create a matrix.

b <- cbind(6,7,8,9,10)

Now let’s put vector a and b into a matrix called matrix_c using rbind().
rbind() functions the same way as cbind() but glues the vectors together by rows and not by columns.

matrix_c <- rbind(a,b) print(matrix_c) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 The matrix class

R also has support for matrices. For example, you can create a matrix of dimension (5,5) filled
with 0’s with the matrix() function:

matrix_a <- matrix(0, nrow = 5, ncol = 5)

If you want to create the following matrix:

\[
B = \left(
\begin{array}{ccc}
2 & 4 & 3 \\
1 & 5 & 7
\end{array} \right)
\]

you would do it like this:

B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)

The option byrow = TRUE means that the rows of the matrix will be filled first.

You can access individual elements of matrix_a like so:

matrix_a[2, 3] ## [1] 0

and R returns its value, 0. We can assign a new value to this element if we want. Try:

matrix_a[2, 3] <- 7

and now take a look at matrix_a again.

print(matrix_a) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 7 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0

Recall our vector b:

b <- cbind(6,7,8,9,10)

To access its third element, you can simply write:

b[3] ## [1] 8

I have heard many people praising R for being a matrix based language. Matrices are indeed useful,
and statisticians are used to working with them. However, I very rarely use matrices in my
day to day work, and prefer an approach based on data frames (which will be discussed below). This
is because working with data frames makes it easier to use R’s advanced functional programming
language capabilities, and this is where R really shines in my opinion. Working with matrices
almost automatically implies using loops and all the iterative programming techniques, à la Fortran,
which I personally believe are ill-suited for interactive statistical programming (as discussed in
the introduction).

The list class

The list class is a very flexible class, and thus, very useful. You can put anything inside a list,
such as numbers:

list1 <- list(3, 2)

or other lists constructed with c():

list2 <- list(c(1, 2), c(3, 4))

you can also put objects of different classes in the same list:

list3 <- list(3, c(1, 2), "lists are amazing!")

and of course create list of lists:

my_lists <- list(list1, list2, list3)

To check the contents of a list, you can use the structure function str():

str(my_lists) ## List of 3 ## $ :List of 2 ## ..$ : num 3 ## ..$ : num 2 ## $ :List of 2 ## ..$ : num [1:2] 1 2 ## ..$ : num [1:2] 3 4 ## $ :List of 3 ## ..$ : num 3 ## ..$ : num [1:2] 1 2 ## ..$ : chr "lists are amazing!"

or you can use RStudio’s Environment pane:

You can also create named lists:

list4 <- list("a" = 2, "b" = 8, "c" = "this is a named list")

and you can access the elements in two ways:

list4[[1]] ## [1] 2

or, for named lists:

list4$c ## [1] "this is a named list"

Lists are used extensively because they are so flexible. You can build lists of datasets and apply
functions to all the datasets at once, build lists of models, lists of plots, etc… In the later
chapters we are going to learn all about them. Lists are central objects in a functional programming
workflow for interactive statistical analysis.

The data.frame and tibble classes

In the next chapter we are going to learn how to import datasets into R. Once you import data, the
resulting object is either a data.frame or a tibble depending on which package you used to
import the data. tibbles extend data.frames so if you know about data.frame objects already,
working with tibbles will be very easy. tibbles have a better print() method, and some other
niceties.

However, I want to stress that these objects are central to R and are thus very important; they are
actually special cases of lists, discussed above. There are different ways to print a data.frame or
a tibble if you wish to inspect it. You can use View(my_data) to show the my_data data.frame
in the View pane of RStudio:

You can also use the str() function:

str(my_data)

And if you need to access an individual column, you can use the $ sign, same as for a list:

my_data$col1 Formulas

We will learn more about formulas later, but because it is an important object, it is useful if you
already know about them early on. A formula is defined in the following way:

my_formula <- ~x class(my_formula) ## [1] "formula"

Formula objects are defined using the ~ symbol. Formulas are useful to define statistical models,
for example for a linear regression:

lm(y ~ x)

or also to define anonymous functions, but more on this later.

Models

A statistical model is an object like any other in R:

data(mtcars) my_model <- lm(mpg ~ hp, mtcars) class(my_model) ## [1] "lm"

my_model is an object of class lm. You can apply different functions to a model object:

summary(my_model) ## ## Call: ## lm(formula = mpg ~ hp, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7121 -2.1122 -0.8854 1.5819 8.2360 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.09886 1.63392 18.421 < 2e-16 *** ## hp -0.06823 0.01012 -6.742 1.79e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.863 on 30 degrees of freedom ## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 ## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07

This class will be explored in later chapters.

NULL, NA and NaN

The NULL, NA and NaN classes are pretty special. NULL is returned when the result of function is undetermined.
For example, consider list4:

list4 ## $a ## [1] 2 ## ## $b ## [1] 8 ## ## $c ## [1] "this is a named list"

if you try to access an element that does not exist, such as d, you will get NULL back:

list4$d ## NULL

NaN means “Not a Number” and is returned when a function return something that is not a number:

sqrt(-1) ## Warning in sqrt(-1): NaNs produced ## [1] NaN

or:

0/0 ## [1] NaN

Basically, numbers that cannot be represented as floating point numbers are NaN.

Finally, there’s NA which is closely related to NaN but is used for missing values. NA stands for Not Available. There are
several types of NAs:

  • NA_integer_
  • NA_real_
  • NA_complex_
  • NA_character_

but these are in principle only used when you need to program your own functions and need to explicitly test for the missingness of, say,
a character value.

To test whether a value is NA, use the is.na() function.

Useful functions to get you started

This section will list several basic R functions that are very useful and should be part of your toolbox.

Sequences

There are several functions that create sequences, seq(), seq_along() and rep(). rep() is easy enough:

rep(1, 10) ## [1] 1 1 1 1 1 1 1 1 1 1

This simply repeats 1 10 times. You can repeat other objects too:

rep("HAHA", 10) ## [1] "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA"

To create a sequence, things are not as straightforward. There is seq():

seq(1, 10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq(70, 80) ## [1] 70 71 72 73 74 75 76 77 78 79 80

It is also possible to provide a by argument:

seq(1, 10, by = 2) ## [1] 1 3 5 7 9

seq_along() behaves similarly, but returns the length of the object passed to it. So if you pass list4 to
seq_along(), it will return a sequence from 1 to 3:

seq_along(list4) ## [1] 1 2 3

which is also true for seq() actually:

seq(list4) ## [1] 1 2 3

but these two functions behave differently for arguments of length equal to 1:

seq(10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq_along(10) ## [1] 1

So be quite careful about that. I would advise you do not use seq(), but only seq_along() and seq_len(). seq_len()
only takes arguments of length 1:

seq_len(10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq_along(10) ## [1] 1

The problem with seq() is that it is unpredictable; depending on its input, the output will either be an integer or a sequence.
When programming, it is better to have function that are stricter and fail when confronted to special cases, instead of returning
some result. This is a bit of a recurrent issue with R, and the functions from the {tidyverse} mitigate this issue by being
stricter than their base R counterparts. For example, consider the ifelse() function from base R:

ifelse(3 > 5, 1, "this is false") ## [1] "this is false"

and compare it to {dplyr}’s implementation, if_else():

if_else(3 > 5, 1, "this is false") Error: `false` must be type double, not character Call `rlang::last_error()` to see a backtrace

if_else() fails because the return value when FALSE is not a double (a real number) but a character. This might seem unnecessarily
strict, but at least it is predictable. This makes debugging easier when used inside functions. In Chapter 8 we are going to learn how
to write our own functions, and being strict makes programming easier.

Basic string manipulation

For now, we have not closely studied character objects, we only learned how to define them. Later, in Chapter 5 we will learn about the
{stringr} package which provides useful function to work with strings. However, there are several base R functions that are very
useful that you might want to know nonetheless, such as paste() and paste0():

paste("Hello", "amigo") ## [1] "Hello amigo"

but you can also change the separator if needed:

paste("Hello", "amigo", sep = "--") ## [1] "Hello--amigo"

paste0() is the same as paste() but does not have any sep argument:

paste0("Hello", "amigo") ## [1] "Helloamigo"

If you provide a vector of characters, you can also use the collapse argument, which places whatever you provide for collapse between the
characters of the vector:

paste0(c("Joseph", "Mary", "Jesus"), collapse = ", and ") ## [1] "Joseph, and Mary, and Jesus"

To change the case of characters, you can use toupper() and tolower():

tolower("HAHAHAHAH") ## [1] "hahahahah" toupper("hueuehuehuheuhe") ## [1] "HUEUEHUEHUHEUHE" Mathematical functions

Finally, there are the classical mathematical functions that you know and love:

  • sqrt()
  • exp()
  • log()
  • abs()
  • sin(), cos(), tan(), and others
  • sum(), cumsum(), prod(), cumprod()
  • max(), min()

and many others…

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Custom JavaScript, CSS and HTML in Shiny

Sun, 12/23/2018 - 23:28

(This article was first published on ListenData, and kindly contributed to R-bloggers)

In this tutorial, I will cover how to include custom JavaScript, CSS and HTML code in your R shiny app. By including them, you can make a very powerful professional web app using R.

First let’s understand basics of a Webpage

In general, web page contains the following section of details.

  1. Content (Header, Paragraph, Footer, Listing)
  2. Font style, color, background, border
  3. Images and Videos
  4. Popups, widgets, special effects etc.
HTML, CSS and JavaScript These 3 web programming languages in conjunction  take care of all the information webpage contains (from text to adding special effects).
  1. HTML determines the content and structure of a page (header, paragraph, footer etc.)
  2. CSS controls how webpage would look like (color, font type, border etc.)
  3. JavaScript decides advanced behaviors such as pop-up, animation etc.
Fundamentals of Webpage

One of the most common web development term you should know : rendering. It is the act of putting together a web page for presentation.

Shiny Dashboard Syntax

In this section, I am using syntax of shinydashboard library. The structure of shinydashboard syntax is similar to shiny library. Both requires ui and server components. However, functions are totally different. Refer the code below. Make sure to install library before using the following program.

# Load Library
library(shinydashboard)

# User Interface
ui =
dashboardPage(
dashboardHeader(title = "Blank Shiny App"),
dashboardSidebar(),
dashboardBody()
)

# Server
server = function(input, output) { }

# Run App
runApp(list(ui = ui, server = server), launch.browser =T)

Example : Create Animation Effect

The program below generates animation in the web page. To test it, you can check out this link. When user hits “Click Me” button, it will trigger demojs() JavaScript which will initiate animation. It’s a very basic animation. You can edit the code and make it as complex as you want.

HTML

Click Me

CSS #sampleanimation {
width: 50px;
height: 50px;
position: absolute;
background-color: blue;
}

#myContainer {
width: 400px;
height: 400px;
position: relative;
background: black;
}
JS function demojs() {
var elem = document.getElementById('sampleanimation');
var position = 0;
var id = setInterval(frame, 10);
function frame() {
if (position == 350) {
clearInterval(id);
} else {
position++;
elem.style.top = position + 'px';
elem.style.left = position + 'px';
}
}
}

There are several ways to include custom JavaScript and CSS codes in Shiny. Some of the common ones are listed below with detailed explanation –

Method I : Use tags to insert HTML, CSS and JS Code in Shiny

HTML

tags$body(HTML(“Your HTML Code“))

CSS

tags$head(HTML(“
Your CSS Code
”))

OR

CSS code can also be defined using tags$style. 

tags$head(tags$style(HTML(” Your CSS Code “)))

JS

tags$head(HTML(“
Your JS Code
”))

OR

JS code can be described with tags$script.

tags$head(tags$script(HTML(” Your JS Code “)))

Code specified in tags$head means it will be included and executed under . Similarly tags$body can also be used to make shiny run code within

tags$head vs. tags$body

In general, JavaScript and CSS files are defined inside  . Things which we want to display under body section of the webpage should be defined within .

Animation Code in Shiny

library(shinydashboard) # User Interface ui <- dashboardPage( dashboardHeader(title = "Basic Use of JS and CSS"), dashboardSidebar(), dashboardBody( # Javasript Code tags$head(HTML(" function demojs() { var elem = document.getElementById(‘sampleanimation’); var position = 0; var id = setInterval(frame, 10); function frame() { if (position == 350) { clearInterval(id); } else { position++; elem.style.top = position + ‘px’; elem.style.left = position + ‘px’; } } } “)), # CSS Code tags$head(HTML(”

#sampleanimation { width: 50px; height: 50px; position: absolute; background-color: blue; }

“)), # HTML Code box(tags$body(HTML(“

Click Me

“)), height = 400) )) server = function(input, output) { } runApp(list(ui = ui, server = server), launch.browser =T)

Important Note

In JS, CSS and HTML code, make sure to replace double quotation mark with single quotation mark under shiny’s HTML(” “) function as it considers double quotation mark as closing the function.

Method II : Call JavaScript and CSS files in Shiny

You can use includeScript( ) and includeCSS( ) functions to refer JS and CSS codes from files saved in your local directory. You can save the files anywhere and mention the file location of them in the functions.

How to create JS and CSS files manually

Open notepad and paste JS code and save it with .js file extension and file type “All files” (not text document). Similarly you can create css file using .css file extension.

library(shinydashboard) # User Interface ui <- dashboardPage( dashboardHeader(title = "Basic Use of JS and CSS"), dashboardSidebar(), dashboardBody( # Call Javasript and CSS Code from file tags$head( includeScript("C:\\Users\\DELL\\Documents\\animate.js"), includeCSS("C:\\Users\\DELL\\Documents\\animation.css") ), # HTML Code box(tags$body(HTML("

Click Me

“)), height = 400) )) server = function(input, output) { } runApp(list(ui = ui, server = server), launch.browser =T)

When to use Method 2?

When you want to include a big (lengthy) JS / CSS code, use method 2. Method 1 should be used for small code snippets as RStudio does not support coloring and error-checking of JS / CSS code. Also it makes code unnecessary lengthy which makes difficult to maintain.

Method III : Add JS and CSS files under www directory

Step 1 : 
Create an app using shinyApp( ) function and save it as app.R. Refer the code below.

library(shinydashboard) library(shinydashboard) app <- shinyApp( ui <- dashboardPage( dashboardHeader(title = "Basic Use of JS"), dashboardSidebar(), dashboardBody( # Javasript and CSS Code tags$head(tags$script(src='animate.js')), tags$head(tags$link(rel="stylesheet", type = "text/css", href = "animation.css")), # HTML Code box(tags$body(HTML("

Click Me

“)), height = 400) )) , server = function(input, output) { } )

Step 2 :
Create a folder named www in your app directory (where your app app.r file is stored) and save .js and .css files under the folder. Refer the folder structure below.

├── app.R
└── www
└── animate.js
└── animation.css

Step 3 :
Submit runApp( ) function. Specify path of app directory.

runApp(appDir = “C:/Users/DELL/Documents”, launch.browser = T)

About Author:

Deepanshu founded ListenData with a simple objective – Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let's Get Connected: LinkedIn

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ListenData. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Certifiably Gone Phishing

Sun, 12/23/2018 - 23:28

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Phishing is [still] the primary way attackers either commit a primary criminal act (i.e. phish a target to, say, install ransomware) or is the initial vehicle used to gain a foothold in an organization so they can perform other criminal operations to achieve some goal. As such, security teams, vendors and active members of the cybersecurity community work diligently to neutralize phishing campaigns as quickly as possible.

One popular community tool/resource in this pursuit is PhishTank which is a collaborative clearing house for data and information about phishing on the Internet. Also, PhishTank provides an open API for developers and researchers to integrate anti-phishing data into their applications at no charge.

While the PhishTank API is useful for real-time anti-phishing operations the data is also useful for security researchers as we work to understand the ebb, flow and evolution of these attacks. One avenue of research is to track the various features associated with phishing campaigns which include (amongst many other elements) network (internet) location of the phishing site, industry being targeted, domain names being used, what type of sites are being cloned/copied and a feature we’ll be looking at in this post: what percentage of new phishing sites use SSL encryption and — of these — which type of SSL certificates are “en vogue”.

Phishing sites are increasingly using and relying on SSL certificates because we in the information security industry spent a decade instructing the general internet surfing population to trust sites with the green lock icon near the location bar. Initially, phishers worked to compromise existing, encryption-enabled web properties to install phishing sites/pages since they could leech off of the “trusted” status of the associated SSL certificates. However, the advent of services like Let’s Encrypt have made it possible for attacker to setup their own phishing domains that look legitimate to current-generation internet browsers and prey upon the decade’s old “trust the lock icon” mantra that most internet users still believe. We’ll table that path of discussion (since it’s fraught with peril if you don’t support the internet-do-gooder-consequences-be-darned cabal’s personal agendas) and just focus on how to work with PhishTank data in R and take a look at the most prevalent SSL certs used in the past week (you can extend the provided example to go back as far as you like provided the phishing sites are still online).

Accessing PhishTank From R

You can use the aquarium package [GL|GH] to gain access to the data provided by PhishTank’s API (you need to sign up for access and put you API key into the PHISHTANK_API_KEY environment variable which is best done via your ~/.Renviron file).

Let’s setup all the packages we’ll need and cache a current copy of the PhishTank data. The package forces you to utilize your own caching strategy since it doesn’t make sense for it to decide that for you. I’d suggest either using the time-stamped approach below or using some type of database system (or, say, Apache Drill) to actually manage the data.

Here are the packages we’ll need:

library(psl) # git[la|hu]b/hrbrmstr/psl library(curlparse) # git[la|hu]b/hrbrmstr/curlparse library(aquarium) # git[la|hu]b/hrbrmstr/aquarium library(gt) # github/rstudio/gt library(furrr) library(stringi) library(openssl) library(tidyverse)

NOTE: The psl and curlparse packages are optional. Windows users will find it difficult to get them working and it may be easier to review the functions provided by the urlparse package and substitute equivalents for the domain() and apex_domain() functions used below. Now, we get a copy of the current PhishTank dataset & cache it:

if (!file.exists("~/Data/2018-12-23-fishtank.rds")) { xdf <- pt_read_db() saveRDS(xdf, "~/Data/2018-12-23-fishtank.rds") } else { xdf <- readRDS("~/Data/2018-12-23-fishtank.rds") }

Let’s take a look:

glimpse(xdf) ## Observations: 16,446 ## Variables: 9 ## $ phish_id "5884184", "5884138", "5884136", "5884135", ... ## $ url "http://internetbanking-bancointer.com.br/lo... ## $ phish_detail_url "http://www.phishtank.com/phish_detail.php?p... ## $ submission_time 2018-12-22 20:45:09, 2018-12-22 18:40:24, 2... ## $ verified "yes", "yes", "yes", "yes", "yes", "yes", "y... ## $ verification_time 2018-12-22 20:45:52, 2018-12-22 21:26:49, 2... ## $ online "yes", "yes", "yes", "yes", "yes", "yes", "y... ## $ details [<209.132.252.7, 209.132.252.0/24, 7296 468... ## $ target "Other", "Other", "Other", "PayPal", "Other"...

The data is really straightforward. We have unique ids for each site/campaign the URL of the site along with a URL to extra descriptive info PhishTank has on the site/campaign. We also know when the site was submitted/discovered and other details, such as the network/internet space the site is in:

glimpse(xdf$details[1]) ## List of 1 ## $ :'data.frame': 1 obs. of 6 variables: ## ..$ ip_address : chr "209.132.252.7" ## ..$ cidr_block : chr "209.132.252.0/24" ## ..$ announcing_network: chr "7296 468" ## ..$ rir : chr "arin" ## ..$ country : chr "US" ## ..$ detail_time : chr "2018-12-23T01:46:16+00:00"

We’re going to focus on recent phishing sites (in this case, ones that are less than a week old) and those that use SSL certificates:

filter(xdf, verified == "yes") %>% filter(online == "yes") %>% mutate(diff = as.numeric(difftime(Sys.Date(), verification_time), "days")) %>% filter(diff <= 7) %>% { all_ct <<- nrow(.) ; . } %>% filter(grepl("^https", url)) %>% { ssl_ct <<- nrow(.) ; . } %>% mutate( domain = domain(url), apex = apex_domain(domain) ) -> recent

Let’s ee how many are using SSL:

(ssl_ct) ## [1] 383 (pct_ssl <- ssl_ct / all_ct) ## [1] 0.2919207

This percentage is lower than a recent “50% of all phishing sites use encryption” statistic going around of late. There are many reasons for the difference:

  • PhishTank doesn’t have all phishing sites in it
  • We just looked at a week of examples
  • Some sites were offline at the time of access attempt
  • Diverse attacker groups with varying degrees of competence engage in phishing attacks

Despite the 20% deviation, 30% is still a decent percentage, and a green, “everything’s ” icon is a still a valued prize so we shall pursue our investigation.

Now we need to retrieve all those certs. This can be a slow operation that so we’ll grab them in parallel. It’s also quite possible the “online”status above data frame glimpse is inaccurate (sites can go offline quickly) so we’ll catch certificate request failures with safely() and cache the results:

cert_dl <- purrr::safely(openssl::download_ssl_cert) plan(multiprocess) if (!file.exists("~/Data/recent.rds")) { recent <- mutate(recent, cert = future_map(domain, cert_dl)) saveRDS(recent, "~/Data/recent.rds") } else { recent <- readRDS("~/Data/recent.rds") }

Let see how many request failures we had:

(failed <- sum(map_lgl(recent$cert, ~is.null(.x$result)))) ## [1] 25 (failed / nrow(recent)) ## [1] 0.06527415

As noted in the introduction to the blog, when attackers want to use SSL for the lock icon ruse they can either try to piggyback off of legitimate domains or rely on Let’s Encrypt to help them commit crimes. Let’s see what the top p”apex” domains](https://help.github.com/articles/about-supported-custom-domains/#apex-domains) were in use in the past week:

count(recent, apex, sort = TRUE) ## # A tibble: 255 x 2 ## apex n ## ## 1 000webhostapp.com 42 ## 2 google.com 17 ## 3 umbler.net 8 ## 4 sharepoint.com 6 ## 5 com-fl.cz 5 ## 6 lbcpzonasegurabeta-viabcp.com 4 ## 7 windows.net 4 ## 8 ashaaudio.net 3 ## 9 brijprints.com 3 ## 10 portaleisp.com 3 ## # ... with 245 more rows

We can see that a large hosting provider (000webhostapp.com) bore a decent number of these sites, but Google Sites (which is what the full domain represented by the google.com apex domain here is usually pointing to) Microsoft SharePoint (sharepoint.com) and Microsoft forums (windows.net) are in active use as well (which is smart give the pervasive trust associated with those properties). There are 241 distinct apex domains in this 1-week set so what is the SSL cert diversity across these pages/campaigns?

We ultimately used openssl::download_ssl_cert to retrieve the SSL certs of each site that was online, so let’s get the issuer and intermediary certs from them and look at the prevalence of each. We’ll extract the fields from the issuer component returned by openssl::download_ssl_cert then just do some basic maths:

filter(recent, map_lgl(cert, ~!is.null(.x$result))) %>% mutate(issuers = map(cert, ~map_chr(.x$result, ~.x$issuer))) %>% mutate( inter = map_chr(issuers, ~.x[1]), # the order is not guaranteed here but the goal of the exercise is root = map_chr(issuers, ~.x[2]) # to get you working with the data vs build a 100% complete solution ) %>% mutate( inter = stri_replace_all_regex(inter, ",([[:alpha:]])+=", ";;;$1=") %>% stri_split_fixed(";;;") %>% # there are parswers for the cert info fields but this hack is quick and works map(stri_split_fixed, "=", 2, simplify = TRUE) %>% map(~setNames(as.list(.x[,2]), .x[,1])) %>% map(bind_cols), root = stri_replace_all_regex(root, ",([[:alpha:]])+=", ";;;$1=") %>% stri_split_fixed(";;;") %>% map(stri_split_fixed, "=", 2, simplify = TRUE) %>% map(~setNames(as.list(.x[,2]), .x[,1])) %>% map(bind_cols) ) -> recent

Let’s take a look at roots:

unnest(recent, root) %>% distinct(phish_id, apex, CN) %>% count(CN, sort = TRUE) %>% mutate(pct = n/sum(n)) %>% gt::gt() %>% gt::fmt_number("n", decimals = 0) %>% gt::fmt_percent("pct")

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', Arial, sans-serif; }

#osjojorroc .gt_table { border-collapse: collapse; margin-left: auto; margin-right: auto; color: #000000; font-size: 16px; background-color: #FFFFFF; /* table.background.color / width: auto; / table.width / border-top-style: solid; / table.border.top.style / border-top-width: 2px; / table.border.top.width / border-top-color: #A8A8A8; / table.border.top.color */ }

#osjojorroc .gt_heading { background-color: #FFFFFF; /* heading.background.color */ border-bottom-color: #FFFFFF; }

#osjojorroc .gt_title { color: #000000; font-size: 125%; /* heading.title.font.size / padding-top: 4px; / heading.top.padding */ padding-bottom: 1px; border-bottom-color: #FFFFFF; border-bottom-width: 0; }

#osjojorroc .gt_subtitle { color: #000000; font-size: 85%; /* heading.subtitle.font.size / padding-top: 1px; padding-bottom: 4px; / heading.bottom.padding */ border-top-color: #FFFFFF; border-top-width: 0; }

#osjojorroc .gt_bottom_border { border-bottom-style: solid; /* heading.border.bottom.style / border-bottom-width: 2px; / heading.border.bottom.width / border-bottom-color: #A8A8A8; / heading.border.bottom.color */ }

#osjojorroc .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; padding-top: 4px; padding-bottom: 4px; }

#osjojorroc .gt_col_heading { color: #000000; background-color: #FFFFFF; /* column_labels.background.color / font-size: 16px; / column_labels.font.size / font-weight: initial; / column_labels.font.weight */ padding: 10px; margin: 10px; }

#osjojorroc .gt_sep_right { border-right: 5px solid #FFFFFF; }

#osjojorroc .gt_group_heading { padding: 8px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#osjojorroc .gt_empty_group_heading { padding: 0.5px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#osjojorroc .gt_striped tr:nth-child(even) { background-color: #f2f2f2; }

#osjojorroc .gt_row { padding: 10px; /* row.padding */ margin: 10px; }

#osjojorroc .gt_stub { border-right-style: solid; border-right-width: 2px; border-right-color: #A8A8A8; text-indent: 5px; }

#osjojorroc .gt_stub.gt_row { background-color: #FFFFFF; }

#osjojorroc .gt_summary_row { background-color: #FFFFFF; /* summary_row.background.color / padding: 6px; / summary_row.padding / text-transform: inherit; / summary_row.text_transform */ }

#osjojorroc .gt_first_summary_row { border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; }

#osjojorroc .gt_table_body { border-top-style: solid; /* field.border.top.style / border-top-width: 2px; / field.border.top.width / border-top-color: #A8A8A8; / field.border.top.color / border-bottom-style: solid; / field.border.bottom.style / border-bottom-width: 2px; / field.border.bottom.width / border-bottom-color: #A8A8A8; / field.border.bottom.color */ }

#osjojorroc .gt_footnote { font-size: 90%; /* footnote.font.size / padding: 4px; / footnote.padding */ }

#osjojorroc .gt_sourcenote { font-size: 90%; /* sourcenote.font.size / padding: 4px; / sourcenote.padding */ }

#osjojorroc .gt_center { text-align: center; }

#osjojorroc .gt_left { text-align: left; }

#osjojorroc .gt_right { text-align: right; font-variant-numeric: tabular-nums; }

#osjojorroc .gt_font_normal { font-weight: normal; }

#osjojorroc .gt_font_bold { font-weight: bold; }

#osjojorroc .gt_font_italic { font-style: italic; }

#osjojorroc .gt_super { font-size: 65%; }

#osjojorroc .gt_footnote_glyph { font-style: italic; font-size: 65%; }

CN n pct DST Root CA X3 96 26.82% COMODO RSA Certification Authority 93 25.98% DigiCert Global Root G2 45 12.57% Baltimore CyberTrust Root 30 8.38% GlobalSign 27 7.54% DigiCert Global Root CA 15 4.19% Go Daddy Root Certificate Authority – G2 14 3.91% COMODO ECC Certification Authority 11 3.07% Actalis Authentication Root CA 9 2.51% GlobalSign Root CA 4 1.12% Amazon Root CA 1 3 0.84% Let’s Encrypt Authority X3 3 0.84% AddTrust External CA Root 2 0.56% DigiCert High Assurance EV Root CA 2 0.56% USERTrust RSA Certification Authority 2 0.56% GeoTrust Global CA 1 0.28% SecureTrust CA 1 0.28%

DST Root CA X3 is (wait for it) Let’s Encrypt! Now, Comodo is not far behind and indeed surpasses LE if we combine the extra-special “enhanced” versions they provide and it’s important for you to read the comments near the lines of code making assumptions about order of returned issuer information above. Now, let’s take a look at intermediaries:

unnest(recent, inter) %>% distinct(phish_id, apex, CN) %>% count(CN, sort = TRUE) %>% mutate(pct = n/sum(n)) %>% gt::gt() %>% gt::fmt_number("n", decimals = 0) %>% gt::fmt_percent("pct")

CN n pct Let’s Encrypt Authority X3 99 27.65% cPanel\, Inc. Certification Authority 75 20.95% RapidSSL TLS RSA CA G1 45 12.57% Google Internet Authority G3 24 6.70% COMODO RSA Domain Validation Secure Server CA 20 5.59% CloudFlare Inc ECC CA-2 18 5.03% Go Daddy Secure Certificate Authority – G2 14 3.91% COMODO ECC Domain Validation Secure Server CA 2 11 3.07% Actalis Domain Validation Server CA G1 9 2.51% RapidSSL RSA CA 2018 9 2.51% Microsoft IT TLS CA 1 6 1.68% Microsoft IT TLS CA 5 6 1.68% DigiCert SHA2 Secure Server CA 5 1.40% Amazon 3 0.84% GlobalSign CloudSSL CA – SHA256 – G3 2 0.56% GTS CA 1O1 2 0.56% AlphaSSL CA – SHA256 – G2 1 0.28% DigiCert SHA2 Extended Validation Server CA 1 0.28% DigiCert SHA2 High Assurance Server CA 1 0.28% Don Dominio / MrDomain RSA DV CA 1 0.28% GlobalSign Extended Validation CA – SHA256 – G3 1 0.28% GlobalSign Organization Validation CA – SHA256 – G2 1 0.28% RapidSSL SHA256 CA 1 0.28% TrustAsia TLS RSA CA 1 0.28% USERTrust RSA Domain Validation Secure Server CA 1 0.28% NA 1 0.28%

LE is number one again! But, it’s important to note that these issuer CommonNames can roll up into a single issuing organization given just how messed up integrity and encryption capability is when it comes to web site certs, so the raw results could do with a bit of post-processing for a more complete picture (an exercise left to intrepid readers).

FIN

There are tons of avenues to explore with this data, so I hope this post whet your collective appetites sufficiently for you to dig into it, especially if you have some dowm-time coming.

Let me also take this opportunity to resissue guidance I and many others have uttered this holiday season: be super careful about what you click on, which sites you even just visit, and just how much you really trust the site, provider and entity behind the form about to enter your personal information and credit card info into.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ShinyProxy Christmas Release

Sun, 12/23/2018 - 12:05

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations. Since our previous blog post five releases took place, so it is time to provide the
‘state of affairs’ before venturing into the New Year.

Kerberos and Co

To some Kerberos is a multi-headed dog that guards the gates of the Underworld. To others
it is enterprise technology that protects corporate networks and offers single sign-on for network services. ShinyProxy already supports a broad range of authentication and authorization technologies (OpenId Connect, Keycloak, LDAP, social login etc.), but with Kerberos we can support even more organizations to manage access to their Shiny apps, both using the Active Directory variant and the Kerberos implementation one finds on GNU/Linux and other unix variants. Similarly to OpenId Connect where the OIDC access token for a user is made available to the Shiny app, the Kerberos credentials cache of a user is automatically mounted inside the app container so the app can use service tickets for accessing other services (backend-principals).

Another (new) authentication method webservice was added to allow usage of ShinyProxy with custom web services that handle authentication with a HTTP POST call returning a session id and user information. For the existing methods a few refinements were made e.g. allowing specific attributes to be used as the user’s name for both Keycloak (proxy.keycloak.name-attribute) and OpenId based authentication (proxy.openid.username-attribute).

Docker Back-Ends

As you may know ShinyProxy can be scaled from a single machine (plain Docker host) to data center scale with Docker Swarm and Kubernetes clusters. For Docker Swarm a fully worked out example has been added on our ShinyProxy Config Examples repository on Github.
In the last release the Kubernetes back-end also gained support to select specific nodes
for the Shiny apps using the proxy.kubernetes.node-selector field.

It is now also possible to mount user-specific docker volumes in your apps (and many other goodies) thanks to the support of Spring expression language (or SpEL) in the application.yml configuration file:

proxy: ... specs: - id: santaclaus display-name: The most Mighty and Loyal Friend of Children container-cmd: ["R", "-e", "sleigh::run_rudolph()"] container-image: openanalytics/christmas access-groups: [statisticians, reindeers] container-volumes: [ "/home/#{proxy.userId}/gifts:/var/gifts" ] Miscellaneous improvements

In terms of integration, we added an example on how to embed Shiny apps into arbitrary web sites or portal sites as part of the configuration examples. In terms of security a new option proxy.bind-address now allows to control the bind address for ShinyProxy. Styling and customizing ShinyProxy was possible for a long time (see a.o. this blog post), but we now added a simple way to set the favicon so you can give your ShinyProxy instances a personal touch.

Full release notes can be found on the downloads page and updated documentation can be found on https://shinyproxy.io. As always community support on this new release is available at

https://support.openanalytics.eu

Merry Christmas and have fun with ShinyProxy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Bear is Here

Sat, 12/22/2018 - 23:46

(This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

October and December have been devastating for stocks. It wasn’t until Friday though that we officially reached the depths of a bear market.

There are different theories, the most common is 20% pullback in an index. As readers of this blog are aware, I follow a slightly different definition, based on Jack Schannep’s work. Based on this definition, a bear market is official when two of the three major indexes (Dow Jones Industrial, S&P 500 and Dow Jones Transportation) reach a 16% decline of the most recent high. For the mathematically inclined, the 16% is not random – simply it takes about 19% gain off the bottom of a 16% decline to reach the same level. Thus, a bear market is official at 16%, and a bull market – at 19%.

Friday was significant on these metrics. Here is the situation on the S&P 500:

require(quantmod) require(PerformanceAnalytics) sp = getSymbols("^GSPC", from="1900-01-01", auto.assign=F) sp.ret = ROC(Cl(sp), type="discrete") table.Drawdowns(sp.ret["2008/"], top=5)

This has the following output:

The current pullback is the first 16+% correction since 2008. The story is the same on the Dow Jones Industrial:

And slightly different on the Down Jones Transportation:

This is certainly the most cherished bear market in US history, but that won’t make it less painful or less damaging. A recession does not always follow a bear market, but does quite often. How often – check Jack Schannep’s book. There are a few interesting statistics, like the average bear market decline and duration. Let’s take the S&P 500 (history from 1950) and do the math:

dd = table.Drawdowns(sp.ret, top=20) print(dd) # A visual inspection shows that we # 11 previous bear markets mean(head(dd, 11)[,'Depth']) # [1] -0.3288091 - 33% median(head(dd, 11)[,'Depth']) # [1] -0.2797 - 28% max(head(dd, 11)[,'Depth']) # [1] -0.1934 - 19%

A gloomy picture. The average bear market is 33%, the median bear market – 28% and the smallest – 19%. In other words, if history repeats itself, there is more pain to come. Significantly more on average.

The last piece of the puzzle (mine puzzle, Jack Schannep has a lot more), is to check the market for oversold signs. For this, Schannep has proposed a simple, yet powerful, indicator to forecast market bottoms. It was extremely precise in the 2011 pullback. Here is the code:

capitulation = function(spx=NULL, dji=NULL, nyse=NULL) { require(quantmod) spx = if(is.null(spx)) na.fill(Ad(getSymbols("^GSPC", from="1900-01-01", auto.assign=F)), "extend") else spx dji = if(is.null(dji)) na.fill(Ad(getSymbols("^DJI", from="1900-01-01", auto.assign=F)), "extend") else dji nyse = if(is.null(nyse)) na.fill(Ad(getSymbols("^NYA", from="1900-01-01", auto.assign=F)), "extend") else nyse spx.macd = MACD(spx, nFast=1, nSlow=50, maType=EMA) dji.macd = MACD(dji, nFast=1, nSlow=50, maType=EMA) nyse.macd = MACD(nyse, nFast=1, nSlow=50, maType=EMA) merged = merge(spx.macd[,1], dji.macd[,1], nyse.macd[,1], spx, dji, nyse, all=FALSE) merged = cbind(ifelse(merged[,1] <= -10 & merged[,2] <= -10 & merged[,3] <= -10, 1, 0), merged) colnames(merged) = c("ind", "spx.ind", "dji.ind", "nyse.ind", "spx.close", "dji.close", "nyse.close") capitulation = merged[merged[,1] == 1,2:NCOL(merged)] return(list(capitulation=capitulation, details=merged)) } cap = capitulation() tail(cap$capitulation) # spx.ind dji.ind nyse.ind spx.close dji.close nyse.close # 2009-03-05 -16.48942 -16.15225 -16.95556 682.55 6594.44 4267.60 # 2009-03-06 -15.84705 -15.21571 -16.07971 683.38 6626.94 4284.49 # 2009-03-09 -16.14169 -15.70102 -16.65649 676.53 6547.05 4226.31 # 2009-03-10 -10.42353 -10.43552 -10.87757 719.60 6926.49 4499.38 # 2011-08-08 -13.47900 -11.42400 -14.88095 1119.46 10809.85 6895.97 # 2011-08-10 -12.61128 -11.47501 -11.57217 1120.76 10719.94 7101.24

The above table shows that there is no extreme selling in the current recess. All in all – more troubles ahead.

The post The Bear is Here appeared first on Quintuitive.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simulating Persian Monarchs gameplay by @ellis2013nz

Sat, 12/22/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

‘I can teach you in a minute…’

In a recent post I simulated some simple dice games and promised (or threatened) that this was the first of a series of posts about games of combined luck and chance. The main aim of that post was to show how even simple probabilistic games can become complicated with tweaks to the rules, but I also mentioned a key concept that “any game of chance can be converted to a complex game of skill by adding gambling”.

Now I want to explore this last idea further with the fictional game Persian Monarchs. As far as I can tell, this game was invented by P. G. Wodehouse for his classic 1939 comedic novel Uncle Fred in the Springtime, surely a front runner for some of the best humorous writing of the twentieth century. Wodehouse was in true mid-season form with this, his first full length novel featuring “Uncle Fred” (Frederick Altamont Cornwallis Twistleton, 5th Earl of Ickenham); before much of his work became tired and formulaic in the post-WWII years.

The plot of the novel is complex and not really to the point for a statistics blog, so do yourself a favour and read it for yourself in your own time. But for our purposes, it’s important to know that several of the key plot twists revolve around substantial sums of money gambled on Persian Monarchs, described by one character thus:

‘I could teach it you in a minute. In its essentials it is not unlike Blind Hooky. Here’s the way it goes. You cut a card, if you see what I mean, and the other fellow cuts a card, if you follow me. Then if the card you’ve cut is higher than the card the other fellow has cut, you win. While, conversely, if the card the other fellow’s cut is higher than the card you’ve cut, he wins.’

He shot an anxious glance at Mr Pott, as if wondering if he had been too abstruse. But Mr Pott appeared to have followed him perfectly.

‘I think I see the idea,’ he said. ‘Anyway, I’ll pick it up as I go along.’

Claude Eustace ‘Mustard’ Pott – the “Mr Pott” in the quotation above – is actually a Persian Monarchs player of considerable accomplishment; in addition to being a former silver ring bookie and Shakespearan actor, and current private detective and part time card hustler. So having a not-too-bright but rich and friendly representative of the upper classes explain his own favourite game to him is as close to heaven as he gets.

It’s not clear from the novel whether Mr Pott (or, later, the Duke of Dunstable, who also shows himself to be a master of cardplay) win at Persian Monarchs through superior skill or through outright cheating. There are indications both ways. For my purposes, I’m going to put aside the question of cheating and consider how it’s possible for skill to shine through in such a simple game.

It all comes down to the addition of choices about offering and accepting wagers. As the gambling rules around Persian Monarchs are not described by Wodehouse, I’m going to assume that it’s something like the following:

  • each player contributes one counter (or other gambling unit) at the beginning of a hand to a combined stake, before receiving their card;
  • after seeing their card, the non-dealer has the option of raising the wager by an extra amount, up to some specified limit (10 in my simulations below), but by zero if they wish. The dealer then has to either cover the increased wager, or decline on the spot in which case the non-dealer wins the existing stake in full regardless of who has the higher cards;
  • if the dealer has covered the extra wager, whomever has the highest card wins and collects the entire stake.

I’m also going to assume that once a hand is finished, those cards are put aside and the next hand proceeds with a reduced pack; until the pack is finished at which point the cards are all brought together and reshuffled.

To skip to the chase, here are the results of simulations of three different strategies at Persian Monarchs:

  • choosing to offer or accept extra wagers completely at random;
  • offering or accepting wagers partly at random (in order to avoid being predictable and easy to bluff) but with probabilities aimed at delivering optimal results in the long run, taking into account which cards have so far appeared;
  • as above, but without bothering to count the cards that have already appeared and been discarded, so basing choices on general probabilities about one’s card relative to a complete pack.

At the end of a game that goes through two whole packs of cards, the optimal strategy outperforms the non-card-counting by between 3.8 to 5.3 counters; and the random strategy by 35 to 37 counters. So skill counts big time. Even the marginal advantage given by counting the cards and adjusting one’s strategy accordingly makes a 4% difference in your rate of return.

‘Optimal’ strategy?

For what I’ve defined as the optimal strategy (without either proof or adequate demonstration by simulation), the non-dealer’s wager-offering algorithm is as follows:

  • First choose whether to offer an increased wager or not. Do this at random, with a ‘probability of offering more’ equal to the estimated probability that the card I’ve got is higher than a random card from the pack.
  • Having decided to offer an increased wager, pick a random number between 1 and 10 to increase the wager by that.

The decision of whether to accept or not is taken similarly.

The trick in constructing a decision rule here is that we want to avoid being predictable to our opponent. For example, if we only offered an increased wager when we had one of the top 50% cards in the pack, we have ruled out any bluffing, and once our opponent picks this up they can use it to their advantage. Similarly, if we increase the wager by more depending on our confidence, we are giving away information. So we want a compromise between strict determination and a bit of randomness always present (even if I draw the Queen of spades I might decline an increased wager, or with the 3 of clubs I might offer to increase the stake, albeit at low probability), with your opponent not quite sure where you’re coming from – let’s call it the ‘madman strategy of Persian Monarchs’.

Writing a card game in R

How did I implement all this?

First, I defined a few convenience objects and functions.

  • pack is a data frame of 52 rows with the value and suit of all the cards in a standard pack of cards, with precedence in order as per Bridge;
  • pm_draw is a function that, given an existing (possibly incomplete) pack of cards, returns a drawn card and the remaining pack as two elements of a list:
library(tidyverse) library(scales) library(parallel) #' Define a starting full pack of cards. In sequence as per Bridge ie Ace of Spades highest, Two of Clubs lowest. pack <- data_frame( id = 1:52, suit = rep(c("clubs", "diamonds", "hearts", "spades"), each = 13), value = rep(c(2:10, "Jack", "Queen", "King", "Ace"), 4) ) %>% mutate(label = paste(value, "of", suit)) full_pack <- pack #' For a given pack (not necessarily a full one), pull out one card, and return #' that card and the remaining pack (as a list with two elements) #' #' @param pack a dataframe with such as that defined by `pack`, with at least #' a column named `id` pm_draw <- function(pack){ drawn <- sample_n(pack, size = 1) remaining_pack <- pack[pack$id != drawn$id, ] return(list(drawn = drawn, remaining_pack = remaining_pack)) }

Then the workhorse of the whole project comes in the next function, pm_hand, which plays a single hand of Persian Monarchs with two players. The wagering strategy can be “auto” (the best strategy I could think of), “ask” (a human gets asked to decide each wager decision) “random” and “non-card-counter”. In itself, this function is not much use, but it abstracts the work of a single hand of dealing, wagering, accepting and checking the result away from what is going to become the main loop of the game to ge defined later.

#' Play a hand of Persian Monarchs #' #' @param p1_counters number of counters owned by player 1 (non-dealer) at beginning of hand #' @param p2_counters number of coutners owned by player 2 (dealer) at beginning of hand #' @known_pack pack at the beginning of the hand, 'known' by the players if they have been counting cards. #' @p1_method method for player 1 deciding whether to offer an increased wager #' @p2_method method for player 2 deciding whether to accept an offered increased wager #' @verbose Whether or not to print out extra information to console #' @p1_lab Label for player 1, used in messages to console #' @p2_lab Label for player 2, used in messages to console #' @max_wager Maximum additional wager to be offered by player 1 #' @details Player 2 is the dealer and Player 1 gets the first choice - whether or not to offer an increased wager #' on top of the original wager of 1 counter, and if so how much to make that additional wager. Player 2 then gets #' to accept and cover that wager, of to fold and concede the current wager of 1 counter to Player 1. pm_hand <- function(p1_counters, p2_counters, known_pack, p1_method = c("auto", "ask", "random", "non-card-counter"), p2_method = c("auto", "ask", "random", "non-card-counter"), verbose = FALSE, p1_lab = "Non-dealer", p2_lab = "Dealer", max_wager = 10){ # convenience function for printing a message to the console but substituting the correct label # for p1 or p2 (player 1 or player 2): pm_cat <- function(txt){ txt <- gsub("p1", p1_lab, txt) txt <- gsub("p2", p2_lab, txt) if(verbose){ cat(txt) } } p1_method <- match.arg(p1_method) p2_method <- match.arg(p2_method) # Both players start by putting in one counter each just to start to play p1_counters <- p1_counters - 1 p2_counters <- p2_counters - 1 wagered <- 2 # Draw a card each tmp1 <- pm_draw(known_pack) tmp2 <- pm_draw(tmp1$remaining_pack) p1 <- tmp1$drawn p2 <- tmp2$drawn # The probabilities as they appear to the players: p1_chance = sum(known_pack$id < p1$id) / (nrow(known_pack) - 1) p2_chance = sum(known_pack$id < p2$id) / (nrow(known_pack) - 1) # Decision - will p1 wager more? if(p1_method %in% c("auto", "random", "non-card-counter")){ # If p1 is being rational, it's chance of increasing the wager comes from it's appreciation # of the probability of winning this hand (more is 1 if increasing the wager, 0 otherwise). # But if the method is random, then they choose at random chance_more <- case_when( p1_method == "auto" ~ p1_chance, p1_method == "random" ~ runif(1, 0, 1), p1_method == "non-card-counter" ~ sum(full_pack$id < p1$id) / 51 ) more <- rbinom(prob = chance_more, size = 1, n = 1) extra <- sample(1:max_wager, size = 1) if(more == 0){ pm_cat("p1 does not add to the initial wager.\n") } } else { pm_cat(paste0("You drew the ", p1$label, ". How much extra will you bet this hand?\n(maximum is ", max_wager, ", 0 is acceptable)")) extra <- -1 while(!extra %in% 0:max_wager){ extra <- as.numeric(readLines(n = 1)) } more <- as.integer(extra > 0) } if(more == 1){ # Decision - will p2 try to accept the increased wager? if(p2_method %in% c("auto", "non-card-counter")){ p2_chance_enh <- case_when( p2_method == "auto" ~ p2_chance, p2_method == "non-card-counter" ~ sum(full_pack$id < p2$id) / 51 ) p2_expected <- p2_chance_enh * (wagered + extra * 2) - (1 - p2_chance_enh) * (wagered + extra * 2) accept = p2_expected > (-wagered) } else { if(p2_method == "ask"){ pm_cat(paste0("You drew the ", p2$label, ". The other player raises the wager by ", extra, ". Do you accept? (Y/n)\n")) input <- "" while(!tolower(input) %in% c("y", "n")){ input <- readLines(n = 1) } accept = (tolower(input) == "y") } else { # random acceptance: accept = as.logical(runif(1, 0 , 1) > 0.5) } } if(accept & p2_counters >= extra){ # the bet is on: p1_counters <- p1_counters - extra p2_counters <- p2_counters - extra wagered <- wagered + extra * 2 pm_cat("p2 accepts the added wager.\n") } else { # or else, p2 gives up: p1_counters <- p1_counters + wagered wagered <- 0 if(verbose){ pm_cat("p2 did not accept the added wager.\n") } } } # Who won? if(p1$id > p2$id){ p1_counters <- p1_counters + wagered if(verbose){ pm_cat(paste0(p1$label, " beats ", p2$label, "; p1 wins.\n")) } } else { p2_counters <- p2_counters + wagered if(verbose){ pm_cat(paste0(p1$label, " loses to ", p2$label, "; p2 wins.\n")) } } known_pack <- tmp2$remaining_pack if(nrow(known_pack) < 2){ pm_cat("Getting a fresh pack of cards.\n") known_pack <- pack } return(list( p1_counters = p1_counters, p2_counters = p2_counters, known_pack = known_pack)) }

Having defined the functionality of a single hand, I then wrote a function for the main cycle of a game of user-specified rounds. This is basically a wrapper around pm_hand, making it user-friendly for a player with sufficient messages to the console.

#-----------------------Human versus computer--------------------- #' Play a human v computer game of Persian Monarchs #' #' @param human number of chips human starts with #' @param computer number of chips computer starts with #' @param number of rounds to play (both players get a turn in a round) persian_monarchs_hvc <- function(human = 100, computer = 100, rounds = 10){ start_pack <- data_frame( id = 1:52, suit = rep(c("clubs", "diamonds", "hearts", "spades"), each = 13), value = rep(c(2:10, "Jack", "Queen", "King", "Ace"), 4) ) %>% mutate(label = paste(value, "of", suit)) known_pack <- start_pack for(i in 1:rounds){ # Computer deals: if(i != 1) { cat("Press enter for next hand.\n") readLines(n = 1) } message("Computer's deal.") cp <- pm_hand(human, computer, known_pack, p1_method = "ask", verbose = TRUE, p1_lab = "Human", p2_lab = "Computer") human <- cp$p1_counters computer <- cp$p2_counters known_pack <- cp$known_pack cat(paste0("You have ", human, " counters.\n")) cat(paste0("There are ", nrow(known_pack), " cards left in the pack.\n")) # Human deals: cat("Press enter for next hand.\n") readLines(n = 1) message("Your deal.") cp <- pm_hand(computer, human, known_pack, p2_method = "ask", verbose = TRUE, p1_lab = "Computer", p2_lab = "Human") human <- cp$p2_counters computer <- cp$p1_counters known_pack <- cp$known_pack cat(paste0("You have ", human, " counters.\n")) cat(paste0("There are ", nrow(known_pack), " cards left in the pack.\n")) } } # Run an interactive game against the computer for three rounds persian_monarchs_hvc(rounds = 3)

This lets us play the game interactively in the R console, with a typical three round game shown in the screenshot below:

As you can see, at the end of the three rounds I’d lost a total of one counter to the computer, playing a more or less ad hoc but sensible strategy of increasing the wager by substantial amounts if I draw a spade, less so for hearts, and minimal or nothing for the minor suits (and comparable strategy for acceptance).

Simulating card game results

It was fun to write a computer game in R, but this particular one isn’t really fascinating enough to play thousands of rounds to explore optimal strategy. To do that we want a function that lets a computer play against itself, so I adapt the persian_monarchs_hvc function to persian_monarchs_cvc (computer versus computer) as below. We can specify one of the three automated strategies for each player.

#-------------------------computer versus computer------------------- #' Play a computer v computer game of Persian Monarchs #' #' @param c1 number of chips Computer One starts with #' @param c2 number of chips Computer Two starts with #' @param c1_method one of "auto" (best choice given expected value) and "random" (coin flip) for #' deciding whether Computer One offers or accepts an increased wager #' @param c2_method one of "auto" (best choice given expected value) and "random" (coin flip) for #' deciding whether Computer Two offers or accepts an increased wager #' @param number of rounds to play (both players get a turn as dealer in a round, so a round is two hands) #' @param verbose whether or not to print results to the screen persian_monarchs_cvc <- function(c1 = 100, c2 = 100, c1_method = c("auto", "random", "non-card-counter"), c2_method = c("auto", "random", "non-card-counter"), rounds = 10, verbose = TRUE){ c1_method <- match.arg(c1_method) c2_method <- match.arg(c2_method) start_pack <- data_frame( id = 1:52, suit = rep(c("clubs", "diamonds", "hearts", "spades"), each = 13), value = rep(c(2:10, "Jack", "Queen", "King", "Ace"), 4) ) %>% mutate(label = paste(value, "of", suit)) known_pack <- start_pack for(i in 1:rounds){ # c2 deals: if(verbose){message("Computer Two's deal.")} cp <- pm_hand(c1, c2, known_pack, p1_method = c1_method, p2_method = c2_method, verbose = verbose, p1_lab = "Computer One", p2_lab = "Computer Two") c1 <- cp$p1_counters c2 <- cp$p2_counters known_pack <- cp$known_pack if(verbose){ cat(paste0("Computer One has ", c1, " counters.\n")) cat(paste0("There are ", nrow(known_pack), " cards left in the pack.\n")) } # c1 deals: if(verbose){ message("Computer One's deal.") } cp <- pm_hand(c2, c1, known_pack, p1_method = c2_method, p2_method = c1_method, verbose = verbose, p1_lab = "Computer Two", p2_lab = "Computer One") c1 <- cp$p2_counters c2 <- cp$p1_counters known_pack <- cp$known_pack if(verbose){ cat(paste0("Computer One has ", c1, " counters.\n")) cat(paste0("There are ", nrow(known_pack), " cards left in the pack.\n")) } } return(data_frame(c1 = c1, c2 = c2)) } # examples: persian_monarchs_cvc(rounds = 3, verbose = TRUE) persian_monarchs_cvc(rounds = 26, c1_method = "auto", c2_method = "auto", verbose = FALSE) persian_monarchs_cvc(rounds = 26, c1_method = "auto", c2_method = "random", verbose = FALSE)

Now that we have this function, it’s a simple matter of playing many thousands of games between computer opponents with differing strategies. This sort of thing usually needs parallel processing (ie running simulations on as many processors as are available at once, rather than just one at a time) to happen in a reasonable period of time. These days, R has plenty of packages to enable this, even on Windows.

The code below does the simulations (and modelling and presentation of results) of what I’ve called the “optimal” strategy versus an identical strategy, a “no card counting” alternative, and making wager decisions at random.

#--------------------multiple simulations using parallel processing-------------------- # cluster with seven processors: cl <- makeCluster(7) # load the functionality we need onto the cluster: clusterEvalQ(cl, {library(tidyverse)}) clusterExport(cl, c("persian_monarchs_cvc", "full_pack", "pack", "pm_hand", "pm_draw")) # number of simulations to do in each run. Takes about 90 seconds per run to do 10,000. n <- 10000 # Run 1 - with both players using the optimal strategy results_auto <- parLapply(cl, 1:n, function(x){ persian_monarchs_cvc(rounds = 26, c1_method = "auto", c2_method = "auto", verbose = FALSE) }) results_auto_df <- do.call("rbind", results_auto) %>% mutate(c2_method = "Best strategy") # Run 2 - computer two plays ok but without counting cards left in the pack results_ncc <- parLapply(cl, 1:n, function(x){ persian_monarchs_cvc(rounds = 26, c1_method = "auto", c2_method = "non-card-counter", verbose = FALSE) }) results_ncc_df <- do.call("rbind", results_ncc) %>% mutate(c2_method = "OK strategy but no card counting") # Run 3 - computer two makes and accepts bets at random results_random <- parLapply(cl, 1:n, function(x){ persian_monarchs_cvc(rounds = 26, c1_method = "auto", c2_method = "random", verbose = FALSE) }) results_random_df <- do.call("rbind", results_random) %>% mutate(c2_method = "Random") stopCluster(cl) results_combined <- results_auto_df %>% rbind(results_random_df) %>% rbind(results_ncc_df) p <- results_combined %>% mutate(c2_method = fct_reorder(c2_method, c2)) %>% ggplot(aes(x = c2, fill = c2_method, colour = c2_method)) + geom_density(alpha = 0.5) + #facet_wrap(~c2_method) + labs(fill = "Wagering decisions:", x = "Counters left at end of game", colour = "Wagering decisions:", y = paste("Density of results from", format(n, big.mark = ","), "simulations")) + ggtitle("Results of two automated wagering strategies for Persian Monarchs", "Counters left (having started with 100) after 26 rounds (two full packs) against a\nsound strategy") print(p) mod <- lm(c2 ~ c2_method, data = results_combined) confint(mod) Reflection

Astute readers will have noticed that the strategy I’ve sometimes called “optimal” is anything but. In fact, I’ve assumed away the chance of learning anything from the opponent’s own behaviour, whether a priori (for instance the reasonable assumption that they are more likely to accept or offer increased wagers with better cards) or after observing them for many rounds. So if we were seeking to create a Persian Monarchs master computer player, the algorithms above would be a sort of starting base case, to which we would need to add the ability to learn all sorts of tactics and strategies such as reading the opponent’s tendencies, swapping strategies oneself to confuse the opponent, bluffs and counter bluffs.

But that’s enough for now. In 2019 I’ll be extending this idea to some reflections on variants of Snakes and Ladders, and then to a classic 1970s text-based computer game that’s the granddaddy of Civilization and its turn-based strategic world-building ilk.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Day 22 – little helper get_files

Sat, 12/22/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 22th day of Christmas my true love gave to me…

What can it do?

This little helper does the same thing as the "Find in files" search within RStudio. It returns a vector with all files in a given folder that contain the search pattern. In your daily workflow, you would usually use the shortcut key SHIFT+CTRL+F. With get_files() you can use this functionality within your scripts.

Overview

To see all the other functions you can either check out our GitHub or you can read about them here.

Have a merry advent season!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Day 22 – little helper get_files erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bubble Packed Chart with R using packcircles package

Sat, 12/22/2018 - 01:00

(This article was first published on R on Chi's Impe[r]fect Blog, and kindly contributed to R-bloggers)

Tableau has chart type called “Packed Bubble Chart”, while I haven’t really utilized packed bubble chart much, I always thought they are fun and beautiful. I wanted to try creating same chart using R, and I came across package called packcircles.

Reading vignettes was really helpful to figure out how to use the package!!
introduction vignettesProgressive packing vignettes to get started on using this package.

Creating Bubble Packed Abstract Art…

I didn’t really have data sets handy to use it for this type of chart, so I’ve decided to play around with image. Initially I wanted to create images used to test colour blindness, but I wasn’t sure what colour palettes should be used to create these images.

Since I currently love Memoji on my iPhone, so I’ve decided I’d use Memoji image as base to create abstract art.

Steps
  • Step 1. Import image and convert image to data frame, so you can extract colour value (RGB)
  • Step 2. Genearate circle packing layout using circleProgressiveLayout function. The resulting data frame here contains center points of circle (x,y) and its radius.
  • Step 3. Convert x & y coordinate from data frame in step 2 so that you can figure out what colour to fill the circle. i.e. Data Frame from Step 1 and Step 2 should be joined, so you need to adjust the scaling.
  • Step 4. Create data frame using circleLayoutVertices function for plotting with ggplot2. The resulting data frame now have specified amount of points per indivisual circle so that you can use geom_path or geom_polygon to draw.
  • Step 5. Join data from Step 4 with colour value from Step 3, so that you can use geom_polygon with fill value to colour the circle!
library(tidyverse) ## I need tidyverse for everything :) library(imager) ## to create data frame from image library(scales) ## rescale function is so handy! library(packcircles) ## making circle packing easy! ## Step 1 im <- load.image("https://farm5.staticflickr.com/4868/45503751845_948f121563_z.jpg") #memoji2 ## if you want to take a look at image.. :) #plot(im) ## Convert Image into Data Frame im.df.colour <- im %>% as.data.frame(wide="c") %>% ## so that rgb value is in separate column. rename(im_x=x,im_y=y) %>% mutate(hex=rgb(c.1,c.2,c.3)) ## Step 2 using circleProgressiveLayout function. ## Generate circle packing layout using rbeta distribution as size of circles pack_layout <- circleProgressiveLayout(rbeta(2000,1,2), sizetype='area') %>% ## Step 3 - I want to figure out what colour to use, so I want layout & image df to have same scaling. mutate(im_x=floor(rescale(x,to=range(im.df.colour$im_x))), im_y=floor(rescale(y,to=range(im.df.colour$im_y))), ## also generate id, so i can join the data frame easily later! id=row_number()) %>% inner_join(im.df.colour %>% select(im_x,im_y,hex), by=c("im_x","im_y")) ## Step 4 ## Using the layout above create data frame using circleLayoutVertices function so that you can plot circle using ggplot2 data_gg <- circleLayoutVertices(pack_layout) %>% inner_join(pack_layout %>% select(id,hex), by=c("id")) ## Step 5 data_gg %>% ggplot(aes(x=x,y=y,group=id)) + geom_polygon(aes(fill=hex)) + scale_fill_identity() + coord_equal() + scale_y_reverse() + ## you need to reverse y-axis theme_void()

Just few more of these…

Good image to use is square image with above, but it’s fun turning into logo & images!!!

Below are just few more things I’ve experimented with circle packing technique…

Experimenting with RGB Colour… ## Generate layout from 500 uniformly distributed number 0 to 1 as area. pack_layout1 <- circleProgressiveLayout(runif(n=500), sizetype='area') ## I want to colour each circle with different rgb value, so I'll append data pack_layout1 <- pack_layout1 %>% mutate(hex_r=rgb(1,rescale(x),rescale(y),rescale(radius)), hex_g=rgb(rescale(x),1,rescale(y),rescale(radius)), hex_b=rgb(rescale(x),rescale(y),1,rescale(radius)), id = row_number()) ## append id, so you can join this table later. ## pack_layout1 contains data where center of circle should be placed with its radius. ## Now generate data so that you can actually draw circle using ggplot2 data_gg1 <- circleLayoutVertices(pack_layout1, npoints=25) ## notice now you have for each circle, you have 25 x and y coordinates to draw circle! ## Since the colour I want to use for each circle is retained in pack_layout1 data frame, I want to combine the info. Also I want to create 3 sets of different colouring. I want to make long table. data_gg1 <- data_gg1 %>% inner_join(pack_layout1 %>% select(-x,-y), by=c("id")) ## I want to create 3 different coloured variations, so convert above table to long format. data_gg_long <- data_gg1 %>% gather(key="colour_group",value="hex",hex_r:hex_b) %>% mutate(colour_group = factor(colour_group,levels=c("hex_r","hex_g","hex_b"), labels=c("keeping red value constant\nmore green to the right - more blue to the top\nsmaller circle has more transparency", "keeping green value constant\nmore red to the right - more blue to the top\nsmaller circle has more transparency", "keeping blue value constant\nmore red to the right - more green to the top\nsmaller circle has more transparency"))) ## Now the fun part! data_gg_long %>% ggplot(aes(x=x,y=y)) + geom_polygon(aes(group=id),fill="#ffffff") + ## first draw all circle white. geom_polygon(aes(group=id, fill=hex)) + ## then colour with value with some transparency coord_equal() + theme_void() + scale_fill_identity() + scale_y_reverse() + facet_wrap(~colour_group) + theme(plot.background=element_rect(fill="#000000de"), strip.text=element_text(family="Roboto Condensed", color="#ffffffde"))

Drawing Smaller Circles Inside of Circles ## Instead of using uniform distribution, used beta distribution this time! pack_layout2 <- circleProgressiveLayout(rbeta(1000,1,1), sizetype='area') ## This time I want to fill circle using hue value... pack_layout2 <- pack_layout2 %>% mutate(r = sqrt(x^2 + y^2), ## calculate distance from 0,0 coordinate angle_t = atan2(y,x), ## The arc-tangent of two arguments atan2(y, x) returns the angle between the x-axis and the vector from the origin to (x, y) angle = rescale(angle_t, from=c(-pi,pi)), ## convert theta value to value betwwen 0 and 1 hex = hsv(h=angle, s=rescale(r), v=0.8), id = row_number()) ## use circleLayoutVertices function to generate data frame for ggplot2 & bring colour info. data_gg2 <- circleLayoutVertices(pack_layout2,npoints=25) %>% inner_join(pack_layout2 %>% select(-x,-y), by=c("id")) ## Now create data for inner circles!! But I'm sampling so that NOT all circle has inner circles! I want to pick more bigger circles than smaller circle, so using raidus as weight to sample. data_gg2_1 <- circleLayoutVertices(pack_layout2 %>% sample_n(800, weight=radius) %>% mutate(radius=0.7*radius), npoints=25) ## I want to draw smaller circle, so shrink the radius data_gg2_2 <- circleLayoutVertices(pack_layout2 %>% sample_n(700,weight=radius) %>% mutate(radius=0.5*radius),npoints=25) data_gg2_3 <- circleLayoutVertices(pack_layout2 %>% sample_n(900,weight=radius) %>% mutate(radius=0.3*radius),npoints=25) ## Draw Black and White Version bw <-data_gg2 %>% ggplot(aes(x=x,y=y, group=id)) + geom_path(data=data_gg2, size=0.5, color="#00000090") + geom_path(data=data_gg2_1,size=1, color="#00000090") + geom_path(data=data_gg2_2,size=0.5, color="#00000090") + geom_path(data=data_gg2_3,size=0.5, color="#00000090") + scale_fill_identity() + scale_color_identity() + theme_void() + coord_fixed() ## Draw colourful version hue <-data_gg2 %>% ggplot(aes(x=x,y=y, group=id)) + geom_polygon(aes(fill=hex)) + geom_path(data=data_gg2, size=0.5, color="#ffffff90") + geom_path(data=data_gg2_1,size=1, color="#ffffff90") + geom_path(data=data_gg2_2,size=0.5, color="#ffffff90") + geom_path(data=data_gg2_3,size=0.5, color="#ffffff90") + scale_fill_identity() + scale_color_identity() + theme_void() + coord_fixed() library(patchwork) bw + hue

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Chi's Impe[r]fect Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Riddler: Santa Needs Some Help With Math

Sat, 12/22/2018 - 01:00

(This article was first published on R on R(e)Thinking, and kindly contributed to R-bloggers)

For the last Riddler of the year I attempt to solve both the Express and Classic Riddlers!

Riddler Express: How Long Will it Take Santa to Place his Reindeer?

We need to find out how long it will take Santa to place his reindeer in the correct sled positions, if he proceeds at random. It takes a minute to harness a raindeer, and the raindeer will grunt to indicate it’s in the right position.

Thinking in terms of a probabiity tree, if we consider the first position in the sled, the expected time that it will take to place the raindeer correctly looks like this:

The expected time that it will take for the first raindeer placed correctly is the sum of all the paths leading to the green circles. For example:

$$
E[T_{8}] = \left (\frac{1}{8} \cdot 1 \right) + \left (\frac{7}{8} \cdot \frac{1}{7} \cdot 2 \right ) + \left (\frac{7}{8} \cdot \frac{6}{7} \cdot\frac{1}{6} \cdot 3 \right ) + …
$$

This process can be repeated for each of the eight deer, with the tree getting one branch shorter for each deer that has been correctly placed. The expected time to place all eight deer when then be the sum of the expected time for each deer:

$$
E[T]=\sum{i=1}^{8}E[T{i}]
$$

To generalize this problem to a sled with N deer, I coded the expected time in R as follows:

raindeer <- function(N) { if (N==2) return(3/2) if (N==1) return(1) n <- N t <- 2 T <- (1/n) while (n>2) { T <- T + ( prod(seq(N-1,2)[1:(t-1)]) / prod(seq(N,3)[1:(t-1)]) ) * ((1/(n-1))*t) t <- t+1 n <- n-1 } T <- T + ( prod(seq(N-1,2)[1:(t-2)]) / prod(seq(N,3)[1:(t-2)]) ) * ((1/(n))*t) T } sled <- function(N) sum(sapply(1:N, raindeer)) sled(8)

[1] 22

So it will take Santa about 22 minutes to get his sled ready to go.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on R(e)Thinking. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Re-creating a Voronoi-Style Map with R

Sat, 12/22/2018 - 01:00

(This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers)

Introduction

I’ve written some “tutorial”-like content recently—see
here,
here, and
here—but
I’ve been lacking on ideas for “original” content since then. With that said,
I thought it would to try to re-create something with R. (Not too long ago
I saw that
Andrew Heiss did something akin to this with
Charles Minard’s well-known visualization of Napoleon’s 1812.)

The focus of my re-creation here is the price contour map shown on the
front page of the website for the Electric Reliability Council of Texas,
the independent system operator of electric power flow
for about 90 percent of Texas residents, as well as the employer of yours truly).

Without going too much into the details of what kind of information
and insight the map is supposed to provide 1, it’s sufficient for the
reader to simply understand the following:

  • “cooler” colors (i.e. blue and green)
    generally indicate regions of electric energy stability—where electric power
    generation is sufficiently meeting electric demand at a given point
    in time;
  • and “warmer” colors (i.e. orange and red) generally represent the
    opposite—regions where there is some “stress” at a given point in time
    due to either the existence or threat of power imbalance.

Anyways, aside from my familiarity with the map (just from visiting
the website so often), I thought that it would
be interesting to try to re-create it because it presents some “challenges”
that I had not really tackled before (even in my previous endeavors
with {ggplot2} and maps)—namely,
(1) manipulating KML data
and (2) creating Voronoi
shapes. 2

Re-creation Process

Unfortunately, the code needed to re-produce the map is not exactly trivial,
so I’m going to leave details-hungry readers
a link to the repository
to explore the techniques that I used on their own time. 3 To summarize the process,
here’s what I did:

  1. I downloaded the price data CSVs for a specific settlement
    interval—7:15 AM interval on November 15, 2018—from public links
    on ERCOT’s website. 4

    Additionally, I downloaded the “static” KML data upon which the map is built.

    Finally, I downloaded “dynamic” KML data representing the price points
    themselves (for both the Real-Time Market (RTM) and the Day-Ahead Market (DAM).

    I should Note that with this step and the others that follow,
    I repeated the actions for both data sets—those representing the
    RTM and DAM prices and points.

  2. With the KML files for the price points, I extracted the
    longitude and latitude value pairs from the points defined in the KML
    using the {tidykml} package.
    Additionally, I converted the legend (embedded in the KML) into a vector
    to be used for the final recreated plot(s) themselves. 5

  3. With the CSVs with the price data, I did a bit of “data munging” to convert
    the data for the two markets into an equivalent, easy-to-work-with format, and
    I joined this with the coordinate data from the previous step.

  4. With the KML files representing the map’s static data, I created
    a SpatialPolygonsDataFrame encompassing both the Texas state and county boundaries,
    as well as the ERCOT region boundaries (which is separate).

    This was probably the most difficult part for me to figure out. Nonetheless,
    with the help of the R community’s packages geared towards working with spatial
    data—most notably, the {sp}
    and {sf} packages—as well as some great
    documentation and tutorials—I found
    this one
    particularly helpful, although one may consider its techniques “outdated” since it doesn’t
    utilize {sf} for interpolation. 6

  5. Next was the “heart” of the process—the interpolation to create the Voronoi
    shapes. 7 This may be the part that is most interesting to the reader,
    since a similar process could be applied to one’s own project. 8

  6. Finally, I used {ggplot2} to create the visualizations themselves.
    Notably, one must be careful with the order of calls to
    geom_polygon(), geom_sf(), and coord_sf() in order to get
    the ordering of layers correct.

As for other details about the implementation, this process provided a perfect
example of how/why functions can be so useful—it was really nice to just call
a custom function twice to repeat each step for
the two data sets, which certainly saved me some time and effort.

The Maps

So what does the final product look like? Below, I’ve placed my re-creations
and the actual image side by side (with the help of the
{magick} package).

Not bad, right? OK, so the sizing isn’t perfect, but I would say it’s close enough.

To take things one step further, I tried “subtracting” the values between the two
re-created maps. There is some value to doing this—to visualize the discrepancies
between prices in the two markets (which, hypothetically, should be equal).

Because this map displays information that is inherently different than those
of the stand-alone RTM and DAM maps, I used a different color
palette—specifically, the “E” palette that comes with the
awesome {viridis} package,
which is conveniently made available in
{ggplot2} ever since the v3 release.

Final Thoughts

I really enjoyed this challenge to myself. I can certainly see myself doing something
similar in the future, perhaps with a focus with which I’m less familiar (to give
myself an even greater challenge). 9

  1. Check out the pop-up help menu for a bit more information about the map itself and this link for more information about Locational Marginal Prices (LMPs).
    ^
  2. If you’ve never heard of Voronoi diagrams, I would highly suggest reading a bit about them—the math behind them is fairly straightforward, and the results are quite elegant.
    ^
  3. I wouldn’t normally use the if ([... exists]) then [do ...] else [...] technique heavily like I do in the “main” script in the repository (because I don’t think it’s exactly a “best practice” and I’d rather use functions to achieve the same thing), but this idiom fit what I was trying to achieve at the time.
    ^
  4. Note that the map updates every 5 minutes for Real-Time Market (RTM) Locational Marginal Prices (LMPs) and every 15 minutes for RTM Settlement Point Prices (SPPs). Additionally, the Day-Ahead Market (DAM) LMP and SPP maps are update once per day, with filters allowing user to browse prices at a specified hour.
    ^
  5. There was some serous “hard-coding” needed to do this.
    ^
  6. The {tidyverse} set of packages, as well as {rvest} and {xml2} were also instrumental in this process.
    ^
  7. As with the previous step, R’s spatial packages were extremely helpful here.
    ^
  8. The rest of the steps may be too project-specific to be “generalizable”.
    ^
  9. Just brainstorming, I really enjoyed Kirk Goldsberry’s piece post-2018-midterm elections (mostly for the infomrative and “clean” visuals). The maps are practically begging to be re-created!
    ^
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Does imputing model labels using the model predictions can improve it’s performance?

Fri, 12/21/2018 - 15:11

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

In some scenarios a data scientist may want to train a model for which there exists an abundance of observations, but only a small fraction of is labeled, making the sample size available to train the model rather small. Although there’s plenty of literature on the subject (e.g. “Active learning”, “Semi-supervised learning” etc) one may be tempted (maybe due to fast approaching deadlines) to train a model with the labelled data and use it to impute the missing labels.

While for some the above suggestion might seem simply incorrect, I have encountered such suggestions on several occasions and had a hard time refuting them. To make sure it wasn’t just the type of places I work at I went and asked around in 2 Israeli (sorry non Hebrew readers) machine learning oriented Facebook groups about their opinion: Machine & Deep learning Israel and Statistics and probability group. While many were referring me to methods discussed in the literature, almost no one indicated the proposed method was utterly wrong.

I decided to perform a simulation study to get a definitive answer once and for all. If you’re interested in reading what were the results see my analysis on Github.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 16 (2018)

Fri, 12/21/2018 - 14:28

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 16 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 16 (2018) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Blogdown – shortcode for radix-like Bibtex

Fri, 12/21/2018 - 01:00

(This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers)

In the spirit of trying out new things in Hugo since my last post on modifying the RSS feed for this website, I attempted to implement the new citation feature from the new radix package by RStudio.

Essentially, I tried using a custom hugo shortcode to replicate the text and BibTex citation at the bottom of the page when rendered by radix.

Custom shortcodes should live in the layouts/shortcodes directory as a example_shortcode.html and then the HTML code can be injected into the page through a shortcode as

{{< example_shortcode >}}

. A lot of this code comes directly from the hugo documentation on shortcodes and the hugo community forums.

The entire shortcode is below:

Citation:

For attribution, please cite this work as

{{$.Page.Params.author}}. ({{dateFormat "2006-01-02" $.Page.Params.date }}). "{{$.Page.Params.title}}". Retrieved from {{ $.Page.Permalink }}.

BibTex citation

@misc{{{$name_string := replace $.Page.Params.author " " "_"}}{{$name_string}}{{dateFormat "2006-01-02" $.Page.Params.date }}, title = {{$title_str := $.Page.Params.title}}{{printf "{%s}" $title_str}}, howpublished = {{$url_str := $.Page.Permalink}}{{printf "{\\url{%s}}" $url_str}}, year = {{$year_str := dateFormat "2006" $.Page.Params.date }}{{printf "{%s}" $year_str}} }

The first two lines of the shortcode are simple HTML tags, and the next line {{$.Page.Params.author}}. ({{dateFormat "2006-01-02" $.Page.Params.date }}). "{{$.Page.Params.title}}". Retrieved from {{ $.Page.Permalink }}. takes in the parameter values from the blog post. So author, the date, and the post title.

Date can be formatted by specifying the format using dateFormat.

The main difficulty came with getting the BibTex citation to show up properly. First, BibTex works best with curly brackets, but double curly brackets are used to access and run Hugo code and functions. I could not find any way of escaping the curly bracket used in BibTex formatting. {{{ in a row would throw errors so I first used printf to insert the string between the curly brackets.

In this line of code, {{$title_str := $.Page.Params.title}}{{printf "{%s}" $title_str}}, I first assigned the parameter to a string variable and then printed the string with printf. But this method did not work for the first and last curly brace. The simpler soluation is to just use HTML special characters which I only realized after examining the hugo documentation site page source.

So really, the shortcode could be cleaner if I just used HTML special characters instead of a prinf string insert, but I am unfamiliar with the HTML codes that the go code is more understandable to me than trying to figure out what each HTML character actually is.

And here is the end result.

Citation:

For attribution, please cite this work as

Yihan Wu. (2018-12-21). "Blogdown - shortcode for radix-like Bibtex". Retrieved from https://www.yihanwu.ca/post/blogdown-shortcode-generation-for-bibtex/.

BibTex citation

@misc{Yihan_Wu2018-12-21, title = {Blogdown - shortcode for radix-like Bibtex}, howpublished = {\url{https://www.yihanwu.ca/post/blogdown-shortcode-generation-for-bibtex/}}, year = {2018} }

I haven’t looked up what is the difference between a partial and a shortcode in hugo but I imagine it would not be hard to add this as a partial and have an automatic citation set up for every blog post/page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

November 2018: “Top 40” New Packages

Fri, 12/21/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Having absorbed an average of 181 new packages each month over the last 28 months, CRAN is still growing at a pretty amazing rate. The following plot shows the number of new packages since I started keeping track in August 2016.

This November, 171 new packages stuck to CRAN. Here is my selection for the “Top 40” organized into the categories: Computational Methods, Data, Finance, Machine Learning, Marketing Analytics, Science, Statistics, Utilities and Visualization.

Computational Methods

mixsqp v0.1-79: Provides optimization algorithms (Kim et al. (2012) based on sequential quadratic programming (SQP) for maximum likelihood estimation of the mixture proportions in a finite mixture model where the component densities are known. The vignette shows how to use the package.

polylabelr v0.1.0: Implements a wrapper around the C++ library polylabel from Mapbox, providing an efficient routine for finding the approximate pole of inaccessibility of a polygon. See README.

Riembase v0.2.1: Implements a number of algorithms to estimate fundamental statistics including Fréchet mean and geometric median for manifold-valued data. See Bhattacharya and Bhattacharya (2012) if you are interested in statistics on manifolds, and Absil et al (2007) for information on the computational aspects of optimization on matrix manifolds.

SolveRationalMatrixEquation v0.1.0: Provides functions to find the symmetric positive definite solution X such that X = Q + L (X inv) L^T given a symmetric positive definite matrix Q and a non-singular matrix L. See Benner et al. (2007) for the details and the vignette for an example.

Data

metsyn v0.1.2: Provides an interface to the Meteo France Synop data API. This is meteorological data recorded every 3 on 62 French meteorological stations.

neonUtilities v1.0.1: Provides an interface to the National Ecological Observatory NEON API. For more information, see the README file.

phenocamapi v0.1.2: Allows users to obtain phenological time-series and site metadata from the PhenoCam network. There is a Getting Started Guide and a vignette with examples.

rdbnomics v0.4.3: Provides access to hundreds of millions data series from DBnomics API. See the vignette for examples.

restez v1.0.0: Allows users to download large sections of GenBank and generate a local SQL-based database. A user can then query this database using restez functions.

Finance

crseEventStudy v1.0: Implements the Dutta et al. (2018) standardized test for abnormal returns in long-horizon event studies to improve the power and robustness of the tests described in Kothari/Warner (2007).

psymonitor v0.0.1: Provides functions to apply the real-time monitoring strategy proposed by Phillips, Shi and Yu (2015) (and here) to test for “bubbles”. There is a vignette on detecting crises and another on monitoring bubbles.

Machine Learning

pivmet v0.1.0: Provides a collection of pivotal algorithms for relabeling the MCMC chains in order to cope with the label switching problem in Bayesian mixture models. Functions also initialize the centers of the classical k-means algorithm in order to obtain a better clustering solution. There is a vignette on K-means clustering and another on Label Swithching.

RDFTensor v1.0: Implements tensor factorization techniques suitable for sparse, binary and three-mode RDF tensors. See Nickel et al. (2012), Lee and Seung, Papalexakis et al. and Chi and T. G. Kolda (2012) for details.

rfviz v1.0.0: Provides an interactive data visualization and exploration toolkit that implements Breiman and Cutler’s original Java based, random forest visualization tools. It includes both supervised and unsupervised classification and regression algorithms. The Berkekey website describes the original implementation.

rJST v1.0: Provides functions to stimulate the Joint Sentiment Topic model as described by Lin and He (2009) and Lin et al. (2012). See the Introduction for details.

Marketing Analytics

Marketmatching v1.1.1: Enables users to find the best control markets using time series matching and analyze the impact of an intervention. Uses the dtw package to do the matching and the CausalImpact package to analyze the causal impact. See the vignette for an example.

Science

EpiSignalDetection v0.1.1: Provides functions to detect possible outbreaks using infectious disease surveillance data at the European Union / European Economic Area or country level. See Salmon et al. (2016) for a description of the automatic detection tools and the vignette for an overview of the package.

memnet v0.1.0: Implements network science tools to facilitate research into human (semantic) memory including several methods to infer networks from verbal fluency data, various network growth models, diverse random walk processes, and tools to analyze and visualize networks. See Wulff et al. (2018) and the vignette for an introduction.

phylocomr v0.1.2: Implements an interface to Phylocom, a library for analysis of phylogenetic community structure and character evolution. See the vignette for and introduction to the package.

plinkQC v0.2.0: Facilitates genotype quality control for genetic association studies as described by Anderson et al. (2010). There are vignettes on Ancestry Estimation, Processing 1000 Genomes, HapMap III Data and Genotype Quality Control.

Statistics

BivRec v1.0.0: Implements a collection of non-parametric and semiparametric methods to analyze alternating recurrent event data. See README for examples.

cusum v0.1.0: Provides functions for constructing and evaluating CUSUM charts and RA-CUSUM charts with focus on false signal probability. The vignette offers an example.

dabestr v0.1.0: Offers an alternative to significance testing using bootstrap methods and estimation plots. See Ho et al (2018). There is a vignette on Bootstrap Confidence Intervals, another on Statistical Visualizations, and a third on creating Estimation Plots.

deckgl v0.1.8: Implements an interface to deck.gl, a WebGL-powered open-source JavaScript framework for visual exploratory data analysis of large data sets and supports basemaps from mapbox. There are fourteen brief vignettes, each devoted to a different plot layer, but look here for a brief overview.

LindleyPowerSeries v0.1.0: Provides functions to compute the probability density function, the cumulative distribution function, the hazard rate function, the quantile function and random generation for Lindley Power Series distributions. See Nadarajah and Si (2018) for details.

modi v0.1.0: Implements algorithms that take sample designs into account to detect multivariate outliers. See Bill and Hulliger (2016) for details and the vignette for an introduction.

MPTmultiverse v0.1: Provides a function to examine the multiverse of possible modeling choices. See the paper by Steegen et al. (2016) and the vignette for an overview of the package.

pterrace v1.0: Provides functions to plot the persistence terrace, a summary graphic for topological data analysis that helps to determine the number of significant topological features. See Moon et al. (2018).

randcorr v1.0: Implements the algorithm by Pourahmadi and Wang (2015) for generating a random p x p correlation matrix by representing the correlation matrix using Cholesky factorization and hyperspherical coordinates. See Makalic and Schmidt (2018) for the sampling process used.

SMFilter v1.0.3: Provides filtering algorithms for the state space models on the Stiefel manifold as well as the corresponding sampling algorithms for uniform, vector Langevin-Bingham and matrix Langevin-Bingham distributions on the Stiefel manifold. See the vignette.

Utilities

IRkernel v0.8.14: Implements a native R kernel for Jupyter Notebook. See README for information on how to use the package.

lobstr v1.0.0: Provides set of tools for inspecting and understanding R data structures inspired by str(). See README for information on the included functions.

parsnip v0.0.1: Implements a common interface allowing users to specify a model without having to remember the different argument names across different functions or computational engines. The vignette goes over the basics.

pkgsearch v2.0.1: Allows users to search CRAN R packages using the METACRAN search server.

stevedore v0.9.0: Implements an interface to the Docker API. There is an Introduction and a vignette with Examples.

vctrs v0.1.0: Defines new notions of prototype and size that are used to provide tools for consistent and well-founded type-coercion and size-recycling. There are vignettes on S3 vectors, Type and Size Stability and Prototypes.

vtree v0.1.4: Provides a function for drawing drawing variable trees plots that display information about hierarchical subsets of a data frame defined by values of categorical variables. The vignette offers an introduction.

zipR v0.1.0: Implements the Python zip() function in R. See the vignette.

Visualization

countcolors v0.9.0: Contains functions to count colors within color range(s) in images, and provides a masked version of the image with targeted pixels changed to a different color. Output includes the locations of the pixels in the images, and the proportion of the image within the target color range with optional background masking. There is an Introduction and an Example.

coveffectsplot v0.0.1: Provides forest plots to visualize covariate effects from either the command line or an interactive Shiny application. There is an Introduction.

_____='https://rviews.rstudio.com/2018/12/21/november-2018-top-40-new-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods

Fri, 12/21/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
free here. This is taken from Chapter 5, which presents
the {tidyverse} packages and how to use them to compute descriptive statistics and manipulate data.
In the text below, I show how you can use the {tidyverse} functions and principles for the
estimation of \(\pi\) using Monte Carlo simulation.

Going beyond descriptive statistics and data manipulation

The {tidyverse} collection of packages can do much more than simply data manipulation and
descriptive statisics. You can use the principles we have covered and the functions you now know
to do much more. For instance, you can use a few {tidyverse} functions to do Monte Carlo simulations,
for example to estimate \(\pi\).

Draw the unit circle inside the unit square, the ratio of the area of the circle to the area of the
square will be \(\pi/4\). Then shot K arrows at the square; roughly \(K*\pi/4\) should have fallen
inside the circle. So if now you shoot N arrows at the square, and M fall inside the circle, you have
the following relationship \(M = N*\pi/4\). You can thus compute \(\pi\) like so: \(\pi = 4*M/N\).

The more arrows N you throw at the square, the better approximation of \(\pi\) you’ll have. Let’s
try to do this with a tidy Monte Carlo simulation. First, let’s randomly pick some points inside
the unit square:

library(tidyverse) library(brotools) n <- 5000 set.seed(2019) points <- tibble("x" = runif(n), "y" = runif(n))

Now, to know if a point is inside the unit circle, we need to check wether \(x^2 + y^2 < 1\). Let’s
add a new column to the points tibble, called inside equal to 1 if the point is inside the
unit circle and 0 if not:

points <- points %>% mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>% rowid_to_column("N")

Let’s take a look at points:

points ## # A tibble: 5,000 x 4 ## N x y inside ## ## 1 1 0.770 0.984 0 ## 2 2 0.713 0.0107 1 ## 3 3 0.303 0.133 1 ## 4 4 0.618 0.0378 1 ## 5 5 0.0505 0.677 1 ## 6 6 0.0432 0.0846 1 ## 7 7 0.820 0.727 0 ## 8 8 0.00961 0.0758 1 ## 9 9 0.102 0.373 1 ## 10 10 0.609 0.676 1 ## # ... with 4,990 more rows

The rowid_to_column() function, from the {tibble} package, adds a new column to the data frame
with an id, going from 1 to the number of rows in the data frame. Now, I can compute the estimation
of \(\pi\) at each row, by computing the cumulative sum of the 1’s in the inside column and dividing
that by the current value of N column:

points <- points %>% mutate(estimate = 4*cumsum(inside)/N)

cumsum(inside) is the M from the formula. Now, we can finish by plotting the result:

ggplot(points) + geom_line(aes(y = estimate, x = N), colour = "#82518c") + geom_hline(yintercept = pi) + theme_blog()

In Chapter 6, we are going to learn all about {ggplot2}.

As the number of tries grows, the estimation of \(\pi\) gets better.

Using a data frame as a structure to hold our simulated points and the results makes it very easy
to avoid loops, and thus write code that is more concise and easier to follow.
If you studied a quantitative field in u8niversity, you might have done a similar exercise at the
time, very likely by defining a matrix to hold your points, and an empty vector to hold whether a
particular point was inside the unit circle. Then you wrote a loop to compute whether
a point was inside the unit circle, save this result in the before-defined empty vector and then
compute the estimation of \(\pi\). Again, I take this opportunity here to stress that there is nothing
wrong with this approach per se, but R, with the {tidyverse} is better suited for a workflow
where lists or data frames are the central objects and where the analyst operates over them
with functional programming techniques.

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages