R news and tutorials contributed by hundreds of R bloggers
Updated: 10 hours 14 min ago

### Weather forecast with regression models – part 3

Sun, 06/11/2017 - 23:47

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In the previous part of this tutorial, we build several models based on logistic regression. One aspect to be further considered is the decision threshold tuning that may help in reaching a better accuracy. We are going to show a procedure able to determine an optimal value for such purpose. ROC plots will be introduced as well.

Decision Threshold Tuning

As default, caret uses 0.5 as threshold decision value to be used as probability cutoff value. We are going to show how to tune such decision threshold to achieve a better training set accuracy. We then compare how the testing set accuracy changes accordingly.

Tuning Analysis

We load previously saved training, testing datasets together with the models we have already determined.

The following tuning procedure explores a pool of cutoff values giving back a table reporting {cutoff, accuracy, specificity} triplets. Its inspection allows to identify the cutoff value providing with the highest accuracy. In case of a tie in accuracy value, we choose the lowest cutoff value of such.

glm.tune <- function(glm_model, dataset) { results <- data.frame() for (q in seq(0.2, 0.8, by = 0.02)) { fitted_values <- glm_model$finalModel$fitted.values prediction = q, "Yes", "No") cm <- confusionMatrix(prediction, dataset$RainTomorrow) accuracy <- cm$overall["Accuracy"] specificity <- cm$byClass["Specificity"] results <- rbind(results, data.frame(cutoff=q, accuracy=accuracy, specificity = specificity)) } rownames(results) <- NULL results } In the utility function above, we included also the specificity metrics. Specificity is defined as TN/(TN+FP) and in the next part of this series we will discuss some aspects of. Let us show an example of confusion matrix with some metrics computation to clarify the terminology. Reference Prediction No Yes No 203 42 Yes 0 3 Accuracy : 0.8306 = (TP+TN)/(TP+TN+FP+FN) = (203+3)/(203+3+42+0) Sensitivity : 1.00000 = TP/(TP+FN) = 203/(203+0) Specificity : 0.06667 = TN/(TN+FP) = 3/(3+42) Pos Pred Value : 0.82857 = TP/(TP+FP) = 203/(203+42) Neg Pred Value : 1.00000 = TN/(TN+FN) = 3/(3+0) Please note that the “No” column is associated to a positive outcome, whilst “Yes” to a negative one. Specificity measure how good we are in predict that {RainTomorrow = “No”} and actually it is and this is something that we are going to investigate in next post of this series. The logistic regression models to be tuned are: mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine In the following analysis, for all models we determine the cutoff value providing with the highest accuracy. glm.tune(mod9am_c1_fit, training) cutoff accuracy specificity 1 0.20 0.7822581 0.73333333 2 0.22 0.7862903 0.71111111 3 0.24 0.7943548 0.68888889 4 0.26 0.8064516 0.64444444 5 0.28 0.8104839 0.64444444 6 0.30 0.8145161 0.60000000 7 0.32 0.8185484 0.57777778 8 0.34 0.8185484 0.51111111 9 0.36 0.8104839 0.46666667 10 0.38 0.8266129 0.46666667 11 0.40 0.8266129 0.40000000 12 0.42 0.8306452 0.40000000 13 0.44 0.8266129 0.37777778 14 0.46 0.8346774 0.35555556 15 0.48 0.8467742 0.33333333 16 0.50 0.8508065 0.31111111 17 0.52 0.8427419 0.26666667 18 0.54 0.8467742 0.26666667 19 0.56 0.8346774 0.20000000 20 0.58 0.8387097 0.20000000 21 0.60 0.8387097 0.20000000 22 0.62 0.8346774 0.17777778 23 0.64 0.8387097 0.17777778 24 0.66 0.8346774 0.15555556 25 0.68 0.8387097 0.15555556 26 0.70 0.8387097 0.13333333 27 0.72 0.8387097 0.13333333 28 0.74 0.8346774 0.11111111 29 0.76 0.8266129 0.06666667 30 0.78 0.8266129 0.06666667 31 0.80 0.8306452 0.06666667 The optimal cutoff value is equal to 0.5. In this case, we find the same cutoff value as the default the caret package takes advantage of for logistic regression models. opt_cutoff &lt 0.5;- pred_test <- predict(mod9am_c1_fit, testing, type = "prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No Then, tuning the 3PM model. glm.tune(mod3pm_c1_fit, training) cutoff accuracy specificity 1 0.20 0.8064516 0.7777778 2 0.22 0.8185484 0.7555556 3 0.24 0.8225806 0.7333333 4 0.26 0.8306452 0.6888889 5 0.28 0.8467742 0.6666667 6 0.30 0.8467742 0.6444444 7 0.32 0.8427419 0.6222222 8 0.34 0.8669355 0.6222222 9 0.36 0.8709677 0.6222222 10 0.38 0.8629032 0.5777778 11 0.40 0.8669355 0.5777778 12 0.42 0.8669355 0.5555556 13 0.44 0.8548387 0.4666667 14 0.46 0.8548387 0.4444444 15 0.48 0.8588710 0.4444444 16 0.50 0.8669355 0.4444444 17 0.52 0.8629032 0.4222222 18 0.54 0.8669355 0.4222222 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3777778 21 0.60 0.8588710 0.3333333 22 0.62 0.8548387 0.3111111 23 0.64 0.8508065 0.2888889 24 0.66 0.8467742 0.2666667 25 0.68 0.8387097 0.2222222 26 0.70 0.8387097 0.2222222 27 0.72 0.8346774 0.2000000 28 0.74 0.8387097 0.1777778 29 0.76 0.8346774 0.1555556 30 0.78 0.8346774 0.1555556 31 0.80 0.8306452 0.1111111 The optimal cutoff value is equal to 0.36. opt_cutoff <- 0.36 pred_test <- predict(mod3pm_c1_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 81 5 Yes 5 14 Accuracy : 0.9048 95% CI : (0.8318, 0.9534) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.0112 Kappa : 0.6787 Mcnemar's Test P-Value : 1.0000 Sensitivity : 0.9419 Specificity : 0.7368 Pos Pred Value : 0.9419 Neg Pred Value : 0.7368 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8190 Balanced Accuracy : 0.8394 'Positive' Class : No We improved the test accuracy, previously resulting as equal to 0.8857 and now equal to 0.9048 (please take a look at part #2 of this tutorial to know about the test accuracy with 0.5 cutoff value). Then, the first evening hours model. glm.tune(mod_ev_c2_fit, training) cutoff accuracy specificity 1 0.20 0.8387097 0.8444444 2 0.22 0.8467742 0.8222222 3 0.24 0.8548387 0.8222222 4 0.26 0.8629032 0.8000000 5 0.28 0.8588710 0.7333333 6 0.30 0.8709677 0.7333333 7 0.32 0.8790323 0.7111111 8 0.34 0.8830645 0.6888889 9 0.36 0.8870968 0.6666667 10 0.38 0.8870968 0.6666667 11 0.40 0.8951613 0.6666667 12 0.42 0.8991935 0.6666667 13 0.44 0.8991935 0.6666667 14 0.46 0.8951613 0.6444444 15 0.48 0.8951613 0.6222222 16 0.50 0.8951613 0.6222222 17 0.52 0.8951613 0.6000000 18 0.54 0.8911290 0.5777778 19 0.56 0.8911290 0.5777778 20 0.58 0.8951613 0.5777778 21 0.60 0.8911290 0.5555556 22 0.62 0.8870968 0.5111111 23 0.64 0.8830645 0.4888889 24 0.66 0.8830645 0.4666667 25 0.68 0.8790323 0.4222222 26 0.70 0.8709677 0.3555556 27 0.72 0.8548387 0.2666667 28 0.74 0.8508065 0.2444444 29 0.76 0.8427419 0.1777778 30 0.78 0.8387097 0.1555556 31 0.80 0.8387097 0.1555556 The optimal cutoff value is equal to 0.42. opt_cutoff <- 0.42 pred_test <- predict(mod_ev_c2_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No We did not improve the accuracy as the default cutoff value provides with the same. Other metrics changed and in particular we notice that the sensitivity decreased a little bit while the positive predicted value improved. Then the second evening hours model. glm.tune(mod_ev_c3_fit, training) cutoff accuracy specificity 1 0.20 0.8225806 0.8000000 2 0.22 0.8145161 0.7333333 3 0.24 0.8064516 0.6666667 4 0.26 0.8145161 0.6666667 5 0.28 0.8145161 0.6222222 6 0.30 0.8185484 0.6222222 7 0.32 0.8266129 0.6000000 8 0.34 0.8185484 0.5555556 9 0.36 0.8306452 0.5333333 10 0.38 0.8467742 0.5333333 11 0.40 0.8467742 0.5111111 12 0.42 0.8427419 0.4666667 13 0.44 0.8508065 0.4666667 14 0.46 0.8467742 0.4222222 15 0.48 0.8467742 0.4000000 16 0.50 0.8508065 0.4000000 17 0.52 0.8548387 0.4000000 18 0.54 0.8508065 0.3777778 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3555556 21 0.60 0.8629032 0.3333333 22 0.62 0.8588710 0.3111111 23 0.64 0.8548387 0.2888889 24 0.66 0.8508065 0.2666667 25 0.68 0.8467742 0.2444444 26 0.70 0.8387097 0.2000000 27 0.72 0.8387097 0.1777778 28 0.74 0.8346774 0.1555556 29 0.76 0.8306452 0.1333333 30 0.78 0.8306452 0.1333333 31 0.80 0.8306452 0.1333333 The optimal cutoff value is equal to 0.56. opt_cutoff <- 0.56 pred_test <- predict(mod_ev_c3_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No In this case, we did not improved the accuracy of this model. Other metrics changed and in particular we notice that the sensitivity is slightly higher than the default cutoff driven value (0.9419), and the positive predicted value decreased a little bit (for default cutoff was 0.9205). ROC analysis To have a visual understanding of the classification performances and compute further metrics, we are going to take advantage of the ROCr package facilities. We plot ROC curves and determine the corresponding AUC value. That is going to be done for each model. The function used to accomplish with this task is the following. glm.perf.plot <- function (prediction, cutoff) { perf <- performance(prediction, measure = "tpr", x.measure = "fpr") par(mfrow=(c(1,2))) plot(perf, col="red") grid() perf <- performance(prediction, measure = "acc", x.measure = "cutoff") plot(perf, col="red") abline(v = cutoff, col="green") grid() auc_res <- performance(prediction, "auc") auc_res@y.values[[1]] } In the following, each model will be considered, AUC value printed out and ROC plots given. mod9am_pred_prob <- predict(mod9am_c1_fit, testing, type="prob") mod9am_pred_resp <- prediction(mod9am_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod9am_pred_resp, 0.5) [1] 0.8004896 mod3pm_pred_prob <- predict(mod3pm_c1_fit, testing, type="prob") mod3pm_pred_resp <- prediction(mod3pm_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod3pm_pred_resp, 0.36) [1] 0.9155447 The 3PM model shows the highest AUC value among all the models. mod_ev_c2_prob <- predict(mod_ev_c2_fit, testing, type="prob") mod_ev_c2_pred_resp <- prediction(mod_ev_c2_prob$Yes , testing$RainTomorrow) glm.perf.plot(mod_ev_c2_pred_resp, 0.42) [1] 0.8390453 mod_ev_c3_prob <- predict(mod_ev_c3_fit, testing, type="prob") mod_ev_c3_pred_resp <- prediction(mod_ev_c3_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod_ev_c3_pred_resp, 0.56) [1] 0.8886169 Conclusions By tuning the decision threshold, we were able to improve training and testing set accuracy. It turned out the 3PM model achieved a satisfactory accuracy. ROC plots gave us an understanding of how true/false positive rates vary and what is the accuracy obtained by a specific decision threshold value (i.e. cutoff value). Ultimately, AUC values were reported. If you have any questions, please feel free to comment below. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Data Visualization with googleVis exercises part 2 Sun, 06/11/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Area, Stepped & Combo charts In the second part of our series we are going to meet three more googleVis charts. More specifically these charts are Area Chart, Stepped Area Chart and Combo Chart. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin! Answers to the exercises are available here. Package & Data frame As you already know, the first thing you have to do is install and load the googleVis package with: install.packages("googleVis") library(googleVis) Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with: df=data.frame(name=c("James", "Curry", "Harden"), Pts=c(20,23,34), Rbs=c(13,7,9)) NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. Area chart It is quite simple to create an area chart with googleVis with: AreaC <- gvisBarChart(df) plot(AreaC) Exercise 1 Create a list named “AreaC” and pass to it the “df” data frame you just created as an area chart. HINT: Use gvisAreaChart(). Exercise 2 Plot the area chart. HINT: Use plot(). Stepped Area chart Creating a stepped area chart is a little different than the area chart. You have to set the X and Y variables and also make it stacked in order to display the values correctly. Here is an example: SteppedAreaC <- gvisSteppedAreaChart(df, xvar="name", yvar=c("val1", "val2"), options=list(isStacked=TRUE)) plot(SteppedAreaC) Exercise 3 Create a list named “SteppedAreaC” and pass to it the “df” data frame you just created as a stepped area chart. You should also set the X variable as the players’ names and Y variable as their values. HINT: Use gvisSteppedAreaChart(), xvar and yvar. Exercise 4 Plot the stepped area chart. HINT: Use plot(). Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to: • Work extensively with the GoogleVis package and its functionality • Learn what visualizations exist for your specific use case • And much more Exercise 5 Now transform your stepped area chart to stacked to correct it and plot it. Combo chart The next chart we are going to meet is the combination of lines and bars chart known as combo chart. You can produce it like this: ComboC <- gvisComboChart(df, xvar="country", yvar=c("val1", "val2"), options=list(seriesType="bars", series='{1: {type:"line"}}')) plot(ComboC) Exercise 6 Create a list named “ComboC” and pass to it the “df” data frame you just created as a combo chart. You should also set the X variable as the players’ names and Y variable as their values. HINT: Use gvisComboChart(), xvar and yvar. Exercise 7 Plot the chart. What kind of chart do you see? HINT: Use plot(). In order to add the bars we have to set it as the example below. options=list(seriesType="bars", series='{1: {type:"line"}}') Exercise 8 Transform the chart you just created into a combo chart with bars and lines and plot it. HINT: Use list(). Exercise 9 In the previous exercise “Pts” are represented by the bars and “Rbs” by the lines. Try to reverse them. You can easily transform your combo chart into a column chart or a line chart just be setting series='{1: {type:"line"}}' to 2 Exercise 10 Transform the combo chart into a column chart and then into a line chart. Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Exit poll for June 2017 election (UK) Sun, 06/11/2017 - 16:13 (This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers) It has been a while since I posted anything here, but I can’t resist this one. Let me just give three numbers. The first two are: • 314, the number of seats predicted for the largest party (Conservatives) in the UK House of Commons, at 10pm in Thursday (i.e., before even a single vote had been counted) from the exit poll commissioned jointly by broadcasters BBC, ITV and Sky. • 318, the actual number of seats that were won by the Conservatives, now that all the votes have been counted. That highly accurate prediction changed the whole story on election night: most of the pre-election voting intention polls had predicted a substantial Conservative majority. (And certainly that’s what Theresa May had expected to achieve when she made the mistake of calling a snap election, 3 years early.) But the exit poll prediction made it pretty clear that the Conservatives would either not achieve a majority (for which 326 seats would be needed), or at best would be returned with a very small majority such as the one they held before the election. Media commentary turned quickly to how a government might be formed in the seemingly likely event of a hung Parliament, and what the future might be for Mrs May. The financial markets moved quite substantially, too, in the moments after 10pm. For more details on the exit poll, its history, and the methods used to achieve that kind of predictive accuracy, see Exit Polling Explained. The third number I want to mention here is • 2.1.0 That’s the version of R that I had at the time of the 2005 General Election, when I completed the development of a fairly extensive set of R functions to use in connection with the exit poll (which at that time was done for BBC and ITV jointly). Amazingly (to me!) the code that I wrote back in 2001–2005 still works fine. My friend and former colleague Jouni Kuha, who stepped in as election-day statistician for the BBC when I gave it up after 2005, told me today that (with some tweaks, I presume!) it all works brilliantly still, as the basis for an extremely high-pressure data analysis on election day/night. Very pleasing indeed; and strong testimony to the heroic efforts of the R Core Development Team, to keep everything stable with a view to the long term. As suggested by that kind tweet reproduced above from the RSS President, David Spiegelhalter: Thursday’s performance was quite a triumph for the practical art and science of Statistics. [And I think I am allowed to say this, since on this occasion I was not even there! The credit for Thursday’s work goes to Jouni Kuha, along with John Curtice, Steve Fisher and the rest of the academic team of analysts who worked in the secret exit-poll “bunker” on 8 June.] var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Machine Learning Powered Biological Network Analysis Sun, 06/11/2017 - 15:50 (This article was first published on r-bloggers – Creative Data Solutions, and kindly contributed to R-bloggers) Metabolomic network analysis can be used to interpret experimental results within a variety of contexts including: biochemical relationships, structural and spectral similarity and empirical correlation. Machine learning is useful for modeling relationships in the context of pattern recognition, clustering, classification and regression based predictive modeling. The combination of developed metabolomic networks and machine learning based predictive models offer a unique method to visualize empirical relationships while testing key experimental hypotheses. The following presentation focuses on data analysis, visualization, machine learning and network mapping approaches used to create richly mapped metabolomic networks. Learn more at www.createdatasol.com The following presentation also shows a sneak peak of a new data analysis visualization software, DAVe: Data Analysis and Visualization engine. Check out some early features. DAVe is built in R and seeks to support a seamless environment for advanced data analysis and machine learning tasks and biological functional and network analysis. As an aside, building the main site (in progress) was a fun opportunity to experiment with Jekyll, Ruby and embedding slick interactive canvas elements into websites. You can checkout all the code here https://github.com/dgrapov/CDS_jekyll_site. slides: https://www.slideshare.net/dgrapov/machine-learning-powered-metabolomic-network-analysis var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Creative Data Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### “smooth” package for R. Common ground. Part I. Prediction intervals Sun, 06/11/2017 - 15:23 (This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers) We have spent previous six posts discussing basics of es() function (underlying models and their implementation). Now it is time to move forward. Starting from this post we will discuss common parameters, shared by all the forecasting functions implemented in smooth. This means that the topics that we discuss are not only applicable to es() , but also to ssarima() , ces() , ges() and sma() . However, taking that we have only discussed ETS so far, we will use es() in our examples for now. And I would like to start this series of general posts from the topic of prediction intervals. Prediction intervals for smooth functions One of the features of smooth functions is their ability to produce different types of prediction intervals. Parametric prediction intervals (triggered by intervals="p" , intervals="parametric" or intervals=TRUE ) are derived analytically only for pure additive and pure multiplicative models and are based on the state-space model discussed in previous posts. In the current smooth version (v2.0.0) only es() function has multiplicative components, all the other functions are based on additive models. This makes es() “special”. While constructing intervals for pure models (either additive or multiplicative) is relatively easy to do, the mixed models cause pain in the arse (one of the reasons why I don’t like them). So in case of mixed ETS models, we have to use several tricks. If the model has multiplicative error, non-multiplicative other components (trend, seasonality) and low variance of the error (smaller than 0.1), then the intervals can be approximated by similar models with additive error term. For example, the intervals for ETS(M,A,N) can be approximated with intervals of ETS(A,A,N), when the variance is low, because the distribution of errors in both models will be similar. In all the other cases we use simulations for prediction intervals construction (via sim.es() function). In this case the data is generated with preset parameters (including variance) and contains $$h$$ observations. This process is repeated 10,000 times, resulting in 10,000 possible trajectories. After that the necessary quantiles of these trajectories for each step ahead are taken using quantile() function from stats package and returned as prediction intervals. This cannot be considered as a pure parametric approach, but it is the closest we have. smooth functions also introduce semiparametric and nonparametric prediction intervals. Both of them are based on multiple steps ahead (also sometimes called as “trace”) forecast errors. These are obtained via producing forecasts for horizon 1 to $$h$$ from each observation of time series. As a result a matrix with $$h$$ columns and $$T-h$$ rows is produced. In case of semi-parametric intervals (called using intervals="sp" or intervals="semiparametric" ), variances of forecast errors for each horizon are calculated and then used in order to extract quantiles of either normal or log-normal distribution (depending on error type). This way we cover possible violation of assumptions of homoscedasticity and no autocorrelation in residuals, but we still assume that each separate observation has some parametric distribution. In case of non-parametric prediction intervals (defined in R via intervals="np" or intervals="nonparametric" ), we loosen assumptions further, dropping part about distribution of residuals. In this case quantile regressions are used as proposed by Taylor and Bunn, 1999. However we use a different form of regression model than the authors do: \label{eq:ssTaylorPIs} \hat{e}_{j} = a_0 j ^ {a_{1}}, where $$j = 1, .., h$$ is forecast horizon. This function has an important advantage over the proposed by the authors second order polynomial: it does not have extremum (turning point) for $$j>0$$, which means that the intervals won’t behave strangely after several observations ahead. Using polynomials for intervals sometimes leads to weird bounds (for example, expanding and then shrinking). On the other hand, power function allows producing wide variety of forecast trajectories, which correspond to differently increasing or decreasing bounds of prediction intervals (depending on values of $$a_0$$ and $$a_1$$), without producing any ridiculous trajectories. The main problem with nonparametric intervals produced by smooth is caused by quantile regressions, which do not behave well on small samples. In order to produce correct 0.95 quantile, we need to have at least 20 observations, and if we want 0.99 quantile, then the sample must contain at least 100. In the cases, when there is not enough observations, the produced intervals can be inaccurate and may not correspond to the nominal level values. As a small note, if a user produces only one-step-ahead forecast, then semiparametric interval will correspond to parametric one (because only the variance of the one-step-ahead error is used), and the nonparametric interval is constructed using quantile() function from stats package. Finally, the width of prediction intervals is regulated by parameter level , which can be written either as a fraction number ( level=0.95 ) or as an integer number, less than 100 ( level=95 ). I personally prefer former, but the latter is needed for the consistency with forecast package functions. By default all the smooth functions produce 95% prediction intervals. There are some other features of prediction interval construction for specific intermittent models and cumulative forecasts, but they will be covered in upcoming posts. Examples in R We will use a time series N1241 as an example and we will estimate model ETS(A,Ad,N). Here’s how we do that: ourModel1 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="p") ourModel2 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="sp") ourModel3 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="np") The resulting graphs demonstrate some differences in prediction intervals widths and shapes: Series N1241 from M3, es() forecast, parametric prediction intervals Series N1241 from M3, es() forecast, semiparametric prediction intervals Series N1241 from M3, es() forecast, nonparametric prediction intervals All of them cover actual values in the holdout, because the intervals are very wide. It is not obvious, which of them is the most appropriate for this task. So we can calculate the spread of intervals and see, which of them is on average wider: mean(ourModel1$upper-ourModel1$lower) mean(ourModel2$upper-ourModel2$lower) mean(ourModel3$upper-ourModel3$lower) Which results in: 950.4171 955.0831 850.614 In this specific example, the non-parametric interval appeared to be the narrowest, which is good, taking that it adequately covered values in the holdout sample. However, this doesn’t mean that it is in general superior to the other methods. Selection of the appropriate intervals should be done based on the general understanding of the violated assumptions. If we didn’t know the actual values in the holdout sample, then we could make a decision based on the analysis of the in-sample residuals in order to get a clue about the violation of any assumptions. This can be done, for example, this way: forecast::tsdisplay(ourModel1$residuals) hist(ourModel1$residuals) qqnorm(ourModel3$residuals) qqline(ourModel3$residuals) Linear plot and correlation functions of the residuals of the ETS(A,Ad,N) model Histogram of the residuals of the ETS(A,Ad,N) model Q-Q plot of the residuals of the ETS(A,Ad,N) model The first plot shows how residuals change over time and how the autocorrelation and partial autocorrelation functions look for this time series. There is no obvious autocorrelation and no obvious heteroscedasticity in the residuals. This means that we can assume that these conditions are not violated in the model, so there is no need to use semiparametric prediction intervals. However, the second and the third graphs demonstrate that the residuals are not normally distributed (as assumed by the model ETS(A,Ad,N)). This means that parametric prediction intervals may be wrong for this time series. All of this motivates the usage of nonparametric prediction intervals for the series N1241. That’s it for today. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Likelihood calculation for the g-and-k distribution Sun, 06/11/2017 - 00:33 (This article was first published on R – Statisfaction, and kindly contributed to R-bloggers) Histogram of 1e5 samples from the g-and-k distribution, and overlaid probability density function Hello, An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2) numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at: github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion). Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4]. These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution. 1. Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474. 2. Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75. 3. Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556. 4. Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### R Weekly Bulletin Vol – XI Sat, 06/10/2017 - 20:48 (This article was first published on R programming, and kindly contributed to R-bloggers) This week’s R bulletin will cover topics on how to round to the nearest desired number, converting and comparing dates and how to remove last x characters from an element. We will also cover functions like rank, mutate, transmute, and set.seed. Click To TweetHope you like this R weekly bulletin. Enjoy reading! Shortcut Keys 1. Comment/uncomment current line/selection – Ctrl+Shift+C 2. Move Lines Up/Down – Alt+Up/Down 3. Delete Line – Ctrl+D Problem Solving Ideas Rounding to the nearest desired number Consider a case where you want to round a given number to the nearest 25. This can be done in the following manner: round(145/25) * 25 [1] 150 floor(145/25) * 25 [1] 125 ceiling(145/25) * 25 [1] 150 Usage: Assume if you are calculating a stop loss or take profit for an NSE stock in which the minimum tick is 5 paisa. In such case, we will divide and multiply by 0.05 to achieve the desired outcome. Example: Price = 566 Stop_loss = 1/100 # without rounding SL = Price * Stop_loss print(SL) [1] 5.66 # with rounding to the nearest 0.05 SL1 = floor((Price * Stop_loss)/0.05) * 0.05 print(SL1) [1] 5.65 How to remove last n characters from every element To remove the last n characters we will use the substr function along with the nchr function. The example below illustrates the way to do it. Example: # In this case, we just want to retain the ticker name which is "TECHM" symbol = "TECHM.EQ-NSE" s = substr(symbol,1,nchar(symbol)-7) print(s) [1] “TECHM” Converting and Comparing dates in different formats When we pull stock data from Google finance the date appears as “YYYYMMDD”, which is not recognized as a date-time object. To convert it into a date-time object we can use the “ymd” function from the lubridate package. Example: library(lubridate) x = ymd(20160724) print(x) [1] “2016-07-24” Another data provider gives stock data which has the date-time object in the American format (mm/dd/yyyy). When we read the file, the date-time column is read as a character. We need to convert this into a date-time object. We can convert it using the as.Date function and by specifying the format. dt = "07/24/2016" y = as.Date(dt, format = "%m/%d/%Y") print(y) [1] “2016-07-24” # Comparing the two date-time objects (from Google Finance and the data provider) after conversion identical(x, y) [1] TRUE Functions Demystified rank function The rank function returns the sample ranks of the values in a vector. Ties (i.e., equal values) and missing values can be handled in several ways. rank(x, na.last = TRUE, ties.method = c(“average”, “first”, “random”, “max”, “min”)) where, x: numeric, complex, character or logical vector na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if “keep” they are kept with rank NA ties.method: a character string specifying how ties are treated Examples: x <- c(3, 5, 1, -4, NA, Inf, 90, 43) rank(x) [1] 3 4 2 1 8 7 6 5 rank(x, na.last = FALSE) [1] 4 5 3 2 1 8 7 6 mutate and transmute functions The mutate and transmute functions are part of the dplyr package. The mutate function computes new variables using the existing variables of a given data frame. The new variables are added to the existing data frame. On the other hand, the transmute function creates these new variables as a separate data frame. Consider the data frame “df” given in the example below. Suppose we have 5 observations of 1-minute price data for a stock, and we want to create a new variable by subtracting the mean from the 1-minute closing prices. It can be done in the following manner using the mutate function. Example: library(dplyr) OpenPrice = c(520, 521.35, 521.45, 522.1, 522) ClosePrice = c(521, 521.1, 522, 522.25, 522.4) Volume = c(2000, 3500, 1750, 2050, 1300) df = data.frame(OpenPrice, ClosePrice, Volume) print(df) df_new = mutate(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE)) print(df_new) # If we want the new variable as a separate data frame, we can use the transmute function instead. df_new = transmute(df, cpmean_diff = ClosePrice - mean(ClosePrice, na.rm = TRUE)) print(df_new) set.seed function The set.seed function helps generate the same sequence of random numbers every time the program runs. It sets the random number generator to a known state. The function takes a single argument which is an integer. One needs to use the same positive integer in order to get the same initial state. Example: # Initialize the random number generator to a known state and generate five random numbers set.seed(100) runif(5) [1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928 # Reinitialize to the same known state and generate the same five 'random' numbers set.seed(100) runif(5) [1] 0.30776611 0.25767250 0.55232243 0.05638315 0.46854928 Next Step We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers. The post R Weekly Bulletin Vol – XI appeared first on . var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Density-Based Clustering Exercises Sat, 06/10/2017 - 17:50 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius. Advantages of density-based clustering: • as mentioned above, it does not require a predefined number of clusters, • clusters can be of any shape, including non-spherical ones, • the technique is able to identify noise data (outliers). Disadvantages: • density-based clustering fails if there are no density drops between clusters, • it is also sensitive to parameters that define density (radius and the minimum number of points); proper parameter setting may require domain knowledge. There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and “mean-shift”. This set of exercises covers basic techniques for using the DBSCAN method, and allows to compare its result to the results of the k-means clustering algorithm by means of the silhouette analysis. The set requires the packages dbscan, cluster, and factoextra to be installed. The exercises make use of the iris data set, which is supplied with R, and the wholesale customers data set from the University of California, Irvine (UCI) machine learning repository (download here). Answers to the exercises are available here. Exercise 1 Create a new data frame using all but the last variable from the iris data set, which is supplied with R. Exercise 2 Use the scale function to normalize values of all variables in the new data set (with default settings). Ensure that the resulting object is of class data.frame. Exercise 3 Plot the distribution of distances between data points and their fifth nearest neighbors using the kNNdistplot function from the dbscan package. Examine the plot and find a tentative threshold at which distances start increasing quickly. On the same plot, draw a horizontal line at the level of the threshold. Exercise 4 Use the dbscan function from the package of the same name to find density-based clusters in the data. Set the size of the epsilon neighborhood at the level of the found threshold, and set the number of minimum points in the epsilon region equal to 5. Assign the value returned by the function to an object, and print that object. Exercise 5 Plot the clusters with the fviz_cluster function from the factoextra package. Choose the geometry type to draw only points on the graph, and assign the ellipse parameter value such that an outline around points of each cluster is not drawn. (Note that the fviz_cluster function produces a 2-dimensional plot. If the data set contains two variables those variables are used for plotting, if the number of variables is bigger the first two principal components are drawn.) Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to: • Delve into various algorithms for classification such as KNN and see how they are applied in R • Evaluate k-Means, Connectivity, Distribution, and Density based clustering • And much more Exercise 6 Examine the structure of the cluster object obtained in Exercise 4, and find the vector with cluster assignments. Make a copy of the data set, add the vector of cluster assignments to the data set, and print its first few lines. Exercise 7 Now look at what happens if you change the epsilon value. 1. Plot again the distribution of distances between data points and their fifth nearest neighbors (with the kNNdistplot function, as in Exercise 3). On that plot, draw horizontal lines at levels 1.8, 0.5, and 0.4. 2. Use the dbscan function to find clusters in the data with the epsilon set at these values (as in Exercise 4). 3. Plot the results (as in the Exercise 5, but now set the ellipse parameter value such that an outline around points is drawn). Exercise 8 This exercise shows how the DBSCAN algorithm can be used as a way to detect outliers: 1. Load the Wholesale customers data set, and delete all variables with the exception of Fresh and Milk. Assign the data set to the customers variable. 2. Discover clusters using the steps from Exercises 2-5: scale the data, choose an epsilon value, find clusters, and plot them. Set the number of minimum points to 5. Use the db_clusters_customers variable to store the output of the dbscan function. Exercise 9 Compare the results obtained in the previous exercise with the results of the k-means algorithm. First, find clusters using this algorithm: 1. Use the same data set, but get rid of outliers for both variables (here the outliers may be defined as values beyond 2.5 standard deviations from the mean; note that the values are already expressed in unit of standard deviation about the mean). Assign the new data set to the customers_core variable. 2. Use kmeans function to obtain an object with cluster assignments. Set the number of centers equal to 4, and the number of initial random sets (the nstart parameter) equal to 10. Assign the obtained object to the variable km_clusters_customers variable. 3. Plot clusters using the fviz_cluster function (as in the previous exercise). Exercise 10 Now compare the results of DBSCAN and k-means using silhouette analysis: 1. Retrieve a vector of cluster assignments from the db_clusters_customers object. 2. Calculate distances between data points in the customers data set using the dist function (with the default parameters). 3. Use the vector and the distances object as inputs into the silhouette function from the cluster package to get a silhouette information object. 4. Plot that object with the fviz_silhouette function from the factoextra package. 5. Repeat the steps described above for the km_clusters_customers object and the customers_core data sets. 6. Compare two plots and the average silhouette width values. Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### UK 2017 General Election Results Data Sat, 06/10/2017 - 12:53 (This article was first published on fReigeist » R, and kindly contributed to R-bloggers) As the reality of a hung parliament starts to sink in, economists, political scientists and commentators will begin their usual routine of “post mortem” analysis of the surprise result of the UK 2017 general election. My co-authors Sascha Becker and Dennis Novy have done a similar exercise studying the EU Referendum last year [see also here] and have worked on the question whether migration contributed to an erosion of pro EU sentiment [see also here]. For people wanting to get to work straight away, there are a few things slowing us down. The last constituency, Kensington, was not called until last night and so I dont expect the UK’s Election Commission to post the final tally of votes across all constituencies anytime before next week. Nevertheless, the crude election results data can be “scraped” from some infographics. This post describes how… The Economist’s Infographics The Economist, among other newspapers, provides a very nice infographic – behind that info graphic lies a web service that can be queried using JSON formed requests. Each Parliamentary constituency has an identifier code that can be used to query the web service and pull the results. The URL for a request is quite simple: http://infographics.economist.com/2017/ukelmap-2017/data/live-results2017/r2017E14000937.json This provides the results for the constituency Cambridgeshire, South East. The JSON object looks as follows resultCB({"swing": -3.84, "mpn": "Lucy Frazer", "electorate": "86121", "lib": 11958, "id": "E14000937", "name": "Cambridgeshire, South East", "lab": 17443, "con": 33601, "status": "hold", "pa_key": "123", "oth": 0, "region": "East Of England", "win": "con", "turnout": "63002"}) This piece of Javascript calls a function resultCB that updates one of the views of the infographic. In order to convert this to an R data frame, we can use the RJSONIO or jsonlite package functions fromJSON, after having removed the part that calls the function, i.e. library(jsonlite) as.data.frame(fromJSON(gsub("\\)$","",gsub("resultCB\\(","",readLines(con="http://infographics.economist.com/2017/ukelmap-2017/data/live-results2017/r2017E14000937.json"))))) ## id pa_key oth name win status swing lib ## 1 E14000937 123 0 Cambridgeshire, South East con hold -3.84 11958 ## region mpn electorate turnout lab con ## 1 East Of England Lucy Frazer 86121 63002 17443 33601

In order to build a data.frame of all election results, all that is necessary is to loop over the set of constituency codes available. I share the results from this step in the following spreadsheet Data for UK 2017 General Election Results (Economist Infographic).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Schedule for useR!2017 now available

Fri, 06/09/2017 - 21:17

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The full schedule of talks for useR!2017, the global R user conference, has now been posted. The conference will feature 16 tutorials, 6 keynotes, 141 full talks, and 86 lightning talks starting on July 5 in Brussels. That's a lot to fir into 4 days, but I'm especially looking forward to the keynote presentations:

I'm also pleased to be attending with several of my Microsoft colleagues. You can find our talks below.

I hope you can attend too! Registration is still open if you'd like to join in. You can find the complete schedule linked below.

Sched: useR!2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Big Data Manipulation in R Exercises

Fri, 06/09/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Some times it is necessary to download really big csv files to deliver some analysis. When you hit file sizes in Gigabytes it is useful to use R instead of spreadsheets. This exercise teaches us to manipulate this kind of files.

Answers to the exercises are available here.

Exercise 1
Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.

Exercise 2
Create a string vector with file names: 00540002-eng, 00540005-eng, 00540007-eng, 00540009-eng, 00540011-eng, 00540013-eng, 00540015-eng, and 00540017-eng.

Exercise 3
Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

Exercise 4
Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

Exercise 5
Clean up all other data sets in the list and exclude registers the same way discribed at exercise 4. Then, pile up all data in a sigle data set.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

• import data into R in several ways while also beeing able to identify a suitable import tool
• use SQL code within R
• And much more

Exercise 6
Write a csv file with the recent create data set.

Exercise 7
Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
Create a string vector with file names: 00540018-eng, 00540019-eng, 00540020-eng, 00540021-eng, 00540022-eng, 00540023-eng, 00540024-eng, and 00540025-eng.
Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

Exercise 8
Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

Exercise 9
Clean up all other data sets in the list and exclude registers the same way discribed at exercise 8. Then, pile up all data in a sigle data set.

Exercise 10
Write a csv file with the recent create data set.

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Managing intermediate results when using R/sparklyr

Fri, 06/09/2017 - 17:37

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.

Handle management

Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat up resources.

To help control handle lifetime the replyr supplies record-retaining temporary name generators (and uses the same internally).

The actual function is pretty simple:

print(replyr::makeTempNameGenerator) ## function(prefix, ## suffix= NULL) { ## force(prefix) ## if((length(prefix)!=1)||(!is.character(prefix))) { ## stop("repyr::makeTempNameGenerator prefix must be a string") ## } ## if(is.null(suffix)) { ## alphabet <- c(letters, toupper(letters), as.character(0:9)) ## suffix <- paste(base::sample(alphabet, size=20, replace= TRUE), ## collapse = '') ## } ## count <- 0 ## nameList <- c() ## function(dumpList=FALSE) { ## if(dumpList) { ## v <- nameList ## nameList <<- c() ## return(v) ## } ## nm <- paste(prefix, suffix, sprintf('%010d',count), sep='_') ## nameList <<- c(nameList, nm) ## count <<- count + 1 ## nm ## } ## } ## ##

For instance to join a few tables it is a can be a good idea to call compute after each join (else the generated SQL can become large and unmanageable). This sort of code looks like the following:

# create example data names <- paste('table', 1:5, sep='_') tables <- lapply(names, function(ni) { di <- data.frame(key= 1:3) di[[paste('val',ni,sep='_')]] <- runif(nrow(di)) copy_to(sc, di, ni) }) # build our temp name generator tmpNamGen <- replyr::makeTempNameGenerator('JOINTMP') # left join the tables in sequence joined <- tables[[1]] for(i in seq(2,length(tables))) { ti <- tables[[i]] if(i<length(tables)) { joined <- compute(left_join(joined, ti, by='key'), name= tmpNamGen()) } else { # use non-temp name. joined <- compute(left_join(joined, ti, by='key'), name= 'joinres') } } # clean up temps temps <- tmpNamGen(dumpList = TRUE) print(temps) ## [1] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000000" ## [2] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000001" ## [3] "JOINTMP_9lWXvfnkhI2NPRsA1tEh_0000000002" for(ti in temps) { db_drop_table(sc, ti) } # show result print(joined) ## Source: query [3 x 6] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 3 x 6 ## key val_table_1 val_table_2 val_table_3 val_table_4 val_table_5 ## ## 1 1 0.7594355 0.8082776 0.696254059 0.3777300 0.30015615 ## 2 2 0.4082232 0.8101691 0.005687125 0.9382002 0.04502867 ## 3 3 0.5941884 0.7990701 0.874374779 0.7936563 0.19940400

Careful introduction and management of materialized intermediates can conserve resources (both time and space) and greatly improve outcomes. We feel it is a good practice to set up an explicit temp name manager, pass it through all your Sparklyr transforms, and then clear temps in batches after the results no longer depend no the intermediates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Unconf projects 5: mwparser, Gargle, arresteddev

Fri, 06/09/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

And finally, we end our series of unconf project summaries (day 1, day 2, day 3, day 4).

mwparser

Summary: Wikimarkup is the language used on Wikipedia and similar projects, and as such contains a lot of valuable data both for scientists studying collaborative systems and people studying things documented on or in Wikipedia. mwparser parses wikimarkup, allowing a user to filter down to specific types of tags such as links or templates, and then extract components of those tags.

Team: Oliver Keyes

Gargle

Summary: Gargle is a library that provides authentication for Google APIs but without all the agonizing pain. The package provides helper functions (for httr) to support automatic retries, paging, and progress bars for API calls.

Team: Craig Citro

arresteddev

Summary: This package is designed to help troubleshoot errors that come up during package and analysis development. As of now, the package helps track tracebacks and errors but more functionality is planned for the future.

Team: Lucy D'Agostino McGowan, Karthik Ram, Miles McBain

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### R Interface to Spark

Fri, 06/09/2017 - 06:30

SparkR

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685 ### SUMMARY TABLE WITH AGG() grp <- groupBy(filter(df1, "month in (1, 3, 5)"), "month") summ <- agg(grp, avg_dep = avg(df1$dep_time), avg_arr = avg(df1$arr_time)) head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685

sparklyr

library(sparklyr) sc <- spark_connect(master = "local") df1 <- spark_read_csv(sc, name = "tbl1", path = "nycflights13.csv", header = TRUE, infer_schema = TRUE) ### SUMMARY TABLE WITH SQL library(DBI) summ <- dbGetQuery(sc, "select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743 ### SUMMARY TABLE WITH DPLYR library(dplyr) summ <- df1 %>% filter(month %in% c(1, 3, 5)) %>% group_by(month) %>% summarize(avg_dep = mean(dep_time), avg_arr = mean(arr_time)) head(summ) # month avg_dep avg_arr # # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Data Science for Business – Time Series Forecasting Part 2: Forecasting with timekit

Fri, 06/09/2017 - 02:00

In my last post, I prepared and visually explored time series data.

Now, I will use this data to test the timekit package for time series forecasting with machine learning.

Forecasting

In time series forecasting, we use models to predict future time points based on past observations.

As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).”

And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts – we can already guess the future curve just by looking at it. So, if we think of use-cases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience.

The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.

library(tidyverse) library(caret) library(tidyquant) library(broom) library(timekit) library(modelr) options(na.action = na.warn)

Training and test data

My input data is the tibble retail_p_day, that was created in my last post.

I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011).

retail_p_day <- retail_p_day %>% mutate(model = ifelse(day <= "2011-11-01", "train", "test")) colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))] <- paste0("P_", colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))])

Here, I am testing out timekit’s functions with the net income per day as response variable. Because the time series in our data set is relatively short and doesn’t cover multiple years, this forecast will only be able to capture recurring variation in days and weeks. Variations like increased sales before holidays, etc. would need additional data from several years to be accurately forecast.

As we can see in the plot below, the net income shows variation between days.

retail_p_day %>% ggplot(aes(x = day, y = sum_income, color = model)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

Augmenting the time series signature

With timekit, we can do forecasting with only a time series signature (a series of dates and times) and a corresponding response variable. If we had additional features that could be forecast independently, we could also introduce these into the model, but here, I will only work with the minimal data set.

A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature:

• index: The index value that was decomposed
• index.num: The numeric value of the index in seconds. The base is “1970-01-01 00:00:00”.
• diff: The difference in seconds from the previous numeric index value.
• year: The year component of the index.
• half: The half component of the index.
• quarter: The quarter component of the index.
• month: The month component of the index with base 1.
• month.xts: The month component of the index with base 0, which is what xts implements.
• month.lbl: The month label as an ordered factor beginning with January and ending with December.
• day: The day component of the index.
• hour: The hour component of the index.
• minute: The minute component of the index.
• second: The second component of the index.
• hour12: The hour component on a 12 hour scale.
• am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
• wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.
• wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6.
• wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
• mday: The day of the month.
• qday: The day of the quarter.
• yday: The day of the year.
• mweek: The week of the month.
• week: The week number of the year (Sunday start).
• week.iso: The ISO week number of the year (Monday start).
• week2: The modulus for bi-weekly frequency.
• week3: The modulus for tri-weekly frequency.
• week4: The modulus for quad-weekly frequency.
• mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.

Because we have missing data for the first column of diff, I am removing this row. We need to keep in mind too, that we have an irregular time series, because we never have data on Saturdays. This will affect the modeling and results and we need to account for this later on! Alternatively, it might make sense to compare the results when setting all NAs/Saturdays to 0, assuming that no information means that there was no income on a given day. Or we could impute missing values. Which strategy most accurately represents your data needs to be decided based on a good understanding of the business and how the data was collected.

retail_p_day_aug <- retail_p_day %>% rename(date = day) %>% select(model, date, sum_income) %>% tk_augment_timeseries_signature() %>% select(-contains("month")) retail_p_day_aug <- retail_p_day_aug[complete.cases(retail_p_day_aug), ]

Preprocessing

Not all of these augmented features will be informative for our model. For example, we don’t have information about time of day, so features like hour, minute, second, etc. will be irrelevant here.

Let’s look at column variation for all numeric feature and remove those with a variance of 0.

library(matrixStats) (var <- data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]), colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>% filter(colvars == 0)) ## colnames colvars ## 1 hour 0 ## 2 minute 0 ## 3 second 0 ## 4 hour12 0 ## 5 am.pm 0 retail_p_day_aug <- select(retail_p_day_aug, -one_of(as.character(var$colnames))) Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set. library(ggcorrplot) cor <- cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) p.cor <- cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]) ggcorrplot(cor, type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor, colors = c(palette_light()[1], "white", palette_light()[2])) I am going to choose a cutoff of 0.9 for removing features: cor_cut <- findCorrelation(cor, cutoff=0.9) retail_p_day_aug <- select(retail_p_day_aug, -one_of(colnames(cor)[cor_cut])) Now, I can split the data into training and test sets: train <- filter(retail_p_day_aug, model == "train") %>% select(-model) test <- filter(retail_p_day_aug, model == "test") Modeling To model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple. fit_lm <- glm(sum_income ~ ., data = train) We can examine our model e.g. by visualizing: tidy(fit_lm) %>% gather(x, y, estimate:p.value) %>% ggplot(aes(x = term, y = y, color = x, fill = x)) + facet_wrap(~ x, scales = "free", ncol = 4) + geom_bar(stat = "identity", alpha = 0.8) + scale_color_manual(values = palette_light()) + scale_fill_manual(values = palette_light()) + theme_tq() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) augment(fit_lm) %>% ggplot(aes(x = date, y = .resid)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq() With this model, we can now add predictions and residuals for the test data… pred_test <- test %>% add_predictions(fit_lm, "pred_lm") %>% add_residuals(fit_lm, "resid_lm") … and visualize the residuals. pred_test %>% ggplot(aes(x = date, y = resid_lm)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq() We can also compare the predicted with the actual sum income in the test set. pred_test %>% gather(x, y, sum_income, pred_lm) %>% ggplot(aes(x = date, y = y, color = x)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq() Forecasting Once we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame). # Extract index idx <- retail_p_day %>% tk_index() From this index we can generate the future time series. Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds). retail_p_day_aug %>% ggplot(aes(x = date, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq() What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always off-days). retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl != "Sunday" & diff > 86400) %>% mutate(days_missing = diff / 86400 -1) ## # A tibble: 5 x 4 ## date wday.lbl diff days_missing ## ## 1 2011-01-04 Tuesday 1036800 11 ## 2 2011-04-26 Tuesday 432000 4 ## 3 2011-05-03 Tuesday 172800 1 ## 4 2011-05-31 Tuesday 172800 1 ## 5 2011-08-30 Tuesday 172800 1 retail_p_day_aug %>% select(date, wday.lbl, diff) %>% filter(wday.lbl == "Sunday" & diff > 172800) %>% mutate(days_missing = diff / 86400 -1) ## # A tibble: 1 x 4 ## date wday.lbl diff days_missing ## ## 1 2011-05-01 Sunday 259200 2 Let’s create a list of all missing days: off_days <- c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03", "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30") %>% ymd() Official UK holidays during that time were: • 2011: • Boxing Day December 26 • Christmas Day Holiday December 27 • 2012: • New Year’s Day Holiday January 2 • Good Friday April 6 • Easter Monday April 9 • Early May Bank Holiday May 7 • Spring Bank Holiday June 4 • Diamond Jubilee Holiday June 5 • Summer Bank Holiday August 27 We can account for the missing Saturdays with inspect_weekdays = TRUE. Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. Generally, it is very difficult to account for holidays, because they don’t occur with an easy to model rule (e.g. Easter is on the first Sunday after the first full moon in Spring). Unfortunately, in our dataset we have seen that holidays and randomly missing days did not have a big overlap in the past. Because not all holidays are missing days and we have more missing days than official holidays, I am using the list of missing days for skipping values – even though this is only a best-guess approach and likely not going to match all days that will be missing in reality during the future time series. idx_future <- idx %>% tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days) idx_future %>% tk_get_timeseries_signature() %>% ggplot(aes(x = index, y = diff)) + geom_point(alpha = 0.5, aes(color = as.factor(diff))) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq() Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame. data_future <- idx_future %>% tk_get_timeseries_signature() %>% rename(date = index) pred_future <- predict(fit_lm, newdata = data_future) pred_future <- data_future %>% select(date) %>% add_column(sum_income = pred_future) retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% rbind(pred_future) %>% ggplot(aes(x = date, y = sum_income)) + scale_x_date() + geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + theme_tq()

When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals.

test_residuals <- pred_test$resid_lm test_resid_sd <- sd(test_residuals, na.rm = TRUE) pred_future <- pred_future %>% mutate( lo.95 = sum_income - 1.96 * test_resid_sd, lo.80 = sum_income - 1.28 * test_resid_sd, hi.80 = sum_income + 1.28 * test_resid_sd, hi.95 = sum_income + 1.96 * test_resid_sd ) This, we can then plot to show the forecast with confidence intervals: retail_p_day %>% select(day, sum_income) %>% rename(date = day) %>% ggplot(aes(x = date, y = sum_income)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future, fill = "#D5DBFF", color = NA, size = 0) + geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future, fill = "#596DD5", color = NA, size = 0, alpha = 0.8) + geom_point(aes(x = date, y = sum_income), data = pred_future, alpha = 0.5, color = palette_light()[[2]]) + geom_smooth(aes(x = date, y = sum_income), data = pred_future, method = 'loess', color = "white") + theme_tq() Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future. Next time, I’ll compare how Facebook’s prophet will predict the future income. sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggcorrplot_0.1.1 matrixStats_0.52.2 ## [3] modelr_0.1.0 timekit_0.3.0 ## [5] broom_0.4.2 tidyquant_0.5.1 ## [7] quantmod_0.4-8 TTR_0.23-1 ## [9] PerformanceAnalytics_1.4.3541 xts_0.9-7 ## [11] zoo_1.8-0 lubridate_1.6.0 ## [13] caret_6.0-76 lattice_0.20-35 ## [15] dplyr_0.5.0 purrr_0.2.2.2 ## [17] readr_1.1.1 tidyr_0.6.3 ## [19] tibble_1.3.1 ggplot2_2.2.1 ## [21] tidyverse_1.1.1 ## ## loaded via a namespace (and not attached): ## [1] httr_1.2.1 jsonlite_1.5 splines_3.4.0 ## [4] foreach_1.4.3 assertthat_0.2.0 stats4_3.4.0 ## [7] cellranger_1.1.0 yaml_2.1.14 backports_1.0.5 ## [10] quantreg_5.33 digest_0.6.12 rvest_0.3.2 ## [13] minqa_1.2.4 colorspace_1.3-2 htmltools_0.3.6 ## [16] Matrix_1.2-10 plyr_1.8.4 psych_1.7.5 ## [19] SparseM_1.77 haven_1.0.0 padr_0.3.0 ## [22] scales_0.4.1 lme4_1.1-13 MatrixModels_0.4-1 ## [25] mgcv_1.8-17 car_2.1-4 nnet_7.3-12 ## [28] lazyeval_0.2.0 pbkrtest_0.4-7 mnormt_1.5-5 ## [31] magrittr_1.5 readxl_1.0.0 evaluate_0.10 ## [34] nlme_3.1-131 MASS_7.3-47 forcats_0.2.0 ## [37] xml2_1.1.1 foreign_0.8-68 tools_3.4.0 ## [40] hms_0.3 stringr_1.2.0 munsell_0.4.3 ## [43] compiler_3.4.0 rlang_0.1.1 grid_3.4.0 ## [46] nloptr_1.0.4 iterators_1.0.8 labeling_0.3 ## [49] rmarkdown_1.5 gtable_0.2.0 ModelMetrics_1.1.0 ## [52] codetools_0.2-15 DBI_0.6-1 reshape2_1.4.2 ## [55] R6_2.2.1 knitr_1.16 rprojroot_1.2 ## [58] Quandl_2.8.0 stringi_1.1.5 parallel_3.4.0 ## [61] Rcpp_0.12.11 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Run massive parallel R jobs cheaply with updated doAzureParallel package Fri, 06/09/2017 - 00:41 (This article was first published on Revolutions, and kindly contributed to R-bloggers) At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs. JS Tan using doAzureParallel #rstats package to run simulation on a cluster of 20 low-priority Azure VMs. Total cost:$0.02 #EARLConf2017 pic.twitter.com/Mpl3IUa9zY

— David Smith (@revodavid) June 7, 2017

Using the doAzureParallel package is simple. First, you need to define the cluster you're going to use as a JSON file. (You can see an example on the right.) Here, you'll specify your Azure credentials, the size of the cluster, and the type of nodes (CPUs and memory) to use in the cluster. You can also specify here R packages (from CRAN and/or Github) to be pre-loaded onto each node, and the maximum number of simultaneous tasks to run on each node (for within-node parallelism).

New to this update, the poolSize option allows you to specify the number of dedicated (standard) VM nodes to use, in addition to a number of low-priority nodes to use. Low-priority nodes can be pre-empted by the Azure system at any time, but are much cheaper to use. (Even if a node is pre-empted your parallel computation will be continue; it will just take a little longer with the reduced capacity.) You can even specify a minimum and maximum number of nodes of each class to use, in which case the cluster will automatically scale up and down according to either (your choice) the workload or the time of day (e.g. only expand the low-priority part of the cluster on weekends, when pre-emption is less likely).

Once you've defined the parameters of your cluster, all you need to do is declare the cluster as a backend for the foreach package. The body of the foreach loop runs just like a for loop, except that multiple iterations run in parallel on the remote cluster. Here are the key parts of the option price simulation example JS presented at the conference.

This same approach can be used for any "embarrassingly parallel" iteration in R, and you can use any R function or package within the body of the loop. For example, you could use a cluster to reduce the time required for parameter tuning and cross-validation with the caret package, or speed up data preparation tasks when using the dplyr package.

In addition to support for auto-scaling clusters, this update to doAzureParallel also includes a few other new features. You'll also find new utility functions for managing multiple long-running R jobs, functions to read data from and write data to Azure Blob storage, and the ability to pre-load data into the cluster by specifying resource files.

The doAzureParallel package is available for download now from Github, under the open-source MIT license. For details on how to use the package, check out the README and the doAzureParallel guide.

Github (Azure): doAzureParallel

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Introduction to Set Theory and Sets with R

Thu, 06/08/2017 - 22:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Sets define a ‘collection’ of objects, or things typically referred to as ‘elements’ or ‘members.’ The concept of sets arises naturally when dealing with any collection of objects, whether it be a group of numbers or anything else. Conceptually, the following examples can be defined as a ‘set’:

• {1, 2, 3, 4}
• {Red, Green, Blue}
• {Cat, Dog}

The first example is the set of the first four natural numbers. The second defines a set of the primary colors while the third example denotes a set of common household pets.

Since its development beginning in the 1870s with Georg Cantor and Richard Dedekind, set theory has become a foundational system of mathematics, and therefore its concepts constantly arise in the study of mathematics and is also an area of active research today.

This post will introduce some of the basic concepts of set theory, specifically the Zermelo-Fraenkel axiomatic system (more on that later), with R code to demonstrate these concepts.

Set Notation

Sets can be defined with lowercase, uppercase, script or Greek letters (in addition to subscripts and the like). Using several types of letters helps when dealing with hierarchies. Before diving into set theory, it’s best to define the common notation employed. One benefit of set theory being ubiquitousness in mathematics is learning its notation also helps in the understanding of other mathematical concepts.

• \forall x – for every set x
• \exists x – there exists such a set x that
• \neg – not
• \wedge – and
• \vee – or (one or the other or both)
• \Rightarrow – implies
• \Leftrightarrow – iff, if and only if.
• A \cup B – union of sets A and B
• A \cap B – intersection of sets A and B
• x \in A – x is an element of a set A
• x \notin A – x is not an element of a set A

The \Rightarrow notation for implies can be thought of like an if statement in that it denotes the relation ‘if a then b.’

Set Membership

Set membership is written similar to:

x \in A

Which can be stated as ‘x is an element of the set A.’ If x is not a member of A, we write:

x \notin A

The symbol \in to denote set membership originated with Giuseppe Peano (Enderton, pp. 26).

Which is read ‘x is not an element of the set A.’ We can write an R function to implement the concept of set membership.

iselement <- function(x, A) { if (x %in% A) { return(TRUE) } return(FALSE) }

Let’s use our simple function to test if there exists some members in the set, A = \{3, 5, 7, 11\}.

A <- c(3, 5, 7, 11) eles <- c(3, 5, 9, 10, (5 + 6)) for (i in 1:length(eles)) { print(iselement(i, A)) } ## [1] FALSE ## [1] FALSE ## [1] TRUE ## [1] FALSE ## [1] TRUE

Set membership leads into one of the first axioms of set theory under the Zermelo-Fraenkel system, the Principle of Extensionality.

Principle of Extensionality

The principle of extensionality states if two sets have the same members, they are equal. The formal definition of the principle of extensionality can be stated more concisely using the notation given above:

\forall A \forall B (\forall x (x \in A \Leftrightarrow x \in B) \Rightarrow A = B)

Stated less concisely but still using set notation:

If two sets A and B are such that for every element (member) x:

x \in A \qquad iff \qquad x \in B Then A = B.

We can express this axiom through an R function to test for set equality.

isequalset <- function(a, b) { a <- unique(a) b <- unique(b) an <- length(a) if (an > length(b)) { return(FALSE) } for (i in 1:an) { if (!(a[i]) %in% b) { return(FALSE) } } return(TRUE) }

We can now put the principle of extensionality in action with our R function!

# original set A to compare A <- c(3, 5, 7, 11) # define some sets to test for equality B <- c(5, 7, 11, 3) C <- c(3, 4, 6, 5) D <- c(3, 5, 7, 11, 13) E <- c(11, 7, 5, 3) G <- c(3, 5, 5, 7, 7, 11) # collect sets into a list to iterate sets <- list(B, C, D, E, G) # using the isequalset() function, print the results of the equality tests. for (i in sets) { print(isequalset(i, A)) } ## [1] TRUE ## [1] FALSE ## [1] FALSE ## [1] TRUE ## [1] TRUE Empty Sets and Singletons

So far we have only investigated sets with two or more members. The empty set, denoted \varnothing, is defined as a set containing no elements and occurs surprisingly frequently in set-theoretic operations despite is seemingly straightforward and simple nature.

The empty set axiom, states the existence of an empty set concisely:

\exists B \forall x \qquad x \notin B Which can also be stated as ‘there is a set having no members.’

A set {\varnothing} can be formed whose only member is \varnothing. It is important to note {\varnothing} \neq \varnothing because \varnothing \in {\varnothing} but \varnothing \notin \varnothing. One can conceptually think of {\varnothing} as a container with nothing in it.

A singleton is a set with exactly one element, denoted typically by {a}. A nonempty set is, therefore, a set with one or more element. Thus a singleton is also nonempty. We can define another quick function to test if a given set is empty, a singleton or a nonempty set.

typeofset <- function(a) { if (length(a) == 0) { return('empty set') } else if (length(a) == 1) { return('singleton') } else if (length(a) > 1) { return('nonempty set') } else { stop('could not determine type of set') } } A <- c() B <- c(0) C <- c(1, 2) D <- list(c()) set_types <- list(A, B, C, D) for (i in set_types) { print(typeofset(i)) } ## [1] "empty set" ## [1] "singleton" ## [1] "nonempty set" ## [1] "singleton"

Note D is defined as a singleton because the set contains one element, the empty set \varnothing.

Summary

This post introduced some of the basic concepts of axiomatic set theory using the Zermelo-Fraenkel axioms by exploring the idea of set, set membership and some particular cases of sets such as the empty set and singletons. Set notation that will be used throughout not just set-theoretic applications but throughout mathematics was also introduced.

References

Barile, Margherita. “Singleton Set.” From MathWorld–A Wolfram Web Resource, created by Eric W. Weisstein. http://mathworld.wolfram.com/SingletonSet.html

Enderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.

Weisstein, Eric W. “Empty Set.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/EmptySet.html

Weisstein, Eric W. “Nonempty Set.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NonemptySet.html

The post Introduction to Set Theory and Sets with R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Campaign Response Testing no longer published on Udemy

Thu, 06/08/2017 - 20:33

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Our free video course Campaign Response Testing is no longer published on Udemy. It remains available for free on YouTube with all source code available from GitHub. I’ll try to correct bad links as I find them.

Udemy recently unilaterally instituted a new policy on free courses: “When a free course has a Recent Review Rating less than 4.1 and is flagged with a ‘high degree of confidence’ the course will be hidden from Udemy’s search.”

Campaign Response Testing is a free course with an all-time average rating of 4.14 and a recent rating of 3.85. We have kept the code up to date and answered student questions (there was no backlog of student questions).

Obviously others should have opinion of our public work (which may or may not be good). And we are keeping the course up for free with no-sign up necessary on YouTube (as we in no way want to hurt or inconvenience students).

But I do have an issue with Udemy’s new system: it is completely vulnerable to griefers and spammers. Accounts with “no skin in the game” (having paid nothing, possibly not even have watched the course, and without having to leave any review text in addition to the “high degree of confidence” star rating) can belittle free courses at will and in bulk. Currently there is no effective dispute mechanism (other than superficial palliatives such as: “answer student questions”), and like all lop-sided “fights against semi-anonymous trolls” it would be a pointless effort anyway. I understand the desire to increase the quality of free content (it is Udemy’s “introduction”) but I feel the introduced system would obviously be unworkable even before it was introduced.

So with hopefully no harm to our subscribers I am un-publishing the course. As is the case with Udemy policy anybody already subscribed should continue to have access to the course through Udemy. And as I have said: we have long sense ported the entire free course to YouTube and Github for free and login-free sharing.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Neural networks Exercises (Part-1)

Thu, 06/08/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Source: Wikipedia

Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.

In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. This set of exercises is an introduction to neural networks where we’ll use them to create two simple regression and clustering model. Doing so, we’ll use a lot of basic concepts we’ll explore further in future sets. If you want more informations about neural network, your can see this page.

Answers to the exercises are available here.

Exercise 1
We’ll start by creating the data set on which we want to do a simple regression. Set the seed to 42, generate 200 random points between -10 and 10 and store them in a vector named X. Then, create a vector named Y containing the value of sin(x). Neural network are a lot more flexible than most regression algorithms and can fit complex function with ease. The biggest challenge is to find the appropriate network function appropriate to the situation.

Exercise 2
A network function is made of three components: the network of neurons, the weight of each connection between neuron and the activation function of each neuron. For this example, we’ll use a feed-forward neural network and the logistic activation which are the defaults for the package nnet. We take one number as input of our neural network and we want one number as the output so the size of the input and output layer are both of one. For the hidden layer, we’ll start with three neurons. It’s good practice to randomize the initial weights, so create a vector of 10 random values, picked in the interval [-1,1].

Exercise 3
Neural networks have a strong tendency of overfitting your data, meaning they become really good at describing the relationship between the values in your data set, but are not effective with data that wasn’t used to train your model. As a consequence, we need to cross-validate our model. Set the seed to 42, then create a training set containing 75% of the values in your initial data set and a test set containing the rest of your data.

Exercise 4
Load the nnet package and use the function of the same name to create your model. Pass your weights via the Wts argument and set the maxit argument to 50. We want to fit a function which can have for output multiple possible values. To do so, set the linout argument to true. Finally, take the time to look at the structure of your model.

Learn more about neural networks in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. In this course you will learn how to:

• Work with Deep Learning networks and related packagse in R
• Create Natural Language Processing models
• And much more

Exercise 5
Predict the output for the test set and compute the RMSE of your predictions. Plot the function sin(x) and then plot your predictions.

Exercise 6
The number of neurons in the hidden layer, as well as the number of hidden layer used, has a great influence on the effectiveness of your model. Repeat the exercises three to five, but this time use a hidden layer with seven neurons and initiate randomly 22 weights.

Exercise 7
Now let us use neural networks to solve a clustering problems, so let’s load the iris data set! It is good practice to normalize your input data to uniformize the behavior of your model over different range of value and have a faster training. Normalize each factor so that they have a mean of zero and a standard deviation of 1, then create your train and test set.

Exercise 8
Use the nnet() and use a hidden layer of ten neurons to create your model. We want to fit a function which have a finite amount of value as output. To do so, set the code>linout argument to true. Look at the structure of your model. With clustering problem, the output is usually a factor that is coded as multiple dummy variables, instead of a single numeric value. As a consequence, the output layer have as one less neuron than the number of levels of the output factor.

Exercise 9
Make prediction with the values of the test set.

Exercise 10
Create the confusion table of your prediction and compute the accuracy of the model.

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Introducing the MonteCarlo Package

Thu, 06/08/2017 - 17:26

(This article was first published on R – first differences, and kindly contributed to R-bloggers)

My first R package has been released on CRAN recently. It is named MonteCarlo and aims to make simulation studies as easy as possible – including parallelization and the generation of tables.

What Are Simulation Studies Good For?

Monte Carlo simulations are an essential tool in statistics and related disciplines. They are routinely used to determine distributional properties, where no analytical results are available. A typical example is to study the finite sample properties of a new statistical procedure. Finite sample properties are often unknown, since theoretical results usually rely on asymptotic theory that only applies to infinite sample sizes. This may be a good approximation in practice if the sample size is large, but it may also be bad if the sample is small or if the rate of convergence is very slow.

Simulation studies are also extremely useful to compare the performance of alternative procedures for the same task. Imagine you want to test whether your sample comes from a Gaussian distribution and you have to decide whether to use the Jarque-Bera test or the Kolmogorov Smirnov test. Both appear to be legitimate choices. A simulation study that is tailored so that it reflects the situation at hand might uncover that one of the procedures is much more powerful than the other.

Finally, small scale simulation studies are an essential tool for statistical programming. Testing is an essential part of programming and software development. Usually, if one writes a function, it is good practice to let it run on some test cases. This is easy, if the outcome is deterministic. But if your function is a statistical test or an estimator, the output depends on the sample and is stochastic. Therefore, the only way to test whether the implementation is correct, is to generate a large number of samples and to see whether it has the expected statistical properties. For example one might test whether the mean squared error of an estimator converges to zero as the sample size increases, or whether the test has the correct alpha error.

Therefore, writing Monte Carlo simulations is an everyday task in many areas of statistics. This comes with considereable effort. It is not unusual that the required lines of code to produce a simulation study are a multiple of that needed to implement the procedure of interest. As a consequence of that they are also one of the main sources for errors. On top of this, the large computational effort often requires parallelization which brings additional complications and programming efforts. This efforts can often be prohibitive – especially for less advanced users.

The MonteCarlo Package

The MonteCarlo package streamlines the process described above. It allows to create simulation studies and to summarize their results in LaTeX tables quickly and easily.

There are only two main functions in the package:

1. MonteCarlo() runs a simulation study for a user defined parameter grid. It handles the generation of loops over these parameter grid and parallelizes the computation on a user specified number of CPU units.
2. MakeTable() creates LaTeX tables from the output of MonteCarlo(). It stacks high dimensional output arrays into tables with a user specified ordering of rows and columns.

To run a simulation study, the user has to nest both – the generation of a sample and the calculation of the desired statistics from this sample – in a single function. This function is passed to MonteCarlo(). No additional programming is required. This approach is not only very versatile – it is also very intuitive. The user formulates his experiment as if he/she was only interested in a single draw.

The aim of this approach is to allow the user full control and flexibility with regard to the design of the Monte Carlo experiment, but to take away all the effort of setting up the technical part of the simulation.

A Simple Example: The t-test

Suppose we want to evaluate the performance of a standard t-test for the hypothesis that the mean is equal to zero. We are interested to see how the size and power of the test change with the sample size (n), the distance from the null hypothesis (loc for location) and the standard deviation of the distribution (scale). The sample is generated from a normal distribution.

To conduct this analysis, we proceed as follows. First, we load the MonteCarlo package.

library(MonteCarlo)

Then define the following function.

######################################### ## Example: t-test # Define function that generates data and applies the method of interest ttest<-function(n,loc,scale){ # generate sample: sample<-rnorm(n, loc, scale) # calculate test statistic: stat<-sqrt(n)*mean(sample)/sd(sample) # get test decision: decision1.96 # return result: return(list("decision"=decision)) }

As discussed above, ttest() is formulated in a way as if we only want to generate a single test decision. The arguments of the function are the parameters we are interested in. ttest() carries out 4 steps:

1. Draw a sample of n observations from a normal distribution with mean loc and standard deviation scale.
2. Calculate the t-statistic.
3. Determine the test decision.
4. Return the desired result in form of a list.

We then define the combinations of parameters that we are interested in and collect them in a list. The elements of the lists must have the same names as the parameters for which we want to supply grids.

# define parameter grid: n_grid<-c(50,100,250,500) loc_grid<-seq(0,1,0.2) scale_grid<-c(1,2) # collect parameter grids in list: param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)

To run the simulation, the function ttest() and the parameter grid (param_list) are passed to MonteCarlo(), together with the desired number of Monte Carlo repetitions (nrep=1000).

# run simulation: MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list)

There is no further coding required. All the mechanics of the Monte Carlo experiment are handled by the MonteCarlo() function.

Calling summary produces a short information on the simulation.

summary(MC_result) ## Simulation of function: ## ## function(n,loc,scale){ ## ## # generate sample: ## sample<-rnorm(n, loc, scale) ## ## # calculate test statistic: ## stat<-sqrt(n)*mean(sample)/sd(sample) ## ## # get test decision: ## decision1.96 ## ## # return result: ## return(list("decision"=decision)) ## } ## ## Required time: 13.38 secs for nrep = 1000 repetitions on 1 CPUs ## ## Parameter grid: ## ## n : 50 100 250 500 ## loc : 0 0.2 0.4 0.6 0.8 1 ## scale : 1 2 ## ## ## 1 output arrays of dimensions: 4 6 2 1000

As one can see from the summary, the simulation results are stored in an array of dimension c(4,6,2,1000), where the Monte Carlo repetitions are collected in the last dimension of the array.

To summarize the results in a reasonable way and to include them as a table in a paper or report, we have to represent them in a matrix. This is handled by the MakeTable() function that stacks the submatrices collected in the array in the rows and columns of a matrix and prints the result in the form of code to generate a LaTeX table.

To determine in which order the results are stacked in rows and columns, we supply the function arguments rows and cols. These are vectors of the names of the parameters in the order in which we want them to appear in the table (sorted from the inside to the outside).

# generate table: MakeTable(output=MC_result, rows="n", cols=c("loc","scale"), digits=2, include_meta=FALSE) ## \begin{table}[h] ## \centering ## \resizebox{ 1 \textwidth}{!}{% ## \begin{tabular}{ rrrrrrrrrrrrrrr } ## \hline\hline\\\\ ## scale && \multicolumn{ 6 }{c}{ 1 } & & \multicolumn{ 6 }{c}{ 2 } \\ ## n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ ## & & & & & & & & & & & & & & \\ ## 50 & & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 & & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ ## 100 & & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 & & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ ## 250 & & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ ## 500 & & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## \\ ## \\ ## \hline\hline ## \end{tabular}% ## } ## \caption{ decision } ## \end{table}

To change the ordering, just change the vectors rows and cols.

# generate table: MakeTable(output=MC_result, rows=c("n","scale"), cols="loc", digits=2, include_meta=FALSE) ## \begin{table}[h] ## \centering ## \resizebox{ 1 \textwidth}{!}{% ## \begin{tabular}{ rrrrrrrrr } ## \hline\hline\\\\ ## scale & n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ ## & & & & & & & & \\ ## \multirow{ 4 }{*}{ 1 } & 50 & & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 \\ ## & 100 & & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 \\ ## & 250 & & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## & 500 & & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## & & & & & & & & \\ ## \multirow{ 4 }{*}{ 2 } & 50 & & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ ## & 100 & & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ ## & 250 & & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ ## & 500 & & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ ## \\ ## \\ ## \hline\hline ## \end{tabular}% ## } ## \caption{ decision } ## \end{table}

Now we can simply copy the code and add it to our paper, report or presentation. That is all. Only make sure that the package multirow is included in the header of the .tex file.

Parallelised Simulation

If the procedure you are interested in is not so fast or you need a large number of replications to produce very accurate results, you might want to use parallelized computation on multiple cores of your computer (or cluster). To achive this, simply specify the number of CPUs by supplying a value for the argument ncpus of MonteCarlo as shown below. Of course you should actually have at least the specified number of units.

# run simulation: MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list, ncpus=4)

This automatically sets up a snow cluster, including the export of functions and the loading of packages. The user does not have to take care of anything.

Further Information

This is an introduction to get you up and running with the MonteCarlo package as quickly as possible. Therefore, I only included a short example. However, the functions MonteCarlo() and particularly MakeTable() provide much more functionality. This is described in more detail in the package vignette, that also provides additional examples.

Filed under: R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – first differences. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));