R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 34 min ago

### Joining Tables in SparkR

Tue, 06/13/2017 - 07:27

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") grp1 <- groupBy(filter(df1, "month in (1, 2, 3)"), "month") sum1 <- withColumnRenamed(agg(grp1, min_dep = min(df1$dep_delay)), "month", "month1") grp2 <- groupBy(filter(df1, "month in (2, 3, 4)"), "month") sum2 <- withColumnRenamed(agg(grp2, max_dep = max(df1$dep_delay)), "month", "month2") # INNER JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all = FALSE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "inner")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 3| -25| 3| 911| #| 2| -33| 2| 853| #+------+-------+------+-------+ # LEFT JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all.x = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "left")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 1| -30| null| null| #| 3| -25| 3| 911| #| 2| -33| 2| 853| #+------+-------+------+-------+ # RIGHT JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all.y = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "right")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 3| -25| 3| 911| #| null| null| 4| 960| #| 2| -33| 2| 853| #+------+-------+------+-------+ # FULL JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "full")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 1| -30| null| null| #| 3| -25| 3| 911| #| null| null| 4| 960| #| 2| -33| 2| 853| #+------+-------+------+-------+

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### RcppMsgPack 0.1.1

Tue, 06/13/2017 - 04:24

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new package! Or at least new on CRAN as the very initial version 0.1.0 had been available via the ghrr drat for over a year. But now we have version 0.1.1 to announce as a CRAN package.

RcppMspPack provides R with MessagePack header files for use via C++ (or C, if you must) packages such as RcppRedis.

MessagePack itself is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it is faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.

MessagePack is used by Redis and many other projects, and has bindings to just about any language.

To use this package, simply add it to the LinkingTo: field in the DESCRIPTION field of your R package—and the R package infrastructure tools will then know how to set include flags correctly on all architectures supported by R.

More information may be on the RcppMsgPack page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Data Science for Business – Time Series Forecasting Part 3: Forecasting with Facebook’s Prophet

Tue, 06/13/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

In my last two posts (Part 1 and Part 2), I explored time series forecasting with the timekit package.

In this post, I want to compare how Facebook’s prophet performs on the same dataset.

Predicting future events/sales/etc. isn’t trivial for a number of reasons and different algorithms use different approaches to handle these problems. Time series data does not behave like a regular numeric vector, because months don’t have the same number of days, weekends and holidays differ between years, etc. Because of this, we often have to deal with multiple layers of seasonality (i.e. weekly, monthly, yearly, irregular holidays, etc.). Regularly missing days, like weekends, are easier to incorporate into time series models than irregularly missing days.

Timekit uses a time series signature for modeling, which we used as features to build our model of choice (e.g. a linear model). This model was then used for predicting future dates.

Prophet is Facebook’s time series forecasting algorithm that was just recently released as open source software with an implementation in R.

Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.”

(I am not going to discuss forecast and ARIMA or other models because they are quite well established with lots and lots of excellent tutorials out there.)

Training and Test data

I am using the same training and test intervals as in my last post using timekit.

Just as with timekit, prophet starts with a data frame that consists of a date column and the respective response variable for each date.

library(prophet) library(tidyverse) library(tidyquant) retail_p_day <- retail_p_day %>% mutate(model = ifelse(day <= "2011-11-01", "train", "test")) train <- filter(retail_p_day, model == "train") %>% select(day, sum_income) %>% rename(ds = day, y = sum_income) test <- filter(retail_p_day, model == "test") %>% select(day, sum_income) %>% rename(ds = day)

Model building

In contrast to timekit, we do not “manually” augment the time series signature in prophet, we can directly feed our input data to the prophet() function (check the function help for details on optional parameters).

To make it comparable, I am feeding the same list of irregularly missing days to the prophet() function. As discussed in the last post, I chose not to use a list of holidays because the holidays in the observation period poorly matched the days that were actually missing.

off_days <- data.frame(ds = as.Date(c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03", "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30"))) %>% mutate(holiday = paste0("off_day_", seq_along(1:length(ds)))) prophet_model_test <- prophet(train, growth = "linear", # growth curve trend n.changepoints = 100, # Prophet automatically detects changes in trends by selecting changepoints from the data yearly.seasonality = FALSE, # yearly seasonal component using Fourier series weekly.seasonality = TRUE, # weekly seasonal component using dummy variables holidays = off_days) ## Initial log joint probability = -8.3297 ## Optimization terminated normally: ## Convergence detected: relative gradient magnitude is below tolerance

Predicting test data

With our model, we can now predict on the test data and compare the predictions with the actual values.

forecast_test <- predict(prophet_model_test, test)

Just as with timekit, I want to have a look at the residuals. Compared to timekit, the residuals actually look almost identical…

forecast_test %>% mutate(resid = sum_income - yhat) %>% ggplot(aes(x = ds, y = resid)) + geom_hline(yintercept = 0, color = "red") + geom_point(alpha = 0.5, color = palette_light()[[1]]) + geom_smooth() + theme_tq()

… As does the comparison plot. So, here it seems that prophet built a model that is basically identical to the linear model I used with timekit.

forecast_test %>% gather(x, y, sum_income, yhat) %>% ggplot(aes(x = ds, y = y, color = x)) + geom_point(alpha = 0.5) + geom_line(alpha = 0.5) + scale_color_manual(values = palette_light()) + theme_tq()

Predicting future sales

Now, let’s see whether the future predictions will be identical as well.

Just like with timekit, I am using a future time series of 300 days. Here, we see a slight difference in how we generate the future time series: with timekit I could use the entire index of observed dates, together with the list of missing days, while prophet uses the forecasting model that was generated for comparing the test data, i.e. it only considers the dates from the training set. We could built a new model with the entire dataset but this would then be different to how I approached the modeling with timekit.

future <- make_future_dataframe(prophet_model_test, periods = 300) forecast <- predict(prophet_model_test, future) plot(prophet_model_test, forecast) + theme_tq()

Interestingly, prophet’s forecast is distinctly different from timekit’s, despite identical performance on test samples! While timekit predicted a drop at the beginning of the year (similar to the training period), prophet predicts a steady increase in the future. It looks like timekit put more weight on the overall pattern during the training period, while prophet seems to put more weight on the last months, which showed a rise in net income.

sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tidyquant_0.5.1 quantmod_0.4-8 ## [3] TTR_0.23-1 PerformanceAnalytics_1.4.3541 ## [5] xts_0.9-7 zoo_1.8-0 ## [7] lubridate_1.6.0 dplyr_0.5.0 ## [9] purrr_0.2.2.2 readr_1.1.1 ## [11] tidyr_0.6.3 tibble_1.3.1 ## [13] ggplot2_2.2.1 tidyverse_1.1.1 ## [15] prophet_0.1.1 Rcpp_0.12.11 ## ## loaded via a namespace (and not attached): ## [1] rstan_2.15.1 reshape2_1.4.2 haven_1.0.0 ## [4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 ## [7] stats4_3.4.0 yaml_2.1.14 rlang_0.1.1 ## [10] foreign_0.8-68 DBI_0.6-1 modelr_0.1.0 ## [13] readxl_1.0.0 plyr_1.8.4 stringr_1.2.0 ## [16] Quandl_2.8.0 munsell_0.4.3 gtable_0.2.0 ## [19] cellranger_1.1.0 rvest_0.3.2 codetools_0.2-15 ## [22] psych_1.7.5 evaluate_0.10 labeling_0.3 ## [25] inline_0.3.14 knitr_1.16 forcats_0.2.0 ## [28] parallel_3.4.0 broom_0.4.2 scales_0.4.1 ## [31] backports_1.0.5 StanHeaders_2.15.0-1 jsonlite_1.5 ## [34] gridExtra_2.2.1 mnormt_1.5-5 hms_0.3 ## [37] digest_0.6.12 stringi_1.1.5 grid_3.4.0 ## [40] rprojroot_1.2 tools_3.4.0 magrittr_1.5 ## [43] lazyeval_0.2.0 xml2_1.1.1 extraDistr_1.8.5 ## [46] assertthat_0.2.0 rmarkdown_1.5 httr_1.2.1 ## [49] R6_2.2.1 nlme_3.1-131 compiler_3.4.0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### thinning a Markov chain, statistically

Tue, 06/13/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Art Owen has arXived a new version of his thinning MCMC paper, where he studies how thinning or subsampling can improve computing time in MCMC chains. I remember quite well the message set by Mark Berliner and Steve MacEachern in an early 1990’s paper that subsampling was always increasing the variance of the resulting estimators. We actually have this result in our Monte Carlo Statistical Methods book. Now, there are other perspectives on this, as for instance cases when thinning can be hard-wired by simulating directly a k-step move, delaying rejection or acceptance, prefetching, or simulating directly the accepted values as in our vanilla Rao-Blackwellisation approach. Here, Art considers the case when there is a cost θ of computing a transform of the simulation [when the transition cost a unit] and when those transforms are positively correlated with correlation ρ. Somewhat unsurprisingly, when θ is large enough, thinning becomes worth implementing. But requires extra computations in evaluating the correlation ρ and the cost θ, which is rarely comparable with the cost of computing the likelihood itself, a requirement for the Metropolis-Hastings or Hamiltonian Monte Carlo step(s).  Subsampling while keeping the right target (which is a hard constraint!) should thus have a much more effective impact on computing budgets.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The magic trick that highlights significant results on any table

Mon, 06/12/2017 - 22:29

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

This post describes the single biggest time saving technique that I know about for highlighting significant results on a table. The table below, which shows the results of a MANOVA, illustrates the trick. The coloring and bold fonts on the table quickly draw the eye to the two key patterns in the results: people aged 50 or more have lower averages, and the result is strongest for Intel. This is revealed by the coral-colored shading and the bolding.

The same principles can be applied to any table containing numbers. The table below shows people’s attitude towards Microsoft by age. The attitude is split into three categories: Detractors, Neutrals, and Promoters. What story does it tell? If you are diligent you can work it out. But, it is hard work to do so because it requires you to compare and contrast 15 different numbers on the table.

Now look at the table below. The colors and arrow tell us people aged 50 or more are less likely to be Promoters than people in the other age groups. Sure we could get to this insight by reading the first table and doing a whole lot of significance tests. But, the simple use of coloring and an arrow makes the key result much easier to find.

Formatting the table to emphasize differences in residuals is the basic principle used in both examples illustrated above.

A simple approach to computing residuals

How do we work out which cells are interesting automatically? To illustrate this, I will walk through the calculations used in the table showing attitude towards Microsoft by age, returning to more complex analyses later in this post.

We start with a table showing the total percentages (i.e., the number of people in each cell of the table, divided by the total sample size). With this table, our cell showing Promoters aged 50 or more shows a value of 3%. Is this result interesting?

To answer this question we should look at the TOTALs. We can see that 27% is the TOTAL percentage for 50 or more. So, all else being equal we should expect that 27% of Promoters will be aged 50 or more. As 21% of the respondents are Promoters, we would then expect, all else being equal, that the percentage that are Promoters aged 50 or more will be 27% * 21% = 6%. The actual percentage shown on the table is only 3%, so we can conclude that the proportion of people that are Promoters and aged 50 or more is roughly half what we would expect if there was no relationship between age and attitude.

The 6% value that I computed above for Promoters aged 50 or more is called the expected value. It is an estimate of the percentage of people in the sample who would be Promoters aged 50 or more, in the event that age and attitude are unrelated. The table below shows the expected values for all the cells in the table. That is, the values computed by multiplying together all the row and column totals for each cell. It is showing 5.6% rather than 6% for the Promoters aged 50 or more as I have used all the decimal places in the data when performing the calculations.

This next table shows the beginning of the magic. Now I have subtracted the expected values from the actual (observed) percentages . The resulting table shows the residuals (residuals = observed – expected values). Cells close to 0% tend to be uninteresting.

Using statistical significance to save more time: standardized residuals

The residuals are difficult to correctly interpret. They have two problems. First, how big is “big”? Second, and less obviously, it is often not appropriate to compare them. We can readily see both of these problems by looking at the residual of -2.6 for Neutral and 30 to 49. This is the biggest residual on the table. It is bigger than the residual for Promoters aged 50 or more. However, the expected value for this the Neutrals aged 30 to 49 is much closer to 50% than it is for the Promoters aged 50 or more. If you have studied your introductory stats, you will recall that this means that there is more uncertainty about the estimate (i.e., a higher sampling error).

There is a simple solution to this problem: scale the residuals so that they show statistical significance. This is straightforward to do (to see the underlying R code, click here to go into Displayr and click on the tables). These standardized residuals are z-scores.  This means that values of more than 1.96, or less than -1.96, are significant at the 0.05 level. The only cell in this table that is significant at the 0.05 level is Promoters in the 50 or more age group. This tells us that this is the first cell we should look at when exploring the table.

Formatting

The standardized residuals provide two types of information that allow us to quickly see patterns on a table. First, we have the standardized residuals themselves. On the table below, negative residuals are shaded in coral and positive values in blue, with the degree of shading proportional to the values. The second bit of information is the statistical significance of the cells (e.g., whether the z-scores are significant at the 0.05 level). This has been shown via bolding.

The table below uses the standardized residuals in a slightly different way:

• Arrow lengths are proportional to to the standardized residuals.
• Arrows only appear when the residual is significant.
• The color of the fonts and the arrows indicates whether the residuals are positive or negative.
• The order of the rows and columns follows the values of the standardized residuals (for Google and Fun).

The net effect of these formatting choices is that it becomes relatively easy to extract key conclusions from the table. For example, Apple does very well on Stylish, but less so on the price-related dimensions.

Techniques for other types of data

The “magic” consists of two separate things: (1) emphasizing differences in residuals and (2) statistical significance. There are lots of different ways to achieve these outcomes.

Computing residuals

The basic method for calculating the residuals illustrated above is technically only “correct” for tables with mutually exclusive categorical variables in the rows and columns. However, it is widely used with pretty much all types of data and does a good job. However, there are no shortage of alternative approaches. For example, in the MANOVA example the residuals are t-statistics. The table immediately above uses a log-linear model to compute the expected values.

Computing statistical significance

The method to use to compute statistical significance depends both on the data and the assumptions you want to make. Provided that you can compute the standard error for every cell on the table, you can compute a z- or t-statistic by dividing the residual by the standard error. You can of course get more rigorous. In the MANOVA example above, for example, I have computed robust standard errors and applied the false discovery rate correction.

Software

For this post, I calculated and formatted the standardized residuals and the MANOVA using R (the formatting uses the formattable package). Click here to go into Displayr, then click on the MANOVA output table in Displayr and you will see the R CODE.

To create the two tables that show standardized residuals with arrows and font colors, I used Displayr, which automatically computes standardized residuals on all tables, and formats them in this way.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Interfacing with APIs using R: the basics

Mon, 06/12/2017 - 22:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While R (and its package ecosystem) provides a wealth of functions for querying and analyzing data, in our cloud-enabled world there's now a plethora of online services with APIs you can use to augment R's capabilities. Many of these APIs use a RESTful interface, which means you will typically send/receive data encoded in the JSON format using HTTP commands.

Fortunately, as Steph Locke explains in her most recent R Quick tip, the process is pretty simple using R:

1. Obtain an authentication key for using the service
2. Find the URL of the API service you wish to use
3. Convert your input data to JSON format using toJSON in the jsonlite package
4. Send your data to the API service using the POST function in the httr package. Include your API key using the add_headers function
5. Extract your results from the API response using the fromJSON function

Steph provides an example of detecting text language using the Microsoft Cognitive Services Text Analytics API, but that basic recipe should work for most APIs. (API keys for other Microsoft Cognitive Services APIs for vision, speech, language, knowledge and search are also available for free use with your Github account.) There many APIs available for use around the web, and this list of public APIs maintained by Abhishek Banthia is also a useful resource.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Workshop on Monetizing R Packages

Mon, 06/12/2017 - 21:35

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Last week I gave a talk at the San Francisco EARL Conference about monetizing R packages. The talk was well received, so this Thursday at 9am PT I will be giving the same presentation as a live workshop.

This workshop covers my initial foray into monetizing choroplethr, which occurred in 2015. You will learn:

1. Why I decided to monetize choroplethr.
2. The strategy I used.
3. Three tactics that can help you monetize your own R package.

The workshop is free to attend. There will be a live Q&A at the end.

The post Workshop on Monetizing R Packages appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Clustering

Mon, 06/12/2017 - 20:53

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

Hello, everyone! I’ve been meaning to get a new blog post out for the
past couple of weeks. During that time I’ve been messing around with
clustering. Clustering, or cluster analysis, is a method of data mining
that groups similar observations together. Classification and clustering
are quite alike, but clustering is more concerned with exploration than
an end result.

Note: This post is far from an exhaustive look at all clustering has to
offer. Check out this guide
for more. I am reading Data Mining by Aggarwal presently, which is very informative.

data("iris") head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa

For simplicity, we’ll use the iris dataset. We’re going to try to use
the numeric data to correctly group the observations by species. There
are 50 of each species in the dataset, so ideally we would end up with
three clusters of 50.

library(ggplot2) ggplot() + geom_point(aes(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species)) As you can see, there is already some groupings present. Let’s use hierarcical clustering first. iris2 <- iris[,c(1:4)] #not going to use the Species column. medians <- apply(iris2, 2, median) mads <- apply(iris2,2,mad) iris3 <- scale(iris2, center = medians, scale = mads) dist <- dist(iris3) hclust <- hclust(dist, method = 'median') #there are several methods for hclust cut <- cutree(hclust, 3) table(cut) ## cut ## 1 2 3 ## 49 34 67 iris <- cbind(iris, cut) iris$cut <- factor(iris$cut) levels(iris$cut) <- c('setosa', 'versicolor', 'virginica') err <- iris$Species == iris$cut table(err) ## err ## FALSE TRUE ## 38 112 ggplot() + geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris$cut)) Nice groupings here, but it looks like the algorithm has some trouble determining between versicolor and virginica. If we used this information to classify the observations, we’d get an error rate of about .25. Let’s try another clustering technique: DBSCAN. library(dbscan) db <- dbscan(iris3, eps = 1, minPts = 20) table(db$cluster) ## ## 0 1 2 ## 5 48 97 iris2 <- cbind(iris2, db$cluster) iris2$cluster <- factor(iris2$db$cluster) ggplot() + geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris2$cluster)) DBSCAN classifies points into three different categories: core, border, and noise points on the basis of density. Thus, the versicolor/ virginica cluster is treated as one group. Since our data is not structured in such a way that density is meaningful, DBSCAN is probably not a wise choice here. Let’s look at one last algo: the ROCK. No, not that ROCK. library(cba) distm <- as.matrix(dist) rock <- rockCluster(distm, 3, theta = .02) ## Clustering: ## computing distances ... ## computing links ... ## computing clusters ... iris$rock <- rock$cl levels(iris$rock) <- c('setosa', 'versicolor', 'virginica') ggplot() + geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = rock$cl)) err <- (iris$Species == iris$rock) table(err) ## err ## FALSE TRUE ## 24 126 While it may not look like it, the ROCK does the best job at determining clusters in this data – the error rate dropped to 16%. Typically this method is reserved for categorical data, but since we used dist it shouldn't cause any problems. I have been working on a project using some of these (and similar) data mining procedures to explore spatial data and search for distinct groups. While clustering the iris data may not be all that meaningful, I think it is illustrative of the power of clustering. I have yet to try higher-dimension clustering techniques, which might be even better at determining Species. Have any comments? Questions? Suggestions for future posts? I am always happy to hear from readers; please contact me! Happy clustering, Kiefer var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### LASSO regression in R exercises Mon, 06/12/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Lease Absolute Shrinkage and Selection Operator (LASSO) performs regularization and variable selection on a given model. Depending on the size of the penalty term, LASSO shrinks less relevant predictors to (possibly) zero. Thus, it enables us to consider a more parsimonious model. In this exercise set we will use the glmnet package (package description: here) to implement LASSO regression in R. Answers to the exercises are available here. Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) “Least Angle Regression” Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO. Exercise 2 The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process. Exercise 3 Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 4 Use the glmnet function to plot the path of each of x’s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero. Learn more about the glmnet package in the online course Regression Machine Learning with R. In this course you will learn how to: • Avoid model over-fitting using cross-validation for optimal parameter selection • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels. • And much more Exercise 5 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 6 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y. Exercise 7 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero. Exercise 8 As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results. Exercise 9 Repeat exercise-4 for the new model. Exercise 10 Repeat exercises 5 and 6 for the new model and see which coefficients are shrunk to zero. This is an effective way to narrow down on important predictors when there are many candidates. Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### RStudio meets MilanoR! June 29th, Milan Mon, 06/12/2017 - 16:57 Hello R-Users, we have a great news! We are going to host Nathan Stephens and Joseph Rickert from RStudio and R Consortium: they are coming to Milano just for us (from USA) to meet the MilanoR community and talk about the latest news from RStudio and R Consortium! Nathan, director of solutions engineering at RStudio, will discuss how the Tidyverse, R Markdown, and Shiny work together with RStudio products to enable better R for Data Science. Joseph, member of the R Consortium board of directors, will tell us all about the R Consortium news and RStudio community support activities. Given the intersection of topics and interests of R in the Data Science space, the event is organized in collaboration with “Data Science Milan, so the two communities will also have the chance to meet. Therefore, on June 29th, in the evening, we will have two interesting talks by two important guests, plus a free buffet & lots of networking opportunities in the cosy garden of Impact Hub. How can you miss it? Program • Welcome presentations • RStudio presents: Publishing and Self-service Analytics with R by Nathan Stephens • R Consortium News and RStudio Community Support by Joseph Rickert • Free refreshment & networking time What’s the price? The event is free and open to everybody. Given the exceptionality of our guests, we did our best to host as many people as possible. So, the meeting is open to max 200 participants, but registration is needed on the Eventbrite event page. Where is it? Sala Biliardo, Impact Hub – Via Aosta 4, Milano (M5 Cenisio) —— FAQ I want to know more about the talks.. Publishing and Self-service Analytics with R: At RStudio we think about a model for data science that begins with importing and tidying data, continues with an iterative cycle of transforming, modelling, and visualizing data, and ends with communicating results. We’ve built tools and packages to help you succeed across this cycle: RStudio and Notebooks for interactive development, the Tidyverse for making data science easier and more effective, and R Markdown, Shiny, and the new RStudio Connect (Pro) for communicating results. In this talk we’ll share the way we think about data science and our toolchain. R Consortium News and RStudio Community Support: There is a lot happening at R Consortium! We now have fifteen member organizations, including the Gordon and Betty Moore Foundation which joined the Consortium this year as a Platinum member. The R Consortium’s Infrastructure Steering Committee (ISC) is currently supporting more than twenty projects working on various aspects of improving R Community infrastructure. In this talk, Joseph will provide an overview of R Consortium activities, and make suggestions on how R users can get involved. He will also talk about what RStudio is doing to support the R Consortium, and how RStudio directly supports the R Community. ..and the speakers.. Nathan Stephens: Nathan Stephens is director of solutions engineering at RStudio. His background is in applied analytics and consulting. He has experience building data science teams, creating innovative data products, analyzing big data, and architecting analytic platforms. He is a long time user of R and has helped many organizations adopt R as an analytic standard. Joseph Rickert: Joseph Rickert is RStudio’s “Ambassador at Large” for all things R. He works with the rest of the RStudio team and the R Consortium to promote open source activities, the R language and the R Community. Joseph represents RStudio on the R Consortium board of directors. Joseph came to RStudio from Microsoft /Revolution Analytics where he was a data scientist, blog editor and special programs manager after having spent most of his career working for startups in various technical, marketing and sales positions. Joseph attended Franklin & Marshall college on a Classics scholarship and later earned an M.A. in Humanities and an M.S. in Statistics from the California State University. You can follow him on Twitter as @RStudioJoe. …and the sponsors.. RStudioPeople all over the world are turning to R, an open source statistical language, to make sense of data. Inspired by the innovations of R users in science, education, and industry, RStudio develops free and open tools for R and enterprise-ready professional products for teams to scale and share work. Quantide is a provider of consulting services and training courses about Data Science and Big Data. It’s specialized in R, the open source software for statistical computing. Headquartered near Milan (Italy), Quantide has been supporting for 10 years customers from several industries around the world. Quantide is the founder and a supporter of the Milano R community, since its start in 2012. …and the organizers.. MilanoR – MilanoR is the R user group of the city of Milano. MilanoR has a blog, where we post R tutorials, articles and events; a Facebook page (that includes our events streaming); a newsletter. MilanoR organizes R talks (the MilanoR meetings) and R-Labs, monthly collaborative R mini-hackathon. All the R-Labs materials are stored on Github, while our R-Labs conversations are on a Slack channel (if you are interested in joining one of these channels, please go here). In a nutshell, we work and talk about R in whatever way has come to our minds. Data Science Milano – Data Science Milano is an independent group of 600+ professionals with the goal of promoting and pioneering knowledge and innovation of the data-driven revolution in the Italian peninsula and beyond. The group encourages international collaboration, sharing and open source tools. Everyone who is involved in data science projects or would like to undertake this career is invited to join. Ok, I want to join the meeting Go to the event Eventbrite page, click the green “Register” button at the top, and welcome! I have another question Please contact us at admin@milanor.net The post RStudio meets MilanoR! June 29th, Milan appeared first on MilanoR. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Facilitating Exploratory Data Visualization: Application to TCGA Genomic Data Mon, 06/12/2017 - 16:30 (This article was first published on Easy Guides, and kindly contributed to R-bloggers) In genomic fields, it’s very common to explore the gene expression profile of one or a list of genes involved in a pathway of interest. Here, we present some helper functions in the ggpubr R package to facilitate exploratory data analysis (EDA) for life scientists. Exploratory Data visualization: Gene Expression Data Standard graphical techniques used in EDA, include: • Box plot • Violin plot • Stripchart • Dot plot • Histogram and density plots • ECDF plot • Q-Q plot All these plots can be created using the ggplot2 R package, which is highly flexible. However, to customize a ggplot, the syntax might appear opaque for a beginner and this raises the level of difficulty for researchers with no advanced R programming skills. If you’re not familiar with ggplot2 system, you can start by reading our Guide to Create Beautiful Graphics in R. Previously, we described how to Add P-values and Significance Levels to ggplots. In this article, we present the ggpubr package, a wrapper around ggplot2, which provides some easy-to-use functions for creating ‘ggplot2’- based publication ready plots. We’ll use the ggpubr functions to visualize gene expression profile from TCGA genomic data sets. Contents: Prerequisites ggpubr package Required R package: ggpubr (version >= 0.1.3). • Install from CRAN as follow: install.packages("ggpubr") • Or, install the latest developmental version from GitHub as follow: if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/ggpubr") • Load ggpubr: library(ggpubr) TCGA data The Cancer Genome Atlas (TCGA) data is a publicly available data containing clinical and genomic data across 33 cancer types. These data include gene expression, CNV profiling, SNP genotyping, DNA methylation, miRNA profiling, exome sequencing, and other types of data. The RTCGA R package, by Marcin Marcin Kosinski et al., provides a convenient solution to access to clinical and genomic data available in TCGA. Each of the data packages is a separate package, and must be installed (once) individually. The following R code installs the core RTCGA package as well as the clinical and mRNA gene expression data packages. # Load the bioconductor installer. source("https://bioconductor.org/biocLite.R") # Install the main RTCGA package biocLite("RTCGA") # Install the clinical and mRNA gene expression data packages biocLite("RTCGA.clinical") biocLite("RTCGA.mRNA") To see the type of data available for each cancer type, use this: library(RTCGA) infoTCGA() # A tibble: 38 x 13 Cohort BCR Clinical CN LowP Methylation mRNA mRNASeq miR miRSeq RPPA MAF rawMAF * 1 ACC 92 92 90 0 80 0 79 0 80 46 90 0 2 BLCA 412 412 410 112 412 0 408 0 409 344 130 395 3 BRCA 1098 1097 1089 19 1097 526 1093 0 1078 887 977 0 4 CESC 307 307 295 50 307 0 304 0 307 173 194 0 5 CHOL 51 45 36 0 36 0 36 0 36 30 35 0 6 COAD 460 458 451 69 457 153 457 0 406 360 154 367 7 COADREAD 631 629 616 104 622 222 623 0 549 491 223 489 8 DLBC 58 48 48 0 48 0 48 0 47 33 48 0 9 ESCA 185 185 184 51 185 0 184 0 184 126 185 0 10 FPPP 38 38 0 0 0 0 0 0 23 0 0 0 # ... with 28 more rows More information about the disease names can be found at: http://gdac.broadinstitute.org/ Gene expression data The R function expressionsTCGA() [in RTCGA package] can be used to easily extract the expression values of genes of interest in one or multiple cancer types. In the following R code, we start by extracting the mRNA expression for five genes of interest – GATA3, PTEN, XBP1, ESR1 and MUC1 – from 3 different data sets: • Breast invasive carcinoma (BRCA), • Ovarian serous cystadenocarcinoma (OV) and • Lung squamous cell carcinoma (LUSC) library(RTCGA) library(RTCGA.mRNA) expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA, LUSC.mRNA, extract.cols = c("GATA3", "PTEN", "XBP1","ESR1", "MUC1")) expr # A tibble: 1,305 x 7 bcr_patient_barcode dataset GATA3 PTEN XBP1 ESR1 MUC1 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA 2.870500 1.3613571 2.983333 3.0842500 1.652125 2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA 2.166250 0.4283571 2.550833 2.3860000 3.080250 3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA 1.323500 1.3056429 3.020417 0.7912500 2.985250 4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA 1.841625 0.8096429 3.131333 2.4954167 -1.918500 5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA -6.025250 0.2508571 -1.451750 -4.8606667 -1.171500 6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA 1.804500 1.3107857 4.041083 2.7970000 3.529750 7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -4.879250 -0.2369286 -0.724750 -4.4860833 -1.455750 8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -3.143250 -1.2432143 -1.193083 -1.6274167 -0.986750 9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA 2.034000 1.2074286 2.278833 4.1155833 0.668000 10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.293125 0.2883571 -1.605083 0.4731667 0.011500 # ... with 1,295 more rows To display the number of sample in each data set, type this: nb_samples <- table(expr$dataset) nb_samples BRCA.mRNA LUSC.mRNA OV.mRNA 590 154 561

We can simplify data set names by removing the “mRNA” tag. This can be done using the R base function gsub().

expr$dataset <- gsub(pattern = ".mRNA", replacement = "", expr$dataset)

Let’s simplify also the patients’ barcode column. The following R code will change the barcodes into BRCA1, BRCA2, …, OV1, OV2, …., etc

expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154)) expr # A tibble: 1,305 x 7 bcr_patient_barcode dataset GATA3 PTEN XBP1 ESR1 MUC1 1 BRCA1 BRCA 2.870500 1.3613571 2.983333 3.0842500 1.652125 2 BRCA2 BRCA 2.166250 0.4283571 2.550833 2.3860000 3.080250 3 BRCA3 BRCA 1.323500 1.3056429 3.020417 0.7912500 2.985250 4 BRCA4 BRCA 1.841625 0.8096429 3.131333 2.4954167 -1.918500 5 BRCA5 BRCA -6.025250 0.2508571 -1.451750 -4.8606667 -1.171500 6 BRCA6 BRCA 1.804500 1.3107857 4.041083 2.7970000 3.529750 7 BRCA7 BRCA -4.879250 -0.2369286 -0.724750 -4.4860833 -1.455750 8 BRCA8 BRCA -3.143250 -1.2432143 -1.193083 -1.6274167 -0.986750 9 BRCA9 BRCA 2.034000 1.2074286 2.278833 4.1155833 0.668000 10 BRCA10 BRCA -0.293125 0.2883571 -1.605083 0.4731667 0.011500 # ... with 1,295 more rows

The above (expr) dataset has been saved at https://raw.githubusercontent.com/kassambara/data/master/expr_tcga.txt. This data is required to practice the R code provided in this tutotial.

If you experience some issues in installing the RTCGA packages, You can simply load the data as follow:

expr <- read.delim("https://raw.githubusercontent.com/kassambara/data/master/expr_tcga.txt", stringsAsFactors = FALSE) Box plots

Create a box plot of a gene expression profile, colored by groups (here data set/cancer type):

library(ggpubr) # GATA3 ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco") # PTEN ggboxplot(expr, x = "dataset", y = "PTEN", title = "PTEN", ylab = "Expression", color = "dataset", palette = "jco")

Exploratory Data visualization: Gene Expression Data

Note that, the argument palette is used to change color palettes. Allowed values include:

• “grey” for grey color palettes;
• brewer palettes e.g. “RdBu”, “Blues”, …;. To view all, type this in R: RColorBrewer::display.brewer.all() or click here to see all brewer palettes;
• or custom color palettes e.g. c(“blue”, “red”) or c(“#00AFBB”, “#E7B800”);
• and scientific journal palettes from the ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.

Instead of repeating the same R code for each gene, you can create a list of plots at once, as follow:

# Create a list of plots p <- ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), title = c("GATA3", "PTEN", "XBP1"), ylab = "Expression", color = "dataset", palette = "jco") # View GATA3 p$GATA3 # View PTEN p$PTEN # View XBP1 pXBP1 Note that, when the argument y contains multiple variables (here multiple gene names), then the arguments title, xlab and ylab can be also a character vector of same length as y. To add p-values and significance levels to the boxplots, read our previous article: Add P-values and Significance Levels to ggplots. Briefly, you can do this: my_comparisons <- list(c("BRCA", "OV"), c("OV", "LUSC")) ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco")+ stat_compare_means(comparisons = my_comparisons) Exploratory Data visualization: Gene Expression Data For each of the genes, you can compare the different groups as follow: compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = expr) # A tibble: 9 x 8 .y. group1 group2 p p.adj p.format p.signif method 1 GATA3 BRCA OV 1.111768e-177 3.335304e-177 < 2e-16 **** Wilcoxon 2 GATA3 BRCA LUSC 6.684016e-73 1.336803e-72 < 2e-16 **** Wilcoxon 3 GATA3 OV LUSC 2.965702e-08 2.965702e-08 3.0e-08 **** Wilcoxon 4 PTEN BRCA OV 6.791940e-05 6.791940e-05 6.8e-05 **** Wilcoxon 5 PTEN BRCA LUSC 1.042830e-16 3.128489e-16 < 2e-16 **** Wilcoxon 6 PTEN OV LUSC 1.280576e-07 2.561153e-07 1.3e-07 **** Wilcoxon 7 XBP1 BRCA OV 2.551228e-123 7.653685e-123 < 2e-16 **** Wilcoxon 8 XBP1 BRCA LUSC 1.950162e-42 3.900324e-42 < 2e-16 **** Wilcoxon 9 XBP1 OV LUSC 4.239570e-11 4.239570e-11 4.2e-11 **** Wilcoxon If you want to select items (here cancer types) to display or to remove a particular item from the plot, use the argument select or remove, as follow: # Select BRCA and OV cancer types ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", select = c("BRCA", "OV")) # or remove BRCA ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", remove = "BRCA") Exploratory Data visualization: Gene Expression Data To change the order of the data sets on x axis, use the argument order. For example order = c(“LUSC”, “OV”, “BRCA”): # Order data sets ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", order = c("LUSC", "OV", "BRCA")) Exploratory Data visualization: Gene Expression Data To create horizontal plots, use the argument rotate = TRUE: ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", rotate = TRUE) Exploratory Data visualization: Gene Expression Data To combine the three gene expression plots into a multi-panel plot, use the argument combine = TRUE: ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, ylab = "Expression", color = "dataset", palette = "jco") Exploratory Data visualization: Gene Expression Data You can also merge the 3 plots using the argument merge = TRUE or merge = “asis”: ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), merge = TRUE, ylab = "Expression", palette = "jco") Exploratory Data visualization: Gene Expression Data In the plot above, It’s easy to visually compare the expression level of the different genes in each cancer type. But you might want to put genes (y variables) on x axis, in order to compare the expression level in the different cell subpopulations. In this situation, the y variables (i.e.: genes) become x tick labels and the x variable (i.e.: dataset) becomes the grouping variable. To do this, use the argument merge = “flip”. ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), merge = "flip", ylab = "Expression", palette = "jco") Exploratory Data visualization: Gene Expression Data You might want to add jittered points on the boxplot. Each point correspond to individual observations. To add jittered points, use the argument add = “jitter” as follow. To customize the added elements, specify the argument add.params. ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "jitter", # Add jittered points add.params = list(size = 0.1, jitter = 0.2) # Point size and the amount of jittering ) Exploratory Data visualization: Gene Expression Data Note that, when using ggboxplot() sensible values for the argument add are one of c(“jitter”, “dotplot”). If you decide to use add = “dotplot”, you can adjust dotsize and binwidth wen you have a strong dense dotplot. Read more about binwidth. You can add and adjust a dotplot as follow: ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "dotplot", # Add dotplot add.params = list(binwidth = 0.1, dotsize = 0.3) ) Exploratory Data visualization: Gene Expression Data You might want to label the boxplot by showing the names of samples with the top n highest or lowest values. In this case, you can use the following arguments: • label: the name of the column containing point labels. • label.select: can be of two formats: • a character vector specifying some labels to show. • a list containing one or the combination of the following components: • top.up and top.down: to display the labels of the top up/down points. For example, label.select = list(top.up = 10, top.down = 4). • criteria: to filter, for example, by x and y variables values, use this: label.select = list(criteria = “y > 3.9 & y < 5 & x %in% c(‘BRCA’, ‘OV’)”). For example: ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "jitter", # Add jittered points add.params = list(size = 0.1, jitter = 0.2), # Point size and the amount of jittering label = "bcr_patient_barcode", # column containing point labels label.select = list(top.up = 2, top.down = 2),# Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting ) Exploratory Data visualization: Gene Expression Data A complex criteria for labeling can be specified as follow: label.select.criteria <- list(criteria = "y > 3.9 & x %in% c('BRCA', 'OV')") ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", label = "bcr_patient_barcode", # column containing point labels label.select = label.select.criteria, # Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting ) Other types of plots, with the same arguments as the function ggboxplot(), are available, such as stripchart and violin plots. Violin plots The following R code draws violin plots with box plots inside: ggviolin(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "boxplot") Exploratory Data visualization: Gene Expression Data Instead of adding a box plot inside the violin plot, you can add the median + interquantile range as follow: ggviolin(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "median_iqr") Exploratory Data visualization: Gene Expression Data When using the function ggviolin(), sensible values for the argument add include: “mean”, “mean_se”, “mean_sd”, “mean_ci”, “mean_range”, “median”, “median_iqr”, “median_mad”, “median_range”. You can also add “jitter” points and “dotplot” inside the violin plot as described previously in the box plot section. Stripcharts and dot plots To draw a stripchart, type this: ggstripchart(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", size = 0.1, jitter = 0.2, ylab = "Expression", add = "median_iqr", add.params = list(color = "gray")) Exploratory Data visualization: Gene Expression Data For a dot plot, use this: ggdotplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", fill = "white", binwidth = 0.1, ylab = "Expression", add = "median_iqr", add.params = list(size = 0.9)) Exploratory Data visualization: Gene Expression Data Density plots To visualize the distribution as a density plot, use the function ggdensity() as follow: # Basic density plot ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE # Add marginal rug ) Exploratory Data visualization: Gene Expression Data # Change color and fill by dataset ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE, # Add marginal rug color = "dataset", fill = "dataset", palette = "jco" ) Exploratory Data visualization: Gene Expression Data # Merge the 3 plots # and use y = "..count.." instead of "..density.." ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette ) Exploratory Data visualization: Gene Expression Data # color and fill by x variables ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", # color and fill by x variables merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette ) Exploratory Data visualization: Gene Expression Data # Facet by "dataset" ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", facet.by = "dataset", # Split by "dataset" into multi-panel merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette ) Exploratory Data visualization: Gene Expression Data Histogram plots To visualize the distribution as a histogram plot, use the function gghistogram() as follow: # Basic histogram plot gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE # Add marginal rug ) Exploratory Data visualization: Gene Expression Data # Change color and fill by dataset gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE, # Add marginal rug color = "dataset", fill = "dataset", palette = "jco" ) Exploratory Data visualization: Gene Expression Data # Merge the 3 plots # and use y = "..count.." instead of "..density.." gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette ) Exploratory Data visualization: Gene Expression Data # color and fill by x variables gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", # color and fill by x variables merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette ) Exploratory Data visualization: Gene Expression Data # Facet by "dataset" gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", facet.by = "dataset", # Split by "dataset" into multi-panel merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette ) Exploratory Data visualization: Gene Expression Data Empirical cumulative density function # Basic ECDF plot ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, xlab = "Expression", ylab = "F(expression)" ) Exploratory Data visualization: Gene Expression Data # Change color by dataset ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, xlab = "Expression", ylab = "F(expression)", color = "dataset", palette = "jco" ) Exploratory Data visualization: Gene Expression Data # Merge the 3 plots and color by x variables ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, xlab = "Expression", ylab = "F(expression)", color = ".x.", palette = "jco" ) Exploratory Data visualization: Gene Expression Data # Merge the 3 plots and color by x variables # facet by "dataset" into multi-panel ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, xlab = "Expression", ylab = "F(expression)", color = ".x.", palette = "jco", facet.by = "dataset" ) Exploratory Data visualization: Gene Expression Data Quantile - Quantile plot # Basic ECDF plot ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, size = 0.5 ) Exploratory Data visualization: Gene Expression Data # Change color by dataset ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", size = 0.5 ) Exploratory Data visualization: Gene Expression Data # Merge the 3 plots and color by x variables ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, color = ".x.", palette = "jco" ) Exploratory Data visualization: Gene Expression Data # Merge the 3 plots and color by x variables # facet by "dataset" into multi-panel ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, size = 0.5, color = ".x.", palette = "jco", facet.by = "dataset" ) Exploratory Data visualization: Gene Expression Data Infos This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.3). jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header .content{padding:0px;} var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### A Data Scientist’s Guide to Predicting Housing Prices in Russia Mon, 06/12/2017 - 15:17 (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers) Sberbank Russian Housing Market A Kaggle Competition on Predicting Realty Price in Russia Written by Haseeb Durrani, Chen Trilnik, and Jack Yip Introduction In May 2017, Sberbank, Russia’s oldest and largest bank, challenged data scientists on Kaggle to come up with the best machine learning models to estimate housing prices for its customers, which includes consumers and developers looking to buy, invest in, or rent properties. This blog post outlines the end-to-end process of how we went about tackling this challenge. About the Data The datasets provided consist of a training set, a test set, and a file containing historical macroeconomic metrics. • The training set contains ~21K realty transactions spanning from August 2011 to June 2015 along with information specific to the property. This set also includes the price at which the property was sold for. • The test set contains ~7K realty transactions spanning from July 2015 to May 2016 along with information specific to the property. This set does not include the price at which the property was transacted. • The macroeconomic data spans from January 2010 to October 2016. • There are a combined total of ~400 features or predictors. The train and test datasets spans from 2011 to 2016. Project Workflow As with any team-based project, defining the project workflow was vital to our delivering on time as we were given only a two-week timeframe to complete the project. Painting the bigger picture allowed us to have a shared understanding of the moving parts and helped us to budget time appropriately. Our project workflow can be broken down into three main parts: 1) Data Assessment, 2) Model Preparation, and 3) Model Fitting. This blog post will follow the general flow depicted in the illustration below. The project workflow was vital to keeping us on track under a two-week constraint. Approach to the Problem Given a tight deadline, we had to limit the project scope to what we can realistically complete. We agreed that our primary objectives are to learn as much as we can and to apply what we have learned in class. To do so, we decided to do the following: • Focus on applying the Multiple Linear Regression and Gradient Boosting model, but to also spend some time with XGBoost for learning purposes (XGBoost is a relatively new machine learning method, and it is oftentimes the model of choice to win Kaggle competitions). • Spend more time working and learning together as a team instead of working in silos. This meant that each of us has an opportunity to work in each of the domains in the project pipeline. While this is not reflective of the fast-paced environment of the real world, we believe that it can help us better learn the data science process. Understanding the Russian Economy (2011 to 2016) In order to better predict Russian housing prices, we first need to understand how the economic forces impact the Russian economy, as it directly affects the supply and demand in the housing market. To understand how we can best apply macroeconomic features into our models, we researched for Russia’s biggest economic drivers and also any events that may have impacted Russia’s economy during the time period of the dataset (i.e. 2011 to 2016). Our findings show that: These two factors initiated a “snowball.” negatively impacting the following base economic indicators: • The Russian ruble • The Russian Trading System Index (RTS Index) • The Russian trade balance Labels for the chart: Blue – RTS Index, Green – Global Oil Price, Purple – USD/Ruble Exchange Rate, Red – Russia’s Trade Balance The poorly performing Russian economy manifested a series of changes in Russia’s macroeconomic indicators, including but not limited to: • Significant increase in Russia’s Consumer Price Index (CPI) • Worsened salary growth and unemployment rate • Decline in Gross Domestic Product (GDP) The change in these indicators, led us to the understanding that Russia is facing a financial crisis. The first chart measures CPI, the second chart measures GDP growth, and the third chart measures unemployment rate (blue) and salary growth (red) As we continued to explore the macroeconomic dataset, we were able to further confirm Russia’s financial crisis. We observed that the Russian Central Bank had increased the banking deposit rate by over 100% in 2014 in an attempt to bring stability to the Russian economy and the Russian ruble. In the chart above, the red data points represent banking deposit rates, and the blue data points represent mortgage rates. Next, we looked for the effect of the Russian Central Bank’s move on the Russian housing market. When raising banking deposit rates, the Russian Central Bank encouraged consumers to keep their savings in the banks and minimize the public’s’ exposure to financial risk. That was seen clearly in the Russian housing market following the raise in deposit rates. We noticed that rent prices increased, and the growth of commercial investment in real estate dropped in the beginning of 2015. The first chart above represents the index of rent prices, and the second chart shows the level of investment in fixed assets. This analysis led us to the understanding that the Russian economy was facing a major financial crisis, which significantly impacted the demand in the Russian housing market. This highlights the importance of the usage of the macroeconomic features in our machine learning model to predict the Russian housing market. The illustration on the left shows the count of transactions by product type. The illustration on the right provides a more detailed view of the same information. Exploratory Data Analysis The first step to any data science project is simple exploration and visualization of the data. Since the ultimate purpose of this competition is price prediction, it’s a good idea to visualize price trends over the time span of the training data set. The visualization below shows monthly average realty prices over time. We can see that average prices have seen fluctuations between 2011 and 2015 with an overall increase over time. There is, however, a noticeable drop from June to December of 2012. It is important to keep in mind that these averaged prices include apartments of different sizes; therefore, a more “standardized” measure of price would be the price per square meter over the same time span. Below we see that the average price per square meter shows fluctuations as well, though the overall trend is quite different from the previous visualization. The data set starts off with a decline in average price per square meter, which sees somewhat of an improvement from late 2011 to mid 2012. Again we see a price drop from June to December of 2012. Some other attributes to take into consideration when evaluating the price of an apartment are the size of the apartment and the size of the building. Our intuition tells us that price should be directly related to the size of an apartment, and the the boxplot below shows that the median price does go up relative to apartment size, with the exception of “Large” apartments having a median price slightly lower than apartments of “Medium” size. Factors such as neighborhood might help in explaining this anomaly. Apartment price as a function of building size shows similar median prices for low-rise, medium, and high-rise buildings. Larger buildings labelled “Sky” with 40 floors or more show a slightly higher median apartment price. Another interesting variable worth taking a look at is the type of transaction or “product type” mentioned in the data set. This feature of the data categorizes each transaction as being either an Investment or an Owner Occupied purchase. The barplot below gives a breakdown of the transactions for the top sub-areas (by frequency) based on the product type. This visualization clearly shows that certain areas are heavily owner occupied while other areas are very attractive to investors. If investment leads to increasing prices, this trend will be important for making future predictions. If we shift our focus to transactions related to the building’s build year based on product type, we see that older buildings generally are involved in investment transactions, possibly due to better deals, while newer constructions are predominantly owner occupied. Simple exploration and visualization of the data outlines important considerations while training our machine learning model. Our model must consider the factors behind the price dips in 2012 and learn to anticipate similar drops in future predictions. The model must also understand that low prices attract investors, as can be seen with older buildings. However, if investment leads to an increase in price for a particular building, the model must be able to adjust future predictions for that building accordingly. Similarly. the model must be able to predict future prices for a given sub-area that has seen an increase in average price over time as a result of investment. Feature Selection To simplify the feature selection process, we fitted an XGBoost model containing all of the housing features (~290 features) and called its feature importance function to determine the most important predictors of realty price. XGBoost was very efficient in doing this; it took less than 10 minutes to fit the model. In the illustration below, XGBoost sorts each of the housing features by its Fscore, which is a measure of variable importance in its predictive power of the transaction price. Among the top 20 features identified with XGBoost, a Random Forest was used to further rank their importance. The %IncMSE metric was preferred over IncNodePurity as our objective was to identify features that can minimize the mean squared errors of our model (i.e. better predictability). In order to determine which macroeconomic features were important, we joined the transaction prices in the training set with the macroeconomics by the timestamp and then repeated the process above. The top 45 ranking features are outlined below. In this project, we have used experimented with different subsets of these features to fit our models. Feature Engineering Often times it is crucial to generate new variables from existing ones to improve prediction accuracy. Feature engineering can serve the purpose of extrapolating information by splitting variables for model flexibility or dimension reduction by combining variables for model simplicity. The graphic below shows that by averaging our chosen distance-related features into an average-distance feature, we yield a similar price trend as the original individual features. For simpler models sub areas were limited to the top 50 most frequent with the rest classified as “other.” The timestamp variable was used to extract date and seasonal information into new variables, such as day, month, year, and season. The apartment size feature mentioned earlier during our exploratory data analysis was actually an engineered featured using the living square meter area of the apartment to determine apartment size. This feature served very valuable during the imputation of missing values for number of rooms in an apartment. Similarly building size was generated using the maximum number of floors in the given building. Data Cleaning Among 45 features selected, outliers and missing values were corrected as both the multiple linear regression and gradient boosting models do not accept missing values. Below is a list of the outlier corrections and missing value imputations. Outlier Correction • Sq-related features: Imputed by mean within sub area • Distance-related features: Imputed by mean within sub area Missing Value Imputation • Build Year: Imputed by most common build year in sub area • Num Room: Imputed by average number of rooms in similarly sized apartments within the sub area • State: Imputed using the build year and sub area Model Selection Prior to fitting models, it is imperative to understand their strengths and weaknesses. Outlined below are some of the pros and cons we have identified, as well as the associated libraries used to implement them. Multiple Linear Regression (R: lm, Python: statsmodel) • Pros: High Interpretability, simple to implement • Cons: Does not handle multicollinearity well (especially if we would like to include a high number of features from this dataset) Gradient Boosting Tree (R: gbm & caret) • Pros: High predictive accuracy • Cons: Slow to train, high number of parameters XGBoost (Python: xgboost) • Pros: Even higher predictive accuracy, accepts missing values • Cons: Relatively fast to train compared to GBM, but higher number of hyperparameters Multiple Linear Regression Despite knowing that a multiple linear regression model will not work well given the dataset’s complexity and the presence of multicollinearity, we were interested to see its predictive strength. We began by validating the presence of multicollinearity. In the illustration below, green and red boxes indicate positive and negative correlations between two features. One of the assumptions of a multiple linear regression model is that its features should not be correlated. Violation of Multicollinearity We continue to examine other underlying assumptions of a multiple linear regression model. When examining scatterplots among predictors and the target value (i.e. price), it is apparent that many predictors do not share a linear relationship with the response variable. The following plots are one of many sets of plots examined during the process. Violation of Linearity The Residuals vs Fitted plot does not align with the assumption that error terms have the same variance no matter where they appear along the regression line (i.e. red line shows an increasing trend). Violation of Constant Variance The Normal Q-Q plot shows severe violation of normality (i.e. standardized residuals are highly deviant from the normal line). Violation of Normality The Scale-Location plot shows a violation of independent errors as it is observed that the standardized residuals are increasing along with increases in input variables Violation of Independent Errors Despite observing violations of the underlying assumptions of multiple linear regression models, we proceeded with submitting our predictions to see how well we can score. Our first model A1 was trained using the top 12 property-specific features identified in the feature selection process. Submitting predictions using this model gave us a Kaggle score of 0.39544 on the public leaderboard (this score is calculated based on the Root Mean Squared Logarithmic Error). Essentially, the lower the score, the more accurate the model’s predictions are. Nevertheless, it is possible to “overfit the public leaderboard”, as the score on the public leaderboard is only representative of the predictions of approximately 35% of the test set. Models A2 and A3 were fitted by removing features from model A1 that showed VIFs (Variance Inflation Factors) greater than 5. Generally, features with VIFs greater than 5 indicate that there is multicollinearity present among the features used in the model. Removing these did not help to improve our Kaggle scores, as it is likely that we have removed features that are also important to the prediction of price. Models B1, B2, and B3 were trained with the features in A1 with the addition of identified important macro features. These models did not improve our Kaggle scores, likely for reasons mentioned previously. Gradient Boosting Model The second machine learning method we used was the Gradient Boosting Model (GBM), which is a tree-based machine learning method that is robust in handling multicollinearity. This is important as we believe that many features provided in the dataset are significant predictors of price. GBMs are trained using the bootstrapping method; each tree is generated using information from previous trees. To avoid overfitting the model to the training dataset, we used cross-validation with Caret package in R. Knowing that this model can handle complexity and multicollinearity well, we added more features for the first model we fitted. In Model A1, we chose 47 features from the original training dataset, which gave us a Kaggle score of 0.38056. Next, we wanted to check if reducing the complexity of the model gives better results. We decided to run another model (A2) with only 20 features. This model performed better than Model A1. In Model A3, we added to the previous model, 5 macro features that were chosen using XGBoost feature importance. As expected, the macro features improved our results. The model gave us the best Kaggle score of all the gradient boosting models we trained. For each of the GBM models, we used cross-validation to tune the parameters (with R’s Caret package). GBM models on average provided us better prediction results compared with those using the multiple linear regression models. XGBoost An extreme form of gradient boosting, XGBoost, is the preferred tool for Kaggle competitions due to it’s accuracy of predictions and speed. The quick speed in training a model is a result of allowing residuals to “roll-over” into the next tree. Careful selection of hyperparameters such as learning rate and max-depth, gave us our best Kaggle prediction score. Fine tuning of the hyperparameters can be achieved through extensive cross-validation; however, caution is recommended when fitting too many parameters as it can be computationally expensive and time consuming. Below is an example of a simple grid search cross validation. Prediction Scores The table below outlines the best public scores we have received using each of the different models, along with the features used. Future Directions This two-week exercise provided us the opportunity to experience first-hand what it is like to be working in a team of data scientists. If given the luxury of additional time, we would dedicate more to engineering new features to improve our model predictions. This would require us to develop deeper industry knowledge of Russia’s housing economy. Acknowledgement We thank the entire team of instructors at NYCDSA for their guidance during this two-week project and also our fellow Kagglers who published kernels with their useful insights. The post A Data Scientist’s Guide to Predicting Housing Prices in Russia appeared first on NYC Data Science Academy Blog. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Single Word Analysis of Early 19th Century Poetry Using tidytext Mon, 06/12/2017 - 13:14 (This article was first published on Pareto's Playground, and kindly contributed to R-bloggers) When reading poetry, it feels as if you are reading emotion. A good poet can use words to bring to their reader whichever emotions they choose. To do this, these poets often have to look deep within themselves, using their own lives as fuel. For this reason you would expect that poets and as a result, their poetry, would be influenced by the events happening around them. If certain events caused the poet’s view of the world to be an unhappy one, you’d expect their poetry to be unhappy, and of course the same for the opposite. In this analysis I attempt to show this relation between world events and the poets of those times. Gathering the required data – this being poems and their dates – proved to be a very difficult task. I had initially hoped to be able to show an analysis of the entire 20th century (this may still happen when I find more time and/or sources), but in the end I settled for the years between 1890 – 1925. I managed to collect an overall number of 97 poems (a few of which are excluded from the analysis due to being written in 1888 or 1889) from 6 of the most well known poets of the time. These poets being: • William Yeats • T. S. Eliot • Robert Frost • Wallace Stevens • E. E. Cummings • D. H. Lawrence I will only be going through the preparation code for Yeats. The rest is similar but not exactly the same, and can be found in my github repository (https://github.com/Laurence-Sonnenberg/single_word_poetry_analysis) if anyone would like to re-create anything I have done. The actual analysis is done using all poets/poems combined. Firstly I load the packages I’m going to need. I have included a small function that other people might find useful. load_pkg <- function(packagelist) { # Check if any package needs installation: PackagesNeedingInstall <- packagelist[!(packagelist %in% installed.packages()[,"Package"])] if(length(PackagesNeedingInstall)) { cat("\nInstalling necesarry packages: ", paste0(PackagesNeedingInstall,collapse = ", "), "\n") install.packages(PackagesNeedingInstall, dependencies = T, repos = "http://cran-mirror.cs.uu.nl/") } # load packages into R: x <- suppressWarnings(lapply(packagelist, require, character.only = TRUE, quietly = T)) } pkg_list <- c("tidyverse", "tidytext", "stringr", "rvest", "lubridate", "purrr") load_pkg(pkg_list) ## ## Installing necesarry packages: tidytext The website I use for scraping has a page for each poet which lists the poems available – then it has a seperate page for each of the poems in the list, along with their dates. I have chosen this website specifically because having each poem dated is a must for my analysis. I run an initial scrape using rvest to get the names which are then cleaned so they can be used in the urls to get the poems and dates. Some names are not exactly the same as their url versions, so they need to be manually changed. # page url poemNameUrl <- "http://www.poetry-archive.com/y/yeats_w_b.html" # scrape for poem names poemName <- poemNameUrl %>% read_html() %>% html_nodes("a font") %>% html_text() # clean names poemName <- poemName[1:50] %>% str_replace_all(pattern = "\r", replacement = "") %>% str_replace_all(pattern = "\n", replacement = " ") %>% str_replace_all(pattern = "[ ]{2}", replacement = "") %>% str_replace_all(pattern = "[[:punct:]]", replacement = "") %>% tolower() # hardcode 2 poem names to fit in with used url name poemName[9] <- "he mourns for the change" poemName[24] <- "the old men admiring themselves" # display head(poemName, 10) ## [1] "at galway races" "the blessed" ## [3] "the cap and bells" "the cat and the moon" ## [5] "down by the salley gardens" "dream of a blessed spirit" ## [7] "the falling of the leaves" "he bids his beloved be at peace" ## [9] "he mourns for the change" "he remembers forgotten beauty" Next I need to scrape the website for the poems and their dates. To do this I use a function that takes a poem name – and again using rvest – scrapes for the given poem and its date, then cleans the text and waits before returning. The wait is so that I don’t send too many requests to the website in quick succession. # function to scrape for poem and its date using poem names GetPoemAndDate <- function(poemName) { # split string at empty space and unlist the result nameVec <- unlist(str_split(poemName, pattern = " ")) # use result in url url <- str_c("http://www.poetry-archive.com/y/", paste(nameVec, collapse = "_"), ".html") # scrape for poem and clean poem <- url %>% read_html() %>% html_nodes("dl") %>% html_text() %>% str_replace_all(pattern = "\r", replacement = "") %>% str_replace_all(pattern = "\n", replacement = " ") %>% str_replace_all(pattern = "[ ]{2}", replacement = "") # scrape for date date <- url %>% read_html() %>% html_nodes("td font") %>% html_text() # clean dates date <- date[3] %>% str_extract(pattern = "[0-9]+") # pause before function return Sys.sleep(runif(1, 0, 1)) return(list(poem = poem, date = date)) } I then use the previous function in a for loop which loops through each poems name and scrapes for that particular poem and its date. With each loop I also add the data frame of poem and date to a list. Once the for loop has completed I rbind all the data frames together. A complete data frame of all poems and dates will be needed for the next step. # get poems and dates poemDataFrame <- list() count <- 1 for (name in poemName) { # get poem and date for given poem name poemNameDate <- GetPoemAndDate(name) # create data frame of name and date and add it to list poemDataFrame[[count]] <- data.frame(poem = poemNameDatepoem, date = poemNameDate$date, stringsAsFactors = FALSE) count <- count + 1 } # rbind all poems and dates poemDataFrame <- do.call(rbind, poemDataFrame) I then combine the names from the original scrape, the dates, and the poems into a single data frame so they will be ready for me to work with. I also take a look at all the text, checking for and correcting any errors that could negatively affect my analysis. In this case I have found an error in a date. # create data frame of names, poems and dates poemDataFrame <- cbind(poemName, poemDataFrame) # hardcode single date with error poemDataFrame$date[40] <- "1916"

After this, we have a dataset of dimension 50×3 with column headings poemName, poem and date.

The next function I have written takes a row index, and using the combined data frame of poems, names and dates, along with tidytext, tokenizes the poem found at that index into single words. It then uses anti_join from dplyr, along with the stop_words dataset from tidytext, to create a tibble – this tibble will now be missing a list of stop words which will not be helpful in the analysis. This list of words for each poem is then used to create a sentiment score. This is done using inner_join and the sentiment dataset filtered on the bing lexicon, again from tidytext. The function calculates the sentiment score value for the poem depending on the number of negative and positive words in the list – negative words get a value of -1 and positive get a value of 1, these values are then summed and the result returned as a percentage of the number of words in that particular poem.

# function to analyse poems and return scores using bing lexicon bingAnalysePoems <- function(i) { # get poem a index i poem <- data_frame(poem = poemDataFrame$poem[i]) # tokenize into words textTokenized <- poem %>% unnest_tokens(word, poem) # load stop words dataset data("stop_words") # anti join on stop words tidyPoem <- textTokenized %>% anti_join(stop_words) # filter on bing lexicon bing <- sentiments %>% filter(lexicon == "bing") %>% select(-score) # join on bing to get whether words are positive or negative poemSentiment <- tidyPoem %>% inner_join(bing) # get score for poem poemSentiment <- poemSentiment %>% mutate(score = ifelse(poemSentiment$sentiment == "positive", 1, -1)) # get score as a percentage of total words in poem finalScore <- (sum(poemSentiment$score)/length(textTokenized$word))*10 return(finalScore) }

Now, using the previous function and sapply I get the score for each poem and create a data frame with the poems score and its date. These scores will be used later to show the change in sentiment for each year.

# get scores scores <- sapply(1:length(poemDataFrame$poem), bingAnalysePoems) # create data frame with data and scores dateAndScore <- data.frame(scores) %>% mutate(date = year(ymd(str_c(poemDataFrame$date, "/01/01")))) # display head(dateAndScore, 10) ## scores date ## 1 -0.20618557 1910 ## 2 -0.03649635 1899 ## 3 0.16528926 1899 ## 4 -0.13157895 1919 ## 5 0.32258065 1921 ## 6 0.31250000 1921 ## 7 -0.42253521 1889 ## 8 -0.73394495 1899 ## 9 -0.23255814 1899 ## 10 -0.44025157 1899

The next function I have written takes a particular sentiment, which in this case can also be called an emotion, from a number of sentiments found in the NRC lexicon of the sentiment dataset; it then again creates a tibble of words not including the stop words, and using semi_join and the sentiment dataset filtered on the NRC lexicon and the particular sentiment given, returns the sum of the number of words relating to that sentiment. These sums will be used to show which emotions were high (or low) in a particular year.

# NRC analysis function nrcAnalysePoems <- function(sent) { # nrcPoem used from global environment textTokenized <- nrcPoem %>% unnest_tokens(word, poem) # load stop words dataset data("stop_words") tidyPoem <- textTokenized %>% anti_join(stop_words) # filter on NRC lexicon and sentiment nrcSentiment <- sentiments %>% filter(lexicon == "nrc", sentiment == sent) # join on sentiment and count words sentimentInPoem <- tidyPoem %>% semi_join(nrcSentiment) %>% count(word, sort = TRUE) # return the sum of the counted words return(sum(sentimentInPoem$n)) } Using the above function and sapply, I get the sums for each different sentiment in my sentiments list; this is done within a for loop, which loops through each poem, creating a data frame with each row consisting of a sentiment, the sum of words for that sentiment, the name of the poem the sentiment relates to, and the poems date. I then add this data frame to a list which will be used in my next step. # list of used setiments found in NRC lexicon sentimentsVec <- c("anger", "anticipation", "disgust", "fear", "joy","sadness", "surprise", "trust") # create empty list nrcAnalysisDataFrame <- list() # get a frequency percetage for each sentiment found in the NRC lexicon for (i in 1:length(poemDataFrame$poem)) { # poem at index i - to be used in nrcAnalysePoems function environment nrcPoem <- data_frame(poem = poemDataFrame$poem[i]) # create data frame for each poem of all sentiment sums nrcDataFrame <- as.data.frame(sapply(sentimentsVec, nrcAnalysePoems)) %>% rownames_to_column() %>% select(sentiment = rowname, value = 2) %>% mutate(name = poemDataFrame$poemName[i], date = poemDataFrame$date[i]) # add data frame to list nrcAnalysisDataFrame[[i]] <- nrcDataFrame } Once the for loop has completed I rbind all the data frames in the list so that I can work with all of them together as a whole. # rbind list of all NRC sum values for all poems nrcAnalysisDataFrame <- do.call(rbind, nrcAnalysisDataFrame) # display head(nrcAnalysisDataFrame, 10) ## sentiment value name date ## 1 anger 1 at galway races 1910 ## 2 anticipation 4 at galway races 1910 ## 3 disgust 2 at galway races 1910 ## 4 fear 4 at galway races 1910 ## 5 joy 2 at galway races 1910 ## 6 sadness 4 at galway races 1910 ## 7 surprise 2 at galway races 1910 ## 8 trust 2 at galway races 1910 ## 9 anger 0 the blessed 1899 ## 10 anticipation 9 the blessed 1899 Now all the preparation code is done and we are ready to begin analyzing. For ease of use I have created a function for each poet to run all the code shown previously; the functions return both the bing and the NRC data frames, ready to be used. These functions are run but will not be shown here. Firstly I will need to call each poet’s function and assign the returned data frames to variables so that I can work with them as needed. yeats <- Yeats() eliot <- Eliot() frost <- Frost() wallace <- Wallace() cummings <- Cummings() lawrence <- Lawrence() poets <- list(yeats, eliot, frost, wallace, cummings, lawrence) Having collected all the poems from our five poets, we plot the count of poems per year for each of our selected sample: From this we see that Yeats dominates the earlier period in our sample, while Frost plays a large part in the second half of the sample of text. Next I will join all the bing scores for all poems using rbind, I do this so that I can work with all the data at once. I then group by date and get a mean score for each year in the complete data frame. This returns a tibble of scores and dates which I can now plot. # rbind data frames and summarize completedataFramebing <- poets %>% map(~.x %>% .[[1]] %>% data.frame) %>% reduce(rbind) %>% filter(date >= 1890) %>% group_by(date) %>% summarize(score = mean(scores)) # display head(completedataFramebing, 10) ## # A tibble: 10 × 2 ## date score ## ## 1 1893 -0.23988213 ## 2 1899 -0.31855413 ## 3 1903 -0.03143456 ## 4 1910 -0.09117054 ## 5 1915 -0.20194924 ## 6 1916 -0.29473096 ## 7 1917 -0.57692308 ## 8 1918 -0.50458716 ## 9 1919 -0.16981618 ## 10 1920 -0.14955234 I use multiplot so that I can show two different plots side by side. # plot bar plot p1 <- ggplot(data = completedataFramebing) + geom_bar(aes(x = date, y = score, fill = score), stat = "identity", position = "identity") + ylim(-1, 1) + ylab("mean percentage score") + xlab("year") + theme(legend.key.size = unit(0.5, "cm"), legend.text = element_text(size = 7), legend.position = c(0.87, 0.77)) + ggtitle("Bar Plot of Percentage Scores") + theme(plot.title = element_text(size = 12)) # plot dot and line plot p2 <- ggplot(data = completedataFramebing) + geom_line(aes(x = date, y = score), colour = "blue") + geom_point(aes(x = date, y = score), size = 1.3, colour = "blue") + ylim(-1, 1) + ylab("mean percentage score") + xlab("year") + ggtitle("Dot and Line Plot of Percentage Scores") + theme(plot.title = element_text(size = 12)) # use multiplot function # can be found at http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/ multiplot(p1, p2, cols = 2) The 0 values in the bar plot are not all actual 0 values but rather years where no information is available. When looking at the plots we can see a steady decline between the period 1890 – 1900, then due to a lack of data very little can be seen during the period from 1900 – 1915, after which there is again a massive decline between 1915 and 1920 and again a small drop between 1920 and 1925. We can also look at the individuals poets, to determine who of them portrayed their sorrow the most through their poems. To do this we plot the average score per poet. From the plot we see that Lawrence has the most negative mean score and with poems such as giorno dei morti or day of the dead, it comes as no suprise! Finally, let’s do the NRC analysis. Firstly, I join together all the poet’s NRC data frames using rbind, again so I can work with all the data at once. I then group them by date and sentiment, sum the sentiment values for each poem and use this sum to get percentage values for each sentiment in each year. Once I have done this the data frame is ready to be plotted. # rbind data frames and summarize yeatsnrc <- poets %>% map(~.x %>% .[[2]]) %>% reduce(rbind) %>% filter(date >= 1890) %>% group_by(date, sentiment) %>% summarise(dateSum = sum(value)) %>% mutate(dateFreqPerc = dateSum/sum(dateSum)) head(yeatsnrc, 10) ## Source: local data frame [10 x 4] ## Groups: date [2] ## ## date sentiment dateSum dateFreqPerc ## ## 1 1893 anger 25 0.10775862 ## 2 1893 anticipation 38 0.16379310 ## 3 1893 disgust 11 0.04741379 ## 4 1893 fear 38 0.16379310 ## 5 1893 joy 42 0.18103448 ## 6 1893 sadness 34 0.14655172 ## 7 1893 surprise 8 0.03448276 ## 8 1893 trust 36 0.15517241 ## 9 1899 anger 17 0.06390977 ## 10 1899 anticipation 46 0.17293233 I use facet_wrap on the date so that I can display all the plots for all the years at once. # plot multiple NRC barplots using facet wrap ggplot(data = yeatsnrc, aes(x = sentiment, y = dateFreqPerc, fill = sentiment)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = "Spectral") + guides(fill = FALSE) + facet_wrap(~date, ncol = 2) + theme(axis.text.x = element_text( angle = 45, hjust = 1, size = 8)) + theme(axis.title.y = element_text(margin = margin(0, 10, 0, 0))) + ylab("frequency percentage") + ggtitle("Bar Plots for Sentiment Percentages Per Year") + theme(plot.title = element_text(size = 14)) Looking at these plots we can see how sentiment changes through the years of available data. During the early 1890s we see equal percentages of anticipation, fear, joy, sadness and trust, while later we see anticipation, joy and trust. In 1903 joy takes the lead, but in 1910 and 1915 we see anticipation and trust increasing with fear and joy equal. In 1916 fear and sadness are high and 1917 anger, fear, sadness and trust are notibly high. From 1918 – 1919 we see a change from negative to positive with joy taking a big lead in 1919 and again in 1920. 1921 sees sadness again with others coming in with mostly equal percentages, and 1922 anticipation, joy and trust. I am not a historian but fortunately my sister majored in history, so I asked her to send me a timeline of influencial world events between the years this analysis is based on. I have used this timeline to explain the findings. From both the bing and NRC analysis it can be seen that there was a very apparent negative trend during the years from 1915 – 1920, this is very possibly due to a number of world events, namely; World War I during the years from 1914 – 1918, the influenza pandemic which lasted from 1918 – 1919, and the Russian Revolution in 1917. Another negative trend seems to appear at the end of the 19th Century, this time it is possibly due to the Boer War, which lasted from 1899 – 1902 and had repurcussions across Europe, as well as the Indian famine which killed an estimated one million people during the years 1899 and 1900. A few things to be mentioned: 1. As we all know the pop artists of the time will always move with popular trends, so it could be that the poets chosen, being the most popular during the years this analysis is based on, may have been more likely to be influenced by the events around them 2. Due to the surprisingly difficult task of finding poetry with enough information, this analysis has been based on only a small number of poems Taking the above into account, it is clear more work needs to be done; though the findings from this brief analysis do show a possible relation between poetry and the events of the time. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Pareto's Playground. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Superstorm Sandy at Barnegat Bay Revisted Mon, 06/12/2017 - 00:17 (This article was first published on AdventuresInData, and kindly contributed to R-bloggers) Animations of continuous data in GIF format offer some portability advantages over video files. A few years ago, shortly after Superstorm Sandy, a colleague and I developed a video of animated water surface elevations from USGS gages in Barnegat Bay, NJ as the eye of the storm approached. That version used video screen capture and MS Excel VBA – you can see it here. With the pending 5 year anniversary of Sandy and the R animation package, the time was right to revisit the animation as a GIF. Sources of data include: USGS 01409335 Little Egg Inlet near Tuckerton, NJ USGS 01409146 East Thorofare at Ship Bottom, NJ USGS 01409110 Barnegat Bay at Waretown, NJ USGS 01408205 Barnegat Bay at Route 37 bridge near Bay Shore NJ USGS 01408168 Barnegat Bay at Mantoloking , NJ USGS 01408043 Point Pleasant Canal at Point Pleasant NJ USGS 01408050 Manasquan River at Point Pleasant NJ NOAA National Data Buoy Center Station 44009 (LLNR 168) – DELAWARE BAY 26 NM Southeast of Cape May, NJ See and download the full length GIF at https://media.giphy.com/media/xUA7beSsKPUPI3V41a/giphy.gif var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: AdventuresInData. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Weather forecast with regression models – part 3 Sun, 06/11/2017 - 23:47 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) In the previous part of this tutorial, we build several models based on logistic regression. One aspect to be further considered is the decision threshold tuning that may help in reaching a better accuracy. We are going to show a procedure able to determine an optimal value for such purpose. ROC plots will be introduced as well. Decision Threshold Tuning As default, caret uses 0.5 as threshold decision value to be used as probability cutoff value. We are going to show how to tune such decision threshold to achieve a better training set accuracy. We then compare how the testing set accuracy changes accordingly. Tuning Analysis We load previously saved training, testing datasets together with the models we have already determined. suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(ROCR)) set.seed(1023) readRDS(file="wf_log_reg_part2.rds") The following tuning procedure explores a pool of cutoff values giving back a table reporting {cutoff, accuracy, specificity} triplets. Its inspection allows to identify the cutoff value providing with the highest accuracy. In case of a tie in accuracy value, we choose the lowest cutoff value of such. glm.tune <- function(glm_model, dataset) { results <- data.frame() for (q in seq(0.2, 0.8, by = 0.02)) { fitted_values <- glm_model$finalModel$fitted.values prediction = q, "Yes", "No") cm <- confusionMatrix(prediction, dataset$RainTomorrow) accuracy <- cm$overall["Accuracy"] specificity <- cm$byClass["Specificity"] results <- rbind(results, data.frame(cutoff=q, accuracy=accuracy, specificity = specificity)) } rownames(results) <- NULL results }

In the utility function above, we included also the specificity metrics. Specificity is defined as TN/(TN+FP) and in the next part of this series we will discuss some aspects of. Let us show an example of confusion matrix with some metrics computation to clarify the terminology.

Reference Prediction No Yes No 203 42 Yes 0 3 Accuracy : 0.8306 = (TP+TN)/(TP+TN+FP+FN) = (203+3)/(203+3+42+0) Sensitivity : 1.00000 = TP/(TP+FN) = 203/(203+0) Specificity : 0.06667 = TN/(TN+FP) = 3/(3+42) Pos Pred Value : 0.82857 = TP/(TP+FP) = 203/(203+42) Neg Pred Value : 1.00000 = TN/(TN+FN) = 3/(3+0)

Please note that the “No” column is associated to a positive outcome, whilst “Yes” to a negative one. Specificity measure how good we are in predict that {RainTomorrow = “No”} and actually it is and this is something that we are going to investigate in next post of this series.

The logistic regression models to be tuned are:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

In the following analysis, for all models we determine the cutoff value providing with the highest accuracy.

glm.tune(mod9am_c1_fit, training) cutoff accuracy specificity 1 0.20 0.7822581 0.73333333 2 0.22 0.7862903 0.71111111 3 0.24 0.7943548 0.68888889 4 0.26 0.8064516 0.64444444 5 0.28 0.8104839 0.64444444 6 0.30 0.8145161 0.60000000 7 0.32 0.8185484 0.57777778 8 0.34 0.8185484 0.51111111 9 0.36 0.8104839 0.46666667 10 0.38 0.8266129 0.46666667 11 0.40 0.8266129 0.40000000 12 0.42 0.8306452 0.40000000 13 0.44 0.8266129 0.37777778 14 0.46 0.8346774 0.35555556 15 0.48 0.8467742 0.33333333 16 0.50 0.8508065 0.31111111 17 0.52 0.8427419 0.26666667 18 0.54 0.8467742 0.26666667 19 0.56 0.8346774 0.20000000 20 0.58 0.8387097 0.20000000 21 0.60 0.8387097 0.20000000 22 0.62 0.8346774 0.17777778 23 0.64 0.8387097 0.17777778 24 0.66 0.8346774 0.15555556 25 0.68 0.8387097 0.15555556 26 0.70 0.8387097 0.13333333 27 0.72 0.8387097 0.13333333 28 0.74 0.8346774 0.11111111 29 0.76 0.8266129 0.06666667 30 0.78 0.8266129 0.06666667 31 0.80 0.8306452 0.06666667

The optimal cutoff value is equal to 0.5. In this case, we find the same cutoff value as the default the caret package takes advantage of for logistic regression models.

opt_cutoff &lt 0.5;- pred_test <- predict(mod9am_c1_fit, testing, type = "prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No

Then, tuning the 3PM model.

glm.tune(mod3pm_c1_fit, training) cutoff accuracy specificity 1 0.20 0.8064516 0.7777778 2 0.22 0.8185484 0.7555556 3 0.24 0.8225806 0.7333333 4 0.26 0.8306452 0.6888889 5 0.28 0.8467742 0.6666667 6 0.30 0.8467742 0.6444444 7 0.32 0.8427419 0.6222222 8 0.34 0.8669355 0.6222222 9 0.36 0.8709677 0.6222222 10 0.38 0.8629032 0.5777778 11 0.40 0.8669355 0.5777778 12 0.42 0.8669355 0.5555556 13 0.44 0.8548387 0.4666667 14 0.46 0.8548387 0.4444444 15 0.48 0.8588710 0.4444444 16 0.50 0.8669355 0.4444444 17 0.52 0.8629032 0.4222222 18 0.54 0.8669355 0.4222222 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3777778 21 0.60 0.8588710 0.3333333 22 0.62 0.8548387 0.3111111 23 0.64 0.8508065 0.2888889 24 0.66 0.8467742 0.2666667 25 0.68 0.8387097 0.2222222 26 0.70 0.8387097 0.2222222 27 0.72 0.8346774 0.2000000 28 0.74 0.8387097 0.1777778 29 0.76 0.8346774 0.1555556 30 0.78 0.8346774 0.1555556 31 0.80 0.8306452 0.1111111

The optimal cutoff value is equal to 0.36.

opt_cutoff <- 0.36 pred_test <- predict(mod3pm_c1_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 81 5 Yes 5 14 Accuracy : 0.9048 95% CI : (0.8318, 0.9534) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.0112 Kappa : 0.6787 Mcnemar's Test P-Value : 1.0000 Sensitivity : 0.9419 Specificity : 0.7368 Pos Pred Value : 0.9419 Neg Pred Value : 0.7368 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8190 Balanced Accuracy : 0.8394 'Positive' Class : No

We improved the test accuracy, previously resulting as equal to 0.8857 and now equal to 0.9048 (please take a look at part #2 of this tutorial to know about the test accuracy with 0.5 cutoff value). Then, the first evening hours model.

glm.tune(mod_ev_c2_fit, training) cutoff accuracy specificity 1 0.20 0.8387097 0.8444444 2 0.22 0.8467742 0.8222222 3 0.24 0.8548387 0.8222222 4 0.26 0.8629032 0.8000000 5 0.28 0.8588710 0.7333333 6 0.30 0.8709677 0.7333333 7 0.32 0.8790323 0.7111111 8 0.34 0.8830645 0.6888889 9 0.36 0.8870968 0.6666667 10 0.38 0.8870968 0.6666667 11 0.40 0.8951613 0.6666667 12 0.42 0.8991935 0.6666667 13 0.44 0.8991935 0.6666667 14 0.46 0.8951613 0.6444444 15 0.48 0.8951613 0.6222222 16 0.50 0.8951613 0.6222222 17 0.52 0.8951613 0.6000000 18 0.54 0.8911290 0.5777778 19 0.56 0.8911290 0.5777778 20 0.58 0.8951613 0.5777778 21 0.60 0.8911290 0.5555556 22 0.62 0.8870968 0.5111111 23 0.64 0.8830645 0.4888889 24 0.66 0.8830645 0.4666667 25 0.68 0.8790323 0.4222222 26 0.70 0.8709677 0.3555556 27 0.72 0.8548387 0.2666667 28 0.74 0.8508065 0.2444444 29 0.76 0.8427419 0.1777778 30 0.78 0.8387097 0.1555556 31 0.80 0.8387097 0.1555556

The optimal cutoff value is equal to 0.42.

opt_cutoff <- 0.42 pred_test <- predict(mod_ev_c2_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

We did not improve the accuracy as the default cutoff value provides with the same. Other metrics changed and in particular we notice that the sensitivity decreased a little bit while the positive predicted value improved. Then the second evening hours model.

glm.tune(mod_ev_c3_fit, training) cutoff accuracy specificity 1 0.20 0.8225806 0.8000000 2 0.22 0.8145161 0.7333333 3 0.24 0.8064516 0.6666667 4 0.26 0.8145161 0.6666667 5 0.28 0.8145161 0.6222222 6 0.30 0.8185484 0.6222222 7 0.32 0.8266129 0.6000000 8 0.34 0.8185484 0.5555556 9 0.36 0.8306452 0.5333333 10 0.38 0.8467742 0.5333333 11 0.40 0.8467742 0.5111111 12 0.42 0.8427419 0.4666667 13 0.44 0.8508065 0.4666667 14 0.46 0.8467742 0.4222222 15 0.48 0.8467742 0.4000000 16 0.50 0.8508065 0.4000000 17 0.52 0.8548387 0.4000000 18 0.54 0.8508065 0.3777778 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3555556 21 0.60 0.8629032 0.3333333 22 0.62 0.8588710 0.3111111 23 0.64 0.8548387 0.2888889 24 0.66 0.8508065 0.2666667 25 0.68 0.8467742 0.2444444 26 0.70 0.8387097 0.2000000 27 0.72 0.8387097 0.1777778 28 0.74 0.8346774 0.1555556 29 0.76 0.8306452 0.1333333 30 0.78 0.8306452 0.1333333 31 0.80 0.8306452 0.1333333

The optimal cutoff value is equal to 0.56.

opt_cutoff <- 0.56 pred_test <- predict(mod_ev_c3_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

In this case, we did not improved the accuracy of this model. Other metrics changed and in particular we notice that the sensitivity is slightly higher than the default cutoff driven value (0.9419), and the positive predicted value decreased a little bit (for default cutoff was 0.9205).

ROC analysis

To have a visual understanding of the classification performances and compute further metrics, we are going to take advantage of the ROCr package facilities. We plot ROC curves and determine the corresponding AUC value. That is going to be done for each model. The function used to accomplish with this task is the following.

glm.perf.plot <- function (prediction, cutoff) { perf <- performance(prediction, measure = "tpr", x.measure = "fpr") par(mfrow=(c(1,2))) plot(perf, col="red") grid() perf <- performance(prediction, measure = "acc", x.measure = "cutoff") plot(perf, col="red") abline(v = cutoff, col="green") grid() auc_res <- performance(prediction, "auc") auc_res@y.values[[1]] }

In the following, each model will be considered, AUC value printed out and ROC plots given.

mod9am_pred_prob <- predict(mod9am_c1_fit, testing, type="prob") mod9am_pred_resp <- prediction(mod9am_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod9am_pred_resp, 0.5) [1] 0.8004896

mod3pm_pred_prob <- predict(mod3pm_c1_fit, testing, type="prob") mod3pm_pred_resp <- prediction(mod3pm_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod3pm_pred_resp, 0.36) [1] 0.9155447

The 3PM model shows the highest AUC value among all the models.

mod_ev_c2_prob <- predict(mod_ev_c2_fit, testing, type="prob") mod_ev_c2_pred_resp <- prediction(mod_ev_c2_prob$Yes , testing$RainTomorrow) glm.perf.plot(mod_ev_c2_pred_resp, 0.42) [1] 0.8390453

mod_ev_c3_prob <- predict(mod_ev_c3_fit, testing, type="prob") mod_ev_c3_pred_resp <- prediction(mod_ev_c3_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod_ev_c3_pred_resp, 0.56) [1] 0.8886169

Conclusions

By tuning the decision threshold, we were able to improve training and testing set accuracy. It turned out the 3PM model achieved a satisfactory accuracy. ROC plots gave us an understanding of how true/false positive rates vary and what is the accuracy obtained by a specific decision threshold value (i.e. cutoff value). Ultimately, AUC values were reported.

If you have any questions, please feel free to comment below.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Data Visualization with googleVis exercises part 2

Sun, 06/11/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Area, Stepped & Combo charts

In the second part of our series we are going to meet three more googleVis charts. More specifically these charts are Area Chart, Stepped Area Chart and Combo Chart.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:

Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with:
df=data.frame(name=c("James", "Curry", "Harden"),
Pts=c(20,23,34),
Rbs=c(13,7,9))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

Area chart

It is quite simple to create an area chart with googleVis with:
AreaC <- gvisBarChart(df)
plot(AreaC)

Exercise 1

Create a list named “AreaC” and pass to it the “df” data frame you just created as an area chart. HINT: Use gvisAreaChart().

Exercise 2

Plot the area chart. HINT: Use plot().

Stepped Area chart

Creating a stepped area chart is a little different than the area chart. You have to set the X and Y variables and also make it stacked in order to display the values correctly. Here is an example:
SteppedAreaC <- gvisSteppedAreaChart(df, xvar="name",
yvar=c("val1", "val2"),
options=list(isStacked=TRUE))
plot(SteppedAreaC)

Exercise 3

Create a list named “SteppedAreaC” and pass to it the “df” data frame you just created as a stepped area chart. You should also set the X variable as the players’ names and Y variable as their values. HINT: Use gvisSteppedAreaChart(), xvar and yvar.

Exercise 4

Plot the stepped area chart. HINT: Use plot().

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

• Work extensively with the GoogleVis package and its functionality
• Learn what visualizations exist for your specific use case
• And much more

Exercise 5

Now transform your stepped area chart to stacked to correct it and plot it.

Combo chart

The next chart we are going to meet is the combination of lines and bars chart known as combo chart. You can produce it like this:
ComboC <- gvisComboChart(df, xvar="country",
yvar=c("val1", "val2"),
options=list(seriesType="bars",
series='{1: {type:"line"}}'))
plot(ComboC)

Exercise 6

Create a list named “ComboC” and pass to it the “df” data frame you just created as a combo chart. You should also set the X variable as the players’ names and Y variable as their values. HINT: Use gvisComboChart(), xvar and yvar.

Exercise 7

Plot the chart. What kind of chart do you see? HINT: Use plot().

In order to add the bars we have to set it as the example below.
options=list(seriesType="bars",
series='{1: {type:"line"}}')

Exercise 8

Transform the chart you just created into a combo chart with bars and lines and plot it. HINT: Use list().

Exercise 9

In the previous exercise “Pts” are represented by the bars and “Rbs” by the lines. Try to reverse them.

You can easily transform your combo chart into a column chart or a line chart just be setting series='{1: {type:"line"}}' to 2

Exercise 10

Transform the combo chart into a column chart and then into a line chart.

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Exit poll for June 2017 election (UK)

Sun, 06/11/2017 - 16:13

(This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers)

It has been a while since I posted anything here, but I can’t resist this one.

Let me just give three numbers.  The first two are:

• 314, the number of seats predicted for the largest party (Conservatives) in the UK House of Commons, at 10pm in Thursday (i.e., before even a single vote had been counted) from the exit poll commissioned jointly by broadcasters BBC, ITV and Sky.
• 318, the actual number of seats that were won by the Conservatives, now that all the votes have been counted.

That highly accurate prediction changed the whole story on election night: most of the pre-election voting intention polls had predicted a substantial Conservative majority.  (And certainly that’s what Theresa May had expected to achieve when she made the mistake of calling a snap election, 3 years early.)  But the exit poll prediction made it pretty clear that the Conservatives would either not achieve a majority (for which 326 seats would be needed), or at best would be returned with a very small majority such as the one they held before the election.  Media commentary turned quickly to how a government might be formed in the seemingly likely event of a hung Parliament, and what the future might be for Mrs May.  The financial markets moved quite substantially, too, in the moments after 10pm.

For more details on the exit poll, its history, and the methods used to achieve that kind of predictive accuracy, see Exit Polling Explained.

The third number I want to mention here is

• 2.1.0

That’s the version of R that I had at the time of the 2005 General Election, when I completed the development of a fairly extensive set of R functions to use in connection with the exit poll (which at that time was done for BBC and ITV jointly).  Amazingly (to me!) the code that I wrote back in 2001–2005 still works fine.  My friend and former colleague Jouni Kuha, who stepped in as election-day statistician for the BBC when I gave it up after 2005, told me today that (with some tweaks, I presume!) it all works brilliantly still, as the basis for an extremely high-pressure data analysis on election day/night.  Very pleasing indeed; and strong testimony to the heroic efforts of the R Core Development Team, to keep everything stable with a view to the long term.

As suggested by that kind tweet reproduced above from the RSS President, David Spiegelhalter: Thursday’s performance was quite a triumph for the practical art and science of Statistics.  [And I think I am allowed to say this, since on this occasion I was not even there!  The credit for Thursday’s work goes to Jouni Kuha, along with John Curtice, Steve Fisher and the rest of the academic team of analysts who worked in the secret exit-poll “bunker” on 8 June.]

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Machine Learning Powered Biological Network Analysis

Sun, 06/11/2017 - 15:50

(This article was first published on r-bloggers – Creative Data Solutions, and kindly contributed to R-bloggers)

Metabolomic network analysis can be used to interpret experimental results within a variety of contexts including: biochemical relationships, structural and spectral similarity and empirical correlation. Machine learning is useful for modeling relationships in the context of pattern recognition, clustering, classification and regression based predictive modeling. The combination of developed metabolomic networks and machine learning based predictive models offer a unique method to visualize empirical relationships while testing key experimental hypotheses. The following presentation focuses on data analysis, visualization, machine learning and network mapping approaches used to create richly mapped metabolomic networks. Learn more at www.createdatasol.com

The following presentation also shows a sneak peak of a new data analysis visualization software, DAVe: Data Analysis and Visualization engine. Check out some early features. DAVe is built in R and seeks to support a seamless environment for advanced data analysis and machine learning tasks and biological functional and network analysis. As an aside, building the main site (in progress)  was a fun opportunity to experiment with Jekyll, Ruby and embedding slick interactive canvas elements into websites. You can checkout all the code here https://github.com/dgrapov/CDS_jekyll_site.

slides: https://www.slideshare.net/dgrapov/machine-learning-powered-metabolomic-network-analysis

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Creative Data Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### “smooth” package for R. Common ground. Part I. Prediction intervals

Sun, 06/11/2017 - 15:23

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

We have spent previous six posts discussing basics of es()

function (underlying models and their implementation). Now it is time to move forward. Starting from this post we will discuss common parameters, shared by all the forecasting functions implemented in smooth. This means that the topics that we discuss are not only applicable to es()

, but also to ssarima()

, ces()

, ges()

and sma()

. However, taking that we have only discussed ETS so far, we will use es()

in our examples for now.

And I would like to start this series of general posts from the topic of prediction intervals.

Prediction intervals for smooth functions

One of the features of smooth

functions is their ability to produce different types of prediction intervals. Parametric prediction intervals (triggered by intervals="p"

, intervals="parametric"

or intervals=TRUE

) are derived analytically only for pure additive and pure multiplicative models and are based on the state-space model discussed in previous posts. In the current smooth

version (v2.0.0) only es()

function has multiplicative components, all the other functions are based on additive models. This makes es()

“special”. While constructing intervals for pure models (either additive or multiplicative) is relatively easy to do, the mixed models cause pain in the arse (one of the reasons why I don’t like them). So in case of mixed ETS models, we have to use several tricks.

If the model has multiplicative error, non-multiplicative other components (trend, seasonality) and low variance of the error (smaller than 0.1), then the intervals can be approximated by similar models with additive error term. For example, the intervals for ETS(M,A,N) can be approximated with intervals of ETS(A,A,N), when the variance is low, because the distribution of errors in both models will be similar. In all the other cases we use simulations for prediction intervals construction (via sim.es()

function). In this case the data is generated with preset parameters (including variance) and contains $$h$$ observations. This process is repeated 10,000 times, resulting in 10,000 possible trajectories. After that the necessary quantiles of these trajectories for each step ahead are taken using quantile()

function from stats

package and returned as prediction intervals. This cannot be considered as a pure parametric approach, but it is the closest we have.

smooth

functions also introduce semiparametric and nonparametric prediction intervals. Both of them are based on multiple steps ahead (also sometimes called as “trace”) forecast errors. These are obtained via producing forecasts for horizon 1 to $$h$$ from each observation of time series. As a result a matrix with $$h$$ columns and $$T-h$$ rows is produced. In case of semi-parametric intervals (called using intervals="sp"

or intervals="semiparametric"

), variances of forecast errors for each horizon are calculated and then used in order to extract quantiles of either normal or log-normal distribution (depending on error type). This way we cover possible violation of assumptions of homoscedasticity and no autocorrelation in residuals, but we still assume that each separate observation has some parametric distribution.

In case of non-parametric prediction intervals (defined in R via intervals="np"

or intervals="nonparametric"

), we loosen assumptions further, dropping part about distribution of residuals. In this case quantile regressions are used as proposed by Taylor and Bunn, 1999. However we use a different form of regression model than the authors do:
\label{eq:ssTaylorPIs}
\hat{e}_{j} = a_0 j ^ {a_{1}},

where $$j = 1, .., h$$ is forecast horizon. This function has an important advantage over the proposed by the authors second order polynomial: it does not have extremum (turning point) for $$j>0$$, which means that the intervals won’t behave strangely after several observations ahead. Using polynomials for intervals sometimes leads to weird bounds (for example, expanding and then shrinking). On the other hand, power function allows producing wide variety of forecast trajectories, which correspond to differently increasing or decreasing bounds of prediction intervals (depending on values of $$a_0$$ and $$a_1$$), without producing any ridiculous trajectories.

The main problem with nonparametric intervals produced by smooth

is caused by quantile regressions, which do not behave well on small samples. In order to produce correct 0.95 quantile, we need to have at least 20 observations, and if we want 0.99 quantile, then the sample must contain at least 100. In the cases, when there is not enough observations, the produced intervals can be inaccurate and may not correspond to the nominal level values.

As a small note, if a user produces only one-step-ahead forecast, then semiparametric interval will correspond to parametric one (because only the variance of the one-step-ahead error is used), and the nonparametric interval is constructed using quantile()

function from stats

package.

Finally, the width of prediction intervals is regulated by parameter level

, which can be written either as a fraction number ( level=0.95

) or as an integer number, less than 100 ( level=95

). I personally prefer former, but the latter is needed for the consistency with forecast

package functions. By default all the smooth

functions produce 95% prediction intervals.

There are some other features of prediction interval construction for specific intermittent models and cumulative forecasts, but they will be covered in upcoming posts.

Examples in R

We will use a time series N1241 as an example and we will estimate model ETS(A,Ad,N). Here’s how we do that:

ourModel1 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="p") ourModel2 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="sp") ourModel3 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, intervals="np")

The resulting graphs demonstrate some differences in prediction intervals widths and shapes:

Series N1241 from M3, es() forecast, parametric prediction intervals

Series N1241 from M3, es() forecast, semiparametric prediction intervals

Series N1241 from M3, es() forecast, nonparametric prediction intervals

All of them cover actual values in the holdout, because the intervals are very wide. It is not obvious, which of them is the most appropriate for this task. So we can calculate the spread of intervals and see, which of them is on average wider:

mean(ourModel1$upper-ourModel1$lower) mean(ourModel2$upper-ourModel2$lower) mean(ourModel3$upper-ourModel3$lower)

Which results in:

950.4171 955.0831 850.614

In this specific example, the non-parametric interval appeared to be the narrowest, which is good, taking that it adequately covered values in the holdout sample. However, this doesn’t mean that it is in general superior to the other methods. Selection of the appropriate intervals should be done based on the general understanding of the violated assumptions. If we didn’t know the actual values in the holdout sample, then we could make a decision based on the analysis of the in-sample residuals in order to get a clue about the violation of any assumptions. This can be done, for example, this way:

forecast::tsdisplay(ourModel1$residuals) hist(ourModel1$residuals) qqnorm(ourModel3$residuals) qqline(ourModel3$residuals)

Linear plot and correlation functions of the residuals of the ETS(A,Ad,N) model

Histogram of the residuals of the ETS(A,Ad,N) model

Q-Q plot of the residuals of the ETS(A,Ad,N) model

The first plot shows how residuals change over time and how the autocorrelation and partial autocorrelation functions look for this time series. There is no obvious autocorrelation and no obvious heteroscedasticity in the residuals. This means that we can assume that these conditions are not violated in the model, so there is no need to use semiparametric prediction intervals. However, the second and the third graphs demonstrate that the residuals are not normally distributed (as assumed by the model ETS(A,Ad,N)). This means that parametric prediction intervals may be wrong for this time series. All of this motivates the usage of nonparametric prediction intervals for the series N1241.

That’s it for today.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Likelihood calculation for the g-and-k distribution

Sun, 06/11/2017 - 00:33

(This article was first published on R – Statisfaction, and kindly contributed to R-bloggers)

Histogram of 1e5 samples from the g-and-k distribution, and overlaid probability density function

Hello,

An example often used in the ABC literature is the g-and-k distribution (e.g. reference [1] below), which is defined through the inverse of its cumulative distribution function (cdf). It is easy to simulate from such distributions by drawing uniform variables and applying the inverse cdf to them. However, since there is no closed-form formula for the probability density function (pdf) of the g-and-k distribution, the likelihood is often considered intractable. It has been noted in [2] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cdf, and 2)  numerically differentiating the cdf, using finite differences, for instance. As it happens, this is very easy to implement, and I coded up an R tutorial at:

github.com/pierrejacob/winference/blob/master/inst/tutorials/tutorial_gandk.pdf

for anyone interested. This is part of the winference package that goes with our tech report on ABC with the Wasserstein distance  (joint work with Espen Bernton, Mathieu Gerber and Christian Robert, to be updated very soon!). This enables standard MCMC algorithms for the g-and-k example. It is also very easy to compute the likelihood for the multivariate extension of [3], since it only involves a fixed number of one-dimensional numerical inversions and differentiations (as opposed to a multivariate inversion).

Surprisingly, most of the papers that present the g-and-k example do not compare their ABC approximations to the posterior; instead, they typically compare the proposed ABC approach to existing ones. Similarly, the so-called Ricker model is commonly used in the ABC literature, and its posterior can be tackled efficiently using particle MCMC methods; as well as the M/G/1 model, which can be tackled either with particle MCMC methods or with tailor-made MCMC approaches such as [4].

These examples can still have great pedagogical value in ABC papers, but it would perhaps be nice to see more comparisons to the ground truth when it’s available; ground truth here being the actual posterior distribution.

1. Fearnhead, P. and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74, 419–474.
2. Rayner, G. D. and MacGillivray, H. L. (2002) Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12, 57–75.
3. Drovandi, C. C. and Pettitt, A. N. (2011) Likelihood-free Bayesian estimation of multivari- ate quantile distributions. Computational Statistics & Data Analysis, 55, 2541–2556.
4. Shestopaloff, A. Y. and Neal, R. M. (2014) On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));