R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 29 min ago

### Leading indicators of economic growth by @ellis2013nz

Thu, 08/09/2018 - 20:30

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

The debate about business confidence as a possible leading indicator of economic growth reminded me of a question that has long been on my back-burner – what data make the best leading indicators (say, a quarter of a year ahead) of economic growth? There are a number of indicators that get a lot of attention in the media and from economic pundits of course, but I wanted to take a data-driven approach and see what actual evidence there is for the various candidate variables.

To cut to the chase, this graphic summarises my results:

The previous quarter’s GDP growth is, unsurprisingly, a strong indicator of the coming quarter’s. Food prices are strongly negatively related to economic growth (in New Zealand anyway – with the strong agriculture sector here food prices may be serving as a proxy for something else, but exactly what I’m not sure although I’ve got some ideas). Other indicators with fair evidence of value include visitor arrivals, car registrations, business confidence (the OECD measure, which David Hood points out is apparently derived from the NZIER’s survey, not the ANZ’s), and electronic card transactions.

One surprise for me is the complete absence of predictive power of merchandise goods exports for growth in GDP. Any suggestions why welcomed in the comments!

These confidence intervals come from what I believe to be a pretty robust frequentist modelling approach that combines multiple imputation, bootstrap and elastic net regularization. I have in the back of my head the idea of a follow-up project that uses Bayesian methods to provide actual forecasts of the coming GDP, but I may not get around to it (or if I do, it will probably be for Victorian or Australian data seeing as I’ll be living there from October and more motivated for forecasts in that space).

Data

To address the question, I started with the Stats NZ data release calendar to identify official statistics that are available on a monthly basis and published within a month of the reference period. I counted seven such collections:

• Electronic card transactions
• Food price index
• Transport vehicle registrations
• International travel and migration
• Building consents issued
• Livestock slaughtering statistics

To these seven, I also added the:

• Reserve Bank’s Trade Weighted Index of the New Zealand dollar compared to a basket of currencies
• OECD business confidence series (I used this rather than the ANZ or NZIER original series as the OECD version is available freely for reproducibility purposes)

Some of these datasets are available programmatically, and some required manual steps to download from Infoshare. Slightly more detail on this is documented in the GitHub repo I set up for the project, as are all the scripts that import and combine the data. If you’re interested in that, fork the entire repository, as its a combined whole and needs to be run as such, via the integrate.R script in the root directory of the project.

Here’s the original time series, plus seasonally adjusted versions of all of them except for the trade weighted index (currencies are so random I thought it made little sense to do a seasonally adjusted version of this; possible subject of future investigation):

I did my own seasonal adjustment because even when seasonally adjusted series are available in some of the Stats NZ collections, typically it goes back only to the 1990s or 2000s. I wanted data at least all the way back to 1987 when the quarterly GDP series starts; time series are difficult enough to analyse as it is without restricting ourselves to a decade or so of data (this is also why I went back to original data series rather than using the Treasury “leading economic indicator” Excel collections, as they cover only a relatively recent period).

Apart from business confidence (which by its nature is restricted to within a constant range), the original data are all clearly non-stationary, so I converted them to growth rates, which gives us the following chart summarising my candidate leading economic indicators:

Both seasonal adjustment and stationarity are important with time series so we can focus on the underlying relationships and avoid spurious correlations from the structural (growth and episodicity) elements of the series. An alternative approach would be to model the seasonality and growth directly in a full blown Bayesian model, but even in that case I’d want to look at de-seasoned, stationary versions for the exploratory stage.

Now that we have those de-seasoned, stationary versions of the data, we can look at them and see which ones are correlated one-on-one with economic growth for the quarter. Remember, the idea here is to use data that are available fairly promptly as leading indicators of economic growth in the same quarter, because economic growth is only reported quarterly and is hard to estimate, and hence is reported at a material lag. Here are the bivariate correlations:

I’ve sequenced the variables in order of strength of relationship with GDP growth in the quarter (code). We can see that lagged GDP growth, electronic card transactions growth, and OECD business confidence have the three strongest positive correlations with GDP growth. Food price growth has a strong negative correlation (ie high growth in food prices is a predictor of low growth in GDP).

The lower diagonal in that pairs plot shows connected scatter plots with a single-predictor linear model. This turned out to be pleasantly easy to do with GGally::ggpairs and a simple user-defined function. Here’s the code for all three of the exploratory charts.

Minimalist regression (poor)

My starting point was a fairly naive out-of-the-box linear regression, just chucking all the data into lm().

load("data/ind_data.rda") summary(ind_data_wide) # 109 cases where even gdp_growth is missing. No point in using them ind_data_wide2 <- subset(ind_data_wide, !is.na(gdp_growth_lag)) # naive approach *will* fit a plausible model, but the NA is down to the lowest common denominator, when data available on all variables mod <- lm(gdp_growth ~ ., data = ind_data_wide2) stargazer(mod, type ="html")

This gives me this result:

Dependent variable: gdp_growth gdp_growth_lag 0.009 (0.121) ect_growth 0.335*** (0.114) bc_sa 0.0001 (0.001) cars_growth 0.022 (0.016) com_vcl_growth 0.003 (0.009) iva_growth 0.046** (0.020) bci_growth 0.001 (0.008) twi_growth -0.004 (0.020) lst_growth -0.002 (0.008) goods_growth 0.005 (0.012) fpi_growth -0.236*** (0.080) yr_num 0.0001 (0.0002) Constant -0.177 (0.354) Observations 61 R2 0.566 Adjusted R2 0.457 Residual Std. Error 0.004 (df = 48) F Statistic 5.215*** (df = 12; 48) Note: *p<0.1; **p<0.05; ***p<0.01

There’s two big problems with this approach:

• By default, the model is fit to the minimal set of the data that is complete for all variables. Electronic card transactions is the shortest series, going back only to 2003 when it is treated as a growth rate. So all the information in the other variables before that time has been thrown out.
• A big amount of collinearity in the data is inevitable with this sort of collection. While it doesn’t bias the estimated coefficients, it can make a big increase to the variance, and basically adds a lot of noise to our estimated effect sizes. I think this is why we have the implausible result in this naive model of lagged GDP growth not being useful as an explanatory variable.

… and one small problem, which is that other than including the lagged value of GDP growth as an explanatory variable, I’m ignoring the time series nature of the data. More on that later.

Multiple imputation (better)

We could address the problem of the short electronic card transactions series either by eliminating that series from the data altogether (which would be a shame as it’s obviously useful), or by imputing it’s previous values. Conceptually this is a little dubious – what exactly are we saying growth in electronic card transactions in 1988 would mean? – but statistically it’s fine. The imputed values will capture the relationships between the various variables from 2003 onwards and avoid creating new artefacts, so in effect we will find the relationship between electronic cards and the others for the years we have the data, while still using the full 1987+ data for all other variables.

The appropriate way to do this is to create multiple complete data sets with different imputed values, fit our model to each, and pool the results. The mice package makes this spectacularly easy:

ind_mi <- mice(ind_data_wide2, print = FALSE) mod_mi <- with(ind_mi, lm(gdp_growth ~ yr_num + gdp_growth_lag + bc_sa + bci_growth + ect_growth + fpi_growth + iva_growth + goods_growth + cars_growth + com_vcl_growth + twi_growth + lst_growth)) summary(pool(mod_mi)) %>% kable(digits = 4)   estimate std.error statistic df p.value (Intercept) -0.2889 0.1712 -1.6878 54.8951 0.0944 yr_num 0.0001 0.0001 1.5128 59.0142 0.1333 gdp_growth_lag 0.2182 0.0846 2.5785 79.8727 0.0113 bc_sa 0.0005 0.0004 1.3342 100.3535 0.1850 bci_growth 0.0061 0.0069 0.8820 106.1382 0.3798 ect_growth 0.2059 0.1308 1.5741 15.6965 0.1184 fpi_growth -0.1900 0.0566 -3.3599 105.0452 0.0011 iva_growth 0.0257 0.0193 1.3335 82.9585 0.1852 goods_growth 0.0008 0.0125 0.0628 97.7269 0.9500 cars_growth 0.0172 0.0071 2.4403 104.3330 0.0163 com_vcl_growth 0.0050 0.0077 0.6501 102.0428 0.5171 twi_growth 0.0150 0.0192 0.7807 103.6387 0.4367 lst_growth 0.0024 0.0061 0.3999 104.7962 0.6900

Compared to the naive model on the cut-down dataset, we now have a much more plausible strong relationship with lagged GDP growth (which didn’t show up at all as significant in the first model). Food prices and car/stationwagon registrations the other two variables with “significant” evidence at this stage. The electronic card transactions (ect_growth) effect is much smaller now, an unsurprising result from the “blurring” of the data introduced by the imputation approach and the fact that the other variables have long series of actual values to assert their relationships with the response of GDP growth.

So, we’ve dealt with the missing data, but not yet the multi-collinearity.

Multiple imputation, elastic net regularization, and bootstrap (best)

My preferred approach to reducing the random noise in effect sizes from having collinear explanatory variables is elastic net regularization, which I’ve written about a number of times in this blog.

Regularization can be combined with multiple imputation if we use a bootstrap. For each bootstrap re-sample, we perform a single imputation of missing variables, using just the data in the re-sample. This adds a little to the computational complexity, but this dataset isn’t large, and the more valid inferences possible make it worthwhile. In general, as much of the data processing stage as possible should be repeated in each bootstrap re-sample, for the re-sampling process to work as intended, ie as a mimic of “what would happen if we could repeat this experiment 999 times”, the fundamental principle of frequentist inference.

Bootstrap of a time series normally needs special methods. I’m not necessarily happy with the shortcut I’ve taken here, which is to say that I can treat the data as cross-sectional other than by including the lagged GDP growth variable. However, with all the variables in their stationary state, and with the residual series not showing signs of autocorrelation (more on this later), I think it’s fair enough in a pragmatic way.

I was able to cannibalise code from my previous posts with minimal amendments.

My first step is to estimate good values for the alpha hyper-parameter in generalised linear regression. This is the value between 0 and 1 that lets you choose a combination of the type of penalty used to determine the amount that coefficients are shrunk towards zero, to counter the way collinearity adds random noise to their estimates. I do this by fitting models with a range of different values of alpha, while using cross-validation (built into the glmnet package) to get good values of lambda. lambda is the hyper-parameter that indicates how much shrinkage there is.

# I like Michael's answer at https://stats.stackexchange.com/questions/46719/multiple-imputation-and-model-selection: # use in combination with bootstrap. cv_results <- data_frame(lambda = numeric(), alpha = numeric(), mcve = numeric(), imputation = numeric()) set.seed(765) for(imp in 1:5){ the_data <- mice::complete(ind_mi, imp) X <-as.matrix(select(the_data, -gdp_growth)) Y <- the_data$gdp_growth alphas <- seq(from = 0, to = 1, length.out = 21) for(i in alphas){ cvfit <- cv.glmnet(as.matrix(X), Y, alpha = i) tmp <- data_frame(lambda = cvfit$lambda, alpha = i, mcve = cvfit$cvm, imputation = imp) cv_results <- rbind(cv_results, tmp) } } cv_res_grp <- cv_results %>% group_by(lambda, alpha) %>% summarise(mcve = mean(mcve)) %>% ungroup() %>% arrange(mcve) cv_res_grp %>% mutate(ordered_alpha = as.factor(round(alpha, 3)), ordered_alpha = fct_reorder(ordered_alpha, mcve, .fun = min)) %>% ggplot(aes(x = lambda, y = sqrt(mcve), colour = ordered_alpha)) + geom_line(size = 2) + scale_x_log10(label = comma) + scale_colour_viridis("alpha", guide = guide_legend(reverse = TRUE), discrete = TRUE) + ggtitle("Cross-validation to select hyper parameters in elastic net regression") + scale_y_continuous("Square root of mean cross validation error", label = comma) + theme(legend.position = "right") chosen_alpha <- arrange(cv_res_grp, mcve)[1, ]$alpha

This gives me a graphic like the following:

We can see that we can get low values of mean error with many combinations of alpha and lambda, which is reassuring in suggesting that the choices aren’t critical. But I store the best value of alpha for use in the actual bootstrapped model fitting to follow.

Here’s the code that does the actual bootstrapped, mulitply-imputed, elastic net regularized fitting of my model. It has four steps:

• define a function elastic() that takes a resampled set of data, performs a single imputation on it, determines a good value of lambda, fits the model and returns the values of the coefficients
• use the boot::boot function to run that function on 999 different samples with replacement of the original data
• tidy up the resulting percentile confidence intervals into a neat data frame
• draw a nice graphic showing the confidence intervals

Most of the code below is actually in the third and fourth of those steps:

# Define a function that creates a single imputed set of data from a bootstrapped resample, then fits ridge regression to it elastic <- function(data, i){ # create a single complete imputed set of data: the_data = mice::complete(mice(data[i, ], m = 1, print = FALSE), 1) X <- as.matrix(dplyr::select(the_data, -gdp_growth)) Y <- the_data$gdp_growth # we're going to scale the data so the results are in terms of standard deviations of the # variable in question, to make coefficients more comparable. Note that this scaling takes # place *within* the bootstrap as it is part of the analytical process we are simulating # by resampling. X <- scale(X) # work out the best lambda for this dataset using cross validation" lambda <- cv.glmnet(as.matrix(X), Y, alpha = chosen_alpha)$lambda.min # fit the model: mod_glmnet <- glmnet(X, Y, alpha = chosen_alpha, lambda = lambda) # return the results: as.vector(coef(mod_glmnet)) } # Now we run the bootstrap, using the function above. set.seed(321) elastic_bt <- boot(data = ind_data_wide2, statistic = elastic, R = 999) # process results into a tidy data frame boot_coefs <- data_frame( variable = character(), lower = numeric(), upper = numeric(), point = numeric() ) var_names <- c("Intercept", names(ind_data_wide_names)[-2]) for(i in 1:length(var_names)){ x <- boot.ci(elastic_bt, type = "perc", index = i) boot_coefs <- rbind(boot_coefs, data_frame(variable = var_names[i], lower = x$percent[4], upper = x$percent[5], point = x$t0)) } # draw graphic boot_coefs %>% filter(!variable %in% c("yr_num", "Intercept")) %>% # next line is in there in case we do still want the time trend in the graphic. But it's never significant, # and it's hard to explian. mutate(variable = ifelse(variable == "yr_num", "Linear\ntime trend", variable)) %>% mutate(variable = fct_reorder(variable, lower)) %>% ggplot(aes(y = variable)) + geom_vline(xintercept = 0, colour = "black") + geom_segment(aes(yend = variable, x = lower, xend = upper), size = 3, colour = "steelblue", alpha = 0.5) + geom_point(aes(x = point)) + scale_x_continuous(label = percent) + labs(x = "Estimated impact in percentage points on coming quarter's GDP growth, of change in one standard deviation in the explanatory variable", y = "", caption = "95% confidence intervals based on bootstrapped elastic net regularized regression, with electronic card transactions imputed differently for each bootstrap resample. Analysis by http://freerangestats.info") + ggtitle("Previous quarter's economic growth (+ve) and food prices (-ve) most useful as leading indicators of this quarter's New Zealand economic growth", str_wrap("Variables considered are official statistics available from Stats NZ every month, within a month; plus the OECD business confidence measure (which is based on NZIER's Quarterly Survey of Business Opinion); and the trade weighted index for currency published by RBNZ. Data goes back to 1987.", 80)) And I’m sticking to that (same graphic as I started the post with) as my best results. Timeseries method I’ve mentioned a couple of times my nervousness about how I’ve inadequately treated the timeseries nature of these data. Here’s a quick look at the residuals of one version of my model (ie one set of imputed values). It assures me that there is no evidence of the residuals being autocorrelated, but I still think I’m cutting corners: lambda <- cv.glmnet(as.matrix(X), Y, alpha = chosen_alpha)$lambda.min mod_glmnet <- glmnet(X, Y, alpha = chosen_alpha, lambda = lambda) residuals <- ts(Y - predict(mod_glmnet, newx = X), start = c(1987, 4), frequency = 4) # Fail to reject null hypothesis of no autocorrelation (which means that it is probably ok to have used the # above method, pretending it is cross-sectional other than by including the lagged GDP) Box.test(residuals, lag = 4, type = "Ljung-Box") par(family = "myfont", font.main = 1, bty = "l") tsdisplay(residuals, main = "Time series of residuals from regression with imputation and regularization")

As an alternative approach, I used the forecast::auto.arima() approach to model an imputed set of the data explicitly as a timeseries. This gets me some results that differ slightly from the above, but perhaps are broadly compatible:

.rownames X2.5.. X97.5.. ar1 0.7812 1.0434 ma1 -1.8363 -1.5284 ma2 0.6152 0.9077 sar1 0.0547 0.5393 sar2 -0.4780 -0.0796 sma1 -1.0269 -0.7013 ect_growth 0.0003 0.0018 bc_sa 0.0009 0.0036 cars_growth -0.0001 0.0017 com_vcl_growth -0.0007 0.0010 iva_growth -0.0003 0.0017 bci_growth -0.0010 0.0004 twi_growth -0.0002 0.0013 lst_growth -0.0001 0.0016 goods_growth -0.0011 0.0008 fpi_growth -0.0023 -0.0003

The key difference is that business confidence (bc_sa) and, to a lesser extent, electronic card transactions (ect_growth) are more strongly significant in this version. Food price (fpi_growth) is still strongly negative, and other confidence intervals all encompass zero. Note that lagged GDP growth doesn’t appear in this model explicitly – it comes in automatically as the ar1 (auto-regression at lag 1) term of the algorithmically-chosen ARIMA model.

The results are similar enough that I decide I’m ok with the approach I’ve taken, so I don’t invest in what would be quite an effort (but eminently possible with time) to introduce multiple imputation and regularization into the time series model.

Here’s the code for that bit of time series modelling:

set.seed(234) ind_imp <- mice::complete(mice(ind_data_wide2, m = 1, print = FALSE), 1) Y <- ts(ind_imp$gdp_growth, start = c(1987, 4), frequency = 4) X <- scale(select(ind_imp, -yr_num, -gdp_growth, -gdp_growth_lag)) mod <- auto.arima(Y, xreg = X) confint(mod) %>% tidy %>% kable(digits = 4) All the analysis code, as well as the data downloading and grooming code (which is the bulk of it…), is available in the GitHub repository for this project. The code excerpts above aren’t self-sufficient without the rest of the project, which has a folder system that doesn’t lend itself to representing in a simple blog post; you need to clone the project to get it to work. For the record, here are the R packages used: library(tidyverse) library(lubridate) library(scales) library(seasonal) library(GGally) library(openxlsx) library(mice) library(forecast) library(glmnet) library(viridis) library(boot) library(broom) library(stargazer) library(knitr) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### New Course: Fundamentals of Bayesian Data Analysis in R Thu, 08/09/2018 - 20:06 (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers) Here is the course link. Course Description Bayesian data analysis is an approach to statistical modeling and machine learning that is becoming more and more popular. It provides a uniform framework to build problem specific models that can be used for both statistical inference and for prediction. This course will introduce you to Bayesian data analysis: What it is, how it works, and why it is a useful tool to have in your data science toolbox. Chapter 1: What is Bayesian Data Analysis? (Free) This chapter will introduce you to Bayesian data analysis and give you a feel for how it works. Chapter 2: How does Bayesian inference work? In this chapter we will take a detailed look at the foundations of Bayesian inference. Chapter 3: Why use Bayesian Data Analysis? This chapter will show you four reasons why Bayesian data analysis is a useful tool to have in your data science tool belt. Chapter 4: Bayesian inference with Bayes’ theorem Learn what Bayes theorem is all about and how to use it for statistical inference. Chapter 5: More parameters, more data, and more Bayes Learn about using the Normal distribution to analyze continuous data and try out a tool for practical Bayesian analysis in R. Prerequisite var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Proteomics Data Analysis (1/3): Data Acquisition and Cleaning Thu, 08/09/2018 - 14:00 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) The analysis of DNA and RNA, the blueprint of life and its carbon copy, has become a staple in the burgeoning field of molecular biology. An emerging and exciting area of study that adds another dimension to our understanding of cellular biology is that of proteomics. The use of mass spectrometry has enabled the identification and quantification of thousands of proteins in a single sample run. In this tutorial series, I will break down the steps of processing a high-throughput proteomics data set derived from mass spectrometry analysis as follows: • Data acquisition and cleaning • Data filtering and missing value imputation • Data visualization and interpretation Source of Proteomics Data To obtain a sample data set, I combed through a proteomics data repository called PRIDE and found an interesting study on drug resistance in breast cancer cell lines. I downloaded the raw files, which are the output from mass spectrometry analysis, and processed them using a software called MaxQuant to map the spectral data to protein sequences. A total of six raw files, corresponding to three replicates of two cell lines (one resistant, one non-resistant), was used. There are numerous other ways to process mass spectrometry data (e.g. Mascot, SEQUEST, ProteinProspector) so the final data table of protein abundance measurements will vary base on the approach. The starting point for this tutorial is the MaxQuant ProteinGroups output file, which can be downloaded here. Data Acquisition The first step is to read the tab-separated data file into R. raw = read.delim("proteinGroups.txt", stringsAsFactors = FALSE, colClasses = "character") Our raw data is an enormous 1787-by-79 data frame. Proteins are arranged in rows and their information in columns. The primary columns of interest are those containing intensity measurements, which reflects protein abundance. grep("^LFQ.intensity", names(raw), value = TRUE) ## [1] "LFQ.intensity.Parental_bR1" "LFQ.intensity.Parental_bR2" ## [3] "LFQ.intensity.Parental_bR3" "LFQ.intensity.Resistant_bR1" ## [5] "LFQ.intensity.Resistant_bR2" "LFQ.intensity.Resistant_bR3" Again, we have a total of six samples. The Parental represents intensity data from the breast cancer cell line SKBR3 while the Resistant is an drug-resistant cell line derived from culturing the parentals in the presence of an EGFR inhibitor. For more information regarding the study, please see the original publication. Data Clean Up Remove False Hits The next step after data acquisition is to clean and organize our data. The first order of business is to remove the false hits, including the contaminants, reverse proteins, and proteins identified by site. These are annotated with a “+” under the columns Potential.contaminant, Reverse, and Only.identified.by.site. We filter the data frame by keeping rows without a “+” annotation in any of the three columns. library(dplyr) df = dplyr::filter(raw, Potential.contaminant != "+") %>% dplyr::filter(Reverse != "+") %>% dplyr::filter(Only.identified.by.site != "+") Often, there is a column indicating the confidence of the protein identification. In our case, that is Q.value, which represents the probability that the protein is a false hit. A typical cutoff is set at 0.01. Fortunately, MaxQuant takes care of this operation and all Q values are below the threshold. summary(as.numeric(df$Q.value)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000000 0.0000000 0.0000000 0.0007193 0.0012759 0.0091429 Define Protein and Gene IDs

A quick look at Protein.IDs and Fasta.headers columns tells us that the protein IDs, protein names, and gene IDs are not clearly displayed.

head(df$Protein.IDs) ## [1] "sp|A0AV96|RBM47_HUMAN" "sp|A0FGR8|ESYT2_HUMAN" "sp|A1L0T0|ILVBL_HUMAN" ## [4] "sp|A4D1S0|KLRG2_HUMAN" "sp|A5YKK6|CNOT1_HUMAN" "sp|A6NDG6|PGP_HUMAN" head(df$Fasta.headers) ## [1] ">sp|A0AV96|RBM47_HUMAN RNA-binding protein 47 OS=Homo sapiens GN=RBM47 PE=1 SV=2" ## [2] ">sp|A0FGR8|ESYT2_HUMAN Extended synaptotagmin-2 OS=Homo sapiens GN=ESYT2 PE=1 SV=1" ## [3] ">sp|A1L0T0|ILVBL_HUMAN Acetolactate synthase-like protein OS=Homo sapiens GN=ILVBL PE=1 SV=2" ## [4] ">sp|A4D1S0|KLRG2_HUMAN Killer cell lectin-like receptor subfamily G member 2 OS=Homo sapiens GN=KLRG2 PE=1 SV=3" ## [5] ">sp|A5YKK6|CNOT1_HUMAN CCR4-NOT transcription complex subunit 1 OS=Homo sapiens GN=CNOT1 PE=1 SV=2" ## [6] ">sp|A6NDG6|PGP_HUMAN Glycerol-3-phosphate phosphatase OS=Homo sapiens GN=PGP PE=1 SV=1"

We will use regular expressions to extract the protein names into a column named Protein.name, UniProt protein IDs into Protein, and gene IDs into Gene. Note that some rows may have multiple IDs separated by semicolons. In those instances, we will isolate the first one available.

# Isolate the first ID df$Protein.IDs = sub(";.*", "", df$Protein.IDs) df$Fasta.headers = sub(";.*", "", df$Fasta.headers) # Extract Protein name regex = regexpr("(?<=_HUMAN.).*(?=.OS)", df$Fasta.headers, perl = TRUE) df$Protein.name = regmatches(df$Fasta.headers, regex) # Extract UniProtID regex = regexpr("(?<=\\|).*(?=\\|)", df$Protein.IDs, perl = TRUE) df$Protein = regmatches(df$Protein.IDs, regex) # Extract Gene ID regex = regexpr("((?<=\\|[[:alnum:]]{6}\\|).*(?=_HUMAN)|(?<=\\|[[:alnum:]]{10}\\|).*(?=_HUMAN))", df$Protein.IDs, perl = TRUE) df$Gene = regmatches(df$Protein.IDs, regex) Transform Intensity Columns Due to our function call for reading the data table, all columns are cast as the character data type. It will benefit our analysis to convert the intensity columns to the proper numeric data type. intensity.names = grep("^LFQ.intensity", names(raw), value = TRUE) df[intensity.names] = sapply(df[intensity.names], as.numeric) Now, let us examine the distribution of protein intensities in a sample. Below is a histogram of the Parental_bR1 protein intensities. The distribution is clearly skewed to the right with a few highly abundant proteins. To normalize the distribution, it is common practice to log2 transform the intensity data. LOG.names = sub("^LFQ.intensity", "LOG2", intensity.names) # rename intensity columns df[LOG.names] = log2(df[intensity.names]) Let's check out the transformed distribution on Parental_bR1 (much better!): Summary This is the first of three tutorials on proteomics data analysis. I have outlined the steps to read and clean a typical mass spectrometry-based proteomics data set. In the next tutorial, we will examine the data in greater detail. In doing so, we will find that only a handful of proteins are quantified across all samples. In other words, proteins are often picked up in one sample but not in the others. This is known as the missing value problem. Stick around to explore techniques for filtering proteins based on the number of valid values and filling in the missing values using data imputation. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### EARL London interviews – Catherine Gamble, Marks and Spencer Thu, 08/09/2018 - 11:05 (This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers) For today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Catherine Gamble, Data Scientist at Marks and Spencer. Catherine is presenting “Using R to Drive Revenue for your Online Business” at EARL London and we got the chance to get a preview of the use case she’ll be presenting. Thanks Catherine for this interview. What was the business need or opportunity that led to this project? As an online retailer, we know that the actions we take, for example, any changes we make to our website, have an impact on our financial results. However, when multiple changes are being made or campaigns are being run at the same time, it can be hard to separate which action led to the desired result. From a strategy and planning perspective, we knew it would be valuable to be able to predict the direct impact of any actions we took, before we made them. How did you go about solving this problem? I developed a predictive model to explore the relationships between action and result. The result was I was able to identify which actions would have an impact on our KPIs. What value did your project deliver? We now have clear insight which is fed into our strategic decision making. As a result, we have had a positive impact on our KPIs and there has been a positive financial impact. What would you say were the elements that made your project a success? Support from the Team – one of the key drivers of success in this project was the time I was given to explore different techniques and models and to learn. Curiosity – this project came about because I was curious about the patterns in the data and wanted to explore some questions around things we were seeing. What other businesses do you think would benefit from this use case? Any online retailer who have multiple sales, marketing and development events and campaigns at the same time. It would also be useful for businesses who have a sales funnel and want to explore how the actions businesses take impact on the results. To hear Catherine’s full talk and others like it – join us at EARL London this September! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Beyond Basic R – Plotting with ggplot2 and Multiple Plots in One Figure Thu, 08/09/2018 - 02:00 (This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers) R can create almost any plot imaginable and as with most things in R if you don’t know where to start, try Google. The Introduction to R curriculum summarizes some of the most used plots, but cannot begin to expose people to the breadth of plot options that exist.There are existing resources that are great references for plotting in R: In base R: In ggplot2: In the Introduction to R class, we have switched to teaching ggplot2 because it works nicely with other tidyverse packages (dplyr, tidyr), and can create interesting and powerful graphics with little code. While ggplot2 has many useful features, this blog post will explore how to create figures with multiple ggplot2 plots. You may have already heard of ways to put multiple R plots into a single figure – specifying mfrow or mfcol arguments to par, split.screen, and layout are all ways to do this. However, there are other methods to do this that are optimized for ggplot2 plots. Multiple plots in one figure using ggplot2 and facets When you are creating multiple plots and they share axes, you should consider using facet functions from ggplot2 (facet_grid, facet_wrap). You write your ggplot2 code as if you were putting all of the data onto one plot, and then you use one of the faceting functions to specify how to slice up the graph. Let’s start by considering a set of graphs with a common x axis. You have a data.frame with four columns: Date, site_no, parameter, and value. You want three different plots in the same figure – a timeseries for each of the parameters with different colored symbols for the different sites. Sounds like a lot, but facets can make this very simple. First, setup your ggplot code as if you aren’t faceting. We will download USGS water data for use in this example from the USGS National Water Information System (NWIS) using the dataRetrieval package (you can learn more about dataRetrieval in this curriculum). Three USGS gage sites in Wisconsin were chosen because they have data for all three water quality parameters (flow, total suspended solids, and inorganic nitrogen) we are using in this example. library(dataRetrieval) library(dplyr) # for rename & select library(tidyr) # for gather library(ggplot2) # Get the data by giving site numbers and parameter codes # 00060 = stream flow, 00530 = total suspended solids, 00631 = concentration of inorganic nitrogen wi_daily_wq <- readNWISdv(siteNumbers = c("05430175", "05427880", "05427927"), parameterCd = c("00060", "00530", "00631"), startDate = "2017-08-01", endDate = "2017-08-31") # Clean up data to have human-readable names + move data into long format wi_daily_wq <- renameNWISColumns(wi_daily_wq, p00530 = "TSS", p00631 = "InorganicN") %>% select(-ends_with("_cd")) %>% gather(key = "parameter", value = "value", -site_no, -Date) # Setup plot without facets p <- ggplot(data = wi_daily_wq, aes(x = Date, y = value)) + geom_point(aes(color = site_no)) + theme_bw() # Now, we can look at the plot and see how it looks before we facet # Obviously, the scales are off because we are plotting flow with concentrations p Now, we know that we can’t keep these different parameters on the same plot. We could have written code to filter the data frame to the appropriate values and make a plot for each of them, but we can also take advantage of facet_grid. Since the resulting three plots that we want will all share an x axis (Date), we can imagine slicing up the figure in the vertical direction so that the x axis remains in-tact but we end up with three different y axes. We can do this using facet_grid and a formula syntax, y ~ x. So, if you want to divide the figure along the y axis, you put variable in the data that you want to use to decide which plot data goes into as the first entry in the formula. You can use a . if you do not want to divide the plot in the other direction. # Add vertical facets, aka divide the plot up vertically since they share an x axis p + facet_grid(parameter ~ .) The result is a figure divided along the y axis based on the unique values of the parameter column in the data.frame. So, we have three plots in one figure. They still all share the same axes, which works for the x axis but not for the y axes. We can change that by letting the y axes scale freely to the data that appears just on that facet. Add the argument scales to facet_grid and specify that they should be “free” rather than the default “fixed”. # Add vertical facets, but scale only the y axes freely p + facet_grid(parameter ~ ., scales = "free_y") From here, there might be a few things you want to change about how it’s labelling the facets. We would probably want the y axis labels to say the parameter and units on the left side. So, we can adjust how the facets are labeled and styled to become our y axis labels. p + facet_grid(parameter ~ ., scales = "free_y", switch = "y", # flip the facet labels along the y axis from the right side to the left labeller = as_labeller( # redefine the text that shows up for the facets c(Flow = "Flow, cfs", InorganicN = "Inorganic N, mg/L", TSS = "TSS, mg/L"))) + ylab(NULL) + # remove the word "values" theme(strip.background = element_blank(), # remove the background strip.placement = "outside") # put labels to the left of the axis text There are still other things you can do with facets, such as using space = "free". The Cookbook for R facet examples have even more to explore! Using cowplot to create multiple plots in one figure When you are creating multiple plots and they do not share axes or do not fit into the facet framework, you could use the packages cowplot or patchwork (very new!), or the grid.arrange function from gridExtra. In this blog post, we will show how to use cowplot, but you can explore the features of patchwork here. The package called cowplot has nice wrapper functions for ggplot2 plots to have shared legends, put plots into a grid, annotate plots, and more. Below is some code that shows how to use some of these helpful cowplot functions to create a figure that has three plots and a shared title. Just as in the previous example, we will download USGS water data from the USGS NWIS using the dataRetrieval package (find out more about dataRetrieval in this curriculum). This USGS gage site on the Yahara River in Wisconsin was chosen because it has data for all three water quality parameters (flow, total suspended solids, and inorganic nitrogen) we are using in this example. library(dataRetrieval) library(dplyr) # for rename library(tidyr) # for gather library(ggplot2) library(cowplot) # Get the data yahara_daily_wq <- readNWISdv(siteNumbers = "05430175", parameterCd = c("00060", "00530", "00631"), startDate = "2017-08-01", endDate = "2017-08-31") # Clean up data to have human-readable names yahara_daily_wq <- renameNWISColumns(yahara_daily_wq, p00530 = "TSS", p00631 = "InorganicN") # Create the three different plots flow_timeseries <- ggplot(yahara_daily_wq, aes(x=Date, y=Flow)) + geom_point() + theme_bw() yahara_daily_wq_long <- gather(yahara_daily_wq, Nutrient, Nutrient_va, TSS, InorganicN) nutrient_boxplot <- ggplot(yahara_daily_wq_long, aes(x=Nutrient, y=Nutrient_va)) + geom_boxplot() + theme_bw() tss_flow_plot <- ggplot(yahara_daily_wq, aes(x=Flow, y=TSS)) + geom_point() + theme_bw() # Create Flow timeseries plot that spans the grid by making one plot_grid # and then nest it inside of a second. Also, include a title at the top # for the whole figure. title <- ggdraw() + draw_label("Conditions for site 05430175", fontface='bold') bottom_row <- plot_grid(nutrient_boxplot, tss_flow_plot, ncol = 2, labels = "AUTO") plot_grid(title, bottom_row, flow_timeseries, nrow = 3, labels = c("", "", "C"), rel_heights = c(0.2, 1, 1)) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Highcharting Jobs Friday Thu, 08/09/2018 - 02:00 (This article was first published on R Views, and kindly contributed to R-bloggers) Today, in honor of last week’s jobs report from the Bureau of Labor Statistics (BLS), we will visualize jobs data with ggplot2 and then, more extensively with highcharter. Our aim is to explore highcharter and its similarity with ggplot and to create some nice interactive visualizations. In the process, we will cover how to import BLS data from FRED and then wrangle it for visualization. We won’t do any modeling or statistical analysis today, though it wouldn’t be hard to extend this script into a forecasting exercise. One nice thing about today’s code flow is that it can be refreshed and updated on each BLS release date. Let’s get to it! We will source our data from FRED and will use the tq_get() function from tidyquant which enables us to import many data series at once in tidy, tibble format. We want to get total employment numbers, ADP estimates, and the sector-by-sector numbers that make up total employment. Let’s start by creating a tibble to hold the FRED codes and more intuitive names for each data series. library(tidyverse) library(tidyquant) codes_names_tbl <- tribble( ~ symbol, ~ better_names, "NPPTTL", "ADP Estimate", "PAYEMS", "Nonfarm Employment", "USCONS", "Construction", "USTRADE", "Retail/Trade", "USPBS", "Prof/Bus Serv", "MANEMP", "Manufact", "USFIRE", "Financial", "USMINE", "Mining", "USEHS", "Health Care", "USWTRADE", "Wholesale Trade", "USTPU", "Transportation", "USINFO", "Info Sys", "USLAH", "Leisure", "USGOVT", "Gov", "USSERV", "Other Services" ) Now we pass the symbol column to tq_get(). fred_empl_data <- tq_get(codes_names_tbl$symbol, get = "economic.data", from = "2007-01-01")

We have our data but look at the symbol column.

fred_empl_data %>% group_by(symbol) %>% slice(1) # A tibble: 15 x 3 # Groups: symbol [15] symbol date price 1 MANEMP 2007-01-01 14008 2 NPPTTL 2007-01-01 115437. 3 PAYEMS 2007-01-01 137497 4 USCONS 2007-01-01 7725 5 USEHS 2007-01-01 18415 6 USFIRE 2007-01-01 8389 7 USGOVT 2007-01-01 22095 8 USINFO 2007-01-01 3029 9 USLAH 2007-01-01 13338 10 USMINE 2007-01-01 706 11 USPBS 2007-01-01 17834 12 USSERV 2007-01-01 5467 13 USTPU 2007-01-01 26491 14 USTRADE 2007-01-01 15443. 15 USWTRADE 2007-01-01 5969.

The symbols are the FRED codes, which are unrecognizable unless you have memorized how those codes map to more intuitive names. Let’s replace them with the better_names column of codes_names_tbl. We will do this with a left_join(). (This explains why I labeled our original column as symbol – it makes the left_join() easier.) Special thanks to Jenny Bryan for pointing out this code flow!

fred_empl_data %>% left_join(codes_names_tbl, by = "symbol" ) %>% select(better_names, everything(), -symbol) %>% group_by(better_names) %>% slice(1) # A tibble: 15 x 3 # Groups: better_names [15] better_names date price 1 ADP Estimate 2007-01-01 115437. 2 Construction 2007-01-01 7725 3 Financial 2007-01-01 8389 4 Gov 2007-01-01 22095 5 Health Care 2007-01-01 18415 6 Info Sys 2007-01-01 3029 7 Leisure 2007-01-01 13338 8 Manufact 2007-01-01 14008 9 Mining 2007-01-01 706 10 Nonfarm Employment 2007-01-01 137497 11 Other Services 2007-01-01 5467 12 Prof/Bus Serv 2007-01-01 17834 13 Retail/Trade 2007-01-01 15443. 14 Transportation 2007-01-01 26491 15 Wholesale Trade 2007-01-01 5969.

That looks much better, but we now have a column called price, that holds the monthly employment observations, and a column called better_names, that holds the more intuitive group names. Let’s change those column names to employees and sector.

fred_empl_data <- fred_empl_data %>% left_join(codes_names_tbl, by = "symbol" ) %>% select(better_names, everything(), -symbol) %>% rename(employees = price, sector = better_names) head(fred_empl_data) # A tibble: 6 x 3 sector date employees 1 ADP Estimate 2007-01-01 115437. 2 ADP Estimate 2007-02-01 115527. 3 ADP Estimate 2007-03-01 115647 4 ADP Estimate 2007-04-01 115754. 5 ADP Estimate 2007-05-01 115809. 6 ADP Estimate 2007-06-01 115831.

fred_empl_data has the names and organization we want, but it still has the raw number of employees per month. We want to visualize the month-to-month change in jobs numbers, which means we need to perform a calculation on our data and store it in a new column. We use mutate() to create the new column and calculate monthly change with value - lag(value, 1). We are not doing any annualizing or seasonality work here – it’s a simple substraction. For yearly change, it would be value - lag(value, 12).

empl_monthly_change <- fred_empl_data %>% group_by(sector) %>% mutate(monthly_change = employees - lag(employees, 1)) %>% na.omit()

Our final data object empl_monthly_change is tidy, has intuitive names in the group column, and has the monthly change that we wish to visualize. Let’s build some charts.

We will start at the top and use ggplot to visualize how total non-farm employment (Sorry farmers. Your jobs don’t count, I guess) has changed since 2007. We want an end-user to quickly glance at the chart and find the months with positive jobs growth and negative jobs growth. That means we want months with positive jobs growth to be one color, and those with negative jobs growth to be another color. There is more than one way to accomplish this, but I like to create new columns and then add geoms based on those columns. (Check out this post by Freddie Mac’s Len Kiefer for another way to accomplish this by nesting ifelse statements in ggplot's aesthetics. In fact, if you like data visualization, check out all the stuff that Len writes.)

Let’s walk through how to create columns for shading by positive or negative jobs growth. First, we are looking at total employment here, so we call filter(sector == "Nonfarm Employment") to get only total employment.

Next, we create two new columns with mutate(). The first is called col_pos and is formed by if_else(monthly_change > 0, monthly_change,...). That logic is creating a column that holds the value of monthly change if monthly change is positive, else it holds NA. We then create another column called col_neg using the same logic.

empl_monthly_change %>% filter(sector == "Nonfarm Employment") %>% mutate(col_pos = if_else(monthly_change > 0, monthly_change, as.numeric(NA)), col_neg = if_else(monthly_change < 0, monthly_change, as.numeric(NA))) %>% dplyr::select(sector, date, col_pos, col_neg) %>% head() # A tibble: 6 x 4 # Groups: sector [1] sector date col_pos col_neg 1 Nonfarm Employment 2007-02-01 85 NA 2 Nonfarm Employment 2007-03-01 214 NA 3 Nonfarm Employment 2007-04-01 59 NA 4 Nonfarm Employment 2007-05-01 153 NA 5 Nonfarm Employment 2007-06-01 77 NA 6 Nonfarm Employment 2007-07-01 NA -30

Have a qucik look at the col_pos and col_neg columns and make sure they look right. col_pos should have only positive and NA values, col_neg shoud have only negative and NA values.

Now we can visualize our monthly changes with ggplot, adding a separate geom for those new columns.

empl_monthly_change %>% filter(sector == "Nonfarm Employment") %>% mutate(col_pos = if_else(monthly_change > 0, monthly_change, as.numeric(NA)), col_neg = if_else(monthly_change < 0, monthly_change, as.numeric(NA))) %>% ggplot(aes(x = date)) + geom_col(aes(y = col_neg), alpha = .85, fill = "pink", color = "pink") + geom_col(aes(y = col_pos), alpha = .85, fill = "lightgreen", color = "lightgreen") + ylab("Monthly Change (thousands)") + labs(title = "Monthly Private Employment Change", subtitle = "total empl, since 2008", caption = "inspired by @lenkiefer") + scale_x_date(breaks = scales::pretty_breaks(n = 10)) + theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), plot.caption = element_text(hjust=0))

That plot is nice, but it’s static! Hover on it and you’ll see what I mean.

Let’s head to highcharter and create an interactive chart that responds when we hover on it. By way of brief background, highcharter is an R hook into the fantastic highcharts JavaScript library. It’s free for personal use but a license is required for commercial use.

One nice feature of highcharter is that we can use very similar aesthetic logic to what we used for ggplot. It’s not identical, but it’s similar and let’s us work with tidy data.

Before we get to the highcharter logic, we will add one column to our tibble to hold the color scheme for our positive and negative monthly changes. Notice how this is different from the ggplot flow above where we create one column to hold our positive changes for coloring and one column to hold our negative changes for coloring.

I want to color positive changes light blue and negative changes pink, and put the rgb codes for those colors directly in the new column. The rgb code for light blue is “#6495ed” and for pink is “#ffe6ea”. Thus we use ifelse to create a column called color_of_bars that holds “#6495ed” (light blue) when monthly_change is postive and “#ffe6ea” (pink) when it’s negative.

total_employ_hc <- empl_monthly_change %>% filter(sector == "Nonfarm Employment") %>% mutate(color_of_bars = ifelse(monthly_change > 0, "#6495ed", "#ffe6ea")) head(total_employ_hc) # A tibble: 6 x 5 # Groups: sector [1] sector date employees monthly_change color_of_bars 1 Nonfarm Employment 2007-02-01 137582 85 #6495ed 2 Nonfarm Employment 2007-03-01 137796 214 #6495ed 3 Nonfarm Employment 2007-04-01 137855 59 #6495ed 4 Nonfarm Employment 2007-05-01 138008 153 #6495ed 5 Nonfarm Employment 2007-06-01 138085 77 #6495ed 6 Nonfarm Employment 2007-07-01 138055 -30 #ffe6ea

Now we are ready to start the highcharter flow.

We start by calling hchart to pass in our data object. Note the similarity to ggplot where we started with ggplot.

Now, intead of waiting for a call to geom_col, we set type = "column" to let hchart know that we are building a column chart. Next, we use hcaes(x = date, y = monthly_change, color = color_of_bars) to specify our aesthetics. Notice how we can control the colors of the bars from values in the color_of_bars column.

We also supply a name = "monthly change" because we want monthly change to appear when a user hovers on the chart. That wasn’t a consideration with ggplot.

library(highcharter) hchart(total_employ_hc, type = "column", pointWidth = 5, hcaes(x = date, y = monthly_change, color = color_of_bars), name = "monthly change") %>% hc_title(text = "Monthly Employment Change") %>% hc_xAxis(type = "datetime") %>% hc_yAxis(title = list(text = "monthly change (thousands)")) %>% hc_exporting(enabled = TRUE)

Let’s stay in the highcharter world and visualize how each sector changed in the most recent month, which is July of 2018.

First, we isolate the most recent month by filtering on the last date. We also don’t want the ADP Estimate and filter that out as well.

empl_monthly_change %>% filter(date == (last(date))) %>% filter(sector != "ADP Estimate") # A tibble: 14 x 4 # Groups: sector [14] sector date employees monthly_change 1 Nonfarm Employment 2018-07-01 149128 157 2 Construction 2018-07-01 7242 19 3 Retail/Trade 2018-07-01 15944 7.1 4 Prof/Bus Serv 2018-07-01 21019 51 5 Manufact 2018-07-01 12751 37 6 Financial 2018-07-01 8568 -5 7 Mining 2018-07-01 735 -4 8 Health Care 2018-07-01 23662 22 9 Wholesale Trade 2018-07-01 5982. 12.3 10 Transportation 2018-07-01 27801 15 11 Info Sys 2018-07-01 2772 0 12 Leisure 2018-07-01 16371 40 13 Gov 2018-07-01 22334 -13 14 Other Services 2018-07-01 5873 -5

That filtered flow has the data we want, but we have two more tasks. First, we want to arrange this data so that it goes from smallest to largest. If we did not do this, our chart would still “work”, but the column heights would not progress from lowest to highest.

Second, we need to create another column to hold colors for negative and positive values, with the same ifelse() logic as we used before.

emp_by_sector_recent_month <- empl_monthly_change %>% filter(date == (last(date))) %>% filter(sector != "ADP Estimate") %>% arrange(monthly_change) %>% mutate(color_of_bars = if_else(monthly_change > 0, "#6495ed", "#ffe6ea"))

Now we pass that object to hchart, set type = "column", and choose our hcaes values. We want to label the x-axis with the different sectors and do that with hc_xAxis(categories = emp_by_sector_recent_month$sector). last_month <- lubridate::month(last(empl_monthly_change$date), label = TRUE, abbr = FALSE) hchart(emp_by_sector_recent_month, type = "column", pointWidth = 20, hcaes(x = sector, y = monthly_change, color = color_of_bars), showInLegend = FALSE) %>% hc_title(text = paste(last_month, "Employment Change", sep = " ")) %>% hc_xAxis(categories = emp_by_sector_recent_month$sector) %>% hc_yAxis(title = list(text = "Monthly Change (thousands)")) {"x":{"hc_opts":{"title":{"text":"July Employment Change"},"yAxis":{"title":{"text":"Monthly Change (thousands)"},"type":"linear"},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0,"showInLegend":false,"marker":{"enabled":true}},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25},"scatter":{"marker":{"symbol":"circle"}}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"group":"group","data":[{"sector":"Gov","date":"2018-07-01","employees":22334,"monthly_change":-13,"color_of_bars":"#ffe6ea","y":-13,"color":"#ffe6ea","name":"Gov"},{"sector":"Financial","date":"2018-07-01","employees":8568,"monthly_change":-5,"color_of_bars":"#ffe6ea","y":-5,"color":"#ffe6ea","name":"Financial"},{"sector":"Other Services","date":"2018-07-01","employees":5873,"monthly_change":-5,"color_of_bars":"#ffe6ea","y":-5,"color":"#ffe6ea","name":"Other Services"},{"sector":"Mining","date":"2018-07-01","employees":735,"monthly_change":-4,"color_of_bars":"#ffe6ea","y":-4,"color":"#ffe6ea","name":"Mining"},{"sector":"Info Sys","date":"2018-07-01","employees":2772,"monthly_change":0,"color_of_bars":"#ffe6ea","y":0,"color":"#ffe6ea","name":"Info Sys"},{"sector":"Retail/Trade","date":"2018-07-01","employees":15944,"monthly_change":7.10000000000036,"color_of_bars":"#6495ed","y":7.10000000000036,"color":"#6495ed","name":"Retail/Trade"},{"sector":"Wholesale Trade","date":"2018-07-01","employees":5981.5,"monthly_change":12.3000000000002,"color_of_bars":"#6495ed","y":12.3000000000002,"color":"#6495ed","name":"Wholesale Trade"},{"sector":"Transportation","date":"2018-07-01","employees":27801,"monthly_change":15,"color_of_bars":"#6495ed","y":15,"color":"#6495ed","name":"Transportation"},{"sector":"Construction","date":"2018-07-01","employees":7242,"monthly_change":19,"color_of_bars":"#6495ed","y":19,"color":"#6495ed","name":"Construction"},{"sector":"Health Care","date":"2018-07-01","employees":23662,"monthly_change":22,"color_of_bars":"#6495ed","y":22,"color":"#6495ed","name":"Health Care"},{"sector":"Manufact","date":"2018-07-01","employees":12751,"monthly_change":37,"color_of_bars":"#6495ed","y":37,"color":"#6495ed","name":"Manufact"},{"sector":"Leisure","date":"2018-07-01","employees":16371,"monthly_change":40,"color_of_bars":"#6495ed","y":40,"color":"#6495ed","name":"Leisure"},{"sector":"Prof/Bus Serv","date":"2018-07-01","employees":21019,"monthly_change":51,"color_of_bars":"#6495ed","y":51,"color":"#6495ed","name":"Prof/Bus Serv"},{"sector":"Nonfarm Employment","date":"2018-07-01","employees":149128,"monthly_change":157,"color_of_bars":"#6495ed","y":157,"color":"#6495ed","name":"Nonfarm Employment"}],"type":"column","pointWidth":20,"showInLegend":false}],"xAxis":{"type":"category","title":{"text":"sector"},"categories":["Gov","Financial","Other Services","Mining","Info Sys","Retail/Trade","Wholesale Trade","Transportation","Construction","Health Care","Manufact","Leisure","Prof/Bus Serv","Nonfarm Employment"]}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"chart","fonts":[],"debug":false},"evals":[],"jsHooks":[]} Finally, let’s compare the ADP Estimates to the actual Nonfarm payroll numbers since 2017. We start with filtering again. adp_bls_hc <- empl_monthly_change %>% filter(sector == "ADP Estimate" | sector == "Nonfarm Employment") %>% filter(date >= "2017-01-01") We create a column to hold different colors, but our logic is not whether a reading is positive or negative. We want to color the ADP and BLS reports differently. adp_bls_hc <- adp_bls_hc %>% mutate(color_of_bars = ifelse(sector == "ADP Estimate", "#ffb3b3", "#4d94ff")) head(adp_bls_hc) # A tibble: 6 x 5 # Groups: sector [1] sector date employees monthly_change color_of_bars 1 ADP Estimate 2017-01-01 123253. 245. #ffb3b3 2 ADP Estimate 2017-02-01 123533. 280. #ffb3b3 3 ADP Estimate 2017-03-01 123655 122. #ffb3b3 4 ADP Estimate 2017-04-01 123810. 155. #ffb3b3 5 ADP Estimate 2017-05-01 124012. 202. #ffb3b3 6 ADP Estimate 2017-06-01 124166. 154. #ffb3b3 tail(adp_bls_hc) # A tibble: 6 x 5 # Groups: sector [1] sector date employees monthly_change color_of_bars 1 Nonfarm Employment 2018-02-01 148125 324 #4d94ff 2 Nonfarm Employment 2018-03-01 148280 155 #4d94ff 3 Nonfarm Employment 2018-04-01 148455 175 #4d94ff 4 Nonfarm Employment 2018-05-01 148723 268 #4d94ff 5 Nonfarm Employment 2018-06-01 148971 248 #4d94ff 6 Nonfarm Employment 2018-07-01 149128 157 #4d94ff And now we pass that object to our familiar hchart flow. hchart(adp_bls_hc, type = 'column', hcaes(y = monthly_change, x = date, group = sector, color = color_of_bars), showInLegend = FALSE ) %>% hc_title(text = "ADP v. BLS") %>% hc_xAxis(type = "datetime") %>% hc_yAxis(title = list(text = "monthly change (thousands)")) %>% hc_add_theme(hc_theme_flat()) %>% hc_exporting(enabled = TRUE) {"x":{"hc_opts":{"title":{"text":"ADP v. BLS"},"yAxis":{"title":{"text":"monthly change (thousands)"},"type":"linear"},"credits":{"enabled":false},"exporting":{"enabled":true},"plotOptions":{"series":{"turboThreshold":0,"showInLegend":true,"marker":{"enabled":true}},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25},"scatter":{"marker":{"symbol":"circle"}}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"name":"ADP Estimate","data":[{"sector":"ADP Estimate","date":"2017-01-01","employees":123252.682,"monthly_change":245.051999999996,"color_of_bars":"#ffb3b3","x":1483228800000,"y":245.051999999996,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-02-01","employees":123532.62,"monthly_change":279.937999999995,"color_of_bars":"#ffb3b3","x":1485907200000,"y":279.937999999995,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-03-01","employees":123655,"monthly_change":122.380000000005,"color_of_bars":"#ffb3b3","x":1488326400000,"y":122.380000000005,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-04-01","employees":123810.373,"monthly_change":155.373000000007,"color_of_bars":"#ffb3b3","x":1491004800000,"y":155.373000000007,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-05-01","employees":124012.451,"monthly_change":202.077999999994,"color_of_bars":"#ffb3b3","x":1493596800000,"y":202.077999999994,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-06-01","employees":124166.035,"monthly_change":153.584000000003,"color_of_bars":"#ffb3b3","x":1496275200000,"y":153.584000000003,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-07-01","employees":124368.583,"monthly_change":202.547999999995,"color_of_bars":"#ffb3b3","x":1498867200000,"y":202.547999999995,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-08-01","employees":124538.23,"monthly_change":169.646999999997,"color_of_bars":"#ffb3b3","x":1501545600000,"y":169.646999999997,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-09-01","employees":124621.918,"monthly_change":83.6880000000092,"color_of_bars":"#ffb3b3","x":1504224000000,"y":83.6880000000092,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-10-01","employees":124783.074,"monthly_change":161.155999999988,"color_of_bars":"#ffb3b3","x":1506816000000,"y":161.155999999988,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-11-01","employees":124983.136,"monthly_change":200.062000000005,"color_of_bars":"#ffb3b3","x":1509494400000,"y":200.062000000005,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2017-12-01","employees":125232.605,"monthly_change":249.468999999997,"color_of_bars":"#ffb3b3","x":1512086400000,"y":249.468999999997,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-01-01","employees":125473.964,"monthly_change":241.359000000011,"color_of_bars":"#ffb3b3","x":1514764800000,"y":241.359000000011,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-02-01","employees":125714.734,"monthly_change":240.76999999999,"color_of_bars":"#ffb3b3","x":1517443200000,"y":240.76999999999,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-03-01","employees":125913.201,"monthly_change":198.467000000004,"color_of_bars":"#ffb3b3","x":1519862400000,"y":198.467000000004,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-04-01","employees":126083.325,"monthly_change":170.123999999996,"color_of_bars":"#ffb3b3","x":1522540800000,"y":170.123999999996,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-05-01","employees":126279.638,"monthly_change":196.313000000009,"color_of_bars":"#ffb3b3","x":1525132800000,"y":196.313000000009,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-06-01","employees":126460.542,"monthly_change":180.903999999995,"color_of_bars":"#ffb3b3","x":1527811200000,"y":180.903999999995,"color":"#ffb3b3"},{"sector":"ADP Estimate","date":"2018-07-01","employees":126679.56,"monthly_change":219.017999999996,"color_of_bars":"#ffb3b3","x":1530403200000,"y":219.017999999996,"color":"#ffb3b3"}],"type":"column","showInLegend":false},{"name":"Nonfarm Employment","data":[{"sector":"Nonfarm Employment","date":"2017-01-01","employees":145696,"monthly_change":259,"color_of_bars":"#4d94ff","x":1483228800000,"y":259,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-02-01","employees":145896,"monthly_change":200,"color_of_bars":"#4d94ff","x":1485907200000,"y":200,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-03-01","employees":145969,"monthly_change":73,"color_of_bars":"#4d94ff","x":1488326400000,"y":73,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-04-01","employees":146144,"monthly_change":175,"color_of_bars":"#4d94ff","x":1491004800000,"y":175,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-05-01","employees":146299,"monthly_change":155,"color_of_bars":"#4d94ff","x":1493596800000,"y":155,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-06-01","employees":146538,"monthly_change":239,"color_of_bars":"#4d94ff","x":1496275200000,"y":239,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-07-01","employees":146728,"monthly_change":190,"color_of_bars":"#4d94ff","x":1498867200000,"y":190,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-08-01","employees":146949,"monthly_change":221,"color_of_bars":"#4d94ff","x":1501545600000,"y":221,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-09-01","employees":146963,"monthly_change":14,"color_of_bars":"#4d94ff","x":1504224000000,"y":14,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-10-01","employees":147234,"monthly_change":271,"color_of_bars":"#4d94ff","x":1506816000000,"y":271,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-11-01","employees":147450,"monthly_change":216,"color_of_bars":"#4d94ff","x":1509494400000,"y":216,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2017-12-01","employees":147625,"monthly_change":175,"color_of_bars":"#4d94ff","x":1512086400000,"y":175,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-01-01","employees":147801,"monthly_change":176,"color_of_bars":"#4d94ff","x":1514764800000,"y":176,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-02-01","employees":148125,"monthly_change":324,"color_of_bars":"#4d94ff","x":1517443200000,"y":324,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-03-01","employees":148280,"monthly_change":155,"color_of_bars":"#4d94ff","x":1519862400000,"y":155,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-04-01","employees":148455,"monthly_change":175,"color_of_bars":"#4d94ff","x":1522540800000,"y":175,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-05-01","employees":148723,"monthly_change":268,"color_of_bars":"#4d94ff","x":1525132800000,"y":268,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-06-01","employees":148971,"monthly_change":248,"color_of_bars":"#4d94ff","x":1527811200000,"y":248,"color":"#4d94ff"},{"sector":"Nonfarm Employment","date":"2018-07-01","employees":149128,"monthly_change":157,"color_of_bars":"#4d94ff","x":1530403200000,"y":157,"color":"#4d94ff"}],"type":"column","showInLegend":false}],"xAxis":{"type":"datetime","title":{"text":"date"}}},"theme":{"colors":["#f1c40f","#2ecc71","#9b59b6","#e74c3c","#34495e","#3498db","#1abc9c","#f39c12","#d35400"],"chart":{"backgroundColor":"#ECF0F1"},"xAxis":{"gridLineDashStyle":"Dash","gridLineWidth":1,"gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"yAxis":{"gridLineDashStyle":"Dash","gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"legendBackgroundColor":"rgba(0, 0, 0, 0.5)","background2":"#505053","dataLabelsColor":"#B0B0B3","textColor":"#34495e","contrastTextColor":"#F0F0F3","maskColor":"rgba(255,255,255,0.3)"},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"chart","fonts":[],"debug":false},"evals":[],"jsHooks":[]} That’s all for today. Try revisiting this script on September 7th, when the next BLS jobs data is released, and see if any new visualizations or code flows come to mind. See you next time and happy coding! _____='https://rviews.rstudio.com/2018/08/09/highcharting-jobs-friday/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Le Monde puzzle [#1063] Thu, 08/09/2018 - 00:18 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) A simple (summertime?!) arithmetic Le Monde mathematical puzzle 1. A “powerful integer” is such that all its prime divisors are at least with multiplicity 2. Are there two powerful integers in a row, i.e. such that both n and n+1 are powerful? 2. Are there odd integers n such that n² – 1 is a powerful integer ? The first question can be solved by brute force. Here is a R code that leads to the solution: isperfz <- function(n){ divz=primeFactors(n) facz=unique(divz) ordz=rep(0,length(facz)) for (i in 1:length(facz)) ordz[i]=sum(divz==facz[i]) return(min(ordz)>1)} lesperf=NULL for (t in 4:1e5) if (isperfz(t)) lesperf=c(lesperf,t) twinz=lesperf[diff(lesperf)==1] with solutions 8, 288, 675, 9800, 12167. The second puzzle means rerunning the code only on integers n²-1… [1] 8 [1] 288 [1] 675 [1] 9800 [1] 235224 [1] 332928 [1] 1825200 [1] 11309768 except that I cannot exceed n²=10⁸. (The Le Monde puzzles will now stop for a month, just like about everything in France!, and then a new challenge will take place. Stay tuned.) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### In case you missed it: July 2018 roundup Wed, 08/08/2018 - 20:53 (This article was first published on Revolutions, and kindly contributed to R-bloggers) In case you missed them, here are some articles from July of particular interest to R users. A program to validate quality and security for R packages: the Linux Foundation's CII Best Practices Badge Program. R scripts to generate images in the style of famous artworks, like Mondrian's. A 6-minute video tour of the AI and Machine Learning services in Azure, including R. The July roundup of AI, Machine Learning and Data Science news. An R package for tiling hexagon-shaped images, used to create a striking banner of hex stickers for useR!2018. Video and R scripts from a workshop on creating an app to detect images of hotdogs Microsoft has released a number of open data sets produced from its research programs. And some general interest stories (not necessarily related to R): As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Hotels vs Airbnb – Barcelona case study (proof of concept) Wed, 08/08/2018 - 20:43 (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers) Click here to access the map 1 – Background and motivation During the last few years, the sharing economy has become more and more ubiquitous, from taxi riding applications to DIY and so on. The Real Estate market is no exception and in the Hotels and Hospitality sector we have seen more of these examples. Among them is Airbnb, a company based in San Francisco that operates an online marketplace and hospitality service has had a bigger impact. The main objective of this case study is to assess the impact of Airbnb and review the hotel prices compared with the accommodation offered by Airbnb. Although it would be a preference to have an in-depth analysis of this case study. But due to the complexity of data variables and various data outcomes, we present an initial analysis of the pricing competition between Airbnb and the Hotels within the city of Barcelona. We have used two main datasets in this case study; hotels location and pricing data and Airbnb listings. he hotels’ datasets were obtained by data scraping from theTripadvisor.com website, we will get more in detail soon, and the Airbnb datasets were downloaded from the site insideairbnb.com 2 – Scraping Tripadvisor Hotels dataset Scraping has becoming more and more popular recently due to the amount of data present in the world wide web today. Scraping TripAdvisor has allowed us to extract a complete list of the hotels in Barcelona (518 in total) together with price information (for 418 of them). The price data shown are expressed in USD charged for a room for two people booking on the night of the 22nd of October of 2018. We decided to use this date mainly because it is a low season in Barcelona and it allows us to make a more direct comparison with Airbnb prices. For scraping, we used a Python environment and two main scraping libraries; Selenium, and Scrapy. The reason why Selenium was needed due to the number of choices allowed for users selections on the TripAdvisor website. This allows us to accurately filter the data to show hotels based on the chosen date, 22nd of October as previously mentioned. One of the downsides of Selenium is processing speed. In order to speed up the scraping process, we have decided to eliminate any graphical contents. With Selenium, we were able to get a complete list of hotels in Barcelona with its prices and a link to the specific hotels TripAdvisor website. This link is the information that will allow our Scrapy crawler to get the remaining information we were looking for; the location and star rating of each of the hotels. Scrapy then was used to speed up the gathering of information of each hotel namely the hotel full address, reviews, ranking and star rating. In the images below we can see both the hotels list exported as well as the hotels details 3 – Hotels data exploration Using cartoframes (CARTOs Python environment for data scientists) we can easily see how the hotels are mainly concentrated in the centre of Barcelona, also with higher booking prices in this area. .errordiv { padding:10px; margin:10px; border: 1px solid #555555;color: #000000;background-color: #f8f8f8; width:500px; }#advanced_iframe {visibility:visible;opacity:1;}#ai-layer-div-advanced_iframe p {height:100%;margin:0;padding:0} var ai_iframe_width_advanced_iframe = 0; var ai_iframe_height_advanced_iframe = 0;var aiIsIe8=false; if (typeof aiReadyCallbacks === 'undefined') { var aiReadyCallbacks = []; } else if (!(aiReadyCallbacks instanceof Array)) { var aiReadyCallbacks = []; }var onloadFiredadvanced_iframe = false; function aiShowIframeId(id_iframe) { jQuery("#"+id_iframe).css("visibility", "visible"); } function aiResizeIframeHeight(height) { aiResizeIframeHeight(height,advanced_iframe); } function aiResizeIframeHeightId(height,width,id) {aiResizeIframeHeightById(id,height);} var ifrm_advanced_iframe = document.getElementById("advanced_iframe");var hiddenTabsDoneadvanced_iframe = false; function resizeCallbackadvanced_iframe() {}function aiChangeUrl(loc) {} 4 – Price comparison Initially, we plotted the distribution of prices for both hotels and Airbnb listings to see the big picture on how they compare. This initial analysis already gives some initial conclusions and we see as hotels are in general more expensive that Airbnb locations. .errordiv { padding:10px; margin:10px; border: 1px solid #555555;color: #000000;background-color: #f8f8f8; width:500px; }#advanced_iframe_2 {visibility:visible;opacity:1;}#ai-layer-div-advanced_iframe_2 p {height:100%;margin:0;padding:0} var ai_iframe_width_advanced_iframe_2 = 0; var ai_iframe_height_advanced_iframe_2 = 0;var aiIsIe8=false; if (typeof aiReadyCallbacks === 'undefined') { var aiReadyCallbacks = []; } else if (!(aiReadyCallbacks instanceof Array)) { var aiReadyCallbacks = []; }var onloadFiredadvanced_iframe_2 = false; function aiShowIframeId(id_iframe) { jQuery("#"+id_iframe).css("visibility", "visible"); } function aiResizeIframeHeight(height) { aiResizeIframeHeight(height,advanced_iframe_2); } function aiResizeIframeHeightId(height,width,id) {aiResizeIframeHeightById(id,height);} var ifrm_advanced_iframe_2 = document.getElementById("advanced_iframe_2");var hiddenTabsDoneadvanced_iframe_2 = false; function resizeCallbackadvanced_iframe_2() {}function aiChangeUrl(loc) {} For more in detail comparison, we used the simple yet effective Machine Learning algorithm of nearest neighbours. For this purpose, we used R software and the package nngeo which computes the K-nearest neighbour calculations considering the geographic attributes of the locations considered. We computed the K=20 nearest neighbours limiting our search to the locations within a radius of 500 m. Both the considered Airbnb locations as well as the lines connecting them to each hotel are represented on the map. For this purpose where only selected Airbnb listings with a ‘Real Bed’ and with accommodation for between 1 and 3 people, in order to compare like for like. Finally, we computed the average price for all the 20 locations near to our target hotel and also the difference to the actual hotel price. This is the main dataset presented in this study and that is shown on the interactive map done with CARTO. Some of the main conclusions that can be extracted from the analysis are: • Budget hotels (1 and 2 stars) are the less affected by Airbnb listings with just a difference of$14 dollars per night among them.
• Hotels more centrally located show more fierce competitions with their Airbnb neighbours, with less central hotels less affected.
• Luxury hotels are the ones that present biggest differences, but as mentioned previously we should further work on the Airbnb data to make like for like comparisons in this segment of the market.
• Finally, there are a few hotels that show a cheaper price than their Airbnb neighbours, so look for those!

Finally as a conclusion just to highlight how this type of analysis shows how individual hotels are affected by Airbnb competition and can also be used in order to decide where to locate a new hotel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The Hawkes Hand

Wed, 08/08/2018 - 20:09

(This article was first published on rstats – MikeJackTzen, and kindly contributed to R-bloggers)

Thanks to @LukeBornn & Tim Swartz, I had the opportunity to present the #HawkesHandat #CASSIS18 . I learned a lot, had a hell of a time & met some great people #RuffRuff . My Hawkes Hand slides can be viewed here https://t.co/RVkyoNc2IY pic.twitter.com/hPfBcpDhcv

— MikeJackTzen (@MKJCKTZN) August 8, 2018

Thanks to Luke Bornn & Tim Swartz, I had the opportunity to present the Hawkes Hand at CASSIS18 . I learned a lot, had a hell of a time & met some great people #RuffRuff . My Hawkes Hand slides are here .

“The Hot Hand is real. If not, why do players even warm up before the game starts?” – Me

In the 80s, GVT looked at the hot hand through a difference of conditional probabilities that depends on the analyst explicitly conditioning on shooting streak sequences. When the sequence is a string of 3 makes, this is like “NBA Jam,” a video game in the 90s.

Are there alternative ways to frame the hot hand other than conditioning on sequential coin flip orderings (H is a make and T is a miss)? If you observed a HHH sequence you might naturally say that the player was hot. But what if I told you that the made shots were evenly spread across 48 minutes. If the player only attempted and made all 3 shots at the cumulative minute marks of 11:30, 23:30, and 35:30, would you still say that the player was hot in the 4th quarter? Alternatively, even if you observed a miss in the sequence HHT within a short time frame (3 minutes), this could very well provide information towards a player heating up. Temporal information is important when contextualizing shooting streaks of makes or misses, information NOT used by the GVT approach.

A professional NBA player is expected to have world-class ability in judging their own latent psycho-bio-mechancial state, then acting upon it. Seeing a player attempt many shots in a short cluster of time would suggest that the player is in a self-excited shooting state, perhaps self-confident in his shot making ability. Instead of the GVT approach, we use a Hawkes model, a way to model temporal clustering in point processes. Specifically, we use the Epidemic Type Aftershock Sequences model.

ETAS is a special case of the Hawkes model, where the ETAS model allows earthquakes of different magnitudes to have different contributions to productivity. A large magnitude earthquake might generate more aftershocks as opposed to an earthquake with a smaller magnitude. In the Hawkes Hand context, we let the ‘local field goal percentage’ be the ‘magnitude’ of a made shot. A made shot with a higher localized field goal percentage may generate more aftershock-makes than a made shot that had a lower field goal percentage.

Our ETAS approach gives us 2 dimensions of hotness (hawkesness):

1. The productivity parameter in the ETAS model summarizes the average number of aftershock-makes generated by a parent make. 2. The logistic regression parameter links the historical intensity of makes to the local field goal percentage (using a geometric random variable to preserve information about the misses between two makes).

For those reading at home, try to match the 3 games of varying hotness to:

Shuffle {Klay Thompson, Tracy McGrady, Kobe Bryant}

Further, these 2 dimensions of hawkesness suggest that there are a handful of games in the 2005-2006 season where Kobe was hotter than he was for his 81 point game on January 22, 2006 (in red below).

If you were an NBA player, how would you react if the league office got rid of pre-game warmups?

PS, we had another presentation at #CASSIS18 where Aaron talked about relational networks in basketball. Find out more about CoordiNet and Dr. Aaron at https://www.aaronjdanielson.com/

PPS, we used #rstats for all of our analysis and graphics. From the presentations to the posters, many of the presenters used R. Talking with analytics staff (from the NBA, MLB, and the NFL) a lot of them use R in their internal systems.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rstats – MikeJackTzen. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Leave-one-out subset for your ggplot smoother

Wed, 08/08/2018 - 13:15

(This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers)

I have a dashboard at work that plots the number of contracts my office handles over time, and I wanted to add a trendline to show growth. However, trendline on the entire dataset skews low because our fiscal year just restarted. It took a little trial and error to figure out how to exclude the most recent year’s count so the trendline is more accurate for this purpose, so I thought I’d share my solution in a short post.

Here’s the code with a dummy dataset, with both the original trendline and the leave-one-out version:

x <- data.frame( x = seq(1:10), y = c(1,1,2,3,5,8,13,21,34,7) ) x %>% arrange(desc(x)) %>% ggplot(aes(x = x, y = y)) + geom_col() + geom_smooth(method = "lm", se = F, color = "red") + geom_smooth(method = "lm", data = function(x) { slice(x, -1) }, se = F, fullrange = TRUE)

The magic happens by defining an anonymous function as the data argument for the geom_smooth function. I’m using the slice function to drop the first element of the data frame provided to geom_smooth.

There are two other important tips here. First, be sure and sort the data in the tidyverse pipe before passing it to ggplot — for my purpose, this was in descending date order because I wanted to drop the most recent entry. The other tip is an aesthetic one, which is to use the fullrange = TRUE argument in order to plot the trendline into the current date column to provide a rough prediction for the current date period.

NOTE: I’ve seen some commentary that the decision to omit elements as described here should be explicitly disclosed. However, I think it’s fairly apparent in this specific application that the extension of the smoother is predictive in nature, rather than a true representation of the last data element. I could probably shade or apply an alpha to the most recent data point using a similar technique to make this even more clear, but the purpose here was to demonstrate the leave-one-out technique.

What do you think of this solution?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### MODIStsp approved on rOpenSci!

Wed, 08/08/2018 - 02:00

(This article was first published on Posts on Lorenzo Busetto Website & Blog, and kindly contributed to R-bloggers)

We are happy to report that our MODIStsp package for
automatic preprocessing of MODIS time series has been recently approved for being included in the rOpenSci ecosystem of R packages for reproducible science!

We wish to thank reviewers Leah Wasser and Jeffrey Hanson for providing really valuable insights during
the onboarding review process. We think their
contribution really helped in improving the package!

Please also note that MODIStsp website was also migrated, and is now available
at http://ropensci.github.io/MODIStsp/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on Lorenzo Busetto Website & Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### phylotaR: Retrieve Orthologous Sequences from GenBank

Wed, 08/08/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

In this technote I will outline what phylotaR was developed for, how to install it and how to run it with some simple examples.

What is phylotaR?

In any phylogenetic analysis it is important to identify sequences that share the same orthology – homologous sequences separated by speciation events. This is often performed by simply searching an online sequence repository using sequence labels. Relying solely on sequence labels, however, can miss sequences that have either not been labelled, have unanticipated names or have been mislabelled.

The phylotaR package, like its earlier inspiration PhyLoTa, identifies sequences of shared orthology not by naming matching but instead through the use of an alignment search tool, BLAST. As a result, a user of phylotaR is able to download more relevant sequence data for their phylogenetic analysis than they otherwise would.

From the taxon of interest the phylotaR pipeline traverses down the taxonomy and downloads entire subsets of sequences representing different nodes in the taxonomy. For taxa with too many sequences (e.g. Homo sapiens, Arabidopsis thaliana), a random subset is instead downloaded. For each of these subsets of downloaded sequences all-vs-all BLAST is then performed. The BLAST results of these subsets are parsed to identify all the independent, homologous clusters of sequences as determined using E-values and coverages. A secondary clustering step performs BLAST again to identify higher-level taxonomic clusters, above the level at which the all-vs-all BLAST searches were performed. This pipeline is automated and all a user needs to supply is a taxonomic group of interest from which the pipeline will begin its traverse. The pipeline is broken down into four stages: retrieve taxonomic information (‘taxise’), download sequences, identify clusters and identify clusters among the clusters.

Figure 1. The stages of phylotaR pipeline: taxise, download, cluster and cluster^2. Note, ‘taxise’ is the name of a stage and does not relate to the package taxize.

After a pipeline has completed, the package provides a series of tools for interacting with the results to identify potential ortholgous clusters suitable for phylogenetic analysis. These tools include methods for filtering based on sequence lengths, sequence annotations, taxonomy …etc. As well as for visualisations to, for example, check the relative abundance of sequences by taxon.

For more information on how the pipeline works, please see the open-access scientific article: phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R

Installation

phylotaR is available from CRAN.

install.packages('phylotaR')

Alternatively, the latest development version can be downloaded from phylotaR’s GitHub page.

devtools::install_github(repo = 'ropensci/phylotaR', build_vignettes = TRUE) BLAST+

In addition to the R package you will also need to have installed BLAST+ – a local copy of the well-known BLAST software on your machine. Unfortunately this can be a little complex as its installation depends on your operating system (Windows, Mac or Linux). Fortunately, NCBI provides detailed installation instructions for each flavour of operating system: NCBI installation instructions.

Once BLAST+ is installed you will need to record the location of the BLAST+ file system path where the exectuable programs are located. This should be something like C:\users\tao\desktop\blast-2.2.29+\bin\ on a Windows machine or /usr/local/ncbi/blast/bin/ on a Unix.

Quick guide Taxonomic ID

After installation, a phylotaR pipeline can be launched with the setup() and run() functions. Before launching a run, a user must first look up the NCBI taxonomic ID for their chosen taxonomic group. To find the ID of your taxon, search NCBI taxonomy using the scientific name or an English common name. For example, upon searching for the primate genus ‘Aotus’ we should come across this page which states the taxonomic ID as 9504.

Because names are not unique phylotaR does not allow names to be used instead of taxonomic IDs. Aotus is a good example as there is also a plant genus of the same name. If we were to search NCBI with just the name then we would be downloading a mixture of primate and plant DNA. Such is the importance of IDs. (Alternatively, you could automate this step yourself using the taxize package.)

Running

With the package installed, BLAST+ available and ID to hand, we can run our pipeline.

# load phylotaR library(phylotaR) # create a folder to hold the cached data and results dir.create('aotus') # setup the folder and check that BLAST is available setup(wd = wd, txid = 9504, ncbi_dr = '/usr/local/ncbi/blast/bin/') # run the pipeline run(wd = wd) # ^^ running should take about 5 minutes to complete

For more details on running the pipeline, such as changing the parameters or understanding the results, see the vignette, vignette('phylotaR') or visit the website.

Timings

The time it takes for a pipeline to complete depends on the number of taxonomic groups a taxon contains and the number of sequences it represents. Below are some examples of the time taken for various taxonomic groups.

Taxon N. taxa N. sequences Time taken (mins.) Anisoptera (“dragonflies”) 1175 11432 72 Acipenseridae (“sturgeons”) 51 2407 13 Tinamiformes (“tinamous”) 25 251 2.7 Aotus (“night/owl monkeys”) 13 1499 3.9 Bromeliaceae (“bromeliads”) 1171 9833 66 Cycadidae (“cycads”) 353 8331 37 Eutardigrada (“water bears”) 261 960 14 Kazachstania (a group of yeast) 40 623 23 Platyrrhini (“new world monkeys”) 212 12731 60 Reviewing the results

After a pipeline has completed, the wd will contain all the results files. The information contained within these files can be read into the R environment using the read_phylota() function. It will generate a Phylota object that contains information on all the identified ‘clusters’ – i.e. groups of ortholgous sequences. A user can interact with the object to filter out unwanted sequences, clusters or taxonomic groups. The package comes with completed results for playing with. In the example below, we demonstrate how to generate a presence/absence matrix of all the genera in the top 10 clusters identifed for cycads, 1445963.

library(phylotaR) data(cycads) # drop all but first ten cycads <- drop_clstrs(cycads, cycads@cids[1:10]) # get genus-level taxonomic names genus_txids <- get_txids(cycads, txids = cycads@txids, rnk = 'genus') genus_txids <- unique(genus_txids) # dropping missing genus_txids <- genus_txids[genus_txids != ''] genus_nms <- get_tx_slot(cycads, genus_txids, slt_nm = 'scnm') # make alphabetical for plotting genus_nms <- sort(genus_nms, decreasing = TRUE) # generate geom_object p <- plot_phylota_pa(phylota = cycads, cids = cycads@cids, txids = genus_txids, txnms = genus_nms) # plot print(p)

Figure 2. The presence (dark block) or absence (light block) for different cycad genera across the top ten clusters in the example dataset.

In this next example, we create a treemap showing the differences in the number of sequences and clusters identifed between genera in tinamous, 8802. (For the unintiated, tinamous are semi-flightless birds found in South America and members of the ratities – the same group comprising ostrichs and kiwis.)

library(phylotaR) data(tinamous) # Plot taxa, size by n. sq, fill by ncl txids <- get_txids(tinamous, txids = tinamous@txids, rnk = 'genus') txids <- txids[txids != ''] txids <- unique(txids) txnms <- get_tx_slot(tinamous, txids, slt_nm = 'scnm') p <- plot_phylota_treemap(phylota = tinamous, txids = txids, txnms = txnms, area = 'nsq', fill = 'ncl') print(p)

Figure 3. Relative number of sequences and clusters per tinamous genus. The larger the size of the box, the more sequences are represented for the genus. The lighter the blue colour, the more clusters are represented for the genus.

Through interacting with the Phylota object and using the various functions for manipulating it, a user can extract the specific ortholgous sequences of interest and write out the sequences in fasta format with the write_sqs() function.

# get sequences for a cluster and write out data('aotus') # take the first cluser ID cid <- aotus@cids[[1]] # extract its sequence IDs from Phylota object sids <- aotus[[cid]]@sids # design sequence names for definition line sid_txids <- get_sq_slot(phylota = aotus, sid = sids, slt_nm = 'txid') sid_spnms <- get_tx_slot(phylota = aotus, txid = sid_txids, slt_nm = 'scnm') sq_nms <- paste0(sid_spnms, ' | ', sids) # write out write_sqs(phylota = aotus, outfile = file.path(tempdir(), 'test.fasta'), sq_nm = sq_nms, sid = sids)

For more information and examples on manipulating the Phylota object, see the phylotaR website.

Future

We have many ideas for improving phylotaR, such as making use of the BLAST API – instead of relying on users installing BLAST+ on their machine – and allowing users to introduce their own non-GenBank sequences. Please see the contributing page for a complete and current list of development options. Fork requests are welcome!

Alternatively, if you have any ideas of your own for new features than please open a new issue.

Acknowledgements

Big thanks to Zebulun Arendsee, Naupaka Zimmerman and Scott Chamberlain for reviewing/editing the package for ROpenSci.

Bennett, D., Hettling, H., Silvestro, D., Zizka, A., Bacon, C., Faurby, S., … Antonelli, A. (2018). phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R. Life, 8(2), 20. DOI:10.3390/life8020020

Sanderson, M. J., Boss, D., Chen, D., Cranston, K. A., & Wehe, A. (2008). The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research. Systematic Biology, 57(3), 335–346. DOI:10.1080⁄10635150802158688

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Extending the OpenImageR package with Gabor feature extraction

Wed, 08/08/2018 - 02:00

(This article was first published on mlampros, and kindly contributed to R-bloggers)

This blog post illustrates the new functionality of the OpenImageR package (Gabor Feature Extraction). The Gabor features have been used extensively in image analysis and processing (Character and Face recognition). Gabor (Nobel prize winner, an electrical engineer, and physicist) used the following wording which I think it’s worth mentioning in this blog post, “You can’t predict the future, but you can invent it.” (source).

In the following lines I’ll describe the GaborFeatureExtract R6 class which includes the following methods,

GaborFeatureExtract gabor_filter_bank() gabor_feature_extraction() gabor_feature_engine() plot_gabor()

These methods are based on the Matlab code Gabor Feature Extraction of the paper M. Haghighat, S. Zonouz, M. Abdel-Mottaleb, “CloudID: Trustworthy cloud-based and cross-enterprise biometric identification,” Expert Systems with Applications, vol. 42, no. 21, pp. 7905-7916, 2015, http://dx.doi.org/10.1016/j.eswa.2015.06.025. The initial Matlab code was modified (I added the Magnitude feature and the gabor_feature_engine() method) and parallelized using Rcpp wherever it was possible.

Gabor Features

I came across the Gabor Features last month when I had to process images and I needed an additional function besides the already existing HoG features. I did the regular search on CRAN (Comprehensive R Archive Network) but I couldn’t find anything related to Gabor Feature Extraction (as of August 2018), therefore I decided to port the Matlab code into R.
There are many resources on the web if someone intends to deepen his / her knowledge on the subject and I’ll add some of these that I found useful at the end of the blog post (References). I’ll explain the methods of the GaborFeatureExtract R6 class using an image found on a Stackoverflow question.

gabor_filter_bank

The gabor_filter_bank method “… generates a custom-sized Gabor filter bank. It creates a UxV cell array, whose elements are MxN matrices; each matrix being a 2-D Gabor filter” according to the author of the Matlab code. The following code chunk shows how this works in R,

library(OpenImageR) init_gb = GaborFeatureExtract$new() #------------------ # gabor-filter-bank #------------------ gb_f = init_gb$gabor_filter_bank(scales = 5, orientations = 8, gabor_rows = 39, gabor_columns = 39, plot_data = TRUE) #----------------------------------------------- # plot of the real part of the gabor-filter-bank #----------------------------------------------- plt_f = init_gb$plot_gabor(real_matrices = gb_f$gabor_real, margin_btw_plots = 0.65, thresholding = FALSE)

For the gabor_filter_bank I use 5 scales and 8 orientations to build filters of size 39 x 39. The output of the method is a list of length 3,

str(gb_f) List of 3 $gaborArray :List of 40 ..$ : cplx [1:39, 1:39] -1.58e-12-0.00i 0.00-5.02e-12i 1.50e-11-0.00i ... ..$: cplx [1:39, 1:39] 4.86e-08-3.96e-08i 1.02e-07+4.63e-08i 6.31e-09+1.93e-07i ... ..$ : cplx [1:39, 1:39] 6.24e-06-6.24e-06i 1.18e-05+0.00i 1.10e-05+1.10e-05i ... ..$: cplx [1:39, 1:39] -6.69e-05-3.18e-05i -4.63e-05-7.20e-05i -1.60e-06-9.81e-05i ... ..$ : cplx [1:39, 1:39] 1.40e-04+5.81e-05i 1.15e-04+1.15e-04i 6.68e-05+1.61e-04i ... ....... $gabor_imaginary:List of 40 ..$ : num [1:39, 1:39] -4.65e-27 -5.02e-12 -1.10e-26 4.21e-11 -2.99e-25 ... ..$: num [1:39, 1:39] -3.96e-08 4.63e-08 1.93e-07 1.53e-07 -3.04e-07 ... ..$ : num [1:39, 1:39] -6.24e-06 4.84e-20 1.10e-05 2.01e-05 1.81e-05 ... ..$: num [1:39, 1:39] -3.18e-05 -7.20e-05 -9.81e-05 -9.58e-05 -5.78e-05 ... ..$ : num [1:39, 1:39] 5.81e-05 1.15e-04 1.61e-04 1.86e-04 1.83e-04 ... ....... $gabor_real :List of 40 ..$ : num [1:39, 1:39] -1.58e-12 5.54e-27 1.50e-11 -4.12e-26 -1.11e-10 ... ..$: num [1:39, 1:39] 4.86e-08 1.02e-07 6.31e-09 -2.85e-07 -4.28e-07 ... ..$ : num [1:39, 1:39] 6.24e-06 1.18e-05 1.10e-05 -8.11e-20 -1.81e-05 ... ..$: num [1:39, 1:39] -6.69e-05 -4.63e-05 -1.60e-06 5.73e-05 1.12e-04 ... ..$ : num [1:39, 1:39] 1.40e-04 1.15e-04 6.68e-05 -3.77e-19 -7.57e-05 ... .......

The first sublist (gaborArray) consists of 40 matrices (5 scales x 8 orientations) of type complex, where each matix is of dimension 39 x 39 (gabor filter). The second sublist (gabor_imaginary) is the imaginary part (numeric) whereas the third sublist is the real part (gabor_real). The real part (numeric) is used to plot the gabor filters.

The documentation includes more details for each of the parameters used.

gabor_feature_extraction

The gabor_feature_extraction method extracts the Gabor Features of the image. This method is modified in comparison to the initial Matlab code to give users the option to downsample the image or to normalize the features. Moreover, I added the Magnitude feature because according to literature it improves predictability.

Based on the previously mentioned car.png image,

# read image #----------- pth_im = system.file("tmp_images", "car.png", package = "OpenImageR") im = readImage(pth_im) * 255 # gabor-feature-extract #---------------------- gb_im = init_gb$gabor_feature_extraction(image = im, scales = 5, orientations = 8, downsample_gabor = FALSE, downsample_rows = NULL, downsample_cols = NULL, gabor_rows = 39, gabor_columns = 39, plot_data = TRUE, normalize_features = FALSE, threads = 3) This function again returns a list of length 3, str(gb_im) List of 3$ gaborFeatures :List of 2 ..$magnitude : num [1, 1:206670] 0 0 0 0 0 0 0 0 0 0 ... ..$ energy_aptitude: num [1, 1:80] 115823 27566 33289 20074 26918 ... $gabor_features_imaginary:List of 5 ..$ : num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$: num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$: num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... $gabor_features_real :List of 5 ..$ : num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$: num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$: num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ... ..$ : num [1:166, 1:249, 1:8] 0 0 0 0 0 0 0 0 0 0 ...

where gaborFeatures are the extracted features with number of rows equal to nrow(im) x ncol(im) (or 166 x 249) and number of columns equal to scales x orientations (or 5 x 8). The second and third sublists are the imaginary and real part of the image resulted after the convolution with the gabor filters. The following code chunk allows the user to plot the different scales and orientations of the image,

plt_im = init_gb$plot_gabor(real_matrices = gb_im$gabor_features_real, margin_btw_plots = 0.65, thresholding = FALSE)

By thresholding the gb_im$gabor_features_real object (scales, orientations) to [0,1] the images can be visually explored, # thresholding parameter is set to TRUE #-------------------------------------- plt_im_thresh = init_gb$plot_gabor(real_matrices = gb_im$gabor_features_real, margin_btw_plots = 0.65, thresholding = TRUE) gabor_feature_engine The gabor_feature_engine method is an extension of the initial Matlab code and allows the user to extract gabor features from multiple images. This method works in the same way as the HOG_apply method, which takes a matrix of images – such as the mnist data set – and after processing it returns the features. The following example illustrates how to use the gabor_feature_engine method with the mnist data set, # for instance downloading of the mnist data on a linux OS #---------------------------------------------------------- system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip") # or just navigate to "https://github.com/mlampros/DataSets/blob/master/mnist.zip" and click the download button #------------------------------------------------------------------------------------------- mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T, quote = "\"", sep = ",") mnist = as.matrix(mnist) x = mnist[, -ncol(mnist)] y = mnist[, ncol(mnist)] + 1 dat = init_gb$gabor_feature_engine(img_data = x, img_nrow = 28, img_ncol = 28, scales = 4, orientations = 8, gabor_rows = 13, gabor_columns = 13, downsample_gabor = FALSE, downsample_rows = NULL, downsample_cols = NULL, normalize_features = FALSE, threads = 6, verbose = TRUE) Time to complete : 4.111672 mins

> str(dat) List of 2 $magnitude : num [1:70000, 1:3136] 0 0 0 0 0 0 0 0 0 0 ...$ energy_aptitude: num [1:70000, 1:64] 2682 2576 1399 1178 2240 ...

The dat object is a list of length 2. The first sublist corresponds to magnitude whereas the second sublist to local-energy and mean-aptitude. The first thing to do before calculating the accuracy for the mnist data is to reduce the dimensionality of the magnitude feature (I’ll use the irlba package for this purpose),

svd_irlb = irlba::irlba(as.matrix(dat$magnitude), nv = 100, nu = 100, verbose = T) new_x = as.matrix(dat$magnitude) %*% svd_irlb$v and I’ll create a centered-scaled matrix of the reduced magnitude and the energy-aptitude data, dat = ClusterR::center_scale(cbind(new_x, dat$energy_aptitude)) dim(dat) [1] 70000 164

then I’ll utilize the nmslibR library (approximate method ‘hnsw’) to calculate the accuracy for the mnist data,

M = 30 efC = 100 num_threads = 6 index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC, 'post' = 0, 'skip_optimized_index' = 1 ) efS = 100 query_time_params = list('efSearch' = efS) fit_gab = nmslibR::KernelKnnCV_nmslib(dat, y, k = 20, folds = 4, h = 1, weights_function = 'biweight_triangular_tricube_MULT', Levels = sort(unique(y)), Index_Params = index_params, Time_Params = query_time_params, space = "cosinesimil", space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR", dtype = "FLOAT", index_filepath = NULL, print_progress = T, num_threads = 6, seed_num = 1) time to complete : 30.82252 secs

acc_fit_gab = unlist(lapply(1:length(fit_gab$preds), function(x) { mean(y[fit_gab$folds[[x]]] == max.col( fit_gab$preds[[x]], ties.method = 'random')) })) mean(acc_fit_gab) [1] 0.9752714 I’ll do the same using the HOG_apply function, # hog-features for the mnist data #-------------------------------- hog = OpenImageR::HOG_apply(x, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6) fit_hog = nmslibR::KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, weights_function = 'biweight_triangular_tricube_MULT', Levels = sort(unique(y)), Index_Params = index_params, Time_Params = query_time_params, space = "cosinesimil", space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR", dtype = "FLOAT", index_filepath = NULL, print_progress = T, num_threads = 6, seed_num = 1) time to complete : 41.00309 secs acc_fit_hog = unlist(lapply(1:length(fit_hog$preds), function(x) { mean(y[fit_hog$folds[[x]]] == max.col( fit_hog$preds[[x]], ties.method = 'random')) })) mean(acc_fit_hog) [1] 0.9787429

By averaging both the gabor and the HoG features the mean accuracy increases to 98.34 % which is very close to the score of the HoG + brute force method of KernelKnn (98.4),

acc_fit_gab_hog = unlist(lapply(1:length(fit_gab$preds), function(x) { mean(y[fit_gab$folds[[x]]] == max.col( (fit_gab$preds[[x]] + fit_hog$preds[[x]]) / 2, ties.method = 'random')) })) mean(acc_fit_gab_hog) [1] 0.9833857

References :

The updated version of the OpenImageR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/OpenImageR/issues.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: mlampros. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### IEEE Language Rankings 2018

Tue, 08/07/2018 - 23:03

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Python retains its top spot in the fifth annual IEEE Spectrum top programming language rankings, and also gains a designation as an "embedded language". Data science language R remains the only domain-specific slot in the top 10 (where it as listed as an "enterprise language") and drops one place compared to its 2017 ranking to take the #7 spot.

Looking at other data-oriented languages, Matlab as at #11 (up 3 places), SQL is at #24 (down 1), Julia at #32 (down 1) and SAS at #40 (down 3). Click the screenshot below for an interactive version of the chart where you can also explore the top 50 rankings.

The IEEE Spectrum rankings are based on search, social media, and job listing trends, GitHub repositories, and mentions in journal articles. You can find details on the ranking methodology here, and discussion of the trends behind the 2018 rankings at the link below.

IEEE Spectrum: The 2018 Top Programming Languages

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Statistics Sunday: Highlighting a Subset of Data in ggplot2

Tue, 08/07/2018 - 19:55

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Highlighting Specific Cases in ggplot2 Here’s my belated Statistics Sunday post, using a cool technique I just learned about: gghighlight. This R package works with ggplot2 to highlight a subset of data. To demonstrate, I’ll use a dataset I analyzed for a previous post about my 2017 reading habits. [Side note: My reading goal for this year is 60 books, and I’m already at 43! I may have to increase my goal at some point.]

setwd("~/R")
library(tidyverse)
## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13]
## Parsed with column specification:
## cols(
## .default = col_integer(),
## Title = col_character(),
## Author = col_character(),
## G_Rating = col_double(),
## Started = col_character(),
## Finished = col_character()
## )
## See spec(...) for full column specifications.

One analysis I conducted with this dataset was to look at the correlation between book length (number of pages) and read time (number of days it took to read the book). We can also generate a scatterplot to visualize this relationship.

cor.test(books$Pages, books$Read_Time)
##
## Pearson's product-moment correlation
##
## data: books$Pages and books$Read_Time
## t = 3.1396, df = 51, p-value = 0.002812
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1482981 0.6067498
## sample estimates:
## cor
## 0.4024597
scatter <- ggplot(books, aes(Pages, Read_Time)) +
geom_point(size = 3) +
theme_classic() +
labs(title = "Relationship Between Reading Time and Page Length") +
xlab("Number of Pages") +
theme(legend.position="none",plot.title=element_text(hjust=0.5))

There’s a significant positive correlation here, meaning the longer books take more days to read. It’s a moderate correlation, and there are certainly other variables that may explain why a book took longer to read. For instance, nonfiction books may take longer. Books read in October or November (while I was gearing up for and participating in NaNoWriMo, respectively) may also take longer, since I had less spare time to read. I can conduct regressions and other analyses to examine which variables impact read time, but one of the most important parts of sharing results is creating good data visualizations. How can I show the impact these other variables have on read time in an understandable and visually appealing way?

gghighlight will let me draw attention to different parts of the plot. For example, I can ask gghighlight to draw attention to books that took longer than a certain amount of time to read, and I can even ask it to label those books.

library(gghighlight)
scatter + gghighlight(Read_Time > 14) +
geom_label(aes(label = Title),
hjust = 1,
vjust = 1,
fill = "blue",
color = "white",
alpha = 0.5)

Here, the gghighlight function identifies the subset (books that took more than 2 weeks to read) and labels those books with the Title variable. Three of the four books with long read time values are non-fiction, and one was read for a course I took, so reading followed a set schedule. But the fourth is a fiction book, which took over 20 days to read. Let’s see how month impacts reading time, by highlighting books read in November. To do that, I’ll need to alter my dataset somewhat. The dataset contains a starting date and finish date, which were read in as characters. I need to convert those to dates and pull out the month variable to create my indicator.

library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
books$Started <- mdy(books$Started)
books$Start_Month <- month(books$Started)
books$Month <- ifelse(books$Start_Month > 10 & books$Start_Month < 12, books$Month <- 1,
books$Month <- 0) scatter + gghighlight(books$Month == 1) +
geom_label(aes(label = Title), hjust = 1, vjust = 1, fill = "blue", color = "white", alpha = 0.5)

The book with the longest read time was, in fact, read during November, when I was spending most of my time writing.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Meta-packages, nails in CRAN’s coffin

Tue, 08/07/2018 - 18:43

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

For example: tidyverse advertises a popular R universe where the vital package data.table never existed.

And now tidymodels is shaping up to be a popular universe where our own package vtreat never existed, except possibly as a footnote to embed.

Users currently (with some luck) discover packages like ours and then (because they trust CRAN) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).

All I can say is: please give vtreat a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Notes from the 2018 APDU Conference

Tue, 08/07/2018 - 18:30

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I recently attended the 2018 Association of Public Data Users (APDU) conference. This was my second time attending the conference, and I enjoyed learning more about how other people use federal statistics. Some of my favorite presentations were:

• Julia Lane on her experience teaching Big Data to social scientists and government workers.
• The Bureau of Economic Analysis (BEA) on their R package.
• Aaron Klein on the value of policy makers getting realtime data
Julia Lane and the Coleridge Initative

As an “R guy” I’m normally the odd man out at APDU. While many attendees do work with data, they tend to use GUI-based tools such as Excel and Tableau. I’ve often wondered if any of the attendees would benefit from learning programming language-based data tools such as R or Python.

It turns out that Julia is the author of an entire book on this topic: Big Data and Social Science! She is also a director of the Coleridge Initiative, which provides Python-based data analysis training to government workers and social scientists.

Julia spoke about her experience with these projects, and the results seemed very positive!

The Bureau of Economic Analysis (BEA) has an R package

While I mostly blog about data that the Census Bureau publishes, APDU is a great reminder of how many other statistical agencies there are. An example is the Bureau of Economic Analysis (BEA) which, among other things, calculates Gross Domestic Product (GDP) statistics.

BEA was a sponsor of the conference, and I got to chat with one of the people running their booth. I was surprised to learn that BEA has published their own R package: bea.R. This is the first time that I have heard of a government agency developing their own R package!

The person I spoke with mentioned that BEA’s Chief Innovation Officer, Jeff Chen, is a big proponent of R. You can learn more about the BEA here.

I think that it would be interesting to extend Choroplethr to work with data from the BEA.

Aaron Klein on Policymakers Getting Realtime Data

Aaron Klein, a former Treasury official, spoke about the value of policy makers getting realtime data. Aaron worked in Treasury during the foreclosure crisis, and spoke about the challenges policymakers faced in responding to it. One issue was quantifying the impact that foreclosures and abandoned homes have on broader communities.

He recently wrote a research paper that attempted to answer this question: Understanding the True Costs of Abandoned Properties: How Maintenance Can Make a Difference. One statistic from the talk left a big impression on me: vacant residential buildings account for one of every 14 residential building fires in America. When you consider that only a small portion of residential homes are vacant, this statistic is truly startling.

Having data like this at the start of the foreclosure crisis might have improved how policymakers responded to it.

The post Notes from the 2018 APDU Conference appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### greybox 0.3.0 – what’s new

Tue, 08/07/2018 - 18:06

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

Three months have passed since the initial release of

greybox

on CRAN. I would not say that the package develops like crazy, but there have been some changes since May. Let’s have a look. We start by loading both

greybox

and

smooth

:

library(greybox) library(smooth)

Rolling Origin

First of all,

ro()

function now has its own class and works with

plot()

function, so that you can have a visual representation of the results. Here’s an example:

x <- rnorm(100,100,10) ourCall <- "es(data, h=h, intervals=TRUE)" ourValue <- c("forecast", "lower", "upper") ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE) plot(ourRO)

Example of the plot of rolling origin function

Each point on the produced graph corresponds to an origin and straight lines correspond to the forecasts. Given that we asked for point forecasts and for lower and upper bounds of prediction interval, we have three respective lines. By plotting the results of rolling origin experiment, we can see if the model is stable or not. Just compare the previous graph with the one produced from the call to Holt’s model:

ourCall <- "es(data, model='AAN', h=h, intervals=TRUE)" ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE) plot(ourRO)

Example of the plot of rolling origin function with ETS(A,A,N)

Holt’s model is not suitable for this time series, so it’s forecasts are less stable than the forecasts of the automatically selected model in the previous case (which is ETS(A,N,N)).

Once again, there is a vignette with examples for the

ro()

function, have a look if you want to know more.

Yes, there is “Generalised Linear Model” in R, which implements Poisson, Gamma, Binomial and other regressions. Yes, there are smaller packages, implementing models with more exotic distributions. But I needed several regression models with: Laplace distribution, Folded normal distribution, Chi-squared distribution and one new mysterious distribution, which is currently called “S distribution”. I needed them in one place and in one format: properly estimated using likelihoods, returning confidence intervals, information criteria and being able to produce forecasts. I also wanted them to work similar to

lm()

, so that the learning curve would not be too steep. So, here it is, the function

alm()

. It works quite similar to

lm()

:

xreg <- cbind(rfnorm(100,1,10),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] ourModel <- alm(y~x1+x2, inSample, distribution="laplace") summary(ourModel)

Here’s the output of the summary:

Distribution used in the estimation: Laplace Coefficients: Estimate Std. Error Lower 2.5% Upper 97.5% (Intercept) 95.85207 0.36746 95.12022 96.58392 x1 0.59618 0.02479 0.54681 0.64554 x2 -0.67865 0.00622 -0.69103 -0.66626 ICs: AIC AICc BIC BICc 474.2453 474.7786 483.7734 484.9419

And here’s the respective plot of the forecast:

plot(forecast(ourModel,outSample))

Forecast from lm with Laplace distribution

The thing that is currently missing in the function is prediction intervals, but this will be added in the upcoming releases.

Having the likelihood approach, allows comparing different models with different distributions using information criteria. Here’s, for example, what model we get if we assume S-distribution (which has fatter tails than Laplace):

summary(alm(y~x1+x2, inSample, distribution="s")) Distribution used in the estimation: S Coefficients: Estimate Std. Error Lower 2.5% Upper 97.5% (Intercept) 95.61244 0.23386 95.14666 96.07821 x1 0.56144 0.00721 0.54708 0.57581 x2 -0.66867 0.00302 -0.67470 -0.66265 ICs: AIC AICc BIC BICc 482.9358 483.4692 492.4639 493.6325

As you see, the information criteria for S distribution are higher than for Laplace, so we can conclude that the previous model was better than the second in terms of ICs.

Note that at this moment the AICc and BICc are not correct for non-normal models (at least the derivation of them needs to be double checked, which I haven’t done yet), so don’t rely on them too much.

I intent to add several other distributions that either are not available in R or are implemented unsatisfactory (from my point of view) – the function is written in a quite flexible way, so this should not be difficult to do. If you have any preferences, please add them on github, here.

I also want to implement the mixture distributions, so that things discussed in the paper on intermittent state-space model can also be implemented using pure regression.

Finally, now that I have alm, we can select between the regression models with different distributions (with

stepwise()

function) or even combine them using AIC weights (hello,

lmCombine()

!). Yes, I know that it sounds crazy (think of the pool of models in this case), but this should be fun!

Regression for Multiple Comparison

One of the reasons why I needed

alm()

is because I wanted to implement a parametric analogue of

nemenyi()

test. This should be handy, when you have large samples (because the asymptotic properties start kicking in) and it should be more powerful than non-parametric tests. So, long story short, I implemented the test in

rmc()

function. It allows, for example, comparing different forecasting methods based on some known error measures. For example, if we can assume that the forecast error is distributed normally, then we can use this test in order to see if the forecasts are biased or not and how they compare with each other. On large samples it should be enough at least to have symmetric distribution.

The test is based on the simple linear model with dummy variables for each provided method (1 if the error corresponds to the method and 0 otherwise). Here’s an example of how this thing works:

ourData <- cbind(rnorm(100,0,10), rnorm(100,-2,5), rnorm(100,2,6), rlaplace(100,1,5)) colnames(ourData) <- c("Method A","Method B","Method C","Method D") rmc(ourData, distribution="norm", level=0.95)

By default the function produces graph in the MCB (Multiple Comparison with the Best) style:

RMC example with normal distribution

Don’t forget that we deal here with errors, so closer to zero is better (Method A). Judging by the graph, we can conclude that the least biased method is A, and that the method B is negatively biased (the bounds lie below zero).

If we compare the results of the test with the mean values, we will see similar ranking:

apply(ourData,2,mean)

Method A Method B Method C Method D -0.5523353 -2.4700249 1.8401501 1.1383028

This also reflects how the data was generated. Notice that Method D was generated from Laplace distribution with mean 1, but the test managed to give the correct answer in this situation, because Laplace distribution is symmetric and the sample size is large enough. But the main point of the test is that we can get the confidence intervals for each parameter, so we can see if the differences between the methods are significant: if the intervals intersect, then they are not.

The function also calculates the relative importance of the model (based on AIC), comparing it with the simple model of the type y = const + e. In my example, the constructed model explains the variance quite well, so the importance is 0.999

Similarly, we can compare absolute errors, but we would need to use Folded Normal distribution in this case:

rmc(abs(ourData), distribution="fnorm", level=0.95)

RMC example with folded normal distribution

Once again, just to compare and make sense of the plot, here are the mean absolute values for the methods:

apply(abs(ourData),2,mean)

Method A Method B Method C Method D 7.703774 4.569836 4.840496 5.569483

In order to compare the different methods using squared errors, we may use Chi-Squared distribution, but some preliminary transformations are in order. Specifically, there is only the basic Chi-Squared distribution implemented in RMC, so we need to centre our data (noncentral Chi-Squared distribution would be the next thing to implement):

newData <- ourData - matrix(apply(ourData,2,mean),nrow(ourData),ncol(ourData),byrow=TRUE)

This way we remove bias and in a way do the estimation of accuracy only. In order to get to the conventional Chi-Squared distribution, we would also need to standardise the data (divide by standard deviation), but this will make everything similar (in our case – several Chi-Squared distributions with h=1 degree of freedom). Instead I would suggest not to do that, and use the new values as is, implying that each method has a Chi-Squared distribution with $$h \times \sigma^2$$ degrees of freedom, where $$h$$ is the forecasting horizon and $$\sigma^2$$ is the variance of the error of the method. Then the differences in $$\sigma^2$$ between the methods will be captured by the test:

rmc(newData^2, distribution="chisq", level=0.95)

RMC example with Chi-Squared distribution

You may notice that we get the results very similar to the ones with the Folded Normal distribution. This is because the data is well behaved and does not violate the assumptions of normality.

Overall, in this example we see that although Method A is the least biased of all, it is also considered as the least accurate one due to the higher variance (the parameter for it is significantly higher than for the other methods).

We can also produce plots with vertical lines, that connect the models that are in the same group (no statistical difference, intersection of respective intervals). Here’s the example for the normal distribution:

rmc(ourData, distribution="norm", level=0.95, style="lines")

RMC example with normal distribution, lines plot

If you want to tune the plot, you can always save the model and then replot it with the desired parameters:

ourModel <- rmc(ourData, distribution="norm", level=0.95) plot(ourModel, xlab="Models", ylab="Errors")

Finally, it is worth pointing out that the regression on ranks on large samples will give the results similar to the nemenyi test:

rmc(t(apply(abs(ourData),1,rank)), distribution="norm", level=0.95)

RMC example with normal distribution, ranked data

This function is currently under development, and I decided to show it here just to give an impression of the ongoing research. I still don’t have good solutions for many popular error measures, but the idea right now is to develop the respective elements of the function.

Also, keep in mind that at the moment the function can only be applied to the errors for a specific horizon (for example, for 5 steps ahead). If you feed it the averaged multiple steps ahead forecast errors (e.g. MAE or MSE), you might get unreliable results, because the respective distribution might be complicated and poorly approximated by the existing ones. Having said that, the larger the dataset that you use is, the more consistent results you will get even with Normal distribution (say hi to the Central Limit Theorem).

What else?

Several methods have been moved from

smooth

to

greybox

. These include:

• pointLik() – returns point Likelihoods, discussed in our research with Nikos;
• pAIC, pBIC, pAICc, pBICc – point values of respective information criteria, from the same research;
• nParam() – returns number of the estimated parameters in the model (+ variance);
• errorType() – returns the type of error used in the model (Additive / Multiplicative);

Furthermore, as you might have already noticed, I’ve implemented several distribution functions:

• Folded normal distribution;
• Laplace distribution;
• S distribution.

Finally, there is also a function, called

lmDynamic()

, which uses pAIC in order to produce dynamic linear regression models. But this should be discussed separately in a separate post.

That’s it for now. See you in greybox 0.4.0!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New from RStudio: Package Manager

Tue, 08/07/2018 - 17:29

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

One of the few remaining hurdles when working with R in the enterprise is consistent access to CRAN. Often desktop class systems will have unrestricted access while server systems might not have any access at all.

This inconsistency often stems from security concerns about allowing servers access to the internet. There have been many different approaches to solving this problem, with some organisations reluctantly allowing outbound access to CRAN, some rolling their own internal CRAN-like repositories, and others installing a fixed set of packages and leaving it at that.

Fortunately, this problem may now be a thing of the past. Yesterday RStudio announced a new software tool called “Package Manager” that provides a single, on-premise, CRAN-like interface that can provide access to CRAN, your organisations own internal packages, or a combination of the two all in a unified system.

RStudio Package Manager (RSPM) removes the need for IT teams to whitelist external access to CRAN from all of their R servers. Now, just a single system requires external access to a carefully managed CRAN mirror built specifically for this purpose. Internal systems can now connect to this single, internal package repository instead of to ad-hoc mirrors. Desktop and laptop users can connect to it too, providing a unified package management experience.

Fig 2. RStudio Package Manager simplifies CRAN access and reduces risk

Further, RSPM can be used to publish internal packages as well, and even supports hosting multiple repositories, which can be useful for different groups within the business.

Mango have been using RSPM and providing feedback on it since the earliest private beta stage and have already provided support around it to a small number of other beta customers. That, combined with our long R heritage and deep roots in the R and enterprise ecosystems means we’re well placed to help others on their enterprise R journey.

To schedule a call to discuss the options, contact sales@mango-solutions.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...