Track changes in data with the lumberjack %>>%
(This article was first published on R – Mark van der Loo, and kindly contributed to Rbloggers)
So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R
> data(retailers, package="validate") > head(retailers, 3) size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat 1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA 2 sc3 0.14 9 1607 NA 1607 131 1544 63 NA 3 sc3 0.14 NA 6886 33 6919 324 6493 426 NAThis data is dirty with missings and full of errors. Let us do some imputations with simputation.
> out < retailers %>% + impute_lm(other.rev ~ turnover) %>% + impute_median(other.rev ~ size) > > head(out,3) size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat 1 sc0 0.02 75 NA 6114.775 1130 NA 18915 20045 NA 2 sc3 0.14 9 1607 5427.113 1607 131 1544 63 NA 3 sc3 0.14 NA 6886 33.000 6919 324 6493 426 NA >Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.
The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.
> library(lumberjack) > retailers$id < seq_len(nrow(retailers)) > out < retailers %>>% + start_log(log=cellwise$new(key="id")) %>>% + impute_lm(other.rev ~ turnover) %>>% + impute_median(other.rev ~ size) %>>% + dump_log(stop=TRUE) Dumped a log at cellwise.csv > > read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3) step time expression key variable old new 1 2 20170623 21:11:05 CEST impute_median(other.rev ~ size) 1 other.rev NA 6114.775 2 1 20170623 21:11:05 CEST impute_lm(other.rev ~ turnover) 2 other.rev NA 5427.113 3 1 20170623 21:11:05 CEST impute_lm(other.rev ~ turnover) 6 other.rev NA 6341.683 >So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.
There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.
If this post got you interested, please install the package using
install.packages('lumberjack')You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.
As always, I am open to suggestions and comments. Either through the packages github page.
Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.
And finally, here’s a picture of a lumberjack smoking a pipe.
[1] It really should be called a function composition operator, but potetoes/potatoes.
Markdown with by wpgfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The R community is one of R’s best features
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that's an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers and documentation, testers, agitators (to keep the community and software on track!), and so much more. Shannon provides links where you can find all of this in the R community:
 #rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter
 RLadies — a worldwide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
 Local R meetup groups — a google search may show that there's one in your area! If not, maybe consider starting one! Facetoface meetups for users of all levels are incredibly valuable
 Rweekly — an incredible weekly recap of all things R
 Rbloggers — an awesome resource to find posts from many different bloggers using R
 DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research
 Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)
I'll add a couple of others as well:
 R Conferences — The annual useR! conference is the major community event of the year, but there are many smaller communityled events on various topics.
 Github — there's a fantastic community of R developers on Github. There's no directory, but the list of trending R developers is a good place to start.
 The R Consortium — proposing or getting involved with an R Consortium project is a great way to get involved with the community
As I've said before, the R community is one of the greatest assets of R, and is an essential component of what makes R useful, easy, and fun to use. And you couldn't find a nicer and more welcoming group of people to be a part of.
To learn more about the R community, be sure to check out Shannon's blog post linked below.
ROpenSci Blog: Hey! You there! You are welcome here
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Logarithmic Scale Explained with U.S. Trade Balance
(This article was first published on novyden, and kindly contributed to Rbloggers)
Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest. Consider U.S. 2016 merchandise trade partner balances data set where each point is a country with 2 features: U.S. imports and exports against it: Suppose we decided to visualize top 30 U.S trading partners using bubble chart, which simply is a 2D scatter plot with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for xy coordinates and trade balance (abs(export – import)) for size: China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to “solve” this problem is to eliminate 3 mentioned outliers from the picture:While this plot does look better it no longer serves its original purpose of displaying all top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale. Quick refresher from algebra. Log function (in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers A, B, and C such that
`A*B=C and A,B,C > 0`applying log results in additive relationship:
`log(A) + log(B) = log(C)` For example, let A=100, B=1000, and C=100000 then`100 * 1000 = 100000`
so that after transformation it becomes `log(100) + log(1000) = log(100000)` or `2 + 3 = 5`
Observe this on 1D plane:
Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:
The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.
For more detailed discussion of logarithmic scale refer to When Should I Use Logarithmic Scales in My Charts and Graphs? Oh, and how about that trade deficit with China?
This is a repost from the original blog on LinkedIn.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: novyden. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Working With SPSS© Data in R
(This article was first published on Pachá (Batteries Included), and kindly contributed to Rbloggers)
IntroductionI was in need of importing SPSS© data for work. There are some options but I’ve used both foreign and haven R packages. I prefer haven because it integrates better with R’s tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.
The DataFor this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nationwide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on samesex marriage, restricting spaces to smoke, and more.
Importing Data #devtools::install_github("ropenscilabs/skimr") # Exploratory Data Analysis tools library(ggplot2) library(dplyr) library(sjlabelled) library(skimr) library(readr) # Import foreign statistical formats library(haven) # Data url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav" sav = "20170624_working_with_spss_data_in_r/udp_national_survey_2015.sav" if(!file.exists(sav)){download.file(url,sav)} survey = read_sav(sav) Exploring dataTo explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.
# How many surveys do I have by day? daily = survey %>% mutate(Fecha = as.Date(Fecha, "%d%m%Y")) %>% rename(date = Fecha) %>% group_by(date) %>% summarise(n = n()) ggplot(daily, aes(date, n)) + geom_line() # How is the age distributed? summary(survey$Edad_Entrevistado) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 32.00 48.00 47.92 61.00 89.00 age = survey %>% mutate(as.integer(Edad_Entrevistado)) %>% rename(age = Edad_Entrevistado) %>% group_by(age) %>% summarise(n = n()) ggplot(age, aes(age, n)) + geom_line() # How is the sex distributed? survey %>% rename(sex_id = Sexo_Entrevistado) %>% group_by(sex_id) %>% summarise(n = n()) # A tibble: 2 x 2 sex_id n 1 1 651 2 2 651 Exploring labelsIn the last tibble we have no idea what is 1 and 2.
survey %>% select(Sexo_Entrevistado) %>% rename(sex_id = Sexo_Entrevistado) %>% distinct() %>% mutate(sex = as_factor(sex_id)) # A tibble: 2 x 2 sex_id sex 1 2 Mujer 2 1 HombreThe last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.
I could run
survey %>% rename(sex = Sexo_Entrevistado) %>% mutate(sex = as.integer(sex)) %>% mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% group_by(sex) %>% summarise(n = n()) # A tibble: 2 x 2 sex n 1 Female 651 2 Male 651The column names are labelled as well. Here sjlabelled helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.
Describing the dataset valid_replies = survey %>% mutate_if(is.labelled,as.numeric) %>% skim() %>% filter(stat=="complete") %>% mutate(description = get_label(survey)) %>% select(var,description,everything()) %>% select(c(stat,level,type)) %>% rename(pcent_valid = value) %>% mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%')) histograms = survey %>% mutate_if(is.labelled,as.numeric) %>% skim() %>% filter(stat=="hist") %>% select(var,level) %>% rename(histogram = level) survey_description = valid_replies %>% left_join(histograms) %>% write_csv("20170624_working_with_spss_data_in_r/survey_description.csv") survey_description # A tibble: 203 x 4 var description pcent_valid histogram 1 PONDERADOR Ponderador 100% ▂▇▇▅▅▃▁▁▁▁ 2 Folio Folio 100% ▇▇▇▇▇▇▇▇▇▇ 3 Región Región 100% ▁▁▂▁▂▁▁▁▇▁ 4 Comuna Comuna 100% ▁▁▂▁▁▂▁▁▇▁ 5 Fecha Fecha entrevista 100% 6 Sexo_Encuestador Sexo Entrevistador 91% ▂▁▁▁▁▁▁▁▁▇ 7 GSE GSE Visual 100% ▁▁▂▁▇▁▁▆▁▁ 8 Sexo_Entrevistado Sexo Entrevistado 100% ▇▁▁▁▁▁▁▁▁▇ 9 Edad_Entrevistado Edad Entrevistado 100% ▇▆▅▆▇▇▅▃▃▂ 10 Hora_Inicio Hora Inicio Medición 100% # ... with 193 more rowsExploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Statespace modelling of the Australian 2007 federal election
(This article was first published on Peter's stats stuff  R, and kindly contributed to Rbloggers)
Pooling the polls with Bayesian statisticsIn an important 2005 article in the Australian Journal of Political Science, Simon Jackman set out a statisticallybased approach to pooling polls in an election campaign. He describes the sensible intuitive approach of modelling a latent, unobserved voting intention (unobserved except on the day of the actual election) and treats each poll as a random observation based on that latent state space. Uncertainty associated with each measurement comes from sample size and bias coming from the average effect of the firm conducting the poll, as well as of course uncertainty about the state of the unobserved voting intention. This approach allows house effects and the latent state space to be estimated simultaneously, quantifies the uncertainty associated with both, and in general gives a much more satisfying method of pooling polls than any kind of weighted average.
Jackman gives a worked example of the approach in his excellent book Bayesian Analysis for the Social Sciences, using voting intention for the Australian Labor Party (ALP) in the 2007 Australian federal election for data. He provides JAGS code for fitting the model, but notes that with over 1,000 parameters to estimate (most of those parameters are the estimated voting intention for each day between the 2004 and 2007 elections) it is painfully slow to fit in general purpose MCMCbased Bayesian tools such as WinBUGS or JAGS – several days of CPU time on a fast computer in 2009. Jackman estimated his model with Gibbs sampling implemented directly in R.
Down the track, I want to implement Jackman’s method of polling aggregation myself, to estimate latent voting intention for New Zealand to provide an alternative method for my election forecasts. I set myself the familiarisation task of reproducing his results for the Australian 2007 election. New Zealand’s elections are a little complex to model because of the multiple parties in the proportional representation system, so I wanted to use a general Bayesian tool for the purpose to simplify my model specification when I came to it. I use Stan because its Hamiltonian Monte Carlo method of exploring the parameter space works well when there are many parameters – as in this case, with well over 1,000 parameters to estimate.
Stan describes itself as “a stateoftheart platform for statistical modeling and highperformance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.” It lets the programmer specify a complex statistical model, and given a set of data will return a range of parameter estimates that were most likely to produce the observed data. Stan isn’t something you use as an endtoend workbench – it’s assumed that data manipulation and presentation is done with another tool such as R, Matlab or Python. Stan focuses on doing one thing well – using Hamiltonian Monte Carlo to estimate complex statistical models, potentially with many thousands of hierarchical parameters, with arbitrarily set prior distributions.
Caveat! – I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work. Use at your own risk!
Basic approach – estimated voting intention in the absence of pollsI approached the problem in stages, gradually making my model more realistic. First, I set myself the task of modelling latent firstpreference support for the ALP in the absence of polling data. If all we had were the 2004 and 2007 election results, where might we have thought ALP support went between those two points? Here’s my results:
For this first analysis, I specified that support for the ALP had to be a random walk that changed by a normally distributed variable with standard deviation of 0.25 percentage points for each daily change. Why 0.25? Just because Jim Savage used it in his rough application of this approach to the US Presidential election in 2016. I’ll be relaxing this assumption later.
Here’s the R code that sets up the session, brings in the data from Jackman’s pscl R package, and defines a graphics function that I’ll be using for each model I create.
library(tidyverse) library(scales) library(pscl) library(forcats) library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = 7) #=========2004 election to 2007 election============== data(AustralianElectionPolling) data(AustralianElections) days_between_elections < as.integer(diff(as.Date(c("20041009", "20071124")))) + 1 #' Function to plot time series extracted from a stan fit of latent state space model of 2007 Australian election plot_results < function(stan_m){ if(class(stan_m) != "stanfit"){ stop("stan_m must be an object of class stanfit, with parameters mu representing latent vote at a point in time") } ex < as.data.frame(rstan::extract(stan_m, "mu")) names(ex) < 1:d1$n_days p < ex %>% gather(day, value) %>% mutate(day = as.numeric(day), day = as.Date(day, origin = "20041008")) %>% group_by(day) %>% summarise(mean = mean(value), upper = quantile(value, 0.975), lower = quantile(value, 0.025)) %>% ggplot(aes(x = day)) + labs(x = "Shaded region shows a pointwise 95% credible interval.", y = "Voting intention for the ALP (%)", caption = "Source: Jackman's pscl R package; analysis at https://ellisp.github.io") + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3) + geom_line(aes(y = mean)) + scale_y_continuous(breaks = 31:54, sec.axis = dup_axis(name = "")) + theme(panel.grid.minor = element_blank()) return(p) }Here’s the Stan program that specifies this super simple model of changing ALP support from 2004 to 2007:
// ozpolls1.stan data { int n_days; // number of days real mu_start; // value at starting election real mu_finish; // value at final election } parameters { real mu[n_days]; // underlying state of vote intention } model { // state model mu[1] ~ normal(mu_start, 0.01); for (i in 2:n_days) mu[i] ~ normal(mu[i  1], 0.25); // measurement model // 1. Election result mu[n_days] ~ normal(mu_finish, 0.01); }And here’s the R code that calls that Stan program and draws the resulting summary graphic. Stan works by compiling a program in C++ that is based on the statistical model specified in the *.stan file. Then the C++ program zooms around the highdimensional parameter space, moving slower around the combinations of parameters that seem more likely given the data and the specified prior distributions. It can use multiple processors on your machine and works super fast given the complexity of what it’s doing.
#no polls inbetween the elections d1 < list(mu_start = 37.64, mu_finish = 43.38, n_days = days_between_elections) # returns some warnings first time it compiles; see # http://mcstan.org/misc/warnings.html suggests most compiler # warnings can be just ignored. system.time({ stan_mod1 < stan(file = 'ozpolls1.stan', data = d1, control = list(max_treedepth = 20)) }) # 1800 seconds plot_results(stan_mod1) + ggtitle("Voting intention for the ALP between the 2004 and 2007 Australian elections", "Latent variable estimated with no use of polling data") Adding in one polling firmNext I wanted to add a single polling firm. I chose Nielsen’s 42 polls because Jackman found they had a fairly low bias, which removed one complication for me as I built up my familiarity with the approach. Here’s the result:
That model was specified in Stan as set out below. The Stan program is more complex now; I’ve had to specify how many polls I have (y_n), the values for each poll (y_values), and the days since the last election each poll was taken (y_days). This way I only have to specify 42 measurement errors as part of the probability model – other implementations I’ve seen of this approach ask for an estimate of measurement error for each poll on each day, treating the days with no polls as missing values to be estimated. That obviously adds a huge computational load I wanted to avoid.
In this program, I haven’t yet added in the notion of a house effect for Nielsen. Each measurement Nielsen made is assumed to have been an unbiased one. Again, I’ll be relaxing this later. The state model is also the same as before ie standard deviation of the day to day innovations is still hard coded as 0.25 percentage points.
// ozpolls2.stan data { int n_days; // number of days real mu_start; // value at starting election real mu_finish; // value at final election int y_n; // number of polls real y_values[y_n]; // actual values in polls int y_days[y_n]; // the number of days since starting election each poll was taken real y_se[y_n]; } parameters { real mu[n_days]; // underlying state of vote intention } model { // state model mu[1] ~ normal(mu_start, 0.01); for (i in 2:n_days) mu[i] ~ normal(mu[i  1], 0.25); // measurement model // 1. Election result mu[n_days] ~ normal(mu_finish, 0.01); // 2. Polls for(t in 1:y_n) y_values[t] ~ normal(mu[y_days[t]], y_se[t]); }Here’s the R code to prepare the data and pass it to Stan. Interestingly, fitting this model is noticeably faster than the one with no polling data at all. My intuition for this is that now the state space is constrained to being reasonably close to some actually observed measurements, it’s an easier job for Stan to know where is good to explore.
#AC Nielson ac < AustralianElectionPolling %>% filter(org == "Nielsen") %>% mutate(MidDate = startDate + (endDate  startDate) / 2, MidDateNum = as.integer(MidDate  as.Date("20041008")), # ie number of days since first election; last election (9 October 2004) is day 1 p = ALP / 100, se_alp = sqrt(p * (1 p) / sampleSize) * 100) d2 < list( mu_start = 37.64, mu_finish = 43.38, n_days = days_between_elections, y_values = ac$ALP, y_days = ac$MidDateNum, y_n = nrow(ac), y_se = ac$se_alp ) system.time({ stan_mod2 < stan(file = 'ozpolls2.stan', data = d2, control = list(max_treedepth = 20)) }) # 510 seconds plot_results(stan_mod2) + geom_point(data = ac, aes(x = MidDate, y = ALP)) + ggtitle("Voting intention for the ALP between the 2004 and 2007 Australian elections", "Latent variable estimated with use of just one firm's polling data (Nielsen)") Including all five polling housesFinally, the complete model replicating Jackman’s work:
As well as adding the other four sets of polls, I’ve introduced five house effects that need to be estimated (ie the bias for each polling firm/mode); and I’ve told Stan to estimate the standard deviation of the daytoday innovations in the latent support for ALP rather than hardcoding it as 0.25. Jackman specified a uniform prior on [0, 1] for that parameter, but I found this led to lots of estimation problems for Stan. The Stan developers give some great practical advice on this sort of issue and I adapted some of that to specify the prior distribution for the standard deviation of day to day innovation as N(0.5, 0.5), constrained to be positive.
Here’s the Stan program:
// ozpolls3.stan data { int n_days; // number of days real mu_start; // value at starting election real mu_finish; // value at final election // change the below into 5 matrixes with 3 columns each for values, days, standard error int y1_n; // number of polls int y2_n; int y3_n; int y4_n; int y5_n; real y1_values[y1_n]; // actual values in polls real y2_values[y2_n]; real y3_values[y3_n]; real y4_values[y4_n]; real y5_values[y5_n]; int y1_days[y1_n]; // the number of days since starting election each poll was taken int y2_days[y2_n]; int y3_days[y3_n]; int y4_days[y4_n]; int y5_days[y5_n]; real y1_se[y1_n]; // the standard errors of the polls real y2_se[y2_n]; real y3_se[y3_n]; real y4_se[y4_n]; real y5_se[y5_n]; } parameters { real mu[n_days]; // underlying state of vote intention real d[5]; // polling effects real sigma; // sd of innovations } model { // state model mu[1] ~ normal(mu_start, 0.01); // starting point // Jackman used uniform(0, 1) for sigma, but this seems to be the cause // of a lot of problems with the estimation process. // https://github.com/standev/stan/wiki/PriorChoiceRecommendations // recommends not using a uniform, but constraining sigma to be positive // and using an open ended prior instead. So: sigma ~ normal(0.5, 0.5); // prior for innovation sd. for (i in 2:n_days) mu[i] ~ normal(mu[i  1], sigma); // measurement model // 1. Election result mu[n_days] ~ normal(mu_finish, 0.01); // 2. Polls d ~ normal(0, 7.5); // ie a fairly loose prior for house effects for(t in 1:y1_n) y1_values[t] ~ normal(mu[y1_days[t]] + d[1], y1_se[t]); for(t in 1:y2_n) y2_values[t] ~ normal(mu[y2_days[t]] + d[2], y2_se[t]); for(t in 1:y3_n) y3_values[t] ~ normal(mu[y3_days[t]] + d[3], y3_se[t]); for(t in 1:y4_n) y4_values[t] ~ normal(mu[y4_days[t]] + d[4], y4_se[t]); for(t in 1:y5_n) y5_values[t] ~ normal(mu[y5_days[t]] + d[5], y5_se[t]); }Building the fact there are 5 polling firms (or firmmode combinations, as Morgan is in there twice) directly into the program must be bad practice, but seeing as there are different numbers of polls taken by each firm and on different days I couldn’t work out a better way to do it. Stan doesn’t support ragged arrays, or objects like R’s lists, or (I think) convenient subsetting of tables, which would be the three ways I’d normally try to do that in another language. So I settled for the approach above, even though it has some ugly bits of repetition.
Here’s the R code that sorts the data and passes it to Stan
#all 5 polls all_polls < AustralianElectionPolling %>% mutate(MidDate = startDate + (endDate  startDate) / 2, MidDateNum = as.integer(MidDate  as.Date("20041008")), # ie number of days since starting election p = ALP / 100, se_alp = sqrt(p * (1 p) / sampleSize) * 100, org = fct_reorder(org, ALP)) poll_orgs < as.character(unique(all_polls$org)) p1 < filter(all_polls, org == poll_orgs[[1]]) p2 < filter(all_polls, org == poll_orgs[[2]]) p3 < filter(all_polls, org == poll_orgs[[3]]) p4 < filter(all_polls, org == poll_orgs[[4]]) p5 < filter(all_polls, org == poll_orgs[[5]]) d3 < list( mu_start = 37.64, mu_finish = 43.38, n_days = days_between_elections, y1_values = p1$ALP, y1_days = p1$MidDateNum, y1_n = nrow(p1), y1_se = p1$se_alp, y2_values = p2$ALP, y2_days = p2$MidDateNum, y2_n = nrow(p2), y2_se = p2$se_alp, y3_values = p3$ALP, y3_days = p3$MidDateNum, y3_n = nrow(p3), y3_se = p3$se_alp, y4_values = p4$ALP, y4_days = p4$MidDateNum, y4_n = nrow(p4), y4_se = p4$se_alp, y5_values = p5$ALP, y5_days = p5$MidDateNum, y5_n = nrow(p5), y5_se = p5$se_alp ) system.time({ stan_mod3 < stan(file = 'ozpolls3.stan', data = d3, control = list(max_treedepth = 15, adapt_delta = 0.8), iter = 4000) }) # about 600 seconds plot_results(stan_mod3) + geom_point(data = all_polls, aes(x = MidDate, y = ALP, colour = org), size = 2) + geom_line(aes(y = mean)) + labs(colour = "") + ggtitle("Voting intention for the ALP between the 2004 and 2007 Australian elections", "Latent variable estimated with use of all major firms' polling data") Estimates of polling house effectsHere’s the house effects estimated by me with Stan, compared to those in Jackman’s 2009 book:
Basically we got the same results – certainly close enough anyway. Jackman writes:
“The largest effect is for the facetoface polls conducted by Morgan; the point estimate of the house effect is 2.7 percentage points, which is very large relative to the classical sampling error accompanhying these polls.”
Interestingly, Morgan’s phone polls did much better.
Here’s the code that did that comparison:
house_effects < summary(stan_mod3, pars = "d")$summary %>% as.data.frame() %>% round(2) %>% mutate(org = poll_orgs, source = "Stan") %>% dplyr::select(org, mean, `2.5%`, `97.5%`, source) jackman < data_frame( org = c("Galaxy", "Morgan, F2F", "Newspoll", "Nielsen", "Morgan, Phone"), mean = c(1.2, 2.7, 1.2, 0.9, 0.8), `2.5%` = c(3.1, 1, 0.5, 0.8, 1), `97.5%` = c(0.6, 4.3, 2.8, 2.5, 2.3), source = "Jackman" ) d < rbind(house_effects, jackman) %>% mutate(org = fct_reorder(org, mean), ypos = as.numeric(org) + 0.1  0.2 * (source == "Stan")) d %>% ggplot(aes(y = ypos, colour = source)) + geom_segment(aes(yend = ypos, x = `2.5%`, xend = `97.5%`)) + geom_point(aes(x = mean)) + scale_y_continuous(breaks = 1:5, labels = levels(d$org), minor_breaks = NULL) + theme(panel.grid.major.y = element_blank(), legend.position = "right") + labs(x = "House effect for polling firms, 95% credibility intervals\n(percentage points overestimate of ALP vote)", y = "", colour = "", caption = "Source: Jackman's pscl R package; analysis at https://ellisp.github.io") + ggtitle("Polling 'house effects' in the leadup to the 2007 Australian election", "Basically the same results in new analysis with Stan as in the original Jackman (2009)")So there we go – state space modelling of voting intention, with variable house effects, in the Australian 2007 federal election.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hey! You there! You are welcome here
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
What's that? You've heard of R? You use R? You develop in R? You know someone else who's mentioned R? Oh, you're breathing? Well, in that case, welcome! Come join the R community!
We recently had a group discussion at rOpenSci's #runconf17 in Los Angeles, CA about the R community. I initially opened the issue on GitHub. After this issue was wellreceived (check out the emojilove below!), we realized people were keen to talk about this and decided to have an optional and informal discussion in person.
To get the discussion started I posed two general questions and then just let discussion fly. I prompted the group with the following:
 The R community is such an asset. How do we make sure that everyone knows about it and feels both welcome and comfortable?
 What are other languages/communities doing that we're not? How could we adopt their good ideas?
The discussion focused primarily on the first point, and I have to say the group's answers…were awesome. Take a look!
How to find the communityEveryone seemed to be in agreement that (1) the community is one of R's biggest strengths and (2) a lot within the R community happens on twitter. During discussion, Julia Lowndes mentioned she joined twitter because she heard that people asked and answered questions about R there, and others echoed this sentiment. Simply, the R community is not just for 'power users' or developers. It's a place for users and people interested in learning more about R. So, if you want to get involved in the community and you are not already, consider getting a twitter account and check out the #rstats hashtag. We expect you'll be surprised by how responsive, welcoming, and inclusive the community is.
In addition to twitter, there are many resources available within the R community where you can learn more about all things R. Below is a brief list of resources mentioned during our discussion that had helped us feel more included in the community. Feel free to suggest more!
 RLadies – – a worldwide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
 Local R meetup groups – a google search may show that there's one in your area! If not, maybe consider starting one! Facetoface meetups for users of all levels are incredibly valuable
 Rweekly – an incredible weekly recap of all things R
 Rbloggers – an awesome resource to find posts from many different bloggers using R
 DataCarpentry and Software Carpentry – a resource of openly available lessons that promote and model reproducible research
 Stack Overflow – chances are your R question has already been answered here (with additional resources for people looking for jobs)
No community is perfect, and being willing to consider our shortcomings and think about ways in which we can improve is so important. The group came up with a lot of great suggestions, including many I had not previously thought of personally.
Alice Daish did a great job capturing the conversation and allowing for more discussion online:
Join the lunchtime #runconf17 discussion about the #rstats communities – what do we need to do to improve? pic.twitter.com/ztbXxNfqU7
— Alice Data (@alice_data) May 26, 2017
To summarize here:
 Take the time to welcome new people. A simple hello can go a long way!
 Reach out to people we may be missing: high school students, people of different backgrounds, individuals in other countries, etc.
 Avoid the word "just" when helping others. "Here's one way of thinking about that" >> "Just do it this way"
 Include people whose primary language is not English in the conversation! Consider tweeting & retweeting in your own language to extend the community. This helps include others and spread knowledge!
 Be involved in open projects. If you chose to turn down an opportunity that is not open, do your best to explain why being involved in open projects is important to you.
 David Smith recently suggested getting #rbeginners to take off as a hashtag – a great way to direct newer members' attention to tips and resources!
 Be conscious of your tone. When in doubt, check out tone checker.
 If you see someone being belittling in their answers, consider reaching out to the person who is behaving inappropriately. There was some agreement that reaching out privately may be more effective as a first approach than calling them out in public.Strong arguments against that strategy and in favor of a public response from Oliver Keyes can be found here.
 Also, it's often easier to defend on behalf of someone else than it is on one's own behalf. Keep that in mind if you see negative things happening, and consider defending on someone else's behalf.
 Having a code of conduct is important. rOpenSci has one, and we like it a whole lot.
And, when times get tough, look to your community. Get out there. Be active. Communicate with one another. Tim Phan brilliantly summarized the importance of action and community in this thread:
Dear #runconf17, bye for now!Thanks to the organizers for all you do. Here's an incoming tweet storm on R, community, and open science 1/6 pic.twitter.com/7DpkceOUC8
— Timothy Phan (@timothy_phan) May 26, 2017
Thank youThank you to all who participated in this conversation and all who contribute to the community to make R such a fun language in which to work and develop! Thank you to rOpenSci for hosting and giving us all the opportunity to get to know one another and work together. I'm excited to see where this community goes moving forward!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
May New Package Picks
(This article was first published on R Views, and kindly contributed to Rbloggers)
Two hundred and twentynine new packages were submitted to CRAN in May. Here are my picks for the “Top 40”, organized into five categories: Data, Data Science and Machine Learning, Education, Miscellaneous, Statistics and Utilities.
Dataangstroms v0.0.1: Provides helper functions for working with Regional Ocean Modeling System (ROMS) output.
bikedata v0.0.1: Download and aggregate data from public bicycle systems from around the world. There is a vignette.
datasauRushttps://CRAN.Rproject.org/package=datasauRus v0.1.2: The Datasaurus Dozen is a set of datasets that have the same summary statistics, despite having radically different distributions. As well as being an engaging variant on the Anscombe’s Quartet, the data is generated in a novel way through a simulated annealing process. Look here for details, and in the vignette for examples.
dwapi v0.1.1: Provides a set of wrapper functions for data.world’s REST API. There is a quickstart guide.
HURDAT v0.1.0: Provides datasets from the Hurricane Research Division’s Hurricane ReAnalysis Project, giving details for most known hurricanes and tropical storms for the Atlantic and northeastern Pacific ocean (northwestern hemisphere). The vignette describes the datasets.
neurohcp v0.6: Implements an interface to the Human Connectome Project. The vignette shows how it works.
osmdata v0.0.3: Provides functions to download and import of OpenStreetMap data as ‘sf’ or ‘sp’ objects. There is an Introduction and a vignette describing Translation to Simple Features.
parlitools v0.0.4: Provides various tools for analyzing UK political data, including creating political cartograms and retrieving data. There is an Introduction, and vignettes on the British Election Study, Mapping Local Suthorities, and Using Cartograms.
rerddap v0.4.2: Implements an R client to NOAA’s ERDDDAP data servers. There is an Introduction.
soilcarbon v1.0.0: Provides tools for analyzing the Soil Carbon Database created by Powell Center Working Group. The vignette launches a local Shiny App.
suncalc v0.1: Implements an R interface to the ‘suncalc.js’ library, part of the SunCalc.net’s project for calculating sun position, sunlight phases, moon position and lunar phase for the given location and time.
Data Science and Machine LearningEventStudy v0.3.1: Provides an interface to the EventStudy API. There is an Introduction, and vignettes on Preparing EventStudy, parameters, and the RStudio Addin.
kmcudaR v1.0.0: Provides a fast, dropin replacement for the classic Kmeans algorithm based on Yingyang KMeans. Look here for details.
openEBGM v0.1.0: Provides an implementation of DuMouchel’s Bayesian data mining method for the market basket problem. There is an Introduction, and vignettes for Processing Raw Data, Hyperparameter Estimation, Empirical Bayes Metrics, and Objects and Class Functions.
spacyr v0.9.0: Provides a wrapper for the Python spaCy Natural Language Processing library. Look here for help with installation and use.
Educationlearnr v0.9: Provides functions to create interactive tutorials for learning about R and R packages using R Markdown, using a combination of narrative, figures, videos, exercises, and quizzes. Look here to get started.
olsrr v0.2.0: Provides tools for teaching and learning ordinary least squares regression. There is an Introduction and vignettes on Heteroscedascitity, Measures of Influence, Collinearity Diagnostics, Residual Diagnostics and Variable Selection Methods.
rODE v0.99.4: Contains functions to show students how an ODE solver is made and how classes can be effective for constructing equations that describe natural phenomena. Have a look at the free book Computer Simulations in Physics. There are several vignettes providing brief examples, including one on the Pendulum and another on Planets.
Miscelaneousatlantistools v0.4.2: Provides access to the Atlantis framework for endtoend marine ecosystem modelling. There is a package demo and vignettes for model preprocessing, model calibration, species calibration, and model comparison.
phylodyn v0.9.0: Provides statistical tools for reconstructing population size from genetic sequence data. There are several vignettes including a Coalescent simulation of genealogies and a case study using New York Influenza data.
StatisticsadaptiveGPCA v0.1: Implements the adaptive gPCA algorithm described in Fukuyama. The vignette shows an example using data stored in a phyloseq object.
BayesNetBP v1.2.1: Implements belief propagation methods for Bayesian Networks based on the paper by Cowell. There is a function to invoke a Shiny App.
RPEXE.RPEXT v0.0.1: Implements the likelihood ration test and backward elimination procedure for the reduced piecewise exponential survival analysis technique described in described in Han et al. 2012 and 2016. The vignette provides examples.
sfdct v0.0.3: Provides functions to construct a constrained ‘Delaunay’ triangulation from simple features objects. There is a vignette.
simglm v0.5.0: Provides functions to simulate linear and generalized linear models with up to three levels of nesting. There is an Introduction and vignettes for simulating GLMs and Missing Data performing Power Analysis and dealing with Unbalanced Data.
Utilitiescheckarg v0.1.0: Provides utility functions that allow checking the basic validity of a function argument or any other value, including generating an error and assigning a default in a single line of code.
CodeDepends v0.53: Provides tools for analyzing R expressions or blocks of code and determining the dependencies between them. The vignette shows how to use them.
desctable v0.1.0: Provides functions to create descriptive and comparative tables that are ready to be saved as csv, or piped to DT::datatable() or pander::pander() to integrate into reports. There is a vignette to get you started.
lifelogr v0.1.0: Provides a framework for combining selfdata from multiple sources, including fitbit and Apple Health. There is a general introduction as well as an introduction for visualization functions.
processx v2.0.0: Portable tools to run system processes in the background.
printr v0.1: Extends knitr generic function knit_print() to automatically print objects using an appropriate format such as Markdown or LaTeX. The vignette provides an introduction.
RHPCBenchmark v0.1.0: Provides microbenchmarks for determining the runtime performance of aspects of the R programming environment, and packages that are relevant to highperformance computation. There is an Introduction.
rlang v0.1.1: Provides a toolbox of functions for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation. The vignette explains R’s capabilities for creating Domain Specific Languages.
readtext v0.50: Provides functions for importing and handling text files and formatted text files with additional metadata, including ‘.csv’, ‘.tab’, ‘.json’, ‘.xml’, ‘.pdf’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’ and other file types. There is a vignette
tangram v0.2.6: Provides an extensible formula system to implements a grammar of tables for creating productionquality tables using a threestep process that involves a formula parser, statistical content generation from data, and rendering. There is a vignette introducing the Grammar, a Global Style for Rmd, and duplicating SAS PROC Tabulate.
tatoo v1.0.6: Provides functions to combine data.frames and to add metadata that can be used for printing and xlsx export. The vignette shows some examples.
VisualizationsContourFunctions v0.1.0: Provides functions for making contour plots. A vignette introduces the package.
mbgraphic v1.0.0: Implements a twostep process for describing univariate and bivariate behavior similar to the cognostics measures proposed by Paul and John Tuke. First, measures describing variables are computed and then plots are selected. The vignette describes the details.
polypoly v0.0.2: Provides tools for reshaping, plotting, and manipulating matrices of orthogonal polynomials. The vignette provides an overview.
RJSplot v2.1: Provides functions to create interactive graphs with ‘R’. It joins the data analysis power of R and the visualization libraries of JavaScript in one package There is a tutorial.
_____='https://rviews.rstudio.com/2017/06/23/maynewpackagepicks/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Set Theory Arbitrary Union and Intersection Operations with R
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Part 3 of 3 in the series Set TheoryThe union and intersection set operations were introduced in a previous post using two sets, and . These set operations can be generalized to accept any number of sets.
Arbitrary Set Unions OperationConsider a set of infinitely many sets:
It would be very tedious and unnecessary to repeat the union statement repeatedly for any nontrivial amount of sets, for example, the first few unions would be written as:
Thus a more general operation for performing unions is needed. This operation is denoted by the symbol. For example, the set above and the desired unions of the member sets can be generalized to the following using the new notation:
We can then state the following definition: For a set , the union of is defined by:
For example, consider the three sets:
The union of the three sets is written as:
Recalling our union axiom from a previous post, the union axiom states for two sets and , there is a set whose members consist entirely of those belonging to sets or , or both. More formally, the union axiom is stated as:
As we are now dealing with an arbitrary amount of sets, we need an updated version of the union axiom to account for the change.
Restating the union axiom:
For any set , there exists a set whose members are the same elements of the elements of . Stated more formally:
The definition of can be stated as:
For example, we can demonstrate the updated axiom with the union of four sets :
We can implement the set operation for an arbitrary amount of sets by expanding upon the function we wrote previously.
set.unions < function(a, b, ...) { u < a for (i in 1:length(b)) { if (!(b[i] %in% u)) { u < append(u, b[i]) } } s < list(...) for (i in s) { for (j in i) { if (!(j %in% u)) { u < append(u, j) } } } return(u) }Perform the set union operation of four sets:
a < c(1,2,3) b < c(3,4,5) c < c(1,4,6) d < c(2,5,7) set.unions(a, b, c, d) ## [1] 1 2 3 4 5 6 7 Intersections of an Arbitrary Number of SetsThe intersection set operation can also be generalized to any number of sets. Consider the previous set containing an infinite number of sets.
As before, writing out all the intersections would be tedious and not elegant. The intersection can instead be written as:
As before in our previous example of set intersections, there is no need for a separate axiom for intersections, unlike unions. Instead, we can state the following theorem, for a nonempty set , a set exists that such for any element :
Consider the following four sets:
The intersection of the sets is written as:
We can write another function to implement the set intersection operation given any number of sets.
set.intersections < function(a, b, ...) { intersect < vector() for (i in a) { if (i %in% b) { intersect < append(intersect, i) } } s < list(...) for (i in s) { for (j in i) { if (j %in% intersect) { intersect < append(intersect, j) } } } intersect < unique(intersect) return(intersect) }Perform set intersections of the four sets specified earlier.
a < c(1,2,3,5) b < c(1,3,5) c < c(1,4,5,3) d < c(2,5,1,3) set.intersections(a, b, c, d) ## [1] 1 3 5 ReferencesEnderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.
The post Set Theory Arbitrary Union and Intersection Operations with R appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RTutor: Emission Certificates and Green Innovation
(This article was first published on Economics and R  R posts, and kindly contributed to Rbloggers)
Which policy instruments should we use to costeffectively reduce greenhouse gas emissions? For a given technological level there are many economic arguments in favour of tradeable emission certificates or a carbon tax: they generate static efficiency by inducing emission reductions in those sectors and for those technologies where it is most cost effective.
Specialized subsidies, like the originally extremely high subsidies on solar energy in Germany and other countries are often much more costly. Yet, we have seen a tremendous cost reduction for photovoltaics, which may have not been achieved on such a scale without those subsidies. And maybe in a world, where the current president of a major polluting country seems not to care much about the risks of climate change, the development of cheap green technology that even absent goverment support can costeffectively substitute fossil fuels, is the most decisive factor to fight climate change.
Yet, the impact of different policy measures on innovation of green technology is very hard to assess. Are focused subsidies or mandates the best way, or can also emission trading or carbon taxes considerably boost innovation of green technologies? That is a tough quantitative question, but we can try to get at least some evidence.
In their article Environmental Policy and Directed Technological Change: Evidence from the European carbon market, Review of Economic and Statistics (2016), Raphael Calel and Antoine Dechezlepretre study the impact of the EU carbon emission trading system on patent activities of the regulated firms. By matching them with unregulated firms, they estimate that the emission trading has increased the innovation activities for low carbon technologies of the regulated firms by 10%.
As part of his Master Thesis at Ulm University, Arthur Schäfer has generated an RTutor problem set that allows you to replicate the main insights of the paper in an interactive fashion.
Here is screenshoot:
Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.
To install the problem set the problem set locally, first install RTutor as explained here:
https://github.com/skranz/RTutor
and then install the problem set package:
https://github.com/ArthurS90/RTutorEmissionTrading
There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)
https://arthurs90.shinyapps.io/RTutorEmissionTrading/
If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page
https://github.com/skranz/RTutor
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Economics and R  R posts. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Face Recognition in R
(This article was first published on RProjects – Stoltzmaniac, and kindly contributed to Rbloggers)
Face Recognition in ROpenCV is an incredibly powerful tool to have in your toolbox. I have had a lot of success using it in Python but very little success in R. I haven’t done too much other than searching Google but it seems as if “imager” and “videoplayR” provide a lot of the functionality but not all of it.
I have never actually called Python functions from R before. Initially, I tried the “rPython” library – that has a lot of advantages, but was completely unnecessary for me so system() worked absolutely fine. While this example is extremely simple, it should help to illustrate how easy it is to utilize the power of Python from within R. I need to give credit to Harrison Kinsley for all of his efforts and work at PythonProgramming.net – I used a lot of his code and ideas for this post (especially the Python portion).
Using videoplayR I created a function which would take a picture with my webcam and save it as “originalWebcamShot.png”
Note: saving images and then loading them isn’t very efficient but works in this case and is extremely easy to implement. It saves us from passing variables, functions, objects, and/or methods between R and Python in this case.
I’ll trace my steps backward through this post (I think it’s easier to understand what’s going on in this case).
The main.R file: Calls my userdefined function
 Turns on the camera
 Takes a picture
 Saves it as “originalWebcamShot.png”
 Runs the Python script
 Loads the previously saved image
 Loads the Haar Cascade algorithms
 Detects faces and eyes
 Draws colored rectangles around them
 Saves the new image as “modifiedWebcamShot.png”
 Reads new image into R
 Displays both images
The userdefined function:
 Function inputs
 rollFrames is the number of pictures to take (allows the camera to adjust)
 showImage gives the option to display the image
 saveImageToWD saves the image generated to the current working directory
 Turns the webcam on
 Takes pictures (number of rollFrames)
 Uses basic logic to determine to show images and/or save them
 Returns the image
The Python script:
 Loads the algorithms from xml files
 Loads the image from “originalWebcamShot.png”
 Converts the image to grayscale
 Runs the facial detection algorithm
 Runs the eye detection algorithm (within the face)
 Draws rectangles around the face and eyes (different colors)
 Saves the new image as “modifiedWebcamShot.png”
The Python code was entirely based off of Harrison Kinsley’s work:
 @sentdex Twitter  YouTube
 Website: PythonProgramming.net
Thanks for reading. As always, the code is on my GitHub
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProjects – Stoltzmaniac. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Online portfolio allocation with a very simple algorithm
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Yuri Resende
Today we will use an online convex optimization technique to build a very simple algorithm for portfolio allocation. Of course this is just an illustrative post and we are going to make some simplifying assumptions. The objective is to point out an interesting direction to approach the problem of portfolio allocation.
The algorithm used here is the Online Gradient Descendent (OGD) and we are going to compare the performance of the portfolio with the Uniform Constant Rebalanced Portfolio and the Dow Jones Industrial Average index. You can skip directly to Implementation and Example if you already know what an online algorithm is.
For those who don’t know what Online Convex Optimization is…From now on, we will say that represents a point in dimension , where is the number of possible stocks to invest. Each of it’s coordinates represent the amount invested in the respective stock. For example, suppose that and . Then, our portfolio at time $latex t$ are composed of from stock one, from stock two, and from stock three.
Online Convex Optimization is based on the idea of minimizing the socalled Regret function, which is basically an idea of how far from a good player you are (some kind of optimal player that you want to be). In the setup of portfolio management, the Regret function is defined as:
The first part on the right hand side is the cumulative log return of the portfolio , and the second part, is the cumulative log return of our portfolio until time , which we have played strategy . What is interesting is that is a very special case of portfolio. It represents the Best Constant Rebalanced Portfolio (CRP)…it means that if someone could see the future and knew upfront all the returns of all the stocks and had to play a fixed portfolio proportion at time zero, would be his choice. In the worst case scenario, the best CRP is equal to the best stock in the period, and the best CRP is the portfolio that we want to track.
The OGD algorithm is quite simple, we are going to update our position every week based on the gradient of the function in the currently strategy. Here it follows:
 First play
 For 1" title="t > 1" class="latex" />,
This algorithm is proved to have of order if . There is another similar algorithm which takes second order information and is quite similar to an online version of the Newton’s method. The name, not surprisingly is Online Newton Step. See Agrawal et al (2006) for further details.
Implementation and ExampleIn our setup the set of possible portfolios will not allow us to borrow money or short selling stocks, so every position should be bigger then zero and the sum of all positions should be one. This set is called simplex and define polytope. Basically it’s a triangle in higher dimension.
library(quantmod) symbols &amp;amp;amp;amp;amp;lt; c('MMM', 'AXP', 'AAPL', 'BA', 'CAT', 'CVX', 'CSCO', 'KO', 'DIS', 'DD', 'XOM', 'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'JPM', 'MCD', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'TRV', 'UTX', 'UNH', 'VZ', 'V', 'WMT') for (i in 1:length(symbols)) getSymbols(symbols[i], from = '20070103', to = '20170621') # Building weekly returns for each of the stocks data &amp;amp;amp;amp;amp;lt;sapply(symbols, function(x) Ad(to.weekly(get(x)))) data &amp;amp;amp;amp;amp;lt; Reduce(cbind, data) data_returns &amp;amp;amp;amp;amp;lt; apply(data, 2, function(x) diff(log(x))) #log returns colnames(data_returns) &amp;amp;amp;amp;amp;lt; symbols data_returns[is.na(data_returns)] &amp;amp;amp;amp;amp;lt; 0 # VISA hasnt negotiations between 2007 and 2008The OGD algorithm is implemented next. It uses the quadprogXT package to solve the projection onto the viable set. We are going to test two different values for to check the algorithm sensibility.
library(quadprogXT) OGD &amp;amp;amp;amp;amp;lt; function(base, eta) { # Gradient of Regret Function gradient = function(b, p, r) b + r/(p%*%r) # Projection onto viable Set proj = function(p) { Dmat = diag(length(p)) Amat = cbind(diag(rep(1, length(p))), 1) bvec = c(rep(0, length(p)), 1) fit = solveQPXT(Dmat = Dmat, dvec = p, Amat = Amat, bvec = bvec) return(fit$solution) } T = nrow(base) N = ncol(base) r = as.matrix(base) + 1 # this is because the algo doesnt work directly with log returns p = matrix(0, nrow = N, ncol = T); p[,1] = 1/N # initial portfolio b = matrix(0, nrow = N, ncol = T); b[,1] = 0 for (i in 2:T) { b[,i] = gradient(b[,i1], p[,i1], r[i1,]) # calculating gradient p.aux = p[,i1] + eta*b[,i] # what we would like to play p[,i] = proj(p.aux) # projection in the viable set } return(list('portfolio' = p,'gradient' = b)) } # testing two etas portfolio1 &amp;amp;amp;amp;amp;lt; OGD(base = data_returns, eta = 1/100) portfolio2 &amp;amp;amp;amp;amp;lt; OGD(base = data_returns, eta = 1/1000)Lets build a small function to calculate the compound return of portfolios taking their log return as input.
compound_return &amp;amp;amp;amp;amp;lt; function(portfolio, r) { return_OGD &amp;amp;amp;amp;amp;lt; c(); return_OGD[1] = portfolio$portfolio[,1]%*%r[1,] portfolio_value &amp;amp;amp;amp;amp;lt; c(); portfolio_value[1] = 1 + portfolio$portfolio[,1]%*%r[1,] for (i in 2:nrow(r)) { return_OGD[i] = portfolio$portfolio[,i]%*%r[i,] portfolio_value[i] = portfolio_value[i1]*(1 + return_OGD[i]) } return(list('value' = portfolio_value, 'return' = return_OGD)) }Now let’s check the performance of the OGD algorithm with both and values for . We are going to compare the cumulative return first with the Uniform Constant Rebalanced Portfolio and the DJIA index. Them we will check the compound returns of DJIA stocks individually to see if the algorithm tracked the best stocks like we want them to do.
# Dow Jones getSymbols('^DJI', src = 'yahoo', from = '20070103', to = '20170621') ## [1] "DJI" DJIA_returns = as.numeric(cumprod(weeklyReturn(DJI) + 1)) # Individual stocks stocks_returns = apply(data_returns + 1, 2, cumprod) # Our portfolios portfolio_value1 &amp;amp;amp;amp;amp;lt; compound_return(portfolio1, data_returns) portfolio_value2 &amp;amp;amp;amp;amp;lt; compound_return(portfolio2, data_returns)In the figures below we can see the performance of the strategies that were implemented. Both portfolios had a surprisingly perform, much better than DJIA and UCRP portfolio.
Indeed, if is relatively big, the algorithm quickly found Apple as the best stock, and start pursuing it. If is relative small, them the algorithm takes more time to adapt, but this makes the portfolio more stable.
## &amp;amp;amp;amp;amp;lt;ScaleContinuousPosition&amp;amp;amp;amp;amp;gt; ## Range: ## Limits: 0  1To finish the comparison between our portfolios and DJIA, lets do a very simple risk analysis with just empirical quantiles using historical results. For a proper risk analysis we should model the conditional volatility of the portfolio, but this will be discussed in another post.
risk_analysis &amp;amp;amp;amp;amp;lt; function(return) { report = matrix(0, ncol = 2, nrow = 2) report[1,] = quantile(return, probs = c(0.01, 0.05)) report[2,] = c(mean(return[which(return &amp;amp;amp;amp;amp;lt;= report[1,1])]), mean(return[which(return &amp;amp;amp;amp;amp;lt;= report[1,2])])) rownames(report) = c('VaR', 'CVaR') colnames(report) = c('1%', '5%') return(round(report, 3)) } report1 &amp;amp;amp;amp;amp;lt; risk_analysis(portfolio_value1$return) report2 &amp;amp;amp;amp;amp;lt; risk_analysis(portfolio_value2$return) report_DJIA &amp;amp;amp;amp;amp;lt; risk_analysis(weeklyReturn(DJI)) Portfolio1 eta = 0.01 1% 5% VaR 0.091 0.047 CVaR 0.119 0.074 Portfolio2 eta = 0.001 1% 5% VaR 0.057 0.039 CVaR 0.085 0.055 DJIA index 1% 5% VaR 0.060 0.040 CVaR 0.084 0.055It’s quite evident that setting a big , like we did in portfolio1, may have caused problems with VaR and Conditional VaR since Apple is a volatile stock. For portfolio2, where is small the portfolio takes time to change, the historical VaR and CVaR are very similar to what we already observed in DJIA, but the returns were much better.
ReferencesAgarwal, Amit, et al. “Algorithms for portfolio management based on the newton method.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data wrangling : Reshaping
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a timeconsuming process which is estimated to take about 6080% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the second part of this series and it aims to cover the reshaping of data used to turn them into a tidy form. By tidy form, we mean that each feature forms a column and each observation forms a row.
Before proceeding, it might be helpful to look over the help pages for the spread, gather, unite, separate, replace_na, fill, extract_numeric.
Moreover please load the following libraries.
install.packages("magrittr")
library(magrittr)
install.packages("tidyr")
library(tidyr)
Please run the code below in order to load the data set:
data < airquality[4:6]
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Print out the structure of the data frame.
Exercise 2
Let’s turn the data frame in a wider form, from above and and turn the Month variable into column headings and spread the Temp values across the months they are related to.
Exercise 3
Turn the wide (exercise 2) data frame into its initial format using the gather function, specify the columns you would like to gather by index number.
Exercise 4
Turn the wide (exercise 2) data frame into its initial format using the gather function, specify the columns you would like to gather by column name.
Learn more about Data PreProcessing in the online course R Data PreProcessing & Data Management – Shape your Data!. In this course you will learn how to: import data into R in several ways while also beeing able to identify a suitable import tool
 use SQL code within R
 And much more
Exercise 5
Turn the wide (exercise 2) data frame into its initial format using the gather function, specify the columns by using remaining column names(the ones you don’t use for gathering).
Exercise 6
Unite the variables Day and Month to a new feature named Date with the format %d%m .
Exercise 7
Create the data frame at its previous format (exercise 6). Separate the variable you have created before (Date) to Day, Month.
Exercise 8
Replace the missing values (NA) with 'Unknown'.
Exercise 9
Run the script below, so that you make a new feature year.
back2long_na$year < rep(NA, nrow(back2long_na))
back2long_na$year[1] < '2015'
back2long_na$year[as.integer(nrow(back2long_na)/3)] < '2016'
back2long_na$year[as.integer(2*nrow(back2long_na)/3)] < '2017'
You have noticed, that the new column has many values. Fill the NAs with the nonmissing value write above it. (eg.the NA’s that are below the ‘2016’ and ‘2017’ value assign it to ‘2016’.
Hint: use the fill function.
Exercise 10
Extract the numeric values from the Temp feature.
Hint: extract_numeric, this is a very important function when the variable we apply the function on is a character with ‘noise’, for example ‘$40’ and you want to transform it to 40.
Related exercise sets: Data Shape Transformation With Reshape()
 Building Shiny App exercises part 10
 Reshape 2 Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
nanotime 0.2.0
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new version of the nanotime package for working with nanosecond timestamps just arrived on CRAN.
nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic.
Thanks to a metric ton of work by Leonardo Silvestri, the package now uses S4 classes internally allowing for greater consistency of operations on nanotime objects.
Changes in version 0.2.0 (20170622)
Rewritten in S4 to provide more robust operations (#17 by Leonardo)

Ensure tz="" is treated as unset (Leonardo in #20)

Added format and tz arguments to nanotime, format, print (#22 by Leonardo and Dirk)

Ensure printing respect options()$max.print, ensure names are kept with vector (#23 by Leonardo)

Correct summary() by defining names< (Leonardo in #25 fixing #24)

Report error on operations that are meaningful for type; handled NA, NaN, Inf, Inf correctly (Leonardo in #27 fixing #26)
We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Can we predict flu deaths with Machine Learning and R?
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as analysed by Kucharski et al (2014). I will be using their data as an example to test whether we can use Machine Learning algorithms for predicting disease outcome.
To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalised, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was devided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets.
The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set.
The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.
The dataThe dataset contains case ID, date of onset, date of hospitalisation, date of outcome, gender, age, province and of course the outcome: Death or Recovery. I can already see that there are a couple of missing values in the data, which I will deal with later.
# install and load package if (!require("outbreaks")) install.packages("outbreaks") library(outbreaks) fluH7N9.china.2013_backup < fluH7N9.china.2013 # back up original dataset in case something goes awry along the way # convert ? to NAs fluH7N9.china.2013$age[which(fluH7N9.china.2013$age == "?")] < NA # create a new column with case ID fluH7N9.china.2013$case.ID < paste("case", fluH7N9.china.2013$case.ID, sep = "_") head(fluH7N9.china.2013) ## case.ID date.of.onset date.of.hospitalisation date.of.outcome outcome gender age province ## 1 case_1 20130219 20130304 Death m 87 Shanghai ## 2 case_2 20130227 20130303 20130310 Death m 27 Shanghai ## 3 case_3 20130309 20130319 20130409 Death f 35 Anhui ## 4 case_4 20130319 20130327 f 45 Jiangsu ## 5 case_5 20130319 20130330 20130515 Recover f 48 Jiangsu ## 6 case_6 20130321 20130328 20130426 Death f 32 JiangsuBefore I start preparing the data for Machine Learning, I want to get an idea of the distribution of the data points and their different variables by plotting. Most provinces have only a handful of cases, so I am combining them into the category “other” and keep only Jiangsu, Shanghai and Zhejian and separate provinces.
# gather for plotting with ggplot2 library(tidyr) fluH7N9.china.2013_gather < fluH7N9.china.2013 %>% gather(Group, Date, date.of.onset:date.of.outcome) # rearrange group order fluH7N9.china.2013_gather$Group < factor(fluH7N9.china.2013_gather$Group, levels = c("date.of.onset", "date.of.hospitalisation", "date.of.outcome")) # rename groups library(plyr) fluH7N9.china.2013_gather$Group < mapvalues(fluH7N9.china.2013_gather$Group, from = c("date.of.onset", "date.of.hospitalisation", "date.of.outcome"), to = c("Date of onset", "Date of hospitalisation", "Date of outcome")) # renaming provinces fluH7N9.china.2013_gather$province < mapvalues(fluH7N9.china.2013_gather$province, from = c("Anhui", "Beijing", "Fujian", "Guangdong", "Hebei", "Henan", "Hunan", "Jiangxi", "Shandong", "Taiwan"), to = rep("Other", 10)) # add a level for unknown gender levels(fluH7N9.china.2013_gather$gender) < c(levels(fluH7N9.china.2013_gather$gender), "unknown") fluH7N9.china.2013_gather$gender[is.na(fluH7N9.china.2013_gather$gender)] < "unknown" # rearrange province order so that Other is the last fluH7N9.china.2013_gather$province < factor(fluH7N9.china.2013_gather$province, levels = c("Jiangsu", "Shanghai", "Zhejiang", "Other")) # convert age to numeric fluH7N9.china.2013_gather$age < as.numeric(as.character(fluH7N9.china.2013_gather$age)) library(ggplot2) my_theme < function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "black"), legend.position = "bottom", legend.justification = "top", legend.box = "horizontal", legend.box.background = element_rect(colour = "grey50"), legend.background = element_blank(), panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) } # plotting raw data ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, fill = outcome)) + stat_density2d(aes(alpha = ..level..), geom = "polygon") + geom_jitter(aes(color = outcome, shape = gender), size = 1.5) + geom_rug(aes(color = outcome)) + labs( fill = "Outcome", color = "Outcome", alpha = "Level", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "" ) + facet_grid(Group ~ province) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1")This plot shows the dates of onset, hospitalisation and outcome (if known) of each data point. Outcome is marked by color and age shown on the yaxis. Gender is marked by point shape. The density distribution of date by age for the cases seems to indicate that older people died more frequently in the Jiangsu and Zhejiang province than in Shanghai and in other provinces. When we look at the distribution of points along the time axis, it suggests that their might be a positive correlation between the likelihood of death and an early onset or early outcome.
I also want to know how many cases there are for each gender and province and compare the genders’ age distribution.
fluH7N9.china.2013_gather_2 < fluH7N9.china.2013_gather[, 4] %>% gather(group_2, value, gender:province) fluH7N9.china.2013_gather_2$value < mapvalues(fluH7N9.china.2013_gather_2$value, from = c("m", "f", "unknown", "Other"), to = c("Male", "Female", "Unknown gender", "Other province")) fluH7N9.china.2013_gather_2$value < factor(fluH7N9.china.2013_gather_2$value, levels = c("Female", "Male", "Unknown gender", "Jiangsu", "Shanghai", "Zhejiang", "Other province")) p1 < ggplot(data = fluH7N9.china.2013_gather_2, aes(x = value, fill = outcome, color = outcome)) + geom_bar(position = "dodge", alpha = 0.7, size = 1) + my_theme() + scale_fill_brewer(palette="Set1", na.value = "grey50") + scale_color_brewer(palette="Set1", na.value = "grey50") + labs( color = "", fill = "", x = "", y = "Count", title = "2013 Influenza A H7N9 cases in China", subtitle = "Gender and Province numbers of flu cases", caption = "" ) p2 < ggplot(data = fluH7N9.china.2013_gather, aes(x = age, fill = outcome, color = outcome)) + geom_density(alpha = 0.3, size = 1) + geom_rug() + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1", na.value = "grey50") + my_theme() + labs( color = "", fill = "", x = "Age", y = "Density", title = "", subtitle = "Age distribution of flu cases", caption = "" ) library(gridExtra) library(grid) grid.arrange(p1, p2, ncol = 2)In the dataset, there are more male than female cases and correspondingly, we see more deaths, recoveries and unknown outcomes in men than in women. This is potentially a problem later on for modeling because the inherent likelihoods for outcome are not directly comparable between the sexes. Most unknown outcomes were recorded in Zhejiang. Similarly to gender, we don’t have an equal distribution of data points across provinces either. When we look at the age distribution it is obvious that people who died tended to be slightly older than those who recovered. The density curve of unknown outcomes is more similar to that of death than of recovery, suggesting that among these people there might have been more deaths than recoveries. And lastly, I want to plot how many days passed between onset, hospitalisation and outcome for each case.
ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, color = outcome)) + geom_point(aes(shape = gender), size = 1.5, alpha = 0.6) + geom_path(aes(group = case.ID)) + facet_wrap( ~ province, ncol = 2) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1") + labs( color = "Outcome", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "\nTime from onset of flu to outcome." )Gives this plot which shows that there are many missing values in the dates, so it is hard to draw a general conclusion.
In Machine Learningspeak features are the variables used for model training. Using the right features dramatically influences the accuracy of the model.
Because we don’t have many features, I am keeping age as it is, but I am also generating new features:
 from the date information I am calculating the days between onset and outcome and between onset and hospitalisation
 I am converting gender into numeric values with 1 for female and 0 for male
 similarly, I am converting provinces to binary classifiers (yes == 1, no == 0) for Shanghai, Zhejiang, Jiangsu and other provinces
 the same binary classification is given for whether a case was hospitalised, and whether they had an early onset or early outcome (earlier than the median date)
When looking at the dataset I created for modeling, it is obvious that we have quite a few missing values. The missing values from the outcome column are what I want to predict but for the rest I would either have to remove the entire row from the data or impute the missing information. I decided to try the latter with the mice package.
# impute missing data library(mice) dataset_impute < mice(dataset[, 1], print = FALSE) # recombine imputed data frame with the outcome column dataset_complete < merge(dataset[, 1, drop = FALSE], mice::complete(dataset_impute, 1), by = "row.names", all = TRUE) rownames(dataset_complete) < dataset_complete$Row.names dataset_complete < dataset_complete[, 1] Test, train and validation datasetsFor building the model, I am separating the imputed data frame into training and test data. Test data are the 57 cases with unknown outcome.
summary(dataset$outcome) ## Death Recover NA's ## 32 47 57The training data will be further devided for validation of the models: 70% of the training data will be kept for model building and the remaining 30% will be used for model testing. I am using the caret package for modeling.
train_index < which(is.na(dataset_complete$outcome)) train_data < dataset_complete[train_index, ] test_data < dataset_complete[train_index, 1] library(caret) set.seed(27) val_index < createDataPartition(train_data$outcome, p = 0.7, list=FALSE) val_train_data < train_data[val_index, ] val_test_data < train_data[val_index, ] val_train_X < val_train_data[,1] val_test_X < val_test_data[,1] Decision treesTo get an idea about how each feature contributes to the prediction of the outcome, I first built a decision tree based on the training data using rpart and rattle.
library(rpart) library(rattle) library(rpart.plot) library(RColorBrewer) set.seed(27) fit < rpart(outcome ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) fancyRpartPlot(fit)This randomly generated decision tree shows that cases with an early outcome were most likely to die when they were 68 or older, when they also had an early onset and when they were sick for fewer than 13 days. If a person was not among the first cases and was younger than 52, they had a good chance of recovering, but if they were 82 or older, they were more likely to die from the flu.
Feature ImportanceNot all of the features I created will be equally important to the model. The decision tree already gave me an idea of which features might be most important but I also want to estimate feature importance using a Random Forest approach with repeated cross validation.
# prepare training scheme control < trainControl(method = "repeatedcv", number = 10, repeats = 10) # train the model set.seed(27) model < train(outcome ~ ., data = train_data, method = "rf", preProcess = NULL, trControl = control) # estimate variable importance importance < varImp(model, scale=TRUE) # prepare for plotting importance_df_1 < importance$importance importance_df_1$group < rownames(importance_df_1) importance_df_1$group < mapvalues(importance_df_1$group, from = c("age", "hospital2", "gender_f2", "province_Jiangsu2", "province_Shanghai2", "province_Zhejiang2", "province_other2", "days_onset_to_outcome", "days_onset_to_hospital", "early_onset2", "early_outcome2" ), to = c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", "Other province", "Days onset to outcome", "Days onset to hospital", "Early onset", "Early outcome" )) f = importance_df_1[order(importance_df_1$Overall, decreasing = FALSE), "group"] importance_df_2 < importance_df_1 importance_df_2$Overall < 0 importance_df < rbind(importance_df_1, importance_df_2) # setting factor levels importance_df < within(importance_df, group < factor(group, levels = f)) importance_df_1 < within(importance_df_1, group < factor(group, levels = f)) ggplot() + geom_point(data = importance_df_1, aes(x = Overall, y = group, color = group), size = 2) + geom_path(data = importance_df, aes(x = Overall, y = group, color = group, group = group), size = 1) + scale_color_manual(values = rep(brewer.pal(1, "Set1")[1], 11)) + my_theme() + theme(legend.position = "none", axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)) + labs( x = "Importance", y = "", title = "2013 Influenza A H7N9 cases in China", subtitle = "Scaled feature importance", caption = "\nDetermined with Random Forest and repeated cross validation (10 repeats, 10 times)" ) Feature PlotBefore I start actually building models, I want to check whether the distribution of feature values is comparable between training, validation and test datasets.
# prepare for plotting dataset_complete_gather < dataset_complete %>% mutate(set = ifelse(rownames(dataset_complete) %in% rownames(test_data), "Test Data", ifelse(rownames(dataset_complete) %in% rownames(val_train_data), "Validation Train Data", ifelse(rownames(dataset_complete) %in% rownames(val_test_data), "Validation Test Data", "NA"))), case_ID = rownames(.)) %>% gather(group, value, age:early_outcome) dataset_complete_gather$group < mapvalues(dataset_complete_gather$group, from = c("age", "hospital", "gender_f", "province_Jiangsu", "province_Shanghai", "province_Zhejiang", "province_other", "days_onset_to_outcome", "days_onset_to_hospital", "early_onset", "early_outcome" ), to = c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", "Other province", "Days onset to outcome", "Days onset to hospital", "Early onset", "Early outcome" )) ggplot(data = dataset_complete_gather, aes(x = as.numeric(value), fill = outcome, color = outcome)) + geom_density(alpha = 0.2) + geom_rug() + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1", na.value = "grey50") + my_theme() + facet_wrap(set ~ group, ncol = 11, scales = "free") + labs( x = "Value", y = "Density", title = "2013 Influenza A H7N9 cases in China", subtitle = "Features for classifying outcome", caption = "\nDensity distribution of all features used for classification of flu outcome." ) ggplot(subset(dataset_complete_gather, group == "Age"  group == "Days onset to hospital"  group == "Days onset to outcome"), aes(x=outcome, y=as.numeric(value), fill=set)) + geom_boxplot() + my_theme() + scale_fill_brewer(palette="Set1", type = "div ") + facet_wrap( ~ group, ncol = 3, scales = "free") + labs( fill = "", x = "Outcome", y = "Value", title = "2013 Influenza A H7N9 cases in China", subtitle = "Features for classifying outcome", caption = "\nBoxplot of the features age, days from onset to hospitalisation and days from onset to outcome." )Gives this plot which luckily, the distributions looks reasonably similar between the validation and test data, except for a few outliers.
Before I try to predict the outcome of the unknown cases, I am testing the models’ accuracy with the validation datasets on a couple of algorithms. I have chosen only a few more well known algorithms, but caret implements many more. I have chose to not do any preprocessing because I was worried that the different data distributions with continuous variables (e.g. age) and binary variables (i.e. 0, 1 classification of e.g. hospitalisation) would lead to problems.
Random ForestRandom Forests predictions are based on the generation of multiple classification trees. This model classified 14 out of 23 cases correctly.
set.seed(27) model_rf < caret::train(outcome ~ ., data = val_train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_rf, val_test_data[, 1]), val_test_data$outcome) ElasticNet Regularized Generalized Linear ModelsLasso or elastic net regularization for generalized linear model regression are based on linear regression models and is useful when we have feature correlation in our model. This model classified 13 out of 23 cases correctly.
set.seed(27) model_glmnet < caret::train(outcome ~ ., data = val_train_data, method = "glmnet", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_glmnet, val_test_data[, 1]), val_test_data$outcome) kNearest Neighbors set.seed(27) model_kknn < caret::train(outcome ~ ., data = val_train_data, method = "kknn", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_kknn, val_test_data[, 1]), val_test_data$outcome) Penalized Discriminant Analysis set.seed(27) model_pda < caret::train(outcome ~ ., data = val_train_data, method = "pda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pda, val_test_data[, 1]), val_test_data$outcome) Stabilized Linear Discriminant AnalysisStabilized Linear Discriminant Analysis is designed for highdimensional data and correlated covariables. This model classified 15 out of 23 cases correctly.
set.seed(27) model_slda < caret::train(outcome ~ ., data = val_train_data, method = "slda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_slda, val_test_data[, 1]), val_test_data$outcome) Nearest Shrunken CentroidsNearest Shrunken Centroids computes a standardized centroid for each class and shrinks each centroid toward the overall centroid for all classes. This model classified 15 out of 23 cases correctly.
set.seed(27) model_pam < caret::train(outcome ~ ., data = val_train_data, method = "pam", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pam, val_test_data[, 1]), val_test_data$outcome) Single C5.0 TreeC5.0 is another treebased modeling algorithm. This model classified 15 out of 23 cases correctly.
set.seed(27) model_C5.0Tree < caret::train(outcome ~ ., data = val_train_data, method = "C5.0Tree", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_C5.0Tree, val_test_data[, 1]), val_test_data$outcome) Partial Least SquaresPartial least squares regression combined principal component analysis and multiple regression and is useful for modeling with correlated features. This model classified 15 out of 23 cases correctly.
set.seed(27) model_pls < caret::train(outcome ~ ., data = val_train_data, method = "pls", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pls, val_test_data[, 1]), val_test_data$outcome) Comparing accuracy of modelsAll models were similarly accurate.
# Create a list of models models < list(rf = model_rf, glmnet = model_glmnet, kknn = model_kknn, pda = model_pda, slda = model_slda, pam = model_pam, C5.0Tree = model_C5.0Tree, pls = model_pls) # Resample the models resample_results < resamples(models) # Generate a summary summary(resample_results, metric = c("Kappa", "Accuracy")) bwplot(resample_results , metric = c("Kappa","Accuracy")) Combined results of predicting validation test samplesTo compare the predictions from all models, I summed up the prediction probabilities for Death and Recovery from all models and calculated the log2 of the ratio between the summed probabilities for Recovery by the summed probabilities for Death. All cases with a log2 ratio bigger than 1.5 were defined as Recover, cases with a log2 ratio below 1.5 were defined as Death, and the remaining cases were defined as uncertain.
results < data.frame(randomForest = predict(model_rf, newdata = val_test_data[, 1], type="prob"), glmnet = predict(model_glmnet, newdata = val_test_data[, 1], type="prob"), kknn = predict(model_kknn, newdata = val_test_data[, 1], type="prob"), pda = predict(model_pda, newdata = val_test_data[, 1], type="prob"), slda = predict(model_slda, newdata = val_test_data[, 1], type="prob"), pam = predict(model_pam, newdata = val_test_data[, 1], type="prob"), C5.0Tree = predict(model_C5.0Tree, newdata = val_test_data[, 1], type="prob"), pls = predict(model_pls, newdata = val_test_data[, 1], type="prob")) results$sum_Death < rowSums(results[, grep("Death", colnames(results))]) results$sum_Recover < rowSums(results[, grep("Recover", colnames(results))]) results$log2_ratio < log2(results$sum_Recover/results$sum_Death) results$true_outcome < val_test_data$outcome results$pred_outcome < ifelse(results$log2_ratio > 1.5, "Recover", ifelse(results$log2_ratio < 1.5, "Death", "uncertain")) results$prediction < ifelse(results$pred_outcome == results$true_outcome, "CORRECT", ifelse(results$pred_outcome == "uncertain", "uncertain", "wrong"))All predictions based on all models were either correct or uncertain.
Predicting unknown outcomesThe above models will now be used to predict the outcome of cases with unknown fate.
set.seed(27) model_rf < caret::train(outcome ~ ., data = train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_glmnet < caret::train(outcome ~ ., data = train_data, method = "glmnet", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_kknn < caret::train(outcome ~ ., data = train_data, method = "kknn", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pda < caret::train(outcome ~ ., data = train_data, method = "pda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_slda < caret::train(outcome ~ ., data = train_data, method = "slda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pam < caret::train(outcome ~ ., data = train_data, method = "pam", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_C5.0Tree < caret::train(outcome ~ ., data = train_data, method = "C5.0Tree", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pls < caret::train(outcome ~ ., data = train_data, method = "pls", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) models < list(rf = model_rf, glmnet = model_glmnet, kknn = model_kknn, pda = model_pda, slda = model_slda, pam = model_pam, C5.0Tree = model_C5.0Tree, pls = model_pls) # Resample the models resample_results < resamples(models) bwplot(resample_results , metric = c("Kappa","Accuracy"))Here again, the accuracy is similar in all models. The final results are calculate as described above.
results < data.frame(randomForest = predict(model_rf, newdata = test_data, type="prob"), glmnet = predict(model_glmnet, newdata = test_data, type="prob"), kknn = predict(model_kknn, newdata = test_data, type="prob"), pda = predict(model_pda, newdata = test_data, type="prob"), slda = predict(model_slda, newdata = test_data, type="prob"), pam = predict(model_pam, newdata = test_data, type="prob"), C5.0Tree = predict(model_C5.0Tree, newdata = test_data, type="prob"), pls = predict(model_pls, newdata = test_data, type="prob")) results$sum_Death < rowSums(results[, grep("Death", colnames(results))]) results$sum_Recover < rowSums(results[, grep("Recover", colnames(results))]) results$log2_ratio < log2(results$sum_Recover/results$sum_Death) results$predicted_outcome < ifelse(results$log2_ratio > 1.5, "Recover", ifelse(results$log2_ratio < 1.5, "Death", "uncertain"))From 57 cases, 14 were defined as Recover, 10 as Death and 33 as uncertain.
results_combined < merge(results[, c(1:16)], fluH7N9.china.2013[which(fluH7N9.china.2013$case.ID %in% rownames(results)), ], by.x = "row.names", by.y = "case.ID") results_combined < results_combined[, c(2, 3, 8, 9)] results_combined_gather < results_combined %>% gather(group_dates, date, date.of.onset:date.of.hospitalisation) results_combined_gather$group_dates < factor(results_combined_gather$group_dates, levels = c("date.of.onset", "date.of.hospitalisation")) results_combined_gather$group_dates < mapvalues(results_combined_gather$group_dates, from = c("date.of.onset", "date.of.hospitalisation"), to = c("Date of onset", "Date of hospitalisation")) results_combined_gather$gender < mapvalues(results_combined_gather$gender, from = c("f", "m"), to = c("Female", "Male")) levels(results_combined_gather$gender) < c(levels(results_combined_gather$gender), "unknown") results_combined_gather$gender[is.na(results_combined_gather$gender)] < "unknown" results_combined_gather$age < as.numeric(as.character(results_combined_gather$age)) ggplot(data = results_combined_gather, aes(x = date, y = log2_ratio, color = predicted_outcome)) + geom_jitter(aes(size = age), alpha = 0.3) + geom_rug() + facet_grid(gender ~ group_dates) + labs( color = "Predicted outcome", size = "Age", x = "Date in 2013", y = "log2 ratio of prediction Recover vs Death", title = "2013 Influenza A H7N9 cases in China", subtitle = "Predicted outcome", caption = "" ) + my_theme() + scale_color_brewer(palette="Set1") + scale_fill_brewer(palette="Set1")The comparison of date of onset, data of hospitalisation, gender and age with predicted outcome shows that predicted deaths were associated with older age than predicted Recoveries. Date of onset does not show an obvious bias in either direction.
ConclusionsThis dataset posed a couple of difficulties to begin with, like unequal distribution of data points across variables and missing data. This makes the modeling inherently prone to flaws. However, real life data isn’t perfect either, so I went ahead and tested the modeling success anyway. By accounting for uncertain classification with low predictions probability, the validation data could be classified accurately. However, for a more accurate model, these few cases don’t give enough information to reliably predict the outcome. More cases, more information (i.e. more features) and fewer missing data would improve the modeling outcome. Also, this example is only applicable for this specific case of flu. In order to be able to draw more general conclusions about flu outcome, other cases and additional information, for example on medical parameters like preexisting medical conditions, disase parameters, demographic information, etc. would be necessary. All in all, this dataset served as a nice example of the possibilities (and pitfalls) of machine learning applications and showcases a basic workflow for building prediction models with R.
If you have questions feel free to post a comment.
Related Post
 Graphical Presentation of Missing Data; VIM Package
 Creating Graphs with Python and GooPyCharts
 Visualizing Tennis Grand Slam Winners Performances
 Plotting Data Online via Plotly and Python
 Using PostgreSQL and shiny with a dynamic leaflet map: monitoring trash cans
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
All the fake data that’s fit to print
charlatan makes fake data.
Excited to annonunce a new package called charlatan. While perusing
packages from other programming languages, I saw a neat Python library
called faker.
charlatan is inspired from and ports many things from Python's
https://github.com/joke2k/faker library. In turn, faker was inspired from
PHP's faker,
Perl's Faker, and
Ruby's faker. It appears that the PHP
library was the original – nice work PHP.
What could you do with this package? Here's some use cases:
 Students in a classroom setting learning any task that needs a dataset.
 People doing simulations/modeling that need some fake data
 Generate fake dataset of users for a database before actual users exist
 Complete missing spots in a dataset
 Generate fake data to replace sensitive real data with before public release
 Create a random set of colors for visualization
 Generate random coordinates for a map
 Get a set of randomly generated DOIs (Digial Object Identifiers) to
assign to fake scholarly artifacts  Generate fake taxonomic names for a biological dataset
 Get a set of fake sequences to use to test code/software that uses
sequence data
 Language support: A huge selling point of charlatan is language support.
Of course for some data types (numbers), languages don't come into play, but
for many they do. That means you can create fake datasets specific to a
language, or a dataset with a mix of languages, etc. For the variables
in this package, we have not yet ported over all languages for those
variables that Python's faker has.  Lite weight: We've tried to make this package as lite as possible so
that it's just generally easy to install, but also can be used in
other packages or workflows while bringing along as little baggage
as possible.  Reviewed: it's been reviewed! See reviews by Brooke Anderson and
Tristan Mahr, and handling editor Noam Ross  R specific features such as methods to create data.frame's (so the
user doesn’t have to do the extra step of putting vectors together)
We have not ported every variable, or every language yet in those variables.
We have added some variables to charlatan that are not in faker (e.g.,
taxonomy, gene sequences). Check out the issues
to follow progress.
 ch_generate: generate a data.frame with fake data
 fraudster: single interface to all fake data methods
 High level interfaces: There are high level functions prefixed with
ch_ that wrap low level interfaces, and are meant to be easier
to use and provide easy way to make many instances of a thing.  Low level interfaces: All of these are R6 objects that a user can
initialize and then call methods on the them.
Check out the package vignette to get started.
setupInstall charlatan
install.packages("charlatan")Or get the development version:
devtools::install_github("ropensci/charlatan") library(charlatan) Examples high level interfacefraudster is an interface for all fake data variables (and locales):
x < fraudster() x$job() #> [1] "Textile designer" x$name() #> [1] "Cris JohnstonTremblay" x$job() #> [1] "Database administrator" x$color_name() #> [1] "SaddleBrown"If you want to set locale, do so like fraudster(locale = "{locale}")
locale supportThe locales that are supported vary by data variable. We're adding more
locales through time, so do check in from time to time – or even better,
send a pull request adding support for the locale you want for the
variable(s) you want.
As an example, you can set locale for job data to any number of supported
locales.
For jobs:
ch_job(locale = "en_US", n = 3) #> [1] "Charity officer" "Financial adviser" "Buyer, industrial" ch_job(locale = "fr_FR", n = 3) #> [1] "Illustrateur" "Guichetier" #> [3] "Responsable d'ordonnancement" ch_job(locale = "hr_HR", n = 3) #> [1] "Pomoćnik strojovođe" #> [2] "Pećar" #> [3] "Konzervator – restaurator savjetnik" ch_job(locale = "uk_UA", n = 3) #> [1] "Фрілансер" "Астрофізик" "Доцент" ch_job(locale = "zh_TW", n = 3) #> [1] "包裝設計" "空調冷凍技術人員" "鍋爐操作技術人員"For colors:
ch_color_name(locale = "en_US", n = 3) #> [1] "DarkMagenta" "Navy" "LightGray" ch_color_name(locale = "uk_UA", n = 3) #> [1] "Синій ВПС" "Темнозелений хакі" "Берлінська лазур"charlatan will tell you when a locale is not supported
ch_job(locale = "cv_MN") #> Error: cv_MN not in set of available locales generate a datasetch_generate() helps you create data.frame's with whatever variables
you want that charlatan supports. Then you're ready to use the
data.frame immediately in whatever your application is.
By default, you get back a certain set of variables. Right now, that is:
name, job, and phone_number.
You can select just the variables you want:
ch_generate('job', 'phone_number', n = 30) #> # A tibble: 30 x 2 #> job phone_number #> #> 1 Call centre manager 16707153079x9104 #> 2 Nurse, learning disability 15027813386x33524 #> 3 Network engineer 16920893060 #> 4 Industrial buyer 15178558517 #> 5 Database administrator (999)4749975x89650 #> 6 Operations geologist 06150655769 #> 7 Engineer, land 3600433630x592 #> 8 Pension scheme manager (374)4296821 #> 9 Personnel officer 11895743348x338 #> 10 Editor, film/video 16981351664 #> # ... with 20 more rows Data typesA sampling of the data types available in charlatan:
person name
ch_name() #> [1] "Jefferey WestO'Connell" ch_name(10) #> [1] "Dylon Hintz" "Dr. Billy Willms DDS" "Captain Bednar III" #> [4] "Carli Torp" "Price Strosin III" "Grady Mayert" #> [7] "Nat HermanKuvalis" "Noelle Funk" "Dr. Jaycie Herzog MD" #> [10] "Ms. Andrea Zemlak"phone number
ch_phone_number() #> [1] "643.993.1958" ch_phone_number(10) #> [1] "+06(6)6080789632" "05108334280" "4471269775" #> [4] "+96(7)2112213020" "4954251506" "12103723188x514" #> [7] "(300)9515115" "680.567.5321" "19478054758x8167" #> [10] "8889985511x554"job
ch_job() #> [1] "Scientist, water quality" ch_job(10) #> [1] "Engineer, production" #> [2] "Architect" #> [3] "Exhibitions officer, museum/gallery" #> [4] "Patent attorney" #> [5] "Surveyor, minerals" #> [6] "Electronics engineer" #> [7] "Secondary school teacher" #> [8] "Intelligence analyst" #> [9] "Nutritional therapist" #> [10] "Information officer" Messy dataReal data is messy! charlatan makes it easy to create
messy data. This is still in the early stages so is not available
across most data types and languages, but we're working on it.
For example, create messy names:
ch_name(50, messy = TRUE) #> [1] "Mr. Vernell Hoppe Jr." "Annika Considine d.d.s." #> [3] "Dr. Jose Kunde DDS" "Karol LeuschkeRunte" #> [5] "Kayleen KutchHintz" "Jahir Green" #> [7] "Stuart Emmerich" "Hillard Schaden" #> [9] "Mr. Caden Braun" "Willie Ebert" #> [11] "Meg Abbott PhD" "Dr Rahn Huel" #> [13] "Kristina Crooks d.d.s." "Lizbeth Hansen" #> [15] "Mrs. Peyton Kuhn" "Hayley Bernier" #> [17] "Dr. Lavon Schimmel d.d.s." "Iridian Murray" #> [19] "Cary Romaguera" "Tristan Windler" #> [21] "Marlana Schroeder md" "Mr. Treyton Nitzsche" #> [23] "Hilmer NitzscheGlover" "Marius Dietrich md" #> [25] "Len Mertz" "Mrs Adyson Wunsch DVM" #> [27] "Dr. Clytie Feest DDS" "Mr. Wong Lebsack I" #> [29] "Arland Kessler" "Mrs Billy O'Connell m.d." #> [31] "Stephen Gerlach" "Jolette Lueilwitz" #> [33] "Mrs Torie Green d.d.s." "Mona Denesik" #> [35] "Mitchell Auer" "Miss. Fae Price m.d." #> [37] "Todd Lehner" "Elva Lesch" #> [39] "Miss. Gustie Rempel DVM" "Lexie ParisianStark" #> [41] "Beaulah CreminRice" "Parrish Schinner" #> [43] "Latrell Beier" "Garry Wolff Sr" #> [45] "Bernhard Vandervort" "Stevie Johnston" #> [47] "Dawson Gaylord" "Ivie Labadie" #> [49] "Ronal Parker" "Mr Willy O'Conner Sr."Right now only suffixes and prefixes for names in en_US locale
are supported. Notice above some variation in prefixes and suffixes.
We have lots ot do still. Some of those things include:
 Locales: For existing data variables in the package, we need to fill in
locales for which Python's faker has the data, but we need to port it over
still.  Data variables: there's more we can port over from Python's faker.
In addition, we may find inspiration from faker libraries in other
programming languages.  Messy data: we want to make messy data support more available throughout
the package. Watch issue #41.  If you have ideas for potential data variables, issue #11 is a good place for those.
Or open a new issue, either way.  One reviewer brought up whether data should be within bounds of reality (
see issue #40). The first
question for me is should we do this – if the answer is yes or at least sometimes,
then we can explore how. It's not yet clear if it's the right thing to do.
RcppCCTZ 0.2.3 (and 0.2.2)
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new minor version 0.2.3 of RcppCCTZ is now on CRAN.
RcppCCTZ uses Rcpp to bring CCTZ to R. CCTZ is a C++ library for translating between absolute and civil times using the rules of a time zone. In fact, it is two libraries. One for dealing with civil time: humanreadable dates and times, and one for converting between between absolute and civil times via time zones. The RcppCCTZ page has a few usage examples and details.
This version ensures that we set the TZDIR environment variable correctly on the old dreaded OS that does not come with proper timezone information—an issue which had come up while preparing the next (and awesome, trust me) release of nanotime. It also appears that I failed to blog about 0.2.2, another maintenance release, so changes for both are summarised next.
Changes in version 0.2.3 (20170619)
On Windows, the TZDIR environment variable is now set in .onLoad()

Replaced init.c with registration code inside of RcppExports.cpp thanks to Rcpp 0.12.11.

Synchronized with upstream CCTZ

The time_point object is instantiated explicitly for nanosecond use which appears to be required on macOS
We also have a diff to the previous version thanks to CRANberries. More details are at the RcppCCTZ page; code, issue tickets etc at the GitHub repository.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Large integers in R: Fibonacci number with 1000 digits, Euler Problem 25
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler Problem 25 takes us back to the Fibonacci sequence and the problems related to working with very large integers.
The Fibonacci sequence follows a simple mathematical rule but it can create things of great beauty. This pattern occurs quite often in nature, like to nautilus shell shown in the image. The video by Arthur Benjamin at the end of this post illustrates some of the magic of this sequence.
Large Integers in RBy default, numbers with more than 7 digits are shown in scientific notation in R, which reduces the accuracy of the calculation. You can change the precision of large integers with the options function but R struggles with integers with more than 22 digits. This example illustrates this issue.
factorial(24) [1] 6.204484e+23 > options(digits=22) > factorial(24) [1] 620448401733239409999872However, finding a number of 1000 digits is a problem with using special functions.
Euler Problem 25 DefinitionThe Fibonacci sequence is defined by the recurrence relation:
, where and . The 12th term, , is the first term to contain three digits.
What is the index of the first term in the Fibonacci sequence to contain 1000 digits?
Proposed SolutionsThe first solution uses the GMP library to manage very large integers. This library also contains a function to generate Fibonacci numbers. This solution cycles through the Fibonacci sequence until it finds a number with 1000 digits.
library(gmp) # GNU Multiple Precision Arithmetic Library n < 1 fib < 1 while (nchar(as.character(fib)) < 1000) { fib < fibnum(n) # Determine next Fibonacci number n < n + 1 } answer < n print(answer)This is a very fast solution but my aim is to solve the first 100 Project Euler problems using only baseR code. The big.add function I developed to solve Euler Problem 13.
t < proc.time() fib < 1 # First Fibonaci number cur < 1 # Current number in sequence pre < 1 # Previous number in sequence index < 2 while (nchar(fib) < 1000) { fib < big.add(cur, pre) # Determine next Fibonacci number pre < cur cur < fib index < index + 1 } answer < index print(answer)This code is much slower than the GMP library but is was fun to develop.
The Magic of the Fibonacci NumbersThe post Large integers in R: Fibonacci number with 1000 digits, Euler Problem 25 appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Updated Data Science Virtual Machine for Windows: GPUenabled with Docker support
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The Windows edition of the Data Science Virtual Machine (DSVM), the allinone virtual machine image with a widecollection of opensource and Microsoft data science tools, has been updated to the Windows Server 2016 platform. This update brings builtin support for Docker containers and GPUbased deep learning.
GPUbased Deep Learning. While prior editions of the DSVM could access GPUbased capabilities by installing additional components, everything is now configured and ready at launch. The DSVM now includes GPUenabled builds of popular deep learning frameworks including CNTK, Tensorflow, and MXNET. It also includes Microsoft R Server 9.1, and several machinelearning functions in the MicrosoftML package can also take advantage of GPUs. Note that you will need to use an Nseries Azure instance to benefit from GPU acceleration, but all of the tools in the DSVM will also work on regular CPUbased instances as well.
Docker Containers. Windows Server 2016 includes Windows Containers, and the DSVM also includes the Docker Engine. This means you can easily run Windows containers from Docker Hub or the Azure Container registry. This is limited to Windowsbased containers right now, but you can use Linuxbased containers on the Data Science Virtual Machine for Ubuntu Linux, which likewise supports GPUenabled deep learning.
It's easy to launch a Windows 2016 DSVM on Azure, and the documentation will help you get started using it. (An Azure account is required, but free Azure credits are available to help get you started.) You can find more information about this update at the link below.
Cortana Intelligence and Machine Learning Blog: Introducing the new Data Science Virtual Machine on Windows Server 2016
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Neural networks Exercises (Part3)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.
In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. In the last exercises sets, we have seen how to implement a feedforward neural network in R. That kind of neural network is quite useful to match a single input value to a specific output value, either a dependent variable in regression problems or a class in clustering problems. However sometime, a sequence of input can give a lot more of information to the network than a single value. For example, if you want to train a neural network to predict which letter will come next in a word based on which letters have been typed, making prediction based on the last letter entered can give good results, but if all the previous letter are used for making the predictions the results should be better since the arrangement of previous letter can give important information about the rest of the word.
In today’s exercise set, we will see a type of neural network that is design to make use of the information made available by using sequence of inputs. Those ”recurrent neural networks” do so by using a hidden state at time t1 that influence the calculation of the weight at time t. For more information about this type of neural network, you can read this article which is a good introduction on the subject.
Answers to the exercises are available here.
Exercise 1
We will start by using a recurrent neural network to predict the values of a time series. Load the tsEuStockMarkets dataset from the dataset package and save the first 1400 observations from the “DAX” time series as your working dataset.
Exercise 2
Process the dataset so he can be used in a neural network.
Exercise 3
Create two matrix containing 10 sequences of 140 observations from the previous dataset. The first one must be made of the original observations and will be the input of our neural network. The second one will be the output and since we want to predict the value of the stock market at time t+1 based on the value at time t, this matrix will be the same as the first one were all the elements are shifted from one position. Make sure that each sequence are coded as a row of each matrix.
Exercise 4
Set the seed to 42 and choose randomly eight sequences to train your model and two sequences that will be used for validation later. Once it’s done, load the rnn package and use the trainr() function to train a recurrent neural network on the training dataset. For now, use a learning rate of 0.01, one hidden layer of one neuron and 500 epoch.
Exercise 5
Use the function predictr to make prediction on all the 10 sequences of your original data matrix, then plot the real values and the predicted value on the same graph. Also draw the plot of the prediction on the test set and the real value of your dataset.
Exercise 6
The last model seems to underestimate the stock values that are higher than 0.5. Repeat the step of exercise 3 and 4 but this time use 10 hidden layers. Once it’s done calculate the RMSE of your predictions. This will be the baseline model for the rest of this exercise set.
 Work with Deep Learning networks and related packages in R
 Create Natural Language Processing models
 And much more
Exercise 7
One interesting method often used to accelerate the training of a neural network is the “Nesterov momentum”. This procedure is based on the fact that while trying to find the weights that minimize the cost function of your neural network, optimization algorithm like gradient descend “zigzag” around a straight path to the minimum value. By adding a momentum matrix, which keeps track of the general direction of the gradient, to the gradient we can minimize the deviation from this optimal path and speeding the convergence of the algorithm. You can see this video for more information about this concept.
Repeat the last exercise, but this time use 250 epochs and a momentum of 0.7.
Exercise 8
As special type of recurrent neural network trained by backpropagation through time is called the Long ShortTerm Memory (LSTM) network. This type of recurrent neural network is quite useful in a deep learning context, since this method is robust again the vanishing gradient problem. We will see both of those concepts more in detail in a future exercise set, but for now you can read about it here.
The trainr() function give us the ability to train a LSTM network by setting the network_type parameter to “lstm”. Use this algorithm with 500 epochs and 20 neuron in the hidden layer to predict the value of your time series.
Exercise 9
When working with a recurrent neural network it is important to choose an input sequence length that give the algorithm the maximum information possible without adding useless noise to the input. Until now we used 10 sequences of 140 observations. Train a recurrent neural network on 28 sequences of 50 observations, make prediction and compute the RMSE to see if this encoding had an effect on your predictions.
Exercise 10
Try to use all of the 1860 observation in the “DAX” time series to train and test a recurrent neural network. Then post the setting you used for your model and why you choose them in the comments.
 Neural networks Exercises (Part2)
 Neural networks Exercises (Part1)
 Forecasting for small business Exercises (Part4)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Call for Help: R/Shiny Developer
(This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers)
Dear Fantasy Football Analytics Community,
Four years ago, we released web apps to help people make better decisions in fantasy football based on the wisdom of the crowd. Over the past four years, the community response has been incredibly supportive, and we continually improved the apps in response to user feedback. The community also contributed directly to the project, with a number of users making additions and edits to our public source R scripts on our GitHub repo and our ffanalytics R package. In sum, we provide web apps built by the people, for the people.
This brings me to our call for help. As we try to keep up with feature requests from the community and the impressive increase in users, we are looking for additional help. We are looking for an R/Shiny developer to help develop our web apps (apps.fantasyfootballanalytics.net). We are looking to help optimize and streamline the apps as well as add features to our popular platform. It would also be preferable for the developer to have some knowledge of American Football and fantasy football.
Crucial skills:
 R package knowledge
 Experience developing/testing R code/packages
 Familiarity with Shiny
Nicetohave skills (but not required):
 Knowledge of American Football and fantasy football
To apply, please email the following to jobs@fantasyfootballanalytics.net:
 letter of interest with a brief description of relevant skills
 resume/CV
 how much time you expect to be able to contribute
Sincerely,
Isaac Petersen
The post Call for Help: R/Shiny Developer appeared first on Fantasy Football Analytics.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...