### New package polypoly (helper functions for orthogonal polynomials)

(This article was first published on ** Higher Order Functions**, and kindly contributed to R-bloggers)

Last week, I released a new package called polypoly to CRAN. It wraps up

some common tasks for dealing with orthogonal polynomials into a single package.

The README shows off the main

functionality, as well as the neat “logo” I made for the package.

In this post, I use the package on some word recognition data.

I primarily use orthogonal polynomials to model data from eyetracking

experiments where growth curves describe how the probability of looking at a

image changes as the image is named. The analysis technique, including

orthogonal polynomials and mixed effects models of eyetracking data, are

described in Mirman’s 2014 book.

In our 2015 paper, toddlers saw

two images on a computer screen. The objects in the images started with

different consonants: for example, *duck* and *ball*. The toddlers heard

sentences like “find the ball”, and we measured how their gaze location onscreen

changed in response to speech. This setup is a pretty standard procedure for

studying spoken word recognition.

We manipulated the vowel in the word *the*. In the *facilitating* condition, the

vowel has acoustic information (via anticipatory coarticulation) which would

allow an adult listener to predict the upcoming consonant. In the *neutral*

condition, the vowel provides no cues about the upcoming consonant. The

scientific question is whether these kiddos can take advantage of these acoustic

cues during word recognition.

Here’s how the data look, both in R and in a plot.

library(ggplot2) library(dplyr) # The data d #> # A tibble: 986 x 6 #> Subj Condition Time ToDistractor ToTarget Proportion #> #> 1 1 facilitating 200 9 9 0.5000000 #> 2 1 facilitating 250 9 10 0.5263158 #> 3 1 facilitating 300 6 12 0.6666667 #> 4 1 facilitating 350 6 12 0.6666667 #> 5 1 facilitating 400 6 12 0.6666667 #> 6 1 facilitating 450 6 12 0.6666667 #> 7 1 facilitating 500 6 12 0.6666667 #> 8 1 facilitating 550 6 12 0.6666667 #> 9 1 facilitating 600 4 12 0.7500000 #> 10 1 facilitating 650 3 15 0.8333333 #> # ... with 976 more rows # Helper dataframe of where to put condition labels on the next plot df_labs <- data_frame( Time = c(650, 800), Proportion = c(.775, .625), Condition = c("facilitating", "neutral")) p <- ggplot(d) + aes(x = Time, y = Proportion, color = Condition) + geom_hline(yintercept = .5, size = 2, color = "white") + stat_summary(fun.data = mean_se) + geom_text(aes(label = Condition), data = df_labs, size = 6) + labs(x = "Time after noun onset [ms]", y = "Proportion looks to named image", caption = "Mean ± SE. N = 29 children.") + guides(color = "none") pEarly on, children look equal amounts to both images on average (.5), and the

proportion of looks to the named image increase as the word unfolds. In the

facilitating condition, that rise happens earlier.

We fit a mixed-effects logistic regression model to estimate how the probability

of looking to the named image changes over time, across conditions, and within

children. We use cubic orthogonal polynomials to represent Time. For each time

point, we have three predictors available to us: Time1,

Time2, and Time3. (Plus, there’s a constant “intercept”

term.) Our model’s growth curve will be a weighted combination of these polynomial

curves. The code below shows off about half the functionality of the package

:bowtie::

I think people sometimes describe the contributions of these curves to the

overall growth curve as *trends*: “A negative linear trend”, “a significant

quadratic trend”, etc. I like that word because it makes the terminology a

little less intimidating.

Why do we use orthogonal polynomial terms? First, note that simple polynomials

*x*, *x*2 and *x*3 are correlated. Orthogonal ones are not

correlated. (Hence, the name.)

Adding new correlated predictors to a model is a problem. The parameter

estimates will change as different predictors are added. Here we simulate some

fake data, and fit three models with 1-, 2- and 3-degree raw polynomials.

As expected, the estimates for the effects change from model to model:

models %>% lapply(broom::tidy) %>% bind_rows(.id = "model") %>% select(model:estimate) %>% mutate(estimate = round(estimate, 2)) #> model term estimate #> 1 m1 (Intercept) 75.69 #> 2 m1 x 72.91 #> 3 m2 (Intercept) -23.91 #> 4 m2 x 122.72 #> 5 m2 I(x^2) -4.53 #> 6 m3 (Intercept) 1.15 #> 7 m3 x 100.48 #> 8 m3 I(x^2) 0.29 #> 9 m3 I(x^3) -0.29But with orthogonal polynomials, the parameter estimates don’t change from model

to model.

That’s probably the simplest reason why orthogonal polynomials are preferred. (I

can’t remember any others right now.)

Before fitting the model, I use poly_add_columns() to add polynomial terms as

columns to the dataframe. (For speed here, I use a simplified random effects

structure, estimating growth curve parameters for each Child x Condition

combination.)

We can confirm that the model captures the overall shape of the growth curves.

# The lines here are not quite the overall average, but the averages of 29 # individual fits (for each participant). That's why the caption is a little # weird. p + stat_summary(aes(y = fitted(m)), fun.y = mean, geom = "line") + labs(caption = "Line: Average of model-fitted values. Points: Mean ± SE.")We can inspect the model summary as well.

arm::display(m) #> glmer(formula = cbind(ToTarget, ToDistractor) ~ (ot1 + ot2 + #> ot3) * Condition + (ot1 + ot2 + ot3 | Subj:Condition), data = d, #> family = binomial) #> coef.est coef.se #> (Intercept) 0.47 0.10 #> ot1 1.57 0.28 #> ot2 0.45 0.11 #> ot3 -0.34 0.09 #> Conditionfacilitating 0.23 0.14 #> ot1:Conditionfacilitating 0.45 0.39 #> ot2:Conditionfacilitating -0.44 0.16 #> ot3:Conditionfacilitating 0.11 0.13 #> #> Error terms: #> Groups Name Std.Dev. Corr #> Subj:Condition (Intercept) 0.53 #> ot1 1.46 0.23 #> ot2 0.52 -0.05 0.31 #> ot3 0.39 -0.08 -0.64 0.09 #> Residual 1.00 #> --- #> number of obs: 986, groups: Subj:Condition, 58 #> AIC = 4788.2, DIC = -3961.1 #> deviance = 395.6The model summary indicates a significant Condition x Time2

interaction, but really, only the intercept and Time1 can ever be

interpreted directly. To understand the model fit, we visualize how each of the

polynomial terms are weighted.

Here we create a matrix of the polynomial terms plus a column of ones for the

intercept.

To compute the weighted values, we multiply by a diagonal matrix of the

coefficients.

Then, we can use the poly_melt() function to get a dataframe from each

weighted matrix and then plot each of the effects.

Visually, the quadratic effect on the neutral curve pulls down the values during

the center (when the curves are most different) and pushes the values in the

tails upwards (when the curves are closest). Although only the quadratic effect

is nominally significant, the constant and linear terms suggest other smaller

effects but they are too noisy to pin down.

It’s worth noting that the predictors and weights discussed above are on the

log-odds/logit scale used inside of the model, instead of the proportion scale

used in the plots of the data and model fits. Basically, these weighted values

are summed together and then squeezed into the range [0, 1] with a nonlinear

transformation. For these data, the two scales produce similar looking growth

curves, but you can notice that the right end of the curves are pinched slightly

closer together in the probability-scale plot:

To **leave a comment** for the author, please follow the link and comment on their blog: ** Higher Order Functions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Mixed models for ANOVA designs with one observation per unit of observation and cell of the design

(This article was first published on ** Computational Psychology - Henrik Singmann**, and kindly contributed to R-bloggers)

Together with David Kellen I am currently working on an introductory chapter to mixed models for a book edited by Dan Spieler and Eric Schumacher (the current version can be found here). The goal is to provide a theoretical and practical introduction that is targeted mainly at experimental psychologists, neuroscientists, and others working with experimental designs and human data. The practical part focuses obviously on R, specifically on lme4 and afex.

One part of the chapter was supposed to deal with designs that cannot be estimated with the *maximal random effects structure justified by the design* because there is only one observation per participant and cell of the design. Such designs are the classical repeated-measures ANOVA design as ANOVA cannot deal with replicates at the cell levels (i.e., those are usually aggregated to yield one observation per cell and unit of observation). Based on my previous thoughts that turned out to be wrong we wrote the following:

Random Effects Structures for Traditional ANOVA Designs

The estimation of the maximal model is not possible when there is only one observation per participant and cell of a repeated-measures design. These designs are typically analyzed using a repeated-measures ANOVA. Currently, there are no clear guidelines on how to proceed in such situations, but we will try to provide some advice. If there is only a single random effects grouping factor, for example participants, we feel that instead of a mixed model, it is appropriate to use a standard repeated-measures ANOVA that addresses sphericity violations via the Greenhouse-Geisser correction.

One alternative strategy that employs mixed models and that we \emph{do not recommend} consists of using the random-intercept only model or removing the random slopes for the highest within-subject interaction. The resulting model assumes invariance of the omitted random effects across participants. If this assumption is violated such a model produces results that cannot be trusted . […]

Fortunately, we asked Jake Westfall to take a look at the chapter and Jake responded:

I don’t think I agree with this. In the situation you describe, where we have a single random factor in a balanced ANOVA-like design with 1 observation per unit per cell, personally I am a proponent of the omit-the-the-highest-level-random-interaction approach. In this kind of design, the random slopes for the highest-level interaction are perfectly confounded with the trial-level error term (in more technical language, the model is only identifiable up to the sum of these two variance components), which is what causes the identifiability problems when one tries to estimate the full maximal model there. (You know all of this of course.) So two equivalent ways to make the model identifiable are to (1) omit the error term, i.e., force the residual variance to be 0, or (2) omit the random slopes for the highest-level interaction. Both of these approaches should (AFAIK) result in a statistically equivalent model, but lme4 does not provide an easy way to do (1), so I generally recommend (2). The important point here is that the standard errors should still be correct in either case — because these two variance components are confounded, omitting e.g. the random interaction slopes simply causes that omitted variance component to be implicitly added to the residual variance, where it is still incorporated into the standard errors of the fixed effects in the appropriate way (because the standard error of the fixed interaction looks roughly like sqrt[(var_error + var_interaction)/n_subjects]). I think one could pretty easily put together a little simulation that would demonstrate this.

Hmm, that sounds very reasonable, but can my intuition on the random effects structure and mixed models really be that wrong? To investigate this I followed Jake’s advise and coded a short simulation that tested this and as it turns out, Jake is right and I was wrong.

In the simulation we will simulate a simple one-factor repeated-measures design with one factor with three levels. Importantly, each unit of observation will only have one observation per factor level. We will then fit this simulated data with both repeated-measures ANOVA and random-intercept only mixed and compare their *p*-values. Note again that for such a design we cannot estimate random slopes for the condition effect.

First, we need a few packages and set some parameters for our simulation:

require(afex) set_sum_contrasts() # for orthogonal sum-to-zero contrasts require(MASS) NSIM <- 1e4 # number of simulated data sets NPAR <- 30 # number of participants per cell NCELLS <- 3 # number of cells (i.e., groups)Now we need to generate the data. For this I employed an approach that is clearly not the most parsimonious, but most clearly follows the formulation of a mixed model that has random variability in the condition effect and on top of this residual variance (i.e., the two confounded factors).

We first create a bare bone data.frame with participant id and condition column and a corresponding model.matrix. Then we create the three random parameters (i.e., intercept and the two parameters for the three conditions) using a zero-centered multivarite normal with specified variance-covariance matrix. We then loop over the participant and estimate the predictions deriving from the three random effects parameters. Only after this we add uncorrelated residual variance to the observations for each simulated data set.

dat <- expand.grid(condition = factor(letters[seq_len(NCELLS)]), id = factor(seq_len(NPAR))) head(dat) # condition id # 1 a 1 # 2 b 1 # 3 c 1 # 4 a 2 # 5 b 2 # 6 c 2 mm <- model.matrix(~condition, dat) head(mm) # (Intercept) condition1 condition2 # 1 1 1 0 # 2 1 0 1 # 3 1 -1 -1 # 4 1 1 0 # 5 1 0 1 # 6 1 -1 -1 Sigma_c_1 <- matrix(0.6, NCELLS,NCELLS) diag(Sigma_c_1) <- 1 d_c_1 <- replicate(NSIM, mvrnorm(NPAR, rep(0, NCELLS), Sigma_c_1), simplify = FALSE) gen_dat <- vector("list", NSIM) for(i in seq_len(NSIM)) { gen_dat[[i]] <- dat gen_dat[[i]]$dv <- NA_real_ for (j in seq_len(NPAR)) { gen_dat[[i]][(j-1)*3+(1:3),"dv"] <- mm[1:3,] %*% d_c_1[[i]][j,] } gen_dat[[i]]$dv <- gen_dat[[i]]$dv+rnorm(nrow(mm), 0, 1) }Now we only need a function that estimates the ANOVA and mixed model for each data set and returns the p-value and loop over it.

## functions returning p-value for ANOVA and mixed model within_anova <- function(data) { suppressWarnings(suppressMessages( a <- aov_ez(id = "id", dv = "dv", data, within = "condition", return = "univariate", anova_table = list(es = "none")) )) c(without = a[["univariate.tests"]][2,6], gg = a[["pval.adjustments"]][1,2], hf = a[["pval.adjustments"]][1,4]) } within_mixed <- function(data) { suppressWarnings( m <- mixed(dv~condition+(1|id),data, progress = FALSE) ) c(mixed=anova(m)$`Pr(>F)`) } p_c1_within <- vapply(gen_dat, within_anova, rep(0.0, 3)) m_c1_within <- vapply(gen_dat, within_mixed, 0.0)The following graph shows the results (GG is the results using the Greenhouse-Geisser adjustment for sphericity violations).

ylim <- c(0, 700) par(mfrow = c(1,3)) hist(p_c1_within[1,], breaks = 20, main = "ANOVA (default)", xlab = "p-value", ylim=ylim) hist(p_c1_within[2,], breaks = 20, main = "ANOVA (GG)", xlab = "p-value", ylim=ylim) hist(m_c1_within, breaks = 20, main = "Random-Intercept Model", xlab = "p-value", ylim=ylim)What these graph clearly shows is that the p-value distribution for the standard repeated-measures ANOVA and the random-intercept mixed model is virtually identical. This clearly shows that my intuition was wrong and Jake was right.

We also see that for ANOVA and mixed model the rate of significant findings with *p* < .05 is slightly above the nominal level. More specifically:

These additional results indicate that maybe one also needs to adjust the degrees of freedom for mixed models for violations of sphericity. But this is not the topic of today’s post.

To sum this up, this simulation shows that removing the highest-order random slope seems to be the right decision if one wants to use a mixed model for a design with one observation per cell of the design and participant, but wants to implement the ‘maximal random effects structure’.

One more thing to note. Ben Bolker raised the same issue and pointed us to one of his example analyses of the starling data that is relevant to the current question. We are very grateful that Jake and Ben took the time to go through our chapter!

You can also download the RMarkdown file of the simulation.

References 881472{STGUHGIT};{STGUHGIT},{M2F33N9W}

apa

default

ASC

no

499

http://singmann.org/wp-content/plugins/zotpress/

To **leave a comment** for the author, please follow the link and comment on their blog: ** Computational Psychology - Henrik Singmann**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Instrumental Variables in R exercises (Part-3)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

This is the third part of the series on Instrumental Variables. For other parts of the series follow the tag instrumental variables.

In this exercise set we will use Generalized Method of Moments (GMM) estimation technique using the examples from part-1 and part-2.

Recall that GMM estimation relies on the relevant moment conditions. For OLS we assume that predictors are uncorrelated with the error term. Similarly, IV estimation implies that the instrument(s) is uncorrelated with the error term.

Answers to the exercises are available here.

**Exercise 1**

Load the AER package (package description: here) and the PSID1976 dataset. This has data regarding labor force participation of married women.

Define a new data-frame that has data for all married women that were employed. As we did in part-2, this data-frame will be used for the remaining exercises.

Next, load the ‘gmm’ package (package description: here).

**Exercise 2**

We will start with a simple example. Regress log(wage) on education using the usual OLS technique.

Next, use the gmm function to estimate the same model using OLS’s moment conditions. Match your result and comment on the standard errors.

**Exercise 3**

Estimate the return to education for the model from Exercise-2 using feducation as the IV. Use both ivreg and gmm functions and compare results.

**Exercise 4**

Regress log(wage) on education, experience and experience^2 using the usual OLS technique.

Next, use the gmm function to estimate the same model using OLS’s moment conditions. Match your result.

**Exercise 5**

Estimate the return to education for the model from Exercise-4 using feducation as the IV. Use both ivreg and gmm functions and compare results.

**Exercise 6**

We will now use the over-identified case. Estimate the return to education for the model from Exercise-2 using feducation and meducation as IVs. Use both ivreg and gmm functions and compare results.

**Exercise 7**

Estimate the return to education for the model from Exercise-4 using feducation and meducation as IVs. Use both ivreg and gmm functions and compare results.

**Exercise 8**

The test of over-identifying restrictions can be obtained by the J-test (Sargan test). It is displayed with the summary and specTest functions. Do the over-identified moment conditions match the data well?

**Exercise 9**

Iterated estimation might offer some advantages over the default two-step method in some cases. Estimate the model in Exercise-7 using the iterative estimation technique.

**Exercise 10**

Use the plot function to get the graph of log(wage) and fitted values for the model in Exercise-7.

- Instrumental Variables in R exercises (Part-1)
- Instrumental Variables in R exercises (Part-2)
- Forecasting: ARIMAX Model Exercises (Part-5)
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Update to GetHFData (Version 1.3)

(This article was first published on ** R and Finance**, and kindly contributed to R-bloggers)

–

I just posted a major update to GetHFData. Version 1.3 of GetHFData makes it possible to download and aggregate order data from Bovespa. The data comprises buy and sell orders sent by market operators. Tabular data includes type of orders (buy or sell, new/update/cancel/..), date/time of submission, priority time, prices, order quantity, among many other information.

**Be aware that these are very large files.** One day of buy and sell orders in the equity market is around 100 MB zipped and close to 1 GB unzipped. If you computer is not suited to store this data in its memory, **it will crash**.

Here’s an example of usage that will download and aggregate order data for all option contracts related to Petrobras (PETR):

library(GetHFData) first.time <- '10:00:00' last.time <- '17:00:00' first.date <- '2015-08-18' last.date <- '2015-08-18' type.output <- 'agg' # aggregates data agg.diff <- '5 min' # interval for aggregation my.assets <- 'PETR' # all options related to Petrobras (partial matching) type.matching <- 'partial' # finds tickers from my.assets using partial matching type.market = 'options' # option market type.data <- 'orders' # order data df.out <- ghfd_get_HF_data(my.assets =my.assets, type.data= type.data, type.matching = type.matching, type.market = type.market, first.date = first.date, last.date = last.date, first.time = first.time, last.time = last.time, type.output = type.output, agg.diff = agg.diff)Minor updates:

- Fixed link to paper and citation
- Now it is possible to use partial matching (e.g. use PETR for all stocks or options related to Petrobras)
- implement option for only downloading files (this is helpful if you are dealing with order data and will process the files in other R session or software)
- muted message “Using ‘,’ as decimal …” from readr

I’m also now using Github for development. This will make it easier to handle bug reports and version control. Here’s is the link. The new version is already available in Github. Give it a try:

devtools::install_github('msperlin/GetHFData')The update should be in CRAN shortly.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R and Finance**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Manipulate Biological Data Using Biostrings Package Exercises(Part 2)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Bioinformatics is an amalgamation Biology and Computer science.Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Proteins. DNA & RNA is made of Nucleotide which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids

In this exercise we try correlate the relation between DNA, RNA & Protein.

Conversion of DNA to RNA is known as Transcription. DNA/RNA to protein is known as Translation.

Here we also discuss Sequence Alignment Techniques. Sequence Alignment is comparing the similarity between the sequences to check how much the DNA,RNA or Protein are related to each other.

Three are three types of Sequence Alignment

1. Global Alignment

2. Local Alignment

3. Over lap Alignment

In the exercises below we cover how we can Manipulate Biological Data using Biostrings package in Bioconductor.

Install Packages

Biostrings

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Create a DNA String and find out the complement of the DNA

**Exercise 2**

Create a RNA String and find out the complement of the RNA

**Exercise 3**

Create a DNA string and find the reverse complement of the same.

**Exercise 4**

Create a RNA string and find the reverse complement of the same.

**Exercise 5**

Create a DNA string and translate the same into Amino Acids using Standard Genetic codon and print the three letter codon of the amino acids

**Exercise 6**

Create a DNA string and translate the same into Amino Acids using Standard Genetic codon and print the three letter codon of the amino acids

**Exercise 7**

Create two DNA Strings and align the sequence using Global Alignment technique and print the score of the alignment

**Exercise 8**

Create two DNA Strings and align the sequence using Global Alignment technique and print the score of the alignment after specifying your own score for match and mismatch among the sequence

**Exercise 9**

Create two DNA Strings and align the sequence using Local Alignment technique and print the score of the alignment

**Exercise 10**

Create two DNA Strings and align the sequence using Overlap Alignment technique and print the score of the alignment

Related exercise sets:- Bioinformatics Tutorial with Exercises in R (part 1)
- Manipulate Biological Data Using Biostrings Package Exercises (Part 1)
- Accessing and Manipulating Biological Databases Exercises (Part-3)
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### 5 ways to measure running time of R code

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.

A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.

1. Using Sys.timeThe run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:.

sleep_for_a_minute <- function() { Sys.sleep(60) } start_time <- Sys.time() sleep_for_a_minute() end_time <- Sys.time() end_time - start_time # Time difference of 1.000327 mins 2. Library tictocThe functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole.

The most recent development1 version of tictoc can be installed from github:

devtools::install_github("collectivemedia/tictoc")One can time a single code chunk:

library(tictoc) tic("sleeping") print("falling asleep...") sleep_for_a_minute() print("...waking up") toc() # [1] "falling asleep..." # [1] "...waking up" # sleeping: 60.026 sec elapsedOr nest multiple timers:

tic("total") tic("data generation") X <- matrix(rnorm(50000*1000), 50000, 1000) b <- sample(1:1000, 1000) y <- runif(1) + X %*% b + rnorm(50000) toc() tic("model fitting") model <- lm(y ~ X) toc() toc() # data generation: 3.792 sec elapsed # model fitting: 39.278 sec elapsed # total: 43.071 sec elapsed 3. Using system.timeOne can time the evaluation of an R expression using system.time. For example, we can use it to measure the execution time of the function sleep_for_a_minute (defined above) as follows.

system.time({ sleep_for_a_minute() }) # user system elapsed # 0.004 0.000 60.051**But what exactly are the reported times user, system, and elapsed?** :confused:

Well, clearly elapsed is the wall clock time taken to execute the function sleep_for_a_minute, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess).

As for user and system times, William Dunlap has posted a great explanation to the r-help mailing list:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

:grinning:

4. Library rbenchmarkThe documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time”. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame.

I installed the development1 version of the rbenchmark package from github:

devtools::install_github("eddelbuettel/rbenchmark")For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures:

- lm,
- the Moore-Penrose pseudoinverse,
- the Moore-Penrose pseudoinverse but without explicit matrix inverses.

Here, the meaning of elapsed, user.self, and sys.self is the same as described above in the section about system.time, and relative is simply the time ratio with the fastest test. Interestingly lm is by far the slowest here.

5. Library microbenchmarkThe most recent development version of microbenchmark can be installed from github:

devtools::install_github("olafmersmann/microbenchmarkCore") devtools::install_github("olafmersmann/microbenchmark")Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.

I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.

library(microbenchmark) set.seed(2017) n <- 10000 p <- 100 X <- matrix(rnorm(n*p), n, p) y <- X %*% rnorm(p) + rnorm(100) check_for_equal_coefs <- function(values) { tol <- 1e-12 max_error <- max(c(abs(values[[1]] - values[[2]]), abs(values[[2]] - values[[3]]), abs(values[[1]] - values[[3]]))) max_error < tol } mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef }, "pseudoinverse" = { b <- solve(t(X) %*% X) %*% t(X) %*% y }, "linear system" = { b <- solve(t(X) %*% X, t(X) %*% y) }, check = check_for_equal_coefs) mbm # Unit: milliseconds # expr min lq mean median uq max neval cld # lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910 100 c # pseudoinverse 26.61816 28.81151 53.32246 30.69587 80.61303 145.0489 100 b # linear system 16.70331 18.58778 35.14599 19.48467 22.69537 138.6660 100 aWe used the function argument check to check for equality (up to a maximal error of 1e-12) of the results returned by the three methods. If the results weren’t equal, microbenchmark would return an error message.

Another great feature is the integration with ggplot2 for plotting microbenchmark results.

library(ggplot2) autoplot(mbm) ConclusionThe given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs:

- The Sys.time approach as well as the tictoc package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, tictoc is more convenient, and (most importantly) foolproof.
- We saw that microbenchmark returns other types of measurements than benchmark, and I think that in most situations the microbenchmark measurements are of a higher practical significance :stuck_out_tongue:.
- To my knowledge microbenchmark is the only benchmarking package that has visualizations built in :+1:.

For these reasons I will go with microbenchmark and tictoc. :bowtie:

### An Interesting Study: Exploring Mental Health Conditions in the Tech Workplace

(This article was first published on ** R – NYC Data Science Academy Blog**, and kindly contributed to R-bloggers)

Contributed by Bo Lian. This blog post is the first class project on Exploratory Visualization & Shiny. Please check the Shiny App here.

Background and MotivationMental illness is a global health problem. It has critically high prevalence, yielding severe health outcomes. One in four adults suffers from a diagnosable mental illness in any given year (National Alliance on Mental Health, 2013). In the tech industry, 50% of individuals have sought treatment for mental illness, according to Finkler, 2015.

This important health issue warrants further investigation. This study, based on the 2014 Mental Health in Tech Survey from Open Sourcing Mental Illness, performs a complete data visualization and statistical analyses among the mental illness and various factors in order to answer following questions:

- What are the strongest groups of predictors of mental health illness in the workplace?
- What might be the causes of the high mental illnesses prevalence in the tech industry other than some common thoughts?
- How is the representativeness of the survey, is other factors involved: geographic locations, ages, etc.?

The data used in the study is the 2014 Mental Health in Tech Survey from Open Sourcing Mental Illness. With 1259 responses, it is considered the largest survey on mental health in the tech industry.

The dataset contains 27 factors that could be segmented into 5 categories, each of which can be performed for Chi-squared test against the response variable of the treatment (if one has sought treatment for a mental health condition).

1. **Demographics** : age, gender, country, state etc.

2. **Mental Health Condition** : treatment, work interference, family history etc.

3. **Employment Background** : tech, remote, employee number, size etc.

4. O**rganizational Policies on Mental Health** : benefit, wellness program etc.

5. **Openness about Mental Health** : self awareness, coworkers, supervisors, openness to discuss etc.

Data Visualization and Analysis

The data is drawn from employees from 46 countries, among which US employees account for 59.2%. Most of those hail from tech heavy states like California, New York, Washington, etc.

The median age of the surveyees is 31. The distribution of ages is right skewed which is expected as tech industry tends to have younger employees.

From the box-plots, there is no statistically significant difference of ages between the treatment and the non treatment groups.

Below are 6 of the data analysis examples by Chi-Square Testing with the scores and P values. As expected, mental illness is strongly dependent on the family history, work interference. If the companies offer mental-illness care options, it also enhances the mental illness treatment rate, so as gender difference. What is interesting is that the study shows that the data shows that what people may assume correlates with mental illness is not a factor at all. For example, either remote work or tech type company is a factor does not correlate with t mental illness so as the supervisor’s attitude.

Results and Conclusions:

**Significant Factors of the Chi-Square tests**

They include: gender, family history, work interference, benefits, care options, mental health interview etc.

**Insignificant Factors of the Chi-Square tests**

Identifiers like self_employed, remote work, number of employees, tech or not, supervisors etc. , did not prove significant.

Among the results, the family history, work interference, gender difference are mostly likely to be the main causal factors. If a company has provided better mental health care services, the employees might be more likely to seek treatment or raise their awareness, different genders might tend to seek treatment differently as well. Interestingly, none of the true work conditions has any statistically significant impact on the mental illness of the employees. Human are tougher than they think! One thing to consider is that as the tech industry is primarily located at the US west or east coast or developed countries as shown in the map, the high living cost, competitive atmosphere, and high population density might contribute greatly to the prevalence of mental illness rather than the working conditions themselves.

Keep exploring on

- More data visualization
- Data complexity and subjectivity, location specificity
- Integrating ongoing survey data from 2016
- Interaction among factors
- Predictions with multiple variates

The post An Interesting Study: Exploring Mental Health Conditions in the Tech Workplace appeared first on NYC Data Science Academy Blog.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – NYC Data Science Academy Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Oakland Real Estate – Full EDA

(This article was first published on ** R Analysis – Analyst at Large**, and kindly contributed to R-bloggers)

Living in the Bay Area has led me to think more and more about real estate (and how amazingly expensive it is here…) I’ve signed up for trackers on Zillow and Redfin, but the data analyst in me always wants to dive deeper, to look back historically, to quantify, to visualize the trends, etc… With that in mind, here is my first view at Oakland real estate prices over the past decade. I’ll only be looking at multi-tenant units (duplexes, triplexes, etc.)

The first plot is simply looking at the number of sales each month:

You can clearly see a strong uptick in the number of units sold from 2003 to 2005 and the following steep decline in sales bottoming out during the financial crisis in 2008. Interestingly, sales pick up again very quickly in 2009 and 2010 (a time when I expected to see low sales figures) before stabilizing at the current rate of ~30 properties sold per month.

The next plot shows the price distribution for multi-tenant buildings sold in Oakland:

Here we can see prices rising pretty steadily right up until 2008 when the financial crisis hit (peaking with median prices around $650K). The prices are rising in 2006 and 2007 even while the number of sales is dropping rapidly. After prices drop to ~$250K we can see that they stay low from 2009 – 2013 (even though sales picked up dramatically in 2009). This suggests that a number of investors recognized the low prices and bought up a lot of the available properties. Starting in 2013 prices begin rising and they have continued to present where we are looking at record high median prices of ~$750K.

**Does month of year matter?**

I’ve started doing some basic analysis on Oakland real estate prices over the past decade (multi-tenant buildings only). There’s still a lot to unpack here, but I’m only able to investigate this 30 minutes at a time (new dad life), so I’m breaking the analysis into small, manageable pieces. The first one I wanted to explore quickly is: does month of year matter? I’ve often heard that summer is the busy season for buying a house because families with children try not to move during the school year. Does this rule also apply to multi-tenant buildings as well (which tend to be purchased by investors)?

I’ve collected the sales over the past decade and group by month of year. We can see that summer months (May, June, and July) do see more sales than other months. Interestingly, December also tends to see a large number of sales. Maybe people have more time over the holidays to check out investment properties? Not sure what else could be driving this weird spike in sales in December – any ideas?

Given that the most sales occur in summer months, I wanted to see if this has any impact on sale price.

There doesn’t seem to be much of a relationship at all between month of year and sale price. I had expected to see higher prices in the summer months under the assumption that demand is higher during that period, but maybe supply is also higher during that time (and they rise in roughly equal proportion). This is something that I could theoretically test (since I have the list date as well as the sale date for each property), but I think there are more interesting topics to explore… It’s enough for me to know that month of year doesn’t appear to be correlated with sale price!

**Are multi-tenant houses going above or below asking price?**

For many people one of the most difficult aspects of the house-buying process is deciding what to bid. I decided to take a look at the relationship between list price and sale price.

Since 2000 the average difference between list and sale price has generally been less than $25K. We can see that in 2004-2006 multi-tenant houses in Oakland were generally selling for an average of $15K – $25K above the asking price. In financial crisis in 2008, we can see houses going for an average of $25K less than asking. From 2010-2013, list price and sale price averaged close to $0K difference. Starting in 2013 we start to see houses selling for more than asking with multi-tenant houses in Oakland now averaging $50K more than asking! I know that housing prices have increased over time, so my next step was to look at these as percentage of asking price (attempting to control for inflation in housing values over the past two decades).

The shape of this plot matches the plot showing absolute dollar figures, but it is helpful to see percentages. The most recent data shows that Oakland multi-tenant houses sell at an average of 7.5% premium over asking price.

**How are days on market and sales price vs. list price premium related?**

I’ve often heard that the longer a house is on the market, the lower the premium vs. asking price. This seems as good a dataset as any to test this theory out!

This first plot just shows the relationship between days on market (x-axis) and difference between sale price and list price (y-axis). This plot isn’t particularly helpful, but it does show me a couple of things. I see a handful of outliers (houses on the market for 1000+ days or selling $1M below asking price). There also seems to be a fairly large cluster of houses selling significantly above asking price between 7-30 days after going on the market. Next step was just to bucket these days on market:

We can see that there does appear to be an inverse relationship between days on market and sales price – list price premium. Roughly 75% of houses sold in less than 30 days were sold above asking price (with median premiums of ~$10K). On the other hand, roughly 75% of houses on the market for 60+ days were sold below asking price (with median discounts of ~$10K). Next steps here would be converting to percent of list price and then reviewing these trends over time, but that will have to wait for another day!

R code here:

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Analysis – Analyst at Large**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Testing the Hierarchical Risk Parity algorithm

(This article was first published on ** R – QuantStrat TradeR**, and kindly contributed to R-bloggers)

This post will be a modified backtest of the Adaptive Asset Allocation backtest from AllocateSmartly, using the Hierarchical Risk Parity algorithm from last post, because Adam Butler was eager to see my results. On a whole, as Adam Butler had told me he had seen, HRP does not generate outperformance when applied to a small, carefully-constructed, diversified-by-selection universe of asset classes, as opposed to a universe of hundreds or even several thousand assets, where its theoretically superior properties result in it being a superior algorithm.

First off, I would like to thank one Matthew Barry, for helping me modify my HRP algorithm so as to not use the global environment for recursion. You can find his github here.

Here is the modified HRP code.

covMat <- read.csv('cov.csv', header = FALSE) corMat <- read.csv('corMat.csv', header = FALSE) clustOrder <- hclust(dist(corMat), method = 'single')$order getIVP <- function(covMat) { invDiag <- 1/diag(as.matrix(covMat)) weights <- invDiag/sum(invDiag) return(weights) } getClusterVar <- function(covMat, cItems) { covMatSlice <- covMat[cItems, cItems] weights <- getIVP(covMatSlice) cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights return(cVar) } getRecBipart <- function(covMat, sortIx) { w <- rep(1,ncol(covMat)) w <- recurFun(w, covMat, sortIx) return(w) } recurFun <- function(w, covMat, sortIx) { subIdx <- 1:trunc(length(sortIx)/2) cItems0 <- sortIx[subIdx] cItems1 <- sortIx[-subIdx] cVar0 <- getClusterVar(covMat, cItems0) cVar1 <- getClusterVar(covMat, cItems1) alpha <- 1 - cVar0/(cVar0 + cVar1) # scoping mechanics using w as a free parameter w[cItems0] <- w[cItems0] * alpha w[cItems1] <- w[cItems1] * (1-alpha) if(length(cItems0) > 1) { w <- recurFun(w, covMat, cItems0) } if(length(cItems1) > 1) { w <- recurFun(w, covMat, cItems1) } return(w) } out <- getRecBipart(covMat, clustOrder) outWith covMat and corMat being from the last post. In fact, this function can be further modified by encapsulating the clustering order within the getRecBipart function, but in the interest of keeping the code as similar to Marcos Lopez de Prado’s code as I could, I’ll leave this here.

Anyhow, the backtest will follow. One thing I will mention is that I’m using Quandl’s EOD database, as Yahoo has really screwed up their financial database (I.E. some sector SPDRs have broken data, dividends not adjusted, etc.). While this database is a $50/month subscription, I believe free users can access it up to 150 times in 60 days, so that should be enough to run backtests from this blog, so long as you save your downloaded time series for later use by using write.zoo.

This code needs the tseries library for the portfolio.optim function for the minimum variance portfolio (Dr. Kris Boudt has a course on this at datacamp), and the other standard packages.

A helper function for this backtest (and really, any other momentum rotation backtest) is the appendMissingAssets function, which simply adds on assets not selected to the final weighting and re-orders the weights by the original ordering.

require(tseries) require(PerformanceAnalytics) require(quantmod) require(Quandl) Quandl.api_key("YOUR_AUTHENTICATION_HERE") # not displaying my own api key, sorry # function to append missing (I.E. assets not selected) asset names and sort into original order appendMissingAssets <- function(wts, allAssetNames, wtsDate) { absentAssets <- allAssetNames[!allAssetNames %in% names(wts)] absentWts <- rep(0, length(absentAssets)) names(absentWts) <- absentAssets wts <- c(wts, absentWts) wts <- xts(t(wts), order.by=wtsDate) wts <- wts[,allAssetNames] return(wts) }Next, we make the call to Quandl to get our data.

symbols <- c("SPY", "VGK", "EWJ", "EEM", "VNQ", "RWX", "IEF", "TLT", "DBC", "GLD") rets <- list() for(i in 1:length(symbols)) { # quandl command to download from EOD database. Free users should use write.zoo in this loop. returns <- Return.calculate(Quandl(paste0("EOD/", symbols[i]), start_date="1990-12-31", type = "xts")$Adj_Close) colnames(returns) <- symbols[i] rets[[i]] <- returns } rets <- na.omit(do.call(cbind, rets))While Josh Ulrich fixed quantmod to actually get Yahoo data after Yahoo broke the API, the problem is that the Yahoo data is now garbage as well, and I’m not sure how much Josh Ulrich can do about that. I really hope some other provider can step up and provide free, usable EOD data so that I don’t have to worry about readers not being able to replicate the backtest, as my policy for this blog is that readers should be able to replicate the backtests so they don’t just nod and take my word for it. If you are or know of such a provider, please leave a comment so that I can let the blog readers know all about you.

Next, we initialize the settings for the backtest.

invVolWts <- list() minVolWts <- list() hrpWts <- list() ep <- endpoints(rets, on = "months") nMonths = 6 # month lookback (6 as per parameters from allocateSmartly) nVol = 20 # day lookback for volatility (20 ibid)While the AAA backtest actually uses a 126 day lookback instead of a 6 month lookback, as it trades at the end of every month, that’s effectively a 6 month lookback, give or take a few days out of 126, but the code is less complex this way.

Next, we have our actual backtest.

for(i in 1:(length(ep)-nMonths)) { # get returns subset and compute absolute momentum retSubset <- rets[c(ep[i]:ep[(i+nMonths)]),] retSubset <- retSubset[-1,] moms <- Return.cumulative(retSubset) # select top performing assets and subset returns for them highRankAssets <- rank(moms) >= 6 # top 5 assets posReturnAssets <- moms > 0 # positive momentum assets selectedAssets <- highRankAssets & posReturnAssets # intersection of the above selectedSubset <- retSubset[,selectedAssets] # subset returns slice if(sum(selectedAssets)==0) { # if no qualifying assets, zero weight for period wts <- xts(t(rep(0, ncol(retSubset))), order.by=last(index(retSubset))) colnames(wts) <- colnames(retSubset) invVolWts[[i]] <- minVolWts[[i]] <- hrpWts[[i]] <- wts } else if (sum(selectedAssets)==1) { # if one qualifying asset, invest fully into it wts <- xts(t(rep(0, ncol(retSubset))), order.by=last(index(retSubset))) colnames(wts) <- colnames(retSubset) wts[, which(selectedAssets==1)] <- 1 invVolWts[[i]] <- minVolWts[[i]] <- hrpWts[[i]] <- wts } else { # otherwise, use weighting algorithms cors <- cor(selectedSubset) # correlation volSubset <- tail(selectedSubset, nVol) # 20 day volatility vols <- StdDev(volSubset) covs <- t(vols) %*% vols * cors # minimum volatility using portfolio.optim from tseries minVolRets <- t(matrix(rep(1, sum(selectedAssets)))) minVolWt <- portfolio.optim(x=minVolRets, covmat = covs)$pw names(minVolWt) <- colnames(covs) minVolWt <- appendMissingAssets(minVolWt, colnames(retSubset), last(index(retSubset))) minVolWts[[i]] <- minVolWt # inverse volatility weights invVols <- 1/vols invVolWt <- invVols/sum(invVols) invNames <- colnames(invVolWt) invVolWt <- as.numeric(invVolWt) names(invVolWt) <- invNames invVolWt <- appendMissingAssets(invVolWt, colnames(retSubset), last(index(retSubset))) invVolWts[[i]] <- invVolWt # hrp weights clustOrder <- hclust(dist(cors), method = 'single')$order hrpWt <- getRecBipart(covs, clustOrder) names(hrpWt) <- colnames(covs) hrpWt <- appendMissingAssets(hrpWt, colnames(retSubset), last(index(retSubset))) hrpWts[[i]] <- hrpWt } }In a few sentences, this is what happens:

The algorithm takes a subset of the returns (the past six months at every month), and computes absolute momentum. It then ranks the ten absolute momentum calculations, and selects the intersection of the top 5, and those with a return greater than zero (so, a dual momentum calculation).

If no assets qualify, the algorithm invests in nothing. If there’s only one asset that qualifies, the algorithm invests in that one asset. If there are two or more qualifying assets, the algorithm computes a covariance matrix using 20 day volatility multiplied with a 126 day correlation matrix (that is, sd_20′ %*% sd_20 * (elementwise) cor_126. It then computes normalized inverse volatility weights using the volatility from the past 20 days, a minimum variance portfolio with the portfolio.optim function, and lastly, the hierarchical risk parity weights using the HRP code above from Marcos Lopez de Prado’s paper.

Lastly, the program puts together all of the weights, and adds a cash investment for any period without any investments.

invVolWts <- round(do.call(rbind, invVolWts), 3) # round for readability minVolWts <- round(do.call(rbind, minVolWts), 3) hrpWts <- round(do.call(rbind, hrpWts), 3) # allocate to cash if no allocation made due to all negative momentum assets invVolWts$cash <- 0; invVolWts$cash <- 1-rowSums(invVolWts) hrpWts$cash <- 0; hrpWts$cash <- 1-rowSums(hrpWts) minVolWts$cash <- 0; minVolWts$cash <- 1-rowSums(minVolWts) # cash value will be zero rets$cash <- 0 # compute backtest returns invVolRets <- Return.portfolio(R = rets, weights = invVolWts) minVolRets <- Return.portfolio(R = rets, weights = minVolWts) hrpRets <- Return.portfolio(R = rets, weights = hrpWts)Here are the results:

compare <- cbind(invVolRets, minVolRets, hrpRets) colnames(compare) <- c("invVol", "minVol", "HRP") charts.PerformanceSummary(compare) rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare)) invVol minVol HRP Annualized Return 0.0872000 0.0724000 0.0792000 Annualized Std Dev 0.1208000 0.1025000 0.1136000 Annualized Sharpe (Rf=0%) 0.7221000 0.7067000 0.6968000 Worst Drawdown 0.1548801 0.1411368 0.1593287 Calmar Ratio 0.5629882 0.5131956 0.4968234In short, in the context of a small, carefully-selected and allegedly diversified (I’ll let Adam Butler speak for that one) universe dominated by the process of which assets to invest in as opposed to how much, the theoretical upsides of an algorithm which simultaneously exploits a covariance structure without needing to invert a covariance matrix can be lost.

However, this test (albeit from 2007 onwards, thanks to ETF inception dates combined with lookback burn-in) confirms what Adam Butler himself told me, which is that HRP hasn’t impressed him, and from this backtest, I can see why. However, in the context of dual momentum rank selection, I’m not convinced that any weighting scheme will realize much better performance than any other.

Thanks for reading.

NOTE: I am always interested in networking and hearing about full-time opportunities related to my skill set. My linkedIn profile can be found here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – QuantStrat TradeR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Canada Immigration: Where to settle in? Exercises

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Many people around the globe would like to immigrate to Canada as a Skilled Worker. These candidates must prove language proficiency in French and English, at least 2 years of working experience after graduation, and more. But, many immigrants that arrive in canada face unemployment rates sometimes even higher than in their original countries. So, the choice of the province to settle in is very important for those wishing to have economic success. With these exercises we will use R to analyze some immigration open data from Canadian government.

Answers to the exercises are available here.

**Exercise 1**

Download and read into R a data set from Labour force survey estimates (LFS), by immigrant status, age group, Canada, regions, provinces and Montreal, Toronto, Vancouver census metropolitan areas, 3-month moving average, unadjusted for seasonality.. Then take a look at it using head.

**Exercise 2**

Load libraries to manipulate data like dplyr. Turn Ref_Date into a Date type variable. (Tip: use as.Date)

**Exercise 3**

Transform the variable “Value” to a numeric format.

**Learn more**about Data manipulation in the online course R: Complete Data Analysis Solutions. In this course you will learn how to:

- Learn indepth how to work with dplyr
- Get a full introduction to the data.table package
- And much more

**Exercise 4**

Create a numeric vector that contains this column indices 1,2,4,5,6, and 9. And create a new data frame to store this data.

**Exercise 5**

Create a text vector that contains the province names. Create a new data frame to store only lines with valid province names.

**Exercise 6**

We are interested in comparing unemployment rate between people born in canada and recent immigrants. Exclude lines related to other kinds of status.

**Exercise 7**

Skilled worker immigrants usually need to have a university degree and at least 2 year of professional experience. So, exclude lines in the “agegroup” variable with “15 years and over”, and remove this column.

**Exercise 8**

Take a look at the summary information of the unemployment rate.

**Exercise 9**

Use the summarize this data grouping then by status and province. Please, take the mean of the unemployment rate as the table content.

**Exercise 10**

Use qplot from ggplot2 to create a plot and find the best province in terms of difference at unemployment rate between local people and recent immigrants.

- Cross Tabulation with Xtabs exercises
- As.Date() Exercises
- Building Shiny App Exercises (part-9)
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Who is the caretaker? Evidence-based probability estimation with the bnlearn package

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

*by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft *

In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.

The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.

We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes 'viral' and gets weird?

— Robert E Kelly (@Robert_E_Kelly) March 10, 2017

What evidence did viewers of the video have?- the person is American Caucasian
- the person is professional
- there are two kids
- the caretaker is Asian

We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)

- Probability (Asian Wife | Caucasian Husband) = 1% [Married couples in the United States in 2010]
- Probability of (Household has Nanny | husband is professional) = 3.5% [The Three Faces of Work-Family Conflict, page 9, Figure 3]
- Probability of (Asian | Nanny) = 6% [Caregiver Statistics: Demographics]
- Probability of (Stay at home mom) = 14% and Probability of (Stay at home mom | Asian Wife) = 30% [Stay-at-Home Mothers by Demographic Group ]

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

library(bnlearn) set.seed(3) net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]") plot(net)The last step is to fit the parameters of the Bayesian network conditional on its structure, the bn.fit function runs the EM algorithm to learn CPT for all different nodes in the above graph.

yn <- c("yes", "no") ca <- c("caucacian","other") ao <- c("asian","other") nw <- c("nanny","wife") cptHusbandDemographics <- matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1] cptHusbandIsProfessional <- matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2] cptNannyDemographics <- matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3] cptWifeDemographics <- matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1] cptStayAtHomeMom <- c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4] dim(cptStayAtHomeMom) <- c(2, 2, 2) dimnames(cptStayAtHomeMom) <- list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn) cptHouseholdHasNanny <- c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5] dim(cptHouseholdHasNanny) <- c(2, 2, 2) dimnames(cptHouseholdHasNanny) <- list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn) cptCaretaker <- c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999) dim(cptCaretaker) <- c(2, 2, 2) dimnames(cptCaretaker) <- list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn) cptCaretakerEthnicity <- c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99) dim(cptCaretakerEthnicity) <- c(2, 2, 2,2) dimnames(cptCaretakerEthnicity) <- list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao) net.disc <- custom.fit(net, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is **90%** vs. 10% that she is the nanny.

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!

The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Managing Spark data handles in R

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr") #packageVersion('sparklyr') suppressPackageStartupMessages(library("dplyr")) #packageVersion('dplyr') suppressPackageStartupMessages(library("tidyr")) # Please see the following video for installation help # https://youtu.be/qnINvPqcRvE # spark_install(version = "2.0.2") # set up a local "practice" Spark instance sc <- spark_connect(master = "local", version = "2.0.2") #print(sc)Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths of parquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths. userSpecification <- read.csv('tableCollection.csv', header = TRUE, strip.white = TRUE, stringsAsFactors = FALSE) print(userSpecification) ## tableName tablePath ## 1 data_01 data_01 ## 2 data_02 data_02 ## 3 data_03 data_03We can now read these parquet files (usually stored in Hadoop) into our Spark environment as follows.

readParquets <- function(userSpecification) { userSpecification <- as_data_frame(userSpecification) userSpecification$handle <- lapply( seq_len(nrow(userSpecification)), function(i) { spark_read_parquet(sc, name = userSpecification$tableName[[i]], path = userSpecification$tablePath[[i]]) } ) userSpecification } tableCollection <- readParquets(userSpecification) print(tableCollection) ## # A tibble: 3 x 3 ## tableName tablePath handle ## ## 1 data_01 data_01 ## 2 data_02 data_02 ## 3 data_03 data_03A data.frame is a great place to keep what you know about your Spark handles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) { tableCollection <- as_data_frame(tableCollection) # get the references tableCollection$handle <- lapply(tableCollection$tableName, function(tableNamei) { dplyr::tbl(sc, tableNamei) }) # and tableNames to handles for convenience # and printing names(tableCollection$handle) <- tableCollection$tableName # add in some details (note: nrow can be expensive) tableCollection$nrow <- vapply(tableCollection$handle, nrow, numeric(1)) tableCollection$ncol <- vapply(tableCollection$handle, ncol, numeric(1)) tableCollection } tableCollection <- addDetails(userSpecification) # convenient printing print(tableCollection) ## # A tibble: 3 x 5 ## tableName tablePath handle nrow ncol ## ## 1 data_01 data_01 10 1 ## 2 data_02 data_02 10 2 ## 3 data_03 data_03 10 3 # look at the top of each table (also forces # evaluation!). lapply(tableCollection$handle, head) ## $data_01 ## Source: query [6 x 1] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 6 x 1 ## a_01 ## ## 1 0.8274947 ## 2 0.2876151 ## 3 0.6638404 ## 4 0.1918336 ## 5 0.9111187 ## 6 0.8802026 ## ## $data_02 ## Source: query [6 x 2] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 6 x 2 ## a_02 b_02 ## ## 1 0.3937457 0.34936496 ## 2 0.0195079 0.74376380 ## 3 0.9760512 0.00261368 ## 4 0.4388773 0.70325800 ## 5 0.9747534 0.40327283 ## 6 0.6054003 0.53224218 ## ## $data_03 ## Source: query [6 x 3] ## Database: spark connection master=local[4] app=sparklyr local=TRUE ## ## # A tibble: 6 x 3 ## a_03 b_03 c_03 ## ## 1 0.59512263 0.2615939 0.592753768 ## 2 0.72292799 0.7287428 0.003926143 ## 3 0.51846687 0.3641869 0.874463146 ## 4 0.01174093 0.9648346 0.177722575 ## 5 0.86250126 0.3891915 0.857614579 ## 6 0.33082723 0.2633013 0.233822140A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) { tableCollection$columns <- lapply(tableCollection$handle, colnames) columnMap <- tableCollection %>% select(tableName, columns) %>% unnest(columns) columnMap } columnMap <- columnDictionary(tableCollection) print(columnMap) ## # A tibble: 6 x 2 ## tableName columns ## ## 1 data_01 a_01 ## 2 data_02 a_02 ## 3 data_02 b_02 ## 4 data_03 a_03 ## 5 data_03 b_03 ## 6 data_03 c_03The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

- Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
- Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Versioning R model objects in SQL Server

(This article was first published on ** R – Locke Data**, and kindly contributed to R-bloggers)

If you build a model and never update it you’re missing a trick. Behaviours change so your model will tend to perform worse over time. You’ve got to regularly refresh it, whether that’s adjusting the existing model to fit the latest data (recalibration) or building a whole new model (retraining), but this means you’ve got new versions of your model that you have to handle. You need to think about your methodology for versioning R model objects, ideally before you lose any versions.

You could store models with ye olde YYYYMMDD style of versioning but that means regularly changing your code to use the latest model version. I’m too lazy for that!

If we’re storing our R model objects in SQL Server then we can utilise another SQL Server capability, temporal tables, to take the pain out of versioning and make it super simple.

Temporal tables will track changes automatically so you would overwrite the previous model with the new one and it would keep a copy of the old one automagically in a history table. You get to always use the latest version via the main table but you can then write temporal queries to extract any version of the model that’s ever been implemented. Super neat!

For some of you, if you’re not interested in the technical details you can drop off now with the knowledge that you can store your models in a non-destructive but easy to use way in SQL Server if you need to.

If you want to see how it’s done, read on!

The technical infoBelow is a working example of how to do versioning R model objects in SQL Server:

- define a versioned model table
- write a model into the model table
- create a new model and overwrite the old
- create a prediction using the latest model on the fly

Note this doesn’t tell you what changed, just that something did change. To identify model changes you will need to load up the models and compare their coefficients or other attributes to identify what exactly changed.

SQL objects A temporal tableA normal table for storing a model might look like

CREATE TABLE [companyModels] ( [id] int NOT NULL PRIMARY KEY CLUSTERED , [name] varchar(200) NOT NULL , [modelObj] varbinary(max) , CONSTRAINT unique_modelname UNIQUE ([name]))If we’re turning it into a temporal table we need to add some extra columns, but we won’t worry about these extra columns day-to-day.

CREATE TABLE [companyModels] ( [id] int NOT NULL PRIMARY KEY CLUSTERED , [name] varchar(200) NOT NULL , [modelObj] varbinary(max) , [ValidFrom] datetime2 (2) GENERATED ALWAYS AS ROW START , [ValidTo] datetime2 (2) GENERATED ALWAYS AS ROW END , PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo) , CONSTRAINT unique_modelname UNIQUE ([name])) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.companyModelsHistory)); A stored procedureAs we have the ID and Valid* columns in our table, we can’t use RODBCs simple table functions and the ID column doesn’t play nicely with RevoScaleR’s rxWriteObject() function as that wants to insert a NULL. It’s not all bad though because we can get around it by using a stored procedure.

This stored procedure will perform an INSERT if it can’t find a model by the given name, and will perform an UPDATE if it does find a match.

CREATE PROCEDURE modelUpsert @modelname varchar(200) , @modelobj varbinary(max) AS WITH MySource as ( select @modelname as modelName, @modelobj as modelObj ) MERGE companymodels AS MyTarget USING MySource ON MySource.modelname = MyTarget.modelname WHEN MATCHED THEN UPDATE SET modelObj = MySource.modelObj WHEN NOT MATCHED THEN INSERT ( modelname, modelObj ) VALUES ( MySource.modelname, MySource.modelObj ); R Build a modelWe need a model to save!

To be able to store the model in the database we need to use the serialize() function to turn it into some raw character thingies and then combine them together with paste0() so they go in the same row.

myModelV1<-lm(column1~column2, someData) preppedDF<-data.frame(modelname="myModel", modelobj=paste0(serialize(myModelV1,NULL) ,collapse = "") ) Call the stored procedureWe need RODBC and RODBCext for executing our stored procedure in our database.

library(RODBC) library(RODBCext) dbstring<-'Driver={ODBC Driver 13 for SQL Server};Server=XXX;Database=XXX;Uid=XXX;Pwd=XXX' dbconn<-RODBC::odbcDriverConnect(dbstring) RODBCext::sqlExecute(dbconn, "exec modelUpsert @modelname=? , @modelobj=?", data = preppedDF)This will now have our model in our database table.

RODBC::sqlQuery(dbconn, "select * from companymodels") # 1 row RODBC::sqlQuery(dbconn, "select * from companymodelshistory") # 0 rowYou should get one row in our main table and no rows in ouor history table.

Rinse and repeatIf we make a change to our model and then push the new model with the same name, we’ll still get one row in our core table, but now we’ll get a row in our history table that contains our v1 model.

myModelV2<-lm(column1~column2, someData[-1,]) preppedDF<-data.frame(modelname="myModel", modelobj=paste0(serialize(myModelV2,NULL) ,collapse = "") ) RODBCext::sqlExecute(dbconn, "exec modelUpsert @modelname=? , @modelobj=?", data = preppedDF) RODBC::sqlQuery(dbconn, "select * from companymodels") # 1 row RODBC::sqlQuery(dbconn, "select * from companymodelshistory") # 1 row Using our model in SQLIf we want to use our model for predictions in SQL, we can now retrieve it from the table along with some input data and get back our input data plus a prediction.

DECLARE @mymodel VARBINARY(MAX)=(SELECT modelobj FROM companymodels WHERE modelname='mymodel' ); EXEC sp_execute_external_script @language = N'R', @script = N' OutputDataSet<-cbind(InputDataSet, predict(unserialize(as.raw(model)), InputDataSet) ) ', @input_data_1 = N'SELECT 42 as column2', @params = N'@model varbinary(max)', @model = @mymodelThe post Versioning R model objects in SQL Server appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Locke Data**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RQGIS release 1.0.0

(This article was first published on ** R – jannesm**, and kindly contributed to R-bloggers)

Today we are proud to announce a major release of the RQGIS package providing an interface between R and QGIS. We have completeley rewritten RQGIS by now using reticulate to establish a tunnel to the Python QGIS API. To make RQGIS even more user-friendly, we have implemented, among others, following features:

- set_env now caches its output, so you have to call it only once
- find_algorithms now supports regular expressions
- run_qgis now accepts R named arguments as input (see example below). The load_output argument is now logical. If TRUE all specified output files will be loaded into R.
- RQGIS now supports simple features

For a detailed overview of all changes, please refer to the release news.

Let’s use RQGIS for calculating the slope and the aspect of a digital elevation model. First of all, we have to attach RQGIS and a digital elevation model (dem) to the global environment:

library("RQGIS") data("dem", package = "RQGIS")Next, we need to find all paths necessary to run successfully the Python QGIS API. Please note, that the output of set_env will be cached and reused in subsequent function calls. Note also, that set_env is always faster when you indicate the path to your QGIS installation, in my case e.g., set_env(root = "C:/OSGeo4W64"). Secondly, we open a QGIS Python application with open_app.

set_env() open_app()Now, that we have established a tunnel to the QGIS Python API, we are ready for some geoprocessing. First, we need a geoalgorithm, that computes the slope and the aspect of a digital elevation model. To find the name of such a geoalgorithm, we use a regular expression which searches for the terms slope and aspect in all available QGIS geoalgorithms.

find_algorithms("slope(.+)?aspect") ## [1] "r.slope.aspect - Generates raster layers of slope, aspect, curvatures and partial derivatives from a elevation raster layer.--->grass7:r.slope.aspect" ## [2] "Slope, aspect, curvature----------------------------->saga:slopeaspectcurvature" ## [3] "r.slope.aspect - Generates raster layers of slope, aspect, curvatures and partial derivatives from a elevation raster layer.--->grass:r.slope.aspect"We will use grass7:r.slope.aspect. To retrieve its function parameters and arguments you can run get_usage("grass7:r.slope.aspect"). Use get_args_man to collect the parameter-argument list including all default values:

params <- get_args_man("grass7:r.slope.aspect")As with previous RQGIS releases, you can still modify the parameter-argument list and submit it to run_qgis:

params$elevation <- dem params$slope <- "slope.tif" params$aspect <- "aspect.tif" run_qgis(alg = "grass7:r.slope.aspect", params = params)But now you can also use R named arguments in run_qgis, i.e. you can specify the geoalgorithms parameters directly in run_qgis (adapted from package rgrass7). Here, we specify the input- and the output-rasters. For all other parameters, default values will automatically be used. For more information on the R named arguments, please refer also to the documentation by running ?run_qgis and/or ?pass_args.

out <- run_qgis(alg = "grass7:r.slope.aspect", elevation = dem, slope = "slope.tif", aspect = "aspect.tif", load_output = TRUE) ## $slope ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\RtmpmsSSA4/slope.tif" ## ## $aspect ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\RtmpmsSSA4/aspect.tif" ## ## $dxy ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\ac7b8544e8194dd1a1b8710e6091f1f3\\dxy.tif" ## ## $dxx ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\1576d9dc93434a578b3aeb16bedb17a2\\dxx.tif" ## ## $tcurvature ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\afea27676cc049faaa8526a486f13f70\\tcurvature.tif" ## ## $dx ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\2d71fd26b1aa4868a5cdfd0d7ad47a0c\\dx.tif" ## ## $dy ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\458f38f6c71947d3a37532e4bc6a6a53\\dy.tif" ## ## $pcurvature ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\80ad6fa1843d4d3a92ed0b4c6a39dcfa\\pcurvature.tif" ## ## $dyy ## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\6c52d235c8614a719954f1f744e3fef1\\dyy.tif"Setting load_output to TRUE automatically loads the resulting QGIS output back into R. Since we have specified two ouput files, the output will be loaded into R as a list (here named out) with two elements: a slope and an aspect raster. However, in the background, QGIS calculates all terrain attributes and derivatives provided by grass7:r.slope.aspect, and saves them to a temporary location. Of course, if you wish you can still access these layers from there.

Before running RQGIS, you need to install third-party software. We wrote a package vignette, which guides you through the installation process of QGIS, GRASS and SAGA on various platforms. To access the vignette, please run:

vignette("install_guide", package = "RQGIS")For more information on RQGIS and examples how to use RQGIS, please refer to its github page and my blog.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – jannesm**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### rtdists 0.7-2: response time distributions now with Rcpp and faster

It took us quite a while but we have finally released a new version of rtdists to CRAN which provides a few significant improvements. As a reminder, rtdists

[p]rovides response time distributions (density/PDF, distribution function/CDF, quantile function, and random generation): (a) Ratcliff diffusion model based on C code by Andreas and Jochen Voss and (b) linear ballistic accumulator (LBA) with different distributions underlying the drift rate.

The main reason it took us relatively long to push the new version was that we had a problem with the C code for the diffusion model that we needed to sort out first. Specifically, the CDF (i.e., pdiffusion) in versions prior to 0.5-2 did not produce correct results in many cases (one consequence of this is that the model predictions given in the previous blog post are wrong). As a temporary fix, we resorted to the correct but slow numerical integration of the PDF (i.e., ddiffusion) to obtain the CDF in version 0.5-2 and later. Importantly, it appears as if the error was not present in fastdm which is the source of the C code we use. Matthew Gretton carefully investigated the original C code, changed it such that it connects to R via Rcpp, and realized that there are two different variants of the CDF, a fast variant and a precise variant. Up to this point we had only used the fast variant and, as it turns out, this was responsible for our incorrect results. We now per default use the precise variant (which only seems to be marginally slower) as it produces the correct results for all cases we have tested (and we have tested quite a few).

In addition to a few more minor changes (see NEWS for full list), we made two more noteworthy changes. First, all diffusion functions as well as rLBA received a major performance update, mainly in situations with trial-wise parameters. Now it should almost always be fastest to call the diffusion functions (e.g., ddiffusion) only once with vectorized parameters instead of calling it several times for different sets of parameters. The speed up with the new version depends on the number of unique parameter sets, but even with only a few different sets the speed up should be clearly noticeable. For completely trial-wise parameters the speed-up should be quite dramatic.

Second, I also updated the vignette which now uses the tidyverse in, I believe, a somewhat more reasonable manner. Specifically, it now is built on nested data (via tidyr::nest) and purrr::map instead of relying heavily on dplyr::do. The problem I had with dplyr::do is that it often leads to somewhat ugly syntax. The changes in the vignette are mainly due to me reading Chapter 25 in the great R for Data Science book by Wickham and Gorlemund. However, I still prefer lattice over ggplot2.

Example AnalysisTo show the now correct behavior of the diffusion CDF let me repeat the example from the last post. As a reminder, we somewhat randomly pick one participant from the speed_acc data set and fit both diffusion model and LBA to the data.

require(rtdists) # Exp. 1; Wagenmakers, Ratcliff, Gomez, & McKoon (2008, JML) data(speed_acc) # remove excluded trials: speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) # create numeric response variable where 1 is an error and 2 a correct response: speed_acc$corr <- with(speed_acc, as.numeric(stim_cat == response))+1 # select data from participant 11, accuracy condition, non-word trials only p11 <- speed_acc[speed_acc$id == 11 & speed_acc$condition == "accuracy" & speed_acc$stim_cat == "nonword",] prop.table(table(p11$corr)) # 1 2 # 0.04166667 0.95833333 ll_lba <- function(pars, rt, response) { d <- dLBA(rt = rt, response = response, A = pars["A"], b = pars["A"]+pars["b"], t0 = pars["t0"], mean_v = pars[c("v1", "v2")], sd_v = c(1, pars["sv"]), silent=TRUE) if (any(d == 0)) return(1e6) else return(-sum(log(d))) } start <- c(runif(3, 0.5, 3), runif(2, 0, 0.2), runif(1)) names(start) <- c("A", "v1", "v2", "b", "t0", "sv") p11_norm <- nlminb(start, ll_lba, lower = c(0, -Inf, 0, 0, 0, 0), rt=p11$rt, response=p11$corr) p11_norm[1:3] # $par # A v1 v2 b t0 sv # 0.1182940 -2.7409230 1.0449963 0.4513604 0.1243441 0.2609968 # # $objective # [1] -211.4202 # # $convergence # [1] 0 ll_diffusion <- function(pars, rt, response) { densities <- ddiffusion(rt, response=response, a=pars["a"], v=pars["v"], t0=pars["t0"], sz=pars["sz"], st0=pars["st0"], sv=pars["sv"]) if (any(densities == 0)) return(1e6) return(-sum(log(densities))) } start <- c(runif(2, 0.5, 3), 0.1, runif(3, 0, 0.5)) names(start) <- c("a", "v", "t0", "sz", "st0", "sv") p11_diff <- nlminb(start, ll_diffusion, lower = 0, rt=p11$rt, response=p11$corr) p11_diff[1:3] # $par # a v t0 sz st0 sv # 1.3206011 3.2727202 0.3385602 0.4621645 0.2017950 1.0551706 # # $objective # [1] -207.5487 # # $convergence # [1] 0As is common, we pass the negative summed log-likelihood to the optimization algorithm (here trusty nlminb) and hence lower values of objective indicate a better fit. Results show that the LBA provides a somewhat better account. The interesting question is whether this somewhat better fit translates into a visibly better fit when comparing observed and predicted quantiles.

# quantiles: q <- c(0.1, 0.3, 0.5, 0.7, 0.9) ## observed data: (p11_q_c <- quantile(p11[p11$corr == 2, "rt"], probs = q)) # 10% 30% 50% 70% 90% # 0.4900 0.5557 0.6060 0.6773 0.8231 (p11_q_e <- quantile(p11[p11$corr == 1, "rt"], probs = q)) # 10% 30% 50% 70% 90% # 0.4908 0.5391 0.5905 0.6413 1.0653 ### LBA: # predicted error rate (pred_prop_correct_lba <- pLBA(Inf, 2, A = p11_norm$par["A"], b = p11_norm$par["A"]+p11_norm$par["b"], t0 = p11_norm$par["t0"], mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), sd_v = c(1, p11_norm$par["sv"]))) # [1] 0.9581342 (pred_correct_lba <- qLBA(q*pred_prop_correct_lba, response = 2, A = p11_norm$par["A"], b = p11_norm$par["A"]+p11_norm$par["b"], t0 = p11_norm$par["t0"], mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), sd_v = c(1, p11_norm$par["sv"]))) # [1] 0.4871710 0.5510265 0.6081855 0.6809796 0.8301286 (pred_error_lba <- qLBA(q*(1-pred_prop_correct_lba), response = 1, A = p11_norm$par["A"], b = p11_norm$par["A"]+p11_norm$par["b"], t0 = p11_norm$par["t0"], mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), sd_v = c(1, p11_norm$par["sv"]))) # [1] 0.4684374 0.5529575 0.6273737 0.7233961 0.9314820 ### diffusion: # same result as when using Inf, but faster: (pred_prop_correct_diffusion <- pdiffusion(rt = 20, response = "upper", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"])) # [1] 0.964723 (pred_correct_diffusion <- qdiffusion(q*pred_prop_correct_diffusion, response = "upper", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"])) # [1] 0.4748271 0.5489903 0.6081182 0.6821927 0.8444566 (pred_error_diffusion <- qdiffusion(q*(1-pred_prop_correct_diffusion), response = "lower", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"])) # [1] 0.4776565 0.5598018 0.6305120 0.7336275 0.9770047 ### plot predictions par(mfrow=c(1,2), cex=1.2) plot(p11_q_c, q*prop.table(table(p11$corr))[2], pch = 2, ylim=c(0, 1), xlim = c(0.4, 1.3), ylab = "Cumulative Probability", xlab = "Response Time (sec)", main = "LBA") points(p11_q_e, q*prop.table(table(p11$corr))[1], pch = 2) lines(pred_correct_lba, q*pred_prop_correct_lba, type = "b") lines(pred_error_lba, q*(1-pred_prop_correct_lba), type = "b") legend("right", legend = c("data", "predictions"), pch = c(2, 1), lty = c(0, 1)) plot(p11_q_c, q*prop.table(table(p11$corr))[2], pch = 2, ylim=c(0, 1), xlim = c(0.4, 1.3), ylab = "Cumulative Probability", xlab = "Response Time (sec)", main = "Diffusion") points(p11_q_e, q*prop.table(table(p11$corr))[1], pch = 2) lines(pred_correct_diffusion, q*pred_prop_correct_diffusion, type = "b") lines(pred_error_diffusion, q*(1-pred_prop_correct_diffusion), type = "b")The fit plot compares observed quantiles (as triangles) with predicted quantiles (circles connected by lines). Here we decided to plot the 10%, 30%, 50%, 70% and 90% quantiles. In each plot, the x-axis shows RTs and the y-axis cumulative probabilities. From this it follows that the upper line and points correspond to the correct trials (which are common) and the lower line and points to the incorrect trials (which are uncommon). For both models the fit looks pretty good especially for the correct trials. However, it appears the LBA does a slightly better job in predicting very fast and slow trials here, which may be responsible for the better fit in terms of summed log-likelihood. In contrast, the diffusion model seems somewhat better in predicting the long tail of the error trials.

Checking the CDFFinally, we can also check whether the analytical CDF does in fact correspond to the empirical CDF of the data. For this we can compare the analytical CDF function pdiffusion to the empirical CDF obtained from random deviates. One thing one needs to be careful about is that pdiffusion provides the ‘defective’ CDF that only approaches one if one adds the CDF for both response boundaries. Consequently, to compare the empirical CDF for one response with the analytical CDF, we need to scale the latter to also go from 0 to 1 (simply by dividing it by its maximal value). Here we will use the parameters values obtained in the previous fit.

rand_rts <- rdiffusion(1e5, a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"]) plot(ecdf(rand_rts[rand_rts$response == "upper","rt"])) normalised_pdiffusion = function(rt,...) pdiffusion(rt,...)/pdiffusion(rt=Inf,...) curve(normalised_pdiffusion(x, response = "upper", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"]), add=TRUE, col = "yellow", lty = 2)This figure shows that the analytical CDF (in yellow) lies perfectly on top the empirical CDF (in black). If it does not for you, you still use an old version of rtdists. We have also added a series of unit tests to rtdists that compare the empirical CDF to the analytical CDF (using ks.test) for a variety of parameter values to catch if such a problem ever occurs again.

References 881472{QQ2QRTIF};{4MC596NZ};5FS47DBR

apa

default

ASC

no

475

http://singmann.org/wp-content/plugins/zotpress/

### MapReduce in Two Modern Paintings

(This article was first published on ** novyden**, and kindly contributed to R-bloggers)

Two years ago we had a rare family outing to the Dallas Museum of Art (my son is teenager and he’s into sport after all). It had an excellent exhibition of modern art and DMA allowed taking pictures. Two hours and dozen of pictures later my weekend was over but thanks to Google Photos I just stumbled upon those pictures again. Suddenly, I realized that two paintings I captured make up an illustration of one of the most important concepts in big data.

There are multiple papers, tutorials and web pages about MapReduce and to truly understand and use it one should study at least a few thoroughly. And there are many illustrations of MapReduce structure and architecture out there.

But the power of art can express more with less with just two paintings. First, we have work by Erró *Foodscape*, 1964:

*Foodscape*, 1964
Oil on canvas

It illustrates variety, richness, potential of insight if consumed properly, and of course, scale. The painting is boundless emphasizing no end to the table surface in all 4 directions. If we zoom in (you can find better quality image here) it contains many types of food and drinks, packaging, presentations, varying in colors, texture and origin. All these represent big data so much better than any kind of flowchart diagram.

The 2d and final painting is by Wayne Thiebaud *Salads, Sandwiches, and Desserts*, 1962:

*Salads, Sandwiches, and Desserts*, 1962
Oil on canva

Should we think of how MapReduce works this seemingly infinite table (also fittingly resembling conveyor line) looks like a result of split-apply-combine executed on *Foodscape* items. Indeed, each vertical group is combination of the same type of finished and plated food ready to serve and combined into variably sized sets (find better quality image here).

And again, I’d like to remind of importance of taking your kids to museum.

To **leave a comment** for the author, please follow the link and comment on their blog: ** novyden**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Regular Expression & Treemaps to Visualize Emergency Department Visits

(This article was first published on ** R – Incidental Ideas**, and kindly contributed to R-bloggers)

It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the Open Data Science Conference (ODSC) East held in Boston, MA. Over a two day period I had the opportunity to listen to a number of leaders in various industries and fields. It… Continue Reading →

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Incidental Ideas**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Shiny Application Layouts Exercises (Part-10)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

**Shiny Application Layouts – Shiny Themes**

In the last part of the series we will check out which themes are available in the shinythemes package. More specifically we will create a demo app with a selector from which you can choose the theme you want.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series and in this series, as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

**Create the App.**

In order to see how themes affect the application components ,we have seen until now, we need to create them.

As we are going to use tags here, a good idea is to use taglList() in order to create our app. TagList() is ideal for users who wish to create their own sets of tags.

Let’s see an example of the skeleton of our application and then create our own step by step before applying to it the theme selector.

#ui.R

tagList(

navbarPage(

"Title",

tabPanel("Navbar 1",

sidebarPanel(

sliderInput("slider","Slider input:", 1, 100, 50),

tags$h5("ActionButton:"),

actionButton("action", "Action")

),

mainPanel(

tabsetPanel(

tabPanel("Table",

h4("Table"),

tableOutput("table")),

tabPanel("VText",

h4("Verbatim Text"),

verbatimTextOutput("vtxt")),

tabPanel("Header",

h1("Header 1"),

h2("Header 2"))

)

)

),

tabPanel("Navbar 2")

))

#server.R

function(input, output) {

output$vtxt <- renderText({

paste(input$slider)

})

output$table <- renderTable({

iris

})

}

**Exercise 1**

Create a UI using tag List with the form of a Navbar Page and name it “Themes”. **HINT**: Use tagList() and navbarPage().

**Exercise 2**

Your Navbar Page should have two tab Panels named “Navbar 1” and “Navbar 2”. **HINT**: Use tabPanel().

**Exercise 3**

In “Navbar 1” add sidebar and main panel. **HINT**: Use sidebarPanel() and mainPanel().

**Exercise 4**

Create three tab panels inside the main panel. Name them “Table”, “Text” and “Header” respectively. **HINT**: Use tabsetPanel() and tabPanel().

**Learn more**about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

**Exercise 5**

In the tab panel “Table” add a table of the iris dataset. Name it “Iris”. **HINT** : Use tableOutput() and renderTable({}).

**Exercise 6**

In the tab panel “Text” add verbatim Text nad name it “Vtext”. **HINT**: Use verbatimTextOutput().

**Exercise 7**

Add a slider and an actionbutton in the sidebar. Connect the slider with the “Text” tab panel. Use tags to name the actionbutton. **HINT**: Use sliderInput(),actionButton(), renderText() and tags.

**Exercise 8**

In the tab panel “Header” add two headers with size h1 and h2 respectively.

**Shinythemes**

**Exercise 9**

Install and load the package shinythemes.

**Exercise 10**

Place themeSelector() inside your tagList() and then use different shiny themes to see how they affect all the components you created so far.

Related exercise sets:- Building Shiny App exercises part 4
- Shiny Application Layouts exercises part 3
- Building Shiny App exercises part 1
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

**leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### CI for difference between independent R square coefficients

(This article was first published on ** R code – Serious Stats**, and kindly contributed to R-bloggers)

In an earlier blog post I provided R code for a CI of a difference in R square for dependent and non-dependent correlations. This was based on a paper by Zou (2007). That paper also provides a method for calculating the CI of a difference in independent R square coefficients based on the limits of the CI for a single R square coefficient. I’ve also been experimenting with knittr and unfortunately haven’t yet worked out how to merge the R markdown output with my blog template, so I’m linking to RPubs for convenience.

You can find the function and a few more details here.

Filed under: comparing correlations, confidence intervals, R code, serious stats Tagged: confidence intervals, R, R code, software, statistics

To **leave a comment** for the author, please follow the link and comment on their blog: ** R code – Serious Stats**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Love is all around: Popular words in pop hits

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

Data scientist Giora Simchoni recently published a fantastic analysis of the history of pop songs on the Billboard Hot 100 using the R language. Giora used the rvest package in R to scrape data from the Ultimate Music Database site for the 350,000 chart entries (and 35,000 unique songs) since 1940, and used those data to create and visualize several measures of song popularity over time.

A novel measure that Giora calculates is "area under the song curve": the sum of all the ranks above 100 for every week the song is in the Hot 100. By that measure, the most popular (and also longest-charting) song of all time is Radioactive by Imagine Dragons:

It's turns out that calculating this "song integral" is pretty simple in R when you use the tidyverse:

calculateSongIntegral <- function(positions) { sum(100 - positions) } billboard %>% filter(EntryDate >= date_decimal(1960)) %>% group_by(Artist, Title) %>% summarise(positions = list(ThisWeekPosition)) %>% mutate(integral = map_dbl(positions, calculateSongIntegral)) %>% group_by(Artist, Title) %>% tally(integral) %>% arrange(-n)Another fascinating chart included in Giora's post is this analysis of the most frequent words to appear in song titles, by decade. He used the tidytext package to extract individual words from song titles and then rank them by frequency of use:

So it seems as though Love Is All Around (#41, October 1994) after all! For more analysis of the Billboard Hot 100 data, including Top-10 rankings for various measures of song popularity and the associated R code, check out Giora's post linked below.

Sex, Drugs and Data: Billboard Bananas

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...