Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 5 hours 3 min ago

UK government using R to modernize reporting of official statistics

Wed, 03/29/2017 - 02:34

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Like all governments, the UK government is responsible for producing reports of official statistics on an ongoing basis. That process has traditionally been a highly manual one: extract data from government systems, load it into a mainframe statistical analysis tool and run models and forecasts, extract the results to a spreadsheet to prepare data for presentation, and ultimately combine it all in a manual document editing tool to produce the final report. The process in the UK looks much like this today:

Matt Upson, a Data Scientist at the UK Government Digital Service, is looking to modernize this process with a reproducible analytical pipeline. This new process, based on the UK Government's Technology Service Manual for new IT deployments, aims to simplify the process by using R — the open-source programming language for statistical analysis — to automate the data extraction, analysis, and document generation tasks.

The development of the new process is underway now, with the first target being a report on culture and sporting impacts on the UK economy:

Towards the end of 2016 we embarked on a project with a team in the Department for Culture, Media, and Sport (DCMS) who are responsible for the production of the Economic Estimates for DCMS Sectors Statistical First Release (SFR). Currently this publication is produced with a mix of manual and semi-manual processes. Our aim was to see if we could speed up production of the SFR, whilst maintaining the high standard of the publication and QA.

Central to the process is automating the report with Rmarkdown, a system for programmatically laying out a document while encapsulating all of the R code for data preparation, analysis, tabulation and charting into a collaborative document. As a comprehensive language, R has the functionality to handle all of these tasks natively, without the need to bring other systems into the chain. And it's an ideal tool for the statistical data analysis required for these reports, which must follow the guidance of the Aqua Book, the UK government's manual for producing quality analysis. (The Aqua Book is well worth reading for any data scientist, with excellent guidance on designing studies in the face of uncertain data.) 

Naturally, the government has specific standards for the presentation of data, and the R-based process needs to be able to reproduce that look in the final report. The GDS team has produced a package for R, called govstyle, which standardizes the look of data charts while upgrading them to more modern design principles. For example this chart from a report on truancy in UK schools:

has been upgraded and reproduced in R to look like this:

An automated workflow allows for the inclusion of modern devops processes, too. Dependency management for R packages is handled with packrat. Test-driven development comes courtesy of the testthat package, and automated testing (including data verification and code coverage analysis) is provided by Travis CI. And source code control and collaboration is provided by Github, which is also where the entire process is documented as R package: the eesectors package.

For more on the UK government's use of R for official statistics reporting, see the post below from the UK Government Digital Service. 

Data at GDS: Reproducible Analytical Pipeline

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

shinyHeatmaply – a shiny app for creating interactive cluster heatmaps

Wed, 03/29/2017 - 02:23

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

My friend Jonathan Sidi and I (Tal Galili) are pleased to announce the release of shinyHeatmaply (0.1.0): a new Shiny application (and Shiny gadget) for creating interactive cluster heatmaps. shinyHeatmaply is based on the heatmaply R package which strives to make it easy as possible to create interactive cluster heatmaps.

The app introduces a functionality that saves to disk a self contained copy of the htmlwidget as an html file with your data and specifications you set from the UI, so it can be embedded in webpages, blogposts and online web appendices for academic publications.

You can see some of shinyHeatmaply‘s capabilities in the following 40 seconds video:

 

Installing shinyHeatmaply

From CRAN:

install_packages('shinyHeatmaply')

From github:

devtools::install_github('yonicd/shinyHeatmaply') Running the app/gadget

The application has an import interface as part of the application which currently supports csv, txt, tab, xls, xlsx, rd, rda. You can start the app using:

library(shiny) library(heatmaply) # If you didn't get shinyHeatmaply yet, you can run it through github: # runGitHub("yonicd/shinyHeatmaply",subdir = 'inst/shinyapp') # or just use your locally installed package: library(shinyHeatmaply) runApp(system.file("shinyapp", package = "shinyHeatmaply"))

The gadget is called from the R console and accepts input arguments. The object defined as the input to the shinyHeatmaply gadget is a data.frame or a list of data.frames. You can start it using the following code:

library(shinyHeatmaply)   #single data.frame data(mtcars) launch_heatmaply(mtcars)   #list data(iris) launch_heatmaply(list('Example1'=mtcars,'Example2'=iris))

You can see an example of a saved shinyHeatmaply output here. Or view the following iframe:

We would love to get your feedback!

For issue reports or feature requests, please visit the GitHub repo.

Post post credit: shinyHeatmaply was made thanks to the dedication of Jonathan Sidi, and based on recent features added to heatmaply by Alan O’Callaghan. I am very grateful to them both. This could also not be made possible by the amazing work of the RStudio’s team on Shiny applications, and of Carson Sievert on plotly. And lastly, to my adviser Yoav Benjamini for his support and advices.

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gradient Descent

Wed, 03/29/2017 - 02:00

(This article was first published on --Jean Arreola--, and kindly contributed to R-bloggers)

Trying gradient descent for linear regression

The best way to learn an algorith is to code it. So here it is, my take on Gradient Descent Algorithm for simple linear regression.

First, we fit a simple linear model with lm for comparison with gradient descent values.

#Load libraries library(dplyr) library(highcharter) #Scaling length variables from iris dataset. iris_demo <- iris[,c("Sepal.Length","Petal.Length")] %>% mutate(sepal_length = as.numeric(scale(Sepal.Length)), petal_length = as.numeric(scale(Petal.Length))) %>% select(sepal_length,petal_length) #Fit a simple linear model to compare coefficients. regression <- lm(iris_demo$petal_length~iris_demo$sepal_length) coef(regression) ## (Intercept) iris_demo$sepal_length ## 4.643867e-16 8.717538e-01 iris_demo_reg <- iris_demo iris_demo_reg$reg <- predict(regression,iris_demo) #Plot the model with highcharter highchart() %>% hc_add_series(data = iris_demo_reg, type = "scatter", hcaes(x = sepal_length, y = petal_length), name = "Sepal Length VS Petal Length") %>% hc_add_series(data = iris_demo_reg, type = "line", hcaes(x = sepal_length, y = reg), name = "Linear Regression") %>% hc_title(text = "Linear Regression")

open

We will try to acomplish the same coefficients, this time using Gradient Descent.

library(tidyr) set.seed(135) #To reproduce results #Auxiliary function # y = mx + b reg <- function(m,b,x) return(m * x + b) #Starting point b <- runif(1) m <- runif(1) #Gradient descent function gradient_desc <- function(b, m, data, learning_rate = 0.01){ # Small steps # Column names = Code easier to understand colnames(data) <- c("x","y") #Values for first iteration b_iter <- 0 m_iter <- 0 n <- nrow(data) # Compute the gradient for Mean Squared Error function for(i in 1:n){ # Partial derivative for b b_iter <- b_iter + (-2/n) * (data$y[i] - ((m * data$x[i]) + b)) # Partial derivative for m m_iter <- m_iter + (-2/n) * data$x[i] * (data$y[i] - ((m * data$x[i]) + b)) } # Move to the OPPOSITE direction of the derivative new_b <- b - (learning_rate * b_iter) new_m <- m - (learning_rate * m_iter) # Replace values and return new <- list(new_b,new_m) return(new) } # I need to store some values to make the motion plot vect_m <- m vect_b <- b # Iterate to obtain better parameters for(i in 1:1000){ if(i %in% c(1,100,250,500)){ # I keep some values in the iteration for the plot vect_m <- c(vect_m,m) vect_b <- c(vect_b,b) } x <- gradient_desc(b,m,iris_demo) b <- x[[1]] m <- x[[2]] } print(paste0("m = ", m)) ## [1] "m = 0.871753774273602" print(paste0("b = ", b)) ## [1] "b = 5.52239677041512e-10"

The difference in the coefficients is minimal.

We can see how the iterations work in the next plot:

#Compute new values iris_demo$preit <- reg(vect_m[1],vect_b[1],iris_demo$sepal_length) iris_demo$it1 <- reg(vect_m[2],vect_b[2],iris_demo$sepal_length) iris_demo$it100 <- reg(vect_m[3],vect_b[3],iris_demo$sepal_length) iris_demo$it250 <- reg(vect_m[4],vect_b[4],iris_demo$sepal_length) iris_demo$it500 <- reg(vect_m[5],vect_b[5],iris_demo$sepal_length) iris_demo$finalit <- reg(m,b,iris_demo$sepal_length) iris_gathered <- iris_demo %>% gather(key = gr, value = val, preit:finalit) %>% select(-petal_length) %>% distinct() iris_start <- iris_gathered %>% filter(gr == "preit") iris_seq <- iris_gathered %>% group_by(sepal_length) %>% do(sequence = list_parse(select(., y = val))) iris_data <- left_join(iris_start, iris_seq) #Motion Plot irhc2 <- highchart() %>% hc_add_series(data = iris_data, type = "line", hcaes(x = sepal_length, y = val), name = "Gradient Descent") %>% hc_motion(enabled = TRUE, series = 0, startIndex = 0, labels = c("Iteration 1","Iteration 100","Iteration 250","Iteration 500","Final Iteration")) %>% hc_add_series(data = iris_demo_reg, type = "scatter", hcaes(x = sepal_length, y = petal_length), name = "Sepal Length VS Petal Length") %>% hc_title(text = "Gradient Descent Iterations") irhc2

open

Maybe in a future post we can try a multivariate regression model!

To leave a comment for the author, please follow the link and comment on their blog: --Jean Arreola--. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A fast method to add annotations to a plot

Wed, 03/29/2017 - 02:00

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

A fast method to add annotations to a plot

Making professional looking plots in R can be fiddly. One task that I often spend ages doing is manually finding coordinates to add labels.
Wouldn’t it be nice if you could just send the coordinates directly to your text editor?

I did some searching and found on stackoverflow that you can send R objects to the clipboard. So here is my solution using that trick.

Adding text to the right position on a plot can be a real hassle. Here I show how to use a simple function to click on a figurea and put coordinates onto your clipboard.

You can get R to send data directly to the clipboard using the pipe command. Below is a little function I wrote that takes coordinates from locator() and sends them to your clipboard. Then you can just hit cmd-v to paste them into your text editor (nb I understand this may need some slight modifications to work on linux or windows, I use OSX):

loccopy <- function(n, digits = 2){ data <- locator(n) data <- round(cbind(data$x, data$y), digits) clip <- pipe("pbcopy", "w") write.table(data, file = clip, col.names = F, row.names = F) close(clip) }

Let’s test it out:

set.seed(42) plot(runif(100)) loccopy(1)

Now hit cmd-v (or equivalent on your OS).

69.23 0.84

Let’s add a label using our fast method for coordinates:

text(69.23, 0.84, "Unusual data point", pos =4, offset = 0)

The pos=4 and offset=0 ensures that the text goes directly to the right of our coordinates.

That’s it. Hope it helps speed up your workflow.

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding Linear SVM with R

Wed, 03/29/2017 - 01:01

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Linear Support Vector Machine or linear-SVM(as it is often abbreviated), is a supervised classifier, generally used in bi-classification problem, that is the problem setting, where there are two classes. Of course it can be extended to multi-class problem. In this work, we will take a mathematical understanding of linear SVM along with R code to understand the critical components of SVM. Taking the liberty to assume that some readers are new to machine learning, let us first find out what the supervised classifier is supposed to do. The term supervised comes from the concept of supervised learning. In case of supervised learning, you provide some examples along with their labels to your classifier and the classifier then tries to learn important features from the provided examples. For example, consider the following data set. You, will provide a part of this data to your linear SVM and tune the parameters such that your SVM can can act as a discriminatory function separating the ham messages from the spam messages. So, let us add the following R-code to our task.

sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) head(sms_data) type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv What Parameters should be tuned

Linear SVM tries to find a separating hyper-plane between two classes with maximum gap in-between. A hyper-plane in \(d\)- dimension is a set of points \(x \in \mathbb R^{d}\) satisfying the equation
$$w^{T}x+b=0$$
Let us denote \(h(x)=w^{T}(x)+b\)
Here \(w \) is a \(d\)-dimensional weight vector while \(b\) is a scalar denoting the bias. Both \(w\) and \(b\) are unknown to you. You will try to find and tune both \(w\) and \(b\) such that \(h(x)\) can separate the spams from hams as accurately as possible. I am drawing an arbitrary line to demonstrate a separating hyper-plane using the following code.

plot(c(-1,10), c(-1,10), type = "n", xlab = "x", ylab = "y", asp = 1) abline(a =12 , b = -5, col=3) text(1,5, "h(x)=0", col = 2, adj = c(-.1, -.1)) text(2,3, "h(x)>0,Ham", col = 2, adj = c(-.1, -.1)) text(0,3, "h(x)<0,Spam", col = 2, adj = c(-.1, -.1))

This is plot we get:

Oh, btw, did you notice that Linear SVM has decision rules. Let \(y\) be an indicator variable which takes the value -1 if the message is spam and 1 if it is a ham. Then, the decision rules are
$$y=-1~iff~h(x)~is~less~than~0$$
$$y=1~iff~h(x)~is~greater~than~0$$
Notice that for any point \(x\) in your data set, \(h(x)\neq 0\) as we want the function \(w^{T}x+b\) to clearly distinguish between spam and ham. This brings us to a important condition.

Important Condition

Linear SVM assumes that the two classes are linearly separable that is a hyper-plane can separate out the two classes and the data points from the two classes do not get mixed up. Of course this is not an ideal assumption and how we will discuss it later how linear SVM works out the case of non-linear separability. But for a reader with some experience here I pose a question which is like this Linear SVM creates a discriminant function but so does LDA. Yet, both are different classifiers. Why ? (Hint: LDA is based on Bayes Theorem while Linear SVM is based on the concept of margin. In case of LDA, one has to make an assumption on the distribution of the data per class. For a newbie, please ignore the question. We will discuss this point in details in some other post.)

Properties of \(w\)

Consider two points \(a_{1}\) and \(a_{2}\) lying on the hyper-plane defined by \(w^{T}x+b\). Thus, \(h(a_{1})=w^{T}a_{1}+b=0\) and \(h(a_{2})=w^{T}a_{2}+b=0\). Therefore,
$$w^{T}(a_{1}-a_{2})=0$$.
Notice that the vector \(a_{1}-a_{2}\) is lying on the separating hyper-plane defined by \(w^{T}x+b\). Thus, \(w\) is orthogonal to the hyper-plane.

Here it is the plot:

The Role of \(w\)

Consider a point \(x_{i}\) and let \(x_{p}\) be the orthogonal projection of \(x_{i}\) onto the hyper-plane. Assume $h(x_{i})>0$ for \(x\).
\(\frac{w}{\vert\vert w\vert\vert}\) is a unit vector in the direction of \(w\).

Here it is the plot:

Let \(r\) be the distance between \(x_{i}\) and \(x_{p}\). Since \(x_{i}\) is projected on the hyper-plane, there is some loss of information which is calculated as
$$(r=x_{i}-x_{p})$$.

Notice that \(r\) is parallel to the unit vector \(\frac{w}{\vert\vert w\vert\vert}\) and therefore \(r\) can be written as some multiple of \(\frac{w}{\vert\vert w\vert\vert}\). Thus, \(r=c\times \frac{w}{\vert\vert w\vert\vert}\), for some scalar value \(c\). It can be shown that
$$c=\frac{h(x_{i})}{\vert\vert w\vert\vert}$$

Actually, \(c\times \frac{w}{\vert\vert w\vert\vert}\) represents the orthogonal distance of the point \(x_{i}\) from the hyper-plane. Distance has to be positive. \(\frac{w}{\vert\vert w\vert\vert}\) cannot be negative. However, \(c=\frac{h(x)}{\vert\vert w\vert\vert}\). Though, before starting the equations we assumed that \(h(x_{i})\) is positive, but that is not a general case. In general \(c\) can be negative also. Therefore we multiply \(y\) with \(c\). Remember your \(y\) from the decision rules. That is
$$c=y\times \frac{h(x_{i})}{\vert\vert w\vert\vert}$$

The Concept of Margin

Linear SVM is based on the concept of margin which is defined as the smallest value of \(c\) for all the \(n\) points in your training data set. That is
$$\delta^{\star}=min\{c~\forall~points~in~the~training~data~set\}$$. All points \(x^{\star}\) which are lying at an orthogonal distance of \(\delta^{\star}\) from the hyper-plane \(w^{T}x+b=0\) are called support vectors.

Scaling Down the Margin

Let \(\delta^{\star}=\frac{y^{\star}\times h(x^{\star})}{\vert\vert w\vert\vert}\), where \(x^{\star}\) is a support vector. We want \(\delta^{\star} = \frac{1}{\vert\vert w\vert\vert}\). Clearly, we want to multiply some value \(s\) such that \(s\times y^{\star}\times h(x^{\star})=1\). Therefore, \(s=\frac{1}{y^{\star}\times h(x^{\star})}\).

Objective Criteria for Linear SVM

After scaling down, we have the following constraint for SVM:
$$y\times (w^{T}x+b)\geq \frac{1}{\vert\vert w\vert\vert}$$.
Now consider two support vectors, one denoted by \(a_{1}\) on the positive side, while the other \(a_{2}\) on the negative side of the hyper-plane. For the time being let us not use \(y\). Then, we have \(w^{T}a_{1}+b= \frac{1}{\vert\vert w\vert\vert}\) and \(w^{T}a_{2}+b= \frac{-1}{\vert\vert w\vert\vert}\).Thus, \(w^{T}(a_{1}-a_{2})=\frac{2}{\vert\vert w\vert\vert}\).\(w^{T}(a_{1}-a_{2})\) depicts the separation/gap between the support vectors on the either side. We want to maximize this gap \(\frac{2}{\vert\vert w\vert\vert}\). For the purpose, we need to minimize \(\vert\vert w\vert\vert\). Instead,we can minimize \(\frac{\vert\vert w\vert\vert^{2}}{2}\). Formally, we want to solve
$$Minimize_{w,b}\frac{\vert\vert w\vert\vert^{2}}{2}$$
subject to the constraint
$$y_{i}(w^{T}x_{i}+b)\geq \frac{1}{\vert \vert w\vert\vert}~\forall ~x_{i}~in ~Training~Data~set$$.

For linearly inseparable case, we introduce some penalty \(0\leq \xi_{i} \leq 1\) in the objective function and our constraint.
$$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C\sum_{i=1}^{n}\xi_{i})$$
subject to the constraints
$$(1)~y_{i}(w^{T}x_{i}+b)\geq 1-\xi_{i}~\forall ~x_{i}~in ~Training~Data~set~of~size~n$$ and
$$(2)~\xi_{i}\geq 0~\forall ~n~points~in~the~Training~Data~set$$

\(C\) is a user defined parameter.

In case you wish to see how \(\xi_{i}\) is used as penalty, please consider the the three figures below. You can also choose to skip them if the above details are okay for you.

Try the Code with R

Let us now start our coding. The needed code is here.

library(tm) setwd(path to the directory) print(getwd()) sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) print(head(sms_data)) sms_data$type<-factor(sms_data$type) simply_text<-Corpus(VectorSource(sms_data$text)) cleaned_corpus<-tm_map(simply_text, tolower) cleaned_corpus<-tm_map(cleaned_corpus,removeNumbers) cleaned_corpus<-tm_map(cleaned_corpus,removeWords,stopwords()) #cleaned_corpus<-tm_map(cleaned_corpus,removePunctuation) #cleaned_corpus<-tm_map(cleaned_corpus,stripWhitespace) sms_dtm<-DocumentTermMatrix(cleaned_corpus) sms_train<-sms_dtm[1:4000] sms_test<-sms_dtm[4001:5574] #sms_train<-cleaned_corpus[1:4000] #sms_test<-cleaned_corpus[4001:5574] freq_term=(findFreqTerms(sms_dtm,lowfreq=2)) #sms_freq_train<-DocumentTermMatrix(sms_train, list(dictionary=freq_term)) #sms_freq_test<-DocumentTermMatrix(sms_test,list(dictionary=freq_term)) y<-sms_data$type y_train<-y[1:4000] y_test<-y[4001:5574] library(e1071) tuned_svm<-tune(svm, train.x=sms_freq_train, train.y = y_train,kernel="linear", range=list(cost=10^(-2:2), gamma=c(0.1, 0.25,0.5,0.75,1,2)) ) print(tuned_svm) adtm.m<-as.matrix(sms_freq_train) adtm.df<-as.data.frame(adtm.m) svm_good_model<-svm(y_train~., data=adtm.df, kernel="linear",cost=tuned_svm$best.parameters$cost, gamma=tuned_svm$best.parameters$gamma) adtm.m_test<-as.matrix(sms_freq_test) adtm.df_test=as.data.frame(adtm.m_test) y_pred<-predict(svm_good_model,adtm.df_test) table(y_test,y_pred) library(gmodels) CrossTable(y_pred,y_test,prop.chisq = FALSE) type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv Parameter tuning of ‘svm’: - sampling method: 10-fold cross validation - best parameters: cost gamma 1 0.1 - best performance: 0.01825 Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 1574 | y_test y_pred | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1295 | 191 | 1486 | | 0.871 | 0.129 | 0.944 | | 0.952 | 0.897 | | | 0.823 | 0.121 | | -------------|-----------|-----------|-----------| spam | 66 | 22 | 88 | | 0.750 | 0.250 | 0.056 | | 0.048 | 0.103 | | | 0.042 | 0.014 | | -------------|-----------|-----------|-----------| Column Total | 1361 | 213 | 1574 | | 0.865 | 0.135 | | -------------|-----------|-----------|-----------|

In the above example we are not using the cleaned corpus. We see that the accuracy/precision of our classifier for ham is 0.871 and recall is 0.95. However for Spam messages, the precision is 0.250, while recall is just 0.103. We found that the usage of cleaned corpus degrades the performance for the SVM more.

    Related Post

    1. How to add a background image to ggplot2 graphs
    2. Streamline your analyses linking R to SAS: the workfloweR experiment
    3. R Programming – Pitfalls to avoid (Part 1)
    4. Eclipse – an alternative to RStudio – part 2
    5. Eclipse – an alternative to RStudio – part 1

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RStudio Connect 1.4.4.1

    Tue, 03/28/2017 - 20:42

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    We’re excited to announce the release of RStudio Connect: version 1.4.4.1. This release includes the ability to manage different versions of your work on RStudio Connect.

    Rollback / Roll Forward
    The most notable feature of this release is the ability to “rollback” to a previously deployed version of your work or “roll forward” to a more recent version of your work.

    You can also download a particular version, perhaps as a starting place for a new report or application, and delete old versions that you want to remove from the server.

    Other important features allow you to:

    • Specify the number of versions to retain. You can alter the setting Applications.BundleRetentionLimit to specify how many versions of your applications you want to keep on disk. By default, we retain all bundles eternally.
    • Limit the number of scheduled reports that will be run concurrently using the Applications.ScheduleConcurrency setting. This setting will help ensure that your server isn’t overwhelmed by too many reports all scheduled to run at the same time of day. The default is set to 2.
    • Create a printable view of your content with a new “Print” menu option.
    • Notify users of unsaved changes before they take an action in parameterized reports.

    The release also includes numerous security and stability improvements.

    If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, etc.) with collaborators, colleagues, or customers.

    You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    nanotime 0.1.2

    Tue, 03/28/2017 - 13:32

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new minor version of the nanotime package for working with nanosecond timestamps arrived yesterday on CRAN.

    nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic.

    This release just arranges things neatly before Leonardo Silvestri and I may shake things up with a possible shift to doing it all in S4 as we may need the added rigour for nanotime object operations for use in his ztsdb project.

    Changes in version 0.1.2 (2017-03-27)
    • The as.integer64 function is now exported as well.

    We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing Boardgame Recommendation website built with Shiny

    Tue, 03/28/2017 - 05:28

    My family and I love boardgames.  We are also on the lookout for new ones that would fit with the ones we already like to play.  So I decided to create a boardgame recommendation website

    https://larrydag.shinyapps.io/boardgame_reco/

    I built it using R.  The recommendation engine uses a very simple collaborative filtering algortihm based on correlation scores from other boardgame players collection lists.  The collections are gathered using the API from BoardgameGeek.com.  It is very much in a beta project phase as I just wanted to get something built to get working.

    I also wanted another project to build in Shiny.  I really like how easy it is to publish R projects with Shiny.

    Some of the features include:

    • Ability to enter your own collection
    • Get recommendation on your collection
    • Amazon link to buy boardgame that is recommended
    Its a work in progress.  There is much to clean up and to make more presentable.  Please take a look and offer comments to help improve the website. 

    Le Monde puzzle [#1000…1025]

    Tue, 03/28/2017 - 00:17

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    Le Monde mathematical puzzle launched a competition to celebrate its 1000th puzzle! A fairly long-term competition as it runs over the 25 coming puzzles (and hence weeks). Starting with puzzle #1001. Here is the 1000th puzzle, not part of the competition:

    Alice & Bob spend five (identical) vouchers in five different shops, each time buying the maximum number of items to get close to the voucher value. In these five shops, they buy sofas at 421 euros each, beds at 347 euros each, kitchen appliances at 289 euros each, tables at 251 euros each and bikes at 211 euros each, respectively. Once the buying frenzy is over, they realise that within a single shop, they would have spent exactly four vouchers for the same products. What is the value of a voucher?

    Filed under: Kids, R Tagged: Alice and Bob, arithmetics, competition, Le Monde, mathematical puzzle, R, Tangente

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to Reach More People with your Next Data Science Talk

    Mon, 03/27/2017 - 18:00

    (This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

    Speaking publicly about your data science projects is one of the best things you can do for your portfolio.

    Writing your presentation will help you refine what your project is, who it is for, and how it relates to other work in the area.

    Giving your presentation at a live, in-person event will make a large impact on the audience.

    That being said, public speaking has its drawbacks. For example:

    1. Time to prepare. While your talk might take just 30 minutes to deliver, you could easily spend weeks writing and rehearsing it.
    2. Cost. Some talks, such as those at local meetups, might be free to attend. But some will require significant travel. That costs money, even if your conference ticket is free.
    3. Limited upside. When you deliver a conference talk, your upside is limited in two ways:
      1. Reach. Only the people in the room at the time of your talk can hear your message.
      2. Interaction. Your interaction with the audience is normally limited to a few minutes of Q&A, and whatever conversations you have with people during the rest of the conference.

    In my experience, this third pain point is the most significant. After all, if you had reached 10x the number of people with your last talk, then you probably would have been willing to spend more time writing it. And you probably would have been willing to travel further to give it.

    Imagine if your talk wasn’t limited by the conference itself

    For a moment, try to ignore the limits that the conferences place on your talk.

    Imagine if members of the audience could ask you questions days, or even weeks, after you gave your talk. Wouldn’t that be a game changer?

    And what about all the people who care about your topic but couldn’t make it to the conference? Wouldn’t it be great if they could hear your talk as well?

    All of this is possible

    I’ve recently worked out a system to solve these problems in my own public speaking events. My talks now reach more people than ever before, and lead to more interactions than ever before.

    Bonus: A checklist for making the most out of your data science talks! Click Here

    Before giving you the exact system I use, let me tell you the technology that you’ll need in order to implement it. Note that some of the links below are affiliate links. This means that, at no additional cost to you, I may get a commission if you wind up purchasing after clicking these links:

    1. A self-hosted WordPress website. I use BlueHost.
    2. An Email Service so you can follow up with people. I use Drip.
    3. A Webinar Provider. I use Zoom.
    4. A Landing Page Builder. I use Thrive.
    Step 1: Let the Audience Download your Slides

    Before the conference, take out a url such as <ConferenceName>Slides.com. The URL should be easy for the audience to remember. Set the URL to redirect to a separate page on your website.

    This page should a “landing page”. This means that the only Call-to-Action (CTA) on the page should be for the reader to enter their email address so that you can send them the slides.

    I recommend that the last slide in your deck contain just this URL. When the slide appears, say “If you would like to download the slide deck I used for this talk, visit this URL and enter your email address. I will then email you the slides.”

    Set up an automated email campaign just for this opt-in. The first email should fire immediately and send them the slides. The second email should fire a few days later, and ask people if they enjoyed the slides or have any questions.

    This tip alone should multiply the chances you have to interact with your audience after the talk ends.

    Step 2: Host a Post-Conference Webinar

    The sad fact is this: most people who would enjoy your talk won’t be in the audience. No conference can get even a fraction of the people who are interested in a topic. The best way around this is to host a live webinar where you go through the same exact same slides as your presentation.

    The week after your presentation, publish a blog post where you announce that you’ll be giving the same talk, but this time as a live webinar. Drive traffic to the registration page by announcing it on LinkedIn, Facebook and Twitter.

    The biggest impact your talk will have is the in-person presentation at the conference. The second biggest impact will be the live webinar.

    Step 3: Use the Webinar Recording

    After giving the webinar, you will now have a recording of the presentation. This is great because (again) most people who are interested in your talk will not have been at the conference, and will not have attended your live webinar.

    You can host the recording on your own website or on youtube. When you get a new subscriber, point them to the recording. You can also periodically drive traffic to the recording on social media.

    Step 4: Post Your Slides Online

    A week after your webinar I recommend posting your slides online. I also recommend creating a content upgrade so that readers can download the slides. A “content upgrade” simply means that readers enter their email address, and that you then email them the slides.

    This step will probably have the highest volume (i.e. the most people will interact with your content in this form), but the least impact (i.e. people will not get as much out of interacting with your talk in this form).

    Closing Notes

    As a data scientist with an online portfolio, conference talks are probably the most valuable type of content you create. Without additional effort, though, that value will be limited to people who are in the audience at the time of your presentation.

    With just a little bit of work, though, you can overcome that limitation and have more followup with your audience and reach more people.

    Bonus: A checklist for making the most out of your data science talks! Click Here

    The post How to Reach More People with your Next Data Science Talk appeared first on AriLamstein.com.

    To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data science for Doctors: Inferential Statistics Exercises (part-3)

    Mon, 03/27/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

    We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

    This is the sixth part of the series and it aims to cover partially the subject of Inferential statistics.
    Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
    In more detail, in this part we will go through the hypothesis testing for Student’s t-distribution (Student’s t-test), which may be the most used test you will need to apply, since in most cases the standard deviation σ of the population is not known. We will cover the one-sample t-test and two-sample t-test(both with equal and unequal variance). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background.

    Before proceeding, it might be helpful to look over the help pages for the t.test.

    Please run the code below in order to load the data set and transform it into a proper data frame format:

    url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
    data <- read.table(url, fileEncoding="UTF-8", sep=",")
    names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
    colnames(data) <- names
    data <- data[-which(data$mass ==0),]

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1

    Suppose that we take a sample of 25 candidates that tried a diet and they had a average weight of 29 (generate 25 normal distributed samples with mean 29 and standard deviation 4) after the experiment.
    Find the t-value.

    Exercise 2

    Find the p-value.

    Exercise 3

    Find the 95% confidence interval.

    Exercise 4

    Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the sample with 5% confidence level.

    Exercise 5

    Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the population and the alternative that the true mean is less than the mean of the population with 5% confidence level.

    Exercise 6

    Suppose that we want to compare the current diet with another one. We assume that we test a different diet to a sample of 27 with mass average of 31(generate normal distributed samples with mean 31 and standard deviation of 5). Test whether the two diets are significantly different.
    Note that the two distributions have different variances.
    hint: This is a two sample hypothesis testing with different variances.

    Exercise 7

    Test whether the the first diet is more efficient than the second.

    Exercise 8

    Assume that the second diet has the same variance as the first one. Is it significant different?

    Exercise 9

    Assume that the second diet has the same variance as the first one. Is it significantly better?

    Exercise 10

    Suppose that you take a sample of 27 with average mass of 29, and after the diet the average mass is 28(generate the sampled with rnorm(27,average,4)). Are they significant different?
    hint: Paired Sample T-Test.

    Related exercise sets:
    1. Data science for Doctors: Inferential Statistics Exercises (part-2)
    2. Data Science for Doctors – Part 4 : Inferential Statistics (1/5)
    3. Data Science for Doctors – Part 2 : Descriptive Statistics
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    An Introduction to Stock Market Data Analysis with R (Part 1)

    Mon, 03/27/2017 - 17:00

    (This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on backtest.ru at a link that now appears to no longer work) and Chinese (here and here). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis.

    This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.

    NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk.

    Introduction

    Advanced mathematics and statistics have been present in finance for some time. Prior to the 1980s, banking and finance were well-known for being “boring”; investment banking was distinct from commercial banking and the primary role of the industry was handling “simple” (at least in comparison to today) financial instruments, such as loans. Deregulation under the Regan administration, coupled with an influx of mathematical talent, transformed the industry from the “boring” business of banking to what it is today, and since then, finance has joined the other sciences as a motivation for mathematical research and advancement. For example one of the biggest recent achievements of mathematics was the derivation of the Black-Scholes formula, which facilitated the pricing of stock options (a contract giving the holder the right to purchase or sell a stock at a particular price to the issuer of the option). That said, bad statistical models, including the Black-Scholes formula, hold part of the blame for the 2008 financial crisis.

    In recent years, computer science has joined advanced mathematics in revolutionizing finance and trading, the practice of buying and selling of financial assets for the purpose of making a profit. In recent years, trading has become dominated by computers; algorithms are responsible for making rapid split-second trading decisions faster than humans could make (so rapidly, the speed at which light travels is a limitation when designing systems). Additionally, machine learning and data mining techniques are growing in popularity in the financial sector, and likely will continue to do so. In fact, a large part of algorithmic trading is high-frequency trading (HFT). While algorithms may outperform humans, the technology is still new and playing an increasing role in a famously turbulent, high-stakes arena. HFT was responsible for phenomena such as the 2010 flash crash and a 2013 flash crash prompted by a hacked Associated Press tweet about an attack on the White House.

    My articles, however, will not be about how to crash the stock market with bad mathematical models or trading algorithms. Instead, I intend to provide you with basic tools for handling and analyzing stock market data with R. We will be using stock data as a first exposure to time series data, which is data considered dependent on the time it was observed (other examples of time series include temperature data, demand for energy on a power grid, Internet server load, and many, many others). I will also discuss moving averages, how to construct trading strategies using moving averages, how to formulate exit strategies upon entering a position, and how to evaluate a strategy with backtesting.

    DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience as a trader (a lot of this knowledge comes from a one-semester course on stock trading I took at Salt Lake Community College)! This is purely introductory knowledge, not enough to make a living trading stocks. People can and do lose money trading stocks, and you do so at your own risk!

    Getting and Visualizing Stock Data Getting Data from Yahoo! Finance with quantmod

    Before we analyze stock data, we need to get it into some workable format. Stock data can be obtained from Yahoo! Finance, Google Finance, or a number of other sources, and the quantmod package provides easy access to Yahoo! Finance and Google Finance data, along with other sources. In fact, quantmod provides a number of useful features for financial modelling, and we will be seeing those features throughout these articles. In this lecture, we will get our data from Yahoo! Finance.

    # Get quantmod if (!require("quantmod")) { install.packages("quantmod") library(quantmod) } start <- as.Date("2016-01-01") end <- as.Date("2016-10-01") # Let's get Apple stock data; Apple's ticker symbol is AAPL. We use the # quantmod function getSymbols, and pass a string as a first argument to # identify the desired ticker symbol, pass 'yahoo' to src for Yahoo! # Finance, and from and to specify date ranges # The default behavior for getSymbols is to load data directly into the # global environment, with the object being named after the loaded ticker # symbol. This feature may become deprecated in the future, but we exploit # it now. getSymbols("AAPL", src = "yahoo", from = start, to = end) ## As of 0.4-0, 'getSymbols' uses env=parent.frame() and ## auto.assign=TRUE by default. ## ## This behavior will be phased out in 0.5-0 when the call will ## default to use auto.assign=FALSE. getOption("getSymbols.env") and ## getOptions("getSymbols.auto.assign") are now checked for alternate defaults ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details. ## [1] "AAPL" # What is AAPL? class(AAPL) ## [1] "xts" "zoo" # Let's see the first few rows head(AAPL) ## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume ## 2016-01-04 102.61 105.37 102.00 105.35 67649400 ## 2016-01-05 105.75 105.85 102.41 102.71 55791000 ## 2016-01-06 100.56 102.37 99.87 100.70 68457400 ## 2016-01-07 98.68 100.13 96.43 96.45 81094400 ## 2016-01-08 98.55 99.11 96.76 96.96 70798000 ## 2016-01-11 98.97 99.06 97.34 98.53 49739400 ## AAPL.Adjusted ## 2016-01-04 102.61218 ## 2016-01-05 100.04079 ## 2016-01-06 98.08303 ## 2016-01-07 93.94347 ## 2016-01-08 94.44022 ## 2016-01-11 95.96942

    Let’s briefly discuss this. getSymbols() created in the global environment an object called AAPL (named automatically after the ticker symbol of the security retrieved) that is of the xts class (which is also a zoo-class object). xts objects (provided in the xts package) are seen as improved versions of the ts object for storing time series data. They allow for time-based indexing and provide custom attributes, along with allowing multiple (presumably related) time series with the same time index to be stored in the same object. (Here is a vignette describing xts objects.) The different series are the columns of the object, with the name of the associated security (here, AAPL) being prefixed to the corresponding series.

    Yahoo! Finance provides six series with each security. Open is the price of the stock at the beginning of the trading day (it need not be the closing price of the previous trading day), high is the highest price of the stock on that trading day, low the lowest price of the stock on that trading day, and close the price of the stock at closing time. Volume indicates how many stocks were traded. Adjusted close (abreviated as “adjusted” by getSymbols()) is the closing price of the stock that adjusts the price of the stock for corporate actions. While stock prices are considered to be set mostly by traders, stock splits (when the company makes each extant stock worth two and halves the price) and dividends (payout of company profits per share) also affect the price of a stock and should be accounted for.

    Visualizing Stock Data

    Now that we have stock data we would like to visualize it. I first use base R plotting to visualize the series.

    plot(AAPL[, "AAPL.Close"], main = "AAPL")

    A linechart is fine, but there are at least four variables involved for each date (open, high, low, and close), and we would like to have some visual way to see all four variables that does not require plotting four separate lines. Financial data is often plotted with a Japanese candlestick plot, so named because it was first created by 18th century Japanese rice traders. Use the function candleChart() from quantmod to create such a chart.

    candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white")

    With a candlestick chart, a black candlestick indicates a day where the closing price was higher than the open (a gain), while a red candlestick indicates a day where the open was higher than the close (a loss). The wicks indicate the high and the low, and the body the open and close (hue is used to determine which end of the body is the open and which the close). Candlestick charts are popular in finance and some strategies in technical analysis use them to make trading decisions, depending on the shape, color, and position of the candles. I will not cover such strategies today.

    (Notice that the volume is tracked as a bar chart on the lower pane as well, with the same colors as the corresponding candlesticks. Some traders like to see how many shares are being traded; this can be important in trading.)

    We may wish to plot multiple financial instruments together; we may want to compare stocks, compare them to the market, or look at other securities such as exchange-traded funds (ETFs). Later, we will also want to see how to plot a financial instrument against some indicator, like a moving average. For this you would rather use a line chart than a candlestick chart. (How would you plot multiple candlestick charts on top of one another without cluttering the chart?)

    Below, I get stock data for some other tech companies and plot their adjusted close together.

    # Let's get data for Microsoft (MSFT) and Google (GOOG) (actually, Google is # held by a holding company called Alphabet, Inc., which is the company # traded on the exchange and uses the ticker symbol GOOG). getSymbols(c("MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "MSFT" "GOOG" # Create an xts object (xts is loaded with quantmod) that contains closing # prices for AAPL, MSFT, and GOOG stocks <- as.xts(data.frame(AAPL = AAPL[, "AAPL.Close"], MSFT = MSFT[, "MSFT.Close"], GOOG = GOOG[, "GOOG.Close"])) head(stocks) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 105.35 54.80 741.84 ## 2016-01-05 102.71 55.05 742.58 ## 2016-01-06 100.70 54.05 743.62 ## 2016-01-07 96.45 52.17 726.39 ## 2016-01-08 96.96 52.33 714.47 ## 2016-01-11 98.53 52.30 716.03 # Create a plot showing all series as lines; must use as.zoo to use the zoo # method for plot, which allows for multiple series to be plotted on same # plot plot(as.zoo(stocks), screens = 1, lty = 1:3, xlab = "Date", ylab = "Price") legend("right", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

    What’s wrong with this chart? While absolute price is important (pricey stocks are difficult to purchase, which affects not only their volatility but your ability to trade that stock), when trading, we are more concerned about the relative change of an asset rather than its absolute price. Google’s stocks are much more expensive than Apple’s or Microsoft’s, and this difference makes Apple’s and Microsoft’s stocks appear much less volatile than they truly are (that is, their price appears to not deviate much).

    One solution would be to use two different scales when plotting the data; one scale will be used by Apple and Microsoft stocks, and the other by Google.

    plot(as.zoo(stocks[, c("AAPL.Close", "MSFT.Close")]), screens = 1, lty = 1:2, xlab = "Date", ylab = "Price") par(new = TRUE) plot(as.zoo(stocks[, "GOOG.Close"]), screens = 1, lty = 3, xaxt = "n", yaxt = "n", xlab = "", ylab = "") axis(4) mtext("Price", side = 4, line = 3) legend("topleft", c("AAPL (left)", "MSFT (left)", "GOOG"), lty = 1:3, cex = 0.5)

    Not only is this solution difficult to implement well, it is seen as a bad visualization method; it can lead to confusion and misinterpretation, and cannot be read easily.

    A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply.

    One transformation would be to consider the stock’s return since the beginning of the period of interest. In other words, we plot:

    This will require transforming the data in the stocks object, which I do next.

    # Get me my beloved pipe operator! if (!require("magrittr")) { install.packages("magrittr") library(magrittr) } ## Loading required package: magrittr stock_return % t %>% as.xts head(stock_return) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 1.0000000 1.0000000 1.0000000 ## 2016-01-05 0.9749407 1.0045620 1.0009975 ## 2016-01-06 0.9558614 0.9863139 1.0023994 ## 2016-01-07 0.9155197 0.9520073 0.9791734 ## 2016-01-08 0.9203607 0.9549271 0.9631052 ## 2016-01-11 0.9352634 0.9543796 0.9652081 plot(as.zoo(stock_return), screens = 1, lty = 1:3, xlab = "Date", ylab = "Return") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

    This is a much more useful plot. We can now see how profitable each stock was since the beginning of the period. Furthermore, we see that these stocks are highly correlated; they generally move in the same direction, a fact that was difficult to see in the other charts.

    Alternatively, we could plot the change of each stock per day. One way to do so would be to plot the percentage increase of a stock when comparing day to day , with the formula:

    But change could be thought of differently as:

    These formulas are not the same and can lead to differing conclusions, but there is another way to model the growth of a stock: with log differences.

    (Here, is the natural log, and our definition does not depend as strongly on whether we use or .) The advantage of using log differences is that this difference can be interpreted as the percentage change in a stock but does not depend on the denominator of a fraction.

    We can obtain and plot the log differences of the data in stocks as follows:

    stock_change % log %>% diff head(stock_change) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 NA NA NA ## 2016-01-05 -0.025378648 0.0045516693 0.000997009 ## 2016-01-06 -0.019763704 -0.0183323194 0.001399513 ## 2016-01-07 -0.043121062 -0.0354019469 -0.023443064 ## 2016-01-08 0.005273804 0.0030622799 -0.016546113 ## 2016-01-11 0.016062548 -0.0005735067 0.002181138 plot(as.zoo(stock_change), screens = 1, lty = 1:3, xlab = "Date", ylab = "Log Difference") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)

    Which transformation do you prefer? Looking at returns since the beginning of the period make the overall trend of the securities in question much more apparent. Changes between days, though, are what more advanced methods actually consider when modelling the behavior of a stock. so they should not be ignored.

    Moving Averages

    Charts are very useful. In fact, some traders base their strategies almost entirely off charts (these are the “technicians”, since trading strategies based off finding patterns in charts is a part of the trading doctrine known as technical analysis). Let’s now consider how we can find trends in stocks.

    A -day moving average is, for a series and a point in time , the average of the past days: that is, if denotes a moving average process, then:

    Moving averages smooth a series and helps identify trends. The larger is, the less responsive a moving average process is to short-term fluctuations in the series . The idea is that moving average processes help identify trends from “noise”. Fast moving averages have smaller and more closely follow the stock, while slow moving averages have larger , resulting in them responding less to the fluctuations of the stock and being more stable.

    quantmod allows for easily adding moving averages to charts, via the addSMA() function.

    candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white") addSMA(n = 20)

    Notice how late the rolling average begins. It cannot be computed until 20 days have passed. This limitation becomes more severe for longer moving averages. Because I would like to be able to compute 200-day moving averages, I’m going to extend out how much AAPL data we have. That said, we will still largely focus on 2016.

    start = as.Date("2010-01-01") getSymbols(c("AAPL", "MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "AAPL" "MSFT" "GOOG" # The subset argument allows specifying the date range to view in the chart. # This uses xts style subsetting. Here, I'm using the idiom # 'YYYY-MM-DD/YYYY-MM-DD', where the date on the left-hand side of the / is # the start date, and the date on the right-hand side is the end date. If # either is left blank, either the earliest date or latest date in the # series is used (as appropriate). This method can be used for any xts # object, say, AAPL candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = 20)

    You will notice that a moving average is much smoother than the actual stock data. Additionally, it’s a stubborn indicator; a stock needs to be above or below the moving average line in order for the line to change direction. Thus, crossing a moving average signals a possible change in trend, and should draw attention.

    Traders are usually interested in multiple moving averages, such as the 20-day, 50-day, and 200-day moving averages. It’s easy to examine multiple moving averages at once.

    candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = c(20, 50, 200))

    The 20-day moving average is the most sensitive to local changes, and the 200-day moving average the least. Here, the 200-day moving average indicates an overall bearish trend: the stock is trending downward over time. The 20-day moving average is at times bearish and at other times bullish, where a positive swing is expected. You can also see that the crossing of moving average lines indicate changes in trend. These crossings are what we can use as trading signals, or indications that a financial security is changing direction and a profitable trade might be made.

    Visit next week to read about how to design and test a trading strategy using moving averages.

    # Package/system information sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 15.10 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] magrittr_1.5 quantmod_0.4-7 TTR_0.23-1 xts_0.9-7 ## [5] zoo_1.7-14 RWordPress_0.2-3 optparse_1.3.2 knitr_1.15.1 ## ## loaded via a namespace (and not attached): ## [1] lattice_0.20-34 XML_3.98-1.5 bitops_1.0-6 grid_3.3.3 ## [5] formatR_1.4 evaluate_0.10 highr_0.6 stringi_1.1.3 ## [9] getopt_1.20.0 tools_3.3.3 stringr_1.2.0 RCurl_1.95-4.8 ## [13] XMLRPC_0.3-0

    To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Analyzing Accupedo step count data in R: Part 2 – Adding weather data

    Mon, 03/27/2017 - 16:30

    (This article was first published on Method Matters, and kindly contributed to R-bloggers)

    In my last set of posts, I wrote about analyzing data from the Accupedo step counter app I have on my phone. In this post, I’ll talk about some additional analysis I’ve done by merging the step counter data with weather data from another source.

    The website www.wunderground.com has freely available weather data available for most parts of the world. According to their website, their data are recorded by a crowd-sourced team of weather enthusiasts who report the weather conditions from personal stations at their homes. Pretty cool project, and it was really easy to search for my city and download the data in a .csv format. In this post, we will focus on the Mean Temperature (for the day) value, recorded in degrees Celsius.

    Data Preparation

    The step count data here are aggregated to the daily level, as I didn’t have more granular weather information. In all honesty, it probably makes more sense to analyze the data at the day level (as I’ve done); it would likely be hard to see hourly evolution in the weather matching hourly evolution in my step counts. My movements are pretty constrained during certain hours of the day (e.g. when I’m at work), and so examining that granular of a relationship is unlikely to be very illuminating.

    I was able to match all of the dates that I had step counts for with the daily weather information from www.wunderground.com, and merged the weather data into the main step count dataset described in the previous post. I then created the analysis dataset, a day-level version of the walking data aggregated across the two variables I had created previously – day and day of week (essentially this aggregates to the day-level, while keeping a column describing the day of the week – e.g. Monday, Tuesday, etc.).

    # load the required packages
    library(plyr); library(dplyr)
    library(lubridate)
    library(ggplot2)

    # make the variables to aggregate across
    # the master data set with steps and temperature is called walking_weather
    walking_weather$oneday <- cut(walking_weather$date_time, breaks = "1 day")
    walking_weather$dow <- wday(walking_weather$date_time, label = TRUE)

    # aggregate by day, day of week
    # max of step count; mean of temperature
    ww_per_day <- walking_weather %>% group_by(oneday, dow) %>%
    summarise(steps = max(steps), temp = mean(Mean_TemperatureC))

    As before, when aggregating I kept the maximum step count per day (as these counts are cumulative), and computed the average temperature in degrees Celsius per day (because there are multiple observations per day in the non-aggregated data, taking the average across days simply returns the daily average temperature from the daily weather data).

    The resulting dataset looks like this:

    Data Visualization and Analysis

    As a first step, let’s make histograms of the temperature and step count variables:

    # histograms of temperature and step count at a daily level
    # using the base plotting system
    hist(ww_per_day$temp, col ='darkblue',
    xlab = 'Daily Average Temperature (Celsius)',
    main = 'Histogram of Temperature')

    hist(ww_per_day$steps, col ='darkgreen',
    xlab = 'Daily Total Step Count',
    main = 'Histogram of Step Count')

    The temperatures have ranged from -4 degrees to 27 degrees Celsius, with a mean of 11.66 (SD = 5.83). 

     

    The daily step counts have ranged from 6 to 65,337, with a mean of 18,294.04 (SD = 7,808.36).

    In order to visualize the relationship between my daily step count and the daily average temperature, I made a simple scatterplot, with a linear model line superimposed on the graph (95% confidence intervals shown in light grey):

    # plot of steps by temperature
    p = ggplot(data = ww_per_day, aes(x = temp, y = steps)) +
    geom_point() +
    coord_cartesian(ylim = c(0, 25000)) +
    geom_smooth(method="lm") +
    theme(legend.title=element_blank()) +
    labs(x = "Temperature (Celsius)", y = "Steps" )
    p

    There is a slight positive relationship between temperature and step count, such that on warmer days I tend to walk more. A simple linear regression will estimate the size of this relationship:

    summary(lm(steps~temp , data = ww_per_day ))

    The estimates given:

    table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; } Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 17277.29 681.76 25.342 < 2e-16 *** temp 87.21 52.31 1.667 0.096 .

    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 7798 on 653 degrees of freedom
    Multiple R-squared: 0.004238, Adjusted R-squared: 0.002713
    F-statistic: 2.779 on 1 and 653 DF, p-value: 0.09597

    The model estimates that I walk an additional 87.21 steps for each degree increase in the daily average temperature. However, the p-value associated with this estimate shows only a nonsignificant trend toward significance.

    In a previous post we saw that my walking patterns tend to be different during the weekdays and the weekend. I was therefore curious to examine the relationship between temperature and step count across these two different types of days.

    I generated the week/weekend variable (as in the previous post) and the plot with the following code:

    # generate the weekday vs. weekend variable
    ww_per_day$week_weekend[ww_per_day$dow == 'Sun'] <- 'Weekend'
    ww_per_day$week_weekend[ww_per_day$dow == 'Sat'] <- 'Weekend'
    ww_per_day$week_weekend[ww_per_day$dow == 'Mon'] <- 'Weekday'
    ww_per_day$week_weekend[ww_per_day$dow == 'Tues'] <- 'Weekday'
    ww_per_day$week_weekend[ww_per_day$dow == 'Wed'] <- 'Weekday'
    ww_per_day$week_weekend[ww_per_day$dow == 'Thurs'] <- 'Weekday'
    ww_per_day$week_weekend[ww_per_day$dow == 'Fri'] <- 'Weekday'

    # plot the relationship between temperature and step count
    # with separate colors and lines for weekday vs. weekend
    p = ggplot(data = ww_per_day, aes(x = temp, y = steps, color=week_weekend)) +
    geom_point() +
    coord_cartesian(ylim = c(0, 25000)) +
    geom_smooth(method="lm") +
    theme(legend.title=element_blank()) +
    labs(x = "Temperature (Celsius)", y = "Steps" )
    p

    The relationship between temperature and step count does indeed appear to be different on weekdays vs. the weekend:

    The general pattern is that there’s not much of a relationship between temperature and step count during the week (after all, I have to walk to work whether it’s warm or cold), but on the weekends I walk quite a bit more when it’s warmer out.

    Notice how the confidence intervals get wider at the extremes of the x-axis. These are regions where there are simply less data (by definition, extreme temperatures are rare). This pattern is especially noticeable for the weekend days (there are far fewer weekend days than there are week days, and so we have even less data here at the margins).

    Although I’m in general not particularly interested in null-hypothesis significance testing of regression coefficients in this type of analysis, I went through the formality of testing the time period (weekday vs. weekend) by temperature interaction.

    summary(lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),
    data = ww_per_day ))

    table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

    Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 19178.76 774.45 24.76 < 2e-16 *** temp 23.46 59.52 0.39 0.69 as.factor(week_weekend)Weekend -6982.66 1486.25 -4.70 0.00 *** temp:as.factor(week_weekend)Weekend 246.76 113.76 2.17 0.03 * ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 7560 on 651 degrees of freedom
    Multiple R-squared: 0.06682, Adjusted R-squared: 0.06252
    F-statistic: 15.54 on 3 and 651 DF, p-value: 9.005e-10

    The results show that the interaction term is statistically significant; we already understood the shape and meaning of the interaction with the above plot.

    More interesting for our purposes is the model summary. The adjusted multiple R squared is about 6%, much greater than the .3% from the model above (where we only include temperature as a predictor), but not very high overall. The take-home message is that time period (weekday vs. weekend) and temperature have clear relationships with step count, but together and in combination they don’t explain a huge amount of variance in the amount of steps I walk each day.

    I wondered about the importance of the interaction term, actually. Interactions are widely sought after in certain academic contexts, but I enjoy the simplicity and ease-of-interpretation of main effects models for applied problems which require interpretable solutions (e.g. “white box” models). Bluntly put: how important is the interaction term in this context? One might argue that, although the p-value for the interaction term passes the threshold for statistical significance, knowing that the interaction term is (statistically) significantly different from zero doesn’t bring us much further in our understanding of the problem at hand, after accounting for the main effects of time period and temperature.

    Model Comparison

    In order to more concretely investigate the importance of the time period by temperature interaction term, I computed the model predicting step count with just the main effects of time period and temperature (lm_1 below), and also the model with the main effects and their interaction term (lm_2 below). I then performed a model comparison to see if adding the interaction term results in a statistically significant reduction in the residual sum of squares. In other words, is the difference between the predictive performance of the first and second model greater than zero?

    # main effects model
    lm_1 <- lm(steps~temp + as.factor(week_weekend), data = ww_per_day )
    summary(lm_1)

    # main effects + interaction model
    lm_2 <- lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),
    data = ww_per_day )
    summary(lm_2)

    # model comparison
    anova(lm_1,lm_2)

    Which gives the following result:
    table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

    Model Res.Df RSS Df Sum of Sq F Pr(>F) 1 652 3.75E+10 2 651 3.72E+10 1 268916758 4.70 0.03 * ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    This comparison suggests that adding the interaction term results in a statistically significant decrease in the residual sum of squares. In other words, we can better predict my step count if we include the interaction term.

    Conclusion

    To sum up, in this post we took the daily total step counts from my Accupedo step counter app and added data about the average temperature on each day that steps were recorded. We used scatterplots to explore and understand the relationship between these two variables, both overall and according to the time period (e.g. weekday vs. weekend).

    We also used linear regression to investigate the relationship between daily temperature and daily step count. When predicting step count from temperature, we saw that there was a non-statistically significant trend such that I walk more on warmer days. When adding time period as a categorical predictor in an interactive regression model, we saw a statistically significant interaction between temperature and time period, such that the relationship between temperature and step count was stronger on weekends vs. weekdays. In other words, I walk more when it’s warmer out, but only during the weekend.

    Caveat: Null Hypothesis Significance Testing (NHST) in Exploratory Data Analysis

    We used NHST and p-values to interpret regression coefficients and to compare the predictive performance of the main effects vs. interaction regression models. I’m somewhat ambivalent about the use of p-values in this context, honestly. I’ve been very influenced by Andrew Gelman’s work, particularly his wonderful blog, and am very conscious of the limitations of p-values in exploratory contexts like this one. If I had to choose the most interesting test of statistical significance in this post, I would choose the model comparison: it puts an explicit focus on the relative performance of competing (nested) models, rather than simply using significance testing to determine whether individual regression coefficients are significantly different from zero.

    The statistical approach we’ve seen in this post conforms very much to the way I was trained in the social sciences. In this tradition, a strong focus is put on NHST (via p-values), and less frequently on interpreting model R square and performing formal model comparisons. In the work I do now as an applied data analyst (or “data scientist”), I often take an approach that borrows heavily from machine learning (though these ideas are also present in traditional statistical thinking): evaluating models via predictive accuracy on a test dataset that was not used in data modeling.

    Coming Up Next

    In the next set of posts, we will analyze a different dataset using an approach that leans more heavily towards this machine-learning perspective. We’ll also change analytical software, using Python/Pandas and the scikit-learn library to analyze the data. Stay tuned!

    To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #0: Introducing R^4

    Mon, 03/27/2017 - 15:10

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    So I had been toying with the idea of getting back to the blog and more regularly writing / posting little tips and tricks. I even starting taking some notes but because perfect is always the enemy of the good it never quite materialized.

    But the relatively broad discussion spawned by last week’s short rant on Suggests != Depends made a few things clear. There appears to be an audience. It doesn’t have to be long. And it doesn’t have to be too polished.

    So with that, let’s get the blogging back from micro-blogging.

    This note forms post zero of what will a new segment I call R4 which is shorthand for relatively random R rambling.

    Stay tuned.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing brotools

    Mon, 03/27/2017 - 09:23

    I’m happy to announce my first R package, called brotools. This is a package that contains functions that are specific to my needs but that you might find also useful. I blogged about some of these functions, so if you follow my blog you might already be familiar with some of them. It is not on CRAN and might very well never be. The code is hosted on bitbucket and you can install the package with

    devtools::install_bitbucket("b-rodrigues/brotools")

    Hope you’ll find the brotools useful!

    Fitting Bayesian Linear Mixed Models for continuous and binary data using Stan: A quick tutorial

    Mon, 03/27/2017 - 08:48

    (This article was first published on Shravan Vasishth's Slog (Statistics blog), and kindly contributed to R-bloggers)

    I want to give a quick tutorial on fitting Linear Mixed Models (hierarchical models) with a full variance-covariance matrix for random effects (what Barr et al 2013 call a maximal model) using Stan.

    For a longer version of this tutorial, see: Sorensen, Hohenstein, Vasishth, 2016.

    Prerequisites: You need to have R and preferably RStudio installed; RStudio is optional. You need to have rstan installed. See here. I am also assuming you have fit lmer models like these before:
    lmer(log(rt) ~ 1+RCType+dist+int+(1+RCType+dist+int|subj) + (1+RCType+dist+int|item), dat)

    If you don’t know what the above code means, first read chapter 4 of my lecture notes.
    The code and data format needed to fit LMMs in Stan The data

    I assume you have a 2×2 repeated measures design with some continuous measure like reading time (rt) data and want to do a main effects and interaction contrast coding. Let’s say your main effects are RCType and dist, and the interaction is coded as int. All these contrast codings are $\pm 1$. If you don’t know what contrast coding is, see these notes and read section 4.3 (although it’s best to read the whole chapter). I am using an excerpt of an example data-set from Husain et al. 2014.
    "subj" "item" "rt""RCType" "dist" "int"
    1 14 438 -1 -1 1
    1 16 531 1 -1 -1
    1 15 422 1 1 1
    1 18 1000 -1 -1 1
    ...

    Assume that these data are stored in R as a data-frame with name rDat.
    The Stan code

    Copy the following Stan code into a text file and save it as the file matrixModel.stan. For continuous data like reading times or EEG, you never need to touch this file again. You will only ever specify the design matrix X and the structure of the data. The rest is all taken care of.
    data {
    int N; //no trials
    int P; //no fixefs
    int J; //no subjects
    int n_u; //no subj ranefs
    int K; //no items
    int n_w; //no item ranefs
    int subj[N]; //subject indicator
    int item[N]; //item indicator
    row_vector[P] X[N]; //fixef design matrix
    row_vector[n_u] Z_u[N]; //subj ranef design matrix
    row_vector[n_w] Z_w[N]; //item ranef design matrix
    vector[N] rt; //reading time
    }

    parameters {
    vector[P] beta; //fixef coefs
    cholesky_factor_corr[n_u] L_u; //cholesky factor of subj ranef corr matrix
    cholesky_factor_corr[n_w] L_w; //cholesky factor of item ranef corr matrix
    vector[n_u] sigma_u; //subj ranef std
    vector[n_w] sigma_w; //item ranef std
    real sigma_e; //residual std
    vector[n_u] z_u[J]; //spherical subj ranef
    vector[n_w] z_w[K]; //spherical item ranef
    }

    transformed parameters {
    vector[n_u] u[J]; //subj ranefs
    vector[n_w] w[K]; //item ranefs
    {
    matrix[n_u,n_u] Sigma_u; //subj ranef cov matrix
    matrix[n_w,n_w] Sigma_w; //item ranef cov matrix
    Sigma_u = diag_pre_multiply(sigma_u,L_u);
    Sigma_w = diag_pre_multiply(sigma_w,L_w);
    for(j in 1:J)
    u[j] = Sigma_u * z_u[j];
    for(k in 1:K)
    w[k] = Sigma_w * z_w[k];
    }
    }

    model {
    //priors
    L_u ~ lkj_corr_cholesky(2.0);
    L_w ~ lkj_corr_cholesky(2.0);
    for (j in 1:J)
    z_u[j] ~ normal(0,1);
    for (k in 1:K)
    z_w[k] ~ normal(0,1);
    //likelihood
    for (i in 1:N)
    rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);
    }
    Define the design matrix

    Since we want to test the main effects coded as the columns RCType, dist, and int, our design matrix will look like this:
    # Make design matrix
    X <- unname(model.matrix(~ 1 + RCType + dist + int, rDat))
    attr(X, "assign") <- NULL
    Prepare data for Stan

    Stan expects the data in a list form, not as a data frame (unlike lmer). So we set it up as follows:
    # Make Stan data
    stanDat <- list(N = nrow(X),
    P = ncol(X),
    n_u = ncol(X),
    n_w = ncol(X),
    X = X,
    Z_u = X,
    Z_w = X,
    J = nlevels(rDat$subj),
    K = nlevels(rDat$item),
    rt = rDat$rt,
    subj = as.integer(rDat$subj),
    item = as.integer(rDat$item))
    Load library rstan and fit Stan model
    library(rstan)
    rstan_options(auto_write = TRUE)
    options(mc.cores = parallel::detectCores())

    # Fit the model
    matrixFit <- stan(file = "matrixModel.stan", data = stanDat,
    iter = 2000, chains = 4)
    Examine posteriors
    print(matrixFit)

    This print output is overly verbose. I wrote a simple function to get the essential information quickly.
    stan_results<-function(m,params=paramnames){
    m_extr<-extract(m,pars=paramnames)
    par_names<-names(m_extr)
    means<-lapply(m_extr,mean)
    quantiles<-lapply(m_extr,
    function(x)quantile(x,probs=c(0.025,0.975)))
    means<-data.frame(means)
    quants<-data.frame(quantiles)
    summry<-t(rbind(means,quants))
    colnames(summry)<-c("mean","lower","upper")
    summry
    }

    For example, if I want to see only the posteriors of the four beta parameters, I can write:
    stan_results(matrixFit, params=c("beta[1]","beta[2]","beta[3]","beta[4]"))

    For more details, such as interpreting the results and computing things like Bayes Factors, see Nicenboim and Vasishth 2016.
    FAQ: What if I don’t want to fit a lognormal?

    In the Stan code above, I assume a lognormal function for the reading times:
    rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);

    If this upsets you deeply and you want to use a normal distribution (and in fact, for EEG data this makes sense), go right ahead and change the lognormal to normal:
    rt[i] ~ normal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);
    FAQ: What if I my dependent measure is binary (0,1) responses?

    Use this Stan code instead of the one shown above. Here, I assume that you have a column called response in the data, which has 0,1 values. These are the trial level binary responses.
    data {
    int N; //no trials
    int P; //no fixefs
    int J; //no subjects
    int n_u; //no subj ranefs
    int K; //no items
    int n_w; //no item ranefs
    int subj[N]; //subject indicator
    int item[N]; //item indicator
    row_vector[P] X[N]; //fixef design matrix
    row_vector[n_u] Z_u[N]; //subj ranef design matrix
    row_vector[n_w] Z_w[N]; //item ranef design matrix
    int response[N]; //response
    }

    parameters {
    vector[P] beta; //fixef coefs
    cholesky_factor_corr[n_u] L_u; //cholesky factor of subj ranef corr matrix
    cholesky_factor_corr[n_w] L_w; //cholesky factor of item ranef corr matrix
    vector[n_u] sigma_u; //subj ranef std
    vector[n_w] sigma_w; //item ranef std
    vector[n_u] z_u[J]; //spherical subj ranef
    vector[n_w] z_w[K]; //spherical item ranef
    }

    transformed parameters {
    vector[n_u] u[J]; //subj ranefs
    vector[n_w] w[K]; //item ranefs
    {
    matrix[n_u,n_u] Sigma_u; //subj ranef cov matrix
    matrix[n_w,n_w] Sigma_w; //item ranef cov matrix
    Sigma_u = diag_pre_multiply(sigma_u,L_u);
    Sigma_w = diag_pre_multiply(sigma_w,L_w);
    for(j in 1:J)
    u[j] = Sigma_u * z_u[j];
    for(k in 1:K)
    w[k] = Sigma_w * z_w[k];
    }
    }

    model {
    //priors
    beta ~ cauchy(0,2.5);
    sigma_u ~ cauchy(0,2.5);
    sigma_w ~ cauchy(0,2.5);
    L_u ~ lkj_corr_cholesky(2.0);
    L_w ~ lkj_corr_cholesky(2.0);
    for (j in 1:J)
    z_u[j] ~ normal(0,1);
    for (k in 1:K)
    z_w[k] ~ normal(0,1);
    //likelihood
    for (i in 1:N)
    response[i] ~ bernoulli_logit(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]]);
    }
    For reproducible example code

    See here.

    To leave a comment for the author, please follow the link and comment on their blog: Shravan Vasishth's Slog (Statistics blog). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Updated Shiny app

    Mon, 03/27/2017 - 03:58

    (This article was first published on Bayes Ball, and kindly contributed to R-bloggers)

    A short post to alert the world that my modest Shiny application, showing Major League Baseball run scoring trends since 1901, has been updated to include the 2016 season. The application can be found here:
    https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/.

    In addition to the underlying data, the update removed some of the processing that was happening inside the application, and put it into the pre-processing stage. This processing needs to happen only the once, and is not related to the reactivity of the application. This will improve the speed of the application; in addition to reducing the processing, it also shrinks the size of the data table loaded into the application.

    The third set of changes were a consequence of the updates to the Shiny and ggplot2 packages in the two years that have passed since I built the app. In Shiny, there was a deprecation for “format” in the sliderInput widget. And in ggplot2, it was a change in the quotes around the “method” specification in stat_smooth(). A little thing that took a few minutes to debug! Next up will be some formatting changes, and a different approach to one of the visualizations.

     -30-

    To leave a comment for the author, please follow the link and comment on their blog: Bayes Ball. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Le Monde puzzle [#1001]

    Mon, 03/27/2017 - 00:17

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    After a long lag (due to my missing the free copies distributed at Paris-Dauphine!), here is a Sudoku-like Le Monde mathematical puzzle:

    A grid of size (n,n) holds integer values such that any entry larger than 1 is the sum of one term in the same column and one term in the same row. What is the maximal possible value observed in such a grid when n=3,4?

    This can be solved in R by a random exploration of such possible grids in a simulated annealing spirit:

    mat=matrix(1,N,N) goal=1 targ=function(mat){ #check constraints d=0 for (i in (1:(N*N))[mat>1]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 d=d+(min(abs(mat[i]-outer(mat[-r,c],mat[r,-c],"+")))>0)} return(d)} cur=0 for (t in 1:1e6){ i=sample(1:(N*N),1);prop=mat prop[i]=sample(1:(2*goal),1) d=targ(prop) if (10*log(runif(1))/t<cur-d){ mat=prop;cur=d} if ((d==0)&(max(prop)>goal)){ goal=max(prop);maxx=prop}}

    returning a value of 8 for n=3 and 37 for n=4. However, the method is quite myopic and I tried instead a random filling of the grid, using each time the maximum possible sum for empty cells:

    goal=1 for (v in 1:1e6){ mat=matrix(0,N,N) #one 1 per row/col for (i in 1:N) mat[i,sample(1:N,1)]=1 for (i in 1:N) if (max(mat[,i])==0) mat[sample(1:N,1),i]=1 while (min(mat)==0){ parm=sample(1:(N*N)) #random order for (i in parm[mat[parm]==0]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){ mat[i]=max(mat[-r,c])+max(mat[r,-c]) break()}}} if (goal<max(mat)){ goal=max(mat);maxx=mat}}

    which recovered a maximum of 8 for n=3, but reached 48 for n=4. And 211 for n=5, 647 for n=6… For instance, here is the solution for n=4:

    [1,] 1 5 11 10 [2,] 2 4 1 5 [3,] 48 2 24 1 [4,] 24 1 22 11

    While the update in the above is random and associated with the first term in the permutation, it may be preferable to favour the largest possible term at each iteration, which I tried as

    while (min(mat)==0){ parm=sample(1:(N*N)) val=0*parm for (i in parm[mat[parm]==0]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){ val[i]=max(mat[-r,c])+max(mat[r,-c])} } #largest term i=order(-val)[1];mat[i]=val[i]}

    For n=4, I did not recover the maximal value 48, but achieved larger values for n=5 (264) and n=6 (2256).

    As an aside, the R code sometimes led to a strange error message linked with the function sample(), which is that too large a bound in the range produces the following

    > sample(1:1e10,1) Error: cannot allocate vector of size 74.5 Gb

    meaning that 1:1e10 first creates a vector for all the possible values. The alternative

    > sample.int(1e10,1) [1] 7572058778

    works, however. And only breaks down for 10¹².

    Filed under: Kids, R Tagged: Le Monde, mathematical puzzle, R, sample, sudoku

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppTOML 0.1.2

    Mon, 03/27/2017 - 00:12

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new release of RcppTOML is now on CRAN. This release fixes a few parsing issues for less frequently-used inputs: vectors of boolean or date(time) types, as well as table array input.

    RcppTOML brings TOML to R. TOML is a file format that is most suitable for configurations, as it is meant to be edited by humans but read by computers. It emphasizes strong readability for humans while at the same time supporting strong typing as well as immediate and clear error reports. On small typos you get parse errors, rather than silently corrupted garbage. Much preferable to any and all of XML, JSON or YAML — though sadly these may be too ubiquitous now. TOML is making good inroads with newer and more flexible projects such as the Hugo static blog compiler, or the Cargo system of Crates (aka "packages") for the Rust language.

    Changes in version 0.1.2 (2017-03-26)
    • Dates and Datetimes in arrays in the input now preserve their types instead of converting to numeric vectors (#13)

    • Boolean vectors are also correctly handled (#14)

    • TableArray types are now stored as lists in a single named list (#15)

    • The README.md file was expanded with an example and screenshot.

    • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols(); also use .registration=TRUE in useDynLib in NAMESPACE

    • Two example files were updated.

    Courtesy of CRANberries, there is a diffstat report for this release.

    More information is on the RcppTOML page page. Issues and bugreports should go to the GitHub issue tracker.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Economics chapter added to “Empirical software engineering using R”

    Mon, 03/27/2017 - 00:04

    (This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

    The Economics chapter of my Empirical software engineering book has been added to the draft pdf (download here).

    This is a slim chapter, it might grow a bit, but I suspect not by a huge amount. Reasons include lots of interesting data being confidential and me not having spent a lot of time on this topic over the years (so my stash of accumulated data is tiny). Also, a significant chunk of the economics data I have is used to discuss issues in the Ecosystems and Projects chapters, perhaps some of this material will migrate back once these chapters are finalized.

    You might argue that Economics is more important than Human cognitive characteristics and should have appeared before it (in chapter order). I would argue that hedonism by those involved in producing software is the important factor that pushes (financial) economics into second place (still waiting for data to argue my case in print).

    Some of the cognitive characteristics data I have been waiting for arrived, and has been added to this chapter (some still to be added).

    As always, if you know of any interesting software engineering data, please tell me.

    I am after a front cover. A woodcut of alchemists concocting a potion appeals, perhaps with various software references discretely included, or astronomy related (the obvious candidate has already been used). The related modern stuff I have seen does not appeal. Suggestions welcome.

    Ecosystems next.

    To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages