### Make ggplot2 purrr

(This article was first published on ** Econometrics and Free Software**, and kindly contributed to R-bloggers)

I’ll be honest: the title is a bit misleading. I will not use purrr that much in this blog post. Actually, I will use one single purrr function, at the very end. I use dplyr much more. However *Make ggplot2 purrr* sounds better than *Make ggplot dplyr* or whatever the verb for dplyr would be.

Also, this blog post was inspired by a stackoverflow question and in particular one of the answers. So I don’t bring anything new to the table, but I found this stackoverflow answer so useful and so underrated (only 16 upvotes as I’m writing this!) that I wanted to write something about it.

Basically the idea of this blog post is to show how to create graphs using ggplot2, but by grouping by a factor variable beforehand. To illustrate this idea, let’s use the data from the Penn World Tables 9.0. The easiest way to get this data is to install the package called pwt9 with:

install.packages("pwt9")and then load the data with:

data("pwt9.0")Now, let’s load the needed packages. I am also using ggthemes which makes themeing your ggplots very easy. I’ll be making Tufte-style plots.

library(ggplot2) library(ggthemes) library(dplyr) library(purrr) library(pwt9)First let’s select a list of countries:

country_list <- c("France", "Germany", "United States of America", "Luxembourg", "Switzerland", "Greece") small_pwt <- pwt9.0 %>% filter(country %in% country_list)Let’s us also order the countries in the data frame as I have written them in country_list:

small_pwt <- small_pwt %>% mutate(country = factor(country, levels = country_list, ordered = TRUE))You might be wondering why this is important. At the end of the article, we are going to save the plots to disk. If we do not re-order the countries inside the data frame as in country_list, the name of the files will not correspond to the correct plots!

Now when you want to plot the same variable by countries, say avh (*Average annual hours worked by persons engaged*), the usual way to do this is with one of facet_wrap() or facet_grid():

As you can see, for this particular example, facet_grid() is not very useful, but do notice its argument, country~., which is different from facet_wrap()’s argument. This way, I get the graphs stacked horizontally. If I had used facet_grid(~country) the graphs would be side by side and completely unreadable.

Now, let’s go to the meat of this post: what if you would like to have one single graph for each country? You’d probably think of using dplyr::group_by() to form the groups and then the graphs. This is the way to go, but you also have to use dplyr::do(). This is because as far as I understand, ggplot2 is not dplyr-aware, and using an arbitrary function with groups is only possible with dplyr::do().

plots <- small_pwt %>% group_by(country) %>% do(plot = ggplot(data = .) + theme_tufte() + geom_line(aes(y = avh, x = year)) + ggtitle(unique(.$country)) + ylab("Year") + xlab("Average annual hours worked by persons engaged"))If you know dplyr at least a little bit, the above lines should be easy for you to understand. But notice how we get the title of the graphs, with ggtitle(unique(.$country)), which was actually the point of the stackoverflow question. What might be surprising though, is the object that is created by this code. Let’s take a look at plots:

print(plots) ## Source: local data frame [6 x 2] ## Groups: <by row> ## ## # A tibble: 6 × 2 ## country plot ## * <ord> <list> ## 1 France <S3: gg> ## 2 Germany <S3: gg> ## 3 United States of America <S3: gg> ## 4 Luxembourg <S3: gg> ## 5 Switzerland <S3: gg> ## 6 Greece <S3: gg>As dplyr::do()’s documentation tells us, the return values get stored inside a list. And this is exactly what we get back; a list of plots! Lists are a very flexible and useful class, and you cannot spell *list* without purrr (at least not when you’re a neRd).

Here are the final lines that use purrr::map2() to save all these plots at once inside your working directory:

file_names <- paste0(country_list, ".pdf") map2(file_names, plots$plot, ggsave)As I said before, if you do not re-order the countries inside the data frame, the names of the files and the plots will not match. Try running all the code without re-ordering, you’ll see!

I hope you found this post useful. You can follow me on twitter for blog updates.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Econometrics and Free Software**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New verison of GWmodel on CRAN

As you noticed, there are 2.0-x versions of GWmodel available on CRAN recently. In cotrast to the 1.x-x versions, there are a number of new features, as follows:

1) Some of the more computationally intensive functions have been re-coded with C++ by linking to Rcpp and RcppArmadillo. Their computational efficiency has been greatly improved.

2) In order to make the function names more understandable, some of the funcitons have been re-named:

montecarlo.gwss ---> gwss.montecarlo

montecarlo.gwpca.1 ---> gwpca.montecarlo.1

montecarlo.gwpca.2 ---> gwpca.montecarlo.2

model.selection.gwr ---> gwr.model.selection

model.view.gwr ---> gwr.model.view

model.sort.gwr ---> gwr.model.sort

montecarlo.gwr ---> gwr.montecarlo

writeGWR ---> gwr.write

writeGWR.shp ---> gwr.write.shp

glyph.plot ---> gwpca.glyph.plot

check.components ---> gwpca.check.components

mink.approach ---> gwr.mink.approach

matrixview ---> gwr.mink.matrixview

gwr.generalised ---> ggwr.basic (A reported bug has been fixed with it, and now it is fine with GWmodel_2.0-2 or GWmodel_2.0-3)

Note that we will keep the accesses to to the old names, to ensure the example code work well in the two published vignettes Lu et a. (2014) and Gollini et al. (2015):

Gollini I, Lu B, Charlton M, Brunsdon C, Harris P (2015) GWmodel: an R Package for exploring Spatial Heterogeneity using Geographically Weighted Models. Journal of Statistical Software, 63(17):1-50, http://www.jstatsoft.org/v63/i17/

Lu B, Harris P, Charlton M, Brunsdon C (2014) The GWmodel R Package: further topics for exploring 17(2): 85-101,http://www.tandfonline.com/doi/abs/10.1080/10095020.2014.917453

3) New features and functions are included:

bw.gw.average: Select bandwidth for GW averages;

gwr.mink.pval: Select the values of p for the Minkovski approach for GWR (see details in Lu et al. 2016: Lu, B, Charlton, M, Brunsdon, C & Harris, P(2016). The Minkowski approach for choosing the distance metric in Geographically Weighted Regression. International Journal of Geographical Information Science, 30(2): 351-368. )

For an adaptive kernel, the bandwidth (the number of nearest neighbours) can be specified as a larger number than the number of observations, and in this case the real value for calculating the weights is {the maximum distance between regression location and all the observations} * (bandwidth/number of observations)

4) Bugs repaired for gwr.basic, gw.dist and ggwr.basic (Please update or install the latest version: GWmodel_2.0-3)

Do feel free to make any suggestions/comments on the new versions. Many thanks.

Best regards,

GWmodel Developement Team

_______________________________________________

R-sig-Geo mailing list

[hidden email]

https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### UK government using R to modernize reporting of official statistics

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

Like all governments, the UK government is responsible for producing reports of official statistics on an ongoing basis. That process has traditionally been a highly manual one: extract data from government systems, load it into a mainframe statistical analysis tool and run models and forecasts, extract the results to a spreadsheet to prepare data for presentation, and ultimately combine it all in a manual document editing tool to produce the final report. The process in the UK looks much like this today:

Matt Upson, a Data Scientist at the UK Government Digital Service, is looking to modernize this process with a reproducible analytical pipeline. This new process, based on the UK Government's Technology Service Manual for new IT deployments, aims to simplify the process by using R — the open-source programming language for statistical analysis — to automate the data extraction, analysis, and document generation tasks.

The development of the new process is underway now, with the first target being a report on culture and sporting impacts on the UK economy:

Towards the end of 2016 we embarked on a project with a team in the Department for Culture, Media, and Sport (DCMS) who are responsible for the production of the Economic Estimates for DCMS Sectors Statistical First Release (SFR). Currently this publication is produced with a mix of manual and semi-manual processes. Our aim was to see if we could speed up production of the SFR, whilst maintaining the high standard of the publication and QA.

Central to the process is automating the report with Rmarkdown, a system for programmatically laying out a document while encapsulating all of the R code for data preparation, analysis, tabulation and charting into a collaborative document. As a comprehensive language, R has the functionality to handle all of these tasks natively, without the need to bring other systems into the chain. And it's an ideal tool for the statistical data analysis required for these reports, which must follow the guidance of the Aqua Book, the UK government's manual for producing quality analysis. (The Aqua Book is well worth reading for any data scientist, with excellent guidance on designing studies in the face of uncertain data.)

Naturally, the government has specific standards for the presentation of data, and the R-based process needs to be able to reproduce that look in the final report. The GDS team has produced a package for R, called govstyle, which standardizes the look of data charts while upgrading them to more modern design principles. For example this chart from a report on truancy in UK schools:

has been upgraded and reproduced in R to look like this:

An automated workflow allows for the inclusion of modern devops processes, too. Dependency management for R packages is handled with packrat. Test-driven development comes courtesy of the testthat package, and automated testing (including data verification and code coverage analysis) is provided by Travis CI. And source code control and collaboration is provided by Github, which is also where the entire process is documented as R package: the eesectors package.

For more on the UK government's use of R for official statistics reporting, see the post below from the UK Government Digital Service.

Data at GDS: Reproducible Analytical Pipeline

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### shinyHeatmaply – a shiny app for creating interactive cluster heatmaps

(This article was first published on ** R – R-statistics blog**, and kindly contributed to R-bloggers)

My friend Jonathan Sidi and I (Tal Galili) are pleased to announce the release of shinyHeatmaply (0.1.0): a new Shiny application (and Shiny gadget) for creating interactive cluster heatmaps. shinyHeatmaply is based on the heatmaply R package which strives to make it easy as possible to create interactive cluster heatmaps.

The app introduces a functionality that saves to disk a self contained copy of the htmlwidget as an html file with your data and specifications you set from the UI, so it can be embedded in webpages, blogposts and online web appendices for academic publications.

You can see some of shinyHeatmaply‘s capabilities in the following 40 seconds video:

Installing shinyHeatmaply

From CRAN:

install_packages('shinyHeatmaply')From github:

devtools::install_github('yonicd/shinyHeatmaply') Running the app/gadgetThe application has an import interface as part of the application which currently supports csv, txt, tab, xls, xlsx, rd, rda. You can start the app using:

library(shiny) library(heatmaply) # If you didn't get shinyHeatmaply yet, you can run it through github: # runGitHub("yonicd/shinyHeatmaply",subdir = 'inst/shinyapp') # or just use your locally installed package: library(shinyHeatmaply) runApp(system.file("shinyapp", package = "shinyHeatmaply"))The gadget is called from the R console and accepts input arguments. The object defined as the input to the shinyHeatmaply gadget is a data.frame or a list of data.frames. You can start it using the following code:

library(shinyHeatmaply) #single data.frame data(mtcars) launch_heatmaply(mtcars) #list data(iris) launch_heatmaply(list('Example1'=mtcars,'Example2'=iris))You can see an example of a saved shinyHeatmaply output **here**. Or view the following iframe:

**For issue reports or feature requests, please visit the GitHub repo.**

**Post post credit:** shinyHeatmaply was made thanks to the dedication of Jonathan Sidi, and based on recent features added to heatmaply by Alan O’Callaghan. I am very grateful to them both. This could also not be made possible by the amazing work of the RStudio’s team on Shiny applications, and of Carson Sievert on plotly. And lastly, to my adviser Yoav Benjamini for his support and advices.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – R-statistics blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Gradient Descent

(This article was first published on ** --Jean Arreola--**, and kindly contributed to R-bloggers)

The best way to learn an algorith is to code it. So here it is, my take on Gradient Descent Algorithm for simple linear regression.

First, we fit a simple linear model with lm for comparison with gradient descent values.

#Load libraries library(dplyr) library(highcharter) #Scaling length variables from iris dataset. iris_demo <- iris[,c("Sepal.Length","Petal.Length")] %>% mutate(sepal_length = as.numeric(scale(Sepal.Length)), petal_length = as.numeric(scale(Petal.Length))) %>% select(sepal_length,petal_length) #Fit a simple linear model to compare coefficients. regression <- lm(iris_demo$petal_length~iris_demo$sepal_length) coef(regression) ## (Intercept) iris_demo$sepal_length ## 4.643867e-16 8.717538e-01 iris_demo_reg <- iris_demo iris_demo_reg$reg <- predict(regression,iris_demo) #Plot the model with highcharter highchart() %>% hc_add_series(data = iris_demo_reg, type = "scatter", hcaes(x = sepal_length, y = petal_length), name = "Sepal Length VS Petal Length") %>% hc_add_series(data = iris_demo_reg, type = "line", hcaes(x = sepal_length, y = reg), name = "Linear Regression") %>% hc_title(text = "Linear Regression")We will try to acomplish the same coefficients, this time using Gradient Descent.

library(tidyr) set.seed(135) #To reproduce results #Auxiliary function # y = mx + b reg <- function(m,b,x) return(m * x + b) #Starting point b <- runif(1) m <- runif(1) #Gradient descent function gradient_desc <- function(b, m, data, learning_rate = 0.01){ # Small steps # Column names = Code easier to understand colnames(data) <- c("x","y") #Values for first iteration b_iter <- 0 m_iter <- 0 n <- nrow(data) # Compute the gradient for Mean Squared Error function for(i in 1:n){ # Partial derivative for b b_iter <- b_iter + (-2/n) * (data$y[i] - ((m * data$x[i]) + b)) # Partial derivative for m m_iter <- m_iter + (-2/n) * data$x[i] * (data$y[i] - ((m * data$x[i]) + b)) } # Move to the OPPOSITE direction of the derivative new_b <- b - (learning_rate * b_iter) new_m <- m - (learning_rate * m_iter) # Replace values and return new <- list(new_b,new_m) return(new) } # I need to store some values to make the motion plot vect_m <- m vect_b <- b # Iterate to obtain better parameters for(i in 1:1000){ if(i %in% c(1,100,250,500)){ # I keep some values in the iteration for the plot vect_m <- c(vect_m,m) vect_b <- c(vect_b,b) } x <- gradient_desc(b,m,iris_demo) b <- x[[1]] m <- x[[2]] } print(paste0("m = ", m)) ## [1] "m = 0.871753774273602" print(paste0("b = ", b)) ## [1] "b = 5.52239677041512e-10"The difference in the coefficients is minimal.

We can see how the iterations work in the next plot:

#Compute new values iris_demo$preit <- reg(vect_m[1],vect_b[1],iris_demo$sepal_length) iris_demo$it1 <- reg(vect_m[2],vect_b[2],iris_demo$sepal_length) iris_demo$it100 <- reg(vect_m[3],vect_b[3],iris_demo$sepal_length) iris_demo$it250 <- reg(vect_m[4],vect_b[4],iris_demo$sepal_length) iris_demo$it500 <- reg(vect_m[5],vect_b[5],iris_demo$sepal_length) iris_demo$finalit <- reg(m,b,iris_demo$sepal_length) iris_gathered <- iris_demo %>% gather(key = gr, value = val, preit:finalit) %>% select(-petal_length) %>% distinct() iris_start <- iris_gathered %>% filter(gr == "preit") iris_seq <- iris_gathered %>% group_by(sepal_length) %>% do(sequence = list_parse(select(., y = val))) iris_data <- left_join(iris_start, iris_seq) #Motion Plot irhc2 <- highchart() %>% hc_add_series(data = iris_data, type = "line", hcaes(x = sepal_length, y = val), name = "Gradient Descent") %>% hc_motion(enabled = TRUE, series = 0, startIndex = 0, labels = c("Iteration 1","Iteration 100","Iteration 250","Iteration 500","Final Iteration")) %>% hc_add_series(data = iris_demo_reg, type = "scatter", hcaes(x = sepal_length, y = petal_length), name = "Sepal Length VS Petal Length") %>% hc_title(text = "Gradient Descent Iterations") irhc2Maybe in a future post we can try a multivariate regression model!

To **leave a comment** for the author, please follow the link and comment on their blog: ** --Jean Arreola--**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A fast method to add annotations to a plot

(This article was first published on ** Bluecology blog**, and kindly contributed to R-bloggers)

Making professional looking plots in R can be fiddly. One task that I often spend ages doing is manually finding coordinates to add labels.

Wouldn’t it be nice if you could just send the coordinates directly to your text editor?

I did some searching and found on stackoverflow that you can send R objects to the clipboard. So here is my solution using that trick.

Adding text to the right position on a plot can be a real hassle. Here I show how to use a simple function to click on a figurea and put coordinates onto your clipboard.

You can get R to send data directly to the clipboard using the pipe command. Below is a little function I wrote that takes coordinates from locator() and sends them to your clipboard. Then you can just hit cmd-v to paste them into your text editor (nb I understand this may need some slight modifications to work on linux or windows, I use OSX):

loccopy <- function(n, digits = 2){ data <- locator(n) data <- round(cbind(data$x, data$y), digits) clip <- pipe("pbcopy", "w") write.table(data, file = clip, col.names = F, row.names = F) close(clip) }Let’s test it out:

set.seed(42) plot(runif(100)) loccopy(1)Now hit cmd-v (or equivalent on your OS).

69.23 0.84Let’s add a label using our fast method for coordinates:

text(69.23, 0.84, "Unusual data point", pos =4, offset = 0)The pos=4 and offset=0 ensures that the text goes directly to the right of our coordinates.

That’s it. Hope it helps speed up your workflow.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Bluecology blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Understanding Linear SVM with R

(This article was first published on ** DataScience+**, and kindly contributed to R-bloggers)

Linear Support Vector Machine or linear-SVM(as it is often abbreviated), is a supervised classifier, generally used in bi-classification problem, that is the problem setting, where there are two classes. Of course it can be extended to multi-class problem. In this work, we will take a mathematical understanding of linear SVM along with R code to understand the critical components of SVM. Taking the liberty to assume that some readers are new to machine learning, let us first find out what the supervised classifier is supposed to do. The term * supervised * comes from the concept of * supervised learning*. In case of supervised learning, you provide some examples along with their labels to your classifier and the classifier then tries to learn important features from the provided examples. For example, consider the following data set. You, will provide a part of this data to your linear SVM and tune the parameters such that your SVM can can act as a discriminatory function separating the ham messages from the spam messages. So, let us add the following R-code to our task.

*type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv*What Parameters should be tuned

Linear SVM tries to find a separating hyper-plane between two classes with maximum gap in-between. A hyper-plane in \(d\)- dimension is a set of points \(x \in \mathbb R^{d}\) satisfying the equation

$$w^{T}x+b=0$$

Let us denote \(h(x)=w^{T}(x)+b\)

Here \(w \) is a \(d\)-dimensional weight vector while \(b\) is a scalar denoting the bias. Both \(w\) and \(b\) are unknown to you. You will try to find and tune both \(w\) and \(b\) such that \(h(x)\) can separate the spams from hams as accurately as possible. I am drawing an arbitrary line to demonstrate a separating hyper-plane using the following code.

Oh, btw, did you notice that Linear SVM has decision rules. Let \(y\) be an indicator variable which takes the value -1 if the message is spam and 1 if it is a ham. Then, the decision rules are

$$y=-1~iff~h(x)~is~less~than~0$$

$$y=1~iff~h(x)~is~greater~than~0$$

Notice that for any point \(x\) in your data set, \(h(x)\neq 0\) as we want the function \(w^{T}x+b\) to clearly distinguish between spam and ham. This brings us to a important condition.

Linear SVM assumes that the two classes are linearly separable that is a hyper-plane can separate out the two classes and the data points from the two classes do not get mixed up. Of course this is not an ideal assumption and how we will discuss it later how linear SVM works out the case of non-linear separability. But for a reader with some experience here I pose a question which is like this Linear SVM creates a discriminant function but so does LDA. Yet, both are different classifiers. Why ? (Hint: LDA is based on Bayes Theorem while Linear SVM is based on the concept of margin. In case of LDA, one has to make an assumption on the distribution of the data per class. For a newbie, please ignore the question. We will discuss this point in details in some other post.)

Properties of \(w\)Consider two points \(a_{1}\) and \(a_{2}\) lying on the hyper-plane defined by \(w^{T}x+b\). Thus, \(h(a_{1})=w^{T}a_{1}+b=0\) and \(h(a_{2})=w^{T}a_{2}+b=0\). Therefore,

$$w^{T}(a_{1}-a_{2})=0$$.

Notice that the vector \(a_{1}-a_{2}\) is lying on the separating hyper-plane defined by \(w^{T}x+b\). Thus, \(w\) is orthogonal to the hyper-plane.

Consider a point \(x_{i}\) and let \(x_{p}\) be the orthogonal projection of \(x_{i}\) onto the hyper-plane. Assume $h(x_{i})>0$ for \(x\).

\(\frac{w}{\vert\vert w\vert\vert}\) is a unit vector in the direction of \(w\).

Let \(r\) be the distance between \(x_{i}\) and \(x_{p}\). Since \(x_{i}\) is projected on the hyper-plane, there is some loss of information which is calculated as

$$(r=x_{i}-x_{p})$$.

Notice that \(r\) is parallel to the unit vector \(\frac{w}{\vert\vert w\vert\vert}\) and therefore \(r\) can be written as some multiple of \(\frac{w}{\vert\vert w\vert\vert}\). Thus, \(r=c\times \frac{w}{\vert\vert w\vert\vert}\), for some scalar value \(c\). It can be shown that

$$c=\frac{h(x_{i})}{\vert\vert w\vert\vert}$$

Actually, \(c\times \frac{w}{\vert\vert w\vert\vert}\) represents the orthogonal distance of the point \(x_{i}\) from the hyper-plane. Distance has to be positive. \(\frac{w}{\vert\vert w\vert\vert}\) cannot be negative. However, \(c=\frac{h(x)}{\vert\vert w\vert\vert}\). Though, before starting the equations we assumed that \(h(x_{i})\) is positive, but that is not a general case. In general \(c\) can be negative also. Therefore we multiply \(y\) with \(c\). Remember your \(y\) from the decision rules. That is

$$c=y\times \frac{h(x_{i})}{\vert\vert w\vert\vert}$$

Linear SVM is based on the concept of margin which is defined as the smallest value of \(c\) for all the \(n\) points in your training data set. That is

$$\delta^{\star}=min\{c~\forall~points~in~the~training~data~set\}$$. All points \(x^{\star}\) which are lying at an orthogonal distance of \(\delta^{\star}\) from the hyper-plane \(w^{T}x+b=0\) are called support vectors.

Let \(\delta^{\star}=\frac{y^{\star}\times h(x^{\star})}{\vert\vert w\vert\vert}\), where \(x^{\star}\) is a support vector. We want \(\delta^{\star} = \frac{1}{\vert\vert w\vert\vert}\). Clearly, we want to multiply some value \(s\) such that \(s\times y^{\star}\times h(x^{\star})=1\). Therefore, \(s=\frac{1}{y^{\star}\times h(x^{\star})}\).

Objective Criteria for Linear SVMAfter scaling down, we have the following constraint for SVM:

$$y\times (w^{T}x+b)\geq \frac{1}{\vert\vert w\vert\vert}$$.

Now consider two support vectors, one denoted by \(a_{1}\) on the positive side, while the other \(a_{2}\) on the negative side of the hyper-plane. For the time being let us not use \(y\). Then, we have \(w^{T}a_{1}+b= \frac{1}{\vert\vert w\vert\vert}\) and \(w^{T}a_{2}+b= \frac{-1}{\vert\vert w\vert\vert}\).Thus, \(w^{T}(a_{1}-a_{2})=\frac{2}{\vert\vert w\vert\vert}\).\(w^{T}(a_{1}-a_{2})\) depicts the separation/gap between the support vectors on the either side. We want to maximize this gap \(\frac{2}{\vert\vert w\vert\vert}\). For the purpose, we need to minimize \(\vert\vert w\vert\vert\). Instead,we can minimize \(\frac{\vert\vert w\vert\vert^{2}}{2}\). Formally, we want to solve

$$Minimize_{w,b}\frac{\vert\vert w\vert\vert^{2}}{2}$$

subject to the constraint

$$y_{i}(w^{T}x_{i}+b)\geq \frac{1}{\vert \vert w\vert\vert}~\forall ~x_{i}~in ~Training~Data~set$$.

For linearly inseparable case, we introduce some penalty \(0\leq \xi_{i} \leq 1\) in the objective function and our constraint.

$$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C\sum_{i=1}^{n}\xi_{i})$$

subject to the constraints

$$(1)~y_{i}(w^{T}x_{i}+b)\geq 1-\xi_{i}~\forall ~x_{i}~in ~Training~Data~set~of~size~n$$ and

$$(2)~\xi_{i}\geq 0~\forall ~n~points~in~the~Training~Data~set$$

\(C\) is a user defined parameter.

In case you wish to see how \(\xi_{i}\) is used as penalty, please consider the the three figures below. You can also choose to skip them if the above details are okay for you.

Let us now start our coding. The needed code is here.

library(tm) setwd(path to the directory) print(getwd()) sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) print(head(sms_data)) sms_data$type<-factor(sms_data$type) simply_text<-Corpus(VectorSource(sms_data$text)) cleaned_corpus<-tm_map(simply_text, tolower) cleaned_corpus<-tm_map(cleaned_corpus,removeNumbers) cleaned_corpus<-tm_map(cleaned_corpus,removeWords,stopwords()) #cleaned_corpus<-tm_map(cleaned_corpus,removePunctuation) #cleaned_corpus<-tm_map(cleaned_corpus,stripWhitespace) sms_dtm<-DocumentTermMatrix(cleaned_corpus) sms_train<-sms_dtm[1:4000] sms_test<-sms_dtm[4001:5574] #sms_train<-cleaned_corpus[1:4000] #sms_test<-cleaned_corpus[4001:5574] freq_term=(findFreqTerms(sms_dtm,lowfreq=2)) #sms_freq_train<-DocumentTermMatrix(sms_train, list(dictionary=freq_term)) #sms_freq_test<-DocumentTermMatrix(sms_test,list(dictionary=freq_term)) y<-sms_data$type y_train<-y[1:4000] y_test<-y[4001:5574] library(e1071) tuned_svm<-tune(svm, train.x=sms_freq_train, train.y = y_train,kernel="linear", range=list(cost=10^(-2:2), gamma=c(0.1, 0.25,0.5,0.75,1,2)) ) print(tuned_svm) adtm.m<-as.matrix(sms_freq_train) adtm.df<-as.data.frame(adtm.m) svm_good_model<-svm(y_train~., data=adtm.df, kernel="linear",cost=tuned_svm$best.parameters$cost, gamma=tuned_svm$best.parameters$gamma) adtm.m_test<-as.matrix(sms_freq_test) adtm.df_test=as.data.frame(adtm.m_test) y_pred<-predict(svm_good_model,adtm.df_test) table(y_test,y_pred) library(gmodels) CrossTable(y_pred,y_test,prop.chisq = FALSE)*type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv Parameter tuning of ‘svm’: - sampling method: 10-fold cross validation - best parameters: cost gamma 1 0.1 - best performance: 0.01825 Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 1574 | y_test y_pred | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1295 | 191 | 1486 | | 0.871 | 0.129 | 0.944 | | 0.952 | 0.897 | | | 0.823 | 0.121 | | -------------|-----------|-----------|-----------| spam | 66 | 22 | 88 | | 0.750 | 0.250 | 0.056 | | 0.048 | 0.103 | | | 0.042 | 0.014 | | -------------|-----------|-----------|-----------| Column Total | 1361 | 213 | 1574 | | 0.865 | 0.135 | | -------------|-----------|-----------|-----------|*

In the above example we are not using the cleaned corpus. We see that the accuracy/precision of our classifier for ham is 0.871 and recall is 0.95. However for Spam messages, the precision is 0.250, while recall is just 0.103. We found that the usage of cleaned corpus degrades the performance for the SVM more.

**
**

**Related Post**

- How to add a background image to ggplot2 graphs
- Streamline your analyses linking R to SAS: the workfloweR experiment
- R Programming – Pitfalls to avoid (Part 1)
- Eclipse – an alternative to RStudio – part 2
- Eclipse – an alternative to RStudio – part 1

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataScience+**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RStudio Connect 1.4.4.1

(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

We’re excited to announce the release of RStudio Connect: version 1.4.4.1. This release includes the ability to manage different versions of your work on RStudio Connect.

**Rollback / Roll Forward**

The most notable feature of this release is the ability to “rollback” to a previously deployed version of your work or “roll forward” to a more recent version of your work.

You can also download a particular version, perhaps as a starting place for a new report or application, and delete old versions that you want to remove from the server.

Other important features allow you to:

- Specify the number of versions to retain. You can alter the setting Applications.BundleRetentionLimit to specify how many versions of your applications you want to keep on disk. By default, we retain all bundles eternally.
- Limit the number of scheduled reports that will be run concurrently using the Applications.ScheduleConcurrency setting. This setting will help ensure that your server isn’t overwhelmed by too many reports all scheduled to run at the same time of day. The default is set to 2.
- Create a printable view of your content with a new “Print” menu option.
- Notify users of unsaved changes before they take an action in parameterized reports.

The release also includes numerous security and stability improvements.

If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, etc.) with collaborators, colleagues, or customers.

You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

- RStudio Connect home page & downloads
- RStudio Connect Admin Guide
- What IT needs to know about RStudio Connect
- Detailed news and changes between each version
- Pricing
- An online preview of RStudio Connect

To **leave a comment** for the author, please follow the link and comment on their blog: ** RStudio Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Re: AVHRR NDVI3g time series tool?

tools for working with these data. The last time I wrestled with the GIMMs

data for research was 12 years ago and boy was it a pain processing them

all. This is great!

On 3/27/17, 11:47 PM, "R-sig-Geo on behalf of Florian Detsch"

<[hidden email] on behalf of

[hidden email]> wrote:

>Yeah, well... you can use the gimms package, but you won't get around

>downloading all the NetCDF files (eg using downloadGimms()). Then,

>rasterizeGimms() allows you to specify a spatial extent (ie your spatial

>points' bounding box) in order to automatically create image subsets,

>which significantly reduces computation time and amount of memory needed.

>

>Note that rasterizeGimms() lets you clip images and perform quality

>control in one go. Just have a look at the corresponding GitBook for

>further details:

>https://envin-marburg.gitbooks.io/introducing-the-r-gimms-package/content/

>.

>

>Best,

>Florian

>

>

>

>On 28.03.2017 00:29, Ahmadou Dicko wrote:

>> You can also use the gimms R package and the downloadGimms function.

>>

>> https://cran.r-project.org/package=gimms

>>

>> Best

>>

>> On Mon, Mar 27, 2017 at 10:00 PM, Andy Bunn <[hidden email]> wrote:

>>

>>> Huh. That is very cool. It doesn't look like the NDVI data are in

>>>ERDDAP.

>>> They are all here in a really easy form: https://ecocast.arc.nasa.gov/

>>> data/pub/gimms/3g.v1/

>>>

>>> I'll dig some more.

>>>

>>> Thanks.

>>>

>>> From: Michael Sumner <[hidden email]<mailto:[hidden email]>>

>>> Date: Monday, March 27, 2017 at 2:03 PM

>>> To: Andy Bunn <[hidden email]<mailto:[hidden email]>>, R-sig-Geo

>>><

>>> [hidden email]<mailto:[hidden email]>>

>>> Subject: Re: [R-sig-Geo] AVHRR NDVI3g time series tool?

>>>

>>>

>>> Maybe xtractomatic?

>>>

>>> On Tue, Mar 28, 2017, 07:55 Andy Bunn <[hidden email]<mailto:Andy

>>> .[hidden email]>> wrote:

>>> Hi all, shot in the dark here. But is there a prefabricated tool for

>>> extracting the biweekly time series from AVHRR NDVI3g for a particular

>>> set of points? I can download all the data and process it myself but

>>> wondering if there is a package that would make this easy. E.g., what

>>>I'd

>>> like would be to have the biweekly NDVI time series from 1981 to 2015

>>>for

>>> these points without processing all the netcdf files:

>>>

>>>

>>> require(sp)

>>> pts <- data.frame(ID="NOCA","MORA","OLYM",

>>> lat=c(48.776, 48.879,47.802),

>>> long=c(-121.299,-121.726,-123.604))

>>> coordinates(pts) <- ~lat+long

>>>

>>>

>>> Can such a thing be easily done?

>>>

>>> Thanks in advance for advice, Andy

>>>

>>> _______________________________________________

>>> R-sig-Geo mailing list

>>> [hidden email]<mailto:[hidden email]>

>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

>>> --

>>> Dr. Michael Sumner

>>> Software and Database Engineer

>>> Australian Antarctic Division

>>> 203 Channel Highway

>>> Kingston Tasmania 7050 Australia

>>>

>>>

>>> [[alternative HTML version deleted]]

>>>

>>> _______________________________________________

>>> R-sig-Geo mailing list

>>> [hidden email]

>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

>>>

>>

>>

>

>--

>Dr. Florian Detsch

>Environmental Informatics

>Department of Geography

>Philipps-Universität Marburg

>Deutschhausstraße 12

>35032 (parcel post: 35037) Marburg, Germany

>

>Phone: +49 (0) 6421 28-25323

>Web:

>http://www.uni-marburg.de/fb19/fachgebiete/umweltinformatik/detschf/index.

>html

>

>_______________________________________________

>R-sig-Geo mailing list

>[hidden email]

>https://stat.ethz.ch/mailman/listinfo/r-sig-geo

_______________________________________________

R-sig-Geo mailing list

[hidden email]

https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### nanotime 0.1.2

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A new minor version of the nanotime package for working with nanosecond timestamps arrived yesterday on CRAN.

nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic.

This release just arranges things neatly before Leonardo Silvestri and I may shake things up with a possible shift to doing it all in S4 as we may need the added rigour for nanotime object operations for use in his ztsdb project.

Changes in version 0.1.2 (2017-03-27)- The as.integer64 function is now exported as well.

We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introducing Boardgame Recommendation website built with Shiny

My family and I love boardgames. We are also on the lookout for new ones that would fit with the ones we already like to play. So I decided to create a boardgame recommendation website

https://larrydag.shinyapps.io/boardgame_reco/

I built it using R. The recommendation engine uses a very simple collaborative filtering algortihm based on correlation scores from other boardgame players collection lists. The collections are gathered using the API from BoardgameGeek.com. It is very much in a beta project phase as I just wanted to get something built to get working.

I also wanted another project to build in Shiny. I really like how easy it is to publish R projects with Shiny.

Some of the features include:

- Ability to enter your own collection
- Get recommendation on your collection
- Amazon link to buy boardgame that is recommended

### Re: AVHRR NDVI3g time series tool?

downloading all the NetCDF files (eg using downloadGimms()). Then,

rasterizeGimms() allows you to specify a spatial extent (ie your spatial

points' bounding box) in order to automatically create image subsets,

which significantly reduces computation time and amount of memory needed.

Note that rasterizeGimms() lets you clip images and perform quality

control in one go. Just have a look at the corresponding GitBook for

further details:

https://envin-marburg.gitbooks.io/introducing-the-r-gimms-package/content/.

Best,

Florian

On 28.03.2017 00:29, Ahmadou Dicko wrote:

> You can also use the gimms R package and the downloadGimms function.

>

> https://cran.r-project.org/package=gimms

>

> Best

>

> On Mon, Mar 27, 2017 at 10:00 PM, Andy Bunn <[hidden email]> wrote:

>

>> Huh. That is very cool. It doesn't look like the NDVI data are in ERDDAP.

>> They are all here in a really easy form: https://ecocast.arc.nasa.gov/

>> data/pub/gimms/3g.v1/

>>

>> I'll dig some more.

>>

>> Thanks.

>>

>> From: Michael Sumner <[hidden email]<mailto:[hidden email]>>

>> Date: Monday, March 27, 2017 at 2:03 PM

>> To: Andy Bunn <[hidden email]<mailto:[hidden email]>>, R-sig-Geo <

>> [hidden email]<mailto:[hidden email]>>

>> Subject: Re: [R-sig-Geo] AVHRR NDVI3g time series tool?

>>

>>

>> Maybe xtractomatic?

>>

>> On Tue, Mar 28, 2017, 07:55 Andy Bunn <[hidden email]<mailto:Andy

>> .[hidden email]>> wrote:

>> Hi all, shot in the dark here. But is there a prefabricated tool for

>> extracting the biweekly time series from AVHRR NDVI3g for a particular

>> set of points? I can download all the data and process it myself but

>> wondering if there is a package that would make this easy. E.g., what I'd

>> like would be to have the biweekly NDVI time series from 1981 to 2015 for

>> these points without processing all the netcdf files:

>>

>>

>> require(sp)

>> pts <- data.frame(ID="NOCA","MORA","OLYM",

>> lat=c(48.776, 48.879,47.802),

>> long=c(-121.299,-121.726,-123.604))

>> coordinates(pts) <- ~lat+long

>>

>>

>> Can such a thing be easily done?

>>

>> Thanks in advance for advice, Andy

>>

>> _______________________________________________

>> R-sig-Geo mailing list

>> [hidden email]<mailto:[hidden email]>

>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

>> --

>> Dr. Michael Sumner

>> Software and Database Engineer

>> Australian Antarctic Division

>> 203 Channel Highway

>> Kingston Tasmania 7050 Australia

>>

>>

>> [[alternative HTML version deleted]]

>>

>> _______________________________________________

>> R-sig-Geo mailing list

>> [hidden email]

>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

>>

>

>

--

Dr. Florian Detsch

Environmental Informatics

Department of Geography

Philipps-Universität Marburg

Deutschhausstraße 12

35032 (parcel post: 35037) Marburg, Germany

Phone: +49 (0) 6421 28-25323

Web: http://www.uni-marburg.de/fb19/fachgebiete/umweltinformatik/detschf/index.html

_______________________________________________

R-sig-Geo mailing list

[hidden email]

https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Le Monde puzzle [#1000…1025]

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

Le Monde mathematical puzzle launched a competition to celebrate its 1000th puzzle! A fairly long-term competition as it runs over the 25 coming puzzles (and hence weeks). Starting with puzzle #1001. Here is the 1000th puzzle, not part of the competition:

*Alice & Bob spend five (identical) vouchers in five different shops, each time buying the maximum number of items to get close to the voucher value. In these five shops, they buy sofas at 421 euros each, beds at 347 euros each, kitchen appliances at 289 euros each, tables at 251 euros each and bikes at 211 euros each, respectively. Once the buying frenzy is over, they realise that within a single shop, they would have spent exactly four vouchers for the same products. What is the value of a voucher?*

Filed under: Kids, R Tagged: Alice and Bob, arithmetics, competition, Le Monde, mathematical puzzle, R, Tangente

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to Reach More People with your Next Data Science Talk

(This article was first published on ** R – AriLamstein.com**, and kindly contributed to R-bloggers)

Speaking publicly about your data science projects is one of the best things you can do for your portfolio.

*Writing* your presentation will help you refine what your project is, who it is for, and how it relates to other work in the area.

*Giving* your presentation at a live, in-person event will make a large impact on the audience.

That being said, public speaking has its drawbacks. For example:

**Time to prepare.**While your talk might take just 30 minutes to deliver, you could easily spend weeks writing and rehearsing it.**Cost.**Some talks, such as those at local meetups, might be free to attend. But some will require significant travel. That costs money, even if your conference ticket is free.**Limited upside***.*When you deliver a conference talk, your upside is limited in two ways:**Reach**. Only the people in the room at the time of your talk can hear your message.**Interaction**. Your interaction with the audience is normally limited to a few minutes of Q&A, and whatever conversations you have with people during the rest of the conference.

In my experience, this third pain point is the most significant. After all, if you had reached 10x the number of people with your last talk, then you probably would have been willing to spend more time writing it. And you probably would have been willing to travel further to give it.

Imagine if your talk wasn’t limited by the conference itselfFor a moment, try to ignore the limits that the conferences place on your talk.

Imagine if members of the audience could ask you questions days, or even weeks, after you gave your talk. Wouldn’t that be a game changer?

And what about all the people who care about your topic but couldn’t make it to the conference? Wouldn’t it be great if they could hear your talk as well?

All of this is possibleI’ve recently worked out a system to solve these problems in my own public speaking events. My talks now reach more people than ever before, and lead to more interactions than ever before.

**Bonus:**A checklist for making the most out of your data science talks! Click Here

Before giving you the exact system I use, let me tell you the technology that you’ll need in order to implement it. Note that some of the links below are affiliate links. This means that, at no additional cost to you, I may get a commission if you wind up purchasing after clicking these links:

- A self-hosted WordPress website. I use BlueHost.
- An Email Service so you can follow up with people. I use Drip.
- A Webinar Provider. I use Zoom.
- A Landing Page Builder. I use Thrive.

Before the conference, take out a url such as <ConferenceName>Slides.com. The URL should be easy for the audience to remember. Set the URL to redirect to a separate page on your website.

This page should a “landing page”. This means that the only Call-to-Action (CTA) on the page should be for the reader to enter their email address so that you can send them the slides.

I recommend that the last slide in your deck contain just this URL. When the slide appears, say “If you would like to download the slide deck I used for this talk, visit this URL and enter your email address. I will then email you the slides.”

Set up an automated email campaign just for this opt-in. The first email should fire immediately and send them the slides. The second email should fire a few days later, and ask people if they enjoyed the slides or have any questions.

This tip alone should multiply the chances you have to interact with your audience after the talk ends.

Step 2: Host a Post-Conference WebinarThe sad fact is this: most people who would enjoy your talk won’t be in the audience. No conference can get even a fraction of the people who are interested in a topic. The best way around this is to host a live webinar where you go through the same exact same slides as your presentation.

The week after your presentation, publish a blog post where you announce that you’ll be giving the same talk, but this time as a live webinar. Drive traffic to the registration page by announcing it on LinkedIn, Facebook and Twitter.

The biggest impact your talk will have is the in-person presentation at the conference. The second biggest impact will be the live webinar.

Step 3: Use the Webinar RecordingAfter giving the webinar, you will now have a recording of the presentation. This is great because (again) most people who are interested in your talk will not have been at the conference, and will not have attended your live webinar.

You can host the recording on your own website or on youtube. When you get a new subscriber, point them to the recording. You can also periodically drive traffic to the recording on social media.

Step 4: Post Your Slides OnlineA week after your webinar I recommend posting your slides online. I also recommend creating a content upgrade so that readers can download the slides. A “content upgrade” simply means that readers enter their email address, and that you then email them the slides.

This step will probably have the highest volume (i.e. the most people will interact with your content in this form), but the least impact (i.e. people will not get as much out of interacting with your talk in this form).

Closing NotesAs a data scientist with an online portfolio, conference talks are probably the most valuable type of content you create. Without additional effort, though, that value will be *limited to people who are in the audience at the time of your presentation.*

With just a little bit of work, though, you can overcome that limitation and have more followup with your audience and reach more people.

**Bonus:**A checklist for making the most out of your data science talks! Click Here

The post How to Reach More People with your Next Data Science Talk appeared first on AriLamstein.com.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – AriLamstein.com**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data science for Doctors: Inferential Statistics Exercises (part-3)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the sixth part of the series and it aims to cover partially the subject of Inferential statistics.

Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

In more detail, in this part we will go through the hypothesis testing for Student’s t-distribution (Student’s t-test), which may be the most used test you will need to apply, since in most cases the standard deviation σ of the population is not known. We will cover the one-sample t-test and two-sample t-test(both with equal and unequal variance). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background.

Before proceeding, it might be helpful to look over the help pages for the t.test.

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

data <- read.table(url, fileEncoding="UTF-8", sep=",")

names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')

colnames(data) <- names

data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Suppose that we take a sample of 25 candidates that tried a diet and they had a average weight of 29 (generate 25 normal distributed samples with mean 29 and standard deviation 4) after the experiment.

Find the t-value.

Exercise 2

Find the p-value.

Exercise 3

Find the 95% confidence interval.

Exercise 4

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the sample with 5% confidence level.

Exercise 5

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the population and the alternative that the true mean is less than the mean of the population with 5% confidence level.

Exercise 6

Suppose that we want to compare the current diet with another one. We assume that we test a different diet to a sample of 27 with mass average of 31(generate normal distributed samples with mean 31 and standard deviation of 5). Test whether the two diets are significantly different.

Note that the two distributions have different variances.

hint: This is a two sample hypothesis testing with different variances.

Exercise 7

Test whether the the first diet is more efficient than the second.

Exercise 8

Assume that the second diet has the same variance as the first one. Is it significant different?

Exercise 9

Assume that the second diet has the same variance as the first one. Is it significantly better?

Exercise 10

Suppose that you take a sample of 27 with average mass of 29, and after the diet the average mass is 28(generate the sampled with rnorm(27,average,4)). Are they significant different?

hint: Paired Sample T-Test.

- Data science for Doctors: Inferential Statistics Exercises (part-2)
- Data Science for Doctors – Part 4 : Inferential Statistics (1/5)
- Data Science for Doctors – Part 2 : Descriptive Statistics
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Re: AVHRR NDVI3g time series tool?

https://cran.r-project.org/package=gimms

Best

On Mon, Mar 27, 2017 at 10:00 PM, Andy Bunn <[hidden email]> wrote:

> Huh. That is very cool. It doesn't look like the NDVI data are in ERDDAP.

> They are all here in a really easy form: https://ecocast.arc.nasa.gov/

> data/pub/gimms/3g.v1/

>

> I'll dig some more.

>

> Thanks.

>

> From: Michael Sumner <[hidden email]<mailto:[hidden email]>>

> Date: Monday, March 27, 2017 at 2:03 PM

> To: Andy Bunn <[hidden email]<mailto:[hidden email]>>, R-sig-Geo <

> [hidden email]<mailto:[hidden email]>>

> Subject: Re: [R-sig-Geo] AVHRR NDVI3g time series tool?

>

>

> Maybe xtractomatic?

>

> On Tue, Mar 28, 2017, 07:55 Andy Bunn <[hidden email]<mailto:Andy

> .[hidden email]>> wrote:

> Hi all, shot in the dark here. But is there a prefabricated tool for

> extracting the biweekly time series from AVHRR NDVI3g for a particular

> set of points? I can download all the data and process it myself but

> wondering if there is a package that would make this easy. E.g., what I'd

> like would be to have the biweekly NDVI time series from 1981 to 2015 for

> these points without processing all the netcdf files:

>

>

> require(sp)

> pts <- data.frame(ID="NOCA","MORA","OLYM",

> lat=c(48.776, 48.879,47.802),

> long=c(-121.299,-121.726,-123.604))

> coordinates(pts) <- ~lat+long

>

>

> Can such a thing be easily done?

>

> Thanks in advance for advice, Andy

>

> _______________________________________________

> R-sig-Geo mailing list

> [hidden email]<mailto:[hidden email]>

> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

> --

> Dr. Michael Sumner

> Software and Database Engineer

> Australian Antarctic Division

> 203 Channel Highway

> Kingston Tasmania 7050 Australia

>

>

> [[alternative HTML version deleted]]

>

> _______________________________________________

> R-sig-Geo mailing list

> [hidden email]

> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

>

--

Ahmadou H. DICKO, PhD

Data Lab Manager

Humanitarian Data Exchange (HDX) - humdata.org

OCHA ROWCA regional office

VDN Sacre Coeur III, Villa 9364 BP 16 922 Fann

Dakar, Senegal

Phone: (+221) 33 869 85 36

Mobile: (+221) 77 123 81 69

Email: [hidden email]

Skype: dicko.ahmadou.h

Twitter : @dickoah

Gitlab: gitlab/dickoa

Github: github/dickoa

[[alternative HTML version deleted]]

_______________________________________________

R-sig-Geo mailing list

[hidden email]

https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Re: AVHRR NDVI3g time series tool?

I'll dig some more.

Thanks.

From: Michael Sumner <[hidden email]<mailto:[hidden email]>>

Date: Monday, March 27, 2017 at 2:03 PM

To: Andy Bunn <[hidden email]<mailto:[hidden email]>>, R-sig-Geo <[hidden email]<mailto:[hidden email]>>

Subject: Re: [R-sig-Geo] AVHRR NDVI3g time series tool?

Maybe xtractomatic?

On Tue, Mar 28, 2017, 07:55 Andy Bunn <[hidden email]<mailto:[hidden email]>> wrote:

Hi all, shot in the dark here. But is there a prefabricated tool for

extracting the biweekly time series from AVHRR NDVI3g for a particular

set of points? I can download all the data and process it myself but

wondering if there is a package that would make this easy. E.g., what I'd

like would be to have the biweekly NDVI time series from 1981 to 2015 for

these points without processing all the netcdf files:

require(sp)

pts <- data.frame(ID="NOCA","MORA","OLYM",

lat=c(48.776, 48.879,47.802),

long=c(-121.299,-121.726,-123.604))

coordinates(pts) <- ~lat+long

Can such a thing be easily done?

Thanks in advance for advice, Andy

_______________________________________________

R-sig-Geo mailing list

[hidden email]<mailto:[hidden email]>

https://stat.ethz.ch/mailman/listinfo/r-sig-geo

--

Dr. Michael Sumner

Software and Database Engineer

Australian Antarctic Division

203 Channel Highway

Kingston Tasmania 7050 Australia

[[alternative HTML version deleted]]

_______________________________________________

R-sig-Geo mailing list

[hidden email]

https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### An Introduction to Stock Market Data Analysis with R (Part 1)

(This article was first published on ** R – Curtis Miller's Personal Website**, and kindly contributed to R-bloggers)

*Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on backtest.ru at a link that now appears to no longer work) and Chinese (here and here). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis.*

*This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.*

**NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk.**

Advanced mathematics and statistics have been present in finance for some time. Prior to the 1980s, banking and finance were well-known for being “boring”; investment banking was distinct from commercial banking and the primary role of the industry was handling “simple” (at least in comparison to today) financial instruments, such as loans. Deregulation under the Regan administration, coupled with an influx of mathematical talent, transformed the industry from the “boring” business of banking to what it is today, and since then, finance has joined the other sciences as a motivation for mathematical research and advancement. For example one of the biggest recent achievements of mathematics was the derivation of the Black-Scholes formula, which facilitated the pricing of stock options (a contract giving the holder the right to purchase or sell a stock at a particular price to the issuer of the option). That said, bad statistical models, including the Black-Scholes formula, hold part of the blame for the 2008 financial crisis.

In recent years, computer science has joined advanced mathematics in revolutionizing finance and **trading**, the practice of buying and selling of financial assets for the purpose of making a profit. In recent years, trading has become dominated by computers; algorithms are responsible for making rapid split-second trading decisions faster than humans could make (so rapidly, the speed at which light travels is a limitation when designing systems). Additionally, machine learning and data mining techniques are growing in popularity in the financial sector, and likely will continue to do so. In fact, a large part of algorithmic trading is **high-frequency trading (HFT)**. While algorithms may outperform humans, the technology is still new and playing an increasing role in a famously turbulent, high-stakes arena. HFT was responsible for phenomena such as the 2010 flash crash and a 2013 flash crash prompted by a hacked Associated Press tweet about an attack on the White House.

My articles, however, will not be about how to crash the stock market with bad mathematical models or trading algorithms. Instead, I intend to provide you with basic tools for handling and analyzing stock market data with R. We will be using stock data as a first exposure to **time series data**, which is data considered dependent on the time it was observed (other examples of time series include temperature data, demand for energy on a power grid, Internet server load, and many, many others). I will also discuss moving averages, how to construct trading strategies using moving averages, how to formulate exit strategies upon entering a position, and how to evaluate a strategy with backtesting.

**DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience as a trader (a lot of this knowledge comes from a one-semester course on stock trading I took at Salt Lake Community College)! This is purely introductory knowledge, not enough to make a living trading stocks. People can and do lose money trading stocks, and you do so at your own risk!**

Before we analyze stock data, we need to get it into some workable format. Stock data can be obtained from Yahoo! Finance, Google Finance, or a number of other sources, and the **quantmod** package provides easy access to Yahoo! Finance and Google Finance data, along with other sources. In fact, **quantmod** provides a number of useful features for financial modelling, and we will be seeing those features throughout these articles. In this lecture, we will get our data from Yahoo! Finance.

Let’s briefly discuss this. getSymbols() created in the global environment an object called AAPL (named automatically after the ticker symbol of the security retrieved) that is of the xts class (which is also a zoo-class object). xts objects (provided in the **xts** package) are seen as improved versions of the ts object for storing time series data. They allow for time-based indexing and provide custom attributes, along with allowing multiple (presumably related) time series with the same time index to be stored in the same object. (Here is a vignette describing xts objects.) The different series are the columns of the object, with the name of the associated security (here, AAPL) being prefixed to the corresponding series.

Yahoo! Finance provides six series with each security. **Open** is the price of the stock at the beginning of the trading day (it need not be the closing price of the previous trading day), **high** is the highest price of the stock on that trading day, **low** the lowest price of the stock on that trading day, and **close** the price of the stock at closing time. **Volume** indicates how many stocks were traded. **Adjusted close** (abreviated as “adjusted” by getSymbols()) is the closing price of the stock that adjusts the price of the stock for corporate actions. While stock prices are considered to be set mostly by traders, **stock splits** (when the company makes each extant stock worth two and halves the price) and **dividends** (payout of company profits per share) also affect the price of a stock and should be accounted for.

Now that we have stock data we would like to visualize it. I first use base R plotting to visualize the series.

plot(AAPL[, "AAPL.Close"], main = "AAPL")A linechart is fine, but there are at least four variables involved for each date (open, high, low, and close), and we would like to have some visual way to see all four variables that does not require plotting four separate lines. Financial data is often plotted with a **Japanese candlestick plot**, so named because it was first created by 18th century Japanese rice traders. Use the function candleChart() from **quantmod** to create such a chart.

With a candlestick chart, a black candlestick indicates a day where the closing price was higher than the open (a gain), while a red candlestick indicates a day where the open was higher than the close (a loss). The wicks indicate the high and the low, and the body the open and close (hue is used to determine which end of the body is the open and which the close). Candlestick charts are popular in finance and some strategies in technical analysis use them to make trading decisions, depending on the shape, color, and position of the candles. I will not cover such strategies today.

(Notice that the volume is tracked as a bar chart on the lower pane as well, with the same colors as the corresponding candlesticks. Some traders like to see how many shares are being traded; this can be important in trading.)

We may wish to plot multiple financial instruments together; we may want to compare stocks, compare them to the market, or look at other securities such as exchange-traded funds (ETFs). Later, we will also want to see how to plot a financial instrument against some indicator, like a moving average. For this you would rather use a line chart than a candlestick chart. (How would you plot multiple candlestick charts on top of one another without cluttering the chart?)

Below, I get stock data for some other tech companies and plot their adjusted close together.

# Let's get data for Microsoft (MSFT) and Google (GOOG) (actually, Google is # held by a holding company called Alphabet, Inc., which is the company # traded on the exchange and uses the ticker symbol GOOG). getSymbols(c("MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "MSFT" "GOOG" # Create an xts object (xts is loaded with quantmod) that contains closing # prices for AAPL, MSFT, and GOOG stocks <- as.xts(data.frame(AAPL = AAPL[, "AAPL.Close"], MSFT = MSFT[, "MSFT.Close"], GOOG = GOOG[, "GOOG.Close"])) head(stocks) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 105.35 54.80 741.84 ## 2016-01-05 102.71 55.05 742.58 ## 2016-01-06 100.70 54.05 743.62 ## 2016-01-07 96.45 52.17 726.39 ## 2016-01-08 96.96 52.33 714.47 ## 2016-01-11 98.53 52.30 716.03 # Create a plot showing all series as lines; must use as.zoo to use the zoo # method for plot, which allows for multiple series to be plotted on same # plot plot(as.zoo(stocks), screens = 1, lty = 1:3, xlab = "Date", ylab = "Price") legend("right", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)What’s wrong with this chart? While absolute price is important (pricey stocks are difficult to purchase, which affects not only their volatility but *your* ability to trade that stock), when trading, we are more concerned about the relative change of an asset rather than its absolute price. Google’s stocks are much more expensive than Apple’s or Microsoft’s, and this difference makes Apple’s and Microsoft’s stocks appear much less volatile than they truly are (that is, their price appears to not deviate much).

One solution would be to use two different scales when plotting the data; one scale will be used by Apple and Microsoft stocks, and the other by Google.

plot(as.zoo(stocks[, c("AAPL.Close", "MSFT.Close")]), screens = 1, lty = 1:2, xlab = "Date", ylab = "Price") par(new = TRUE) plot(as.zoo(stocks[, "GOOG.Close"]), screens = 1, lty = 3, xaxt = "n", yaxt = "n", xlab = "", ylab = "") axis(4) mtext("Price", side = 4, line = 3) legend("topleft", c("AAPL (left)", "MSFT (left)", "GOOG"), lty = 1:3, cex = 0.5)Not only is this solution difficult to implement well, it is seen as a bad visualization method; it can lead to confusion and misinterpretation, and cannot be read easily.

A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply.

One transformation would be to consider the stock’s return since the beginning of the period of interest. In other words, we plot:

This will require transforming the data in the stocks object, which I do next.

# Get me my beloved pipe operator! if (!require("magrittr")) { install.packages("magrittr") library(magrittr) } ## Loading required package: magrittr stock_return % t %>% as.xts head(stock_return) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 1.0000000 1.0000000 1.0000000 ## 2016-01-05 0.9749407 1.0045620 1.0009975 ## 2016-01-06 0.9558614 0.9863139 1.0023994 ## 2016-01-07 0.9155197 0.9520073 0.9791734 ## 2016-01-08 0.9203607 0.9549271 0.9631052 ## 2016-01-11 0.9352634 0.9543796 0.9652081 plot(as.zoo(stock_return), screens = 1, lty = 1:3, xlab = "Date", ylab = "Return") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)This is a much more useful plot. We can now see how profitable each stock was since the beginning of the period. Furthermore, we see that these stocks are highly correlated; they generally move in the same direction, a fact that was difficult to see in the other charts.

Alternatively, we could plot the change of each stock per day. One way to do so would be to plot the percentage increase of a stock when comparing day to day , with the formula:

But change could be thought of differently as:

These formulas are not the same and can lead to differing conclusions, but there is another way to model the growth of a stock: with log differences.

(Here, is the natural log, and our definition does not depend as strongly on whether we use or .) The advantage of using log differences is that this difference can be interpreted as the percentage change in a stock but does not depend on the denominator of a fraction.

We can obtain and plot the log differences of the data in stocks as follows:

stock_change % log %>% diff head(stock_change) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 NA NA NA ## 2016-01-05 -0.025378648 0.0045516693 0.000997009 ## 2016-01-06 -0.019763704 -0.0183323194 0.001399513 ## 2016-01-07 -0.043121062 -0.0354019469 -0.023443064 ## 2016-01-08 0.005273804 0.0030622799 -0.016546113 ## 2016-01-11 0.016062548 -0.0005735067 0.002181138 plot(as.zoo(stock_change), screens = 1, lty = 1:3, xlab = "Date", ylab = "Log Difference") legend("topleft", c("AAPL", "MSFT", "GOOG"), lty = 1:3, cex = 0.5)Which transformation do you prefer? Looking at returns since the beginning of the period make the overall trend of the securities in question much more apparent. Changes between days, though, are what more advanced methods actually consider when modelling the behavior of a stock. so they should not be ignored.

Moving AveragesCharts are very useful. In fact, some traders base their strategies almost entirely off charts (these are the “technicians”, since trading strategies based off finding patterns in charts is a part of the trading doctrine known as **technical analysis**). Let’s now consider how we can find trends in stocks.

A **-day moving average** is, for a series and a point in time , the average of the past days: that is, if denotes a moving average process, then:

Moving averages smooth a series and helps identify trends. The larger is, the less responsive a moving average process is to short-term fluctuations in the series . The idea is that moving average processes help identify trends from “noise”. **Fast** moving averages have smaller and more closely follow the stock, while **slow** moving averages have larger , resulting in them responding less to the fluctuations of the stock and being more stable.

**quantmod** allows for easily adding moving averages to charts, via the addSMA() function.

Notice how late the rolling average begins. It cannot be computed until 20 days have passed. This limitation becomes more severe for longer moving averages. Because I would like to be able to compute 200-day moving averages, I’m going to extend out how much AAPL data we have. That said, we will still largely focus on 2016.

start = as.Date("2010-01-01") getSymbols(c("AAPL", "MSFT", "GOOG"), src = "yahoo", from = start, to = end) ## [1] "AAPL" "MSFT" "GOOG" # The subset argument allows specifying the date range to view in the chart. # This uses xts style subsetting. Here, I'm using the idiom # 'YYYY-MM-DD/YYYY-MM-DD', where the date on the left-hand side of the / is # the start date, and the date on the right-hand side is the end date. If # either is left blank, either the earliest date or latest date in the # series is used (as appropriate). This method can be used for any xts # object, say, AAPL candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = 20)You will notice that a moving average is much smoother than the actual stock data. Additionally, it’s a stubborn indicator; a stock needs to be above or below the moving average line in order for the line to change direction. Thus, crossing a moving average signals a possible change in trend, and should draw attention.

Traders are usually interested in multiple moving averages, such as the 20-day, 50-day, and 200-day moving averages. It’s easy to examine multiple moving averages at once.

candleChart(AAPL, up.col = "black", dn.col = "red", theme = "white", subset = "2016-01-04/") addSMA(n = c(20, 50, 200))The 20-day moving average is the most sensitive to local changes, and the 200-day moving average the least. Here, the 200-day moving average indicates an overall **bearish** trend: the stock is trending downward over time. The 20-day moving average is at times bearish and at other times **bullish**, where a positive swing is expected. You can also see that the crossing of moving average lines indicate changes in trend. These crossings are what we can use as **trading signals**, or indications that a financial security is changing direction and a profitable trade might be made.

*Visit next week to read about how to design and test a trading strategy using moving averages.*

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Curtis Miller's Personal Website**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Analyzing Accupedo step count data in R: Part 2 – Adding weather data

(This article was first published on ** Method Matters**, and kindly contributed to R-bloggers)

In my last set of posts, I wrote about analyzing data from the Accupedo step counter app I have on my phone. In this post, I’ll talk about some additional analysis I’ve done by merging the step counter data with weather data from another source.

The website www.wunderground.com has freely available weather data available for most parts of the world. According to their website, their data are recorded by a crowd-sourced team of weather enthusiasts who report the weather conditions from personal stations at their homes. Pretty cool project, and it was really easy to search for my city and download the data in a .csv format. In this post, we will focus on the Mean Temperature (for the day) value, recorded in degrees Celsius.

**Data Preparation**

The step count data here are aggregated to the daily level, as I didn’t have more granular weather information. In all honesty, it probably makes more sense to analyze the data at the day level (as I’ve done); it would likely be hard to see hourly evolution in the weather matching hourly evolution in my step counts. My movements are pretty constrained during certain hours of the day (e.g. when I’m at work), and so examining that granular of a relationship is unlikely to be very illuminating.

I was able to match all of the dates that I had step counts for with the daily weather information from www.wunderground.com, and merged the weather data into the main step count dataset described in the previous post. I then created the analysis dataset, a day-level version of the walking data aggregated across the two variables I had created previously – day and day of week (essentially this aggregates to the day-level, while keeping a column describing the day of the week – e.g. Monday, Tuesday, etc.).

# load the required packageslibrary(plyr); library(dplyr)

library(lubridate)

library(ggplot2)

# make the variables to aggregate across

# the master data set with steps and temperature is called walking_weather

walking_weather$oneday <- cut(walking_weather$date_time, breaks = "1 day")

walking_weather$dow <- wday(walking_weather$date_time, label = TRUE)

# aggregate by day, day of week

# max of step count; mean of temperature

ww_per_day <- walking_weather %>% group_by(oneday, dow) %>%

summarise(steps = max(steps), temp = mean(Mean_TemperatureC))

As before, when aggregating I kept the maximum step count per day (as these counts are cumulative), and computed the average temperature in degrees Celsius per day (because there are multiple observations per day in the non-aggregated data, taking the average across days simply returns the daily average temperature from the daily weather data).

The resulting dataset looks like this:

**Data Visualization and Analysis**

As a first step, let’s make histograms of the temperature and step count variables:

# histograms of temperature and step count at a daily level# using the base plotting system

hist(ww_per_day$temp, col ='darkblue',

xlab = 'Daily Average Temperature (Celsius)',

main = 'Histogram of Temperature')

hist(ww_per_day$steps, col ='darkgreen',

xlab = 'Daily Total Step Count',

main = 'Histogram of Step Count')

The temperatures have ranged from -4 degrees to 27 degrees Celsius, with a mean of 11.66 (SD = 5.83).

The daily step counts have ranged from 6 to 65,337, with a mean of 18,294.04 (SD = 7,808.36).

In order to visualize the relationship between my daily step count and the daily average temperature, I made a simple scatterplot, with a linear model line superimposed on the graph (95% confidence intervals shown in light grey):

# plot of steps by temperaturep = ggplot(data = ww_per_day, aes(x = temp, y = steps)) +

geom_point() +

coord_cartesian(ylim = c(0, 25000)) +

geom_smooth(method="lm") +

theme(legend.title=element_blank()) +

labs(x = "Temperature (Celsius)", y = "Steps" )

p

There is a slight positive relationship between temperature and step count, such that on warmer days I tend to walk more. A simple linear regression will estimate the size of this relationship:

summary(lm(steps~temp , data = ww_per_day ))The estimates given:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; } Estimate Std. Error t value Pr(>|t|) Starz! (Intercept) 17277.29 681.76 25.342 < 2e-16 *** temp 87.21 52.31 1.667 0.096 .
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7798 on 653 degrees of freedom

Multiple R-squared: 0.004238, Adjusted R-squared: 0.002713

F-statistic: 2.779 on 1 and 653 DF, p-value: 0.09597

The model estimates that I walk an additional 87.21 steps for each degree increase in the daily average temperature. However, the p-value associated with this estimate shows only a nonsignificant trend toward significance.

In a previous post we saw that my walking patterns tend to be different during the weekdays and the weekend. I was therefore curious to examine the relationship between temperature and step count across these two different types of days.

I generated the week/weekend variable (as in the previous post) and the plot with the following code:

# generate the weekday vs. weekend variableww_per_day$week_weekend[ww_per_day$dow == 'Sun'] <- 'Weekend'

ww_per_day$week_weekend[ww_per_day$dow == 'Sat'] <- 'Weekend'

ww_per_day$week_weekend[ww_per_day$dow == 'Mon'] <- 'Weekday'

ww_per_day$week_weekend[ww_per_day$dow == 'Tues'] <- 'Weekday'

ww_per_day$week_weekend[ww_per_day$dow == 'Wed'] <- 'Weekday'

ww_per_day$week_weekend[ww_per_day$dow == 'Thurs'] <- 'Weekday'

ww_per_day$week_weekend[ww_per_day$dow == 'Fri'] <- 'Weekday'

# plot the relationship between temperature and step count

# with separate colors and lines for weekday vs. weekend

p = ggplot(data = ww_per_day, aes(x = temp, y = steps, color=week_weekend)) +

geom_point() +

coord_cartesian(ylim = c(0, 25000)) +

geom_smooth(method="lm") +

theme(legend.title=element_blank()) +

labs(x = "Temperature (Celsius)", y = "Steps" )

p

The relationship between temperature and step count does indeed appear to be different on weekdays vs. the weekend:

The general pattern is that there’s not much of a relationship between temperature and step count during the week (after all, I have to walk to work whether it’s warm or cold), but on the weekends I walk quite a bit more when it’s warmer out.

Notice how the confidence intervals get wider at the extremes of the x-axis. These are regions where there are simply less data (by definition, extreme temperatures are rare). This pattern is especially noticeable for the weekend days (there are far fewer weekend days than there are week days, and so we have even less data here at the margins).

Although I’m in general not particularly interested in null-hypothesis significance testing of regression coefficients in this type of analysis, I went through the formality of testing the time period (weekday vs. weekend) by temperature interaction.

summary(lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),data = ww_per_day ))

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

Estimate
Std. Error
t value
Pr(>|t|)
Starz!
(Intercept)
19178.76
774.45
24.76
< 2e-16
***
temp
23.46
59.52
0.39
0.69
as.factor(week_weekend)Weekend
-6982.66
1486.25
-4.70
0.00
***
temp:as.factor(week_weekend)Weekend
246.76
113.76
2.17
0.03
*
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7560 on 651 degrees of freedom

Multiple R-squared: 0.06682, Adjusted R-squared: 0.06252

F-statistic: 15.54 on 3 and 651 DF, p-value: 9.005e-10

The results show that the interaction term is statistically significant; we already understood the shape and meaning of the interaction with the above plot.

More interesting for our purposes is the model summary. The adjusted multiple R squared is about 6%, much greater than the .3% from the model above (where we only include temperature as a predictor), but not very high overall. The take-home message is that time period (weekday vs. weekend) and temperature have clear relationships with step count, but together and in combination they don’t explain a huge amount of variance in the amount of steps I walk each day.

I wondered about the importance of the interaction term, actually. Interactions are widely sought after in certain academic contexts, but I enjoy the simplicity and ease-of-interpretation of main effects models for applied problems which require interpretable solutions (e.g. “white box” models). Bluntly put: how important is the interaction term in this context? One might argue that, although the p-value for the interaction term passes the threshold for statistical significance, knowing that the interaction term is (statistically) significantly different from zero doesn’t bring us much further in our understanding of the problem at hand, after accounting for the main effects of time period and temperature.

**Model Comparison**

In order to more concretely investigate the importance of the time period by temperature interaction term, I computed the model predicting step count with just the main effects of time period and temperature (lm_1 below), and also the model with the main effects and their interaction term (lm_2 below). I then performed a model comparison to see if adding the interaction term results in a statistically significant reduction in the residual sum of squares. In other words, is the difference between the predictive performance of the first and second model greater than zero?

# main effects modellm_1 <- lm(steps~temp + as.factor(week_weekend), data = ww_per_day )

summary(lm_1)

# main effects + interaction model

lm_2 <- lm(steps~temp + as.factor(week_weekend)+ temp:as.factor(week_weekend),

data = ww_per_day )

summary(lm_2)

# model comparison

anova(lm_1,lm_2)

Which gives the following result:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

Model
Res.Df
RSS
Df
Sum of Sq
F
Pr(>F)
1
652
3.75E+10
2
651
3.72E+10
1
268916758
4.70
0.03 *
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This comparison suggests that adding the interaction term results in a statistically significant decrease in the residual sum of squares. In other words, we can better predict my step count if we include the interaction term.

**Conclusion**

To sum up, in this post we took the daily total step counts from my Accupedo step counter app and added data about the average temperature on each day that steps were recorded. We used scatterplots to explore and understand the relationship between these two variables, both overall and according to the time period (e.g. weekday vs. weekend).

We also used linear regression to investigate the relationship between daily temperature and daily step count. When predicting step count from temperature, we saw that there was a non-statistically significant trend such that I walk more on warmer days. When adding time period as a categorical predictor in an interactive regression model, we saw a statistically significant interaction between temperature and time period, such that the relationship between temperature and step count was stronger on weekends vs. weekdays. In other words, I walk more when it’s warmer out, but only during the weekend.

*Caveat: Null Hypothesis Significance Testing (NHST) in Exploratory Data Analysis *

We used NHST and *p*-values to interpret regression coefficients and to compare the predictive performance of the main effects vs. interaction regression models. I’m somewhat ambivalent about the use of *p*-values in this context, honestly. I’ve been very influenced by Andrew Gelman’s work, particularly his wonderful blog, and am very conscious of the limitations of *p*-values in exploratory contexts like this one. If I had to choose the most interesting test of statistical significance in this post, I would choose the model comparison: it puts an explicit focus on the relative performance of competing (nested) models, rather than simply using significance testing to determine whether individual regression coefficients are significantly different from zero.

The statistical approach we’ve seen in this post conforms very much to the way I was trained in the social sciences. In this tradition, a strong focus is put on NHST (via *p*-values), and less frequently on interpreting model R square and performing formal model comparisons. In the work I do now as an applied data analyst (or “data scientist”), I often take an approach that borrows heavily from machine learning (though these ideas are also present in traditional statistical thinking): evaluating models via predictive accuracy on a test dataset that was not used in data modeling.

*Coming Up Next *

In the next set of posts, we will analyze a different dataset using an approach that leans more heavily towards this machine-learning perspective. We’ll also change analytical software, using Python/Pandas and the scikit-learn library to analyze the data. Stay tuned!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Method Matters**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Re: AVHRR NDVI3g time series tool?

On Tue, Mar 28, 2017, 07:55 Andy Bunn <[hidden email]> wrote:

> Hi all, shot in the dark here. But is there a prefabricated tool for

> extracting the biweekly time series from AVHRR NDVI3g for a particular

> set of points? I can download all the data and process it myself but

> wondering if there is a package that would make this easy. E.g., what I'd

> like would be to have the biweekly NDVI time series from 1981 to 2015 for

> these points without processing all the netcdf files:

>

>

> require(sp)

> pts <- data.frame(ID="NOCA","MORA","OLYM",

> lat=c(48.776, 48.879,47.802),

> long=c(-121.299,-121.726,-123.604))

> coordinates(pts) <- ~lat+long

>

>

> Can such a thing be easily done?

>

> Thanks in advance for advice, Andy

>

> _______________________________________________

> R-sig-Geo mailing list

> [hidden email]

> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

> --

Dr. Michael Sumner

Software and Database Engineer

Australian Antarctic Division

203 Channel Highway

Kingston Tasmania 7050 Australia

[[alternative HTML version deleted]]

_______________________________________________

R-sig-Geo mailing list

[hidden email]

https://stat.ethz.ch/mailman/listinfo/r-sig-geo