# GEOSTAT courses

Cpanel

## GEOSTAT News

Stay informed on our latest news!

## GEOSTAT software

List of FOSS software used in this course and installation instructions. Follow these instructions to prepare and customize the software before the beginning of the course.

## Literature used

List of books / lecture notes used in this course. See also: CRAN Task View: Analysis of Spatial Data.

## Events

« February 2017 »
MonTueWedThuFriSatSun
12345
6789101112
13141516171819
20212223242526
2728

## R-bloggers

R news and tutorials contributed by (750) R bloggers
Updated: 5 min 31 sec ago

### Analytical and Numerical Solutions to Linear Regression Problems

18 February 2017 - 2:52pm

This exercise focuses on linear regression with both analytical (normal equation) and numerical (gradient descent) methods. We will start with linear regression with one variable. From this part of the exercise, we will create plots that help to visualize how gradient descent gets the coefficient of the predictor and the intercept. In the second part, we will implement linear regression with multiple variables.

Linear regression with one variable

In this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next.
The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in $10,000s. library(ggplot2) library(data.table) library(magrittr) library(caret) library(fields) library(plot3D) Load the data and display first 6 observations ex1data1 <- fread("ex1data1.txt",col.names=c("population","profit")) head(ex1data1) population profit 1: 6.1101 17.5920 2: 5.5277 9.1302 3: 8.5186 13.6620 4: 7.0032 11.8540 5: 5.8598 6.8233 6: 8.3829 11.8860 Plotting the Data Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems that we encounter in real life are multi-dimensional and can’t be plotted on a 2-d plot. ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=4,alpha=0.5)+ ylab('Profit in$10,000s')+ xlab('Population of City in 10,000s')+ggtitle ('Figure 1: Scatter plot of training data')+ theme(plot.title = element_text(size = 16,colour="red"))

Here it is the plot:

In this part, we will fit linear regression parameters θ to our dataset using gradient descent.
To implement gradient descent, we need to calculate the cost, which is given by:

Computing the cost J(θ)

As we perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. In this section, we will implement a function to calculate J(θ) so we can check the convergence of our gradient descent implementation.
But remember, before we go any further we need to add the θ intercept term.

X=cbind(1,ex1data1$population) y=ex1data1$profit head(X) [,1] [,2] [1,] 1 6.1101 [2,] 1 5.5277 [3,] 1 8.5186 [4,] 1 7.0032 [5,] 1 5.8598 [6,] 1 8.3829

The function below calcuates cost based on the equation given above.

computeCost=function(X,y,theta){ z=((X%*%theta)-y)^2 return(sum(z)/(2*nrow(X))) }

Now, we can calculate the initial cost by initilizating the initial parameters to 0.

Next, we will implement gradient descent by calling the computeCost function above. A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. Our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
The gradient descent function below returns the cost in every iteration and the optimal parameters for the number of iterations and learning rate alpha specified.

gradientDescent=function(X, y, theta, alpha, iters){ gd=list() cost=rep(0,iters) for(k in 1:iters){ z=rep(0,ncol(X)) for(i in 1:ncol(X)){ for(j in 1:nrow(X)){ z[i]=z[i]+(((X[j,]%*%theta)-y[j])*X[j,i]) } } theta= theta-((alpha/nrow(X))*z) cost[k]=computeCost(X,y,theta) } gd$theta= theta gd$cost=cost gd }

Now, let’s use the gradientDescent function to find the parameters and we have to make sure that our cost never increases. Let’s use 1500 iterations and a learning rate alpha of 0.04 for now. Later, we will see the effect of these values in our application.

iterations = 1500 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta [,1] [1,] -3.630291 [2,] 1.166362 Ploting the cost function as a function of the number of iterations We expect that the cost should be decreasing or at least should not be at all increasing if our implementaion is correct. Let’s plot the cost fucntion as a function of number of iterations and make sure our implemenation makes sense. data.frame(Cost=gradientDescent_results$cost,Iterations=1:iterations)%>% ggplot(aes(x=Iterations,y=Cost))+geom_line(color="blue")+ ggtitle("Cost as a function of number of iteration")+ theme(plot.title = element_text(size = 16,colour="red"))

Here it is the plot:

As the plot above shows, the cost decrases with number of iterations and gets almost close to convergence with 1500 iterations.

Plot the linear fit

Now, since we know the parameters (slope and intercept), we can plot the linear fit on the scatter plot.

ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=3,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ ggtitle ('Figure 1: Scatter plot of training data') + geom_abline(intercept = theta[1], slope = theta[2],col="red",show.legend=TRUE)+ theme(plot.title = element_text(size = 16,colour="red"))+ annotate("text", x = 12, y = 20, label = paste0("Profit = ",round(theta[1],4),"+",round(theta[2],4)," * Population")) Here it is the plot: Visualizing J(θ) To understand the cost function J(θ) better, let’s now plot the cost over a 2-dimensional grid of θ0 and θ1 values. The global minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point. Intercept=seq(from=-10,to=10,length=100) Slope=seq(from=-1,to=4,length=100) # initialize cost values to a matrix of 0's Cost = matrix(0,length(Intercept), length(Slope)); for(i in 1:length(Intercept)){ for(j in 1:length(Slope)){ t = c(Intercept[i],Slope[j]) Cost[i,j]= computeCost(X, y, t) } } persp3D(Intercept,Slope,Cost,theta=-45, phi=25, expand=0.75,lighting = TRUE, ticktype="detailed", xlab="Intercept", ylab="Slope", zlab="",axes=TRUE, main="Surface") image.plot(Intercept,Slope,Cost, main="Contour, showing minimum") contour(Intercept,Slope,Cost, add = TRUE,n=30,labels='') points(theta[1],theta[2],col='red',pch=4,lwd=6) Here it is the plot: Normal Equation Since linear regression has closed-form solution, we can solve it analytically and it is called normal equation. It is given by the formula below. we do not need to iterate or choose learning curve. However, we need to calculate inverse of a matrix , which make it slow if the number of records is very large. Gradient descent is applicable to other machine learning techniques as well. Further, gradient descent method is more appropriate even to linear regression when the number of observations is very large. θ=(XTX)−1XTy theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2 [,1] [1,] -3.895781 [2,] 1.193034 There is very small difference between the parameters we got from normal equation and using gradient descent. Let’s increase the number of iteration and see if they get closer to each other. I increased the number of iterations from 1500 to 15000. iterations = 15000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta [,1] [1,] -3.895781 [2,] 1.193034

As you can see, now the results from normal equation and gradient descent are the same.

Using caret package

By the way, we can use packages develoed by experts in the field and perform our machine learning tasks. There are many machine learning packages in R for differnt types of machine learning tasks. To verify that we get the same results, let’s use the caret package, which is among the most commonly used machine learning packages in R.

my_lm <- train(profit~population, data=ex1data1,method = "lm") my_lm$finalModel$coefficients (Intercept) -3.89578087831187 population 1.1930336441896 Linear regression with multiple variables

In this part of the exercise, we will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. The file ex1data2.txt contains a training set of housing prices in Port- land, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) head(ex1data2) size bedrooms price 1: 2104 3 399900 2: 1600 3 329900 3: 2400 3 369000 4: 1416 2 232000 5: 3000 4 539900 6: 1985 4 299900 Feature Normalization

House sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.

ex1data2=as.data.frame(ex1data2) for(i in 1:(ncol(ex1data2)-1)){ ex1data2[,i]=(ex1data2[,i]-mean(ex1data2[,i]))/sd(ex1data2[,i]) } Gradient Descent

Previously, we implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

X=cbind(1,ex1data2$size,ex1data2$bedrooms) y=ex1data2$price head(X) iterations = 6000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta [,1] [,2] [,3] [1,] 1 0.13000987 -0.2236752 [2,] 1 -0.50418984 -0.2236752 [3,] 1 0.50247636 -0.2236752 [4,] 1 -0.73572306 -1.5377669 [5,] 1 1.25747602 1.0904165 [6,] 1 -0.01973173 1.0904165 [,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474 Normal Equation theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2 [,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474 Using caret package ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) my_lm <- train(price~size+bedrooms, data=ex1data2,method = "lm", preProcess = c("center","scale")) my_lm$finalModel$coefficients (Intercept) 340412.659574468 size 110631.050278846 bedrooms -6649.4742708198 Summary

In this post, we saw how to implement numerical and analytical solutions to linear regression problems using R. We also used caret -the famous R machine learning package- to verify our results. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository.

Related Post

### Putting It All Together

18 February 2017 - 11:03am

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The kind folks over at @RStudio gave a nod to my recently CRAN-released epidata package in their January data package roundup and I thought it might be useful to give it one more showcase using the recent CRAN update to ggalt and the new hrbrthemes (github-only for now) packages.

Labor force participation rate

The U.S. labor force participation rate (LFPR) is an oft-overlooked and under- or mis-reported economic indicator. I’ll borrow the definition from Investopedia:

The participation rate is a measure of the active portion of an economy’s labor force. It refers to the number of people who are either employed or are actively looking for work. During an economic recession, many workers often get discouraged and stop looking for employment, resulting in a decrease in the participation rate.

Population age distributions and other factors are necessary to honestly interpret this statistic. Parties in power usually dismiss/ignore this statistic outright and their opponents tend to wholly embrace it for criticism (it’s an easy target if you’re naive). “Yay” partisan democracy.

Since the LFPR is has nuances when looked at categorically, let’s take a look at it by attained education level to see how that particular view has changed over time (at least since the gov-quants have been tracking it).

We can easily grab this data with epidata::get_labor_force_participation_rate()(and, we’ll setup some library() calls while we’re at it:

library(epidata) library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes") library(ggalt) library(tidyverse) library(stringi) part_rate <- get_labor_force_participation_rate("e") glimpse(part_rate) ## Observations: 457 ## Variables: 7 ## $date <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 1979-05-01... ##$ all <dbl> 0.634, 0.634, 0.635, 0.636, 0.636, 0.637, 0.637, 0.637, 0.638, 0.638, 0... ## $less_than_hs <dbl> 0.474, 0.475, 0.475, 0.475, 0.475, 0.474, 0.474, 0.473, 0.473, 0.473, 0... ##$ high_school <dbl> 0.690, 0.691, 0.692, 0.692, 0.693, 0.693, 0.694, 0.694, 0.695, 0.696, 0... ## $some_college <dbl> 0.709, 0.710, 0.711, 0.712, 0.712, 0.713, 0.712, 0.712, 0.712, 0.712, 0... ##$ bachelor's_degree <dbl> 0.771, 0.772, 0.772, 0.773, 0.772, 0.772, 0.772, 0.772, 0.772, 0.773, 0... ## $advanced_degree <dbl> 0.847, 0.847, 0.848, 0.848, 0.848, 0.848, 0.847, 0.847, 0.848, 0.848, 0... One of the easiest things to do is to use ggplot2 to make a faceted line chart by attained education level. But, let’s change the labels so they are a bit easier on the eyes in the facets and switch the facet order from alphabetical to something more useful: gather(part_rate, category, rate, -date) %>% mutate(category=stri_replace_all_fixed(category, "_", " "), category=stri_trans_totitle(category), category=stri_replace_last_regex(category, "Hs$", "High School"), category=factor(category, levels=c("Advanced Degree", "Bachelor's Degree", "Some College", "High School", "Less Than High School", "All"))) -> part_rate

Now, we’ll make a simple line chart, tweaking the aesthetics just a bit:

ggplot(part_rate) + geom_line(aes(date, rate, group=category)) + scale_y_percent(limits=c(0.3, 0.9)) + facet_wrap(~category, scales="free") + labs(x=paste(format(range(part_rate$date), "%Y-%b"), collapse=" to "), y="Participation rate (%)", title="U.S. Labor Force Participation Rate", caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") + theme_ipsum_rc(grid="XY", axis="XY") The “All” view is interesting in that the LFPR has held fairly “steady” between 60% & 70%. Those individual and fractional percentage points actually translate to real humans, so the “minor” fluctuations do matter. It’s also interesting to see the direct contrast between the starting historical rate and current rate (you could also do the same with min/max rates, etc.) We can use a “dumbbell” chart to compare the 1978 value to today’s value, but we’ll need to reshape the data a bit first: group_by(part_rate, category) %>% arrange(date) %>% slice(c(1, n())) %>% spread(date, rate) %>% ungroup() %>% filter(category != "All") %>% mutate(category=factor(category, levels=rev(levels(category)))) -> rate_range filter(part_rate, category=="Advanced Degree") %>% arrange(date) %>% slice(c(1, n())) %>% mutate(lab=lubridate::year(date)) -> lab_df (We’ll be using the extra data frame to add labels the chart.) Now, we can compare the various ranges, once again tweaking aesthetics a bit: ggplot(rate_range) + geom_dumbbell(aes(y=category, x=1978-12-01, xend=2016-12-01), size=3, color="#e3e2e1", colour_x = "#5b8124", colour_xend = "#bad744", dot_guide=TRUE, dot_guide_size=0.25) + geom_text(data=lab_df, aes(x=rate, y=5.25, label=lab), vjust=0) + scale_x_percent(limits=c(0.375, 0.9)) + labs(x=NULL, y=NULL, title=sprintf("U.S. Labor Force Participation Rate %s-Present", lab_df$lab[1]), caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") + theme_ipsum_rc(grid="X")

Fin

One takeaway from both these charts is that it’s probably important to take education level into account when talking about the labor force participation rate. The get_labor_force_participation_rate() function — along with most other functions in the epidata package — also has options to factor the data by sex, race and age, so you can pull in all those views to get a more nuanced & informed understanding of this economic health indicator.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### cricketr and yorkr books – Paperback now in Amazon

18 February 2017 - 7:34am

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

My books
– Cricket Analytics with cricketr
– Beaten by sheer pace!: Cricket analytics with yorkr
are now available on Amazon in both Paperback and Kindle versions

The cricketr and yorkr packages are written in R, and both are available in CRAN. The books contain details on how to use these R packages to analyze performance of cricketers.

cricketr is based on data from ESPN Cricinfo Statsguru, and can analyze Test, ODI and T20 batsmen & bowlers. yorkr is based on data from Cricsheet, and can analyze ODI, T20 and IPL. yorkr can analyze batsmen, bowlers, matches and teams.

Cricket Analytics with cricketr
You can access the paperback at Cricket analytics with cricketr

Beaten by sheer pace! Cricket Analytics with yorkr
You can buy the paperback from Amazon at Beaten by sheer pace: Cricket analytics with yorkr

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Accessing MSSQL Server with R (RSQLServer with dplyr)

18 February 2017 - 5:44am

(This article was first published on r-bloggers – verenahaunschmid, and kindly contributed to R-bloggers)

Recently I have been starting to use dplyr for handling my data in R. It makes everything a lot smoother! My previous workflow – running an SQL query, storing the results as CSV, loading it in RStudio – is now history. With dplyr you can directly query data from many different databases in a very convenient way. Unfortunately Microsoft SQL Server is not directly supported but by using the package RSQLServer it can be done like with any other database. In this blog post I’ll explain how I installed everything on my Windows 7 machine to access MSSQL Server with R, since it was not as straight forward as one might think.

All you need … theoretically

The package RSQLServer is not available on CRAN anymore but it can be installed from the github repo imanuelcostigan/RSQLServer.

 # install.packages('devtools') devtools::install_github('imanuelcostigan/RSQLServer') 

If this works you’re lucky and already have all the necessary things installed. If not, follow the steps below

Errors I encountered

On my path from finding out about RSQLServer to actually using it I had to fix a few things. Searching on the internet and finding most of the answers on stackoverflow I was able to piece together everything I needed. To save everyone else who tries this some time I collected everything I found out.

Fixing the installation

The first error occured when I tried to install the package as shown above.

from URL https://api.github.com/repos/imanuelcostigan/RSQLServer/zipball/master
Installing RSQLServer
Error: running command ‘”C:/PROGRA~1/R/R-33~1.2/bin/x64/R” –no-site-file –no-environ –no-save –no-restore –quiet CMD config CC’ had status 1

To resolve this:

1. Find out which R version you have, if you don’t know, type R.version inside an R session
• Although I had R version 3.3 and downloaded the correct R version for it, it tried to install it in a directory called “34”. To make sure it will later find it, I changed this to “33” to match my R version. I am not sure whether it made a difference.
• Make sure to check the option where it asks if you want to add it to the path. This makes a difference.
3. Make sure you have the correct Java version and/or architecture installed. I didn’t have Java installed for x86 which also caused the installation to fail.
4. Restart your computer or at least log out and in again.

After that I was able to install the package using the following command.

 devtools::install_github('imanuelcostigan/RSQLServer') 

Getting integrated (NTLM) authentication to work

Finally accessing MSSQL Server with R

I was working with two different databases. One was remote and with that it worked immediately. The other one was a local installation on my computer.

For the local one I got an error message. Note: I don’t think it had anything to do with the databases being remote/local, just with the settings I used for installing my database!

Error in rJava::.jcall(driver@jdrv, “Ljava/sql/Connection;”, “connect”, :
java.sql.SQLException: Network error IOException: Connection refused: connect

To resolve this, you need to open the SQL Server configuration manager. I have the SQL Server configuration manager and the SQL Server configuration manager (2016) installed. For some settings it doesn’t matter which one you use but for this one I had to use the older one otherwise my database (SQL Server 2012) wouldn’t show up.

Make sure to enable TCP/IP. Restart you SQL Server instance. Now it should be working!

Use the configuration manager to enable TCP/IP.

All you need to connect (with Windows authentication) now is the following code:

 library(dplyr) conn <- RSQLServer::src_sqlserver("servername", database = "dbname")

Limitations

Currently dplyr cannot deal with schema names in code like this:

 conn %>% tbl("schema.table") 

In this case you have to use

 conn %>% tbl(sql("SELECT * FROM schema.table")) 

But this issue is getting addressed in rstats-db/DBI/issues/24 (dplyr uses DBI).

What now?
Query all the data!

Now that you have a connection to your SQL Server it's time to query all the data.

Check out the RStudio: Work with big data webinar or the dplyr databases vignette.

Have fun!

The post Accessing MSSQL Server with R (RSQLServer with dplyr) appeared first on verenahaunschmid.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – verenahaunschmid. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RPushbullet 0.3.1

17 February 2017 - 9:17pm

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release 0.3.1 of the RPushbullet package, following the recent 0.3.0 release is now on CRAN. RPushbullet is interfacing the neat Pushbullet service for inter-device messaging, communication, and more. It lets you easily send alerts like the one to the to your browser, phone, tablet, … — or all at once.

This release owes once again a lot to Seth Wenchel who helped to update and extend a number of features. We fixed one more small bug stemming from the RJSONIO to jsonlite transition, and added a few more helpers. We also enabled Travis testing and with it covr-based coverage analysis using pretty much the same setup I described in this recent blog post.

Changes in version 0.3.1 (2017-02-17)
• The target device designation was corrected (#39).

• Three new (unexported) helper functions test the validity of the api key, device and channel (Seth in #41).

• The summary method for the pbDevices class was corrected (Seth in #43).

• New helper functions pbValidateConf, pbGetUser, pbGetChannelInfo were added (Seth in #44 closing #40).

• New classes pbUser and pbChannelInfo were added (Seth in #44).

• Travis CI tests (and covr coverage analysis) are now enabled via an encrypted config file (#45).

Courtesy of CRANberries, there is also a diffstat report for this release.

More details about the package are at the RPushbullet webpage and the RPushbullet GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

17 February 2017 - 7:00pm

(This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers)

Armadillo is very versatile C++ library for linear algebra, brough
to R via the RcppArmadillo package. It
has proven to be very useful and popular, and is (as of February 2017) used by well over 300 CRAN
packages as indicated by the reverse depends / linking-to on its
CRAN page. Several earlier posts here on the
Rcpp Gallery also demonstrate different usage patterns.

Armadillo has a core focus on dense matrices, but continues to
expand its capabilities for sparse matrices. Basic operation are supported directly via the
templated header files, along with calls back into the default (dense) LAPACK and BLAS libraries we
can access easily as R uses them.

Armadillo also supports dedicated sparse solvers via the
the SuperLU library. Many Linux distributions ship
it, see e.g. the Debian package page and the
Ubuntu package page; there is also a
homebrew recipe for OS X /
macOS (or other systems using brew). As of this writing, the version in the current Ubuntu
release is behind the version Debian. But it is the 5.2.* version that is in Debian that is also
required with current Armadillo versions 7.700.* so we prepared ‘backports’ via
Dirk’s PPA repo.)

Recently, a GitHub issue ticket asked how to
use SuperLU along with
repeats the main answer, which was spread out over multiple posts, in a single article.

In a nutshell, two things need to happen:

1. One needs to define the required variable ARMA_USE_SUPERLU which has to be done before the
Armadillo headers are included. One possibility (shown below) is a #define statement right in the
code file.

2. One needs to tell the linker to use the SuperLU
library. This step is of course not perfectly portable, and does require that the library be
installed. (A genuinely portable solution would either need to test for presence of SuperLU, or include its
sources. Both aspects are beyond the scope of this post._

Sys.setenv("PKG_LIBS"="-lsuperlu")

// Important: this definition ensures Armadillo enables SuperLU #define ARMA_USE_SUPERLU 1 #include <RcppArmadillo.h> // [[Rcpp::depends(RcppArmadillo)]] using namespace arma; // [[Rcpp::export]] void superLuDemo() { sp_mat A = sprandu<sp_mat>(1000, 1000, 0.1); vec b = randu<vec>(1000); mat B = randu<mat>(1000, 5); vec x = spsolve(A, b); // solve one system mat X = spsolve(A, B); // solve several systems bool status = spsolve(x, A, b); // use default solver if (status == false) { Rcpp::Rcout << "no solution" << endl; } spsolve(x, A, b, "lapack"); // use LAPACK solver spsolve(x, A, b, "superlu"); // use SuperLU solver Rcpp::Rcout << "Done.\n"; }

This code snippet defines a function superLuDemo() which we can call from R:

superLuDemo() Done.

As the data generated here is random, we did not bother printing the (dense) result vector of
length 1000.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A plot against the CatterPlots complot

17 February 2017 - 7:00pm

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

In these terrible times, we R people have more important subjects to debate/care about than ggplot2 vs. base R graphics (isn’t even worth discussing anyway, ggplot2 is clearly the best alternative). Or so I thought until I saw CatterPlots trending on Twitter this week and even being featured on Revolutions blog. It was cool because plots with cats are cool, but looking more closely at the syntax of CatterPlots, I couldn’t but realize it was probably a complot to make us all like base R graphics syntax again! So let me show you how to make a cute plot with the awesome ggplot2 extension emojifont.

The plot I’m going to re-make in this post is an old one already that I designed for celebrating my PhD thesis submission in March last year. I decided to compare the gestation time of my thesis, 1246 days, to the gestation time of some common mammals.

Getting data

I’m a bit ashamed because I forgot to write down where I took my data from but I’m pretty sure it was from reading Wikipedia pages and taking a more or less random numbers in the gestation time range indicated for each animals. I also chose animals I could find an emoji for. As you can imagine, it was quite an agreeable data collection. Here is what I end up with:

library("dplyr") library("forcats") gestation <- readr::read_csv2("data/2017-02-18-complot.csv")%>% arrange(gestation) %>% mutate(animal = fct_rev(fct_inorder(animal))) knitr::kable(gestation)  animal gestation label color mouse 19 mouse grey dog 61 dog grey cat 64 cat grey wolf 68 wolf grey tiger 105 tiger grey pig 112 pig grey sheep 144 sheep grey bear 220 bear grey human 280 baby grey whale 590 whale grey elephant 617 elephant grey my thesis 1246 closed_book gold levels(gestation$animal)  ## [1] "my thesis" "elephant" "whale" "human" "bear" ## [6] "sheep" "pig" "tiger" "wolf" "cat" ## [11] "dog" "mouse"  I used forcats::fct_inorder and forcats::fct_rev because in such a post not using forcats would probably have been a deadly sin (and I don’t have 7 lives), but also because I needed the levels to be ordered from longest to shortest gestation time, otherwise the plot would have looked ugly. Making the plot At the time I used gganimate which I still love, but in the meantime I also fell in love with magick in particular since reading Bob’s post. So I decided to re-write my code with magick. On this blog I already published an article featuring both emojifont and gganimate. library("ggplot2") library("emojifont") load.emojifont('OpenSansEmoji.ttf') library("magick") plot_one_gestation <- function(gestation_time, gestation){ now_data <- filter_(gestation, ~ gestation <= gestation_time) p <- ggplot(now_data) p <- p + geom_col(aes(x = animal, y = gestation, fill = color)) p <- p + scale_fill_manual(values = c("grey" = "grey30", "gold" = "darkgoldenrod1")) p <- p + geom_text(aes(x = animal, y = gestation + 45, label = emoji(label)), family="OpenSansEmoji", size=30) p <- p + theme(axis.text.y=element_blank(), axis.ticks=element_blank(), text = element_text(size=40), legend.position="none") p <- p + ggtitle(gestation_time) p <- p + scale_x_discrete(limits = levels(gestation$animal)) p <- p + ylim(c(0, max(gestation$gestation) + 50)) p <- p + coord_flip() p <- p + xlab("Animal") p <- p + ylab("Gestation in days") outfil <- paste0("figs/animals_", gestation_time, ".png") ggsave(outfil, p, width=5, height=5) outfil }  I wanted the last image to be repeated 4 times so that everyone might have time to ponder over my accomplishment. I was celebrating and boasting with this graph. library("purrr") c(unique(gestation$gestation), rep(max(gestationgestation), 3))%>% map(plot_one_gestation, gestation = gestation) %>% map(image_read) %>% image_join() %>% image_animate(fps=1) %>% image_write("gestation.gif")  Aren’t these animal emojis adorable? At least as adorable as CatterPlotscats? You can also bring emojis to ggplot2 with emoGG (check out Lucy’s recent post) so you don’t need to ever leave the comfort of ggplot2 for the sake of “c[au]t[e]?ness”. Disclaimer: I actually starred CatterPlots on Github because I’m an open-minded cat person (married to a cat allergic person!). To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Optimization and Operations Research in R 17 February 2017 - 4:24pm (This article was first published on R Blog, and kindly contributed to R-bloggers) Authors: Stefan Feuerriegel and Joscha Märkle-Huß R is widely taught in business courses and, hence, known by most data scientists with business background. However, when it comes to optimization and Operations Research, many other languages are used. Especially for optimization, solutions range from Microsoft Excel solvers to modeling environments such as Matlab and GAMS. Most of these are non-free and require students to learn yet another language. Because of this, we propose to use R in optimization problems of Operations Research, since R is open source, comes for free and is widely known. Furthermore, R provides a multitude of numerical optimization packages that are readily available. At the same time, R is widely used in industry, making it a suitable and skillful tool to lever the potential of numerical optimization. The materials starts with a review of numerical and linear algebra basics for optimization. Here, participants learn how to derive a problem statement that is compatible with solving algorithms. This is followed by an overview on problem classessuch as one and multi-dimensional problems. Starting with linear and quadratic algorithms, we also cover convex optimization, followed by non-linear approaches such as gradient based (gradient descent methods), Hessian based (Newton and quasi-Newton methods) and non-gradient based (Nelder-Mead). We finally demonstrate the potent capabilities of R for Operations Research: we show how to solve optimization problems in industry and business, as well as illustrate the use in methods for statistics and data mining (e.g. quantile regression). All examples are supported by appropriate visualizations. Slides Homeworks To leave a comment for the author, please follow the link and comment on their blog: R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Catterplots: Plots with cats 17 February 2017 - 1:00pm (This article was first published on Revolutions, and kindly contributed to R-bloggers) As a devotee of Tufte, I'm generally against chartjunk. Graphical elements that obscure interpretation of the data occasionally have a useful role to play, but more often than not that role is to entertain the expense of enlightenment, or worse, to actively mislead. So it's with mixed feelings that I refer you to catterplots, an R package by David Gibbs to create scatterplots using outlines of cats as the symbols. But hey, it's the Friday before a long weekend, so here's a poorly annotated catterplot of households with dogs vs households with cats in the 48 continental US states (plus DC): The data behind this chart I found at data.world, which looks to be an interesting new resource for data sets. (Registration is required to download data, however.) The code behind the chart is below. (If you want a variant featuring Nyan Cat, try cat=11.) dogscats <- catplot(dogs, cats, size=0.1, catcolor=c(0.7,0.7,0.7,1), cat=3, xlab="Households with dogs", ylab="Households with cats", main="Household Pet Ownership in US States") The catterplot package is available for on GitHub at the link below. GitHub (GibbsDavidl): Catterplots To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### January New Data Packages 17 February 2017 - 12:00pm (This article was first published on RStudio, and kindly contributed to R-bloggers) by Joseph Rickert As forecast, the number of R packages hosted on CRAN exceeded 10,000 in January. Dirk Eddelbuettel, who tracks what’s happening on CRAN with his CRANberries site, called hurricaneexposure the 10,000th package in a tweet on January 27th. hurricaneexposure was one of two hundred and six new packages that arrived on CRAN in January. Approximately 10% of these packages have to do with providing access to data in by some means or another. Some packages contain the data sets, some provide wrappers to APIs, and at least one package provides code to scrape data from a site. The following 17 packages are picks for data-related packages for January 2017. I will select packages in other categories in a follow-up post. Data Packages • elevatr v0.1.1: Provides access to several databases that provide elevation data, including Mapzen Elevation Service, Mapzen Terrain Service, Amazon Terrain Tiles, and the USGS Elevation Point Query Service. There is a vignette. • epidata v0.1.0: Provides tools to retrieve data from the Economic Policy Institute. The README shows how to use the package. • europop v0.3: Contains a data set giving the populations of all European cities with at least 10,000 inhabitants during the period 1500-1800. • fivethirtyeight v0.1.0: Provides the data, code, and interactive visualizations behind FiveThirtyEight Stories. There is a vignette that provides an example of a data analysis, and a list of data sets that are included. • getCRUCLdata v.1.1: Provides functions that automate downloading and importing climatology data from University of East Anglia Climate Research Unit (CRU). There is a vignette to get you started. • hurricaneexposure v0.0.1: Allows users to create time series of tropical storm exposure histories for chosen counties for a number of hazard metrics (wind, rain, distance from the storm, etc.). The vignette provides an overview. • metScanR v0.0.1: Provides functions for mapping and gathering meteorological data from various US surface networks: COOP, USCRN, USRCRN, AL-USRCRN, ASOS, AWOS, SNOTEL, SNOTELLITE, SCAN, SNOW, and NEON. • mglR v0.1.0: Provides tools to download and organize large-scale, publicly available genomic studies on a candidate gene scale. The vignette shows how to use the package. • nzpullover v0.0.2: Contains data sets of driving offences and fines in New Zealand between 2009 and 2016, originally published by the New Zealand Police. • owmr v0.7.2: Provides a wrapper for the OpenWeatherMap API. • PeriodicTable v0.1.1: Contains a data set of the properties of chemical elements. • pwt9 v9.0-0: Contains the Penn World Table 9, which provides information on relative levels of income, output, inputs, and productivity for 182 countries between 1950 and 2014. • rdwd v0.7.0: Provides functions to obtain climate data from the German Weather Service, Deutscher Wetterdienst, (DWD). There is a vignette on Weather Stations and another showing how to use the package. • rwars v1.0.0: Provides functions to retrieve and reformat data from the ‘Star Wars’ API SWAPI. The vignette shows how to use the package. • wikidataQueryServiceR v0.1.0: Provides an API Client Library for Wikidata Query Service, which provides a way for tools to query Wikidata via SPARQL. See the README for how to use it. • wikilake v0.1: Provides functions to scrape metadata about lakes from Wikipedia. The vignette fetches data from Michigan lakes. • worrms v0.1.0: Provides a client for the World Register of Marine Species. The vignette shows how to use the package. To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Naive Bayes Classification in R (Part 2) 17 February 2017 - 9:31am (This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers) Following on from Part 1 of this two-part post, I would now like to explain how the Naive Bayes classifier works before applying it to a classification problem involving breast cancer data. The dataset is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia) and the attributes are as follows: age: a series of ranges from 20-29 to 70-79 menopause: whether a patient was pre- or post-menopausal upon diagnosis tumor.size: the largest diameter (mm) of excised tumor inv.nodes: the number of axillary lymph nodes which contained metastatic breast cancer node.caps: whether metastatic cancer was contained by the lymph node capsule deg.malign: the histological grade of the tumor (1-3 with 3 = highly abnormal cells) breast: which breast the cancer occurred in breast.quad: region of the breast cancer occurred in (four quadrants with nipple = central) irradiat: whether the patient underwent radiation therapy Some preprocessing of these data was required as there were some NAs (9 in total). I imputed predicted values using separate Naive Bayes classifiers. The objective here is to attempt to predict, using these attributes, with relatively high accuracy whether a recurrence of breast cancer is likely to occur in patients who were previously diagnosed and treated for the disease. We can pursue this objective by using the Naive Bayes classification method. Naive Bayes’ Classification Below is the Naive Bayes’ Theorem: P(A | B) = P(A) * P(B | A) / P(B) Which can be derived from the general multiplication formula for AND events: P(A and B) = P(A) * P(B | A) P(B | A) = P(A and B) / P(A) P(B | A) = P(B) * P(A | B) / P(A) If I replace the letters with meaningful words as I have been adopting throughout, the Naive Bayes formula becomes: P(outcome | evidence) = P(outcome) * P(evidence | outcome) / P(evidence) It is with this formula that the Naive Bayes classifier calculates conditional probabilities for a class outcome given prior information or evidence (our attributes in this case). The reason it is termed “naive” is because we assume independence between attributes when in reality they may be dependent in some way. For the breast cancer dataset we will be working with, some attributes are clearly dependent such as age and menopause status while some may or may not be dependent such as histological grade and tumor size. This assumption allows us to calculate the probability of the evidence by multiplying the individual probabilities of each piece of evidence occurring together using the simple multiplication rule for independent AND events. Another point to note is that this naivety results in probabilities that are not entirely mathematically correct but they are a good approximation and adequate for the purposes of classification. Indeed, the Naive Bayes classifier has proven to be highly effective and is commonly deployed in email spam filters. Calculating Conditional Probabilities Conditional probabilities are fundamental to the working of the Naive Bayes formula. Tables of conditional probabilities must be created in order to obtain values to use in the Naive Bayes algorithm. The R package e1071 contains a very nice function for creating a Naive Bayes model: library(e1071) model <- naiveBayes(class ~ ., data = breast_cancer) class(model) summary(model) print(model) The model has class “naiveBayes” and the summary tells us that the model provides a-priori probabilities of no-recurrence and recurrence events as well as conditional probability tables across all attributes. To examine the conditional probability tables just print the model. One of our tasks for this assignment was to create code which would give us the same conditional probabilities as those output by the naiveBayes() function. I went about this in the following way: tbl_list <- sapply(breast_cancer[-10], table, breast_cancer[ , 10]) tbl_list <- lapply(tbl_list, t) cond_probs <- sapply(tbl_list, function(x) { apply(x, 1, function(x) { x / sum(x) }) }) cond_probs <- lapply(cond_probs, t) print(cond_probs) The first line of code uses the sapply function to loop over all attribute variables in the dataset and create tables against the class attribute. I then used the lapply function to transpose all tables in the list so the rows represented the class attribute. To calculate conditional probabilities for each element in the tables, I used sapply, lapply and anonymous functions. I had to transpose the output in order to get the same structure as the naiveBayes model output. Finally, I printed out my calculated conditional probabilities and compared them with the naiveBayes output to validate the calculations. Applying the Naive Bayes’ Classifier So I’ve explained (hopefully reasonably well) how the Naive Bayes classifier works based on the fundamental rules of probability. Now it’s time to apply the model to the data. This is easily done in R by using the predict() function. preds <- predict(model, newdata = breast_cancer) You will see that I have trained the model using the entire dataset and then made predictions on the same dataset. In our assignment we were asked to train the model and test it on the dataset, treating the dataset as an unlabeled test set. This is unconventional as the training set and test set are then identical but I believe the assignment was intended to just test our understanding of how the method works. In practice, one would use a training set for the model to learn from and a test set to assess model accuracy. If one outcome class is more abundant in the dataset, as is the case with the breast cancer data (no-recurrence: 201, recurrence: 85), the data is unbalanced. This is okay for a generative Naive Bayes model as you want your model to learn from real-world events and to capture the truth. Manipulating the data to achieve less skew would be dangerous. Applying the model to the data gives the following confusion matrix from which a model accuracy of 75% can be calculated: conf_matrix <- table(preds, breast_cancerclass)

This post has only scraped the surface of classification methods in machine learning but has been a useful revision for myself and perhaps it may help others new to the Naive Bayes classifier. Please feel free to comment and correct any errors that may be present.

Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175

To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Naive Bayes Classification in R (Part 1)

17 February 2017 - 9:30am

(This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

Introduction

A very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is the Naive Bayes classifier. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a two-part piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful. If the reader wishes to get straight into the Bayes classification problem, please see Part 2 here.

The dataset used in the classification problem is titled “Breast cancer data” and is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia). It is an interesting dataset which contains categorical data on two classes with nine attributes. The dataset can be found here. The two outcome classes are “no-recurrence events” and “recurrence events” which refer to whether breast cancer returned to a patient or not having previously being diagnosed  with the disease and treated for it.

Our goal was to use these data to build a Naive Bayes classification model which could predict with good accuracy whether a patient diagnosed with breast cancer is likely to experience a recurrence of the disease based on the attributes. First, I will briefly discuss the basic concepts of probability before moving on to the classification problem in Part 2.

Fundamentals of Probability

Probability boils down to working out proportions and performing mathematical operations on them such as addition, multiplication and division. The fact that Naive Bayes classification has been so successful given this simple foundation is truly remarkable. There are a number of rules and concepts involved in probability calculations which one should be aware of in order to understand how the Naive Bayes classifier works. We will discuss these now but first I want to explain independent events, dependent events and mutual exclusiveness.

Independent events are those which, upon occurrence, do not affect the probability of another event occurring. An example would be Brazil winning the football World Cup and a hurricane forming over the Atlantic Ocean in September. Dependent events are those which, upon occurrence, affect the probability of another event occurring (i.e. they are linked in some way). An example would be getting a cold and going to work the following day. The probability of you going to work can be affected by the occurrence of a cold as you don’t feel very well and would like to not infect your colleagues. When two or more events are mutually exclusive, they cannot occur simultaneously. What is the probability of rolling a 6 and a 3 on a single dice roll? Zero. It is impossible. These two outcomes are mutually exclusive.

To calculate the probability of a single event occurring, sometimes called the observational probability, you just take the number of times the event occurred and divide it by the total number of processes which occurred that could have lead to that event. For example, let’s say I sprint 100 metres fifty times over the course of a month. I want to know the probability of my time being below 10 seconds based on the previous sprints. If I ran sub-10 second 100 metres on 27 occasions, then the observational probability of me running a sub-10 second 100 metres is simply:

P(<10 second 100 m) = (27 / 50) = 0.54 (54%)

Therefore, I can expect to sprint 100 metres in under 10 seconds about half of the time based on records of the fifty completed sprints. Please note that the P() wrapper is used to denote the probability of some outcome and is used throughout these posts.

Multiplication Rules for AND Events

When calculating the probability of two or more events occurring simultaneously, we first consider whether the events are independent or dependent. If they are independent we can use the simple multiplication rule:

P(outcome 1 AND outcome 2) = P(outcome 1) * P(outcome 2)

If I were to calculate the probability of Brazil winning the football World Cup and a hurricane forming over the Atlantic Ocean in September, I would use this simple multiplication rule. The two events are independent as the occurrence of one does not affect the other’s chance of occurring. If Brazil have a probability of winning the World Cup of 41% and the probability of a hurricane over the Atlantic Ocean in September is 91%, then we calculate the probability of both occurring:

P(Brazil AND Hurricane) = P(Brazil) * P(Hurricane)

= (0.41) * (0.91)

= 0.3731 (37%)

If two or more events are dependent, however, we must use the general multiplication rule. This formula is always valid, in both cases of independent and dependent events, as we will see.

P(outcome 1 AND outcome 2) = P(outcome 1) * P(outcome 2 | outcome 1)

P(outcome 2 | outcome 1) refers to the conditional probability of outcome 2 occurring given outcome 1 has already occurred. One can immediately see how this formula incorporates dependence between the events. If the events were independent, then the conditional probability is irrelevant as one outcome does not influence the chance of the other and P(outcome 2 | outcome 1) is simply P(outcome 2).  The formula just becomes the simple multiplication rule already described.

We can apply the general multiplication rule to a deck of cards example. What would be the probability of drawing a King of any suit and an Ace of any suit from a 52-card deck with just two draws? There are 4 Kings and 4 Aces in a standard deck so we know that drawing an Ace immediately affects our chances of drawing a King as there are fewer cards in the deck after the first draw. We can use the general multiplication formula as follows:

P(Ace AND King) = P(Ace) * P(King | Ace)

= (4 / 52) * (4 / 51)

= 0.006033183 (0.6%)

Obviously, if two events are mutually exclusive and cannot occur simultaneously, they are disjoint and the multiplication rules cannot be applied. The dice roll example describes such a scenario. The best we can do in such a case is state the probability of either outcome occurring. This brings us to the the simple addition rule.

When calculating the probability of either one event or the other occurring, we use the addition rule. When the outcomes are mutually exclusive, we use the simple addition formula:

P(outcome 1 OR outcome 2) = P(outcome 1) + P(outcome 2)

Applied to the dice roll example, what is the probability of rolling a 6 or a 3? Both outcomes cannot occur simultaneously. The probability of rolling a 6 is (1 / 6) and the same can be said for rolling a 3. Therefore,

P(6 OR 3) = (1 / 6) + (1 / 6) = 0.33 (33%)

However, if the events are not mutually exclusive and can occur simultaneously, we must use the following general addition formula which is always valid in both cases of mutual exclusiveness and non-mutual exclusiveness.

P(outcome 1 OR outcome 2) = P(outcome 1) + P(outcome 2) – P(outcome 1 AND outcome 2)

Applied to the World Cup and hurricane example, this would mean calculating the probabilities of both outcomes as well as the probability of both occurring simultaneously.

P(Brazil) + P(Hurricane) – P(Brazil AND Hurricane)

We know that the outcomes are independent, the occurrence of one  does not affect the probability of the other occurring, and we can therefore use the simple multiplication rule for two simultaneous events to calculate P(Brazil AND Hurricane).

P(Brazil AND Hurricane) = (0.41) * (0.91)

= 0.3731 (37%)

Finally, we can plug this probability into the main formula to get our answer:

P(Brazil OR Hurricane) = (0.91) + (0.41) – (0.3731) = 0.9469 (95%)

This makes sense right? Brazil are indeed an awesome football team. However, what drives this probability up is the fact that September is one of the months in the Atlantic hurricane season and you are almost guaranteed to see a hurricane event.

Summary

That was a brief introduction to probability and the rules associated with it. I hope it is clear and easy to follow. The important rules to remember are the general forms of the multiplication rule and the addition rule as they are valid in all cases. Using them in the above examples in place of the simple rules still yields the same results. As such, it may be more useful to memorise these general forms. Here they are once more:

P(outcome 1 AND outcome 2) = P(outcome 1) * P(outcome 2 | outcome 1)

P(outcome 1 OR outcome 2) = P(outcome 1) + P(outcome 2) – P(outcome 1 AND outcome 2)

I have chosen to use “outcome 1” and “outcome 2” instead of the usual “A” and “B” format because I believe it easier to understand Naive Bayes when replacing the letters with meaningful words. Here is how the general rules are typically written:

P(A and B) = P(A) * P(B | A)

P(A or B) = P(A) + P(B) – P(A and B)

In Part 2 I describe how to implement the Naive Bayes classifier in R and explain how it works based on the fundamentals of probability outlined here.

Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175

To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to perform PCA with R?

17 February 2017 - 9:14am

(This article was first published on François Husson, and kindly contributed to R-bloggers)

This post shows how to perform PCA with R and the package FactoMineR.

If you want to learn more on methods such as PCA, you can enroll in this MOOC (everyting is free): MOOC on Exploratory Multivariate Data Analysis

Dataset

Here is a wine dataset, with 10 wines and 27 sensory attributes (like sweetness, bitterness, fruity odor, and so on), 2 preference variables, and a qualitative variable corresponding to the wine labels (there are 2 labels, Sauvignon and Vouvray). The values in the data table correspond to the average score given by several judges for the same wine and descriptive variable. The aim of doing PCA here is to characterize the wines according to their sensory characteristics.

Performing PCA … with additional information

Here are the lines of code used. Note that we use the information given by the qualitative variable.

wine

library(FactoMineR)

### PCA with supplementary variables
res

### Print the main results
summary(res)

Two graphs are given by default, one for the individuals, one for the quantitative variables.

But is is interesting to consider the qualitative variable to better understand the differences between wines. Wines are colored according to their label.

## Drawing wines according to the label plot(res,habillage="Label")

Interpretation

The graph of the individuals shows, for instance, that S Michaud and S Trotignon are very “close”. It means that the scores for S Michaud and S Trotignon are approximately the same, whatever the variable. In the same way, Aub Marigny and Font Coteaux are wines with similar sensory scores for the 27 attributes. On the other hand, Font Brulés and S Trotignon have very different sensory profiles, because the first principal component, representing the main axis of variability between wines, separates them strongly.

The variables astringency, visual intensity, mushroom odor and candied fruit odor, found to the right, have correlations close to 1 with the first dimension. Since the correlation with the 1st dimension is close to 1, the values of these variables move in the same direction as the coordinates in the 1st dimensions. Wines with a small value in the 1st dimension have low values for these variables, and wines with large values in the 1st dimension have high values for these variables. Thus, the wines that are to the right of the plot have high (and positive) values in the 1st dimension and thus have high values for these variables. With the same logic, wines that are to the left have a small value in the 1st dimension, and thus low values for these variables.

For the variables passionfruit odor, citrus odor and freshness, everything is the other way around. The correlation with the 1st dimension is close to -1, and thus the values move in the opposite direction. Wines with a low value in the 1st dimension have low coordinate values, and thus have high values for these variables, and wines with large values in the 1st dimension have small values for these variables.

Overall, we see that the first dimension splits apart wines that are considered fruity and flowery (on the left) from wines that are woody or with vegetal odors. And this is the main source of variability.

So then, how can we interpret the 2nd dimension, the vertical axis? At the top, wines have large values on the vertical axis. Since the correlation coefficients between the 2nd dimension and variables such acidity or bitterness are close to 1, it means that wines at the top take large values for these variables. And wines at the bottom have small values in the 2nd dimension, and thus small values for these variables. For sweetness, the correlation coefficient is close to -1, so wines that have a small value in the 2nd dimension are sweet, while wines that have large values are not.

Overall, the 2nd dimension separates the wines at the top, acidic and bitter, from sweet wines at the bottom.

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Moving largish data from R to H2O – spam detection with Enron emails

17 February 2017 - 6:00am

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Moving around sparse matrices of text data – the limitations of as.h2o

This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o() to interpret a sparse Matrix object from R and convert it to an H2O frame. The Matrix and as.h2o method is ok for medium sized data but broke down on my hardware with a larger dataset – a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix from the Matrix package or a simple triplet matrix from the slam package. I’m not sure what as.h2o does under the hood in converting from Matrix to an H2O frame, but it’s too much for my laptop.

My motivation for this is that I want to use R for convenient pre-processing of textual data using the tidytext approach; but H2O for high powered machine learning. tidytext makes it easy to create a sparse matrix with cast_dtm or cast_sparse, but uploading this to H2O can be a challenge.

How to write from R into SVMLight format

After some to-and-fro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight"). This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge – how to efficiently write files in SVMLight format.

SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional row-based sparse matrix format which might convey information in row-column-value triples, it uses label-column:value indicators. It looks like this:

1 10:3.4 123:0.5 34567:0.231 0.2 22:1 456:0.3 

That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1 as the response value, 3.4 in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1 in the 22nd column, and so on.

Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste statements that were applyd to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.

I haven’t turned this into a package – it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).

library(data.table) library(slam) # Convert a simple triplet matrix to svm format #' @author Peter Ellis #' @return a character vector of length n = nrow(stm) calc_stm_svm <- function(stm, y){ # returns a character vector of length y ready for writing in svm format if(!"simple_triplet_matrix" %in% class(stm)){ stop("stm must be a simple triple matrix") } if(!is.vector(y) | nrow(stm) != length(y)){ stop("y should be a vector of length equal to number of rows of stm") } n <- length(y) # data.table solution thanks to @roland at http://stackoverflow.com/questions/41477700/optimising-sapply-or-for-paste-to-efficiently-transform-sparse-triplet-m/41478999#41478999 stm2 <- data.table(i = stm$i, j = stm$j, v = stm$v) res <- stm2[, .(i, jv = paste(j, v, sep = ":")) ][order(i), .(res = paste(jv, collapse = " ")), by = i][["res"]] out <- paste(y, res) return(out) } #' @param stm a simple triplet matrix (class exported slam) of features (ie explanatory variables) #' @param y a vector of labels. If not provided, a dummy of 1s is provided #' @param file file to write to. #' @author Peter Ellis write_stm_svm <- function(stm, y = rep(1, nrow(stm)), file){ out <- calc_stm_svm(stm, y) writeLines(out, con = file) } Example application – spam detection with the Enron emails Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat. The original authors classified the emails as spam or ham and saved these pre-processed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only. Data First the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files library(tidyverse) library(tidytext) library(tm) library(testthat) library(data.table) library(h2o) library(stringr) library(forcats) #===========download files============= # Files described at http://csmining.org/index.php/enron-spam-datasets.html baseurl <- "http://csmining.org/index.php/enron-spam-datasets.html?file=tl_files/Project_Datasets/Enron-Spam%20datasets/Preprocessed/enron" filenames <- paste0(baseurl, 1:6, ".tar.tar") for(i in 1:6){ message("Downloading ", i) dfile <- paste0("enron", i, ".tar.tar") download.file(filenames[i], destfile = dfile, mode = "wb") message("Un-archiving ", i) untar(dfile) } This creates six folders with the names enron1, enron2 etc; each with a spam and a ham subfolder containing numerous text files. The files look like this example piece of ham (ie non-spam; a legitimate email), chosen at random: Subject: re : creditmanager net meeting aidan , yes , this will work for us . vince " aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am to : vince j kaminski / hou / ect @ ect cc : subject : creditmanager net meeting vincent , i cannot rearrange my schedule for tomorrow so i would like to confirm that we will have a net - meeting of creditmanager on friday 7 th of january at 9 . 30 your time . regards aidan mc nulty 212 981 7422  The pre-processing has removed duplicates, emails sent to themselves, some of the headers, etc. Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext package which I never tire of saying is a game changer for convenient analysis of text by statisticians: #=============import to R================ # Adapting the example at http://tidytextmining.com/usenet.html folders <- paste0("enron", 1:6) spam <- data_frame(file = dir(paste0(folders, "/spam"), full.names = TRUE)) %>% mutate(text = map(file, readLines, encoding = "Latin-1")) %>% transmute(id = basename(file), text) %>% unnest(text) %>% mutate(SPAM = "spam") ham <- data_frame(file = dir(paste0(folders, "/ham"), full.names = TRUE)) %>% mutate(text = map(file, readLines, encoding = "Latin-1")) %>% transmute(id = basename(file), text) %>% unnest(text) %>% mutate(SPAM = "ham") enron_raw <- rbind(spam, ham) #============error checks - failing!=================== # Check against the counts provided at http://csmining.org/index.php/data.html # Turns out some 3013 spam messages have gone missing # should be 20170 spam messages and 16545 ham messages: # returns an error expect_equal(length(unique(enron_raw$id)), 20170 + 16545) enron_raw %>% select(id, SPAM) %>% distinct() %>% summarise(spam_count = sum(SPAM == "spam"), ham_count = sum(SPAM == "ham")) # For my purposes I decide not to worry about this. #=================further processing================== enron <- enron_raw %>% # will just remove the "Subject:" and "Subject :" and treat subject words like any other mutate(text = gsub("^Subject *: *", "", text), text = gsub("<U.....>", "", text, fixed = FALSE)) enron_words <- enron %>% unnest_tokens(word, text) %>% select(-SPAM)

As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id, and columns indicating whether it is SPAM and those other characteristics of interest.

# First I'm creating a summary, dense data frame with some numeric info on each document enron_sum1 <- enron %>% mutate(number_characters = nchar(text)) %>% group_by(id, SPAM) %>% summarise(number_characters = sum(number_characters)) enron_sum2 <- enron_words %>% group_by(id) %>% summarise(number_words = length(word)) enron_sum3 <- enron_words %>% anti_join(stop_words, by = "word") %>% group_by(id) %>% summarise(number_nonstop_words = length(word)) enron_sum_total <- enron_sum1 %>% left_join(enron_sum2, by = "id") %>% left_join(enron_sum3, by = "id") %>% mutate(number_stop_words = number_words - number_nonstop_words, proportion_stop_words = number_stop_words / number_words) %>% select(-number_nonstop_words) enron_sum_total Source: local data frame [33,702 x 6] Groups: id [33,702] id SPAM number_characters number_words number_stop_words <chr> <chr> <int> <int> <int> 1 0001.1999-12-10.farmer.ham.txt ham 28 4 0 2 0001.1999-12-10.kaminski.ham.txt ham 24 4 3 3 0001.2000-01-17.beck.ham.txt ham 3486 559 248 4 0001.2000-06-06.lokay.ham.txt ham 3603 536 207 5 0001.2001-02-07.kitchen.ham.txt ham 322 48 18 6 0001.2001-04-02.williams.ham.txt ham 1011 202 133 7 0002.1999-12-13.farmer.ham.txt ham 4194 432 118 8 0002.2001-02-07.kitchen.ham.txt ham 385 64 40 9 0002.2001-05-25.SA_and_HP.spam.txt spam 990 170 80 10 0002.2003-12-18.GP.spam.txt spam 1064 175 63 

I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam package), with a column for each word (having first limited myself to interesting words)

used_words <- enron_words %>% # knock out stop words: anti_join(stop_words, by = "word") %>% # knock out numerals, and words with only 2 or 1 letters: mutate(word = gsub("[0-9]", "", word), wordlength = nchar(word)) %>% filter(wordlength > 2) %>% group_by(word) %>% summarise(count = length(word)) %>% ungroup() %>% # knock out words used less than 10 times: filter(count >= 10) enron_dtm <- enron_words %>% right_join(used_words, by = "word") %>% cast_dtm(id, word, count) # we need a version of the dense data in the same order as the document-term-matrix, to do a sort of # manual join of the sparse matrix with the dense one later in H2O. rows <- data_frame(id = rownames(enron_dtm)) enron_dense <- left_join(rows, enron_sum_total, by = "id") expect_equal(nrow(enron_dtm), nrow(enron_dense)) expect_equal(rownames(enron_dtm), enron_dense$id) Now we can load our two datasets onto an H2O cluster for analysis: #================import to h2o and join up there============ h2o.init(nthreads = -1, max_mem_size = "8G") # Load up the dense matrix with counts of stopwords etc: enron_dense_h2o <- as.h2o(enron_dense) # Load up the sparse matrix with columns for each word: thefile <- tempfile() write_stm_svm(enron_dtm, file = thefile) enron_sparse_h2o <- h2o.uploadFile(thefile, parse_type = "SVMLight") unlink(thefile) # Number of rows should be equal: expect_equal(nrow(enron_sparse_h2o), nrow(enron_dtm)) # Number of columns should be 1 extra in H2O, dummy variable of labels (1) added by write_stm_svm: expect_equal(ncol(enron_sparse_h2o), ncol(enron_dtm) + 1) # First column should be the dummy labels = all one expect_equal(mean(enron_sparse_h2o[ , 1]), 1) enron_fulldata <- h2o.cbind(enron_sparse_h2o, enron_dense_h2o) head(colnames(enron_fulldata), 10) # Convert the target variable to a factor so h2o.glm and other modelling functions # know what to do with it: enron_fulldata[ , "SPAM"] <- as.factor(enron_fulldata[ , "SPAM"]) I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords. Analysis To give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam: • generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables • random forest • gradient boosting machine • neural network I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my not-particularly-tuned models: The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)  names coefficients sign word 1 C9729 1.1635213 NEG enron 2 C25535 0.6023054 NEG vince 3 C1996 0.5990230 NEG attached 4 C15413 0.4524011 NEG louise 5 C19891 0.3905246 NEG questions 6 C11478 0.2993239 NEG gas 7 proportion_stop_words 0.2935112 NEG <NA> 8 C12866 0.2774074 NEG hourahead 9 C7268 0.2600452 NEG daren 10 C16257 0.2497282 NEG meter 11 C12878 0.2439315 NEG houston 12 C21441 0.2345008 NEG sally 13 C16106 0.2179897 NEG meeting 14 C12894 0.1965571 NEG hpl 15 C16618 0.1909332 NEG monday 16 C11270 0.1873195 NEG friday 17 C8553 0.1704185 NEG doc 18 C7386 0.1673093 NEG deal 19 C21617 0.1636310 NEG schedule 20 C4185 0.1510748 NEG california 21 C12921 0.3695006 POS http 22 C16624 0.2132104 POS money 23 C22597 0.2034031 POS software 24 C15074 0.1957970 POS life 25 C5394 0.1922659 POS click 26 C17683 0.1915608 POS online 27 C25462 0.1703109 POS viagra 28 C16094 0.1605989 POS meds 29 C21547 0.1583438 POS save 30 C9483 0.1498732 POS email  So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron – things may have changed) more likely to be spam. Here’s the code for the analysis all together: #=====================analysis in H2O============= #-----------prep------------------------ # names of the explanatory variables - all the columns in the sparse matrix (which are individual words) # except the first one which is the dummy "labels" created just for the SVMLight format. And the # four summary variables in the dense dataframe: xnames <- c(colnames(enron_sparse_h2o)[-1], "number_characters", "number_words", "number_stop_words", "proportion_stop_words") # A lookup table of the column names that refer to words to the words they actually mean. # Useful when we look at variable performance down the track. wordcols <- data_frame( variable = colnames(enron_sparse_h2o), word = c("", colnames(enron_dtm)) ) # The slower models (ie apart from glm) take ages for cross-validation so # we'll settle for single-split validation enron_split <- h2o.splitFrame(enron_fulldata, ratios = c(0.6, 0.2)) dim(enron_split[[1]]) #-------------------GLM--------------------- # Binomial GLM, with elastic net regularization. This is the minimal baseline sort of model. # 200 seconds system.time({ mod.glm <- h2o.glm(x = xnames, y = "SPAM", training_frame = enron_split[[1]], validation_frame = enron_split[[2]], family = "binomial", alpha = 0.5, lambda_search = TRUE) }) h2o.varimp(mod.glm) %>% slice(1:30) %>% left_join(wordcols, by = c("names" = "variable")) %>% arrange(sign, desc(coefficients)) h2o.performance(mod.glm, valid = TRUE) # 131 / 6640 #--------------------Random Forest------------------- # with ntrees = 50, nfolds = 10, max_depth = 20, took 25 minutes to get # 5% through so I stopped it and went back to having a single validation frame. # Can view progress by going to http://127.0.0.1:54321/flow/index.html # 1340 seconds system.time({ mod.rf <- h2o.randomForest(x = xnames, y = "SPAM", training_frame = enron_split[[1]], validation_frame = enron_split[[2]], ntrees = 500, max_depth = 20, stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 25) }) # note from watching progress in flow, scoring every score_tree_interval models takes quite a lot of time. # so I pushed out score_tree_interval. On current settings it will score # the latest model against the validation frame every 25 trees, and stop if it hasn't improved since 3*25 trees ago. # If score_tree_interval is a small number, it stops growing trees too quickly h2o.performance(mod.rf, valid = TRUE) # 172/6640 error rate #-------------------gradient boosting machine----------------- # 1240 seconds system.time({ mod.gbm <- h2o.gbm(x = xnames, y = "SPAM", training_frame = enron_split[[1]], validation_frame = enron_split[[2]], ntrees = 100, max_depth = 5, stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 5) }) h2o.performance(mod.gbm, valid = TRUE) # 240/6640 error rate #-------------------artificial neural network------------------------ # 900 seconds; much faster when sparse = TRUE system.time({ mod.dl <- h2o.deeplearning(x = xnames, y = "SPAM", training_frame = enron_split[[1]], validation_frame = enron_split[[2]], hidden = c(200, 200), stopping_tolerance = 0.0001, stopping_rounds = 5, sparse = TRUE) }) h2o.performance(mod.dl, valid = TRUE) # 82/6640 #-----------------Naive Bayes - doesn't work-------------------- # Conditional probabilities won't fit in the driver node's memory (20.99 GB > 6.06GB) mod.nb <- h2o.naiveBayes(x = xnames, y = "SPAM", training_frame = enron_split[[1]], validation_frame = enron_split[[2]]) #==============presentation of results========================== perf <- function(newdata){ return(c( glm = as.character(h2o.confusionMatrix(mod.glm, newdata = newdata)$Rate[1]), rf = as.character(h2o.confusionMatrix(mod.rf, newdata = newdata)$Rate[1]), gbm = as.character(h2o.confusionMatrix(mod.gbm, newdata = newdata)$Rate[1]), dl = as.character(h2o.confusionMatrix(mod.dl, newdata = newdata)$Rate[1]) )) } # this isn't the most efficient computing wise, because the performance on the training # and validation sets could be extracted more directly but it is quick and easy to code: perfs <- lapply(enron_split, perf) perfsdf <- data_frame(value = unlist(perfs), model = rep(c("Generalized Linear Model", "Random Forest", "Gradient Boosting Machine", "Deep Learning"), 3), dataset = rep(c("Training", "Validation", "Testing"), each = 4), abbrev = rep(names(perfs[[1]]), 3)) %>% mutate(errors = as.numeric(str_sub(str_extract(value, "=[0-9]+"), start = 2)), denom = as.numeric(str_sub(str_extract(value, "/[0-9]+"), start = 2)), errorrate = errors / denom, model = factor(model, levels = c("Deep Learning", "Generalized Linear Model", "Random Forest", "Gradient Boosting Machine")), dataset = factor(dataset, levels = c("Training", "Validation", "Testing"))) perfsdf %>% ggplot(aes(x = errorrate, y = dataset, label = abbrev, colour = model)) + geom_path(aes(group = abbrev), alpha = 0.5, linetype = 2) + geom_text() + scale_colour_brewer(palette = "Set1") + scale_x_continuous("Error rate", label = percent) + labs(y = "", colour = "") + ggtitle("Error rates detecting ham from spam", "Enron email dataset, four different statistical learning methods") To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Exploratory Multivariate Data Analysis with R- enroll now in the MOOC 17 February 2017 - 2:48am (This article was first published on François Husson, and kindly contributed to R-bloggers) Exploratory multivariate data analysis is studied and has been taught in a “French-way” for a long time in France. You can enroll in a MOOC (completely free) on Exploratory Multivariate Data Analysis. The MOOC will start the 27th of February. This MOOC focuses on 4 essential and basic methods, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical and clustering. This course is application-oriented and many examples and numerous exercises are done with FactoMineR (a package of the free R software) will make the participant efficient and reliable face to data analysis. To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The Steep Slide of NFL Draft Salaries 16 February 2017 - 10:35pm (This article was first published on R – Jesse Piburn, and kindly contributed to R-bloggers) Some friends and I got into a conversation about rookies in the NFL and how much their salaries were. We eventually started guessing how much more the first overall pick makes compared to, Mr. Irrelevant, the last pick of the NFL draft. It’s a pretty steep drop from number 1 to number 256 (7 rounds of 32 teams, plus a maximum of 32 compensatory selections), but it turns out most of that slide happens in the first 2 rounds. Below is the chart I came up with. Here is a higher res link Here is the R code to reproduce the chart above, including the function to download the data from spotrac, which is a great site by the way. #################################################################### # Create nfl draft salary chart # date: 2/16/2017 # author: Jesse Piburn # #################################################################### # gtable and grid used to add the source text at the bottom ----- library(ggplot2) library(ggthemes) library(gtable) library(grid) library(dplyr) getDraftContracts <- function(year) { x <- rvest::html(paste0("http://www.spotrac.com/nfl/draft/", year)) x <- rvest ::html_table(x) # each round is its own table ----- x <- lapply(1:length(x), FUN = function(i) { df <- x[[i]] df$Round <- i df$Cond_Pick <- grepl(" \$$C\$$", df$Pick) df$Pick <- as.numeric(gsub(" \$$C\$$", "", df$Pick)) df$Yrs <- as.numeric(df$Yrs) df }) df <- dplyr::bind_rows(x) df$Total Value <- as.numeric(gsub("\\$|,", "", df$Total Value)) df$Signing Bonus <- as.numeric(gsub("\\$|,", "", df$Signing Bonus)) cap_index <- which(names(df) == paste(year, "Cap")) df[, cap_index] <- as.numeric(gsub("\\$|,", "", df[, cap_index])) names(df)[cap_index] <- "Rookie Cap Hit" df$Yearly Avg <- df$Total Value/ df$Yrs df$Season <- year df } yrs <- 2011:2016 df_list <- lapply(yrs, getDraftContracts) df <- dplyr::bind_rows(df_list) df$Season <- factor(df$Season) plot_title <- "Average Annual Salary of Rookie Contract" plot_subtitle <- "Due to compensatory picks rounds 3 through 7 will have varying numbers of selections per season" p1 <- ggplot(df, aes(x = Pick, y = Yearly Avg, colour = Season, group = Season)) + geom_line(size = .8) + theme_fivethirtyeight() + ylab("Avg Yearly Value of Rookie Contract") + xlab("Pick") + scale_x_continuous(breaks = seq.default(1, 225, 32), labels =c(paste("Round", 1:3), paste("Pick", seq.default(1, 225, 32)[4:8]))) + scale_y_continuous(breaks = seq.default(1000000, 7000000, 1000000), labels = paste0("$", 1:7, " mil"), expand = c(0,100000)) + labs(title = plot_title, subtitle = plot_subtitle) + guides(colour = guide_legend(nrow = 1)) # outliers ----- label_df <- df %>% filter(Player %in% c("Kyle Rudolph", "Colin McCarthy", "Julius Thomas", "Theo Riddick", "Braxton Miller")) label_df$label <- paste0(label_df$Player," (", label_df$Pos, ")") p1 <- p1 + geom_text(data = label_df, aes(x = Pick, y = Yearly Avg, colour = Season, label = label), fontface = "bold", show.legend = FALSE, nudge_x = c(-25, 15, -25, 20, -5), nudge_y = c(-60000, 120000, 0, 150000, 120000)) # turn into grob and add the bottom text ----- p1 <- ggplotGrob(p1) subtext <- textGrob("Source: spotrac", gp = gpar(fontsize = 12, fontface = "italic", fontfamily = "sans", col = "#696969"), just = "left", x = unit(0.01, "npc")) p1 <- gtable_add_rows(p1, heights = unit(0, "mm"), pos = -1) p1 <- gtable_add_grob(p1, grobTree(subtext), t = nrow(p1), l = 1, b = nrow(p1)-1, r = 7) grid.draw(p1) ggsave(filename = "plots/rookie salaries.png", plot = p1, width = 8*1.2, height = 6*1.2, dpi = 600, units = "in") To leave a comment for the author, please follow the link and comment on their blog: R – Jesse Piburn. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### littler 0.3.2 16 February 2017 - 8:20pm (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) The third release of littler as a CRAN package is now available, following in the now more than ten-year history as a package started by Jeff in the summer of 2006, and joined by me a few weeks later. littler is the first command-line interface for R and predates Rscript. It is still faster, and in my very biased eyes better as it allows for piping as well shebang scripting via #!, uses command-line arguments more consistently and still starts faster. It prefers to live on Linux and Unix, has its difficulties on OS X due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems where a good idea?) and simply does not exist on Windows (yet — the build system could be extended — see RInside for an existence proof, and volunteers welcome!). This release brings several new examples script to run package checks, use the extraordinary R Hub, download RStudio daily builds, and more — see below for details. No internals were changed. The NEWS file entry is below. Changes in littler version 0.3.2 (2017-02-14) • Changes in examples • New scripts getRStudioServer.r and getRStudioDesktop.r to download daily packages, currently defaults to Ubuntu amd64 • New script c4c.r calling rhub::check_for_cran() • New script rd2md.r to convert Rd to markdown. • New script build.r to create a source tarball. • The installGitHub.r script now use package remotes (PR #44, #46) Courtesy of CRANberries, there is a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page. The code is available via the GitHub repo, from tarballs off my littler page and the local directory here — and now of course all from its CRAN page and via install.packages("littler"). Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter. Comments and suggestions are welcome at the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### A knapsack riddle [#2]? 16 February 2017 - 6:17pm (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) Still about this allocation riddle of the past week, and still with my confusion about the phrasing of the puzzle, when looking at a probabilistic interpretation of the game, rather than for a given adversary’s y, the problem turns out to search for the maximum of where the Y’s are Binomial B(100,p). Given those p’s, this function of x is available in closed form and can thus maximised by a simulated annealing procedure, coded as utility=function(x,p){ ute=2*pbinom(x[1]-1,100,prob=p[1])+ dbinom(x[1],100,p[1]) for (i in 2:10) ute=ute+2*i*pbinom(x[i]-1,100,prob=p[i])+ i*dbinom(x[i],100,p[i]) return(ute)} #basic term in utility baz=function(i,x,p){ return(i*dbinom(x[i],100,p[i])+ i*dbinom(x[i]-1,100,p[i]))} #relies on a given or estimated p x=rmultinom(n=1,siz=100,prob=p) maxloz=loss=0 newloss=losref=utility(x,p) #random search T=1e3 Te=1e2 baza=rep(0,10) t=1 while ((t<T)||(newloss>loss)){ loss=newloss i=sample(1:10,1,prob=(10:1)*(x>0)) #moving all other counters by one xp=x+1;xp[i]=x[i] #corresponding utility change for (j in 1:10) baza[j]=baz(j,xp,p) proz=exp(log(t)*(baza-baza[i])/Te) #soft annealing move j=sample(1:10,1,prob=proz) if (i!=j){ x[i]=x[i]-1;x[j]=x[j]+1} newloss=loss+baza[j]-baza[i] if (newloss>maxloz){ maxloz=newloss;argz=x} t=t+1 if ((t>T-10)&(newloss<losref)){ t=1;loss=0 x=rmultinom(n=1,siz=100,prob=p) newloss=losref=utility(x,p)}} which seems to work, albeit not always returning the same utility. For instance, > p=cy/sum(cy) > utility(argz,p) [1] 78.02 > utility(cy,p) [1] 57.89 or > p=sy/sum(sy) > utility(argz,p) [1] 82.04 > utility(sy,p) [1] 57.78 Of course, this does not answer the question as intended and reworking the code to that purpose is not worth the time! Filed under: Kids, R, Statistics Tagged: integer programming, knapsack problem, mathematical puzzle, optimisation To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Factor Analysis with the Principal Component Method Part Two 16 February 2017 - 3:30pm (This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers) In the first post on factor analysis, we examined computing the estimated covariance matrix S of the rootstock data and proceeded to find two factors that fit most of the variance of the data using the principal component method. However, the variables in the data are not on the same scale of measurement, which can cause variables with comparatively large variances to dominate the diagonal of the covariance matrix and the resulting factors. The correlation matrix, therefore, makes more intuitive sense to employ in factor analysis. In fact, as we saw previously, most packages available in R default to using the correlation matrix when performing factor analysis. There are several benefits to using R over S, not only that it scales non-commensurate variables, but it is also easier to calculate the factors as the matrix does not need to be decomposed and estimated like S. Factor Analysis with the Correlation Matrix Similar to factor analysis with the covariance matrix, we estimate \Lambda which is p \times m where D is a diagonal matrix of the m largest eigenvalues of R, and C is a matrix of the corresponding eigenvectors as columns. \hat{\Lambda} = CD^{1/2} = (\sqrt{\theta_1}c_1, \sqrt{\theta_2}c_2, \cdots, \sqrt{\theta_m}c_m) Where \theta_1, \theta_2, \cdots, \theta_m are the largest eigenvalues of R. Thus the correlation matrix R does not require decomposition, and we can proceed directly to finding the eigenvalues and eigenvectors of R. Load the rootstock data and name the columns. From the previous post: The rootstock data contains growth measurements of six different apple tree rootstocks from 1918 to 1934 (Andrews and Herzberg 1985, pp. 357-360) and were obtained from the companion FTP site of the book Methods of Multivariate Analysis by Alvin Rencher. The data contains four dependent variables as follows: • trunk girth at four years (mm \times 100) • extension growth at four years (m) • trunk girth at 15 years (mm \times 100) • weight of tree above ground at 15 years (lb \times 1000) root <- read.table('ROOT.DAT', col.names = c('Tree.Number', 'Trunk.Girth.4.Years', 'Ext.Growth.4.Years', 'Trunk.Girth.15.Years', 'Weight.Above.Ground.15.Years')) Compute the correlation matrix of the data. R <- cor(root[,2:5]) round(R, 2) ## Trunk.Girth.4.Years Ext.Growth.4.Years ## Trunk.Girth.4.Years 1.00 0.88 ## Ext.Growth.4.Years 0.88 1.00 ## Trunk.Girth.15.Years 0.44 0.52 ## Weight.Above.Ground.15.Years 0.33 0.45 ## Trunk.Girth.15.Years ## Trunk.Girth.4.Years 0.44 ## Ext.Growth.4.Years 0.52 ## Trunk.Girth.15.Years 1.00 ## Weight.Above.Ground.15.Years 0.95 ## Weight.Above.Ground.15.Years ## Trunk.Girth.4.Years 0.33 ## Ext.Growth.4.Years 0.45 ## Trunk.Girth.15.Years 0.95 ## Weight.Above.Ground.15.Years 1.00 Then find the eigenvalues and eigenvectors of R. r.eigen <- eigen(R) r.eigen ##$values ## [1] 2.78462702 1.05412174 0.11733950 0.04391174 ## ## $vectors ## [,1] [,2] [,3] [,4] ## [1,] 0.4713465 0.5600120 0.6431731 0.2248274 ## [2,] 0.5089667 0.4544775 -0.7142114 -0.1559013 ## [3,] 0.5243109 -0.4431448 0.2413716 -0.6859012 ## [4,] 0.4938456 -0.5324091 -0.1340527 0.6743048 We can check the proportion of each eigenvalue respective to the total sum of the eigenvalues. \frac{\sum^p_{i=1} \hat{\lambda}^2_{ij}}{tr(R)} = \frac{\theta_j}{p} Where p is the number of variables. The quick and dirty loop below finds the proportion of the total for each eigenvalue and the cumulative proportion. cumulative.proportion <- 0 prop <- c() cumulative <- c() for (i in r.eigen$values) { proportion <- i / dim(root[,2:5])[2] cumulative.proportion <- cumulative.proportion + proportion prop <- append(prop, proportion) cumulative <- append(cumulative, cumulative.proportion) } data.frame(cbind(prop, cumulative)) ## prop cumulative ## 1 0.69615676 0.6961568 ## 2 0.26353043 0.9596872 ## 3 0.02933488 0.9890221 ## 4 0.01097793 1.0000000

As in the case of the covariance matrix, the first two factors account for nearly all of the sample variance and thus can proceed with m = 2 factors.

The eigenvectors corresponding to the two largest eigenvalues are multiplied by the square roots of their respective eigenvalues as seen earlier to obtain the factor loadings.

factors <- t(t(r.eigen$vectors[,1:2]) * sqrt(r.eigen$values[1:2])) round(factors, 2) ## [,1] [,2] ## [1,] 0.79 0.57 ## [2,] 0.85 0.47 ## [3,] 0.87 -0.45 ## [4,] 0.82 -0.55

Computing the communality remains the same as in the covariance setting.

h2 <- rowSums(factors^2)

The specific variance when factoring R is 1 – \hat{h}^2_i.

u2 <- 1 - h2

According to the documentation of the principal() function (called by ?principal), there is another statistic called complexity, which is the number of factors on which a variable has moderate or high loadings (Rencher, 2002 pp. 431), that is found by:

\frac{(\sum^m_{i=1} \hat{\lambda}^2_i)^2}{\sum^m_{i=1} \hat{\lambda}_i^4}

In the most simple structure, the complexity of all the variables is 1. The complexity of the variables is reduced by performing rotation which will be seen later.

com <- rowSums(factors^2)^2 / rowSums(factors^4) com ## [1] 1.831343 1.553265 1.503984 1.737242 mean(com) ## [1] 1.656459

As seen in the previous post, the principal() function from the psych package performs factor analysis with the principal component method.

library(psych)

Since we are using R instead of S, the covar argument remains FALSE by default. No rotation is done for now, so the rotate argument is set to none.

root.fa <- principal(root[,2:5], nfactors = 2, rotate = 'none') root.fa ## Principal Components Analysis ## Call: principal(r = root[, 2:5], nfactors = 2, rotate = "none") ## Standardized loadings (pattern matrix) based upon correlation matrix ## PC1 PC2 h2 u2 com ## Trunk.Girth.4.Years 0.79 0.57 0.95 0.051 1.8 ## Ext.Growth.4.Years 0.85 0.47 0.94 0.061 1.6 ## Trunk.Girth.15.Years 0.87 -0.45 0.97 0.027 1.5 ## Weight.Above.Ground.15.Years 0.82 -0.55 0.98 0.022 1.7 ## ## PC1 PC2 ## SS loadings 2.78 1.05 ## Proportion Var 0.70 0.26 ## Cumulative Var 0.70 0.96 ## Proportion Explained 0.73 0.27 ## Cumulative Proportion 0.73 1.00 ## ## Mean item complexity = 1.7 ## Test of the hypothesis that 2 components are sufficient. ## ## The root mean square of the residuals (RMSR) is 0.03 ## with the empirical chi square 0.39 with prob < NA ## ## Fit based upon off diagonal values = 1

The output of the principal() function agrees with our calculations.

Factor Rotation with Varimax Rotation

Rotation moves the axes of the loadings to produce a more simplified structure of the factors to improve interpretation. Therefore the goal of rotation is to find an interpretable pattern of the loadings where variables are clustered into groups corresponding to the factors. We will see that a successful rotation yields a complexity closer to 1, which denotes the variables load highly on only one factor.

One of the most common approaches to rotation is varimax rotation, which is a type of orthogonal rotation (axes remain perpendicular). The varimax technique seeks loadings that maximize the variance of the squared loadings in each column of the rotated matrix \hat{\Lambda}*.

The varimax() function is used to find the rotated factor loadings. For those interested, the R code for the varimax() function can be found here.

factors.v <- varimax(factors)\$loadings round(factors.v, 2) ## ## Loadings: ## [,1] [,2] ## [1,] 0.16 0.96 ## [2,] 0.28 0.93 ## [3,] 0.94 0.29 ## [4,] 0.97 0.19 ## ## [,1] [,2] ## SS loadings 1.928 1.907 ## Proportion Var 0.482 0.477 ## Cumulative Var 0.482 0.959

The varimax rotation was rather successful in finding a rotation that simplified the complexity of the variables. The first two variables now load highly on the second factor while the remaining two variables load primarily on the first factor.

Since we used an orthogonal rotation technique, the communalities will not change.

h2.v <- rowSums(factors.v^2) h2.v ## [1] 0.9492403 0.9390781 0.9725050 0.9779253 h2 ## [1] 0.9492403 0.9390781 0.9725050 0.9779253

Thus the specific variances will also be unchanged.

u2.v <- 1 - h2.v u2.v ## [1] 0.05075965 0.06092192 0.02749496 0.02207470 u2 ## [1] 0.05075965 0.06092192 0.02749496 0.02207470

As stated previously, the complexity of the variables on the rotated factors should be closer to 1 compared to the non-rotated complexity.

com.v <- rowSums(factors.v^2)^2 / rowSums(factors.v^4) com.v ## [1] 1.054355 1.179631 1.185165 1.074226 mean(com.v) ## [1] 1.123344

The complexity is rather close to 1 which provides us further acknowledgment the factors are now in a more simplified structure.

Setting the rotation argument to varimax in the principal()` function outputs the rotated factors and corresponding statistics.

root.fa2 <- principal(root[,2:5], nfactors = 2, rotation = 'varimax') root.fa2 ## Principal Components Analysis ## Call: principal(r = root[, 2:5], nfactors = 2, rotation = "varimax") ## Standardized loadings (pattern matrix) based upon correlation matrix ## RC1 RC2 h2 u2 com ## Trunk.Girth.4.Years 0.16 0.96 0.95 0.051 1.1 ## Ext.Growth.4.Years 0.28 0.93 0.94 0.061 1.2 ## Trunk.Girth.15.Years 0.94 0.29 0.97 0.027 1.2 ## Weight.Above.Ground.15.Years 0.97 0.19 0.98 0.022 1.1 ## ## RC1 RC2 ## SS loadings 1.94 1.90 ## Proportion Var 0.48 0.48 ## Cumulative Var 0.48 0.96 ## Proportion Explained 0.50 0.50 ## Cumulative Proportion 0.50 1.00 ## ## Mean item complexity = 1.1 ## Test of the hypothesis that 2 components are sufficient. ## ## The root mean square of the residuals (RMSR) is 0.03 ## with the empirical chi square 0.39 with prob < NA ## ## Fit based upon off diagonal values = 1 Interpretation of Factors

The factor analysis performed on the rootstock data yielded two latent variables that fit and explain the variance of the data quite sufficiently. We see both variables relating to measurements at four years load heavily on factor 2 while the 15-year measurements load mainly on the first factor. Thus we could designate names for the factors, or latent variables, such as ‘15 years growth’ and ‘4 years growth’, respectively. There isn’t any standard way of ‘naming’ factors as the interpretation can vary widely between each case. In this example, the factors make intuitive sense based on how they load on the variables; however, factors resulting from a factor analysis may not always make logic sense to the original data. If the resulting factors do not seem logical, changes to the approach such as adjusting the number of factors or the threshold of the loadings deemed important, or even a different method of rotation can be done to improve interpretation.

References

Rencher, A. (2002). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.

https://en.wikipedia.org/wiki/Talk:Varimax_rotation

https://en.wikipedia.org/wiki/Varimax_rotation

http://web.stanford.edu/class/psych253/tutorials/FactorAnalysis.html

The post Factor Analysis with the Principal Component Method Part Two appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Should I Learn R or Python? … It Doesn’t Matter

16 February 2017 - 2:43pm

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Should I learn R or Python for data science?

I am asked this question regularly, both online and in person. There is a simple answer: it doesn’t matter. There are pros and cons to both which have been written about extensively so I reinvent the wheel by making a list here (do a quick search in Google and you’ll find tens of thousands of relevant results).

The fact is, you’re asking the wrong question.

You should be asking yourself, ‘what am I looking to accomplish?’

Once you answer this question, everything else should fall into place nicely. All of the articles you look up will certainly point you in the right direction.

Here are some examples:

• I want to know how diverse my stock portfolio is, so I will build a statistically based stock portfolio
• I want to know about what my competitors are doing online, so I will scrape the web for information about business competitors
• I want to know what my local government is spending money on, so I will find the data and do some analysis
If you are learning a language to put on your resume and get a job you’re doing something wrong!

There is no shortage of information on in-demand skills found in online job listings. You may look up trends to see what’s more popular, but results vary based off of how you look for the information. Consider the following charts and ask yourself, what conclusions could I draw that would lead me to any sort of decision?

trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"python programming","geo":"","time":"all"},{"keyword":"r programming","geo":"","time":"all"},{"keyword":"matlab programming","geo":"","time":"all"}],"category":0,"property":""}, {"exploreQuery":"date=all&q=python%20programming,r%20programming,matlab%20programming","guestPath":"https://trends.google.com:443/trends/embed/"});

trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"learn python","geo":"","time":"all"},{"keyword":"learn r","geo":"","time":"all"},{"keyword":"learn matlab","geo":"","time":"all"}],"category":0,"property":""}, {"exploreQuery":"date=all&q=learn%20python,learn%20r,learn%20matlab"});

trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"python tutorial","geo":"","time":"all"},{"keyword":"r tutorial","geo":"","time":"all"},{"keyword":"matlab tutorial","geo":"","time":"all"}],"category":0,"property":""}, {"exploreQuery":"date=all&q=python%20tutorial,r%20tutorial,matlab%20tutorial"});

From this basic web search data, you might determine that python is the most popular language (side note, these are all dwarfed by C++ and others). Does that mean you should learn python first? Maybe, but probably not. Learning the most popular language could potentially have a negative impact because there may be less scarcity in the market for that skill. The reality is, you have already narrowed your search down to the most popular languages in data science and that is what’s important. No matter which choice you make, you will be fine! There may be a job or two which requires proficiency in one language or the other, but you will always be able to find something interesting with either language under your belt.

It’s time for the good stuff!

The only way to make the decision for yourself is to find something you’re interested in and get started building something. When you care about what you’re doing, you will become great whatever you choose! Here are few keys to success:

1. Set aside time to learn every single day (and learn how to learn)
2. Read a lot! Books, blogs, articles and studies encourage you to think about new ways to do things
3. Work with open source data
4. Create a GitHub account
5. Use real data and make something that matters to you
It’s time for you to get out there and do something that matters.

Every project you complete will give you a sense of pride. Humans thrive from a sense of accomplishment. Once your hobby turns into a skill, you will have no problem finding a job, learning a new language, creating a strong sense of self-worth and making something you’re proud of!

Have fun!

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

You are here

## Who's new

• m.blangiardo
• cesar.izq
• matthias.templ
• Clara Romero
• zmunch