Feed aggregator
Analytical and Numerical Solutions to Linear Regression Problems
This exercise focuses on linear regression with both analytical (normal equation) and numerical (gradient descent) methods. We will start with linear regression with one variable. From this part of the exercise, we will create plots that help to visualize how gradient descent gets the coefficient of the predictor and the intercept. In the second part, we will implement linear regression with multiple variables.
Linear regression with one variableIn this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next.
The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in $10,000s.
Load the data and display first 6 observations
ex1data1 < fread("ex1data1.txt",col.names=c("population","profit")) head(ex1data1) population profit 1: 6.1101 17.5920 2: 5.5277 9.1302 3: 8.5186 13.6620 4: 7.0032 11.8540 5: 5.8598 6.8233 6: 8.3829 11.8860 Plotting the DataBefore starting on any task, it is often useful to understand the data by visualizing it. For this dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems that we encounter in real life are multidimensional and can’t be plotted on a 2d plot.
ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=4,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ggtitle ('Figure 1: Scatter plot of training data')+ theme(plot.title = element_text(size = 16,colour="red")) Gradient DescentIn this part, we will fit linear regression parameters θ to our dataset using gradient descent.
To implement gradient descent, we need to calculate the cost, which is given by:
As we perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. In this section, we will implement a function to calculate J(θ) so we can check the convergence of our gradient descent implementation.
But remember, before we go any further we need to add the θ intercept term.
The function below calcuates cost based on the equation given above.
computeCost=function(X,y,theta){ z=((X%*%theta)y)^2 return(sum(z)/(2*nrow(X))) }Now, we can calculate the initial cost by initilizating the initial parameters to 0.
theta=matrix(rep(0,ncol(X))) round(computeCost(X,y,theta),2) 32.07 Gradient descentNext, we will implement gradient descent by calling the computeCost function above. A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. Our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
The gradient descent function below returns the cost in every iteration and the optimal parameters for the number of iterations and learning rate alpha specified.
Now, let’s use the gradientDescent function to find the parameters and we have to make sure that our cost never increases. Let’s use 1500 iterations and a learning rate alpha of 0.04 for now. Later, we will see the effect of these values in our application.
iterations = 1500 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta [,1] [1,] 3.630291 [2,] 1.166362 Ploting the cost function as a function of the number of iterationsWe expect that the cost should be decreasing or at least should not be at all increasing if our implementaion is correct. Let’s plot the cost fucntion as a function of number of iterations and make sure our implemenation makes sense.
data.frame(Cost=gradientDescent_results$cost,Iterations=1:iterations)%>% ggplot(aes(x=Iterations,y=Cost))+geom_line(color="blue")+ ggtitle("Cost as a function of number of iteration")+ theme(plot.title = element_text(size = 16,colour="red"))Here it is the plot:
As the plot above shows, the cost decrases with number of iterations and gets almost close to convergence with 1500 iterations.
Now, since we know the parameters (slope and intercept), we can plot the linear fit on the scatter plot.
ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=3,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ ggtitle ('Figure 1: Scatter plot of training data') + geom_abline(intercept = theta[1], slope = theta[2],col="red",show.legend=TRUE)+ theme(plot.title = element_text(size = 16,colour="red"))+ annotate("text", x = 12, y = 20, label = paste0("Profit = ",round(theta[1],4),"+",round(theta[2],4)," * Population")) Visualizing J(θ)To understand the cost function J(θ) better, let’s now plot the cost over a 2dimensional grid of θ0 and θ1 values. The global minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.
Intercept=seq(from=10,to=10,length=100) Slope=seq(from=1,to=4,length=100) # initialize cost values to a matrix of 0's Cost = matrix(0,length(Intercept), length(Slope)); for(i in 1:length(Intercept)){ for(j in 1:length(Slope)){ t = c(Intercept[i],Slope[j]) Cost[i,j]= computeCost(X, y, t) } } persp3D(Intercept,Slope,Cost,theta=45, phi=25, expand=0.75,lighting = TRUE, ticktype="detailed", xlab="Intercept", ylab="Slope", zlab="",axes=TRUE, main="Surface") image.plot(Intercept,Slope,Cost, main="Contour, showing minimum") contour(Intercept,Slope,Cost, add = TRUE,n=30,labels='') points(theta[1],theta[2],col='red',pch=4,lwd=6) Normal EquationSince linear regression has closedform solution, we can solve it analytically and it is called normal equation. It is given by the formula below. we do not need to iterate or choose learning curve. However, we need to calculate inverse of a matrix , which make it slow if the number of records is very large. Gradient descent is applicable to other machine learning techniques as well. Further, gradient descent method is more appropriate even to linear regression when the number of observations is very large.
θ=(XTX)−1XTy
There is very small difference between the parameters we got from normal equation and using gradient descent. Let’s increase the number of iteration and see if they get closer to each other. I increased the number of iterations from 1500 to 15000.
iterations = 15000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta [,1] [1,] 3.895781 [2,] 1.193034As you can see, now the results from normal equation and gradient descent are the same.
Using caret packageBy the way, we can use packages develoed by experts in the field and perform our machine learning tasks. There are many machine learning packages in R for differnt types of machine learning tasks. To verify that we get the same results, let’s use the caret package, which is among the most commonly used machine learning packages in R.
my_lm < train(profit~population, data=ex1data1,method = "lm") my_lm$finalModel$coefficients (Intercept) 3.89578087831187 population 1.1930336441896 Linear regression with multiple variablesIn this part of the exercise, we will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. The file ex1data2.txt contains a training set of housing prices in Port land, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.
ex1data2 < fread("ex1data2.txt",col.names=c("size","bedrooms","price")) head(ex1data2) size bedrooms price 1: 2104 3 399900 2: 1600 3 329900 3: 2400 3 369000 4: 1416 2 232000 5: 3000 4 539900 6: 1985 4 299900 Feature NormalizationHouse sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.
ex1data2=as.data.frame(ex1data2) for(i in 1:(ncol(ex1data2)1)){ ex1data2[,i]=(ex1data2[,i]mean(ex1data2[,i]))/sd(ex1data2[,i]) } Gradient DescentPreviously, we implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.
X=cbind(1,ex1data2$size,ex1data2$bedrooms) y=ex1data2$price head(X) iterations = 6000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta [,1] [,2] [,3] [1,] 1 0.13000987 0.2236752 [2,] 1 0.50418984 0.2236752 [3,] 1 0.50247636 0.2236752 [4,] 1 0.73572306 1.5377669 [5,] 1 1.25747602 1.0904165 [6,] 1 0.01973173 1.0904165 [,1] [1,] 340412.660 [2,] 110631.050 [3,] 6649.474 Normal Equation theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2 [,1] [1,] 340412.660 [2,] 110631.050 [3,] 6649.474 Using caret package ex1data2 < fread("ex1data2.txt",col.names=c("size","bedrooms","price")) my_lm < train(price~size+bedrooms, data=ex1data2,method = "lm", preProcess = c("center","scale")) my_lm$finalModel$coefficients (Intercept) 340412.659574468 size 110631.050278846 bedrooms 6649.4742708198 SummaryIn this post, we saw how to implement numerical and analytical solutions to linear regression problems using R. We also used caret
the famous R machine learning package to verify our results. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository.
Related Post
 How to create a loop to run multiple regression models
 Regression model with auto correlated errors – Part 3, some astrology
 Regression model with auto correlated errors – Part 2, the models
 Regression model with auto correlated errors – Part 1, the data
 Linear Regression from Scratch in R
Putting It All Together
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
The kind folks over at @RStudio gave a nod to my recently CRANreleased epidata
package in their January data package roundup and I thought it might be useful to give it one more showcase using the recent CRAN update to ggalt
and the new hrbrthemes
(githubonly for now) packages.
The U.S. labor force participation rate (LFPR) is an oftoverlooked and under or misreported economic indicator. I’ll borrow the definition from Investopedia:
The participation rate is a measure of the active portion of an economy’s labor force. It refers to the number of people who are either employed or are actively looking for work. During an economic recession, many workers often get discouraged and stop looking for employment, resulting in a decrease in the participation rate.Population age distributions and other factors are necessary to honestly interpret this statistic. Parties in power usually dismiss/ignore this statistic outright and their opponents tend to wholly embrace it for criticism (it’s an easy target if you’re naive). “Yay” partisan democracy.
Since the LFPR is has nuances when looked at categorically, let’s take a look at it by attained education level to see how that particular view has changed over time (at least since the govquants have been tracking it).
We can easily grab this data with epidata::get_labor_force_participation_rate()
(and, we’ll setup some library()
calls while we’re at it:
library(epidata)
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(ggalt)
library(tidyverse)
library(stringi)
part_rate < get_labor_force_participation_rate("e")
glimpse(part_rate)
## Observations: 457
## Variables: 7
## $ date <date> 19781201, 19790101, 19790201, 19790301, 19790401, 19790501...
## $ all <dbl> 0.634, 0.634, 0.635, 0.636, 0.636, 0.637, 0.637, 0.637, 0.638, 0.638, 0...
## $ less_than_hs <dbl> 0.474, 0.475, 0.475, 0.475, 0.475, 0.474, 0.474, 0.473, 0.473, 0.473, 0...
## $ high_school <dbl> 0.690, 0.691, 0.692, 0.692, 0.693, 0.693, 0.694, 0.694, 0.695, 0.696, 0...
## $ some_college <dbl> 0.709, 0.710, 0.711, 0.712, 0.712, 0.713, 0.712, 0.712, 0.712, 0.712, 0...
## $ bachelor's_degree <dbl> 0.771, 0.772, 0.772, 0.773, 0.772, 0.772, 0.772, 0.772, 0.772, 0.773, 0...
## $ advanced_degree <dbl> 0.847, 0.847, 0.848, 0.848, 0.848, 0.848, 0.847, 0.847, 0.848, 0.848, 0...
One of the easiest things to do is to use ggplot2
to make a faceted line chart by attained education level. But, let’s change the labels so they are a bit easier on the eyes in the facets and switch the facet order from alphabetical to something more useful:
gather(part_rate, category, rate, date) %>%
mutate(category=stri_replace_all_fixed(category, "_", " "),
category=stri_trans_totitle(category),
category=stri_replace_last_regex(category, "Hs$", "High School"),
category=factor(category, levels=c("Advanced Degree", "Bachelor's Degree", "Some College",
"High School", "Less Than High School", "All"))) > part_rate
Now, we’ll make a simple line chart, tweaking the aesthetics just a bit:
ggplot(part_rate) +
geom_line(aes(date, rate, group=category)) +
scale_y_percent(limits=c(0.3, 0.9)) +
facet_wrap(~category, scales="free") +
labs(x=paste(format(range(part_rate$date), "%Y%b"), collapse=" to "),
y="Participation rate (%)",
title="U.S. Labor Force Participation Rate",
caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
theme_ipsum_rc(grid="XY", axis="XY")
The “All” view is interesting in that the LFPR has held fairly “steady” between 60% & 70%. Those individual and fractional percentage points actually translate to real humans, so the “minor” fluctuations do matter.
It’s also interesting to see the direct contrast between the starting historical rate and current rate (you could also do the same with min/max rates, etc.) We can use a “dumbbell” chart to compare the 1978 value to today’s value, but we’ll need to reshape the data a bit first:
group_by(part_rate, category) %>%
arrange(date) %>%
slice(c(1, n())) %>%
spread(date, rate) %>%
ungroup() %>%
filter(category != "All") %>%
mutate(category=factor(category, levels=rev(levels(category)))) > rate_range
filter(part_rate, category=="Advanced Degree") %>%
arrange(date) %>%
slice(c(1, n())) %>%
mutate(lab=lubridate::year(date)) > lab_df
(We’ll be using the extra data frame to add labels the chart.)
Now, we can compare the various ranges, once again tweaking aesthetics a bit:
ggplot(rate_range) +
geom_dumbbell(aes(y=category, x=`19781201`, xend=`20161201`),
size=3, color="#e3e2e1",
colour_x = "#5b8124", colour_xend = "#bad744",
dot_guide=TRUE, dot_guide_size=0.25) +
geom_text(data=lab_df, aes(x=rate, y=5.25, label=lab), vjust=0) +
scale_x_percent(limits=c(0.375, 0.9)) +
labs(x=NULL, y=NULL,
title=sprintf("U.S. Labor Force Participation Rate %sPresent", lab_df$lab[1]),
caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
theme_ipsum_rc(grid="X")
Fin
One takeaway from both these charts is that it’s probably important to take education level into account when talking about the labor force participation rate. The get_labor_force_participation_rate()
function — along with most other functions in the epidata
package — also has options to factor the data by sex, race and age, so you can pull in all those views to get a more nuanced & informed understanding of this economic health indicator.
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
cricketr and yorkr books – Paperback now in Amazon
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
My books
– Cricket Analytics with cricketr
– Beaten by sheer pace!: Cricket analytics with yorkr
are now available on Amazon in both Paperback and Kindle versions
The cricketr and yorkr packages are written in R, and both are available in CRAN. The books contain details on how to use these R packages to analyze performance of cricketers.
cricketr is based on data from ESPN Cricinfo Statsguru, and can analyze Test, ODI and T20 batsmen & bowlers. yorkr is based on data from Cricsheet, and can analyze ODI, T20 and IPL. yorkr can analyze batsmen, bowlers, matches and teams.
Cricket Analytics with cricketr
You can access the paperback at Cricket analytics with cricketr
Beaten by sheer pace! Cricket Analytics with yorkr
You can buy the paperback from Amazon at Beaten by sheer pace: Cricket analytics with yorkr
Order your copy today! Hope you have a great time reading!
To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Accessing MSSQL Server with R (RSQLServer with dplyr)
(This article was first published on rbloggers – verenahaunschmid, and kindly contributed to Rbloggers)
Recently I have been starting to use dplyr for handling my data in R. It makes everything a lot smoother! My previous workflow – running an SQL query, storing the results as CSV, loading it in RStudio – is now history. With dplyr you can directly query data from many different databases in a very convenient way. Unfortunately Microsoft SQL Server is not directly supported but by using the package RSQLServer it can be done like with any other database. In this blog post I’ll explain how I installed everything on my Windows 7 machine to access MSSQL Server with R, since it was not as straight forward as one might think.
All you need … theoreticallyThe package RSQLServer is not available on CRAN anymore but it can be installed from the github repo imanuelcostigan/RSQLServer.
# install.packages('devtools')
devtools::install_github('imanuelcostigan/RSQLServer')
If this works you’re lucky and already have all the necessary things installed. If not, follow the steps below
Errors I encounteredOn my path from finding out about RSQLServer to actually using it I had to fix a few things. Searching on the internet and finding most of the answers on stackoverflow I was able to piece together everything I needed. To save everyone else who tries this some time I collected everything I found out.
Fixing the installationThe first error occured when I tried to install the package as shown above.
Downloading GitHub repo imanuelcostigan/RSQLServer@master
from URL https://api.github.com/repos/imanuelcostigan/RSQLServer/zipball/master
Installing RSQLServer
Downloading GitHub repo hadley/dplyr@63d4a9f5
from URL https://api.github.com/repos/hadley/dplyr/zipball/63d4a9f5
Error: running command ‘”C:/PROGRA~1/R/R33~1.2/bin/x64/R” –nositefile –noenviron –nosave –norestore –quiet CMD config CC’ had status 1
To resolve this:
 Find out which R version you have, if you don’t know, type
R.version
inside an R session  Download and install the correct version of Rtools
 Although I had R version 3.3 and downloaded the correct R version for it, it tried to install it in a directory called “34”. To make sure it will later find it, I changed this to “33” to match my R version. I am not sure whether it made a difference.
 Make sure to check the option where it asks if you want to add it to the path. This makes a difference.
 Make sure you have the correct Java version and/or architecture installed. I didn’t have Java installed for x86 which also caused the installation to fail.
 Restart your computer or at least log out and in again.
After that I was able to install the package using the following command.
devtools::install_github('imanuelcostigan/RSQLServer')
Download jtdsdrivers and follow these instructions.
Finally accessing MSSQL Server with RI was working with two different databases. One was remote and with that it worked immediately. The other one was a local installation on my computer.
For the local one I got an error message. Note: I don’t think it had anything to do with the databases being remote/local, just with the settings I used for installing my database!
Error in rJava::.jcall(driver@jdrv, “Ljava/sql/Connection;”, “connect”, :
java.sql.SQLException: Network error IOException: Connection refused: connect
To resolve this, you need to open the SQL Server configuration manager. I have the SQL Server configuration manager and the SQL Server configuration manager (2016) installed. For some settings it doesn’t matter which one you use but for this one I had to use the older one otherwise my database (SQL Server 2012) wouldn’t show up.
Make sure to enable TCP/IP. Restart you SQL Server instance. Now it should be working!
Use the configuration manager to enable TCP/IP.
All you need to connect (with Windows authentication) now is the following code:
library(dplyr)
conn < RSQLServer::src_sqlserver("servername", database = "dbname")
Currently dplyr cannot deal with schema names in code like this:
conn %>% tbl("schema.table")
In this case you have to use
conn %>% tbl(sql("SELECT * FROM schema.table"))
But this issue is getting addressed in rstatsdb/DBI/issues/24 (dplyr uses DBI).
What now?Query all the data!
Now that you have a connection to your SQL Server it's time to query all the data.
Check out the RStudio: Work with big data webinar or the dplyr databases vignette.
Have fun!
Useful links imanuelcostigan/RSQLServer
 Rtools
 stackoverflow.com: I/O Error: SSO Failed: Native SSPI library not loaded
 stackoverflow.com: JDBC connection failed, error: TCP/IP connection to host failed
The post Accessing MSSQL Server with R (RSQLServer with dplyr) appeared first on verenahaunschmid.
To leave a comment for the author, please follow the link and comment on their blog: rbloggers – verenahaunschmid. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RPushbullet 0.3.1
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new release 0.3.1 of the RPushbullet package, following the recent 0.3.0 release is now on CRAN. RPushbullet is interfacing the neat Pushbullet service for interdevice messaging, communication, and more. It lets you easily send alerts like the one to the to your browser, phone, tablet, … — or all at once.
This release owes once again a lot to Seth Wenchel who helped to update and extend a number of features. We fixed one more small bug stemming from the RJSONIO to jsonlite transition, and added a few more helpers. We also enabled Travis testing and with it covrbased coverage analysis using pretty much the same setup I described in this recent blog post.
Changes in version 0.3.1 (20170217)
The target device designation was corrected (#39).

Three new (unexported) helper functions test the validity of the api key, device and channel (Seth in #41).

The
summary
method for thepbDevices
class was corrected (Seth in #43). 
New helper functions
pbValidateConf
,pbGetUser
,pbGetChannelInfo
were added (Seth in #44 closing #40). 
New classes
pbUser
andpbChannelInfo
were added (Seth in #44). 
Travis CI tests (and covr coverage analysis) are now enabled via an encrypted config file (#45).
Courtesy of CRANberries, there is also a diffstat report for this release.
More details about the package are at the RPushbullet webpage and the RPushbullet GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using Armadillo with SuperLU
(This article was first published on Rcpp Gallery, and kindly contributed to Rbloggers)
Armadillo is very versatile C++ library for linear algebra, brough
to R via the RcppArmadillo package. It
has proven to be very useful and popular, and is (as of February 2017) used by well over 300 CRAN
packages as indicated by the reverse depends / linkingto on its
CRAN page. Several earlier posts here on the
Rcpp Gallery also demonstrate different usage patterns.
Armadillo has a core focus on dense matrices, but continues to
expand its capabilities for sparse matrices. Basic operation are supported directly via the
templated header files, along with calls back into the default (dense) LAPACK and BLAS libraries we
can access easily as R uses them.
Armadillo also supports dedicated sparse solvers via the
SuperLU package. However, this requires access to
the SuperLU library. Many Linux distributions ship
it, see e.g. the Debian package page and the
Ubuntu package page; there is also a
homebrew recipe for OS X /
macOS (or other systems using brew
). As of this writing, the version in the current Ubuntu
release is behind the version Debian. But it is the 5.2.* version that is in Debian that is also
required with current Armadillo versions 7.700.* so we prepared ‘backports’ via
Dirk’s PPA repo.)
Recently, a GitHub issue ticket asked how to
use SuperLU along with
RcppArmadillo. This post essentially
repeats the main answer, which was spread out over multiple posts, in a single article.
In a nutshell, two things need to happen:

One needs to define the required variable
ARMA_USE_SUPERLU
which has to be done before the
Armadillo headers are included. One possibility (shown below) is a#define
statement right in the
code file. 
One needs to tell the linker to use the SuperLU
library. This step is of course not perfectly portable, and does require that the library be
installed. (A genuinely portable solution would either need to test for presence of SuperLU, or include its
sources. Both aspects are beyond the scope of this post._
Sys.setenv("PKG_LIBS"="lsuperlu")
Now that R knows about this library, we can present the code requiring it:
// Important: this definition ensures Armadillo enables SuperLU
#define ARMA_USE_SUPERLU 1
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace arma;
// [[Rcpp::export]]
void superLuDemo() {
sp_mat A = sprandu<sp_mat>(1000, 1000, 0.1);
vec b = randu<vec>(1000);
mat B = randu<mat>(1000, 5);
vec x = spsolve(A, b); // solve one system
mat X = spsolve(A, B); // solve several systems
bool status = spsolve(x, A, b); // use default solver
if (status == false) { Rcpp::Rcout << "no solution" << endl; }
spsolve(x, A, b, "lapack"); // use LAPACK solver
spsolve(x, A, b, "superlu"); // use SuperLU solver
Rcpp::Rcout << "Done.\n";
}
This code snippet defines a function superLuDemo()
which we can call from R:
superLuDemo()
Done.
As the data generated here is random, we did not bother printing the (dense) result vector of
length 1000.
To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A plot against the CatterPlots complot
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
In these terrible times, we R people have more important subjects to debate/care about than ggplot2
vs. base R graphics (isn’t even worth discussing anyway, ggplot2
is clearly the best alternative). Or so I thought until I saw CatterPlots
trending on Twitter this week and even being featured on Revolutions blog. It was cool because plots with cats are cool, but looking more closely at the syntax of CatterPlots
, I couldn’t but realize it was probably a complot to make us all like base R graphics syntax again! So let me show you how to make a cute plot with the awesome ggplot2
extension emojifont
.
The plot I’m going to remake in this post is an old one already that I designed for celebrating my PhD thesis submission in March last year. I decided to compare the gestation time of my thesis, 1246 days, to the gestation time of some common mammals.
Getting dataI’m a bit ashamed because I forgot to write down where I took my data from but I’m pretty sure it was from reading Wikipedia pages and taking a more or less random numbers in the gestation time range indicated for each animals. I also chose animals I could find an emoji for. As you can imagine, it was quite an agreeable data collection. Here is what I end up with:
library("dplyr")
library("forcats")
gestation < readr::read_csv2("data/20170218complot.csv")%>%
arrange(gestation) %>%
mutate(animal = fct_rev(fct_inorder(animal)))
knitr::kable(gestation)
animal
gestation
label
color
mouse
19
mouse
grey
dog
61
dog
grey
cat
64
cat
grey
wolf
68
wolf
grey
tiger
105
tiger
grey
pig
112
pig
grey
sheep
144
sheep
grey
bear
220
bear
grey
human
280
baby
grey
whale
590
whale
grey
elephant
617
elephant
grey
my thesis
1246
closed_book
gold
levels(gestation$animal)
## [1] "my thesis" "elephant" "whale" "human" "bear"
## [6] "sheep" "pig" "tiger" "wolf" "cat"
## [11] "dog" "mouse"
I used forcats::fct_inorder
and forcats::fct_rev
because in such a post not using forcats
would probably have been a deadly sin (and I don’t have 7 lives), but also because I needed the levels to be ordered from longest to shortest gestation time, otherwise the plot would have looked ugly.
At the time I used gganimate
which I still love, but in the meantime I also fell in love with magick
in particular since reading Bob’s post. So I decided to rewrite my code with magick
. On this blog I already published an article featuring both emojifont
and gganimate
.
library("ggplot2")
library("emojifont")
load.emojifont('OpenSansEmoji.ttf')
library("magick")
plot_one_gestation < function(gestation_time, gestation){
now_data < filter_(gestation, ~ gestation <= gestation_time)
p < ggplot(now_data)
p < p + geom_col(aes(x = animal,
y = gestation,
fill = color))
p < p +
scale_fill_manual(values = c("grey" = "grey30",
"gold" = "darkgoldenrod1"))
p < p + geom_text(aes(x = animal,
y = gestation + 45,
label = emoji(label)),
family="OpenSansEmoji", size=30)
p < p + theme(axis.text.y=element_blank(),
axis.ticks=element_blank(),
text = element_text(size=40),
legend.position="none")
p < p + ggtitle(gestation_time)
p < p + scale_x_discrete(limits = levels(gestation$animal))
p < p + ylim(c(0, max(gestation$gestation) + 50))
p < p + coord_flip()
p < p + xlab("Animal")
p < p + ylab("Gestation in days")
outfil < paste0("figs/animals_", gestation_time, ".png")
ggsave(outfil, p, width=5, height=5)
outfil
}
I wanted the last image to be repeated 4 times so that everyone might have time to ponder over my accomplishment. I was celebrating and boasting with this graph.
library("purrr")
c(unique(gestation$gestation),
rep(max(gestation$gestation), 3))%>%
map(plot_one_gestation, gestation = gestation) %>%
map(image_read) %>%
image_join() %>%
image_animate(fps=1) %>%
image_write("gestation.gif")
Aren’t these animal emojis adorable? At least as adorable as CatterPlots
cats? You can also bring emojis to ggplot2
with emoGG
(check out Lucy’s recent post) so you don’t need to ever leave the comfort of ggplot2
for the sake of “c[au]t[e]?ness”.
Disclaimer: I actually starred CatterPlots
on Github because I’m an openminded cat person (married to a cat allergic person!).
To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Optimization and Operations Research in R
(This article was first published on R Blog, and kindly contributed to Rbloggers)
Authors: Stefan Feuerriegel and Joscha MärkleHuß
R is widely taught in business courses and, hence, known by most data scientists with business background. However, when it comes to optimization and Operations Research, many other languages are used. Especially for optimization, solutions range from Microsoft Excel solvers to modeling environments such as Matlab and GAMS. Most of these are nonfree and require students to learn yet another language. Because of this, we propose to use R in optimization problems of Operations Research, since R is open source, comes for free and is widely known. Furthermore, R provides a multitude of numerical optimization packages that are readily available. At the same time, R is widely used in industry, making it a suitable and skillful tool to lever the potential of numerical optimization.
The materials starts with a review of numerical and linear algebra basics for optimization. Here, participants learn how to derive a problem statement that is compatible with solving algorithms. This is followed by an overview on problem classessuch as one and multidimensional problems. Starting with linear and quadratic algorithms, we also cover convex optimization, followed by nonlinear approaches such as gradient based (gradient descent methods), Hessian based (Newton and quasiNewton methods) and nongradient based (NelderMead). We finally demonstrate the potent capabilities of R for Operations Research: we show how to solve optimization problems in industry and business, as well as illustrate the use in methods for statistics and data mining (e.g. quantile regression). All examples are supported by appropriate visualizations.
Slides
1 – Motivation
2 – Introduction to R
3 – Advanced R
4 – Numerical Analysis
Homeworks
1 – Homework
2 – Homework
3 – Homework
4 – Homework
To leave a comment for the author, please follow the link and comment on their blog: R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Catterplots: Plots with cats
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
As a devotee of Tufte, I'm generally against chartjunk. Graphical elements that obscure interpretation of the data occasionally have a useful role to play, but more often than not that role is to entertain the expense of enlightenment, or worse, to actively mislead. So it's with mixed feelings that I refer you to catterplots, an R package by David Gibbs to create scatterplots using outlines of cats as the symbols.
But hey, it's the Friday before a long weekend, so here's a poorly annotated catterplot of households with dogs vs households with cats in the 48 continental US states (plus DC):
The data behind this chart I found at data.world, which looks to be an interesting new resource for data sets. (Registration is required to download data, however.) The code behind the chart is below. (If you want a variant featuring Nyan Cat, try cat=11
.)
The catterplot package is available for on GitHub at the link below.
GitHub (GibbsDavidl): Catterplots
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
January New Data Packages
(This article was first published on RStudio, and kindly contributed to Rbloggers)
by Joseph Rickert
As forecast, the number of R packages hosted on CRAN exceeded 10,000 in January. Dirk Eddelbuettel, who tracks what’s happening on CRAN with his CRANberries site, called hurricaneexposure the 10,000th package in a tweet on January 27th.
hurricaneexposure was one of two hundred and six new packages that arrived on CRAN in January. Approximately 10% of these packages have to do with providing access to data in by some means or another. Some packages contain the data sets, some provide wrappers to APIs, and at least one package provides code to scrape data from a site. The following 17 packages are picks for datarelated packages for January 2017. I will select packages in other categories in a followup post.
Data Packages
elevatr v0.1.1: Provides access to several databases that provide elevation data, including Mapzen Elevation Service, Mapzen Terrain Service, Amazon Terrain Tiles, and the USGS Elevation Point Query Service. There is a vignette.

epidata v0.1.0: Provides tools to retrieve data from the Economic Policy Institute. The README shows how to use the package.

europop v0.3: Contains a data set giving the populations of all European cities with at least 10,000 inhabitants during the period 15001800.

fivethirtyeight v0.1.0: Provides the data, code, and interactive visualizations behind FiveThirtyEight Stories. There is a vignette that provides an example of a data analysis, and a list of data sets that are included.

getCRUCLdata v.1.1: Provides functions that automate downloading and importing climatology data from University of East Anglia Climate Research Unit (CRU). There is a vignette to get you started.

hurricaneexposure v0.0.1: Allows users to create time series of tropical storm exposure histories for chosen counties for a number of hazard metrics (wind, rain, distance from the storm, etc.). The vignette provides an overview.

metScanR v0.0.1: Provides functions for mapping and gathering meteorological data from various US surface networks: COOP, USCRN, USRCRN, ALUSRCRN, ASOS, AWOS, SNOTEL, SNOTELLITE, SCAN, SNOW, and NEON.

mglR v0.1.0: Provides tools to download and organize largescale, publicly available genomic studies on a candidate gene scale. The vignette shows how to use the package.

nzpullover v0.0.2: Contains data sets of driving offences and fines in New Zealand between 2009 and 2016, originally published by the New Zealand Police.

owmr v0.7.2: Provides a wrapper for the OpenWeatherMap API.

PeriodicTable v0.1.1: Contains a data set of the properties of chemical elements.

pwt9 v9.00: Contains the Penn World Table 9, which provides information on relative levels of income, output, inputs, and productivity for 182 countries between 1950 and 2014.

rdwd v0.7.0: Provides functions to obtain climate data from the German Weather Service, Deutscher Wetterdienst, (DWD). There is a vignette on Weather Stations and another showing how to use the package.

rwars v1.0.0: Provides functions to retrieve and reformat data from the ‘Star Wars’ API SWAPI. The vignette shows how to use the package.

wikidataQueryServiceR v0.1.0: Provides an API Client Library for Wikidata Query Service, which provides a way for tools to query Wikidata via SPARQL. See the README for how to use it.

wikilake v0.1: Provides functions to scrape metadata about lakes from Wikipedia. The vignette fetches data from Michigan lakes.

worrms v0.1.0: Provides a client for the World Register of Marine Species. The vignette shows how to use the package.
To leave a comment for the author, please follow the link and comment on their blog: RStudio. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Naive Bayes Classification in R (Part 2)
(This article was first published on Environmental Science and Data Analytics, and kindly contributed to Rbloggers)
Following on from Part 1 of this twopart post, I would now like to explain how the Naive Bayes classifier works before applying it to a classification problem involving breast cancer data. The dataset is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia) and the attributes are as follows:
age: a series of ranges from 2029 to 7079
menopause: whether a patient was pre or postmenopausal upon diagnosis
tumor.size: the largest diameter (mm) of excised tumor
inv.nodes: the number of axillary lymph nodes which contained metastatic breast cancer
node.caps: whether metastatic cancer was contained by the lymph node capsule
deg.malign: the histological grade of the tumor (13 with 3 = highly abnormal cells)
breast: which breast the cancer occurred in
breast.quad: region of the breast cancer occurred in (four quadrants with nipple = central)
irradiat: whether the patient underwent radiation therapy
Some preprocessing of these data was required as there were some NAs (9 in total). I imputed predicted values using separate Naive Bayes classifiers. The objective here is to attempt to predict, using these attributes, with relatively high accuracy whether a recurrence of breast cancer is likely to occur in patients who were previously diagnosed and treated for the disease. We can pursue this objective by using the Naive Bayes classification method.
Naive Bayes’ ClassificationBelow is the Naive Bayes’ Theorem:
P(A  B) = P(A) * P(B  A) / P(B)
Which can be derived from the general multiplication formula for AND events:
P(A and B) = P(A) * P(B  A)
P(B  A) = P(A and B) / P(A)
P(B  A) = P(B) * P(A  B) / P(A)
If I replace the letters with meaningful words as I have been adopting throughout, the Naive Bayes formula becomes:
P(outcome  evidence) = P(outcome) * P(evidence  outcome) / P(evidence)
It is with this formula that the Naive Bayes classifier calculates conditional probabilities for a class outcome given prior information or evidence (our attributes in this case). The reason it is termed “naive” is because we assume independence between attributes when in reality they may be dependent in some way. For the breast cancer dataset we will be working with, some attributes are clearly dependent such as age and menopause status while some may or may not be dependent such as histological grade and tumor size.
This assumption allows us to calculate the probability of the evidence by multiplying the individual probabilities of each piece of evidence occurring together using the simple multiplication rule for independent AND events. Another point to note is that this naivety results in probabilities that are not entirely mathematically correct but they are a good approximation and adequate for the purposes of classification. Indeed, the Naive Bayes classifier has proven to be highly effective and is commonly deployed in email spam filters.
Calculating Conditional ProbabilitiesConditional probabilities are fundamental to the working of the Naive Bayes formula. Tables of conditional probabilities must be created in order to obtain values to use in the Naive Bayes algorithm. The R package e1071 contains a very nice function for creating a Naive Bayes model:
library(e1071) model < naiveBayes(class ~ ., data = breast_cancer) class(model) summary(model) print(model)The model has class “naiveBayes” and the summary tells us that the model provides apriori probabilities of norecurrence and recurrence events as well as conditional probability tables across all attributes. To examine the conditional probability tables just print the model.
One of our tasks for this assignment was to create code which would give us the same conditional probabilities as those output by the naiveBayes() function. I went about this in the following way:
tbl_list < sapply(breast_cancer[10], table, breast_cancer[ , 10]) tbl_list < lapply(tbl_list, t) cond_probs < sapply(tbl_list, function(x) { apply(x, 1, function(x) { x / sum(x) }) }) cond_probs < lapply(cond_probs, t) print(cond_probs)The first line of code uses the sapply function to loop over all attribute variables in the dataset and create tables against the class attribute. I then used the lapply function to transpose all tables in the list so the rows represented the class attribute.
To calculate conditional probabilities for each element in the tables, I used sapply, lapply and anonymous functions. I had to transpose the output in order to get the same structure as the naiveBayes model output. Finally, I printed out my calculated conditional probabilities and compared them with the naiveBayes output to validate the calculations.
Applying the Naive Bayes’ ClassifierSo I’ve explained (hopefully reasonably well) how the Naive Bayes classifier works based on the fundamental rules of probability. Now it’s time to apply the model to the data. This is easily done in R by using the predict() function.
preds < predict(model, newdata = breast_cancer)You will see that I have trained the model using the entire dataset and then made predictions on the same dataset. In our assignment we were asked to train the model and test it on the dataset, treating the dataset as an unlabeled test set. This is unconventional as the training set and test set are then identical but I believe the assignment was intended to just test our understanding of how the method works. In practice, one would use a training set for the model to learn from and a test set to assess model accuracy.
If one outcome class is more abundant in the dataset, as is the case with the breast cancer data (norecurrence: 201, recurrence: 85), the data is unbalanced. This is okay for a generative Naive Bayes model as you want your model to learn from realworld events and to capture the truth. Manipulating the data to achieve less skew would be dangerous.
Applying the model to the data gives the following confusion matrix from which a model accuracy of 75% can be calculated:
conf_matrix < table(preds, breast_cancer$class)This post has only scraped the surface of classification methods in machine learning but has been a useful revision for myself and perhaps it may help others new to the Naive Bayes classifier. Please feel free to comment and correct any errors that may be present.
Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175
To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Naive Bayes Classification in R (Part 1)
(This article was first published on Environmental Science and Data Analytics, and kindly contributed to Rbloggers)
IntroductionA very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is the Naive Bayes classifier. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a twopart piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful. If the reader wishes to get straight into the Bayes classification problem, please see Part 2 here.
The dataset used in the classification problem is titled “Breast cancer data” and is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia). It is an interesting dataset which contains categorical data on two classes with nine attributes. The dataset can be found here. The two outcome classes are “norecurrence events” and “recurrence events” which refer to whether breast cancer returned to a patient or not having previously being diagnosed with the disease and treated for it.
Our goal was to use these data to build a Naive Bayes classification model which could predict with good accuracy whether a patient diagnosed with breast cancer is likely to experience a recurrence of the disease based on the attributes. First, I will briefly discuss the basic concepts of probability before moving on to the classification problem in Part 2.
Fundamentals of ProbabilityProbability boils down to working out proportions and performing mathematical operations on them such as addition, multiplication and division. The fact that Naive Bayes classification has been so successful given this simple foundation is truly remarkable. There are a number of rules and concepts involved in probability calculations which one should be aware of in order to understand how the Naive Bayes classifier works. We will discuss these now but first I want to explain independent events, dependent events and mutual exclusiveness.
Independent events are those which, upon occurrence, do not affect the probability of another event occurring. An example would be Brazil winning the football World Cup and a hurricane forming over the Atlantic Ocean in September. Dependent events are those which, upon occurrence, affect the probability of another event occurring (i.e. they are linked in some way). An example would be getting a cold and going to work the following day. The probability of you going to work can be affected by the occurrence of a cold as you don’t feel very well and would like to not infect your colleagues. When two or more events are mutually exclusive, they cannot occur simultaneously. What is the probability of rolling a 6 and a 3 on a single dice roll? Zero. It is impossible. These two outcomes are mutually exclusive.
To calculate the probability of a single event occurring, sometimes called the observational probability, you just take the number of times the event occurred and divide it by the total number of processes which occurred that could have lead to that event. For example, let’s say I sprint 100 metres fifty times over the course of a month. I want to know the probability of my time being below 10 seconds based on the previous sprints. If I ran sub10 second 100 metres on 27 occasions, then the observational probability of me running a sub10 second 100 metres is simply:
P(<10 second 100 m) = (27 / 50) = 0.54 (54%)
Therefore, I can expect to sprint 100 metres in under 10 seconds about half of the time based on records of the fifty completed sprints. Please note that the P() wrapper is used to denote the probability of some outcome and is used throughout these posts.
Multiplication Rules for AND EventsWhen calculating the probability of two or more events occurring simultaneously, we first consider whether the events are independent or dependent. If they are independent we can use the simple multiplication rule:
P(outcome 1 AND outcome 2) = P(outcome 1) * P(outcome 2)
If I were to calculate the probability of Brazil winning the football World Cup and a hurricane forming over the Atlantic Ocean in September, I would use this simple multiplication rule. The two events are independent as the occurrence of one does not affect the other’s chance of occurring. If Brazil have a probability of winning the World Cup of 41% and the probability of a hurricane over the Atlantic Ocean in September is 91%, then we calculate the probability of both occurring:
P(Brazil AND Hurricane) = P(Brazil) * P(Hurricane)
= (0.41) * (0.91)
= 0.3731 (37%)
If two or more events are dependent, however, we must use the general multiplication rule. This formula is always valid, in both cases of independent and dependent events, as we will see.
P(outcome 1 AND outcome 2) = P(outcome 1) * P(outcome 2  outcome 1)
P(outcome 2  outcome 1) refers to the conditional probability of outcome 2 occurring given outcome 1 has already occurred. One can immediately see how this formula incorporates dependence between the events. If the events were independent, then the conditional probability is irrelevant as one outcome does not influence the chance of the other and P(outcome 2  outcome 1) is simply P(outcome 2). The formula just becomes the simple multiplication rule already described.
We can apply the general multiplication rule to a deck of cards example. What would be the probability of drawing a King of any suit and an Ace of any suit from a 52card deck with just two draws? There are 4 Kings and 4 Aces in a standard deck so we know that drawing an Ace immediately affects our chances of drawing a King as there are fewer cards in the deck after the first draw. We can use the general multiplication formula as follows:
P(Ace AND King) = P(Ace) * P(King  Ace)
= (4 / 52) * (4 / 51)
= 0.006033183 (0.6%)
Obviously, if two events are mutually exclusive and cannot occur simultaneously, they are disjoint and the multiplication rules cannot be applied. The dice roll example describes such a scenario. The best we can do in such a case is state the probability of either outcome occurring. This brings us to the the simple addition rule.
When calculating the probability of either one event or the other occurring, we use the addition rule. When the outcomes are mutually exclusive, we use the simple addition formula:
P(outcome 1 OR outcome 2) = P(outcome 1) + P(outcome 2)
Applied to the dice roll example, what is the probability of rolling a 6 or a 3? Both outcomes cannot occur simultaneously. The probability of rolling a 6 is (1 / 6) and the same can be said for rolling a 3. Therefore,
P(6 OR 3) = (1 / 6) + (1 / 6) = 0.33 (33%)
However, if the events are not mutually exclusive and can occur simultaneously, we must use the following general addition formula which is always valid in both cases of mutual exclusiveness and nonmutual exclusiveness.
P(outcome 1 OR outcome 2) = P(outcome 1) + P(outcome 2) – P(outcome 1 AND outcome 2)
Applied to the World Cup and hurricane example, this would mean calculating the probabilities of both outcomes as well as the probability of both occurring simultaneously.
P(Brazil) + P(Hurricane) – P(Brazil AND Hurricane)
We know that the outcomes are independent, the occurrence of one does not affect the probability of the other occurring, and we can therefore use the simple multiplication rule for two simultaneous events to calculate P(Brazil AND Hurricane).
P(Brazil AND Hurricane) = (0.41) * (0.91)
= 0.3731 (37%)
Finally, we can plug this probability into the main formula to get our answer:
P(Brazil OR Hurricane) = (0.91) + (0.41) – (0.3731) = 0.9469 (95%)
This makes sense right? Brazil are indeed an awesome football team. However, what drives this probability up is the fact that September is one of the months in the Atlantic hurricane season and you are almost guaranteed to see a hurricane event.
SummaryThat was a brief introduction to probability and the rules associated with it. I hope it is clear and easy to follow. The important rules to remember are the general forms of the multiplication rule and the addition rule as they are valid in all cases. Using them in the above examples in place of the simple rules still yields the same results. As such, it may be more useful to memorise these general forms. Here they are once more:
P(outcome 1 AND outcome 2) = P(outcome 1) * P(outcome 2  outcome 1)
P(outcome 1 OR outcome 2) = P(outcome 1) + P(outcome 2) – P(outcome 1 AND outcome 2)
I have chosen to use “outcome 1” and “outcome 2” instead of the usual “A” and “B” format because I believe it easier to understand Naive Bayes when replacing the letters with meaningful words. Here is how the general rules are typically written:
P(A and B) = P(A) * P(B  A)
P(A or B) = P(A) + P(B) – P(A and B)
In Part 2 I describe how to implement the Naive Bayes classifier in R and explain how it works based on the fundamentals of probability outlined here.
Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175
To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to perform PCA with R?
(This article was first published on François Husson, and kindly contributed to Rbloggers)
This post shows how to perform PCA with R and the package FactoMineR.
If you want to learn more on methods such as PCA, you can enroll in this MOOC (everyting is free): MOOC on Exploratory Multivariate Data Analysis
DatasetHere is a wine dataset, with 10 wines and 27 sensory attributes (like sweetness, bitterness, fruity odor, and so on), 2 preference variables, and a qualitative variable corresponding to the wine labels (there are 2 labels, Sauvignon and Vouvray). The values in the data table correspond to the average score given by several judges for the same wine and descriptive variable. The aim of doing PCA here is to characterize the wines according to their sensory characteristics.
Performing PCA … with additional informationHere are the lines of code used. Note that we use the information given by the qualitative variable.
### Read data
wine
### Loading FactoMineR
library(FactoMineR)
### PCA with supplementary variables
res
### Print the main results
summary(res)
Two graphs are given by default, one for the individuals, one for the quantitative variables.
But is is interesting to consider the qualitative variable to better understand the differences between wines. Wines are colored according to their label.
## Drawing wines according to the label plot(res,habillage="Label") InterpretationThe graph of the individuals shows, for instance, that S Michaud and S Trotignon are very “close”. It means that the scores for S Michaud and S Trotignon are approximately the same, whatever the variable. In the same way, Aub Marigny and Font Coteaux are wines with similar sensory scores for the 27 attributes. On the other hand, Font Brulés and S Trotignon have very different sensory profiles, because the first principal component, representing the main axis of variability between wines, separates them strongly.
The variables astringency, visual intensity, mushroom odor and candied fruit odor, found to the right, have correlations close to 1 with the first dimension. Since the correlation with the 1st dimension is close to 1, the values of these variables move in the same direction as the coordinates in the 1st dimensions. Wines with a small value in the 1st dimension have low values for these variables, and wines with large values in the 1st dimension have high values for these variables. Thus, the wines that are to the right of the plot have high (and positive) values in the 1st dimension and thus have high values for these variables. With the same logic, wines that are to the left have a small value in the 1st dimension, and thus low values for these variables.
For the variables passionfruit odor, citrus odor and freshness, everything is the other way around. The correlation with the 1st dimension is close to 1, and thus the values move in the opposite direction. Wines with a low value in the 1st dimension have low coordinate values, and thus have high values for these variables, and wines with large values in the 1st dimension have small values for these variables.
Overall, we see that the first dimension splits apart wines that are considered fruity and flowery (on the left) from wines that are woody or with vegetal odors. And this is the main source of variability.
So then, how can we interpret the 2nd dimension, the vertical axis? At the top, wines have large values on the vertical axis. Since the correlation coefficients between the 2nd dimension and variables such acidity or bitterness are close to 1, it means that wines at the top take large values for these variables. And wines at the bottom have small values in the 2nd dimension, and thus small values for these variables. For sweetness, the correlation coefficient is close to 1, so wines that have a small value in the 2nd dimension are sweet, while wines that have large values are not.
Overall, the 2nd dimension separates the wines at the top, acidic and bitter, from sweet wines at the bottom.
To leave a comment for the author, please follow the link and comment on their blog: François Husson. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Moving largish data from R to H2O – spam detection with Enron emails
(This article was first published on Peter's stats stuff  R, and kindly contributed to Rbloggers)
Moving around sparse matrices of text data – the limitations ofas.h2o
This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o()
to interpret a sparse Matrix
object from R and convert it to an H2O frame. The Matrix
and as.h2o
method is ok for medium sized data but broke down on my hardware with a larger dataset – a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix
from the Matrix
package or a simple triplet matrix from the slam
package. I’m not sure what as.h2o
does under the hood in converting from Matrix
to an H2O frame, but it’s too much for my laptop.
My motivation for this is that I want to use R for convenient preprocessing of textual data using the tidytext
approach; but H2O for high powered machine learning. tidytext
makes it easy to create a sparse matrix with cast_dtm
or cast_sparse
, but uploading this to H2O can be a challenge.
After some toandfro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight")
. This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge – how to efficiently write files in SVMLight format.
SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional rowbased sparse matrix format which might convey information in rowcolumnvalue triples, it uses labelcolumn:value indicators. It looks like this:
1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:0.3
That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1
as the response value, 3.4
in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1
in the 22nd column, and so on.
Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm
is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste
statements that were apply
d to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.
I haven’t turned this into a package – it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).
library(data.table)
library(slam)
# Convert a simple triplet matrix to svm format
#' @author Peter Ellis
#' @return a character vector of length n = nrow(stm)
calc_stm_svm < function(stm, y){
# returns a character vector of length y ready for writing in svm format
if(!"simple_triplet_matrix" %in% class(stm)){
stop("stm must be a simple triple matrix")
}
if(!is.vector(y)  nrow(stm) != length(y)){
stop("y should be a vector of length equal to number of rows of stm")
}
n < length(y)
# data.table solution thanks to @roland at http://stackoverflow.com/questions/41477700/optimisingsapplyorforpastetoefficientlytransformsparsetripletm/41478999#41478999
stm2 < data.table(i = stm$i, j = stm$j, v = stm$v)
res < stm2[, .(i, jv = paste(j, v, sep = ":"))
][order(i), .(res = paste(jv, collapse = " ")), by = i][["res"]]
out < paste(y, res)
return(out)
}
#' @param stm a simple triplet matrix (class exported slam) of features (ie explanatory variables)
#' @param y a vector of labels. If not provided, a dummy of 1s is provided
#' @param file file to write to.
#' @author Peter Ellis
write_stm_svm < function(stm, y = rep(1, nrow(stm)), file){
out < calc_stm_svm(stm, y)
writeLines(out, con = file)
}
Example application – spam detection with the Enron emails
Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat.
The original authors classified the emails as spam or ham and saved these preprocessed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only.
DataFirst the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files
library(tidyverse)
library(tidytext)
library(tm)
library(testthat)
library(data.table)
library(h2o)
library(stringr)
library(forcats)
#===========download files=============
# Files described at http://csmining.org/index.php/enronspamdatasets.html
baseurl < "http://csmining.org/index.php/enronspamdatasets.html?file=tl_files/Project_Datasets/EnronSpam%20datasets/Preprocessed/enron"
filenames < paste0(baseurl, 1:6, ".tar.tar")
for(i in 1:6){
message("Downloading ", i)
dfile < paste0("enron", i, ".tar.tar")
download.file(filenames[i], destfile = dfile, mode = "wb")
message("Unarchiving ", i)
untar(dfile)
}
This creates six folders with the names enron1
, enron2
etc; each with a spam
and a ham
subfolder containing numerous text files. The files look like this example piece of ham (ie nonspam; a legitimate email), chosen at random:
Subject: re : creditmanager net meeting
aidan ,
yes , this will work for us .
vince
" aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am
to : vince j kaminski / hou / ect @ ect
cc :
subject : creditmanager net meeting
vincent , i cannot rearrange my schedule for tomorrow so i would like to
confirm that we will have a net  meeting of creditmanager on friday 7 th of
january at 9 . 30 your time .
regards
aidan mc nulty
212 981 7422
The preprocessing has removed duplicates, emails sent to themselves, some of the headers, etc.
Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext
package which I never tire of saying is a game changer for convenient analysis of text by statisticians:
#=============import to R================
# Adapting the example at http://tidytextmining.com/usenet.html
folders < paste0("enron", 1:6)
spam < data_frame(file = dir(paste0(folders, "/spam"), full.names = TRUE)) %>%
mutate(text = map(file, readLines, encoding = "Latin1")) %>%
transmute(id = basename(file), text) %>%
unnest(text) %>%
mutate(SPAM = "spam")
ham < data_frame(file = dir(paste0(folders, "/ham"), full.names = TRUE)) %>%
mutate(text = map(file, readLines, encoding = "Latin1")) %>%
transmute(id = basename(file), text) %>%
unnest(text) %>%
mutate(SPAM = "ham")
enron_raw < rbind(spam, ham)
#============error checks  failing!===================
# Check against the counts provided at http://csmining.org/index.php/data.html
# Turns out some 3013 spam messages have gone missing
# should be 20170 spam messages and 16545 ham messages:
# returns an error
expect_equal(length(unique(enron_raw$id)), 20170 + 16545)
enron_raw %>%
select(id, SPAM) %>%
distinct() %>%
summarise(spam_count = sum(SPAM == "spam"), ham_count = sum(SPAM == "ham"))
# For my purposes I decide not to worry about this.
#=================further processing==================
enron < enron_raw %>%
# will just remove the "Subject:" and "Subject :" and treat subject words like any other
mutate(text = gsub("^Subject *: *", "", text),
text = gsub("<U.....>", "", text, fixed = FALSE))
enron_words < enron %>%
unnest_tokens(word, text) %>%
select(SPAM)
As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id
, and columns indicating whether it is SPAM and those other characteristics of interest.
# First I'm creating a summary, dense data frame with some numeric info on each document
enron_sum1 < enron %>%
mutate(number_characters = nchar(text)) %>%
group_by(id, SPAM) %>%
summarise(number_characters = sum(number_characters))
enron_sum2 < enron_words %>%
group_by(id) %>%
summarise(number_words = length(word))
enron_sum3 < enron_words %>%
anti_join(stop_words, by = "word") %>%
group_by(id) %>%
summarise(number_nonstop_words = length(word))
enron_sum_total < enron_sum1 %>%
left_join(enron_sum2, by = "id") %>%
left_join(enron_sum3, by = "id") %>%
mutate(number_stop_words = number_words  number_nonstop_words,
proportion_stop_words = number_stop_words / number_words) %>%
select(number_nonstop_words)
enron_sum_total
Source: local data frame [33,702 x 6]
Groups: id [33,702]
id SPAM number_characters number_words number_stop_words
<chr> <chr> <int> <int> <int>
1 0001.19991210.farmer.ham.txt ham 28 4 0
2 0001.19991210.kaminski.ham.txt ham 24 4 3
3 0001.20000117.beck.ham.txt ham 3486 559 248
4 0001.20000606.lokay.ham.txt ham 3603 536 207
5 0001.20010207.kitchen.ham.txt ham 322 48 18
6 0001.20010402.williams.ham.txt ham 1011 202 133
7 0002.19991213.farmer.ham.txt ham 4194 432 118
8 0002.20010207.kitchen.ham.txt ham 385 64 40
9 0002.20010525.SA_and_HP.spam.txt spam 990 170 80
10 0002.20031218.GP.spam.txt spam 1064 175 63
I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam
package), with a column for each word (having first limited myself to interesting words)
used_words < enron_words %>%
# knock out stop words:
anti_join(stop_words, by = "word") %>%
# knock out numerals, and words with only 2 or 1 letters:
mutate(word = gsub("[09]", "", word),
wordlength = nchar(word)) %>%
filter(wordlength > 2) %>%
group_by(word) %>%
summarise(count = length(word)) %>%
ungroup() %>%
# knock out words used less than 10 times:
filter(count >= 10)
enron_dtm < enron_words %>%
right_join(used_words, by = "word") %>%
cast_dtm(id, word, count)
# we need a version of the dense data in the same order as the documenttermmatrix, to do a sort of
# manual join of the sparse matrix with the dense one later in H2O.
rows < data_frame(id = rownames(enron_dtm))
enron_dense < left_join(rows, enron_sum_total, by = "id")
expect_equal(nrow(enron_dtm), nrow(enron_dense))
expect_equal(rownames(enron_dtm), enron_dense$id)
Now we can load our two datasets onto an H2O cluster for analysis:
#================import to h2o and join up there============
h2o.init(nthreads = 1, max_mem_size = "8G")
# Load up the dense matrix with counts of stopwords etc:
enron_dense_h2o < as.h2o(enron_dense)
# Load up the sparse matrix with columns for each word:
thefile < tempfile()
write_stm_svm(enron_dtm, file = thefile)
enron_sparse_h2o < h2o.uploadFile(thefile, parse_type = "SVMLight")
unlink(thefile)
# Number of rows should be equal:
expect_equal(nrow(enron_sparse_h2o), nrow(enron_dtm))
# Number of columns should be 1 extra in H2O, dummy variable of labels (1) added by write_stm_svm:
expect_equal(ncol(enron_sparse_h2o), ncol(enron_dtm) + 1)
# First column should be the dummy labels = all one
expect_equal(mean(enron_sparse_h2o[ , 1]), 1)
enron_fulldata < h2o.cbind(enron_sparse_h2o, enron_dense_h2o)
head(colnames(enron_fulldata), 10)
# Convert the target variable to a factor so h2o.glm and other modelling functions
# know what to do with it:
enron_fulldata[ , "SPAM"] < as.factor(enron_fulldata[ , "SPAM"])
I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords.
AnalysisTo give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam:
 generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables
 random forest
 gradient boosting machine
 neural network
I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my notparticularlytuned models:
The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)
names coefficients sign word
1 C9729 1.1635213 NEG enron
2 C25535 0.6023054 NEG vince
3 C1996 0.5990230 NEG attached
4 C15413 0.4524011 NEG louise
5 C19891 0.3905246 NEG questions
6 C11478 0.2993239 NEG gas
7 proportion_stop_words 0.2935112 NEG <NA>
8 C12866 0.2774074 NEG hourahead
9 C7268 0.2600452 NEG daren
10 C16257 0.2497282 NEG meter
11 C12878 0.2439315 NEG houston
12 C21441 0.2345008 NEG sally
13 C16106 0.2179897 NEG meeting
14 C12894 0.1965571 NEG hpl
15 C16618 0.1909332 NEG monday
16 C11270 0.1873195 NEG friday
17 C8553 0.1704185 NEG doc
18 C7386 0.1673093 NEG deal
19 C21617 0.1636310 NEG schedule
20 C4185 0.1510748 NEG california
21 C12921 0.3695006 POS http
22 C16624 0.2132104 POS money
23 C22597 0.2034031 POS software
24 C15074 0.1957970 POS life
25 C5394 0.1922659 POS click
26 C17683 0.1915608 POS online
27 C25462 0.1703109 POS viagra
28 C16094 0.1605989 POS meds
29 C21547 0.1583438 POS save
30 C9483 0.1498732 POS email
So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron – things may have changed) more likely to be spam.
Here’s the code for the analysis all together:
#=====================analysis in H2O=============
#prep
# names of the explanatory variables  all the columns in the sparse matrix (which are individual words)
# except the first one which is the dummy "labels" created just for the SVMLight format. And the
# four summary variables in the dense dataframe:
xnames < c(colnames(enron_sparse_h2o)[1],
"number_characters", "number_words", "number_stop_words", "proportion_stop_words")
# A lookup table of the column names that refer to words to the words they actually mean.
# Useful when we look at variable performance down the track.
wordcols < data_frame(
variable = colnames(enron_sparse_h2o),
word = c("", colnames(enron_dtm))
)
# The slower models (ie apart from glm) take ages for crossvalidation so
# we'll settle for singlesplit validation
enron_split < h2o.splitFrame(enron_fulldata, ratios = c(0.6, 0.2))
dim(enron_split[[1]])
#GLM
# Binomial GLM, with elastic net regularization. This is the minimal baseline sort of model.
# 200 seconds
system.time({
mod.glm < h2o.glm(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
family = "binomial",
alpha = 0.5, lambda_search = TRUE)
})
h2o.varimp(mod.glm) %>%
slice(1:30) %>%
left_join(wordcols, by = c("names" = "variable")) %>%
arrange(sign, desc(coefficients))
h2o.performance(mod.glm, valid = TRUE) # 131 / 6640
#Random Forest
# with ntrees = 50, nfolds = 10, max_depth = 20, took 25 minutes to get
# 5% through so I stopped it and went back to having a single validation frame.
# Can view progress by going to http://127.0.0.1:54321/flow/index.html
# 1340 seconds
system.time({
mod.rf < h2o.randomForest(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
ntrees = 500, max_depth = 20,
stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 25)
})
# note from watching progress in flow, scoring every score_tree_interval models takes quite a lot of time.
# so I pushed out score_tree_interval. On current settings it will score
# the latest model against the validation frame every 25 trees, and stop if it hasn't improved since 3*25 trees ago.
# If score_tree_interval is a small number, it stops growing trees too quickly
h2o.performance(mod.rf, valid = TRUE) # 172/6640 error rate
#gradient boosting machine
# 1240 seconds
system.time({
mod.gbm < h2o.gbm(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
ntrees = 100, max_depth = 5,
stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 5)
})
h2o.performance(mod.gbm, valid = TRUE) # 240/6640 error rate
#artificial neural network
# 900 seconds; much faster when sparse = TRUE
system.time({
mod.dl < h2o.deeplearning(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
hidden = c(200, 200),
stopping_tolerance = 0.0001, stopping_rounds = 5, sparse = TRUE)
})
h2o.performance(mod.dl, valid = TRUE) # 82/6640
#Naive Bayes  doesn't work
# Conditional probabilities won't fit in the driver node's memory (20.99 GB > 6.06GB)
mod.nb < h2o.naiveBayes(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]])
#==============presentation of results==========================
perf < function(newdata){
return(c(
glm = as.character(h2o.confusionMatrix(mod.glm, newdata = newdata)$Rate[1]),
rf = as.character(h2o.confusionMatrix(mod.rf, newdata = newdata)$Rate[1]),
gbm = as.character(h2o.confusionMatrix(mod.gbm, newdata = newdata)$Rate[1]),
dl = as.character(h2o.confusionMatrix(mod.dl, newdata = newdata)$Rate[1])
))
}
# this isn't the most efficient computing wise, because the performance on the training
# and validation sets could be extracted more directly but it is quick and easy to code:
perfs < lapply(enron_split, perf)
perfsdf < data_frame(value = unlist(perfs),
model = rep(c("Generalized Linear Model", "Random Forest", "Gradient Boosting Machine", "Deep Learning"), 3),
dataset = rep(c("Training", "Validation", "Testing"), each = 4),
abbrev = rep(names(perfs[[1]]), 3)) %>%
mutate(errors = as.numeric(str_sub(str_extract(value, "=[09]+"), start = 2)),
denom = as.numeric(str_sub(str_extract(value, "/[09]+"), start = 2)),
errorrate = errors / denom,
model = factor(model, levels = c("Deep Learning", "Generalized Linear Model",
"Random Forest", "Gradient Boosting Machine")),
dataset = factor(dataset, levels = c("Training", "Validation", "Testing")))
perfsdf %>%
ggplot(aes(x = errorrate, y = dataset, label = abbrev, colour = model)) +
geom_path(aes(group = abbrev), alpha = 0.5, linetype = 2) +
geom_text() +
scale_colour_brewer(palette = "Set1") +
scale_x_continuous("Error rate", label = percent) +
labs(y = "", colour = "") +
ggtitle("Error rates detecting ham from spam",
"Enron email dataset, four different statistical learning methods")
To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Exploratory Multivariate Data Analysis with R enroll now in the MOOC
(This article was first published on François Husson, and kindly contributed to Rbloggers)
Exploratory multivariate data analysis is studied and has been taught in a “Frenchway” for a long time in France. You can enroll in a MOOC (completely free) on Exploratory Multivariate Data Analysis. The MOOC will start the 27th of February.
This MOOC focuses on 4 essential and basic methods, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical and clustering.
This course is applicationoriented and many examples and numerous exercises are done with FactoMineR (a package of the free R software) will make the participant efficient and reliable face to data analysis.
To leave a comment for the author, please follow the link and comment on their blog: François Husson. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Steep Slide of NFL Draft Salaries
(This article was first published on R – Jesse Piburn, and kindly contributed to Rbloggers)
Some friends and I got into a conversation about rookies in the NFL and how much their salaries were. We eventually started guessing how much more the first overall pick makes compared to, Mr. Irrelevant, the last pick of the NFL draft. It’s a pretty steep drop from number 1 to number 256 (7 rounds of 32 teams, plus a maximum of 32 compensatory selections), but it turns out most of that slide happens in the first 2 rounds. Below is the chart I came up with. Here is a higher res link
Here is the R code to reproduce the chart above, including the function to download the data from spotrac, which is a great site by the way.
#################################################################### # Create nfl draft salary chart # date: 2/16/2017 # author: Jesse Piburn # #################################################################### # gtable and grid used to add the source text at the bottom  library(ggplot2) library(ggthemes) library(gtable) library(grid) library(dplyr) getDraftContracts < function(year) { x < rvest::html(paste0("http://www.spotrac.com/nfl/draft/", year)) x < rvest ::html_table(x) # each round is its own table  x < lapply(1:length(x), FUN = function(i) { df < x[[i]] df$Round < i df$Cond_Pick < grepl(" \\(C\\)", df$Pick) df$Pick < as.numeric(gsub(" \\(C\\)", "", df$Pick)) df$Yrs < as.numeric(df$Yrs) df }) df < dplyr::bind_rows(x) df$`Total Value` < as.numeric(gsub("\\$,", "", df$`Total Value`)) df$`Signing Bonus` < as.numeric(gsub("\\$,", "", df$`Signing Bonus`)) cap_index < which(names(df) == paste(year, "Cap")) df[, cap_index] < as.numeric(gsub("\\$,", "", df[, cap_index])) names(df)[cap_index] < "Rookie Cap Hit" df$`Yearly Avg` < df$`Total Value`/ df$Yrs df$Season < year df } yrs < 2011:2016 df_list < lapply(yrs, getDraftContracts) df < dplyr::bind_rows(df_list) df$Season < factor(df$Season) plot_title < "Average Annual Salary of Rookie Contract" plot_subtitle < "Due to compensatory picks rounds 3 through 7 will have varying numbers of selections per season" p1 < ggplot(df, aes(x = Pick, y = `Yearly Avg`, colour = Season, group = Season)) + geom_line(size = .8) + theme_fivethirtyeight() + ylab("Avg Yearly Value of Rookie Contract") + xlab("Pick") + scale_x_continuous(breaks = seq.default(1, 225, 32), labels =c(paste("Round", 1:3), paste("Pick", seq.default(1, 225, 32)[4:8]))) + scale_y_continuous(breaks = seq.default(1000000, 7000000, 1000000), labels = paste0("$", 1:7, " mil"), expand = c(0,100000)) + labs(title = plot_title, subtitle = plot_subtitle) + guides(colour = guide_legend(nrow = 1)) # outliers  label_df < df %>% filter(Player %in% c("Kyle Rudolph", "Colin McCarthy", "Julius Thomas", "Theo Riddick", "Braxton Miller")) label_df$label < paste0(label_df$Player," (", label_df$Pos, ")") p1 < p1 + geom_text(data = label_df, aes(x = Pick, y = `Yearly Avg`, colour = Season, label = label), fontface = "bold", show.legend = FALSE, nudge_x = c(25, 15, 25, 20, 5), nudge_y = c(60000, 120000, 0, 150000, 120000)) # turn into grob and add the bottom text  p1 < ggplotGrob(p1) subtext < textGrob("Source: spotrac", gp = gpar(fontsize = 12, fontface = "italic", fontfamily = "sans", col = "#696969"), just = "left", x = unit(0.01, "npc")) p1 < gtable_add_rows(p1, heights = unit(0, "mm"), pos = 1) p1 < gtable_add_grob(p1, grobTree(subtext), t = nrow(p1), l = 1, b = nrow(p1)1, r = 7) grid.draw(p1) ggsave(filename = "plots/rookie salaries.png", plot = p1, width = 8*1.2, height = 6*1.2, dpi = 600, units = "in")
To leave a comment for the author, please follow the link and comment on their blog: R – Jesse Piburn. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
littler 0.3.2
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
The third release of littler as a CRAN package is now available, following in the now more than tenyear history as a package started by Jeff in the summer of 2006, and joined by me a few weeks later.
littler is the first commandline interface for R and predates Rscript
. It is still faster, and in my very biased eyes better as it allows for piping as well shebang scripting via #!
, uses commandline arguments more consistently and still starts faster. It prefers to live on Linux and Unix, has its difficulties on OS X due to yetanotherbraindeadedness there (who ever thought caseinsensitive filesystems where a good idea?) and simply does not exist on Windows (yet — the build system could be extended — see RInside for an existence proof, and volunteers welcome!).
This release brings several new examples script to run package checks, use the extraordinary R Hub, download RStudio daily builds, and more — see below for details. No internals were changed.
The NEWS
file entry is below.

Changes in examples

New scripts
getRStudioServer.r
andgetRStudioDesktop.r
to download daily packages, currently defaults to Ubuntu amd64 
New script
c4c.r
callingrhub::check_for_cran()

New script
rd2md.r
to convert Rd to markdown. 
New script
build.r
to create a source tarball. 
The
installGitHub.r
script now use package remotes (PR #44, #46)

Courtesy of CRANberries, there is a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page. The code is available via the GitHub repo, from tarballs off my littler page and the local directory here — and now of course all from its CRAN page and via install.packages("littler")
. Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter.
Comments and suggestions are welcome at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...