Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 5 hours 34 min ago

Applications of R at EARL San Francisco 2017

Fri, 06/16/2017 - 22:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The Mango team held their first instance of the EARL conference series in San Francisco last month, and it was a fantastic showcase of real-world applications of R. This was a smaller version of the EARL conferences in London and Boston, but with that came the opportunity to interact with R users from industry in a more intimate setting. Hopefully Mango will return to the venue again next year, and if so I'll definitely be back!

As always with EARL events, the program featured many interesting presentations of how R is used to implement a data-driven (or data-informed) policy at companies around the world. With a dual-track program I couldn't attend all of the talks, but here are some of the applications that caught my interest:

  • Ricardo Bion (AirBnB): An keynote with an update on data science practice at AirBnB: training for everybody, the Knowledge Repository, trends in Python and R usage, and even a version of the AirBnB app implemented in Shiny!
  • Hilary Parker (Stitchfix): A wonderful keynote on bringing principles of engineering (and in particular blameless post-mortems) to data science (slides)
  • Prakhar Mehrotra (Uber): Using R to forecast demand and usage (sadly no slides, but here's R charting Uber traffic data)
  • David Croushore (Pandora): Using R and Shiny to monitor and forecast demand by users for different music services (slides).
  • David Bishop (Hitachi Solutions). Using R in the clean energy industry, with a nice demo of a Power BI dashboard to predict wind turbine anomalies, and even visualize the turbines with HoloLens (slides)
  • Tyler Cole (Amgen). Using R packages and Microsoft R Server for clinical trial submissions to the FDA (slides)
  • Luke Fostvedt (Pfizer). How R and R packages are used at various stages of the drug development process at Pfizer (slides)
  • Gabriel Becker (Genentech). Processes and R packages used at Genentech to manage the various tradeoffs for reproducibility in a multi-collaborator environment (slides)
  • Shad Thomas (Glass Box Research): Using R and Shiny for segmentation analysis in market research (slides)
  • Aaron Hamming (ProAssurance): Using R to combat the opioid epidemic by identifying suspect prescribers (slides)
  • Eduardo Ariño de la Rubia (Domino Data Lab): Using R from APIs including Rook, Rapache, Plumbr, OpenCPU and Domino (slides forthcoming)
  • Madhura Raju (Microsoft). Using R to detect distributed denial-of-service attacks (slides)
  • Slides from my own presentation, Predicting patient length-of-stay in hospitals, are available too.

There are many more interesting presentations to browse at the EARL San Francisco website. Just click on the presenter name to see the talk description and a link to slides (where available).

EARL Conference: San Francisco 2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

qualtRics 2.0 now available from CRAN

Fri, 06/16/2017 - 21:37

(This article was first published on Jasper Ginn's blog, and kindly contributed to R-bloggers)

qualtRics version 2.0 is now available from CRAN. This version contains several additional features and simplifies the process of importing survey exports from Qualtrics. Below, I outline the most important changes. You can check the full changelog here.

qualtRics configuration file

In previous versions, users needed to register their API key by calling the registerApiKey() function to avoid having to pass it to each function. However, they did need to pass the root url to each function, which is annoying and inconsistent.

In this version, the registerApiKey() function has been replaced by registerOptions(), which stores both the API key and the root url in environment variables. It also stores several other options (e.g. verbose logs to the R console). You can find more information about this here

This version also supports the use of a configuration file called ‘.qualtRics.yml’ that the user can store in the working directory of an R project. If a config file is present in the working directory, then it will be loaded when the user loads the qualtRics library, eliminating the need to register credentials at all. You can find more information about this here.

getSurvey() supports all available parameters

The getSurvey() function now supports all parameters that the Qualtrics API provides. Concretely, I’ve added the following parameters since version 1.0:

  1. seenUnansweredRecode: Recode seen but unanswered questions with a string value.
  2. limit: Maximum number of responses exported. Defaults to NULL (all responses).
  3. useLocalTime: Use local timezone to determine response date values.
  4. includedQuestionIds: Export only specified questions.

You can use a new function called getSurveyQuestions() to retrieve a data frame of question IDs and labels, which you can in turn pass to the getSurvey() function to export only a subset of your questions.

getSurveys() retrieves > 100 surveys

In previous versions of qualtRics, getSurveys() only retrieved 100 results. It now fetches all surveys. If you manage many surveys, this does mean that it could take some time for the function to complete.

If you notice any issues with the package, or if you would like to suggest a feature, please leave a comment here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stats NZ encouraging active sharing for microdata access projects

Fri, 06/16/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

The Integrated Data Infrastructure (IDI) is an amazing research tool

One of New Zealand’s most important data assets is the integrated data infrastructure:

“The Integrated Data Infrastructure (IDI) is a large research database containing microdata about people and households. Data is from a range of government agencies, Statistics NZ surveys including the 2013 Census, and non-government organisations. The IDI holds over 166 billion facts, taking up 1.22 terabytes of space – and is continually growing. Researchers use the IDI to answer complex questions to improve outcomes for New Zealanders.”

Quote from Stats NZ

The data already linked to the IDI “spine” is impressive:

Security and access is closely controlled

The IDI can be accessed only from a small number of very secure datalabs with frosted windows and heavy duty secure doors. We have one of these facilities at my workplace, the Social Investment Unit (to become the Social Investment Agency from 1 July 2017). Note/reminder: as always, I am writing this blog in my own capacity, not representing the views of my employer.

User credentials and permissions are closely guarded by Stats NZ and need to be linked to clearly defined public-good (ie non-commercial) research projects. All output is checked by Stats NZ against strict rules for confidentialisation (eg random rounding; cell suppression in specified situations; no individual values even in scatter charts; etc) before it can be taken out of the datalab and shown to anyone. All this is quite right – even though the IDI does not contain names and addresses for individuals (they are stripped out after matching, before analysts get to use the data) and there are strict rules against trying to reverse the confidentialisation, it does contain very sensitive data and must be closely protected to ensure continued social permission for this amazing research tool.

More can be done to build capability, including by sharing code

However, while the security restrictions are necessary and acceptable, other limitations on its use can be overcome – and Stats NZ and others are working to do so. The biggest such limitation is capability – skills and knowledge to use the data well.

The IDI is a largish and complex SQL Server datamart, with more than 40 origin systems from many different agencies (mostly but not all of them government) outside of Stats NZ’s direct quality control. Using the data requires not just a good research question and methodology, but good coding skills in SQL plus at least one of R, SAS or Stata. Familiarity with the many data dictionaries and metadata of varied quality is also crucial, and in fact is probably the main bottleneck in expanding IDI usage.

A typical workflow includes a data grooming stage of hundreds of lines of SQL code joining many database tables together in meticulously controlled ways before analysis even starts. That’s why one of the objectives at my work has been to create assets like the “Social Investment Analytical Layer”, a repository of code that creates analysis-ready tables and views with a standardised structure (each row a combination of an individual and an “event” like attending school, paying tax, or receiving a benefit) to help shorten the barrier to new analysts.

We publish that and other code with a GPL-3 open source license on GitHub and have plans for more. Although the target audience of analysts with access to the IDI is small, documenting and tidying code to publication standard and sharing it is important to help that group of people grow in size. Code sharing until recently has been ad hoc and very much depending on “who you know”, although there is a Wiki in the IDI environment where key snippets of data grooming code is shared. We see the act of publishing on GitHub as taking this to the next step – amongst other things it really forces us to document and debug our code very thoroughly!

Stats NZ is showing leadership on open data / open science

Given the IDI is an expensive taxpayer-created asset, my personal view is that all code and output should be public. So I was very excited a few weeks back when Liz MacPherson, the Government Statistician (a statutory position responsible for New Zealand’s official statistics system, and also the Chief Executive of Stats NZ) sent out an email to researchers “encouraging active sharing for microdata access projects” including the IDI.

I obtained Liz’s permission to reproduce the full text of her email of 29 May 2017:

Subject: Stats NZ: Encouraging active sharing for microdata access projects

Stats NZ is all about unleashing the power of data to change lives. New Zealand has set a clear direction to increase the availability of data to improve decision making, service design and data driven innovation.
To support this direction, Stats NZ is moving to an open data / open science environment, where research findings, code and tables are shared by default.

We are taking steps to encourage active sharing for microdata access projects. All projects that are approved for access to microdata will be asked to share their findings, code and tables (once confidentiality checked).

The benefits of sharing code and project findings to facilitate reproducible research have been identified internationally, particularly among research communities looking to grow access to shared tools and resources. To assist you with making code open there is guidance available on how to license and use repositories like GitHub in the New Zealand Government Open Access and Licensing (NZGOAL) Framework – Software Extension.

Stats NZ is proud to now be leading the Open Data programme of work, and as such wants to ensure that all tables in published research reports are made available as separate open data alongside the reports for others to work with and build on.
There are many benefits to sharing: helping to build communities of interest where researchers can work collaboratively, avoid duplication, and build capability. Sharing also provides transparency for the use that is made of data collected through surveys and administrative sources, and accountability for tax-payer dollars that are spent on building datasets and databases like the IDI.

One of the key things Stats NZ checks for when assessing microdata access applications is that there is an intention to make the research findings available publicly. This is why the application form asks about intended project outputs and plans for dissemination.

While I have been really pleased to see the use that is made of the IDI wiki and Meetadata, the sharing of findings, code and tables is still disproportionately small in relation to the number of projects each year. This is why I am taking steps to encourage sharing. Over time, this will become an expectation.

More information will be communicated by the Integrated Data team shortly by email, at the IDI Forums and on Meetadata.

In the meantime, if you have any suggestions about how to make this process work for all, or if you have any questions, please contact ….., Senior Manager, Integrated Data ……

Warm regards

Liz

Liz MacPherson

Government Statistician/Chief Executive

Stats NZ Tatauranga Aotearoa

This is great news, and a big step in the right direction. It really is impressive for the Government Statistician to be promoting reproducibility and open access to code, outputs and research products. I look forward to seeing more microdata analytical code sharing, and continued mutual support for raising standards and sharing experience.

Now to get version control software into the datalab environment please! (and yes, there is a Stats NZ project in flight to fix this critical gap too). My views on the critical importance of version control are set out in this earlier post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A “poor man’s video analyzer”…

Fri, 06/16/2017 - 11:30

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Not so long ago there was a nice dataiku meetup with Pierre Gutierrez talking about transfer learning. RStudio recently released the keras package, an R interface to work with keras for deep learning and transfer learning. Both events inspired me to do some experiments at my work here at RTL and explore the usability of it for us at RTL. I like to share the slides of the presentation that I gave internally at RTL, you can find them on slide-share.

As a side effect, another experiment that I like to share is the “poor man’s video analyzer“. There are several vendors now that offer API’s to analyze videos. See for example the one that Microsoft offers. With just a few lines of R code I came up with a shiny app that is a very cheap imitation

Set up of the R Shiny app

To run the shiny app a few things are needed. Make sure that ffmpeg is installed, it is used to extract images from a video. Tensorflow and keras need to be installed as well. The extracted images from the video are parsed through a pre-trained VGG16 network so that each image is tagged.

After this tagging a data table will appear with the images and their tags. That’s it! I am sure there are better visualizations than a data table to show a lot of images. If you have a better idea just adjust my shiny app on GitHub….

Using the app, some screen shots

There is a simple interface, specify the number of frames per second that you want to analyse. And then upload a video, many formats are supported (by ffmpeg), like *.mp4, *.mpeg, *.mov

Click on video images to start the analysis process. This can take a few minutes, when it is finished you will see a data table with extracted images and their tags from VGG-16.

Click on ‘info on extracted classes’ to see an overview of the class. You will see a bar chart of tags that where found and the output of ffmpeg. It shows some info on the video.

If you have code to improve the data table output in a more fancy visualization, just go to my GitHub. For those who want to play around, look at a live video analyzer shiny app here.

A Shiny app version using miniUI will be a better fit for small mobile screens.

Cheers, Longhow

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Using Partial Least Squares to conduct relative importance analysis in Displayr

Fri, 06/16/2017 - 03:40

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable.

There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.

Partial Least Squares

PLS is a dimension reduction technique with some similarity to principal component analysis. The predictor variables are mapped to a smaller set of variables and within that smaller space we perform a regression against the outcome variable.  In contrast to principal component analysis where the dimension reduction ignores the outcome variable, the PLS procedure aims to choose new mapped variables that maximally explain the outcome variable.

Loading the example data

First I’ll add some data with Insert > Data Set > URL and paste in this link:

http://wiki.q-researchsoftware.com/images/6/69/Stacked_Cola_Brand_Associations.sav

Dragging Brand preference onto the page from the Data tree on the left table produces a table showing the breakdown of the respondents by category. This includes a Don’t Know category that doesn’t fit in the ordered scale from Love to Hate.  To remove Don’t Know I click on top of Brand preference in the Data tree on the left and then click on Value Attributes. Changing Missing Values for the Don’t Know category to Exclude from analyses produces the table below.

Creating the PLS model

Partial least squares is easy to run with a few lines of code. Select Insert > R Output and enter the following snippet of code into the R CODE box:

dat = data.frame(pref, Q5r0, Q5r1, Q5r2, Q5r3, Q5r4, Q5r5, Q5r6, Q5r7, Q5r8, Q5r9, Q5r10, Q5r11, Q5r12, Q5r13, Q5r14, Q5r15, Q5r16, Q5r17, Q5r18, Q5r19, Q5r20, Q5r21, Q5r22, Q5r23, Q5r24, Q5r25, Q5r26, Q5r27, Q5r29, Q5r28, Q5r30, Q5r31, Q5r32, Q5r33) library(pls) library(flipFormat) library(flipTransformations) dat = AsNumeric(ProcessQVariables(dat), binary = FALSE, remove.first = FALSE) pls.model = plsr(pref ~ ., data = dat, validation = "CV")

The first line selects pref as the outcome variable (strength of preference for a brand) and then adds 34 predictor variables, each indicating whether the respondent perceives the brand to have a particular characteristic. These variables can be dragged across from the Data tree on the left.

Next, the 3 libraries containing useful functions are loaded. The package pls contains the function to estimate the PLS model, and our own publicly-available packages, flipFormat and flipTransformations are included for function to help us transform and tidy the data. Since the R pls package requires inputs to be numerical I convert the variables from categorical.

In the final line above the plsr function does the work and creates pls.model.

Automatically Selecting the Dimensions

The following few lines recreate the model having found the optimal number of dimensions,

# Find the number of dimensions with lowest cross validation error cv = RMSEP(pls.model) best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1 # Rerun the model pls.model = plsr(pref ~ ., data = dat, ncomp = best.dims) Producing the Output

Finally, we extract the useful information and format the output,

coefficients = coef(pls.model) sum.coef = sum(sapply(coefficients, abs)) coefficients = coefficients * 100 / sum.coef names(coefficients) = TidyLabels(Labels(dat)[-1]) coefficients = sort(coefficients, decreasing = TRUE)

The regression coefficients are normalized so their absolute sum is 100. The labels are added and the result is sorted.

The results below show that Reliable and Fun are positive predictors of preference, Unconventional and Sleepy are negative predictors and Tough has little relevance.

TRY IT OUT
You can perform this analysis for yourself in Displayr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Some R User Group News

Fri, 06/16/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

This week, members of the Bay Area useR Group (BARUG) celebrated the group’s one hundred and first meetup with beer, pizza and three outstanding presentations at the cancer diagnostics company GRAIL. Pete Mohanty began the evening with the talk Did “Communities in Crisis” Elect Trump?: An Analysis with Kernel Regularized Least Squares. Not only is the Political Science compelling, but the underlying Data Science is top shelf. The bigKRLS package that Pete and his collaborator Robert Shaffer wrote to support their research uses parallel processing and external memory techniques to make the computationally intensive Regularized Least Squares algorithm suitable for fairly large data sets. The following graph from Pete’s presentation gives some idea of the algorithm’s running time.

In the second talk, long time BARUG contributor Earl Hubbell described the production workflow that supports GRAIL’s scientific work. “Rmarkdown, tidy data principles, and the RStudio ecosystem serve as one foundation for [GRAIL’s] reproducible research.” In the evening’s third talk, Ashley Semanskee the Kaiser Family Foundation is using R to automate the annual production of its Employer Health Benefits Survey.

Also recently:


Note that members of the Portland group also helped to organize the first ever CasdadiaRConf conference.

Finally, I would like to mention that the recently reformed Austin R User Group is thinking big. They are planning Data Day Texas 2018! and have already assembled an impressive list of speakers.

_____='https://rviews.rstudio.com/2017/06/16/some-r-user-group-news/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An easy way to accidentally inflate reported R-squared in linear regression models

Thu, 06/15/2017 - 23:43

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R.

We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here.

First let’s set up our problem with a data set where the quantity to be predicted (y) has no real relation to the independent variable (x). We will first build our example data:

library("sigr") library("broom") set.seed(23255) d <- data.frame(y= runif(100), x= ifelse(runif(100)>=0.5, 'a', 'b'), stringsAsFactors = FALSE)

Now let’s build a model and look at the summary statistics returned as part of the model fitting process:

m1 <- lm(y~x, data=d) t(broom::glance(m1)) ## [,1] ## r.squared 0.002177326 ## adj.r.squared -0.008004538 ## sigma 0.302851476 ## statistic 0.213843593 ## p.value 0.644796456 ## df 2.000000000 ## logLik -21.432440763 ## AIC 48.864881526 ## BIC 56.680392084 ## deviance 8.988463618 ## df.residual 98.000000000 d$pred1 <- predict(m1, newdata = d)

I strongly prefer to directly calculate the the model performance statistics off the predictions (it lets us easily compare different modeling methods), so let’s also do that also:

cat(render(sigr::wrapFTest(d, 'pred1', 'y'), format='markdown'))

F Test summary: (R2=0.0022, F(1,98)=0.21, p=n.s.).

So far so good. Let’s now remove the "intercept term" by adding the "0+" from the fitting command.

m2 <- lm(y~0+x, data=d) t(broom::glance(m2)) ## [,1] ## r.squared 7.524811e-01 ## adj.r.squared 7.474297e-01 ## sigma 3.028515e-01 ## statistic 1.489647e+02 ## p.value 1.935559e-30 ## df 2.000000e+00 ## logLik -2.143244e+01 ## AIC 4.886488e+01 ## BIC 5.668039e+01 ## deviance 8.988464e+00 ## df.residual 9.800000e+01 d$pred2 <- predict(m2, newdata = d)

Uh oh. That appeared to vastly improve the reported R-squared and the significance ("p.value")!

That does not make sense, anything m2 can do m1 can also do. In fact the two models make essentially identical predictions, which we confirm below:

sum((d$pred1 - d$y)^2) ## [1] 8.988464 sum((d$pred2 - d$y)^2) ## [1] 8.988464 max(abs(d$pred1 - d$pred2)) ## [1] 4.440892e-16 head(d) ## y x pred1 pred2 ## 1 0.007509118 b 0.5098853 0.5098853 ## 2 0.980353615 a 0.5380361 0.5380361 ## 3 0.055880927 b 0.5098853 0.5098853 ## 4 0.993814410 a 0.5380361 0.5380361 ## 5 0.636617385 b 0.5098853 0.5098853 ## 6 0.154032277 a 0.5380361 0.5380361

Let’s double check the fit quality of the predictions.

cat(render(sigr::wrapFTest(d, 'pred2', 'y'), format='markdown'))

F Test summary: (R2=0.0022, F(1,98)=0.21, p=n.s.).

Ah. The prediction fit quality is identical to the first time (as one would expect). This is yet another reason to directly calculate model fit quality from the predictions: it isolates you from any foibles of how the modeling software does it.

The answer to our puzzles of "what went wrong" is something we have written about before here.

Roughly what is going on is:

If the fit formula sent lm() has no intercept (triggered by the "0+") notation then summary.lm() changes how it computes r.squared as follows (from help(summary.lm)):

r.squared R^2, the ‘fraction of variance explained by the model’, R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2), where y* is the mean of y[i] if there is an intercept and zero otherwise.

This is pretty bad.

Then to add insult to injury the "0+" notation also changes how R encodes the categorical variable x.

Compare:

summary(m1) ## ## Call: ## lm(formula = y ~ x, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.53739 -0.23265 -0.02039 0.27247 0.47111 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.53804 0.04515 11.918 <2e-16 *** ## xb -0.02815 0.06088 -0.462 0.645 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3029 on 98 degrees of freedom ## Multiple R-squared: 0.002177, Adjusted R-squared: -0.008005 ## F-statistic: 0.2138 on 1 and 98 DF, p-value: 0.6448 summary(m2) ## ## Call: ## lm(formula = y ~ 0 + x, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.53739 -0.23265 -0.02039 0.27247 0.47111 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## xa 0.53804 0.04515 11.92 <2e-16 *** ## xb 0.50989 0.04084 12.49 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3029 on 98 degrees of freedom ## Multiple R-squared: 0.7525, Adjusted R-squared: 0.7474 ## F-statistic: 149 on 2 and 98 DF, p-value: < 2.2e-16

Notice the second model directly encodes both levels of x. This means that if m1 has pred1 = a + b * (x=='b') we can reproduce this model (intercept and all) as m2: pred2 = a * (x=='a') + (a+b) * (x=='b'). I.e., the invariant (x=='a') + (x=='b') == 1 means m2 can imitate the model with the intercept term.

The presumed (and I think weak) justification of summary.lm() switching the model quality assessment method is something along the lines that mean(y) may not be in the model’s concept space and this might lead to reporting negative R-squared. I don’t have any problem with negative R-squared, it can be taken to mean you did worse than the unconditional average. However, even if you accept the (no-warning) scoring method switch: that argument doesn’t apply here. m2 can imitate having an intercept, so it isn’t unfair to check if it is better than using only the intercept.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Finer Monotonic Binning Based on Isotonic Regression

Thu, 06/15/2017 - 23:24

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binning variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) { d1 <- data[c(y, x)] d2 <- d1[!is.na(d1[x]), ] c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs") reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1]) k <- knots(as.stepfun(reg)) sm1 <-smbinning.custom(d1, y, x, k) c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint) c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " ")))) c3 <- c2[!is.na(c2)] return(smbinning.custom(d1, y, x, c3[-length(c3)])) }

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

LTV Binning with isobin() Function

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 46 81 78 3 81 78 3 0.0139 0.9630 0.0370 26.0000 3.2581 1.9021 0.0272 2 <= 71 312 284 28 393 362 31 0.0535 0.9103 0.0897 10.1429 2.3168 0.9608 0.0363 3 <= 72 22 20 2 415 382 33 0.0038 0.9091 0.0909 10.0000 2.3026 0.9466 0.0025 4 <= 73 27 24 3 442 406 36 0.0046 0.8889 0.1111 8.0000 2.0794 0.7235 0.0019 5 <= 81 303 268 35 745 674 71 0.0519 0.8845 0.1155 7.6571 2.0356 0.6797 0.0194 6 <= 83 139 122 17 884 796 88 0.0238 0.8777 0.1223 7.1765 1.9708 0.6149 0.0074 7 <= 90 631 546 85 1515 1342 173 0.1081 0.8653 0.1347 6.4235 1.8600 0.5040 0.0235 8 <= 94 529 440 89 2044 1782 262 0.0906 0.8318 0.1682 4.9438 1.5981 0.2422 0.0049 9 <= 95 145 119 26 2189 1901 288 0.0248 0.8207 0.1793 4.5769 1.5210 0.1651 0.0006 10 <= 100 907 709 198 3096 2610 486 0.1554 0.7817 0.2183 3.5808 1.2756 -0.0804 0.0010 11 <= 101 195 151 44 3291 2761 530 0.0334 0.7744 0.2256 3.4318 1.2331 -0.1229 0.0005 12 <= 110 1217 934 283 4508 3695 813 0.2085 0.7675 0.2325 3.3004 1.1940 -0.1619 0.0057 13 <= 112 208 158 50 4716 3853 863 0.0356 0.7596 0.2404 3.1600 1.1506 -0.2054 0.0016 14 <= 115 253 183 70 4969 4036 933 0.0433 0.7233 0.2767 2.6143 0.9610 -0.3950 0.0075 15 <= 136 774 548 226 5743 4584 1159 0.1326 0.7080 0.2920 2.4248 0.8857 -0.4702 0.0333 16 <= 138 27 18 9 5770 4602 1168 0.0046 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0024 17 > 138 66 39 27 5836 4641 1195 0.0113 0.5909 0.4091 1.4444 0.3677 -0.9882 0.0140 18 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 19 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1897

LTV Binning with monobin() Function

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 85 1025 916 109 1025 916 109 0.1756 0.8937 0.1063 8.4037 2.1287 0.7727 0.0821 2 <= 94 1019 866 153 2044 1782 262 0.1746 0.8499 0.1501 5.6601 1.7334 0.3775 0.0221 3 <= 100 1052 828 224 3096 2610 486 0.1802 0.7871 0.2129 3.6964 1.3074 -0.0486 0.0004 4 <= 105 808 618 190 3904 3228 676 0.1384 0.7649 0.2351 3.2526 1.1795 -0.1765 0.0045 5 <= 114 985 748 237 4889 3976 913 0.1688 0.7594 0.2406 3.1561 1.1493 -0.2066 0.0076 6 > 114 947 665 282 5836 4641 1195 0.1622 0.7022 0.2978 2.3582 0.8579 -0.4981 0.0461 7 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 8 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1628

Bureau_Score Binning with isobin() Function

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 491 4 1 3 4 1 3 0.0007 0.2500 0.7500 0.3333 -1.0986 -2.4546 0.0056 2 <= 532 24 9 15 28 10 18 0.0041 0.3750 0.6250 0.6000 -0.5108 -1.8668 0.0198 3 <= 559 51 24 27 79 34 45 0.0087 0.4706 0.5294 0.8889 -0.1178 -1.4737 0.0256 4 <= 560 2 1 1 81 35 46 0.0003 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0008 5 <= 572 34 17 17 115 52 63 0.0058 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0143 6 <= 602 153 84 69 268 136 132 0.0262 0.5490 0.4510 1.2174 0.1967 -1.1592 0.0459 7 <= 605 56 31 25 324 167 157 0.0096 0.5536 0.4464 1.2400 0.2151 -1.1408 0.0162 8 <= 606 14 8 6 338 175 163 0.0024 0.5714 0.4286 1.3333 0.2877 -1.0683 0.0035 9 <= 607 17 10 7 355 185 170 0.0029 0.5882 0.4118 1.4286 0.3567 -0.9993 0.0037 10 <= 632 437 261 176 792 446 346 0.0749 0.5973 0.4027 1.4830 0.3940 -0.9619 0.0875 11 <= 639 150 95 55 942 541 401 0.0257 0.6333 0.3667 1.7273 0.5465 -0.8094 0.0207 12 <= 653 451 300 151 1393 841 552 0.0773 0.6652 0.3348 1.9868 0.6865 -0.6694 0.0412 13 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 14 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 15 <= 667 57 44 13 1845 1175 670 0.0098 0.7719 0.2281 3.3846 1.2192 -0.1367 0.0002 16 <= 677 381 300 81 2226 1475 751 0.0653 0.7874 0.2126 3.7037 1.3093 -0.0466 0.0001 17 <= 679 66 53 13 2292 1528 764 0.0113 0.8030 0.1970 4.0769 1.4053 0.0494 0.0000 18 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 19 <= 689 203 164 39 2655 1821 834 0.0348 0.8079 0.1921 4.2051 1.4363 0.0804 0.0002 20 <= 699 304 249 55 2959 2070 889 0.0521 0.8191 0.1809 4.5273 1.5101 0.1542 0.0012 21 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 22 <= 717 368 318 50 3639 2656 983 0.0630 0.8641 0.1359 6.3600 1.8500 0.4941 0.0132 23 <= 721 134 119 15 3773 2775 998 0.0230 0.8881 0.1119 7.9333 2.0711 0.7151 0.0094 24 <= 723 49 44 5 3822 2819 1003 0.0084 0.8980 0.1020 8.8000 2.1748 0.8188 0.0043 25 <= 739 425 394 31 4247 3213 1034 0.0728 0.9271 0.0729 12.7097 2.5424 1.1864 0.0700 26 <= 746 166 154 12 4413 3367 1046 0.0284 0.9277 0.0723 12.8333 2.5520 1.1961 0.0277 27 <= 756 234 218 16 4647 3585 1062 0.0401 0.9316 0.0684 13.6250 2.6119 1.2560 0.0422 28 <= 761 110 104 6 4757 3689 1068 0.0188 0.9455 0.0545 17.3333 2.8526 1.4967 0.0260 29 <= 763 46 44 2 4803 3733 1070 0.0079 0.9565 0.0435 22.0000 3.0910 1.7351 0.0135 30 <= 767 96 92 4 4899 3825 1074 0.0164 0.9583 0.0417 23.0000 3.1355 1.7795 0.0293 31 <= 772 77 74 3 4976 3899 1077 0.0132 0.9610 0.0390 24.6667 3.2055 1.8495 0.0249 32 <= 787 269 260 9 5245 4159 1086 0.0461 0.9665 0.0335 28.8889 3.3635 2.0075 0.0974 33 <= 794 95 93 2 5340 4252 1088 0.0163 0.9789 0.0211 46.5000 3.8395 2.4835 0.0456 34 > 794 182 179 3 5522 4431 1091 0.0312 0.9835 0.0165 59.6667 4.0888 2.7328 0.0985 35 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 36 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8357

Bureau_Score Binning with monobin() Function

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Set Operations Unions and Intersections in R

Thu, 06/15/2017 - 23:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 2 of 2 in the series Set Theory

The set operations of unions and intersections should ring a bell for those who’ve worked with relational databases and Venn Diagrams. The ‘union’ of two of sets A and B represents a set that comprises all members of A and B (or both).

One of the most natural ways to visualize set unions and intersections is using Venn diagrams.

The Venn diagram on the left visualizes a set union while the Venn diagram on the right visually represents a set intersection operation.

Set Unions

The union of two sets A and B is denoted as:

\large{A \cup B}

The union axiom states for two sets A and B, there is a set whose members consist entirely of those belonging to sets A or B, or both. More formally, the union axiom is stated as:

\large{\forall a \space \forall b \space \exists B \space \forall x (x \in B \Leftrightarrow x \in a \space \vee \space x \in b)}

For example, for two sets A and B:

\large{A = \{3, 5, 7, 11 \} \qquad B = \{3, 5, 13, 17 \}}

The union of the two sets is:

\large{A \cup B = \{3, 5, 7, 11 \} \cup \{3, 5, 13, 17 \} = \{3, 5, 7, 11, 13, 17 \}}

We can define a simple function in R that implements the set union operation. There is a function in base R union() that performs the same operation that is recommended for practical uses.

set.union <- function(a, b) { u <- a for (i in 1:length(b)) { if (!(b[i] %in% u)) { u <- append(u, b[i]) } } return(u) }

Using our function to perform a union operation of the two sets as above.

a <- c(3, 5, 7, 11) b <- c(3, 5, 13, 17) set.union(a, b) ## [1] 3 5 7 11 13 17 Set Intersections

The intersection of two sets A and B is the set that comprises the elements that are both members of the two sets. Set intersection is denoted as:

\large{A \cap B}

Interestingly, there is no axiom of intersection unlike for set union operations. The concept of set intersection arises from a different axiom, the axiom schema of specification, which asserts the existence of a subset of a set given a certain condition. Defining this condition (also known as a sentence) as \sigma(x), the axiom of specification (subset) is stated as:

\large{\forall A \space \exists B \space \forall x (x \in B \Leftrightarrow x \in A \wedge \sigma(x))}

Put another way; the axiom states that for a set A and a condition (sentence) \sigma of a subset of A, the subset does indeed exist. This axiom leads us to the definition of set intersections without needing to state any additional axioms. Using the subset axiom as a basis, we can define the existence of the set intersection operation. Given two sets a and b:

\large{\forall a \space \forall b \exists B \space \forall x (x \in B \Leftrightarrow x \in a \space \wedge \space x \in b)}

Stated plainly, given sets a and b, there exists a set B that contains the members existing in both sets.

For example, using the previous sets defined earlier:

\large{A = \{3, 5, 7, 11 \} \qquad B = \{3, 5, 13, 17 \}}

The intersection of the two sets is:

\large{A \cap B = \{3, 5, 7, 11 \} \cap \{3, 5, 13, 17 \} = \{3, 5 \}}

We can also define a straightforward function to implement the set intersection operation. Base R also features a function intersect() that performs the set intersection operation.

set.intersection <- function(a, b) { intersect <- vector() for (i in 1:length(a)) { if (a[i] %in% b) { intersect <- append(intersect, a[i]) } } return(intersect) }

Then using the function to perform set intersection on the two sets to confirm our above results.

a <- c(3, 5, 7, 11, 13, 20, 30) b <- c(3, 5, 13, 17, 7, 10) set.intersection(a, b) ## [1] 3 5 7 13 Subsets

The concept of a subset of a set was introduced when we developed the set intersection operation. A set, A, is said to be a subset of B, written as A \subset B if all the elements of A are also elements of B. Therefore, all sets are subsets of themselves and the empty set \varnothing is a subset of every set.

We can write a simple function to test whether a set a is a subset of b.

issubset <- function(a, b) { for (i in 1:length(a)) { if (!(a[i] %in% b)) { return(FALSE) } } return(TRUE) }

The union of two sets a and b has by definition subsets equal to a and b, making a good test case for our function.

a <- c(3, 5, 7, 11) b <- c(3, 5, 13, 17) c <- set.union(a, b) c ## [1] 3 5 7 11 13 17 print(issubset(a, c)) ## [1] TRUE print(issubset(b, c)) ## [1] TRUE print(issubset(c(3, 5, 7, 4), a)) ## [1] FALSE Summary

This post introduced the common set operations unions and intersections and the axioms asserting those operations, as well as the definition of a subset of a set which arises naturally from the results of unions and intersections.

References

Axiom schema of specification. (2017, May 27). In Wikipedia, The Free Encyclopedia. From https://en.wikipedia.org/w/index.php?title=Axiom_schema_of_specification&oldid=782595557

Axiom of union. (2017, May 27). In Wikipedia, The Free Encyclopedia. From https://en.wikipedia.org/w/index.php?title=Axiom_of_union&oldid=782595523

Enderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.

The post Set Operations Unions and Intersections in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Introducing Our Instructor Pages!

Thu, 06/15/2017 - 21:05

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

At DataCamp, we are proud to have the best instructors in the data science community teaching courses on our site. Today, close to 50 instructors have one or more DataCamp courses live on the platform, and many more have a course in development (so stay tuned for that!).  

Until now, it was not possible to navigate through your preferred instructors but with our new instructor pages, we are changing that.

We built these instructor pages to make it easier for you to discover what other courses are taught by your favorite instructor, and to find new instructors that are teaching topics you are passionate about. Every instructor page also has a small biography allowing you to better understand the person behind the course.  

Curious? Check out the instructor profile pages of Garrett Grolemund (RStudio), Dhavide Aruliah (Continuum Analytics), and 50 others on our instructor’s overview page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Data Manipulation with Data Table -Part 1

Thu, 06/15/2017 - 21:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


In the exercises below we cover the some useful features of data.table ,data.table is a library in R for fast manipulation of large data frame .Please see the data.table vignette before trying the solution .This first set is intended for the begineers of data.table package and does not cover set keywords, joins of data.table which will be covered in the next set . Load the data.table library in your r session before starting the exercise
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load the iris dataset ,make it a data.table and name it iris_dt ,Print mean of Petal.Length, grouping by first letter of Species from iris_dt .

Exercise 2
Load the diamonds dataset from ggplot2 package as dt (a data.table) ,Find mean price for each group of cut and color .
Exercise 3
Load the diamonds dataset from ggplot2 package as dt . Now group the dataset by price per carat and print top 5 in terms of count per group . Dont use head ,use chaining in data.table to achieve this

Exercise 4
Use the already loaded diamonds dataset and print the last two carat value of each cut .

Exercise 5
In the same data set , find median of the columns x,y,z per cut . Use data.table’s methods to achieve this .

Exercise 6
Load the airquality dataset as data.table, Now I want to find Logarithm of wind rate for each month and for days greater than 15
Exercise 7
In the same data set , for all the odd rows ,update Temp column by adding 10 .

Exercise 8
data.table comes with a powerful feature of updating column by reference as you have seen in the last exercise,Its even possible to update /create multiple columns .Now to test that in the airquality data.table that you have created previously,add 10 to Solar.R ,Wind .

Exercise 9
Now you have a fairly good idea of how easy its to create multiple column ,Its even possible to use delete multiple column using the same idea. In this exercise , use the same airquality data.table that you have created previously from airquality and delete Solar.R,Wind,Temp using a single expression
Exercise 10
Load the airquality dataset as data.table again , I want to create two columns a,b which indicates temp in Celcius and Kelvin scale . Write a expression to achieve same.
Celcius = (Temp-32)*5/9
Kelvin = Celcius+273.15

Related exercise sets:
  1. Shiny Application Layouts Exercises (Part-2)
  2. Descriptive Analytics-Part 5: Data Visualisation (Categorical variables)
  3. Summary Statistics With Aggregate()
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Retailers: Here’s What Your Employees Are Saying About You – A Project on Web Scraping Indeed.com

Thu, 06/15/2017 - 20:56

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

To view the original source code, you can visit my Github page here. If you are interested in learning more about my background, you can visit my LinkedIn page here.

About This Project

This is the second of four projects for the NYC Data Science Academy. In this project, students are required to collect their datasets using a method called webscraping. Essentially, webscraping is the process of collecting information (i.e. texts) from a series of websites into structured databases such as .csv files. Webscraping enables analyses to be done on data that is not already provided in a structured format.

In this analysis, we will scrape employee reviews for the 8 retail companies that made it to Indeed’s 50 Best Corporate Employers in the US list. See below for additional details regarding each of these 8 retail firms in discussion. The rank and overall score corresponds to the employer’s rank and overall score as they appear on Indeed’s website.

Objectives

Given a two-week timeframe, the scope of the analysis was limited to these three primary objectives:

  • To understand when employees are more likely to submit positive and negative reviews
  • To understand the sentiments and emotions associated with positive and negative reviews (i.e. Sentiment Analysis)
  • To understand the topics of positive and negative reviews (i.e. Topic Modeling)

Web Scraping with Scrapy

Scrapy (a Python framework) was used for collecting employee reviews for each of the retailers on the 50 Best Corporate Employers List. Each review scraped contains the following elements:

  • Company
  • Industry
  • Job Title
  • Date
  • Month
  • Location
  • Rating
  • Header
  • Comment (Main Review)
  • Pro
  • Con

As seen in the illustration below, not every review contains a pro and a con, as those are optional fields for the reviewer. To see an example of a real review, you can visit Indeed’s Walt Disney Parks and Resorts review page here.

Employee Reviews At a Glance

In total, we have scraped over 20,000 reviews. The number of reviews scraped for each firm varies significantly across the board. This is likely due to the difference in the number of employees hired by each firm across the nation.

Employee reviews for these eight retail firms tend to peak around the beginning of the year (January through April). This is expected as we know that retail is a cyclical industry in which the majority of sales happen during the holiday season (November through January). Because employees are oftentimes hired only for the duration of this period, it makes sense  that the majority of the reviews come in around the beginning of the year.

Among the eight retailers on the 50 Best Corporate Employers list, we see that the review ratings skew toward a score of 5, which is the highest score a review can get.

There is no significant differences in average rating across retailers.

There is no significant differences in average rating across months.

When Are Employees At Retail Firms More Likely to Submit Positive and Negative Reviews?

Average Employee Rating by Month

When we consider reviews of all ratings (i.e. reviews with ratings 1 to 5), retailers receive the highest average ratings during the months of October, November, December, and January. Note that the ratings have been scaled to 0 to 1 to allow for comparison across firms and months.

Most Negative Reviews By Month (Reviews with Ratings Below 3)

If we only consider the count of negative reviews (i.e. reviews with ratings 1 or 2), retailers on average receive the most negative reviews during the months of October, February, March, and April (i.e. these months scored higher across retailers).

Most Positive Reviews By Month (Reviews with Ratings Above 3)

If we only consider the count of positive reviews (i.e. reviews with ratings 4 or 5), retailers on average receive the most positive reviews during the months of January, February, and April (i.e. these months scored higher across retailers).

Summary Observations

In summary, while the highest average ratings concentrate toward the end of the year, the highest count of both positive and negative reviews are given during the beginning of the year. This aligns with our understanding that the retail industry is cyclical. In other words, employees are oftentimes hired specifically to aid sales around the holiday season (i.e. Nov to Jan). When their work concludes around the beginning of the year, that’s when we can expect employees to write the majority of the reviews (i.e. both positive and negative).

Understanding Employees Reviews Using Sentiment Analysis

A sentiment analysis was done using R’s Syuzhet library to visualize the emotions across employee reviews. In the illustration below, each graph represents the intensity of the an emotion from January to December on a positive scale (i.e. 0 and above). For example, for Build-A-Bear Workshop’s ‘positive’ graph, we can see that there is an uptick in positive sentiment toward March and November, and each of those months scored approximately a 6. This is a high score relative to the scores recorded across all other emotions.

Generally, the sentiment observed align with our understanding that the majority of the employee reviews are positive, as we are narrowly focused on analyzing employee reviews among the best 8 corporate retail employers according to Indeed’s ranking. As an aside, it is interesting to see that Build-A-Bear Workshop and Trader Joe’s recorded more volatility in its scoring across the negative emotions (i.e. negative, sadness, fear, disgust, anger).

Using Topic Modeling to Identify Patterns Among Positive and Negative Employee Reviews

Topic modeling is a form of unsupervised machine learning, and it can help us identify patterns by clustering similar items into groups. In this analysis, pyLDAvis, a Python library, was used to analyze the groups among positive and negative reviews (using pros and cons of the employee reviews). Essentially, the library takes a list of documents as inputs (i.e. a list of reviews) and attempts to group the reviews based on common combinations of keywords appearing in each of these reviews.

The drawback of using topic modeling, as with other clustering techniques, is that the user has to specify the number of groups to cluster the documents into. This presents a catch-22 issue as the initial objective is to understand the number of topics or groups among your inputs. Given this hurdle, the user must use their business judgment as well as running multiple trials with different number of topics as inputs to determine results that are most interpretable. More sophisticated clustering libraries aid the users by automatically selecting the best number of topics, but this should only be used as a guide to begin the analysis.

In the illustration below, we can see that words like ‘great’ and ‘benefit,’ among many other words, are oftentimes seen together among pro reviews.

Observed Topics Among Pros and Cons in Employee Reviews

Two topic modeling analyses were conducted using pro reviews and con reviews. Below are the most common combinations of words appearing in each of the groups for each analysis. It is not a surprise that the groups overlap in the common words identified among them, as we have chosen a relatively high number of topics. Nevertheless, the results provide a useful start for understanding what employees are thinking.

Visualizing Employee Reviews Using Word Clouds

R’s Wordcloud2 library was used to further aid the visualization of what employees are saying about these firms. Essentially, the library takes a list of words along with their frequencies, and produces a word cloud based on those inputs. The higher the frequency a word is associated with, the larger it appears in the word cloud.

Word Cloud – Positive Reviews

The positive reviews word cloud was created using only the pros from each of the reviews. (Note: Each review contains the main review, a pro, and a con.) Some of the common words we see among pros are ‘great,”work,’ ‘job,’’company,’ and ‘customers.’

Word Cloud – Negative Reviews

Similarly, a word cloud was created using only the cons from each of the reviews. We can see that common words appearing include work, management, hours, employees, and pay.

Future Directions

Beyond the scope of this project, below are other interesting areas that are worth looking into if additional time allows.

  • Evaluate the credibility of Indeed’s Top 50 Ranking corporate employers by sampling comparable employers that did not make it to the list
  • Better understand the rationale why positive and negative reviews are concentrated in the months they were observed to be in, and evaluate whether this behavior is prevalent across other industries
  • Perform statistical methods to evaluate the relationships among variables

The post Retailers: Here’s What Your Employees Are Saying About You – A Project on Web Scraping Indeed.com appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Demo: Real-Time Predictions with Microsoft R Server

Thu, 06/15/2017 - 19:59

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to generate predictions for loan delinquency, using a Python script as the client. (You can see slides here, and a video of the presentation below.)

With Microsoft R Server 9.1, there are now two ways to operationalize models as a Web service or as a SQL Server stored procedure:

  • Flexible Operationalization: Deploy any R script or function.
  • Real-Time Operationalization: Deploy model objects generated by specific functions in Microsoft R, but generates predictions much more quickly by bypassing the R interpreter.

In the demo, which begins at the 10:00 mark in the video below, you can see a comparison of using the two types of deployment. Ultimately, I was able to generate predictions from a random forest at a rate of 1M predictions per second, with three Python clients simultaneously drawing responses from the server (an Azure GS5 instance running the Windows Data Science VM).

If you'd like to try out this capability yourself, you can find the R and Python scripts used in the demo at this Github repository. The lending club data is available here, and the script used to featurize the data is here

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Neural networks Exercises (Part-2)

Thu, 06/15/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Source: Wikipedia

Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.

In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. Today, we’ll practice how to use the nnet and neuralnet packages to create a feedforward neural networks, which we introduce in the last set of exercises. In this type of neural network, all the neuron from the input layer are linked to the neuron from the hidden layer and all of those neuron are linked to the output layer, like seen on this image. Since there’s no cycle in this network, the information flow in one direction from the input layer to the hidden layers to the output layer. For more information about those types of neural network you can read this page.

Answers to the exercises are available here.

Exercise 1
We’ll start by practicing what we’ve seen in the last set of exercises. Load the MASS package and the biopsy dataset, then prepare your data to be feed to a neural network.

Exercise 2
We’ll use the nnet() function from the package of the same name to do a logistic regression on the biopsy data set using a feedforward neural network. If you remember the
last set of exercises you know that we have to choose the number of neuron in the hidden layer of our feedforward neural network. There’s no rule or equation which can tell us the optimal number of neurons to use, so the best way to find the better model is to do a bunch of cross-validation of our model with different number of neurons in the hidden layer and choose the one who would fit best the data. A good range to test with this process is between one neuron and the number of input variables.

Write a function that take a train data set, a test data set and a range of integer corresponding to the number of neurons to be used as parameter. Then this function should, for each possible number of neuron in the hidden layer, train a neural network made with nnet(), make prediction on the test set and return the accuracy of the prediction.

Exercise 3
Use your function on your data set and plot the result. Which should be the number of neurons to use in the hidden layer of your feedforward neural network.

Exercise 4
The nnet() function is easy to use, but doesn’t give us a lot of option to customize our neural network. As a consequence, it’s a good package to use if you have to do a quick model to test a hypothesis, but for more complex model the neuralnet package is a lot more powerful. Documentation for this package can be found here.

Use the neuralnet() function with the default parameter and the number of neuron in the hidden layer set to the answer of the last exercise. Note that this function can only handle numeric value and cannot deal with factors. Then use the compute() function to make prediction on the values of the test set and compute the accuracy of your model.

Learn more about neural networks in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. In this course you will learn how to:

  • Work with Deep Learning networks and related packages in R
  • Create Natural Language Processing models
  • And much more

Exercise 5
The nnet() function use by default the BFGS algorithm to adjust the value of the weights until the output values of our model are close to the values of our data set. The neuralnet package give us the option to use more efficient algorithm to compute those value which result in faster processing time and overall better estimation. For example, by default this function use the resilient backpropagation with weight backtracking.

Use the neuralnet() function with the parameter algorithm set to ‘rprop-‘, which stand for resilient backpropagation without weight backtracking.
Then test your model and print the accuracy.

Exercise 6
Two other algorithm can be used with the neuralnet() function: 'sag' and 'slr'. Those two strings tell the function to use the globally convergent algorithm (grprop) and to modify the learning rate associated with the smallest absolute gradient (sag) or the smallest learning rate (slr). When using those algorithm, it can be useful to pass a vector or list containing the lowest and highest limit for the learning rate to the learningrate.limit parameter.

Again, use the neuralnet() function twice, once with parameter algorithm set to 'sag' and then to 'slr'. In both cases set the learningrate.limit parameter to c(0.1,1) and change the stepmax parameter to 1e+06.

Exercise 7
The learning rate determine how much the backpropagation can affect the weight at each iteration. A high learning rate mean that during the training of the neural network, each iteration can strongly change the value of the weight or, to put in other way, the algorithm learn a lot of each observation in your data set. This mean that outlier could easily affect your weight and make your algorithm diverge from the path of the ideal weights for your problem. A small learning rate mean that the algorithm learn less from each observation in your data set, so your neural network is less affected by outlier, but this mean that you will need more observations to make a good model.

Use the neuralnet() function with parameter algorithm set to ‘rprop+’ twice: once with the learningrate parameter set to 0.01 and another time with the learningrate parameter set to 1. Notice the difference in running time in both cases.

Exercise 8
The neuralnet package give us the ability of make a visual representation of the neural network you made. Use the plot() function to visualize one of the neural networks of the last exercise.

Exercise 9
Until now, we’ve used feedfordward neural network with one hidden layer of neurons, but we could use more. In fact, the state of the art neural network use often 100 of hidden layer for modeling complex behavior. For basic regression problems or even basic digits recognition problems, one layer is enough, but if you want to use more, you can do so with the neuralnet() function by passing a vector of integer to the hidden parameter representing the number of neurons in each layer.

Create a feedforward neural network with three hidden layers of nine neurons and use it on your data.

Exercise 10
Plot the feedforward neural network from the last exercise.

Related exercise sets:
  1. Neural networks Exercises (Part-1)
  2. Evaluate your model with R Exercises
  3. Model Evaluation Exercises 1
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Sampling weights and multilevel modeling in R

Thu, 06/15/2017 - 15:26

(This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers)

So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay.

Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample.

Now, when using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students.

Now, let’s assume that in the finite population we have 10.000 students and 40 schools. For the sake of my example, let’s consider that you have selected 500 students allocated in 8 schools. For the sake of easiness, let’s think that a simple random sample is used (I know, this kind of sampling design is barely used) to select students. Think about it, if you take into account only the student’s weights to fit your multilevel model, you will find that you are estimating parameters with an expanded sample that represents 10.000 students that are allocated in a sample of just eight schools. So, any conclusion stated will be wrong. For example, when performing a simple analysis of variance, the percentage of variance explained by the schools will be extremely low, because of you are expanding the sample of schools. Now, if you take into account both sets of weights (students and schools), you will find yourself fitting a model with expanded samples that represent 10.000 students and 40 schools (which is good).

Unfortunately, as far as I know, the R suitcase lacks of a package that performs this kind of design-based inference to fitting multilevel models. So, right about now, we can unbiasedly estimate model parameters, but when it comes to estimate standard errors (from a design-based perspective) we need to use other computational resources and techniques like bootstrapping or Jackknife. According to the assumption of independence, most of the applied statistical methods cannot be used to analyze this kind of data directly due to dependency among sampled observation units. Inaccurate standard errors may be produced if no adjustment is made when analyzing complex survey data

Now, when it comes to educational studies (based on large-assessment tests), we can distinguish (at least) four set of weights: total student weight, student house-weight, student senate-weight and school weight. TIMMS team claims that total student weight is appropriate for single-level student-level analyses. Student house weight, also called normalized weight, is used when analyses are sensitive to sample size. Student house weight is essentially a linear transformation of total student weight so that the sum of the weights is equal to the sample size. Student Senate weight is used when analyses involve more than one country because it is total student weight scaled in such a way that all students’ senate weights sum to 500 (or 1000) in each country. School weight should be used when analyzing school-level data, as it is the inverse of the probability of selection for the selected school.

R workshop

We will use the student house-weight to fit a multilevel model. As stated before, the sum of these weights is equal to the sample. For the R workshop, we will use PISA 2012 data (available in the OECD website). I have done a filter for the Colombian case and saved this data to be directly compatible with R (available here). Let’s load the data into R.

rm(list = ls())
library(dplyr)
library(ggplot2)
library(lme4)

setwd("/your working directory")
load("PisaCol.RData")

head(names(PisaCol))
summary(PisaCol$STRATUM)

Now, we create an object containing the student house-weights and summarize some results based on that set of weights. Notice that the total student weights are stored in the column W_FSTUWT of the PISA database. I recall you that I am working with the first plausible value of the mathematics test and that score will be defined as our (dependent) variable of interest for the modeling.  

n <- nrow(PisaCol)
PisaCol$W_HOUSEWHT <- n * PisaCol$W_FSTUWT / sum(PisaCol$W_FSTUWT)

PisaCol %>%
group_by(STRATUM) %>%
summarise(avg1 = weighted.mean(PV1MATH, w = W_HOUSEWHT),
avg2 = weighted.mean(PV2MATH, w = W_HOUSEWHT))

We use the function lmer of the lme4 package to obtain the estimation of the model coefficients in the null model (where schools are defined as independent variables).

##################
### Null model ###
##################

HLM0 <- lmer(PV1MATH ~ (1 | SCHOOLID), data = PisaCol,
weights = W_HOUSEWHT)
coef(HLM0)
summary(HLM0)

# 62.81% of the variance is due to students
# 37.19% of the variance is due to schools
100 * 3569 / (3569 + 2113)

Now, as you may know, the PISA index of economic, social and cultural status has a strong relationship to student achievement, so it is a good idea to control for this variable in a more refined model. 

#################
### ESCS mdel ###
#################

HLM1 <- lmer(PV1MATH ~ ESCS + (1 + ESCS | SCHOOLID), data = PisaCol,
weights = W_HOUSEWHT)
coef(HLM1)
summary(HLM1)

# After contoling for ESCE, 34.58% of the variance is due to schools
100 * (96.12 + 1697.36) / (3392.58 + 96.12 + 1697.36)

So then, in summary: we have 3569 units of within-schools variance (63%), after controlling for ESCE that figure turns out to 3392 units (student background explains 5% of that variation). We have 2113 (37%) units of between-school variances, after controlling for ESCE that figure turns out to 1793 (student background explains 15% of that variation). The following code makes a graph that summarizes the relationship of the student achievement with ESCE.

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
theme_minimal() + geom_point() + theme(legend.position="none")

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
geom_point(aes(colour = SCHOOLID)) + theme(legend.position="none") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Lexicographic Permutations: Euler Problem 24

Thu, 06/15/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler Problem 24 asks to develop lexicographic permutations which are ordered arrangements of objects in lexicographic order. Tushar Roy of Coding Made Simple has shared a great introduction on how to generate lexicographic permutations.

Euler Problem 24 Definition

A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are:

012 021 102 120 201 210

What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?

Brute Force Solution

The digits 0 to 9 have permutations (including combinations that start with 0). Most of these permutations are, however, not in lexicographic order. A brute-force way to solve the problem is to determine the next lexicographic permutation of a number string and repeat this one million times.

nextPerm <- function(a) { # Find longest non-increasing suffix i <- length(a) while (i > 1 && a[i - 1] >= a[i]) i <- i - 1 # i is the head index of the suffix # Are we at the last permutation? if (i <= 1) return (NA) # a[i - 1] is the pivot # Find rightmost element that exceeds the pivot j <- length(a) while (a[j] <= a[i - 1]) j <- j - 1 # Swap pivot with j temp <- a[i - 1] a[i - 1] <- a[j] a[j] <- temp # Reverse the suffix a[i:length(a)] <- rev(a[i:length(a)]) return(a) } numbers <- 0:9 for (i in 1:(1E6 - 1)) numbers <- nextPerm(numbers) answer <- numbers print(answer)

This code takes the following steps:

  1. Find largest index such that .
    1. If no such index exists, then this is already the last permutation.
  2. Find largest index such that and a_{i-1}" title="a_j > a_{i-1}" class="latex" />.
  3. Swap and .
  4. Reverse the suffix starting at .
Combinatorics

A more efficient solution is to use combinatorics, thanks to MathBlog. The last nine digits can be ordered in ways. So the first permutations start with a 0. By extending this thought, it follows that the millionth permutation must start with a 2.

From this rule, it follows that the 725761st permutation is 2013456789. We now need 274239 more lexicographic permutations:

We can repeat this logic to find the next digit. The last 8 digits can be ordered in 40320 ways. The second digit is the 6th digit in the remaining numbers, which is 7 (2013456789).

This process is repeated until all digits have been used.

numbers <- 0:9 n <- length(numbers) answer <- vector(length = 10) remain <- 1E6 - 1 for (i in 1:n) { j <- floor(remain / factorial(n - i)) answer[i] <- numbers[j + 1] remain <- remain %% factorial(n - i) numbers <- numbers[-(j + 1)] } answer <- paste(answer, collapse = "") print(answer)

R blogger Tony’s Bubble Universe created a generalised function to solve this problem a few years ago.

The post Lexicographic Permutations: Euler Problem 24 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Studying disease with R: RECON, The R Epidemics Consortium

Wed, 06/14/2017 - 23:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

For almost a year now, a collection of researchers from around the world has been collaborating to develop the next generation of analysis tools for disease outbreak response using R. The R Epidemics Consortium (RECON) creates R packages for handling, visualizing, and analyzing outbreak data using cutting-edge statistical methods, along with general-purpose tools for data cleaning, versioning, and encryption, and system infrastructure. Like ROpenSci, the Epidemics Consortium is focused on developing efficient, reliable, and accessible open-source tools, but with a focus on epidemology as opposed to science generally.

The Epidemics Consortium has already created several useful resources for epidemiology:

There are also a large number of additional packages under development.

RECON welcomes new members, particularly experienced R developers and as public health officers specialized in outbreak response. You can find information on how to join here, and general information about the R Epidemics Consortium at the link below.

RECON: The R Epidemics Consortium (via Maëlle Salmon‏)

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Exploratory Factor Analysis – Exercises

Wed, 06/14/2017 - 18:10

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This set of exercises is about exploratory factor analysis. We shall use some basic features of psych package. For quick introduction to exploratory factor analysis and psych package, we recommend this short “how to” guide.

You can download the dataset here. The data is fictitious.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Load the data, install the packages psych and GPArotation which we will use in the following exercises, and load it. Describe the data.

Exercise 2

Using the parallel analysis, determine the number of factors.

Exercise 3

Determine the number of factors using Very Simple Structure method.

Exercise 4

Based on normality test, is the Maximum Likelihood factoring method proper, or is OLS/Minres better? (Tip: Maximum Likelihood method requires normal distribution.)

Exercise 5

Using oblimin rotation, 5 factors and factoring method from the previous exercise, find the factor solution. Print loadings table with cut off at 0.3.

Exercise 6

Plot factors loadings.

Exercise 7

Plot structure diagram.

Exercise 8

Find the higher-order factor model with five factors plus general factor.

Exercise 9

Find the bifactor solution.

Exercise 10

Reduce the number of dimensions using hierarchical clustering analysis.

Related exercise sets:
  1. Repeated measures ANOVA in R Exercises
  2. Data science for Doctors: Cluster Analysis Exercises
  3. Factor exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to draw connecting routes on map with R and great circles

Wed, 06/14/2017 - 16:25

(This article was first published on Blog – The R graph Gallery, and kindly contributed to R-bloggers)

This post explains how to draw connection lines between several localizations on a map, using R. The method proposed here relies on the use of the gcIntermediate function from the geosphere package. Instead of making straight lines, it offers to draw the shortest routes, using great circles. A special care is given for situations where cities are very far from each other and where the shortest connection thus passes behind the map.

First we need to load 3 libraries. Maps allows to draw the background map, and geosphere provides the gcintermediate function.

library(tidyverse) library(maps) library(geosphere)

 

1- Draw an empty map

This is easily done using the ‘maps’ package. You can see plenty of other maps made with R in the map section of the R graph gallery.

par(mar=c(0,0,0,0)) map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) )

2- Add 3 cities

First we create a data frame with the coordinates of Buenos Aires, Melbourne and Paris

Buenos_aires=c(-58,-34) Paris=c(2,49) Melbourne=c(145,-38) data=rbind(Buenos_aires, Paris, Melbourne) %>% as.data.frame() colnames(data)=c("long","lat")

Then add it to the map using the ‘points’ function:

points(x=data$long, y=data$lat, col="slateblue", cex=3, pch=20)

4- Show connection between them

Now we can connect cities drawing the shortest route between them. This is done using great circles, what gives a better visualization than using straight lines. This technique has been proposed by Nathan Yau on FlowingData

# Connection between Buenos Aires and Paris inter <- gcIntermediate(Paris, Buenos_aires, n=50, addStartEnd=TRUE, breakAtDateLine=F) lines(inter, col="slateblue", lwd=2) # Between Paris and Melbourne inter <- gcIntermediate(Melbourne, Paris, n=50, addStartEnd=TRUE, breakAtDateLine=F) lines(inter, col="slateblue", lwd=2)


5 – Correcting gcIntermediate

If we use the same method between Melbourne and Buenos Aires, we get this disappointing result:

What happens is that gcintermediate follows the shortest path, which means it will go East from Australia until the date line, break the line and come back heading East from the pacific to South America. Because we do not want to see the horizontal line, we need to plot this connection in 2 times.

To do so we can use the following function, which breaks the line in 2 sections when the distance between 2 points is longer than 180 degrees.

plot_my_connection=function( dep_lon, dep_lat, arr_lon, arr_lat, ...){ inter <- gcIntermediate(c(dep_lon, dep_lat), c(arr_lon, arr_lat), n=50, addStartEnd=TRUE, breakAtDateLine=F) inter=data.frame(inter) diff_of_lon=abs(dep_lon) + abs(arr_lon) if(diff_of_lon > 180){ lines(subset(inter, lon>=0), ...) lines(subset(inter, lon<0), ...) }else{ lines(inter, ...) } }

Let’s try it!

map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) ) points(x=data$long, y=data$lat, col="slateblue", cex=3, pch=20) plot_my_connection(Paris[1], Paris[2], Melbourne[1], Melbourne[2], col="slateblue", lwd=2) plot_my_connection(Buenos_aires[1], Buenos_aires[2], Melbourne[1], Melbourne[2], col="slateblue", lwd=2) plot_my_connection(Buenos_aires[1], Buenos_aires[2], Paris[1], Paris[2], col="slateblue", lwd=2)

6 – Apply it to several pairs of cities

Let’s consider 8 cities:

data=rbind( Buenos_aires=c(-58,-34), Paris=c(2,49), Melbourne=c(145,-38), Saint.Petersburg=c(30.32, 59.93), Abidjan=c(-4.03, 5.33), Montreal=c(-73.57, 45.52), Nairobi=c(36.82, -1.29), Salvador=c(-38.5, -12.97) ) %>% as.data.frame() colnames(data)=c("long","lat")

We can generate all pairs of coordinates

all_pairs=cbind(t(combn(data$long, 2)), t(combn(data$lat, 2))) %>% as.data.frame() colnames(all_pairs)=c("long1","long2","lat1","lat2")

And plot every connections:

# background map par(mar=c(0,0,0,0)) map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) ) # add every connections: for(i in 1:nrow(all_pairs)){ plot_my_connection(all_pairs$long1[i], all_pairs$lat1[i], all_pairs$long2[i], all_pairs$lat2[i], col="skyblue", lwd=1) } # add points and names of cities points(x=data$long, y=data$lat, col="slateblue", cex=2, pch=20) text(rownames(data), x=data$long, y=data$lat, col="slateblue", cex=1, pos=4)

7 – An alternative using the greatCircle function

This is the method proposed by the Simply Statistics Blog to draw a twitter connection map. The idea is to calculate the whole great circle, and keep only the part that stays in front of the map, never going behind it.

# A function that keeps the good part of the great circle, by Jeff Leek: getGreatCircle = function(userLL,relationLL){ tmpCircle = greatCircle(userLL,relationLL, n=200) start = which.min(abs(tmpCircle[,1] - data.frame(userLL)[1,1])) end = which.min(abs(tmpCircle[,1] - relationLL[1])) greatC = tmpCircle[start:end,] return(greatC) } # map 3 connections: map('world',col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05,mar=rep(0,4),border=0, ylim=c(-80,80) ) great=getGreatCircle(Paris, Melbourne) lines(great, col="skyblue", lwd=2) great=getGreatCircle(Buenos_aires, Melbourne) lines(great, col="skyblue", lwd=2) great=getGreatCircle(Paris, Buenos_aires) lines(great, col="skyblue", lwd=2) points(x=data$long, y=data$lat, col="slateblue", cex=3, pch=20) text(rownames(data), x=data$long, y=data$lat, col="slateblue", cex=1, pos=4)

Note that the R graph gallery proposes lot of other examples of maps made with R. You can follow the gallery on Twitter or on Facebook to be aware or recent updates.

1

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog – The R graph Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

When the LASSO fails???

Wed, 06/14/2017 - 16:20

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

When the LASSO fails?

The LASSO has two important uses, the first is forecasting and the second is variable selection. We are going to talk about the second. The variable selection objective is to recover the correct set of variables that generate the data or at least the best approximation given the candidate variables. The LASSO has attracted a lot of attention lately because it allows us to estimate a linear regression with thousands of variables and the model select the right ones for us. However, what many people ignore is when the LASSO fails.

Like any model, the LASSO also rely on assumptions in order to work. The first is sparsity, i.e. only a small number of variables may actually be relevant. If this assumption does not hold there is no hope to use the LASSO for variable selection. Another assumption is that the irrepresentable condition must hold, this condition may look very technical but it only says that the relevant variable may not be very correlated with the irrelevant variables.

Suppose your candidate variables are represented by the matrix , where each column is a variable and each line is an observation. We can calculate the covariance matrix , which is a symmetric matrix. This matrix may be broken into four pieces:

The first piece, , shows the covariances between only the important variables, is the covariance matrix of the irrelevant variables and and shows the covariances between relevant and irrelevant variables. With that in mind, the irrepresentable condition is:

The inequality above must hold for all elements. is for positive values of , for negative values and if .

Example

For this example we are going to generate two covariance matrices, one that satisfies the irrepresentable condition and one that violates it. Our design will be very simple: only 10 candidate variables where five of them are relevant.

library(mvtnorm) library(corrplot) library(glmnet) library(clusterGeneration) k=10 # = Number of Candidate Variables p=5 # = Number of Relevant Variables N=500 # = Number of observations betas=(-1)^(1:p) # = Values for beta set.seed(12345) # = Seed for replication sigma1=genPositiveDefMat(k,"unifcorrmat")$Sigma # = Sigma1 violates the irc sigma2=sigma1 # = Sigma2 satisfies the irc sigma2[(p+1):k,1:p]=0 sigma2[1:p,(p+1):k]=0 # = Verify the irrepresentable condition irc1=sort(abs(sigma1[(p+1):k,1:p]%*%solve(sigma1[1:p,1:p])%*%sign(betas))) irc2=sort(abs(sigma2[(p+1):k,1:p]%*%solve(sigma2[1:p,1:p])%*%sign(betas))) c(max(irc1),max(irc2)) ## [1] 3.222599 0.000000 # = Have a look at the correlation matrices par(mfrow=c(1,2)) corrplot(cov2cor(sigma1)) corrplot(cov2cor(sigma2))

As you can see, irc1 violates the irrepresentable condition and irc2 does not. The correlation matrix that satisfies the irrepresentable condition is block diagonal and the relevant variables have no correlation with the irrelevant ones. This is an extreme case, you may have a small correlation and still satisfy the condition.

Now let us check how the LASSO works for both covariance matrices. First we need do understand what is the regularization path. The LASSO objective function penalizes the size of the coefficients and this penalization is controlled by a hyper-parameter . We can find the exact that is sufficiently big to shrink all variables to zero and for any value smaller than some variable will be included. As we decrease the size of more variables are included until we have a model with all variables (or the biggest identified model when we have more variables than observations). This path between the model with all variables and the model with no variables is the regularization path. The code below generates data from multivariate normal distributions for the covariance matrix that violates the irrepresentable condition and the covariance matrix that satisfies it. Then I estimate the regularization path for both case and summarize the information in plots.

X1=rmvnorm(N,sigma = sigma1) # = Variables for the design that violates the IRC = # X2=rmvnorm(N,sigma = sigma2) # = Variables for the design that satisfies the IRC = # e=rnorm(N) # = Error = # y1=X1[,1:p]%*%betas+e # = Generate y for design 1 = # y2=X2[,1:p]%*%betas+e # = Generate y for design 2 = # lasso1=glmnet(X1,y1,nlambda = 100) # = Estimation for design 1 = # lasso2=glmnet(X2,y2,nlambda = 100) # = Estimation for design 2 = # ## == Regularization path == ## par(mfrow=c(1,2)) l1=log(lasso1$lambda) matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Violates IRC") l2=log(lasso2$lambda) matplot(as.matrix(l2),t(coef(lasso2)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Satisfies IRC")

The plot on the left shows the results when the irrepresentable condition is violated and the plot on the right is the case when it is satisfied. The five black lines that slowly converge to zero are the five relevant variables and the red line is an irrelevant variable. As you can see, when the IRC is satisfied all relevant variables shrink very fast to zero as we increase lambda. However, when the IRC is violated one irrelevant variable starts with a very small coefficient that slowly increases before decreasing to zero in the very end of the path. This variable is selected through the entire path, it is virtually impossible to recover the correct set of variables in this case unless you apply a different penalty to each variable. This is precisely what the adaptive LASSO does. Does that mean that the adaLASSO is free from the irrepresentable condition??? The answer is: partially. The adaptive LASSO requires a less restrictive condition called weighted irrepresentable condition, which is much easier to satisfy. The two plots below show the regularization path for the LASSO and the adaLASSO in the case the IRC is violated. As you can see, the adaLASSO selects the correct set of variables in all the path.

lasso1.1=cv.glmnet(X1,y1) w.=(abs(coef(lasso1.1)[-1])+1/N)^(-1) adalasso1=glmnet(X1,y1,penalty.factor = w.) par(mfrow=c(1,2)) l1=log(lasso1$lambda) matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="LASSO") l2=log(adalasso1$lambda) matplot(as.matrix(l2),t(coef(adalasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="adaLASSO")

The biggest problem is that the irrepresentable condition and its less restricted weighted version are not testable in the real world because we need the populational covariance matrix and the true betas that generate the data. The solution is to study your data as much as possible to at least have an idea of the situation.

Some articles on the topic

Zhao, Peng, and Bin Yu. “On model selection consistency of Lasso.” Journal of Machine learning research 7.Nov (2006): 2541-2563.

Meinshausen, Nicolai, and Bin Yu. “Lasso-type recovery of sparse representations for high-dimensional data.” The Annals of Statistics (2009): 246-270.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Pages