Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 33 min ago

linl 0.0.1: linl is not Letter

Sun, 10/22/2017 - 23:06

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Aaron Wolen and I are pleased to announce the availability of the initial 0.0.1 release of our new linl package on the CRAN network. It provides a simple-yet-powerful Markdown—and RMarkdown—wrapper the venerable LaTeX letter class. Aaron had done the legwork in the underlying pandoc-letter repository upon which we build via proper rmarkdown integration.

The package also includes a LaTeX trick or two: optional header and signature files, nicer font, better size, saner default geometry and more. See the following screenshot which shows the package vignette—itself a simple letter—along with (most of) its source:

The initial (short) NEWS entry follows:

Changes in tint version 0.0.1 (2017-10-17)
  • Initial CRAN release

The date is a little off; it took a little longer than usual for the good folks at CRAN to process the initial submission. We expect future releases to be more timely.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Markets Performance after Election: Day 239

Sun, 10/22/2017 - 00:53

(This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

When I wrote the original post, I wasn’t planning on writing a follow-up. Certainly not the week after. But what a difference a week can make in a dynamic system like the US stock market.

While re-running the computations testing the latest version of RStudio, I noticed something surprising – President Trump’s rally has advanced to 2nd place!

A mere week ago, that seemed unthinkable. Something abnormal must have happened. Two things happened. First, the current stock market advanced another 2%. Nothing to brag about just another positive vote of confidence in the economy’s direction.

The more impactful reason behind this sudden switch became clear when I took a look at the President H.W. Bush’s rally. Even from the above chart, it’s clear the rally lost significant amount of steam within a day, or two max. Is this a problem with data? A bit of forensics:

election.date = "1988-11-08" require(quantmod) dj = getSymbols("^DJI", from="1900-01-01", auto.assign=F) id = findInterval(as.Date(election.date), index(dj)) tail(ROC(Cl(dj[id:(id+239)]), type="discrete", na.pad=F))

And the truth reveals itself:

Date Return 1989-10-12 -0.49% 1989-10-13 -6.91% 1989-10-16 3.43% 1989-10-17 -0.70% 1989-10-18 0.19% 1989-10-19 1.50%

Low and behold, no problem with the data, just a 7% drop on October 13th 1989. A quick search reveals that this was indeed Friday the 13th mini-crash! Friday, the 13th … mini-crash … What a coincidence!

I will keep it brief and wrap up this post here. There were a few improvements and changes I did to the R code used to perform these analysis – the Gist contains all of them.

It’s all optimism in the air, judging by the market behavior at least.

The post Markets Performance after Election: Day 239 appeared first on Quintuitive.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rick and Morty and Tidy Data Principles (Part 2)

Sat, 10/21/2017 - 18:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Motivation

The first part left an open door to analyze Rick and Morty contents using tf-idf, bag-of-words or some other NLP techniques. Here I’m also taking a lot of ideas from Julia Silge‘s blog.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

Term Frequency

The most basic measure in natural language processing is obviously to just count words. This is a crude way of knowing what a document is about. The problem with counting words, however, is that there are some words (called stopwords) that are always too common, like “the” or “that”. So to create a more meaningful representation what people usually do is to compare the word counts observed in a document with that of a larger body of text.

Tf-idf is the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents.

The inverse document frequency for any given term is defined as:

$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$

We can use tidy data principles to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.

What do Rick and Morty say?

Let’s start by looking at Rick and Morty dialogues and examine first term frequency, then tf-idf. I’ll analyze this removing stopwords beforehand.

if (!require("pacman")) install.packages("pacman") p_load(data.table,tidyr,stringr,tidytext,dplyr,janitor,ggplot2,viridis,ggstance,igraph) p_load_gh("thomasp85/ggraph","dgrtwo/widyr") rick_and_morty_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/rick_and_morty_subs.csv")) rick_and_morty_subs_tidy = rick_and_morty_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) %>% count(season, word, sort = TRUE) total_words <- rick_and_morty_subs_tidy %>% group_by(season) %>% summarize(total = sum(n)) season_words <- left_join(rick_and_morty_subs_tidy, total_words) season_words # A tibble: 11,933 x 4 season word n total 1 S01 morty 1283 19381 2 S01 rick 1132 19381 3 S01 jerry 475 19381 4 S03 morty 331 13008 5 S03 rick 251 13008 6 S02 rick 242 12829 7 S02 morty 228 12829 8 S01 beth 224 19381 9 S01 summer 215 19381 10 S01 yeah 209 19381 # ... with 11,923 more rows

Let’s look at the distribution of n/total for each season, the number of times a word appears in a season divided by the total number of terms (words) in that season. This is term frequency!

ggplot(season_words, aes(n/total, fill = season)) + geom_histogram(alpha = 0.8, show.legend = FALSE) + xlim(0, 0.001) + labs(title = "Term Frequency Distribution in Rick and Morty' Seasons", y = "Count") + facet_wrap(~season, nrow = 3, scales = "free_y") + theme_minimal(base_size = 13) + scale_fill_viridis(end = 0.85, discrete=TRUE) + theme(strip.text=element_text(hjust=0)) + theme(strip.text = element_text(face = "italic"))

There are very long tails to the right for these dialogues because of the extremely common words. These plots exhibit similar distributions for each season, with many words that occur rarely and fewer words that occur frequently. The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Rick and Morty’ seasons as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.

season_words <- season_words %>% bind_tf_idf(word, season, n) season_words # A tibble: 11,933 x 7 season word n total tf idf tf_idf 1 S01 morty 1283 19381 0.06619885 0 0 2 S01 rick 1132 19381 0.05840772 0 0 3 S01 jerry 475 19381 0.02450854 0 0 4 S03 morty 331 13008 0.02544588 0 0 5 S03 rick 251 13008 0.01929582 0 0 6 S02 rick 242 12829 0.01886351 0 0 7 S02 morty 228 12829 0.01777223 0 0 8 S01 beth 224 19381 0.01155771 0 0 9 S01 summer 215 19381 0.01109334 0 0 10 S01 yeah 209 19381 0.01078376 0 0 # ... with 11,923 more rows

Notice that idf and thus tf-idf are zero for the extremely common words after removing stopwords. These are all words that appear all the time on every chapter, so the idf term (which will then be the natural log of 1) is zero, and “Rick” and “Morty” are examples of this. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection. Let’s look at terms with high tf-idf.

season_words %>% select(-total) %>% arrange(desc(tf_idf)) # A tibble: 11,933 x 6 season word n tf idf tf_idf 1 S03 pickle 43 0.003305658 1.098612 0.003631637 2 S02 unity 32 0.002494349 1.098612 0.002740322 3 S01 meeseeks 44 0.002270265 1.098612 0.002494141 4 S03 vindicators 26 0.001998770 1.098612 0.002195873 5 S02 purge 25 0.001948710 1.098612 0.002140877 6 S01 flu 32 0.001651102 1.098612 0.001813921 7 S01 crystals 30 0.001547908 1.098612 0.001700550 8 S03 tommy 20 0.001537515 1.098612 0.001689133 9 S02 deer 19 0.001481020 1.098612 0.001627066 10 S03 noob 18 0.001383764 1.098612 0.001520220 # ... with 11,923 more rows

Curious about “pickle”? You’d better watch Picle Rick episode if you don’t get why “pickle” is the highest tf-idf ranked term. “Vindicator” is another term that is concentrated in one episode where Vindicators appear. There’s even an episode where flu is a part of the central problem and Rick has to use his mind to try to solve a flu of of control because of his inventions.

Some of the values for idf are the same for different terms because there are 6 documents in this corpus and we are seeing the numerical value for ln(6/1), ln(6/2), etc. Let’s look at a visualization for these high tf-idf words.

plot_tfidf <- season_words %>% arrange(desc(tf_idf)) %>% mutate(word = factor(word, levels = rev(unique(word)))) ggplot(plot_tfidf[1:20,], aes(tf_idf, word, fill = season, alpha = tf_idf)) + geom_barh(stat = "identity") + labs(title = "Highest tf-idf words in Rick and Morty' Seasons", y = NULL, x = "tf-idf") + theme_minimal(base_size = 13) + scale_alpha_continuous(range = c(0.6, 1), guide = FALSE) + scale_x_continuous(expand=c(0,0)) + scale_fill_viridis(end = 0.85, discrete=TRUE) + theme(legend.title=element_blank()) + theme(legend.justification=c(1,0), legend.position=c(1,0))

Let’s look at the seasons individually.

plot_tfidf <- plot_tfidf %>% group_by(season) %>% top_n(15) %>% ungroup() ggplot(plot_tfidf, aes(tf_idf, word, fill = season, alpha = tf_idf)) + geom_barh(stat = "identity", show.legend = FALSE) + labs(title = "Highest tf-idf words in Rick and Morty' Seasons", y = NULL, x = "tf-idf") + facet_wrap(~season, nrow = 3, scales = "free") + theme_minimal(base_size = 13) + scale_alpha_continuous(range = c(0.6, 1)) + scale_x_continuous(expand=c(0,0)) + scale_fill_viridis(end = 0.85, discrete=TRUE) + theme(strip.text=element_text(hjust=0)) + theme(strip.text = element_text(face = "italic"))

if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = "center", indent = "0em", linebreak = "false";

if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); var location_protocol = (false) ? 'https' : document.location.protocol; if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:'; mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '../mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RTutor: Pollution Reduction by Wind Energy

Fri, 10/20/2017 - 23:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

How much pollution reduction do we get from a MWh electricity produced by wind or solar power? In principle, this corresponds to the avoided pollution of a fossil fuel plant that reduces its output due to the higher production of renewable electricity. Which fossil fuel plant reduces its output and to which degree is not straightforward, however. It depends on the flexibility of the power plants, the predictability of the wind supply and possible network congestion.

Kevin Novan studies the pollution reduction of wind energy in his very interesting article “Valuing the Wind: Renewable Energy Policies and Air Pollution Avoided” (AEJ, Economic Policy 2015) using detailed time series data from Texas.

As part of her Master Thesis at Ulm University, Anna Sophie Barann has generated a nice RTutor problem set that allows you to replicate the main insights of the article in an interactive fashion. You learn about R, econometrics and the electricity sector at the same time.

Here is screenshoot:

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/asbara/RTutorPollutionReductions

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 25 hours total usage time per month. So it may be greyed out when you click at it.)

https://asbara.shinyapps.io/RTutorPollutionReductions/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An Updated History of R

Fri, 10/20/2017 - 20:56

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Here's a refresher on the history of the R project:

  • 1992: R development begins as a research project in Auckland, NZ by Robert Gentleman and Ross Ihaka 
  • 1993: First binary versions of R published at Statlib [see update, below]
  • 1995: R first distributed as open-source software, under GPL2 license
  • 1997: R core group formed
  • 1997: CRAN founded (by Kurt Hornik and Fritz Leisch)
  • 1999: The R website, r-project.org, founded
  • 2000: R 1.0.0 released (February 29) 
  • 2001: R News founded (later to become the R Journal)
  • 2003: R Foundation founded
  • 2004: First UseR! conference (in Vienna)
  • 2004: R 2.0.0 released
  • 2009: First edition of the R Journal
  • 2013: R 3.0.0 released
  • 2015: R Consortium founded, with R Foundation participation
  • 2016: New R logo adoptedt

I've added some additional dates gleaned from the r-announce mailing list archives and a 1998 paper on the history of R written by co-founder Ross-Ihaka.

According to the paper, "R began as an experiment in trying to use the methods of Lisp implementors to build a small testbed which could be used to trial some ideas on how a statistical environment might be built." It all stared when

… Robert Gentleman and I became colleagues at The University of Auckland. We both had an interest in statistical computing and saw a common need for a better software environment in our Macintosh teaching laboratory. We saw no suitable commercial environment and we began to experiment to see what might be involved in developing one ourselves.

The paper provides fascinating insights into the beginnings of the R project, and the similarities and differences between it and the S language that preceded it. It's also interesting to see the future goals of the R project as envisioned back in 1998: "to produce a fee implementation of something 'close to' version 3 of the S language"; "development of an integrated user interface"; to get substantial use out of R for statistical work and teaching". I think it's fair to say that in all those areas, especially the latter, the R project has succeeded beyond measure.

Ross Ihaka: R : Past and Future History (PDF) (via Jesse Maegan)

Update: The first public announcement about R appears to have been a post by Ross Ihaka to the s-news mailing list on August 4, 1993. This pre-dates the online archives, but I reproduce it below (with some minor formatting) for the record. (Grateful thanks to reader SK for providing this email from personal archives.)

Date: Wed, 4 Aug 93 14:01:31 NZS
From: Ross Ihaka
To: S-news@utstat.toronto.edu
Subject: Re: Is  S  available for a Macintosh personal computer?

Joseph B Kruskal writes:

If anyone knows of an S available for a Macintosh computer,
I would be pleased to hear about it. 

About a year ago Robert Gentleman and I considered the problem of obtaining decent statistical software for our undergraduate Macintosh lab.  After considering the options, we decided that the most satisfactory alternative was to write our own.  We started by writing a small lisp interpreter.  Next we expanded its data structures with atomic vector types and altered its evaluation semantics to include lazy evaluation of closure arguments and argument binding by tag as well as order.  Finally we added some syntactic sugar to make it look somewhat like S.  We call the result "R".

R is not ready for release at this point, but we are committed to having it ready by March next year (when we start teaching with it).

Because there is likely to be some interest in it we are going to put some (SPARC/SGI/Macintosh) binaries on Statlib later this week so people can give it a test drive.

I'll send out a short note when the binaries are available.

THIS IS NOT AN OFFICIAL RELEASE.  PLEASE DON'T BUG US ABOUT IT.  WE ARE FULLY OCCUPIED WITH TEACHING AND ADMINISTRATIVE WORK AT PRESENT.

WHEN THE SOURCE CODE IS STABLE, WE WILL MAKE IT AVAILABLE THROUGH STATLIB.

        (Robert will be at the ASA meetings in S.F. next week ...)

Some Notes About R
------------------

1. We have tried to make R small. We use mark/sweep garbage collection and reference counting to keep memory demands low. Our primary target platform is the Macintosh LC-II with 2-4Mb of memory.

2. A side effect of 1) is that the interpreter seems to be fairly fast, particularly at looping and array mutation.

3. We are trying to make R portable. We have used ANSI C and f2c (or native unix f77) with as little use of host operating system features as possible. At present we have verified portability to Unix, DOS, MacOS and we expect to be able to port easily to any system with an ANSI C Compiler.

4. R should look familiar to S users, but some of the semantics are closer to those of Scheme.  For example, We have abandoned the idea of call-frames in favour of true lexical scoping because it provides a better way of retaining state from function call to function call.

Ross Ihaka

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Install Useful Eclipse Plugins in Bio7 for R, Data Science and Programming

Fri, 10/20/2017 - 17:12

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

20.10.2017

Beside a massive of amount of R packages and ImageJ plugins Bio7 can be extended with Eclipse plugins useful for data science and programming.

Some of them could also be very useful for R related developments (e.g., to develop R packages or distribute Shiny apps).

Installation of Eclipse Plugins

One  way to install Eclipse plugins is by using the Update Manager in the help menu (Help->Install New Software).

Thus the first step to install a plugin is to open the Update Manager and selecting the category “Eclipse Repository – http://download.eclipse.org/releases/neon” (see screenshot below!)

Another handy way of searching and installing Eclipse plugins is by using the Marketplace Client plugin which must be installed firstly with the Update Manager.

Search for the Marketplace Client (e.g., “marketplace”) plugin to filter, select and install the Marketplace Client.

The following plugins can be installed using either the Marketplace Client or the Update Manager.

Database Remote Files and Shells

EGit

https://marketplace.eclipse.org/content/egit-git-team-provider
The powerful Eclipse Git plugin to connect Bio7 with Git repositories.

TM-Terminal

http://marketplace.eclipse.org/content/tcf-terminals
A true native terminal with SSH,Telnet and Serial line support. Bio7 comes already with a pseudo-terminal but lacks the support of a pty connection and ANSI escape sequences (but can be controlled from a Bio7 API interface). The terminal can also be installed with the Update-manager searching for “tm” (see screenshot below).

Remote System Explorer

https://marketplace.eclipse.org/content/remote-system-explorer-ssh-telnet-ftp-and-dstore-protocols
A plugin to connect Bio7 with SSH, Telnet, FTP and DStore protocols.This plugin work very well with a remote connection and makes it easy to transfer data from a local workspace to a remote server. Remote files can be opened, edited and stored with the available Bio7 editors (e.g., a *.html file with the JavaFX editor – see screenshot below).
Apropos, please use an OpenSSH key for, e.g. a SFTPconnection!). It works great together with the TM-Terminal plugin.

DBWeaver

https://marketplace.eclipse.org/content/dbeaver
“DBeaver is free universal SQL client/database tool for developers and database administrators. It can work with any database server which has JDBC driver.” (source: website DBWeaver)
DBWeaver is a great database tool and plugin for Eclipse and can also be installed into Bio7 to control and edit R database connections.

Scripting Editors with Bio7 interpret support

The following three editors can be used within the current Bio7 process to execute scripts using the Bio7 interface. It is important however to create a language project first to profit from the advanced editor features.

Bio7 will recognize the opened editor type and interpret the current opened script with the embedded script interpreter of Bio7 when using the dedicated toolbar action.

Please use the context menu action “Open with” to open the scripts with the installed editors instead of the default Bio7 editors (Groovy, Jython/Python editors are available by default in Bio7).

Here a video installing and using the JavaScript editor as an general example.

JSDT

https://eclipse.org/webtools/jsdt/
JavaScript editor and development tools. Install with the Update-manager (search for JavaScript – see video above).

PyDev Editor

https://marketplace.eclipse.org/content/pydev-python-ide-eclipse
A very powerful Python editor which can be used to execute Bio7 Jython and Python scripts instead of the Bio7 default editor.
In addition Bio7 can use the PyDev editor to execute scripts running on the Bio7 Java classpath. If you open Jython/Python scripts with the PyDev editor the default Bio7 action will be visible to execute the scripts on the Bio7 classpath.

Groovy Editor

https://github.com/groovy/groovy-eclipse/wiki
(Add site: http://dist.springsource.org/snapshot/GRECLIPSE/e4.7/ to the Update Manager)
A more powerful Groovy editor. Can also be used with the Bio7 classpath if scripts are opened with this editor instead.

General programming editors and extensions

Eclipse CDT

https://marketplace.eclipse.org/content/eclipse-cdt-cc-development-tooling
Powerful C/C++ editor and development tooling.

Arduino C++ IDE:

https://marketplace.eclipse.org/content/arduino-c-ide
A great Arduino extension of the C++ editor of Eclipse. See this video for an introduction. Bio7 embedds some Java libs to communicate with an Arduino (see Bio7 examples).

Photran (comes with PTP)

https://eclipse.org/ptp/
http://www.eclipse.org/photran/
An Integrated Development Environment and Refactoring Tool for Fortran

Julia Editor

https://github.com/JuliaComputing/JuliaDT
A Julia editor in an early stage. But very promising.

Bash Editor

https://marketplace.eclipse.org/content/bash-editor
A bash editor for bash scripts. Perfect for the use with the above linked TM-Terminal and Remote Systems Explorer.

Jeeeyul’s Eclipse Themes

https://marketplace.eclipse.org/content/jeeeyuls-eclipse-themes
To install more themes into Bio7.

pde-tools

https://github.com/jeeeyul/pde-tools#crazy-outline
A plugin which adds clipboard history support and a crazy outline view.

WindowBuilder

https://marketplace.eclipse.org/content/windowbuilder
A very powerful builder for Graphical User Interfaces for SWT, etc.
In Bio7 a wizard is available to create a template for the builder (File->New->Bio7 Java Projects->Bio7 Java Project (WindowBuilder template)).
You can then open this template with the WindowBuilder to create interfaces (e.g., to create R interfaces using the new Bio7 R API). The created interface can be compiled dynamically with Bio7 (compiling the main class) and executed instantly in the current Bio7 process (see screenshot below)

More plugins can be found with the Eclipse Marketplace Client. It depends however how useful the plugins are in the context of Bio7.

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Machine Learning with R and Python – Part 3

Fri, 10/20/2017 - 15:27

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

In this post ‘Practical Machine Learning with R and Python – Part 3’,  I discuss ‘Feature Selection’ methods. This post is a continuation of my 2 earlier posts

  1. Practical Machine Learning with R and Python – Part 1
  2. Practical Machine Learning with R and Python – Part 2

While applying Machine Learning techniques, the data set will usually include a large number of predictors for a target variable. It is quite likely, that not all the predictors or feature variables will have an impact on the output. Hence it is becomes necessary to choose only those features which influence the output variable thus simplifying  to a reduced feature set on which to train the ML model on. The techniques that are used are the following

  • Best fit
  • Forward fit
  • Backward fit
  • Ridge Regression or L2 regularization
  • Lasso or L1 regularization

This post includes the equivalent ML code in R and Python.

All these methods remove those features which do not sufficiently influence the output. As in my previous 2 posts on “Practical Machine Learning with R and Python’, this post is largely based on the topics in the following 2 MOOC courses
1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

You can download this R Markdown file and the associated data from Github – Machine Learning-RandPython-Part3. 

1.1 Best Fit

For a dataset with features f1,f2,f3…fn, the ‘Best fit’ approach, chooses all possible combinations of features and creates separate ML models for each of the different combinations. The best fit algotithm then uses some filtering criteria based on Adj Rsquared, Cp, BIC or AIC to pick out the best model among all models.

Since the Best Fit approach searches the entire solution space it is computationally infeasible. The number of models that have to be searched increase exponentially as the number of predictors increase. For ‘p’ predictors a total of ML models have to be searched. This can be shown as follows

There are ways to choose single feature ML models among ‘n’ features, ways to choose 2 feature models among ‘n’ models and so on, or

= Total number of models in Best Fit.  Since from Binomial theorem we have

When x=1 in the equation (1) above, this becomes

Hence there are models to search amongst in Best Fit. For 10 features this is or ~1000 models and for 40 features this becomes which almost 1 trillion. Usually there are datasets with 1000 or maybe even 100000 features and Best fit becomes computationally infeasible.

Anyways I have included the Best Fit approach as I use the Boston crime datasets which is available both the MASS package in R and Sklearn in Python and it has 13 features. Even this small feature set takes a bit of time since the Best fit needs to search among ~  models

Initially I perform a simple Linear Regression Fit to estimate the features that are statistically insignificant. By looking at the p-values of the features it can be seen that ‘indus’ and ‘age’ features have high p-values and are not significant

1.1a Linear Regression – R code source('RFunctions-1.R') #Read the Boston crime data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select specific columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") dim(df1) ## [1] 506 14 # Linear Regression fit fit <- lm(cost~. ,data=df1) summary(fit) ## ## Call: ## lm(formula = cost ~ ., data = df1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.595 -2.730 -0.518 1.777 26.199 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 *** ## crimeRate -1.080e-01 3.286e-02 -3.287 0.001087 ** ## zone 4.642e-02 1.373e-02 3.382 0.000778 *** ## indus 2.056e-02 6.150e-02 0.334 0.738288 ## charles 2.687e+00 8.616e-01 3.118 0.001925 ** ## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 *** ## rooms 3.810e+00 4.179e-01 9.116 < 2e-16 *** ## age 6.922e-04 1.321e-02 0.052 0.958229 ## distances -1.476e+00 1.995e-01 -7.398 6.01e-13 *** ## highways 3.060e-01 6.635e-02 4.613 5.07e-06 *** ## tax -1.233e-02 3.760e-03 -3.280 0.001112 ** ## teacherRatio -9.527e-01 1.308e-01 -7.283 1.31e-12 *** ## color 9.312e-03 2.686e-03 3.467 0.000573 *** ## status -5.248e-01 5.072e-02 -10.347 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.745 on 492 degrees of freedom ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 ## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

Next we apply the different feature selection models to automatically remove features that are not significant below

1.1a Best Fit – R code

The Best Fit requires the ‘leaps’ R package

library(leaps) source('RFunctions-1.R') #Read the Boston crime data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select specific columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Perform a best fit bestFit=regsubsets(cost~.,df1,nvmax=13) # Generate a summary of the fit bfSummary=summary(bestFit) # Plot the Residual Sum of Squares vs number of variables plot(bfSummary$rss,xlab="Number of Variables",ylab="RSS",type="l",main="Best fit RSS vs No of features") # Get the index of the minimum value a=which.min(bfSummary$rss) # Mark this in red points(a,bfSummary$rss[a],col="red",cex=2,pch=20)

The plot below shows that the Best fit occurs with all 13 features included. Notice that there is no significant change in RSS from 11 features onward.

# Plot the CP statistic vs Number of variables plot(bfSummary$cp,xlab="Number of Variables",ylab="Cp",type='l',main="Best fit Cp vs No of features") # Find the lowest CP value b=which.min(bfSummary$cp) # Mark this in red points(b,bfSummary$cp[b],col="red",cex=2,pch=20)

Based on Cp metric the best fit occurs at 11 features as seen below. The values of the coefficients are also included below

# Display the set of features which provide the best fit coef(bestFit,b) ## (Intercept) crimeRate zone charles nox ## 36.341145004 -0.108413345 0.045844929 2.718716303 -17.376023429 ## rooms distances highways tax teacherRatio ## 3.801578840 -1.492711460 0.299608454 -0.011777973 -0.946524570 ## color status ## 0.009290845 -0.522553457 # Plot the BIC value plot(bfSummary$bic,xlab="Number of Variables",ylab="BIC",type='l',main="Best fit BIC vs No of Features") # Find and mark the min value c=which.min(bfSummary$bic) points(c,bfSummary$bic[c],col="red",cex=2,pch=20)

# R has some other good plots for best fit plot(bestFit,scale="r2",main="Rsquared vs No Features")

R has the following set of really nice visualizations. The plot below shows the Rsquared for a set of predictor variables. It can be seen when Rsquared starts at 0.74- indus, charles and age have not been included. 

plot(bestFit,scale="Cp",main="Cp vs NoFeatures")

The Cp plot below for value shows indus, charles and age as not included in the Best fit

plot(bestFit,scale="bic",main="BIC vs Features")

1.1b Best fit (Exhaustive Search ) – Python code

The Python package for performing a Best Fit is the Exhaustive Feature Selector EFS.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS # Read the Boston crime data df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] # Set X and y X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] # Perform an Exhaustive Search. The EFS and SFS packages use 'neg_mean_squared_error'. The 'mean_squared_error' seems to have been deprecated. I think this is just the MSE with the a negative sign. lr = LinearRegression() efs1 = EFS(lr, min_features=1, max_features=13, scoring='neg_mean_squared_error', print_progress=True, cv=5) # Create a efs fit efs1 = efs1.fit(X.as_matrix(), y.as_matrix()) print('Best negtive mean squared error: %.2f' % efs1.best_score_) ## Print the IDX of the best features print('Best subset:', efs1.best_idx_) Features: 8191/8191Best negtive mean squared error: -28.92 ## ('Best subset:', (0, 1, 4, 6, 7, 8, 9, 10, 11, 12))

The indices for the best subset are shown above.

1.2 Forward fit

Forward fit is a greedy algorithm that tries to optimize the feature selected, by minimizing the selection criteria (adj Rsqaured, Cp, AIC or BIC) at every step. For a dataset with features f1,f2,f3…fn, the forward fit starts with the NULL set. It then pick the ML model with a single feature from n features which has the highest adj Rsquared, or minimum Cp, BIC or some such criteria. After picking the 1 feature from n which satisfies the criteria the most, the next feature from the remaining n-1 features is chosen. When the 2 feature model which satisfies the selection criteria the best is chosen, another feature from the remaining n-2 features are added and so on. The forward fit is a sub-optimal algorithm. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of  which is of the order of . Though forward fit is a sub optimal solution it is far more computationally efficient than best fit

1.2a Forward fit – R code

Forward fit in R determines that 11 features are required for the best fit. The features are shown below

library(leaps) # Read the data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") #Split as training and test train_idx <- trainTestSplit(df1,trainPercent=75,seed=5) train <- df1[train_idx, ] test <- df1[-train_idx, ] # Find the best forward fit fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward") # Compute the MSE valErrors=rep(NA,13) test.mat=model.matrix(cost~.,data=test) for(i in 1:13){ coefi=coef(fitFwd,id=i) pred=test.mat[,names(coefi)]%*%coefi valErrors[i]=mean((test$cost-pred)^2) } # Plot the Residual Sum of Squares plot(valErrors,xlab="Number of Variables",ylab="Validation Error",type="l",main="Forward fit RSS vs No of features") # Gives the index of the minimum value a<-which.min(valErrors) print(a) ## [1] 11 # Highlight the smallest value points(c,valErrors[a],col="blue",cex=2,pch=20)

Forward fit R selects 11 predictors as the best ML model to predict the ‘cost’ output variable. The values for these 11 predictors are included below

#Print the 11 ccoefficients coefi=coef(fitFwd,id=i) coefi ## (Intercept) crimeRate zone indus charles ## 2.397179e+01 -1.026463e-01 3.118923e-02 1.154235e-04 3.512922e+00 ## nox rooms age distances highways ## -1.511123e+01 4.945078e+00 -1.513220e-02 -1.307017e+00 2.712534e-01 ## tax teacherRatio color status ## -1.330709e-02 -8.182683e-01 1.143835e-02 -3.750928e-01 1.2b Forward fit with Cross Validation – R code

The Python package SFS includes N Fold Cross Validation errors for forward and backward fit so I decided to add this code to R. This is not available in the ‘leaps’ R package, however the implementation is quite simple. Another implementation is also available at Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford 2.

library(dplyr) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") set.seed(6) # Set max number of features nvmax<-13 cvError <- NULL # Loop through each features for(i in 1:nvmax){ # Set no of folds noFolds=5 # Create the rows which fall into different folds from 1..noFolds folds = sample(1:noFolds, nrow(df1), replace=TRUE) cv<-0 # Loop through the folds for(j in 1:noFolds){ # The training is all rows for which the row is != j (k-1 folds -> training) train <- df1[folds!=j,] # The rows which have j as the index become the test set test <- df1[folds==j,] # Create a forward fitting model for this fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward") # Select the number of features and get the feature coefficients coefi=coef(fitFwd,id=i) #Get the value of the test data test.mat=model.matrix(cost~.,data=test) # Multiply the tes data with teh fitted coefficients to get the predicted value # pred = b0 + b1x1+b2x2... b13x13 pred=test.mat[,names(coefi)]%*%coefi # Compute mean squared error rss=mean((test$cost - pred)^2) # Add all the Cross Validation errors cv=cv+rss } # Compute the average of MSE for K folds for number of features 'i' cvError[i]=cv/noFolds } a <- seq(1,13) d <- as.data.frame(t(rbind(a,cvError))) names(d) <- c("Features","CVError") #Plot the CV Error vs No of Features ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") + xlab("No of features") + ylab("Cross Validation Error") + ggtitle("Forward Selection - Cross Valdation Error vs No of Features")

Forward fit with 5 fold cross validation indicates that all 13 features are required

# This gives the index of the minimum value a=which.min(cvError) print(a) ## [1] 13 #Print the 13 coefficients of these features coefi=coef(fitFwd,id=a) coefi ## (Intercept) crimeRate zone indus charles ## 36.650645380 -0.107980979 0.056237669 0.027016678 4.270631466 ## nox rooms age distances highways ## -19.000715500 3.714720418 0.019952654 -1.472533973 0.326758004 ## tax teacherRatio color status ## -0.011380750 -0.972862622 0.009549938 -0.582159093 1.2c Forward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

Note: The Cross validation error for SFS in Sklearn is negative, possibly because it computes the ‘neg_mean_squared_error’. The earlier ‘mean_squared_error’ in the package seems to have been deprecated. I have taken the -ve of this neg_mean_squared_error. I think this would give mean_squared_error.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs import matplotlib.pyplot as plt from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] lr = LinearRegression() # Create a forward fit model sfs = SFS(lr, k_features=(1,13), forward=True, # Forward fit floating=False, scoring='neg_mean_squared_error', cv=5) # Fit this on the data sfs = sfs.fit(X.as_matrix(), y.as_matrix()) # Get all the details of the forward fits a=sfs.get_metric_dict() n=[] o=[] # Compute the mean cross validation scores for i in np.arange(1,13): n.append(-np.mean(a[i]['cv_scores'])) m=np.arange(1,13) # Get the index of the minimum CV score # Plot the CV scores vs the number of features fig1=plt.plot(m,n) fig1=plt.title('Mean CV Scores vs No of features') fig1.figure.savefig('fig1.png', bbox_inches='tight') print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T) idx = np.argmin(n) print "No of features=",idx #Get the features indices for the best forward fit and convert to list b=list(a[idx]['feature_idx']) print(b) # Index the column names. # Features from forward fit print("Features selected in forward fit") print(X.columns[b]) ## avg_score ci_bound cv_scores \ ## 1 -42.6185 19.0465 [-23.5582499971, -41.8215743748, -73.993608929... ## 2 -36.0651 16.3184 [-18.002498199, -40.1507894517, -56.5286659068... ## 3 -34.1001 20.87 [-9.43012884381, -25.9584955394, -36.184188174... ## 4 -33.7681 20.1638 [-8.86076528781, -28.650217633, -35.7246353855... ## 5 -33.6392 20.5271 [-8.90807628524, -28.0684679108, -35.827463022... ## 6 -33.6276 19.0859 [-9.549485942, -30.9724602876, -32.6689523347,... ## 7 -32.4082 19.1455 [-10.0177149635, -28.3780298492, -30.926917231... ## 8 -32.3697 18.533 [-11.1431684243, -27.5765510172, -31.168994094... ## 9 -32.4016 21.5561 [-10.8972555995, -25.739780653, -30.1837430353... ## 10 -32.8504 22.6508 [-12.3909282079, -22.1533250755, -33.385407342... ## 11 -34.1065 24.7019 [-12.6429253721, -22.1676650245, -33.956999528... ## 12 -35.5814 25.693 [-12.7303397453, -25.0145323483, -34.211898373... ## 13 -37.1318 23.2657 [-12.4603005692, -26.0486211062, -33.074137979... ## ## feature_idx std_dev std_err ## 1 (12,) 18.9042 9.45212 ## 2 (10, 12) 16.1965 8.09826 ## 3 (10, 12, 5) 20.7142 10.3571 ## 4 (10, 3, 12, 5) 20.0132 10.0066 ## 5 (0, 10, 3, 12, 5) 20.3738 10.1869 ## 6 (0, 3, 5, 7, 10, 12) 18.9433 9.47167 ## 7 (0, 2, 3, 5, 7, 10, 12) 19.0026 9.50128 ## 8 (0, 1, 2, 3, 5, 7, 10, 12) 18.3946 9.19731 ## 9 (0, 1, 2, 3, 5, 7, 10, 11, 12) 21.3952 10.6976 ## 10 (0, 1, 2, 3, 4, 5, 7, 10, 11, 12) 22.4816 11.2408 ## 11 (0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12) 24.5175 12.2587 ## 12 (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12) 25.5012 12.7506 ## 13 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) 23.0919 11.546 ## No of features= 7 ## [0, 2, 3, 5, 7, 10, 12] ## ################################################################################# ## Features selected in forward fit ## Index([u'crimeRate', u'indus', u'chasRiver', u'rooms', u'distances', ## u'teacherRatio', u'status'], ## dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

The above plot indicates that 8 features provide the lowest Mean CV error

1.3 Backward Fit

Backward fit belongs to the class of greedy algorithms which tries to optimize the feature set, by dropping a feature at every stage which results in the worst performance for a given criteria of Adj RSquared, Cp, BIC or AIC. For a dataset with features f1,f2,f3…fn, the backward fit starts with the all the features f1,f2.. fn to begin with. It then pick the ML model with a n-1 features by dropping the feature,, for e.g., the inclusion of which results in the worst performance in adj Rsquared, or minimum Cp, BIC or some such criteria. At every step 1 feature is dopped. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of which is of the order of . Though backward fit is a sub optimal solution it is far more computationally efficient than best fit

1.3a Backward fit – R code library(dplyr) # Read the data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") set.seed(6) # Set max number of features nvmax<-13 cvError <- NULL # Loop through each features for(i in 1:nvmax){ # Set no of folds noFolds=5 # Create the rows which fall into different folds from 1..noFolds folds = sample(1:noFolds, nrow(df1), replace=TRUE) cv<-0 for(j in 1:noFolds){ # The training is all rows for which the row is != j train <- df1[folds!=j,] # The rows which have j as the index become the test set test <- df1[folds==j,] # Create a backward fitting model for this fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="backward") # Select the number of features and get the feature coefficients coefi=coef(fitFwd,id=i) #Get the value of the test data test.mat=model.matrix(cost~.,data=test) # Multiply the tes data with teh fitted coefficients to get the predicted value # pred = b0 + b1x1+b2x2... b13x13 pred=test.mat[,names(coefi)]%*%coefi # Compute mean squared error rss=mean((test$cost - pred)^2) # Add the Residual sum of square cv=cv+rss } # Compute the average of MSE for K folds for number of features 'i' cvError[i]=cv/noFolds } a <- seq(1,13) d <- as.data.frame(t(rbind(a,cvError))) names(d) <- c("Features","CVError") # Plot the Cross Validation Error vs Number of features ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") + xlab("No of features") + ylab("Cross Validation Error") + ggtitle("Backward Selection - Cross Valdation Error vs No of Features")

# This gives the index of the minimum value a=which.min(cvError) print(a) ## [1] 13 #Print the 13 coefficients of these features coefi=coef(fitFwd,id=a) coefi ## (Intercept) crimeRate zone indus charles ## 36.650645380 -0.107980979 0.056237669 0.027016678 4.270631466 ## nox rooms age distances highways ## -19.000715500 3.714720418 0.019952654 -1.472533973 0.326758004 ## tax teacherRatio color status ## -0.011380750 -0.972862622 0.009549938 -0.582159093

Backward selection in R also indicates the 13 features and the corresponding coefficients as providing the best fit

1.3b Backward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs import matplotlib.pyplot as plt from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression # Read the data df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] lr = LinearRegression() # Create the SFS model sfs = SFS(lr, k_features=(1,13), forward=False, # Backward floating=False, scoring='neg_mean_squared_error', cv=5) # Fit the model sfs = sfs.fit(X.as_matrix(), y.as_matrix()) a=sfs.get_metric_dict() n=[] o=[] # Compute the mean of the validation scores for i in np.arange(1,13): n.append(-np.mean(a[i]['cv_scores'])) m=np.arange(1,13) # Plot the Validation scores vs number of features fig2=plt.plot(m,n) fig2=plt.title('Mean CV Scores vs No of features') fig2.figure.savefig('fig2.png', bbox_inches='tight') print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T) # Get the index of minimum cross validation error idx = np.argmin(n) print "No of features=",idx #Get the features indices for the best forward fit and convert to list b=list(a[idx]['feature_idx']) # Index the column names. # Features from backward fit print("Features selected in bacward fit") print(X.columns[b]) ## avg_score ci_bound cv_scores \ ## 1 -42.6185 19.0465 [-23.5582499971, -41.8215743748, -73.993608929... ## 2 -36.0651 16.3184 [-18.002498199, -40.1507894517, -56.5286659068... ## 3 -35.4992 13.9619 [-17.2329292677, -44.4178648308, -51.633177846... ## 4 -33.463 12.4081 [-20.6415333292, -37.3247852146, -47.479302977... ## 5 -33.1038 10.6156 [-20.2872309863, -34.6367078466, -45.931870352... ## 6 -32.0638 10.0933 [-19.4463829372, -33.460638577, -42.726257249,... ## 7 -30.7133 9.23881 [-19.4425181917, -31.1742902259, -40.531266671... ## 8 -29.7432 9.84468 [-19.445277268, -30.0641187173, -40.2561247122... ## 9 -29.0878 9.45027 [-19.3545569877, -30.094768669, -39.7506036377... ## 10 -28.9225 9.39697 [-18.562171585, -29.968504938, -39.9586835965,... ## 11 -29.4301 10.8831 [-18.3346152225, -30.3312847532, -45.065432793... ## 12 -30.4589 11.1486 [-18.493389527, -35.0290639374, -45.1558231765... ## 13 -37.1318 23.2657 [-12.4603005692, -26.0486211062, -33.074137979... ## ## feature_idx std_dev std_err ## 1 (12,) 18.9042 9.45212 ## 2 (10, 12) 16.1965 8.09826 ## 3 (10, 12, 7) 13.8576 6.92881 ## 4 (12, 10, 4, 7) 12.3154 6.15772 ## 5 (4, 7, 8, 10, 12) 10.5363 5.26816 ## 6 (4, 7, 8, 9, 10, 12) 10.0179 5.00896 ## 7 (1, 4, 7, 8, 9, 10, 12) 9.16981 4.58491 ## 8 (1, 4, 7, 8, 9, 10, 11, 12) 9.77116 4.88558 ## 9 (0, 1, 4, 7, 8, 9, 10, 11, 12) 9.37969 4.68985 ## 10 (0, 1, 4, 6, 7, 8, 9, 10, 11, 12) 9.3268 4.6634 ## 11 (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12) 10.8018 5.40092 ## 12 (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12) 11.0653 5.53265 ## 13 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) 23.0919 11.546 ## No of features= 9 ## Features selected in bacward fit ## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate', ## u'teacherRatio', u'color', u'status'], ## dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

Backward fit in Python indicate that 10 features provide the best fit

1.3c Sequential Floating Forward Selection (SFFS) – Python code

The Sequential Feature search also includes ‘floating’ variants which include or exclude features conditionally, once they were excluded or included. The SFFS can conditionally include features which were excluded from the previous step, if it results in a better fit. This option will tend to a better solution, than plain simple SFS. These variants are included below

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs import matplotlib.pyplot as plt from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] lr = LinearRegression() # Create the floating forward search sffs = SFS(lr, k_features=(1,13), forward=True, # Forward floating=True, #Floating scoring='neg_mean_squared_error', cv=5) # Fit a model sffs = sffs.fit(X.as_matrix(), y.as_matrix()) a=sffs.get_metric_dict() n=[] o=[] # Compute mean validation scores for i in np.arange(1,13): n.append(-np.mean(a[i]['cv_scores'])) m=np.arange(1,13) # Plot the cross validation score vs number of features fig3=plt.plot(m,n) fig3=plt.title('SFFS:Mean CV Scores vs No of features') fig3.figure.savefig('fig3.png', bbox_inches='tight') print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T) # Get the index of the minimum CV score idx = np.argmin(n) print "No of features=",idx #Get the features indices for the best forward floating fit and convert to list b=list(a[idx]['feature_idx']) print(b) print("#################################################################################") # Index the column names. # Features from forward fit print("Features selected in forward fit") print(X.columns[b]) ## avg_score ci_bound cv_scores \ ## 1 -42.6185 19.0465 [-23.5582499971, -41.8215743748, -73.993608929... ## 2 -36.0651 16.3184 [-18.002498199, -40.1507894517, -56.5286659068... ## 3 -34.1001 20.87 [-9.43012884381, -25.9584955394, -36.184188174... ## 4 -33.7681 20.1638 [-8.86076528781, -28.650217633, -35.7246353855... ## 5 -33.6392 20.5271 [-8.90807628524, -28.0684679108, -35.827463022... ## 6 -33.6276 19.0859 [-9.549485942, -30.9724602876, -32.6689523347,... ## 7 -32.1834 12.1001 [-17.9491036167, -39.6479234651, -45.470227740... ## 8 -32.0908 11.8179 [-17.4389015788, -41.2453629843, -44.247557798... ## 9 -31.0671 10.1581 [-17.2689542913, -37.4379370429, -41.366372300... ## 10 -28.9225 9.39697 [-18.562171585, -29.968504938, -39.9586835965,... ## 11 -29.4301 10.8831 [-18.3346152225, -30.3312847532, -45.065432793... ## 12 -30.4589 11.1486 [-18.493389527, -35.0290639374, -45.1558231765... ## 13 -37.1318 23.2657 [-12.4603005692, -26.0486211062, -33.074137979... ## ## feature_idx std_dev std_err ## 1 (12,) 18.9042 9.45212 ## 2 (10, 12) 16.1965 8.09826 ## 3 (10, 12, 5) 20.7142 10.3571 ## 4 (10, 3, 12, 5) 20.0132 10.0066 ## 5 (0, 10, 3, 12, 5) 20.3738 10.1869 ## 6 (0, 3, 5, 7, 10, 12) 18.9433 9.47167 ## 7 (0, 1, 2, 3, 7, 10, 12) 12.0097 6.00487 ## 8 (0, 1, 2, 3, 7, 8, 10, 12) 11.7297 5.86484 ## 9 (0, 1, 2, 3, 7, 8, 9, 10, 12) 10.0822 5.04111 ## 10 (0, 1, 4, 6, 7, 8, 9, 10, 11, 12) 9.3268 4.6634 ## 11 (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12) 10.8018 5.40092 ## 12 (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12) 11.0653 5.53265 ## 13 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) 23.0919 11.546 ## No of features= 9 ## [0, 1, 2, 3, 7, 8, 9, 10, 12] ## ################################################################################# ## Features selected in forward fit ## Index([u'crimeRate', u'zone', u'indus', u'chasRiver', u'distances', ## u'idxHighways', u'taxRate', u'teacherRatio', u'status'], ## dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

SFFS provides the best fit with 10 predictors

1.3d Sequential Floating Backward Selection (SFBS) – Python code

The SFBS is an extension of the SBS. Here features that are excluded at any stage can be conditionally included if the resulting feature set gives a better fit.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs import matplotlib.pyplot as plt from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] lr = LinearRegression() sffs = SFS(lr, k_features=(1,13), forward=False, # Backward floating=True, # Floating scoring='neg_mean_squared_error', cv=5) sffs = sffs.fit(X.as_matrix(), y.as_matrix()) a=sffs.get_metric_dict() n=[] o=[] # Compute the mean cross validation score for i in np.arange(1,13): n.append(-np.mean(a[i]['cv_scores'])) m=np.arange(1,13) fig4=plt.plot(m,n) fig4=plt.title('SFBS: Mean CV Scores vs No of features') fig4.figure.savefig('fig4.png', bbox_inches='tight') print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T) # Get the index of the minimum CV score idx = np.argmin(n) print "No of features=",idx #Get the features indices for the best backward floating fit and convert to list b=list(a[idx]['feature_idx']) print(b) print("#################################################################################") # Index the column names. # Features from forward fit print("Features selected in backward floating fit") print(X.columns[b]) ## avg_score ci_bound cv_scores \ ## 1 -42.6185 19.0465 [-23.5582499971, -41.8215743748, -73.993608929... ## 2 -36.0651 16.3184 [-18.002498199, -40.1507894517, -56.5286659068... ## 3 -34.1001 20.87 [-9.43012884381, -25.9584955394, -36.184188174... ## 4 -33.463 12.4081 [-20.6415333292, -37.3247852146, -47.479302977... ## 5 -32.3699 11.2725 [-20.8771078371, -34.9825657934, -45.813447203... ## 6 -31.6742 11.2458 [-20.3082500364, -33.2288990522, -45.535507868... ## 7 -30.7133 9.23881 [-19.4425181917, -31.1742902259, -40.531266671... ## 8 -29.7432 9.84468 [-19.445277268, -30.0641187173, -40.2561247122... ## 9 -29.0878 9.45027 [-19.3545569877, -30.094768669, -39.7506036377... ## 10 -28.9225 9.39697 [-18.562171585, -29.968504938, -39.9586835965,... ## 11 -29.4301 10.8831 [-18.3346152225, -30.3312847532, -45.065432793... ## 12 -30.4589 11.1486 [-18.493389527, -35.0290639374, -45.1558231765... ## 13 -37.1318 23.2657 [-12.4603005692, -26.0486211062, -33.074137979... ## ## feature_idx std_dev std_err ## 1 (12,) 18.9042 9.45212 ## 2 (10, 12) 16.1965 8.09826 ## 3 (10, 12, 5) 20.7142 10.3571 ## 4 (4, 10, 7, 12) 12.3154 6.15772 ## 5 (12, 10, 4, 1, 7) 11.1883 5.59417 ## 6 (4, 7, 8, 10, 11, 12) 11.1618 5.58088 ## 7 (1, 4, 7, 8, 9, 10, 12) 9.16981 4.58491 ## 8 (1, 4, 7, 8, 9, 10, 11, 12) 9.77116 4.88558 ## 9 (0, 1, 4, 7, 8, 9, 10, 11, 12) 9.37969 4.68985 ## 10 (0, 1, 4, 6, 7, 8, 9, 10, 11, 12) 9.3268 4.6634 ## 11 (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12) 10.8018 5.40092 ## 12 (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12) 11.0653 5.53265 ## 13 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) 23.0919 11.546 ## No of features= 9 ## [0, 1, 4, 7, 8, 9, 10, 11, 12] ## ################################################################################# ## Features selected in backward floating fit ## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate', ## u'teacherRatio', u'color', u'status'], ## dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

SFBS indicates that 10 features are needed for the best fit

1.4 Ridge regression

In Linear Regression the Residual Sum of Squares (RSS) is given as


Ridge regularization =

where is the regularization or tuning parameter. Increasing increases the penalty on the coefficients thus shrinking them. However in Ridge Regression features that do not influence the target variable will shrink closer to zero but never become zero except for very large values of

Ridge regression in R requires the ‘glmnet’ package

1.4a Ridge Regression – R code library(glmnet) library(dplyr) # Read the data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL #Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select specific columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Set X and y as matrices X=as.matrix(df1[,1:13]) y=df1$cost # Fit a Ridge model fitRidge <-glmnet(X,y,alpha=0) #Plot the model where the coefficient shrinkage is plotted vs log lambda plot(fitRidge,xvar="lambda",label=TRUE,main= "Ridge regression coefficient shrikage vs log lambda")

The plot below shows how the 13 coefficients for the 13 predictors vary when lambda is increased. The x-axis includes log (lambda). We can see that increasing lambda from to  significantly shrinks the coefficients. We can draw a vertical line from the x-axis and read the values of the 13 coefficients. Some of them will be close to zero

# Compute the cross validation error cvRidge=cv.glmnet(X,y,alpha=0) #Plot the cross validation error plot(cvRidge, main="Ridge regression Cross Validation Error (10 fold)")

This gives the 10 fold Cross Validation  Error with respect to log (lambda) As lambda increase the MSE increases

1.4a Ridge Regression – Python code

The coefficient shrinkage for Python can be plotted like R using Least Angle Regression model a.k.a. LARS package. This is included below

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() from sklearn.linear_model import Ridge X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Scale the X_train and X_test X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit a ridge regression with alpha=20 linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train) # Print the training R squared print('R-squared score (training): {:.3f}' .format(linridge.score(X_train_scaled, y_train))) # Print the test Rsquared print('R-squared score (test): {:.3f}' .format(linridge.score(X_test_scaled, y_test))) print('Number of non-zero features: {}' .format(np.sum(linridge.coef_ != 0))) trainingRsquared=[] testRsquared=[] # Plot the effect of alpha on the test Rsquared print('Ridge regression: effect of alpha regularization parameter\n') # Choose a list of alpha values for this_alpha in [0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000]: linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train) # Compute training rsquared r2_train = linridge.score(X_train_scaled, y_train) # Compute test rsqaured r2_test = linridge.score(X_test_scaled, y_test) num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0) trainingRsquared.append(r2_train) testRsquared.append(r2_test) # Create a dataframe alpha=[0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000] trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha) testRsquared=pd.DataFrame(testRsquared,index=alpha) # Plot training and test R squared as a function of alpha df3=pd.concat([trainingRsquared,testRsquared],axis=1) df3.columns=['trainingRsquared','testRsquared'] fig5=df3.plot() fig5=plt.title('Ridge training and test squared error vs Alpha') fig5.figure.savefig('fig5.png', bbox_inches='tight') # Plot the coefficient shrinage using the LARS package from sklearn import linear_model # ############################################################################# # Compute paths n_alphas = 200 alphas = np.logspace(0, 8, n_alphas) coefs = [] for a in alphas: ridge = linear_model.Ridge(alpha=a, fit_intercept=False) ridge.fit(X_train_scaled, y_train) coefs.append(ridge.coef_) # ############################################################################# # Display results ax = plt.gca() fig6=ax.plot(alphas, coefs) fig6=ax.set_xscale('log') fig6=ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis fig6=plt.xlabel('alpha') fig6=plt.ylabel('weights') fig6=plt.title('Ridge coefficients as a function of the regularization') fig6=plt.axis('tight') plt.savefig('fig6.png', bbox_inches='tight') ## R-squared score (training): 0.620 ## R-squared score (test): 0.438 ## Number of non-zero features: 13 ## Ridge regression: effect of alpha regularization parameter

The plot below shows the training and test error when increasing the tuning or regularization parameter ‘alpha’

For Python the coefficient shrinkage with LARS must be viewed from right to left, where you have increasing alpha. As alpha increases the coefficients shrink to 0.

1.5 Lasso regularization

The Lasso is another form of regularization, also known as L1 regularization. Unlike the Ridge Regression where the coefficients of features which do not influence the target tend to zero, in the lasso regualrization the coefficients become 0. The general form of Lasso is as follows

1.5a Lasso regularization – R code library(glmnet) library(dplyr) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Set X and y as matrices X=as.matrix(df1[,1:13]) y=df1$cost # Fit the lasso model fitLasso <- glmnet(X,y) # Plot the coefficient shrinkage as a function of log(lambda) plot(fitLasso,xvar="lambda",label=TRUE,main="Lasso regularization - Coefficient shrinkage vs log lambda")

The plot below shows that in L1 regularization the coefficients actually become zero with increasing lambda

# Compute the cross validation error (10 fold) cvLasso=cv.glmnet(X,y,alpha=0) # Plot the cross validation error plot(cvLasso)

This gives the MSE for the lasso model

1.5 b Lasso regularization – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import Lasso from sklearn.preprocessing import MinMaxScaler from sklearn import linear_model scaler = MinMaxScaler() df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) linlasso = Lasso(alpha=0.1, max_iter = 10).fit(X_train_scaled, y_train) print('Non-zero features: {}' .format(np.sum(linlasso.coef_ != 0))) print('R-squared score (training): {:.3f}' .format(linlasso.score(X_train_scaled, y_train))) print('R-squared score (test): {:.3f}\n' .format(linlasso.score(X_test_scaled, y_test))) print('Features with non-zero weight (sorted by absolute magnitude):') for e in sorted (list(zip(list(X), linlasso.coef_)), key = lambda e: -abs(e[1])): if e[1] != 0: print('\t{}, {:.3f}'.format(e[0], e[1])) print('Lasso regression: effect of alpha regularization\n\ parameter on number of features kept in final model\n') trainingRsquared=[] testRsquared=[] #for alpha in [0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]: for alpha in [0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]: linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train) r2_train = linlasso.score(X_train_scaled, y_train) r2_test = linlasso.score(X_test_scaled, y_test) trainingRsquared.append(r2_train) testRsquared.append(r2_test) alpha=[0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10] #alpha=[0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50] trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha) testRsquared=pd.DataFrame(testRsquared,index=alpha) df3=pd.concat([trainingRsquared,testRsquared],axis=1) df3.columns=['trainingRsquared','testRsquared'] fig7=df3.plot() fig7=plt.title('LASSO training and test squared error vs Alpha') fig7.figure.savefig('fig7.png', bbox_inches='tight') ## Non-zero features: 7 ## R-squared score (training): 0.726 ## R-squared score (test): 0.561 ## ## Features with non-zero weight (sorted by absolute magnitude): ## status, -18.361 ## rooms, 18.232 ## teacherRatio, -8.628 ## taxRate, -2.045 ## color, 1.888 ## chasRiver, 1.670 ## distances, -0.529 ## Lasso regression: effect of alpha regularization ## parameter on number of features kept in final model ## ## Computing regularization path using the LARS ... ## .C:\Users\Ganesh\ANACON~1\lib\site-packages\sklearn\linear_model\coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ## ConvergenceWarning)

The plot below gives the training and test R squared error

1.5c Lasso coefficient shrinkage – Python code

To plot the coefficient shrinkage for Lasso the Least Angle Regression model a.k.a. LARS package. This is shown below

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import Lasso from sklearn.preprocessing import MinMaxScaler from sklearn import linear_model scaler = MinMaxScaler() df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print("Computing regularization path using the LARS ...") alphas, _, coefs = linear_model.lars_path(X_train_scaled, y_train, method='lasso', verbose=True) xx = np.sum(np.abs(coefs.T), axis=1) xx /= xx[-1] fig8=plt.plot(xx, coefs.T) ymin, ymax = plt.ylim() fig8=plt.vlines(xx, ymin, ymax, linestyle='dashed') fig8=plt.xlabel('|coef| / max|coef|') fig8=plt.ylabel('Coefficients') fig8=plt.title('LASSO Path - Coefficient Shrinkage vs L1') fig8=plt.axis('tight') plt.savefig('fig8.png', bbox_inches='tight') This plot show the coefficient shrinkage for lasso. This 3rd part of the series covers the main ‘feature selection’ methods. I hope these posts serve as a quick and useful reference to ML code both for R and Python! Stay tuned for further updates to this series! Watch this space!

 

You may also like

1. Natural language processing: What would Shakespeare say?
2. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
3. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables
4. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
5. Experiments with deblurring using OpenCV
6. R vs Python: Different similarities and similar differences

To see all posts see Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Qualitative Research in R

Fri, 10/20/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In the last two posts, I’ve focused purely on statistical topics – one-way ANOVA and dealing with multicollinearity in R. In this post, I’ll deviate from the pure statistical topics and will try to highlight some aspects of qualitative research. More specifically, I’ll show you the procedure of analyzing text mining and visualizing the text analysis using word cloud.
Some of typical usage of the text mining are mentioned below:
• Marketing managers often use the text mining approach to study the needs and complaints of their customers;
• Politicians and journalists also use effectively the text mining to critically analyze the lectures delivered by the opposition leaders;
• Social media experts uses this technique to collect, analyze and share user posts, comments etc.;
• Social science researchers use text mining approach for analysing the qualitative data.

What is Text mining?

It is method which enables us to highlight the most frequently used keywords in a paragraph of texts or compilation of several text documents.

What is Word Cloud?

It is the visual representation of text data, especially the keywords in the text documents.
R has very simple and straightforward approaches for text mining and creating word clouds.

The text mining package “(tm)” will be used for mining the text and the word cloud generator package (wordcloud) will be used for visualizing the keywords as a word cloud.

Preparing the Text Documents:

As the starting point of qualitative research, you need to create the text file. Here I’ve used the lecture delivered by great Indian Hindu monk Swami Vivekananda at the first World’s Parliament of Religions held from 11 to 27 September 1893. Only two lecture notes – opening and closing address, will be used.
Both the lectures are saved in text file (chicago).

Loading the Required Packages: library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer") Load the text: Importing the text file:

The text file (chicago) is imported using the following code in R.

The R code for leading the text is given below:

text <- readLines(file.choose()) Build the data as a corpus:

The ‘text’ object will now be loaded as ‘Corpora’ which are collections of documents containing (natural language) text. The Corpus() function from text mining(tm) package will be used for this purpose.
The R code for building the corpus is given below:

docs <- Corpus(VectorSource(text))

Next use the function inspect() under the tm package to display detailed information of the text document.
The R code for inspecting the text is given below:

inspect(docs) The output is not, however, produced here due to space constraint Text transformation:

After inspecting the text document (corpora), it is required to perform some text transformation for replacing special characters from the text. To do this, use the ‘tm_map()’ function.
The R code for transformation of the text is given below:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") Text Cleaning:

After removing the special characters from the text, it is now the time to remove the to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”. This is required as the The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. For doing this exercise, the same ‘tm_map()’ function will be used.
The R code for cleaning the text along with the short self-explanation is given below:

# Convert the text to lower case docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # Remove english common stopwords docs <- tm_map(docs, removeWords, stopwords("english")) # Remove your own stop word # specify your stopwords as a character vector docs <- tm_map(docs, removeWords, c("I", "my")) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace) Building a document matrix:

Document matrix is the frequency distribution of the words used in the given text. I hope that readers will easily understand this frequency distribution of words.
The R function TermDocumentMatrix() from the text mining package ‘tm’ will be used for building this frequency table for words in the given text.
The R code is given below:

docs_matrix <- TermDocumentMatrix(docs) m <- as.matrix(docs_matrix) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) Have a look at the document matrix for the top ten keywords: head(d, 10) word freq religions religions 7 world world 6 earth earth 6 become become 6 hindu hindu 5 religion religion 5 thanks thanks 5 different different 4 men men 4 proud proud 4 Generate the Word cloud:

Finally, the frequency table of the words (document matrix) will be visualized graphically by plotting in a word cloud with the help of the following R code.

wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

You can also use barplot to plot the frequencies of the keywords using the following R code:

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word, col ="lightblue", main ="Most commonly used words", ylab = "Word frequencies", xlab="Keywords")

Conclusion:

The above word cloud clearly shows that “religions”, “earth”, “world”, “hindu”, “one” etc. are the most important words in the lecture delivered by Swamiji in Chicago World’s Parliament of Religions.

    Related Post

    1. Multi-Dimensional Reduction and Visualisation with t-SNE
    2. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
    3. Analyzing Obesity across USA
    4. Can we predict flu deaths with Machine Learning and R?
    5. Graphical Presentation of Missing Data; VIM Package
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Gold-Mining – Week 7 (2017)

    Fri, 10/20/2017 - 03:54

    (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

    Week76 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

    The post Gold-Mining – Week 7 (2017) appeared first on Fantasy Football Analytics.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    County-Level Choropleth in Plotly and R

    Thu, 10/19/2017 - 21:23

    (This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

    At Plotly, we are commonly asked about thematic maps — especially county-level choropleth maps. This style of map provides a visual illustration of variation across a geographic area. Some pertinent uses are population density, economic measurements, crime statistics, and election results. With Plotly, there are multiple ways to bring county-level choropleths to life. For example,

    using ggplotly

    or native plot_ly

    or plot_mapbox

    . . .

    For now, we will look at how to create the US electoral map using the ggplotly method. For more information concerning the plot_ly and plot_mapbox examples, checkout our R documentation library.

     . . .

    Before we get started, we have to retrieve the data. The 2016 US election data can be sourced from Kaggle. For the map data, county and state boundaries, we can us the Maps package via ggplot2, which is loaded with Plotly package. Thus, we can load the Plotly library and read in the election and map data. Additionally, we will make use of the dplyr package.

    #install.packages(“plotly”) library(plotly) library(dplyr) county_df <- map_data("county") state_df <- map_data("state") df <- read.csv("https://raw.githubusercontent.com/bcdunbar/datasets/master/votes.csv")

    As we are using multiple datasets, we will need to ensure that county names match so that we are able to plot all the counties, parishes, etc.. Thus, we make sure county names are lower case and use gsub to remove any unnecessary words (such as “county” or “parish”) and characters. We do this to both datasets.

    df <- subset(df, select = c(Clinton, Trump, county_name)) df$county_name %<>% gsub(" county", "", .) %>% gsub(" parish", "", .) %>% gsub(" ", "", .) %>% gsub("[.]", "", .) county_df$subregion <- gsub(" ", "", county_df$subregion)

    Next, we want to correct the vote measurement and then determine who won which county by checking which candidate had the greater amount of votes.

    df$Clinton <- df$Clinton*100 df$Trump <- df$Trump*100 for (i in 1:length(df[,1])) { if (df$Clinton[i] > df$Trump[i]) { df$win[i] = 'Clinton' } else { df$win[i] = 'Trump' } }

    Rename for consistency, then join the two datasets using dplyr and remove any duplicates.

    names(df) <- c("clinton", "trump", "subregion", "win") choropleth <- inner_join(county_df, df, by = "subregion") choropleth <- choropleth[!duplicated(choropleth$order), ]

    . . .

    As previously illustrated, there are multiple ways to visualize county-level choropleths in Plotly but we will focus on using ggplot2 and converting it to a Plotly object with ggplotly. Here, call ggplot using longitude, latitude, and group, and geom_polygon to color counties by the winning candidate. Additionally, we utilize a third dataset (state boundaries) in the second geom_polygon to add a familiar shape to our map. Finally, remove the axes, grid lines, and background to set the theme.

    p <- ggplot(choropleth, aes(long, lat, group = group)) + geom_polygon(aes(fill = win), colour = alpha("white", 1/2), size = 0.1) + geom_polygon(data = state_df, colour = "white", fill = NA) + scale_fill_manual(values = c('blue','red')) + theme_void()

    . . .

    Now using the magic of ggplotly, we can convert this static plot into an interactive Plotly object. Furthermore, we have the option of extending the Plotly object by adding layout attributes and, or, modifying existing, or default, attributes by utilizing the style function.

    p <- ggplotly(p, tooltip = 'text') %>% layout( hovermode = 'x', margin = list( t = 20, b = 20, l = 20, r = 20), legend = list( orientation = 'h', x = 0.5, y = 1.01, xanchor = 'center')) # use style to modify layer p <- style(p, hoverinfo = 'none', traces = c(3)) # use plotly_build to modify layer p <- plotly_build(p) str(p$x$layout$annotations) # check annotations p$x$layout$annotations = NULL # remove annotation

    Once you’re happy with the final product, and have added your Plotly credentials, you can publish it to your Plotly account.

    Sys.setenv("plotly_username" = " ... ") Sys.setenv("plotly_api_key" = " ... ") api_create(p, filename = 'US 2016 Election')

    . . .

    Want More?

    If you like this, checkout the full code here and more examples via our R documentation library.

    If you’re looking for something more in-depth, then consider attending the 2 day R/Shiny Master Class in NYC November 18–19, where you’ll cover topics such as extending ggplot2, animations, linked brushing, and Shiny, whilst learning to build “stats apps” that have rich and interactive controls.

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    First steps with MRF smooths

    Thu, 10/19/2017 - 20:00

    (This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

    One of the specialist smoother types in the mgcv package is the Markov Random Field (MFR) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. I’ve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result I’ve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial Regional Smoothing in R from The Pudding. In this post I take a quick look at how to use the MRF smooth in mgcv to model a discrete spatial data set from the US Census Bureau.

    In that tutorial, the example data are taken from the US Census Bureau via a shapefile prepared by the author. After a little munging — quite a few steps are missing from the tutorial — I managed to get data from the shapefile that matched what was used in the tutorial. The data are on county level percentages of US adults whose highest level of education attainment is a high school diploma. The raw data are shown in the figure below


    To follow along, you’ll need to download the example shapefile provided by the author of the post on The Pudding. The shapefile(s) are in a ZIP, which I extracted into the working directory; the code below assumes this.

    This post will make use of the following set of package; load them now, as shown below, and install any that you may be missing

    library('rgdal') library('proj4') library('spdep') library('mgcv') library('ggplot2') library('dplyr') library('viridis')

    Assuming you have extracted the shapefile, we load it into R using readOGR()

    shp <- readOGR('.', 'us_county_hs_only')

    and do some data munging

    ## select only mainland US counties states <- c(1,4:6, 8:13, 16:42, 44:51, 53:56) shp <- shp[shp$STATEFP %in% sprintf('%02i', states), ] df <- droplevels(as(shp, 'data.frame')) ## project data aea.proj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96" shp <- spTransform(shp, CRS(aea.proj)) # project to Albers shpf <- fortify(shp, region = 'GEOID') ## Need a proportion for fitting df <- transform(df, hsd = hs_pct / 100)

    The shapefile contains US Census Bureau data for all US counties, including many that are far from the continental USA. The tutorial from The Pudding doesn’t go into how they removed, or how they drew a map without these additional counties. For our purposes they may cause complications when we try to model them using the MRF smooth. I’m sure the modelling approach can handle data like this, but as I wanted to achieve something that followed the tutorial I’ve removed everything not linked to the continental US landmass, including (I’m sorry!), Alaska and Hawaii — my ggplot and mapping skills aren’t yet good enough to move Alaska and Hawaii to the bottom left of such maps.

    The data were projected using the Albers equal area projection and subsequently passed to the fortify() method from ggplot2 to get a version of the county polygons suitable for plotting with that package.

    Finally, I created a new variable hsd which is just the variable hs_pct divided by 100. This creates a proportion that we’ll need for model fitting as you’ll see shortly.

    Before we can model these data with gam(), we need to create the supporting information that gam() will use to create the MRF smooth penalty. The penalty matrix in an MRF smooth is based on the neighbourhood structure of the observations. There are three ways to pass this information to gam()

    1. as a list of polygons (not SpatialPolygons, I believe)
    2. as a list containing the neighbourhood structure, or
    3. the raw penalty matrix itself.

    Options 1 and 3 aren’t easily doable as far I can see — gam() isn’t expecting the sort of object we created when we imported the shapefile and nobody want’s to build a penalty matrix by hand! Thankfully option 2, the neighbourhood structure is relatively easy to create. For that I use the poly2nb() function from the spdep package. This function takes a shapefile and works out which regions are neighbours of any other region by virtue of them sharing a border. To make sure everything matches up nicely in the way gam() wants this list, we specify that the region IDs should be the GEOIDs from the original data set (the GEOID uniquely identifies each county) and we have to set the names attribute on the neighbouthood list to match these unique IDs

    nb <- poly2nb(shp, row.names = df$GEOID) names(nb) <- attr(nb, "region.id")

    The result of the previous chunk is a list whose names map on to the levels of the GEOID factor. The values in each element of nb index the elements of nb that are neighbours of the current element

    str(nb[1:6]) List of 6 $ 19107: int [1:6] 1417 1464 1632 2277 2278 2851 $ 19189: int [1:6] 551 1414 2151 2452 2846 2849 $ 20093: int [1:7] 5 557 1064 1142 1437 1441 2978 $ 20123: int [1:5] 1469 1565 2648 2966 2977 $ 20187: int [1:7] 3 554 1142 1441 1620 2142 2238 $ 21005: int [1:7] 582 583 953 954 1770 1861 2169

    With that done we can now fit the GAM. Fitting this is going to take a wee while (over 3 hours for the full rank MRF, using 6 threads, on a reasonably powerful 3-year old workstation with dual 4-core Xeon processors). To specify an MRF smooth we use the bs argument to the s() function, setting it to bs = ‘mrf’. The neighbourhood list is passed via the xt argument, which takes a list as a value; here we specify a component nb which takes our neighbourhood list nb. The final set-up variable to consider is whether to fit a full rank MRF, where a coefficient for each county will be estimated, or a reduced rank MRF, wherein the MRF is represented using fewer coefficients and counties are mapped to the smaller set of coefficients. The rank of the MRF smooth is set using the k argument. The default is to fit a full rank MRF, whilst setting k < NROW(data) will result ins a reduced-rank MRF being etimated.

    The full rank MRF model is estimated using

    ctrl <- gam.control(nthreads = 6) # use 6 parallel threads, reduce if fewer physical CPU cores m1 <- gam(hsd ~ s(GEOID, bs = 'mrf', xt = list(nb = nb)), # define MRF smooth data = df, method = 'REML', # fast version of REML smoothness selection control = ctrl, family = betar()) # fit a beta regression

    As the response is a proportion, the fitted GAM uses the beta distribution as the conditional distribution of the response. The default link in the logit, just as it is in for the binomial distribution, and insures that fitted values on the scale of the linear predictor are mapped onto the allowed range for proportions of 0–1.

    The final model uses in the region of 1700 effective degrees of freedom. This is the smoothness penalty at work; rather than 3108 individual coefficients, the smoothness invoked to try to arrange for neighbouring counties to have similar coefficients has shrunk away almost half of the complexity implied by the full rank MRF.

    summary(m1) Family: Beta regression(179.532) Link function: logit Formula: hsd ~ s(GEOID, bs = "mrf", xt = list(nb = nb)) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.63806 0.00283 -225.5 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(GEOID) 1732 3107 9382 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.769 Deviance explained = 89.7% -REML = -4544 Scale est. = 1 n = 3108

    Whilst the penalty enforces smoothness, further smoothness can be enforced by fitting a reduced rank MARF. In the next code block I fit models with k = 300 and k = 30 respectively, which imply considerable smoothing relative to the full rank model.

    ## rank 300 MRF m2 <- gam(hsd ~ s(GEOID, bs = 'mrf', k = 300, xt = list(nb = nb)), data = df, method = 'REML', control = ctrl, family = betar()) ## rank 30 MRF m3 <- gam(hsd ~ s(GEOID, bs = 'mrf', k = 30, xt = list(nb = nb)), data = df, method = 'REML', control = ctrl, family = betar())

    To visualise the different fits we need to generate predicted values on the response scale for each county and add this data to the county data df

    df <- transform(df, mrfFull = predict(m1, type = 'response'), mrfRrank300 = predict(m2, type = 'response'), mrfRrank30 = predict(m3, type = 'response'))

    Before we can plot these fitted values we need to merge df with the fortified shapefile

    ## merge data with fortified shapefile mdata <- left_join(shpf, df, by = c('id' = 'GEOID')) Warning: Column `id`/`GEOID` joining character vector and factor, coercing into character vector

    To facilitate plotting with ggplot2 I begin by creating some fixed plot components, like the theme, scale, and labels

    theme_map <- function(...) { theme_minimal() + theme(..., axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), panel.border = element_blank()) } myTheme <- theme_map(legend.position = 'bottom') myScale <- scale_fill_viridis(name = '%', option = 'plasma', limits = c(0.1, 0.55), labels = function(x) x * 100, guide = guide_colorbar(direction = "horizontal", barheight = unit(2, units = "mm"), barwidth = unit(75, units = "mm"), title.position = 'left', title.hjust = 0.5, label.hjust = 0.5)) myLabs <- labs(x = NULL, y = NULL, title = 'US Adult Education', subtitle = '% of adults where high school diploma is highest level education', caption = 'Source: US Census Bureau')

    I took many of these settings from Timo Grossenbacher’s excellent post on mapping regional demographic data in Switzerland.

    Now we can plot the fitted proportions. Note that whilst we plot proportions, the colour bar labels are in percentages in keeping with the original data (see the definition for my_scale to see how this was achieved).

    Fitted values from the full rank MRF are shown below

    ggplot(mdata, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = mrfFull)) + geom_path(col = 'black', alpha = 0.5, size = 0.1) + coord_equal() + myTheme + myScale + myLabs

    This model expains about 90% of the deviance in the original data. Whilst some smoothing is evident, the fitted values show a considerable about of non-spatial variation. This is most likely due to not including important covariates, such as country average income, which might explain some of the finer scale structure; neighbouring counties with quite different proportions. A more considered analysis would include these and other relevant predictors alongside the MRF.

    Smoother surfaces can be achieved via the reduced rank MRFs. First the rank 300 MRF

    ggplot(mdata, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = mrfRrank300)) + geom_path(col = 'black', alpha = 0.5, size = 0.1) + coord_equal() + myTheme + myScale + myLabs

    and next the rank 30 MRF

    ggplot(mdata, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = mrfRrank30)) + geom_path(col = 'black', alpha = 0.5, size = 0.1) + coord_equal() + myTheme + myScale + myLabs

    As can be clearly seen from the plots, the degree of smoothness can be controlled effectively via the k argument.

    In a future post I’ll take a closer look at using MRFs alongside other covariates as part of a model complex spatial modeling exercise.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Quickly Install R on Ubuntu 17.10

    Thu, 10/19/2017 - 18:00

    (This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

    Motivation

    On a previous post I explained how to install R on Ubuntu 16.04 without further complications.

    Now here are the equivalent steps on Ubuntu 17.10. The presented script also installs Java and common packages as I wrote this for a fresh install.

    What do you need to do this?

    Administrator (sudo) access to be able to install packages.

    How do you do this?

    This script will install everything.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    An AI pitches startup ideas

    Fri, 10/13/2017 - 22:47

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Take a look at this list of 13 hot startups, from a list compiled by Alex Bresler. Perhaps one of them is the next Juicero?

    FAR ATHERA: A CLINICAL AI PLATFORM THAT CAN BE ACCESSED ON DEMAND.

    ZAPSY: TRY-AT-HOME SERVICE FOR CONSUMER ELECTRONICS.

    MADESS: ON-DEMAND ACCESS TO CLEAN WATER.

    DEERG: AI RADIOLOGIST IN A HOME

    SPER: THE FASTEST, EASIEST WAY TO BUY A HOME WITHOUT THE USER HAVING TO WEAR ANYTHING.

    PLILUO: VENMO FOR B2B SAAS.

    LANTR: WE HELP DOCTORS COLLECT 2X MORE ON CANDLES AND KEROSENE.

    ABS: WE PROVIDE FULL-SERVICE SUPPORT FOR SUBLIME, VIM, AND EMACS.

    INSTABLE DUGIT: GITHUB FOR YOUR LOVED ONES.

    CREDITAY: BY REPLACING MECHANICAL PARTS WITH AIR, WE ELIMINATE INSTALLATION COMPLEXITY AND MAINTENANCE HEADACHES, LEADING TO SIGNIFICANT EFFICIENCY GAINS IN PRODUCTION COSTS AND HARVESTING TIME.

    CREDITANO: WE BUILD SOFTWARE TO ENABLE HIGH FUNCTIONALITY BIONICS FOR TREATING HEALTH CONDITIONS TO BE SOFTWARE ENGINEERS FOR FREE IN EXCHANGE FOR A FULL STACK SOLUTION THAT DELIVERS WORKFLOW AND EMBEDS ANALYTICS WITHIN THE ELECTRONIC MEDICAL RECORD.

    PATT: CONNECTED TOYS THAT ENABLE TWO-WAY VOICE MESSAGING BETWEEN CHILDREN AND THEIR LOVED ONES CAN KNOW THEY ARE GONE, SO THEY GET FREE WATER REFILLS FROM OUR POOL OF VETTED ENGINEERING AND PROGRAMMING FOR KIDS 6-12 WITH A SINGLE MESSAGE.

    GITS: WE NOW WANT TO BRING BOXING AS AN ALTERNATIVE TO PAGERS.

    As you've probably guessed by now, these are not real startups. (Though given Juicero, you'd be forgiven if you didn't.) Alex Bresler created a generating neural network (GNN) from YCombinator's archive of startup names and descriptions, and the GNN created almost 5000 new ones. (You can download an Excel file with all of them here, to save load on Alex's server.) I picked a few by hand to create the list above, but there was no shortage of funny ones (and plenty of all-too-real ones as well).

    Alex used his fundmanageR package (available on Github) to download the YCombinator database into R.  The Generational Neural Network was trained using the keras package for R on these data, using Tensorflow as the back end.  You can find all the details on how the list was generated, including the R code, at Alex's blog linked below.

    ASBC LLC: Juice, Err, Ohh…

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R for Business Analysts, NYC Data Science Academy in-person training | November 6 @NYC

    Fri, 10/13/2017 - 20:23

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    R is a powerful language used widely for business analytics. More companies are seeking business analysts with knowledge of R.

    In this R for Business Analysts course, students will learn how data science is done in the wild, with a focus on data acquisition, cleaning, and aggregation, exploratory data analysis and visualization, feature engineering, and model creation and validation. Students use the R statistical programming language to work through real-world examples that illustrate these concepts. Concurrently, students learn some of the statistical and mathematical foundations that power the data-scientific approach to problem solving.

    Classes will be given in a lab setting, with student exercises mixed with lectures. Students should bring a laptop to class. There will be a modest amount of homework after each class. Due to the focused nature of this course, there will be no individual class projects but the instructors will be available to help students who are applying R to their own work outside of class.

    Designed and taught by Brennan Lodge, Team Lead at Bloomberg. Watch his interview here.

    Learn More and Sign Up

    Details Who Is This Course For?

    This course is for anyone with a basic understanding of data analysis techniques and business analysts interested in improving their ability to tackle problems involving multi-dimensional data in a systematic, principled way. A familiarity with the R programming language is helpful, but unnecessary if the pre-work for the course is completed (more on that below).

    Prerequisites Students should have some experience with programming and have some familiarity with basic statistical and linear algebraic concepts such as mean, median, mode, standard deviation, correlation, and the difference between a vector and a matrix. In R, it will be helpful to know basic data structures such as data frames and how to use R Studio.Students should complete the following pre-work (approximately 2 hours) before the first day of class:
    • R Programming – https://www.rstudio.com/online-learning/#R
    • R Studio Essentials Programming 1: Writing Code https://www.rstudio.com/resources/webinars/rstudio-essentials-webinar-series-part-1/
    Outcomes

    Upon completing the course, students have:

    • An understanding of data science business problems solvable using R and an ability to articulate those business use cases from a statistical perspective.
    • The ability to create data visualization output with Rmarkdown files and Shiny Applications.
    • Familiarity with the R data science ecosystem, strategizing and the various tools a business analyst can use to continue developing as a data scientist.
    Syllabus

    Unit 1: Data Science and R Intro

    • Big Data
    • Data Science
    • Roles in Data Science
    • Use Cases
    • Data’isms
    • Class Format overview
    • R Background
    • R Intro
    • R Studio

    Unit 2: Visualize

    • Rules of the road with data viz
    • Chart junk
    • Chart terminology
    • Clean chart
    • Scaling data
    • Data Viz framework
    • Code plotting

    Unit 3: R Markdown

    • Presenting your work
    • R markdown file structure
    • Code chunks
    • Generating a R markdown file
    • Rmarkdown Exercise

    Unit 4: Shiny

    • Shiny structure
    • Reactive output
    • Widgets
    • Rendering Output
    • Stock example
    • Hands-on challenge

    Unit 5: Data Analysis

    • How to begin your data journey?
    • The human factor
    • Business Understanding
    • Dplyr
    • EDA – Exploratory Data Analysis
    • Data Anomalies
    • Data Statistics
    • Key Business Analysis Takeaways
    • Diamond data set exercise
    • Hands on challenge with Bank Marketing

    Unit 6: Introduction to Regression

    • Regression Definition
    • Examples of regression
    • Formulize the formula
    • Plotting
    • Statistical definitions involved
    • mtcars regression example
    • Business use case with regression

    Unit 7: Introduction to Machine Learning

    • ML Concept
    • Types of ML
    • CRISP Model
    • Modeling
    • Evaluation
    • Titanic Example
    • Decision Trees
    • Feature Engineering

    Unit 8: Strategy

    • Data Driven Decision Making
    • Data Science Strategy
    • Strategy Fails
    • Macroeconomic strategy
    • Adapting
    • Data Science Project
    • Data Impact
    • Project guide
    • Opportunities for improvement
    • Big Box Store Strategic Exercise

    Seats are filling up fast! Sign up now.

    If you have any questions about our course or the application process, please do not hesitate to reach out to us via email.

    The post R for Business Analysts, NYC Data Science Academy in-person training | November 6 @NYC appeared first on NYC Data Science Academy Blog.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Uber assignment with lpSolve

    Fri, 10/13/2017 - 17:13

    (This article was first published on R – insightR, and kindly contributed to R-bloggers)

    By Yuri Fonseca

    In this post we are going to make an Uber assignment simulation and calculate some metrics of waiting time through simulation.

    Setting

    Suppose we live in a 100×100 block city where each block takes 1 minute to cross by car. Drivers can pick up passengers only on corners, and passengers must call Uber on corners. Inferior-left corner is (1,1) and superior-right corner is (100,100).

    Functions

    In order to calculate the average waiting time of a passenger, we need a couple of functions. The first function calculates the distance of a specific passenger and a specific driver, the second function creates the passengers, the third creates the cars and finally, the last function builds the distance matrix between cars and passengers in order to assign them in a optimal way.

    Since our city is a perfect square, the distance between a passenger and a car is simple the distance in the x-axis plus the distance in the y-axis. Moreover, signs don’t matter. In this simple example we just need to know the initial position of the driver, initial position of the car and final destination. Each of these positions is a 2×1 vector representation in the city map.

    Observation: distance matrix

    We are going to use an linear programming solver in order to allocate optimally cars to passangers. This allocation problem just work with square matrix. So, if we have more cars than passengers, we need to create fictionary passengers (zero distances) in order to the solver converge. Note that we are going to do just a single round of allocation, so the number of cars needs to be bigger or equal than the number of passengers asking for a UBER driver.

    create_passenger = function(id){ initial.position = sample(50, 2, replace = TRUE) final.destination = sample(50, 2, replace = TRUE) return(list('number' = id, 'initial' = initial.position, 'final' = final.destination)) } create_car = function(id){ initial.position = sample(50, 2, replace = TRUE) return(list('number' = id, 'position' = initial.position)) } distance = function(x,y){ sum(abs(x-y)) } distance.matrix = function(cars, passengers){ d.matrix = matrix(0, nrow = length(cars), ncol = length(cars)) for (i in 1:length(cars)){ for (j in 1:length(passengers)){ d.matrix[i,j] = distance(cars[[i]]$position, passengers[[j]]$initial) } } return(d.matrix) } Example MAP

    Let’s check an example of 10 passengers and 10 cars:

    library(lpSolve) library(ggplot2) set.seed(20) passengers = lapply(seq(1:10), create_passenger) cars = lapply(seq(1:10), create_car) d.matrix = distance.matrix(cars, passengers) opt.allocation = lp.assign(d.matrix) passengers.points = sapply(passengers, function(x) x$initial) cars.points = sapply(cars, function(x) x$position) points = t(cbind(passengers.points, cars.points)) assignments = apply(opt.allocation$solution, 1, which.max) #checking the assignment for each car df1 = data.frame('x.axis' = points[,1], 'y.axis' = points[,2], 'id' = c(rep('Passenger',10), rep('Car',10))) # df.assign = data.frame('x' = cars.points[1,], # 'y' = cars.points[2,], # 'xend' = passengers.points[1,assignments], # 'yend' = passengers.points[2,assignments]) df.assign1 = data.frame('x' = cars.points[1,], 'y' = cars.points[2,], 'xend' = passengers.points[1,assignments], 'yend' = cars.points[2,]) df.assign2 = data.frame('x' = passengers.points[1,assignments], 'y' = cars.points[2,], 'xend' = passengers.points[1,assignments], 'yend' = passengers.points[2,assignments]) ggplot(df1, aes(x.axis,y.axis)) + geom_point(aes(color = id, group = id), size = 3) + # car and passengers geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = df.assign1) + geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = df.assign2, arrow = arrow(length = unit(0.02, "npc"), type = 'closed')) + scale_x_continuous(minor_breaks = seq(1, 50, 1)) + scale_y_continuous(minor_breaks = seq(1, 50, 1)) + ggtitle('Optimal Allocation')

    Monte Carlo simulation

    Now the waiting time… Suppose that we have 10 passengers asking for cars in the city, we are going to see how the waiting time changes with the numbers of cars. As example, we are going to change the numbers of cars from 10 up to 30 cars, with 500 hundred Monte Carlo simulation.

    simulations = function(N, MC) { ncars = N times = matrix(0, nrow = MC, ncol = N) for (i in 1:MC){ passengers = lapply(seq(1:10), create_passenger) cars = lapply(seq(1:ncars), create_car) d.matrix = distance.matrix(cars, passengers) opt.allocation = lp.assign(d.matrix) times[i,] = colSums(opt.allocation$solution*opt.allocation$costs) # waiting time for each passenger } return(times) } results = lapply(seq(10,30,2), simulations, MC = 500) # MC = 500 just to save some time

    Now it is possible to check how the numbers of cars affect the mean waiting time and the 10% and 90% confidence level for the waiting time.

    df2 = data.frame('WaitingTime' = sapply(results, mean), 'LB' = sapply(results, quantile, probs = 0.10), 'UB' = sapply(results, quantile, probs = 0.90), 'Cars' = seq(10,30,2)) ggplot(df2, aes(x = Cars)) + geom_line(aes(y = WaitingTime), lwd = 1.2) + geom_ribbon(aes(ymin = LB, ymax = UB), fill = 'blue', alpha = .3)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    .rprofile: David Smith

    Fri, 10/13/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    In this occasional series, we interview someone using a loosely defined set of interview questions for the purpose of “Demystifying the Creative/Development Process of the R Community”. This interview was conducted and prepared by Kelly O’Briant as part of an rOpenSci unconf17 project.

    David Smith is a Blogger and Community Lead at Microsoft. I had the chance to interview David last May at rOpenSci unconf17. We spoke about his career, the process of working remote within a team, community development/outreach and his personal methods for discovering great content to share and write about.

    KO: What is your name, job title, and how long have you been using R?

    DS: My name is David Smith. I work at Microsoft and my self-imposed title is ‘R Community Lead’. I’ve been working with R specifically for about 10 years, but I’d been working with S since the early 90s.

    KO: How did you transition into using R?

    DS: I was using S for a long, long time, and I worked for the company that commercialized S, where I was a project manager for S-PLUS. And I got out of that company, and then worked for a startup in a different industry for a couple of years. But while I was there, the people that founded Revolution Analytics approached me because they were setting up a company to build a commercial business around R, and reached out to me because of my connections with the S community.

    KO: So you came to Microsoft through Revolution?

    DS: That’s correct. I was with Revolution Analytics and then Microsoft bought that company, so I’ve been with Microsoft since then.

    KO: How has that transition gone, is there a Revolution team inside of Microsoft, or has it become more spread out?

    DS: It’s been more spread out. It got split up into the engineering people and the marketing people, sales people all got distributed around. When I first went to Microsoft I started off in the engineering group doing product management. But I was also doing the community role, and it just wasn’t a very good fit just time-wise, between doing community stuff and doing code or product management. So then I switched to a different group called the Ecosystem team, and so now I’m 100% focused on community within a group that’s focused on the ecosystem in general.

    The one piece of advice I could give anyone starting out in their careers is – write what you do, write it in public, and make it so that other people can reproduce what you did.

    KO: What does it mean to be 100% community focused, do you do a lot of training?

    DS: I don’t do a lot of training myself, but I work with a lot of other people on the team who do training. We’re focused mainly on building up the ecosystem of people that ultimately add value to the products that Microsoft has. And we’re specifically involved in the products that Microsoft has that now incorporate R by building up the value of the R ecosystem.

    KO: What does your day-to-day look like, are you in an office, do you work remote?

    DS: I work from home. I had moved from Seattle where Microsoft is headquartered to Chicago a couple of months before the acquisition happened, so I wasn’t about to move back to Seattle. But they let me work from home in Chicago, which has worked out great because most of my job is communicating with the community at large. So I do the Revolutions Blog, which I’ve been writing for eight or nine years now, writing stories about people using R, and applications of R packages. All as a way of communicating to the wider world the cool stuff that people can do with R, and also to the R community occasionally, what kind of things you can do with R in the Microsoft environment.

    KO: Have you always been a writer or interested in writing and communications?

    DS: No, no. I have a mathematics and computer science degree. I’m not trained as a writer. But it’s actually been useful having the perspective of statistics and mathematics and programming, and to bring that to a broader audience through writing. I’ve learned a lot about the whole writing and blogging and journalism process through that, but I’m certainly not trained in that way.

    KO: How does your Ecosystems team at Microsoft function and collaborate?

    DS: Unlike many teams at Microsoft, our team is very distributed. We have people working remotely from Denver, I’m in Chicago, Seattle, we’re all kind of distributed all around the place. So we meet virtually through Skype, have video meetings once a week and communicate a lot online.

    KO: What kind of tools are you using?

    DS: Traditionally, as in Microsoft, mainly email and Skype for the meetings. I set up an internal team focused around community more broadly around Microsoft and we use Microsoft Teams for that,
    which is a little bit like Slack. But a lot of the stuff that I do is more out in the open, so I use a lot of Twitter and Github for the code that I point to and stuff like that.

    KO: How do you manage your Twitter?

    DS: Twitter I do manually in real-time. I don’t do a lot of scheduling except for @RLangTip
    which is a feed of daily R tips. And for that I do scheduling through Tweetdeck on the web.

    KO: How many Twitter accounts are you managing?

    DS: I run @revodavid which is my personal twitter account, and @RLangTip which is R language tips. I tweet for @R_Forwards which is the diversity community for R, @RConsortium, the R Consortium, so quite a few.

    KO: How long has this been a core part of your work day?

    DS: The community thing as a focus, maybe five or six years? My career path for a long time was in product management. So I managed S-PLUS as a product for a long time, I managed another product at a different startup, and then I came to Revolution and I did a combination of engineering and product management. But in the last 18 months I’ve been 100% in the community space.

    KO: How did you get into product management to begin with?

    DS: That’s a good question that I’m not sure I know the answer to. I started off my first job after university — I actually left university specifically to become a support engineer for S-PLUS. When I took on that role, they didn’t really have product management yet at that company, and so when they were looking for somebody to basically steer S-PLUS as a product, it was a good fit for me and an opportunity to move to the States. I took that on and I kind of just learned product management as I did it. I went to a few sort of training/seminar type things, but I didn’t study it.

    KO: Sure. It seems like something that people just kind of get saddled with sometimes?

    DS: Exactly. It’s a discipline that doesn’t really have discipline. But for the various companies I’ve worked for, mostly startups, they all seem to have very different perspectives on what product management is and what the role of a product manager is.

    KO: Yeah, I know what you mean. Are you happy to have sort of moved away from that?

    DS: I am in the sense of — it was different being in a startup where being a product manager was more like being the shepherd of an entire product ecosystem, whereas in a big company the product manager is a lot more focused and inherently so, a lot more narrow. I happen to prefer the bigger picture I guess.

    Honestly, I kind of focus from the point of view of what interests me personally. Which doesn’t sound very community oriented at all… but it’s an exercise in empathy.

    KO: What’s your process for deciding what things you talk about and bring to the community?

    DS: Honestly, I kind of focus from the point of view of what interests me personally. Which doesn’t sound very community oriented at all… but it’s an exercise in empathy. If I can write about something, or find a topic that I might find is interesting or exciting and I can communicate that with other people, I’m motivated to write about it and I hope that people are then motivated to learn about it. Kind of the antithesis of this is when I worked in marketing for a while; a lot of that style of writing was the bane of my existence because you’re producing these documents that literally are designed for nobody to read, in this language that nobody engages with. I much prefer blogging and tweeting because it’s much more directly for people.

    KO: What have some of your most popular or successful engagements been about? Feel free to interpret ‘successful' in any way.

    DS: Well, from the point of view of what has been the most rewarding part of my job, is finding under-recognized or just these really cool things that people have done that just haven’t had a lot of exposure. And I’ve got a fairly big audience and a fairly wide reach, and it’s always fun for me to find things that people have done that maybe haven’t been seen. And it’s not my work, but I can — you know — take an eight page document that somebody’s written that has really cool things in it and just pull out various things. There is so much very cool stuff that people have done, half of the battle is getting it out there.

    KO: What are some of your favorite sources for discovering cool things on the internet?

    DS: There are channels on Reddit that I get a lot of material from, like /r/dataisbeautiful and things like that. It’s hard to say particular accounts on twitter, but I’ve spent a lot of time following people where I’ve read one of their blog posts and I find their twitter account, and they have just a few followers, I’ll follow them, and then over time it amounts to some good stuff. I have twitter open all day, every day. I don’t read everything on my feed every day, but I certainly keep it open.

    KO: How much of your day is just spent exploring?

    DS: A lot of it. I spend about half of any given day reading. It takes a long time, but every now and then you find this really cool stuff.

    It’s one thing to be able to do really cool stuff in R or any other language, but until you can distill that down into something that other people consume, it’s going to be hard to sell yourself.

    KO: Do you have any last nuggets of wisdom for people starting out their careers in R?

    DS: For people starting out their careers, I think one of the most important skills to learn is that communication skill. It’s one thing to be able to do really cool stuff in R or any other language, but until you can distill that down into something that other people consume, it’s going to be hard to sell yourself. And it’s also going to be hard to be valuable. A lot of the people I’ve watched evolve in the community are people who have begun very early in their careers, blogging about what they do. The one piece of advice I could give anyone starting out in their careers is – write what you do, write it in public, and make it so that other people can reproduce what you did.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Practical Machine Learning with R and Python – Part 2

    Fri, 10/13/2017 - 05:57

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    In this 2nd part of the series “Practical Machine Learning with R and Python – Part 2”, I continue where I left off in my first post Practical Machine Learning with R and Python – Part 2. In this post I cover the some classification algorithmns and cross validation. Specifically I touch
    -Logistic Regression
    -K Nearest Neighbors (KNN) classification
    -Leave out one Cross Validation (LOOCV)
    -K Fold Cross Validation
    in both R and Python.

    As in my initial post the algorithms are based on the following courses.

    You can download this R Markdown file along with the data from Github. I hope these posts can be used as a quick reference in R and Python and Machine Learning.I have tried to include the coolest part of either course in this post.

    The following classification problem is based on Logistic Regression. The data is an included data set in Scikit-Learn, which I have saved as csv and use it also for R. The fit of a classification Machine Learning Model depends on how correctly classifies the data. There are several measures of testing a model’s classification performance. They are

    Accuracy = TP + TN / (TP + TN + FP + FN) – Fraction of all classes correctly classified
    Precision = TP / (TP + FP) – Fraction of correctly classified positives among those classified as positive
    Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate (True positive) – Fraction of correctly classified as positive among all positives in the data
    F1 = 2 * Precision * Recall / (Precision + Recall)

    1a. Logistic Regression – R code

    The caret and e1071 package is required for using the confusionMatrix call

    source("RFunctions.R") library(dplyr) library(caret) library(e1071) # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Fit a generalized linear logistic model, fit=glm(output~.,family=binomial,data=train,control = list(maxit = 50)) # Predict the output from the model a=predict(fit,newdata=train,type="response") # Set response >0.5 as 1 and <=0.5 as 0 b=ifelse(a>0.5,1,0) # Compute the confusion matrix for training data confusionMatrix(b,train$output) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 154 0 ## 1 0 272 ## ## Accuracy : 1 ## 95% CI : (0.9914, 1) ## No Information Rate : 0.6385 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 1 ## Mcnemar's Test P-Value : NA ## ## Sensitivity : 1.0000 ## Specificity : 1.0000 ## Pos Pred Value : 1.0000 ## Neg Pred Value : 1.0000 ## Prevalence : 0.3615 ## Detection Rate : 0.3615 ## Detection Prevalence : 0.3615 ## Balanced Accuracy : 1.0000 ## ## 'Positive' Class : 0 ## m=predict(fit,newdata=test,type="response") n=ifelse(m>0.5,1,0) # Compute the confusion matrix for test output confusionMatrix(n,test$output) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 52 4 ## 1 5 81 ## ## Accuracy : 0.9366 ## 95% CI : (0.8831, 0.9706) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.8677 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9123 ## Specificity : 0.9529 ## Pos Pred Value : 0.9286 ## Neg Pred Value : 0.9419 ## Prevalence : 0.4014 ## Detection Rate : 0.3662 ## Detection Prevalence : 0.3944 ## Balanced Accuracy : 0.9326 ## ## 'Positive' Class : 0 ## 1b. Logistic Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression os.chdir("C:\\Users\\Ganesh\\RandPython") from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) fig, subaxes = plt.subplots(1, 1, figsize=(7, 5)) # Fit a model clf = LogisticRegression().fit(X_train, y_train) # Compute and print the Accuray scores print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) y_predicted=clf.predict(X_test) # Compute and print confusion matrix confusion = confusion_matrix(y_test, y_predicted) from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted))) print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted))) print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted))) print('F1: {:.2f}'.format(f1_score(y_test, y_predicted))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96 ## Accuracy: 0.96 ## Precision: 0.99 ## Recall: 0.94 ## F1: 0.97 2. Dummy variables

    The following R and Python code show how dummy variables are handled in R and Python. Dummy variables are categorival variables which have to be converted into appropriate values before using them in Machine Learning Model For e.g. if we had currency as ‘dollar’, ‘rupee’ and ‘yen’ then the dummy variable will convert this as
    dollar 0 0 0
    rupee 0 0 1
    yen 0 1 0

    2a. Logistic Regression with dummy variables- R code # Load the dummies library library(dummies) df <- read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?")) # Remove rows which have NA df1 <- df[complete.cases(df),] dim(df1) ## [1] 30161 16 # Select specific columns adult <- df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain, capital.loss,hours.per.week,native.country,salary) # Set the dummy data with appropriate values adult1 <- dummy.data.frame(adult, sep = ".") #Split as training and test train_idx <- trainTestSplit(adult1,trainPercent=75,seed=1111) train <- adult1[train_idx, ] test <- adult1[-train_idx, ] # Fit a binomial logistic regression fit=glm(salary~.,family=binomial,data=train) # Predict response a=predict(fit,newdata=train,type="response") # If response >0.5 then it is a 1 and 0 otherwise b=ifelse(a>0.5,1,0) confusionMatrix(b,train$salary) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 16065 3145 ## 1 968 2442 ## ## Accuracy : 0.8182 ## 95% CI : (0.8131, 0.8232) ## No Information Rate : 0.753 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.4375 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.9432 ## Specificity : 0.4371 ## Pos Pred Value : 0.8363 ## Neg Pred Value : 0.7161 ## Prevalence : 0.7530 ## Detection Rate : 0.7102 ## Detection Prevalence : 0.8492 ## Balanced Accuracy : 0.6901 ## ## 'Positive' Class : 0 ## # Compute and display confusion matrix m=predict(fit,newdata=test,type="response") ## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading n=ifelse(m>0.5,1,0) confusionMatrix(n,test$salary) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 5263 1099 ## 1 357 822 ## ## Accuracy : 0.8069 ## 95% CI : (0.7978, 0.8158) ## No Information Rate : 0.7453 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.4174 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.9365 ## Specificity : 0.4279 ## Pos Pred Value : 0.8273 ## Neg Pred Value : 0.6972 ## Prevalence : 0.7453 ## Detection Rate : 0.6979 ## Detection Prevalence : 0.8437 ## Balanced Accuracy : 0.6822 ## ## 'Positive' Class : 0 ## 2b. Logistic Regression with dummy variables- Python code

    Pandas has a get_dummies function for handling dummies

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Read data df =pd.read_csv("adult1.csv",encoding="ISO-8859-1",na_values=[""," "," ?"]) # Drop rows with NA df1=df.dropna() print(df1.shape) # Select specific columns adult = df1[['age','occupation','education','educationNum','capitalGain','capital-loss', 'hours-per-week','native-country','salary']] X=adult[['age','occupation','education','educationNum','capitalGain','capital-loss', 'hours-per-week','native-country']] # Set approporiate values for dummy variables X_adult=pd.get_dummies(X,columns=['occupation','education','native-country']) y=adult['salary'] X_adult_train, X_adult_test, y_train, y_test = train_test_split(X_adult, y, random_state = 0) clf = LogisticRegression().fit(X_adult_train, y_train) # Compute and display Accuracy and Confusion matrix print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_adult_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_adult_test, y_test))) y_predicted=clf.predict(X_adult_test) confusion = confusion_matrix(y_test, y_predicted) print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted))) print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted))) print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted))) print('F1: {:.2f}'.format(f1_score(y_test, y_predicted))) ## (30161, 16) ## Accuracy of Logistic regression classifier on training set: 0.82 ## Accuracy of Logistic regression classifier on test set: 0.81 ## Accuracy: 0.81 ## Precision: 0.68 ## Recall: 0.41 ## F1: 0.51 3a – K Nearest Neighbors Classification – R code

    The Adult data set is taken from UCI Machine Learning Repository

    source("RFunctions.R") df <- read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?")) # Remove rows which have NA df1 <- df[complete.cases(df),] dim(df1) ## [1] 30161 16 # Select specific columns adult <- df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain, capital.loss,hours.per.week,native.country,salary) # Set dummy variables adult1 <- dummy.data.frame(adult, sep = ".") #Split train and test as required by KNN classsification model train_idx <- trainTestSplit(adult1,trainPercent=75,seed=1111) train <- adult1[train_idx, ] test <- adult1[-train_idx, ] train.X <- train[,1:76] train.y <- train[,77] test.X <- test[,1:76] test.y <- test[,77] # Fit a model for 1,3,5,10 and 15 neighbors cMat <- NULL neighbors <-c(1,3,5,10,15) for(i in seq_along(neighbors)){ fit =knn(train.X,test.X,train.y,k=i) table(fit,test.y) a<-confusionMatrix(fit,test.y) cMat[i] <- a$overall[1] print(a$overall[1]) } ## Accuracy ## 0.7835831 ## Accuracy ## 0.8162047 ## Accuracy ## 0.8089113 ## Accuracy ## 0.8209787 ## Accuracy ## 0.8184591 #Plot the Accuracy for each of the KNN models df <- data.frame(neighbors,Accuracy=cMat) ggplot(df,aes(x=neighbors,y=Accuracy)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("Accuracy") + ggtitle("KNN regression - Accuracy vs Number of Neighors (Unnormalized)")

    3b – K Nearest Neighbors Classification – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import MinMaxScaler # Read data df =pd.read_csv("adult1.csv",encoding="ISO-8859-1",na_values=[""," "," ?"]) df1=df.dropna() print(df1.shape) # Select specific columns adult = df1[['age','occupation','education','educationNum','capitalGain','capital-loss', 'hours-per-week','native-country','salary']] X=adult[['age','occupation','education','educationNum','capitalGain','capital-loss', 'hours-per-week','native-country']] #Set values for dummy variables X_adult=pd.get_dummies(X,columns=['occupation','education','native-country']) y=adult['salary'] X_adult_train, X_adult_test, y_train, y_test = train_test_split(X_adult, y, random_state = 0) # KNN classification in Python requires the data to be scaled. # Scale the data scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_adult_train) # Apply scaling to test set also X_test_scaled = scaler.transform(X_adult_test) # Compute the KNN model for 1,3,5,10 & 15 neighbors accuracy=[] neighbors=[1,3,5,10,15] for i in neighbors: knn = KNeighborsClassifier(n_neighbors = i) knn.fit(X_train_scaled, y_train) accuracy.append(knn.score(X_test_scaled, y_test)) print('Accuracy test score: {:.3f}' .format(knn.score(X_test_scaled, y_test))) # Plot the models with the Accuracy attained for each of these models fig1=plt.plot(neighbors,accuracy) fig1=plt.title("KNN regression - Accuracy vs Number of neighbors") fig1=plt.xlabel("Neighbors") fig1=plt.ylabel("Accuracy") fig1.figure.savefig('foo1.png', bbox_inches='tight') ## (30161, 16) ## Accuracy test score: 0.749 ## Accuracy test score: 0.779 ## Accuracy test score: 0.793 ## Accuracy test score: 0.804 ## Accuracy test score: 0.803

    Output image:

    4 MPG vs Horsepower

    The following scatter plot shows the non-linear relation between mpg and horsepower. This will be used as the data input for computing K Fold Cross Validation Error

    4a MPG vs Horsepower scatter plot – R Code df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] ggplot(df3,aes(x=horsepower,y=mpg)) + geom_point() + xlab("Horsepower") + ylab("Miles Per gallon") + ggtitle("Miles per Gallon vs Hosrsepower")

    4b MPG vs Horsepower scatter plot – Python Code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape #X=autoDF3[['cylinder','displacement','horsepower','weight']] X=autoDF3[['horsepower']] y=autoDF3['mpg'] fig11=plt.scatter(X,y) fig11=plt.title("KNN regression - Accuracy vs Number of neighbors") fig11=plt.xlabel("Neighbors") fig11=plt.ylabel("Accuracy") fig11.figure.savefig('foo11.png', bbox_inches='tight') 5 K Fold Cross Validation

    K Fold Cross Validation is a technique in which the data set is divided into K Folds or K partitions. The Machine Learning model is trained on K-1 folds and tested on the Kth fold i.e.
    we will have K-1 folds for training data and 1 for testing the ML model. Since we can partition this as or K choose 1, there will be K such partitions. The K Fold Cross
    Validation estimates the average validation error that we can expect on a new unseen test data.

    The formula for K Fold Cross validation is as follows


    and

    and

    where is the number of elements in partition ‘K’ and N is the total number of elements


    Leave Out one Cross Validation (LOOCV) is a special case of K Fold Cross Validation where N-1 data points are used to train the model and 1 data point is used to test the model. There are N such paritions of N-1 & 1 that are possible. The mean error is measured The Cross Valifation Error for LOOCV is


    where is the diagonal hat matrix

    see [Statistical Learning]

    The above formula is also included in this blog post

    It took me a day and a half to implement the K Fold Cross Validation formula. I think it is correct. In any case do let me know if you think it is off

    5a. Leave out one cross validation (LOOCV) – R Code

    R uses the package ‘boot’ for performing Cross Validation error computation

    library(boot) library(reshape2) # Read data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) # Select complete cases df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] set.seed(17) cv.error=rep(0,10) # For polynomials 1,2,3... 10 fit a LOOCV model for (i in 1:10){ glm.fit=glm(mpg~poly(horsepower,i),data=df3) cv.error[i]=cv.glm(df3,glm.fit)$delta[1] } cv.error ## [1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305 ## [8] 18.96115 19.06863 19.49093 # Create and display a plot folds <- seq(1,10) df <- data.frame(folds,cvError=cv.error) ggplot(df,aes(x=folds,y=cvError)) + geom_point() +geom_line(color="blue") + xlab("Degree of Polynomial") + ylab("Cross Validation Error") + ggtitle("Leave one out Cross Validation - Cross Validation Error vs Degree of Polynomial")

    5b. Leave out one cross validation (LOOCV) – Python Code

    In Python there is no available function to compute Cross Validation error and we have to compute the above formula. I have done this after several hours. I think it is now in reasonable shape. Do let me know if you think otherwise. For LOOCV I use the K Fold Cross Validation with K=N

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split, KFold from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error # Read data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Remove rows with NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['horsepower']] y=autoDF3['mpg'] # For polynomial degree 1,2,3... 10 def computeCVError(X,y,folds): deg=[] mse=[] degree1=[1,2,3,4,5,6,7,8,9,10] nK=len(X)/float(folds) xval_err=0 # For degree 'j' for j in degree1: # Split as 'folds' kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: # Create the appropriate train and test partitions from the fold index X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # For the polynomial degree 'j' poly = PolynomialFeatures(degree=j) # Transform the X_train and X_test X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test) # Fit a model on the transformed data linreg = LinearRegression().fit(X_train_poly, y_train) # Compute yhat or ypred y_pred = linreg.predict(X_test_poly) # Compute MSE * n_K/N test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X)) # Add the test_mse for this partition of the data mse.append(test_mse) # Compute the mean of all folds for degree 'j' deg.append(np.mean(mse)) return(deg) df=pd.DataFrame() print(len(X)) # Call the function once. For LOOCV K=N. hence len(X) is passed as number of folds cvError=computeCVError(X,y,len(X)) # Create and plot LOOCV df=pd.DataFrame(cvError) fig3=df.plot() fig3=plt.title("Leave one out Cross Validation - Cross Validation Error vs Degree of Polynomial") fig3=plt.xlabel("Degree of Polynomial") fig3=plt.ylabel("Cross validation Error") fig3.figure.savefig('foo3.png', bbox_inches='tight')

     

    6a K Fold Cross Validation – R code

    Here K Fold Cross Validation is done for 4, 5 and 10 folds using the R package boot and the glm package

    library(boot) library(reshape2) set.seed(17) #Read data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] a=matrix(rep(0,30),nrow=3,ncol=10) set.seed(17) # Set the folds as 4,5 and 10 folds<-c(4,5,10) for(i in seq_along(folds)){ cv.error.10=rep(0,10) for (j in 1:10){ # Fit a generalized linear model glm.fit=glm(mpg~poly(horsepower,j),data=df3) # Compute K Fold Validation error a[i,j]=cv.glm(df3,glm.fit,K=folds[i])$delta[1] } } # Create and display the K Fold Cross Validation Error b <- t(a) df <- data.frame(b) df1 <- cbind(seq(1,10),df) names(df1) <- c("PolynomialDegree","4-fold","5-fold","10-fold") df2 <- melt(df1,id="PolynomialDegree") ggplot(df2) + geom_line(aes(x=PolynomialDegree, y=value, colour=variable),size=2) + xlab("Degree of Polynomial") + ylab("Cross Validation Error") + ggtitle("K Fold Cross Validation - Cross Validation Error vs Degree of Polynomial")

    6b. K Fold Cross Validation – Python code

    The implementation of K-Fold Cross Validation Error has to be implemented and I have done this below. There is a small discrepancy in the shapes of the curves with the R plot above. Not sure why!

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split, KFold from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error # Read data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NA rows autoDF3=autoDF2.dropna() autoDF3.shape #X=autoDF3[['cylinder','displacement','horsepower','weight']] X=autoDF3[['horsepower']] y=autoDF3['mpg'] # Create Cross Validation function def computeCVError(X,y,folds): deg=[] mse=[] # For degree 1,2,3,..10 degree1=[1,2,3,4,5,6,7,8,9,10] nK=len(X)/float(folds) xval_err=0 for j in degree1: # Split the data into 'folds' kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: # Partition the data acccording the fold indices generated X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Scale the X_train and X_test as per the polynomial degree 'j' poly = PolynomialFeatures(degree=j) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test) # Fit a polynomial regression linreg = LinearRegression().fit(X_train_poly, y_train) # Compute yhat or ypred y_pred = linreg.predict(X_test_poly) # Compute MSE *(nK/N) test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X)) # Append to list for different folds mse.append(test_mse) # Compute the mean for poylnomial 'j' deg.append(np.mean(mse)) return(deg) # Create and display a plot of K -Folds df=pd.DataFrame() for folds in [4,5,10]: cvError=computeCVError(X,y,folds) #print(cvError) df1=pd.DataFrame(cvError) df=pd.concat([df,df1],axis=1) #print(cvError) df.columns=['4-fold','5-fold','10-fold'] df=df.reindex([1,2,3,4,5,6,7,8,9,10]) df fig2=df.plot() fig2=plt.title("K Fold Cross Validation - Cross Validation Error vs Degree of Polynomial") fig2=plt.xlabel("Degree of Polynomial") fig2=plt.ylabel("Cross validation Error") fig2.figure.savefig('foo2.png', bbox_inches='tight')

    output

    This concludes this 2nd part of this series. I will look into model tuning and model selection in R and Python in the coming parts. Comments, suggestions and corrections are welcome!
    To be continued….
    Watch this space!

    Also see

    1. Design Principles of Scalable, Distributed Systems
    2. Re-introducing cricketr! : An R package to analyze performances of cricketers
    3. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
    4. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
    5. Simulating an Edge Shape in Android

    To see all posts see Index of posts

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    GitHub Streak: Round Four

    Fri, 10/13/2017 - 04:45

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Three years ago I referenced the Seinfeld Streak used in an earlier post of regular updates to to the Rcpp Gallery:

    This is sometimes called Jerry Seinfeld’s secret to productivity: Just keep at it. Don’t break the streak.

    and showed the first chart of GitHub streaking

    And two year ago a first follow-up appeared in this post:

    And a year ago we had a followup last year

    And as it October 12 again, here is the new one:

    Again, special thanks go to Alessandro Pezzè for the Chrome add-on GithubOriginalStreak.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Bold & Beautiful Character Similarities using Word Embeddings

    Thu, 10/12/2017 - 22:32

    (This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

    Introduction

    I often see advertisement for The Bold and The Beautiful, I have never watched a single episode of the series. Still, even as a data scientist you might be wondering how these beautiful ladies and gentlemen from the show are related to each other. I do not have the time to watch all these episodes to find out, so I am going to use word embeddings on recaps instead…

    Calculating word embeddings

    First, we need some data, from the first few google hits I got to the site soap central. Recaps can be found from the show that date back to 1997. Then, I used a little bit of rvest code to scrape the daily recaps into an R data set.

    Word embedding is a technique to transform a word onto a vector of numbers, there are several approaches to do this. I have used the so-called Global Vector word embedding. See here for details, it makes use of word co-occurrences that are determined from a (large) collection of documents and there is fast implementation in the R text2vec package.

    Once words are transformed to vectors, you can calculate distances (similarities) between the words, for a specific word you can calculate the top 10 closest words for example. More over linguistic regularities can be determined, for example:

    amsterdam - netherlands + germany

    would result in a vector that would be close to the vector for berlin.

    Results for The B&B recaps

    It takes about an hour on my laptop to determine the word vectors (length 250) from 3645 B&B recaps (15 seasons). After removing some common stop words, I have 10.293 unique words, text2vec puts the embeddings in a matrix (10.293 by 250).

    Lets take the lovely steffy,

    the ten closest words are:

    from to value 1 steffy steffy 1.0000000 2 steffy liam 0.8236346 3 steffy hope 0.7904697 4 steffy said 0.7846245 5 steffy wyatt 0.7665321 6 steffy bill 0.6978901 7 steffy asked 0.6879022 8 steffy quinn 0.6781523 9 steffy agreed 0.6563833 10 steffy rick 0.6506576

    Lets take take the vector steffyliam, the closest words we get are

    death furious lastly excused frustration onset 0.2237339 0.2006695 0.1963466 0.1958089 0.1950601 0.1937230

     and for bill – anger we get

    liam katie wyatt steffy quinn said 0.5550065 0.4845969 0.4829327 0.4645065 0.4491479 0.4201712

    The following figure shows some other B&B characters and their closest matches.

     

    If you want to see the top n characters for other B&B characters use my little shiny app. The R code for scraping B&B recaps, calculating glove word-embeddings and a small shiny app can be found on my Git Hub.

    Conclusion

    This is a Mickey Mouse use case, but it might be handy if you are in the train and hear people next to you talking about the B&B, you can join their conversation. Especially if you have had a look at my B&B shiny app……

    Cheers, Longhow

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A cRyptic crossword with an R twist

    Thu, 10/12/2017 - 21:30

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Last week's R-themed crossword from R-Ladies DC was popular, so here's another R-related crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of R-news (now known as the R Journal). Unlike the last crossword, this one follows the conventions of a British cryptic crossword: the grid is symmetrical, and eschews 4×4 blocks of white or black squares. Most importantly, the clues are in the cryptic style: rather than being a direct definition, cryptic clues pair wordplay (homonyms, anagrams, etc) with a hidden definition. (Wikipedia has a good introduction to the types of clues you're likely to find.) Cryptic crosswords can be frustrating for the uninitiated, but are fun and rewarding once you get to into it. 

    In fact, if you're unfamiliar with cryptic crosswords, this one is a great place to start. Not only are many (but not all) of the answers related in some way to R, Barry has helpfully provided the answers along with an explanation of how the cryptic clue was formed. There's no shame in peeking, at least for a few, to help you get your legs with the cryptic style.

    Again, if you're stuck you can find the solution here

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages