Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 5 hours 5 min ago

R and Data Science activities in London, June 27th – 29th

Wed, 06/07/2017 - 11:27

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

Locke Data will be up to some shenanigans of various stripes in the big smoke. We hope to see you at some of them!

June 26th — Monday Introduction to R (Newcastle)

I won’t be in London for this but I’ll be doing a day of Introduction to R in Newcastle. This is supporting the local user groups and costs up to £90 for the whole day.


Intro to R in Newcastle, June 26th June 27th — Tuesday Data Science Executive Briefing

I’ll be teaming up with onezeero to host an Executive Briefing, offering senior management the opportunity to get a buzzword-free and FUD-free overview of what data science is, ways it can help organisations, and what sorts of things would help ensure they can implement data science with minimal risk.


If this sounds like something you want to attend, or get some senior people at your organisation along to, send Dominic an email at dom@onezeero.co.uk.

Data Science Technologies event

In the evening, I’ll be talking R and SQL Server for real-time predictions at a new meetup being hosted by Winton called Data Science Technologies.

If you’re interested, you can register on the Meetup


Data Science Technologies

https://www.jumpingrivers.com/

June 28th — Wednesday R and Microsoft training day

I’ll be delivering a training day helping people put R into production using Microsoft products.

Working with Jumping Rivers, there’s a week of training available in London.

On my R with Microsoft training day we’ll look at how we can use R Server to cope with large data volumes and parallel computations, then we’ll look at embedding R in various services to enable applications and users to make use of our R script.

The day is very practical with exercises along the way to ensure you come out of the training having experience doing. As a result, you’ll need to bring along a laptop and you should already be able to write R code at least at a beginner level.

Find out more about the course and register on JumpingRivers.com


Jumping Rivers Training June 29th — Thursday Chat over coffee

Fancy grabbing some coffee? Help me recover from some hard work by grabbing a coffee with me. Up to 4 people can book the same slot so that I can introduce folks – but if you want to talk one on one, let me know. Go Coffee!!


Coffee time! PowerBI User Group

In the evening, I’ll be delivering a session on how R can solve PowerBI pain points at the London PowerBI User Group. If this is of interest, you can RSVP on the meetup.


London PowerBI User Group

The post R and Data Science activities in London, June 27th – 29th appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

14 Jobs for R users (2017-05-06) – from all over the world

Wed, 06/07/2017 - 10:20
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
  1. Part-Time
    Big Data Discovery Automation in Digital Health (5-10 hours per week)
    MedStar Institutes for Innovation – Posted by Praxiteles
    Anywhere
    22 May2017
  2. Full-Time
    Data Scientist / Analytics Consultant
    Hedera Insights – Posted by HederaInsights
    Antwerpen Vlaanderen, Belgium
    18 May2017
  3. Full-Time
    Data Scientist (m/f)
    Meritocracy – Posted by arianna@meritocracy
    Hamburg Hamburg, Germany
    11 May2017
  4. Full-Time
    Data Scientist
    One Acre Fund – Posted by OAF
    Nairobi Nairobi County, Kenya
    9 May2017
  5. Full-Time
    Back-End Developer – Sacramento Kings (Sacramento, CA)
    Sacramento Kings – Posted by khavey
    Sacramento California, United States
    9 May2017
  6. Full-Time
    Data Analyst, Chicago Violence Reduction Strategy
    National Network for Safe Communities (NNSC) – Posted by nnsc
    Chicago Illinois, United States
    5 May2017
More New Jobs
  1. Full-Time
    Phd opportunity @ Barcelona / Moorepark Teagasc (Ireland), Universitat Autònoma de Barcelona (Spain) and CEDRIC (Centre d’études et de recherché en informatique et communications) of CNAM (Paris) – Posted by emmanuel.serrano.ferron
    Anywhere
    25 May2017
  2. Full-Time
    Research Software Engineer @ Bailrigg, England Lancaster University – Posted by killick
    Bailrigg England, United Kingdom
    24 May2017
  3. Full-Time
    Senior Data Scientist @ Minneapolis, Minnesota, U.S. General Mills – Posted by dreyco676
    Minneapolis Minnesota, United States
    23 May2017
  4. Part-Time
    Big Data Discovery Automation in Digital Health (5-10 hours per week) MedStar Institutes for Innovation – Posted by Praxiteles
    Anywhere
    22 May2017
  5. Full-Time
    Data Scientist / Analytics Consultant Hedera Insights – Posted by HederaInsights
    Antwerpen Vlaanderen, Belgium
    18 May2017
  6. Full-Time
    Data Scientists for ArcelorMittal @ Avilés, Principado de Asturias, Spain ArcelorMittal – Posted by JuanM
    Avilés Principado de Asturias, Spain
    12 May2017
  7. Full-Time
    Data Scientist (m/f) Meritocracy – Posted by arianna@meritocracy
    Hamburg Hamburg, Germany
    11 May2017
  8. Full-Time
    Data Scientist Prospera Technologies – Posted by Prospera Technologies
    Tel Aviv-Yafo Tel Aviv District, Israel
    11 May2017
  9. Full-Time
    Data Scientist One Acre Fund – Posted by OAF
    Nairobi Nairobi County, Kenya
    9 May2017
  10. Full-Time
    Back-End Developer – Sacramento Kings (Sacramento, CA) Sacramento Kings – Posted by khavey
    Sacramento California, United States
    9 May2017
  11. Full-Time
    Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc
    Chicago Illinois, United States
    5 May2017
  12. Full-Time
    Transportation Market Research Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    4 May2017
  13. Full-Time
    Data Manager @ Los Angeles, California, U.S. Virtual Pediatric Systems, LLC – Posted by gsotocampos
    Los Angeles California, United States
    3 May2017
  14. Full-Time
    Sr. Manager, Analytics @ Minneapolis, Minnesota, U.S. Korn Ferry – Posted by jone1087
    Minneapolis Minnesota, United States
    3 May 2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Unconf projects 3: available, miner, rcheatsheet, ponyexpress

Wed, 06/07/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Continuing our series of blog posts (day 1, day 2) this week about unconf 17.

available

Summary: Ever have trouble naming your software package? Find a great name and realize it's already taken on CRAN, or further along in development on GitHub? The available package makes it easy to check for valid, available names, and also checks various sources for any unintended meanings. The package can also suggest names based on the description and title of your package.

Team: Jim Hester, Carl Ganz, Gábor Csárdi, Molly Lewis, Rachel Tatman

Github: https://github.com/ropenscilabs/available

miner

Summary:
The miner package provides a nice interface to the Minecraft API and allows users to manipulate the Minecraft world from a few simple functions. The package is designed with the intention of encouraging new users to learn R by writing scripts to automate activities inside Minecraft.

Team: Brooke Anderson, Karl Broman, Gergely Daróczi, Mario Inchiosa, David Smith, Ali Zaidi

Github: https://github.com/ROpenSciLabs/miner

rcheatsheet

Summary: RStudio pioneered the idea of having beautiful cheatsheets for R packages. Creating them however, is quite time consuming and requires effort to maintain as packages evolve. The rcheatsheet package makes it easy for any package author to quickly generate beautufil cheatsheets from a simple table stored on a Google sheet or a local table. This short slide deck provides a quick overview

Team: Diana Ly, Ramnath Vaidyanathan

Github: https://github.com/ropenscilabs/rcheatsheet

ponyexpress

Summary:
Need to write custom emails to a large group of people? ponyexpress makes it possible to create custom emails based on various fields of a data frame (final grades for a class, abstract acceptances for a workshop, project assignments for a hackathon, or just about anything else) and quickly sends them from your Gmail account.

The package also provides custom template to make it easy to write rich html emails.

Team: Karthik Ram, Lucy D'Agostino McGowan

Github: https://github.com/ropenscilabs/ponyexpress

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

In case you missed it: May 2017 roundup

Tue, 06/06/2017 - 21:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from May of particular interest to R users.

Many interesting presentations recorded at the R/Finance 2017 conference in Chicago are now available to watch.

A review of some of the R packages and projects implemented at the 2017 ROpenSci Unconference.

An example of applying Bayesian Learning with the "bnlearn" package to challenge stereotypical assumptions

Data from the Billboard Hot 100 chart used to find the most popular words in the titles of pop hits.

Microsoft R Open 3.4.0 is now available for Windows, Mac and Linux.

How to use the "tweenr" package to create smooth transitions in data animations. 

A preview of some of the companies and R applications to be presented at the EARL conference in San Francisco.

The AzureDSVM package makes it easy to spawn and manage clusters of the Azure Data Science Virtual Machine.

An online course on spatial data analysis in R, from the Consumer Data Research Centre in the UK.

Video and slides of Lixun Zhang's presentation "R in Financial Services: Challenges and Opportunity" from the New York R Conference.

Visual Studio 2017 now features built-in support for both R and Python development.

Quantifying the home-field advantage in English Premier League football.

Using the new CRAN_package_db function to analyze data about CRAN packages.

Stack Overflow Trends tracks the trajectory of questions about R and Python.

A recorded webinar on using Microsoft R to predict length of stay in hospitals.

The new "Real-Time Scoring" capability in Microsoft R Server creates a service to generate predictions from certain models in milliseconds.

"Technical Foundations of Informatics" is an open course guide on data analysis and visualization with R with a modern slant.

The Datasaurus Dozen generalizes Anscombe's Quartet with a process to create datasets of any shape with (nearly) the same summary statistics.

CareerCast ranks Statistician as the best job to have in 2017.

You can now use Microsoft R within Alteryx Designer.

How to clean messy data in Excel by providing just a few examples of transformations.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Package “SentimentAnalysis” released on CRAN

Tue, 06/06/2017 - 21:09

(This article was first published on R Blog, and kindly contributed to R-bloggers)

Authors: Stefan Feuerriegel, Nicolas Pröllochs

This report introduces sentiment analysis in R and shows how to use our package “SentimentAnalysis”.

What is sentiment analysis?

Sentiment analysis is a research branch located at the heart of natural language processing (NLP), computational linguistics and text mining. It refers to any measures by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed opinion in a range spanning from positive to negative. As a result, one may also refer to sentiment analysis as opinion mining.

What is the the package “SentimentAnalysis” about?

Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.

How to use the package “SentimentAnalysis”?

This simple example shows how to perform a sentiment analysis of a single string. The result is a two-level factor with levels “positive” and “negative.”

# Analyze a single string to obtain a binary response (positive / negative) sentiment <- analyzeSentiment("Yeah, this was a great soccer game of the German team!") convertToBinaryResponse(sentiment)$SentimentGI #> [1] positive #> Levels: negative positive

The following shows a larger example:

# Create a vector of strings documents <- c("Wow, I really like the new light sabers!", "That book was excellent.", "R is a fantastic language.", "The service in this restaurant was miserable.", "This is neither positive or negative.", "The waiter forget about my a dessert -- what a poor service!") # Analyze sentiment sentiment <- analyzeSentiment(documents) # Extract dictionary-based sentiment according to the Harvard-IV dictionary sentiment$SentimentGI #> [1] 0.3333333 0.5000000 0.5000000 -0.6666667 0.0000000 -0.6000000 # View sentiment direction (i.e. positive, neutral and negative) convertToDirection(sentiment$SentimentGI) #> [1] positive positive positive negative neutral negative #> Levels: negative neutral positive response <- c(+1, +1, +1, -1, 0, -1) compareToResponse(sentiment, response)

For more information, check out the following resources.

Download: https://cran.r-project.org/package=SentimentAnalysis
Vignette: https://cran.r-project.org/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Manipulate Biological Data Using Biostrings Package Exercises(Part 3)

Tue, 06/06/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Bioinformatics is an amalgamation Biology and Computer science.Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Proteins. DNA & RNA is made of Nucleotide which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids
In this exercise we try compare between DNAs, RNAs & Amino Acid Sequences to fund out the relationships.
Comparison is done using sequence alignment or sequence comparison techniques.
There are two types of sequence alignment exists.
1.Pairwise alignment
2.Multiple Sequence Alignment

Pairwise alignment refers to comparison between two sequences, where as Multiple Sequence Alignment refers to comparing more than two sequences.

In the exercises below we cover how we can do pairwise alignment using Biostrings package in Bioconductor.

Install Packages
Biostrings

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create two DNA strings and do pairwise alignment using local, global and overlap alignment techniques and print the score.

Exercise 2

Create two DNA strings and do pairwise alignment and write the alignment to an .aln file.

Exercise 3

Create two Amino acid strings and do pairwise alignment

Exercise 4

Create two Amino acid strings and do pairwise alignment using BLOSUM62 substitution matrix.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 5

Create two Amino acid strings and do pairwise alignment using BLOSUM100 substitution matrix

Exercise 6

Create two Amino acid strings and do pairwise alignment using PAM250 substitution matrix

Exercise 7

Compare between BLOSUM62 substitution matrix of R and that of the NCBI Database using any two amino acid of your choice.

Exercise 8

Do pairwise alignment using Needlemann Wunch Alignment algorithm and print the score, suppress any warnings.

Exercise 9

Create two DNA Strings and translate the same to amino acids and do pairwise alignment between the amino acid sequences

Exercise 10

Create two RNA Strings and translate the same to amino acids and do pairwise alignment between the amino acid sequences

Related exercise sets:
  1. Manipulate Biological Data Using Biostrings Package Exercises(Part 2)
  2. Bioinformatics Tutorial with Exercises in R (part 1)
  3. Accessing and Manipulating Biological Databases Exercises (Part-3)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Web data acquisition: from database to dataframe for data analysis and visualization (Part 4)

Tue, 06/06/2017 - 11:54

The previous post described how the deeply nested JSON data on fligths were parsed and stored in an R-friendly database structure. However, looking into the data, the information is not yet ready for statistical analysis and visualization and some further processing is necessary before extracting insights and producing nice plots.

In the parsed batch, it is clearly visible the redundant structure of the data with the flight id repeted for each segment of each flight. This is also confirmed with the following simple check as the rows of the dataframe are more than the unique counts of the elements in the id column.

dim(data_items) [1] 397 15 length(unique(data_items$id)) 201 # real time changes of data could produce different results

This implies that the information of each segment of each flight has to be aggregated and merged in a dataset as single observations of a statistical analysis between, for example, price and distance. First, a unique primary key for each observation has to be used as reference variable to uniquely identify each element of the dataset.

library(plyr) # sql like functions library(readr) # parse numbers from strings data_items <- data.frame(data_items) # id (primary key) data <- data.frame(unique(data_items$id)) colnames(data) <- c('id') # n° of segment n_segment <- aggregate(data_items['segment.id'], by=data_items['id'], length) data <- join(data, n_segment, by='id', type='left', match='first') # sql left join # mileage mileage <- aggregate(data_items['segment.leg.mileage'], by=data_items['id'], sum) data <- join(data, mileage, by='id', type='left', match='first') # sql left join # price price <- data.frame('id'=data_items$id, 'price'=parse_number(data_items$saleTotal)) data <- join(data, price, by='id', type='left', match='first') # sql left join # dataframe colnames(data) <- c('id','segment', 'mileage', 'price') head(data)

The aggregation of mileage and price using the unique primary key allows to set up a dataframe ready for statistical analysis and data visualization. Current data tells us that there is a maximum of three segments in the connection between FCO and LHR with a minimum price of around EUR 122 and a median around EUR 600.

# descriptive statistics summary(data) # histogram price & distance g1 <- ggplot(data, aes(x=price)) + geom_histogram(bins = 50) + ylab("Distribution of the Price (EUR)") + xlab("Price (EUR)") g2 <- ggplot(data, aes(x=mileage)) + geom_histogram(bins = 50) + ylab("Distribution of the Distance") + xlab("Distance (miles)") grid.arrange(g1, g2) # price - distance relationship s0 <- ggplot(data = data, aes(x = mileage, y = price)) + geom_smooth(method = "lm", se=FALSE, color="black") + geom_point() + labs(x = "Distance in miles", y = "Price (EUR)") s0 <- ggMarginal(s0, type = "histogram", binwidth = 30) s0

Of course, plenty of other analysis and graphical representations using flights features are possible given the large set of variables available in QPX Express API and the availability of data in real time.

To conclude the 4-step (flight) trip from data acquisition to data analysis, let's recap the most important concepts described in each of the post:

1) Client-Server connection

2) POST request in R

3) Data parsing and structuring

4) Data analysis and visualization

That's all folks!

#R #rstats #maRche #json #curl #qpxexpress #Rbloggers

This post is also shared in www.r-bloggers.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Unconf projects 2: checkers, gramr, data-packages, exploRingJSON

Tue, 06/06/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Following up on Stefanie's recap of unconf 17, we are following up this entire week with summaries of projects developed at the event. We plan to highlight 4-5 projects each day, with detailed posts from a handful of teams to follow.

checkers

Summary: checkers is a framework for reviewing analysis projects. It provides automated checks for best practices, using extensions on the goodpractice package.

In addition, checkers includes a descriptive guide for best practices.

Team: Laura DeCicco, Noam Ross, Alice Daish, Molly Lewis, Nistara Randhawa, Jennifer Thompson, Nicholas Tierney

Github: https://github.com/ropenscilabs/checkers

gramr

Summary: gramr is a package that wraps the command line tool write-good to provide grammar checking functions for Rmd or md documents. It can be used as an RStudio Addin, or from the console or command line by supplying an Rmd or md filename.

p.s. Two of the three (Ben/Gordon) were remote participants.

Team: Jasmine Dumas, Ben Marwick, Gordon Shotwell

Github: https://github.com/ropenscilabs/gramr

data-packages

Summary: The data-packages team created a suite of outputs, including a flexdashboard with a searchable table to query dataset metadata.

In addition, the team created a best practices document for data packages, a Twitter bot @rstatsdata that tweets about datasets in R packages on CRAN, and they have plans for future work.

Team: Jakub Nowosad, Andy Teucher, Joseph Stachelek, Richie Cotton, Claudia Vitolo

Github: https://github.com/ropenscilabs/data-packages

exploRingJSON

Summary: The exploRingJSON project includes a survey of R packages that deal with JSON data, a package (https://github.com/sctyner/JSOmetaN) to get high level metadata for JSON data, and a Shiny app to explore JSON.

Team:

Sam Tyner, Kelly O'Briant, Katherine Scranton

Github: https://github.com/ropenscilabs/exploRingJSON

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The Many-Faced Future

Tue, 06/06/2017 - 08:21

(This article was first published on jottR, and kindly contributed to R-bloggers)

The future package defines the Future API, which is a unified, generic, friendly API for parallel processing. The Future API follows the principle of write code once and run anywhere – the developer chooses what to parallelize and the user how and where.

The nature of a future is such that it lends itself to be used with several of the existing map-reduce frameworks already available in R. In this post, I’ll give an example of how to apply a function over a set of elements concurrently using plain sequential R, the parallel package, the future package alone, as well as future in combination of the foreach, the plyr, and the purrr packages.


You can choose your own future and what you want to do with it. Example: Multiple Mandelbrot sets

The Julia package provides the JuliaImage() function for generating a Julia set for a given set of start parameters (centre, L, C) where centre specify the center point in the complex plane, L specify the width and height of the square region around this location, and C is a complex coefficient controlling the “shape” of the generated Julia set. For example, to generate one of the above Julia set images (1000-by-1000 pixels), you can use: library("Julia")
set <- JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = -0.4 + 0.6i)
plot_julia(set)

with plot_julia <- function(img, col = topo.colors(16)) {
par(mar = c(0, 0, 0, 0))
image(img, col = col, axes = FALSE)
}

For the purpose of illustrating how to calculate different Julia sets in parallel, I will use the same (centre, L) = (0 + 0i, 3.5) region as above with the following ten complex coeffients (from Julia set): Cs <- list(
a = -0.618 + 0i,
b = -0.4 + 0.6i,
c = 0.285 + 0i,
d = 0.285 + 0.01i,
e = 0.45 + 0.1428i,
f = -0.70176 - 0.3842i,
g = 0.835 - 0.2321i,
h = -0.8 + 0.156i,
i = -0.7269 + 0.1889i,
j = 0 - 0.8i
)

Now we’re ready to see how we can use futures in combination of different map-reduce implementations in R for generating these ten sets in parallel. Note that all approaches will generate the exact same ten Julia sets. So, feel free to pick your favorite approach.
Sequential

To process the above ten regions sequentially, we can use the lapply() function: library("Julia")
sets <- lapply(Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})
Parallel library("parallel")
ncores <- future::availableCores() ## a friendly version of detectCores()
cl <- makeCluster(ncores)

clusterEvalQ(cl, library("Julia"))
sets <- parLapply(cl, Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})
Futures (in parallel) library("future")
plan(multisession) ## defaults to availableCores() workers

library("Julia")
sets <- future_lapply(Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})

We could also have used the more explicit setup plan(cluster, workers = makeCluster(availableCores())), which is identical to plan(multisession).
Futures with foreach library("doFuture")
registerDoFuture() ## tells foreach futures should be used
plan(multisession) ## specifies what type of futures

library("Julia")
sets <- foreach(C = Cs) %dopar% {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
}

Note that I didn’t pass .packages = "Julia" to foreach() because the doFuture backend will do that automatically for us – that’s one of the treats of using futures. If we would have used doParallel::registerDoParallel(cl) or similar, we would have had to worry about that.
Futures with plyr

The plyr package will utilize foreach internally if we pass .parallel = TRUE. Because of this, we can use plyr::llply() to parallelize via futures as follows: library("plyr")
library("doFuture")
registerDoFuture() ## tells foreach futures should be used
plan(multisession) ## specifies what type of futures

library("Julia")
sets <- llply(Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
}, .parallel = TRUE)

For the same reason as above, also here we don’t have to worry about global variables and making sure needed packages are attached; that’s all handles by the future packages.
Futures with purrr (= furrr)

As a final example, here is how you can use futures to parallelize your purrr::map() calls: library("purrr")
library("future")
plan(multisession)

library("Julia")
sets <- Cs %>%
map(~ future(JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = .x))) %>%
values

Comment: This latter approach will not perform load balance (“scheduling”) across backend workers; that’s a feature that ideally would be taken care of by purrr itself. However, I have some ideas for future versions of future (pun…) that may achieve this without having to modify the purrr package.
Got compute?

If you have access to one or more machines with R installed (e.g. a local or remote cluster, or a Google Compute Engine cluster), and you’ve got direct SSH access to those machines, you can have those machines calculate the above Julia sets for you; just change future plan, e.g. plan(cluster, workers = c("machine1", "machine2", "machine3.remote.org"))

If you have access to a high-performance compute (HPC) cluster with a HPC scheduler (e.g. Slurm, TORQUE / PBS, LSF, and SGE), you can harness its power by switching to: library("future.batchtools")
plan(batchtools_sge)

For more details, see the vignettes of the future.batchtools and batchtools packages.

Happy futuring!


Links

See also var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jottR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

anytime 0.3.0

Tue, 06/06/2017 - 03:05

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of the anytime package is now on CRAN. It marks the eleventh release since the inaugural version late last summer.

anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects — and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.

This release brings a little more consistency to how numeric or integer arguments are handled. Previously, we were overly eager in accepting something such as 20150605 (i.e. today) as a (numerical or integer) input to both anytime() and anydate(). That is well-intentioned, but ultimately foolish. We relied on heuristic cutoffs to determine whether input was "meant to be" a date or time offset. There lies madness. We now differentiate whether we were called via anytime() (in which case numerical data is second offset to the epoch, just as.POSICct()) or anytime() (in which case it is days offset to the (date) epoch, just like as.Date()). The previous behaviour can be restored via a options, both function-local as well as global are supported. And of course, there is no change for all other (and more common) input formats, notably character or factor. A full list of changes follows.

Changes in anytime version 0.3.0 (2017-06-05)
  • Numeric input is now always an offset to epoch, with anytime() using seconds, and anydate() using dates. (#65 fixing #63).

  • Old behaviour can be re-enabled with an option also supporting a global setting getOption("anytimeOldHeuristic")

  • RStudio versions 1.1.129 or later can run all functions without fear of crashing due to a change in their use of Boost.

  • Replaced init.c with registration code inside of RcppExports.cpp thanks to Rcpp 0.12.11.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Who Survives Riddler Nation?

Tue, 06/06/2017 - 02:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

Last week, I published an article on learning to fight in the Battle for Riddler Nation. Here’s a refresher of the rules:

In a distant, war-torn land, there are 10 castles. There are two warlords: you and your archenemy. Each castle has its own strategic value for a would-be conqueror. Specifically, the castles are worth 1, 2, 3, …, 9, and 10 victory points. You and your enemy each have 100 soldiers to distribute, any way you like, to fight at any of the 10 castles. Whoever sends more soldiers to a given castle conquers that castle and wins its victory points. If you each send the same number of troops, you split the points. You don’t know what distribution of forces your enemy has chosen until the battles begin. Whoever wins the most points wins the war.

Submit a plan distributing your 100 soldiers among the 10 castles. Once I receive all your battle plans, I’ll adjudicate all the possible one-on-one match-ups. Whoever wins the most wars wins the battle royale and is crowned king or queen of Riddler Nation!

This battle took place in two rounds. In the first round, I submitted the solution (1, 1, 1, 1, 1, 1, 23, 23, 24, 24) using intuition, and did not do particularly well. In the second round, I used agent-based modelling and an evolutionary model to get a solution. The strategy (3, 3, 13, 2, 4, 1, 27, 30, 10, 7), performed best in the last round of the simulated 100-years war of Riddler Nation, so I submitted it to the contest.

The results of the second round are now out, and although Mr. Roeder enjoyed my article enough to give it a shout-out (sweet, my name is on FiveThirtyEight, which is among my favorite news sites!), I did not win. Now, this is not shocking; with 931 participants, and game theory essentially reducing every game to either domination by one player or an optimal gamble (the latter being true here), the chances of my coming out on top are slim. With that in mind, I cannot be disappointed. In fact, I have some reasons to feel good; my strategy beats the top performers, who seemed to expect others would play the optimal strategy from the first round (Mr. Roeder’s statistics revealed a downward shift in troop allocations to lower-valued castles) and thus chose a strategy to beat it. Perhaps my strategy was ahead of its time? Who knows.

Mr. Roeder has published the strategies his readers submitted for the second round, and I will be rebuilding a strategy from that data using Mesa and agent-based modeling again, but we will be doing more.

First, let’s see how well my strategy did in the last round, because I want to know.

Second, I’m going to explore the idea that my method generates a mixture of strategies that reflects the game-theoretic optimal strategy. (See the conclusion of my previous post to see the definition of this strategy and its properties.) In some sense, this allows me to determine if I lost the last round because of bad luck or because other players were choosing superior strategies.

Third, I want to cluster the strategies so I can see what “species” of strategies people chose. I use the excellent R package mclust for this clustering.

Fourth and finally, I visualize the evolution of strategies as my simulation plays out, using ggplot2, dimensionality reduction, and gganimate.

Results of the Last Battle for Riddler Nation

The winning strategy of the last round, submitted by Vince Vatter, was (0, 1, 2, 16, 21, 3, 2, 1, 32, 22), with an official record1 of 751 wins, 175 losses, and 5 ties. Naturally, the top-performing strategies look similar. This should not be surprising; winning strategies exploit common vulnerabilities among submissions.

I’ve downloaded the submitted strategies for the second round (I already have the first round’s strategies). Lets load them in and start analyzing them.

import pandas as pd from pandas import Series, DataFrame import numpy as np from numpy.random import poisson, uniform import mesa, mesa.time, mesa.datacollection import pymysql, sqlalchemy import random # Code from the last post def gen_strat(obj): """Generating a Series consisting of castle troop assignments args: obj: An iterable (NumPy array, Series, list, tuple, etc.) containing troop assignment return: Series; The generated strategy """ srs = Series(obj).apply(int) if (len(srs) != 10) or (srs.sum() != 100) or ((srs < 0).values.any()): raise ValueError("A valid strategy consists of ten integers between 0 and 100 that sum to 100.") srs.index = pd.Index(np.arange(10) + 1) return srs def check_strat(strat): """Checks that strat is a valid strategy, like that created by gen_strat args: strat: Series; An object to check return: bool; True if valid """ if str(type(strat)) == "": # All the other conditions that must hold if len(strat) == 10 and strat.sum() == 100 and (strat >= 0).values.all() and \ (strat == strat.apply(int)).values.all() and (strat.index.values == (np.arange(10) + 1)).all(): return True return False def battle_strat(strat1, strat2): """Determine which strategy between two wins args: strat1: Series; generated by gen_strat (or obeying its rules) strat2: Series; generated by gen_strat (or obeying its rules) return: dict; with the following fields: * "winner": int; 1 for strat1, -1 for strat2, 0 for tie * "pts": list; Item 0 is strat1's points, Item1 strat2's points """ if not check_strat(strat1) or not check_strat(strat2): raise ValueError("Both strat1 and strat2 must be valid strategies") castle_margins = strat1 - strat2 win1 = (castle_margins > 0).apply(int) win2 = (castle_margins < 0).apply(int) tie = (castle_margins == 0).apply(int) castle_points = Series(strat1.index, index=strat1.index) tie_pts = (tie * castle_points).sum() / 2 pts1 = (win1 * castle_points).sum() + tie_pts pts2 = (win2 * castle_points).sum() + tie_pts pts_list = [pts1, pts2] if pts1 > pts2: return {"winner": 1, "pts": pts_list} elif pts1 < pts2: return {"winner": -1, "pts": pts_list} else: return {"winner": 0, "pts": pts_list} def battle_strat_df(df1, df2): """Adjudicate all battles to see who wins args: df1: DataFrame; A data frame of warlords' strategies df2: DataFrame; A data frame of warlords' strategies return: DataFrame; Contains columns for Win, Tie, and Lose, the performance of the warlords of df1 against those of df2 """ df1_mat = df1.values df2_mat = df2.values score_mat = np.ones(df2_mat.shape, dtype=np.int8) res = list() for i in range(df1_mat.shape[0]): vec = df1_mat[i, np.newaxis, :] margin = vec - df2_mat df1_castles = (margin > 0) * (np.arange(10) + 1) df2_castles = (margin < 0) * (np.arange(10) + 1) tie_castles = (margin == 0) * (np.arange(10) + 1) tie_scores = tie_castles.sum(axis=1) / 2 df1_scores = df1_castles.sum(axis=1) + tie_scores df2_scores = df2_castles.sum(axis=1) + tie_scores res.append({"Win": (df1_scores > df2_scores).sum(), "Tie": (df1_scores == df2_scores).sum(), "Losses": (df1_scores < df2_scores).sum()}) return DataFrame(res, index=df1.index) ########################### ## Definition of the ABM ## ########################### class Warlord(mesa.Agent): """A warlord of Riddler Nation, who starts with a war doctrine, provinces, and fights other warlords""" def __init__(self, model, unique_id, doctrine, provinces, max_provinces, mutate_prob, mutate_amount): """Create a warlord args: model: mesa.Model; The model to which the agent belongs unique_id: int; Identifies the agent doctrine: Series; Describes the doctrine, and must be like one generated by gen_strat() provinces: int; The number of provinces the warlord starts with max_provinces: int; The maximum number of provinces a warlord can have before a split mutate_prob: float; A number between 0 and 1 that represents the probability of a mutation in offspring mutate_ammount: float; A number greater than 0 representing average number of mutations in doctrine of offspring return: Warlord """ self.model = model if not check_strat(doctrine): raise ValueError("doctrine must be a valid strategy") self.doctrine = doctrine self.unique_id = int(unique_id) self.provinces = max(0, int(provinces)) # provinces at least 0 self.max_provinces = max(2, int(max_provinces)) # max_provinces at least 2 self.mutate_prob = max(min(mutate_prob, 1), 0) # between 0 and 1 self.mutate_amount = max(mutate_amount, 0) self.opponent = None # In future, will be the id of this warlord's opponent def step(self): """In a step, find a new opponent to battle if not assigned""" if self.opponent is None: other = self.model.get_random_warlord() self.opponent = other other.opponent = self def advance(self): """Fight, and handle death or split of provinces""" if self.opponent is not None: # Resolve battle other = self.opponent res = battle_strat(self.doctrine, other.doctrine)["winner"] self.provinces += res other.provinces -= res # Clear opponents self.opponent = None other.opponent = None # Resolve possible death if self.provinces <= 0: self.model.schedule.remove(self) # If too many provinces, split, create new warlord if self.provinces >= self.max_provinces: son_doctrine = self.doctrine son_mutate_prob = self.mutate_prob son_mutate_amount = self.mutate_amount son_provinces = int(self.provinces / 2) # Is there a mutation? if uniform() < self.mutate_prob: # Yes! Apply the mutation son_mutate_prob = (self.mutate_prob + 1) / 2 son_mutate_amount = self.mutate_amount * 2 # Change doctrine mutations = min(poisson(self.mutate_amount), 100) for _ in range(mutations): son_doctrine[random.choice(son_doctrine.index)] += 1 # Add 1 to a random castle son_doctrine[random.choice(son_doctrine[son_doctrine > 0].index)] -= 1 # Subtract 1 from a random # castle that has at least # one soldier # Create the son son = Warlord(model = self.model, unique_id = self.model.next_id(), doctrine = son_doctrine, provinces = son_provinces, max_provinces = self.max_provinces, mutate_prob = son_mutate_prob, mutate_amount = son_mutate_amount) self.model.schedule.add(son) # Changes to the warlord self.mutate_prob /= 2 self.mutate_amount /= 2 self.provinces = self.max_provinces - son_provinces class RiddlerBattle(mesa.Model): """The Model that handles the simulation of the Battle for Riddler Nation""" def __init__(self, doctrines, N=None, max_provinces=6, mutate_prob=0.9, mutate_amount=10, debug=False): """Create an instance of the Battle for Riddler Nation args: doctrines: DataFrame; Contains all the doctrines for the warlords to create N: int; The number of warlords to create; if None, the number of rows of doctrines max_provinces: int; The maximum number of provinces each warlord can have mutate_prob: float; Each warlord's initial mutation probability mutate_amount: float; Each warlord's initial rate of number of mutations debug: bool; Debug mode return: RiddlerBattle """ if N is None: N = doctrines.shape[0] if debug: strat = {"Doctrine": lambda w: w.doctrine.values, "Provinces": lambda w: w.provinces} else: strat = {"Doctrine": lambda w: w.doctrine.values} self.schedule = mesa.time.SimultaneousActivation(self) # Battles are determined simultaneously self._max_id = 0 # The highest id currently used by the model self.dc = mesa.datacollection.DataCollector(agent_reporters=strat) # Data collection self.create_warlords(doctrines, N, max_provinces, mutate_prob, mutate_amount) self._gen_warlord_list() def create_warlords(self, doctrines, N, max_provinces, mutate_prob, mutate_amount): """Populate the model with warlords args: doctrines: DataFrame; Contains all the doctrines for the warlords to create N: int; The number of warlords to create max_provinces: int; The maximum number of provinces each warlord can have mutate_prob: float; Each warlord's initial mutation probability mutate_amount: float; Each warlord's initial rate of number of mutations """ i = 0 init_provinces = int(max_provinces / 2) for _ in range(N): w = Warlord(model = self, unique_id = self.next_id(), doctrine = doctrines.iloc[i, :], provinces = init_provinces, max_provinces = max_provinces, mutate_prob = mutate_prob, mutate_amount = mutate_amount) self.schedule.add(w) i += 1 if (i >= doctrines.shape[0]): i = 0 # We've run out of available doctrines, so start at the beginning of the DataFrame def _gen_warlord_list(self): """Generate a list of warlords used for assigning opponents""" self._warlord_list = [w for w in self.schedule.agents] def get_random_warlord(self): """Pull a random warlord without an opponent""" i = random.choice([i for i in range(len(self._warlord_list))]) return self._warlord_list.pop(i) def next_id(self): """Get the next valid id for the model, and update the max id of the model return: int; Can be used as an id """ self._max_id += 1 return self._max_id def step(self): """Take a step""" self.dc.collect(self) self.schedule.step() self._gen_warlord_list() # Reset the list of available warlords def run_model(self, steps): """Run the model args: steps: int; The number of steps to run the model """ steps = int(steps) for _ in range(steps): self.step() # Read in data, and create a DataFrame containing the strategies from both rounds rs1 = pd.read_csv("castle-solutions.csv", encoding='latin-1').iloc[:, 0:10].applymap(int) rs1.columns = pd.Index(np.arange(10) + 1) rs1 = rs1.loc[rs1.apply(check_strat, axis=1), :] # Use only valid strategies rs1.index = pd.Index(np.arange(rs1.shape[0])) rs2 = pd.read_csv("castle-solutions-2.csv", encoding='latin-1').iloc[:, 0:10].applymap(int) rs2.columns = rs1.columns.copy() rs2 = rs2.loc[rs2.apply(check_strat, axis=1), :] # Use only valid strategies rs2.index = pd.Index(np.arange(rs2.shape[0])) # Combine into one DataFrame, with a multi-index for distinguishing participants riddler_strats = pd.concat([rs1, rs2]) riddler_strats.index = pd.MultiIndex(levels=[[1, 2], list(set(rs1.index.values.tolist()).union(rs2.index.values.tolist()))], labels=[([0] * rs1.shape[0]) + ([1] * rs2.shape[0]), rs1.index.values.tolist() + rs2.index.values.tolist()], names=["round", "idx"]) riddler_strats.head() 1 2 3 4 5 6 7 8 9 10 round idx 1 0 100 0 0 0 0 0 0 0 0 0 1 52 2 2 2 2 2 2 12 12 12 2 26 26 26 16 1 1 1 1 1 1 3 25 0 0 0 0 0 0 25 25 25 4 25 0 0 0 0 0 0 25 25 25 del rs1, rs2 # No longer needed

Let’s see how my strategy fared.

# See how my strategy did in the second round myStrat = gen_strat((3, 3, 13, 2, 4, 1, 27, 30, 10, 7)) myStrat_vs_r2 = battle_strat_df(DataFrame([myStrat]), riddler_strats.loc[2]) myStrat_vs_r2 Losses Tie Win 0 358 10 534 myStrat_vs_r2.Win / myStrat_vs_r2.values.sum() 0 0.592018 Name: Win, dtype: float64

You may recall that with my first strategy I was winning 48% of my battles, and now I won 59% of battles, an improvement.

Let’s now get an overall ranking for the round 2 strategies.

r2_res = battle_strat_df(riddler_strats.loc[2], riddler_strats.loc[2]) r2_res.sort_values("Win", ascending=False).head() Losses Tie Win idx 0 169 6 727 1 179 9 714 2 182 8 712 4 193 6 703 3 194 10 698

This is an exact match to Mr. Roeder’s results (there’s one extra tie than there should be, since a strategy always ties against itself), which is good news. (It shows I’m not crazy!)

How did second-round strategies fare against the first?

r2_vs_r1 = battle_strat_df(riddler_strats.loc[2], riddler_strats.loc[1]) r2_vs_r1.sort_values("Win", ascending=False).head() Losses Tie Win idx 299 107 4 1238 400 107 4 1238 300 107 4 1238 297 107 4 1238 379 108 3 1238

I would also like to summarize the strategies in round 2. One way is to take the mean of the troop allocations.

riddler_strats.loc[2].mean().round(0) 1 3.0 2 4.0 3 6.0 4 8.0 5 10.0 6 12.0 7 14.0 8 17.0 9 14.0 10 12.0 dtype: float64 # How did my strategy fair against the mean? battle_strat(myStrat, gen_strat(riddler_strats.loc[2].mean().round(0))) {'pts': [18.5, 36.5], 'winner': -1}

Strange… my strategy does not beat the mean strategy.

Mixed Strategy

Unfortunately, I did not save the data I generated from the last post; it’s gone forever. Below, I recreate the simulation I developed on the data from the first Battle for Riddler Nation. Furthermore, because I want to use this data later, I save the simulations in a MySQL database.

%time rb1 = RiddlerBattle(riddler_strats.loc[(1), :]) CPU times: user 3.42 s, sys: 32 ms, total: 3.45 s Wall time: 3.56 s %time rb1.run_model(100) CPU times: user 17min 44s, sys: 1.83 s, total: 17min 46s Wall time: 18min 9s warlords1 = rb1.dc.get_agent_vars_dataframe() warlords1 = DataFrame([w[1]["Doctrine"] for w in warlords1.iterrows()], columns=np.arange(10) + 1, index=warlords1.index) def pymysql_sqlalchemy_stringgen(user, passwd, host, dbname): """Generate a connection string for use with SQLAlchemy for MySQL and PyMySQL connections Args: user (str): The username of the connecting user passwd (str): The user's password host (str): The host for where the database is located dbname (str): The name of the database to connect with Returns: (str) A SQLAlchemy connection string suitable for use with create_engine() Additional options for the connection are not supported with this function. """ return "mysql+pymysql://" + user + ":" + passwd + "@" + host + "/" + dbname conn = sqlalchemy.create_engine(pymysql_sqlalchemy_stringgen("root", pswd, "localhost", "riddlerbattles")).connect() try: conn.execute("DROP TABLE round1simulation;") except: pass make_table = """CREATE TABLE `round1simulation` ( `step` int(4), `agentid` int(8), `castle_1` int(8), `castle_2` int(8), `castle_3` int(8), `castle_4` int(8), `castle_5` int(8), `castle_6` int(8), `castle_7` int(8), `castle_8` int(8), `castle_9` int(8), `castle_10` int(8), PRIMARY KEY (`step`,`agentid`) );""" conn.execute(make_table) <sqlalchemy.engine.result.ResultProxy at 0xa979946c> # Put the data in the table temp_table = warlords1.copy() temp_table.columns = pd.Index(["castle_" + str(i) for i in temp_table.columns]) temp_table.columns Index(['castle_1', 'castle_2', 'castle_3', 'castle_4', 'castle_5', 'castle_6', 'castle_7', 'castle_8', 'castle_9', 'castle_10'], dtype='object') temp_table.to_sql("round1simulation", con=conn, if_exists='append') del temp_table # Not needed any longer

As mentioned in my previous post, the optimal strategy is a random strategy. No choice of troop allocation that always win against other allocations exists. Instead, a player employing optimal play randomly allocates troops to castles, with some probability distribution. The question is: what is that probability distribution?

I hypothesize that the distribution of pure strategies used by the warlords generated by the ABM is close to whatever the optimal distribution is. In particular, random warlords picked from the final population of warlords produced by the ABM tie or beat random strategies picked by players at least half of the time.

I test this hypothesis below.

def random_battle(df1, df2): """Pick a random warlord from df1 and have him battle a random warlord from df2 Args: df1: DataFrame; A data frame of warlords' strategies df2: DataFrame; A data frame of warlords' strategies Returns: int; The result of the battle, with 1 for victory for the warlord from df1, -1 for defeat, 0 for tie """ return battle_strat(gen_strat(df1.sample(n=1).values[0]), gen_strat(df2.sample(n=1).values[0]))['winner'] warlords1.head() 1 2 3 4 5 6 7 8 9 10 Step AgentID 0 1 100 0 0 0 0 0 0 0 0 0 2 52 2 2 2 2 2 2 12 12 12 3 26 26 26 16 1 1 1 1 1 1 4 24 0 0 0 0 2 0 22 27 25 5 26 0 0 0 0 1 0 24 25 24 # Do 1000 random battles, picking from my warlords found with the ABM and those from the second round of the # Battle for Riddler Nation rand1 = Series([random_battle(warlords1.loc[99], riddler_strats.loc[2]) for _ in range(1000)]) rand1.head() 0 -1 1 1 2 -1 3 -1 4 -1 dtype: int64 rand1.value_counts() -1 508 1 469 0 23 dtype: int64 # Winning proportion rand1.value_counts().loc[1] / rand1.value_counts().sum() 0.46899999999999997

That is not good news for my hypothesis. The strategies found via the ABM lose against human opponents more frequently than they win. This should not be happening if my belief were correct.

Perhaps ABMs cannot be used to solve this problem after all, or at least for my approach. Then again, maybe I need a longer "training" period; that is, 100 rounds are not enough to learn a strategy. Perhaps 1,000 or 3,000 are needed (though completing that many round would take a long time), or I need to seed the system with more agents.

Strategy Clusters

I believe that humans are going to be prone to choosing strategies that follow similar themes. Perhaps some players believe in a balanced approach to all castles, some emphasize high-point castles, others low-point castles, and so on. Here, I use cluster analysis to examine this hypothesis.

I will be switching from Python to R from this point on, but there's a few things I want to do in Python first: submit strategies from the second round to my database, and generate strategies via the ABM when using strategies from both rounds of the Battle for Riddler Nation as the initial seed.

# Submit round 2 strategies try: conn.execute("DROP TABLE round2strats;") except: pass make_table = """CREATE TABLE `round2strats` ( `idx` int(4), `castle_1` int(8), `castle_2` int(8), `castle_3` int(8), `castle_4` int(8), `castle_5` int(8), `castle_6` int(8), `castle_7` int(8), `castle_8` int(8), `castle_9` int(8), `castle_10` int(8), PRIMARY KEY (`idx`) );""" conn.execute(make_table) temp_table = riddler_strats.loc[2].copy() temp_table.columns = pd.Index(["castle_" + str(i) for i in temp_table.columns]) temp_table.to_sql("round2strats", con=conn, if_exists='append') del temp_table # Generate combined strategies ABM results %time rb2 = RiddlerBattle(DataFrame(riddler_strats.values, columns=np.arange(10) + 1)) CPU times: user 5.86 s, sys: 12 ms, total: 5.88 s Wall time: 6.12 s %time rb2.run_model(100) CPU times: user 29min 44s, sys: 1.77 s, total: 29min 46s Wall time: 30min 17s warlords2 = rb2.dc.get_agent_vars_dataframe() warlords2 = DataFrame([w[1]["Doctrine"] for w in warlords2.iterrows()], columns=np.arange(10) + 1, index=warlords2.index) # FYI: I'm aware this is bad database design, and I don't care. try: conn.execute("DROP TABLE round2simulation;") except: pass make_table = """CREATE TABLE `round2simulation` ( `step` int(4), `agentid` int(8), `castle_1` int(8), `castle_2` int(8), `castle_3` int(8), `castle_4` int(8), `castle_5` int(8), `castle_6` int(8), `castle_7` int(8), `castle_8` int(8), `castle_9` int(8), `castle_10` int(8), PRIMARY KEY (`step`,`agentid`) );""" conn.execute(make_table) temp_table = warlords2.copy() temp_table.columns = pd.Index(["castle_" + str(i) for i in temp_table.columns]) temp_table.to_sql("round2simulation", con=conn, if_exists='append') del temp_table # We're done in Python now; only remaining business is closing the MySQL connection conn.close()

Now we can start using R.

# Loading packages pckgs_cran <- c("dplyr", "ggplot2", "mclust", "devtools", "RMySQL") for (pckg in pckgs_cran) { if (!require(pckg, character.only = TRUE)) { install.packages(pckg) require(pckg, character.only = TRUE) } } ## Loading required package: dplyr ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ## Loading required package: ggplot2 ## Loading required package: mclust ## Package 'mclust' version 5.3 ## Type 'citation("mclust")' for citing this R package in publications. ## Loading required package: devtools ## Loading required package: RMySQL ## Loading required package: DBI if (!require("gganimate")) { devtools::install_github("dgrtwo/gganimate") library(gganimate) } ## Loading required package: gganimate `%s%` <- function(x, y) paste(x, y) `%s0%` <- function(x, y) paste0(x, y)

Now let's connect to the database so we can work with the data.

db <- src_mysql("riddlerbattles", host = "127.0.0.1", user = "root", password = pswd) castle_str <- paste0("castle_", 1:10) round1_strats <- tbl(db, sql("SELECT" %s% paste(castle_str, collapse = ", ") %s% "FROM round1simulation WHERE step = 0")) round2_strats <- tbl(db, sql("SELECT" %s% paste(castle_str, collapse = ", ") %s% "FROM round2strats")) round1_sim <- tbl(db, sql("SELECT step," %s% paste(castle_str, collapse = ", ") %s% "FROM round1simulation")) round2_sim <- tbl(db, sql("SELECT step," %s% paste(castle_str, collapse = ", ") %s% "FROM round2simulation"))

I first will embed the strategies from round 1 into a 2D space, just for the sake of visualization. I do this by finding a distance matrix, then using classical multidimensional scaling to find points in a 2D plane that closely preserve the distance matrix. Here, I feel that Manhattan distance is the best way to describe distances between strategies (it basically translates to how many troops need to be moved for one strategy to become another).

round1_strats_mat % as.data.frame %>% as.matrix round2_strats_mat % as.data.frame %>% as.matrix strats_mds % dist("manhattan") %>% cmdscale round1_strats_mds <- strats_mds[1:nrow(round1_strats_mat), ] round2_strats_mds <- strats_mds[-(1:nrow(round1_strats_mat)), ] round1_strats_plot <- data.frame("x" = round1_strats_mds[, 1], "y" = round1_strats_mds[, 2]) round2_strats_plot <- data.frame("x" = round2_strats_mds[, 1], "y" = round2_strats_mds[, 2]) ggplot(round1_strats_plot, aes(x = x, y = y)) + geom_point() + theme_bw()

ggplot(round2_strats_plot, aes(x = x, y = y)) + geom_point() + theme_bw()

Clusters definitely exist in the first round. In my opinion, at least five, maybe six clusters exist (I want to consider stragglers a separate cluster). In the second round, I observe more uniformity; my guess is there are between one and four clusters, but most likely one.

I want to know how performance in the Battle is distributed. For this, I'll need an adjudicator, but I can write one for R similarly to how I wrote one in Python with NumPy.

# Interesting: this is barely slower than the Python original battle_strat_df <- function(df1, df2) { # Adjudicate all battles to see who wins # # args: # df1: matrix; A matrix of warlords' strategies # df2: matrix; A matrix of warlords' strategies # return: # matrix; Contains columns for Win, Tie, and Lose, the performance # of the warlords of df1 against those of df2 score_mat <- matrix(rep(1:10, times = nrow(df2)), nrow = nrow(df2), byrow = TRUE) res <- sapply(1:nrow(df1), function(i) { vec <- df1[i, ] margin <- matrix(rep(vec, times = nrow(df2)), nrow = nrow(df2)) - df2 df1_castles 0) * score_mat df2_castles <- (margin < 0) * score_mat tie_castles <- (margin == 0) * score_mat tie_scores <- rowSums(tie_castles) / 2 df1_scores <- rowSums(df1_castles) + tie_scores df2_scores df2_scores), "Tie" = sum(df1_scores == df2_scores), "Losses" = sum(df1_scores < df2_scores))) }) return(t(res)) } round1_res <- battle_strat_df(round1_strats_mat, round1_strats_mat) round1_strats_plot$win <- (round1_res[, "Win"] / nrow(round1_strats_mat)) round2_res <- battle_strat_df(round2_strats_mat, round2_strats_mat) round2_strats_plot$win <- (round2_res[, "Win"] / nrow(round2_strats_mat)) ggplot(round1_strats_plot, aes(x = x, y = y, color=win)) + geom_point() + scale_color_gradient(low = "dark blue", high = "orange") + theme_bw()

ggplot(round2_strats_plot, aes(x = x, y = y, color=win)) + geom_point() + scale_color_gradient(low = "dark blue", high = "orange") + theme_bw()

Judging from these charts, strategies near the periphery don't do very well, while those near the center tend to do better. A peripheral strategy probably resembles (100, 0, 0, 0, 0, 0, 0, 0, 0, 0), but the winning strategies look, in some sense, "balanced". With that in mind, these pictures are not too shocking. Looking at the first chart for the first round one may guess that some clusters consist of well performing strategies, while other clusters consist of more poor performers.

Let's see if we can use this embedding and mclust to determine what the clusters are. mclust is an R package for model-based clustering; that is, it applies the EM-algorithm to find a clustering scheme that works well for data that was drawn from Normally distributed clusters. If you're familiar with k-means clustering, model-based clustering via the EM-algorithm and a Normal model is a much more sophisticated version of the same thing.

mclust's principal function, Mclust(), takes automated steps to find a good clustering scheme. I simply run it out of the box to find clusters for both data sets.

clust1 <- Mclust(round1_strats_mds, G=1:6) summary(clust1) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VEV (ellipsoidal, equal shape) model with 5 components: ## ## log.likelihood n df BIC ICL ## -13023.54 1349 25 -26227.25 -26471.11 ## ## Clustering table: ## 1 2 3 4 5 ## 75 575 123 310 266 round1_strats_plot$clust <- as.factor(clust1$classification) ggplot(round1_strats_plot, aes(x = x, y = y, color=clust)) + geom_point() + scale_color_brewer(palette = "Dark2") + theme_bw()

clust2 <- Mclust(round2_strats_mds, G=1:4) summary(clust2) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVI (diagonal, varying volume and shape) model with 3 components: ## ## log.likelihood n df BIC ICL ## -8741.765 902 14 -17578.79 -17678.37 ## ## Clustering table: ## 1 2 3 ## 735 25 142 round2_strats_plot$clust <- as.factor(clust2$classification) ggplot(round2_strats_plot, aes(x = x, y = y, color=clust)) + geom_point() + scale_color_brewer(palette = "Dark2") + theme_bw()

I would say reasonable clustering were found both times, and they largely affirm what I believed; there's five good clusters in the first round, but by the second round players started to converge to variations on one strategy (it may have found three clusters, but they're not very convincing).

Let's characterize the clusters even more.

clust_strats1 <- t(sapply(unique(clust1$classification), function(c) colMeans(round1_strats_mat[which( clust1$classification == c), ]))) clust_strats1 %>% round(digits = 0) ## castle_1 castle_2 castle_3 castle_4 castle_5 castle_6 castle_7 ## [1,] 3 3 4 6 7 10 12 ## [2,] 4 7 8 10 9 10 6 ## [3,] 7 8 9 13 17 20 24 ## [4,] 1 2 2 5 7 12 17 ## [5,] 2 4 4 8 13 18 23 ## castle_8 castle_9 castle_10 ## [1,] 17 19 19 ## [2,] 3 12 31 ## [3,] 1 1 1 ## [4,] 25 27 1 ## [5,] 27 1 1 clust_strats1 %>% heatmap(Colv=NA)

clust_wins1 <- sapply(unique(clust1$classification), function(c) { res <- battle_strat_df(round1_strats_mat[which( clust1$classification == c ), ], round1_strats_mat) return(mean(res[, "Win"] / nrow(round1_strats_mat))) }) names(clust_wins1) <- 1:5 clust_wins1 ## 1 2 3 4 5 ## 0.3497831 0.3156116 0.3306334 0.3121045 0.3097644 clust_strats2 <- t(sapply(unique(clust2$classification), function(c) colMeans(round2_strats_mat[which( clust2$classification == c), ]))) clust_strats2 %>% round(digits = 0) ## castle_1 castle_2 castle_3 castle_4 castle_5 castle_6 castle_7 ## [1,] 3 4 6 8 10 13 13 ## [2,] 7 1 0 0 0 0 0 ## [3,] 2 4 5 7 11 10 24 ## castle_8 castle_9 castle_10 ## [1,] 14 15 13 ## [2,] 31 30 29 ## [3,] 30 4 4 clust_strats2 %>% heatmap(Colv=NA)

clust_wins2 <- sapply(unique(clust2$classification), function(c) { res <- battle_strat_df(round2_strats_mat[which( clust2$classification == c ), ], round2_strats_mat) return(mean(res[, "Win"] / nrow(round2_strats_mat))) }) names(clust_wins2) <- 1:3 clust_wins2 ## 1 2 3 ## 0.3464018 0.1466962 0.3247166

In the first round, we can see that there were five doctrines for troop allocation. Doctrines 1 and 2 bet heavy on high-value castles, and the other three were varying degrees of betting on low-value castles, with the most successful of the three groups betting almost exclusively on low-value castles, ignoring those with a value greater than 7. The other two were less successful the more they bet on high-value castles.

In the second round, people tended to spread out their troop allocations more. Most gravitated towards an even distribution of troops among castles with values of at least 6. The least successful group, and also the smallest group, continued to bet heavy on high-value castles, and a subset bet slightly more heavily on lower-value castles.

Strategy Evolution

Finally, I want to look at how strategies developed during the simulation evolved.

In spirit, I repeat the multidimensional scaling, clustering, and plotting I did above, but clusters are found on data that extends across time (I hope this shows, say, a speciation process, or perhaps homogenization), and the plot is animated.

This is all done for the first simulation, using strategies submitted in only the first round, below. Unfortunately, the classical multidimensional scaling method and model-based clustering methods that I implemented before won't work here; they're too computationally taxing, and simply fail and crash R. Since the matrix off strategies is large, I'm not surprised, but I am disappointed. I had to resort to PCA and using the first two principal components, and using k-means for clustering instead of model-based clustering.

round1_sims_mat <- round1_sim %>% as.data.frame %>% as.matrix round1_sims_mds <- cbind(princomp(round1_sims_mat[, 2:11], scores = TRUE)$scores[, 1:2], round1_sims_mat[, "step"]) colnames(round1_sims_mds) <- c("x", "y", "step") sims_clust1 <- kmeans(round1_sims_mds[,1:2], centers = 10)$cluster round1_sims_plot <- data.frame("x" = round1_sims_mds[,"x"], "y" = round1_sims_mds[,"y"], "step" = round1_sims_mds[,"step"], "clust" = as.factor(sims_clust1)) p <- ggplot(round1_sims_plot, aes(x = x, y = y, color = clust, frame = step)) + geom_point() + theme_bw() gganimate(p, "out.gif", interval = .2)

I was disappointed by the result. I expected to see exploration of the feature space, species being born and going extinct, but all I see is mass population die-off as the algorithm converges on a small handful of survivors. Flimsy strategy choices are weened out, and eventually more and more weak choices die out, but there's no reason to think that the algorithm is allowing individuals with good characteristics (that is, good choice of doctrine) not necessarily included in the initial set of strategies to be born and challenge incumbents.

If my strategy is not performing well, this is likely the explanation: not enough exploration of the possible space of strategies. This suggests allow for more radical evolution in the ABM as opposed to what is currently being employed.

Here is the simulation on a larger starting strategy set, including the strategies from both round 1 and round 2. The result is basically the same.

round2_sims_mat <- round2_sim %>% as.data.frame %>% as.matrix round2_sims_mds <- cbind(princomp(round2_sims_mat[, 2:11], scores = TRUE)$scores[, 1:2], round2_sims_mat[, "step"]) colnames(round2_sims_mds) <- c("x", "y", "step") sims_clust2 <- kmeans(round2_sims_mds[,1:2], centers = 10)$cluster round2_sims_plot <- data.frame("x" = round2_sims_mds[,"x"], "y" = round2_sims_mds[,"y"], "step" = round2_sims_mds[,"step"], "clust" = as.factor(sims_clust2)) p <- ggplot(round2_sims_plot, aes(x = x, y = y, color = clust, frame = step)) + geom_point() + theme_bw() gganimate(p, "out2.gif", interval = .2)

Conclusion

There's a lot to learn from these data sets. I want to see what words appear in players' description of their strategies' motivation, allowing me to discover thought patterns in strategy clusters. I also want to see if a neural network could be used to learn good strategies.

That said, the ABM method appears to need tweaking. The winning strategy's author said he was using genetics-based algorithms to find his solution, a commonly recurring theme. I thought I was using a genetic algorithm, but it seems that although the weak die off and the strong reproduce, there are not enough mutations in my model to allow finding a good strategy. Maybe this could be fixed by parameter tuning. I don't know. We will need to see.

That said, this has been a fun project, and I have learned a lot in the process. Perhaps next time I will reign supreme over Riddler Nation.

  1. As mentioned in my first article, I was not able to replicate Mr. Roeder’s records, and I did not find the same winning strategies as he did. I do not know where the source of the error is, but I looked over my code a lot and cannot find errors in it. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Freedman’s paradox

Tue, 06/06/2017 - 02:00

(This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.

The 1983 paper presents a simulation with only 10 repetitions. But in the present day it is very easy (both in terms of computational time and implementation difficulty) to reproduce the simulation with many more repetitions (even my phone’s computational power is probably higher than that of the high performance computer that Freedman used in the 80’s). We also have more convenient ways to visualize the results than in the 80’s. So let’s do it.

I am going to use a few R packages (most notably the package broom to fit and analyze many many linear models in a single step).

library(dplyr) library(broom) library(ggplot2) library(tidyr) set.seed(20170605)

The considered data structure is the following:

  • A matrix of predictors with 100 rows and 50 columns is generated with independent standard normal entries.
  • The response variable is generated independently of the model matrix (also from the standard normal distribution), i.e., the true answer is that there is no relationship between predictors and response.

Instead of Freedman’s 10 repetitions we perform 1000. So let’s generate all 1000 datasets at once as stacked in a large data frame:

n_row <- 100 # n_col is set to 51 because the 51st column will serve as y n_col <- 51 n_rep <- 1000 # a stack of matrices for all n_rep repetitions is generated... X <- matrix(rnorm(n_rep * n_row * n_col), n_rep * n_row, n_col) colnames(X) <- paste0("X", 1:n_col) # ...and then transformed to a data frame with a repetition number column X_df <- as_data_frame(X) %>% mutate(repetition = rep(1:n_rep, each = n_row))

The data are analyzed in two successive linear models. The second (illegally) reusing the results of the first.

The first model fit.
After the 1000 ordinary linear models are fit to the data, we record for each of them the R squared, the F test statistic with corresponding p-value, and the t test statistics with p-values for the individual regression coefficients.

Using functions from the broom package we can fit and extract information from all 1000 models at once.

# all models can be fit at once... models_df = X_df %>% group_by(repetition) %>% do(full_model = lm(X51 ~ . + 0, data = select(., -repetition))) # ...then the results are extracted model_coefs <- tidy(models_df, full_model) model_statistics <- glance(models_df, full_model) model_statistics$data_reuse <- rep(FALSE, nrow(model_statistics))

The second model fit.
For each one of the first 1000 models, the corresponding second linear model is fit using only those variables which have p-values significant at the 25% level in the first model.
That is, the second model uses the first model for variable selection.

This gives us 1000 reduced re-fitted linear models. We record the same model statistics (R squared, F, and t tests) as for the first group of models.

reduced_models <- list() for (i in 1:n_rep) { full_data <- X_df %>% filter(repetition == i) significant_coefs <- model_coefs %>% filter(repetition == i & p.value < 0.25) reduced_data <- select(full_data, one_of(unlist(significant_coefs[ , "term"])), X51) reduced_models[[i]] <- lm(X51 ~ . + 0, data = reduced_data) tmp_df <- glance(reduced_models[[i]]) tmp_df$repetition <- i tmp_df$data_reuse <- TRUE model_statistics <- bind_rows(model_statistics, tmp_df) }

Finally let’s look at the results. The figure shows the distributions of the considered model statistics across the 1000 repetitions for model fits with and without data reuse (the code producing this figure is given at the bottom of this post):

Well, the R squared statistic shows a moderate change between models with or without data reuse (average of 0.3093018 vs. 0.5001641). The F test statistic however grows immensely to an average of 3.2806118 (from 1.0480097), and the p-values fall after data reuse to an average of 0.0112216 (from 0.5017696), below the widely used (but arbitrary) 5% significance level.

Obviously the model with data reuse is highly misleading here, because in fact there are absolutely no relationships between the predictor variables and the response (as per the data generation procedure).

In fact, Freedman derived asymptotic estimates for the magnitudes of change in the considered model statistics, and they indeed match the above observations. However I’m too lazy to summarize them here. So I refer to the primary source.

This code generates the above figure:

model_statistics %>% select(r.squared, p.value, statistic, repetition, data_reuse) %>% mutate(data_reuse = ifelse(data_reuse, "With Data Reuse", "Without Data Reuse")) %>% mutate(data_reuse = factor(data_reuse, levels = c("Without Data Reuse", "With Data Reuse"), ordered = TRUE)) %>% rename("F-statistic" = statistic, "p-value" = p.value, "R squared" = r.squared) %>% gather(stat, value, -repetition, -data_reuse) %>% ggplot(aes(x = stat, y = value)) + geom_violin(aes(fill = stat), scale = "width", draw_quantiles = c(0.25, 0.5, 0.75)) + geom_hline(yintercept = 0.05, linetype = 2, size = 0.3) + facet_wrap(~data_reuse) + theme_linedraw() + scale_y_continuous(breaks = c(0.05, 2, 4, 6)) + ggtitle(paste(n_rep, "repetitions of an LM fit with", n_row, "rows,", n_col, "columns")) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The Case Against Seasonal Unit Roots

Mon, 06/05/2017 - 19:26

(This article was first published on R – first differences, and kindly contributed to R-bloggers)

There are several ways to model seasonality in a time series. Traditionally, trend-cycle decomposition such as the Holt-Winters procedure has been very popular. Also, until today applied researchers often try to account for seasonality by using seasonal dummy variables. But of course, in a stochastic process it seems unreasonable to assume that seasonal effects are purely deterministic. Therefore, in a time series context seasonal extensions of the classical ARMA model are very popular. One of these extensions is the seasonal unit root model

where is the usual lag operator and is the period length of the seasonality such as or for a yearly cycle in quarterly or monthly data and is some short run component such as an innovation term or a SARMA(p,q)-(P,Q) model.

I have always been puzzled about the popularity of this process. Probably it is due to the obvious conceptual simplicity. It also seems to be a natural extension of the usual non-seasonal integrated ARIMA model. However, the model generates a feature that we will hardly ever observe in an actual time series: as time progresses the difference between consecutive values of the will become infinitely large.

To see this consider the following example. To generate seasonal unit root processes we first define a function that generates seasonal sums.

seasonal_sum<-function(data,S){ out<-data for(t in (S+1):length(data)){out[t]<-data[t]+out[(t-S)]} out }

We then generate a sample of 250 observations from the process and look at its plot and its autocorrelation function. We choose a period of , so that the example resembles a yearly cycle in monthly data.

series<-seasonal_sum(rnorm(250),S=12) acf(series)

ts.plot(series, ylab="series", xlab="t")

From the autocorrelation function (ACF) it can be seen that there is a pronounced seasonal behavior with a spike in the ACF at each lag that is an integer multiple of . However, the plot of the series shows a curious behavior. As increases, we see that the difference between two consecutive observations increases. This behavior becomes even more pronounced if we increase the sample size to 2500.

ts.plot(seasonal_sum(rnorm(2500),S=12), ylab="series")

To understand this feature consider the usual unit root model with an innovation with variance . This can be expressed as the sum over all past innovations.

From this representation it is easy to show that the variance of the process is given by

so that the variance becomes infinite as approaches infinity. This is a property that seems to apply to many economic and financial time series and is therefore completely reasonable.

Now, the seasonal unit root model can be expressed in a similar way, but with an important twist. To see this, denote the th innovation in the th repetition of the cycle of length by . This means that if you have monthly observations the innovation in the first January in the sample is and the innovation in the second January in the sample is . By the same principle the innovation in the th December in the sample would be . Therefore, any observation , for some and can be represented as

.

The important thing to note here is that for two consecutive observations within the th repetition of the cycle we have and . Since and are independent streams of random numbers this means that and are independent random walks! Consequently, the difference of the process is given by

so that

Since goes to infinity as goes to infinity, so does the variance of the changes. Has anybody ever seen a series that exhibits such a feature? Of course in reality we would expect that the innovations are not but show some kind of dependence structure, so that the random walks are not completely independent anymore. However, if the dependence is weak – such as that of an ARMA process – they are still asymptotically independent for large lags. Therefore, the same issue arises, as can be seen from the example below.

sarima_sim<-function(T, S, arma_model){ arma_series<-arima.sim(n=T, model=arma_model) seasonal_sum(data=arma_series, S=S) } sarima_series<-sarima_sim(T=250, S=12, arma_model=list(ar=c(0.5,0.3))) acf(sarima_series)

ts.plot(sarima_series, ylab="series")

ts.plot(sarima_sim(T=2500, S=12, arma_model=list(ar=c(0.5,0.3))), ylab="series")

So what is the conclusion from all this? The seasonal unit root process seems to be ill suited to model most behavior that we observe in practice. However, it is well known that it often generates a good fit. Especially in shorter time series the drawbacks of the seasonal unit root process do not have to become visible. Nevertheless, I think it is fair to say that one could envision a more satisfactory model. One avenue that seems very useful in this context is that of seasonal long memory processes that are able to combine some persistence in the cyclical fluctuations with a finite variance.

Another important conclusion is that we have to be careful with seemingly direct extensions of standard models such as the ARIMA. The fact that the ARIMA is extremely successful in modelling the non-seasonal component, does not necessarily mean that the SARIMA is a good model for the seasonal applications that we have in mind, too.

Filed under: Allgemein, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – first differences. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Quantile Regression in R exercises

Mon, 06/05/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable. In contrast, quantile regression models this relationship for different quantiles of the dependent variable.
In this exercise set we will use the quantreg package (package description: here) to implement quantile regression in R.

Answers to the exercises are available here.

Exercise 1
Load the quantreg package and the barro dataset (Barro and Lee, 1994). This has data on GDP growth rates for various countries.
Next, summarize the data.

Exercise 2
The dependent variable is y.net (Annual change per capita GDP). The remaining variables will be used to explain y.net. It is easier to combine variables using cbind before applying regression techniques. Combine variables so that we can write Y ~ X.

Exercise 3
Regress y.net on the independent variables using OLS. We will use this result as benchmark for comparison.

Exercise 4
Using the rq function, estimate the model at the median y.net. Compare results from exercise-3.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 5
Estimate the model for the first and third quartiles and compare results.

Exercise 6
Using a single command estimate the model for 10 equally spaced deciles of y.net.

Exercise 7
quantreg package also offers shrinkage estimators to determine which variables play the most important role in predicting y.net. Estimate the model with LASSO based quantile regression at the median level with lambda=0.5.

Exercise 8
Quantile plots are most useful for interpreting results. To do that we need to define the sequence of percentiles. Use the seq function to define the sequence of percentiles from 5% to 95% with a jump of 5%.

Exercise 9
Use the result from exercise-8 to plot the graphs. Note that the red line is the OLS estimate bounded by the dotted lines which represent confidence intervals.

Exercise 10
Using results from exercise-5, test whether coefficients are significantly different for the first and third quartile based regressions.

Related exercise sets:
  1. Forecasting: Multivariate Regression Exercises (Part-4)
  2. Instrumental Variables in R exercises (Part-3)
  3. Forecasting: ARIMAX Model Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

There is usually more than one way in R

Mon, 06/05/2017 - 17:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse().

tibble::glimpse()

Consider the important task inspecting a data.frame to see column types and a few example values. The dplyr/tibble/tidyverse way of doing this is as follows:

library("tibble") glimpse(mtcars) Observations: 32 Variables: 11 $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.... $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4 $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 27... $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65,... $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3.0... $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070, 3.730, 3.... $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40, 17.60, 18... $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1 $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1 $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4 $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2 utils::str()

A common “base-R“ (actually from the utils package) way to examine the data is:

str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

However, both of the above summaries have unfortunately obscured an important fact about the mtcars data.frame: the car names! This is because mtcars stores this important key as row-names instead of as a column. Even base::summary() will hide this from the analyst.

utils::head()

The base-R command head() (again from the utils package) provides a good way to examine the first few rows of data:

head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

We are missing the size of the table and the column types, but those are easy to get with “dim(mtcars)” and “stack(vapply(mtcars, class, character(1)))“. And we can get something like the “columns on the side” presentation as follows:

t(head(mtcars)) Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant mpg 21.00 21.000 22.80 21.400 18.70 18.10 cyl 6.00 6.000 4.00 6.000 8.00 6.00 disp 160.00 160.000 108.00 258.000 360.00 225.00 hp 110.00 110.000 93.00 110.000 175.00 105.00 drat 3.90 3.900 3.85 3.080 3.15 2.76 wt 2.62 2.875 2.32 3.215 3.44 3.46 qsec 16.46 17.020 18.61 19.440 17.02 20.22 vs 0.00 0.000 1.00 1.000 0.00 1.00 am 1.00 1.000 1.00 0.000 0.00 0.00 gear 4.00 4.000 4.00 3.000 3.00 3.00 carb 4.00 4.000 1.00 1.000 2.00 1.00

Also, head() is usually re-adapted (through R‘s S3 object system) to work with remote data sources.

library("sparklyr") sc <- sparklyr::spark_connect(version='2.0.2', master = "local") dRemote <- copy_to(sc, mtcars) head(dRemote) # Source: query [6 x 11] # Database: spark connection master=local[4] app=sparklyr local=TRUE # # # A tibble: 6 x 11 # mpg cyl disp hp drat wt qsec vs am gear carb # # 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 # 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 # 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 # 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 glimpse(dRemote) # Observations: 32 # Variables: 11 # # Rerun with Debug # Error in if (width[i] <= max_width[i]) next : # missing value where TRUE/FALSE needed broom::glance(dRemote) # Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl Conclusion

R often has more than one way to nearly perform the same task. When working in R consider trying a few functions to see which one best fits your needs. Also be aware that base-R (R with the standard packages) often already has powerful capabilities.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

ANOVA in R: afex may be the solution you are looking for

Mon, 06/05/2017 - 17:14

(This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers)

Prelude: When you start with R and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS, R kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base R may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with R and ANOVA I had enough and decided to develop afex. If you know this feeling, afex is also for you.

A new version of afex (0.18-0) has been accepted on CRAN a few days ago. This version only fixes a small bug that was introduced in the last version.  aov_ez did not work with more than one covariate (thanks to tkerwin for reporting this bug).

I want to use this opportunity to introduce one of the main functionalities of afex. It provides a set of functions that make calculating ANOVAs easy. In the default settings, afex automatically uses appropriate orthogonal contrasts for factors, transforms numerical variables into factors, uses so-called Type III sums of squares, and allows for any number of factors including repeated-measures (or within-subjects) factors and mixed/split-plot designs. Together this guarantees that the ANOVA results correspond to the results obtained from commercial statistical packages such as SPSS or SAS. On top of this, the ANOVA object returned by afex (of class afex_aov) can be directly used for follow-up or post-hoc tests/contrasts using the lsmeans package .

Example Data

Let me illustrate how to calculate an ANOVA with a simple example. We use data courtesy of Andrew Heathcote and colleagues . The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants. Here we only look at three factors:

  • task is a between subjects (or independent-samples) factor: 25 participants worked on the lexical decision task (lexdec; i.e., participants had to make a binary decision whether or not the presented string is a word or nonword) and 20 participants on the naming task (naming; i.e., participant had to say the presented string out loud).
  • stimulus is a repeated-measures or within-subjects factor that codes whether a presented string was a word or nonword.
  • length is also a repeated-measures factor that gives the number of characters of the presented strings with three levels: 3, 4, and 5.

The dependent variable is the response latency or response time for each presented string. More specifically, as is common in the literature we analyze the log of the response times, log_rt. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. To use this data in an ANOVA one needs to aggregate the data such that only one observation per participant and cell of the design (i.e., combination of all factors) remains. As we will see, afex does this automatically for us (this is one of the features I blatantly stole from ez).

library(afex) data("fhch2010") # load data (comes with afex) mean(!fhch2010$correct) # error rate # [1] 0.01981546 fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors str(fhch2010) # structure of the data # 'data.frame': 13222 obs. of 10 variables: # $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 ... # $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ... # $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ... # $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ... # $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ... # $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ... # $ item : Factor w/ 600 levels "abide","acts",..: 363 121 ... # $ rt : num 1.091 0.876 0.71 1.21 0.843 ... # $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ... # $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...

We first load the data and remove the roughly 2% errors. The structure of the data.frame (obtained via str()) shows us that the data has a few more factors than discussed here. To specify our ANOVA we first use function aov_car() which works very similar to base R aov(), but as all afex functions uses car::Anova() (read as function Anova() from package car) as the backend for calculating the ANOVA.

Specifying an ANOVA (a1 <- aov_car(log_rt ~ task*length*stimulus + Error(id/(length*stimulus)), fhch)) # Contrasts set to contr.sum for the following variables: task # Anova Table (Type 3 tests) # # Response: log_rt #                 Effect          df  MSE          F   ges p.value # 1                 task       1, 43 0.23  13.38 ***   .22   .0007 # 2               length 1.83, 78.64 0.00  18.55 ***  .008  <.0001 # 3          task:length 1.83, 78.64 0.00       1.02 .0004     .36 # 4             stimulus       1, 43 0.01 173.25 ***   .17  <.0001 # 5        task:stimulus       1, 43 0.01  87.56 ***   .10  <.0001 # 6      length:stimulus 1.70, 72.97 0.00       1.91 .0007     .16 # 7 task:length:stimulus 1.70, 72.97 0.00       1.21 .0005     .30 # --- # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1 # # Sphericity correction method: GG # Warning message: # More than one observation per cell, aggregating the data using mean (i.e, fun_aggregate = mean)!

The printed output is an ANOVA table that could basically be copied to a manuscript as is. One sees the terms in column Effect, the degrees of freedoms (df), the mean-squared error (MSE, I would probably remove this column in a manuscript), the F-value (F, which also contains the significance stars), and the p-value (p.value). The only somewhat uncommon column is ges which provides generalized eta-squared, ‘the recommended effect size statistics for repeated measure designs’  . The standard output also reports Greenhouse-Geisser (GG) corrected df for repeated-measures factors with more than two levels (to account for possible violations of sphericity). Note that these corrected df are not integers.

We can also see a warning notifying us that afex has detected that each participant and cell of the design provides more than one observation which are then automatically aggregated using mean. The warning serves the purpose to notify the user in case this was not intended (i.e., when there should be only one observation per participant and cell of the design). The warning can be suppressed via specifying fun_aggregate = mean explicitly in the call to aov_car.

The formula passed to aov_car basically needs to be the same as for standard aov with a few differences:

  • It must have an Error term specifying the column containing the participant (or unit of observation) identifier (e.g., minimally +Error(id)). This is necessary to allow the automatic aggregation even in designs without repeated-measures factor.
  • Repeated-measures factors only need to be defined in the Error term and do not need to be enclosed by parentheses. Consequently, the following call produces the same ANOVA: aov_car(log_rt ~ task + Error(id/length*stimulus), fhch)

     

In addition to aov_car, afex provides two further function for calculating ANOVAs. These function produce the same output but differ in the way how to specify the ANOVA.

  • aov_ez allows the ANOVA specification not via a formula but via character vectors (and is similar to ez::ezANOVA()): aov_ez(id = "id", dv = "log_rt", fhch, between = "task", within = c("length", "stimulus"))
  • aov_4 requires a formula for which the id and repeated-measures factors need to be specified as in lme4::lmer() (with the same simplification that repeated-measures factors only need to be specified in the random part): aov_4(log_rt ~ task + (length*stimulus|id), fhch) aov_4(log_rt ~ task*length*stimulus + (length*stimulus|id), fhch)
Follow-up Tests

A common requirement after the omnibus test provided by the ANOVA is some-sort of follow-up analysis. For this purpose, afex is fully integrated with lsmeans .

For example, assume we are interested in the significant task:stimulus interaction. As a first step we might want to investigate the marginal means of these two factors:

lsmeans(a1, c("stimulus","task")) # NOTE: Results may be misleading due to involvement in interactions #  stimulus task        lsmean         SE    df    lower.CL    upper.CL #  word     naming -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 #  nonword  naming -0.02687619 0.04250050 48.46 -0.11230839  0.05855602 #  word     lexdec  0.00331642 0.04224522 47.37 -0.08165241  0.08828525 #  nonword  lexdec  0.05640801 0.04224522 47.37 -0.02856083  0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95

From this we can see naming trials seems to be generally slower (as a reminder, the dv is log-transformed RT in seconds, so values below 0 correspond to RTs bewteen 0 and 1), It also appears that the difference between word and nonword trials is larger in the naming task then in the lexdec task. We test this with the following code using a few different lsmeans function. We first use lsmeans again, but this time using task as the conditioning variable specified in by. Then we use pairs() for obtaining all pairwise comparisons within each conditioning strata (i.e., level of task). This provides us already with the correct tests, but does not control for the family-wise error rate across both tests. To get those, we simply update() the returned results and remove the conditioning by setting by=NULL. In the call to update we can already specify the method for error control and we specify 'holm',  because it is uniformly more powerful than Bonferroni.

# set up conditional marginal means: (ls1 <- lsmeans(a1, c("stimulus"), by="task")) # task = naming: # stimulus lsmean SE df lower.CL upper.CL # word -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # # task = lexdec: # stimulus lsmean SE df lower.CL upper.CL # word 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95 update(pairs(ls1), by=NULL, adjust = "holm") # contrast task estimate SE df t.ratio p.value # word - nonword naming -0.31424037 0.02080113 43 -15.107 <.0001 # word - nonword lexdec -0.05309159 0.01860509 43 -2.854 0.0066 # # Results are averaged over the levels of: length # P value adjustment: holm method for 2 tests

Hmm. These results show that the stimulus effects in both task conditions are independently significant. Obviously, the difference between them must also be significant then, or?

pairs(update(pairs(ls1), by=NULL)) # contrast estimate SE df t.ratio p.value # wrd-nnwrd,naming - wrd-nnwrd,lexdec -0.2611488 0.02790764 43 -9.358 <.0001

They obviously are. As a reminder, the interaction is testing exactly this, the difference of the difference. And we can actually recover the F-value of the interaction using lsmeans alone by invoking yet another of its functions, test(..., joint=TRUE).

test(pairs(update(pairs(ls1), by=NULL)), joint=TRUE) # df1 df2 F p.value # 1 43 87.565 <.0001

These last two example were perhaps not particularly interesting from a statistical point of view, but show an important ability of lsmeans. Any set of estimated marginal means produced by lsmeans, including any sort of (custom) contrasts, can be used again for further tests or calculating new sets of marginal means. And with test() we can even obtain joint F-tests over several parameters using joint=TRUE. lsmeans is extremely powerful and one of my most frequently used packages that basically performs all tests following an omnibus test (and in its latest version it directly interfaces with rstanarm so it can now also be used for a lot of Bayesian stuff, but this is the topic of another blog post).

Finally, lsmeans can also be used directly for plotting by envoking lsmip:

lsmip(a1, task ~ stimulus)

Note that lsmip does not add error bars to the estimated marginal means, but only plots the point estimates. There are mainly two reasons for this. First, as soon as repeated-measures factors are involved, it is difficult to decide which error bars to plot. Standard error bars based on the standard error of the mean are not appropriate for within-subjects comparisons. For those, one would need to use a within-subject intervals  (see also here or here). Especially for plots as the current one with both independent-samples and repeated-measures factors (i.e., mixed within-between designs or split-plot designs) no error bar will allow comparisons across both dimensions. Second, only ‘if the SE [i.e., standard error] of the mean is exactly 1/2 the SE of the difference of two means — which is almost never the case — it would be appropriate to use overlapping confidence intervals to test comparisons of means’ (lsmeans author Russel Lenth, the link provides an alternative).

We can also use lsmeans in combination with lattice to plot the results on the unconstrained scale (i.e., after back-transforming tha data from the log scale to the original scale of response time in seconds). The plot is not shown here.

lsm1 <- summary(lsmeans(a1, c("stimulus","task"))) lsm1$lsmean <- exp(lsm1$lsmean) require(lattice) xyplot(lsmean ~ stimulus, lsm1, group = task, type = "b", auto.key = list(space = "right"))

 

Summary
  • afex provides a set of functions that make specifying standard ANOVAs for an arbitrary number of between-subjects (i.e., independent-sample) or within-subjects (i.e., repeated-measures) factors easy: aov_car(), aov_ez(), and aov_4().
  • In its default settings, the afex ANOVA functions replicate the results of commercial statistical packages such as SPSS or SAS (using orthogonal contrasts and Type III sums of squares).
  • Fitted ANOVA models can be passed to lsmeans for follow-up tests, custom contrast tests, and plotting.
  • For specific questions visit the new afex support forum: afex.singmann.science (I think we just need someone to ask the first ANOVA question to get the ball rolling).
  • For more examples see the vignette or here (blog post by Ulf Mertens) or download the full example R script used here.

As a caveat, let me end this post with some cautionary remarks from Douglas Bates (fortunes::fortune(184)) who explains why ANOVA in R is supposed to not be the same as in other software packages (i.e., he justifies why it ‘sucks’):

You must realize that R is written by experts in statistics and statistical computing who, despite popular opinion, do not believe that everything in SAS and SPSS is worth copying. Some things done in such packages, which trace their roots back to the days of punched cards and magnetic tape when fitting a single linear model may take several days because your first 5 attempts failed due to syntax errors in the JCL or the SAS code, still reflect the approach of “give me every possible statistic that could be calculated from this model, whether or not it makes sense”. The approach taken in R is different. The underlying assumption is that the useR is thinking about the analysis while doing it.
— Douglas Bates (in reply to the suggestion to include type III sums of squares and lsmeans in base R to make it more similar to SAS or SPSS)
R-help (March 2007)

Maybe he is right, but maybe what I have described here is useful to some degree.

References 881472
{EID22BTQ};{6ZTIZNX3};{ZTKXF57D};{EID22BTQ};{8SUWFIRC},{4U9N89AU},{CS2R6JNU}
apa
default
ASC
no










485
http://singmann.org/wp-content/plugins/zotpress/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology - Henrik Singmann. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

R⁶ — Scraping Images To PDFs

Mon, 06/05/2017 - 16:34

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all.

CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since the image quality and font was likely deliberately set to make it hard to do precisely what I’m trying to do.

If you work through the example, you’ll get a feel for:

  • using sprintf() to take a template and build a vector of URLs
  • use dplyr progress bars
  • customize httr verb options to ensure you can get to content
  • use purrr to iterate through a process of turning raw image bytes into image content (via magick) and turn a list of images into a PDF
library(httr) library(magick) library(tidyverse) url_template <- "https://assets.documentcloud.org/documents/1657793/pages/radioshack-convert-p%s-large.gif" pb <- progress_estimated(38) sprintf(url_template, 1:38) %>% map(~{ pb$tick()$print() GET(url = .x, add_headers( accept = "image/webp,image/apng,image/*,*/*;q=0.8", referer = "http://money.cnn.com/interactive/technology/radio-shack-closure-list/index.html", authority = "assets.documentcloud.org")) }) -> store_list_pages map(store_list_pages, content) %>% map(image_read) %>% reduce(image_join) %>% image_write("combined_pages.pdf", format = "pdf")

I figured out the Document Cloud links and necessary httr::GET() options by using Chrome Developer Tools and my curlconverter package.

If any academic-y folks have a test subjectsummer intern with a free hour and would be willing to have them transcribe this list and stick it on GitHub, you’d have my eternal thanks.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

14 Jobs for R users (2017-05-06) – from all over the world

Mon, 06/05/2017 - 16:30
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New Jobs
  1. Full-Time
    Phd opportunity @ Barcelona / Moorepark Teagasc (Ireland), Universitat Autònoma de Barcelona (Spain) and CEDRIC (Centre d’études et de recherché en informatique et communications) of CNAM (Paris) – Posted by emmanuel.serrano.ferron
    Anywhere
    25 May2017
  2. Full-Time
    Research Software Engineer @ Bailrigg, England Lancaster University – Posted by killick
    Bailrigg England, United Kingdom
    24 May2017
  3. Full-Time
    Senior Data Scientist @ Minneapolis, Minnesota, U.S. General Mills – Posted by dreyco676
    Minneapolis Minnesota, United States
    23 May2017
  4. Part-Time
    Big Data Discovery Automation in Digital Health (5-10 hours per week) MedStar Institutes for Innovation – Posted by Praxiteles
    Anywhere
    22 May2017
  5. Full-Time
    Data Scientist / Analytics Consultant Hedera Insights – Posted by HederaInsights
    Antwerpen Vlaanderen, Belgium
    18 May2017
  6. Full-Time
    Data Scientists for ArcelorMittal @ Avilés, Principado de Asturias, Spain ArcelorMittal – Posted by JuanM
    Avilés Principado de Asturias, Spain
    12 May2017
  7. Full-Time
    Data Scientist (m/f) Meritocracy – Posted by arianna@meritocracy
    Hamburg Hamburg, Germany
    11 May2017
  8. Full-Time
    Data Scientist Prospera Technologies – Posted by Prospera Technologies
    Tel Aviv-Yafo Tel Aviv District, Israel
    11 May2017
  9. Full-Time
    Data Scientist One Acre Fund – Posted by OAF
    Nairobi Nairobi County, Kenya
    9 May2017
  10. Full-Time
    Back-End Developer – Sacramento Kings (Sacramento, CA) Sacramento Kings – Posted by khavey
    Sacramento California, United States
    9 May2017
  11. Full-Time
    Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc
    Chicago Illinois, United States
    5 May2017
  12. Full-Time
    Transportation Market Research Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    4 May2017
  13. Full-Time
    Data Manager @ Los Angeles, California, U.S. Virtual Pediatric Systems, LLC – Posted by gsotocampos
    Los Angeles California, United States
    3 May2017
  14. Full-Time
    Sr. Manager, Analytics @ Minneapolis, Minnesota, U.S. Korn Ferry – Posted by jone1087
    Minneapolis Minnesota, United States
    3 May 2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Weather forecast with regression models – part 2

Mon, 06/05/2017 - 14:27

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In the first part, I introduced the weather dataset and outlining its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:

  • tomorrow’s weather forecast at 9am of the current day
  • tomorrow’s weather forecast at 3pm of the current day
  • tomorrow’s weather forecast at late evening time of the current day

For each of above tasks, a specific subset of variables shall be available, and precisely:

  • 9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
  • 3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
  • evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed

We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.

suppressPackageStartupMessages(library(caret)) set.seed(1023) weather_data5 <- read.csv("weather_data5.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE) colnames(weather_data5) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" [6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [16] "Temp3pm" "RainTomorrow" nrow(weather_data5) [1] 353 sum(weather_data5["RainTomorrow"] == "Yes") [1] 64 sum(weather_data5["RainTomorrow"] == "No") [1] 289

We are going to split the original dataset in a training dataset (70% of original data) and a testing dataset (30% remaining).

train_rec <- createDataPartition(weather_data5$RainTomorrow, p = 0.7, list = FALSE) training <- weather_data5[train_rec,] testing <- weather_data5[-train_rec,]

We check the balance of RainTomorrow Yes/No fractions in the training and testing datasets.

sum(training["RainTomorrow"] == "Yes")/sum(training["RainTomorrow"] == "No") [1] 0.2216749 sum(testing["RainTomorrow"] == "Yes")/sum(testing["RainTomorrow"] == "No") [1] 0.2209302

The dataset is slight unbalanced with respect the label to be predicted. We will check later if some remedy is needed or achieved accuracy is satisfactory as well.

9AM Forecast Model

For all models, we are going to take advantage of a train control directive made available by the caret package which prescribes repeated k-flod cross-validation. The k-fold cross validation method involves splitting the training dataset into k-subsets. For each subset, one is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the training dataset, and an overall accuracy estimate is provided. The process of splitting the data into k-folds can be repeated a number of times and this is called repeated k-fold cross validation. The final model accuracy is taken as the mean from the number of repeats.

trControl <- trainControl(method = "repeatedcv", repeats = 5, number = 10, verboseIter = FALSE)

The trainControl is passed as a parameter to the train() caret function. As a further input to the train() function, the tuneLength parameter indicates the number of different values to try for each algorithm parameter. However in case of the “glm” method, it does not drive any specific behavior and hence it will be omitted.

By taking into account the results from exploratory analysis, we herein compare two models for 9AM prediction. The first one is so determined and evaluated.

predictors_9am_c1 <- c("Cloud9am", "Humidity9am", "Pressure9am", "Temp9am") formula_9am_c1 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c6, collapse="+"), sep="~")) mod9am_c1_fit <- train(formula_9am_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c1_fit$results$Accuracy [1] 0.8383538

The resulting accuracy shown above is the one determined by the repeated k-fold cross validation as above explained. The obtained value is not that bad considering the use of just 9AM available data.

(summary_rep <- summary(mod9am_c1_fit$finalModel)) Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.7667 -0.5437 -0.3384 -0.1874 2.7742 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 96.01611 33.57702 2.860 0.004242 ** Cloud9am 0.17180 0.07763 2.213 0.026893 * Humidity9am 0.06769 0.02043 3.313 0.000922 *** Pressure9am -0.10286 0.03283 -3.133 0.001729 ** Temp9am 0.10595 0.04545 2.331 0.019744 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 177.34 on 243 degrees of freedom AIC: 187.34 Number of Fisher Scoring iterations: 5

From above summary, all predictors result as significative for the logistic regression model. We can test the null hypothesis that the logistic model represents an adequate fit for the data by computing the following p-value.

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9994587

We further investigate if any predictor can be dropped by taking advantage of the drop1() function.

drop1(mod9am_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud9am + Humidity9am + Pressure9am + Temp9am Df Deviance AIC LRT Pr(>Chi) 177.34 187.34 Cloud9am 1 182.48 190.48 5.1414 0.0233623 * Humidity9am 1 190.19 198.19 12.8502 0.0003374 *** Pressure9am 1 187.60 195.60 10.2668 0.0013544 ** Temp9am 1 183.13 191.13 5.7914 0.0161050 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can evaluate a second model where the MinTemp is replaced by the Temp9am. We do not keep both as they are correlated (see part #1 exploratory analysis).

predictors_9am_c2 <- c("Cloud9am", "Humidity9am", "Pressure9am", "MinTemp") formula_9am_c2 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c2, collapse="+"), sep="~")) mod9am_c2_fit <- train(formula_9am_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c2_fit$results$Accuracy [1] 0.8366462 (summary_rep <- summary(mod9am_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -1.7233 -0.5446 -0.3510 -0.1994 2.7237 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 103.00407 33.57102 3.068 0.00215 ** Cloud9am 0.16379 0.08014 2.044 0.04096 * Humidity9am 0.05462 0.01855 2.945 0.00323 ** Pressure9am -0.10792 0.03298 -3.272 0.00107 ** MinTemp 0.06750 0.04091 1.650 0.09896 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.9 on 247 degrees of freedom Residual deviance: 180.3 on 243 degrees of freedom AIC: 190.3 Number of Fisher Scoring iterations: 5

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9990409

This second model shows similar accuracy values, however MinTemp p-value shows that such predictor is not significative. Further, the explained variance is slightly less than the first model one. As a conclusion, we select the first model for 9AM predictions purpose and we go on by calculating accuracy with the test set.

mod9am_pred <- predict(mod9am_c1_fit, testing) confusionMatrix(mod9am_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No

The confusion matrix shows encouraging results, not a bad test accuracy for a 9AM weather forecast. We then go on building the 3PM prediction model.

3PM Forecast Model

We are going to use what we expect to be reliable predictors at 3PM. We already seen that they are not strongly correlated each other and they are not linearly dependent.

predictors_3pm_c1 <- c("Cloud3pm", "Humidity3pm", "Pressure3pm", "Temp3pm") formula_3pm_c1 <- as.formula(paste("RainTomorrow", paste(predictors_3pm_c1, collapse="+"), sep="~")) mod3pm_c1_fit <- train(formula_3pm_c1, data = training, method = "glm", family = "binomial", trControl = trControl, metric = 'Accuracy') mod3pm_c1$_fitresults$Accuracy [1] 0.8582333

This model shows an acceptable accuracy as measured on the training set.

(summary_rep <- summary(mod3pm_c1_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.3065 -0.5114 -0.2792 -0.1340 2.6641 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 122.56229 35.20777 3.481 0.000499 *** Cloud3pm 0.27754 0.09866 2.813 0.004906 ** Humidity3pm 0.06745 0.01817 3.711 0.000206 *** Pressure3pm -0.12885 0.03443 -3.743 0.000182 *** Temp3pm 0.10877 0.04560 2.385 0.017071 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 159.64 on 243 degrees of freedom AIC: 169.64 Number of Fisher Scoring iterations: 6

All predictors are reported as significative for the model. Let us further verify if any predictor can be dropped.

drop1(mod3pm_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm Df Deviance AIC LRT Pr(>Chi) 159.64 169.64 Cloud3pm 1 168.23 176.23 8.5915 0.003377 ** Humidity3pm 1 175.91 183.91 16.2675 5.500e-05 *** Pressure3pm 1 175.60 183.60 15.9551 6.486e-05 *** Temp3pm 1 165.77 173.77 6.1329 0.013269 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999913

We go on with the computation of the test set accuracy.

mod3pm_pred <- predict(mod3pm_c1_fit, testing) confusionMatrix(mod3pm_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

The test set prediction accuracy is quite satisfactory.

Evening Forecast Model

As first evening forecast model, we introduce the Sunshine variable and at the same time we take Cloud3pm and Humidity3pm out as strongly correlated to Sunshine.

predictors_evening_c1 <- c("Pressure3pm", "Temp3pm", "Sunshine") formula_evening_c1 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c1, collapse="+"), sep="~")) mod_ev_c1_fit <- train(formula_evening_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c1_fit$results$Accuracy [1] 0.8466974

The training set based accuracy is acceptable.

(summary_rep <- summary(mod_ev_c1_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -1.9404 -0.4532 -0.2876 -0.1350 2.4664 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 174.43458 37.12201 4.699 2.61e-06 *** Pressure3pm -0.17175 0.03635 -4.726 2.29e-06 *** Temp3pm 0.06506 0.03763 1.729 0.0838 . Sunshine -0.41519 0.07036 -5.901 3.61e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 156.93 on 244 degrees of freedom AIC: 164.93 Number of Fisher Scoring iterations: 6

The Temp3pm predictor is reported as not significative and can be dropped as confirmed below.

drop1(mod_ev_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Pressure3pm + Temp3pm + Sunshine Df Deviance AIC LRT Pr(>Chi) 156.93 164.93 Pressure3pm 1 185.64 191.64 28.712 8.400e-08 *** Temp3pm 1 160.03 166.03 3.105 0.07805 . Sunshine 1 203.03 209.03 46.098 1.125e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999968

As a second tentative model, we take advantage of the 3PM model predictors together with WindGustDir and WindGustSpeed.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir", "WindGustSpeed") formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~")) mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c2_fit$results$Accuracy [1] 0.8681179

It results with an improved training set accuracy.

(summary_rep <- summary(mod_ev_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.44962 -0.42289 -0.20929 -0.04328 2.76204 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 9.928e+01 4.863e+01 2.042 0.041193 * Cloud3pm 2.569e-01 1.081e-01 2.376 0.017481 * Humidity3pm 8.441e-02 2.208e-02 3.822 0.000132 *** Pressure3pm -1.088e-01 4.695e-02 -2.318 0.020462 * Temp3pm 1.382e-01 5.540e-02 2.495 0.012596 * WindGustDirENE -1.064e+00 1.418e+00 -0.750 0.453254 WindGustDirESE 1.998e+00 1.088e+00 1.837 0.066235 . WindGustDirN -2.731e-03 1.759e+00 -0.002 0.998761 WindGustDirNE 1.773e+00 1.260e+00 1.407 0.159364 WindGustDirNNE -1.619e+01 2.884e+03 -0.006 0.995520 WindGustDirNNW 1.859e+00 1.021e+00 1.821 0.068672 . WindGustDirNW 1.011e+00 1.009e+00 1.002 0.316338 WindGustDirS 1.752e+00 1.248e+00 1.404 0.160394 WindGustDirSE -1.549e+01 2.075e+03 -0.007 0.994043 WindGustDirSSE -1.051e-01 1.636e+00 -0.064 0.948777 WindGustDirSSW -1.451e+01 6.523e+03 -0.002 0.998225 WindGustDirSW 2.002e+01 6.523e+03 0.003 0.997551 WindGustDirW 1.108e+00 1.183e+00 0.936 0.349051 WindGustDirWNW 1.325e-01 1.269e+00 0.104 0.916848 WindGustDirWSW -1.574e+01 4.367e+03 -0.004 0.997124 WindGustSpeed 1.674e-02 2.065e-02 0.811 0.417420 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 135.61 on 227 degrees of freedom AIC: 177.61 Number of Fisher Scoring iterations: 17

The WindGustDir and WindGustSpeed predictors are reported as not significative. Also the AIC value is sensibly increased.

drop1(mod_ev_c2_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW + WindGustSpeed Df Deviance AIC LRT Pr(>Chi) 135.61 177.61 Cloud3pm 1 141.64 181.64 6.0340 0.01403 * Humidity3pm 1 154.49 194.49 18.8817 1.391e-05 *** Pressure3pm 1 141.38 181.38 5.7692 0.01631 * Temp3pm 1 142.21 182.21 6.5992 0.01020 * WindGustDirENE 1 136.20 176.20 0.5921 0.44159 WindGustDirESE 1 139.40 179.40 3.7883 0.05161 . WindGustDirN 1 135.61 175.61 0.0000 0.99876 WindGustDirNE 1 137.55 177.55 1.9412 0.16354 WindGustDirNNE 1 136.40 176.40 0.7859 0.37535 WindGustDirNNW 1 139.43 179.43 3.8160 0.05077 . WindGustDirNW 1 136.69 176.69 1.0855 0.29747 WindGustDirS 1 137.64 177.64 2.0293 0.15430 WindGustDirSE 1 136.36 176.36 0.7458 0.38782 WindGustDirSSE 1 135.61 175.61 0.0041 0.94866 WindGustDirSSW 1 135.64 175.64 0.0339 0.85384 WindGustDirSW 1 138.38 178.38 2.7693 0.09609 . WindGustDirW 1 136.52 176.52 0.9106 0.33996 WindGustDirWNW 1 135.62 175.62 0.0109 0.91689 WindGustDirWSW 1 135.85 175.85 0.2414 0.62318 WindGustSpeed 1 136.27 176.27 0.6633 0.41541 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDir has some borderline p-value for some specific directions. WindGustDir is not significative and we should drop it from the model. Hence, we redefine such model after having taken WindGustDir off.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir") formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~")) mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c2_fit$results$Accuracy [1] 0.8574256 (summary_rep <- summary(mod_ev_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.45082 -0.40624 -0.21271 -0.04667 2.71193 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 118.76549 42.77047 2.777 0.005490 ** Cloud3pm 0.25566 0.10794 2.369 0.017859 * Humidity3pm 0.08052 0.02126 3.787 0.000152 *** Pressure3pm -0.12701 0.04172 -3.044 0.002333 ** Temp3pm 0.13035 0.05441 2.396 0.016592 * WindGustDirENE -1.13954 1.43581 -0.794 0.427398 WindGustDirESE 2.03313 1.10027 1.848 0.064624 . WindGustDirN -0.00212 1.68548 -0.001 0.998996 WindGustDirNE 1.59121 1.25530 1.268 0.204943 WindGustDirNNE -16.31932 2891.85796 -0.006 0.995497 WindGustDirNNW 1.87617 1.03532 1.812 0.069962 . WindGustDirNW 1.09105 1.01928 1.070 0.284433 WindGustDirS 1.93346 1.24036 1.559 0.119046 WindGustDirSE -15.50996 2073.57766 -0.007 0.994032 WindGustDirSSE -0.09029 1.63401 -0.055 0.955934 WindGustDirSSW -14.34617 6522.63868 -0.002 0.998245 WindGustDirSW 19.99670 6522.63868 0.003 0.997554 WindGustDirW 1.16834 1.18736 0.984 0.325127 WindGustDirWNW 0.16961 1.28341 0.132 0.894858 WindGustDirWSW -15.71816 4349.40572 -0.004 0.997117 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 136.27 on 228 degrees of freedom AIC: 176.27 Number of Fisher Scoring iterations: 17 drop1(mod_ev_c2_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW Df Deviance AIC LRT Pr(>Chi) 136.27 176.27 Cloud3pm 1 142.24 180.24 5.9653 0.014590 * Humidity3pm 1 154.50 192.50 18.2255 1.962e-05 *** Pressure3pm 1 146.76 184.76 10.4852 0.001203 ** Temp3pm 1 142.33 180.33 6.0611 0.013819 * WindGustDirENE 1 136.93 174.93 0.6612 0.416126 WindGustDirESE 1 140.13 178.13 3.8568 0.049545 * WindGustDirN 1 136.27 174.27 0.0000 0.998996 WindGustDirNE 1 137.87 175.87 1.5970 0.206332 WindGustDirNNE 1 137.14 175.14 0.8698 0.351012 WindGustDirNNW 1 140.09 178.09 3.8159 0.050769 . WindGustDirNW 1 137.52 175.52 1.2477 0.263994 WindGustDirS 1 138.78 176.78 2.5032 0.113617 WindGustDirSE 1 137.03 175.03 0.7541 0.385179 WindGustDirSSE 1 136.28 174.28 0.0031 0.955862 WindGustDirSSW 1 136.30 174.30 0.0290 0.864850 WindGustDirSW 1 139.00 177.00 2.7314 0.098393 . WindGustDirW 1 137.28 175.28 1.0100 0.314905 WindGustDirWNW 1 136.29 174.29 0.0174 0.894921 WindGustDirWSW 1 136.51 174.51 0.2373 0.626164 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDirESE is reported as significant and hence including WindGustDir was a right choice and accept that model as a candidate one. The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999998

To investigate a final third choice, we gather a small set of predictors, Pressure3pm and Sunshine.

predictors_evening_c3 <- c("Pressure3pm", "Sunshine") formula_evening_c3 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c3, collapse="+"), sep="~")) mod_ev_c3_fit <- train(formula_evening_c3, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c3_fit$results$Accuracy [1] 0.8484513

The training set based accuracy has an acceptable value.

(summary_rep <- summary(mod_ev_c3_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.1123 -0.4602 -0.2961 -0.1562 2.3997 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 192.97146 36.09640 5.346 8.99e-08 *** Pressure3pm -0.18913 0.03545 -5.336 9.52e-08 *** Sunshine -0.35973 0.06025 -5.971 2.36e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 160.03 on 245 degrees of freedom AIC: 166.03 Number of Fisher Scoring iterations: 6

All predictors are reported as significative.

drop1(mod_ev_c3_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Pressure3pm + Sunshine Df Deviance AIC LRT Pr(>Chi) 160.03 166.03 Pressure3pm 1 199.36 203.36 39.324 3.591e-10 *** Sunshine 1 206.09 210.09 46.060 1.147e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999938

We compare the last two models by running an ANOVA analysis on those to check if the lower residual deviance of the first model is significative or not.

anova(mod_ev_c2_fit$finalModel, mod_ev_c3_fit$finalModel, test="Chisq") Analysis of Deviance Table Model 1: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW Model 2: .outcome ~ Pressure3pm + Sunshine Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 228 136.27 2 245 160.03 -17 -23.76 0.1261

Based on p-value, there is no significative difference between them. We then choose both models. The first model because it provides with a better accuracy then the second. The second model for its simplicity. Let us evaluate the test accuracy for both of them.

modevening_pred <- predict(mod_ev_c2_fit, testing) confusionMatrix(modevening_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 83 9 Yes 3 10 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.5604 Mcnemar's Test P-Value : 0.14891 Sensitivity : 0.9651 Specificity : 0.5263 Pos Pred Value : 0.9022 Neg Pred Value : 0.7692 Prevalence : 0.8190 Detection Rate : 0.7905 Detection Prevalence : 0.8762 Balanced Accuracy : 0.7457 'Positive' Class : No

Good test accuracy with a high sensitivity and positive predicted values. We then ttest the second evening forecast candidate model.

modevening_pred <- predict(mod_ev_c3_fit, testing) confusionMatrix(modevening_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 81 7 Yes 5 12 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.598 Mcnemar's Test P-Value : 0.77283 Sensitivity : 0.9419 Specificity : 0.6316 Pos Pred Value : 0.9205 Neg Pred Value : 0.7059 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8381 Balanced Accuracy : 0.7867 'Positive' Class : No

Slightly higher accuracy and lower sensitivity than the previous model. Positive predicitive value is improved with respect the same.

Resuming up our final choices:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

We save the training record indexes, the training and testing datasets together with all the chosen models for further analysis.

saveRDS(list(weather_data5, train_rec, training, testing, mod9am_c1_fit, mod9am_c2_fit, mod3pm_c1_fit, mod_ev_c2_fit, mod_ev_c3_fit), file="wf_log_reg_part2.rds") Conclusions

We build models for weather forecasts purposes at different hours of the day. We explored training accuracy and prediction significancy to determine best models. We then compute the test accuracy and collect results. Prediction accuracy was shown to be quite satisfactory in case expecially for 3PM and evening models. In next part of this tutorial, we will tune all those models to improve their accuracy if possible.

If you have any questions, please feel free to comment below.

    Related Post

    1. Weather forecast with regression models – part 1
    2. Weighted Linear Support Vector Machine
    3. Logistic Regression Regularized with Optimization
    4. Analytical and Numerical Solutions to Linear Regression Problems
    5. How to create a loop to run multiple regression models
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Shiny app to explore ggplot2

    Mon, 06/05/2017 - 10:09

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    Do you struggle with learning ggplot2? Do you have problems with understanding what aesthetics actually do and how manipulating them change your plots?

    Here is the solution! Explore 33 ggplot2 geoms on one website!

    I created this ggplot2 explorer to help all R learners to understand how to plot beautiful/useful charts using the most popular vizualization package ggplot2. It won’t teach you how to write a code, but definitely will show you how ggplot2 geoms look like, and how manipulating their arguments changes visualization. 

    You can find my app here

    And you can find code on my github

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages