Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 19 min ago

Working with R

Thu, 10/05/2017 - 11:29

(This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers)

I’ve been pretty quiet on the blog front recently. That’s because I overhauled my site, migrating it to Hugo (the foundation of blogdown). Just doing one extra thing on top of my usual workload, I also did another thing. I wrote a book too!
I’m a big book fan, and I’m especially a Kindle Unlimited fan (all the books you can read for £8 per month, heck yeah!) so I wanted to make books that I could publish and see on Kindle Unlimited.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing

Thu, 10/05/2017 - 10:45

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

If you work on natural language processing in a day-to-day setting which involves statistical engineering, at a certain timepoint you need to process your text with a number of text mining procedures of which the following ones are steps you must do before you can get usefull information about your text

  • Tokenisation (splitting your full text in words/terms)
  • Parts of Speech (POS) tagging (assigning each word a syntactical tag like is the word a verb/noun/adverb/number/…)
  • Lemmatisation (a lemma means that the term we “are” is replaced by the verb to “be”, more information: https://en.wikipedia.org/wiki/Lemmatisation)
  • Dependency Parsing (finding relationships between, namely between “head” words and words which modify those heads, allowing you to look to words which are maybe far away from each other in the raw text but influence each other)

If you do this in R, there aren’t much available tools to do this. In fact there are none which

  1. do this for multiple language
  2. do not depend on external software dependencies (java/python)
  3. which also allow you to train your own parsing & tagging models.

Except R package udpipe (https://github.com/bnosac/udpipe, https://CRAN.R-project.org/package=udpipe) which satisfies these 3 criteria.

If you are interested in doing the annotation, pre-trained models are available for 50 languages (see ?udpipe_download_model) for details. Let’s show how this works on some Dutch text and what you get of of this.

library(udpipe)
dl <- udpipe_download_model(language = "dutch")
dl

language                                                                      file_model
   dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-ud-2.0-170801.udpipe

udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_dutch,
                     x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)
x

The result of this is a dataset where text has been splitted in paragraphs, sentences, words, words are replaced by their lemma (ging > ga, nam > neem), you get the universal parts of speech tags, detailed parts of speech tags, you get features of the word and with the head_token_id we see which words are influencing other words in the text as well as the dependency relationship between these words.

To go from that dataset to meaningfull visualisations like this one is than just a matter of a few lines of code. The following visualisation shows the co-occurrence of nouns with customer feedback on Airbnb appartment stays in Brussels (open data available at http://insideairbnb.com/get-the-data.html).

In a next post, we’ll show how to train your own tagging models.

If you like this type of analysis or if you are interested in text mining with R, we have 3 upcoming courses planned on text mining. Feel free to register at the following links.

    • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
    • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
    • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
    • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
    • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
    • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
    • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

For business questions on text mining, feel free to contact BNOSAC by sending us a mail here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Writing academic articles using R Markdown and LaTeX

Thu, 10/05/2017 - 01:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Markdown and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

To illustrate the principles of using R Markdown and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

Body Image

Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

Body shape measurement scale.

R Markdown and LaTeX

The R Markdown file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

<>= body <- read.csv("body_image.csv") # Respondent characteristics body$Cohort <- cut(body$Age, c(0, 15, 30, 50, 99), labels = c("<16", "16--30", "31--50", ">50")) body$Date <- as.Date(body$Date) body$Current_Ideal <- body$Current - body$Ideal library(xtable) respondents <- addmargins(table(body$Gender, body$Cohort)) xtable(respondents, caption = "Age profile of survey participants", label = "gender-age", digits = 0) @ Configuration

I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Markdown file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.

The post Writing academic articles using R Markdown and LaTeX appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing the Deep Learning Virtual Machine on Azure

Wed, 10/04/2017 - 21:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A new member has just joined the family of Data Science Virtual Machines on Azure: The Deep Learning Virtual Machine. Like other DSVMs in the family, the Deep Learning VM is a pre-configured environment with all the tools you need for data science and AI development pre-installed. The Deep Learning VM is designed specifically for GPU-enabled instances, and comes with a complete suite of deep learning frameworks including Tensorflow, PyTorch, MXNet, Caffe2 and CNTK. It also comes witth example scripts and data sets to get you started on deep learning and AI problems, including:

The DLVM along with all the DSVMs also provides a complete suite of data science tools including R, Python, Spark, and much more:

There have also been some updates and additions to the tools provided in the entire DSVM family, including:

All Data Science Virtual Machines, including the Deep Learning Virtual Machine, are available as Windows and Ubuntu Linux instances, and are free of any software charges: pay only for the infrastructure charge according to the power and size of the instance you choose. An Azure account is required, but you can get started with $200 in free Azure credits here.

Microsoft Azure: Data Science Virtual Machines

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time Series Analysis in R Part 3: Getting Data from Quandl

Wed, 10/04/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here: Part 1, Part 2.

Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

install.packages('Quandl')

You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

library(Quandl) hpidata <- Quandl("YALE/NHPI", type="ts", start_date="1990-01-01") plot.ts(hpidata, main = "Robert Shiller's Nominal Home Price Index")

Gives this plot:

While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

gdpdata <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01") pidata <- Quandl("FRED/PINCOME", type="ts", start_date="1990-01-01") umdata <- Quandl("UMICH/SOC43", type="ts")[, 1] plot.ts(cbind(gdpdata, pidata), main="US GPD and Personal Income, billions $")

Gives this plot:

plot.ts(umdata, main = "University of Michigan Consumer Survey, Selling Conditions for Houses")

Gives this plot:

The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

gdpdata_ann <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", collapse="annual") frequency(gdpdata_ann) [1] 1

We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

For example, here we get the data in percent change form:

gdpdata_pc <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", transform="rdiff") plot.ts(gdpdata_pc * 100, ylab= "% change", main="US Gross Domestic Product, % change")

Gives this plot:

You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.

    Related Post

    1. Pulling Data Out of Census Spreadsheets Using R
    2. Extracting Tables from PDFs in R using the Tabulizer Package
    3. Extract Twitter Data Automatically using Scheduler R package
    4. An Introduction to Time Series with JSON Data
    5. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    nzelect 0.4.0 on CRAN with results from 2002 to 2014 and polls up to September 2017 by @ellis2013nz

    Wed, 10/04/2017 - 13:00

    (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

    More nzelect New Zealand election data on CRAN

    Version 0.4.0 of my nzelect R package is now on CRAN. The key changes from version 0.3.0 are:

    • election results by voting place are now available back to 2002 (was just 2014)
    • polling place locations, presented as consistent values of latitude and longitude, now available back to 2008 (was just 2014)
    • voting intention polls are now complete up to the 2017 general election (previously stopped about six months ago on CRAN, although the GitHub version was always kept up to date)
    • a few minor bug fixes eg allocate_seats now takes integer arguments, not just numeric.

    The definitive source of New Zealand election statistics is the Electoral Commission. If there are any discrepencies between their results and those in nzelect, it’s a bug, and please file an issue on GitHub. The voting intention polls come from Wikipedia.

    Example – special and early votes

    Currently, while we wait for the counting of the “special” votes from the 2017 election, there’s renewed interest in the differences between special votes and those counted on election night. Special votes are those made by anyone outside their electorate, enrolled in the month before the election, or are in one of a few other categories. Most importantly, it’s people who are on the move or who are very recently enrolled, and obviously such people are different from the run of the mill voter.

    Here’s a graphic created with the nzelect R package that shows how the “special votes” in the past have been disproportionately important for Greens and Labour:

    Here’s the code to create that graphic:

    # get the latest version from CRAN: install.packages("nzelect") library(nzelect) library(tidyverse) library(ggplot2) library(scales) library(ggrepel) library(forcats) # palette of colours for the next couple of charts: palette <- c(parties_v, Other = "pink2", `Informal Votes` = "grey") # special votes: nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("special", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Special' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that came from special votes") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were 'special'", label = percent) + scale_colour_manual(values = palette, guide = FALSE)

    Special votes are sometimes confused with advance voting in general. While many special votes are advance votes, the relationship is far from one to one. We see this particularly acutely by comparing the previous graphic to one that is identical except that it identifies all advance votes (those with the phrase “BEFORE” in the Electoral Commission’s description of polling place):

    While New Zealand First are the party that gains least proportionately from special votes, they gain the most from advance votes, although the difference between parties is fairly marginal. New Zealand First voters are noticeably more likely to be in an older age bracket than the voters for other parties. My speculation on their disproportionate share of advance voting is that it is related to that, although I’m not an expert in that area and am interested in alternative views.

    This second graphic also shows nicely just how much the advance voting is becoming a feature of the electoral landscape. Unlike the proportion of votes that are “special” which has been fairly stable, the proportion of votes that are case in advance has increased very substantially over the past decade, and increased further in the 2017 election (for which final results come out on Saturday).

    Here’s the code for the second graphic; it’s basically the same as the previous chunk of code, except filtering on a different character string in the voting place name:

    nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("before", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Before' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that were cast before election day") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were before election day", label = percent) + scale_colour_manual(values = palette, guide = FALSE) ‘Party’ compared to ‘Candidate’ vote

    Looking for something else to showcase, I thought it might be interesting to pool all five elections for which I have the detailed results and compare the party vote (ie the proportional representation choice out of the two votes New Zealanders get) to the candidate vote (ie the representative member choice). Here’s a graphic that does just that:

    We see that New Zealand First and the Greens are the two parties that are most noticeably above the diagonal line indicating equality between party and candidate votes. This isn’t a surprise – these are minority parties that appeal to (different) demographic and issues-based communities that are dispersed across the country, and generally have little chance of winning individual electorates. Hence the practice of voters is often to split their votes. This is all perfectly fine and is exactly how mixed-member proportional voting systems are meant to work.

    Here’s the code that produced the scatter plot:

    nzge %>% group_by(voting_type, party) %>% summarise(votes = sum(votes)) %>% spread(voting_type, votes) %>% ggplot(aes(x = Candidate, y = Party, label = party)) + geom_abline(intercept = 0, slope = 1, colour = "grey50") + geom_point() + geom_text_repel(colour = "steelblue") + scale_x_log10("Total 'candidate' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + scale_y_log10("Total 'party' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + ggtitle("Lots of political parties: total votes from 2002 to 2014", "New Zealand general elections") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package") + coord_equal() What next?

    Obviously, the next big thing for nzelect is to get the 2017 results in once they are announced on Saturday. I should be able to do this for the GitHub version by early next week. I would have delayed the CRAN release until 2017 results were available, but unfortunately I had a bug in some of the examples in my helpfiles that stopped them working after the 23 September 2017, so I had to rush a fix and the latest enhancements to CRAN to avoid problems with the CRAN maintainers (which I fully endorse and thank by the way).

    My other plans for nzelect over the next months to a year include:

    • reliable point locations for voting places back to 2002
    • identify consistent/duplicate voting places over time to make it easier to analyse comparative change by micro location
    • add detailed election results for the 1999 and 1996 elections (these are saved under a different naming convention to those from 2002 onwards, which is why they need a bit more work)
    • add high level election results for prior to 1994

    The source code for cleaning the election data and packaging it into nzelect is on GitHub. The package itself is on CRAN and installable in the usual way.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RProtoBuf 0.4.11

    Wed, 10/04/2017 - 02:25

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    RProtoBuf provides R bindings for the Google Protocol Buffers ("ProtoBuf") data encoding and serialization library used and released by Google, and deployed fairly widely in numerous projects as a language and operating-system agnostic protocol.

    A new releases RProtoBuf 0.4.11 appeared on CRAN earlier today. Not unlike the other recent releases, it is mostly a maintenance release which switches two of the vignettes over to using the pinp package and its template for vignettes.

    Changes in RProtoBuf version 0.4.11 (2017-10-03)
    • The RProtoBuf-intro and RProtoBif-quickref vignettes were converted to Rmarkdown using the templates and style file from the pinp package.

    • A few minor internal upgrades

    CRANberries also provides a diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Create Powerpoint presentations from R with the OfficeR package

    Tue, 10/03/2017 - 23:30

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    For many of us data scientists, whatever the tools we use to conduct research or perform an analysis, our superiors are going to want the results as a Microsoft Office document. Most likely it's a Word document or a PowerPoint presentation, it probably has to follow the corporate branding guidelines to boot. The OfficeR package, by David Gohel, addresses this problem by allowing you to take a Word or PowerPoint template and programatically insert text, tables and charts generated by R into the template to create a complete document. (The OfficeR package also represents a leap forward from the similar ReporteRs package: it's faster, and no longer has a dependency on a Java installation.)

    At his blog, Len Kiefer takes the OfficeR package through its paces, demonstrating how to create a PowerPoint deck using R. The process is pretty simple:

    • Create a template PowerPoint presentation to host the slides. You can use the Slide Master mode in PowerPoint to customize the style of the slides you will create using R, and you can use all of PowerPoint's features to control layout, select fonts and colors, and include images like logos. 
    • For each slide you wish to create, you can either reference a template slide included in your base presentation, or add a new slide based on one of your custom layouts.
    • For each slide, you will reference on or or more placeholders (regions where content goes) in the layout using their unique names. You can then use R functions to fill them with text, hyperlinks, tables or images. The formatting of each will be controlled by the formatting you specified within Powerpoint.

    Len used this process to convert the content of a blog post on housing inventory into the PowerPoint deck you see embedded below.

    You can find details of how to create Microsoft Office documents in the OfficeR vignette on PowerPoint documents, and a there's a similar guide for Word documents, too. Or just take a look at the blog post linked before for a worked example.

    Len Kiefer: Crafting a PowerPoint Presentation with R

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R live class – Statistics for Data Science

    Tue, 10/03/2017 - 17:52

    (This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

     

    Statistics for Data Science is our second course of the autumn term. It takes place in October 11-12 in Milano Lima.

    In this two-day course you will learn how to develop a wide variety of linear and generalized linear models with R. The course follows a step-by-step approach: starting from the simplest linear regression we will add complexity to the model up to the most sophisticated GLM, looking at the statistical theory along with its R implementation. Supported by plenty of examples, the course will give you a wide overview of the R capabilities to model and to make predictions.

    Statistics for Data Science Outlines
    • t-test, ANOVA
    • Linear, Polynomial and Multiple Regression
    • More Complex Linear Models
    • Generalized Linear Models
    • Logistic Regression
    • Poisson Dep. Var. Regression
    • Gamma Dep. Var. Regression
    • Check of models assumptions
    • Brief outlines of GAM, Mixed Models. Neural Networks, Tree-based Modelling

     

    Statistics for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

    The course is for max 6 attendees.

    Location

    The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

    Registration

    If you want to reserve a seat go to: FAQ, detailed program and tickets.

    Other R courses | Autumn term

    You can find an overview of all our courses here. Next dates will be:

    • October 25-26: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
    • November 7-8: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
    • November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
    • November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

    In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

    The post R live class – Statistics for Data Science appeared first on Quantide – R training & consulting.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dashboard Design: 8 Types of Online Dashboards

    Tue, 10/03/2017 - 13:31

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    What type of online dashboard will work best for your data? This post reviews eight types of online dashboards to assist you in choosing the right approach for your next dashboard. Note that there may well be more than eight types of dashboards, I am sure I will miss a few. If so, please tell me in the comments section of this post.

    KPI Online Dashboards

    The classic dashboards are designed to report key performance indicators (KPIs). Think of the dashboard of a car or the cockpit of an airplane. The KPI dashboard is all about dials and numbers. Typically, these dashboards are live and show the latest numbers. In a business context, they typically show trend data as well.

    A very simple example of a KPI Dashboard is below. Such dashboards can, of course, be huge. Huge dashboards have lots of pages crammed with numbers and charts, looking at all manner of operational and strategic data.

    Click the image for an interactive version

    Geographic Online Dashboards

    The most attractive dashboards are often geographic. The example below was created by Iaroslava Mizai in Tableau. Due to people being inspired by such dashboards, I imagine that a lot of money has been spent on Tableau licenses. While visually attractive, such dashboards tend to make up a tiny proportion of the dashboards in widespread use.

    While visually attractive, such dashboards tend to make up a tiny proportion of the dashboards in widespread use. Outside of sales, geography, and demography, few people spend much time exploring geographic data.

    Click the image for an interactive version

    Catalog Online Dashboards

    catalog online dashboard is based around a menu. The viewer can select the results they are interested in from that menu. It is a much more general dashboard used for displaying data rather than geography. Here, you can also use any variable to cut the data. For example, the Catalog Dashboard below gives the viewer a choice of country to investigate.

    Click the image for an interactive version

    The dashboard below has the same basic idea, except the user navigates by clicking the control box on the right-side of the heading. In this example of a brand health dashboard, the control box is currently set to IGA (but you could click on it to change it to another supermarket).

     

    Click the image for an interactive version

    The PowerPoint Alternative Dashboard

    story dashboard consists of a series of pages specifically ordered for the reader. This type of online dashboard is used as a powerful alternative to PowerPoint, with the additional benefits of being interactive, updatable and live. Typically, a user either navigates through such a dashboard using navigation buttons (i.e., forward and backward). Alternatively, they use the navigation bar on the left, as shown in the online dashboard example below.

    Drill-Down Online Dashboards

    A drill-down is an online dashboard (or dashboard control) where the viewer can “drill” into the data to get more information. The whole dashboard is organized in in hierarchical fashion.

    There are five common ways of facilitating drill-downs in dashboards: zoom, graphical filtering, control-based, filtering, and landing pages. The choice of which to use is partly technological and partly related to the structure of the data.

    1. Zoom

    Zooming is perhaps the most widely used technique for permitting users to drill-down. The user can typically achieve the zoom via mouse, touch, + buttons, and draggers. For example, the earlier Microsoft KPI dashboard permitted the viewer to change the time series window by dragging on the bottom of each chart.

    While zooming is the most aesthetically pleasing way of drilling into data, it is also the least general. This approach to dashboarding only works when there is a strong and obvious ordering of the data. This is typically only the case with geographic and time series data, although sometimes data is forced into a hierarchy to make zooming possible. This is the case in the Zooming example below, which shows survival rates for the Titanic (double-click on it to zoom).

    Unless writing everything from scratch, the ability to add zoom to a dashboard will depend on the components being used (i.e. whether the components support zoom).

    Click the image for an interactive version

    2. Graphical filtering

    Graphical filtering allows the user to explore data by clicking on graphical elements of the dashboard. For example, in this QLik dashboard, I clicked on the Ontario pie slice (on the right of the screen) and all the other elements on the page automatically updated to show data relating to Ontario.

    Graphical filtering is cool. However, it both requires highly structured data and quite a bit of time figuring out how to design and implement the user interface. They are also the most challenging to build. The most amazing examples tend to be bespoke websites created by data journalists (e.g., http://www.poppyfield.org/). The most straightforward way of creating such dashboards with graphical filtering tends to be using business intelligence tools, like Qlik and Tableau. Typically, there is a lot of effort required to structure the data up front. You then get the graphical filtering “for free”.  If you are more the DIY-type, wanting to build your own dashboards and pay nothing, RStudio’s Shiny is probably the most straightforward option.

    Click the image for an interactive version

    3. Control-based drill-downs

    A quicker and easier way of implementing drill-downs is to give the user controls that they can use to select data. From a user interface perspective, the appearance is essentially the same as with the Supermarket Brand Health dashboard (example a few dashboards above). Here, a user chooses from the available options (or uses sliders, radio buttons, etc.).

    4. Filtered drill-downs

    When drilling-down involves restricting the data to a subset of the observations (e.g., to a subset of respondents in a survey), users can zoom in using filtering tools. For example, you can filter the Supermarket Brand Health dashboard by various demographic groups. While using filters to zoom is the least sexy of the ways of permitting users to drill into data, it is usually the most straightforward to implement. Furthermore, it is also a lot more general than any of the other styles of drill-downs considered so far. For example, the picture below illustrates drilling into the data of women aged 35 or more (using the Filters drop-down menu on the top right corner).

    Click the image for an interactive version

    5. Hyperlink drill-downs

    The most general approach for creating drill-downs is to link together multiple pages with hyperlinks. While all of the other approaches involve some aspect of filtering. On the other hand, hyperlinks enable the user to drill into qualitatively different data. Typically, there is a landing page that contains a summary of key data. So the user clicks on the data of interest to drill down and get more information. In the example of a hyperlinked dashboard below, the landing page shows the performance of different departments in a supermarket. The viewer clicks on the result for a department (e.g.: CHECK OUT) which takes them to a screen showing more detailed results.

    Click the image for an interactive version

     

    Interactive Infographic Dashboard

    Infographic dashboards present viewers with a series of closely related charts, text, and images. Here is an example of an interactive infographic on Gamers, where the user can change the country at the top and the dashboard automatically updates.

    Click the image for an interactive version

    Visual Confections Dashboard

    visual confection is an online dashboard that layers multiple visual elements. On the other hand, a series of related visualizations is an infographic. The dashboard below overlays time series information, with exercise and diet information.

    Click the image for an interactive version

    Simulator Dashboards

    The final type of dashboard that I can think of is a simulator. The simulator dashboard example below is from a latent class logit choice model of the egg market. The user can select different properties for each of the competitors and the dashboard predicts market share.

    Click the image for an interactive version

     Create your own Online Dashboards

    I have mentioned a few specific apps for creating online dashboards, including Tableau, QLik, and Shiny. All the other online dashboards in this post used R from within Displayr (you can even just use Displayr to see the underlying R code for each online dashboard). To explore or replicate the Displayr dashboards, just follow the links below for Edit mode for each respective dashboard, and then click on each of the visual elements.

    Microsoft KPI

    Overview: A one-page dashboard showing stock price and Google Trends data for Microsoft.
    Interesting features: Automatically updated every 24 hours, pulling in data from Yahoo Finance and Google Trends.
    Edit mode: Click here to see the underlying document.
    View modeClick here to see the dashboard.

     

    Europe and Immigration

    Overview: Attitudes of Europeans to Immigration
    Interesting features: Based on 213,308 survey responses collected over 13 years. Custom navigation via images and hyperlinks.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboard.

     

    Supermarket Brand Health

    Overview: Usage and attitudes towards supermarkets
    Interesting features: Uses a control (combo box) to update the calculations for the chosen supermarket brand.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboard.

     

    Supermarket Department NPS

    Overview: Performance by department of supermarkets.
    Interesting features: Color-coding of circles based on underlying data (they change when the data is filtered using the Filters menu in the top right). Custom navigation, whereby the user clicks on the circle for a department and gets more information about that department.
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

    Blood Glucose Confection

    Overview: Blood glucose measurements and food diary.
    Interesting features: The fully automated underlying charts that integrate data from a wearable blood glucose implant and a food diary. See Layered Data Visualizations Using R, Plotly, and Displayr for more about this dashboard.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboards.

     

    Interactive infographic

    Overview: An infographic that updates based on the viewer’s selection of country.
    Interesting features: Based on an infographic created in Canva. The data is pasted in from a spreadsheet  (i.e., no hookup to a database).
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

     

    Presidential MaxDiff

    Overview: A story-style dashboard showing an analysis of what Americans desire in their Commander-in-Chief.
    Interesting features: A revised data file can be used to automatically update the visualizations, text, and the underlying analysis (a MaxDiff model)(i.e., it is an automated report).
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboards.

     

    Choice Simulator

    Overview: A decision-support system
    Interesting features: The simulator is hooked up directly to an underlying latent class model. See How to Create an Online Choice Simulator for more about this dashboard.
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    googleLanguageR – Analysing language through the Google Cloud Machine Learning APIs

    Tue, 10/03/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    One of the greatest assets human beings possess is the power of speech and language, from which almost all our other accomplishments flow. To be able to analyse communication offers us a chance to gain a greater understanding of one another.

    To help you with this, googleLanguageR is an R package that allows you to perform speech-to-text transcription, neural net translation and natural language processing via the Google Cloud machine learning services.

    An introduction to the package is below, but you can find out more details at the googleLanguageR website.

    Google's bet

    Google predicts that machine learning is to be a fundamental feature of business, and so they are looking to become the infrastructure that makes machine learning possible. Metaphorically speaking: If machine learning is electricity, then Google wants to be the pylons carrying it around the country.

    Google may not be the only company with such ambitions, but one advantage Google has is the amount of data it possesses. Twenty years of web crawling has given it an unprecedented corpus to train its models. In addition, its recent moves into voice and video gives it one of the biggest audio and speech datasets, all of which have been used to help create machine learning applications within its products such as search and Gmail. Further investment in machine learning is shown by Google's purchase of Deepmind, a UK based A.I. research firm that recently was in the news for defeating the top Go champion with its neural network trained Go bot. Google has also taken an open-source route with the creation and publication of Tensorflow, a leading machine learning framework.

    Whilst you can create your own machine learning models, for those users who haven't the expertise, data or time to do so, Google also offers an increasing range of machine learning APIs that are pre-trained, such as image and video recognition or job search. googleLanguageR wraps the subset of those machine learning APIs that are language flavoured – Cloud Speech, Translation and Natural Language.

    Since they carry complementary outputs that can be used in each other's input, all three of the APIs are included in one package. For example, you can transcribe a recording of someone speaking in Danish, translate that to English and then identify how positive or negative the writer felt about its content (sentiment analysis) then identify the most important concepts and objects within the content (entity analysis).

    Motivations Fake news

    One reason why I started looking at this area was the growth of 'fake news', and its effect on political discourse on social media. I wondered if there was some way to put metrics on how much a news story fuelled one's own bias within your own filter bubble. The entity API provides a way to perform entity and sentiment analysis at scale on tweets, and by then comparing different users and news sources preferences the hope is to be able to judge how much they are in agreement with your own bias, views and trusted reputation sources.

    Make your own Alexa

    Another motivating application is the growth of voice commands that will become the primary way of user interface with technology. Already, Google reports up to 20% of search in its app is via voice search. I'd like to be able to say "R, print me out that report for client X". A Shiny app that records your voice, uploads to the API then parses the return text into actions gives you a chance to create your very own Alexa-like infrastructure.

    The voice activated internet connected speaker, Amazon's Alexa – image from www.amazon.co.uk

    Translate everything

    Finally, I live and work in Denmark. As Danish is only spoken by less than 6 million people, applications that work in English may not be available in Danish very quickly, if at all. The API's translation service is the one that made the news in 2016 for "inventing its own language", and offers much better English to Danish translations that the free web version and may make services available in Denmark sooner.

    Using the library

    To use these APIs within R, you first need to do a one-time setup to create a Google Project, add a credit card and authenticate which is detailed on the package website.

    After that, you feed in the R objects you want to operate upon. The rOpenSci review helped to ensure that this can scale up easily, so that you can feed in large character vectors which the library will parse and rate limit as required. The functions also work within tidyverse pipe syntax.

    Speech-to-text

    The Cloud Speech API is exposed via the gl_speech function.

    It supports multiple audio formats and languages, and you can either feed a sub-60 second audio file directly, or perform asynchrnous requests for longer audio files.

    Example code:

    library(googleLanguageR) my_audio <- "my_audio_file.wav" gl_speech(my_audio) # A tibble: 1 x 3 # transcript confidence words #* #1 Hello Mum 0.9227779 Translation

    The Cloud Translation API lets you translate text via gl_translate

    As you are charged per character, one tip here if you are working with lots of different languages is to perform detection of language offline first using another rOpenSci package, cld2. That way you can avoid charges for text that is already in your target language i.e. English.

    library(googleLanguageR) library(cld2) library(purrr) my_text <- c("Katten sidder på måtten", "The cat sat on the mat") ## offline detect language via cld2 detected <- map_chr(my_text, detect_language) # [1] "DANISH" "ENGLISH" ## get non-English text translate_me <- my_text[detected != "ENGLISH"] ## translate gl_translate(translate_me) ## A tibble: 1 x 3 # translatedText detectedSourceLanguage text #* #1 The cat is sitting on the mat da Katten sidder på måtten Natural Language Processing

    The Natural Language API reveals the structure and meaning of text, accessible via the gl_nlp function.

    It returns several analysis:

    • Entity analysis – finds named entities (currently proper names and common nouns) in the text along with entity types, salience, mentions for each entity, and other properties. If possible, will also return metadata about that entity such as a Wikipedia URL.
    • Syntax – analyzes the syntax of the text and provides sentence boundaries and tokenization along with part of speech tags, dependency trees, and other properties.
    • Sentiment – the overall sentiment of the text, represented by a magnitude [0, +inf] and score between -1.0 (negative sentiment) and 1.0 (positive sentiment)

    These are all useful to get an understanding of the meaning of a sentence, and has potentially the greatest number of applications of the APIs featured. With entity analysis, auto categorisation of text is possible; the syntax returns let you pull out nouns and verbs for parsing into other actions; and the sentiment analysis allows you to get a feeling for emotion within text.

    A demonstration is below which gives an idea of what output you can generate:

    library(googleLanguageR) quote <- "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." nlp <- gl_nlp(quote) str(nlp) #List of 6 # $ sentences :List of 1 # ..$ :'data.frame': 1 obs. of 4 variables: # .. ..$ content : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." # .. ..$ beginOffset: int 0 # .. ..$ magnitude : num 0.6 # .. ..$ score : num -0.6 # $ tokens :List of 1 # ..$ :'data.frame': 20 obs. of 17 variables: # .. ..$ content : chr [1:20] "Two" "things" "are" "infinite" ... # .. ..$ beginOffset : int [1:20] 0 4 11 15 23 25 29 38 42 48 ... # .. ..$ tag : chr [1:20] "NUM" "NOUN" "VERB" "ADJ" ... # .. ..$ aspect : chr [1:20] "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" ... # .. ..$ case : chr [1:20] "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" ... # .. ..$ form : chr [1:20] "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" ... # .. ..$ gender : chr [1:20] "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" ... # .. ..$ mood : chr [1:20] "MOOD_UNKNOWN" "MOOD_UNKNOWN" "INDICATIVE" "MOOD_UNKNOWN" ... # .. ..$ number : chr [1:20] "NUMBER_UNKNOWN" "PLURAL" "NUMBER_UNKNOWN" "NUMBER_UNKNOWN" ... # .. ..$ person : chr [1:20] "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" ... # .. ..$ proper : chr [1:20] "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" ... # .. ..$ reciprocity : chr [1:20] "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" ... # .. ..$ tense : chr [1:20] "TENSE_UNKNOWN" "TENSE_UNKNOWN" "PRESENT" "TENSE_UNKNOWN" ... # .. ..$ voice : chr [1:20] "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" ... # .. ..$ headTokenIndex: int [1:20] 1 2 2 2 2 6 2 6 9 6 ... # .. ..$ label : chr [1:20] "NUM" "NSUBJ" "ROOT" "ACOMP" ... # .. ..$ value : chr [1:20] "Two" "thing" "be" "infinite" ... # $ entities :List of 1 # ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 9 variables: # .. ..$ name : chr [1:6] "human stupidity" "things" "universe" "universe" ... # .. ..$ type : chr [1:6] "OTHER" "OTHER" "OTHER" "OTHER" ... # .. ..$ salience : num [1:6] 0.1662 0.4771 0.2652 0.2652 0.0915 ... # .. ..$ mid : Factor w/ 0 levels: NA NA NA NA NA NA # .. ..$ wikipedia_url: Factor w/ 0 levels: NA NA NA NA NA NA # .. ..$ magnitude : num [1:6] NA NA NA NA NA NA # .. ..$ score : num [1:6] NA NA NA NA NA NA # .. ..$ beginOffset : int [1:6] 42 4 29 86 29 86 # .. ..$ mention_type : chr [1:6] "COMMON" "COMMON" "COMMON" "COMMON" ... # $ language : chr "en" # $ text : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." # $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables: # ..$ magnitude: num 0.6 # ..$ score : num -0.6 Acknowledgements

    This package is 10 times better due to the efforts of the rOpenSci reviewers Neal Richardson and Julia Gustavsen, who have whipped the documentation, outputs and test cases into the form they are today in 0.1.0. Many thanks to them.

    Hopefully, this is just the beginning and the package can be further improved by its users – if you do give the package a try and find a potential improvement, raise an issue on GitHub and we can try to implement it. I'm excited to see what users can do with these powerful tools.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping ecosystems of software development

    Tue, 10/03/2017 - 02:00

    (This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

    I have a new post on the Stack Overflow blog today about the complex, interrelated ecosystems of software development. On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately. One place we see developers using tags at Stack Overflow is on their Developer Stories. If we are interested in how technologies are connected and how they are used together, developers’ own descriptions of their work and careers is a great place to get that.

    I released the data for this network structure as a dataset on Kaggle so you can explore it for yourself! For example, the post for Stack Overflow includes an interactive visualization created using the networkD3 package but we can create other kinds of visualizations using the ggraph package. Either way, trusty igraph comes into play.

    library(readr) library(igraph) library(ggraph) stack_network <- graph_from_data_frame(read_csv("stack_network_links.csv"), vertices = read_csv("stack_network_nodes.csv")) set.seed(2017) ggraph(stack_network, layout = "fr") + geom_edge_link(alpha = 0.2, aes(width = value)) + geom_node_point(aes(color = as.factor(group), size = 10 * nodesize)) + geom_node_text(aes(label = name), family = "RobotoCondensed-Regular", repel = TRUE) + theme_graph(base_family = "RobotoCondensed-Regular") + theme(plot.title = element_text(family="Roboto-Bold"), legend.position="none") + labs(title = "Stack Overflow Tag Network", subtitle = "Tags correlated on Developer Stories")

    We have explored these kinds of network structures using all kinds of data sources at Stack Overflow, from Q&A to traffic, and although we see similar relationships across all of them, we really like Developer Stories as a data source for this particular question. Let me know if you have any comments or questions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Comparing assault death rates in the US to other advanced democracies

    Mon, 10/02/2017 - 23:28

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    In an effort to provide context to the frequent mass shootings in the United States, Kieran Healy (Associate Professor of Sociology at Duke University) created this updated chart comparing assault death rates in the US to that of 23 other advanced democracies. The chart shows the rate (per 100,000 citizens) of death caused by assaults (stabbings, gunshots, etc. by a third party). Assaults are used rather than gun deaths specifically, as that's the only statistic for which readily comparable data is available. The data come from the OECD Health Status Database through 2015, the most recent complete year available.

    The goal of this chart is to "set the U.S. in some kind of longitudinal context with broadly comparable countries", and to that end OECD countries Estonia and Mexico are not included. (Estonia suffered a spike of violence in the mid-90's, and Mexico has been embroiled in drug violence for decades. See the chart with Estonia and Mexico included here.) Healy provides a helpful FAQ justifying this decision and other issues related to the data and their presentation.

    Healy used the R language (and, specifically the ggplot2 graphics package) to create this chart, and the source code is available on Github.

    For more context around this chart follow the link below, and also his prior renderings and commentary related to the same data through 2013 and through 2010.

    Kieran Healy: Assault deaths to 2015

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Processing Rmarkdown documents with Eclipse and StatET

    Mon, 10/02/2017 - 23:06

    (This article was first published on R-project – lukemiller.org, and kindly contributed to R-bloggers)

    Processing R markdown (Rmd) documents with Eclipse/StatET external tools requires a different setup than processing ‘regular’ knitr documents (Rnw). I was having problems getting the whole rmarkdown -> pandoc workflow working on Eclipse, but the following fix seems to have resolved it, and I can generate Word or HTML documents from a single .Rmd file with a YAML header (see image below).

    A .Rmd document with YAML header set to produce a Microsoft Word .docx output file.

    For starters, I open the Run > External Tools > External Tools Configurations window. The Rmarkdown document is considered a type of Wikitext, so create a new Wikitext + R Document Processing entry. To do that, hit the little blank-page-with-a-plus icon found in the upper left corner of the window, above where it says “type filter text”.

    Using the Auto (YAML) single-step preset.

    Enter a name for this configuration in the Name field (where it says knitr_RMD_yaml in my image above). I then used the Load Preset/Example menu in the upper right to choose the “Auto (YAML) using RMarkdown, single-step” option. This processes your .Rmd document directly using the rmarkdown::render() function, skipping over a separate knitr::knit() step, but the output should look the same.

    Next go to the “2) Produce Output” tab (you are skipping over the “1) R-Weave” tab because you chose the 1-step process). By default the entry in the File field here was causing errors for me. The change is pictured below, so that the File entry reads "${file_name_base:${source_file_path}}.${out_file_ext}". This change allowed my setup to actually find the .Rmd and output .md files successfully, so that the .md document could then be passed on to pandoc.

    Modify the File entry from the default so that instead reads the same as what’s pictured here.

    This all assumes that you’ve previously downloaded and installed pandoc so that it can be found on the Windows system $PATH.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-project – lukemiller.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Marketing Multi-Channel Attribution model based on Sales Funnel with R

    Thu, 09/28/2017 - 06:00

    (This article was first published on R language – AnalyzeCore – data is beautiful, data is a story, and kindly contributed to R-bloggers)

    This is the last post in the series of articles about using Multi-Channel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively attribute marketing channels.

    In this article, we will review another fascinating approach that marries heuristic and probabilistic methods. Again, the core idea is straightforward and effective.

    Sales Funnel Usually, companies have some kind of idea on how their clients move along the user journey from first visiting a website to closing a purchase. This sequence of steps is called a Sales (purchasing or conversion) Funnel. Classically, the Sales Funnel includes at least four steps:
    • Awareness – the customer becomes aware of the existence of a product or service (“I didn’t know there was an app for that”),
    • Interest – actively expressing an interest in a product group (“I like how your app does X”),
    • Desire – aspiring to a particular brand or product (“Think I might buy a yearly membership”),
    • Action – taking the next step towards purchasing the chosen product (“Where do I enter payment details?”).

    For an e-commerce site, we can come up with one or more conditions (events/actions) that serve as an evidence of passing each step of a Sales Funnel.

    For some extra information about Sales Funnel, you can take a look at my (rather ugly) approach of Sales Funnel visualization with R.

    Companies, naturally, lose some share of visitors on each following step of a Sales Funnel as it gets narrower. That’s why it looks like a string of bottlenecks. We can calculate a probability of transition from the previous step to the next one based on recorded history of transitions. On the other hand, customer journeys are sequences of sessions (visits) and these sessions are attributed to different marketing channels.

    Therefore, we can link marketing channels with a probability of a customer passing through each step of a Sales Funnel. And here goes the core idea of the concept. The probability of moving through each “bottleneck” represents the value of the marketing channel which leads a customer through it. The higher probability of passing a “neck”, the lower the value of a channel that provided the transition. And vice versa, the lower probability, the higher value of a marketing channel in question.

    Let’s study the concept with the following example. First off, we’ll define the Sales Funnel and a set of conditions which will register as customer passing through each step of the Funnel.

    • 0 step (necessary condition) – customer visits a site for the first time
    • 1st step (awareness) – visits two site’s pages
    • 2nd step (interest) – reviews a product page
    • 3rd step (desire) –  adds a product to the shopping cart
    • 4th step (action) – completes purchase

    Second, we need to extract the data that includes sessions where corresponding events occurred. We’ll simulate this data with the following code:

    click to expand R code library(tidyverse) library(purrrlyr) library(reshape2) ##### simulating the "real" data ##### set.seed(454) df_raw <- data.frame(customer_id = paste0('id', sample(c(1:5000), replace = TRUE)), date = as.POSIXct(rbeta(10000, 0.7, 10) * 10000000, origin = '2017-01-01', tz = "UTC"), channel = paste0('channel_', sample(c(0:7), 10000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))), site_visit = 1) %>% mutate(two_pages_visit = sample(c(0,1), 10000, replace = TRUE, prob = c(0.8, 0.2)), product_page_visit = ifelse(two_pages_visit == 1, sample(c(0, 1), length(two_pages_visit[which(two_pages_visit == 1)]), replace = TRUE, prob = c(0.75, 0.25)), 0), add_to_cart = ifelse(product_page_visit == 1, sample(c(0, 1), length(product_page_visit[which(product_page_visit == 1)]), replace = TRUE, prob = c(0.1, 0.9)), 0), purchase = ifelse(add_to_cart == 1, sample(c(0, 1), length(add_to_cart[which(add_to_cart == 1)]), replace = TRUE, prob = c(0.02, 0.98)), 0)) %>% dmap_at(c('customer_id', 'channel'), as.character) %>% arrange(date) %>% mutate(session_id = row_number()) %>% arrange(customer_id, session_id) df_raw <- melt(df_raw, id.vars = c('customer_id', 'date', 'channel', 'session_id'), value.name = 'trigger', variable.name = 'event') %>% filter(trigger == 1) %>% select(-trigger) %>% arrange(customer_id, date)

    And the data sample looks like:

    Next up, the data needs to be preprocessed. For example, it would be useful to replace NA/direct channel with the previous one or separate first-time purchasers from current customers, or even create different Sales Funnels based on new and current customers, segments, locations and so on. I will omit this step but you can find some ideas on preprocessing in my previous blogpost.

    The important thing about this approach is that we only have to attribute the initial marketing channel, one that led the customer through their first step. For instance, a customer initially reviews a product page (step 2, interest) and is brought by channel_1. That means any future product page visits from other channels won’t be attributed until the customer makes a purchase and starts a new Sales Funnel journey.

    Therefore, we will filter records for each customer and save the first unique event of each step of the Sales Funnel using the following code:

    click to expand R code ### removing not first events ### df_customers <- df_raw %>% group_by(customer_id, event) %>% filter(date == min(date)) %>% ungroup()

    I point your attention that in this way we assume that all customers were first-time buyers, therefore every next purchase as an event will be removed with the above code.

    Now, we can use the obtained data frame to compute Sales Funnel’s transition probabilities, importance of Sale Funnel steps, and their weighted importance. According to the method, the higher probability, the lower value of the channel. Therefore, we will calculate the importance of an each step as 1 minus transition probability. After that, we need to weight importances because their sum will be higher than 1. We will do these calculations with the following code:

    click to expand R code ### Sales Funnel probabilities ### sf_probs <- df_customers %>% group_by(event) %>% summarise(customers_on_step = n()) %>% ungroup() %>% mutate(sf_probs = round(customers_on_step / customers_on_step[event == 'site_visit'], 3), sf_probs_step = round(customers_on_step / lag(customers_on_step), 3), sf_probs_step = ifelse(is.na(sf_probs_step) == TRUE, 1, sf_probs_step), sf_importance = 1 - sf_probs_step, sf_importance_weighted = sf_importance / sum(sf_importance) )

    A hint: it can be a good idea to compute Sales Funnel probabilities looking at a limited prior period, for example, 1-3 months. The reason is that customers’ flow or “necks” capacities could vary due to changes on a company’s site or due to changes in marketing campaigns and so on. Therefore, you can analyze the dynamics of the Sales Funnel’s transition probabilities in order to find the appropriate time period.

    I can’t publish a blogpost without visualization. This time I suggest another approach for the Sales Funnel visualization that represents all customer journeys through the Sales Funnel with the following code:

    click to expand R code ### Sales Funnel visualization ### df_customers_plot <- df_customers %>% group_by(event) %>% arrange(channel) %>% mutate(pl = row_number()) %>% ungroup() %>% mutate(pl_new = case_when( event == 'two_pages_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'two_pages_visit'])) / 2), event == 'product_page_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'product_page_visit'])) / 2), event == 'add_to_cart' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'add_to_cart'])) / 2), event == 'purchase' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'purchase'])) / 2), TRUE ~ 0 ), pl = pl + pl_new) df_customers_plot$event <- factor(df_customers_plot$event, levels = c('purchase', 'add_to_cart', 'product_page_visit', 'two_pages_visit', 'site_visit' )) # color palette cols <- c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f', '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac') ggplot(df_customers_plot, aes(x = event, y = pl)) + theme_minimal() + scale_colour_manual(values = cols) + coord_flip() + geom_line(aes(group = customer_id, color = as.factor(channel)), size = 0.05) + geom_text(data = sf_probs, aes(x = event, y = 1, label = paste0(sf_probs*100, '%')), size = 4, fontface = 'bold') + guides(color = guide_legend(override.aes = list(size = 2))) + theme(legend.position = 'bottom', legend.direction = "horizontal", panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8), axis.title.y = element_text(size = 16, face = "bold"), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain")) + ggtitle("Sales Funnel visualization - all customers journeys")

    Ok, seems we now have everything to make final calculations. In the following code, we will remove all users that didn’t make a purchase. Then, we’ll link weighted importances of the Sales Funnel steps with sessions by event and, at last, summarize them.

    click to expand R code ### computing attribution ### df_attrib <- df_customers %>% # removing customers without purchase group_by(customer_id) %>% filter(any(as.character(event) == 'purchase')) %>% ungroup() %>% # joining step's importances left_join(., sf_probs %>% select(event, sf_importance_weighted), by = 'event') %>% group_by(channel) %>% summarise(tot_attribution = sum(sf_importance_weighted)) %>% ungroup()

    As the result, we’ve obtained the number of conversions that have been distributed by marketing channels:

    In the same way you can distribute the revenue by channels.

    At the end of the article, I want to share OWOX company’s blog where you can read more about the approach: Funnel Based Attribution Model.

    In addition, you can find that OWOX provides an automated system for Marketing Multi-Channel Attribution based on BigQuery. Therefore, if you are not familiar with R or don’t have a suitable data warehouse, I can recommend you to test their service.

    SaveSaveSaveSaveSaveSaveSaveSaveSaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSaveSaveSave

    SaveSaveSaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    The post Marketing Multi-Channel Attribution model based on Sales Funnel with R appeared first on AnalyzeCore – data is beautiful, data is a story.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R language – AnalyzeCore – data is beautiful, data is a story. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppZiggurat 0.1.4

    Thu, 09/28/2017 - 04:06

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance release of RcppZiggurat is now on the CRAN network for R. It switched the vignette to the our new pinp package and its two-column pdf default.

    The RcppZiggurat package updates the code for the Ziggurat generator which provides very fast draws from a Normal distribution. The package provides a simple C++ wrapper class for the generator improving on the very basic macros, and permits comparison among several existing Ziggurat implementations. This can be seen in the figure where Ziggurat from this package dominates accessing the implementations from the GSL, QuantLib and Gretl—all of which are still way faster than the default Normal generator in R (which is of course of higher code complexity).

    The NEWS file entry below lists all changes.

    Changes in version 0.1.4 (2017-07-27)
    • The vignette now uses the pinp package in two-column mode.

    • Dynamic symbol registration is now enabled.

    Courtesy of CRANberries, there is also a diffstat report for the most recent release. More information is on the RcppZiggurat page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Featurizing images: the shallow end of deep learning

    Thu, 09/28/2017 - 00:09

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    //

    by Bob Horton and Vanja Paunic, Microsoft AI and Research Data Group

    Training deep learning models from scratch requires large data sets and significant computational reources. Using pre-trained deep neural network models to extract relevant features from images allows us to build classifiers using standard machine learning approaches that work well for relatively small data sets. In this context, a deep learning solution can be thought of as incorporating layers that compute features, followed by layers that map these features to outcomes; here we’ll just map the features to outcomes ourselves.

    We explore an example of using a pre-trained deep learning image classifier to generate features for use with traditional machine learning approaches to address a problem the original model was never trained on (see the blog post “Image featurization with a pre-trained deep neural network model” for other examples). This approach allows us to quickly and easily create a custom classifier for a specific specialized task, using only a relatively small training set. We use the image featurization abilities of Microsoft R Server 9.1 (MRS) to create a classifier for different types of knots in lumber. These images were made publicly available from the laboratory of Prof., Dr. Olli Silven, University of Oulu, Finland, in 1995. Please note that we are using this problem as an academic example of an image classification task with clear industrial implications, but we are not really trying to raise the bar in this well-established field.

    We characterize the performance of the machine learning model and describe how it might fit into the framework of a lumber grading system. Knowing the strengths and weaknesses of the classifier, we discuss how it could be used to triage additional image data for labeling by human experts, so that the system can be iteratively improved.

    The pre-trained deep learning models used here are optional components that can be installed alongside Microsoft R Server 9.1; directions are here.

    Lumber grading

    In the sawmill industry lumber grading is an important step of the manufacturing process. Improved grading accuracy and better control of quality variation in production leads directly to improved profits. Grading has traditionally been done by visual inspection, in which a (human) grader marks each piece of lumber as it leaves the mill, according to a factors like size, category, and position of knots, cracks, species of tree, etc. Visual inspection is often an error prone and laborious task. Certain defect classes may be difficult to distinguish, even for a human expert. To that end, a number of automated lumber grading systems have been developed which aim to improve the accuracy and the efficiency of lumber grading [2-7].

    Loading and visualizing data

    Let us start by downloading the data.

    DATA_DIR <- file.path(getwd(), 'data') if(!dir.exists(DATA_DIR)){ dir.create(DATA_DIR) knots_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/knots.zip' names_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/names.txt' download.file(knots_url, destfile = file.path(DATA_DIR, 'knots.zip')) download.file(names_url, destfile = file.path(DATA_DIR, 'names.txt')) unzip(file.path(DATA_DIR, 'knots.zip'), exdir = file.path(DATA_DIR, 'knot_images')) }

    Let’s now load the data from the downloaded files and look at some of those files.

    knot_info <- read.delim(file.path(DATA_DIR, "names.txt"), sep=" ", header=FALSE, stringsAsFactors=FALSE)[1:2] names(knot_info) <- c("file", "knot_class") knot_info$path <- file.path(DATA_DIR, "knot_images", knot_info$file) names(knot_info) ## [1] "file" "knot_class" "path"

    We’ll be trying to predict knot_class, and here are the counts of the categories:

    table(knot_info$knot_class) ## ## decayed_knot dry_knot edge_knot encased_knot horn_knot ## 14 69 65 29 35 ## leaf_knot sound_knot ## 47 179

    Four of these labels relate to how the knot is integrated into the structure of the surrounding wood; these are the descriptions from the README file:

    • sound: “A knot grown firmly into the surrounding wood material and does not contain any bark or signs of decay. The color may be very close to the color of sound wood.”

    • dry: “A firm or partially firm knot, and has not taken part to the vital processes of growing wood, and does not contain any bark or signs of decay. The color is usually darker than the color of sound wood, and a thin dark ring or a partial ring surrounds the knot.”

    • encased: “A knot surrounded totally or partially by a bark ring. Compared to dry knot, the ring around the knot is thicker.”

    • decayed: “A knot containing decay. Decay is difficult to recognize, as it usually affects only the strength of the knot.”

    Edge, horn and leaf knots are related to the orientation of the knot relative to the cutting plane, the position of the knot on the board relative to the edge, or both. These attributes are of different character than the ones related to structural integration. Theoretically, you could have an encased horn knot, or a decayed leaf knot, etc., though there do not happen to be any in this dataset. Unless otherwise specified, we have to assume the knot is sound, cross-cut, and not near an edge. This means that including these other labels makes the knot_class column ‘untidy’, in that it refers to more than one different kind of characteristic. We could try to split out distributed attributes for orientation and position, as well as for structural integration, but to keep this example simple we’ll just filter the data to keep only the sound, dry, encased, and decayed knots that are cross-cut and not near an edge. (It turns out that you can use featurization to fit image classification models that recognize position and orientation quite well, but we’ll leave that as an exercise for the reader.)

    knot_classes_to_keep <- c("sound_knot", "dry_knot", "encased_knot", "decayed_knot") knot_info <- knot_info[knot_info$knot_class %in% knot_classes_to_keep,] knot_info$knot_class <- factor(knot_info$knot_class, levels=knot_classes_to_keep)

    We kept the knot_class column as character data while we were deciding which classes to keep, but now we can make that a factor.

    Here are a few examples of images from each of the four classes:

    library("pixmap") set.seed(1) samples_per_category <- 3 op <- par(mfrow=c(1, samples_per_category), oma=c(2,2,2,2), no.readonly = TRUE) for (kc in levels(knot_info$knot_class)){ kc_files <- knot_info[knot_info$knot_class == kc, "path"] kc_examples <- sample(kc_files, samples_per_category) for (kc_ex in kc_examples){ pnm <- read.pnm(kc_ex) plot(pnm, xlab=gsub(".*(knot[0-9]+).*", "\\1", kc_ex)) mtext(text=gsub("_", " ", kc), side=3, line=0, outer=TRUE, cex=1.7) } }

    par(op) Featurizing images

    Here we use the rxFeaturize function from Microsoft R Server, which allows us to perform a number of transformations on the knot images in order to produce numerical features. We first resize the images to fit the dimensions required by the pre-trained deep neural model we will use, then extract the pixels to form a numerical data set, then run that data set through a DNN pre-trained model. The result of the image featurization is a numeric vector (“feature vector”) that represents key characteristics of that image.

    Image featurization here is accomplished by using a deep neural network (DNN) model that has already been pre-trained by using millions of images. Currently, MRS supports four types of DNNs – three ResNet models (18, 50, 101)[1] and AlexNet [8].

    knot_data_df <- rxFeaturize(data = knot_info, mlTransforms = list(loadImage(vars = list(Image = "path")), resizeImage(vars = list(Features = "Image"), width = 224, height = 224, resizingOption = "IsoPad"), extractPixels(vars = "Features"), featurizeImage(var = "Features", dnnModel = "resnet101")), mlTransformVars = c("path", "knot_class")) ## Elapsed time: 00:02:59.8362547

    We have chosen the “resnet101” DNN model, which is 101 layers deep; the other ResNet options (18 or 50 layers) generate features more quickly (well under 2 minutes each for this dataset, as opposed to several minutes for ResNet-101), but we found that the features from 101 layers work better for this example.

    We have placed the features in a dataframe, which lets us use any R algorithm to build a classifier. Alternatively, we could have saved them directly to an XDF file (the native file format for Microsoft R Server, suitable for large datasets that would not fit in memory, or that you want to distribute on HDFS), or generated them dynamically when training the model (see examples in the earlier blog post).

    Since the featurization process takes a while, let’s save the results up to this point. We’ll put them in a CSV file so that, if you are so inspired, you can open it in Excel and admire the 2048 numerical features the deep neural net model has created to describe each image. Once we have the features, training models on this data will be relatively fast.

    write.csv(knot_data_df, "knot_data_df.csv", row.names=FALSE) Select training and test data

    Now that we have extracted numeric features from each image, we can use traditional machine learning algorithms to train a classifier.

    in_training_set <- sample(c(TRUE, FALSE), nrow(knot_data_df), replace=TRUE, prob=c(2/3, 1/3)) in_test_set <- !in_training_set

    We put about two thirds of the examples (203) into the training set and the rest (88) in the test set.

    Fit Random Forest model

    We will use the popular randomForest package, since it handles both “wide” data (with more features than cases) as well as multiclass outcomes.

    library(randomForest) form <- formula(paste("knot_class" , paste(grep("Feature", names(knot_data_df), value=TRUE), collapse=" + "), sep=" ~ ")) training_set <- knot_data_df[in_training_set,] test_set <- knot_data_df[in_test_set,] fit_rf <- randomForest(form, training_set) pred_rf <- as.data.frame(predict(fit_rf, test_set, type="prob")) pred_rf$knot_class <- knot_data_df[in_test_set, "knot_class"] pred_rf$pred_class <- predict(fit_rf, knot_data_df[in_test_set, ], type="response") # Accuracy with(pred_rf, sum(pred_class == knot_class)/length(knot_class)) ## [1] 0.8068182

    Let’s see how the classifier did for all four classes. Here is the confusion matrix, as well as a moisaic plot showing predicted class (pred_class) against the actual class (knot_class).

    with(pred_rf, table(knot_class, pred_class)) ## pred_class ## knot_class sound_knot dry_knot encased_knot decayed_knot ## sound_knot 51 0 0 0 ## dry_knot 3 19 0 0 ## encased_knot 6 5 1 0 ## decayed_knot 2 1 0 0

    mycolors <- c("yellow3", "yellow2", "thistle3", "thistle2") mosaicplot(table(pred_rf[,c("knot_class", "pred_class")]), col = mycolors, las = 1, cex.axis = .55, main = NULL, xlab = "Actual Class", ylab = "Predicted Class")

    It looks like the classifier performs really well on the sound knots. On the other hand, all of the decayed knots in the test set are misclassified. This is a small class, so let’s look at those misclassified samples.

    library(dplyr) pred_rf$path <- test_set$path # misclassified knots misclassified <- pred_rf %>% filter(knot_class != pred_class) # number of misclassified decayed knots num_example <- sum(misclassified$knot_class == "decayed_knot") op <- par(mfrow = c(1, num_example), mar=c(2,1,0,1), no.readonly = TRUE) for (i in which(misclassified$knot_class == "decayed_knot")){ example <- misclassified[i, ] pnm <- read.pnm(example$path) plot(pnm) mtext(text=sprintf("predicted as\n%s", example$pred_class), side=1, line=-2) } mtext(text='Misclassified Decayed Knots', side = 3, outer = TRUE, line = -3, cex=1.2)

    par(op)

    We were warned about the decayed knots in the README file; they are more difficult to visually classify. Interestingly, the ones classified as sound actually do appear to be well-integrated into the surrounding wood; they also happen to look somewhat rotten. Also, the decayed knot classified as dry does have a dark border, and looks like a dry knot. These knots appear to be two decayed sound knots and a decayed dry knot; in other words, maybe decay should be represented as a separate attribute that is independent of the structural integration of the knot. We may have found an issue with the labels, rather than a problem with the classifier.

    Let’s look at the performance of the classifier more closely. Using the scores returned by the classifier, we can plot an ROC curve for each of the individual classes.

    library(pROC) plot_target_roc <- function(target_class, outcome, test_data, multiclass_predictions){ is_target <- test_data[[outcome]] == target_class roc_obj <- roc(is_target, multiclass_predictions[[target_class]]) plot(roc_obj, print.auc=TRUE, main=target_class) text(x=0.2, y=0.2, labels=sprintf("total cases: %d", sum(is_target)), pos=3) } op <- par(mfrow=c(2,2), oma=c(2,2,2,2), #mar=c(0.1 + c(7, 4, 4, 2)), no.readonly = TRUE) for (knot_category in levels(test_set$knot_class)){ plot_target_roc(knot_category, "knot_class", test_set, pred_rf) } mtext(text="ROC curves for individual classes", side=3, line=0, outer=TRUE, cex=1.5)

    par(op)

    These ROC curves show that the classifier scores can be used to provide significant enrichment for any of the classes. With sound knots, for example, the score can unambiguously identify a large majority of the cases that are not of that class. Another way to look at this is to consider the ranges of each classification score for each actual class:

    library(tidyr) library(ggplot2) pred_rf %>% select(-path, -pred_class) %>% gather(prediction, score, -knot_class) %>% ggplot(aes(x=knot_class, y=score, col=prediction)) + geom_boxplot()

    Note that the range of the sound knot and dry knot scores tend to be higher for all four classes. But when you consider the scores for a given prediction class across all the actual classes, the scores tend to be higher when they match than when they don’t. For example, even though the decayed knot score never goes very high, it tends to be higher for the decayed knots than for other classes. Here’s a boxplot of just the decayed_knot scores for all four actual classes:

    boxplot(decayed_knot ~ knot_class, pred_rf, xlab="actual class", ylab="decayed_knot score", main="decayed_knot score for all classes")

    Even though the multi-class classifier did not correctly identify any of the three decayed knots in the test set, this does not mean that it is useless for finding decayed knots. In fact, the decayed knots had higher scores for the decayed predictor than most of the other knots did (as shown in the boxplots above), and this score it is able to correctly determine that almost 80% of the knots are not decayed. This means that this classifier could be used to screen for knots that are more likely to be decayed, or alternatively, you could use these scores to isolate a collection of cases that are unlikely to be any of the kinds of knots your classifier is good at recognizing. This ability to focus on things we are not good at might be helpful if we needed to search through a large collection of images to find more of the kinds of knots for which we need more training data. We could show those selected knots to expert human labelers, so they wouldn’t need to label all the examples in the entire database. This would help us get started with an active learning process, where we use a model to help develop a better model.

    Here we’ve shown an open-source random forest model trained on features that have been explicitly written to a dataframe. The model classes in the MicrosoftML package can also generate these features dynamically, so that the trained model can accept a file name as input and automatically generate the features on demand during scoring. For examples see the earlier blog post. In general, these data sets are “wide”, in that they have a large number of features; in this example, they have far more columns of features than rows of cases. This means that you need to select algorithms that can handle wide data, like elastic net regularized linear models, or, as we’ve seen here, random forests. Many of the MicrosoftML algorithms are well suited for wide datasets.

    Discussion

    Image featurization allows us to make effective use of relatively small sets of labeled images that would not be sufficient to train a deep network from scratch. This is because we can re-use the lower-level features that the pre-trained model had learned for more general image classification tasks.

    This custom classifier was constructed quite quickly; although the featurizer required several minutes to run, after the features were generated all training was done on relatively small models and data sets (a few hundred cases with a few thousand features), where training a given model takes on the order of seconds, and tuning hyperparameters is relatively straightforward. Industrial applications commonly require specialized classification or scoring models that can be used to address specific technical or regulatory concerns, so having a rapid and adaptable approach to generate such models should have considerable practical utility.

    Featurization is the low-hanging fruit of transfer learning. More general transfer learning allows the specialization of deeper and deeper layers in a pre-trained general model, but the deeper you want to go, the more labelled training data you will generally need. We often find ourselves awash in data, but limited by the availability of quality labels. Having a partially effective classifier can let you bootstrap an active learning approach, where a crude model is used to triage images, classifying the obvious cases directly and referring only those with uncertain scores to expert labelers. The larger set of labeled images is then used to build a better model, which can do more effective triage, and so on, leading to iterative model improvement. Companies like CrowdFlower use these kinds of approaches to optimize data labeling and model building, but there are many use cases where general crowdsourcing may not be adequate (such as when the labeling must be done by experts), so having a straightforward way to bootstrap that process could be quite useful.

    The labels we have may not be tidy, that is, they may not refer to distinct characteristics. In this case, the decayed knot does not seem to really be an alternative to “encased”, “dry”, or “sound” knots; rather, it seems that any of these categories of knots might possibly be decayed. In many applications, it is not always obvious what a properly distributed outcome labeling should include. This is one reason that a quick and easy approach is valuable; it can help you to clarify these questions with iterative solutions that can help frame the discussion with domain experts. In a subsequent iteration we may want to consider “decayed” as a separate attribute from “sound”, “dry” or “encased”.

    Image featurization gives us a simple way to build domain-specific image classifiers using relatively small training sets. In the common use case where data is plentiful but good labels are not, this provides a rapid and straightforward first step toward getting the labeled cases we need to build more sophistiated models; with a few iterations, we might even learn to recognize decayed knots. Seriously, wood that knot be cool?

    References
    1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2016) The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778

    2. Kauppinen H. and Silven O.: A Color Vision Approach for Grading Lumber. In Theory & Applications of Image Processing II – Selected papers from the 9th Scandinavian Conference on Image Analysis (ed. G. Borgefors), World Scientific, pp. 367-379 1995.

    3. Silven O. and Kauppinen H.: Recent Developments in Wood Inspection. (1996) International Journal on Pattern Recognition and Artificial Intelligence, IJPRAI, pp. 83-95, 1996.

    4. Rinnhofer, Alfred & Jakob, Gerhard & Deutschl, Edwin & Benesova, Wanda & Andreu, Jean-Philippe & Parziale, Geppy & Niel, Albert. (2006). A multi-sensor system for texture based high-speed hardwood lumber inspection. Proceedings of SPIE – The International Society for Optical Engineering. 5672. 34-43. 10.1117/12.588199.

    5. Irene Yu-Hua Gu, Henrik Andersson, Raul Vicen. (2010) Wood defect classification based on image analysis and support vector machines. Wood Science and Technology 44:4, 693-704. Online publication date: 1-Nov-2010.

    6. Kauppinen H, Silven O & Piirainen T. (1999) Self-organizing map based user interface for visual surface inspection. Proc. 11th Scandinavian Conference on Image Analysis (SCIA’99), June 7-11, Kangerlussuaq, Greenland, 801-808.

    7. Kauppinen H, Rautio H & Silven O (1999). Nonsegmenting defect detection and SOM-based classification for surface inspection using color vision. Proc. EUROPTO Conf. on Polarization and Color Techniques in Industrial Inspection (SPIE Vol. 3826), June 17-18, Munich, Germany, 270-280.

    8. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. (2012) ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New Course! Supervised Learning in R: Classification

    Wed, 09/27/2017 - 16:08

    (This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

    Hi there! We proud to launch our latest R & machine learning course, Supervised Learning in R: Classification! By Brett Lantz.

    This beginner-level introduction to machine learning covers four of the most common classification algorithms. You will come away with a basic understanding of how each algorithm approaches a learning task, as well as learn the R functions needed to apply these tools to your own work.

    Take me to chapter 1!

    Supervised Learning in R: Classification features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in machine learning in R!

    What you’ll learn:

    Chapter 1: k-Nearest Neighbors (kNN)

    This chapter will introduce classification while working through the application of kNN to self-driving vehicles.

    Chapter 2: Naive Bayes

    Naive Bayes uses principles from the field of statistics to make predictions. This chapter will introduce the basics of Bayesian methods.

    Chapter 3: Logistic Regression

    Logistic regression involved fitting a curve to numeric data to make predictions about binary events.

    Chapter 4: Classification Trees

    Classification trees use flowchart-like structures to make decisions. Because humans an readily understand these tree structures, classification trees are useful when transparency is needed.

    Start your path to mastering ML in R with Supervised Learning in R: Classification!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Oneway ANOVA Explanation and Example in R; Part 1

    Wed, 09/27/2017 - 15:00

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    This tutorial was inspired by a this post published at DataScience+ by Bidyut Ghosh. Special thanks also to Dani Navarro, The University of New South Wales (Sydney) for the book Learning Statistics with R (hereafter simply LSR) and the lsr packages available through CRAN. I highly recommend it.

    Let’s load the required R packages library(ggplot2) library(lsr) library(psych) library(car) library(tidyverse) library(dunn.test) library(BayesFactor) library(scales) library(knitr) library(kableExtra) options(width = 130) options(knitr.table.format = "html") The Oneway Analysis of Variance (ANOVA)

    The Oneway ANOVA is a statistical technique that allows us to compare mean differences of one outcome (dependent) variable across two or more groups (levels) of one independent variable (factor). If there are only two levels (e.g. Male/Female) of the independent (predictor) variable the results are analogous to Student’s t-test. It is also true that ANOVA is a special case of the GLM or regression models so as the number of levels increases it might make more sense to try one of those approaches. ANOVA also allows for comparisons of mean differences across multiple factors (Factorial or Nway ANOVA) which we won’t address here.

    Our scenario and data

    Professor Ghosh’s original scenario can be summarized this way. Imagine that you are interested in understanding whether knowing the brand of car tire can help you predict whether you will get more or less mileage before you need to replace them. We’ll draw what is hopefully a random sample of 60 tires from four different manufacturers and use the mean mileage by brand to help inform our thinking. While we expect variation across our sample we’re interested in whether the differences between the tire brands (the groups) is significantly different than what we would expect in random variation within the groups.

    Our research or testable hypothesis is then described \[\mu_{Apollo} \ne \mu_{Bridgestone} \ne \mu_{CEAT} \ne \mu_{Falken}\] as at least one of the tire brand populations is different than the other three. Our null is basically “nope, brand doesn’t matter in predicting tire mileage – all brands are the same”.
    He provides the following data set with 60 observations. I’ve chosen to download directly but you could of course park the file locally and work from there.

    Column Contains Type Brands What brand tyre factor Mileage Tyre life in thousands num tyre<-read.csv("https://datascienceplus.com/wp-content/uploads/2017/08/tyre.csv") # tyre<-read.csv("tyre.csv") # if you have it on your local hard drive str(tyre) ## 'data.frame': 60 obs. of 2 variables: ## $ Brands : Factor w/ 4 levels "Apollo","Bridgestone",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ Mileage: num 33 36.4 32.8 37.6 36.3 ... summary(tyre) ## Brands Mileage ## Apollo :15 Min. :27.88 ## Bridgestone:15 1st Qu.:32.69 ## CEAT :15 Median :34.84 ## Falken :15 Mean :34.74 ## 3rd Qu.:36.77 ## Max. :41.05 head(tyre) ## Brands Mileage ## 1 Apollo 32.998 ## 2 Apollo 36.435 ## 3 Apollo 32.777 ## 4 Apollo 37.637 ## 5 Apollo 36.304 ## 6 Apollo 35.915

    View(tyre) if you use RStudio this is a nice way to see the data in spreadsheet format
    The data set contains what we expected. The dependent variable Mileage is numeric and the independent variable Brand is of type factor. R is usually adept at coercing a chr string or an integer as the independent variable but I find it best to explicitly make it a factor when you’re working on ANOVAs.

    Let’s graph and describe the basics. First a simple boxplot of all 60 data points along with a summary using the describe command from the package psych. Then in reverse order lets describe describeBy and boxplot breaking it down by group (in our case tire brand).

    boxplot(tyre$Mileage, horizontal = TRUE, main="Mileage distribution across all brands", col = "blue") describe(tyre) # the * behind Brands reminds us it's a factor and some of the numbers are nonsensical ## vars n mean sd median trimmed mad min max range skew kurtosis se ## Brands* 1 60 2.50 1.13 2.50 2.50 1.48 1.00 4.00 3.00 0.00 -1.41 0.15 ## Mileage 2 60 34.74 2.98 34.84 34.76 3.09 27.88 41.05 13.17 -0.11 -0.44 0.38

    describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE, digits = 2) ## item group1 vars n mean sd median trimmed mad min max range skew kurtosis se ## X11 1 Apollo 1 15 34.80 2.22 34.84 34.85 2.37 30.62 38.33 7.71 -0.18 -1.24 0.57 ## X12 2 Bridgestone 1 15 31.78 2.20 32.00 31.83 1.65 27.88 35.01 7.13 -0.29 -1.05 0.57 ## X13 3 CEAT 1 15 34.76 2.53 34.78 34.61 2.03 30.43 41.05 10.62 0.64 0.33 0.65 ## X14 4 Falken 1 15 37.62 1.70 37.38 37.65 1.18 34.31 40.66 6.35 0.13 -0.69 0.44 boxplot(tyre$Mileage~tyre$Brands, main="Boxplot comparing Mileage of Four Brands of Tyre", col= rainbow(4), horizontal = TRUE)

    Let’s format the table describeby generates to make it a little nicer using the kable package. Luckily describeby generates a dataframe with mat=TRUE and after that we can select which columns to publish (dropping some of the less used) as well as changing the column labels as desired.

    describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE) %>% #create dataframe select(Brand=group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max, Skew=skew, Kurtosis=kurtosis, SEM=se) %>% kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE, caption="Tire Mileage Brand Descriptive Statistics") %>% kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE) Tire Mileage Brand Descriptive Statistics
    Brand N Mean SD Median Min Max Skew Kurtosis SEM Apollo 15 34.80 2.22 34.84 30.62 38.33 -0.18 -1.24 0.57 Bridgestone 15 31.78 2.20 32.00 27.88 35.01 -0.29 -1.05 0.57 CEAT 15 34.76 2.53 34.78 30.43 41.05 0.64 0.33 0.65 Falken 15 37.62 1.70 37.38 34.31 40.66 0.13 -0.69 0.44

    Certainly much nicer looking and I only scratched the surface of the options available. We can certainly look at the numbers and learn a lot. But let’s see if we can also improve our plotting to be more informative.
    The more I use ggplot2 the more I love the ability to use it to customize the presentation of the data to optimize understanding! The next plot might be accused of being a little “busy” but essentially answers our Oneway ANOVA question in one picture (note that I have stayed with the original decision to set \(\alpha\) = 0.01 significance level (99% confidence intervals)).

    ggplot(tyre, aes(reorder(Brands,Mileage),Mileage,fill=Brands))+ # ggplot(tyre, aes(Brands,Mileage,fill=Brands))+ # if you want to leave them alphabetic geom_jitter(colour = "dark gray",width=.1) + stat_boxplot(geom ='errorbar',width = 0.4) + geom_boxplot()+ labs(title="Boxplot, dotplot and SEM plot of mileage for four brands of tyres", x = "Brands (sorted)", y = "Mileage (in thousands)", subtitle ="Gray dots=sample data points, Black dot=outlier, Blue dot=mean, Red=99% confidence interval", caption = "Data from https://datascienceplus.com/one-way-anova-in-r/") + guides(fill=FALSE) + stat_summary(fun.data = "mean_cl_normal", colour = "red", size = 1.5, fun.args = list(conf.int=.99)) + stat_summary(geom="point", fun.y=mean, color="blue") + theme_bw()

    By simple visual inspection it certainly appears that we have evidence of the effect of tire brand on mileage. There is one outlier for the CEAT brand but little cause for concern. Means and medians are close together so no major concerns about skewness. Different brands have differing amounts of variability but nothing shocking visually.

    Oneway ANOVA Test & Results

    So the heart of this post is to actually execute the Oneway ANOVA in R. There are several ways to do so but let’s start with the simplest from the base R first aov. While it’s possible to wrap the command in a summary or print statement I recommend you always save the results out to an R object in this case tyres.aov. It’s almost inevitable that further analysis will ensue and the tyres.aov object has a wealth of useful information. If you’re new to R a couple of quick notes. The dependent variable goes to the left of the tilde and our independent or predictor variable to the right. aov is not limited to Oneway ANOVA so adding additional factors is possible.
    As I mentioned earlier ANOVA is a specialized case of the GLM and therefore the list object returned tyres.aov is actually of both aov and lm class. The names command will give you some sense of all the information contained in the list object. We’ll access some of this later as we continue to analyze our data. The summary command gives us the key ANOVA data we need and produces a classic ANOVA table. If you’re unfamiliar with them and want to know more especially where the numbers come from I recommend a good introductory stats text. As noted earlier I recommend Learning Statistics with R LSR see Table 14-1 on page 432.

    tyres.aov<- aov(Mileage~Brands, tyre) class(tyres.aov) ## [1] "aov" "lm" typeof(tyres.aov) ## [1] "list" names(tyres.aov) ## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" ## [8] "df.residual" "contrasts" "xlevels" "call" "terms" "model" summary(tyres.aov) ## Df Sum Sq Mean Sq F value Pr(>F) ## Brands 3 256.3 85.43 17.94 2.78e-08 *** ## Residuals 56 266.6 4.76 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    We can reject the null hypothesis at the \(\alpha\) = 0.01 significance level (99% confidence). The F statistic is calculated as \[F = \frac{MS_{between}}{MS_{within}}\] and the table gives us the precise p value and the common asterisks to show “success”.
    In published results format that probably looks like “a Oneway ANOVA showed a significant effect for brand on tire mileage, F(3,56)=17.94, p<.01”. In other words, we can reject the null hypothesis that these data came from brand tire populations where the average tire mileage life was the same! Making it a prediction statement, we can see that brand type helps predict mileage life.
    That’s exciting news, but leaves us with some other unanswered questions.
    The data provide support for the hypothesis that the means aren’t all equal – that’s called the omnibus test. We have support for rejecting \[\mu_{Apollo} = \mu_{Bridgestone} = \mu_{CEAT} = \mu_{Falken}\] but at this point we can’t state with any authority which specific pairs are different, all we can say is that at least one is different! When we look at the graph we made earlier we can guess we know but let’s do better than that. How can we use confidence intervals to help us understand whether the data are indicating simple random variation or whether the underlying population is different. We just need to compute the confidence interval for each brand’s mean and then see which brand means lie inside or outside the confidence interval of the others. We would expect that if we ran our experiment 100 times with our sample size numbers for each brand the mileage mean would lie inside the upper and lower limit of our confidence interval 99 times (with \(\alpha\) = 0.01) out of those 100 times. If our data shows it outside the confidence interval that is evidence of a statistically significant difference for that specific pairing.
    But we don’t have to rely on our graph, we can be more precise and test it in a very controlled fashion.

    A priori and post hoc comparisons

    We could just take mileage and brands and run all the possible t tests. There would be 6 of them; Apollo -v- Bridgestone, Apollo -v- CEAT, Apollo -v- Falken, Bridgestone -v- CEAT, Bridgestone -v- Falken, and CEAT -v- Falken. Base R provides pairwise.t.test to feed it the data and allow it to rapidly make all the relevant comparisons. lsr provides a helper function that makes it possible to simply feed it the aov object and do the same.
    The “answers” appear to support our read of the graph. All of the possible pairs seem to be different other than Apollo -v- CEAT which is what the graph shows. The significance levels R spits out are all much smaller than p<.01. Break out the champagne start the victory dance.

    pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "none") ## ## Pairwise comparisons using t tests with pooled SD ## ## data: tyre$Mileage and tyre$Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.00037 - - ## CEAT 0.96221 0.00043 - ## Falken 0.00080 9.7e-10 0.00069 ## ## P value adjustment method: none # unfortunately pairwise.t.test doesn't accept formula style or an aov object # lsr library to the rescue posthocPairwiseT(tyres.aov,p.adjust.method = "none") #equivalent just easier to use the aov object ## ## Pairwise comparisons using t tests with pooled SD ## ## data: Mileage and Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.00037 - - ## CEAT 0.96221 0.00043 - ## Falken 0.00080 9.7e-10 0.00069 ## ## P value adjustment method: none

    But that would be wrong and here’s why. Assuming we want to have 99% confidence again, across all six unique pairings, we are “cheating” if we don’t adjust the rejection region (and our confidence intervals) and just run the test six times. It’s analogous to rolling the die six times instead of once. The more simultaneous tests we run the more likely we are to find a difference even though none exists. We need to adjust our thinking and our confidence to account for the fact that we are making multiple comparisons (a.k.a. simultaneous comparisons). Our confidence interval must be made wider (more conservative) to account for the fact we are making multiple simultaneous comparisons. Thank goodness the tools exist to do this for us. As a matter of fact there is no one single way to make the adjustment… there are many.
    One starting position is that it makes a difference whether you have specified (hypothesized) some specific relationships a priori (in advance) or whether you’re exploring posthoc (after the fact also called “fishing”). The traditional position is that a priori grants you more latitude and less need to be conservative. The only thing that is certain is that some adjustment is necessary. In his original post Professor Ghosh applied one of the classical choices for making an adjustment Tukey’s Honestly Significant Difference (HSD) https://en.wikipedia.org/wiki/Tukey%27s_range_test. Let’s reproduce his work first as two tables at two confidence levels.

    TukeyHSD(tyres.aov, conf.level = 0.95) ## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = Mileage ~ Brands, data = tyre) ## ## $Brands ## diff lwr upr p adj ## Bridgestone-Apollo -3.01900000 -5.1288190 -0.909181 0.0020527 ## CEAT-Apollo -0.03792661 -2.1477456 2.071892 0.9999608 ## Falken-Apollo 2.82553333 0.7157143 4.935352 0.0043198 ## CEAT-Bridgestone 2.98107339 0.8712544 5.090892 0.0023806 ## Falken-Bridgestone 5.84453333 3.7347143 7.954352 0.0000000 ## Falken-CEAT 2.86345994 0.7536409 4.973279 0.0037424 TukeyHSD(tyres.aov, conf.level = 0.99) ## Tukey multiple comparisons of means ## 99% family-wise confidence level ## ## Fit: aov(formula = Mileage ~ Brands, data = tyre) ## ## $Brands ## diff lwr upr p adj ## Bridgestone-Apollo -3.01900000 -5.6155816 -0.4224184 0.0020527 ## CEAT-Apollo -0.03792661 -2.6345082 2.5586550 0.9999608 ## Falken-Apollo 2.82553333 0.2289517 5.4221149 0.0043198 ## CEAT-Bridgestone 2.98107339 0.3844918 5.5776550 0.0023806 ## Falken-Bridgestone 5.84453333 3.2479517 8.4411149 0.0000000 ## Falken-CEAT 2.86345994 0.2668783 5.4600415 0.0037424

    A lot of output there but not too difficult to understand. We can see the 6 pairings we have been tracking listed in the first column. The diff column is the difference between the means of the two brands listed. So the mean for Bridgestone is 3,019 miles less than Apollo. The lwr and upr columns show the lower and upper CI limits. Notice they change between the two different confidence levels we’ve run, whereas the mean difference and exact p value do not. So good news here is that even with our more conservative Tukey HSD test we have empirical support for 5 out of the 6 possible differences.

    Now let’s graph just the .99 CI version.

    par()$oma # current margins ## [1] 0 0 0 0 par(oma=c(0,5,0,0)) # adjust the margins because the factor names are long plot(TukeyHSD(tyres.aov, conf.level = 0.99),las=1, col = "red") par(oma=c(0,0,0,0)) # put the margins back

    If you’re a visual learner, as I am, this helps. We’re looking at the differences in means amongst the pairs of brands. 0 on the x axis means no difference at all and the red horizontals denote 99% confidence intervals.
    Finally, as I mentioned earlier there are many different ways (tests) for adjusting. Tukey HSD is very common and is easy to access and graph. But two others worth noting are the Bonferroni and it’s successor the Holm. Let’s go back to our earlier use of the pairwise.t.test. We’ll use it again (as well as the lsr wrapper function posthocPairwise). You can use the built-in R help for p.adjust to see all the methods available. I recommend holm as a general position but know your options.

    pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "bonferroni") ## ## Pairwise comparisons using t tests with pooled SD ## ## data: tyre$Mileage and tyre$Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.0022 - - ## CEAT 1.0000 0.0026 - ## Falken 0.0048 5.8e-09 0.0041 ## ## P value adjustment method: bonferroni pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "holm") ## ## Pairwise comparisons using t tests with pooled SD ## ## data: tyre$Mileage and tyre$Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.0019 - - ## CEAT 0.9622 0.0019 - ## Falken 0.0021 5.8e-09 0.0021 ## ## P value adjustment method: holm posthocPairwiseT(tyres.aov) # default is Holm ## ## Pairwise comparisons using t tests with pooled SD ## ## data: Mileage and Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.0019 - - ## CEAT 0.9622 0.0019 - ## Falken 0.0021 5.8e-09 0.0021 ## ## P value adjustment method: holm

    Happily, given our data, we get the same overall answer with very slightly different numbers. As it turns out we have very strong effect sizes and the tests don’t change our overall answers. Wait what’s an effect size you ask? That’s our next question which we will cover in the second part.

      Related Post

      1. One-way ANOVA in R
      2. Cubic and Smoothing Splines in R
      3. Chi-Squared Test – The Purpose, The Math, When and How to Implement?
      4. Missing Value Treatment
      5. R for Publication by Page Piccinini
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Churn Prediction with Automatic ML

      Wed, 09/27/2017 - 10:01

      (This article was first published on MLJAR Blog Category: R, and kindly contributed to R-bloggers)

      Sometimes we don’t even realize how common machine learning (ML) is in our daily lives. Various “intelligent” algorithms help us for instance with finding the most important facts (Google), they suggest what movie to watch (Netflix), or influence our shopping decisions (Amazon). The biggest international companies quickly recognized the potential of machine learning and transferred it to business solutions.

      Nowadays not only big companies are able to use ML. Imagine — not so abstract — situation when a company tries to predict customer behavior based on some personal data. Just a few years ago, the best strategy to solve this problem would be to hire a good data science team. Nowadays, thanks to growing ML popularity, it is available even for small start-ups.

      Today, I would like to present you a demo of how to solve difficult business problems with ML. We will take advantage of mljar.com service and its R API. With just a few lines of code we will be able to achieve very good results.

      Typical problem for companies operating on a contractual basis (like Internet, or phone providers) is whether a customer will decide to stay for a next period of time, or churn. It is crucial for a company to focus on customers who are at risk of churning in order to prevent it.

      In this example we will help a telecom company to predict, which consumers are likely to renew a contract and which are not. We will base on the data from the past. This dataset in publicly available and can be downloaded for example here: https://github.com/WLOGSolutions/telco-customer-churn-in-r-and-h2o/tree/master/data.

      It contains descriptors, which can be divided into three different categories:

      • demographical data: gender, age, education, etc.
      • account information: number of complaints, average call duration, etc.
      • services used: Internet, voice etc.

      Finally, we will also have a column with two labels: churn and no churn, which is our target to predict.

      Now, that we have the problem set and understand our data, we can move on to the code. mljar.com has both R and Python API, but this time we focus on the former.

      First of all, we need to import necessary libraries.

      If you miss one of them, you can always easily install it with install.packages command.

      The following lines enable you to read and clean the dataset. In three steps we: get rid of irrelevant columns (time), select only complete records and remove duplicated rows.

      It is important to validate our final ML model before publishing, so we split the churn data into training and test set in proportion 7:3.

      For convenience, let’s assign features and labels to separate variables:

      And that’s it! We are ready to use mljar to train our ML models.

      (Before, make sure that you have mljar account and your mljar token is set. For more information take a look at: https://github.com/mljar/mljar-api-R/)

      The main advantage of mljar is that it frees the user from thinking about model selection. We can use various different models and compare their performance. In this case, as a performance measure we choose Area Under Curve (https://en.wikipedia.org/wiki/Receiveroperatingcharacteristic).

      We use the following ML algorithms: logistic regression, xgboost, light gradient boosting, extra trees, regularized greedy forest and k-nearest neighbors. Moreover, in R training all these models are as simple as that:

      After running this, when you log in into mljar.com web service and select “Churn” project you should see training data preview with some basic statistics. It is very helpful when you consider which features to use. In this example we decided to take advantage of all of them, but that’s not always the best strategy.

      With a little bit of patience (which depends on your mljar subscription), we got our final model with the best results. All models are also listed at mljar.com in Results card.

      mljarfit function returns the best model, which in this case is: ensembleavg.

      Ensemble average means that this is a combination of various models, which usually performs better together than individual models.

      Another useful option of mljar web service is hidden in “Feature Analysis” card. If you select some model there, you will see a graphical illustration of feature importance. For instance for extra trees algorithm we have:

      The graph leads to a conclusion that age, unpaid invoice balance and monthly billed amounts are the most important customer descriptors, whereas number of calls or using some extra services have almost no impact on churning.

      To predict labels on the test set, we use mljar_predict command.

      Note that the variable y_pr contains probabilities of our churn/no-churn labels. With help of pROC package we will easily find out which threshold is the best in our case:

      In data frame yprc our classification results are encoded as logical values, so let’s transform them into more readable form:

      Finally, we can evaluate our model’s prediction with different measures:

      Similar analysis has been performed using another R library for machine learning: H2O (look here: (https://github.com/WLOGSolutions/telco-customer-churn-in-r-and-h2o). You can see a comparison of results in the table below:

      I hope that I convinced you that some complex business problems can be solved easily with machine-learning algorithms using tools like mljar.

      (Worth to know: you can repeat all steps of the above analysis with no coding at all, using only mljar web-service. Try it out at mljar.com. If you would like to run code: it is here.)

      For more interesting R resources please check r-bloggers.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: MLJAR Blog Category: R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages