## Error message

• Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
• Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 3 min ago

### EARL London 2017 – That’s a wrap!

Wed, 09/20/2017 - 18:04

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Beth Ashlee, Data Scientist

After a successful first-ever EARL San Francisco in June, it was time to head back to the birth place EARL – London. With more abstracts submitted than ever before, the conference was made up of 54 fantastic talks and 5 key notes from an impressive selection of industries. With so many talks to pick from we thought we would summarise a few of my favourites!

Day 1 highlights:

After brilliant keynotes from Tom Smith (ONS) and Rstudio’s Jenny Bryan in session 1, Derek Norton and Neera Talbert from Microsoft took us through the Microsoft process of moving a company from SAS to R in session 2. They explained that with the aim of shrinking the ‘SAS footprint’, it’s important to think about the drivers behind a company leaving SAS as well as considering the impact to end users. Their approach focused on converting program logic rather than specific code.

After lunch, Luisa Pires discussed the digital change occurring within the Bank of England. She highlighted the key selling points behind choosing R as a platform and the process behind organizing their journey. They first ran divisional R training, before progressing through to produce a data science training programme to enable the use of R as a project tool.

Finishing up the day, Jessica Peterka-Bonetta gave a fascinating talk on sentiment analysis when considering the use of emojis. She demonstrated that even though adding emojis into your sentiment analysis can add complexity to your process, in the right context they can add real value to tracking the sentiment of a trend. It was an engaging talk which prompted some interesting audience questions such as – “What about combinations of emojis; how would they effect sentiment?”.

After a Pimms reception it was all aboard the Symphony Cruiser for a tour of the River Thames. On board we enjoyed food, drinks and live music (which resulted in some impromptu dancing, but what happens at a Conference, stays at the Conference!).

Day 2 highlights:

Despite a ‘lively’ evening, the EARL kicked off in full swing on Thursday morning. There were three fantastic keynotes, including Hilary Parker – her talk filled with analogies and movie references to describe her reproducible work flow methods – definitely something I could relate to!

The morning session included a talk from Mike Smith from Pfizer. Mike showed us his use of Shiny as a tool for determining wait times when submitting large jobs to Pfizer’s high performance compute grid. Mike used real time results to visualise whether it was beneficial to submit a large job at the current time or wait until later. He outlined some of the frustrations of changing data sources in a such a large company and his reluctance to admit he was ‘data science-ing’.

After lunch, my colleague Adnan Fiaz gave an interesting talk on the data pipeline, comparing it to the process involved in oil pipelines. He spoke of the methods and actions that must be taken before being able to process your data and generate results. The comparison showed surprising similarities between the two processes, and clarified the method that we must take at Mango to ensure we can safely, efficiently and aptly analyse data.

The final session of day 2 finished on a high, with a packed room gathering to hear from RStudio’s Joe Cheng. Joe highlighted the exciting new asynchronous programming feature that will be available in the next release of Shiny. This new feature is set to revolutionise the responsiveness of Shiny applications, allowing users to overcome the restrictions of R’s single threaded natured.

I get so much out of EARL each year and this year was no different; I just wish I had a timeturner to get to all the presentations!

On behalf of the rest of the Mango team, a massive thank you to all our speakers and attendees, we hope you enjoyed EARL as much as we did!

All slides we have permission to publish are available under the Speakers section on the EARL conference website.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Preview: ALTREP promises to bring major performance improvements to R

Wed, 09/20/2017 - 16:55

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Changes are coming to the internals of the R engine which promise to improve performance and reduce memory use, with dramatic impacts in some circumstances. The changes were first proposed by Gabe Becker at the DSC Conference in 2016 (and updated in 2017), and the implementation by Luke Tierney and Gabe Becker is now making its way into the development branches of the R sources.

ALTREP is an alternative way of representing R objects internally. (The details are described in this draft vignette, and there are a few more examples in this presentation.) It's easiest to illustrate by example.

Today, a vector of numbers in R is represented as a contiguous block of memory in RAM. In other words, if you create the sequence of a million integers 1:1000000, R creates a block of memory 4Mb in size (4 bytes per number) to store each number in the sequence. But what if we could use a more efficient representation just for sequences like this? WIth ALTREP, a sequence like this is instead represented by just its start and end values, which takes up almost no memory at all. That means, in the future you'll be able to write loops like this:

for (i in 1:1e10) do_something()

without getting errors like this:

Error: cannot allocate vector of size 74.5 Gb

ALTREP has the potential to make many other operations faster or even instantaneous, even on very large objects. Here are a few examples of functions that could be sped up:

• is.na(x) — ALTREP will keep track of whether a vector includes NAs or not, so that R no longer has to inspect the entire vector
• sort(x) — ALTREP will keep track of whether a vector is already sorted, and sorting will be instantaneous in this case
• x < 5 — knowing that the vector is already sorted, ALTREP can find the breakpoint very quickly (in O(log n) time), and return a "sequence" logical vector that consumes basically no memory
• match(y,x) — if ALTREP knows that x is already sorted, matching is also much faster
• as.character(numeric_vector) — ALTREP can defer converting a numeric vector to characters until the character representation is actually needed

That last benefit will likely have a large impact on the handling of data frames, which carry around a column of character row labels which start out as a numeric sequence. Development builds are already demonstrating a huge performance gain in the linear regression function lm() as a result:

> n <- 10000000 > x <- rnorm(n) > y <- rnorm(n) # With R 3.4 > system.time(lm(y ~ x)) user system elapsed 9.225 0.703 9.929 # With ALTREP > system.time(lm(y ~ x)) user system elapsed 1.886 0.610 2.496

The ALTREP framework is designed to be extensible, so that package authors can define their own alternative representations of standard R objects. For example, an R vector could be represented as a distributed object as in a system like Spark, while still behaving like an ordinary R vector to the R engine. An example package, simplemmap, illustrates this concept by defining an implementation of vectors as memory-mapped objects that live on disk instead of in RAM.

There's no definite date yet when ALTREP will be in an official R release, and my guess is that there's likely to be an extensive testing period to shake out any bugs caused by changing the internal representation of R objects. But the fact that the implementation is already making its way into the R sources is hugely promising, and I look forward to testing out the real-world impacts. You can read more about the current state of ALTREP in the draft vignette by Luke Tierney, Gabe Becker and Tomas Kalibera linked below.

R-Project.org: ALTREP: Alternative Representations for R Objects

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### pinp 0.0.2: Onwards

Wed, 09/20/2017 - 15:00

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A first update 0.0.2 of the pinp package arrived on CRAN just a few days after the initial release.

We added a new vignette for the package (see below), extended a few nice features, and smoothed a few corners.

The NEWS entry for this release follows.

Changes in tint version 0.0.2 (2017-09-20)
• The YAML segment can be used to select font size, one-or-two column mode, one-or-two side mode, linenumbering and watermarks (#21 and #26 addressing #25)

• If pinp.cls or jss.bst are not present, they are copied in ((#27 addressing #23)

• Output is now in shaded framed boxen too (#29 addressing #28)

• Endmatter material is placed in template.tex (#31 addressing #30)

• Expanded documentation of YAML options in skeleton.Rmd and clarified available one-column option (#32).

• Section numbering can now be turned on and off (#34)

• The default bibliography style was changed to jss.bst.

• A short explanatory vignette was added.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js

Wed, 09/20/2017 - 10:00

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

D3partitionR is an R package to visualize interactively nested and hierarchical data using D3.js and HTML widget. These last few weeks I’ve been working on a major D3partitionR update which is now available on GitHub. As soon as enough feedbacks are collected, the package will be on uploaded on the CRAN. Until then, you can install it using devtools

library(devtools) install_github("AntoineGuillot2/D3partitionR")

Here is a quick overview of the possibilities using the Titanic data:

A major update

This update is a major update from the previous version which will break code from 0.3.1

New functionalities
• Additional data for nodes: Additional data can be added for some given nodes. For instance, if a comment or a link needs to be shown in the tooltip or label of some nodes, they can be added through the add_nodes_data function

• Variable selection and computation, now, you can provide a variable for:
• sizing (i.e. the size of each node)
• color, any variable from your data.frame or from the nodes data can be used as a color.
• label, any variable from your data.frame or from the nodes data can be used as a label.
• tooltip, you can provide several variables to be displayed in the tooltip.
• aggregation function, when numerical variables are provided, you can choose the aggregation function you want.
• Coloring: The color scale can now be continuous. For instance, you can use the mean survival rate to the Titanic accident in each node, this make it easy to visualise quickly women in 1st class are more likely to survive than men in 3rd class.

Treemap to show the survival rate to the Titanic accident
• Label: Labels providing the showing the node’s names (or any other variable) can now be added to the plot.
• Breadcrumb: To avoid overlapping, the width of each breadcrumb is now variable and dependant on the length of the word

• Legend: By default, the legend now shows all the modalities/levels that are in the plot. To avoid wrapping, enabling the zoom_subset option will only shows the modalities in the direct children of the zoomed root.
API and backend change
• Easy data preprocessing, The data preparation was tedious in the previous versions. Now, you just need to aggregate your data.frame at the right level, the data.frame can directly be used in the D3partitionR functions to avoid to deal with nesting a data.frame which can be pretty complicated.
• The R API is greatly improved, D3partitionR are now S3 objects with a clearly named list of function to add data and to modify the chart appearance and parameters. Using pipes now makes D3partitionR syntax looks gg-like
##Treemap D3partitionR()%>% add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>% set_chart_type('treemap')%>% plot() ##Circle treemap D3partitionR()%>% add_data(data_plot,count = 'N',steps=c('Sex','Embarked','Pclass','Survived'))%>% set_chart_type('circle_treemap')%>% plot()

Style consistency among the different type of chart. Now, it’s easy to switch from a treemap to a circle treemap or a sunburst and keep consistent styling policy.

Update to d3.js V4 and modularization. Each type of charts now has its own file and function. This function draws the chart at its root level with labels and colors, it returns a zoom function. The on-click actions (such as the breadcrumb update or the legend update) and the hover action (tooltips) are defined in a ‘global’ function.

Hence, adding new visualizations will be easy, the drawing and zooming script will just need to be adapted to this previous template.

What’s next

Thanks to the several feedbacks that will be collected during next week, a stable release version should soon be on the CRAN. I will also post more ressources on D3partitionR with use cases and example of Shiny Applications build on it.

The post Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Regression Analysis — What You Should’ve Been Taught But Weren’t, and Were Taught But Shouldn’t Have Been

Wed, 09/20/2017 - 09:55

The above title was the title of my talk this evening at our Bay Area R Users Group. I had been asked to talk about my new book, and I presented four of the myths that are dispelled in the book.

Hadley also gave an interesting talk, “An introduction to tidy evaluation,” involving some library functions that are aimed at writing clearer, more readable R. The talk came complete with audience participation, very engaging and informative.

The venue was GRAIL, a highly-impressive startup. We will be hearing a lot more about this company, I am sure.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### 12 Visualizations to Show a Single Number

Wed, 09/20/2017 - 08:53

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Infographics, dashboards, and reports often need to highlight or visualize a single number. But how do you highlight a single number so that it has an impact and looks good? It can be a big challenge to make a lonely, single number look great. In this post, I show 12 different ways of representing a single number. Most of these visualizations have been created automatically using R.

When to create a visualization to represent a single number

There are a number of situations in which it can be advantageous to create a visualization to represent a single number:

• To communicate with less numerate viewers/readers;
• Infographics and dashboards commonly use one important number;
• To attract the attention of distracted or busy viewers/readers;
• To add some humanity or “color”, to create an emotional connection;
• Or to increase the redundancy of the presentation (see Improve the Quality of Data Visualizations Using Redundancy).

Option 1: Standard text formatting: font, size, style

Sometimes the plain text is the best option. Make fonts big and simple so they stand out.

669 people died

Option 2: Informative formatting

Colors, bolding, emphasis, and other formatting options can be used to draw attention to figures and to communicate additional information. For example, the red font could draw attention to low results. Example: Sales declined by 23%.

You can do this in a sentence or in a visualization, such as in the bar chart below, where color is used to encode statistical testing results.

And you could also use informative formatting via a colored background for numbers, as in the visual confection below. In this instance, traffic-light coloring indicates the relative levels of performance of different departments in a supermarket.

Option 3: Pie charts

Proportions are often illustrated using pie charts with varying degrees of success. They can be particularly elegant for displaying single numbers that are proportions.

Option 4: Donut charts

Similarly with donut charts. It’s just a matter of taste.

Option 5: Portion of an image

The two-category pie chart and donut chart are special cases of a more general strategy, which is to show a portion of an image.

Option 6:  Overlaid images

A twist on showing a portion of an image is to proportionally color an image.

A common, but misleading, criticism of overlaid image visualizations and all the pictograph type of visualizations is that they are imprecise at best, and innumerate at worst. The three visualizations above have all been created to illustrate this point. The one on the left is not too bad. The President Trump approval visualization can readily be criticized in that the actual area shaded in blue is less than 37%. This is due to the greater amount of whitespace over the shoulders. Particularly problematic is the age visualization. This implicitly compares a number, 37, against some unknown and silly benchmark implied by the top of the image.

While such criticisms are technically correct, they are misleading. Consider the “worst” of the visualizations, which shows the average age. The purpose of the visualization is simply to communicate to the viewer that the average age figure is in some way low. How low? This is communicated by the number at the base. If the actual correct number is shown, there is little prospect of the viewer being misled. However, showing a number without the visualization runs the risk that the viewer fails to notice the point at all. This leads to a much higher error.

Furthermore, there are many contexts in which precision is not even important. How often is half a glass of wine actually precisely half a glass?

Option 7:  Countable pictographs

While all pictographs have a bad reputation, in the case of the countable pictograph, it is quite undeserved. Countable pictographs achieve redundancy and thus likely improve the accuracy with which the underlying data is understood by the viewer.

Option 8: Uncountable pictographs

The goal of the uncountable pictograph is to suggest, in a graphical way, “big”. It is often most useful when next to a comparable countable pictograph.

Option 9: Dials, gauges, and thermometers

This data is from a study by the Pew Research Center. Go to the original report for a better visualization, which uses a thermometer.

Option 10: Physical representations

This photo, from The Guardian, shows 888,246 ceramic red poppies installed around the Tower of London. Each poppy represents a British or Colonial serviceman killed in World War I. It is the same basic idea as the uncountable pictograph but on a much grander scale.

Option 11: Artwork/graphic design

Option 12: Evocative collages

And finally the most common of all approaches in marketing reporting is to find an image or images that in some way represent, or evoke, something relevant to the number.

Software

You can only create the visualizations in option 2 easily in Displayr. The visualizations in options 3 through 8 were all created using the open-source GitHub R packages Displayr/rhtmlPictographs and Displayr/flipPictographs, which were created by my colleagues Kyle and Carmen.

Everyone can access the Displayr document used to create these visualizations here. To modify the visualizations with your own data, click on each visualization and either change the Inputs or modify the underlying R code (Properties > R CODE).

Acknowledgments

Google’s gauge inspired the gauge in this post. I haven’t been able to find any copyright information for the beautiful Coca-cola photo. So, if you know where it is from, please tell me!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### From Biology to Industry. A Blogger’s Journey to Data Science.

Wed, 09/20/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Today, I have given a webinar for the Applied Epidemiology Didactic of the University of Wisconsin – Madison titled “From Biology to Industry. A Blogger’s Journey to Data Science.”

I talked about how blogging about R and Data Science helped me become a Data Scientist. I also gave a short introduction to Machine Learning, Big Data and Neural Networks.

My slides can be found here: https://www.slideshare.net/ShirinGlander/from-biology-to-industry-a-bloggers-journey-to-data-science

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm

Wed, 09/20/2017 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement.

Defining data conditionally

Of course, simstudy has always enabled conditional distributions based on sequentially defined variables. That is really the whole point of simstudy. But, what if I wanted to specify completely different families of distributions or very different regression curves based on different individual characteristics? With the previous version of simstudy, it was not really easy to do. Now, with the addition of two key functions, defCondition and addCondition the process is much improved. defCondition is analogous to the function defData, in that this new function provides an easy way to specify conditional definitions (as does defReadCond, which is analogous to defRead). addCondition is used to actually add the data column, just as addColumns adds columns.

It is probably easiest to see in action:

library(simstudy) # Define baseline data set def <- defData(varname="x", dist="normal", formula=0, variance=9) def <- defData(def, varname = "group", formula = "0.2;0.5;0.3", dist = "categorical") # Generate data set.seed(111) dt <- genData(1000, def) # Convert group to factor - new function dt <- genFactor(dt, "group", replace = TRUE) dt

defCondition is the same as defData, except that instead of specifying a variable name, we need to specify a condition that is based on a pre-defined field:

defC <- defCondition(condition = "fgroup == 1", formula = "5 + 2*x", variance = 4, dist = "normal") defC <- defCondition(defC, condition = "fgroup == 2", formula = 4, variance = 3, dist="normal") defC <- defCondition(defC, condition = "fgroup == 3", formula = "3 - 2*x", variance = 2, dist="normal") defC ## condition formula variance dist link ## 1: fgroup == 1 5 + 2*x 4 normal identity ## 2: fgroup == 2 4 3 normal identity ## 3: fgroup == 3 3 - 2*x 2 normal identity

A subsequent call to addCondition generates a data table with the new variable, in this case $$y$$:

dt <- addCondition(defC, dt, "y") dt ## id y x fgroup ## 1: 1 5.3036869 0.7056621 2 ## 2: 2 2.1521853 -0.9922076 2 ## 3: 3 4.7422359 -0.9348715 3 ## 4: 4 16.1814232 -6.9070370 3 ## 5: 5 4.3958893 -0.5126281 3 ## --- ## 996: 996 -0.8115245 -2.7092396 1 ## 997: 997 1.9946074 0.7126094 2 ## 998: 998 11.8384871 2.3895135 1 ## 999: 999 3.3569664 0.8123200 1 ## 1000: 1000 3.4662074 -0.4653198 3

In this example, I’ve partitioned the data into three subsets, each of which has a very different linear relationship between variables $$x$$ and $$y$$, and different variation. In this particular case, all relationships are linear with normally distributed noise, but this is absolutely not required.

Here is what the data look like:

library(ggplot2) mycolors <- c("#555bd4","#d4555b","#d4ce55") ggplot(data = dt, aes(x = x, y = y, group = fgroup)) + geom_point(aes(color = fgroup), size = 1, alpha = .4) + geom_smooth(aes(color = fgroup), se = FALSE, method = "lm") + scale_color_manual(name = "Cluster", values = mycolors) + scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey96", color = "grey80"))

Latent class regression models

Suppose we come across the same data set, but are not privy to the group classification, and we are still interested in the relationship between $$x$$ and $$y$$. This is what the data set would look like – not as user-friendly:

rawp <- ggplot(data = dt, aes(x = x, y = y, group = fgroup)) + geom_point(color = "grey75", size = .5) + scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey96", color = "grey80")) rawp

We might see from the plot, or we might have some subject-matter knowledge that suggests there are are several sub-clusters within the data, each of which appears to have a different relationship between $$x$$ and $$y$$. (Obviously, we know this is the case, since we generated the data.) The question is, how can we estimate the regression lines if we don’t know the class membership? That is where the EM algorithm comes into play.

The EM algorithm, very, very briefly

The EM algorithm handles model parameter estimation in the context of incomplete or missing data. In the example I’ve been discussing here, the subgroups or cluster membership are the missing data. There is an extensive literature on EM methods (starting with this article by Dempster, Laird & Rubin), and I am barely even touching the surface, let alone scratching it.

The missing data (cluster probabilities) are estimated in the Expectation- or E-step. The unknown model parameters (intercept, slope, and variance) for each of the clusters is estimated in the Maximization- or M-step, which in this case assumes the data come from a linear process with normally distributed noise – both the linear coefficients and variation around the line are conditional on cluster membership. The process is iterative. First, the E-step, which is based on some starting model parameters at first and then updated with the most recent parameter estimates from the prior M-step. Second, the M-step is based on estimates of the maximum likelihood of all the data (including the ‘missing’ data estimated in the prior E-step). We iterate back and forth until the parameter estimates in the M-step reach a steady state, or the overal likelihood estimate becomes stable.

The strength or usefulness of the EM method is that the likelihood of the full data (both observed data – $$x$$’s and $$y$$’s – and unobserved data – cluster probabilities) is much easier to write down and estimate than the likelihood of the observed data only ($$x$$’s and $$y$$’s). Think of the first plot above with the structure given by the colors compared to the second plot in grey without the structure. The first seems so much more manageable than the second – if only we knew the underlying structure defined by the clusters. The EM algorithm builds the underlying structure so that the maximum likelihood estimation problem becomes much easier.

Here is a little more detail on what the EM algorithm is estimating in our application. (See this for the much more detail.) First, we estimate the probability of membership in cluster $$j$$ for our linear regression model with three clusters:

$P_i(j|x_i, y_i, \pi_j, \alpha_{j0}, \alpha_{j1}, \sigma_j) = p_{ij}= \frac{\pi_jf(y_i|x_i, \alpha_{j0}, \alpha_{j1}, \sigma_j))}{\sum_{k=1}^3 \pi_k f(y_i|x_i, \alpha_{k0}, \alpha_{k1}, \sigma_k))},$ where $$\alpha_{j0}$$ and $$\alpha_{j1}$$ are the intercept and slope for cluster $$j$$, and $$\sigma_j$$ is the standard deviation for cluster $$j$$. $$\pi_j$$ is the probability of any individual being in cluster $$j$$, and is estimated by taking and average of the $$p_{ij}$$’s across all individuals. Finally, $$f(.|.)$$ is the density from the normal distribution $$N(\alpha_{j0} + \alpha_{j1}x, \sigma_j^2)$$.

Second, we maximize each of the three cluster-specific log-likelihoods, where each individual is weighted by its probability of cluster membership (which is $$P_i(j)$$, estimated in the E-step). In particular, we are maximizing the cluster-specific likelihood with respect to the three unknown parameters $$\alpha_{j0}$$, $$\alpha_{j1}$$, and $$\sigma_j$$:

$\sum_{n=1}^N \hat{p}_{nk} \text{log} (f(y_n|x_n,\alpha_{j0},\alpha_{j1},\sigma_j)$ In R, the flexmix package has implemented an EM algorithm to estimate latent class regression models. The package documentation provides a really nice, accessible description of the two-step procedure, with much more detail than I have provided here. I encourage you to check it out.

Iterating slowly through the EM algorithm

Here is a slow-motion version of the EM estimation process. I show the parameter estimates (visually) at the early stages of estimation, checking in after every three steps. In addition, I highlight two individuals and show the estimated probabilities of cluster membership. At the beginning, there is little differentiation between the regression lines for each cluster. However, by the 10th iteration the parameter estimates for the regression lines are looking pretty similar to the original plot.

library(flexmix) selectIDs <- c(508, 775) # select two individuals ps <- list() count <- 0 p.ij <- data.table() # keep track of estimated probs pi.j <- data.table() # keep track of average probs for (i in seq(1,10, by=3)) { count <- count + 1 set.seed(5) # fit model up to "i" iterations - either 1, 4, 7, or 10 exMax <- flexmix(y ~ x, data = dt, k = 3, control = list(iter.max = i) ) p.ij <- rbind(p.ij, data.table(i, selectIDs, posterior(exMax)[selectIDs,])) pi.j <- rbind(pi.j, data.table(i, t(apply(posterior(exMax), 2, mean)))) dp <- as.data.table(t(parameters(exMax))) setnames(dp, c("int","slope", "sigma")) # flexmix rearranges columns/clusters dp[, grp := c(3, 1, 2)] setkey(dp, grp) # create plot for each iteration ps[[count]] <- rawp + geom_abline(data = dp, aes(intercept = int, slope = slope, color=factor(grp)), size = 1) + geom_point(data = dt[id %in% selectIDs], color = "black") + scale_color_manual(values = mycolors) + ggtitle(paste("Iteration", i)) + theme(legend.position = "none", plot.title = element_text(size = 9)) } library(gridExtra) grid.arrange(ps[[1]], ps[[2]], ps[[3]], ps[[4]], nrow = 1)

For the two individuals, we can see the probabilities converging to a level of certainty/uncertainty. The individual with ID #775 lies right on the regression line for cluster 3, far from the other lines, and the algorithm quickly assigns a probability of 100% to cluster 3 (its actual cluster). The cluster assignment is less certain for ID #508, which lies between the two regression lines for clusters 1 and 2.

# actual cluster membership dt[id %in% selectIDs, .(id, fgroup)] ## id fgroup ## 1: 508 2 ## 2: 775 3 setkey(p.ij, selectIDs, i) p.ij[, .(selectIDs, i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))] ## selectIDs i C1 C2 C3 ## 1: 508 1 0.32 0.36 0.32 ## 2: 508 4 0.29 0.44 0.27 ## 3: 508 7 0.25 0.65 0.10 ## 4: 508 10 0.24 0.76 0.00 ## 5: 775 1 0.35 0.28 0.37 ## 6: 775 4 0.33 0.14 0.53 ## 7: 775 7 0.11 0.01 0.88 ## 8: 775 10 0.00 0.00 1.00

In addition, we can see how the estimate of overall group membership (for all individuals) changes through the iterations. The algorithm starts by assigning equal probability to each cluster (1/3) and slowly moves towards the actual distribution used to generate the data (20%, 50%, and 30%).

pi.j[, .(i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))] ## i C1 C2 C3 ## 1: 1 0.33 0.34 0.33 ## 2: 4 0.31 0.34 0.35 ## 3: 7 0.25 0.39 0.36 ## 4: 10 0.23 0.44 0.33 Final estimation of linear models

The final estimation is shown below, and we can see that the parameters have largely converged to the values used to generate the data.

# Estimation until convergence set.seed(5) ex1 <- flexmix(y ~ x, data = dt, k = 3) # paramter estimates data.table(parameters(ex1))[, .(param = c("int", "slope", "sd"), C1 = round(Comp.2, 2), C2 = round(Comp.3, 2), C3 = round(Comp.1, 2))] ## param C1 C2 C3 ## 1: int 5.18 3.94 3.00 ## 2: slope 1.97 -0.03 -1.99 ## 3: sd 2.07 1.83 1.55 # estimates of cluster probabilities round(apply(posterior(ex1), 2, mean), 2)[c(2,3,1)] ## [1] 0.19 0.51 0.30 # estimates of individual probabilities data.table(posterior(exMax)[selectIDs,])[,.(selectIDs, C1 = round(V2, 2), C2 = round(V3, 2), C3 = round(V1, 2))] ## selectIDs C1 C2 C3 ## 1: 508 0.24 0.76 0 ## 2: 775 0.00 0.00 1 How do we know the relationship is linear?

In reality, there is no reason to assume that the relationship between $$x$$ and $$y$$ is simply linear. We might want to look at other possibilities, such as a quadratic relationship. So, we use flexmix to estimate an expanded model, and then we plot the fitted lines on the original data:

ex2 <- flexmix(y ~ x + I(x^2), data = dt, k = 3) dp <- as.data.table(t(parameters(ex2))) setnames(dp, c("int","slope", "slope2", "sigma")) dp[, grp := c(1,2,3)] x <- c(seq(-10,10, by =.1)) dp1 <- data.table(grp = 1, x, dp[1, int + slope*x + slope2*(x^2)]) dp2 <- data.table(grp = 2, x, dp[2, int + slope*x + slope2*(x^2)]) dp3 <- data.table(grp = 3, x, dp[3, int + slope*x + slope2*(x^2)]) dp <- rbind(dp1, dp2, dp3) rawp + geom_line(data=dp, aes(x=x, y=V3, group = grp, color = factor(grp)), size = 1) + scale_color_manual(values = mycolors) + theme(legend.position = "none")

And even though the parameter estimates appear to be reasonable, we would want to compare the simple linear model with the quadratic model, which we can use with something like the BIC. We see that the linear model is a better fit (lower BIC value) – not surprising since this is how we generated the data.

summary(refit(ex2)) ## $Comp.1 ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.440736 0.309576 4.6539 3.257e-06 *** ## x -0.405118 0.048808 -8.3003 < 2.2e-16 *** ## I(x^2) -0.246075 0.012162 -20.2337 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ##$Comp.2 ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.955542 0.289914 23.9918 < 2.2e-16 *** ## x 0.305995 0.049584 6.1712 6.777e-10 *** ## I(x^2) 0.263160 0.014150 18.5983 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## $Comp.3 ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.9061090 0.1489738 26.2201 < 2e-16 *** ## x -0.0681887 0.0277366 -2.4584 0.01395 * ## I(x^2) 0.0113305 0.0060884 1.8610 0.06274 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Comparison of the two models BIC(ex1) ## [1] 5187.862 BIC(ex2) ## [1] 5316.034 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Visualizing dataset to apply machine learning-exercises Fri, 09/08/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) INTRODUCTION Dear reader, If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world. This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets. Before proceeding, please follow our short tutorial. Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then see solutions to check your answers. Exercise 1 Create a variable “x” and attach to it the input attributes of the “iris” dataset. HINT: Use columns 1 to 4. Exercise 2 Create a variable “y” and attach to it the output attribute of the “iris” dataset. HINT: Use column 5. Exercise 3 Create a whisker plot (boxplot) for the variable of the first column of the “iris” dataset. HINT: Use boxplot(). Exercise 4 Now create a whisker plot for each one of the four input variables of the “iris” dataset in one image. HINT: Use par(). Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to: • Create a machine learning algorithm from a beginner point of view • Quickly dive into more advanced methods in an accessible pace and with more explanations • And much more This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience. Exercise 5 Create a barplot to breakdown your output attribute. HINT: Use plot(). Exercise 6 Create a scatterplot matrix of the “iris” dataset using the “x” and “y” variables. HINT: Use featurePlot(). Exercise 7 Create a scatterplot matrix with ellipses around each separated group. HINT: Use plot="ellipse". Exercise 8 Create box and whisker plots of each input variable again, but this time broken down into separated plots for each class. HINT: Use plot="box". Exercise 9 Create a list named “scales” that includes the “x” and “y” variables and set relation to “free” for both of them. HINT: Use list() Exercise 10 Create a density plot matrix for each attribute by class value. HINT: Use featurePlot(). Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Hurricane Harvey’s rains, visualized in R by USGS Fri, 09/08/2017 - 17:36 (This article was first published on Revolutions, and kindly contributed to R-bloggers) On August 26 Hurricane Harvey became the largest hurricane to make landfall in the United States in over 20 years. (That record may yet be broken by Irma, now bearing down on the Florida peninsula.) Harvey's rains brought major flooding to Houston and other coastal areas in the Gulf of Mexico. You can see the rainfall generated by Harvey across Texas and Louisiana in this animation from the US Geological Survey of county-by-county precipitation as the storm makes its way across land. Watch #Harvey move thru SE #Texas spiking rainfall rates in each county (blue colors) Interactive version: https://t.co/qFrnyq3Sbm pic.twitter.com/md0hiUs9Bb — USGS Coastal Change (@USGSCoastChange) September 6, 2017 Interestingly, the heaviest rains appear to fall somewhat away from the eye of the storm, marked in orange on the map. The animation features Harvey's geographic track, along with a choropleth of hourly rainfall totals and a joyplot of river gage flow rates, and was created using the R language. You can find the data and R code on Github, which makes use of the USGS's own vizlab package which facilitates the rapid assembly of web-ready visualizations. You can find more information about the animation, including the web-based interactive version, at the link below. USGS Vizlab: Hurricane Harvey's Water Footprint (via Laura DeCicco) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Experiences as a first time rOpenSci package reviewer Fri, 09/08/2017 - 09:00 (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers) It all started January 26th this year when I signed up to volunteer as a reviewer for R packages submitted to rOpenSci. My main motivation for wanting to volunteer was to learn something new and to contribute to the R open source community. If you are wondering why the people behind rOpenSci are doing this, you can read How rOpenSci uses Code Review to Promote Reproducible Science. Three months later I was contacted by Maelle Salmon asking whether I was interested in reviewing the R package patentsview for rOpenSci. And yes, I was! To be honest I was a little bit thrilled. The packages are submitted for review to rOpenSci via an issue to their GitHub repository and also the reviews happen there. So you can check out all previous package submissions and reviews. With all the information you get from rOpenSci and also the help from the editor it is straightforward to do the package review. Before I started I read the reviewer guides (links below) and checked out a few of the existing reviews. I installed the package patentsview from GitHub and also downloaded the source code so I could check out how it was implemented. I started by testing core functionality of the package by running the examples that were mentioned in the README of the package. I think this is a good starting point because you get a feeling of what the author wants to achieve with the package. Later on I came up with my own queries (side note: this R package interacts with an API from which you can query patents). During the process I used to switch between writing queries like a normal user of the package would do and checking the code. When I saw something in the code that wasn't quite clear to me or looked wrong I went back to writing new queries to check whether the behavior of the methods was as expected. With this approach I was able to give feedback to the package author which led to the inclusion of an additional unit test, a helper function that makes the package easier to use, clarification of an error message and an improved documentation. You can find the review I did here. There are several R packages that helped me get started with my review, e.g. devtools and goodpractice. These packages can also help you when you start writing your own packages. An example for a very useful method is devtools::spell_check(), which performs a spell check on the package description and on manual pages. At the beginning I had an issue with goodpractice::gp() but Maelle Salmon (the editor) helped me resolve it. In the rest of this article you can read what I gained personally from doing a review. Contributing to the open source community When people think about contributing to the open source community, the first thought is about creating a new R package or contributing to one of the major packages out there. But not everyone has the resources (e.g. time) to do so. You also don't have awesome ideas every other day which can immediately be implemented into new R packages to be used by the community. Besides contributing with code there are also lots of other things than can be useful for other R users, for example writing blog posts about problems you solved, speaking at meetups or reviewing code to help improve it. What I like much about reviewing code is that people see things differently and have other experiences. As a reviewer, you see a new package from the user's perspective which can be hard for the programmer themselves. Having someone else review your code helps finding things that are missing because they seem obvious to the package author or detect code pieces that require more testing. I had a great feeling when I finished the review, since I had helped improve an already amazing R package a little bit more. Reviewing helps improve your own coding style When I write R code I usually try to do it in the best way possible. Google's R Style Guide is a good start to get used to coding best practice in R and I also enjoyed reading Programming Best Practices Tidbits . So normally when I think some piece of code can be improved (with respect to speed, readability or memory usage) I check online whether I can find a better solution. Often you just don't think something can be improved because you always did it in a certain way or the last time you checked there was no better solution. This is when it helps to follow other people's code. I do this by reading their blogs, following many R users on Twitter and checking their GitHub account. Reviewing an R package also helped me a great deal with getting new ideas because I checked each function a lot more carefully than when I read blog posts. In my opinion, good code does not only use the best package for each problem but also the small details are well implemented. One thing I used to do wrong for a long time was filling of data.frames until I found a better (much faster) solution on stackoverflow. And with respect to this you can learn a lot from someone else's code. What I found really cool in the package I reviewed was the usage of small helper functions (see utils.R). Functions like paste0_stop and paste0_message make the rest of the code a lot easier to read. Good start for writing your own packages When reviewing an R package, you check the code like a really observant user. I noticed many things that you usually don't care about when using an R package, like comments, how helpful the documentation and the examples are, and also how well unit tests cover the code. I think that reviewing a few good packages can prepare you very well for writing your own packages. Do you want to contribute to rOpenSci yourself? If I motivated you to become an rOpenSci reviewer, please sign up! Here is a list of useful things if you want to become an rOpenSci reviewer like me. If you are generally interested in either submitting or reviewing an R package, I would like to invite you to the Community Call on rOpenSci software review and onboarding. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The writexl package: zero dependency xlsx writer for R Fri, 09/08/2017 - 09:00 (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers) We have started working on a new rOpenSci package called writexl. This package wraps the very powerful libxlsxwriter library which allows for exporting data to Microsoft Excel format. The major benefit of writexl over other packages is that it is completely written in C and has absolutely zero dependencies. No Java, Perl or Rtools are required. Getting Started The write_xlsx function writes a data frame to an xlsx file. You can test that data roundtrips properly by reading it back using the readxl package. Columns containing dates and factors get automatically coerced to character strings. library(writexl) library(readxl) write_xlsx(iris, "iris.xlsx") # read it back out <- read_xlsx("iris.xlsx") You can also give it a named list of data frames, in which case each data frame becomes a sheet in the xlsx file: write_xlsx(list(iris = iris, cars = cars, mtcars = mtcars), "mydata.xlsx") Performance is good too; in our benchmarks writexl is about twice as fast as openxlsx: library(microbenchmark) library(nycflights13) microbenchmark( writexl = writexl::write_xlsx(flights, tempfile()), openxlsx = openxlsx::write.xlsx(flights, tempfile()), times = 5 ) ## Unit: seconds ## expr min lq mean median uq max neval ## writexl 8.884712 8.904431 9.103419 8.965643 9.041565 9.720743 5 ## openxlsx 17.166818 18.072527 19.171003 18.669805 18.756661 23.189206 5 Roadmap The initial version of writexl implements the most important functionality for R users: exporting data frames. However the underlying libxlsxwriter library actually provides far more sophisticated functionality such as custom formatting, writing complex objects, formulas, etc. Most of this probably won't be useful to R users. But if you have a well defined use case for exposing some specific features from the library in writexl, open an issue on Github and we'll look into it! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Infographic-style charts using the R waffle package Fri, 09/08/2017 - 07:54 (This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers) Infographics. I’ve seen good examples. I’ve seen more bad examples. In general, I prefer a good chart to an infographic. That said, there’s a “genre” of infographic that I do think is useful, which I’ll call “if X were 100 Y”. A good example: if the world were 100 people. That method of showing proportions has been called a waffle chart and for extra “infographic-i-ness”, the squares can be replaced by icons. You want to do this using R? Of course you do. Here’s how. There’s not much more here than you’ll find at the Github home of the R packages, waffle and extrafont. I’ve just made it a little more step-by-step. 1. Install the R packages You need waffle to create waffle charts and extrafont to use icons in the charts. install.packages(c("waffle", "extrafont")) library(waffle) library(extrafont) 2. Install Font Awesome The icons that waffle can use are supplied by Font Awesome. Download the zip file by clicking the Download link at their website. You may want to check out the paid option later but for now, just select the free download. Unzip the file and navigate to the fonts directory. To install the font, use the file named fontawesome-webfont.ttf. For Windows or Mac OS X users, installation is as simple as double-clicking the file and choosing “Install”. I haven’t done this under Linux for a while, it may even be the same procedure these days. 3. Import and register fonts In R, fonts have to be imported to the extrafont database. You need to do this whenever a new font is installed that you want to use in R. font_import() # check that Font Awesome is imported fonts()[grep("Awesome", fonts())] # [1] "FontAwesome" If you run font_import() in a session, you need to register the fonts with the R output device. Otherwise, this occurs when the package is loaded. If you’re using the Windows OS and want to plot to the screen in RStudio, an additional argument is required: # this should be fine for Mac OSX loadfonts() # use this if things look odd in RStudio under Windows loadfonts(device = "win") 4. Create charts Now we are ready to waffle. First, your basic waffle chart. The default colours are Colour Brewer “Set2”. waffle(c(50, 30, 15, 5), rows = 5, title = "Your basic waffle chart") Next, using icons. You can browse or search for available icon names on this page. You may need to fiddle with the size. waffle(c(50, 30, 15, 5), rows = 5, use_glyph = "child", glyph_size = 6, title = "Look I made an infographic using R!") You can use the iron() function to append waffle charts, which might be useful in comparisons. Here’s some made-up data involving some aspect of cars in two unnamed countries: iron( waffle(c(no = 80, yes = 20), rows = 5, use_glyph = "car", glyph_size = 6, colors = c("#c7d4b6", "#a3aabd"), title = "Country A"), waffle(c(no = 70, yes = 30), rows = 5, use_glyph = "car", glyph_size = 6, colors = c("#c7d4b6", "#a3aabd"), title = "Country B") ) Conclusion That’s it, more or less. Are the icons any more effective than the squares? In certain settings, perhaps. You be the judge. Filed under: R, statistics Tagged: packages, rstats, visualization, waffle var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Gold-Mining – Week 1 (2017) Fri, 09/08/2017 - 01:28 (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers) Edit: A version posted at 7:30pm and removed at 8:30pm (eastern time zone) had last years figures. This one now points to the correct database. Football is Back! But I’m still shaking off the rust… Bare with me, we will try to get it up tonight. The post Gold-Mining – Week 1 (2017) appeared first on Fantasy Football Analytics. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Naive Principal Component Analysis (using R) Thu, 09/07/2017 - 22:46 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) Post from Pablo Bernabeu’s blog. Principal Component Analysis (PCA) is a technique used to find the core components that underlie different variables. It comes in very useful whenever doubts arise about the true origin of three or more variables. There are two main methods for performing a PCA: naive or less naive. In the naive method, you first check some conditions in your data which will determine the essentials of the analysis. In the less-naive method, you set the those yourself, based on whatever prior information or purposes you had. I will tackle the naive method, mainly by following the guidelines in Field, Miles, and Field (2012), with updated code where necessary. This lecture material was also useful. The ‘naive’ approach is characterized by a first stage that checks whether the PCA should actually be performed with your current variables, or if some should be removed. The variables that are accepted are taken to a second stage which identifies the number of principal components that seem to underlie your set of variables. I ascribe these to the ‘naive’ or formal approach because either or both could potentially be skipped in exceptional circumstances, where the purpose is not scientific, or where enough information exists in advance. STAGE 1. Determine whether PCA is appropriate at all, considering the variables • Variables should be inter-correlated enough but not too much. Field et al. (2012) provide some thresholds, suggesting that no variable should have many correlations below .30, or any correlation at all above .90. Thus, in the example here, variable Q06 should probably be excluded from the PCA. • Bartlett’s test, on the nature of the intercorrelations, should be significant. Significance suggests that the variables are not an ‘identity matrix’ in which correlations are a sampling error. • KMO (Kaiser-Meyer-Olkin), a measure of sampling adequacy based on common variance (so similar purpose as Bartlett’s). As Field et al. review, ‘values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb’ (p. 761). There’s a general score as well as one per variable. The general one will often be good, whereas the individual scores may more likely fail. Any variable with a score below .5 should probably be removed, and the test should be run again. • Determinant: A formula about multicollinearity. The result should preferably fall below .00001. Note that some of these tests are run on the dataframe and others on a correlation matrix of the data, as distinguished below. # Necessary libraries library(ltm) library(lattice) library(psych) library(car) library(pastecs) library(scales) library(ggplot2) library(arules) library(plyr) library(Rmisc) library(GPArotation) library(gdata) library(MASS) library(qpcR) library(dplyr) library(gtools) library(Hmisc) # Select only your variables of interest for the PCA dataset = mydata[, c('select_var1','select_var1', 'select_var2','select_var3','select_var4', 'select_var5','select_var6','select_var7')] # Create matrix: some tests will require it data_matrix = cor(dataset, use = 'complete.obs') # See intercorrelations round(data_matrix, 2) # Bartlett's cortest.bartlett(dataset) # KMO (Kaiser-Meyer-Olkin) KMO(data_matrix) # Determinant det(data_matrix) STAGE 2. Identify number of components (aka factors) In this stage, principal components (formally called ‘factors’ at this stage) are identified among the set of variables. • The identification is done through a basic, ‘unrotated’ PCA. The number of components set a priori must equal the number of variables that are being tested. # Start off with unrotated PCA pc1 = psych::principal(dataset, nfactors = length(dataset), rotate="none") pc1 Below, an example result: ## Principal Components Analysis ## Call: psych::principal(r = eng_prop, nfactors = 3, rotate = "none") ## Standardized loadings (pattern matrix) based upon correlation matrix ## PC1 PC2 PC3 h2 u2 com ## Aud_eng -0.89 0.13 0.44 1 -2.2e-16 1.5 ## Hap_eng 0.64 0.75 0.15 1 1.1e-16 2.0 ## Vis_eng 0.81 -0.46 0.36 1 -4.4e-16 2.0 ## ## PC1 PC2 PC3 ## SS loadings 1.87 0.79 0.34 ## Proportion Var 0.62 0.26 0.11 ## Cumulative Var 0.62 0.89 1.00 ## Proportion Explained 0.62 0.26 0.11 ## Cumulative Proportion 0.62 0.89 1.00 ## ## Mean item complexity = 1.9 ## Test of the hypothesis that 3 components are sufficient. ## ## The root mean square of the residuals (RMSR) is 0 ## with the empirical chi square 0 with prob < NA ## ## Fit based upon off diagonal values = 1 Among the columns, there are first the correlations between variables and components, followed by a column (h2) with the ‘communalities‘. If less factors than variables had been selected, communality values would be below 1. Then there is the uniqueness column (u2): uniqueness is equal to 1 minus the communality. Next is ‘com’, which reflects the complexity with which a variable relates to the principal components. Those components are precisely found below. The first row contains the sums of squared loadings, or eigenvalues, namely, the total variance explained by each linear component. This value corresponds to the number of units explained out of all possible factors (which were three in the above example). The rows below all cut from the same cloth. Proportion var = variance explained over a total of 1. This is the result of dividing the eigenvalue by the number of components. Multiply by 100 and you get the percentage of total variance explained, which becomes useful. In the example, 99% of the variance has been explained. Aside from the meddling maths, we should actually expect 100% there because the number of factors equaled the number of variables. Cumulative var: variance added consecutively up to the last component. Proportion explained: variance explained over what has actually been explained (only when variables = factors is this the same as Proportion var). Cumulative proportion: the actually explained variance added consecutively up to the last component. Two sources will determine the number of components to select for the next stage: • Kaiser’s criterion: components with SS loadings > 1. In our example, only PC1. A more lenient alternative is Joliffe’s criterion, SS loadings > .7. • Scree plot: the number of points after point of inflexion. For this plot, call: plot(pc1$values, type = 'b')

Imagine a straight line from the first point on the right. Once this line bends considerably, count the points after the bend and up to the last point on the left. The number of points is the number of components to select. The example here is probably the most complicated (two components were finally chosen), but normally it’s not difficult.

Based on both criteria, go ahead and select the definitive number of components.

STAGE 3.  Run definitive PCA

Run a very similar command as you did before, but now with a more advanced method. The first PCA, a heuristic one, worked essentially on the inter-correlations. The definitive PCA, in contrast, will implement a prior shuffling known as ‘rotation’, to ensure that the result is robust enough (just like cards are shuffled). Explained variance is captured better this way. The go-to rotation method is the orthogonal, or ‘varimax’ (though others may be considered too).

# Now with varimax rotation, Kaiser-normalized # by default: pc2 = psych::principal(dataset, nfactors=2, rotate = "varimax", scores = TRUE) pc2 pc2$loadings # Healthcheck pc2$residual pc2$fit pc2$communality

We would want:

• Less than half of residuals with absolute values > 0.05
• Model fit > .9
• All communalities > .7

If any of this fails, consider changing the number of factors. Next, the rotated components that have been ‘extracted’ from the core of the set of variables can be added to the dataset. This would enable the use of these components as new variables that might prove powerful and useful (as in this research).

dataset = cbind(dataset, pc2$scores) summary(dataset$RC1, dataset$RC2) STAGE 4. Determine ascription of each variable to components Check the main summary by just calling pc2, and see how each variable correlates with the rotated components. This is essential because it reveals how variables load on each component, or in other words, to which component a variable belongs. For instance, the table shown here belongs to a study about meaning of words. These results suggest that the visual and haptic modalities of words are quite related, whereas the auditory modality is relatively unique. When the analysis works out well, a cut-off point of r = .8 may be applied for considering a variable as part of a component. STAGE 5. Enjoy the plot The plot is perhaps the coolest part about PCA. It really makes an awesome illustration of the power of data analysis. ggplot(dataset, aes(RC1, RC2, label = as.character(main_eng))) + aes (x = RC1, y = RC2, by = main_eng) + stat_density2d(color = "gray87")+ geom_text(size = 7) + ggtitle ('English properties') + theme_bw() + theme( plot.background = element_blank() ,panel.grid.major = element_blank() ,panel.grid.minor = element_blank() ,panel.border = element_blank() ) + theme(axis.line = element_line(color = 'black')) + theme(axis.title.x = element_text(colour = 'black', size = 23, margin=margin(15,15,15,15)), axis.title.y = element_text(colour = 'black', size = 23, margin=margin(15,15,15,15)), axis.text.x = element_text(size=16), axis.text.y = element_text(size=16)) + labs(x = "", y = "Varimax-rotated Principal Component 2") + theme(plot.title = element_text(hjust = 0.5, size = 32, face = "bold", margin=margin(15,15,15,15))) Below is an example combining PCA plots with code similar to the above. These plots illustrate something further with regard to the relationships among modalities. In property words, the different modalities spread out more clearly than they do in concept words. This makes sense because in language, properties define concepts (see more). An example of this code is use is available here (with data here). References Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage. Feel free to comment below or on the original post. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### In case you missed it: August 2017 roundup Thu, 09/07/2017 - 22:18 (This article was first published on Revolutions, and kindly contributed to R-bloggers) In case you missed them, here are some articles from August of particular interest to R users. Using the featurizeText function in the MicrosoftML package to extract ngrams from unstructured text. A joyplot visualizes the probabilities associated with phrases like "highly likely" and "little chance" by a sample of 46 Redditors. Two examples of creating 3-D animations in R: a stereo cube, and the Charleston in motion capture data. A tutorial on creating thematic maps in R, from ComputerWorld. Some tips on using R to query data in Power BI. Using the Rcpp package to calculate a membership matrix for fuzzy k-means clustering. A reimagining of Minard's chart of Napoleon's march on Russia. Rankings of gender roles for men and women in film, from an analysis of scripts with the tidytext package. Several talks at the upcoming Ignite conference feature R. Norm Matloff's keynote at UseR!2017 discussed various obstacles to performance in parallel programming. Peter Dalgaard marks the 20th anniversary of the formation of the R Core Group. Using the rxFeaturize function in the MicrosoftML package to find images similar to a reference image. Buzzfeed used R to identify possible spy planes in Flightradar24 data. Timo Grossenbacher offers a helpful workflow for implementing reproducible data analysis in R. A preview of version 0.10.0 of the dplyrXdf package, featuring support for the tidyeval framework on out-of-memory Xdf data files. A tutorial on using CNTK via the keras package for forecasting Neils Berglund explains how to use the sqlrutils package to publish an R function as a SQL Server stored procedure. Tomas Kalibera's presentation at UseR!2017 includes useful guidance on getting the most out of R's byte compiler. The kadinsky package makes accidental aRt, deliberately. Angus Taylor's UseR!2017 presentation uses MXNET to categorize product reviews on Amazon. Various solutions for the energy, retail and shipping industries, with data and Microsoft R code. Talks on new database interfaces from UseR!2017: Jim Hester on the ODBC package, and Kirill Müller on the DBI package. And some general interest stories (not necessarily related to R): As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Readability Redux Tue, 09/05/2017 - 02:19 (This article was first published on R – rud.is, and kindly contributed to R-bloggers) I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text. I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt a try on your own conversion needs since each method produces different results. On my macOS systems, the htm2txt calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations). I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava-based R package dubbed jericho which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment). Is it worth it to get rJava up and running to use jericho vs htm2txt? Let’s take a look: library(jericho) # devtools::install_github("hrbrmstr/jericho") library(microbenchmark) library(htm2txt) library(tidyverse) c( "https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526", "https://en.wikipedia.org/wiki/Timeline_of_antisemitism", "http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/" ) -> urls map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html microbenchmark( jericho_txt = { a <- html_to_text(sites_html[1]) }, jericho_render = { a <- render_html_to_text(sites_html[1]) }, htm2txt = { a <- htm2txt(sites_html[1]) }, times = 10 ) -> mb1 # microbenchmark( # jericho_txt = { # a <- html_to_text(sites_html[2]) # }, # jericho_render = { # a <- render_html_to_text(sites_html[2]) # }, # htm2txt = { # a <- htm2txt(sites_html[2]) # }, # times = 10 # ) -> mb2 microbenchmark( jericho_txt = { a <- html_to_text(sites_html[3]) }, jericho_render = { a <- render_html_to_text(sites_html[3]) }, htm2txt = { a <- htm2txt(sites_html[3]) }, times = 10 ) -> mb3 The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho goes fast in that test). Here’s what the other two look like: mb1 ## Unit: milliseconds ## expr min lq mean median uq max neval ## jericho_txt 4.121872 4.294953 4.567241 4.405356 4.734923 5.621142 10 ## jericho_render 5.446296 5.564006 5.927956 5.719971 6.357465 6.785791 10 ## htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132 10 mb3 ## Unit: milliseconds ## expr min lq mean median uq max neval ## jericho_txt 2.641352 2.814318 3.297543 3.034445 3.488639 5.437411 10 ## jericho_render 3.034765 3.143431 4.708136 3.746157 5.953550 8.931072 10 ## htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958 10 You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text. I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Analyzing Google Trends Data in R Mon, 09/04/2017 - 13:32 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R. This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Blu-ray. Analyzing Google Trends in R I have never bought a Blu-ray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blu-rays peaked. The following R code retrieves the global search history since 2004 for Blu-ray. library(gtrendsR) library(reshape2) google.trends = gtrends(c("blu-ray"), gprop = "web", time = "all")[[1]] google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits") rownames(google.trends) = google.trends$date google.trends$date = NULL The first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, news, images or youtube. The third argument time can be any of now 1-d, now 7-d, today 1-m, today 3-m, today 12-m, today+5-y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 2010-12-31 2011-06-30. Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code: Sys.setenv(TZ = "UTC") After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code. Plotting Google Trends data: Identifying seasonality and trends Plotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year. Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Blu-ray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here. Google Trends by geographic region Next, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below. google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "2010-06-30 2017-06-30")[[1]] The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as US-CA for California. I find the easiest way to get these codes is to use this Wikipedia page. An alternative way to find all the region-level codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT). library(gtrendsR) geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code)) Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada. Create your own analysis In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Calculating Marginal Effects Exercises Mon, 09/04/2017 - 08:54 (This article was first published on R-exercises, and kindly contributed to R-bloggers) A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I’ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will explore calculating marginal effects for linear, logistic, and probit regression models in R. Exercises in this section will be solved using the Margins and mfx packages. It is recommended to take a look at the concise and excellent documentation for these packages before continuing. Answers to the exercises are available here. Exercise 1 Load the mtcars dataset. Build a linear regression of mpg on wt, qsec, am, and hp. Exercise 2 Print the coefficients from the linear model in the previous exercise. Exercise 3 Using Margins package find marginal effects. Exercise 4 Verify that you receive the same results from Exercises 2 and 3. Why do these marginal effects match the coefficients found when printing the linear model object? Exercise 5 Using the mtcars dataset, built a linear regression similar to Exercise 1 except include an interaction term with am and hp. Find the marginal effects for this regression. Exercise 6 Using your favorite dataset (mine is field.goals from the nutshell package), construct a logistic regression. Exercise 7 Explain why marginal effects for a logit model more complex than for a linear model? Exercise 8 For the next two exercises, you may use either package. Calculate the marginal effects with respect to the mean. Exercise 9 Calculate the average marginal effects. Exercise 10 If these marginal effects are different, explain why they are different. Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Thou shalt not compare numeric values (except when it works) Mon, 09/04/2017 - 07:32 (This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers) This was just going to be a few Tweets but it ended up being a bit of a rollercoaster of learning for me, and I haven’t blogged in far too long, so I’m writing it up quickly as a ‘hey look at that’ example for newcomers. I’ve been working on the ‘merging data’ part of my book and, as I do when I’m writing this stuff, I had a play around with some examples to see if there was anything funky going on if a reader was to try something slightly different. I’ve been using dplyr for the examples after being thoroughly convinced on Twitter to do so. It’s going well. Mostly. ## if you haven't already done so, load dplyr suppressPackageStartupMessages(library(dplyr)) My example involved joining together two tibbles containing text values. Nothing too surprising. I wondered though; do numbers behave the way I expect? Now, a big rule in programming is ‘thou shalt not compare numbers’, and it holds especially true when numbers aren’t exactly integers. This is because representing non-integers is hard, and what you see on the screen isn’t always what the computer sees internally. Thou shalt not compare numbers. If I had a tibble where the column I would use to join had integers dataA <- tribble( ~X, ~Y, 0L, 100L, 1L, 101L, 2L, 102L, 3L, 103L ) dataA ## # A tibble: 4 x 2 ## X Y ## ## 1 0 100 ## 2 1 101 ## 3 2 102 ## 4 3 103 and another tibble with numeric in that column dataB <- tribble( ~X, ~Z, 0, 1000L, 1, 1001L, 2, 1002L, 3, 1003L ) dataB ## # A tibble: 4 x 2 ## X Z ## ## 1 0 1000 ## 2 1 1001 ## 3 2 1002 ## 4 3 1003 would they still join? full_join(dataA, dataB) ## Joining, by = "X" ## # A tibble: 4 x 3 ## X Y Z ## ## 1 0 100 1000 ## 2 1 101 1001 ## 3 2 102 1002 ## 4 3 103 1003 Okay, sure. R treats these as close enough to join. I mean, maybe it shouldn’t, but we’ll work with what we have. R doesn’t always think these are equal identical(0L, 0) ## [1] FALSE identical(2L, 2) ## [1] FALSE though sometimes it does 0L == 0 ## [1] TRUE 2L == 2 ## [1] TRUE (== coerces types before comparing). Well, what if one of these just ‘looks like’ the other value (can be coerced to the same?) dataC <- tribble( ~X, ~Z, "0", 100L, "1", 101L, "2", 102L, "3", 103L ) dataC ## # A tibble: 4 x 2 ## X Z ## ## 1 0 100 ## 2 1 101 ## 3 2 102 ## 4 3 103 full_join(dataA, dataC) ## Joining, by = "X" ## Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'X' x 'X' because of incompatible types (character / integer) That’s probably wise. Of course, R is perfectly happy with things like "2":"5" ## [1] 2 3 4 5 and == thinks that’s fine "0" == 0L ## [1] TRUE "2" == 2L ## [1] TRUE but who am I to argue? Anyway, how far apart can those integers and numerics be before they aren’t able to be joined? What if we shift the ‘numeric in name only’ values away from the integers just a teensy bit? .Machine$double.eps is the built-in value for ‘the tiniest number you can produce’. On this system it’s 2.22044610^{-16}.

dataBeps <- tribble( ~X, ~Z, 0 + .Machine$double.eps, 1000L, 1 + .Machine$double.eps, 1001L, 2 + .Machine$double.eps, 1002L, 3 + .Machine$double.eps, 1003L ) dataBeps ## # A tibble: 4 x 2 ## X Z ## ## 1 2.220446e-16 1000 ## 2 1.000000e+00 1001 ## 3 2.000000e+00 1002 ## 4 3.000000e+00 1003 full_join(dataA, dataBeps) ## Joining, by = "X" ## # A tibble: 6 x 3 ## X Y Z ## ## 1 0.000000e+00 100 NA ## 2 1.000000e+00 101 NA ## 3 2.000000e+00 102 1002 ## 4 3.000000e+00 103 1003 ## 5 2.220446e-16 NA 1000 ## 6 1.000000e+00 NA 1001

Well, that’s… weirder. The values offset from 2 and 3 joined fine, but the 0 and 1 each got multiple copies since R thinks they’re different. What if we offset a little further?

dataB2eps <- tribble( ~X, ~Z, 0 + 2*.Machine$double.eps, 1000L, 1 + 2*.Machine$double.eps, 1001L, 2 + 2*.Machine$double.eps, 1002L, 3 + 2*.Machine$double.eps, 1003L ) dataB2eps ## # A tibble: 4 x 2 ## X Z ## ## 1 4.440892e-16 1000 ## 2 1.000000e+00 1001 ## 3 2.000000e+00 1002 ## 4 3.000000e+00 1003 full_join(dataA, dataB2eps) ## Joining, by = "X" ## # A tibble: 8 x 3 ## X Y Z ## ## 1 0.000000e+00 100 NA ## 2 1.000000e+00 101 NA ## 3 2.000000e+00 102 NA ## 4 3.000000e+00 103 NA ## 5 4.440892e-16 NA 1000 ## 6 1.000000e+00 NA 1001 ## 7 2.000000e+00 NA 1002 ## 8 3.000000e+00 NA 1003

That’s what I’d expect. So, what’s going on? Why does R think those numbers are the same? Let’s check with a minimal example: For each of the values 0:4, let’s compare that integer with the same offset by .Machine$double.eps suppressPackageStartupMessages(library(purrr)) ## for the 'thou shalt not for-loop' crowd map_lgl(0:4, ~ as.integer(.x) == as.integer(.x) + .Machine$double.eps) ## [1] FALSE FALSE TRUE TRUE TRUE

And there we have it. Some sort of relative difference tolerance maybe? In any case, the general rule to live by is to never compare floats. Add this to the list of reasons why.

For what it’s worth, I’m sure this is hardly a surprising detail to the dplyr team. They’ve dealt with things like this for a long time and I’m sure it was much worse before those changes.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...