R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 38 min ago

### Association rules using FPGrowth in Spark MLlib through SparklyR

Thu, 11/23/2017 - 20:55

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

• use more computing power (or cluster of computing nodes).
• use another algorithm, for example FP Growth, which is more scalable. See this blog for some details on Apriori vs. FP Growth.

Or do both of the above points by using FPGrowth in Spark MLlib on a cluster. And the nice thing is: you can stay in your familiar R Studio environment!

Spark MLlib and sparklyr Example Data set

We use the example groceries transactions data in the arules package. It is not a big data set and you would definitely not need more than a laptop, but it is much more realistic than the example given in the Spark MLlib documentation :-).

Preparing the data

I am a fan of sparklyr It offers a good R interface to Spark and MLlib. You can use dplyr syntax to prepare data on Spark, it exposes many of the MLlib machine learning algorithms in a uniform way. Moreover, it is nicely integrated into the RStudio environment offering the user views on Spark data and a way to manage the Spark connection.

First connect to spark and read in the groceries transactional data, and upload the data to Spark. I am just using a local spark install on my Ubuntu laptop.

###### sparklyr code to perform FPGrowth algorithm ############ library(sparklyr) library(dplyr) #### spark connect ######################################### sc <- spark_connect(master = "local") #### first create some dummy data ########################### transactions = readRDS("transactions.RDs") #### upload to spark #########################################   trx_tbl  = copy_to(sc, transactions, overwrite = TRUE)

For demonstration purposes, data is copied in this example from the local R session to Spark. For large data sets this is not feasible anymore, in that case data can come from hive tables (on the cluster).

The figure above shows the products purchased by the first four customers in Spark in an RStudio grid. Although transactional systems will often output the data in this structure, it is not what the FPGrowth model in MLlib expects. It expects the data aggregated by id (customer) and the products inside an array. So there is one more preparation step.

# data needs to be aggregated by id, the items need to be in a list trx_agg = trx_tbl %>% group_by(id) %>% summarise( items = collect_list(item) )

The figure above shows the aggregated data, customer 12, has a list of 9 items that he has purchased.

Running the FPGrowth algorithm

We can now run the FPGrowth algorithm, but there is one more thing. Sparklyr does not expose the FPGrowth algorithm (yet), there is no R interface to the FPGrowth algorithm. Luckily, sparklyr allows the user to invoke the underlying Scala methods in Spark. We can define an new object with invoke_new

uid = sparklyr:::random_string("fpgrowth_")   jobj = invoke_new(sc, "org.apache.spark.ml.fpm.FPGrowth", uid)

Now jobj is an object of class FPGrowth in Spark.

jobj class org.apache.spark.ml.fpm.FPGrowth fpgrowth_d4d41f71f3e0

And by looking at the Scala documentation of FPGrowth we see that there are more methods that you can use. We need to use the function invoke, to specify which column contains the list of items, to specify the minimum confidence and to specify the minimum support.

jobj %>%      invoke("setItemsCol", "items") %>%     invoke("setMinConfidence", 0.03) %>%     invoke("setMinSupport", 0.01)  %>%     invoke("fit", spark_dataframe(trx_agg))

By invoking fit, the FPGrowth algorithm is fitted and an FPGrowthModel object is returned where we can invoke associationRules to get the calculated rules in a spark data frame

rules = FPGmodel %>% invoke("associationRules")

The rules in the spark data frame consists of an antecedent column (the left hand side of the rule), a consequent column (the right hand side of the rule) and a column with the confidence of the rule. Note that the antecedent and consequent are lists of items! If needed we can split these lists and collect them to R for plotting for further analysis.

The invoke statements and rules extractions statements can of course be wrapped inside functions to make it more reusable. So given the aggregated transactions in a spark table trx_agg, you can get something like:

GroceryRules =  ml_fpgrowth(   trx_agg ) %>%   ml_fpgrowth_extract_rules() plot_rules(GroceryRules) Conclusion

The complete R script can be found on my GitHub. If arules in R on your laptop is not workable anymore because of the size of your data, consider FPGrowth in Spark through sparklyr.

cheers, Longhow

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R live class | Professional R Programming | Nov 29-30 Milan

Thu, 11/23/2017 - 19:13

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

Professional R Programming is the sixth and last course of the autumn term. It takes place in November 29-30 in a location close to Milano Lima.
If you have a solid R knowledge and want to boost your programming skills, this course is made for you.
This course will give you an inner perspective of R working mechanisms, as well as tools for addressing your code’s issues and to make it more efficient. Once these concepts are established, you will learn how to create R packages and use them as the fundamental unit of reproducible R code.

Professional R Programming: Outlines

– Base Programming: environments, functions and loops
– Functionals in base R
– The purrr package
– Code style and clarity
– Profiling
– Parallel computation
– Testing and debugging
– R packages

Professional R Programming is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

Sadly, this is the last course of the autumn term. Our next R classes’ session will be in Spring! Stay in touch for more updates.

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | Professional R Programming | Nov 29-30 Milan appeared first on Quantide – R training & consulting.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### EARL Boston round up

Thu, 11/23/2017 - 17:14

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Now we’ve recovered from over indulging in Boston’s culinary delights, we’re ready to share our highlights from this year’s EARL Boston Conference.

Day 1 highlights

Stack Overflow’s David Robinson kicked off the Conference, using Stack Overflow data to perform all sorts of interesting analyses. Highlights included trends in questions mentioning specific R packages over time, leading to the identification of rival R packages. We found that R is the least disliked language (because it’s the best obviously!); although David cautioned that often people who haven’t used R before haven’t heard of it either.

Richie Cotton’s talk on how DataCamp is a ‘data-inspired’ organisation was particularly entertaining and he was a really engaging speaker. It was also great to hear from Emily Riederer about tidycf; she shared a really good example of the type of data-driven revolution taking place in many financial institutions.

We also enjoyed Patrick Turgeon’s presentation on Quantitative Trading with R. His presentation portrayed quantitative trading as a scientific problem to be investigated using diverse sources of information, open source code and hard work. Free from jargon, Patrick demonstrated that placing bets on the markets does not have to be some mysterious art, but an analytic puzzle like any other.

A brilliant first day was rounded out with an evening reception overlooking the Charles River, where we enjoyed drinks and a chance to catch up with everyone at the Conference. It was a great opportunity to chat with all the attendees to find out what talks they enjoyed and what ones they wanted to catch on day two.

Day 2 highlights

Mara Averick got things moving on day two with a witty and humble keynote talk on the importance of good communication in data science. She may have also confessed to getting R banned in her fantasy basketball league. From having to argue with the internet that Krieger’s most common distinctive phrase is “yep yep”, to always having the correct word for any situation, she gave a fantastic presentation; a key skill for any data scientist (even if she says she isn’t one!).

Keeping up the theme of great communication in data science, Ali Zaidi gave a really clear rundown of what deep learning is, and how existing models can be reused to make pattern recognition practicably applicable even with modest hardware.

Other highlights included both Alex Albright’s and Monika Wahi’s talks. Alex showed us lots of enjoyable small analyses and suggested ideas for finding fun datasets and encouraged us all to share our findings when experimenting. Monika Wahi discussed why SAS still dominates in healthcare and how we can convert people to R. She talked about how R is easier and nicer to read and showed us equivalent code in SAS and R to illustrate her point.

It was tough, but we picked just a few of the many highlights from both days at EARL. Please tweet @EARLConf any of your EARL highlights so we can share them – we would love to see what people enjoyed the most.

We’d like to thank all of our attendees for joining us, our fantastic speakers, and our generous sponsors for making EARL Boston the success it has been.

To hear the latest news about EARL Conferences sign up to our mailing list, and we’ll let you know first when tickets are on sale.

You can now find speaker slides (where available) on the EARL website – just click on the speaker profile and download the file.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Happy Thanksgiving!

Thu, 11/23/2017 - 17:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Today is Thanksgiving Day here in the US, so we're taking the rest of the week off to enjoy the time with family.

Even if you don't celebrate Thanksgiving, today is still an excellent day to give thanks to the volunteers who have contributed to the R project and its ecosystem. In particular, give thanks to the R Core Group, whose tireless dedication — in several cases over a period of more than 20 years — was and remains critical to the success and societal contributions of the language we use and love: R. You can contribute financially by becoming a Supporting Member of the R Foundation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R

Thu, 11/23/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Sentiment Analysis is one of the most obvious things Data Analysts with unlabelled Text data (with no score or no rating) end up doing in an attempt to extract some insights out of it and the same Sentiment analysis is also one of the potential research areas for any NLP (Natural Language Processing) enthusiasts.

For an analyst, the same sentiment analysis is a pain in the neck because most of the primitive packages/libraries handling sentiment analysis perform a simple dictionary lookup and calculate a final composite score based on the number of occurrences of positive and negative words. But that often ends up in a lot of false positives, with a very obvious case being ‘happy’ vs ‘not happy’ – Negations, in general Valence Shifters.

Consider this sentence: ‘I am not very happy’. Any Primitive Sentiment Analysis Algorithm would just flag this sentence positive because of the word ‘happy’ that apparently would appear in the positive dictionary. But reading this sentence we know this is not a positive sentence.

While we could build our own way to handle these negations, there are couple of new R-packages that could do this with ease. One such package is sentimentr developed by Tyler Rinker.

Installing the package

sentimentr can be installed from CRAN or the development version can be installed from github.

install.packages('sentimentr') #or library(devtools) install_github('trinker/sentimentr') Why sentimentr?

The author of the package himself explaining what does sentimentr do that other packages don’t and why does it matter?

“sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup. The next questions address why it matters.”

Sentiment Scoring:

sentimentr offers sentiment analysis with two functions: 1. sentiment_by() 2. sentiment()

Aggregated (Averaged) Sentiment Score for a given text with sentiment_by

sentiment_by('I am not very happy', by = NULL) element_id sentence_id word_count sentiment 1: 1 1 5 -0.06708204

But this might not help much when we have multiple sentences with different polarity, hence sentence-level scoring with sentiment would help here.

sentiment('I am not very happy. He is very happy') element_id sentence_id word_count sentiment 1: 1 1 5 -0.06708204 2: 1 2 4 0.67500000

Both the functions return a dataframe with four columns:

1. element_id – ID / Serial Number of the given text
2. sentence_id – ID / Serial Number of the sentence and this is equal to element_id in case of sentiment_by
3. word_count – Number of words in the given sentence
4. sentiment – Sentiment Score of the given sentence

Extract Sentiment Keywords

The extract_sentiment_terms() function helps us extract the keywords – both positive and negative that was part of the sentiment score calculation. sentimentr also supports pipe operator %>% which makes it easier to write multiple lines of code with less assignment and also cleaner code.

'My life has become terrible since I met you and lost money' %>% extract_sentiment_terms() element_id sentence_id negative positive 1: 1 1 terrible,lost money Sentiment Highlighting:

And finally, the highight() function coupled with sentiment_by() that gives a html output with parts of sentences nicely highlighted with green and red color to show its polarity. Trust me, This might seem trivial but it really helps while making Presentations to share the results, discuss False positives and to identify the room for improvements in the accuracy.

'My life has become terrible since I met you and lost money. But I still have got a little hope left in me' %>% sentiment_by(by = NULL) %>% highlight()

Output Screenshot:

Try using sentimentr for your sentiment analysis and text analytics project and do share your feedback in comments. Complete code used here is available on my github.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Arbitrary Data Transforms Using cdata

Wed, 11/22/2017 - 17:36

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We have been writing a lot on higher-order data transforms lately:

What I want to do now is "write a bit more, so I finally feel I have been concise."

The cdata R package supplies general data transform operators.

• The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD().
• These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
• It is easy to write many different operations in terms of the cdata primitives.
• These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data we use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants).
• The transforms are controlled by a control table that itself is a diagram of or picture of the transform.

We will end with a quick example, centered on pivoting/un-pivoting values to/from more than one column at the same time.

Suppose we had some sales data supplied as the following table:

SalesPerson Period BookingsWest BookingsEast a 2017Q1 100 175 a 2017Q2 110 180 b 2017Q1 250 0 b 2017Q2 245 0

Suppose we are interested in adding a derived column: which region the salesperson made most of their bookings in.

library("cdata") ## Loading required package: wrapr library("seplyr") d <- d %.>% dplyr::mutate(., BestRegion = ifelse(BookingsWest > BookingsEast, "West", ifelse(BookingsEast > BookingsWest, "East", "Both")))

Our notional goal is (as part of a larger data processing plan) to reformat the data a thin/tall table or a RDF-triple like form. Further suppose we wanted to copy the derived column into every row of the transformed table (perhaps to make some other step involving this value easy).

We can use cdata::moveValuesToRowsD() to do this quickly and easily.

First we design what is called a transform control table.

cT1 <- data.frame(Region = c("West", "East"), Bookings = c("BookingsWest", "BookingsEast"), BestRegion = c("BestRegion", "BestRegion"), stringsAsFactors = FALSE) print(cT1) ## Region Bookings BestRegion ## 1 West BookingsWest BestRegion ## 2 East BookingsEast BestRegion

In a control table:

• The column names specify new columns that will be formed by cdata::moveValuesToRowsD().
• The values specify where to take values from.

This control table is called "non trivial" as it does not correspond to a simple pivot/un-pivot (those tables all have two columns). The control table is a picture of of the mapping we want to perform.

An interesting fact is cdata::moveValuesToColumnsD(cT1, cT1, keyColumns = NULL) is a picture of the control table as a one-row table (and this one row table can be mapped back to the original control table by cdata::moveValuesToRowsD(), these two operators work roughly as inverses of each other; though cdata::moveValuesToRowsD() operates on rows and cdata::moveValuesToColumnsD() operates on groups of rows specified by the keying columns).

The mnemonic is:

• cdata::moveValuesToColumnsD() converts arbitrary grouped blocks of rows that look like the control table into many columns.
• cdata::moveValuesToRowsD() converts each row into row blocks that have the same shape as the control table.

Because pivot and un-pivot are fairly common needs cdata also supplies functions that pre-populate the controls tables for these operations (buildPivotControlTableD() and buildUnPivotControlTable()).

To design any transform you draw out the control table and then apply one of these operators (you can pretty much move from any block structure to any block structure by chaining two or more of these steps).

We can now use the control table to supply the same transform for each row.

d %.>% dplyr::mutate(., Quarter = substr(Period,5,6), Year = as.numeric(substr(Period,1,4))) %.>% dplyr::select(., -Period) %.>% moveValuesToRowsD(., controlTable = cT1, columnsToCopy = c('SalesPerson', 'Year', 'Quarter')) %.>% arrange_se(., c('SalesPerson', 'Year', 'Quarter', 'Region')) %.>% knitr::kable(.) SalesPerson Year Quarter Region Bookings BestRegion a 2017 Q1 East 175 East a 2017 Q1 West 100 East a 2017 Q2 East 180 East a 2017 Q2 West 110 East b 2017 Q1 East 0 West b 2017 Q1 West 250 West b 2017 Q2 East 0 West b 2017 Q2 West 245 West

Notice we were able to easily copy the extra BestRegion values into all the correct rows.

It can be hard to figure out how to specify such a transformation in terms of pivots and un-pivots. However, as we have said: by drawing control tables one can easily design and manage fairly arbitrary data transform sequences (often stepping through either a denormalized intermediate where all values per-instance are in a single row, or a thin intermediate like the triple-like structure we just moved into).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Mapping “world cities” in R

Wed, 11/22/2017 - 13:00

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

Here at Sharp Sight, we make a lot of maps.

There are a few reasons for this.

First, good maps are typically ‘information dense.’ You can get a lot of information at a glance from a good map. They are good visualization tools for finding and communicating insights.

Second, it’s extremely easy to get data that you can use to make a map. From a variety of sources, you’ll find data about cities, states, counties, and countries. If you know how to retrieve this data and wrangle it into shape, it will be easy to find data that you can use to make a map.

Finally, map making is just good practice. To create a map like the one we’re about to make, you’ll typically need to use a variety of data wrangling and data visualization tools. Maps make for excellent practice for intermediate data scientists who have already mastered some of the basics.

With that in mind, this week we’ll make a map of “world cities.” This set of cities has been identified by the Globalization and World Cities (GaWC) Research Network as being highly connected and influential in the world economy.

We’re going to initially create a very basic map, but we’ll also create a small multiple version of the map (broken out by GaWC ranking).

Let’s get started.

First, we’ll load the packages that we’ll need.

#============== # LOAD PACKAGES #============== library(tidyverse) library(ggmap) library(forcats)

Next, we’ll input the cities by hard coding them as data frames. To be clear, there is more than one way to do this (e.g., we could scrape the data), but there isn’t that much data here, so doing this manually is acceptable.

#=================== # INPUT ALPHA CITIES #=================== df_alpha_plus_plus <- tibble(city = c('London','New York')) df_alpha_plus <- tibble(city = c('Singapore', 'Hong Kong', 'Paris', 'Beijing' ,'Tokyo', 'Dubai', 'Shanghai')) df_alpha <- tibble(city = c('Sydney', 'São Paulo', 'Milan', 'Chicago' ,'Mexico City', 'Mumbai', 'Moscow', 'Frankfurt' ,'Madrid', 'Warsaw', 'Johannesburg', 'Toronto' ,'Seoul', 'Istanbul', 'Kuala Lumpur', 'Jakarta' ,'Amsterdam', 'Brussels', 'Los Angeles')) df_alpha_minus <- tibble(city = c('Dublin', 'Melbourne', 'Washington', 'New Delhi' ,'Bangkok', 'Zurich', 'Vienna', 'Taipei' ,'Buenos Aires', 'Stockholm', 'San Francisco' ,'Guangzhou', 'Manila', 'Bogotá', 'Miami', 'Luxembourg' ,'Riyadh', 'Santiago', 'Barcelona', 'Tel Aviv', 'Lisbon'))

Now, we’ll create a new variable called rating. This will contain the global city rating.

Notice that this is a very straightforward use of dplyr::mutate(), one of the tidyverse functions you should definitely master.

#======================= # ADD GLOBAL CITY RATING #======================= df_alpha_plus_plus <- df_alpha_plus_plus %>% mutate(rating = 'alpha++') df_alpha_plus <- df_alpha_plus %>% mutate(rating = 'alpha+') df_alpha <- df_alpha %>% mutate(rating = 'alpha') df_alpha_minus <- df_alpha_minus %>% mutate(rating = 'alpha-')

Next, we’ll combine the different data frames into one using rbind().

#====================================== # COMBINE DATAFRAMES INTO ONE DATAFRAME #====================================== alpha_cities <- rbind(df_alpha_plus_plus ,df_alpha_plus ,df_alpha ,df_alpha_minus )

Now that the data are combined into a single data frame, we’ll get the longitude and latitude using geocode().

#======== # GEOCODE #======== latlong <- geocode(alpha_cities$city) #======== # INSPECT #======== alpha_cities latlong Once we have the longitude and latitude data, we need to combine it with the original data in the alpha_cities data frame. To do this, we will use cbind(). #============================ # BIND LAT/LONG TO CITY NAMES #============================ alpha_cities <- cbind(alpha_cities, latlong) %>% rename(long = lon) alpha_cities #names(alpha_cities) Now we have the data that we need, but we’ll need to clean things up a little. In the visualization we’ll make, we will need to use the faceting technique from ggplot2. When we do this, we’ll facet on the rating variable, but we will need the levels of that variable to be ordered properly (otherwise the facets will be out of order). To reorder the factor levels of rating, we will use fct_relevel(). #================================================ # REORDER LEVELS OF GLOBAL CITY RATING # - the global city ratings should be ordered # i.e., alpha++, then alpha+ .... # - to do this, we'll use forecats::fct_relevel() #================================================ alpha_cities <- mutate(alpha_cities, rating = fct_relevel(rating, 'alpha++','alpha+','alpha','alpha-')) levels(alpha_cities$rating)

Because we will be building a map, we’ll need to retrive a map of the world. We can get a world map by using map_data(“world”).

#============== # GET WORLD MAP #============== map_world <- map_data("world")

Ok. We basically have everything we need. Now we will make a simple first draft.

#================ # FIRST DRAFT MAP #================ ggplot() + geom_polygon(data = map_world, aes(x = long, y = lat, group = group)) + geom_point(data = alpha_cities, aes(x = long, y = lat), color = 'red')

… and now we’ll use the faceting technique to break out our plot using the rating variable.

#========================== # CREATE SMALL MULTIPLE MAP #========================== ggplot() + geom_polygon(data = map_world, aes(x = long, y = lat, group = group)) + geom_point(data = alpha_cities, aes(x = long, y = lat), color = 'red') + #facet_grid(. ~ rating) #facet_grid(rating ~ .) facet_wrap(~ rating)

Once again, this is a good example of an intermediate-level project that you could do to practice your data wrangling and data visualization skills.

Having said that, before you attempt to do something like this yourself, I highly recommend that you first master the individual tools that we used here (i.e., the tools from ggplot2, dplyr, and the tidyverse).

To master data science, you need to master the essential tools.

And to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

You’ll learn:

• What data science tools you should learn (and what not to learn)
• How to practice those tools
• How to put those tools together to execute analyses and machine learning projects
• … and more

The post Mapping “world cities” in R appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tips for A/B Testing with R

Wed, 11/22/2017 - 10:19

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better?

For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test. For this purpose, visitors of a website are randomly assigned to one of two groups between which the target metric (i.e. click-through rate, conversion rate…) can then be compared. Due to this randomization, the groups do not systematically differ in all other relevant dimensions. This means: If your target metric takes a significantly higher value in one group, you can be quite sure that it is because of your treatment and not because of any other variable.

In comparison to other methods, conducting an A/B test does not require extensive statistical knowledge. Nevertheless, some caveats have to be taken into account.

When making a statistical decision, there are two possible errors (see also table 1): A Type I error means that we observe a significant result although there is no real difference between our groups. A Type II error means that we do not observe a significant result although there is in fact a difference. The Type I error can be controlled and set to a fixed number in advance, e.g., at 5%, often denoted as α or the significance level. The Type II error in contrast cannot be controlled directly. It decreases with the sample size and the magnitude of the actual effect. When, for example, one of the designs performs way better than the other one, it’s more likely that the difference is actually detected by the test in comparison to a situation where there is only a small difference with respect to the target metric. Therefore, the required sample size can be computed in advance, given α and the minimum effect size you want to be able to detect (statistical power analysis). Knowing the average traffic on the website you can get a rough idea of the time you have to wait for the test to complete. Setting the rule for the end of the test in advance is often called “fixed-horizon testing”.

Table 1: Overview over possible errors and correct decisions in statistical tests Effect really exists No Yes Statistical test is significant No True negative Type II error (false negative) Yes Type I error (false positive) True positive

Statistical tests generally provide the p-value which reflects the probability of obtaining the observed result (or an even more extreme one) just by chance, given that there is no effect. If the p-value is smaller than α, the result is denoted as “significant”.

The following code simulates some data and plots the course of the p-value during the test. (For the first samples which are still very small R returns a warning that the chi square approximation may be incorrect.)

library(timeDate) library(ggplot2) # Choose parameters: pA <- 0.05 # True click through rate for group A pB <- 0.08 # True click through rate for group B nA <- 500 # Number of cases for group A nB <- 500 # Number of cases for group B alpha <- 0.05 # Significance level # Simulate data: data <- data.frame(group = rep(c("A", "B"), c(nA, nB)), timestamp = sample(seq(as.timeDate('2016-06-02'), as.timeDate('2016-06-09'), by = 1), nA+nB), clickedTrue = as.factor(c(rbinom(n = nA, size = 1, prob = pA), rbinom(n = nB, size = 1, prob = pB)))) # Order data by timestamp data <- data[order(data$GMT.x..i..), ] levels(data$clickedTrue) <- c("0", "1") # Compute current p-values after every observation: pValues <- c() index <- c() for (i in 50:dim(data)[1]){ presentData <- table(data$group[1:i], data$clickedTrue[1:i]) if (all(rowSums(presentData) > 0)){ pValues <- c(pValues, prop.test(presentData)$p.value) index <- c(index, i)} } results <- data.frame(index = index, pValue = pValues) # Plot the p-values: ggplot(results, aes(x = index, y = pValue)) + geom_line() + geom_hline(aes(yintercept = alpha)) + scale_y_continuous(name = "p-value", limits = c(0,1)) + scale_x_continuous(name = "Observed data points") + theme(text = element_text(size=20)) The figure below shows an example with 500 observations and true rates of 5% in both groups, i.e., no actual difference. You can see that the p-value nevertheless crosses the threshold several times, but finally takes a very high value. By stopping this test early, it would have been very likely to draw a wrong conclusion. Example for the course of a p-value for two groups with no actual difference The following code shows you how to test the difference between two rates in R, e.g., click-through rates or conversion rates. You can apply the code to your own data by replacing the URL to the example data with your file path. To test the difference between two proportions, you can use the function prop.test which is equivalent to Pearson’s chi-squared test. For small samples you should use Fisher’s exact test instead. Prop.test returns a p-value and a confidence interval for the difference between the two rates. The interpretation of a 95% confidence interval is as follows: When conducting such an analysis many times, then 95% of the displayed confidence intervals would contain the true difference. Afterwards you can also take a look at the fluctuations of the p-value during the tests by using the code from above. library(readr) # Specify file path: dataPath <- "https://www.inwt-statistics.de/files/INWT/downloads/exampleDataABtest.csv" # Read data data <- read_csv(file = dataPath) # Inspect structure of the data str(data) ## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 3 variables: ##$ group : chr "A" "A" "A" "B" ... ## $time : POSIXct, format: "2016-06-02 02:17:53" "2016-06-02 03:03:54" ... ##$ clickedTrue: int 0 0 1 0 0 0 0 0 0 0 ... # Change the column names names(data) <- c("group", "time", "clickedTrue") # Change type of group to factor data$group <- as.factor(data$group) # Change type of click through variable to factor data$clickedTrue <- as.factor(data$clickedTrue) levels(data$clickedTrue) <- c("0", "1") # Compute frequencies and conduct test for proportions # (Frequency table with successes in the first column) freqTable <- table(data$group, data$clickedTrue)[, c(2,1)] # print frequency table freqTable ## ## 1 0 ## A 20 480 ## B 40 460 # Conduct significance test prop.test(freqTable, conf.level = .95) ## ## 2-sample test for equality of proportions with continuity ## correction ## ## data: freqTable ## X-squared = 6.4007, df = 1, p-value = 0.01141 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## -0.071334055 -0.008665945 ## sample estimates: ## prop 1 prop 2 ## 0.04 0.08 There are some more pitfalls, but most of them can easily be avoided. First, as a counterpart of stopping your test early because of a significant result, you could gather more data after the planned end of the test because the results have not yet become significant. This would likewise lead to an α inflation. A second, similar problem arises when running several tests at once: The probability to achieve a false-positive result would then be α for each of the tests. The overall probability that at least one of the results is false-positive is much larger. So always keep in mind that some of the significant results may have been caused by chance. Third, you can also get into trouble when you reach the required sample size very fast and stop the test after a few hours already. You should always consider that the behavior of the users in this specific time slot might not be representative for the general case. To avoid this, you should plan the duration of the test so that it covers at least 24 hours or a week when customers are behaving different at the weekend than on a typical work day. A fourth caveat concerns a rather moral issue: When users discover they are part of an experiment and suffer from disadvantages as a result, they might rightly become angry. (This problem will probably not arise due to a different-colored button, but maybe because of different prices or special offers.) If you are willing to invest some more time, you may want to learn about techniques to avoid α inflation when conducting multiple tests or stopping your test as soon as the p-value crosses a certain threshold. In addition, there are techniques to include previous knowledge in your computations with Bayesian approaches. The latter is especially useful when you have rather small samples, but previous knowledge about the values that your target metric usually takes. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### October 2017 New Packages Wed, 11/22/2017 - 01:00 (This article was first published on R Views, and kindly contributed to R-bloggers) Of the 182 new packages that made it to CRAN in October, here are my picks for the “Top 40”. They are organized into eight categories: Engineering, Machine Learning, Numerical Methods, Science, Statistics, Time Series, Utilities and Visualizations. Engineering is a new category, and its appearance may be an early signal for the expansion of R into a new domain. The Science category is well-represented this month. I think this is the result of the continuing trend for working scientists to wrap their specialized analyses into R packages. Engineering FlowRegEnvCost 0.1.1: Calculates the daily environmental costs of river-flow regulation by dams based on García de Jalon et al. (2017). rroad v0.0.4: Computes and visualizes the International Roughness Index (IRI) given a longitudinal road profile for a single road segment, or for a sequence of segments with a fixed length. For details on The International Road Roughness Experiment establishing a correlation and a calibration standard for measurements, see the World Bank technical paper. The vignette shows an example of a Road Condition Analysis. The following scaleogram was produced from a continuous wavelet transform of a 3D accelerometer signal. Machine Learning detrendr v0.1.0: Implements a method based on an algorithm by Nolan et al. (2017) for detrending images affected by bleaching. See the vignette MlBayesOpt v0.3.3: Provides a framework for using Bayesian optimization (see Shahriari et al. to tune hyperparameters for support vector machine, random forest, and extreme gradient boosting models. The vignette shows how to set things up. rerf v1.0: Implements an algorithm, Random Forester (RerF), developed by Tomita (2016), which is similar to the Random Combination (Forest-RC) algorithm developed by Breiman (2001). Both algorithms form splits using linear combinations of coordinates. Numerical Methods episode v1.0.0: Provides statistical tools for inferring unknown parameters in continuous time processes governed by ordinary differential equations (ODE). See the Introduction. KGode v1.0.1: Implements the kernel ridge regression and the gradient matching algorithm proposed in Niu et al. (2016), and the warping algorithm proposed in Niu et al. (2017) for improving parameter estimation in ODEs. Science adjclust v0.5.2: Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated with a position, and only adjacent clusters can be merged. The algorithm, which is time- and memory-efficient, is described in Alia Dehman (2015). There are vignettes on Clustering Hi-C Contact Maps, Implementation Notes, and Inferring Linkage Disequilibrium blocks from Genotypes. hsdar v0.6.0: Provides functions for transforming reflectance spectra, calculating vegetation indices and red edge parameters, and spectral resampling for hyperspectral remote sensing and simulation. The Introduction offers several examples. mapfuser v0.1.2: Constructs consensus genetic maps with LPmerge (See Endelman and Plomion (2014)) and models the relationship between physical distance and genetic distance using thin-plate regression splines (see Wood (2003)). The vignette explains how to use the package. mortAAR v1.0.0: Provides functions for the analysis of archaeological mortality data See Chamberlain (2006). There is a vignette on Lifetables and an Extended Discussion. skyscapeR v0.2.2: Provides a tool set for data reduction, visualization and analysis in skyscape archaeology, archaeoastronomy and cultural astronomy. The vignette shows how to use the package. Statistics BayesRS v0.1.2: Fits hierarchical linear Bayesian models, samples from the posterior distributions of model parameters in JAGS, and computes Bayes factors for group parameters of interest with the Savage-Dickey density ratio ([See Wetzels et al.(2009). There is an Introduction. CatPredi v1.1: Allows users to categorize a continuous predictor variable in a logistic or a Cox proportional hazards regression setting, by maximizing the discriminative ability of the model. See Barrio et al. (2015) and Barrio et al. (2017). CovTools v0.2.1: Provides a collection of geometric and inferential tools for convenient analysis of covariance structures. For an introduction to covariance in multivariate statistical analysis, see Schervish (1987). genlogis v0.5.0: Provides basic distribution functions for a generalized logistic distribution proposed by Rathie and Swamee (2006). emmeans v0.9.1: Provides functions to obtain estimated marginal means (EMMs) for many linear, generalized linear, and mixed models, and computes contrasts or linear functions of EMMs, trends, and comparisons of slopes. There are twelve vignettes including The Basics, Comparisons and Contrasts, Confidence Intervals and Tests, Interaction Analysis, and Working with Messy Data. ESTER v0.1.0: Provides an implementation of sequential testing that uses evidence ratios computed from the Akaike weights of a set of models. For details see Burnham & Anderson (2004). There is a vignette. FarmTest v1.0.0: Provides functions to perform robust multiple testing for means in the presence of latent factors. It uses Huber’s loss function to estimate distribution parameters and accounts for strong dependence among coordinates via an approximate factor model. See Zhou et al.(2017) for details. There is a vignette to get you started. miic v0.1: Implements an information-theoretic method which learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many datasets. For more information see Verny et al. (2017). modcmfitr v0.1.0: Fits a modified version of the Connor-Mosimann distribution ( Connor & Mosimann (1969)), a Connor-Mosimann distribution, or a Dirichlet distribution to elicited quantiles of a multinomial distribution. See the vignette for details. pense v1.0.8: Provides a robust penalized elastic net S and MM estimator for linear regression as described in Freue et al. (2017). paramtest v0.1.0: Enables running simulations or other functions while easily varying parameters from one iteration to the next. The vignette shows how to run a power simulation. rENA v0.1.0: Implements functions to perform epistemic network analysis ENA, a novel method for identifying and quantifying connections among elements in coded data, and representing them in dynamic network models, which illustrate the structure of connections and measure the strength of association among elements in a network. rma.exact v0.1.0: Provides functions to compute an exact CI for the population mean under a random-effects model. For details, see Michael, Thronton, Xie, and Tian (2017). Time Series carfima v1.0.1: Provides a toolbox to fit a continuous-time, fractionally integrated ARMA process CARFIMA on univariate and irregularly spaced time-series data using a general-order CARFIMA(p, H, q) model for p>q as specified in Tsai and Chan (2005). colorednoise v0.0.1: Provides tools for simulating populations with white noise (no temporal autocorrelation), red noise (positive temporal autocorrelation), and blue noise (negative temporal autocorrelation) based on work by Ruokolainen et al. (2009). The vignette describes colored noise. nnfor v0.9: Provides functions to facilitate automatic time-series modelling with neural networks. Look here for some help getting started. Utilities hdf5r v1.0.0: Provides an object-oriented wrapper for the HDF5 API using R6 classes. The vignette shows how to use the package. geoops v0.1.2: Provides tools for working with the GeoJSON geospatial data interchange format. There is an Introduction. linl v0.0.2: Adds a LaTeX Letter class to rmarkdown, using the pandoc-letter template adapted for use with markdown. See the vignette and README for details. oshka v0.1.2: Expands quoted language by recursively replacing any symbol that points to quoted language with the language itself. There is an Introduction and a vignette on Non Standard Evaluation Functions. rcreds v0.6.6: Provides functions to read and write credentials to and from an encrypted file. The vignette describes how to use the package. RMariaDB v1.0-2: Implements a DBI-compliant interface to MariaDB and MySQL databases. securitytxt v0.1.0: Provides tools to identify and parse security.txt files to enable the analysis and adoption of the Web Security Policies draft standard. usethis v1.1.0: Automates package and project setup tasks, including setting up unit testing, test coverage, continuous integration, Git, GitHub, licenses, Rcpp, RStudio projects, and more that would otherwise be performed manually. README provides examples. xltabr v0.1.1: Provides functions to produce nicely formatted cross tabulations to Excel using [openxlsx]((https://cran.r-project.org/package=openxlsx), which has been developed to help automate the process of publishing Official Statistics. Look here for documentation. Visualizations iheatmapr v0.4.2: Provides a system for making complex, interactive heatmaps. Look at the webpage for examples. otvPlots v0.2.0: Provides functions to automate the visualization of variable distributions over time, and compute time-aggregated summary statistics for large datasets. See the README for an introduction. _____='https://rviews.rstudio.com/2017/11/22/october-2017-new-packages/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Le Monde puzzle [#1029] Wed, 11/22/2017 - 00:17 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) A convoluted counting Le Monde mathematical puzzle: A film theatre has a waiting room and several projection rooms. With four films on display. A first set of 600 spectators enters the waiting room and vote for their favourite film. The most popular film is projected to the spectators who voted for it and the remaining spectators stay in the waiting room. They are joined by a new set of 600 spectators, who then also vote for their favourite film. The selected film (which may be the same as the first one) is then shown to those who vote for it and the remaining spectators stay in the waiting room. This pattern is repeated for a total number of 10 votes, after which the remaining spectators leave. What are the maximal possible numbers of waiting spectators and of spectators in a projection room? A first attempt by random sampling does not produce extreme enough events to reach those maxima: wm=rm=600 #waiting and watching for (v in 1:V){ film=rep(0,4) #votes on each fiLm for (t in 1:9){ film=film+rmultinom(1,600,rep(1,4)) rm=max(rm,max(film)) film[order(film)[4]]=0 wm=max(wm,sum(film)+600)} rm=max(rm,max(film)+600)} where the last line adds the last batch of arriving spectators to the largest group of waiting ones. This code only returns 1605 for the maximal number of waiting spectators. And 1155 for the maximal number in a projection room. Compared with the even separation of the first 600 into four groups of 150… I thus looked for an alternative deterministic allocation: wm=rm=0 film=rep(0,4) for (t in 1:9){ size=sum(film)+600 film=c(rep(ceiling(size/4),3),size-3*ceiling(size/4)) film[order(film)[4]]=0 rm=max(rm,max(film)+600) wm=max(wm,sum(film)+600)} which tries to preserve as many waiting spectators as possible for the last round (and always considers the scenario of all newcomers backing the largest waiting group for the next film). The outcome of this sequence moves up to 1155 for the largest projection audience and 2264 for the largest waiting group. I however wonder if splitting into two groups in the final round(s) could even increase the size of the last projection. And indeed halving the last batch into two groups leads to 1709 spectators in the final projection. With uncertainties about the validity of the split towards ancient spectators keeping their vote fixed! (I did not think long enough about this puzzle to turn it into a more mathematical problem…) While in Warwick, I reconsidered the problem from a dynamic programming perspective, always keeping the notion that it was optimal to allocate the votes evenly between some of the films (from 1 to 4). Using the recursive R code optiz=function(votz,t){ if (t==9){ return(sort(votz)[3]+600) }else{ goal=optiz(sort(votz)+c(0,0,600,-max(votz)),t+1) goal=rep(goal,4) for (i in 2:4){ film=sort(votz);film[4]=0;film=sort(film) size=sum(film[(4-i+1):4])+600 film[(4-i+1):4]=ceiling(size/i) while (sum(film[(4-i+1):4])>size) film[4]=film[4]-1 goal[i]=optiz(sort(film),t+1)} return(max(goal))}} led to a maximal audience size of 1619. [Which is also the answer provided by Le Monde] Filed under: Books, Kids, R Tagged: combinatorics, Le Monde, R var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### 4-Day Workshop in Toronto: Learn R, tidyverse, and Shiny with Dean Attali and Kirill Muller Tue, 11/21/2017 - 21:00 (This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers) Kirill Muller and I are excited to announce a 4-day R workshop in Toronto on January 23-26, 2018! To register or for more information click here Readers of this blog can receive a 10% discount by using promo code TIDYSHINYTEN (limited-time offer). If you know someone who might benefit from the workshop, you can receive 10% of the ticket price using this affiliate link (we will pay you using PayPal). The course will take place in the Hilton Hotel in Toronto, Canada over 4 full days. It is intended for beginners and experienced R users alike. For any questions, feel free to contact Dean at dean@attalitech.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Dean Attali's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### xray: The R Package to Have X Ray Vision on your Datasets Tue, 11/21/2017 - 19:00 (This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers) This package lets you analyze the variables of a dataset, to evaluate how is the shape of your data. Consider this the first step when you have your data for modeling, you can use this package to analyze all variables and check if there is anything weird worth transforming or even avoiding the variable altogether. Installation You can install xray from github with: # install.packages("devtools") devtools::install_github("sicarul/xray") Usage Anomaly detection xray::anomalies analyzes all your columns for anomalies, whether they are NAs, Zeroes, Infinite, etc, and warns you if it detects variables with at least 80% of rows with those anomalies. It also warns you when all rows have the same value. Example: data(longley) badLongley=longley badLongley$GNP=NA xray::anomalies(badLongley) #> Warning in xray::anomalies(badLongley): Found 1 possible problematic variables: #> GNP #> $variables #> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct #> 1 GNP 16 16 1 0 0 0 0 0 0 1 #> 2 GNP.deflator 16 0 0 0 0 0 0 0 0 16 #> 3 Unemployed 16 0 0 0 0 0 0 0 0 16 #> 4 Armed.Forces 16 0 0 0 0 0 0 0 0 16 #> 5 Population 16 0 0 0 0 0 0 0 0 16 #> 6 Year 16 0 0 0 0 0 0 0 0 16 #> 7 Employed 16 0 0 0 0 0 0 0 0 16 #> type anomalous_percent #> 1 Logical 1 #> 2 Numeric 0 #> 3 Numeric 0 #> 4 Numeric 0 #> 5 Numeric 0 #> 6 Integer 0 #> 7 Numeric 0 #> #>$problem_variables #> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct #> 1 GNP 16 16 1 0 0 0 0 0 0 1 #> type anomalous_percent #> 1 Logical 1 #> problems #> 1 Anomalies present in 100% of the rows. Less than 2 distinct values. Distributions

xray::distributions tries to analyze the distribution of your variables, so you can understand how each variable is statistically structured. It also returns a percentiles table of numeric variables as a result, which can inform you of the shape of the data.

distrLongley=longley distrLongley$testCategorical=c(rep('One',7), rep('Two', 9)) xray::distributions(distrLongley) #> Variable p_1 p_10 p_25 p_50 p_75 p_90 #> 1 GNP.deflator 83.78 88.35 94.525 100.6 111.25 114.95 #> 2 GNP 237.8537 258.74 317.881 381.427 454.0855 510.387 #> 3 Unemployed 187.93 201.55 234.825 314.35 384.25 434.4 #> 4 Armed.Forces 147.61 160.3 229.8 271.75 306.075 344.85 #> 5 Population 107.7616 109.2025 111.7885 116.8035 122.304 126.61 #> 6 Year 1947.15 1948.5 1950.75 1954.5 1958.25 1960.5 #> 7 Employed 60.1938 60.7225 62.7125 65.504 68.2905 69.4475 #> p_99 #> 1 116.72 #> 2 549.3859 #> 3 478.725 #> 4 358.695 #> 5 129.7466 #> 6 1961.85 #> 7 70.403 Distributions along a time axis xray::timebased also investigates into your distributions, but shows you the change over time, so if there is any change in the distribution over time (For example a variable stops or starts being collected) you can easily visualize it. dateLongley=longley dateLongley$Year=as.Date(paste0(dateLongley$Year,'-01-01')) dateLongley$Data='Original' ndateLongley=dateLongley ndateLongley$GNP=dateLongley$GNP+10 ndateLongley$Data='Offseted' xray::timebased(rbind(dateLongley, ndateLongley), 'Year') #> [1] “7 charts have been generated.” If you are just starting to learn about Data Science or want to explore further, i encourage you to check this cool book made by my buddy Pablo Casas: The Data Science Live Book var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### βCEA Tue, 11/21/2017 - 17:32 (This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers) Recently, I’ve been doing a lot of work on the beta version of BCEA (I was after all born in Agrigento$-$in the picture to the left$-$, which is a Greek city, so a beta version sounds about right…). The new version is only available as a beta-release from our GitHub repository – usual ways to install it are through the devtools package. There aren’t very many changes from the current CRAN version, although the one thing I did change is kind of big. In fact, I’ve embedded the web-app functionalities within the package. So, it is now possible to launch the web-app from the current R session using the new function BCEAweb. This takes as arguments three inputs: a matrix e containing$S$simulations for the measures of effectiveness computed for the$T$interventions; a matrix c containing the simulations for the measures of costs; and a data frame or matrix containing simulations for the model parameters. In fact, none of the inputs is required and the user can actually launch an empty web-app, in which the inputs can be uploaded, say, from a spreadsheet (there are in fact other formats available). I think the web-app facility is not necessary when you’ve gone through the trouble of actually installing the R package and you’re obviously using it from R. But it’s helpful, nonetheless, for example in terms of producing some standard output (perhaps even more than the actual package$-$which I think is more flexible) and of reporting, with the cool facility based on pandoc. This means there are a few more packages “suggested” on installation and potentially a longer compilation time for the package$-$but nothing major. The new version is under testing but I may be able to release it on CRAN soon-ish… And there are other cool things we’re playing around (the links here give all the details!). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Scale up your parallel R workloads with containers and doAzureParallel Tue, 11/21/2017 - 17:30 (This article was first published on Revolutions, and kindly contributed to R-bloggers) by JS Tan (Program Manager, Microsoft) The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves. However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter. Because of R's rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model's reproducibility! This is why using containers is extremely important when it comes to operationalizing your R workloads. I'm happy to announce that the doAzureParallel package, powered by Azure Batch, now supports fully containerized deployments. With this migration, doAzureParallel will not only help you scale out your workloads, but will also do it in a completely containerized fashion, letting your bypass the complexities of dealing with inconsistent environments. Now that doAzureParallel runs on containers, we can ensure a consistent immutable runtime while handling custom R versions, environments, and packages. By default, the container used in doAzureParallel is the 'rocker/tidyverse:latest' container that is developed and maintained as part of the rocker project. For most cases, and especially for beginners, this image will contain most of what is needed. However, as users become more experienced or have more complex deployment requirements, they may want to change the Docker image that is used, or even build their own. doAzureParallel supports both those options, giving you flexibility (without any compromise on reliability). Configuring the Docker image is easy. Once you know which Docker image you want to use, you can simply specify its location in the cluster configuration and doAzureParallel will just know to use it when provisioning subsequent clusters. More details on configuring your Docker container settings with doAzureParallel are included in the documentation. With this release, we hope to unblock many users who are looking to take their R models, and scale it up in the cloud. To get started with doAzureParallel, visit our Github page. Please give it a try and let us know if you have questions, feedback, or suggestions, or via email at razurebatch@microsoft.com. Github (Azure): doAzureParallel var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### dplyr and the design effect in survey samples Tue, 11/21/2017 - 13:59 (This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers) For those guys like me who are not such R geeks, this trick could be of interest. The package dplyr can be very useful when it comes to data manipulation and you can extract valuable information from a data frame. For example, when using if you want to count how many humans have a particular hair color, you can run the following piece of code: library(dplyr) starwars %>% filter(species == "Human") %>% group_by(hair_color) %>% summarise(n = n()) hair_color n auburn 1 auburn, grey 1 auburn, white 1 black 8 blond 3 brown 14 brown, grey 1 grey 1 none 3 white 2 As a result the former query gives you a data frame and you can use it to make another query. For example, if you want to know the average number of individuals in the data frame you can use the summarise twice: library(dplyr) starwars %>% filter(species == "Human") %>% group_by(hair_color) %>% summarise(n = n()) %>% summarise(x.b = mean(n)) x.b 3.5 Now, turning our attention to statistics, it is known that, when dealing with sample surveys, one measure of interest is the design effect defined as$Deff \approx 1 + (\bar{m} – 1)\rho$where$\bar{m}$is the average cluster size and$\rho$is the intraclass correlation coefficient. If you are dealing with survey data and you want to figure out the value of$\bar{m}$and$\rho$, you can use dplyr. Let’s use the Lucy data of the samplesize4surveys package to show how you can do it. library(samplesize4surveys) data(Lucy) m <- Lucy %>% group_by(Zone) %>% summarise(n = n()) %>% summarise(m = mean(n)) rho <- ICC(y = Lucy$Taxes, cl = Lucy$Zone)$ICC

DEFF <- 1 + (as.integer(m) - 1) * rho
DEFF
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Practical Machine Learning with R and Python – Part 6

Tue, 11/21/2017 - 12:46

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Introduction

This is the final and concluding part of my series on ‘Practical Machine Learning with R and Python’. In this series I included the implementations of the most common Machine Learning algorithms in R and Python. The algorithms implemented were

1. Practical Machine Learning with R and Python – Part 1 In this initial post, I touch upon regression of a continuous target variable. Specifically I touch upon Univariate, Multivariate, Polynomial regression and KNN regression in both R and Python
2. Practical Machine Learning with R and Python – Part 2 In this post, I discuss Logistic Regression, KNN classification and Cross Validation error for both LOOCV and K-Fold in both R and Python
3. Practical Machine Learning with R and Python – Part 3 This 3rd part included feature selection in Machine Learning. Specifically I touch best fit, forward fit, backward fit, ridge(L2 regularization) & lasso (L1 regularization). The post includes equivalent code in R and Python.
4. Practical Machine Learning with R and Python – Part 4 In this part I discussed SVMs, Decision Trees, Validation, Precision-Recall, AUC and ROC curves
5. Practical Machine Learning with R and Python – Part 5  In this penultimate part, I touch upon B-splines, natural splines, smoothing spline, Generalized Additive Models(GAMs), Decision Trees, Random Forests and Gradient Boosted Treess.

In this last part I cover Unsupervised Learning. Specifically I cover the implementations of Principal Component Analysis (PCA). K-Means and Heirarchical Clustering. You can download this R Markdown file from Github at MachineLearning-RandPython-Part6

1.1a Principal Component Analysis (PCA) – R code

Principal Component Analysis is used to reduce the dimensionality of the input. In the code below 8 x 8 pixel of handwritten digits is reduced into its principal components. Then a scatter plot of the first 2 principal components give a very good visial representation of the data

library(dplyr) library(ggplot2) #Note: This example is adapted from an the example in the book Python Datascience handbook by # Jake VanderPlas (https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html) # Read the digits data (From sklearn datasets) digits= read.csv("digits.csv") # Create a digits classes target variable digitClasses <- factor(digits$X0.000000000000000000e.00.29) #Invoke the Principal Componsent analysis on columns 1-64 digitsPCA=prcomp(digits[,1:64]) # Create a dataframe of PCA df <- data.frame(digitsPCA$x) # Bind the digit classes df1 <- cbind(df,digitClasses) # Plot only the first 2 Principal components as a scatter plot. This plot uses only the # first 2 principal components ggplot(df1,aes(x=PC1,y=PC2,col=digitClasses)) + geom_point() + ggtitle("Top 2 Principal Components") 1.1 b Variance explained vs no principal components – R code

In the code below the variance explained vs the number of principal components is plotted. It can be seen that with 20 Principal components almost 90% of the variance is explained by this reduced dimensional model.

# Read the digits data (from sklearn datasets) digits= read.csv("digits.csv") # Digits target digitClasses <- factor(digits$X0.000000000000000000e.00.29) digitsPCA=prcomp(digits[,1:64]) # Get the Standard Deviation sd=digitsPCA$sdev # Compute the variance digitsVar=digitsPCA$sdev^2 #Compute the percent variance explained percentVarExp=digitsVar/sum(digitsVar) # Plot the percent variance exlained as a function of the number of principal components #plot(cumsum(percentVarExp), xlab="Principal Component", # ylab="Cumulative Proportion of Variance Explained", # main="Principal Components vs % Variance explained",ylim=c(0,1),type='l',lwd=2, # col="blue") 1.1c Principal Component Analysis (PCA) – Python code import numpy as np from sklearn.decomposition import PCA from sklearn import decomposition from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import load_digits # Load the digits data digits = load_digits() # Select only the first 2 principal components pca = PCA(2) # project from 64 to 2 dimensions #Compute the first 2 PCA projected = pca.fit_transform(digits.data) # Plot a scatter plot of the first 2 principal components plt.scatter(projected[:, 0], projected[:, 1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('spectral', 10)) plt.xlabel('PCA 1') plt.ylabel('PCA 2') plt.colorbar(); plt.title("Top 2 Principal Components") plt.savefig('fig1.png', bbox_inches='tight') 1.1 b Variance vs no principal components – Python code import numpy as np from sklearn.decomposition import PCA from sklearn import decomposition from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import load_digits digits = load_digits() # Select all 64 principal components pca = PCA(64) # project from 64 to 2 dimensions projected = pca.fit_transform(digits.data) # Obtain the explained variance for each principal component varianceExp= pca.explained_variance_ratio_ # Compute the total sum of variance totVarExp=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100) # Plot the variance explained as a function of the number of principal components plt.plot(totVarExp) plt.xlabel('No of principal components') plt.ylabel('% variance explained') plt.title('No of Principal Components vs Total Variance explained') plt.savefig('fig2.png', bbox_inches='tight') 1.2a K-Means – R code In the code first the scatter plot of the first 2 Principal Components of the handwritten digits is plotted as a scatter plot. Over this plot 10 centroids of the 10 different clusters corresponding the 10 diferent digits is plotted over the original scatter plot. library(ggplot2) # Read the digits data digits= read.csv("digits.csv") # Create digit classes target variable digitClasses <- factor(digits$X0.000000000000000000e.00.29) # Compute the Principal COmponents digitsPCA=prcomp(digits[,1:64]) # Create a data frame of Principal components and the digit classes df <- data.frame(digitsPCA$x) df1 <- cbind(df,digitClasses) # Pick only the first 2 principal components a<- df[,1:2] # Compute K Means of 10 clusters and allow for 1000 iterations k<-kmeans(a,10,1000) # Create a dataframe of the centroids of the clusters df2<-data.frame(k$centers) #Plot the first 2 principal components with the K Means centroids ggplot(df1,aes(x=PC1,y=PC2,col=digitClasses)) + geom_point() + geom_point(data=df2,aes(x=PC1,y=PC2),col="black",size = 4) + ggtitle("Top 2 Principal Components with KMeans clustering") 1.2b K-Means – Python code

The centroids of the 10 different handwritten digits is plotted over the scatter plot of the first 2 principal components.

import numpy as np from sklearn.decomposition import PCA from sklearn import decomposition from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.cluster import KMeans digits = load_digits() # Select only the 1st 2 principal components pca = PCA(2) # project from 64 to 2 dimensions projected = pca.fit_transform(digits.data) # Create 10 different clusters kmeans = KMeans(n_clusters=10) # Compute the clusters kmeans.fit(projected) y_kmeans = kmeans.predict(projected) # Get the cluster centroids centers = kmeans.cluster_centers_ centers #Create a scatter plot of the first 2 principal components plt.scatter(projected[:, 0], projected[:, 1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('spectral', 10)) plt.xlabel('PCA 1') plt.ylabel('PCA 2') plt.colorbar(); # Overlay the centroids on the scatter plot plt.scatter(centers[:, 0], centers[:, 1], c='darkblue', s=100) plt.savefig('fig3.png', bbox_inches='tight') 1.3a Heirarchical clusters – R code

Herirachical clusters is another type of unsupervised learning. It successively joins the closest pair of objects (points or clusters) in succession based on some ‘distance’ metric. In this type of clustering we do not have choose the number of centroids. We can cut the created dendrogram mat an appropriate height to get a desired and reasonable number of clusters These are the following ‘distance’ metrics used while combining successive objects

• Ward
• Complete
• Single
• Average
• Centroid
# Read the IRIS dataset iris <- datasets::iris iris2 <- iris[,-5] species <- iris[,5] #Compute the distance matrix d_iris <- dist(iris2) # Use the 'average' method to for the clsuters hc_iris <- hclust(d_iris, method = "average") # Plot the clusters plot(hc_iris) # Cut tree into 3 groups sub_grp <- cutree(hc_iris, k = 3) # Number of members in each cluster table(sub_grp) ## sub_grp ## 1 2 3 ## 50 64 36 # Draw rectangles around the clusters rect.hclust(hc_iris, k = 3, border = 2:5) 1.3a Heirarchical clusters – Python code from sklearn.datasets import load_iris import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Load the IRIS data set iris = load_iris() # Generate the linkage matrix using the average method Z = linkage(iris.data, 'average') #Plot the dendrogram #dendrogram(Z) #plt.xlabel('Data') #plt.ylabel('Distance') #plt.suptitle('Samples clustering', fontweight='bold', fontsize=14); #plt.savefig('fig4.png', bbox_inches='tight') Conclusion

This is the last and concluding part of my series on Practical Machine Learning with R and Python. These parallel implementations of R and Python can be used as a quick reference while working on a large project. A person who is adept in one of the languages R or Python, can quickly absorb code in the other language.

Hope you find this series useful!

More interesting things to come. Watch this space!

References

1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

To see all posts see ‘Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Using an R ‘template’ package to enhance reproducible research or the ‘R package syndrome’

Tue, 11/21/2017 - 11:00
Motivation

Have you ever had the feeling that the creation of your data analysis report(s) resulted in looking up, copy-pasting and reuse of code from previous analyses?
This approach is time consuming and prone to errors. If you frequently analyze similar data(-types), e.g. from a standardized analysis workflow or different experiments on the same platform, the automation of your report creation via an R ‘template’ package might be a very useful and time-saving step. It also allows to focus on the important part of the analysis (i.e. the experiment or analysis-specific part).
Imagine that you need to analyze tens or hundreds of runs of data in the same format, making use of an R ‘template’ package can save you hours, days or even weeks. On the go, reports can be adjusted, errors corrected and extensions added without much effort.

A bit of history

The reproducibility of an analysis workflow remains one complex challenge for R data scientists. The creation of dynamic and reproducible analysis report in R started to spread with the creation of Sweave [1] by Friederich Leisch, enabling the creation of mixed R and LaTeX reports. This was further expanded with the knitr package [2] by Yihui Xie in 2012. The concept of child documents and parametrized (rmarkdown) reports is another step towards this goal [3]. Our approach integrates the efforts towards reproducible research by means of report templates integrated within an R package structure, namely the ‘R template’ package.

Advantages of an R ‘template’ package

Integrating report templates in an R package structure enables to take advantage of the functionalities of an R package. R functions and report templates, wrapping the entire analysis workflow can be contained in the same code structure.
Bug fixes, extensions of an analysis can be tracked via the package versioning. Because your entire analysis workflow is contained within a unique container, the code for the creation of the analysis report can easily be exchanged with collaborators and a new version installed smoothly in a standard manner.
Successive modifications of the reports can easily be applied to previous analysis reports.

How do you start the development and/or writing of your own R ‘templates’?

Distinguish the data(experiment)-specific (e.g. different models, visualization) from the frequently used analysis parts. The latter one contains the parts of the analysis that are consistent throughout the different analyses or experiments (i.e. the code that you were copy-pasting from previous analyses). Think of e.g. the reporting of results from a standard analysis or modeling technique like linear discriminant analysis. The modeling function, visualization and results formatting is the same irrespective of the specific experiment at hand. The experiment-specific part consists of e.g. the input data, model or analysis parameters, etc.
The code which is consistent throughout the different analyses can be captured in R functions or the equivalent R ‘template’ documents for reporting. The R ‘template’ documents can indeed be seen as the equivalent of an R function for reporting purposes, integrated within an R package, with input parameters and potentially some default values (i.e. the most common value).
Each part of an analysis report can be contained in separated modular child document(s), integrated within an R package structure.
This can be either one ‘template’ document or several (nested) ‘template’ documents, depending on personal preference regarding structuring (e.g. one document per analysis technique) as well as complexity of the analysis at hand. An advantage of using a more complex nested approach is that it allows to include or exclude several parts of the analysis depending on requests of the user and/or the data at hand and/or the findings within the analysis.
This so called template document(s) can be run from outside the package, and for data/parameters specific of a certain analysis/experiment.

Our suggested approach

The approach that we recommend makes use of a main ‘master’ ‘template’ which calls on its turn the (different) child document(s) contained in the R ‘template’ package (see figure). This allows the developers to easily extend the ‘template’ without running into issues with previous reports.

It is advisable and user-friendly to create a start template which mentions the required and optional input parameters necessary for the downstream analysis. A dedicated function can be created to extract the template(s) path (after the package installation), which can be used to render/run the document.

To enhance the approach:

• progress messages can be implemented in the template to follow the report creation
• default parameters can be set at the start of each template
• session information can be included at the end of the main template to track the versions of the R package and its dependencies.
How to use an R ‘template’ package?

The template(s) contained in the R package can be either called directly in R. There is the option of specifying the necessary parameters for the report directly inside the document or to pass them via the render call which provides a cleaner R environment. The latter option might be quite complicated, depending on the analysis at hand.
The report creation can also be combined with a Shiny user interface. This has the advantage that even non-(expert) R-users can easily create a reproducible report without specifying anything in R at all. (Although it might be that your report does not lend itself to a shiny interface.) Suppose your data are in a standardized format (e.g. several runs of data with the same type of information). In this case a shiny application can be developed with a list of possible runs, some user-specified additional parameters and a button to create and/or download the report. In this way even users without any R knowledge can create their own standardized report in a straightforward way from the same R template package.

To summarize, using an R template package makes the approach less prone to errors, you save time and the reports are consistent across different analyses. It is easy to keep track of changes, extensions and bug fixes, via appropriate package versioning which ensures the reproducibility of an entire analysis workflow. The combination of an R ‘template’ package with a shiny application creates opportunities for non(-expert) R users to create their own report.

Is the development of an R ‘template’ worthwhile the effort?

The development of such template reports might seem cumbersome and time-consuming; and the use of the R package structure for such purposes overkill, a.k.a the ‘R package syndrome’.
However a lot of time and analysis reproducibility can be gained when using this package afterwards, which makes the effort worthwhile.

All the tools are already available thanks to the open R community, so do-it-yourself!

Our presentation at the UseR2017 conference is available here. An example of an R ‘template’ package with a shiny interface is available here. Feel free to contact us for more information: Laure Cougnaud (laure.cougnaud.at.openanalytics.eu), Kirsten Van Hoorde (kirsten.vanhoorde.at.openanalytics.eu).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### How to Build a Geographic Dashboard with Real-Time Data

Tue, 11/21/2017 - 03:08

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In this post, I show how to build an interactive geographic dashboard using Displayr, Plotly and R. It is particularly fascinating in that it tracks the real-time position of military aircraft. To do so, I am first going to bring in some data from two separate sources (regions based on the size of military air-force, and then the real-time tracking of aircraft positions). The dashboard displays the dynamic data in two ways: region shading (to indicate the size of the military force per country) and point markers (for the aircraft positions). I then build a map to neatly and attractively display all this data.

Reading tabular data from the web

I am going to shade each country of the map according to the size of its military air-force. To do this I need a list of aircraft per country. Fortunately, the ever useful Wikipedia has just what I need (here). The following code reads in the data and cleans it to produce a neat table.

library(httr) library(XML) # Extract table via URL url &amp;amp;lt;- "https://en.wikipedia.org/wiki/List_of_countries_by_level_of_military_equipment#List" r &amp;amp;lt;- GET(url) airforces &amp;amp;lt;- readHTMLTable(doc = content(r, "text"))[[2]] # Clean required columns airforces &amp;amp;lt;- airforces[-1, c("Country[note 1]", "Military aircraft[note 3]")] colnames(airforces) &amp;amp;lt;- c("Country", "MilitaryAircraft") remove.bracket.content &amp;amp;lt;- function(s) { return(gsub("\$.+\$", "", s)) } airforces &amp;amp;lt;- data.frame(apply(airforces, 2, remove.bracket.content)) airforces$MilitaryAircraft &amp;amp;lt;- as.numeric(gsub(",", "", airforces$MilitaryAircraft)) airforces Pooling real-time data from across the globe

Compared to the above, the second source of data is more dynamic. I am using ADS-B, which pools real-time flight tracking information from across the globe. Not all military operations are top secret. Apparently, some aircraft operated by the armed forces broadcast their positions publicly.

To collate this information, I construct a URL to retrieve a JSON object with information about military aircraft (JSON is flexible text format used for data exchange). Then I parse the JSON to create a data.frame.

library(jsonlite)&amp;amp;lt;/pre&amp;amp;gt; url &amp;amp;lt;- "http://public-api.adsbexchange.com/VirtualRadar/AircraftList.json?" url &amp;amp;lt;- paste0(url, "fMilQ=TRUE") positions &amp;amp;lt;- fromJSON(url)$acList if (length(positions) != 0) { positions &amp;amp;lt;- positions[positions$Type != "TEST", ] positions &amp;amp;lt;- positions[!is.na(positions$Lat), ] } positions Shading countries of a map The following code produces a plotly map of the world. The countries are shaded according to the size of the air-force, with the scale shown on a legend. In plotly terminology, each layer of the map is known as a trace. library(plotly) library(flipFormat) # specify map area and projection g &amp;amp;lt;- list(scope = "world", showframe = FALSE, showcoastlines = TRUE, projection = list(type = 'mercator'), lonaxis = list(range = c(-140, 179)), lataxis = list(range = c(-55, 70)), resolution = 50) # higher resolution # shade countries by airforce size p &amp;amp;lt;- plot_geo(airforces) %&amp;amp;gt;% add_trace(data = airforces, name = "Airforce", z = ~MilitaryAircraft, color = ~MilitaryAircraft, colors = 'Blues', locations = ~Country, marker = list(line = list(color = toRGB("grey"), width = 0.5)), showscale = TRUE, locationmode = "country names", colorbar = list(title = 'Airforce', separatethousands = TRUE)) %&amp;amp;gt;% config(displayModeBar = F) %&amp;amp;gt;% layout(geo = g, margin = list(l=0, r=0, t=0, b=0, pad=0), paper_bgcolor = 'transparent') Adding markers for the aircraft Finally, I add markers showing the airplane locations as another trace. I use a different color for those with speed less than 200 knots and altitude less than 2000 feet. The hover text contains more detail about the aircraft. aircolors = rep("airborne", nrow(positions)) # create a vector with the status of each aeroplane aircolors[positions$Spd &amp;amp;lt; 200 &amp;amp;amp; positions$Alt &amp;amp;lt; 2000] &amp;amp;lt;- "ground/approach" hovertext = paste0("Operator:", positions$Op, "\nModel:", positions$Mdl, "\nAltitide(ft):", sapply(positions$Alt, FormatAsReal)) hoverinfo = rep("all", nrow(positions)) p = add_trace(p, data = positions, x = positions$Long, y = positions$Lat, color = aircolors, hovertext = hovertext, showlegend = FALSE)

The final result is shown below. Since broadcasting of position is purely voluntary, I suspect that the Top Gun flights are not shown!

While the map above shows the required information, it can easily be made more useful and appealing. Displayr allows me to add a control to switch between regions, text and image annotations, and a background. Here is a link to the finished dashboard, with a screenshot shown below.

Thinking about it again, if you can’t see any military aircraft at all on the radar, then you should really worry!

Try it yourself

You can explore the dashboards used in this post within Displayr. All the code is embedded within the pages (look on the right-hand panel after you click on an object). Click here to see the copy of the working document (without background and finishing touches). Click here open a copy of the finished document that includes the background, control box, and other finishing touches.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R charts in a Tweet

Mon, 11/20/2017 - 21:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I’ve reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.)

That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters:

df <- expand.grid(x = 1:10, y=1:10)
df$angle <- runif(100, 0, 2*pi) df$speed <- runif(100, 0, sqrt(0.1 * df$x)) ggplot(df, aes(x, y)) + geom_point() + geom_spoke(aes(angle = angle, radius = speed)) y’all twitterpeople give me 280 characters? yr just gonna get code samples pic.twitter.com/hyGEE2DxGy — @DynamicWebPaige (@DynamicWebPaige) November 8, 2017 x <- getURL(“https://t.co/ivZZvodbNK“) b <- read.csv(text=x) c <- get_map(location=c(-122.080954,36.971709), maptype=”terrain”, source=”google”, zoom=14) ggmap(c) + geom_path(data=b, aes(color=elevation), size=3)+ scale_color_gradientn(colours=rainbow(7), breaks=seq(25, 200, 25)) pic.twitter.com/7WdQLR56uZ — @DynamicWebPaige (@DynamicWebPaige) November 11, 2017 library(dygraphs) lungDeaths <- cbind(mdeaths, fdeaths) dygraph(lungDeaths) %>% dySeries(“mdeaths”, label = “Male”) %>% dySeries(“fdeaths”, label = “Female”) %>% dyOptions(stackedGraph = TRUE) %>% dyRangeSelector(height = 20) I ❤️ R’s interactive visualizations SO MUCH pic.twitter.com/LevjElly3L — @DynamicWebPaige (@DynamicWebPaige) November 16, 2017 library(plot3D) par(mar = c(2, 2, 2, 2)) par(mfrow = c(1, 1)) x <- seq(0, 2*pi,length.out=50) y <- seq(0, pi,length.out=50) M <- mesh(x, y) surf3D(x = (3+2*cos(M$x)) * cos(M$y), y = (3+2*cos(M$x)) * sin(M$y), z = 2 * sin(M$x),
colkey=FALSE,

— @DynamicWebPaige (@DynamicWebPaige) November 17, 2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Kaggle’s Advanced Regression Competition: Predicting Housing Prices in Ames, Iowa

Mon, 11/20/2017 - 19:31

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction

Kaggle.com is a website designed for data scientists and data enthusiasts to connect and compete with each other. It is an open community that hosts forums and competitions in the wide field of data. In each Kaggle competition, competitors are given a training data set, which is used to train their models, and a test data set, used to test their models. Kagglers can then submit their predictions to view how well their score (e.g., accuracy or error) compares to others’.

As a team, we joined the House Prices: Advanced Regression Techniques Kaggle challenge to test our model building and machine learning skills. For this competition, we were tasked with predicting housing prices of residences in Ames, Iowa. Our training data set included 1460 houses (i.e., observations) accompanied by 79 attributes (i.e., features, variables, or predictors) and the sales price for each house. Our testing set included 1459 houses with the same 79 attributes, but sales price was not included as this was our target variable.

To view our code (split between R and Python) and our project presentation slides for this project see our shared GitHub repository.

Understanding the Data

Of the 79 variables provided, 51 were categorical and 28 were continuous.

Our first step was to combine these data sets into a single set both to account for the total missing values and to fully understand all the classes for each categorical variable. That is, there might be missing values or different class types in the test set that are not in the training set.

Processing the Data

Response Variable

As our response variable, Sale Price, is continuous, we will be utilizing regression models. One assumption of linear regression models is that the error between the observed and expected values (i.e., the residuals) should be normally distributed. Violations of this assumption often stem from a skewed response variable. Sale Price has a right skew, so we log + 1 transform it to normalize its distribution.

Missing Data

Machine learning algorithms do not handle missing values very well, so we must obtain an understanding of the missing values in our data to determine the best way to handle them. We find that 34 of the predictor variables have values that are interpreted by R and Python as missing (i.e., “NA” and “NaN”). Below we describe examples of some of the ways we treated these missing data.

1) NA/NaN is actually a class:

In many instances, what R and Python interpret as a missing value is actually a class of the variable. For example, Pool Quality is comprised of 5 classes: Excellent, Good, Fair, Typical, and NA. The NA class describes houses that do not have a pool, but our coding languages interpret houses of NA class as a missing value instead of a class of the Pool Quality variable.

Our solution was to impute most of the NA/NaN values to a value of “None.”

2) Not every NA/NaN corresponds to a missing attribute:

While we found that most NA/NaN values corresponded to an actual class for different variables, some NA/NaN values actually represented missing data. For example, we find that three houses with NA/NaN values for Pool Quality, also have a non-zero value for the variable, Pool Area (square footage of pool). These three houses likely have a pool, but its quality was not assessed or input into the data set.

Our solution was to first calculate mean Pool Area for each class of Pool Quality, then impute the missing Pool Quality classes based on how close that house’s Pool Area was to the mean Pool Areas for each Pool Quality class. For example, the first row in the below picture on the left has a Pool Area of 368 square feet. The average Pool Area for houses with Excellent pool quality (Ex) is about 360 square feet (picture on the right). Therefore, we imputed this house to have a Pool Quality of Excellent.

3) Domain knowledge:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

4) Imputing with mode:

Most variables have some intuitive relationship to other variables, and imputation can be based on these related features. But some missing values are found in variables with no apparent relation to others. For example, the Electrical variable, which describes the electrical system, was missing for a single observation.

Our solution was to simply find the most common class for this categorical variable and impute for this missing value.

Ordinal Categories

For linear (but not tree-based) models, categorical variables must be treated as continuous. There are two types of categorical features: ordinal, where there is an inherent order to the classes (e.g., Excellent is greater than Good, which is greater than Fair), and nominal, where there is no obvious order (e.g., red, green, and blue).

Our solution for ordinal variables was to simply assign the classes a number corresponding to their relative ranks. For example, Kitchen Quality has five classes: Excellent, Good, Typical, Fair, and Poor, which we encoded (i.e., converted) to the numbers 5, 4, 3, 2, and 1, respectively.

Nominal Categories

The ranking of nominal categories is not appropriate as there is no actual rank among the classes.

Our solution was to one-hot encode these variables, which creates a new variable for each class with values of zero (not present) or one (present).

Outliers

An outlier can be defined with a quantitative (i.e., statistical) or qualitative definition. We opted for the qualitative version when looking for outliers: observations that are abnormally far from other values. Viewing the relationship between Above Ground Living Area and Sale Price, we noticed some very large areas for very low prices.

Our solution was to remove these observations as we thought they fit our chosen definition of an outlier, and because they might increase our models’ errors.

Skewness

While there are few assumptions regarding the independent variables of regression models, often transforming skewed variables to a normal distribution can improve model performance.

Our solution was to log + 1 transform several of the predictors.

Near Zero Predictors

Predictors with very low variance offer little predictive power to models.

Our solution was to find the ratio of the second most frequent value to the most frequent value for each predictor, and to remove variables where this ratio was less than 0.05. This roughly translates to dropping variables where 95% or more of the values are the same.

Feature Engineering

Feature (variable or predictor) engineering is one of the most important steps in model creation. Often there is valuable information “hidden” in the predictors that is only revealed when manipulating these features in some way. Below are just some examples of the features we created:

• Remodeled (categorical): Yes or No if Year Built is different from Year Remodeled; If the year the house was remodeled is different from the year it was built, the remodeling likely increases property value
• Seasonality (categorical): Combined Month Sold with Year Sold; While more houses were sold during summer months, this likely varies across years, especially during the time period these houses were sold, which coincides with the housing crash (2006-2010)
• New House (categorical): Yes or No if Year Sold is equal to Year Built; If a house was sold the same year it was built, we might expect it was in high demand and might have a higher Sale Price
• Total Area (continuous): Sum of all variables that describe the area of different sections of a house; There are many variables that pertain to the square footage of different aspects of each house, we might expect that the total square footage has a strong influence on Sale Price

Models and Results

Now that we have prepared our data set, we can begin training our models and use them to predict Sale Price.

We trained and tested dozens of versions of the models described below with different combinations of engineered features and processed variables. The information in the table represents our best results for each model. The table explains the pros and cons for each model type, the optimal hyperparameters found through either grid search or Bayesian optimization, our test score, and the score we received from Kaggle. Our scores the root mean square error (RMSE) of our predictions, which is a metric for describing the difference between the observed values and our predicted values for Sale Price; scores closer to zero are better.

For brevity, we will not describe the details of the different models. However, see the following links for more information about how each model is used to create predictions: random forest, gradient boost, XGBoost, elastic net regularization for regression.

Below are plots summarizing variables that contribute most to the respective model’s prediction of Sale Price.

For most models, predictors related to square footage (Area), quality (different Quality measures), and age (Year Built) have the strongest impact on each model’s predictions.

There is no visualization for our best model, which was an ensemble of four other models. The predictions for this ensembled model are calculated by averaging the predictions from the separate models (two linear regression models and two tree-based models). The idea is that each model’s predictions include error both above and below the real values, and the averaged predictions of the best models might have less overall error than any one single model.

One note is that tree-based models (random forest, gradient boosting, and XGBoosting) cannot provide an idea of positive or negative influence for each variable on Sale Price, rather they can only provide an idea of how important that variable is to the models’ predictions overall. In contrast, linear models can provide information about which variables positively and negatively influence Sale Price. For the figure immediately above, the strongest predictor, residency in a Commercial Zone, is actually negatively related to Sale Price.

Conclusions

The objective of this Kaggle competition was to build models to predict housing prices of different residences in Ames, IA. Our best model resulted in an RMSE of 0.1071, which translates to an error of about \$9000 (or about 5%) for the average-priced house.

While this error is quite low, the interpretability of our model is poor. Each model found within our ensembled model varies with respect to the variables that are most important to predicting Sale Price. The best way to interpret our ensemble is to look for shared variables among its constituent models. The variables seen as most important or as strongest predictors through our models were those related to square footage, the age and condition of the home, the neighborhood where the house was located, the city zone where the house was located, and the year the house was sold.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));