Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 18 min ago

Shiny server series part 3: adding SSL encryption

Sun, 04/23/2017 - 19:16

(This article was first published on Jasper Ginn's blog, and kindly contributed to R-bloggers)

This guide is part of a series on setting up your own private server running shiny apps. There are many guides with great advice on how to set up an R shiny server and related software. I try to make a comprehensive guide based in part on these resources as well as my own experiences. I always aim to properly attribute information to their respective sources. If you notice an issue, please contact me.

Part 1 of this series is available here
Part 2 of this series is available here

In part 1 and part 2 of this series, we set up an ubuntu 16.04 server to host shiny applications. Thus far, we configured shiny server to listen on port 3838 (for public apps) and 4949 (for private apps). In this part, we will set up SSL encryption on the server for additional security.

Resources used for this part

This guide is largely based on this tutorial.

Adding SSL encryption to your server

We’re going to use certbot and Let’s encrypt to set up the SSL certificate.

Log into your server and switch to the Shiny user:

# Log into the server ssh root@<your-VPS-ip> # Switch to shiny user su shiny

Go to the sbin folder on your server and download certbot-auto:

cd /usr/local/sbin sudo wget https://dl.eff.org/certbot-auto

Make the script executable:

sudo chmod a+x /usr/local/sbin/certbot-auto

Now, open up the nginx configuration:

sudo nano /etc/nginx/sites-available/default

Take note of the root location, shown in the image below surrounded by the blue box. For the remainder of this tutorial, I’ll assume that your root location is located at /var/www/html. If it is not, make sure to switch your root location with mine when executing the commands below.

Then, add the contents below to the nginx configuration (surrounded by the red box in the image)

location ~ /.well-known { allow all; }

Restart nginx:

sudo service nginx restart

Take your root location and your domain name (with www. and without it) and fill them out in the and parts in the command below. Don’t forget to change <.extension> to your extension (e.g. .nl, .com, .eu). Then, execute this command:

certbot-auto certonly -a webroot --webroot-path=/var/www/html -d <your-domain-name>.<extension> -d www.<your-domain-name>.<extension>

Next, we generate a strong Diffie–Hellman group for extra security:

sudo openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048

SSL certificates expire every couple of months or so, so it’s a good idea to refresh our certificate regularly. We’ll set up a cron job that does this every week. Access cron by executing the following:

sudo crontab -e

Add the following lines:

30 2 * * 1 /usr/local/sbin/certbot-auto renew >> /var/log/le-renew.log 35 2 * * 1 /etc/init.d/nginx reload

Hit control+x and then Y and enter, and your changes will be saved. Congratulations, you have now successfully set up SSL encryption on your server! Note that SSL encryption is not yet operational; we’ll take care of that in the next part, when we’ll add user authentication to our private shiny server using Auth0

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Experimental Design Exercises

Sun, 04/23/2017 - 18:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises we shall follow the practice of conducting an experimental study. Researcher wants to see if there is any influence of working-out on body mass. Three groups of subjects with similar food and sport habits were included in the experiment. Each group was subjected to a different set of exercises. Body mass was measured before and after workout. The focus of the research is the difference in body mass between groups, measured after working-out. In order to examine these effects, we shall use paired t test, t test for independent samples, one-way and two-ways analysis of variance and analysis of covariance.

You can download the dataset here. The data is fictious.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Load the data. Calculate descriptive statistics and test for the normality of both initial and final measurements for whole sample and for each group.

Exercise 2

Is there effect of exercises and what is the size of that effect for each group? (Tip: You should use paired t test.)

Exercise 3

Is the variance of the body mass on final measurement the same for each of the three groups? (Tip: Use Levene’s test for homogeneity of variances)

Exercise 4

Is there a difference between groups on final measurement and what is the effect size? (Tip: Use one-way ANOVA)

Learn more about statistics for your experimental design in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to:

  • Work thru regression problems
  • use different statistical tests and interpret them
  • And much more

Exercise 5

Between which groups does the difference of body mass appear after the working-out? (Tip: Conduct post-hoc test.)

Exercise 6

What is the impact of age and working-out program on body mass on final measurement? (Tip: Use two-way between groups ANOVA.)

Exercise 7

What is the origin of effect of working-out program between subjects of different age? (Tip: You should conduct post-hoc test.)

Exercise 8

Is there a linear relationship between initial and final measurement of body mass for each group?

Exercise 9

Is there a significant difference in body mass on final measurement between groups, while controlling for initial measurement?

Exercise 10

How much of the variance is explained by independent variable? How much of the variance is explained by covariate?

Related exercise sets:
  1. Repeated measures ANOVA in R Exercises
  2. One Way Analysis of Variance Exercises
  3. Data science for Doctors: Inferential Statistics Exercises (part-2)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using R: a function that adds multiple ggplot2 layers

Sun, 04/23/2017 - 11:29

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Another interesting thing that an R course participant identified: Sometimes one wants to make a function that returns multiple layers to be added to a ggplot2 plot. One could think that just adding them and returning would work, but it doesn’t. I think it has to do with how + is evaluated. There are a few workarounds that achieve similar results and may save typing.

First, some data to play with: this is a built-in dataset of chickens growing:

library(ggplot2) data(ChickWeight) diet1 <- subset(ChickWeight, Diet == 1) diet2 <- subset(ChickWeight, Diet == 2)

This is just an example that shows the phenomenon. The first two functions will work, but combining them won’t.

add_line <- function(df) { geom_line(aes(x = Time, y = weight, group = Chick), data = df) } add_points <- function(df) { geom_point(aes(x = Time, y = weight), data = df) } add_line_points <- function(df) { add_line(df) + add_points(df) } ## works (plot1 <- ggplot() + add_line(diet1) + add_points(diet1)) ## won't work: non-numeric argument to binary operator try((plot2 <- ggplot() + add_line_points(diet1)))

First, you can get the same result by putting mappings and data in the ggplot function. This will work if all layers are going to plot the same data, but that does it for some cases:

## bypasses the issue by putting mappings in ggplot() (plot3 <- ggplot(aes(x = Time, y = weight, group = Chick), data = diet1) + geom_line() + geom_point())

One way is to write a function that takes the plot object as input, and returns a modified version of it. If we use the pipe operator %>%, found in the magrittr package, it even gets a ggplot2 like feel:

## bypasses the issue and gives a similar feel with pipes library(magrittr) add_line_points2 <- function(plot, df, ...) { plot + geom_line(aes(x = Time, y = weight, group = Chick), ..., data = df) + geom_point(aes(x = Time, y = weight), ..., data = df) } (plot4 <- ggplot() %>% add_line_points2(diet1) %>% add_line_points2(diet2, colour = "red"))

Finally, in many cases, one can stick all the data in a combined data frame, and avoid building up the plot from different data frames altogether.

## plot the whole dataset at once (plot5 <- ggplot(aes(x = Time, y = weight, group = Chick, colour = Diet), data = ChickWeight) + geom_line() + geom_point())

Okay, maybe that plot is a bit too busy to be good. But note how the difference between plotting a single diet and all diets at the same time one more mapping in aes(). No looping or custom functions required.

I hope that was of some use.

Postat i:computer stuff, data analysis Tagged: ggplot2, R

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A classical analysis (Radio Swiss classic program)

Sun, 04/23/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I am not a classical music expert at all, but I happen to have friends who are, and am even married to someone playing the cello (and the ukulele!). I appreciate listening to such music from time to time, in particular Baroque music. A friend made me discover Radio Swiss classic, an online radio playing classical music all day and all night long, with a quite nice variety, and very little speaking between pieces, with no ads (thank you, funders of the radio!). Besides, the voices telling me which piece has just been played are really soothing, so Radio Swiss classic is a good one in my opinion.

Today, instead of anxiously waiting for the results of the French presidential elections, I decided to download the program of the radio in the last years and have a quick look at it, since after all, the website says that the radio aims at relaxing people.

Scraping the program

My webscraping became a bit more elegant because I followed the advice of EP alias expersso, who by the way should really start blogging. I started downloading programs since September 2008 because that’s when I met the friend who told me about Radio Swiss Classic.

dates <- seq(from = lubridate::ymd("2008-09-01"), to = lubridate::ymd("2017-04-22"), by = "1 day") base_url <- "http://www.radioswissclassic.ch/en/music-programme/search/" get_one_day_program <- function(date, base_url){ # in order to see progress message(date) # build URL date_as_string <- as.character(date) date_as_string <- stringr::str_replace_all(date_as_string, "-", "") url <- paste0(base_url, date_as_string) # read page page <- try(xml2::read_html(url), silent = TRUE) if(is(page, "try-error")){ message("horribly wrong") closeAllConnections() return(NULL) }else{ # find all times, artists and pieces times <- xml2::xml_text(xml2::xml_find_all(page, xpath="//span[@class='time hidden-xs']//text()")) artists <- xml2::xml_text(xml2::xml_find_all(page, xpath="//span[@class='titletag']//text()")) pieces <- xml2::xml_text(xml2::xml_find_all(page, xpath="//span[@class='artist']//text()")) # the last artist and piece are the current ones artists <- artists[1:(length(artists) - 1)] pieces <- pieces[1:(length(pieces) - 1)] # get a timedate from each time timedates <- paste(as.character(date), times) timedates <- lubridate::ymd_hm(timedates) timedates <- lubridate::force_tz(timedates, tz = "Europe/Zurich") # format the output program <- tibble::tibble(time = timedates, artist = artists, piece = pieces) return(program) } } programs <- purrr::map(dates, get_one_day_program, base_url = base_url) programs <- dplyr::bind_rows(programs) save(programs, file = "data/radioswissclassic_programs.RData")

There were some days without any program on the website, for which the website said something was horribly wrong with the server.

load("data/radioswissclassic_programs.RData") wegot <- length(unique(lubridate::as_date(programs$time))) wewanted <- length(seq(from = lubridate::ymd("2008-09-01"), to = lubridate::ymd("2017-04-22"), by = "1 day"))

However, I got a program for approximately 0.96 of the days.

Who are the most popular composers? library("magrittr") table(programs$artist) %>% broom::tidy() %>% dplyr::arrange(desc(Freq)) %>% head(n = 20) %>% knitr::kable() Var1 Freq Wolfgang Amadeus Mozart 37823 Ludwig van Beethoven 20936 Joseph Haydn 18140 Franz Schubert 15596 Antonio Vivaldi 14947 Johann Sebastian Bach 12003 Felix Mendelssohn-Bartholdy 11541 Antonin Dvorak 10265 Gioachino Rossini 9591 Frédéric Chopin 8470 Piotr Iljitsch Tchaikowsky 8092 Georg Friedrich Händel 7935 Tomaso Albinoni 6175 Gaetano Donizetti 5945 Giuseppe Verdi 5639 Johannes Brahms 5526 Johann Nepomuk Hummel 5439 Camille Saint-Saëns 5395 Luigi Boccherini 5130 Johann Christian Bach 4976

I’ll have to admit that I don’t even know all the composers in this table but they’re actually all famous according to my live-in classical music expert. Radio Swiss classic allows listeners to rate pieces, so the most popular ones are programmed more often, and well I guess the person making the programs also tend to program famous composers quite often.

library("ggplot2") library("hrbrthemes") table(programs$artist) %>% broom::tidy() %>% ggplot() + geom_histogram(aes(Freq)) + scale_x_log10() + theme_ipsum(base_size = 14)

Interestingly, but not that surprisingly I guess given the popularity of, say, Mozart, the distribution of occurrences by composers seems to be log-normally distributed.

How long are pieces?

On the website of Radio Swiss classic it is stated that pieces are longer in the evening than during the day, which I wanted to try and see. Because the program of the radio was not corrected for time changes (i.e. on 25 hour-days there are only 24 hours of music according to the online program), I shall only look at pieces whose duration is smaller than 60 minutes, which solves the issue of missing days at the same time.

programs <- dplyr::arrange(programs, time) programs <- dplyr::mutate(programs, duration = difftime(lead(time, 1), time, units = "min")) programs <- dplyr::mutate(programs, duration = ifelse(duration > 60, NA, duration)) programs <- dplyr::mutate(programs, hour = as.factor(lubridate::hour(time))) programs %>% ggplot() + geom_boxplot(aes(hour, duration))+ theme_ipsum(base_size = 14)

I don’t find the difference between day and night that striking, maybe I could try to define day and night to have a prettier figure (but I won’t do any test, I soon need to go watch TV).

programs %>% dplyr::mutate(night = (lubridate::hour(time) <= 4 | lubridate::hour(time) >= 20)) %>% ggplot() + geom_boxplot(aes(night, duration))+ theme_ipsum(base_size = 14)

Conclusion

The website also states that the pieces are more lively in the morning, but I have no data to which to match the titles of the pieces to investigate that claim. Well I have not even looked for such data. Another extension that I would find interesting would be to match each composer’s name to a style and then see how often each style is played. Now I’ll stop relaxing and go stuff my face with food in front of the election results!

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Explaining complex machine learning models with LIME

Sun, 04/23/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

The classification decisions made by machine learning models are usually difficult – if not impossible – to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well – often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.

Especially, if we aim to develop machine learning models for medical diagnostics, high accuracies on test samples might not be enough to sell them to clinicians. Doctors and patients alike will be less inclined to trust a decision made by a model that they don’t understand.

Therefore, we would like to be able to explain in concrete terms why a model classified a case with a certain label, e.g. why one breast mass sample was classified as “malignant” and not as “benign”.

Local Interpretable Model-Agnostic Explanations (LIME) is an attempt to make these complex models at least partly understandable. The method has been published in

“Why Should I Trust You?” Explaining the Predictions of Any Classifier. By Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin from the University of Washington in Seattle

lime is able to explain all models for which we can obtain prediction probabilities (in R, that is every model that works with predict(type = "prob")). It makes use of the fact that linear models are easy to explain because they are based on linear relationships between features and class labels: The complex model function is approximated by locally fitting linear models to permutations of the original training set.

On each permutation, a linear model is being fit and weights are given so that incorrect classification of instances that are more similar to the original data are penalized (positive weights support a decision, negative weights contradict them). This will give an approximation of how much (and in which way) each feature contributed to a decision made by the model.

The code for lime has originally been made available for Python but the awesome Thomas Lin Pedersen has already created an implementation in R. It is not on CRAN (yet, I assume), but you can install it via Github:

devtools::install_github("thomasp85/lime")

The data I am using is the World Happiness Data from my last post. So, let’s train a neural network on this data to predict three classes of the happiness scores: low, medium and high.

load("data_15_16.RData") # configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret) set.seed(42) index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE) train_data <- data_15_16[index, ] test_data <- data_15_16[-index, ] set.seed(42) model_mlp <- caret::train(Happiness.Score.l ~ ., data = train_data, method = "mlp", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE))

The explain function

The central function of lime is lime() It creates the function that is used in the next step to explain the model’s predictions.

We can give a couple of options. Check the help ?lime for details, but the most important to think about are:

  • Should continuous features be binned? And if so, into how many bins?

Here, I am keeping the default bin_continuous = TRUE but specify 5 instead of 4 (the default) bins with n_bins = 5.

library(lime) explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)

Now, let’s look at how the model is explained. Here, I am not going to look at all test cases but I’m randomly choosing three cases with correct predictions and three with wrong predictions.

pred <- data.frame(sample_id = 1:nrow(test_data), predict(model_mlp, test_data, type = "prob"), actual = test_data$Happiness.Score.l) pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)] pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")

Beware that we need to give our test-set data table row names with the sample names or IDs to be displayed in the header of our explanatory plots below.

library(tidyverse) pred_cor <- filter(pred, correct == "correct") pred_wrong <- filter(pred, correct == "wrong") test_data_cor <- test_data %>% mutate(sample_id = 1:nrow(test_data)) %>% filter(sample_id %in% pred_cor$sample_id) %>% sample_n(size = 3) %>% remove_rownames() %>% tibble::column_to_rownames(var = "sample_id") %>% select(-Happiness.Score.l) test_data_wrong <- test_data %>% mutate(sample_id = 1:nrow(test_data)) %>% filter(sample_id %in% pred_wrong$sample_id) %>% sample_n(size = 3) %>% remove_rownames() %>% tibble::column_to_rownames(var = "sample_id") %>% select(-Happiness.Score.l)

The explain function from above can now be used with our test samples. Further options we can specify are:

  • How many features do we want to use in the explanatory function?

Let’s say we have a big training set with 100 features. Looking at all features and trying to understand them all could be more confusing than helpful. And very often, a handful of very important features will be enough to predict test samples with a reasonable accuracy (see also my last post on OneR). So, we can choose how many features we want to look at with the n_features option.

  • How do we want to choose these features?

Next, we specify how we want this subset of features to be found. The default, auto, uses forward selection if we chose n_features <= 6 and uses the features with highest weights otherwise. We can also directly choose feature_select = "forward_selection", feature_select = "highest_weights" or feature_select = "lasso_path". Again, check ?lime for details.

In our example dataset, we only have 7 features and I want to look at the top 5.

I also want to have explanation for all three class labels in the response variable (low, medium and high happiness), so I am choosing n_labels = 3.

explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5) explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)

It will return a tidy tibble object that we can plot with plot_features():

plot_features(explanation_cor, ncol = 3)

plot_features(explanation_wrong, ncol = 3)

The information in the output tibble is described in the help function ?lime and can be viewed with

tibble::glimpse(explanation_cor)

So, what does this tell us, now? Let’s look at case 22 (the first row of our plot for correctly predicted classes): This sample has been correctly predicted to come from the medium happiness group because it

  • has a dystopia value between 2.03 & 2.32,
  • a trust/government corruption score below 0.05,
  • a GDP/economy score between 1.06 and 1.23 and
  • a life expectancy score between 0.59 and 0.7.

From the explanation for the label “high” we can also see that this case has a family score bigger than 1.12, which is more representative of high happiness samples.

pred %>% filter(sample_id == 22) ## sample_id low medium high actual prediction correct ## 1 22 0.02906327 0.847562 0.07429938 medium medium correct

The explanatory function named dystopia the most strongly supporting feature for this prediction. Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables […]

The explanatory plot tells us for each feature and class label in which range of values a representative data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For case 22 and the feature dystopia, the data point 2.27 falls within the range for medium happiness (between 2.03 and 2.32) with a high weight.

When we look at where this case falls on the range of values for this feature, we can see that is indeed very close to the median of medium training cases and further away from the medians for high and low training cases. The other supportive features show us the same trend.

train_data %>% gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>% ggplot(aes(x = Happiness.Score.l, y = y)) + geom_boxplot(alpha = 0.8, color = "grey") + geom_point(data = gather(test_data[22, ], x, y, Economy..GDP.per.Capita.:Dystopia.Residual), color = "red", size = 3) + facet_wrap(~ x, scales = "free", ncol = 4)

An overview over the top 5 explanatory features for case 22 is stored in:

as.data.frame(explanation_cor[1:9]) %>% filter(case == "22") ## case label label_prob model_r2 model_intercept ## 1 22 medium 0.84756196 0.05004205 0.5033729 ## 2 22 medium 0.84756196 0.05004205 0.5033729 ## 3 22 medium 0.84756196 0.05004205 0.5033729 ## 4 22 medium 0.84756196 0.05004205 0.5033729 ## 5 22 medium 0.84756196 0.05004205 0.5033729 ## 6 22 high 0.07429938 0.06265119 0.2293890 ## 7 22 high 0.07429938 0.06265119 0.2293890 ## 8 22 high 0.07429938 0.06265119 0.2293890 ## 9 22 high 0.07429938 0.06265119 0.2293890 ## 10 22 high 0.07429938 0.06265119 0.2293890 ## 11 22 low 0.02906327 0.07469729 0.3528088 ## 12 22 low 0.02906327 0.07469729 0.3528088 ## 13 22 low 0.02906327 0.07469729 0.3528088 ## 14 22 low 0.02906327 0.07469729 0.3528088 ## 15 22 low 0.02906327 0.07469729 0.3528088 ## feature feature_value feature_weight ## 1 Dystopia.Residual 2.27394 0.14690100 ## 2 Trust..Government.Corruption. 0.03005 0.06308598 ## 3 Economy..GDP.per.Capita. 1.13764 0.02944832 ## 4 Health..Life.Expectancy. 0.66926 0.02477567 ## 5 Generosity 0.00199 -0.01326503 ## 6 Family 1.23617 0.13629781 ## 7 Generosity 0.00199 -0.07514534 ## 8 Trust..Government.Corruption. 0.03005 -0.07574480 ## 9 Dystopia.Residual 2.27394 -0.07687559 ## 10 Economy..GDP.per.Capita. 1.13764 0.07167086 ## 11 Family 1.23617 -0.14932931 ## 12 Economy..GDP.per.Capita. 1.13764 -0.12738346 ## 13 Generosity 0.00199 0.09730858 ## 14 Dystopia.Residual 2.27394 -0.07464384 ## 15 Trust..Government.Corruption. 0.03005 0.06220305 ## feature_desc ## 1 2.025072 < Dystopia.Residual <= 2.320632 ## 2 Trust..Government.Corruption. <= 0.051198 ## 3 1.064792 < Economy..GDP.per.Capita. <= 1.275004 ## 4 0.591822 < Health..Life.Expectancy. <= 0.701046 ## 5 Generosity <= 0.123528 ## 6 1.119156 < Family ## 7 Generosity <= 0.123528 ## 8 Trust..Government.Corruption. <= 0.051198 ## 9 2.025072 < Dystopia.Residual <= 2.320632 ## 10 1.064792 < Economy..GDP.per.Capita. <= 1.275004 ## 11 1.119156 < Family ## 12 1.064792 < Economy..GDP.per.Capita. <= 1.275004 ## 13 Generosity <= 0.123528 ## 14 2.025072 < Dystopia.Residual <= 2.320632 ## 15 Trust..Government.Corruption. <= 0.051198

In a similar way, we can explore why some predictions were wrong.

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.3 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] dplyr_0.5.0 purrr_0.2.2 readr_1.1.0 ## [4] tidyr_0.6.1 tibble_1.3.0 tidyverse_1.1.1 ## [7] RSNNS_0.4-9 Rcpp_0.12.10 lime_0.1.0 ## [10] caret_6.0-73 ggplot2_2.2.1 lattice_0.20-35 ## [13] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 ## ## loaded via a namespace (and not attached): ## [1] lubridate_1.6.0 assertthat_0.2.0 glmnet_2.0-5 ## [4] rprojroot_1.2 digest_0.6.12 psych_1.7.3.21 ## [7] R6_2.2.0 plyr_1.8.4 backports_1.0.5 ## [10] MatrixModels_0.4-1 stats4_3.3.3 evaluate_0.10 ## [13] httr_1.2.1 hrbrthemes_0.1.0 lazyeval_0.2.0 ## [16] readxl_0.1.1 minqa_1.2.4 SparseM_1.76 ## [19] extrafontdb_1.0 car_2.1-4 nloptr_1.0.4 ## [22] Matrix_1.2-8 rmarkdown_1.4 labeling_0.3 ## [25] splines_3.3.3 lme4_1.1-12 extrafont_0.17 ## [28] stringr_1.2.0 foreign_0.8-67 munsell_0.4.3 ## [31] hunspell_2.3 broom_0.4.2 modelr_0.1.0 ## [34] mnormt_1.5-5 mgcv_1.8-17 htmltools_0.3.5 ## [37] nnet_7.3-12 codetools_0.2-15 MASS_7.3-45 ## [40] ModelMetrics_1.1.0 grid_3.3.3 nlme_3.1-131 ## [43] jsonlite_1.4 Rttf2pt1_1.3.4 gtable_0.2.0 ## [46] DBI_0.6-1 magrittr_1.5 scales_0.4.1 ## [49] stringi_1.1.5 reshape2_1.4.2 xml2_1.1.1 ## [52] tools_3.3.3 forcats_0.2.0 hms_0.3 ## [55] pbkrtest_0.4-7 yaml_2.1.14 colorspace_1.3-2 ## [58] rvest_0.3.2 knitr_1.15.1 haven_1.0.0 ## [61] quantreg_5.29

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Does money buy happiness after all? Machine Learning with One Rule

Sun, 04/23/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

This week, I am exploring Holger K. von Jouanne-Diedrich’s OneR package for machine learning. I am running an example analysis on world happiness data and compare the results with other machine learning models (decision trees, random forest, gradient boosting trees and neural nets).

Conclusions

All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.

If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.

When looking at feature importance across models, the feature OneR chose – Economy/GDP per capita – was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.

OneR

OneR has been developed for the purpose of creating machine learning models that are easy to interpret and understand, while still being as accurate as possible. It is based on the one rule classification algorithm from Holte (1993), which is basically a decision tree cut at the first level.

R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.

While the original algorithm has difficulties in handling missing values and numeric data, the package provides enhanced functionality to handle those cases better, e.g. introducing a separate class for NA values and the optbin() function to find optimal splitting points for each feature. The main function of the package is OneR, which finds an optimal split for each feature and only use the most important feature with highest training accuracy for classification.

I installed the latest stable version of the OneR package from CRAN.

library(OneR)

The dataset

I am using the World Happiness Report 2016 from Kaggle.

library(tidyverse) data_16 <- read.table("world-happiness/2016.csv", sep = ",", header = TRUE) data_15 <- read.table("world-happiness/2015.csv", sep = ",", header = TRUE)

In the 2016 data there are upper and lower CI for the happiness score given, while in the 2015 data we have standard errors. Because I want to combine data from the two years, I am using only columns that are in both datasets.

common_feats <- colnames(data_16)[which(colnames(data_16) %in% colnames(data_15))] # features and response variable for modeling feats <- setdiff(common_feats, c("Country", "Happiness.Rank", "Happiness.Score")) response <- "Happiness.Score" # combine data from 2015 and 2016 data_15_16 <- rbind(select(data_15, one_of(c(feats, response))), select(data_16, one_of(c(feats, response))))

The response variable happiness score is on a numeric scale. OneR could also perform regression but here, I want to compare classification tasks. For classifying happiness, I create three bins for low, medium and high values of the happiness score. In order to not having to deal with unbalanced data, I am using the bin() function from OneR with method = "content". For plotting the cut-points, I am extracting the numbers from the default level names.

data_15_16$Happiness.Score.l <- bin(data_15_16$Happiness.Score, nbins = 3, method = "content") intervals <- paste(levels(data_15_16$Happiness.Score.l), collapse = " ") intervals <- gsub("\\(|]", "", intervals) intervals <- gsub(",", " ", intervals) intervals <- as.numeric(unique(strsplit(intervals, " ")[[1]])) data_15_16 %>% ggplot() + geom_density(aes(x = Happiness.Score), color = "blue", fill = "blue", alpha = 0.4) + geom_vline(xintercept = intervals[2]) + geom_vline(xintercept = intervals[3])

Now I am removing the original happiness score column from the data for modeling and rename the factor levels of the response variable.

data_15_16 <- select(data_15_16, -Happiness.Score) %>% mutate(Happiness.Score.l = plyr::revalue(Happiness.Score.l, c("(2.83,4.79]" = "low", "(4.79,5.89]" = "medium", "(5.89,7.59]" = "high")))

Because there are only 9 features in this small dataset, I want to explore them all individually before modeling. First, I am plotting the only categorical variable: Region.

This plots shows that there are a few regions with very strong biases in happiness: People in Western Europe, Australia, New Zealand, North America, Latin American and the Caribbean tend to me in the high happiness group, while people in sub-saharan Africa and Southern Asia tend to be the least happiest.

data_15_16 %>% ggplot(aes(x = Region, fill = Happiness.Score.l)) + geom_bar(position = "dodge", alpha = 0.7) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), plot.margin = unit(c(0, 0, 0, 1.5), "cm")) + scale_fill_brewer(palette = "Set1")

The remaining quantitative variables show happiness biases to varying degrees: e.g. low health and life expectancy is strongly biased towards low happiness, economic factors, family and freedom show a bias in the same direction, albeit not as strong.

data_15_16 %>% gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>% ggplot(aes(x = y, fill = Happiness.Score.l)) + geom_histogram(alpha = 0.7) + facet_wrap(~ x, scales = "free", ncol = 4) + scale_fill_brewer(palette = "Set1")

While OneR could also handle categorical data, in this example, I only want to consider the quantitative features to show the differences between OneR and other machine learning algorithms.

data_15_16 <- select(data_15_16, -Region)

Modeling

The algorithms I will compare to OneR will be run via the caret package.

# configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)

I will also use caret’s createDataPartition() function to partition the data into training (70%) and test sets (30%).

set.seed(42) index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE) train_data <- data_15_16[index, ] test_data <- data_15_16[-index, ]

OneR

OneR only accepts categorical features. Because we have numerical features, we need to convert them to factors by splitting them into appropriate bins. While the original OneR algorithm splits the values into ever smaller factors, this has been changed in this R-implementation with the argument of preventing overfitting. We can either split the data into pre-defined numbers of buckets (by length, content or cluster) or we can use the optbin() function to obtain the optimal number of factors from pairwise logistic regression or information gain.

# default method length data_1 <- bin(train_data, nbins = 5, method = "length") # method content data_2 <- bin(train_data, nbins = 5, method = "content") # method cluster data_3 <- bin(train_data, nbins = 3, method = "cluster") # optimal bin number logistic regression data_4 <- optbin(formula = Happiness.Score.l ~., data = train_data, method = "logreg") # optimal bin number information gain data_5 <- optbin(formula = Happiness.Score.l ~., data = train_data, method = "infogain")

This is how the data looks like following discretization:

  • Default method

  • 5 bins with method = "content

  • 3 bins with method = "cluster

  • optimal bin number according to logistic regression

  • optimal bin number according to information gain

Model building

Now I am running the OneR models. During model building, the chosen attribute/feature with highest accuracy along with the top 7 features decision rules and accuracies are printed. Unfortunately, this information is not saved in the model object; this would have been nice in order to compare the importance of features across models later on.

Here, all five models achieved highest prediction accuracy with the feature Economy GDP per capita.

for (i in 1:5) { data <- get(paste0("data_", i)) print(model <- OneR(formula = Happiness.Score.l ~., data = data, verbose = TRUE)) assign(paste0("model_", i), model) } ## ## Attribute Accuracy ## 1 * Economy..GDP.per.Capita. 63.96% ## 2 Health..Life.Expectancy. 59.91% ## 3 Family 57.21% ## 4 Dystopia.Residual 51.8% ## 5 Freedom 49.55% ## 6 Trust..Government.Corruption. 45.5% ## 7 Generosity 41.89% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## ## Call: ## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE) ## ## Rules: ## If Economy..GDP.per.Capita. = (-0.00182,0.365] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.365,0.73] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.73,1.09] then Happiness.Score.l = medium ## If Economy..GDP.per.Capita. = (1.09,1.46] then Happiness.Score.l = high ## If Economy..GDP.per.Capita. = (1.46,1.83] then Happiness.Score.l = high ## ## Accuracy: ## 142 of 222 instances classified correctly (63.96%) ## ## ## Attribute Accuracy ## 1 * Economy..GDP.per.Capita. 64.41% ## 2 Health..Life.Expectancy. 60.81% ## 3 Family 59.91% ## 4 Trust..Government.Corruption. 55.41% ## 5 Freedom 53.15% ## 5 Dystopia.Residual 53.15% ## 7 Generosity 41.44% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## ## Call: ## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE) ## ## Rules: ## If Economy..GDP.per.Capita. = (-0.00182,0.548] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.548,0.877] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.877,1.06] then Happiness.Score.l = medium ## If Economy..GDP.per.Capita. = (1.06,1.28] then Happiness.Score.l = medium ## If Economy..GDP.per.Capita. = (1.28,1.83] then Happiness.Score.l = high ## ## Accuracy: ## 143 of 222 instances classified correctly (64.41%) ## ## ## Attribute Accuracy ## 1 * Economy..GDP.per.Capita. 63.51% ## 2 Health..Life.Expectancy. 62.16% ## 3 Family 54.5% ## 4 Freedom 50.45% ## 4 Dystopia.Residual 50.45% ## 6 Trust..Government.Corruption. 43.24% ## 7 Generosity 36.49% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## ## Call: ## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE) ## ## Rules: ## If Economy..GDP.per.Capita. = (-0.00182,0.602] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.602,1.1] then Happiness.Score.l = medium ## If Economy..GDP.per.Capita. = (1.1,1.83] then Happiness.Score.l = high ## ## Accuracy: ## 141 of 222 instances classified correctly (63.51%) ## ## ## Attribute Accuracy ## 1 * Economy..GDP.per.Capita. 63.96% ## 2 Health..Life.Expectancy. 62.16% ## 3 Family 58.56% ## 4 Freedom 51.35% ## 5 Dystopia.Residual 50.9% ## 6 Trust..Government.Corruption. 46.4% ## 7 Generosity 40.09% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## ## Call: ## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE) ## ## Rules: ## If Economy..GDP.per.Capita. = (-0.00182,0.754] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.754,1.12] then Happiness.Score.l = medium ## If Economy..GDP.per.Capita. = (1.12,1.83] then Happiness.Score.l = high ## ## Accuracy: ## 142 of 222 instances classified correctly (63.96%) ## ## ## Attribute Accuracy ## 1 * Economy..GDP.per.Capita. 67.12% ## 2 Health..Life.Expectancy. 65.77% ## 3 Family 61.71% ## 4 Trust..Government.Corruption. 56.31% ## 5 Dystopia.Residual 55.41% ## 6 Freedom 50.9% ## 7 Generosity 43.69% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## ## Call: ## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE) ## ## Rules: ## If Economy..GDP.per.Capita. = (-0.00182,0.68] then Happiness.Score.l = low ## If Economy..GDP.per.Capita. = (0.68,1.24] then Happiness.Score.l = medium ## If Economy..GDP.per.Capita. = (1.24,1.83] then Happiness.Score.l = high ## ## Accuracy: ## 149 of 222 instances classified correctly (67.12%)

Model evaluation

The function eval_model() prints confusion matrices for absolute and relative predictions, as well as accuracy, error and error rate reduction. For comparison with other models, it would have been convenient to be able to extract these performance metrics directly from the eval_model object, instead of only the confusion matrix and values of correct/all instances and having to re-calculate performance metrics again manually.

for (i in 1:5) { model <- get(paste0("model_", i)) eval_model(predict(model, test_data), test_data$Happiness.Score.l) } ## ## Confusion matrix (absolute): ## Actual ## Prediction high low medium Sum ## high 23 0 11 34 ## low 1 26 10 37 ## medium 7 5 10 22 ## Sum 31 31 31 93 ## ## Confusion matrix (relative): ## Actual ## Prediction high low medium Sum ## high 0.25 0.00 0.12 0.37 ## low 0.01 0.28 0.11 0.40 ## medium 0.08 0.05 0.11 0.24 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.6344 (59/93) ## ## Error rate: ## 0.3656 (34/93) ## ## Error rate reduction (vs. base rate): ## 0.4516 (p-value = 2.855e-09) ## ## ## Confusion matrix (absolute): ## Actual ## Prediction high low medium Sum ## high 19 0 1 20 ## low 3 28 14 45 ## medium 9 3 16 28 ## Sum 31 31 31 93 ## ## Confusion matrix (relative): ## Actual ## Prediction high low medium Sum ## high 0.20 0.00 0.01 0.22 ## low 0.03 0.30 0.15 0.48 ## medium 0.10 0.03 0.17 0.30 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.6774 (63/93) ## ## Error rate: ## 0.3226 (30/93) ## ## Error rate reduction (vs. base rate): ## 0.5161 (p-value = 1.303e-11) ## ## ## Confusion matrix (absolute): ## Actual ## Prediction high low medium Sum ## high 23 0 11 34 ## low 0 25 7 32 ## medium 8 6 13 27 ## Sum 31 31 31 93 ## ## Confusion matrix (relative): ## Actual ## Prediction high low medium Sum ## high 0.25 0.00 0.12 0.37 ## low 0.00 0.27 0.08 0.34 ## medium 0.09 0.06 0.14 0.29 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.6559 (61/93) ## ## Error rate: ## 0.3441 (32/93) ## ## Error rate reduction (vs. base rate): ## 0.4839 (p-value = 2.116e-10) ## ## ## Confusion matrix (absolute): ## Actual ## Prediction high low medium Sum ## high 23 0 11 34 ## low 2 26 11 39 ## medium 6 5 9 20 ## Sum 31 31 31 93 ## ## Confusion matrix (relative): ## Actual ## Prediction high low medium Sum ## high 0.25 0.00 0.12 0.37 ## low 0.02 0.28 0.12 0.42 ## medium 0.06 0.05 0.10 0.22 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.6237 (58/93) ## ## Error rate: ## 0.3763 (35/93) ## ## Error rate reduction (vs. base rate): ## 0.4355 (p-value = 9.799e-09) ## ## ## Confusion matrix (absolute): ## Actual ## Prediction high low medium Sum ## high 21 0 3 24 ## low 0 26 8 34 ## medium 10 5 20 35 ## Sum 31 31 31 93 ## ## Confusion matrix (relative): ## Actual ## Prediction high low medium Sum ## high 0.23 0.00 0.03 0.26 ## low 0.00 0.28 0.09 0.37 ## medium 0.11 0.05 0.22 0.38 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.7204 (67/93) ## ## Error rate: ## 0.2796 (26/93) ## ## Error rate reduction (vs. base rate): ## 0.5806 (p-value = 2.761e-14)

Because I want to calculate performance measures for the different classes separately and like to have a more detailed look at the prediction probabilities I get from the models, I prefer to obtain predictions with type = "prob. While I am not looking at it here, this would also allow me to test different prediction thresholds.

for (i in 1:5) { model <- get(paste0("model_", i)) pred <- data.frame(model = paste0("model_", i), sample_id = 1:nrow(test_data), predict(model, test_data, type = "prob"), actual = test_data$Happiness.Score.l) pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)] pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong") pred$pred_prob <- NA for (j in 1:nrow(pred)) { pred[j, "pred_prob"] <- max(pred[j, 3:5]) } if (i == 1) { pred_df <- pred } else { pred_df <- rbind(pred_df, pred) } }

Comparing other algorithms Decision trees

First, I am building a decision tree with the rpart package and rpart() function. This, we can plot with rpart.plot().

Economy GDP per capita is the second highest node here, the best predictor here would be health and life expectancy.

library(rpart) library(rpart.plot) set.seed(42) fit <- rpart(Happiness.Score.l ~ ., data = train_data, method = "class", control = rpart.control(xval = 10), parms = list(split = "information")) rpart.plot(fit, extra = 100)

In order to compare the models, I am producing the same output table for predictions from this model and combine it with the table from the OneR models.

pred <- data.frame(model = "rpart", sample_id = 1:nrow(test_data), predict(fit, test_data, type = "prob"), actual = test_data$Happiness.Score.l) pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)] pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong") pred$pred_prob <- NA for (j in 1:nrow(pred)) { pred[j, "pred_prob"] <- max(pred[j, 3:5]) } pred_df_final <- rbind(pred_df, pred)

Random Forest

Next, I am training a Random Forest model. For more details on Random Forest, check out my post “Can we predict flu deaths with Machine Learning and R?”.

set.seed(42) model_rf <- caret::train(Happiness.Score.l ~ ., data = train_data, method = "rf", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE))

The varImp() function from caret shows us which feature was of highest importance for the model and its predictions.

Here, we again find Economy GDP per captia on top.

varImp(model_rf) ## rf variable importance ## ## Overall ## Economy..GDP.per.Capita. 100.00 ## Dystopia.Residual 97.89 ## Health..Life.Expectancy. 77.10 ## Family 47.17 ## Trust..Government.Corruption. 29.89 ## Freedom 19.29 ## Generosity 0.00 pred <- data.frame(model = "rf", sample_id = 1:nrow(test_data), predict(model_rf, test_data, type = "prob"), actual = test_data$Happiness.Score.l) pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)] pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong") pred$pred_prob <- NA for (j in 1:nrow(pred)) { pred[j, "pred_prob"] <- max(pred[j, 3:5]) } pred_df_final <- rbind(pred_df_final, pred)

Extreme gradient boosting trees

Gradient boosting is another decision tree-based algorithm, explained in more detail in my post “Extreme Gradient Boosting and Preprocessing in Machine Learning”.

set.seed(42) model_xgb <- caret::train(Happiness.Score.l ~ ., data = train_data, method = "xgbTree", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE))

As before, we again find Economy GDP per capita as most important feature.

varImp(model_xgb) ## xgbTree variable importance ## ## Overall ## Economy..GDP.per.Capita. 100.00 ## Health..Life.Expectancy. 67.43 ## Family 46.59 ## Freedom 0.00 pred <- data.frame(model = "xgb", sample_id = 1:nrow(test_data), predict(model_xgb, test_data, type = "prob"), actual = test_data$Happiness.Score.l) pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)] pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong") pred$pred_prob <- NA for (j in 1:nrow(pred)) { pred[j, "pred_prob"] <- max(pred[j, 3:5]) } pred_df_final <- rbind(pred_df_final, pred)

Neural network

Finally, I am comparing a neural network model. Here as well, I have a post where I talk about what they are in more detail: “Building deep neural nets with h2o and rsparkling that predict arrhythmia of the heart”.

set.seed(42) model_nn <- caret::train(Happiness.Score.l ~ ., data = train_data, method = "mlp", trControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE))

And Economy GDP per capita is again the most important feature!

varImp(model_nn) ## ROC curve variable importance ## ## variables are sorted by maximum importance across the classes ## low medium high ## Economy..GDP.per.Capita. 100.00 66.548 100.000 ## Health..Life.Expectancy. 95.22 63.632 95.222 ## Family 84.48 45.211 84.483 ## Dystopia.Residual 71.23 43.450 71.232 ## Freedom 64.58 41.964 64.581 ## Trust..Government.Corruption. 26.13 40.573 40.573 ## Generosity 0.00 3.462 3.462 pred <- data.frame(model = "nn", sample_id = 1:nrow(test_data), predict(model_nn, test_data, type = "prob"), actual = test_data$Happiness.Score.l) pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)] pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong") pred$pred_prob <- NA for (j in 1:nrow(pred)) { pred[j, "pred_prob"] <- max(pred[j, 3:5]) } pred_df_final <- rbind(pred_df_final, pred)

Model comparisons

Now to the final verdict: How do the different models compare?

The first plot below shows the prediction probabilites for the three happiness levels low, medium and high for each test data instance. For each instance, only the prediction probability of the predicted class (i.e. with the highest value) is shown. The upper row shows correct predictions, the lower row shows wrong predictions.

Sometimes, it is obvious from such a plot if a more stringent prediction threshold could improve things (when wrong predictions tend to be close to the threshold). With three classes to predict, this is obviously not as trivial as if we only had two but the same principle holds true: the smaller the prediction probability, the more uncertain it tends to be.

pred_df_final %>% ggplot(aes(x = actual, y = pred_prob, fill = prediction, color = prediction)) + geom_boxplot(alpha = 0.7) + facet_grid(correct ~ model) + scale_color_brewer(palette = "Set1") + scale_fill_brewer(palette = "Set1")

Probably the most straight-forwards performance measure is accuracy: i.e. the proportion of correct predictions vs the total number of instances to predict. The closer to 1, the better the accuracy.

Not surprisingly, the more complex models tend to be more accurate – albeit only slightly.

pred_df_final %>% group_by(model) %>% dplyr::summarise(correct = sum(correct == "correct")) %>% mutate(accuracy = correct / nrow(test_data)) %>% ggplot(aes(x = model, y = accuracy, fill = model)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = "Set1")

When we look at the three classes individually, it looks a bit more complicated but most models achieved highest accuracy for class “high”.

pred_df_final %>% group_by(model, prediction) %>% dplyr::summarise(correct = sum(correct == "correct"), n = n()) %>% mutate(accuracy = correct / n) %>% ggplot(aes(x = model, y = accuracy, fill = prediction)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette = "Set1")

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.3 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] RSNNS_0.4-9 Rcpp_0.12.10 plyr_1.8.4 ## [4] xgboost_0.6-4 randomForest_4.6-12 rpart.plot_2.1.1 ## [7] rpart_4.1-10 caret_6.0-73 lattice_0.20-35 ## [10] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 ## [13] dplyr_0.5.0 purrr_0.2.2 readr_1.1.0 ## [16] tidyr_0.6.1 tibble_1.3.0 ggplot2_2.2.1 ## [19] tidyverse_1.1.1 OneR_2.1 ## ## loaded via a namespace (and not attached): ## [1] lubridate_1.6.0 assertthat_0.2.0 rprojroot_1.2 ## [4] digest_0.6.12 psych_1.7.3.21 R6_2.2.0 ## [7] backports_1.0.5 MatrixModels_0.4-1 stats4_3.3.3 ## [10] evaluate_0.10 httr_1.2.1 lazyeval_0.2.0 ## [13] readxl_0.1.1 data.table_1.10.4 minqa_1.2.4 ## [16] SparseM_1.76 car_2.1-4 nloptr_1.0.4 ## [19] Matrix_1.2-8 rmarkdown_1.4 labeling_0.3 ## [22] splines_3.3.3 lme4_1.1-12 stringr_1.2.0 ## [25] foreign_0.8-67 munsell_0.4.3 broom_0.4.2 ## [28] modelr_0.1.0 mnormt_1.5-5 mgcv_1.8-17 ## [31] htmltools_0.3.5 nnet_7.3-12 codetools_0.2-15 ## [34] MASS_7.3-45 ModelMetrics_1.1.0 grid_3.3.3 ## [37] nlme_3.1-131 jsonlite_1.4 gtable_0.2.0 ## [40] DBI_0.6-1 magrittr_1.5 scales_0.4.1 ## [43] stringi_1.1.5 reshape2_1.4.2 xml2_1.1.1 ## [46] RColorBrewer_1.1-2 tools_3.3.3 forcats_0.2.0 ## [49] hms_0.3 pbkrtest_0.4-7 yaml_2.1.14 ## [52] colorspace_1.3-2 rvest_0.3.2 knitr_1.15.1 ## [55] haven_1.0.0 quantreg_5.29

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R code to accompany Real-World Machine Learning (Chapter 6): Exploring NYC Taxi Data

Sun, 04/23/2017 - 01:00

(This article was first published on data prone - R, and kindly contributed to R-bloggers)

Abstract

The rwml-R Github repo is updated with R code for exploratory data analysis of New York City taxi data from Chapter 6 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include reading large data files with the fread function from data.table, joining data frames by multiple variables with inner_join, and plotting categorical and numerical data with ggplot2.

Data for NYC taxi example

The data files for the examples in Chapter 6 of the book are available at
http://www.andresmh.com/nyctaxitrips/.
They are compressed as a 7-Zip file archive
(e.g. with p7zip), so you will
need to have the 7z command available in your path to decompress and load
the data.
(On a mac, you can use Homebrew to install p7zip with
the command brew install p7zip.)

Using fread (and dplyr…again)

As in Chapter 5, the fread function from the
data.table library is used to quickly read in a sample of the rather large
data files. It is similar to read.table but faster and more convenient.
The following code reads in the first 50k lines of data from one of the
trip data files and one of the fare data files. The mutate and filter
functions from dplyr are used to clean up the data (e.g. remove data
with unrealistic latitude and longitude values). The trip and fare data are
combined with the inner_join function from the dplyr package.

tripFile1 <- "../data/trip_data_1.csv" fareFile1 <- "../data/trip_fare_1.csv" npoints <- 50000 tripData <- fread(tripFile1, nrows=npoints, stringsAsFactors = TRUE) %>% mutate(store_and_fwd_flag = replace(store_and_fwd_flag, which(store_and_fwd_flag == ""), "N")) %>% filter(trip_distance > 0 & trip_time_in_secs > 0 & passenger_count > 0) %>% filter(pickup_longitude < -70 & pickup_longitude > -80) %>% filter(pickup_latitude > 0 & pickup_latitude < 41) %>% filter(dropoff_longitude < 0 & dropoff_latitude > 0) tripData$store_and_fwd_flag <- factor(tripData$store_and_fwd_flag) fareData <- fread(fareFile1, nrows=npoints, stringsAsFactors = TRUE) dataJoined <- inner_join(tripData, fareData) remove(fareData, tripData) Exploring the data

In the complete code-through, plots of categorical and numerical
features of the data are made using
ggplot2, including a visualization of the pickup locations in latitude and
longitude space which is shown below. With slightly less than 50,000 data
points, we can clearly see the street layout of downtown Manhatten.
Many of the trips originate in the other boroughs of New York, too.

Feedback welcome

If you have any feedback on the rwml-R project, please
leave a comment below or use the Tweet button.
As with any of my projects, feel free to fork the rwml-R repo
and submit a pull request if you wish to contribute.
For convenience, I’ve created a project page for rwml-R with
the generated HTML files from knitr, including a page with
all of the event-modeling examples from chapter 6.

Download
Fork

To leave a comment for the author, please follow the link and comment on their blog: data prone - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cartograms of New Zealand census data

Sat, 04/22/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

I’ve been trying to catch up with mapping functionality in R and its extended ecoverse, so might do a few posts on this in the next month or so. First up (more or less by accident) is cartograms – maps where some other numeric variable is substituted for land area, while still trying to preserve regions’ basic shapes and relative locations.

Cartograms are particularly useful as a variant on choropleth maps (regions’ fill colour specified based on a numeric variable) for socio-economic comparison. Straight choropleth maps can give a misleading comparison between large sparsely populated areas and small densely populated areas (ie cities). This is a problem for statistical maps of New Zealand, for example, with the Auckland City and Regional Council making up a quarter of the country’s population but much less of its area.

In this post, I use illustrative data from the 2013 New Zealand Census by Statistics New Zealand. Last year I took their “meshblock” census data set and re-shaped it into the nzcensus R package; only available from GitHub (as it’s too large for CRAN). In the process of preparing today’s blog post I made some small additions to the variables in that package – numbers of dwellings, households and individuals, and cartogram outlines of New Zealand regions (and I hope to follow up with other spatial categories.)

Rectangles

My first exploration into this was via the recmap R package, which does rectangular cartograms. These are a simplified version that just needs to know the centres of regions, the numeric variable you want their size to be proportionate to, and aspect ratios for rectangles. Here’s the out-of-the box version, showing the proportion of individuals who are receiving unemployment benefits on census night 2013:

Note that the proportion of all people receiving benefits is lower than the unemployment rate because there is a different denominator.

This is an interesting start, but I don’t find this that satisfactory for New Zealand’s 16 regions; and the result is even less satisfactory for 67 Territorial Authorities. The algorithm is quick, but the rectangles just don’t cut it for today’s high expectations of data visualisation; and the adjacency of various regions has been lost (most significantly, Wellington relegated to the bottom of the map, rather than the middle where it more intuitively belongs). There are ways to over-ride both of these issues, but I’ll put recmap aside for now – potentially but not immediately useful for my purposes.

Here’s the code for those rectangles, plus some helper functions for colour palettes and legends I’ll use throughout the post:

# Install the latest version of the nzcensus package if not already installed: devtools::install_github("ellisp/nzelect/pkg2") # Install Ministry of Business, Innovation and Employment's NZ maps package, # only needed for region_simpl, used to illustrate a "normal" map. devtools::install_github("nz-mbie/mbiemaps-public/pkg") library(recmap) library(nzcensus) library(tidyverse) library(viridis) library(mbiemaps) # Helper functions: colour_scale <- function(x, palette = viridis(100)){ levs <- round(x / max(x, na.rm = TRUE) * length(palette)) return(as.character(palette[levs])) } make_legend <- function(x, palette = viridis(100), title = NULL, location = "right", multiplier = 100, digits = 1, ...){ y <- seq(from= min(x), to = max(x), length.out = 5) levsy <- round(y / max(x, na.rm = TRUE) * length(palette)) legend(location, legend = round(y * multiplier, digits), pch = 15, col = palette[levsy], text.col = palette[levsy], bty = "n", title = title, ...) title(xlab = "Source: Statistics New Zealand Census 2013, in nzcensus R package", adj = 1, col.lab = "grey50", cex.lab = 0.8) } #=========rectangle cartogram========== tmp <- with(filter(REGC2013, !grepl("Area Outside", REGC2013_N)), data.frame(x = WGS84Longitude, y = WGS84Latitude, dx = 12, dy = 8, z = ResidentPop2013, name = gsub(" Region", "", as.character(REGC2013_N)), value = PropUnemploymentBenefit2013, stringsAsFactors = FALSE)) %>% mutate(colour = colour_scale(value)) tmp %>% recmap() %>% plot(col.text = "grey10", col = tmp[, "colour"], border = "white") title(main = "Unemployment by region; regions sized by usual resident population") make_legend(tmp$value, title = "Proportion of all individuals\non unemployment benefit", location = "left", cex = 0.8) ScapeToad and cartograms

Turning a shapefile into a cartogram is a serious job for serious GIS software, and there’s no obvious way to do it from R. However, the ScapeToad software (written in Java) is freely available under a GPL license and is simple to operate. I’ve half started a project for R and ScapeToad to interact but I doubt it will get far or be worthwhile.

ScapeToad imports layers from an ESRI format shapefile, and then reshapes the borders for you based on a numeric variable that is included as part of the polygon-level data in that shapefile. R works adequately for reading and writing shapefiles, and of course is unparalleled for managing numeric data in general. The cartogram calculation is easy to specify in ScapeToad but is computationally intensive (at the time of writing, my laptop has been whirring away at the Territorial Authority level for four hours…). I’ve added a step to the build for the nzcensus R package that saves a cartogram version of New Zealand’s regional council boundaries, with size proportional to “usual resident population”. I hope to do the same for Territorial Authority, Area Unit and Mesh Block down the track, but for now only the regional council level is complete. Because I’ve done the work in advance, there’s no need to install ScapeToad to use the map in nzcensus.

Here’s an example end result:

Auckland, and the north island in general, are distorted to appear much larger than their geographic region, whereas other regions (particularly the West Coast of the south island) are shrunk; yet the map remains recognisably New Zealand. In contrast, in the traditional choropleth map on the right, the colour of Auckland is simply not prominent enough given its importance in socio-economic terms.

The R code to produce this from an existing cartogram set of boundaries (reg_cart_simpl, which now ships with nzcensus) is probably even easier than the rectangle version:

#===============shape-preserving cartogram============= comb_data <- reg_cart_simpl@data %>% left_join(REGC2013, by = c("Name" = "REGC2013_N")) par(font.main= 1, fg = "grey75", mfrow = c(1, 2)) plot(reg_cart_simpl, col = colour_scale(comb_data$PropUnemploymentBenefit2013)) title(main = "Unemployment by region; regions sized by usual resident population") make_legend(comb_data$PropUnemploymentBenefit2013, title = "Proportion of all individuals\non unemployment benefit", location = "left", cex = 0.8) # compare with the standard regions map, from the mbiemaps package data(region_simpl) plot(region_simpl, col = colour_scale(comb_data$PropUnemploymentBenefit2013)) title(main = "Regions as they are shaped and sized geographically")

I was so pleased with how this looks a did a few variants:

Then I realised that better would be to let people choose a variable for colouring in the map themselves so I made this Shiny app:

This also makes it easier to select a colour scheme (and its direction), and gives micro-interactivity like informative tooltips. And thanks to the work of the Shiny and Leaflet developers isn’t much harder than the first static graphics.

The web app makes sense because most of the data in the nzcensus package has been transformed into percentages (of dwellings, of households, or of individuals), which minimises the effort needed in choice of scales and the like. More polish (eg clearer what the denominator is for each variable) would be needed to turn this into something of professional standard, but what we’ve got here is ok for a Sunday afternoon project.

Also available:

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using R as a GIS

Sat, 04/22/2017 - 13:52

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

In real estate, spatial data is the name of the game. Countless programs
in other domains utilize the power of this data, which is becoming more
prevalent by the day.

In this post I will go over a few simple, but powerful tools to get you
started using using geographic information in R.

##First, some libraries## #install.packages('GISTools', dependencies = T) library(GISTools)

GISTools provides an easy-to-use method for creating shading schemes
and choropleth maps. Some of you may have heard of the sp package,
which adds numerous spatial classes to the mix. There are also functions
for analysis and making things look nice.

Let’s get rolling: source the vulgaris dataset, which contains
location information for Syringa Vulgaris (the Lilac) observation
stations and US states. This code plots the states and vulgaris
points.

data(&quot;vulgaris&quot;) #load data par = (mar = c(2,0,0,0)) #set margins of plot area plot(us_states) plot(vulgaris, add = T, pch = 20)

One thing to note here is the structure of these objects. us_states is
a SpatialPolygonsDataFrame, which stores information for plotting shapes
(like a shapefile) within its attributes. vulgaris by contrast is a
SpatialPointsDataFrame, which contains data for plotting individual
points. Much like a data.frame and $, these objects harbor
information that can be accessed via @.

kable(head(vulgaris@data)) Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev 3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146 3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146 3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146 3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146 3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146 3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146

Let’s take a look at some functions that use this data.

newVulgaris kable(head(data.frame(newVulgaris))) x y 3 4896 -67.65 44.65 3 4897 -67.65 44.65 3 4898 -67.65 44.65 3 4899 -67.65 44.65 3 4900 -67.65 44.65 3 4901 -67.65 44.65

gIntersection, as you may have guessed from the name, returns the
intersection of two spatial objects. In this case, we are given the
points from vulgaris that are within us_states. However, the rest of
the vulgaris data has been stripped from the resulting object. We’ve
got to jump through a couple of hoops to get that information back.

&lt;br /&gt;newVulgaris &lt;- data.frame(newVulgaris) tmp &lt;- rownames(newVulgaris) tmp &lt;- strsplit(tmp, &quot; &quot;) tmp &lt;- (sapply(tmp, &quot;[[&quot;, 2)) tmp &lt;- as.numeric(tmp) vdf &lt;- data.frame(vulgaris) newVulgaris &lt;- subset(vdf, row.names(vdf) %in% tmp) Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev Long.1 Lat.1 optional 3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE 3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE 3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE 3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE 3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE 3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE

Look familiar? Now we’ve got a data frame with the clipped vulgaris
values and original data preserved.

vulgarisSpatial ``` After storing our clipped data frame as a SpatialPointsDataFrame, we can again make use of it - in this case we add a shading scheme to the `vulgaris` points. ``` r shades shades$cols plot(us_states) choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)

Colors are pretty, but what do they mean? Let’s add a legend.

us_states@bbox #Get us_states bounding box coordinates. ##min max ## r1 -124.73142 -66.96985 ## r2 24.95597 49.37173 plot(us_states) choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20) par(xpd=TRUE) #Allow plotting outside of plot area. choro.legend(-124, 30, shades, cex = .75, title = &quot;Elevation in Meters&quot;) # Plot legend in bottom left. Takes standard legend() params.

It looks like there’s a lot going on in the Northeastern states. For a
closer look, create another clipping (like above) and plot it. Using the
structure below, we can create a selection vector. I have hidden the
full code since it is repetitive (check GitHub for the full code.)

index '...' plot(us_states[index,]) choropleth(vulgarisNE, vulgarisNE$Elev,shading = shades, add = T, pch = 20) par(xpd = T) choro.legend(-73, 39.75, shades, cex = .75, title = &quot;Elevation in Meters&quot;)

Hopefully this has been a useful introduction (or refresher) on spatial
data. I always learn a lot in the process of writing these posts. If you
have any ideas or suggestions please leave a comment or feel free to
contact me!

Happy mapping,

Kiefer

To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shuttering Pies With Retiring Stores

Sat, 04/22/2017 - 05:32

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I caught this “gem” in the Wall Street Journal tonight:

It’s pretty hard to compare store-to-store, even though it is fairly clear which ones are going-going-gone. If we want to see the relative percentage of each store closing and also want to see how they stack up against each other, then let’s make a column of 100% bars and label total stores in each:

library(hrbrthemes) library(tidyverse) read.table(text='store,closing,total "Radio Shack",550,1500 "Payless",400,2600 "Rue21",400,1100 "The Limited",250,250 "bebe",180,180 "Wet Seal",170,170 "Crocs",160,560 "JCPenny",138,1000 "American Apparel",110,110 "Kmart",109,735 "hhgregg",88,220 "Sears",41,695', sep=",", header=TRUE, stringsAsFactors=FALSE) %>% as_tibble() %>% mutate(remaining = total - closing, gone = round((closing/total) * 100)/100, stay = 1-gone, rem_lab = ifelse(remaining == 0, "", scales::comma(remaining))) %>% arrange(desc(stay)) %>% mutate(store=factor(store, levels=store)) -> closing_df update_geom_font_defaults(font_rc) ggplot(closing_df) + geom_segment(aes(0, store, xend=gone, yend=store, color="Closing"), size=8) + geom_segment(aes(gone, store, xend=gone+stay, yend=store, color="Remaining"), size=8) + geom_text(aes(x=0, y=store, label=closing), color="white", hjust=0, nudge_x=0.01) + geom_text(aes(x=1, y=store, label=rem_lab), color="white", hjust=1, nudge_x=-0.01) + scale_x_percent() + scale_color_ipsum(name=NULL) + labs(x=NULL, y=NULL, title="Selected 2017 Store closings (estimated)", subtitle="Smaller specialty chains such as Bebe and American Apparel are closing their stores,\nwhile lareger chains such as J.C. Penny and Sears are scaling back their footprint.") + theme_ipsum_rc(grid="X") + theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) + theme(legend.position=c(0.875, 1.025)) + theme(legend.direction="horizontal")

One might try circle packing or a treemap to show both relative store count and percentage, but I think the bigger story is the percent reduction for each retail chain. It’d be cool to see what others come up with.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using MCA and variable clustering in R for insights in customer attrition

Sat, 04/22/2017 - 01:15

Analytical challenges in multivariate data analysis and predictive modeling include identifying redundant and irrelevant variables. A recommended analytics approach is to first address the redundancy; which can be achieved by identifying groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with other variable groups in the same data set. On the other hand, relevancy is about potential predictor variables and involves understanding the relationship between the target variable and input variables.

Multiple correspondence analysis (MCA) is a multivariate data analysis and data mining tool for finding and constructing a low-dimensional visual representation of variable associations among groups of categorical variables. Variable clustering as a tool for identifying redundancy is often applied to get a first impression of variable associations and multivariate data structure.

The motivations of this post are to illustrate the applications of: 1) preparing input variables for analysis and predictive modeling, 2) MCA as a multivariate exploratory data analysis and categorical data mining tool for business insights of customer churn data, and 3) variable clustering of categorical variables for the identification of redundant variables.

Customer churn data in this analysis:
Customer attrition is a metrics businesses use to monitor and quantify the loss of customers and/or clients for various reasons. The data set includes customer-level demographic, account and services information including monthly charge amounts and length of service with the company. Customers who left the company for competitors (Yes) or staying with the company (No) have been identified in the last column labeled churn.

Load Packages

require(caret) require(plyr) require(car) require(dplyr) require(reshape2) theme_set(theme_bw(12)) Import and Pre-process Data

The data set used in this post was obtained from the watson-analytics-blog site. Click the hyperlink “Watson Analytics Sample Dataset – Telco Customer Churn” to download the file “WA_Fn-UseC_-Telco-Customer-Churn.csv”.

setwd("path to the location of your copy of the saved csv data file") churn <- read.table("WA_Fn-UseC_-Telco-Customer-Churn,csv", sep=",", header=TRUE) ## inspect data dimensions and structure str(churn) ## 'data.frame': 7043 obs. of 21 variables: ## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... ## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... ## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... ## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ... ## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... ## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... ## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... ## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... ## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... ## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... ## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... ## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... ## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... ## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... ## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... ## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... ## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... ## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ... ## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1

The raw data set contains 7043 records and 21 variables. Looking at the data structure, some data columns need recoding. For instance, changing values from “No phone service” and “No internet service” to “No”, for consistency. The following code statements are to recode those observations and more.

## recode selected observations churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from=c("No phone service"), to=c("No"))) churn$InternetService <- as.factor(mapvalues(churn$InternetService, from=c("Fiber optic"), to=c("Fiberoptic"))) churn$PaymentMethod <- as.factor(mapvalues(churn$PaymentMethod, from=c("Credit card (automatic)","Electronic check","Mailed check", "Bank transfer (automatic)"), to=c("Creditcard","Electronicheck","Mailedcheck","Banktransfer"))) churn$Contract <- as.factor(mapvalues(churn$Contract, from=c("Month-to-month", "Two year", "One year"), to=c("MtM","TwoYr", "OneYr"))) cols_recode1 <- c(10:15) for(i in 1:ncol(churn[,cols_recode1])) { churn[,cols_recode1][,i] <- as.factor(mapvalues (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No"))) }

Besides, values in the SeniorCitizen column were entered as 0 and 1. Let’s recode this variable as “No” and “Yes”, respectively, for consistency.

churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from=c("0","1"), to=c("No", "Yes")))

Exclude the consumer id and total charges columns from data analysis.

cols_drop <- c(1, 20) churn <- churn[,-cols_drop]

Let’s do summary statistics of the two numerical variables to see distribution of the data.

summary(churn$MonthlyCharges) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.25 35.50 70.35 64.76 89.85 118.80 summary(churn$tenure) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 9.00 29.00 32.37 55.00 72.00

On the basis of the data distributions above, values in the tenure and monthly charges numerical columns could be coerced to a 3-level categorical value as follows.

churn$tenure <- as.factor(car::recode(churn$tenure, "1:9 = 'ShortTenure'; 9:29 = 'MediumTenure'; else = 'LongTenure'")) churn$MonthlyCharges <- as.factor(car::recode(churn$MonthlyCharges, "1:35 = 'LowCharge';35:70 = 'MediumCharge'; else = 'HighCharge'"))

It’s time to check for missing values in the pre-processed data set.

mean(is.na(churn)) ## [1] 0

There are no missing values. How about the category levels of each variable?

## check for factor levels in each column nfactors <- apply(churn, 2, function(x) nlevels(as.factor(x))) nfactors ## gender SeniorCitizen Partner Dependents ## 2 2 2 2 ## tenure PhoneService MultipleLines InternetService ## 3 2 2 3 ## OnlineSecurity OnlineBackup DeviceProtection TechSupport ## 2 2 2 2 ## StreamingTV StreamingMovies Contract PaperlessBilling ## 2 2 3 2 ## PaymentMethod MonthlyCharges Churn ## 4 3 2

Now, the data set is ready for analysis.

Partitioning the raw data into 70% training and 30% testing data sets inTrain <- createDataPartition(churn$Churn, p=0.7, list=FALSE) ## set random seed to make reproducible results set.seed(324) training <- churn[inTrain,] testing <- churn[-inTrain,]

Check for the dimensions of the training and testing data sets

dim(training) ; dim(testing) ## [1] 4931 19 ## [1] 2112 19

As expected, the training data set contains 4931 observations and 19 columns, whereas the testing data set contains 2112 observations and 19 columns.

Multiple Correspondence Analysis (MCA)

Invoke the FactoMiner & factoextra packages.

require(FactoMineR) require(factoextra) res.mca <- MCA(training, quali.sup=c(17,19), graph=FALSE) fviz_mca_var(res.mca, repel=TRUE)

Here is the plot of the MCA.

A general guide to extrapolating from the figure above for business insights would be to observe and make a note as to how close input variables are to the target variable and to each other. For instance, customers with month to month contract, those with fiber optic internet service, senior citizens, customers with high monthly charges, single customers or customers with no dependents, those with paperless billing are being related to a short tenure with the company and a propensity of risk to churn.

On the other hand, customers with 1 – 2 years contract, those with DSL internet service, younger customers, those with low monthly charges, customers with multiple lines, online security and tech support services are being related to a long tenure with the company and a tendency to stay with company.

Variable Clustering

Load the ClustOfVar package and the hclustvar function produces a tree of variable groups. A paper detailing the ClustOfVar package is here

require(ClustOfVar) # run variable clustering excluding the target variable (churn) variable_tree <- hclustvar(X.quali = training[,1:18]) #plot the dendrogram of variable groups plot(variable_tree)

Here is a tree of categorical variable groups.

The dendrogram suggests that the 18 input variables can be combined into approximately 7 – 9 groups of variables. That is one way of going about it. The good news is that the ClusofVar package offers a function to cut the cluster into any number of desired groups (clusters) of variables. So, the syntax below will run 25 bootstrap samples of the trees to produce a plot of stability of variable cluster partitions.

# requesting for 25 bootstrap samplings and a plot stability(variable_tree, B=25)


The plot of stability of variable cluster partitions suggests approximately a 7 to 9-cluster solution. The syntax below will list a consensus list of 9-clusters along with the variables names included in each cluster.

## cut the variable tree into 9(?) groups of variables clus <- cutreevar(variable_tree,9) ## print the list of variables in each cluster groups print(clus$var) ## $cluster1 ## squared loading ## gender 1 ## ## $cluster2 ## squared loading ## SeniorCitizen 1 ## ## $cluster3 ## squared loading ## Partner 0.7252539 ## Dependents 0.7252539 ## ## $cluster4 ## squared loading ## tenure 0.7028741 ## Contract 0.7162027 ## PaymentMethod 0.4614923 ## ## $cluster5 ## squared loading ## PhoneService 0.6394947 ## MultipleLines 0.6394947 ## ## $cluster6 ## squared loading ## InternetService 0.9662758 ## MonthlyCharges 0.9662758 ## ## $cluster7 ## squared loading ## OnlineSecurity 0.5706136 ## OnlineBackup 0.4960295 ## TechSupport 0.5801424 ## ## $cluster8 ## squared loading ## DeviceProtection 0.5319719 ## StreamingTV 0.6781781 ## StreamingMovies 0.6796616 ## ## $cluster9 ## squared loading ## PaperlessBilling 1

The 9-clusters and the variable names in each cluster are listed above. The practical guide to minimizing redundancy is to select a cluster representative. However, subject-matter considerations should have a say in the consideration and selection of other candidate representatives of each variable cluster group.

Descriptive statistics of customer churn

what was the overall customer churn rate in the training data set?

# overall customer churn rate round(prop.table(table(training$Churn))*100,1) ## ## No Yes ## 73.5 26.5

The overall customer attrition rate was approximately 26.5%.
Customer churn rate by demography, account and service information

cols_aggr_demog <- c(1:4,6:7,9:14,16) variable <- rep(names(training[,cols_aggr_demog]),each=4) demog_counts=c() for(i in 1:ncol(training[,cols_aggr_demog])) { demog_count <- ddply(training, .(training[,cols_aggr_demog][,i],training$Churn), "nrow") names(demog_count) <- c("class","Churn","count") demog_counts <- as.data.frame(rbind(demog_counts, demog_count)) } demog_churn_rate <- as.data.frame(cbind(variable, demog_counts)) demog_churn_rate1 <- dcast(demog_churn_rate, variable + class ~ Churn, value.var="count") demog_churn_rate2 <- mutate(demog_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1)) demog <- as.data.frame(paste(demog_churn_rate2$variable,demog_churn_rate2$class)) names(demog) <- c("Category") demog2 <- as.data.frame(cbind(demog,demog_churn_rate2)) cols_aggr_nlev3 <- c(5,8,15,18) variable <- rep(names(training[,cols_aggr_nlev3]),each=6) nlev3_counts=c() for(i in 1:ncol(training[,cols_aggr_nlev3])) { nlev3_count <- ddply(training, .(training[,cols_aggr_nlev3][,i],training$Churn), "nrow") names(nlev3_count) <- c("class","Churn","count") nlev3_counts <- as.data.frame(rbind(nlev3_counts, nlev3_count)) } nlev3_churn_rate <- as.data.frame(cbind(variable, nlev3_counts)) nlev3_churn_rate1 <- dcast(nlev3_churn_rate, variable + class ~ Churn, value.var="count") nlev3_churn_rate2 <- mutate(nlev3_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1)) nlev3 <- as.data.frame(paste(nlev3_churn_rate2$variable,nlev3_churn_rate2$class)) names(nlev3) <- c("Category") nlev3 <- as.data.frame(cbind(nlev3,nlev3_churn_rate2)) variable <- rep("PaymentMethod",8) nlev4_count <- ddply(training, .(training[,17],training$Churn), "nrow") names(nlev4_count) <- c("class","Churn","count") nlev4_churn_rate <- as.data.frame(cbind(variable, nlev4_count)) nlev4_churn_rate1 <- dcast(nlev4_churn_rate, variable + class ~ Churn, value.var="count") nlev4_churn_rate2 <- mutate(nlev4_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1)) nlev4 <- as.data.frame(paste(nlev4_churn_rate$variable4,nlev4_churn_rate2$class)) names(nlev4) <- c("Category") nlev4 <- as.data.frame(cbind(nlev4,nlev4_churn_rate2)) final_agg <- as.data.frame(rbind(demog2, nlev3, nlev4)) ggplot(final_agg, aes(Category, churn_rate, color=churn_rate < 0)) + geom_segment(aes(x=reorder(Category, -churn_rate), xend = Category, y = 0, yend = churn_rate), size = 1.1, alpha = 0.7) + geom_point(size = 2.5) + theme(legend.position="none") + xlab("Variable") + ylab("Customer Churn (%)") + ggtitle("Customer Attrition rate \n Difference from the overall average (%)") + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_flip()

Looking at the figure above, customers with higher than average attrition rates include those with an electronic check, with month to month contracts, with higher monthly charges and paperless billing. On a positive note, customers with low monthly charges, longer period contract, with online security services, with dependents or with partners, those paying with credit card or bank transfer showed a much lower than average rates of attrition.

In conclusion

Variables such as contract length, bill payment method, internet service type and even customer demography appeared to play a role in customer attrition and retention. The next step for this company would be to deploy predictive and prescriptive models that would score prospective customers for the propensity of risk to churn. Hope this post is helpful. Please leave your comments or suggestions below. Ok to networking with the author on LinkedIn.

    Related Post

    1. Web Scraping and Applied Clustering Global Happiness and Social Progress Index
    2. Key Phrase Extraction from Tweets
    3. Financial time series forecasting – an easy approach
    4. Outlier detection and treatment with R
    5. Implementing Apriori Algorithm in R

    Programming over R

    Fri, 04/21/2017 - 20:49

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.

    Meta-Programming

    Meta-programming is a powerful tool that allows one to re-shape a programming language or write programs that automate parts of working with a programming language.

    Meta-programming itself has the central contradiction that one hopes nobody else is doing meta-programming, but that they are instead dutifully writing referentially transparent code that is safe to perform transformations over, so that one can safely introduce their own clever meta-programming. For example: one would hate to lose the ability to use a powerful package such as future because we already “used up all the referential transparency” for some minor notational effect or convenience.

    That being said, R is an open system and it is fun to play with the notation. I have been experimenting with different notations for programming over R for a while, and thought I would demonstrate a few of them here.

    Let Blocks

    We have been using let to code over non-standard evaluation (NSE) packages in R for a while now. This allows code such as the following:

    library("dplyr") library("wrapr") d <- data.frame(x = c(1, NA)) cname <- 'x' rname <- paste(cname, 'isNA', sep = '_') let(list(COL = cname, RES = rname), d %>% mutate(RES = is.na(COL)) ) # x x_isNA # 1 1 FALSE # 2 NA TRUE

    let is in fact quite handy notation that will work in a non-deprecated manner with both dplyr 0.5 and dplyr 0.6. It is how we are future-proofing our current dplyr workflows.

    Unquoting

    dplyr 0.6 is introducing a new execution system (alternately called rlang or tidyeval, see here) which uses a notation more like the following (but fewer parenthesis, and with the ability to control left-hand side of an in-argument assignment):

    beval(d %>% mutate(x_isNA = is.na((!!cname))))

    The inability to re-map the right-hand side of the apparent assignment is because the “(!! )” notation doesn’t successfully masquerade as a lexical token valid on the left-hand side of assignments or function argument bindings.

    And there was an R language proposal for a notation like the following (but without the quotes, and with some care to keep it syntactically distinct from other uses of “@”):

    ateval('d %>% mutate(@rname = is.na(@cname))')

    beval and ateval are just curiosities implemented to try and get a taste of the new dplyr notation, and we don’t recommend using them in production — their ad-hoc demonstration implementations are just not powerful enough to supply a uniform interface. dplyr itself seems to be replacing a lot of R‘s execution framework to achieve stronger effects.

    Write Arrow

    We are experimenting with “write arrow” (a deliberate homophone of “right arrow”). It allows the convenient storing of a pipe result into a variable chosen by name.

    library("dplyr") library("replyr") 'x' -> whereToStoreResult 7 %>% sin %>% cos %->_% whereToStoreResult print(x) ## [1] 0.7918362

    Notice, the value “7” is stored in the variable “x” not in a variable named “whereToStoreResult”. “whereToStoreResult” was able to name where to store the value parametrically.

    This allows code such as the following:

    for(i in 1:3) { i %->_% paste0('x',i) }

    (Please run the above to see the automatic creation of variables named “x1”, “x2”, and “x3”, storing values 1,2, and 3 respectively.)

    We know left to right assignment is heterodox; but the notation is very slick if you are consistent with it, and add in some formatting rules (such as insisting on a line break after each pipe stage).

    Conclusion

    One wants to use meta-programming with care. In addition to bringing in desired convenience it can have unexpected effects and interactions deeper in a language or when exposed to other meta-programming systems. This is one reason why a “seemingly harmless” proposal such as “user defined unary functions” or “at unquoting” takes so long to consider. This is also why new language features are best tried in small packages first (so users can easily chose to include them or not in their larger workflow) to drive public request for comments (RFC) processes or allow the ideas to evolve (and not be frozen at their first good idea, a great example of community accepted change being Haskel’s switch from request chaining IO to monadic IO; the first IO system “seemed inevitable” until it was completely replaced).

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Weekly Bulletin Vol – V

    Fri, 04/21/2017 - 20:08

    (This article was first published on R programming, and kindly contributed to R-bloggers)

    This week’s R bulletin will cover topics like how to avoid for-loops, add or shorten an existing vector, and play a beep sound in R. We will also cover functions like env.new function, readSeries, and the with and within functions. Hope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. To stop debugging – Shift+F8
    2. To quit an R session (desktop only) – Ctrl+Q
    3. To restart an R Session – Ctrl+Shift+0

    Problem Solving Ideas Avoiding For Loop by using “with” function

    For Loop can be slow in terms of execution speed when we are dealing with large data sets. For faster execution, one can use the “with” function as an alternative. The syntax of the with function is given below:

    with(data, expr)

    where, “data” is typically a data frame, and “expr” stands for one or more expressions to be evaluated using the contents of the data frame. If there is more than one expression, then the expressions need to be wrapped in curly braces.

    Example: Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods and time them using the system.time function. Note the time taken to execute the For Loop versus the time to execute the with function in combination with the lagpad function.

    library(quantmod) # Using FOR Loop system.time({ df = read.csv("NIFTY.csv") df = df[,c(1,3:6)] df$GapOpen = double(nrow(df)) for ( i in 2:nrow(df)) { df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2) } print(head(df)) })

    # Using with function + lagpad, instead of FOR Loop system.time({ dt = read.csv("NIFTY.csv") dt = dt[,c(1,3:6)] lagpad = function(x, k) { c(rep(NA, k), x)[1 : length(x)] } dt$PrevClose = lagpad(dt$CLOSE, 1) dt$GapOpen_ = with(dt, round(Delt(dt$PrevClose,dt$OPEN)*100,2)) print(head(dt)) })

    Adding to an existing vector or shortening it

    Adding or shortening an existing vector can be done by assigning a new length to the vector. When we shorten a vector, the values at the end will be removed, and when we extend an existing vector, missing values will be added at the end.

    Example:

    # Shorten an existing vector even = c(2,4,6,8,10,12) length(even)

    [1] 6

    # The new length equals the number of elements required in the vector to be shortened. length(even) = 3 print(even)

    [1] 2 4 6

    # Add to an existing vector odd = c(1,3,5,7,9,11) length(odd)

    [1] 6

    # The new length equals the number of elements required in the extended vector. length(odd) = 8 odd[c(7,8)] = c(13,15) print(odd)

    [1] 1 3 5 7 9 11 13 15

    Make R beep/play a sound

    If you want R to play a sound/beep upon executing the code, we can do this using the “beepr” package. The beep function from the package plays a sound when the code gets executed. One also needs to install the “audio” package along with the “beepr” package.

    install.packages("beepr") install.packages("audio") library(beepr) beep()

    One can select from the various sounds using the “sound” argument and by assigning one of the specified values to it.

    beep(sound = 9)

    One can keep repeating the message using beepr as illustrated in the example below (source:http: //stackoverflow.com/)

    Example:

    work_complete <- function() { cat("Work complete. Press esc to sound the fanfare!!!\n") on.exit(beepr::beep(3)) while (TRUE) { beepr::beep(4) Sys.sleep(1) } } work_complete()

    One can also use the beep function to play a sound if an error occurs during the code execution.

    options(error = function() {beep(sound =5)}) Functions Demystified env.new function

    Environments act as a storehouse. When we create variables in R from the command prompt these get stored in the R’s global environment. To access the variables stored in the global environment, one can use the following expression:

    head(ls(envir = globalenv()), 15)

    [1] “df”  “dt”  “even”  “i”  “lagpad”  “odd”

    If we want to store the variables in a specific environment, we can assign the variable to that environment or create a new environment which will store the variable. To create a new environment we use the new.env function.

    Example:

    my_environment = new.env()

    Once we create a new environment, assigning a variable to the environment can be done in multiple ways. Following are some of the methods:

    Examples: # By using double square brackets my_environment[["AutoCompanies"]] = c("MARUTI", "TVSMOTOR", "TATAMOTORS") # By using dollar sign operator my_environment$AutoCompanies = c("MARUTI", "TVSMOTOR", "TATAMOTORS") # By using the assign function assign("AutoCompanies", c("MARUTI", "TVSMOTOR", "TATAMOTORS"), my_environment)

    The variables existing in an environment can be viewed or listed using the get function or by using the ls function.

    Example:

    ls(envir = my_environment)

    [1] “AutoCompanies”

    get("AutoCompanies", my_environment)

    [1] “MARUTI”  “TVSMOTOR”  “TATAMOTORS”

    readSeries function

    The readSeries function is part of the timeSeries package, and it reads a file in table format and creates a timeSeries object from it. The main arguments of the function are:

    readSeries(file, header = TRUE, sep = “,”,format)

    where,
    file: the filename of a spreadsheet dataset from which to import the data records.
    header: a logical value indicating whether the file contains the names of the variables as its first line.
    format: a character string with the format in POSIX notation specifying the timestamp format.
    sep: the field separator used in the spreadsheet file to separate columns. By default, it is set as “;”.

    Example:

    library(timeSeries) # Reading the NIFTY data using read.csv df = read.csv(file = "NIFTY.csv") print(head(df))

    # Reading the NIFTY data and creating a time series object using readSeries # function df = readSeries(file = "NIFTY.csv", header = T, sep = ",", format = "%Y%m%d") print(head(df))

    with and within functions

    The with and within functions apply an expression to a given data set and allows one to manipulate it. The within function even keeps track of changes made, including adding or deleting elements and returns a new object with these revised contents. The syntax for these two functions is given as:

    with(data, expr)
    within(data, expr)

    where,
    data – typically is a list or data frame, although other options exist for with.
    expr – one or more expressions to evaluate using the contents of data, the commands must be wrapped in braces if there is more than one expression to evaluate.

    # Consider the NIFTY data df = as.data.frame(read.csv("NIFTY.csv")) print(head(df, 3))

    # Example of with function: df$Average = with(df, apply(df[3:6], 1, mean)) print(head(df, 3))

    # Example of within function: df = within(df, { df$Average = apply(df[3:6], 1, mean) }) print(head(df, 3))

    Next Step

    We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

    Download the PDF Now!

    The post R Weekly Bulletin Vol – V appeared first on .

    To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Microeconomic Theory and Linear Regression (Part 2)

    Fri, 04/21/2017 - 18:00

    (This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

    Introduction

    In the first part of this article I explained and compared four different functional forms to model apples’ production.

    This is what I need to retake the previously explained examples:

    # Libraries #install.packages(c("micEcon","lmtest","bbmle","miscTools")) library(micEcon) library(lmtest) library(stats4) #this is a base package so I don't install this library(bbmle) library(miscTools) # Load data data("appleProdFr86", package = "micEcon") data = appleProdFr86 rm(appleProdFr86) data$qCap = data$vCap/data$pCap data$qLab = data$vLab/data$pLab data$qMat = data$vMat/data$pMat # Fit models ## Linear specification prodLin = lm(qOut ~ qCap + qLab + qMat, data = data) data$qOutLin = fitted(prodLin) ## Quadratic specification prodQuad = lm(qOut ~ qCap + qLab + qMat + I(0.5*qCap^2) + I(0.5*qLab^2) + I(0.5*qMat^2) + I(qCap*qLab) + I(qCap*qMat) + I(qLab*qMat), data = data) data$qOutQuad = fitted(prodQuad) ## Cobb-Douglas specification prodCD = lm(log(qOut) ~ log(qCap) + log(qLab) + log(qMat), data = data) data$qOutCD = fitted(prodCD) ## Translog specification prodTL = lm(log(qOut) ~ log(qCap) + log(qLab) + log(qMat) + I(0.5*log(qCap)^2) + I(0.5*log(qLab)^2) + I(0.5*log(qMat)^2) + I(log(qCap)*log(qLab)) + I(log(qCap)*log(qMat)) + I(log(qLab)*log(qMat)), data = data) data$qOutTL = fitted(prodTL)

    These four specifications presented different problems that suggested model misspecification. Ideally, I should find a model with positive predicted output and positive factor elasticity to make sense of the context of production I’m working with.

    Model misspectification

    RESET test (the name stands for Regression Equation Specification Error Test) is a test for functional form misspecification.

    As a general test, RESET helps detecting ommited variables. Package lmtest provides resettest, a function that provides one of the many versions of the test, and it fits this model by default:

    $$y_i = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + \gamma_1 \hat{y}_i^2 + \gamma_2 \hat{y}_i^3 + e_i$$

    The key idea behind fitting another model that includes the output is to detect if the model ignores an important nonlinearity.

    The hypothesis testing is:

    $$H_0:\: \gamma_i = 0 \text{ for any } i \\
    H_1:\: \gamma_i \neq 0 \text{ for some } i$$

    If there is evidence to reject the null hypthosis, the test does not provide an alternative model and it just states that the model is misspecified.

    The statistic will be a t statistic if we just include \(\gamma_1 \hat{y}_i^2\), and it will change to a F statistic if we include \(\gamma_2 \hat{y}_i^3\) or a higher power.

    In the context of the worked specifications:

    resettest(prodLin) RESET test data: prodLin RESET = 17.639, df1 = 2, df2 = 134, p-value = 1.584e-07 resettest(prodQuad) RESET test data: prodQuad RESET = 7.3663, df1 = 2, df2 = 128, p-value = 0.0009374 resettest(prodCD) RESET test data: prodCD RESET = 2.9224, df1 = 2, df2 = 134, p-value = 0.05724 resettest(prodTL) RESET test data: prodTL RESET = 1.2811, df1 = 2, df2 = 128, p-value = 0.2813

    \(\Longrightarrow\) The null hypothesis is rejected both for Linear and Quadratic specifications while Cobb-Douglas specification is rejected at 10% of significance. The null hypothesis can’t be rejected for the Translog Specification.

    I can also modify the powers in the model:

    resettest(prodTL, power = 2) RESET test data: prodTL RESET = 2.358, df1 = 1, df2 = 129, p-value = 0.1271 resettest(prodTL, power = 2:4) RESET test data: prodTL RESET = 1.3475, df1 = 3, df2 = 127, p-value = 0.262

    \(\Longrightarrow\) The null hypothesis can’t be rejected for Translog Specification, not even by adding a higher power.

    Quasiconcavity

    A function

    $$
    \begin{align*}
    f: \mathbb{R}^n &\to \mathbb{R} \cr
    x &\mapsto f(x)
    \end{align*}
    $$

    is quasiconcave if for \(0 \leq t \leq 1\)

    $$f(tx_1 + (1-t)x_2) \geq \min\{f(x_1),f(x_2)\}$$

    In particular, any concave and twice continuously differentiable function is quasiconcave.

    In really raw terms, quasiconcavity guarantees there is a production plan where the company is minimizing cost. Adding differentiability, you can find an optimal production plan by using derivatives.

    To cehck the details please check Advanced Microeconomic Theory by Geoffrey A. Jehle.

    I’ll check the quasiconcavity in an indirect way, by checking the concavity at each observation using the hessian method:

    # Quadratic Specification ## Regression coefficients b1 = coef(prodQuad)["qCap"] b2 = coef(prodQuad)["qLab"] b3 = coef(prodQuad)["qMat"] b11 = coef(prodQuad)["I(0.5 * qCap^2)"] b22 = coef(prodQuad)["I(0.5 * qLab^2)"] b33 = coef(prodQuad)["I(0.5 * qMat^2)"] b12 = b21 = coef(prodQuad)["I(qCap * qLab)"] b13 = b31 = coef(prodQuad)["I(qCap * qMat)"] b23 = b32 = coef(prodQuad)["I(qLab * qMat)"] ## Marginal productivity data$mpCapQuad = with(data, b1 + b11*qCap + b12*qLab + b13*qMat) data$mpLabQuad = with(data, b2 + b21*qCap + b22*qLab + b23*qMat) data$mpMatQuad = with(data, b3 + b31*qCap + b32*qLab + b33*qMat) ## Quasiconcavity data$quasiConcQuad = NA for(obs in 1:nrow(data)) { hmLoop = matrix(0, nrow = 3, ncol = 3) hmLoop[1,1] = b11 hmLoop[1,2] = hmLoop[2,1] = b12 hmLoop[1,3] = hmLoop[3,1] = b13 hmLoop[2,2] = b22 hmLoop[2,3] = hmLoop[3,2] = b23 hmLoop[3,3] = b33 data$quasiConcQuad[obs] = hmLoop[1,1] < 0 & det(hmLoop[1:2,1:2] > 0) & det(hmLoop < 0) } sum(!data$quasiConcQuad) [1] 140 # Translog Specification ## Regression coefficients b1 = coef(prodTL)["log(qCap)"] b2 = coef(prodTL)["log(qLab)"] b3 = coef(prodTL)["log(qMat)"] b11 = coef(prodTL)["I(0.5 * log(qCap)^2)"] b22 = coef(prodTL)["I(0.5 * log(qLab)^2)"] b33 = coef(prodTL)["I(0.5 * log(qMat)^2)"] b12 = b21 = coef(prodTL)["I(log(qCap) * log(qLab))"] b13 = b31 = coef(prodTL)["I(log(qCap) * log(qMat))"] b23 = b32 = coef(prodTL)["I(log(qLab) * log(qMat))"] ## Scale elasticity data$eCapTL = with(data, b1 + b11*log(qCap) + b12*log(qLab) + b13*log(qMat)) data$eLabTL = with(data, b2 + b21*log(qCap) + b22*log(qLab) + b23*log(qMat)) data$eMatTL = with(data, b3 + b31*log(qCap) + b32*log(qLab) + b33*log(qMat)) ## Marginal productivity data$mpCapTL = with(data, eCapTL * qOutTL / qCap) data$mpLabTL = with(data, eLabTL * qOutTL / qLab) data$mpMatTL = with(data, eMatTL * qOutTL / qMat) ## Derivatives of marginal productivity data$dmpCapCapTL = with(data, (qOutTL / qCap^2) * (b11 + eCapTL^2 - eCapTL)) data$dmpLabLabTL = with(data, (qOutTL / qLab^2) * (b22 + eLabTL^2 - eLabTL)) data$dmpMatMatTL = with(data, (qOutTL / qLab^2) * (b33 + eMatTL^2 - eMatTL)) data$dmpCapLabTL = with(data, (qOutTL / (qCap * qLab)) * (b12 + eCapTL * eLabTL)) data$dmpCapMatTL = with(data, (qOutTL / (qCap * qMat)) * (b13 + eCapTL * eMatTL)) data$dmpLabMatTL = with(data, (qOutTL / qLab * qMat) * (b23 + eLabTL * qMat)) ## Quasiconcavity data$quasiConcTL = NA for(obs in 1:nrow(data)) { hmLoop = matrix(0, nrow = 3, ncol = 3) hmLoop[1,1] = data$dmpCapCapTL[obs] hmLoop[1,2] = hmLoop[2,1] = data$dmpCapLabTL[obs] hmLoop[1,3] = hmLoop[3,1] = data$dmpCapMatTL[obs] hmLoop[2,2] = data$dmpLabLabTL[obs] hmLoop[2,3] = hmLoop[3,2] = data$dmpLabMatTL[obs] hmLoop[3,3] = data$dmpMatMatTL[obs] data$quasiConcTL[obs] = hmLoop[1,1] < 0 & det(hmLoop[1:2,1:2] > 0) & det(hmLoop < 0) } sum(!data$quasiConcTL) [1] 126

    With this information I can extend the information from the last table in the first part of the post:

    Quadratic Translog \(R^2\) on \(y\) 0.85 0.77 \(R^2\) on \(ln(y)\) 0.55 0.63 Obs. with negative predicted output 0 0 Obs. that violate the monotonicity condition 39 48 Obs. with negative scale elasticity 22 0 Obs. that violate the quasiconcavity condition 140 126

    \(\Longrightarrow\) Now I have more evidence in favour of a different approach as it was mentioned at the end of the first part of the post.

    Non-parametric regression

    A non-parametric model is of the form:

    $$Y_i = m(X_i) + \varepsilon_i$$

    The advantages are:
    The functional form is not defined beforehand
    Obtains a best fit function \(m\) from the data
    * Obtains marginal effects

    The disadvantages are:
    Higher computational cost
    There are no regression coefficients (the focus is the regression function)
    * Hard to interpret

    Single variable non-parametric regression

    I can predict the output by using capital as the only input:

    library(KernSmooth) qCap = as.vector(data$qCap) qOut = as.vector(data$qOut) bw = dpill(qCap,qOut) fit1 = locpoly(qCap, qOut, bandwidth = bw, degree = 1) # 1st degree polynomial plot(fit1$x, fit1$y, type = "l")

    Here are different specifications:

    # 2nd degree polynomial fit2.1 <- locpoly(qCap, qOut, bandwidth = bw, degree = 2) fit2.2 <- locpoly(qCap, qOut, bandwidth = bw*2, degree = 2) # less variance / more bias fit2.3 <- locpoly(qCap, qOut, bandwidth = bw/2, degree = 2) # more variance / less bias # linear regression fit.lm <- lm(qOut ~ qCap) # plots plot(fit1$x, fit1$y, type = "l");par(mfrow=c(2,2)) plot(fit1$x, fit1$y, type = "l");lines(fit2.1, col = "blue") plot(fit1$x, fit1$y, type = "l");lines(fit2.2, col = "red") plot(fit1$x, fit1$y, type = "l");lines(fit2.3, col = "green") plot(fit1$x, fit1$y, type = "l");abline(fit.lm, col = "purple")

    Checking the models:

    # model 1 sum(is.na(fit1$y)) [1] 50 # models 2, 3 y 4 sum(is.na(fit2.1$y)) [1] 50 sum(is.na(fit2.2$y)) [1] 9 sum(is.na(fit2.3$y)) [1] 84

    \(\Longrightarrow\) I should use model 2.2 and adjust it.

    I’ll try with a neoclassical function with diminishing marginal returns:

    # se ajusta la banda y el grado fit.sqrt = locpoly(qCap, qOut, bandwidth = bw*3, degree = 0.5) sum(is.na(fit.sqrt$y)) [1] 0 sum(fit.sqrt$y < 0) [1] 0

    \(\Longrightarrow\) In this model there is no indetermined or negative predicted output.

    plot(fit.sqrt$x, fit.sqrt$y, type = "l")

    I can obtain the median productivity:

    median(fit.sqrt$y)/median(fit.sqrt$x) [1] 18.05637

    \(\Longrightarrow\) In median term each capital unit increases the output in 18 units.

    If I adjust the bandwidth the function will look more and more like a neoclassical function from textbooks, with an initial part of increasing marginal returns and from the point where marginal and median productivities are equal (the same point where marginal productivity is at maximum) the return becomes increasing at decreasing scale.

    This is an example of the last paragraph:

    fit.sqrt.2 = locpoly(qCap, qOut, bandwidth = bw*5, degree = 0.5) plot(fit.sqrt.2$x, fit.sqrt.2$y, type = "l")

    Multivariable non-parametric regression

    Among many alternatives, Epanechnikov’s method is a non-parametric method that weights the contribution of each observation and iterating obtains the best fit function.

    I’ll fit a regression applying logs to the variables:

    library(np) options(np.messages=FALSE) prodNP = npreg(log(qOut) ~ log(qCap) + log(qLab) + log(qMat), regtype = "ll", bwmethod = "cv.aic", ckertype = "epanechnikov", data = data, gradients = TRUE) summary(prodNP) Regression Data: 140 training points, in 3 variable(s) log(qCap) log(qLab) log(qMat) Bandwidth(s): 1.039647 332644 0.8418465 Kernel Regression Estimator: Local-Linear Bandwidth Type: Fixed Residual standard error: 0.6227669 R-squared: 0.6237078 Continuous Kernel Type: Second-Order Epanechnikov No. Continuous Explanatory Vars.: 3

    These are the regression plots:

    the variables that do not appear in a plot are considered to be fixed to a value equal to their median.

    I can test the significance for this model:

    options(np.messages=FALSE) npsigtest(prodNP) Kernel Regression Significance Test Type I Test with IID Bootstrap (399 replications, Pivot = TRUE, joint = FALSE) Explanatory variables tested for significance: log(qCap) (1), log(qLab) (2), log(qMat) (3) log(qCap) log(qLab) log(qMat) Bandwidth(s): 1.039647 332644 0.8418465 Individual Significance Tests P Value: log(qCap) 0.11779 log(qLab) < 2e-16 *** log(qMat) < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    \(\Longrightarrow\) at 10% of significance the only variable with non-statistical significant effects is capital.

    Finally I can obtain information about Product-Factor Elasticity and Scale Elasticity:

    summary(gradients(prodNP)[,1]) # capital elasticity Min. 1st Qu. Median Mean 3rd Qu. Max. -0.10300 0.08598 0.14360 0.14500 0.18810 0.30060 summary(gradients(prodNP)[,2]) # labour elasticity Min. 1st Qu. Median Mean 3rd Qu. Max. 0.02208 0.63050 0.69880 0.70970 0.80670 0.97890 summary(gradients(prodNP)[,3]) # materials elasticity Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3873 0.4546 0.5669 0.5850 0.6502 1.3480 summary(rowSums(gradients(prodNP))) # scale elasticity Min. 1st Qu. Median Mean 3rd Qu. Max. 1.267 1.363 1.416 1.440 1.479 1.880 par(mfrow=c(2,2)) hist(gradients(prodNP)[,1], main = "Capital") hist(gradients(prodNP)[,2], main = "Labour") hist(gradients(prodNP)[,3], main = "Materials") hist(rowSums(gradients(prodNP)), main = "Scale")

    if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = "center", indent = "0em", linebreak = "false";

    if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); var location_protocol = (false) ? 'https' : document.location.protocol; if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:'; mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '../mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); }

    To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Reproducible Data Science with R

    Fri, 04/21/2017 - 17:01

    Yesterday, I had the honour of presenting at The Data Science Conference in Chicago. My topic was Reproducible Data Science with R, and while the specific practices in the talk are aimed at R users, my intent was to make a general argument for doing data science within a reproducible workflow. Whatever your tools, a reproducible process:

    • Saves time,
    • Produces better science,
    • Creates more trusted research,
    • Reduces the risk of errors, and
    • Encourages collaboration.

    Sadly there's no recording of this presentation, but my hope is that the slides are sufficiently self-contained. Some of the images are links to further references, too. You can browse them below, or download (CC-BY) them from the SlideShare page.

    Thanks to all who attended for the interesting questions and discussion during the panel session!

    Emails from R

    Fri, 04/21/2017 - 14:04

    (This article was first published on R – Insights of a PhD student, and kindly contributed to R-bloggers)

    There are a few packages for sending email directly from R, but I work in a place where none of these work due to strict network settings. To at least partially circumvent this, here’s some code to produce a PowerShell script to send email(s) via Outlook. The PowerShell script can then be run either by a shell call (again, not possible in my workplace) or by right clicking the file and selecting run with PowerShell.

     

    # Addresses add <- c("xxx@yyy.cc", "aaa@bbb.cc") subject <- "Testing" # construct message # opening start <- 'Hi, how are you? ' # main content body <- ' sent almost exclusively from R ' # signature sig <- ' And this is my signature ' # paste components together Body <- paste(start, body, sig) # construct PowerShell code (*.ps1) email <- function(recip, subject, Body, filename, attachment = NULL, append){     file <- paste0(filename, ".ps1")     write('$Outlook = New-Object -ComObject Outlook.Application', file, append = append)   write('$Mail = $Outlook.CreateItem(0)', file, append = TRUE)   write(paste0('$Mail.To = "', recip, '"'), file, append = TRUE)   write(paste0('$Mail.Subject = "', subject, '"'), file, append = TRUE)   write(paste0('$Mail.Body = "', Body, '"'), file, append = TRUE)   if(!is.null(attachment)){     write(paste0('$File = "', attachment, '"'), file, append = TRUE)     write('$Mail.Attachments.Add($File)', file, append = TRUE)   }   write('$Mail.Send()', file, append = TRUE) if(append) write('', file, append = TRUE) } for(i in 1:length(add)){   file <- paste0("email", i, ".ps1")   att <- file.path(getwd(), "blabla.txt")   email(add[i], subject, Body, file, attachment = att) # with attachment   # email(add[i], subject, Body, file)                 # without attachment # email(add[i], subject, Body, file, append = TRUE) # multiple emails in a single PS file }

     

    Now you can go and run the PowerShell script from within windows explorer.

    To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD student. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Logistic regressions (in R)

    Fri, 04/21/2017 - 11:02

    (This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

    Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.

    Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

    I recently did a remote talk to Plymouth University on logistic regressions, which covers:

    • how they work (not maths heavy!)
    • how you build them in R
    • things to think about when preparing you data
    • ways to evaluate a logistic regression

    You can watch the video below, get the slides, and view the slides’ source code.

    This talk is a cut-down version of my community workshop on logistic regressions, which is in itself a cut-down version of a full day of training on them. Get in touch if you’re interested in the talk or workshop for your user group, or if you’d like to discuss in-depth training.

     

    The post Logistic regressions (in R) appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Rblpapi 0.3.6

    Fri, 04/21/2017 - 03:36

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Time for a new release of Rblpapi — version 0.3.6 is now on CRAN. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required).

    This is the seventh release since the package first appeared on CRAN last year. This release brings a very nice new function lookupSecurity() contributed by Kevin Jin as well as a number of small fixes and enhancements. Details below:

    Changes in Rblpapi version 0.3.6 (2017-04-20)
    • bdh can now store in double preventing overflow (Whit and John in #205 closing #163)

    • bdp documentation has another ovveride example

    • A new function lookupSecurity can search for securities, optionally filtered by yellow key (Kevin Jin and Dirk in #216 and #217 closing #215)

    • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols(); also use .registration=TRUE in useDynLib in NAMESPACE (Dirk in #220)

    • getBars and getTicks can now return data.table objects (Dirk in #221)

    • bds has improved internal protect logic via Rcpp::Shield (Dirk in #222)

    Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Shiny App for Importing and Forecasting Commodities Prices from Quandl

    Fri, 04/21/2017 - 02:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)



    In a previous post, we imported oil data from Quandl and applied a simple model to it. Today, we’ll port that work over to a Shiny app (by way of flexdashboard, of course) that allows a user to choose a commodity (oil, copper or gold), choose a frequency for the time series, and choose how many periods ahead to forecast. The app will display the price history and the forecasted price. To see it in action, have a look here.

    As with the previous Notebook, the main purpose of this post is to build a template where a more sophisticated or proprietary model could be used for forecasting. By setting it up so that models could be selected as inputs, several different models could be included. We are not as concerned with the modeling as much as we are with providing a format that is friendly to both end users and any future collaborator that might want to take this app and expand upon it.

    Let’s get to it! In the code chunk below, we are immediately faced with a few important decisions. The first of those is the format of the user input. At the extreme, we could use textInput and simply allow the user to enter the code for the desired data set. The benefit is that we would not limit the user in any way; he or she could choose any dataset on Quandl. The cost would be that the user would need to know – or go to Quandl and look up – the code for any data set.

    For example, to import WTI oil prices, the user would have to type in “FRED/DCOILWTICO”. That’s no problem if most of the end users know that code, but it’s a big problem otherwise. We want to emphasize convenience and broad usability, so we are going with selectInput instead of textInput, meaning our app will show a drop-down of a few choices. The user just selects “WTI oil” instead of typing “FRED/DCOILWTICO”“, or selects”copper" instead of typing “ODA/PCOPP_USD”. But, if a user wants to work with a data set that we haven’t included, said user is out of luck.

    Another big decision is how many choices to give the user. I have included only 3: oil, gold and copper. In industry, you would probably include several more, perhaps all of the industrial metals, but there’s is a cutoff somewhere. Or maybe we prefer one choice, because this app is just for oil analysis. Either way, the number of options in the drop-down menu is another trade-off between usability and flexibility.

    The final decision is a bit more nuanced, and requires looking ahead to how these inputs will be used further down in the app. Have a peek at the object called dataChoices, and you might notice that we don’t strictly need that object. We could have put the vector of choices as an argument to selectInput, so that our code would have read choices = c("WTI oil" = "FRED/DCOILWTICO", ...) instead of choices = dataChoices. In that choice assignment, “WTI oil” is called the name and “FRED/DCOILWTICO” is called the value (together we can think of them as a name-value pair). The entire reason for building a separate dataChoices object is that we want the ability to extract either the name or the value of the name-value pair. Usually we would care only about the value, because we want to pass the value to Quandl and import the data, but that name is going to be useful as well when we label our graph.

    The ability to extract names and values will become even more useful when we get to the frequency of the time series and forecasting. For now, let’s look at dataChoices and selectInput.

    # Notice a tradeoff here: we're making it easy on our users because they don't need to # remember the naming conventions. But, we're also forced to severely limit their choices. # The dataChoices object is going to allow us to add a nicer label to the graph. # Notice also how easily we can include datasets from different sources, and not worry about # their formats. Thanks, Quandl! dataChoices <- c("WTI oil" = "FRED/DCOILWTICO", # oil data from Fred "Copper" = "ODA/PCOPP_USD", # copper data from ODA "Gold" = "CHRIS/CME_GC1") # gold data from CME selectInput("dataSet", "Commodity", choices = dataChoices, selected = "WTI oil")

    Alright, we have given the user the ability to choose a commodity. Next we want to ask about the frequency, and this gets a bit more complicated. We need to tell Quandl the frequency of the time series to import using the “daily”, “weekly”, “monthly” convention, so we set those to the values. But, further down the app when we want to store the forecast results, we’ll need to use the “days”, “weeks”, “months” phrasing and we’ll need to pull out the names from the name-value pair in frequencyChoices. We knew that this would be necessary because when we built our Notebook for importing, wrangling, and testing, we noted the different conventions and started thinking about how to deal with them in the Shiny context.

    # The frequencyChoices object is going to allow us to pass different period conventions # to different places further down the app. frequencyChoices <- c("days" = "daily", "weeks" = "weekly", "months" = "monthly") selectInput("frequency", "freq", choices = frequencyChoices, selected = "months")

    The remainder of the inputs should look familiar as they are in standard format.

    dateRangeInput("dateRange", "Date range", start = "1980-01-01", end = "2016-12-31") numericInput("periods", "Periods to Forecast", 6, min = 1, max = 100)

    Now that we’ve built the sidebar inputs, let’s put them to use. First, we will import the commodity time series data from Quandl. This will be familiar from the Notebook, but note in particular that we will use the value from the input$frequency choice because Quandl uses the “daily/weekly/monthly” frequency format.

    # Let's pull in the Quandl data. # Nothing fancy here except we are going to use the reactive inputs. # We will pass in the value from the input$dataSet key-value pair. # We will also pass in the value from the input$frequency key-value pair. commodity <- reactive({ # It might be a good idea to include your quandl api key if this will be used more than 50 times in a day. Quandl.api_key("your_apikey_here") commodity <- Quandl(input$dataSet, start_date = format(input$dateRange[1]), end_date = format(input$dateRange[2]), order = "asc", type = "xts", collapse = as.character(input$frequency) ) })

    We have imported a time series object for the dataset, date range and frequency chosen by the user. Now we want to do some forecasting and create a visualization. We’ll first use the forecast() function, then we’ll combine the forecasted prices and the historical prices into one xts object that can be passed to dygraph. Let’s handle this in one reactive.

    First, we’ll call forecast and pass it the periods input from the user.

    combined_xts <- reactive({ # Just like the Notebook, except periods is a reactive input. forecasted <- forecast(commodity(), h = input$periods)

    Now we need to combine that forecasted object with the commodity object that was created in a previous code chunk. Here we will finally thank ourselves for thinking about the name-value issue when assigning frequencyChoices. When we create a dataframe to hold the forecasted time series, we build a ‘date’ column using the seq() function. The column starts on the end date of the historical data (which the user selected with the dateRange input) and runs for as many periods as the user chose in the periods input. But, we need to supply a value to the by = ... argument of the seq() function so that it knows if we want to move 6 days, 6 weeks, 6 months, etc. To do that, we need to extract the name (“days/weeks/months”) from the frequencyChoices name-value pair and pass it to seq. The way we extract the name is with this selection statement: names(frequencyChoices[frequencyChoices == input$frequency]).

    forecast_dataframe <- data.frame( date = seq(input$dateRange[2], # The next line is very important, and it's the reason we # created the frequencyChoices object. by = names(frequencyChoices[frequencyChoices == input$frequency]), length.out = input$periods), Forecast = forecasted$mean, Hi_95 = forecasted$upper[,2], Lo_95 = forecasted$lower[,2])

    Now we convert the forecast_dataframe object to an xts and combine that new object with the commodity xts object. Remember, we imported the commodity data from Quandl in the form of an xts object, which saved us from having to do a conversion. This chunk should look very similar to the Notebook.

    forecast_xts <- xts(forecast_dataframe[,-1], order.by = forecast_dataframe[,1]) combined_xts <- cbind(commodity(), forecast_xts) # Add a nicer name for the first column. colnames(combined_xts)[1] <- "Actual" # This is the combined object that will be passed to dygraphs below. combined_xts })

    Next, we will create a chart of the actual price history. Nothing fancy here, except again we are going extract the ‘name’ portion of a name-value pair, this time from the dataSet object (the first one we created in the first code chunk). We want to label the graph with ‘WTI oil’ and not the Quandl code, so we select the name with names(dataChoices[dataChoices==input$dataSet]).

    dygraphOutput("commodity") output$commodity <- renderDygraph({ dygraph(commodity(), # We pull out the name of the selected name-value input$dataSet like so: # names(dataChoices[dataChoices==input$dataSet]) main = paste("Price history of", names(dataChoices[dataChoices==input$dataSet]), sep = " ")) %>% dyAxis("y", label = "$") %>% dyOptions(axisLineWidth = 1.5, fillGraph = TRUE, drawGrid = TRUE) })

    Last but not least, let’s graph the historical and forecasted time series together on one graph. We could have ported the code directly from the Notebook, but I couldn’t help tinkering just a bit. I wanted to focus the graph on the time period around where the actual data ends and the forecast begins, because that is probably what’s of most interest to the user. To do that, we’ll use the dyRangeSelector() function and pass two values to the dateWindow variable: a start date and end an date. This is pure aesthetics, but I think it’s worth the effort here.

    We are going to use seq() and names(frequencyChoices[frequencyChoices == input$frequency]) once again.

    dygraphOutput("forecasted") output$forecasted <- renderDygraph({ # We want the user to be able to see the forecasted area, so let's focus in on that # by truncating the view of the dygraph. # We need to give the graph a start date and an end date. start_date <- tail(seq(input$dateRange[2], by = "-1 months", length = 6), 1) end_date <- tail(seq(input$dateRange[2], by = names(frequencyChoices[frequencyChoices == input$frequency]), length = input$periods), 1)

    Now we will supply those date objects to a piped dygraphs chain, and that will be a wrap. Have a close look at the chunk below and spot where we again extract the name from the input$dataSet name-value pair.

    dygraph(combined_xts(), # Name about to be extracted! main = paste(names(dataChoices[dataChoices==input$dataSet]), ": Historical and Forecast", sep = "")) %>% # Add the actual series. dySeries("Actual", label = "Actual") %>% # Add the three forecasted series. dySeries(c("Lo_95", "Forecast", "Hi_95")) %>% # A range selector to focus on the where historical data ends and # foracated data begins. # Note the user can still use the range selector to zoom out if so desired. dyRangeSelector(dateWindow = c(start_date, end_date)) })

    That’s all for today! We’ve built an app that can be extended in many directions – more data set choices, an input selector so the user can choose different models, more visualizations – and in that sense it can serve as a template for ourselves and anyone that wants to build upon this work.

    This app also relies on and highlights one of the tremendous benefits of Quandl: the ability to import datasets from different sources and have them arrive in the IDE in a consistent format. This makes using Quandl with Shiny quite enjoyable after the initial skeleton is built. For example, if we or a colleague did want to change this app and substitute in different commodities, there would be no worry beyond the finding the right code on Quandl and adding another name-value pair to dataChoices.

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Earthquakes & Maps – Report and shinyapp from R-Lab#2

    Thu, 04/20/2017 - 18:18

    (This article was first published on MilanoR, and kindly contributed to R-bloggers)

    Our board of ideas, tasks to do, doing and done <3

    Hey R-users and lovers!

    In the last few weeks we have had many events going on for the R community in Milano!

    After the launching event for our new R-labs and the 8th MilanoR meeting, we had our second R-Lab.

    This was the first R-Lab with its actual structure: an organisation/ a company is willing to study a specific topic, they present it and we work all together on the problem in order to solve it using R and having fun.

    In #RLab2 we talked about earthquakes and shiny with several participants with different backgrounds and different interests: from students to researchers and professionals, from statistics to geology and physics, from R beginners to advanced users… we all got together to learn and share our abilities!

    The meeting opened with the newly born EarthCloud association that provided us with interesting data on the Bove-Vettore (Umbria) fault and the fault’s movements in years 1980-2016. We continued with Andrea and Mariachiara that gave us some inputs on the use of shiny and of ggmap: here both the presentations.

    In three different heterogeneous groups we then started to work: a group was in charge of producing plots to be integrated into the shiny app, another group was in charge of producing maps to be integrated in the shiny app and a last group was in charge of putting up the actual shiny app integrating everyone’s functions.

    The ggplot group produced univariate and bivariate plots for giving an overview of the magnitude of eartquakes in the years 1980-2016, on their depth and on their localisation.

    The ggmap group produced a number of maps of the Bove-Vettore fault system highlighting the areas with more intense magnitude.

    While the data visualization groups were busy finding the best representation for the data, the Shiny group was setting up a friendly shiny app where all the useful info could be presented and easily navigated.

    By the end of the evening, and a couple of hours work, we were tired but happy! All groups came up with functioning functions and great ideas while learning about earthquakes with the geologists!

    Check here for the functioning app:

    https://r-lab-milano.shinyapps.io/earthquakemaps/

    Download the the source code and the data on our dedicated github repo. It is still on development! (@rlabbers: there are still some graphs to integrate, and some little changes to do)

    Thank you very much to all the participants, to EarthCloud for their proposal and contribution, and see you next month for the third Rlab!

    (hint: the third R-Lab is Saturday 13th of May, and we will work together with the Comune di Milano for an entire day! We’ll have access to the municipality data of the city budget planning, for designing and realizing an R Shiny app that allows citizens to visualize and explore the city budget data. Subscriptions will open soon here, stay tuned!)

     

    The post Earthquakes & Maps – Report and shinyapp from R-Lab#2 appeared first on MilanoR.

    To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages