Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 44 min ago

ggvis Exercises (Part-2)

Fri, 08/18/2017 - 18:26

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.

This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.
Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set and make a scatterplot. HINT: Use ggvis() and layer_points().

Exercise 2

Add a slider to the scatterplot of Exercise 1 that sets the point size from 10 to 100. HINT: Use input_slider().

Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Work extensively with the ggvis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 3

Add a slider to the scatterplot of Exercise 1 that sets the point opacity from 0 to 1. HINT: Use input_slider().

Exercise 4

Create a histogram of the variable “Horsepower” of the “Cars93” data set. HINT: Use layer_histograms().

Exercise 5

Set the width and the center of the histogram bins you just created to 10.

Exercise 6

Add 2 sliders to the histogram you just created, one for width and the other for center with values from 0 to 10 and set the step to 1. HINT: Use input_slider().

Exercise 7

Add the labels “Width” and “Center” to the two sliders respectively. HINT: Use label.

Exercise 8

Create a scatterplot of the variables “Horsepower” and “MPG.city” of the “Cars93” dataset with size = 10 and opacity = 0.5.

Exercise 9

Add to the scatterplot you just created a function which will set the size with the left and right keyboard controls. HINT: Use left_right().

Exercise 10

Add interactivity to the scatterplot you just created using a function that shows the value of the “Horsepower” when you “mouseover” a certain point. HINT: Use add_tooltip().

Related exercise sets:
  1. How to create interactive data visualizations with ggvis
  2. ggvis Exercises (Part-1)
  3. How to create visualizations with iPlots package in R
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

GoTr – R wrapper for An API of Ice And Fire

Fri, 08/18/2017 - 15:53

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Ava Yang

It’s Game of Thrones time again as the battle for Westeros is heating up. There are tons of ideas, ingredients and interesting analyses out there and I was craving for my own flavour. So step zero, where is the data?

Jenny Bryan’s purrr tutorial introduced the list got_chars, representing characters information from the first five books, which seems not much fun beyond exercising list manipulation muscle. However, it led me to an API of Ice and Fire, the world’s greatest source for quantified and structured data from the universe of Ice and Fire including the HBO series Game of Thrones. I decided to create my own API functions, or better, an R package (inspired by the famous rwar package).

The API resources cover 3 types of endpoint – Books, Characters and Houses. GoTr pulls data in JSON format and parses them to R list objects. httr’s Best practices for writing an API package by Hadley Wickham is another life saver.

The package contains: – One function got_api() – Two ways to specify parameters generally, i.e. endpoint type + id or url – Three endpoint types

## Install GoTr from github #devtools::install_github("MangoTheCat/GoTr") library(GoTr) library(tidyverse) library(listviewer) # Retrieve books id 5 books_5 <- got_api(type = "books", id = 5) # Retrieve characters id 583 characters_583 <- got_api(type = "characters", id = 583) # Retrieve houses id 378 house_378 <- got_api(type = "houses", id = 378) # Retrieve pov characters data in book 5 povData <- books_5$povCharacters %>% flatten_chr() %>% map(function(x) got_api(url = x)) # Helpful functions to check structure of list object length(books_5) ## [1] 11 names(books_5) ## [1] "url" "name" "isbn" "authors" ## [5] "numberOfPages" "publisher" "country" "mediaType" ## [9] "released" "characters" "povCharacters" names(house_378) ## [1] "url" "name" "region" ## [4] "coatOfArms" "words" "titles" ## [7] "seats" "currentLord" "heir" ## [10] "overlord" "founded" "founder" ## [13] "diedOut" "ancestralWeapons" "cadetBranches" ## [16] "swornMembers" str(characters_583, max.level = 1) ## List of 16 ## $ url : chr "https://anapioficeandfire.com/api/characters/583" ## $ name : chr "Jon Snow" ## $ gender : chr "Male" ## $ culture : chr "Northmen" ## $ born : chr "In 283 AC" ## $ died : chr "" ## $ titles :List of 1 ## $ aliases :List of 8 ## $ father : chr "" ## $ mother : chr "" ## $ spouse : chr "" ## $ allegiances:List of 1 ## $ books :List of 1 ## $ povBooks :List of 4 ## $ tvSeries :List of 6 ## $ playedBy :List of 1 map_chr(povData, "name") ## [1] "Aeron Greyjoy" "Arianne Martell" "Arya Stark" ## [4] "Arys Oakheart" "Asha Greyjoy" "Brienne of Tarth" ## [7] "Cersei Lannister" "Jaime Lannister" "Samwell Tarly" ## [10] "Sansa Stark" "Victarion Greyjoy" "Areo Hotah" #listviewer::jsonedit(povData)

Another powerful parameter is query which allows filtering by specific attribute such as the name of a character, pagination and so on.

It’s worth knowing about pagination. The first simple request will render a list of 10 elements, since the default number of items per page is 10. The maximum valid pageSize is 50, i.e. if 567 is passed on to it, you still get 50 characters.

# Retrieve character by name Arya_Stark <- got_api(type = "characters", query = list(name = "Arya Stark")) # Retrieve characters on page 3, change page size to 20. characters_page_3 <- got_api(type = "characters", query = list(page = "3", pageSize="20"))

So how do we get ALL books, characters or houses information? The package does not provide the function directly but here’s an implementation.

# Retrieve all books booksAll <- got_api(type = "books", query = list(pageSize="20")) # Extract names of all books map_chr(booksAll, "name") ## [1] "A Game of Thrones" "A Clash of Kings" ## [3] "A Storm of Swords" "The Hedge Knight" ## [5] "A Feast for Crows" "The Sworn Sword" ## [7] "The Mystery Knight" "A Dance with Dragons" ## [9] "The Princess and the Queen" "The Rogue Prince" ## [11] "The World of Ice and Fire" "A Knight of the Seven Kingdoms" # Retrieve all houses houses <- 1:9 %>% map(function(x) got_api(type = "houses", query = list(page=x, pageSize="50"))) %>% unlist(recursive=FALSE) map_chr(houses, "name") %>% length() ## [1] 444 map_df(houses, `[`, c("name", "region")) %>% head() ## # A tibble: 6 x 2 ## name region ## ## 1 House Algood The Westerlands ## 2 House Allyrion of Godsgrace Dorne ## 3 House Amber The North ## 4 House Ambrose The Reach ## 5 House Appleton of Appleton The Reach ## 6 House Arryn of Gulltown The Vale

The houses list is a starting point for a social network analysis: Mirror mirror tell me, who are the most influential houses in the Seven Kingdom? Stay tuned for that is the topic of the next blogpost.

Thanks to all open resources. Please comment, fork, issue, star the work-in-progress on our GitHub repository.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Estimating Gini coefficient when we only have mean income by decile by @ellis2013nz

Fri, 08/18/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Income inequality data

Ideally the Gini coefficient to estimate inequality is based on original household survey data with hundreds or thousands of data points. Often this data isn’t available due to access restrictions from privacy or other concerns, and all that is published is some kind of aggregate measure. Some aggregations include the income at the 80th percentile divided by that at the 20th (or 90 and 10); the number of people at the top of the distribution whose combined income is equal to that of everyone else; or the income of the top 1% as a percentage of all income. I wrote a little more about this in one of my earliest blog posts.

One way aggregated data are sometimes presented is as the mean income in each decile or quintile. This is not the same as the actual quantile values themselves, which are the boundary between categories. The 20th percentile is the value of the 20/100th person when they are lined up in increasing order, whereas the mean income of the first quintile is the mean of all the incomes of a “bin” of everyone from 0/100 to 20/100, when lined up in order.

To explore estimating Gini coefficients from this type of binned data I used data from the wonderful Lakner-Milanovic World Panel Income Distribution database, which is available for free download. This useful collection contains the mean income by decile bin of many countries from 1985 onwards – the result of some careful and doubtless very tedious work with household surveys from around the world. This is an amazing dataset, and amongst other purposes it can be used (as Milanovic and co-authors have pioneered dating back to his World Bank days) in combination with population numbers to estimate “global inequality”, treating everyone on the planet as part of a single economic community regardless of national boundaries. But that’s for another day.

Here’s R code to download the data (in Stata format) and grab the first ten values, which happen to represent Angloa in 1995. These particular data are based on consumption, which in poorer economies is often more sensible to measure than income:

library(tidyverse) library(scales) library(foreign) # for importing Stata files library(actuar) # for access to the Burr distribution library(acid) library(forcats) # Data described at https://www.gc.cuny.edu/CUNY_GC/media/LISCenter/brankoData/LaknerMilanovic2013WorldPanelIncomeDistributionLMWPIDDescription.pdf # The database has been created specifically for the # paper “Global Income Distribution: From the Fall of the Berlin Wall to the Great Recession”, # World Bank Policy Research Working Paper No. 6719, December 2013, published also in World # Bank Economic Review (electronically available from 12 August 2015). download.file("https://wfs.gc.cuny.edu/njohnson/www/BrankoData/LM_WPID_web.dta", mode = "wb", destfile = "LM_WPID_web.dta") wpid <- read.dta("LM_WPID_web.dta") # inc_con whether survey is income or consumption # group income decline group 1 to 10 # RRinc is average per capita income of the decile in 2005 PPP # the first 10 rows are Angola in 1995, so let's experiment with them angola <- wpid[1:10, c("RRinc", "group")]

Here’s the resulting 10 numbers. N

And this is the Lorenz curve:

Those graphics were drawn with this code:

ggplot(angola, aes(x = group, y = RRinc)) + geom_line() + geom_point() + ggtitle("Mean consumption by decile in Angola in 1995") + scale_y_continuous("Annual consumption for each decile group", label = dollar) + scale_x_continuous("Decile group", breaks = 1:10) + labs(caption = "Source: Lakner/Milanovic World Panel Income Distribution data") + theme(panel.grid.minor = element_blank()) angola %>% arrange(group) %>% mutate(cum_inc_prop = cumsum(RRinc) / sum(RRinc), pop_prop = group / max(group)) %>% ggplot(aes(x = pop_prop, y = cum_inc_prop)) + geom_line() + geom_ribbon(aes(ymax = pop_prop, ymin = cum_inc_prop), fill = "steelblue", alpha = 0.2) + geom_abline(intercept = 0, slope = 1, colour = "steelblue") + labs(x = "Cumulative proportion of population", y = "Cumulative proportion of consumption", caption = "Source: Lakner/Milanovic World Panel Income Distribution data") + ggtitle("Inequality in Angola in 1995", "Lorenz curve based on binned decile mean consumption") Calculating Gini directly from deciles?

Now, I could just treat these 10 deciles as a sample of 10 representative people (each observation after all represents exactly 10% of the population) and calculate the Gini coefficient directly. But my hunch was that this would underestimate inequality, because of the straight lines in the Lorenz curve above which are a simplification of the real, more curved, reality.

To investigate this issue, I started by creating a known population of 10,000 income observations from a Burr distribution, which is a flexible, continuous non-negative distribution often used to model income. That looks like this:

population <- rburr(10000, 1, 3, 3) par(bty = "l", font.main = 1) plot(density(population), main = "Burr(1,3,3) distribution")

Then I divided the data up into between 2 and 100 bins, took the means of the bins, and calculated the Gini coefficient of the bins. Doing this for 10 bins is the equivalent of calculating a Gini coefficient directly from decile data such as in the Lakner-Milanovic dataset. I got this result, which shows, that when you have the means of 10 bins, you are underestimating inequality slightly:

Here’s the code for that little simulation. I make myself a little function to bin data and return the mean values of the bins in a tidy data frame, which I’ll need for later use too:

#' Quantile averages #' #' Mean value in binned groups #' #' @param y numeric vector to provide summary statistics on #' @param len number of bins to calculate means for #' @details this is different from producing the actual quantiles; it is the mean value of y within each bin. bin_avs <- function(y, len = 10){ # argument checks: if(class(y) != "numeric"){stop("y should be numeric") } if(length(y) < len){stop("y should be longer than len")} # calculation: y <- sort(y) bins <- cut(y, breaks = quantile(y, probs = seq(0, 1, length.out = len + 1))) tmp <- data.frame(bin_number = 1:len, bin_breaks = levels(bins), mean = as.numeric(tapply(y, bins, mean))) return(tmp) } ginis <- numeric(99) for(i in 1:99){ ginis[i] <- weighted.gini(bin_avs(population, len = i + 1)$mean)$Gini } ginis_df <- data.frame( number_bins = 2:100, gini = ginis ) ginis_df %>% mutate(label = ifelse(number_bins < 11 | round(number_bins / 10) == number_bins / 10, number_bins, "")) %>% ggplot(aes(x = number_bins, y = gini)) + geom_line(colour = "steelblue") + geom_text(aes(label = label)) + labs(x = "Number of bins", y = "Gini coefficient estimated from means within bins") + ggtitle("Estimating Gini coefficient from binned mean values of a Burr distribution population", paste0("Correct Gini is ", round(weighted.gini(population)$Gini, 3), ". Around 25 bins needed for a really good estimate.")) A better method for Gini from deciles?

Maybe I should have stopped there; after all, there is hardly any difference between 0.32 and 0.34; probably much less than the sampling error from the survey. But I wanted to explore if there were a better way. The method I chose was to:

  • choose a log-normal distribution that would generate (close to) the 10 decile averages we have;
  • simulate individual-level data from that distribution; and
  • estimate the Gini coefficient from that simulated data.

I also tried this with a Burr distribution but the results were very unstable. The log-normal approach was quite good at generating data with means of 10 bins very similar to the original data, and gave plausible values of Gini coefficient just slightly higher than when calculated directly of the bins’ means.

Here’s how I did that:

# algorithm will be iterative # 1. assume the 10 binned means represent the following quantiles: 0.05, 0.15, 0.25 ... 0.65, 0.75, 0.85, 0.95 # 2. pick the best lognormal distribution that fits those 10 quantile values. # Treat as a non-linear optimisation problem and solve with `optim()`. # 3. generate data from that distribution and work out what the actual quantiles are # 4. repeat 2, with these actual quantiles n <- 10000 x <- angola$RRinc fn2 <- function(params) { sum((x - qlnorm(p, params[1], params[2])) ^ 2 / x) } p <- seq(0.05, 0.95, length.out = 10) fit2 <- optim(c(1,1), fn2) simulated2 <- rlnorm(n, fit2$par[1], fit2$par[2]) p <- plnorm(bin_avs(simulated2)$mean, fit2$par[1], fit2$par[2]) fit2 <- optim(c(1,1), fn2) simulated2 <- rlnorm(n, fit2$par[1], fit2$par[2])

And here are the results. The first table shows the means of the bins in my simulated log-normal population (mean) compared to the original data for Angola’s actual deciles in 1995 (x). The next two values, 0.415 and 0.402, are the Gini coefficents from the simulated and original data respectively:

> cbind(bin_avs(simulated2), x) bin_number bin_breaks mean x 1 1 (40.6,222] 165.0098 174 2 2 (222,308] 266.9120 287 3 3 (308,393] 350.3674 373 4 4 (393,487] 438.9447 450 5 5 (487,589] 536.5547 538 6 6 (589,719] 650.7210 653 7 7 (719,887] 795.9326 785 8 8 (887,1.13e+03] 1000.8614 967 9 9 (1.13e+03,1.6e+03] 1328.3872 1303 10 10 (1.6e+03,1.3e+04] 2438.4041 2528 > weighted.gini(simulated2)$Gini [,1] [1,] 0.4145321 > > # compare to the value estimated directly from the data: > weighted.gini(x)$Gini [,1] [1,] 0.401936

As would be expected from my earlier simulation, the Gini coefficient from the estimated underlying log-normal distribtuion is verr slightly higher than that calculated directly from the means of the decile bins.

Applying this method to the Lakner-Milanovic inequality data

I rolled up this approach into a function to convert means of deciles into Gini coefficients and applied it to all the countries and years in the World Panel Income Distribution data. Here are the results, first over time:

.. and then as a snapshot

Neither of these is great as a polished data visualisation, but it’s difficult data to present in a static snapshot, and will do for these illustrative purposes.

Here’s the code for that function (which depends on the previously defined ) and drawing those charts. Drawing on the convenience of Hadley Wickham’s dplyr and ggplot2 it’s easy to do this on the fly and in the below I calculate the Gini coefficients twice, once for each chart. Technically this is wasteful, but with modern computers this isn’t a big deal even though there is quite a bit of computationally intensive stuff going on under the hood; the code below only takes a minute or so to run.

#' Convert data that is means of deciles into a Gini coefficient #' #' @param x vector of 10 numbers, representing mean income (or whatever) for 10 deciles #' @param n number of simulated values of the underlying log-normal distribution to generate #' @details returns an estimate of Gini coefficient that is less biased than calculating it #' directly from the deciles, which would be slightly biased downwards. deciles_to_gini <- function(x, n = 1000){ fn <- function(params) { sum((x - qlnorm(p, params[1], params[2])) ^ 2 / x) } # starting estimate of p based on binned means and parameters p <- seq(0.05, 0.95, length.out = 10) fit <- optim(c(1,1), fn) # calculate a better value of p simulated <- rlnorm(n, fit$par[1], fit$par[2]) p <- plnorm(bin_avs(simulated)$mean, fit$par[1], fit$par[2]) # new fit with the better p fit <- optim(c(1,1), fn) simulated <- rlnorm(n, fit$par[1], fit$par[2]) output <- list(par = fit$par) output$Gini <- as.numeric(weighted.gini(simulated)$Gini) return(output) } # example usage: deciles_to_gini(x = wpid[61:70, ]$RRinc) deciles_to_gini(x = wpid[171:180, ]$RRinc) # draw some graphs: wpid %>% filter(country != "Switzerland") %>% mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>% group_by(region, country, contcod, year, inc_con) %>% summarise(Gini = deciles_to_gini(RRinc)$Gini) %>% ungroup() %>% ggplot(aes(x = year, y = Gini, colour = contcod, linetype = inc_con)) + geom_point() + geom_line() + facet_wrap(~region) + guides(colour = FALSE) + ggtitle("Inequality over time", "Gini coefficients estimated from decile data") + labs(x = "", linetype = "", caption = "Source: Lakner/Milanovic World Panel Income Distribution data") wpid %>% filter(country != "Switzerland") %>% mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>% group_by(region, country, contcod, year, inc_con) %>% summarise(Gini = deciles_to_gini(RRinc)$Gini) %>% ungroup() %>% group_by(country) %>% filter(year == max(year)) %>% ungroup() %>% mutate(country = fct_reorder(country, Gini), region = fct_lump(region, 5)) %>% ggplot(aes(x = Gini, y = country, colour = inc_con, label = contcod)) + geom_text(size = 2) + facet_wrap(~region, scales = "free_y", nrow = 2) + labs(colour = "", y = "", x = "Gini coefficient", caption = "Source: Lakner-Milanovic World Panel Income Distribution") + ggtitle("Inequality by country", "Most recent year available up to 2008; Gini coefficients are estimated from decile mean income.")

There we go – deciles to Gini fun with world inequality data!

# cleanup unlink("LM_WPID_web.dta") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Oil leakage… those old BMW’s are bad :-)

Fri, 08/18/2017 - 11:42

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

RDW Data

There are many data sets that you can download. I have used the following:

  • Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
  • Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
  • Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.
Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

 

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage.  But look at BMW: waaauwww those old BMW’s are leaking like oil crazy…  The few lines of R code can be found here.

Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.

Cheers, Longhow

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.7.960.1.0

Fri, 08/18/2017 - 02:41

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The bi-monthly RcppArmadillo release is out with a new version 0.7.960.1.0 which is now on CRAN, and will get to Debian in due course.

And it is a big one. Lots of nice upstream changes from Armadillo, and lots of work on our end as the Google Summer of Code project by Binxiang Ni, plus a few smaller enhancements — see below for details.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 379 other packages on CRAN—an increase of 49 since the last CRAN release in June!

Changes in this release relative to the previous CRAN release are as follows:

Changes in RcppArmadillo version 0.7.960.1.0 (2017-08-11)
  • Upgraded to Armadillo release 7.960.1 (Northern Banana Republic Deluxe)

    • faster randn() when using OpenMP (NB: usually omitted when used fromR)

    • faster gmm_diag class, for Gaussian mixture models with diagonal covariance matrices

    • added .sum_log_p() to the gmm_diag class

    • added gmm_full class, for Gaussian mixture models with full covariance matrices

    • expanded .each_slice() to optionally use OpenMP for multi-threaded execution

  • Upgraded to Armadillo release 7.950.0 (Northern Banana Republic)

    • expanded accu() and sum() to use OpenMP for processing expressions with computationally expensive element-wise functions

    • expanded trimatu() and trimatl() to allow specification of the diagonal which delineates the boundary of the triangular part

  • Enhanced support for sparse matrices (Binxiang Ni as part of Google Summer of Code 2017)

    • Add support for dtCMatrix and dsCMatrix (#135)

    • Add conversion and unit tests for dgT, dtT and dsTMatrix (#136)

    • Add conversion and unit tests for dgR, dtR and dsRMatrix (#139)

    • Add conversion and unit tests for pMatrix and ddiMatrix (#140)

    • Rewrite conversion for dgT, dtT and dsTMatrix, and add file-based tests (#142)

    • Add conversion and unit tests for indMatrix (#144)

    • Rewrite conversion for ddiMatrix (#145)

    • Add a warning message for matrices that cannot be converted (#147)

    • Add new vignette for sparse matrix support (#152; Dirk in #153)

    • Add support for sparse matrix conversion from Python SciPy (#158 addressing #141)

  • Optional return of row or column vectors in collapsed form if appropriate #define is set (Serguei Sokol in #151 and #154)

  • Correct speye() for non-symmetric cases (Qiang Kou in #150 closing #149).

  • Ensure tests using Scientific Python and reticulate are properly conditioned on the packages being present.

  • Added .aspell/ directory with small local directory now supported by R-devel.

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2017 App Update

Fri, 08/18/2017 - 02:34

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

As you may have noticed, we have made a few changes to our apps for the 2017 season to bring you a smoother and quicker experience while also adding more advanced and customizable views.

Most visibly, we moved the apps to Shiny so we can continue to build on our use of R and add new features and improvements throughout the season.  We expect the apps to better handle high traffic load this season during draft season and peak traffic.

In addition to the ability to create and save custom settings, you can also choose the columns you view in our Projections tool.  We have also added more advanced metrics such as weekly VOR and Projected Points Per Dollar (ROI) for those of you in auction leagues.  With a free account, you’ll be able to create and save one custom setting.  If you get an FFA Insider subscription, you’ll be able to create and save unlimited custom settings.

Up next is the ability to upload custom auction values to make it easier to use during auction drafts.

We are also always looking to add new features, so feel free to drop us a suggestion in the Comments section below!

The post 2017 App Update appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Chapman University DataFest Highlights

Fri, 08/18/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Editor’s Note: The 2017 Chapman University DataFest was held during the weekend of April 21-23. The 2018 DataFest will be held during the weekend of April 27-29.

DataFest was founded by Rob Gould in 2011 at UCLA with 40 students. In just seven years, it has grown to 31 sites in three countries. Have a look at Mine Çetinkaya-Rundel’s post Growth of DataFest over the years for the details. In recent years, it has been difficult for UCLA to keep up with the growing interest and demand from southern California universities. This year, the Chapman DataFest became the second DataFest site in southern California, and the largest inaugural DataFest in the history of the event. We had 65 students who stayed the whole weekend from seven universities organized into 15 teams.

The event began on a Friday evening with Professor Rob Gould, the “founder” of DataFest, giving advice on goals for the weekend. He then introduced the Expedia dataset: nearly 11 million records representing users’ online searches for hotels, plus an associated file with detailed information about the hotel destinations.

Throughout the weekend, the organizers kept students motivated with data challenges (with cell phone chargers awarded as prizes), a mini-talk on tools for joining and merging data files, and a tutorial from bitScoop on using their API integration platform.

At noon on Sunday, the students submitted their two-slide presentations via email. At 1 pm, each team had five minutes to show their findings to the six-judge panel: Johnny Lin (UCLA), Joe Kurian (Mitsubishi UFG Union Bank, Irvine), Tao Song (Spectrum Pharmaceuticals), Pamela Hsu (Spectrum Pharmaceuticals), Lynn Langit (AWS, GCP IoT), and Brett Danaher (Chapman University).

The judges announced winners in three official categories:

Best Insight: CSU Northridge team “Mean Squares” (Jamie Decker, Matthew Jones, Collin Miller, Ian Postel, and Seyed Sajjadi). [See Seyed’s description of his team’s experience!]

Best Visualization: Chapman University team “Winners ‘); Drop Table” (Dylan Bowman, William Cortes, Shevis Johnson, and Tristan Tran).

Best Use of External Data: Chapman University team “BEST” (Brandon Makin, Sarah Lasman, and Timothy Kristedja).

Additionally, “Judges’ Choice” awards for “Best Use of Statistical Models” went to the USC “Big Data” team (Hsuanpei Lee, Omar Lopez, Yi Yang Tan, Grace Xu, and Xuejia Xu) and the USC “Quants” team (Cheng (Serena) Cheng, Chelsea Lee, and Hossein Shafii).

All winners were given certificates and medallions designed by Chapman’s Ideation Lab and printed on Chapman’s MLAT Lab 3D printer (see photo).

Winners also received free student memberships in the American Statistical Association.

Many thanks go to the Silver Sponsors: Children’s Hospital Orange County Medical Intelligence and Innovation Institute, Southern California Chapter of the American Statistical Association, and Chapman University MLAT Lab; and Bronze Sponsors: Experian, RStudio, Chapman University Computational and Data Sciences and Schmid College of Science and Technology, Orange County Long Beach ASA Chapter, the Missing Variables, USC Stats Club, Luke Thelen, and Google.

Thanks also to the 45 VIP consultants from BitScoop Labs, Chapman University, Compatiko, CSU Fullerton, CSU Long Beach, CSU San Bernardino, Education Management Services, Freelance Data Analysis, Hiner Partners, Mater Dei High School, Nova Technologies, Otoy, Southern California Edison, Sonikpass, Startup, SurEmail, UC Irvine, UCLA, USC, and Woodbridge High School, many of whom spent most of the weekend working with the students.

Overall, participants were enthusiastic about meeting students from other schools and the opportunity to work with the local professionals. (See the two student perspectives below.) DataFest will continue to grow as these students return to their schools and share their enthusiasm with their classmates!

The Mean Squares Perspective

by Seyed Sajjadi

For most of our team, this DataFest was only the first or second hackathon they ever attended, but the group gelled instantly.

Culture is important for a hackathon group, but talent and preparation play key roles in the success or failure. Our group spent more than a month in advance preparing for this competition. We practiced, practiced, and practiced some more for this event. We had weekly workshops where we presented the assignments that we had worked on for the past week.

The next essential for the competition may come as a surprise to most: having an artist design and prepare the presentation took an enormous amount of work off our shoulders. During the entire competition, we had a very talented artist design a fabulous slideshow for the presentation. This may sound boastful, but allowing specialized talent to work on the slideshow the entire competition is a lot better than designing it at the last minute.

The questions that were asked were not specific at all, and it was on the participants to form and ask the proper questions. We focused on optimizing two questions of customer acquisition and retention/conversion. We proved that online targeting and marketing can be optimized by regional historical data feedback, meaning that most states residents tend to have similar preferences when it comes to same destinations. For instance, most Californians go to Las Vegas to gamble, but most people from Texas go to Las Vegas for music events; these analyses can be used to better target potential customers from neighboring regions.

Regarding customer retention and conversion of lookers to bookers, we calculated the optimum point in time where Expedia can offer more special packages; this time frame happened to be around 14 sessions of interaction between the customer and the website. The biggest part of our analysis was achieved via hierarchical clustering.

A big aspect of the event had to do with the atmosphere and the organization. They invited people from industry to come and roam around the halls, which led to a great opportunity to meet professionals in the field of data science. We were situated in a huge room with all of the teams. We ended up crowding around a small table with everyone on their laptops and chairs. The room was big enough to have impromptu meetings, which allowed a lot of room to breathe. This hackathon was a huge growing experience for all of us on “The Mean Squares”.

Team Pineapples’ Perspective

by Annelise Hitchman

On day one, I could tell my enthusiasm to start working on the dataset was matched by the other dozens of students participating. The room was filled with interaction, and not just among the individual teams. I enjoyed talking with all the consultants in the room about the data, our approach, and even just learning about what they did for work. DataFest introduced me to real-world data that I had never seen in my classes. I learned quite a bit about data analysis from both my own team members and nearly everyone else at the event. Watching the final presentations was an inspiring and insightful end to DataFest. I really hope that DataFest is able to continue and be available to universities such as my own, so that all students interested in data analysis can participate.

Michael Fahy is Professor of Mathematics and Computer Science and Associate Dean, Schmid College of Science & Technology at Chapman University

_____='https://rviews.rstudio.com/2017/08/18/chapman-university-datafest-highlights/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Server Pro is ready for BigQuery on the Google Cloud Platform

Fri, 08/18/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio is excited to announce the availability of RStudio Server Pro on the Google Cloud Platform.

RStudio Server Pro GCP is identical to RStudio Server Pro, but with additional convenience for data scientists, including pre-installation of multiple versions of R, common systems libraries, and the BigQuery package for R.

RStudio Server Pro GCP adapts to your unique circumstances. It allows you to choose different GCP computing instances for RStudio Server Pro no matter how large, whenever a project requires it (hourly pricing).

If the enhanced security, support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you, please give RStudio Server Pro for GCP a try. Below are some useful links to get you started:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

20 years of the R Core Group

Thu, 08/17/2017 - 23:03

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The first "official" version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard:

It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA

— Peter Dalgaard (@pdalgd) August 16, 2017

Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects' founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members. These are the volunteers that implement the R language and its base packages, document, build, test and release it, and manage all the infrastructure that makes that possible.

Thank you to all the R Core Group members, past and present!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Probability functions beginner

Thu, 08/17/2017 - 18:10

On this set of exercises, we are going to explore some of the probability functions in R with practical applications. Basic probability knowledge is required.

Note: We are going to use random number functions and random process functions in R such as runif, a problem with these functions is that every time you run them you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random numbers). For this set of exercises we will use set.seed(1), don’t forget to specify it before every random exercise.

Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generating random numbers.  Set your seed to 1 and generate 10 random numbers using runif and save it in an object called random_numbers.

Exercise 2

Using the function ifelse and the object random_numbers simulate coin tosses. Hint: If random_numbers is bigger than .5 then the result is head, otherwise is tail.

Another way of generating random coin tosses is by using the rbinom function. Set the seed again to 1 and simulate with this function 10 coin tosses. Note: The value you will obtain is the total number of heads of those 10 coin tosses.

Exercise 3

Using the function rbinom to generate 10 unfair coin tosses with probability success of 0.3. Set the seed to 1.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

  • work with different binomial and logistic regression techniques,
  • know how to compare regression models and choose the right fit,
  • and much more.

Exercise 4

We can simulate rolling a die in R with runif. Save in an object called die_roll 1 random number with min = 1 and max = 6. This mean that we will generate a random number between 1 and 6.

Apply the function ceiling to die_roll. Don’t forget to set the seed to 1 before calling runif.

Exercise 5

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with an standard deviation of 0.1, using rnorm simulate the height of 100 people and save it in an object called heights.

To get an idea of the values of heights applying the function summaryto it.

Exercise 6

a) What’s the probability that a person will be smaller than 1.90? Use pnorm
b) What’s the probability that a person will be taller than 1.60? Use pnorm

Exercise 7

The waiting time (in minutes) at a doctor’s clinic follows an exponential distribution with a rate parameter of 1/50. Use the function rexp to simulate the waiting time of 30 people at the doctor’s office.

Exercise 8

What’s the probability that a person will wait less than 10 minutes? Use pexp

Exercise 9

What’s the waiting time average?

Exercise 10

Let’s assume that patients with a waiting time bigger than 60 minutes leave. Out of 30 patients that arrive to the clinic how many are expected to leave? Use qexp

Related exercise sets:
  1. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
  2. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
  3. Lets Begin with something sample
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Tesseract and Magick: High Quality OCR in R

Thu, 08/17/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Last week we released an update of the tesseract package to CRAN. This package provides R bindings to Google's OCR library Tesseract.

install.packages("tesseract")

The new version ships with the latest libtesseract 3.05.01 on Windows and MacOS. Furthermore it includes enhancements for managing language data and using tesseract together with the magick package.

Installing Language Data

The new version has several improvements for installing additional language data. On Windows and MacOS you use the tesseract_download() function to install additional languages:

tesseract_download("fra")

Language data are now stored in rappdirs::user_data_dir('tesseract') which makes it persist across updates of the package. To OCR french text:

french <- tesseract("fra") text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french) cat(text)

Très Bien! Note that on Linux you should not use tesseract_download but instead install languages using apt-get (e.g. tesseract-ocr-fra) or yum (e.g. tesseract-langpack-fra).

Tesseract and Magick

The tesseract developers recommend to clean up the image before OCR'ing it to improve the quality of the output. This involves things like cropping out the text area, rescaling, increasing contrast, etc.

The rOpenSci magick package is perfectly suitable for this task. The latest version contains a convenient wrapper image_ocr() that works with pipes.

devtools::install_github("ropensci/magick")

Let's give it a try on some example scans:

# Requires devel version of magick # devtools::install_github("ropensci/magick") # Test it library(magick) library(magrittr) text <- image_read("https://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/bowers.jpg") %>% image_resize("2000") %>% image_convert(colorspace = 'gray') %>% image_trim() %>% image_ocr() cat(text) The Llfe and Work of Fredson Bowers by G. THOMAS TANSELLE N EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE ACCOM- plishment and influence cause them to be the symbols of their age; their careers and oeuvres become the touchstones by which the field is measured and its history told. In the related pursuits of analytical and descriptive bibliography, textual criticism, and scholarly editing, Fredson Bowers was such a figure, dominating the four decades after 1949, when his Principles of Bibliographical Description was pub- lished. By 1973 the period was already being called “the age of Bowers”: in that year Norman Sanders, writing the chapter on textual scholarship for Stanley Wells's Shakespeare: Select Bibliographies, gave this title to a section of his essay. For most people, it would be achievement enough to rise to such a position in a field as complex as Shakespearean textual studies; but Bowers played an equally important role in other areas. Editors of ninetcemh-cemury American authors, for example, would also have to call the recent past “the age of Bowers," as would the writers of descriptive bibliographies of authors and presses. His ubiquity in the broad field of bibliographical and textual study, his seemingly com- plete possession of it, distinguished him from his illustrious predeces- sors and made him the personification of bibliographical scholarship in his time. \Vhen in 1969 Bowers was awarded the Gold Medal of the Biblio- graphical Society in London, John Carter’s citation referred to the Principles as “majestic," called Bowers's current projects “formidable," said that he had “imposed critical discipline" on the texts of several authors, described Studies in Bibliography as a “great and continuing achievement," and included among his characteristics "uncompromising seriousness of purpose” and “professional intensity." Bowers was not unaccustomed to such encomia, but he had also experienced his share of attacks: his scholarly positions were not universally popular, and he expressed them with an aggressiveness that almost seemed calculated to

Not bad but not perfect. Can you do a better job?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Update on Our ‘revisit’ Package

Thu, 08/17/2017 - 05:01

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

On May 31, I made a post here about our R package revisit, which is designed to help remedy the reproducibility crisis in science. The intended user audience includes

  • reviewers of research manuscripts submitted for publication,
  • scientists who wish to confirm the results in a published paper, and explore alternate analyses, and
  • members of the original research team itself, while collaborating during the course of the research.

The package is documented mainly in the README file, but we now also have a paper on arXiv.org, which explains the reproducibility crisis in detail, and how our package addresses it. Reed Davis and I, the authors of the software, are joined in the paper by Prof. Laurel Beckett of the UC Davis Medical School, and Dr. Paul Thompson of Sanford Research.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualising Water Consumption using a Geographic Bubble Chart

Thu, 08/17/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create … Continue reading →

The post Visualising Water Consumption using a Geographic Bubble Chart appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

Thu, 08/17/2017 - 00:07

(This article was first published on R programming – The Chemical Statistician, and kindly contributed to R-bloggers)

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected.  Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;      set sashelp.class;      * define a new character variable to classify someone as tall or short; if height > 60 then height_class = 'Tall'; else height_class = 'Short'; run; * print the results for the first 5 rows; proc print data = c1 (obs = 5); run;

Here is the result:

Obs Name Sex Age Height Weight height_class 1 Alfred M 14 69.0 112.5 Tall 2 Alice F 13 56.5 84.0 Shor 3 Barbara F 13 65.3 98.0 Tall 4 Carol F 14 62.8 102.5 Tall 5 Henry M 14 63.5 102.5 Tall

What happened?  Why does the word “Short” render as “Shor”?

This occurred because SAS sets the length of a new character variable as the length of the first value given in its definition.  My code defined “height_class” by setting the value “Tall” first, which has a length of 4.  Thus, “height_class” was defined as a character variable with a length of 4.  Any subsequent values must follow this variable type and format.

How can we circumvent this?  You can pre-set the length of any new variable with the LENGTH statement before the SET statement.  In the revised code below, I correct the problem by setting the length of “height_class” to 5 before defining its possible values.

data c2;      set sashelp.class;      * define a new character variable to classify someone as tall or short; length height_class $ 5; if height > 60 then height_class = 'Tall'; else height_class = 'Short'; run; * print the results for the first 5 rows; proc print data = c2 (obs = 5); run;

Here is the result:

Obs Name Sex Age Height Weight height_class 1 Alfred M 14 69.0 112.5 Tall 2 Alice F 13 56.5 84.0 Short 3 Barbara F 13 65.3 98.0 Tall 4 Carol F 14 62.8 102.5 Tall 5 Henry M 14 63.5 102.5 Tall

 

Notice that “height_class” for Alice is “Short”, as it should be.

An alternative solution is to re-write the code so that the first instance of “height_class” is the longest possible value.  This does not require the use of the LENGTH statement.

data c3; set sashelp.class; * define a new character variable to classify someone as tall or short; if height < 60 then height_class = 'Short'; else height_class = 'Tall'; run;

 

By the way, I don’t notice this problem in R.  Here is some code to illustrate this observation.

> set.seed(235) > > # randomly generate 4 values > x = rnorm(3, 60, 5) > > # add a value to the beginning of "x" so that the first value is above 60 > # add a value to the end of "x" so that the last vlaue is below 60 > x = c(63, x, 57) > x [1] 63.00000 70.68902 61.36082 56.62601 57.00000 > > # pre-allocate a vector for classifying "x" as "tall" or "short" > y = 0 * x > > > for (i in 1:length(x)) + { + if (x[i] > 60) + { + y[i] = 'Tall' + } + else + { + y[i] = 'Short' + } + } > > > # display "y" > y [1] "Tall" "Tall" "Tall" "Short" "Short"

Notice that the value “Short” renders fully with a length of 5.  I did not need to pre-set the length of “y” first.

Filed under: Categorical Data Analysis, Data Analysis, R programming, SAS Programming, Statistics, Tutorials Tagged: categorical data, categorical variable, character data, character variable, length(), R, r programing, SAS, sas programming

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R programming – The Chemical Statistician. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to build an image recognizer in R using just a few images

Wed, 08/16/2017 - 23:43

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server. 

The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates "features" from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these "features", you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.

But if you don't have millions of images, it's still possible to generate these features from a model that has already been trained on millions of images. ResNet is a very deep neural network model trained for the task of image recognition which has been used to win major computer-vision competitions. With the rxFeaturize function in Microsoft R Client and Microsoft R Server, you can generate 4096 features from this model on any image you provide. The features themselves are meaningful only to a computer, but that vector of 4096 numbers between zero and one is (ideally) a distillation of the unique characteristics of that image as a human would recognize it. You can then use that features vector to create your own image-recognition system without the burden of training your own neural network on a large corpus of images.

On the Cortana Intelligence and ML blog, Remko de Lange provides a simple example: given a collection of 60-or-so images of chairs like those below, how can you find the image that most looks like a deck chair?

First, you need a representative image of a deck chair:

Then, you calculate the features vector for that image using rxFeaturize

imageToMatchDF <- rxFeaturize( data = data.frame(Image="deck-chair.jpg"), mlTransforms = list( loadImage(vars = list(Features = "Image")), resizeImage(vars = "Features", width = 227, height = 227), extractPixels(vars = "Features"), featurizeImage(var = "Features", dnnModel = "alexnet") ))

Note that when featurizing an image, you need to shrink it down to a specified size (the built-in function resizeImage handles that). There are also several pretrained models to choose from: three variants of ResNet and also Alexnet, which we use here. This gives us a features vector of 4096 numbers to represent our deck chair image. Then, we just need to use the same process to calculate the features vector for our 60 other images, and find the one that's closest to our deck chair. We can simply use the dist function in R to do that, for example, and that's exactly what Remko's script on Github does. The image with the closest features vector to our representative image is this one:

So, even with a relatively small collection of images, it's possible to build a powerful image recognition system using image featurization and the powerful image recognition neural networks provided with Microsoft R. The complete code and sample images used in this example are available on Github. (Note, you'll need to have a license for Microsoft R Server or install the free Microsoft R Client with the pretrained models option to use the image featurization functions in the script.) And for more details on creating this recognizer, check out the blog post below.

Cortana Intelligence and Machine Learning Blog: Find Images With Images

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Thank You For The Very Nice Comment

Wed, 08/16/2017 - 22:29

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Somebody nice reached out and gave us this wonderful feedback on our new Supervised Learning in R: Regression (paid) video course.

Thanks for a wonderful course on DataCamp on XGBoost and Random forest. I was struggling with Xgboost earlier and Vtreat has made my life easy now :).

Supervised Learning in R: Regression covers a lot as it treats predicting probabilities as a type of regression. Nina and I are very proud of this course and think it is very much worth your time (for the beginning through advanced R user).

vtreat is a statistically sound data cleaning and preparation tool introduced towards the end of the course. R users who try vtreat find it makes training and applying models much easier.

vtreat is distributed as a free open-source package available on CRAN. If you are doing predictive modeling in R I honestly think you will find vtreat invaluable.



And to the person who took the time to write the nice note. A sincere thank you from both Nina Zumel and myself. That kind of interaction really makes developing courses and packages feel worthwhile.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data wrangling : Cleansing – Regular expressions (1/3)

Wed, 08/16/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the fourth part of the series and it aims to cover the cleaning of data used. At previous parts we learned how to import, reshape and transform data. The rest of the series will be dedicated to the data cleansing process. On this post we will go through the regular expressions, a sequence of characters that define a search pattern, mainly
for use in pattern matching with text strings.In particular, we will cover the foundations of regular expression syntax.

Before proceeding, it might be helpful to look over the help pages for the sub, gsub.

Moreover please run the following commands to create the strings that we will work on.
textmeta <- "R|is|cool,|so|are|you|that|you|are|for|__|your|skills|by|solving|this|exercise. Moreover parenthesis symbol is []! Finally once you are done with this set go for a coffee, you deserve it!"

textseq <-"I hope you are using R version 3.4.0 and you have updated on 2017-04-21, with nickname:'You stupid Darkness'."

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

From object textmeta substitute the full stop (‘.’) with exclamation mark (‘!’) and assign the result to the object text.

Exercise 2

From object text substitute the double underscore (‘__’) with ‘enchancing’ and assign the result to the object text.

Exercise 3

From object text substitute the backslash (‘\’) with a letter-spacing (‘ ‘) and assign the result to the object text.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 4

From object text substitute the square brackets (‘[]’) with parentheses (‘()’) and assign the result to the object text.

Exercise 5

From object text substitute the ‘coffee’ with ‘your favourite beverage’ and assign the result to the object text.

Exercise 6

From object textseq substitute all the digits with a sharp (‘#’) and assign the result to the object text.

Exercise 7

From object textseq substitute any space with a underscore (‘_’) and assign the result to the object text.

Exercise 8

From object textseq substitute any wording with blah’ and assign the result to the object text.

Exercise 9

From object textseq seperate each word with a backslash (‘\’) and assign the result to the object text.

Exercise 10

From object textseq seperate each character boundary with a backslash (‘\’) and assign the result to the object text.

Related exercise sets:
  1. Regular Expressions Exercises – Part 1
  2. Density-Based Clustering Exercises
  3. Data wrangling : I/O (Part-2)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding overfitting: an inaccurate meme in supervised learning

Wed, 08/16/2017 - 16:40

(This article was first published on Memo's Island, and kindly contributed to R-bloggers)

Preamble There is a lot of confusion among practitioners regarding the concept of overfitting. It seems like, a kind of an urban legend or a meme, a folklore is circulating in data science or allied fields with the following statement:
Applying cross-validation prevents overfitting and a good out-of-sample performance, low generalisation error in unseen data, indicates not an overfit.

This statement is of course not true: cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model. What actually people refer to in one aspect of this statement is called overtraining. Unfortunately, this meme is not only propagated in industry but in some academic papers as well. This might be at best a confusion on jargon. But, it will be a good practice if we set the jargon right and clear on what do we refer to when we say overfitting, in communicating our results.

Aim In this post, we will give an intuition on why model validation as approximating generalization error of a model fit and detection of overfitting can not be resolved simultaneously on a single model. We will work on  a concrete example workflow in understanding overfitting, overtraining and a typical final model building stage  after some conceptual introduction. We will avoid giving a reference to the Bayesian interpretations and regularisation and restrict the post to regression and cross-validation. While regularisation has different ramification due to its mathematical properties and prior distributions have different implications in Bayesian statistics. We assume an introductory background in machine learning, so this is not a beginners tutorial.

A recent question from Andrew Gelman, a Bayesian guru, regarding What is overfitting? was one of the reasons why this post is developed along with my frustration to see practitioners being muddy on the meaning of overfitting and continuing recently published data science related technical articles circulating around and even in some academic papers claiming the above statement.

What do we need to satisfy in supervised learning? One of the most basic tasks in mathematics is to find a solution to a function: If we restrict ourselves to real numbers in $n$-dimensions and our domain of interest would be $\mathbb{R}^{n}$. Now imagine set of $p$ points living in this domain  $x_{i}$ form a dataset, this is actually a partial solution to a function. The main purpose of modelling is to find an explanation of the dataset, meaning that we need to determine $m$-parameters, $a \in \mathbb{R}^{m}$ which are unknown. (Note that a non-parametric model does not mean no parameters.) Mathematically speaking this manifests as a function as we said before,  $f(x, a)$. This modelling is usually called regression, interpolation or supervised learning depending on the literature you are reading. This is a form of an inverse problem, while we don’t know the parameters but we have a partial information regarding variables. The main issue here is ill-posedness, meaning that solutions are not well-posed. Omitting axiomatic technical details, practical problem is that we can find many functions  $f(x, a)$  or models, explaining the dataset. So, we seek the following two concepts to be satisfied by our model solution,  $f(x, a)=0$. 1. Generalized: A model should not depend on the dataset. This step is called model validation. 2. Minimally complex: A model should obey Occam’s razor or principle of parsimony. This step is called model selection.

Figure 1: A workflow for model validation and
 selection in supervised learning.

Generalization of a model can be measured by goodness-of-fit. It essentially tells us how good our model (chosen function) explains the dataset. To find a minimally complex model requires comparison against another model.  Up to now, we have not named a technique how to check if a model is generalized and selected as the best model. Unfortunately, there is no unique way of doing both and that’s the task of data scientist or quantitative practitioner that requires human judgement.

Model validation: An example One way to check if a model is generalized enough is to come up with a metric on how good it explains the dataset. Our task in model validation is to estimate the model error. For example, root mean square deviation (RMDS) is one metric we can use.  If  RMSD is low, we could say that our model fit is good, ideally it should be close to zero.  But it is not generalized enough if we use the same dataset to measure the goodness-of-fit.  We could use different dataset, specially out-of-sample dataset, to validate this as much as we can, i.e. so called hold out method.  Out-of-sample is just a fancy way of saying we did not use the same dataset to find the value of parameters $a$. An improved way of doing this is cross-validation. We split our dataset into $k$ partitions, and we obtain $k$ RMDS values to averaged over. This is summarised in Figure 1.  Note that, different parameterisation of the same model does not constitute a different model.

Model Selection: Detection of overfitting Overfitting comes into play when we try to satisfy ‘minimally complex model’. This is a comparison problem and we need more than one model to judge if a given model is an overfit. Douglas Hawkins in his classic paper The Problem of Overfitting, states that

Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well.

The important point here what do we mean by complex model, or how can we quantify model complexity? Unfortunately, again there is no unique way of doing this. One of the most used approaches is that a model having more parameters is getting more complex. But this is again a bit of a meme and not generally true. One could actually resort to different measures of complexity. For example, by this definition $f_{1}(a,x)=ax$ and $f_{2}(a,x)=ax^2$ have the same complexity by having the same number of free parameters, but intuitively $f_{2}$ is more complex, while it is nonlinear. There are a lot of information theory based measures of complexity but discussion of those are beyond the scope of our post. For demonstration purposes, we will consider more parameters and degree of nonlinearity as more complex a model.
Figure 2: Simulated data and the non-stochastic
part of the data.

Hand on example We have intuitively covered the reasons behind how we can’t resolve model validation and judge overfitting simultaneously. Now try to demonstrate this with a simple dataset and models, yet essentially capturing the above premise.
A usual procedure is to generate a synthetic dataset, or simulated dataset, from a model, as a gold standard and use this dataset to build other models. Let’s use the following functional form, from classic text of Bishop, but with an added Gaussian noise $$ f(x) = sin(2\pi x) + \mathcal{N}(0,0.1).$$ We generate large enough set, 100 points to avoid sample size issue discussed in Bishop’s book, see Figure 2. Let’s decide on two models we would like to apply to this dataset in supervised learning task. Note that, we won’t be discussing Bayesian interpretation here, so equivalency of these model under a strong prior assumption is not an issue as we are using this example for ease of demonstrating the concept. A polynomial model of degree $3$ and degree $5$, we call them $g(x)$ and $h(x)$ respectively are used to learn from the simulated data. $$g(x) = a_{0} + a_{1} x + a_{2} x^{2} + a_{3} x^{3}$$ and $$h(x) = b_{0} + b_{1} x + b_{2} x^{2} + b_{3} x^{3} + b_{4} x^{4} + b_{5} x^{5} + b_{6} x^{6}.$$
Figure 3: Overtraining occurs at around after 40
percent of the data usage for g(x).

Overtraining is not overfitting Overtraining means a model performance degrades in learning model parameters against an objective variable that effects how model is build, for example, an objective variable can be a training data size or iteration cycle in neural network. This is more prevalent in neural networks (see Dayhoff 2011). In our practical example, this will manifest in hold out method to measure RMSD in modelling with g(x). In other words finding an optimal number of data points to use to train the model to give a better performance on unseen data, See Figure 3 and 4.

Overfitting with low validation error We can also estimate 10-fold cross-validation error, CV-RMSD. For this sampling, g and h have 0.13 and 0.12 CV-RMSD respectively. So as we can see, we have a situation that more complex model reaches similar predictive power with cross validation and we can not distinguish this overfitting by just looking at CV-RMSD value or detecting ‘overtraining’ curve from Figures 4. We need two models to compare, hence both Figure 3 and 4, with both CV-RMSD values. We might argue that in small data sets we might be able tell the difference by looking at test and training error differences, this is exactly how Bishop explains overfitting; where he points out overtraining in small datasets.
Which trained model to deploy? Now the question is, we found out best performing model with minimal complexity empirically. All well, but which trained model should we use in production?
Actualy we have already build the model in model selection. In above case, since we got similar
predictive power from g and h, we obviously will use g, trained on the splitting sweet spot from Figure 3.


Figure 4: Overtraining occurs at around after 30
percent of the data usage for h(x)

Conclusion The essential message here is poor validation performance would not guarantee the detection of an overfitted model. As we have seen from examples using synthetic data in one dimension. Overtraining is actually what most practitioners mean when they use the term overfitting.

Outlook As more and more people are using techniques from machine learning or inverse problems, both in academia and industry, some key technical concepts are deviated a bit and take different definitions and meaning for different people, due to the fact that people learn some concepts not from reading the literature carefully but from their line managers or senior colleagues verbally. This creates memes which are actually wrong or at least creating lots of confusion in jargon. It is very important for all of us as practitioners that we must question all technical concepts and try to seek origins from the published scientific literature and not rely entirely on verbal explanations from our experienced colleagues. Also, we should strongly avoid ridiculing question from colleagues even they sound too simple, at the end of the day we don’t stop learning and naive looking questions might have very important consequences in fundamentals of the field.

Figure 5: A deployed models h and g  on the testing set with
the original data. 

Appendix: Reproducing the example using R

The code used in producing the synthetic data, modelling step and visualising the results can be found in github [repo]. In this appendix, we present this R code with detailed comments, but visualisation codes are omitted, they are available in the github repository.

R (GNU S) provides very powerful formula interface. It is probably the most advanced and expressive formula interface in statistical computing, of course along with S.







Above two polynomials can be expressed as formula and as well as a function where we can evaluate.

1
2
3
4
5
6
7
8
9 #'
#' Two polynomial models: g and h, 3rd and 5th degree respectively.
#'
g_fun <- function(x,params) as.numeric(params[1]+x*params[2]+x^2*params[3]+x^3*params[4])
h_fun <- function(x,params) as.numeric(params[1]+x*params[2]+x^2*params[3]+params[4]*x^3+
params[5]*x^4+params[6]*x^{5}+
params[7]*x^{6})
g_formula <- ysim ~ I(x) + I(x^2) + I(x^3)
h_formula <- ysim ~ I(x) + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6)


A learning from data will be achieved with lm function from R,
1
2
3
4
5
6
7
8 #'
#' Given data.frame with x and ysim, and an R formula with ysim=f(x),
#' fit a linear model
#'
get_coefficients <- function(data_portion, model_formula) {
model <- lm(model_formula, data=data_portion)
return(model$coefficients)
}

and the resulting approximated function can be applied to new data set with the following helper functions with measuring RMSD as a performance metric.
1
2
3
4
5
6
7
8
9 #'
#' Find the prediction error for a given model_function and model_formula
#'
lm_rmsd <- function(x_train, y_train, x_test, y_test, model_function, model_formula) {
params <- get_coefficients(data.frame(x=x_train,ysim=y_train), model_formula)
params[as.numeric(which(is.na(params)))] <- 0 # if there is any co-linearity
f_hat <- sapply(x_test, model_function, params=params)
return(sqrt(sum((f_hat-y_test)^2)/length(f_hat)))
}


We can generate a simulated data as we discussed above by using runif.

1
2
3
4
5
6
7
8
9
10
11
12
13 #'
#' Generate a synthetic dataset
#' A similar model from Bishop :
#'
#' f(x) = sin(2pi*x) + N(0, 0.1)
#'
set.seed(424242)
f <- function(x) return(sin(2*pi*x))
fsim <- function(x) return(sin(2*pi*x)+rnorm(1,0,0.1))
x <- seq(0,1,1e-2)
y <- sapply(x,f)
ysim <- sapply(x,fsim)
simdata <- data.frame(x=x, y=y, ysim=ysim)

To detect overtraining we can split the data in different places in increasing training size, and measure the the performance on the training data itself and unseen test data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 #'
#' Demonstration of overtraining with g
#'
#'
set.seed(424242)
model_function <- g_fun
model_formula <- g_formula
split_percent <- seq(0.05,0.95,0.03)
split_len <- length(split_percent)
data_len <- length(simdata$ysim)
splits <- as.integer(data_len*split_percent)
test_rmsd <- vector("integer", split_len-1)
train_rmsd <- vector("integer", split_len-1)
for(i in 2:split_len) {
train_ix <- sample(1:data_len,splits[i-1])
test_ix <- (1:data_len)[-train_ix]
train_rmsd[i-1] <- lm_rmsd(simdata$x[train_ix],
simdata$ysim[train_ix],
simdata$x[train_ix],
simdata$ysim[train_ix],
model_function,
model_formula)
test_rmsd[i-1] <- lm_rmsd(simdata$x[train_ix],
simdata$ysim[train_ix],
simdata$x[test_ix],
simdata$ysim[test_ix],
model_function,
model_formula)
}

rmsd_df <- data.frame(test_rmsd=test_rmsd, train_rmsd=train_rmsd, percent=split_percent[-1])
rmsd_df2 <- melt(rmsd_df, id=c("percent"))
colnames(rmsd_df2) <- c("percent", "Error_on", "rmsd")
rmsd_df2$test_train <- as.factor(rmsd_df2$Error_on)

And the last portion of the code does 10-fold cross-validation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 #' CV for g(x) and h(x)
split_percent <- seq(0,1,0.1)
split_len <- length(split_percent)
data_len <- length(simdata$ysim)
splits <- as.integer(data_len*split_percent)
cv_rmsd_g <- 0
cv_rmsd_h <- 0
for(i in 2:split_len) { # 10-fold cross validation
test_ix <- (splits[i-1]+1):splits[i]
train_ix <- (1:data_len)[-test_ix]
x_train <- simdata$x[train_ix]
y_train <- simdata$ysim[train_ix]
x_test <- simdata$x[test_ix]
y_test <- simdata$ysim[test_ix]
cv_rmsd_g <- lm_rmsd(x_train,
y_train,
x_test,
y_test,
g_fun,
g_formula)+cv_rmsd_g
cv_rmsd_h <- lm_rmsd(x_train,
y_train,
x_test,
y_test,
h_fun,
h_formula)+cv_rmsd_h
}

cat("10-fold CV error G = ", cv_rmsd_g/split_len,"\n") # 0.1304164
cat("10-fold CV error H = ", cv_rmsd_h/split_len,"\n") # 0.1206458





var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Memo's Island. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

MODIStsp 1.3.3 is out – Speeding things up and squashing some bugs !

Wed, 08/16/2017 - 10:42

(This article was first published on Spatial Processing in R, and kindly contributed to R-bloggers)

  A new version of MODIStsp (1.3.3) is on CRAN as of today ! Below, you can find a short description of the main improvements.

Processing speed improvements

  Processing of MODIS layers after download (i.e., scale and offset calibration, computation of Spectral Indexes and Quality Indicators) is now much faster.


  As you can see in the figure, processing time was almost halved on my (not so fast) laptop. This was achieved by modifying all computation functions so to use `raster::calc()` and `raster::overlay()` (more on this in a later post).

Although speed is also limited by download speed and compression options, this will allow to save quite some time when working on large areas and with many MODIS layers.



Test mode

  MODIStsp 1.3.3 also introduces a test mode. Although it was mainly developed to facilitate unit testing, it is also available to the user. Running:

MODIStsp(test = X) # with X equal to a number between 0 and 6

  will run MODIStsp expoiting different sets of processing parameters (available as JSON files in the \Test_files subfolder of MODIStsp installation). We hope this will help us in more easily pinpoint and solve eventual problems signalled by users.

Bug fixes

  Several bugs discovered after v1.3.2 release were fixed. We thank MODIStsp users for their feedback and help in improving the package ! You can find a list of the main fixes in our NEWS page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Spatial Processing in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Master the tidyverse

Wed, 08/16/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

If you’ve read the book “R for Data Science” or plan to, now you can dive deeper with co-author and RStudio Master Instructor Garrett Grolemund, winner of the Excellence in CE Award at JSM 2017!

This two day workshop provides a comprehensive overview of what is now called the Tidyverse, a core set of R packages that are essential to Data Science. You will visualize, transform, and model data in R and work with date-times, character strings, and untidy data formats. Along the way, you will learn to use the brightest stars in the tidyverse: ggplot2, dplyr, tidyr, readr, purrr and tibble, along with stringr, lubridate, hms, and forcats.

  • When? 9 a.m.to 5 p.m. Thursday October 5th and Friday the 6th

  • Where? 20 F Street NW Conference Center, Washington, D.C.

  • Who? Garrett Grolemund, Data Scientist and Master Instructor at RStudio.

Garrett excels at teaching, statistics, and teaching statistics. He wrote the popular lubridate package and is the author of Hands On Programming with R in addition to Data Science with R. He holds a PhD in Statistics and specializes in Data Visualization.

Build your skills and learn from the best at this rare in-person workshop – the only East coast workshop from RStudio in 2017.

Register here: https://www.rstudio.com/workshops/master-the-tidyverse/

As of today, there are just 30+ seats left. Discounts are available for academics (students or faculty) and for 5 or more attendees from any organization. Email training@rstudio.com if you have any questions about the workshop that you don’t find answered on the registration page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages