Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 10 hours 1 min ago

RStudio::Conf 2018

Tue, 08/29/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

It’s not even Labor Day, so it seems to be a bit early to start planning for next year’s R conferences. But, early-bird pricing for RStudio::Conf 2018 ends this Thursday.

The conference which will be held in San Diego between January 31st and February 3rd promises to match and even surpass this year’s event. In addition to keynotes from Di Cook (Monash University and Iowa State University), J.J. Allaire (RStudio Founder, CEO & Principal Developer), Shiny creator Joe Cheng, and Chief Scientist Hadley Wickham, a number of knowledgeable (and entertaining) speakers have already committed including quant, long-time R user and twitter humorist JD Long (@CMastication), Stack Overflow’s David Robinson (@drob) and ProPublica editor Olga Pierce (@olgapierce).

Making the deadline for early-bird pricing will get you a significant savings. The “full seat” price for the conference is $495 before midnight EST August 31, 2017 and $695 thereafter. You can register here.

For a good idea of the kinds of things you can expect to learn at RStudio::Conf 2018 have a look at the videos from this year’s event.

_____='https://rviews.rstudio.com/2017/08/29/rstudio-conf-2018/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny Dev Center gets a shiny new update

Tue, 08/29/2017 - 02:00

I am excited to announce the redesign and reorganization of shiny.rstudio.com, also known as the Shiny Dev Center. The Shiny Dev Center is the place to go to learn about all things Shiny and to keep up to date with it as it evolves.

The goal of this refresh is to provide a clear learning path for those who are just starting off with developing Shiny apps as well as to make advanced Shiny topics easily accessible to those building large and complex apps. The articles overview that we designed to help navigate the wealth of information on the Shiny Dev Center aims to achieve this goal.

Other highlights of the refresh include:

  • A brand new look!
  • New articles
  • Updated articles with modern Shiny code examples
  • Explicit linking, where relevant, to other RStudio resources like webinars, support docs, etc.
  • A prominent link to our ever growing Shiny User Showcase
  • A guide for contributing to Shiny (inspired by the Tidyverse contribute guide)

Stay tuned for more updates to the Shiny Dev Center in the near future!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

6 R Jobs for R users (2017-08-28) – from all over the world

Tue, 08/29/2017 - 00:23
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

 

More New Jobs
  1. Freelance
    Optimization Expert IdeaConnection – Posted by Sherri Ann O’Gorman
    Anywhere
    22 Aug2017
  2. Full-Time
    Data journalist for The Economist @London The Economist – Posted by cooberp
    London England, United Kingdom
    18 Aug2017
  3. Full-Time
    Technical Business Analyst Investec Asset Management – Posted by IAM
    London England, United Kingdom
    14 Aug2017
  4. Full-Time
    Senior Data Scientist @ Dallas, Texas, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
    Dallas Texas, United States
    8 Aug2017
  5. Full-Time
    Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
    Mesa Arizona, United States
    31 Jul2017
  6. Full-Time
    Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
    Philadelphia Pennsylvania, United States
    31 Jul2017

 

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Le Monde puzzle [#1018]

Tue, 08/29/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An arithmetic Le Monde mathematical puzzle (that first did not seem to involve R programming because of the large number of digits in the quantity involved):

An integer x with less than 100 digits is such that adding the digit 1 on both sides of x produces the integer 99x.  What are the last nine digits of x? And what are the possible numbers of digits of x?

The integer x satisfies the identity

where ω is the number of digits of x. This amounts to

10….01 = 89 x,

where there are ω zeros. Working with long integers in R could bring an immediate solution, but I went for a pedestrian version, handling each digit at a time and starting from the final one which is necessarily 9:

#multiply by 9 rap=0;row=NULL for (i in length(x):1){ prud=rap+x[i]*9 row=c(prud%%10,row) rap=prud%/%10} row=c(rap,row) #multiply by 80 rep=raw=0 for (i in length(x):1){ prud=rep+x[i]*8 raw=c(prud%%10,raw) rep=prud%/%10} #find next digit y=(row[1]+raw[1]+(length(x)>1))%%10

returning

7 9 7 7 5 2 8 0 9

as the (only) last digits of x. The same code can be exploited to check that the complete multiplication produces a number of the form 10….01, hence to deduce that the length of x is either 21 or 65, with solutions

[1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9 [1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 8 9 8 8 7 6 4 0 4 4 9 4 3 8 2 0 2 2 [39] 4 7 1 9 1 0 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9

The maths question behind is to figure out the powers k of 10 such that

For instance, 10²≡11 mod (89) and 11¹¹≡88 mod (89) leads to the first solution ω=21. And then, since 10⁴⁴≡1 mod (89), ω=21+44=65 is another solution…

Filed under: Books, Kids, R Tagged: arithmetics, competition, Le Monde, long division, mathematical puzzle, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Packages to simplify mapping in R

Mon, 08/28/2017 - 23:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Computerworld's Sharon Machlis has published a very useful tutorial on creating geographic data maps with R. (The tutorial was actually published back in March, but I only came across it recently.) While it's been possible to create maps in R for a long time, some recent packages and data APIs have made the process much simpler. The tutorial is based on the following R packages:

  • sf, a package that provides simple data structures for objects representing spatial geometries including map features like borders and rivers
  • tmap and tmaptools, packages for creating static and interactive "thematic maps" like choropleths and spatial point maps using a ggplot2-like syntax
  • tigris, a package that provides shapefiles you can use to map the USA and its states, counties, and census tracts (very useful for visualizing US census data, which you can easily obtain with the tidycensus package)
  • rio, a package to streamline the import of flat files from third-party data sources like the U.S. Bureau of Labor Statistics

With those packages, as you'll learn in the tutorial linked below, you can easily create attractive maps to visualize geographic data, and even make those graphics interactive with scrolling, zooming, and data pop-ups thanks to the capabilities of the leaflet package. All of these packages are now available on CRAN, so there's no need to install from Github as the tutorial then suggested, unless you want access to even newer in-development capabilities.

Computerworld: Mapping in R just got a whole lot easier

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Neat New seplyr Feature: String Interpolation

Mon, 08/28/2017 - 22:33

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.



This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods.

The method is easiest to see with an example:

library("seplyr") ## Loading required package: wrapr ratio <- 2 compCol1 <- "Sepal.Width" expr <- expand_expr("Sepal.Length" >= ratio * compCol1) print(expr) ## [1] "Sepal.Length >= ratio * Sepal.Width"

expand_expr works by capturing the user supplied expression unevaluated, performing some transformations, and returning the entire expression as a single quoted string (essentially returning new source code).

Notice in the above one layer of quoting was removed from "Sepal.Length" and the name referred to by “compCol1” was substituted into the expression. “ratio” was left alone as it was not referring to a string (and hence can not be a name; unbound or free variables are also left alone). So we see that the substitution performed does depend on what values are present in the environment.

If you want to be stricter in your specification, you could add quotes around any symbol you do not want de-referenced. For example:

expand_expr("Sepal.Length" >= "ratio" * compCol1) ## [1] "Sepal.Length >= ratio * Sepal.Width"

After the substitution the returned quoted expression is exactly in the form seplyr expects. For example:

resCol1 <- "Sepal_Long" datasets::iris %.>% mutate_se(., resCol1 := expr) %.>% head(.) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long ## 1 5.1 3.5 1.4 0.2 setosa FALSE ## 2 4.9 3.0 1.4 0.2 setosa FALSE ## 3 4.7 3.2 1.3 0.2 setosa FALSE ## 4 4.6 3.1 1.5 0.2 setosa FALSE ## 5 5.0 3.6 1.4 0.2 setosa FALSE ## 6 5.4 3.9 1.7 0.4 setosa FALSE

Details on %.>% (dot pipe) and := (named map builder) can be found here and here respectively. The idea is: seplyr::mutate_se(., "Sepal_Long" := "Sepal.Length >= ratio * Sepal.Width") should be equilant to dplyr::mutate(., Sepal_Long = Sepal.Length >= ratio * Sepal.Width).

seplyr also provides an number of seplyr::*_nse() convenience forms wrapping all of these steps into one operation. For example:

datasets::iris %.>% mutate_nse(., resCol1 := "Sepal.Length" >= ratio * compCol1) %.>% head(.) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long ## 1 5.1 3.5 1.4 0.2 setosa FALSE ## 2 4.9 3.0 1.4 0.2 setosa FALSE ## 3 4.7 3.2 1.3 0.2 setosa FALSE ## 4 4.6 3.1 1.5 0.2 setosa FALSE ## 5 5.0 3.6 1.4 0.2 setosa FALSE ## 6 5.4 3.9 1.7 0.4 setosa FALSE

To use string literals you merely need one extra layer of quoting:

"is_setosa" := expand_expr(Species == "'setosa'") ## is_setosa ## "Species == \"setosa\"" datasets::iris %.>% transmute_nse(., "is_setosa" := Species == "'setosa'") %.>% summary(.) ## is_setosa ## Mode :logical ## FALSE:100 ## TRUE :50

The purpose of all of the above is to mix names that are known while we are writing the code (these are quoted) with names that may not be known until later (i.e., column names supplied as parameters). This allows the easy creation of useful generic functions such as:

countMatches <- function(data, columnName, targetValue) { # extra quotes to say we are interested in value, not de-reference targetSym <- paste0('"', targetValue, '"') data %.>% transmute_nse(., "match" := columnName == targetSym) %.>% group_by_se(., "match") %.>% summarize_se(., "count" := "n()") } countMatches(datasets::iris, "Species", "setosa") ## # A tibble: 2 x 2 ## match count ## ## 1 FALSE 100 ## 2 TRUE 50

The purpose of the seplyr string system is to pull off quotes and de-reference indirect variables. So, you need to remember to add enough extra quotation marks to prevent this where you do not want it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

Mon, 08/28/2017 - 18:26

(This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers)

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

with a gamma region looking like this:

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal \(\gamma\)-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

ep.rle <- rle(ep.data$y) stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2) ep.data.sm <- ep.data[stair.midpoints,] plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5) points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.

ep.data <- ep.data.sm

Transformation

The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.

library(Bolstad) #A package containing a function for Simpon's Rule integration ep.data$x <- ep.data$x/max(ep.data$x) A.tot <- sintegral(ep.data$x,ep.data$y)$value ep.data$y <- ep.data$y/A.tot #sanity check sintegral(ep.data$x,ep.data$y)$value

## [1] 1

plot(y~x, data = ep.data, type = "l")

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

library(quantmod) ep.max <- findPeaks(ep.data$y) plot(y~x, data = ep.data, type = "l", main = "Maxima") abline(v = ep.data$x[ep.max], col = "red", lty = 2)

ep.min <- findValleys(ep.data$y) plot(y~x, data = ep.data, type = "l", main = "Minima") abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

ep.max <- ep.max[-1] ep.min <- ep.min[-c(1,length(ep.min))]

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of \(\mu_i\) and \(\sigma_i\)) but there is no particular clinical value to fitting the albumin, \(\alpha_1\), \(\alpha_2\), \(\beta_1\) and \(\beta_2\) domains with Gaussians. What is of interest is separately quantifying the two peaks in \(\gamma\) with two separate Gaussians so let's isolate the \(\gamma\) region based on the location of the minimum between \(\beta_2\) and \(\gamma\).

Isolate the \(\gamma\) Region

gamma.ind <- max(ep.min):nrow(ep.data) gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind]) plot(y ~ x, gamma.data, type = "l")

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]

where \(\mu_1\) is the \(x\) location of the first peak in \(\gamma\) and \(\mu_2\) is the \(x\) location of the second peak in \(\gamma\). The estimates of \(\sigma_1\) and \(\sigma_2\) can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to \(\sigma\) by

\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the \(FWHM\). Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

FWHM.finder <- function(ep.data, mu.index){ peak.height <- ep.data$y[mu.index] fxn.for.roots <- ep.data$y - peak.height/2 indices <- 1:nrow(ep.data) root.indices <- which(diff(sign(fxn.for.roots))!=0) tmp <- c(root.indices,mu.index) %>% sort tmp2 <- which(tmp == mu.index) first.root <- root.indices[tmp2 -1] second.root <- root.indices[tmp2] HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root] HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index] FWHM <- HWHM2 + HWHM1 FWHM2 = 2*min(c(HWHM1,HWHM2)) return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2)) }

The peak in the \(\gamma\) region was obtained previously:

plot(y ~ x, gamma.data, type = "l") gamma.max <- findPeaks(gamma.data$y) abline(v = gamma.data$x[gamma.max])

and from them \(\mu_1\) is determined to be 0.7. We have to guess where the second peak is, which is at about \(x=0.75\) and has an index of 252 in the gamma.data dataframe.

gamma.data[252,]

## x y ## 252 0.7487757 0.6381026

#append the second peak gamma.max <- c(gamma.max,252) gamma.mu <- gamma.data$x[gamma.max] gamma.mu

## [1] 0.6983350 0.7487757

plot(y ~ x, gamma.data, type = "l") abline(v = gamma.data$x[gamma.max])

Now we can find the estimates of the standard deviations:

#find the FWHM estimates of sigma_1 and sigma_2: FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data) gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

The estimates of \(\sigma_1\) and \(\sigma_2\) are now obtained. The estimates of \(C_1\) and \(C_2\) are just the peak heights.

peak.heights <- gamma.data$y[gamma.max]

We can now use nls() to determine the fit.

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) + C2*exp(-(x-mean2)**2/(2 * sigma2**2))), data = gamma.data, start = list(mean1 = gamma.mu[1], mean2 = gamma.mu[2], sigma1 = gamma.sigma[1], sigma2 = gamma.sigma[2], C1 = peak.heights[1], C2 = peak.heights[2]), algorithm = "port")

Determining the fitted values of our unknown coefficients:

dffit <- data.frame(x=seq(0, 1 , 0.001)) dffit$y <- predict(fit, newdata=dffit) fit.sum <- summary(fit) fit.sum #show the fitted coefficients

## ## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - ## mean2)^2/(2 * sigma2^2))) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## mean1 0.7094793 0.0003312 2142.23 <2e-16 *** ## mean2 0.7813900 0.0007213 1083.24 <2e-16 *** ## sigma1 0.0731113 0.0002382 306.94 <2e-16 *** ## sigma2 0.0250850 0.0011115 22.57 <2e-16 *** ## C1 0.6983921 0.0018462 378.29 <2e-16 *** ## C2 0.0819704 0.0032625 25.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.01291 on 611 degrees of freedom ## ## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

coef.fit <- fit.sum$coefficients[,1] mu.fit <- coef.fit[1:2] sigma.fit <- coef.fit[3:4] C.fit <- coef.fit[5:6]

And now we can plot the fitted results against the original results:

#original plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") #overall fit lines(y ~ x, data = dffit, col ="red", cex = 0.2) legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum")) #components of the fit for(i in 1:2){ x <- dffit$x y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2)) lines(x,y, col = i + 2) }

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the \(\gamma\) background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of \(\mu_1\) and \(\mu_2\) but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the \(\gamma\) region to be those values having \(y \leq 0.02\). Then I will connect the first (x,y) point from the \(\gamma\) region and connect it to the tail.

gamma.tail <- filter(gamma.data, y <= 0.02) baseline.data <- rbind(gamma.data[1,],gamma.tail) names(baseline.data) <- c("x","y") baseline.fun <- approxfun(baseline.data) plot(y~x, data = gamma.data, type = "l") lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

Now we can define a new dataframe gamma.no.base that has the baseline removed:

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x)) plot(y~x, data = gamma.data, type = "l") lines(y ~ x, data = gamma.no.base, lty = 2) gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks abline(v = gamma.no.base$x[gamma.max])

The black is the original \(\gamma\) and the dashed has the baseline removed. This becomes and easy fit.

#Estimate the Ci peak.heights <- gamma.no.base$y[gamma.max] #Estimate the mu_i gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before #Estimate the sigma_i from the FWHM FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base) gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355 #Perform the fit fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) + C2*exp(-(x-mean2)**2/(2 * sigma2**2))), data = gamma.no.base, start = list(mean1 = gamma.mu[1], mean2 = gamma.mu[2], sigma1 = gamma.sigma[1], sigma2 = gamma.sigma[2], C1 = peak.heights[1], C2 = peak.heights[2]), algorithm = "port") #Plot the fit dffit <- data.frame(x=seq(0, 1 , 0.001)) dffit$y <- predict(fit, newdata=dffit) fit.sum <- summary(fit) coef.fit <- fit.sum$coefficients[,1] mu.fit <- coef.fit[1:2] sigma.fit <- coef.fit[3:4] C.fit <- coef.fit[5:6] plot(y ~ x, data = gamma.no.base, type = "l") legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum")) lines(y ~ x, data = dffit, col ="red", cex = 0.2) for(i in 1:2){ x <- dffit$x y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2)) lines(x,y, col = i + 2) }

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the \(\gamma\) background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() A.mono <- round(A.mono,3) A.mono

## sigma1 ## 0.024

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal \(\gamma\)-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the \(\gamma\)-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast \(\gamma\) dwarfed by normal \(\gamma\)-globulins. That is, like this:

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions
  • The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.

  • Integrating monoclonal peaks under the \(\gamma\) baseline (or \(\beta\)) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.

    • Basline removal might be helpful at times.
  • Peak integration will require human adjudication.

  • While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.

  • MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recording of my Talk at the ACS Data Users Conference

Mon, 08/28/2017 - 18:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I recently had the honor of speaking at the 2017 American Community Survey (ACS) Data Users Conference. My talk focused on choroplethr, the suite of R packages I’ve written for mapping open datasets in R. The ACS is the one of the main surveys that the US Census Bureau runs, and I was honored to share my work there.

My talk was recorded, and you can view it below:

Attending the conference was part of my effort to bridge the gap between the world of free data analysis tools (i.e. the R community) and the world of government datasets (which I consider the most interesting datasets around).

The conference was great. I have been meaning to write a blog post summarizing what I learned, but I simply haven’t had the time. Instead, I will point you to a recording of all the conference talks.

 

The post Recording of my Talk at the ACS Data Users Conference appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

One-way ANOVA in R

Mon, 08/28/2017 - 14:24

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Suppose as a business manager you have the responsibility for testing and comparing the lifetimes of four brands (Apollo, Bridgestone, CEAT and Falken) of automobile tyres. The lifetime of these sample observations are measured in mileage run in ’000 miles. For each brand of automobile tyre, sample of 15 observations have been collected. On the basis of these information, you have to take you decision regarding the four brands of automobile tyre. The data is provided in the csv file format (called, tyre.csv).

In order to test and compare the lifetimes of four brands of tyre, you should apply one-way ANOVA method as there is only one factor or criterion (mileage run) to classify the sample observations. The null and alternative hypothesis to be used is given as:

Data Import and Outlier Checking

Let us first import the data into R and save it as object ‘tyre’. The R codes to do this:

tyre<- read.csv(file.choose(),header = TRUE, sep = ",") attach(tyre)

Before doing anything, you should check the variable type as in ANOVA, you need categorical independent variable (here the factor or treatment variable ‘brand’. In R, you can use the following code:

is.factor(Brands) [1] TRUE

As the result is ‘TRUE’, it signifies that the variable ‘Brands’ is a categorical variable.
Now it is all set to run the ANOVA model in R. Like other linear model, in ANOVA also you should check the presence of outliers can be checked by boxplot. As there are four populations to study, you should use separate boxplot for each of the population. In R, it is done as:

boxplot(Mileage~Brands, main="Fig.-1: Boxplot of Mileage of Four Brands of Tyre", col= rainbow(4))

Gives this plot:

If you are using advanced graphics, you can use the ‘ggplot2’ package with the following code to get the above boxplot.

library(ggplot2) ggplot(tyre, aes(Brands,Mileage))+geom_boxplot(aes(col=Brands))+labs(title="Boxplot of Mileage of Four Brands of Tyre")

The above picture shows that there is one extreme observation in the CEAT brand. To find out the exact outliers or extreme observations, you can use the following command:

boxplot.stats(Mileage[Brands=="CEAT"]) $stats [1] 30.42748 33.11079 34.78336 36.12533 36.97277 $n [1] 15 $conf [1] 33.55356 36.01316 $out [1] 41.05

So, the outlier is the observation valued ‘41.05’. The confidence interval is (33.55 – 36.01) and the minimum and maximum values of the sample coming from the population ‘CEAT’ is 30.43 and 41.05 respectively. Considering all these points, we ignore the outlier value ‘41.05’ momentarily and carry out the analysis. If at later stage, we find that this outlier may create problems in the estimation, we will exclude it.

Estimation of Model

Let us now fit the model by using the aov() function in R.

model1<- aov(Mileage~Brands)

To get the result of the one-way ANOVA:

summary(model1) Df Sum Sq Mean Sq F value Pr(>F) Brands 3 256.3 85.43 17.94 2.78e-08 *** Residuals 56 266.6 4.76 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal.
Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test. In R, the following are the code for applying the Tukey’s HSD test:

TukeyHSD(model1, conf.level = 0.99) Tukey multiple comparisons of means 99% family-wise confidence level Fit: aov(formula = Mileage ~ Brands) $Brands diff lwr upr p adj Bridgestone-Apollo -3.01900000 -5.6155816 -0.4224184 0.0020527 CEAT-Apollo -0.03792661 -2.6345082 2.5586550 0.9999608 Falken-Apollo 2.82553333 0.2289517 5.4221149 0.0043198 CEAT-Bridgestone 2.98107339 0.3844918 5.5776550 0.0023806 Falken-Bridgestone 5.84453333 3.2479517 8.4411149 0.0000000 Falken-CEAT 2.86345994 0.2668783 5.4600415 0.0037424

The TukeyHSD command shows the pair-wise difference of mileage of four brands of tyres at 1% level of significance. Here, the “diff” column provides mean differences. The “lwr” and “upr” columns provide lower and upper 99% confidence bounds, respectively. Finally, the “p adj” column provides the p-values adjusted for the number of comparisons made. As there are four brands, six possible pair-wise comparisons are obtained. The results show that all the pairs of mileage are statistically significantly different from the Tukey’s Honestly Significant Differences, except for the pair CEAT-Apollo. More specifically, the pair-wise difference between Bridgestone-Apollo is found to be -3.019 which means that Apollo has higher mileage than Bridgestone and this is statistically significant. Similarly, you can make other pair-wise comparison.
You can also plot these the results of Tukey’s HSD comparison by using the plot function as follows:

plot(TukeyHSD(model1, conf.level = 0.99),las=1, col = "red")

The diagram is show below:

Another way of visualization by plotting means of mileage of four brands of tyre with the help of gplots package. By using the plotmeans() function in the gplots package, you can create the mean plots for single factors including confidence intervals.

library(gplots) plotmeans(Mileage~Brands, main="Fig.-3: Mean Plot with 95% Confidence Interval", ylab = "Mileage run ('000 miles)", xlab = "Brands of Tyre")

Gives this plot:

Diagnostic Checking

Diagnostic Checking: After estimating the ANOVA model and finding out the possible pairs of differences, it is now time to go for the different diagnostic checking with respect to model assumptions. The single call to plot function generates four diagnostic plots (Fig.-5).

par(mfrow=c(2,2)) plot(model1)

Give this plot:

The Residuals vs. Fitted plot shown in the upper left of Fig.-4, shows the fitted values plotted against the model residuals. If the residuals follow any particular pattern, such as a diagonal line, there may be other predictors not yet in the model that could improve it. The flat lowess line looks very good as the single predictor variable or regressor sufficiently explaining the dependent variable.
The Normal Q-Q Plot in the upper right of Fig.-4, shows the quantiles of the standardized residuals plotted against the quantiles you would expect if the data were normally distributed. Since these fall mostly on the straight line, the assumption of normally distributed residuals is met. Since there are only 15 observations in each individual brand of tyre, it is not wise to go for group-wise checking of normality assumption. Moreover, the normality of the overall residual can be checked by means of some statistical test such as Shapiro-Wilk test. Shortly I’ll show you this procedure too.
The Scale-Location plot in the lower left of Fig.-4, shows the square root of the absolute standardized residuals plotted against the fitted, or predicted, values. Since the lowess line that fits this is fairly flat, it indicates that the spread in the predictions is almost the same across the prediction line, indicating the very less chances of failure of meeting the assumption of homoscedasticity. This will be further verified by some statistical tests. In case of ANOVA, you can check the assumption of homogeneity of variances across the four brands of tyre.
Finally, the Residuals vs. Leverage plot in the lower right corner, shows a measure of the influence of each point on the overall equation against the standardized residuals. Since no points stand out far from the pack, we can assume that there are no outliers having undue influence on the fit of the model.
Thus, the graphical diagnostic of the model fit apparently shows that the assumptions requirements of ANOVA model is fairly fulfilled. However, the normality assumption and homogeneity are supposed to be confirmed by the appropriate statistical tests.

Regarding the fulfillment of normality assumption, it has been already discussed that when the number of observations is less, it is wise to test normality for the overall residuals of the model, instead of checking it for separate group. In R the residuals of model is saved as follows:

uhat<-resid(model1)

where resid function extracts the model residual and it is saved as object ‘uhat’.

Now you may apply the Shapiro-Wilk test for normality with the following hypotheses set-up:

The test code and results are shown below:

shapiro.test(uhat) Shapiro-Wilk normality test data: uhat W = 0.9872, p-value = 0.7826

As the p-value is higher than the level of significance, you cannot reject the null hypothesis, which implies that the samples are taken from the normal populations.
Another assumption requirement is the homogeneity of variances across the groups, which can be statistically tested by Bartlett test and Levene test. In both the test, the null hypothesis is set as the homogeneity of variance across the cross-sectional group. The tests are conducted as follows:

bartlett.test(Mileage~Brands) Bartlett test of homogeneity of variances data: Mileage by Brands Bartlett's K-squared = 2.1496, df = 3, p-value = 0.5419 library(car) levene.test(Mileage~Brands) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 3 0.6946 0.5592 56

Both the tests confirmed the presence of homogeneity of variance across the four brands of tyre as we cannot reject the null hypothesis of homogeneity of variances across the brands of tyre.

The Conclusion

The mileages of the four brands of tyre are different. The results show that all the pairs of mileage are statistically significantly different from the Tukey’s Honestly Significant Differences, except for the pair CEAT-Apollo. More specifically, the pair-wise difference between Bridgestone-Apollo is found to be -3.019 which means that Apollo has higher mileage than Bridgestone and this is statistically significant.

    Related Post

    1. Cubic and Smoothing Splines in R
    2. Chi-Squared Test – The Purpose, The Math, When and How to Implement?
    3. Missing Value Treatment
    4. R for Publication by Page Piccinini
    5. Assessing significance of slopes in regression models with interaction
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A hidden process behind binary or other categorical outcomes?

    Mon, 08/28/2017 - 02:00

    (This article was first published on ouR data generation, and kindly contributed to R-bloggers)

    I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. – six groups in total. The question was how best to evaluate effectiveness of the intervention?

    Since the outcome was categorical and ordinal – that is category 1 implied less meat consumption that category 2, category 2 implied less consumption that category 3, and so on – a model that estimates the cumulative probability of ordinal outcomes seemed like a possible way to proceed. Cumulative logit models estimate a number of parameters that represent the cumulative log-odds of an outcome; the parameters are the log-odds of categories 2 through 6 versus category 1, categories 3 through 6 versus 1 & 2, etc. Maybe not the most intuitive way to interpret the data, but seems to plausibly fit the data generating process.

    I was concerned about the proportionality assumption of the cumulative logit model, particularly when we started to consider adjusting for baseline characteristics (more on that in the next post). I looked more closely at the data generating assumptions of the cumulative logit model, which are quite frequently framed in the context of a continuous latent measure that follows a logistic distribution. I thought I’d describe that data generating process here to give an alternative view of discrete data models.

    I know I have been describing a context that includes an outcome with multiple categories, but in this post I will focus on regular logistic regression with a binary outcome. This will hopefully allow me to establish the idea of a latent threshold. I think it will be useful to explain this simpler case first before moving on to the more involved case of an ordinal response variable, which I plan to tackle in the near future.

    A latent continuous process underlies the observed binary process

    For an event with a binary outcome (true or false, A or B, 0 or 1), the observed outcome may, at least in some cases, be conceived as the manifestation of an unseen, latent continuous outcome. In this conception, the observed (binary) outcome merely reflects whether or not the unseen continuous outcome has exceeded a specified threshold. Think of this threshold as a tipping point, above which the observable characteristic takes on one value (say false), below which it takes on a second value (say true).

    The logistic distribution

    Logistic regression models are used to estimate relationships of individual characteristics with categorical outcomes. The name of this regression model arises from the logistic distribution, which is a symmetrical continuous distribution. In a latent (or hidden) variable framework, the underlying, unobserved continuous measure is drawn from this logistic distribution. More specifically, the standard logistic distribution is typically assumed, with a location parameter of 0, and a scale parameter of 1. (The mean of this distribution is 0 and variance is approximately 3.29.)

    Here is a plot of a logistic pdf, shown in relation to a standard normal pdf (with mean 0 and variance 1):

    library(ggplot2) library(data.table) my_theme <- function() { theme(panel.background = element_rect(fill = "grey90"), panel.grid = element_blank(), axis.ticks = element_line(colour = "black"), panel.spacing = unit(0.25, "lines"), plot.title = element_text(size = 12, vjust = 0.5, hjust = 0), panel.border = element_rect(fill = NA, colour = "gray90")) } x <- seq(-6, 6, length = 1000) yNorm <- dnorm(x, 0, 1) yLogis <- dlogis(x, location = 0, scale = 1) dt <- data.table(x, yNorm, yLogis) dtm <- melt(dt, id.vars = "x", value.name = "Density") ggplot(data = dtm) + geom_line(aes(x = x, y = Density, color = variable)) + geom_hline(yintercept = 0, color = "grey50") + my_theme() + scale_color_manual(values = c("red", "black"), labels=c("Normal", "Logistic")) + theme(legend.position = c(0.8, 0.6), legend.title = element_blank())

    The threshold defines the probability

    Below, I have plotted the standardized logistic pdf with a threshold that defines a tipping point for a particular Group A. In this case the threshold is 1.5, so for everyone with a unseen value of \(X < 1.5\), the observed binary outcome \(Y\) will be 1. For those where \(X \geq 1.5\), the observed binary outcome \(Y\) will be 0:

    xGrpA <- 1.5 ggplot(data = dtm[variable == "yLogis"], aes(x = x, y = Density)) + geom_line() + geom_segment(x = xGrpA, y = 0, xend = xGrpA, yend = dlogis(xGrpA), lty = 2) + geom_area(mapping = aes(ifelse(x < xGrpA, x, xGrpA)), fill = "white") + geom_hline(yintercept = 0, color = "grey50") + ylim(0, 0.3) + my_theme() + scale_x_continuous(breaks = c(-6, -3, 0, xGrpA, 3, 6))

    Since we have plot a probability density (pdf), the area under the entire curve is equal to 1. We are interested in the binary outcome \(Y\) defined by the threshold, so we can say that the area below the curve to the left of threshold (filled in white) represents \(P(Y = 1|Group=A)\). The remaining area represents \(P(Y = 0|Group=A)\). The area to the left of the threshold can be calculated in R using the plogis function:

    (p_A <- plogis(xGrpA)) ## [1] 0.8175745

    Here is the plot for a second group that has a threshold of 2.2:

    The area under the curve to the left of the threshold is \(P(X < 2.2)\), which is also \(P(Y = 1 | Group=B)\):

    (p_B <- plogis(xGrpB)) ## [1] 0.9002495 Log-odds and probability

    In logistic regression, we are actually estimating the log-odds of an outcome, which can be written as

    \[log \left[ \frac{P(Y=1)}{P(Y=0)} \right]\].

    In the case of Group A, log-odds of Y being equal to 1 is

    (logodds_A <- log(p_A / (1 - p_A) )) ## [1] 1.5

    And for Group B,

    (logodds_B <- log(p_B / (1 - p_B) )) ## [1] 2.2

    As you may have noticed, we’ve recovered the thresholds that we used to define the probabilities for the two groups. The threshold is actually the log-odds for a particular group.

    Logistic regression

    The logistic regression model that estimates the log-odds for each group can be written as

    \[log \left[ \frac{P(Y=1)}{P(Y=0)} \right] = B_0 + B_1 * I(Grp = B) \quad ,\]

    where \(B_0\) represents the threshold for Group A and \(B_1\) represents the shift in the threshold for Group B. In our example, the threshold for Group B is 0.7 units (2.2 – 1.5) to the right of the threshold for Group A. If we generate data for both groups, our estimates for \(B_0\) and \(B_1\) should be close to 1.5 and 0.7, respectively

    The process in action

    To put this all together in a simulated data generating process, we can see the direct link with the logistic distribution, the binary outcomes, and an interpretation of estimates from a logistic model. The only stochastic part of this simulation is the generation of continuous outcomes from a logistic distribution. Everything else follows from the pre-defined group assignments and the group-specific thresholds:

    n = 5000 set.seed(999) # Stochastic step xlatent <- rlogis(n, location = 0, scale = 1) # Deterministic part grp <- rep(c("A","B"), each = n / 2) dt <- data.table(id = 1:n, grp, xlatent, y = 0) dt[grp == "A" & xlatent <= xGrpA, y := 1] dt[grp == "B" & xlatent <= xGrpB, y := 1] # Look at the data dt ## id grp xlatent y ## 1: 1 A -0.4512173 1 ## 2: 2 A 0.3353507 1 ## 3: 3 A -2.2579527 1 ## 4: 4 A 1.7553890 0 ## 5: 5 A 1.3054260 1 ## --- ## 4996: 4996 B -0.2574943 1 ## 4997: 4997 B -0.9928283 1 ## 4998: 4998 B -0.7297179 1 ## 4999: 4999 B -1.6430344 1 ## 5000: 5000 B 3.1379593 0

    The probability of a “successful” outcome (i.e \(P(Y = 1\))) for each group based on this data generating process is pretty much equal to the areas under the respective densities to the left of threshold used to define success:

    dt[, round(mean(y), 2), keyby = grp] ## grp V1 ## 1: A 0.82 ## 2: B 0.90

    Now let’s estimate a logistic regression model:

    library(broom) glmfit <- glm(y ~ grp, data = dt, family = "binomial") tidy(glmfit, quick = TRUE) ## term estimate ## 1 (Intercept) 1.5217770 ## 2 grpB 0.6888526

    The estimates from the model recover the logistic distribution thresholds for each group. The Group A threshold is estimated to be 1.52 (the intercept) and the Group B threshold is estimated to be 2.21 (intercept + grpB parameter). These estimates can be interpreted as the log-odds of success for each group, but also as the threshold for the underlying continuous data generating process that determines the binary outcome \(Y\). And we can interpret the parameter for grpB in the traditional way as the log-odds ratio comparing the log-odds of success for Group B with the log-odds of success for Group A, or as the shift in the logistic threshold for Group A to the logistic threshold for Group B.

    In the next week or so, I will extend this to a discussion of an ordinal categorical outcome. I think the idea of shifting the thresholds underscores the proportionality assumption I alluded to earlier …

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pricing Optimization: How to find the price that maximizes your profit

    Sun, 08/27/2017 - 18:02

    (This article was first published on R – insightR, and kindly contributed to R-bloggers)

    By Yuri Fonseca

    Basic idea

    In this post we will discuss briefly about pricing optimization. The main idea behind this problem is the following question: As manager of a company/store, how much should I charge in order to maximize my revenue or profit?

    Obviously, the answer isn’t as high as possible. If you charge one hundred dollars for a candy, probably only one or two people will accept to buy it. Although the profit per product is very high, you probably won’t even your fixed costs. Charge a very small is also not the best call.

    Before showing an example for this problem, let us build some simple formulas.

    Imagine a monopolist selling a specific product with demand curve , where is the quantity sold given a specific price . To simplify things, let’s suppose that is a linear function:

    The total revenue will be given by:

    And total profit will be given by:

    Where is the unity cost of the product. Adding fixed costs in the profit equation does not change the price police, so we will suppose it’s zero.

    Next, we differentiate the equations for and to find the first order conditions, which allow us to find the optimal police under the hypothesis of a linear demand curve. is expected to be negative (demand decrease when prices increase) and are concave functions of . As consequence, the critical point will be a maximum point. Therefore, the optimal police for revenue is given by:

    And for profit:

    Note that sometimes people write the linear demand curve as , in this case should be positive, and the signs in equation 2 and equation 3 must change. Moreover, it is interesting to note that the price that maximizes profit is always bigger than the one that maximizes total revenue because is always positive.

    If taxes are calculated just on profit the price police remains the same. However, countries like Brazil usually charges a lot of taxes on total revenue. In this case, the price police for maximizing revenue doesn’t change, but the police for maximizing profit will change according to the following expression:

    Example and implementations:

    As an example of how to proceed with the estimation of the optimum price, let’s generate a linear demand curve with for daily selling of a product:

    library(ggplot2) # example of linear demand curve (first equation) demand = function(p, alpha = -40, beta = 500, sd = 10) { error = rnorm(length(p), sd = sd) q = p*alpha + beta + error return(q) } set.seed(100) prices = seq(from = 5, to = 10, by = 0.1) q = demand(prices) data = data.frame('prices' = prices,'quantity' = q) ggplot(data, aes(prices, quantity)) + geom_point(shape=1) + geom_smooth(method='lm') + ggtitle('Demand Curve')

    Imagine a company that has been selling the product which follows the demand curve above for a while (one year changing prices daily), testing some prices over time. The following time-series is what we should expect for the historical revenue, profit and cost of the company:

    set.seed(10) hist.prices = rnorm(252, mean = 6, sd = .5) # random prices defined by the company hist.demand = demand(hist.prices) # demand curve defined in the chunck above hist.revenue = hist.prices*hist.demand # From the revenue equation unity.cost = 4 # production cost per unity hist.cost = unity.cost*hist.demand hist.profit = (hist.prices - unity.cost)*hist.demand # From the price equation data = data.frame('Period' = seq(1,252),'Daily.Prices' = hist.prices, 'Daily.Demand' = hist.demand, 'Daily.Revenue' = hist.revenue, 'Daily.Cost' = hist.cost, 'Daily.Profit' = hist.profit) ggplot(data, aes(Period, Daily.Prices)) + geom_line(color = 4) + ggtitle('Historical Prices used for explotation')

    ggplot(data, aes(Period, Daily.Revenue, colour = 'Revenue')) + geom_line() + geom_line(aes(Period, Daily.Profit, colour = 'Profit')) + geom_line(aes(Period, Daily.Cost, colour = 'Cost')) + labs(title = 'Historical Performance', colour = '')

    We can recover the demand curve using the historical data (that is how it is done in the real world).

    library(stargazer) model.fit = lm(hist.demand ~ hist.prices) # linear model for demand stargazer(model.fit, type = 'html', header = FALSE) # output Dependent variable: hist.demand hist.prices -38.822*** (1.429) Constant 493.588*** (8.542) Observations 252 R2 0.747 Adjusted R2 0.746 Residual Std. Error 10.731 (df = 250) F Statistic 738.143*** (df = 1; 250) Note: *p<0.1; **p<0.05; ***p<0.01

    And now we need to apply equation 2 and equation 3.

    # estimated parameters beta = model.fit$coefficients[1] alpha = model.fit$coefficients[2] p.revenue = -beta/(2*alpha) # estimated price for revenue p.profit = (alpha*unity.cost - beta)/(2*alpha) # estimated price for profit

    The final plot with the estimated prices:

    true.revenue = function(p) p*(-40*p + 500) # Revenue with true parameters (chunck demand) true.profit = function(p) (p - unity.cost)*(-40*p + 500) # price with true parameters # estimated curves estimated.revenue = function(p) p*(model.fit$coefficients[2]*p + model.fit$coefficients[1]) estimated.profit = function(p) (p - unity.cost)*(model.fit$coefficients[2]*p + model.fit$coefficients[1]) opt.revenue = true.revenue(p.revenue) # Revenue with estimated optimum price opt.profit = true.profit(p.profit) # Profit with estimated optimum price # plot df = data.frame(x1 = p.revenue, x2 = p.profit, y1 = opt.revenue, y2 = opt.profit, y3 = 0) ggplot(data = data.frame(Price = 0)) + stat_function(fun = true.revenue, mapping = aes(x = Price, color = 'True Revenue')) + stat_function(fun = true.profit, mapping = aes(x = Price, color = 'True Profit')) + stat_function(fun = estimated.revenue, mapping = aes(x = Price, color = 'Estimated Revenue')) + stat_function(fun = estimated.profit, mapping = aes(x = Price, color = 'Estimated Profit')) + scale_x_continuous(limits = c(4, 11)) + labs(title = 'True curves without noise') + ylab('Results') + scale_color_manual(name = "", values = c("True Revenue" = 2, "True Profit" = 3, "Estimated Revenue" = 4, "Estimated Profit" = 6)) + geom_segment(aes(x = x1, y = y1, xend = x1, yend = y3), data = df) + geom_segment(aes(x = x2, y = y2, xend = x2, yend = y3), data = df)

    Final observations

    As you can see, the estimated Revenue and estimated Profit curves are quite similar to the true ones without noise and the expected revenue for our estimated optimal policies looks very promising. Although the linear and monopolist assumption looks quite restrictive, this might not be the case, check Besbes and Zeevi (2015) and Cooper et al (2015).

    If one expect a large variance for , it might be useful to simulate , and then the optimal price using Jensen’s inequality.

    References

    Phillips, Robert Lewis. Pricing and revenue optimization. Stanford University Press, 2005.

    Besbes, Omar, and Assaf Zeevi. “On the (surprising) sufficiency of linear models for dynamic pricing with demand learning.” Management Science 61.4 (2015): 723-739.

    Cooper, William L., Tito Homem-de-Mello, and Anton J. Kleywegt. “Learning and pricing with models that do not explicitly incorporate competition.” Operations research 63.1 (2015): 86-103.

    Talluri, Kalyan T., and Garrett J. Van Ryzin. The theory and practice of revenue management. Vol. 68. Springer Science & Business Media, 2006.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Marginal effects for negative binomial mixed effects models (glmer.nb and glmmTMB) #rstats

    Sun, 08/27/2017 - 12:17

    (This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

    Here’s a small preview of forthcoming features in the ggeffects-package, which are already available in the GitHub-version: For marginal effects from models fitted with glmmTMB() or glmer() resp. glmer.nb(), confidence intervals are now also computed.

    If you want to test these features, simply install the package from GitHub:

    library(devtools) devtools::install_github("strengejacke/ggeffects")

    Here are three examples:

    library(glmmTMB) library(lme4) library(ggeffects) data(Owls) m1 <- glmmTMB(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls, family = nbinom1) m2 <- glmmTMB(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls, family = nbinom2) m3 <- glmer.nb(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls) m4 <- glmmTMB( SiblingNegotiation ~ FoodTreatment + ArrivalTime + SexParent + (1 | Nest), data = Owls, ziformula = ~ 1, family = list(family = "truncated_poisson", link = "log") )

    pr1 <- ggpredict(m1, c("ArrivalTime", "SexParent")) plot(pr1)

    pr2 <- ggpredict(m2, c("ArrivalTime", "SexParent")) plot(pr2)

    pr3 <- ggpredict(m3, c("ArrivalTime", "SexParent")) plot(pr3)

    pr4 <- ggpredict( m4, c("FoodTreatment", "ArrivalTime [21,24,30]", "SexParent") ) plot(pr4)

    The code to calculate confidence intervals is based on the FAQ provided from Ben Bolker. Here is another example, that reproduces this plot (note, since age is numeric, ggpredict() produces a straight line, and not points with error bars).

    library(nlme) data(Orthodont) m5 <- lmer(distance ~ age * Sex + (age|Subject), data = Orthodont) pr5 <- ggpredict(m5, c("age", "Sex")) plot(pr5)

    Tagged: data visualization, ggplot, R, rstats, sjPlot

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    WoE and IV Variable Screening with {Information} in R

    Sun, 08/27/2017 - 11:42

    (This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

    A short note on information-theoretic variable screening in R w. {information}.

    Variable screening comes as an important step in the
    contemporary EDA for predictive modeling: what can we tell about the
    nature of the relationships between a set of predictors and the
    dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we
    start rolling them into a predictive model? In this blog post I will
    discuss two information-theoretic measures that are common in variable
    screening for binary classification and regression models in the credit
    risk arena (the fact being completely unrelated to the simple truth that
    they could do you some good in any other application of predictive
    modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value
    (IV) of a variable in respect to a binary outcome. Then I will
    illustrate their computation (it’s fairly easy, in fact) from the
    {Information} package in R.

    Weight of Evidence

    Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):

    and assume your models M1, M2 of the world*
    are simply two discrete possible states of a binary variable Y, while
    the data are given by discrete distributions of some predictor X (or, X
    stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n,  take the log:

    and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence
    (WoE). In theory, any monotonic transformation of the odds would do,
    but the logarithm brings an intuitive advantage of obtaining a negative
    WoE when the odds are less than one and a positive one when they are
    higher than one. To simplify the setting where the analysis under
    consideration encompasses a binary dependent Y and a discrete (or binned
    continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:

    where f denotes counts.

    Let’s illustrate the computation of WoE in this setting for a variable from a well-known dataset**. We have one categorical, binary dependent:

    dataSet <- read.table(‘bank-additional-full.csv’,
                        header = T,
                        strip.white = F,
                        sep = “;”)

    str(dataSet)
    table(dataSet$y)

    dataSet$y <- recode(dataSet$y,
                      ‘yes’ = 1,
                       ‘no’ = 0)

    and we want to compute the WoE for, say, the age variable. Here it goes:

    # – compute WOE for: dataSet$age
    bins <- 10

    q <- quantile(dataSet$age,
                probs = c(1:(bins – 1)/bins),
                na.rm = TRUE,
                type = 3)

    cuts <- unique(q)

    aggAge <- table(findInterval(dataSet$age,
                               vec = cuts,
                               rightmost.closed = FALSE),
                               dataSet$y)

    aggAge <- as.data.frame.matrix(aggAge)
    aggAge$N <- rowSums(aggAge)
    aggAge$WOE <- log((aggAge$`1`*sum(aggAge$`0`))/(aggAge$`0`*sum(aggAge$`1`)))

    In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table()
    call provides for the conditional distributions like the ones shown in
    the table above. The computation of WoE is then straightforward – as
    exemplified in the last line. However, you want to spare yourself from
    computing the WoE in this way for many variables in the dataset, and
    that’s where {Information} in R comes handy; for the same dataset:

    # – Information value: all variables

    infoTables <- create_infotables(data = dataSet,

                                   y = “y”,
                                  bins = 10,
                                  parallel = T)

    # – WOE table:
    infoTables$Tables$age$WOE

    with the respective data frames in infoTables$Tables standing for the variables in the dataset.

    Information Value

    A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:

    In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xj|Y=1) and P(xj|Y=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.

    Using {Information} in R, for the dataset under consideration:

    # – Information value: all variables

    infoTables <- create_infotables(data = dataSet,

                                   y = “y”,
                                  bins = 10,
                                  parallel = T)

    # – Plot IV

    plotFrame <- infoTables$Summary[order(-infoTables$Summary$IV), ]
    plotFrame$Variable <- factor(plotFrame$Variable,

                                levels = plotFrame$Variable[order(-plotFrame$IV)])

    ggplot(plotFrame, aes(x = Variable, y = IV)) +
    geom_bar(width = .35, stat = “identity”, color = “darkblue”, fill = “white”) +
    ggtitle(“Information Value”) +
    theme_bw() +
    theme(plot.title = element_text(size = 10)) +
    theme(axis.text.x = element_text(angle = 90))

    You may have noted the usage of parallel = T in the create_infotables()
    call; the {Information} package will try to use all available cores to
    speed up the computations by default. Besides the basic package
    functionality that I have illustrated, the package provides a natural
    way of dealing with uplift models, where the computation of the IVs for
    the control vs. treatment designs is nicely automated. Cross-validation
    procedures are also built-in.

    However, now that we know that we have a nice,
    working package for WoE and IV estimation in R, let’s restrain ourselves
    from using it to perform automatic feature
    selection for models like binary logistic regression. While the
    information-theoretic measures discussed here truly assess the
    predictive power of a predictor in binary classification, building a
    model that encompasses multiple terms model is another story. Do not get
    disappointed if you start figuring out how the AICs for the full models
    are still lower then those for the nested models obtained by feature
    selection based on the IV values; although they can provide useful
    guidelines, WoE and IV are not even meant to be used that way (I’ve
    tried… once with the dataset used in the previous examples, and then
    with the two {Information} built-in datasets; not too much of a success,
    as you may have guessed).

    References

    * For parametric models, you would need to integrate over the full
    parameter space, of course; taking the MLEs would result in obtaining
    the standard LR test.

    ** The dataset is considered in S. Moro, P. Cortez and P. Rita
    (2014). A Data-Driven Approach to Predict the Success of Bank
    Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. I
    have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
    (N.B. https://archive.ics.uci.edu/ml/machine-learning-databases/00222/,
    file: bank-additional.zip); a nice description of the dataset is found
    at:
    http://www2.1010data.com/documentationcenter/beta/Tutorials/MachineLearningExamples/BankMarketingDataSet.html)

    Goran S. Milovanović, Phd

    Data Science Consultant, SmartCat

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    NYC Citi Bike Visualization – A Hint of Future Transportation

    Sun, 08/27/2017 - 07:18

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Introduction

    Like all other sharing systems,  Airbnb the housing sharing system, Uber the car sharing system, Citi Bike is the network of bicycle rental stations intended for point-to-point transportation.

    Citi Bike is New York City’s largest bike sharing system. It’s a convenient solution for trips that are too far to walk but too short for a taxi or the subway. The bike sharing system is combined with all other transportation methods available in the area for commuters.

    Currently, there are about a million trips on average per month by Citi Bike riders. The system has 10,000 bicycles and 610 stations. By end of 2017, the total size of Citi Bike system will be 12,000 bikes and 750 stations. The grey area is the current service area. The yellow and blue areas represent the sections to be covered by end of 2017.

     

     

    The Optimization Questions

    Any Citi Bike client has come up against two frustrating scenarios: the empty dock at the start and full dock at the end of the trip. Researchers call this as “rebalancing” problem as part of “fleet optimization” questions.  This problem has attracted the attention of data scientists to develop complex methodologies to optimize the available bikes and open docks.

    Following I attempt to utilize the shiny visualization app to provide a hint for the 3 questions:

    1. Fleet Routing Pattern Detection: what are the most popular routes during peak hours and off-peak? What is the direction of the flow?
    2. Station Balance Prediction: what is the average volume of imbalance in the distributed system? What is the station-level inflow and outflow? Is it sensitive to the time? How does it look like in a time series?
    3. Reducing rebalancing demand: What are the riders’ activities like? Is it possible to rebalance through pricing schemes?

    The visualization app is intended to provide a way to explore different comparative measures at the route, station and system levels with spatial attributes and time series.

     

    The Data

    Citi published Citi Bike Trip Histories – downloadable files of Citi Bike trip data. I used the Citi Bike data for the month of March 2017 (approximately 1 million observations). The data includes:

    • Trip Duration (in seconds)
    • Start Time and Date
    • Stop Time and Date
    • Start Station Name
    • End Station Name
    • Station ID
    • Station Lat/Long
    • Bike ID
    • User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
    • Gender (Zero=unknown; 1=male; 2=female)
    • Year of Birth

    Before moving ahead with building the app, I was interested in exploring the data and identifying patterns of rebalancing.

    Bar Chart 1 – Time wise imbalance (Peak/Off peak)

    Bar Chart 2 -Location wise imbalance (Top 10 popular Station)

     

    Insights

    On the interactive map, each dot presents a station.  The visualization will also provide options to identify popular routes by selecting date and hour range. The top popular routes are marked in orange as the lines between the spatial points. The direction of the routes is indicated by moving from the more red towards the more green dots.

    Citi Bike Daily Migration

    The Patterns

    Interesting patterns are observed. The most popular routes on the west side run through Central Park and Chelsea Pier. Grand central/Penn Station centered routes are also in the hottest route list. Outside Manhattan there are centers in Queen and Brooklyn initiating lots of popular routes. Riders bike more along the west and east streets than along north and south avenue. That makes sense in light of the fact that  there are more uptown and downtown subways than crosstown ones, and riders do utilize the Citi Bike as a an alternative transportation option.

    While not enough bikes available in hot pick up stations, the docks are lacking in hot drop off stations. The red dots are where outflow of bikes exceeds the inflow of bikes The green dots are where inflow of bikes exceeds the outflow of bikes. In the other words, the green dots are the hot spot to pick up a bike(more inbound bikes) and the red(more empty docks) to drop them off. And The more extreme the color of dot is, the higher percentage change of the flows this stations has. The size and transparency of the dot is represented by the volume of  both inflow and outflow of the stations. The more obvious the dot is, the hotter spot the station is.

    What caused the balancing problem? The map based interactive app provides an insight for predicting demand. The information displayed is the accumulated hourly variables based on dates selected. Details of statistic numbers is also available for each stations by zooming in.

    New York has a classic peak commuter flow pattern.  Most commuters ride bikes towards the center of the town from its edge in the morning. At the end of the day, they ride the reverse way to the edge where they live, especially at the edge sections with fewer public transportations options.   

     

    The Rider’s Activities

    What about the rider’s activities. Is there any pattern involved? The app provides insights of rider’s performance for reducing rebalancing demands. By studying rider’s activities, it will provides suggestions for potential solutions.

    Below each bubble represents an age and gender group. The age is represented as the number on each bubble.  A negative correlation is observed between age and speed. The younger the rider is, the faster he/she rides. In similarity, the group in the thirties shows similar miles per trip. The performance between female and male group are also different.  The male groups in blue perform a higher speed level than female groups in red. 

     

    The Balancing Solutions

    Is there solutions for rebalancing to cut the cost and improve the efficiency, instead of manually moving bikes via trucks, bike-draw trailers and sprinter vans from full stations to empty stations?  The moving will take crews travel in pairs 45 minutes to load a truck. 

    Citi Bike sought a way to get the riders to move the bikes themselves. In May it started the pilot Bike Angel. The reverse-commuter would be perfect target member of the Angel program. What is so appealing about the program is the bike sharing system could self -distribute its fleet with the proper incentives. The member can easily make 10 Amazon gift card with a few reverse trips. As a result, the demand of manually moving bike around would decrease.

     

    The Conclusions

    The Visualization app provides the real time status of fleets: popular routes, inbound/outbound, net change, time series of stations, hot spot analysis and rider’s activities. It supports the self distributed fleet by establishing a baseline for identifying “healthy” rebalancing within the bike share system. It provides a hint for a future transportation solutions.

     

    The App

    The interactive app is available on Shiny.io.

    The post NYC Citi Bike Visualization – A Hint of Future Transportation appeared first on NYC Data Science Academy Blog.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Why I find tidyeval useful

    Sun, 08/27/2017 - 02:00

    (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

    First thing’s first: maybe you shouldn’t care about tidyeval. Maybe you don’t need it. If you exclusively work interactively, I don’t think that learning about tidyeval is important. I can only speak for me, and explain to you why I personally find tidyeval useful.
    I wanted to write this blog post after reading this twitter thread and specifically this question.
    Mara Averick then wrote this blogpost linking to 6 other blog posts that give some tidyeval examples.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-7)

    Sat, 08/26/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

    Until now, we used random variable simulation and bootstrapping to test hypothesis and compute statistics of a single sample. In today’s set, we’ll learn how to use permutation to test hypothesis about two different samples and how to adapt bootstrapping to this situation.

    Answers to the exercises are available here.

    Exercise 1

    1. Generate 500 points from a beta distribution of parameter a=2 and b=1.5, then store the result in a vector named beta1.
    2. Generate 500 points from the same distribution and store those points in a vector named beta2.
    3. Concatenate both vectors to create a vector called beta.data.
    4. Plot the ecdf of beta1 and beta2.
    5. Sample 500 points from beta.data and plot the ecdf of this sample. Repeat this process 5 times.
    6. Does all those samples share the same distribution and if the answer is yes, what is the distribution?

    Exercise 2
    When we test an hypothesis, we suppose that this hypothesis is true, we simulate what would happen if that’s the case and if our initial observation happen less that α percent of the time we reject the hypothesis. Now, from the first exercise, we know that if two samples share the same distribution, we can assume that any sample drawn from those samples will follow the same distribution. In particular, if we shuffle the observations from a sample of size n1 and those of a sample of size n2, shuffle them and draw two new samples of size n1 and n2, they all should have a similar CDF. We can use this fact to test the hypothesis that two samples have the same distribution. This is process is called a permutation test.

    Load this dataset where each column represents a variable and we want to know if they are identically distributed. Each exercise below follow a step of a permutation test.

    1. What are the null and alternative hypotheses for this test?
    2. Concatenate both samples into a new vector called data.ex.2.
    3. Write a function that take data.ex.2 and the size of both sample as arguments, create a temporary vector by permuting data.ex.2 and return two new samples. The first sample has the same number of observations than the first column of the dataset, the second is made from the rest of the observations. Name this function permutation.sample (we will used it in the next exercise.) Why do we want the function to return samples of those size?
    4. Plot the ECDF of both initial variables in black.
    5. Use the function permutation.sample 100 times to generate permuted samples, then compute the ECDF of those samples and add the plot of those curve to the previous plot. Use the color red for the first batch of samples and green for the second batch.
    6. By looking at the plot, can you tell if the null hypothesis is true?

    Exercise 3
    A business analyst think that the daily returns of the apple stocks follow a normal distribution with mean of 0 and a standard deviation of 0.1. Use this dataset of the daily return of those stocks for the last 10 years to test this hypothesis.

    Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

    • Work with about different binomial and logistic regression techniques
    • Know how to compare regression models and choose the right fit
    • And much more

    Exercise 4
    Permutation test can help us verify if two samples come from the same distribution, but if this is true, we can conclude that both sample share the same statistics. As a consequence permutation test can also be used to test if statistic of two sample are the same. One really useful application of this is to test if two mean are the same or significantly different (as you have probably realized by now, statistician are obsessed with mean and love to spend time studying it!). In this situation, the question is to determine if the difference of mean in two sample are random or a consequence of a difference of distribution.

    You should be quite familiar with tests by now, so how would you proceed to do a permutation test to verify if two means are equals? Used that process to test the equality of the mean of both sample in this dataset.

    Exercise 5
    Looking at the average annual wage of the United States and Switzerland both country have relatively the same level of wealth since those statistics are of 60154 and 60124 US dollar respectively. In this dataset, you will find simulated annual wage from citizen of both countries. Test the hypothesis that both the American and the Swiss have the same average annual wage based on those samples at a level of 5%.

    Exercise 6
    To test if two samples from different distribution have the same statistics, we cannot use the permutation test: we instead will use bootstrapping. To test if two sample as the same mean, for example, you should follow those steps:

    1. Formulate a null and an alternative hypothesis.
    2. Set a significance level.
    3. Compute the difference of mean of both samples. This will be the reference value we will use to compute the p-value.
    4. Concatenate both samples and compute the mean of this new dataset.
    5. Shift both samples so that they share the mean of the concatenated dataset.
    6. Use bootstrap to generate an estimate of the mean of both shifted samples.
    7. Compute the difference of both means.
    8. Repeat the last two steps at least 1000 times.
    9. Compute the p-value and draw a conclusion.

    Use the dataset from last exercise to see if the USA and Switzerland have the same average wage at a level of 5%.

    Exercise 7
    Test the hypothesis that both samples in this dataset have the same mean.

    Exercise 8
    R have functions that use analytic methods to test if two samples have an equal mean.

    1. Use the t.test() function to test the equality of the mean of the samples of the last exercise.
    2. Use this function to test the hypothesis that the average wage in the US are bigger than in Switzerland.

    Exercise 9
    The globular cluster luminosity dataset list measurement about the luminosity of cluster of stars in different region of the milky way galaxy and the Andromeda galaxy. Test the hypothesis that the average luminosity in both galaxy have a difference of 24,78.

    Exercise 10
    A company that mold aluminum for auto parts has bought a smaller company to increase the amount of parts they can produce each year. In their factory, the smaller company used the standard equipment, but used a different factory layout, had a different supply line and managed their employees work schedules in a completely different manner that their new parent company. Before changing the company culture, the engineer in the parent company are interested to know which of the approach is the more effective. To do so they measure the time it took to make an auto part in each factory, 150 times and created this dataset where the first column represent the sample of the small factory.

    1. Does the average time it takes to make a part is the same in both factory?
    2. Does the production time follow the same distribution in both factory?
    3. If the engineer want to minimize the percentage of part that take more than one hour to be made, which setup they should implement in both their factory: the one of the parent company or the one of the smaller company?
    Related exercise sets:
    1. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
    2. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
    3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    wrapr: R Code Sweeteners

    Sat, 08/26/2017 - 03:02

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    wrapr is an R package that supplies powerful tools for writing and debugging R code.

    Primary wrapr services include:

    • let()
    • %.>% (dot arrow pipe)
    • := (named map builder)
    • λ() (anonymous function builder)
    • DebugFnW()
    let()

    let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()).

    The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "a + a".

    library("wrapr") a <- 7 let( c(VAR = 'a'), VAR + VAR ) # [1] 14

    This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them.

    We are trying to make let() self teaching and self documenting (to the extent that makes sense). For example try the arguments "eval=FALSE" prevent execution and see what would have been executed, or debug=TRUE to have the replaced code printed in addition to being executed:

    let( c(VAR = 'a'), eval = FALSE, { VAR + VAR } ) # { # a + a # } let( c(VAR = 'a'), debugPrint = TRUE, { VAR + VAR } ) # { # a + a # } # [1] 14

    Please see vignette('let', package='wrapr') for more examples. For working with dplyr 0.7.* we suggest also taking a look at an alternate approach called seplyr.

    %.>% (dot arrow pipe)

    %.>% dot arrow pipe is a strict pipe with intended semantics:

    "a %.>% b" is to be treated as if the user had written "{ . <- a; b };" with "%.>%" being treated as left-associative, and .-side effects removed.

    That is: %.>% does not alter any function arguments that are not explicitly named. It is not defined as a %.% b ~ b(a) (roughly dplyr‘s original pipe) or as the large set of differing cases constituting magrittr::%>%. %.>% is designed to be explicit and simple.

    The effect looks is show below.

    The following two expressions should be equivalent:

    cos(exp(sin(4))) # [1] 0.8919465 4 %.>% sin(.) %.>% exp(.) %.>% cos(.) # [1] 0.8919465

    The notation is quite powerful as it treats pipe stages as expression parameterized over the variable ".". This means you do not need to introduce functions to express stages. The following is a valid dot-pipe:

    1:4 %.>% .^2 # [1] 1 4 9 16

    The notation is also very regular in that expressions have the same iterpretation be then surrounded by parenthesis, braces, or as-is:

    1:4 %.>% { .^2 } # [1] 1 4 9 16 1:4 %.>% ( .^2 ) # [1] 1 4 9 16

    Regularity can be a big advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details.

    :=

    := is the "named map builder". It allows code such as the following:

    'a' := 'x' # a # "x"

    The important property of named map builder is it accepts values on the left-hand side allowing the following:

    name <- 'variableNameFromElseWhere' name := 'newBinding' # variableNameFromElseWhere # "newBinding"

    A nice property is := commutes (in the sense of algebra or category theory) with R‘s concatenation function c(). That is the following two statements are equivalent:

    c('a', 'b') := c('x', 'y') # a b # "x" "y" c('a' := 'x', 'b' := 'y') # a b # "x" "y" λ()

    λ() is a concise abstract function creator. It is a placeholder that allows the use of the λ-character for very concise function abstraction.

    Example:

    # Make sure lambda function builder is in our enironment. if(exists('defineLambda', envir=as.environment('package:wrapr'), mode='function')) { # Note: prior to version 0.4.2 wrapr # loaded a lambda-definition during # package load. The following explicit # wrapr::defineLambda() is more polite. wrapr::defineLambda() } # square numbers 1 through 4 sapply(1:4, λ(x, x^2)) # [1] 1 4 9 16 DebugFnW()

    DebugFnW() wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even re-run with a step-debugger. Please see our free debugging video series and vignette('DebugFnW', package='wrapr') for examples.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Unbottling “.msg” Files in R

    Sat, 08/26/2017 - 01:10

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R package — msgxtractr — that can read in these ‘.msg’ files and produce a list as a result.

    I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, not-so-cross-platform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg() that smashes open bottled up msgs, grabs sane/useful fields and produces a list() with them all wrapped up in a bow (an example is at the end and in the GH README).

    Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.

    Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.

    devtools::install_github("hrbrmstr/msgxtractr") library(msgxtractr) print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr")))) ## List of 7 ## $ headers :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 18 variables: ## ..$ Return-path : chr "" ## ..$ Received :List of 1 ## .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)" ## ..$ Original-recipient : chr "rfc822;brianzhou@me.com" ## ..$ Received-SPF : chr "pass (st11p00mm-smtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__ ## ..$ DKIM-Signature : chr "v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__ ## ..$ MIME-version : chr "1.0" ## ..$ X-Received : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)" ## ..$ Date : chr "Mon, 18 Nov 2013 10:26:24 +0200" ## ..$ Message-id : chr "" ## ..$ Subject : chr "Test for TIF files" ## ..$ From : chr "Brian Zhou " ## ..$ To : chr "brianzhou@me.com" ## ..$ Cc : chr "Brian Zhou " ## ..$ Content-type : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a" ## ..$ Authentication-results : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com header.i=@gmail.com\nheader."| __truncated__ ## ..$ x-icloud-spam-score : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent" ## ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__ ## ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__ ## $ sender :List of 2 ## ..$ sender_email: chr "brizhou@gmail.com" ## ..$ sender_name : chr "Brian Zhou" ## $ recipients :List of 2 ## ..$ :List of 3 ## .. ..$ display_name : NULL ## .. ..$ address_type : chr "SMTP" ## .. ..$ email_address: chr "brianzhou@me.com" ## ..$ :List of 3 ## .. ..$ display_name : NULL ## .. ..$ address_type : chr "SMTP" ## .. ..$ email_address: chr "brizhou@gmail.com" ## $ subject : chr "Test for TIF files" ## $ body : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__ ## $ attachments :List of 2 ## ..$ :List of 4 ## .. ..$ filename : chr "importOl.tif" ## .. ..$ long_filename: chr "import OleFileIO.tif" ## .. ..$ mime : chr "image/tiff" ## .. ..$ content : raw [1:969674] 49 49 2a 00 ... ## ..$ :List of 4 ## .. ..$ filename : chr "raisedva.tif" ## .. ..$ long_filename: chr "raised value error.tif" ## .. ..$ mime : chr "image/tiff" ## .. ..$ content : raw [1:1033142] 49 49 2a 00 ... ## $ display_envelope:List of 2 ## ..$ display_cc: chr "Brian Zhou" ## ..$ display_to: chr "brianzhou@me.com" ## NULL

    NOTE: Don’t try to read those TIFF images with magick or evan the tiff package. It seems to have some strange tags. But, saving it (use writeBin()) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Tidy evaluation, most common actions

    Fri, 08/25/2017 - 22:30

    (This article was first published on That’s so Random, and kindly contributed to R-bloggers)

    Tidy evaluation is a bit challenging to get your head around. Even after reading programming with dplyr several times, I still struggle when creating functions from time to time. I made a small summary of the most common actions I perform, so I don’t have to dig in the vignettes and on stackoverflow over and over. Each is accompanied with a minimal example on how to implement it. I thought others might find this useful too, so here it is in a blog post. This list is meant to be a living thing so additions and improvements are most welcome. Please do a PR on this file or send an email.

    library(tidyverse) bare to quosure: quo bare_to_quo <- function(x, var){ x %>% select(!!var) %>% head(1) } bare_to_quo(mtcars, quo(cyl)) ## cyl ## Mazda RX4 6 bare to quosure in function: enquo bare_to_quo_in_func <- function(x, var) { var_enq <- enquo(var) x %>% select(!!var_enq) %>% head(1) } bare_to_quo_in_func(mtcars, mpg) ## mpg ## Mazda RX4 21 quosure to a name: quo_name bare_to_name <- function(x, nm) { nm_name <- quo_name(nm) x %>% mutate(!!nm_name := 42) %>% head(1) %>% select(!!nm) } bare_to_name(mtcars, quo(this_is_42)) ## this_is_42 ## 1 42 quosure to text: quo_text quo_to_text <- function(x, var) { var_enq <- enquo(var) ggplot(x, aes_string(rlang::quo_text(var_enq))) + geom_density() } plt <- quo_to_text(mtcars, cyl)

    Note that tidy evaluation is not yet implemented in ggplot2, but this will be in future versions. This is a workaround for the meantime, when combining dplyr and ggplot2.

    character to quosure: sym char_to_quo <- function(x, var) { var_enq <- rlang::sym(var) x %>% select(!!var_enq) %>% head(1) } char_to_quo(mtcars, "vs") ## vs ## Mazda RX4 0 multiple bares to quosure: quos bare_to_quo_mult <- function(x, ...) { grouping <- quos(...) x %>% group_by(!!!grouping) %>% summarise(nr = n()) } bare_to_quo_mult(mtcars, vs, cyl) ## # A tibble: 5 x 3 ## # Groups: vs [?] ## vs cyl nr ## ## 1 0 4 1 ## 2 0 6 3 ## 3 0 8 14 ## 4 1 4 10 ## 5 1 6 4 multiple characters to quosure: syms bare_to_quo_mult_chars <- function(x, ...) { grouping <- rlang::syms(...) x %>% group_by(!!!grouping) %>% summarise(nr = n()) } bare_to_quo_mult_chars(mtcars, list("vs", "cyl")) ## # A tibble: 5 x 3 ## # Groups: vs [?] ## vs cyl nr ## ## 1 0 4 1 ## 2 0 6 3 ## 3 0 8 14 ## 4 1 4 10 ## 5 1 6 4 quoting full expressions

    Altough quoting column names is most often used, it is by no means the only option. We can use the above to quote full expressions.

    filter_func <- function(x, filter_exp) { filter_exp_enq <- enquo(filter_exp) x %>% filter(!!filter_exp_enq) } filter_func(mtcars, hp == 93) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Stan Weekly Roundup, 25 August 2017

    Fri, 08/25/2017 - 21:00

    (This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

    This week, the entire Columbia portion of the Stan team is out of the office and we didn’t have an in-person/online meeting this Thursday. Mitzi and I are on vacation, and everyone else is either teaching, TA-ing, or attending the Stan course. Luckily for this report, there’s been some great activity out of the meeting even if I don’t have a report of what everyone around Columbia has been up to. If a picture’s really worth a thousand words, this is the longest report yet.

    • Ari Hartikainen has produced some absolutely beautiful parallel coordinate plots of HMC divergences* for multiple parameters. The divergent transitions are shown in green and the lines connect a single draw. The top plot is unnormalized, whereas the bottom scales all parameters to a [0, 1] range.



      You can follow the ongoing discussion on the forum thread. There are some further plots for larger models and some comparisons with the pairs plots that Michael Betancourt has been recommending for the same purpose (the problem with pairs is that it’s very very slow, at least in RStan, because it has to draw quadratically many plots).

    • Sebastian Weber has a complete working prototype of the MPI (multi-core parallelization) in place and has some beautiful results to report. The first graph is the speedup he achieved on a 20-core server (all in one box with shared memory):



      The second graph shows what happens when the problem size grows (those bottom numbers on the x-axis are the number of ODE systems being solved, whereas the top number remains the number of cores used).



      As with Ari’s plots, you can follow the ongoing disussion on the forum thread. And if you know something about MPI, you can even help out. Sebastian’s been asking if anyone who knows MPI would like to check his work—he’s learning it as he goes (and doing a bang-up job of it, I might add!).

     

    These lists are incomplete

    After doing a handful of these reports, I’m sorry to say you’re only seeing a very biased selection of activity around Stan. For the full story, I’d encourage you to jump onto our forums or GitHub (warning: very high traffic, even if you focus).

     * Divergences in Stan arise when the Hamiltonian, which should be conserved across a trajectory, diverges—it’s basically a numerical simulation problem—if we could perfectly follow the Hamiltonian through complex geometries, there wouldn’t be any divergences. This is a great diagnostic mechanism to signal something’s going wrong and resulting estimates might be biased. It may seem to make HMC more fragile, but the problem is that Gibbs and Metropolis will fail silently in a lot of these situations (though BUGS will often help you out of numerical issues by crashing).

    The post Stan Weekly Roundup, 25 August 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages