R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 36 min ago

3-D animations with R

Tue, 08/29/2017 - 23:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is often used to visualize and animate 2-dimensional data. (Here are just a few examples.)  But did you know you can create 3-dimensional animations as well?

As Thomas Lins Pedersen explains in a recent blog post, the trick is in using the persp function to translate points in 3-D space into a 2-D projection. This function is normally used to render a 3-D scatterplot or wireframe plot, but if you instead capture its output value, it returns a transformation matrix. You can then use the trans3d function to with this matrix to transform points in 3-D space. Thomas demonstrates how you can pass the transformed 2-D coordinates to plot a 3-D cube, and even animate it from two slightly different perspectives to create a 3-D stereo pair:

Rendering 3-D images isn't just for fun: there's plenty of 3-D data to analyze and visualize, too. Giora Simchoni used R to visualize data from the Carnegie-Mellon Graphics Lab Motion Capture Database. This data repository provides the output of human figures in motion-capture suits performing actions like walking, jumping, and even dancing. Since the motion-capture suits include multiple sensors measured over a time-period, the data structures are quite complex. To make things simpler, Giora created the mocap package (available on Github) to read these motion data files and generate 3-D animations to visualize them. For example, here's the output from two people performing the Charleston together:

You can find complete details behind both animations, including the associated R code, at the links below.

Data Imaginist: I made a 3D movie with ggplot2 once – here's how I did it
Giora Simchoni: Lambada! (The mocap Package)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Clean or shorten Column names while importing the data itself

Tue, 08/29/2017 - 19:05

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.

One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
“ library(janitor); newdataobject <- read.csv(“yourcsvfilewithpath.csv”, header=T) %>% clean_names()

”

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Working with air quality and meteorological data Exercises (Part-2)

Tue, 08/29/2017 - 18:08

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports.

In the previous exercise set we used data from MY1 station to see how to import data and extract basic statistical information from data. In this exercise set we will use some basic and useful functions that are available in openair package to analyze and visualize MY1 data.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag openair

Please load the package openair before starting the exercises.

Exercise 1
Use summaryPlot function to plot timeseries and histogram for pm10, and o3

Exercise 2
Use windRose function to plot monthly wind rose.

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course
[Intermediate] Spatial Data Analysis with R, QGIS & More
. this course you will learn how to:

• Work with Spatial data and maps
• Learn about different tools to develop spatial data next to R
• And much more

Exercise 3
Use pollutionRose function to plot monthly pollution roses for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 4
Use pollutionRose to plot seasonal pollution roses for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 5
Use percentileRose function to plot monthly percentile roses for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 6
Use polarCluster function to plot cluster roses plot for
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Cycling Accident Map of Madrid City

Tue, 08/29/2017 - 18:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Far away, this ship has taken me far away (Starlight, Muse)

Madrid City has an Open Data platform where can be found around 300 data sets about a number of topics. One of these sets is the one I used for this experiment. It contains information about cycling accidents  happened in the city from January to July 2017. I have done a map to locate where the accidents took place. This experiment shows how R makes very easy to create professional maps with Leaflet (in this case I use Carto basemaps).

To locate accidents the data set only contains the address where they happened so the first thing I did is to obtain their geographical coordinates using geocode function from ggmap package. There were 431 accidents during the first 7 months of 2017 (such a big number!) and I got coordinates of 407 so I can locate 94% of the accidents.

Obviously, the amount of accidents in some place depend on how many bikers circulate there as well as on its infrastructure. None of these things can be seen in the map: It only shows number of accidents.

The categorization of accidents is:

• Double collision (Colisión doble): Traffic accident occurred between two moving vehicles.
• Multiple collision (Colisión múltiple): Traffic accident occurred between more than two moving vehicles.
• Fixed object collision (Choque con objeto fijo): Accident occurred between a moving vehicle with a driver and an immovable object that occupies the road or separated area of ​​the same, whether parked vehicle, tree, street lamp, etc.
• Accident (Atropello): Accident occurred between a vehicle and a pedestrian that occupies the road or travels by sidewalks, refuges, walks or zones of the public road not destined to the circulation of vehicles.
• Overturn (Vuelco): Accident suffered by a vehicle with more than two wheels which by some circumstance loses contact with the road and ends supported on one side or on its roof.
• Motorcycle fall (Caída motocicleta): Accident suffered by a motorcycle, which at some moment loses balance, because of the driver or due to the conditions of the road.
• Moped fall (Caída ciclomotor): Accident suffered by a moped, which at some moment loses balance, because of the driver or due to the conditions of the road.
• Bicycle fall (Caída bicicleta): Accident suffered by a bicycle, which at some moment loses balance, because of the driver or due to the conditions of the road.

These categories are redundant (e.g. Double and Multiple collision), difficult to understand (e.g. Overturn) or both things at the same time (e.g. Motorcycle fall and Moped fall). This categorization also forgets human damages incurred by the accident.

Taking all these things in mind, this is the map:

Here is a full-screen version of the map.

My suggestions to the city council of Madrid are:

1. Add geographical coordinates to data (I guess many of the analysis will need them)
2. Rethink the categorization to make it clearer and more informative
3. Add more cycling data sets to the platform (detail of bikeways, traffic …) to understand accidents better
4. Attending just to the number of accidents , put the focus around Parque del Retiro, specially on its west surroundings, from Plaza de Cibeles to Plaza de Carlos V: more warning signals, more  (or better) bikeways …

I add the code below to update the map (If someone ask it to me, I can do it myself regularly):

library(dplyr) library(stringr) library(ggmap) library(leaflet) # First, getting the data download.file(paste0("http://datos.madrid.es/egob/catalogo/", file), destfile="300110-0-accidentes-bicicleta.csv") data=read.csv("300110-0-accidentes-bicicleta.csv", sep=";", skip=1) # Prepare data for geolocation data %>% mutate(direccion=paste(str_trim(Lugar), str_trim(Numero), "MADRID, SPAIN", sep=", ") %>% str_replace("NA, ", "") %>% str_replace(" - ", " CON ")) -> data # Geolocation (takes some time ...) coords=c() for (i in 1:nrow(data)) { coords %>% rbind(geocode(data[i,"direccion"])) -> coords Sys.sleep(0.5) } # Save data, just in case data %>% cbind(coords) %>% saveRDS(file="bicicletas.RDS") data=readRDS(file="bicicletas.RDS") # Remove non-successfull geolocations data %>% filter(!is.na(lon)) %>% droplevels()-> data # Remove non-successfull geolocations data %>% mutate(Fecha=paste0(as.Date(data$Fecha, "%d/%m/%Y"), " ", TRAMO.HORARIO), popup=paste0("Dónde:", direccion, "Cuándo:", Fecha, "Qué pasó:", Tipo.Accidente)) -> data # Do the map data %>% split(data$Tipo.Accidente) -> data.df l <- leaflet() %>% addProviderTiles(providers$CartoDB.Positron) names(data.df) %>% purrr::walk( function(df) { l <<- l %>% addCircleMarkers(data=data.df[[df]], lng=~lon, lat=~lat, popup=~popup, color="red", stroke=FALSE, fillOpacity = 0.8, group = df, clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F)) }) l %>% addLayersControl( overlayGroups = names(data.df), options = layersControlOptions(collapsed = FALSE) ) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content Tue, 08/29/2017 - 16:14 (This article was first published on R – rud.is, and kindly contributed to R-bloggers) I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney: [Please RT]#RStats folks, @nj_tierney & I need your help for {naniar}! When does R silently drop/omit NA? https://t.co/V5elyGcG8Z pic.twitter.com/VScLXFCl2n — Colin Fay (@_ColinFay) August 29, 2017 When I got distracted by a PDF in the CRAN doc/contrib directory: Short-refcard.pdf. I’m not a big reference card user but students really like them and after seeing what it was I remembered having seen the document ages ago, but never associated it with CRAN before. I saw: by Tom Short, EPRI PEAC, tshort@epri-peac.com 2004-11-07 Granted to the public domain. See www. Rpad. org for the source and latest version. Includes material from R for Beginners by Emmanuel Paradis (with permission). at the top of the card. The link (which I’ve made unclickable for reasons you’ll see in a sec — don’t visit that URL) was clickable and I tapped it as I wanted to see if it had changed since 2004. You can open that image in a new tab to see the full, rendered site and take a moment to see if you can find the section that links to objectionable — and, potentially malicious — content. It’s easy to spot. I made a likely correct assumption that Tom Short had nothing to do with this and wanted to dig into it a bit further to see when this may have happened. So, don your bestest deerstalker and follow along as we see when this may have happened. Digging In Domain Land We’ll need some helpers to poke around this data in a safe manner: library(wayback) # devtools::install_github("hrbrmstr/wayback") library(ggTimeSeries) # devtools::install_github("AtherEnergy/ggTimeSeries") library(splashr) # devtools::install_github("hrbrmstr/splashr") library(passivetotal) # devtools::install_github("hrbrmstr/passivetotal") library(cymruservices) library(magick) library(tidyverse) (You’ll need to get a RiskIQ PassiveTotal key to use those functions. Also, please donate to Archive.org if you use the wayback package.) Now, let’s see if the main Rpad content URL is in the wayback machine: glimpse(archive_available("http://www.rpad.org/Rpad/")) ## Observations: 1 ## Variables: 5 ##$ url "http://www.rpad.org/Rpad/" ## $available TRUE ##$ closet_url "http://web.archive.org/web/20170813053454/http://ww... ## $timestamp 2017-08-13 ##$ status "200"

It is! Let’s see how many versions of it are in the archive:

x <- cdx_basic_query("http://www.rpad.org/Rpad/") ts_range <- range(x$timestamp) count(x, timestamp) %>% ggplot(aes(timestamp, n)) + geom_segment(aes(xend=timestamp, yend=0)) + labs(x=NULL, y="# changes in year", title="rpad.org Wayback Change Timeline") + theme_ipsum_rc(grid="Y") count(x, timestamp) %>% mutate(Year = lubridate::year(timestamp)) %>% complete(timestamp=seq(ts_range[1], ts_range[2], "1 day")) %>% filter(!is.na(timestamp), !is.na(Year)) %>% ggplot(aes(date = timestamp, fill = n)) + stat_calendar_heatmap() + viridis::scale_fill_viridis(na.value="white", option = "magma") + facet_wrap(~Year, ncol=1) + labs(x=NULL, y=NULL, title="rpad.org Wayback Change Timeline") + theme_ipsum_rc(grid="") + theme(axis.text=element_blank()) + theme(panel.spacing = grid::unit(0.5, "lines")) There’s a big span between 2008/9 and 2016/17. Let’s poke around there a bit. First 2016: tm <- get_timemap("http://www.rpad.org/Rpad/") (rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2016)) ## # A tibble: 1 x 5 ## rel link type ## ## 1 memento http://web.archive.org/web/20160629104907/http://www.rpad.org:80/Rpad/ ## # ... with 2 more variables: from , datetime (p2016 <- render_png(url = rurl$link))

Hrm. Could be server or network errors.

Let’s go back to 2009.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2009)) ## # A tibble: 4 x 5 ## rel link type ## ## 1 memento http://web.archive.org/web/20090219192601/http://rpad.org:80/Rpad ## 2 memento http://web.archive.org/web/20090322163146/http://www.rpad.org:80/Rpad ## 3 memento http://web.archive.org/web/20090422082321/http://www.rpad.org:80/Rpad ## 4 memento http://web.archive.org/web/20090524155658/http://www.rpad.org:80/Rpad ## # ... with 2 more variables: from , datetime (p2009 <- render_png(url = rurl$link[4])) If you poke around that, it looks like the original Rpad content, so it was “safe” back then. (rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2017)) ## # A tibble: 6 x 5 ## rel link type ## ## 1 memento http://web.archive.org/web/20170323222705/http://www.rpad.org/Rpad ## 2 memento http://web.archive.org/web/20170331042213/http://www.rpad.org/Rpad/ ## 3 memento http://web.archive.org/web/20170412070515/http://www.rpad.org/Rpad/ ## 4 memento http://web.archive.org/web/20170518023345/http://www.rpad.org/Rpad/ ## 5 memento http://web.archive.org/web/20170702130918/http://www.rpad.org/Rpad/ ## 6 memento http://web.archive.org/web/20170813053454/http://www.rpad.org/Rpad/ ## # ... with 2 more variables: from , datetime (p2017 <- render_png(url = rurl$link[1]))

I won’t break your browser and add another giant image, but that one has the icky content. So, it’s a relatively recent takeover and it’s likely that whomever added the icky content links did so to try to ensure those domains and URLs have both good SEO and a positive reputation.

Let’s see if they were dumb enough to make their info public:

rwho <- passive_whois("rpad.org") str(rwho, 1) ## List of 18 ## $registryUpdatedAt: chr "2016-10-05" ##$ admin :List of 10 ## $domain : chr "rpad.org" ##$ registrant :List of 10 ## $telephone : chr "5078365503" ##$ organization : chr "WhoisGuard, Inc." ## $billing : Named list() ##$ lastLoadedAt : chr "2017-03-14" ## $nameServers : chr [1:2] "ns-1147.awsdns-15.org" "ns-781.awsdns-33.net" ##$ whoisServer : chr "whois.publicinterestregistry.net" ## $registered : chr "2004-06-15" ##$ contactEmail : chr "411233718f2a4cad96274be88d39e804.protect@whoisguard.com" ## $name : chr "WhoisGuard Protected" ##$ expiresAt : chr "2018-06-15" ## $registrar : chr "eNom, Inc." ##$ compact :List of 10 ## $zone : Named list() ##$ tech :List of 10

Nope. #sigh

Is this site considered “malicious”?

(rclass <- passive_classification("rpad.org")) ## $everCompromised ## NULL Nope. #sigh What’s the hosting history for the site? rdns <- passive_dns("rpad.org") rorig <- bulk_origin(rdns$results$resolve) tbl_df(rdns$results) %>% type_convert() %>% select(firstSeen, resolve) %>% left_join(select(rorig, resolve=ip, as_name=as_name)) %>% arrange(firstSeen) %>% print(n=100) ## # A tibble: 88 x 3 ## firstSeen resolve as_name ## ## 1 2009-12-18 11:15:20 144.58.240.79 EPRI-PA - Electric Power Research Institute, US ## 2 2016-06-19 00:00:00 208.91.197.132 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG ## 3 2016-07-29 00:00:00 208.91.197.27 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG ## 4 2016-08-12 20:46:15 54.230.14.253 AMAZON-02 - Amazon.com, Inc., US ## 5 2016-08-16 14:21:17 54.230.94.206 AMAZON-02 - Amazon.com, Inc., US ## 6 2016-08-19 20:57:04 54.230.95.249 AMAZON-02 - Amazon.com, Inc., US ## 7 2016-08-26 20:54:02 54.192.197.200 AMAZON-02 - Amazon.com, Inc., US ## 8 2016-09-12 10:35:41 52.84.40.164 AMAZON-02 - Amazon.com, Inc., US ## 9 2016-09-17 07:43:03 54.230.11.212 AMAZON-02 - Amazon.com, Inc., US ## 10 2016-09-23 18:17:50 54.230.202.223 AMAZON-02 - Amazon.com, Inc., US ## 11 2016-09-30 19:47:31 52.222.174.253 AMAZON-02 - Amazon.com, Inc., US ## 12 2016-10-24 17:44:38 52.85.112.250 AMAZON-02 - Amazon.com, Inc., US ## 13 2016-10-28 18:14:16 52.222.174.231 AMAZON-02 - Amazon.com, Inc., US ## 14 2016-11-11 10:44:22 54.240.162.201 AMAZON-02 - Amazon.com, Inc., US ## 15 2016-11-17 04:34:15 54.192.197.242 AMAZON-02 - Amazon.com, Inc., US ## 16 2016-12-16 17:49:29 52.84.32.234 AMAZON-02 - Amazon.com, Inc., US ## 17 2016-12-19 02:34:32 54.230.141.240 AMAZON-02 - Amazon.com, Inc., US ## 18 2016-12-23 14:25:32 54.192.37.182 AMAZON-02 - Amazon.com, Inc., US ## 19 2017-01-20 17:26:28 52.84.126.252 AMAZON-02 - Amazon.com, Inc., US ## 20 2017-02-03 15:28:24 52.85.94.225 AMAZON-02 - Amazon.com, Inc., US ## 21 2017-02-10 19:06:07 52.85.94.252 AMAZON-02 - Amazon.com, Inc., US ## 22 2017-02-17 21:37:21 52.85.63.229 AMAZON-02 - Amazon.com, Inc., US ## 23 2017-02-24 21:43:45 52.85.63.225 AMAZON-02 - Amazon.com, Inc., US ## 24 2017-03-05 12:06:32 54.192.19.242 AMAZON-02 - Amazon.com, Inc., US ## 25 2017-04-01 00:41:07 54.192.203.223 AMAZON-02 - Amazon.com, Inc., US ## 26 2017-05-19 00:00:00 13.32.246.44 AMAZON-02 - Amazon.com, Inc., US ## 27 2017-05-28 00:00:00 52.84.74.38 AMAZON-02 - Amazon.com, Inc., US ## 28 2017-06-07 08:10:32 54.230.15.154 AMAZON-02 - Amazon.com, Inc., US ## 29 2017-06-07 08:10:32 54.230.15.142 AMAZON-02 - Amazon.com, Inc., US ## 30 2017-06-07 08:10:32 54.230.15.168 AMAZON-02 - Amazon.com, Inc., US ## 31 2017-06-07 08:10:32 54.230.15.57 AMAZON-02 - Amazon.com, Inc., US ## 32 2017-06-07 08:10:32 54.230.15.36 AMAZON-02 - Amazon.com, Inc., US ## 33 2017-06-07 08:10:32 54.230.15.129 AMAZON-02 - Amazon.com, Inc., US ## 34 2017-06-07 08:10:32 54.230.15.61 AMAZON-02 - Amazon.com, Inc., US ## 35 2017-06-07 08:10:32 54.230.15.51 AMAZON-02 - Amazon.com, Inc., US ## 36 2017-07-16 09:51:12 54.230.187.155 AMAZON-02 - Amazon.com, Inc., US ## 37 2017-07-16 09:51:12 54.230.187.184 AMAZON-02 - Amazon.com, Inc., US ## 38 2017-07-16 09:51:12 54.230.187.125 AMAZON-02 - Amazon.com, Inc., US ## 39 2017-07-16 09:51:12 54.230.187.91 AMAZON-02 - Amazon.com, Inc., US ## 40 2017-07-16 09:51:12 54.230.187.74 AMAZON-02 - Amazon.com, Inc., US ## 41 2017-07-16 09:51:12 54.230.187.36 AMAZON-02 - Amazon.com, Inc., US ## 42 2017-07-16 09:51:12 54.230.187.197 AMAZON-02 - Amazon.com, Inc., US ## 43 2017-07-16 09:51:12 54.230.187.185 AMAZON-02 - Amazon.com, Inc., US ## 44 2017-07-17 13:10:13 54.239.168.225 AMAZON-02 - Amazon.com, Inc., US ## 45 2017-08-06 01:14:07 52.222.149.75 AMAZON-02 - Amazon.com, Inc., US ## 46 2017-08-06 01:14:07 52.222.149.172 AMAZON-02 - Amazon.com, Inc., US ## 47 2017-08-06 01:14:07 52.222.149.245 AMAZON-02 - Amazon.com, Inc., US ## 48 2017-08-06 01:14:07 52.222.149.41 AMAZON-02 - Amazon.com, Inc., US ## 49 2017-08-06 01:14:07 52.222.149.38 AMAZON-02 - Amazon.com, Inc., US ## 50 2017-08-06 01:14:07 52.222.149.141 AMAZON-02 - Amazon.com, Inc., US ## 51 2017-08-06 01:14:07 52.222.149.163 AMAZON-02 - Amazon.com, Inc., US ## 52 2017-08-06 01:14:07 52.222.149.26 AMAZON-02 - Amazon.com, Inc., US ## 53 2017-08-11 19:11:08 216.137.61.247 AMAZON-02 - Amazon.com, Inc., US ## 54 2017-08-21 20:44:52 13.32.253.116 AMAZON-02 - Amazon.com, Inc., US ## 55 2017-08-21 20:44:52 13.32.253.247 AMAZON-02 - Amazon.com, Inc., US ## 56 2017-08-21 20:44:52 13.32.253.117 AMAZON-02 - Amazon.com, Inc., US ## 57 2017-08-21 20:44:52 13.32.253.112 AMAZON-02 - Amazon.com, Inc., US ## 58 2017-08-21 20:44:52 13.32.253.42 AMAZON-02 - Amazon.com, Inc., US ## 59 2017-08-21 20:44:52 13.32.253.162 AMAZON-02 - Amazon.com, Inc., US ## 60 2017-08-21 20:44:52 13.32.253.233 AMAZON-02 - Amazon.com, Inc., US ## 61 2017-08-21 20:44:52 13.32.253.29 AMAZON-02 - Amazon.com, Inc., US ## 62 2017-08-23 14:24:15 216.137.61.164 AMAZON-02 - Amazon.com, Inc., US ## 63 2017-08-23 14:24:15 216.137.61.146 AMAZON-02 - Amazon.com, Inc., US ## 64 2017-08-23 14:24:15 216.137.61.21 AMAZON-02 - Amazon.com, Inc., US ## 65 2017-08-23 14:24:15 216.137.61.154 AMAZON-02 - Amazon.com, Inc., US ## 66 2017-08-23 14:24:15 216.137.61.250 AMAZON-02 - Amazon.com, Inc., US ## 67 2017-08-23 14:24:15 216.137.61.217 AMAZON-02 - Amazon.com, Inc., US ## 68 2017-08-23 14:24:15 216.137.61.54 AMAZON-02 - Amazon.com, Inc., US ## 69 2017-08-25 19:21:58 13.32.218.245 AMAZON-02 - Amazon.com, Inc., US ## 70 2017-08-26 09:41:34 52.85.173.67 AMAZON-02 - Amazon.com, Inc., US ## 71 2017-08-26 09:41:34 52.85.173.186 AMAZON-02 - Amazon.com, Inc., US ## 72 2017-08-26 09:41:34 52.85.173.131 AMAZON-02 - Amazon.com, Inc., US ## 73 2017-08-26 09:41:34 52.85.173.18 AMAZON-02 - Amazon.com, Inc., US ## 74 2017-08-26 09:41:34 52.85.173.91 AMAZON-02 - Amazon.com, Inc., US ## 75 2017-08-26 09:41:34 52.85.173.174 AMAZON-02 - Amazon.com, Inc., US ## 76 2017-08-26 09:41:34 52.85.173.210 AMAZON-02 - Amazon.com, Inc., US ## 77 2017-08-26 09:41:34 52.85.173.88 AMAZON-02 - Amazon.com, Inc., US ## 78 2017-08-27 22:02:41 13.32.253.169 AMAZON-02 - Amazon.com, Inc., US ## 79 2017-08-27 22:02:41 13.32.253.203 AMAZON-02 - Amazon.com, Inc., US ## 80 2017-08-27 22:02:41 13.32.253.209 AMAZON-02 - Amazon.com, Inc., US ## 81 2017-08-29 13:17:37 54.230.141.201 AMAZON-02 - Amazon.com, Inc., US ## 82 2017-08-29 13:17:37 54.230.141.83 AMAZON-02 - Amazon.com, Inc., US ## 83 2017-08-29 13:17:37 54.230.141.30 AMAZON-02 - Amazon.com, Inc., US ## 84 2017-08-29 13:17:37 54.230.141.193 AMAZON-02 - Amazon.com, Inc., US ## 85 2017-08-29 13:17:37 54.230.141.152 AMAZON-02 - Amazon.com, Inc., US ## 86 2017-08-29 13:17:37 54.230.141.161 AMAZON-02 - Amazon.com, Inc., US ## 87 2017-08-29 13:17:37 54.230.141.38 AMAZON-02 - Amazon.com, Inc., US ## 88 2017-08-29 13:17:37 54.230.141.151 AMAZON-02 - Amazon.com, Inc., US

Unfortunately, I expected this. The owner keeps moving it around on AWS infrastructure.

So What?

This was an innocent link in a document on CRAN that went to a site that looked legit. A clever individual or organization found the dead domain and saw an opportunity to legitimize some fairly nasty stuff.

Now, I realize nobody is likely using “Rpad” anymore, but this type of situation can happen to any registered domain. If this individual or organization were doing more than trying to make objectionable content legit, they likely could have succeeded, especially if they enticed you with a shiny new devtools::install_…() link with promises of statistically sound animated cat emoji gif creation tools. They did an eerily good job of making this particular site still seem legit.

There’s nothing most folks can do to “fix” that site or have it removed. I’m not sure CRAN should remove the helpful PDF, but with a clickable link, it might be a good thing to suggest.

You’ll see that I used the splashr package (which has been submitted to CRAN but not there yet). It’s a good way to work with potentially malicious web content since you can “see” it and mine content from it without putting your own system at risk.

After going through this, I’ll see what I can do to put some bows on some of the devel-only packages and get them into CRAN so there’s a bit more assurance around using them.

I’m an army of one when it comes to fielding R-related security issues, but if you do come across suspicious items (like this or icky/malicious in other ways) don’t hesitate to drop me an @ or DM on Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tidyverse practice: mapping large European cities

Tue, 08/29/2017 - 13:59

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

As noted in several recent posts, when you’re learning R and R’s Tidyverse packages, it’s important to break everything down into small units that you can learn.

What that means is that you need to identify the most important tools and functions of the Tidyverse, and then practice them until you are fluent.

But once you have mastered the essential functions as isolated units, you need to put them together. By putting the individual piece together, you solidify your knowledge of how they work individually but also begin to learn how you can combine small tools together to create novel effects.

With that in mind, I want to show you another small project. Here, we’re going to use a fairly small set of functions to create a map of the largest cities in Europe.

As we do this, pay attention:

• How many packages and functions do you really need?
• Evaluate: how long would it really take to memorize each individual function? (Hint: it’s much, much less time than you think.)
• Which functions have you seen before? Are some of the functions and techniques used more often than others (if you look across many different analyses)?

Ok, with those questions in mind, let’s get after it.

First we’ll just load a few packages.

#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(ggmap) library(stringr)

Next, we’re going to use the rvest package to scrape data from Wikipedia. The data that we are gathering is data about the largest cities in Europe. You can read more about the data on Wikipedia.

#=========================== # SCRAPE DATA FROM WIKIPEDIA #=========================== html.population <- read_html('https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits') df.euro_cities <- html.population %>% html_nodes("table") %>% .[[2]] %>% html_table() # inspect df.euro_cities %>% head() df.euro_cities %>% names()

Here at Sharp Sight, we haven’t worked with the rvest package in too many examples, so you might not be familiar with it.

Having said that, just take a close look. How many functions did we use from rvest? Could you memorize them? How long would it take?

Ok. Now we’ll do a little data cleaning.

First, we’re going to remove some of the variables using dplyr::select(). We are using the minus sign (‘-‘) in front of the names of the variables that we want to remove.

#============================ # REMOVE EXTRANEOUS VARIABLES #============================ df.euro_cities <- select(df.euro_cities, -Date, -Image, -Location, -Ref., -2011 Eurostat\npopulation[3]) # inspect df.euro_cities %>% names()

After removing the variables that we don’t want, we only have four variables. These remaining raw variable names could be cleaned up a little.

Ideally, we want names that are lower case (because they are easier to type). We also want variable names that are brief and descriptive.

In this case, renaming these variables to be brief, descriptive, and lower-case is fairly straightforward. Here, we will use very simple variable names: like rank, city, country, and population.

To add these new variable names, we can simply assign them by using the colnames() function.

#=============== # RENAME COLUMNS #=============== colnames(df.euro_cities) <- c("rank", "city", "country", "population") # inspect df.euro_cities %>% names() df.euro_cities %>% head()

Now that we have clean variable names, we will do a little modification of the data itself.

When we scraped the data from Wikipedia, some extraneous characters appeared in the population variable. Essentially, there were some leading digits and special characters that appear to be useless artifacts of the scraping process. We want to remove these extraneous characters and parse the population data into a proper numeric.

To do this, we will use a few functions from the stringr package.

First, we use str_extract() to extract the population data. When we do this, we are extracting everything from the ‘♠’ character to the end of the string (note: to do this, we are using a regular expression in str_extract()).

This is a quick way to get the numbers at the end of the string, but we actually don’t want to keep the ‘♠’ character. So, after we extract the population numbers (along with the ‘♠’), we then strip off the ‘♠’ character by using str_replace().

#======================================================================== # CLEAN UP VARIABLE: population # - when the data are scraped, there are some extraneous characters # in the "population" variable. # ... you can see leading numbers and some other items # - We will use stringr functions to extract the actual population data # (and remove the stuff we don't want) # - We are executing this transformation inside dplyr::mutate() to # modify the variable inside the dataframe #======================================================================== df.euro_cities <- df.euro_cities %>% mutate(population = str_extract(population, "♠.*$") %>% str_replace("♠","") %>% parse_number()) df.euro_cities %>% head() We will also do some quick data wrangling on the city names. Two of the city names on the Wikipedia page (Istanbul and Moscow) had footnotes. Because of this, those two city names had extra bracket characters when we read them in (e.g. “Istanbul[a]”). We want to strip off those footnotes. To do this we will once again use str_replace() to strip away the information that we don’t want. #========================================================================== # REMOVE "notes" FROM CITY NAMES # - two cities had extra characters for footnotes # ... we will remove these using stringr::str_replace and dplyr::mutate() #========================================================================== df.euro_cities <- df.euro_cities %>% mutate(city = str_replace(city, "\$.\$","")) df.euro_cities %>% head() For the sake of making the data a little easier to explain, we’re going to filter the data to records where the population is over 1,000,000. Keep in mind: this is a straightforward use of dplyr::filter(); this is the sort of thing that you should be able to do with your eyes closed. #========================= # REMOVE CITIES UNDER 1 MM #========================= df.euro_cities <- filter(df.euro_cities, population >= 1000000) #================= # COERCE TO TIBBLE #================= df.euro_cities <- df.euro_cities %>% as_tibble() Before we map the cities on a map, we need to get geospatial information. That is, we need to geocode these records. To do this, we will use the geocode() function to get the longitude and latitude. After obtaining the geo data, we will join it back to the original data using cbind(). #======================================================== # GEOCODE # - here, we're just getting longitude and latitude data # using ggmap::geocode() #======================================================== data.geo <- geocode(df.euro_cities$city) df.euro_cities <- cbind(df.euro_cities, data.geo) #inspect df.euro_cities

To map the data points, we also need a map that will sit in the background, underneath the points.

We will use the function map_data() to get a world map.

#============== # GET WORLD MAP #============== map.europe <- map_data("world")

Now that the data are clean, and we have a world map, we will plot the data.

#================================= # PLOT BASIC MAP # - this map is "just the basics" #================================= ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group)) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70))

This first plot is a “first iteration.” In this version, we haven’t done any serious formatting. It’s just a “first pass” to make sure that the data are in the right format. If we had found anything “out of line,” we would go back to an earlier part of the analysis and modify our code to correct any problems in the data.

Based on this plot, it looks like the data are essentially correct.

Now, we just want to “polish” the visualization by changing colors, fonts, sizes, etc.

#==================================================== # PLOT 'POLISHED' MAP # - this version is formatted and cleaned up a little # just to make it look more aesthetically pleasing #==================================================== #------------- # CREATE THEME #------------- theme.maptheeme <- theme(text = element_text(family = "Gill Sans", color = "#444444")) + theme(plot.title = element_text(size = 32)) + theme(plot.subtitle = element_text(size = 16)) + theme(panel.grid = element_blank()) + theme(axis.text = element_blank()) + theme(axis.ticks = element_blank()) + theme(axis.title = element_blank()) + theme(legend.background = element_blank()) + theme(legend.key = element_blank()) + theme(legend.title = element_text(size = 18)) + theme(legend.text = element_text(size = 10)) + theme(panel.background = element_rect(fill = "#596673")) + theme(panel.grid = element_blank()) #------ # PLOT #------ #fill = "#AAAAAA",colour = "#818181", size = .15) ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group), fill = "#DEDEDE",colour = "#818181", size = .15) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70)) + labs(title = "European Cities with Large Populations", subtitle = "Cities with over 1MM population, within city limits") + scale_size_continuous(range = c(.7,15), breaks = c(1100000, 4000000, 8000000, 12000000), name = "Population", labels = scales::comma_format()) + theme.maptheeme

Not too bad.

Keep in mind that as a reader, you get to see the finished product: the finalized visualization and the finalized code.

But as always, the process for creating a visualization like this is highly iterative. If you work on a similar project, expect to change your code dozens of times. You’ll change your data-wrangling code as you work with the data and identify new items you need to change or fix. You’ll also change your ggplot() visualization code multiple times as you try different colors, fonts, and settings.

If you master the basics, the hard things never seem hard

Creating this visualization is actually not terribly hard to do, but if you’re somewhat new to R, it might seem rather challenging.

If you look at this, and it seems difficult then you need to understand: once you master the basics, the hard things never seem hard.

What I mean by that, is that this visualization is nothing more than a careful application of a few dozen simple tools, arranged in a way to create something new.

Once you master individual tools from ggplot2, dplyr, and the rest of the Tidyverse, projects like this become very easy to execute.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to master the essential tools.

You need to know what tools are important, which tools are not important, and how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons.

You’ll learn:

• What data science tools you should learn (and what not to learn)
• How to practice those tools
• How to put those tools together to execute analyses and machine learning projects
• … and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Tidyverse practice: mapping large European cities appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rtimicropem: Using an *R* package as platform for harmonized cleaning of data from RTI MicroPEM air quality sensors

Tue, 08/29/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

As you might remember from my blog post about ropenaq, I work as a data manager and statistician for an epidemiology project called CHAI for Cardio-vascular health effects of air pollution in Telangana, India. One of our interests in CHAI is determining exposure, and sources of exposure, to PM2.5 which are very small particles in the air that have diverse adverse health effects. You can find more details about CHAI in our recently published protocol paper. In this blog post that partly corresponds to the content of my useR! 2017 lightning talk, I'll present a package we wrote for dealing with the output of a scientific device, which might remind you of similar issues in your experimental work.

Why write the rtimicropem package?

Part of the CHAI project is a panel study involving about 40 people wearing several devices, as you see above. The devices include a GPS, an accelerometer, a wearable camera, and a PM2.5 monitor outputting time-resolved data (the grey box on the left). Basically, with this device, the RTI MicroPEM, we get one PM2.5 exposure value every 10 seconds. This is quite exciting, right? Except that we have two main issues with it…

First of all, the output of the device, a file with a ".csv" extension corresponding to a session of measurements, in our case 24 hours of measurements, is not really a csv. The header contains information about settings of the device for that session, and then comes the actual table with measurements.

Second, since the RTI MicroPEMs are nice devices but also a work-in-progress, we had some problems with the data, such as negative relative humidity. Because of these issues, we decided to write an R package whose three goals were to:

• Transform the output of the device into something more usable.

• Allow the exploration of individual files after a day in the field.

• Document our data cleaning process.

We chose R because everything else in our project, well data processing, documentation and analysis, was to be implemented in R, and because we wanted other teams to be able to use our package.

Features of rtimicropem: transform, explore and learn about data cleaning

First things first, our package lives here and is on CRAN. It has a nice documentation website thanks to pkgdown.

Transform and explore single files

In rtimicropem after the use of the convert_output function, one gets an object of the R6 class micropem class. Its fields include the settings and measurements as two data.frames, and it has methods such as summary and plot for which you see the static output below (no unit on this exploratory plot).

The plot method can also outputs an interactive graph thanks to rbokeh.

While these methods can be quite helpful to explore single files as an R user, they don't help non R users a lot. Because we wanted members of our team working in the field to be able to explore and check files with no R knowledge, we created a Shiny app that allows to upload individual files and then look at different tabs, including one with a plot, one with the summary of measurements, etc. This way, it was easy to spot a device failure for instance, and to plan a new measurement session with the corresponding participant.

Transform a bunch of files

At the end of the CHAI data collection, we had more than 250 MicroPEM files. In order to prepare them for further processing we wrote the batch_convert function that saves the content of any number of MicroPEM files as two (real!) csv, one with the measurements, one with the settings.

Learn about data cleaning

As mentioned previously, we experienced issues with MicroPEM data quality. Although we had heard other teams complain of similar problems, in the literature there were very few details about data cleaning. We decided to gather information from other teams and the manufacturer and to document our own decisions, e.g. remove entire files based on some criteria, in a vignette of the package. This is our transparent answer to the question "What was your experience with MicroPEMs?" which we get often enough from other scientists interested in PM2.5 exposure.

Place of rtimicropem in the R package ecosystem

When preparing rtimicropem submission to rOpenSci, I started wondering whether one would like to have one R package for each scientific device out there. In our case, having the weird output to deal with, and the lack of a central data issues documentation place, were enough of a motivation. But maybe one could hope that manufacturers of scientific devices would focus a bit more on making the output format analysis-friendly, and that the open documentation of data issues would be language-agnostic and managed by the manufacturers themselves. In the meantime, we're quite proud to have taken the time to create and share our experience with rtimicropem, and have already heard back from a few users, including one who found the package via googling "RTI MicroPEM data"! Another argument I in particular have to write R packages for dealing with scientific data is that it might motivate people to learn R, but this is maybe a bit evil.

What about the place of rtimicropem in the rOpenSci package collection? After very useful reviews by Lucy D'Agostino McGowan and Kara Woo our package got onboarded which we were really thankful for and happy about. Another package I can think off the top of my head to deal with the output of a scientific tool is plater. Let me switch roles from CHAI team member to rOpenSci onboarding co-editor here and do some advertisement… Such packages are unlikely to become the new ggplot2 but their specialization doesn't make them less useful and they fit very well in the "data extraction" of the onboarding categories. So if you have written such a package, please consider submitting it! It'll get better thanks to review and might get more publicity as part of a larger software ecosystem. For the rtimicropem submission we took advantage of the joint submission process of rOpenSci and the Journal of Open Source Software, JOSS, so now our piece of software has its JOSS paper with a DOI. And hopefully, having more submissions of packages for scientific hardware might inspire R users to package up the code they wrote to use the output of their scientific tools!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppSMC 0.2.0

Tue, 08/29/2017 - 04:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version 0.2.0 of the RcppSMC package arrived on CRAN earlier today (as a very quick pretest-publish within minutes of submission).

RcppSMC provides Rcpp-based bindings to R for the Sequential Monte Carlo Template Classes (SMCTC) by Adam Johansen described in his JSS article.

This release 0.2.0 is chiefly the work of Leah South, a Ph.D. student at Queensland University of Technology, who was during the last few months a Google Summer of Code student mentored by Adam and myself. It was pleasure to work with Leah on this, and see her progress. Our congratulations to Leah for a job well done!

Changes in RcppSMC version 0.2.0 (2017-08-28)
• Also use .registration=TRUE in useDynLib in NAMESPACE

• Multiple Sequential Monte Carlo extensions (Leah South as part of Google Summer of Code 2017)

• Switching to population level objects (#2 and #3).

• Using Rcpp attributes (#2).

• Using automatic RNGscope (#4 and #5).

• Adding multiple normalising constant estimators (#7).

• Static Bayesian model example: linear regression (#10 addressing #9).

• Adding a PMMH example (#13 addressing #11).

• Framework for additional algorithm parameters and adaptation (#19 addressing #16; also #24 addressing #23).

• Common adaptation methods for static Bayesian models (#20 addressing #17).

• Supporting MCMC repeated runs (#21).

• Adding adaptation to linear regression example (#22 addressing #18).

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppSMC page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio::Conf 2018

Tue, 08/29/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

It’s not even Labor Day, so it seems to be a bit early to start planning for next year’s R conferences. But, early-bird pricing for RStudio::Conf 2018 ends this Thursday.

The conference which will be held in San Diego between January 31st and February 3rd promises to match and even surpass this year’s event. In addition to keynotes from Di Cook (Monash University and Iowa State University), J.J. Allaire (RStudio Founder, CEO & Principal Developer), Shiny creator Joe Cheng, and Chief Scientist Hadley Wickham, a number of knowledgeable (and entertaining) speakers have already committed including quant, long-time R user and twitter humorist JD Long (@CMastication), Stack Overflow’s David Robinson (@drob) and ProPublica editor Olga Pierce (@olgapierce).

Making the deadline for early-bird pricing will get you a significant savings. The “full seat” price for the conference is $495 before midnight EST August 31, 2017 and$695 thereafter. You can register here.

For a good idea of the kinds of things you can expect to learn at RStudio::Conf 2018 have a look at the videos from this year’s event.

_____='https://rviews.rstudio.com/2017/08/29/rstudio-conf-2018/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny Dev Center gets a shiny new update

Tue, 08/29/2017 - 02:00

I am excited to announce the redesign and reorganization of shiny.rstudio.com, also known as the Shiny Dev Center. The Shiny Dev Center is the place to go to learn about all things Shiny and to keep up to date with it as it evolves.

The goal of this refresh is to provide a clear learning path for those who are just starting off with developing Shiny apps as well as to make advanced Shiny topics easily accessible to those building large and complex apps. The articles overview that we designed to help navigate the wealth of information on the Shiny Dev Center aims to achieve this goal.

Other highlights of the refresh include:

• A brand new look!
• New articles
• Updated articles with modern Shiny code examples
• Explicit linking, where relevant, to other RStudio resources like webinars, support docs, etc.
• A prominent link to our ever growing Shiny User Showcase
• A guide for contributing to Shiny (inspired by the Tidyverse contribute guide)

Stay tuned for more updates to the Shiny Dev Center in the near future!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

6 R Jobs for R users (2017-08-28) – from all over the world

Tue, 08/29/2017 - 00:23
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

More New Jobs
1. Freelance
Optimization Expert IdeaConnection – Posted by Sherri Ann O’Gorman
Anywhere
22 Aug2017
2. Full-Time
Data journalist for The Economist @London The Economist – Posted by cooberp
London England, United Kingdom
18 Aug2017
3. Full-Time
Technical Business Analyst Investec Asset Management – Posted by IAM
London England, United Kingdom
14 Aug2017
4. Full-Time
Senior Data Scientist @ Dallas, Texas, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
Dallas Texas, United States
8 Aug2017
5. Full-Time
Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
Mesa Arizona, United States
31 Jul2017
6. Full-Time
Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
Philadelphia Pennsylvania, United States
31 Jul2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Le Monde puzzle [#1018]

Tue, 08/29/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An arithmetic Le Monde mathematical puzzle (that first did not seem to involve R programming because of the large number of digits in the quantity involved):

An integer x with less than 100 digits is such that adding the digit 1 on both sides of x produces the integer 99x.  What are the last nine digits of x? And what are the possible numbers of digits of x?

The integer x satisfies the identity

where ω is the number of digits of x. This amounts to

10….01 = 89 x,

where there are ω zeros. Working with long integers in R could bring an immediate solution, but I went for a pedestrian version, handling each digit at a time and starting from the final one which is necessarily 9:

#multiply by 9 rap=0;row=NULL for (i in length(x):1){ prud=rap+x[i]*9 row=c(prud%%10,row) rap=prud%/%10} row=c(rap,row) #multiply by 80 rep=raw=0 for (i in length(x):1){ prud=rep+x[i]*8 raw=c(prud%%10,raw) rep=prud%/%10} #find next digit y=(row[1]+raw[1]+(length(x)>1))%%10

returning

7 9 7 7 5 2 8 0 9

as the (only) last digits of x. The same code can be exploited to check that the complete multiplication produces a number of the form 10….01, hence to deduce that the length of x is either 21 or 65, with solutions

[1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9 [1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 8 9 8 8 7 6 4 0 4 4 9 4 3 8 2 0 2 2 [39] 4 7 1 9 1 0 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9

The maths question behind is to figure out the powers k of 10 such that

For instance, 10²≡11 mod (89) and 11¹¹≡88 mod (89) leads to the first solution ω=21. And then, since 10⁴⁴≡1 mod (89), ω=21+44=65 is another solution…

Filed under: Books, Kids, R Tagged: arithmetics, competition, Le Monde, long division, mathematical puzzle, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Packages to simplify mapping in R

Mon, 08/28/2017 - 23:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Computerworld's Sharon Machlis has published a very useful tutorial on creating geographic data maps with R. (The tutorial was actually published back in March, but I only came across it recently.) While it's been possible to create maps in R for a long time, some recent packages and data APIs have made the process much simpler. The tutorial is based on the following R packages:

• sf, a package that provides simple data structures for objects representing spatial geometries including map features like borders and rivers
• tmap and tmaptools, packages for creating static and interactive "thematic maps" like choropleths and spatial point maps using a ggplot2-like syntax
• tigris, a package that provides shapefiles you can use to map the USA and its states, counties, and census tracts (very useful for visualizing US census data, which you can easily obtain with the tidycensus package)
• rio, a package to streamline the import of flat files from third-party data sources like the U.S. Bureau of Labor Statistics

With those packages, as you'll learn in the tutorial linked below, you can easily create attractive maps to visualize geographic data, and even make those graphics interactive with scrolling, zooming, and data pop-ups thanks to the capabilities of the leaflet package. All of these packages are now available on CRAN, so there's no need to install from Github as the tutorial then suggested, unless you want access to even newer in-development capabilities.

Computerworld: Mapping in R just got a whole lot easier

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Neat New seplyr Feature: String Interpolation

Mon, 08/28/2017 - 22:33

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.

This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods.

The method is easiest to see with an example:

library("seplyr") ## Loading required package: wrapr ratio <- 2 compCol1 <- "Sepal.Width" expr <- expand_expr("Sepal.Length" >= ratio * compCol1) print(expr) ## [1] "Sepal.Length >= ratio * Sepal.Width"

expand_expr works by capturing the user supplied expression unevaluated, performing some transformations, and returning the entire expression as a single quoted string (essentially returning new source code).

Notice in the above one layer of quoting was removed from "Sepal.Length" and the name referred to by “compCol1” was substituted into the expression. “ratio” was left alone as it was not referring to a string (and hence can not be a name; unbound or free variables are also left alone). So we see that the substitution performed does depend on what values are present in the environment.

If you want to be stricter in your specification, you could add quotes around any symbol you do not want de-referenced. For example:

expand_expr("Sepal.Length" >= "ratio" * compCol1) ## [1] "Sepal.Length >= ratio * Sepal.Width"

After the substitution the returned quoted expression is exactly in the form seplyr expects. For example:

resCol1 <- "Sepal_Long" datasets::iris %.>% mutate_se(., resCol1 := expr) %.>% head(.) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long ## 1 5.1 3.5 1.4 0.2 setosa FALSE ## 2 4.9 3.0 1.4 0.2 setosa FALSE ## 3 4.7 3.2 1.3 0.2 setosa FALSE ## 4 4.6 3.1 1.5 0.2 setosa FALSE ## 5 5.0 3.6 1.4 0.2 setosa FALSE ## 6 5.4 3.9 1.7 0.4 setosa FALSE

Details on %.>% (dot pipe) and := (named map builder) can be found here and here respectively. The idea is: seplyr::mutate_se(., "Sepal_Long" := "Sepal.Length >= ratio * Sepal.Width") should be equilant to dplyr::mutate(., Sepal_Long = Sepal.Length >= ratio * Sepal.Width).

seplyr also provides an number of seplyr::*_nse() convenience forms wrapping all of these steps into one operation. For example:

datasets::iris %.>% mutate_nse(., resCol1 := "Sepal.Length" >= ratio * compCol1) %.>% head(.) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long ## 1 5.1 3.5 1.4 0.2 setosa FALSE ## 2 4.9 3.0 1.4 0.2 setosa FALSE ## 3 4.7 3.2 1.3 0.2 setosa FALSE ## 4 4.6 3.1 1.5 0.2 setosa FALSE ## 5 5.0 3.6 1.4 0.2 setosa FALSE ## 6 5.4 3.9 1.7 0.4 setosa FALSE

To use string literals you merely need one extra layer of quoting:

"is_setosa" := expand_expr(Species == "'setosa'") ## is_setosa ## "Species == \"setosa\"" datasets::iris %.>% transmute_nse(., "is_setosa" := Species == "'setosa'") %.>% summary(.) ## is_setosa ## Mode :logical ## FALSE:100 ## TRUE :50

The purpose of all of the above is to mix names that are known while we are writing the code (these are quoted) with names that may not be known until later (i.e., column names supplied as parameters). This allows the easy creation of useful generic functions such as:

countMatches <- function(data, columnName, targetValue) { # extra quotes to say we are interested in value, not de-reference targetSym <- paste0('"', targetValue, '"') data %.>% transmute_nse(., "match" := columnName == targetSym) %.>% group_by_se(., "match") %.>% summarize_se(., "count" := "n()") } countMatches(datasets::iris, "Species", "setosa") ## # A tibble: 2 x 2 ## match count ## ## 1 FALSE 100 ## 2 TRUE 50

The purpose of the seplyr string system is to pull off quotes and de-reference indirect variables. So, you need to remember to add enough extra quotation marks to prevent this where you do not want it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

Mon, 08/28/2017 - 18:26

(This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers)

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

with a gamma region looking like this:

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal $$\gamma$$-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

ep.rle <- rle(ep.data$y) stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2) ep.data.sm <- ep.data[stair.midpoints,] plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5) points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red") Now that we are satisfied that the new data is OK, I will overwrite the original dataframe. ep.data <- ep.data.sm Transformation The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1. library(Bolstad) #A package containing a function for Simpon's Rule integration ep.data$x <- ep.data$x/max(ep.data$x) A.tot <- sintegral(ep.data$x,ep.data$y)$value ep.data$y <- ep.data$y/A.tot #sanity check sintegral(ep.data$x,ep.data$y)$value

## [1] 1

plot(y~x, data = ep.data, type = "l")

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

library(quantmod) ep.max <- findPeaks(ep.data$y) plot(y~x, data = ep.data, type = "l", main = "Maxima") abline(v = ep.data$x[ep.max], col = "red", lty = 2)

ep.min <- findValleys(ep.data$y) plot(y~x, data = ep.data, type = "l", main = "Minima") abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

ep.max <- ep.max[-1] ep.min <- ep.min[-c(1,length(ep.min))]

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of $$\mu_i$$ and $$\sigma_i$$) but there is no particular clinical value to fitting the albumin, $$\alpha_1$$, $$\alpha_2$$, $$\beta_1$$ and $$\beta_2$$ domains with Gaussians. What is of interest is separately quantifying the two peaks in $$\gamma$$ with two separate Gaussians so let's isolate the $$\gamma$$ region based on the location of the minimum between $$\beta_2$$ and $$\gamma$$.

Isolate the $$\gamma$$ Region

gamma.ind <- max(ep.min):nrow(ep.data) gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind]) plot(y ~ x, gamma.data, type = "l")

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

$y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)$

where $$\mu_1$$ is the $$x$$ location of the first peak in $$\gamma$$ and $$\mu_2$$ is the $$x$$ location of the second peak in $$\gamma$$. The estimates of $$\sigma_1$$ and $$\sigma_2$$ can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to $$\sigma$$ by

$FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i$

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the $$FWHM$$. Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

FWHM.finder <- function(ep.data, mu.index){ peak.height <- ep.data$y[mu.index] fxn.for.roots <- ep.data$y - peak.height/2 indices <- 1:nrow(ep.data) root.indices <- which(diff(sign(fxn.for.roots))!=0) tmp <- c(root.indices,mu.index) %>% sort tmp2 <- which(tmp == mu.index) first.root <- root.indices[tmp2 -1] second.root <- root.indices[tmp2] HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root] HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index] FWHM <- HWHM2 + HWHM1 FWHM2 = 2*min(c(HWHM1,HWHM2)) return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2)) }

The peak in the $$\gamma$$ region was obtained previously:

plot(y ~ x, gamma.data, type = "l") gamma.max <- findPeaks(gamma.data$y) abline(v = gamma.data$x[gamma.max])

and from them $$\mu_1$$ is determined to be 0.7. We have to guess where the second peak is, which is at about $$x=0.75$$ and has an index of 252 in the gamma.data dataframe.

gamma.data[252,]

## x y ## 252 0.7487757 0.6381026

#append the second peak gamma.max <- c(gamma.max,252) gamma.mu <- gamma.data$x[gamma.max] gamma.mu ## [1] 0.6983350 0.7487757 plot(y ~ x, gamma.data, type = "l") abline(v = gamma.data$x[gamma.max])

Now we can find the estimates of the standard deviations:

#find the FWHM estimates of sigma_1 and sigma_2: FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data) gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

The estimates of $$\sigma_1$$ and $$\sigma_2$$ are now obtained. The estimates of $$C_1$$ and $$C_2$$ are just the peak heights.

peak.heights <- gamma.data$y[gamma.max] We can now use nls() to determine the fit. fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) + C2*exp(-(x-mean2)**2/(2 * sigma2**2))), data = gamma.data, start = list(mean1 = gamma.mu[1], mean2 = gamma.mu[2], sigma1 = gamma.sigma[1], sigma2 = gamma.sigma[2], C1 = peak.heights[1], C2 = peak.heights[2]), algorithm = "port") Determining the fitted values of our unknown coefficients: dffit <- data.frame(x=seq(0, 1 , 0.001)) dffit$y <- predict(fit, newdata=dffit) fit.sum <- summary(fit) fit.sum #show the fitted coefficients

## ## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - ## mean2)^2/(2 * sigma2^2))) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## mean1 0.7094793 0.0003312 2142.23 <2e-16 *** ## mean2 0.7813900 0.0007213 1083.24 <2e-16 *** ## sigma1 0.0731113 0.0002382 306.94 <2e-16 *** ## sigma2 0.0250850 0.0011115 22.57 <2e-16 *** ## C1 0.6983921 0.0018462 378.29 <2e-16 *** ## C2 0.0819704 0.0032625 25.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.01291 on 611 degrees of freedom ## ## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

coef.fit <- fit.sum$coefficients[,1] mu.fit <- coef.fit[1:2] sigma.fit <- coef.fit[3:4] C.fit <- coef.fit[5:6] And now we can plot the fitted results against the original results: #original plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") #overall fit lines(y ~ x, data = dffit, col ="red", cex = 0.2) legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum")) #components of the fit for(i in 1:2){ x <- dffit$x y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2)) lines(x,y, col = i + 2) }

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the $$\gamma$$ background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of $$\mu_1$$ and $$\mu_2$$ but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the $$\gamma$$ region to be those values having $$y \leq 0.02$$. Then I will connect the first (x,y) point from the $$\gamma$$ region and connect it to the tail.

gamma.tail <- filter(gamma.data, y <= 0.02) baseline.data <- rbind(gamma.data[1,],gamma.tail) names(baseline.data) <- c("x","y") baseline.fun <- approxfun(baseline.data) plot(y~x, data = gamma.data, type = "l") lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

Now we can define a new dataframe gamma.no.base that has the baseline removed:

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x)) plot(y~x, data = gamma.data, type = "l") lines(y ~ x, data = gamma.no.base, lty = 2) gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks abline(v = gamma.no.base$x[gamma.max]) The black is the original $$\gamma$$ and the dashed has the baseline removed. This becomes and easy fit. #Estimate the Ci peak.heights <- gamma.no.base$y[gamma.max] #Estimate the mu_i gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before #Estimate the sigma_i from the FWHM FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base) gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355 #Perform the fit fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) + C2*exp(-(x-mean2)**2/(2 * sigma2**2))), data = gamma.no.base, start = list(mean1 = gamma.mu[1], mean2 = gamma.mu[2], sigma1 = gamma.sigma[1], sigma2 = gamma.sigma[2], C1 = peak.heights[1], C2 = peak.heights[2]), algorithm = "port") #Plot the fit dffit <- data.frame(x=seq(0, 1 , 0.001)) dffit$y <- predict(fit, newdata=dffit) fit.sum <- summary(fit) coef.fit <- fit.sum$coefficients[,1] mu.fit <- coef.fit[1:2] sigma.fit <- coef.fit[3:4] C.fit <- coef.fit[5:6] plot(y ~ x, data = gamma.no.base, type = "l") legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum")) lines(y ~ x, data = dffit, col ="red", cex = 0.2) for(i in 1:2){ x <- dffit$x y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2)) lines(x,y, col = i + 2) }

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the $$\gamma$$ background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

$\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1$

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() A.mono <- round(A.mono,3) A.mono

## sigma1 ## 0.024

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal $$\gamma$$-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the $$\gamma$$-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast $$\gamma$$ dwarfed by normal $$\gamma$$-globulins. That is, like this:

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions
• The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.

• Integrating monoclonal peaks under the $$\gamma$$ baseline (or $$\beta$$) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.

• Basline removal might be helpful at times.
• Peak integration will require human adjudication.

• While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.

• MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recording of my Talk at the ACS Data Users Conference

Mon, 08/28/2017 - 18:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I recently had the honor of speaking at the 2017 American Community Survey (ACS) Data Users Conference. My talk focused on choroplethr, the suite of R packages I’ve written for mapping open datasets in R. The ACS is the one of the main surveys that the US Census Bureau runs, and I was honored to share my work there.

My talk was recorded, and you can view it below:

Attending the conference was part of my effort to bridge the gap between the world of free data analysis tools (i.e. the R community) and the world of government datasets (which I consider the most interesting datasets around).

The conference was great. I have been meaning to write a blog post summarizing what I learned, but I simply haven’t had the time. Instead, I will point you to a recording of all the conference talks.

The post Recording of my Talk at the ACS Data Users Conference appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

One-way ANOVA in R

Mon, 08/28/2017 - 14:24

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Suppose as a business manager you have the responsibility for testing and comparing the lifetimes of four brands (Apollo, Bridgestone, CEAT and Falken) of automobile tyres. The lifetime of these sample observations are measured in mileage run in ’000 miles. For each brand of automobile tyre, sample of 15 observations have been collected. On the basis of these information, you have to take you decision regarding the four brands of automobile tyre. The data is provided in the csv file format (called, tyre.csv).

In order to test and compare the lifetimes of four brands of tyre, you should apply one-way ANOVA method as there is only one factor or criterion (mileage run) to classify the sample observations. The null and alternative hypothesis to be used is given as:

Data Import and Outlier Checking

Let us first import the data into R and save it as object ‘tyre’. The R codes to do this:

tyre<- read.csv(file.choose(),header = TRUE, sep = ",") attach(tyre)

Before doing anything, you should check the variable type as in ANOVA, you need categorical independent variable (here the factor or treatment variable ‘brand’. In R, you can use the following code:

is.factor(Brands) [1] TRUE

As the result is ‘TRUE’, it signifies that the variable ‘Brands’ is a categorical variable.
Now it is all set to run the ANOVA model in R. Like other linear model, in ANOVA also you should check the presence of outliers can be checked by boxplot. As there are four populations to study, you should use separate boxplot for each of the population. In R, it is done as:

boxplot(Mileage~Brands, main="Fig.-1: Boxplot of Mileage of Four Brands of Tyre", col= rainbow(4))

Gives this plot:

If you are using advanced graphics, you can use the ‘ggplot2’ package with the following code to get the above boxplot.

library(ggplot2) ggplot(tyre, aes(Brands,Mileage))+geom_boxplot(aes(col=Brands))+labs(title="Boxplot of Mileage of Four Brands of Tyre")

The above picture shows that there is one extreme observation in the CEAT brand. To find out the exact outliers or extreme observations, you can use the following command:

boxplot.stats(Mileage[Brands=="CEAT"]) $stats [1] 30.42748 33.11079 34.78336 36.12533 36.97277$n [1] 15 $conf [1] 33.55356 36.01316$out [1] 41.05

So, the outlier is the observation valued ‘41.05’. The confidence interval is (33.55 – 36.01) and the minimum and maximum values of the sample coming from the population ‘CEAT’ is 30.43 and 41.05 respectively. Considering all these points, we ignore the outlier value ‘41.05’ momentarily and carry out the analysis. If at later stage, we find that this outlier may create problems in the estimation, we will exclude it.

Estimation of Model

Let us now fit the model by using the aov() function in R.

model1<- aov(Mileage~Brands)

To get the result of the one-way ANOVA:

summary(model1) Df Sum Sq Mean Sq F value Pr(>F) Brands 3 256.3 85.43 17.94 2.78e-08 *** Residuals 56 266.6 4.76 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal.
Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test. In R, the following are the code for applying the Tukey’s HSD test:

TukeyHSD(model1, conf.level = 0.99) Tukey multiple comparisons of means 99% family-wise confidence level Fit: aov(formula = Mileage ~ Brands) $Brands diff lwr upr p adj Bridgestone-Apollo -3.01900000 -5.6155816 -0.4224184 0.0020527 CEAT-Apollo -0.03792661 -2.6345082 2.5586550 0.9999608 Falken-Apollo 2.82553333 0.2289517 5.4221149 0.0043198 CEAT-Bridgestone 2.98107339 0.3844918 5.5776550 0.0023806 Falken-Bridgestone 5.84453333 3.2479517 8.4411149 0.0000000 Falken-CEAT 2.86345994 0.2668783 5.4600415 0.0037424 The TukeyHSD command shows the pair-wise difference of mileage of four brands of tyres at 1% level of significance. Here, the “diﬀ” column provides mean diﬀerences. The “lwr” and “upr” columns provide lower and upper 99% confidence bounds, respectively. Finally, the “p adj” column provides the p-values adjusted for the number of comparisons made. As there are four brands, six possible pair-wise comparisons are obtained. The results show that all the pairs of mileage are statistically significantly different from the Tukey’s Honestly Significant Differences, except for the pair CEAT-Apollo. More specifically, the pair-wise difference between Bridgestone-Apollo is found to be -3.019 which means that Apollo has higher mileage than Bridgestone and this is statistically significant. Similarly, you can make other pair-wise comparison. You can also plot these the results of Tukey’s HSD comparison by using the plot function as follows: plot(TukeyHSD(model1, conf.level = 0.99),las=1, col = "red") The diagram is show below: Another way of visualization by plotting means of mileage of four brands of tyre with the help of gplots package. By using the plotmeans() function in the gplots package, you can create the mean plots for single factors including confidence intervals. library(gplots) plotmeans(Mileage~Brands, main="Fig.-3: Mean Plot with 95% Confidence Interval", ylab = "Mileage run ('000 miles)", xlab = "Brands of Tyre") Gives this plot: Diagnostic Checking Diagnostic Checking: After estimating the ANOVA model and finding out the possible pairs of differences, it is now time to go for the different diagnostic checking with respect to model assumptions. The single call to plot function generates four diagnostic plots (Fig.-5). par(mfrow=c(2,2)) plot(model1) Give this plot: The Residuals vs. Fitted plot shown in the upper left of Fig.-4, shows the ﬁtted values plotted against the model residuals. If the residuals follow any particular pattern, such as a diagonal line, there may be other predictors not yet in the model that could improve it. The ﬂat lowess line looks very good as the single predictor variable or regressor sufficiently explaining the dependent variable. The Normal Q-Q Plot in the upper right of Fig.-4, shows the quantiles of the standardized residuals plotted against the quantiles you would expect if the data were normally distributed. Since these fall mostly on the straight line, the assumption of normally distributed residuals is met. Since there are only 15 observations in each individual brand of tyre, it is not wise to go for group-wise checking of normality assumption. Moreover, the normality of the overall residual can be checked by means of some statistical test such as Shapiro-Wilk test. Shortly I’ll show you this procedure too. The Scale-Location plot in the lower left of Fig.-4, shows the square root of the absolute standardized residuals plotted against the ﬁtted, or predicted, values. Since the lowess line that ﬁts this is fairly ﬂat, it indicates that the spread in the predictions is almost the same across the prediction line, indicating the very less chances of failure of meeting the assumption of homoscedasticity. This will be further verified by some statistical tests. In case of ANOVA, you can check the assumption of homogeneity of variances across the four brands of tyre. Finally, the Residuals vs. Leverage plot in the lower right corner, shows a measure of the influence of each point on the overall equation against the standardized residuals. Since no points stand out far from the pack, we can assume that there are no outliers having undue influence on the ﬁt of the model. Thus, the graphical diagnostic of the model fit apparently shows that the assumptions requirements of ANOVA model is fairly fulfilled. However, the normality assumption and homogeneity are supposed to be confirmed by the appropriate statistical tests. Regarding the fulfillment of normality assumption, it has been already discussed that when the number of observations is less, it is wise to test normality for the overall residuals of the model, instead of checking it for separate group. In R the residuals of model is saved as follows: uhat<-resid(model1) where resid function extracts the model residual and it is saved as object ‘uhat’. Now you may apply the Shapiro-Wilk test for normality with the following hypotheses set-up: The test code and results are shown below: shapiro.test(uhat) Shapiro-Wilk normality test data: uhat W = 0.9872, p-value = 0.7826 As the p-value is higher than the level of significance, you cannot reject the null hypothesis, which implies that the samples are taken from the normal populations. Another assumption requirement is the homogeneity of variances across the groups, which can be statistically tested by Bartlett test and Levene test. In both the test, the null hypothesis is set as the homogeneity of variance across the cross-sectional group. The tests are conducted as follows: bartlett.test(Mileage~Brands) Bartlett test of homogeneity of variances data: Mileage by Brands Bartlett's K-squared = 2.1496, df = 3, p-value = 0.5419 library(car) levene.test(Mileage~Brands) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 3 0.6946 0.5592 56 Both the tests confirmed the presence of homogeneity of variance across the four brands of tyre as we cannot reject the null hypothesis of homogeneity of variances across the brands of tyre. The Conclusion The mileages of the four brands of tyre are different. The results show that all the pairs of mileage are statistically significantly different from the Tukey’s Honestly Significant Differences, except for the pair CEAT-Apollo. More specifically, the pair-wise difference between Bridgestone-Apollo is found to be -3.019 which means that Apollo has higher mileage than Bridgestone and this is statistically significant. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... A hidden process behind binary or other categorical outcomes? Mon, 08/28/2017 - 02:00 (This article was first published on ouR data generation, and kindly contributed to R-bloggers) I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. – six groups in total. The question was how best to evaluate effectiveness of the intervention? Since the outcome was categorical and ordinal – that is category 1 implied less meat consumption that category 2, category 2 implied less consumption that category 3, and so on – a model that estimates the cumulative probability of ordinal outcomes seemed like a possible way to proceed. Cumulative logit models estimate a number of parameters that represent the cumulative log-odds of an outcome; the parameters are the log-odds of categories 2 through 6 versus category 1, categories 3 through 6 versus 1 & 2, etc. Maybe not the most intuitive way to interpret the data, but seems to plausibly fit the data generating process. I was concerned about the proportionality assumption of the cumulative logit model, particularly when we started to consider adjusting for baseline characteristics (more on that in the next post). I looked more closely at the data generating assumptions of the cumulative logit model, which are quite frequently framed in the context of a continuous latent measure that follows a logistic distribution. I thought I’d describe that data generating process here to give an alternative view of discrete data models. I know I have been describing a context that includes an outcome with multiple categories, but in this post I will focus on regular logistic regression with a binary outcome. This will hopefully allow me to establish the idea of a latent threshold. I think it will be useful to explain this simpler case first before moving on to the more involved case of an ordinal response variable, which I plan to tackle in the near future. A latent continuous process underlies the observed binary process For an event with a binary outcome (true or false, A or B, 0 or 1), the observed outcome may, at least in some cases, be conceived as the manifestation of an unseen, latent continuous outcome. In this conception, the observed (binary) outcome merely reflects whether or not the unseen continuous outcome has exceeded a specified threshold. Think of this threshold as a tipping point, above which the observable characteristic takes on one value (say false), below which it takes on a second value (say true). The logistic distribution Logistic regression models are used to estimate relationships of individual characteristics with categorical outcomes. The name of this regression model arises from the logistic distribution, which is a symmetrical continuous distribution. In a latent (or hidden) variable framework, the underlying, unobserved continuous measure is drawn from this logistic distribution. More specifically, the standard logistic distribution is typically assumed, with a location parameter of 0, and a scale parameter of 1. (The mean of this distribution is 0 and variance is approximately 3.29.) Here is a plot of a logistic pdf, shown in relation to a standard normal pdf (with mean 0 and variance 1): library(ggplot2) library(data.table) my_theme <- function() { theme(panel.background = element_rect(fill = "grey90"), panel.grid = element_blank(), axis.ticks = element_line(colour = "black"), panel.spacing = unit(0.25, "lines"), plot.title = element_text(size = 12, vjust = 0.5, hjust = 0), panel.border = element_rect(fill = NA, colour = "gray90")) } x <- seq(-6, 6, length = 1000) yNorm <- dnorm(x, 0, 1) yLogis <- dlogis(x, location = 0, scale = 1) dt <- data.table(x, yNorm, yLogis) dtm <- melt(dt, id.vars = "x", value.name = "Density") ggplot(data = dtm) + geom_line(aes(x = x, y = Density, color = variable)) + geom_hline(yintercept = 0, color = "grey50") + my_theme() + scale_color_manual(values = c("red", "black"), labels=c("Normal", "Logistic")) + theme(legend.position = c(0.8, 0.6), legend.title = element_blank()) The threshold defines the probability Below, I have plotted the standardized logistic pdf with a threshold that defines a tipping point for a particular Group A. In this case the threshold is 1.5, so for everyone with a unseen value of $$X < 1.5$$, the observed binary outcome $$Y$$ will be 1. For those where $$X \geq 1.5$$, the observed binary outcome $$Y$$ will be 0: xGrpA <- 1.5 ggplot(data = dtm[variable == "yLogis"], aes(x = x, y = Density)) + geom_line() + geom_segment(x = xGrpA, y = 0, xend = xGrpA, yend = dlogis(xGrpA), lty = 2) + geom_area(mapping = aes(ifelse(x < xGrpA, x, xGrpA)), fill = "white") + geom_hline(yintercept = 0, color = "grey50") + ylim(0, 0.3) + my_theme() + scale_x_continuous(breaks = c(-6, -3, 0, xGrpA, 3, 6)) Since we have plot a probability density (pdf), the area under the entire curve is equal to 1. We are interested in the binary outcome $$Y$$ defined by the threshold, so we can say that the area below the curve to the left of threshold (filled in white) represents $$P(Y = 1|Group=A)$$. The remaining area represents $$P(Y = 0|Group=A)$$. The area to the left of the threshold can be calculated in R using the plogis function: (p_A <- plogis(xGrpA)) ## [1] 0.8175745 Here is the plot for a second group that has a threshold of 2.2: The area under the curve to the left of the threshold is $$P(X < 2.2)$$, which is also $$P(Y = 1 | Group=B)$$: (p_B <- plogis(xGrpB)) ## [1] 0.9002495 Log-odds and probability In logistic regression, we are actually estimating the log-odds of an outcome, which can be written as $log \left[ \frac{P(Y=1)}{P(Y=0)} \right]$. In the case of Group A, log-odds of Y being equal to 1 is (logodds_A <- log(p_A / (1 - p_A) )) ## [1] 1.5 And for Group B, (logodds_B <- log(p_B / (1 - p_B) )) ## [1] 2.2 As you may have noticed, we’ve recovered the thresholds that we used to define the probabilities for the two groups. The threshold is actually the log-odds for a particular group. Logistic regression The logistic regression model that estimates the log-odds for each group can be written as $log \left[ \frac{P(Y=1)}{P(Y=0)} \right] = B_0 + B_1 * I(Grp = B) \quad ,$ where $$B_0$$ represents the threshold for Group A and $$B_1$$ represents the shift in the threshold for Group B. In our example, the threshold for Group B is 0.7 units (2.2 – 1.5) to the right of the threshold for Group A. If we generate data for both groups, our estimates for $$B_0$$ and $$B_1$$ should be close to 1.5 and 0.7, respectively The process in action To put this all together in a simulated data generating process, we can see the direct link with the logistic distribution, the binary outcomes, and an interpretation of estimates from a logistic model. The only stochastic part of this simulation is the generation of continuous outcomes from a logistic distribution. Everything else follows from the pre-defined group assignments and the group-specific thresholds: n = 5000 set.seed(999) # Stochastic step xlatent <- rlogis(n, location = 0, scale = 1) # Deterministic part grp <- rep(c("A","B"), each = n / 2) dt <- data.table(id = 1:n, grp, xlatent, y = 0) dt[grp == "A" & xlatent <= xGrpA, y := 1] dt[grp == "B" & xlatent <= xGrpB, y := 1] # Look at the data dt ## id grp xlatent y ## 1: 1 A -0.4512173 1 ## 2: 2 A 0.3353507 1 ## 3: 3 A -2.2579527 1 ## 4: 4 A 1.7553890 0 ## 5: 5 A 1.3054260 1 ## --- ## 4996: 4996 B -0.2574943 1 ## 4997: 4997 B -0.9928283 1 ## 4998: 4998 B -0.7297179 1 ## 4999: 4999 B -1.6430344 1 ## 5000: 5000 B 3.1379593 0 The probability of a “successful” outcome (i.e $$P(Y = 1$$)) for each group based on this data generating process is pretty much equal to the areas under the respective densities to the left of threshold used to define success: dt[, round(mean(y), 2), keyby = grp] ## grp V1 ## 1: A 0.82 ## 2: B 0.90 Now let’s estimate a logistic regression model: library(broom) glmfit <- glm(y ~ grp, data = dt, family = "binomial") tidy(glmfit, quick = TRUE) ## term estimate ## 1 (Intercept) 1.5217770 ## 2 grpB 0.6888526 The estimates from the model recover the logistic distribution thresholds for each group. The Group A threshold is estimated to be 1.52 (the intercept) and the Group B threshold is estimated to be 2.21 (intercept + grpB parameter). These estimates can be interpreted as the log-odds of success for each group, but also as the threshold for the underlying continuous data generating process that determines the binary outcome $$Y$$. And we can interpret the parameter for grpB in the traditional way as the log-odds ratio comparing the log-odds of success for Group B with the log-odds of success for Group A, or as the shift in the logistic threshold for Group A to the logistic threshold for Group B. In the next week or so, I will extend this to a discussion of an ordinal categorical outcome. I think the idea of shifting the thresholds underscores the proportionality assumption I alluded to earlier … var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Pricing Optimization: How to find the price that maximizes your profit Sun, 08/27/2017 - 18:02 (This article was first published on R – insightR, and kindly contributed to R-bloggers) By Yuri Fonseca Basic idea In this post we will discuss briefly about pricing optimization. The main idea behind this problem is the following question: As manager of a company/store, how much should I charge in order to maximize my revenue or profit? Obviously, the answer isn’t as high as possible. If you charge one hundred dollars for a candy, probably only one or two people will accept to buy it. Although the profit per product is very high, you probably won’t even your fixed costs. Charge a very small is also not the best call. Before showing an example for this problem, let us build some simple formulas. Imagine a monopolist selling a specific product with demand curve , where is the quantity sold given a specific price . To simplify things, let’s suppose that is a linear function: The total revenue will be given by: And total profit will be given by: Where is the unity cost of the product. Adding fixed costs in the profit equation does not change the price police, so we will suppose it’s zero. Next, we differentiate the equations for and to find the first order conditions, which allow us to find the optimal police under the hypothesis of a linear demand curve. is expected to be negative (demand decrease when prices increase) and are concave functions of . As consequence, the critical point will be a maximum point. Therefore, the optimal police for revenue is given by: And for profit: Note that sometimes people write the linear demand curve as , in this case should be positive, and the signs in equation 2 and equation 3 must change. Moreover, it is interesting to note that the price that maximizes profit is always bigger than the one that maximizes total revenue because is always positive. If taxes are calculated just on profit the price police remains the same. However, countries like Brazil usually charges a lot of taxes on total revenue. In this case, the price police for maximizing revenue doesn’t change, but the police for maximizing profit will change according to the following expression: Example and implementations: As an example of how to proceed with the estimation of the optimum price, let’s generate a linear demand curve with for daily selling of a product: library(ggplot2) # example of linear demand curve (first equation) demand = function(p, alpha = -40, beta = 500, sd = 10) { error = rnorm(length(p), sd = sd) q = p*alpha + beta + error return(q) } set.seed(100) prices = seq(from = 5, to = 10, by = 0.1) q = demand(prices) data = data.frame('prices' = prices,'quantity' = q) ggplot(data, aes(prices, quantity)) + geom_point(shape=1) + geom_smooth(method='lm') + ggtitle('Demand Curve') Imagine a company that has been selling the product which follows the demand curve above for a while (one year changing prices daily), testing some prices over time. The following time-series is what we should expect for the historical revenue, profit and cost of the company: set.seed(10) hist.prices = rnorm(252, mean = 6, sd = .5) # random prices defined by the company hist.demand = demand(hist.prices) # demand curve defined in the chunck above hist.revenue = hist.prices*hist.demand # From the revenue equation unity.cost = 4 # production cost per unity hist.cost = unity.cost*hist.demand hist.profit = (hist.prices - unity.cost)*hist.demand # From the price equation data = data.frame('Period' = seq(1,252),'Daily.Prices' = hist.prices, 'Daily.Demand' = hist.demand, 'Daily.Revenue' = hist.revenue, 'Daily.Cost' = hist.cost, 'Daily.Profit' = hist.profit) ggplot(data, aes(Period, Daily.Prices)) + geom_line(color = 4) + ggtitle('Historical Prices used for explotation') ggplot(data, aes(Period, Daily.Revenue, colour = 'Revenue')) + geom_line() + geom_line(aes(Period, Daily.Profit, colour = 'Profit')) + geom_line(aes(Period, Daily.Cost, colour = 'Cost')) + labs(title = 'Historical Performance', colour = '') We can recover the demand curve using the historical data (that is how it is done in the real world). library(stargazer) model.fit = lm(hist.demand ~ hist.prices) # linear model for demand stargazer(model.fit, type = 'html', header = FALSE) # output Dependent variable: hist.demand hist.prices -38.822*** (1.429) Constant 493.588*** (8.542) Observations 252 R2 0.747 Adjusted R2 0.746 Residual Std. Error 10.731 (df = 250) F Statistic 738.143*** (df = 1; 250) Note: *p<0.1; **p<0.05; ***p<0.01 And now we need to apply equation 2 and equation 3. # estimated parameters beta = model.fit$coefficients[1] alpha = model.fit$coefficients[2] p.revenue = -beta/(2*alpha) # estimated price for revenue p.profit = (alpha*unity.cost - beta)/(2*alpha) # estimated price for profit The final plot with the estimated prices: true.revenue = function(p) p*(-40*p + 500) # Revenue with true parameters (chunck demand) true.profit = function(p) (p - unity.cost)*(-40*p + 500) # price with true parameters # estimated curves estimated.revenue = function(p) p*(model.fit$coefficients[2]*p + model.fit$coefficients[1]) estimated.profit = function(p) (p - unity.cost)*(model.fit$coefficients[2]*p + model.fit\$coefficients[1]) opt.revenue = true.revenue(p.revenue) # Revenue with estimated optimum price opt.profit = true.profit(p.profit) # Profit with estimated optimum price # plot df = data.frame(x1 = p.revenue, x2 = p.profit, y1 = opt.revenue, y2 = opt.profit, y3 = 0) ggplot(data = data.frame(Price = 0)) + stat_function(fun = true.revenue, mapping = aes(x = Price, color = 'True Revenue')) + stat_function(fun = true.profit, mapping = aes(x = Price, color = 'True Profit')) + stat_function(fun = estimated.revenue, mapping = aes(x = Price, color = 'Estimated Revenue')) + stat_function(fun = estimated.profit, mapping = aes(x = Price, color = 'Estimated Profit')) + scale_x_continuous(limits = c(4, 11)) + labs(title = 'True curves without noise') + ylab('Results') + scale_color_manual(name = "", values = c("True Revenue" = 2, "True Profit" = 3, "Estimated Revenue" = 4, "Estimated Profit" = 6)) + geom_segment(aes(x = x1, y = y1, xend = x1, yend = y3), data = df) + geom_segment(aes(x = x2, y = y2, xend = x2, yend = y3), data = df)

Final observations

As you can see, the estimated Revenue and estimated Profit curves are quite similar to the true ones without noise and the expected revenue for our estimated optimal policies looks very promising. Although the linear and monopolist assumption looks quite restrictive, this might not be the case, check Besbes and Zeevi (2015) and Cooper et al (2015).

If one expect a large variance for , it might be useful to simulate , and then the optimal price using Jensen’s inequality.

References

Phillips, Robert Lewis. Pricing and revenue optimization. Stanford University Press, 2005.

Besbes, Omar, and Assaf Zeevi. “On the (surprising) sufficiency of linear models for dynamic pricing with demand learning.” Management Science 61.4 (2015): 723-739.

Cooper, William L., Tito Homem-de-Mello, and Anton J. Kleywegt. “Learning and pricing with models that do not explicitly incorporate competition.” Operations research 63.1 (2015): 86-103.

Talluri, Kalyan T., and Garrett J. Van Ryzin. The theory and practice of revenue management. Vol. 68. Springer Science & Business Media, 2006.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Marginal effects for negative binomial mixed effects models (glmer.nb and glmmTMB) #rstats

Sun, 08/27/2017 - 12:17

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Here’s a small preview of forthcoming features in the ggeffects-package, which are already available in the GitHub-version: For marginal effects from models fitted with glmmTMB() or glmer() resp. glmer.nb(), confidence intervals are now also computed.

If you want to test these features, simply install the package from GitHub:

library(devtools) devtools::install_github("strengejacke/ggeffects")

Here are three examples:

library(glmmTMB) library(lme4) library(ggeffects) data(Owls) m1 <- glmmTMB(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls, family = nbinom1) m2 <- glmmTMB(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls, family = nbinom2) m3 <- glmer.nb(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls) m4 <- glmmTMB( SiblingNegotiation ~ FoodTreatment + ArrivalTime + SexParent + (1 | Nest), data = Owls, ziformula = ~ 1, family = list(family = "truncated_poisson", link = "log") )

pr1 <- ggpredict(m1, c("ArrivalTime", "SexParent")) plot(pr1)

pr2 <- ggpredict(m2, c("ArrivalTime", "SexParent")) plot(pr2)

pr3 <- ggpredict(m3, c("ArrivalTime", "SexParent")) plot(pr3)

pr4 <- ggpredict( m4, c("FoodTreatment", "ArrivalTime [21,24,30]", "SexParent") ) plot(pr4)

The code to calculate confidence intervals is based on the FAQ provided from Ben Bolker. Here is another example, that reproduces this plot (note, since age is numeric, ggpredict() produces a straight line, and not points with error bars).

library(nlme) data(Orthodont) m5 <- lmer(distance ~ age * Sex + (age|Subject), data = Orthodont) pr5 <- ggpredict(m5, c("age", "Sex")) plot(pr5)

Tagged: data visualization, ggplot, R, rstats, sjPlot

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...