Data.Table by Example – Part 2
(This article was first published on R – Mathew Analytics, and kindly contributed to Rbloggers)
In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.
dat = fread("rows.csv") names(dat) < gsub(" ", "_", names(dat)) dat[1:3]Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.
dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)] dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE), value2 = sample(1:50, nrow(dat), replace=TRUE), value3 = sample(1:50, nrow(dat), replace=TRUE))]You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.
dat$value1 < sample(1:50, nrow(dat), replace=TRUE) dat$value2 < sample(1:50, nrow(dat), replace=TRUE) dat$value3 < sample(1:50, nrow(dat), replace=TRUE)In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.
dat[, .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))] dat[Primary_Type=="PROSTITUTION", .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))]The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols.
dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")] dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.
my.summary < function(x){ c(mean = mean(x), median = median(x), min = min(x), max = max(x)) } dat[, sapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]Perfect! As you can see, the syntax is concise and is very easy to work with.
In the next part, I’ll cover a few advanced features like get() and {}.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppAnnoy 0.0.10
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A few short weeks after the more substantial 0.0.9 release of RcppAnnoy, we have a quick bugfix update.
RcppAnnoy is our Rcppbased R integration of the nifty Annoy library by Erik. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours.
Michaël Benesty noticed that our getItemsVector() function didn’t, ahem, do much besides crashing. Simple bug, they happen–now fixed, and a unit test added.
Changes in this version are summarized here:
Changes in version 0.0.10 (20170925) The getItemsVector() function no longer crashes (#24)
Courtesy of CRANberries, there is also a diffstat report for this release.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Meet the new Microsoft R Server: Microsoft ML Server 9.2
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Microsoft R Server has received a new name and a major update: Microsoft ML Server 9.2 is now available. ML Server provides a scalable production platform for R — and now Python — programs. The basic idea is that a local client can push R or Python code and have it operationalized on the remote server. ML Server is also included with the Data Science Virtual Machine and HDInsight Spark clusters on Azure.
This video gives a highlevel overview of the process, or you can also see details of deploying an R model or a Python model as a web service.
The related Microsoft Machine Learning Services provides similar capabilities for indatabase computations within SQL Server 2017 (now with Python as well as R) and (in preview) the fullymanaged Azure SQL Database. ML Services also provides realtime scoring of trained models, with predictions generated from the data in the database without invoking the R or Python engine.
Learn more about the many updates to Microsoft ML Server 9.2 in the blog post linked below.
Cortana Intelligence and Machine Learning Blog: Introducing Microsoft Machine Learning Server 9.2 Release
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Working with air quality and meteorological data Exercises (Part4)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports.
In the previous exercises set we used data from MY1 station to see how to import data and extract basic statistical information from data. Then we practiced using functions that are available in openair package to analyze and visualize data. In this exercise set we will go through more advance functions that are very useful in air quality data analysis.
Back trajectories are very useful in air pollution analysis and can provide important information on air mass origins. Backtrajectories show the path of air masses traveled over specific time periods and can be used to identify sources of pollutants. However, calculating and analysis of backtrajectory data is a time consuming process. Therefore, to overcome some of these issues and expand the possibilities for data analysis, openair makes several functions available to access and analyse precalculated back trajectories. In this exercise set will go through these functions.
Answers to the exercises are available here.
For other parts of this exercise set follow the tag openair
Please load the package openair before starting the exercises.
Exercise 1
In the first exercise we will use importTraj function to import precalculated backtrjactories which will be also used in the next exercises. Use importTraj to import precalculated backtrajectories for London and for the year 2011. Save the results in a dataframe named “traj”.
Exercise 2
In this exercise will use trajPlot function to plot trajectory line to visually see how airmasses has travels before arriving to the receptor point.
Use trajPlot function to plot trajectory lines for the traj dataframe obtained in the previous exercise. Here, only use trajectories for the period of 1016 January.
Exercise 3
There are also a few other ways of representing the data shown in For example we can plot the trajectories for each day.
Use trajPlot function to plot the same trajectory lines as in the previous exercise but separate the trajectories by day.
Exercise 4
One of the most useful approaches is to link the back trajectories with the concentrations of a pollutant.
In this exercise first import data for North Kensington using importAURN function. And then use merge function to merge the pollutant data with backtrajectories into single dataframe.
Now, use trajPlot function to plot trajectories along with pm10 level asscociated with each trajectories.
You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course[Intermediate] Spatial Data Analysis with R, QGIS & More. this course you will learn how to:
 Work with Spatial data and maps
 Learn about different tools to develop spatial data next to R
 And much more
Exercise 5
Another useful analysis is to identify the contribution of high concentration back trajectories. In other words, to identify the potential of sources contributing to high pollutant concentration in the
area of interest. This can be done by trajrajLevel. In this function we can use a number of statistics option such as frequency and difference. However another statistical method that can be applied is potential source contribution function (PSCF) method which is widely use in air mass trajectory analysis to identify potential sources or hotspots for pollution.
Use trajrajLevel function to plot and identify potential sources for high pm2.5 concentration in London in 2011.
Related exercise sets: Working with air quality and meteorological data Exercises (Part3)
 Working with air quality and meteorological data Exercises (Part2)
 Working with air quality and meteorological data Exercises (Part1)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Microsoft Cognitive Services Vision API in R
(This article was first published on RProjects – Stoltzmaniac, and kindly contributed to Rbloggers)
Microsoft Cognitive Services Vision API in RA little while ago I did a brief tutorial of the Google Vision API using RoogleVision created by Mark Edmonson. I couldn’t find anything similar to that in R for the Microsoft Cognitive Services API so I thought I would give it a shot. I whipped this example together quickly to give it a proofofconcept but I could certainly see myself building an R package to support this (unless someone can point to one – and please do if one exists)!
A quick example, sending this image retrieved the location of the human face and created a caption! Here’s my dog lined up next to his doppelganger:
The API is extremely easy to access using RCurl and httr. There are a lot of options which can be accessed. In this example, I’ll just cover the basics of image detection and descriptions.
If you don’t want to spend time writing a bunch of code, you can simply use the “helper_functions.R” file I created and swap out your credentials and API endpoint to get it working.
Getting Started With Microsoft Cognitive ServicesIn order to get started, all you need is an Azure Account which is free if you can keep yourself under certain thresholds and limits. There is even a free trial period (at the time this was written, at least).
Once that is taken care of there are a few things you need to do:
 Login to portal.azure.com
 On the lefthand menu click “Add”
 Click on “AI + Cognitive Services” and then the “Computer Vision API”
 Fill out the information required. You may have “Free Trial” under Subscription. Pay special attention to Location because this will be used in your API script
 In the lefthand menu, click “Keys” underneath “Resource Management” and you will find what you need for credentials. Underneath your Endpoint URL, click on “Show access keys…” – copy your key and use it in your script (do not make this publicly accessible)
 You’re ready to go!
Now, you can write a script to utilize the power of the Microsoft Cognitive Vision API.
What data can you get?There are a lot of features you can request. I’m only asking for: description, tags, categories, and faces. You can also return: image type, color, and adult. There are also details which can be returned such as: landmarks and celebrities.
Here is the setup and API call:
image_url = 'https://imgur.com/rapIn0u.jpg' visualFeatures = "Description,Tags,Categories,Faces" # options = "Categories, Tags, Description, Faces, ImageType, Color, Adult" details = "Landmarks" # options = Landmarks, Celebrities reqURL = paste(api_endpoint_url, "?visualFeatures=", visualFeatures, "&details=", details, sep="") APIresponse = POST(url = reqURL, content_type('application/json'), add_headers(.headers = c('OcpApimSubscriptionKey' = api_key)), body=list(url = image_url), encode = "json") df = content(APIresponse)The dataframe returned looks messy, but isn’t too bad once you dive in. Take a look:
str(df) ## List of 5 ## $ tags :List of 6 ## ..$ :List of 2 ## .. ..$ name : chr "dog" ## .. ..$ confidence: num 0.987 ## ..$ :List of 3 ## .. ..$ name : chr "mammal" ## .. ..$ confidence: num 0.837 ## .. ..$ hint : chr "animal" ## ..$ :List of 2 ## .. ..$ name : chr "looking" ## .. ..$ confidence: num 0.814 ## ..$ :List of 2 ## .. ..$ name : chr "animal" ## .. ..$ confidence: num 0.811 ## ..$ :List of 2 ## .. ..$ name : chr "posing" ## .. ..$ confidence: num 0.54 ## ..$ :List of 2 ## .. ..$ name : chr "staring" ## .. ..$ confidence: num 0.165 ## $ description:List of 2 ## ..$ tags :List of 18 ## .. ..$ : chr "dog" ## .. ..$ : chr "mammal" ## .. ..$ : chr "looking" ## .. ..$ : chr "animal" ## .. ..$ : chr "photo" ## .. ..$ : chr "posing" ## .. ..$ : chr "camera" ## .. ..$ : chr "man" ## .. ..$ : chr "standing" ## .. ..$ : chr "smiling" ## .. ..$ : chr "face" ## .. ..$ : chr "white" ## .. ..$ : chr "holding" ## .. ..$ : chr "close" ## .. ..$ : chr "wearing" ## .. ..$ : chr "laying" ## .. ..$ : chr "head" ## .. ..$ : chr "teeth" ## ..$ captions:List of 1 ## .. ..$ :List of 2 ## .. .. ..$ text : chr "a close up of Albert Einstein and a dog posing for the camera" ## .. .. ..$ confidence: num 0.892 ## $ requestId : chr "2143e23a14c847c497509bfc82381512" ## $ metadata :List of 3 ## ..$ width : int 824 ## ..$ height: int 824 ## ..$ format: chr "Jpeg" ## $ faces :List of 1 ## ..$ :List of 3 ## .. ..$ age : int 73 ## .. ..$ gender : chr "Male" ## .. ..$ faceRectangle:List of 4 ## .. .. ..$ left : int 505 ## .. .. ..$ top : int 241 ## .. .. ..$ width : int 309 ## .. .. ..$ height: int 309 The top 5 descriptions returned are: description_tags = df$description$tags description_tags_tib = tibble(tag = character()) for(tag in description_tags){ for(text in tag){ if(class(tag) != "list"){ ## To remove the extra caption from being included tmp = tibble(tag = tag) description_tags_tib = description_tags_tib %>% bind_rows(tmp) } } } knitr::kable(description_tags_tib[1:5,]) tag dog mammal looking animal photoThe caption returned:
captions = df$description$captions captions_tib = tibble(text = character(), confidence = numeric()) for(caption in captions){ tmp = tibble(text = caption$text, confidence = caption$confidence) captions_tib = captions_tib %>% bind_rows(tmp) } knitr::kable(captions_tib) text confidence a close up of Albert Einstein and a dog posing for the camera 0.891846 The metadata returned: metadata = df$metadata metadata_tib = tibble(width = metadata$width, height = metadata$height, format = metadata$format) knitr::kable(metadata_tib) width height format 824 824 Jpeg The locations of faces returned: faces = df$faces faces_tib = tibble(faceID = numeric(), age = numeric(), gender = character(), x1 = numeric(), x2 = numeric(), y1 = numeric(), y2 = numeric()) n = 0 for(face in faces){ n = n + 1 tmp = tibble(faceID = n, age = face$age, gender = face$gender, x1 = face$faceRectangle$left, y1 = face$faceRectangle$top, x2 = face$faceRectangle$left + face$faceRectangle$width, y2 = face$faceRectangle$top + face$faceRectangle$height) faces_tib = faces_tib %>% bind_rows(tmp) } #faces_tib knitr::kable(faces_tib) faceID age gender x1 x2 y1 y2 1 73 Male 505 814 241 550 A few more examples:As always, you can find the code I used on my GitHub.
Side note: I used a ton of for loops to access the data – not ideal… please let me know if you have better ways of dealing with nested data like this when you have unknown numbers of levels.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProjects – Stoltzmaniac. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
13 new R Jobs for R users (20170926) – from all over the world
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Current R jobsJob seekers: please follow the links below to learn more and apply for your R job of interest:
Featured Jobs
 FullTime
 Data Scientist
Edelweiss Business Services – Posted by Aakash Gupta  Mumbai
Maharashtra, India  20 Sep 2017

 FullTime
 Solutions Engineer for RStudio @ U.S.A. (or anywhere)
RStudio – Posted by nwstephens  Anywhere
 8 Sep 2017

 Freelance
 R Programmer & Statistician Empiricus – Posted by empiricus
 Anywhere
 25 Sep 2017

 FullTime
 Data Scientist Edelweiss Business Services – Posted by Aakash Gupta
 Mumbai Maharashtra, India
 20 Sep 2017

 FullTime
 Postdoctoral Data Scientist: GIS and mHealth Harvard T.H. Chan School of Public Health – Posted by Christine Choirat
 Boston Massachusetts, United States
 14 Sep 2017

 FullTime
 Data Scientist @ London Hiscox – Posted by alan.south
 London England, United Kingdom
 13 Sep 2017

 Freelance
 Crafty developer: R>C++ SherandoResearch – Posted by feersum_monkey
 Anywhere
 8 Sep 2017

 FullTime
 Solutions Engineer for RStudio @ U.S.A. (or anywhere) RStudio – Posted by nwstephens
 Anywhere
 8 Sep 2017

 FullTime
 Senior Data Scientist @ Boston, Massachusetts, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
 Boston Massachusetts, United States
 8 Sep 2017

 PartTime
 CranR Programmierer Praktikum @ Zürich, Switzerland Fisch Asset Management AG – Posted by HR_Fisch
 Zürich Zürich, Switzerland
 4 Sep 2017

 Freelance
 Data analyst at talmix @ London, UK Talmix – Posted by manikauluck
 London England, United Kingdom
 4 Sep 2017

 FullTime
 R Developer and Open Source Community Manager at UCLA @ Los Angeles, California UCLA – Posted by graemeblair
 Los Angeles California, United States
 4 Sep 2017

 FullTime
 Work in Bodrum: Fullstack web developer with R experience @ Muğla, Turkey PranaGEO Ltd. – Posted by acizmeli
 Muğla Turkey
 4 Sep 2017

 FullTime
 Digital Performance & Data Science Section Manager @ Paris, France Nissan Automotive Europe – Posted by Estefania Rendon
 Paris ÎledeFrance, France
 31 Aug 2017

 Freelance
 Quantitative Analyst (Credit) @ Zürich, Switzerland ipub – Posted by gluc
 Zürich Zürich, Switzerland
 29 Aug 2017
In Rusers.com you can see all the R jobs that are currently available.
Rusers ResumesRusers also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts ).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));R live class  R for Data Science  Oct 45 Milan
(This article was first published on R blog  Quantide  R training & consulting, and kindly contributed to Rbloggers)
R for Data Science is our first course of the autumn term. It takes place in October 45 in a location close to Milano Lima.
If you want to deepen your data analysis knowledge using the most modern R tools, or you want to figure out if R is the right solution for you, this is the your class. This course is for people that are willing to learn about R and would like to get an overview of its capabilities for data science or those that have very little knowledge of R.
R for Data Science Outline– A bit of R history and online resources
– R and RStudio installation and configuration
– Your first R session
– Your first R markdown document
– R objects: data and functions
– Data import from external sources: excel and database connection
– Data manipulation with tidyverse
– Tidy data with tidyr
– Data visualization using ggplot
– An overview of statistical models and data mining with R
The course is for max 6 attendees.
R for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.
LocationThe course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.
RegistrationIf you want to reserve a seat go to: FAQ, detailed program and tickets.
Other R courses  Autumn termYou can find an overview of all our courses here. Next dates will be:
 October 1112: Statistics for Data Science. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now!
 October 2526: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
 November 78: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
 November 2122: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
 November 2930: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!
In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailormade course, picking all the topics that are interesting for your organization and dropping the rest.
The post R live class  R for Data Science  Oct 45 Milan appeared first on Quantide – R training & consulting.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R blog  Quantide  R training & consulting. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
When are Citi Bikes Faster than Taxis in New York City?
(This article was first published on Category: R  Todd W. Schneider, and kindly contributed to Rbloggers)
Every day in New York City, millions of commuters take part in a giant race to determine transportation supremacy. Cars, bikes, subways, buses, ferries, and more all compete against one another, but we never get much explicit feedback as to who “wins.” I’ve previously written about NYC’s public taxi data and Citi Bike share data, and it occurred to me that these datasets can help identify who’s fastest, at least between cars and bikes. In fact, I’ve built an interactive guide that shows when a Citi Bike is faster than a taxi, depending on the route and the time of day.
The methodology and findings will be explained more below, and all code used in this post is available opensource on GitHub.
Interactive guide to when taxis are faster or slower than Citi BikesPick a starting neighborhood and a time. The map shows whether you’d expect get to each neighborhood faster with a taxi (yellow) or a Citi Bike (dark blue).
Starting neighborhood
Alphabet CityBattery ParkBattery Park CityBloomingdaleCentral ParkChinatownClinton EastClinton WestEast ChelseaEast Harlem SouthEast VillageFinancial District NorthFinancial District SouthFlatironGarment DistrictGramercyGreenwich Village NorthGreenwich Village SouthHudson SqKips BayLenox Hill EastLenox Hill WestLincoln Square EastLincoln Square WestLittle Italy/NoLiTaLower East SideManhattan ValleyMeatpacking/West Village WestMidtown CenterMidtown EastMidtown NorthMidtown SouthMurray HillPenn Station/Madison Sq WestSeaportSoHoStuy Town/Peter Cooper VillageSutton Place/Turtle Bay NorthTimes Sq/Theatre DistrictTriBeCa/Civic CenterTwo Bridges/Seward ParkUN/Turtle Bay SouthUnion SqUpper East Side NorthUpper East Side SouthUpper West Side NorthUpper West Side SouthWest Chelsea/Hudson YardsWest VillageWorld Trade CenterYorkville EastYorkville West
BedfordBoerum HillBrooklyn HeightsBrooklyn Navy YardBushwick SouthCarroll GardensClinton HillCobble HillColumbia StreetCrown Heights NorthDowntown Brooklyn/MetroTechDUMBO/Vinegar HillEast WilliamsburgFort GreeneGowanusGreenpointPark SlopeProspect HeightsProspect ParkRed HookSouth WilliamsburgStuyvesant HeightsSunset Park WestWilliamsburg (North Side)Williamsburg (South Side)
Long Island City/Hunters PointLong Island City/Queens PlazaQueensbridge/RavenswoodSunnyside
8:00 AM–11:00 AM
11:00 AM–4:00 PM
4:00 PM–7:00 PM
7:00 PM–10:00 PM
10:00 PM–8:00 AM From Midtown East, weekdays 8:00 AM–11:00 AM
Taxi vs. Citi Bike travel times to other neighborhoods
Turn on javascript (or click through from RSS) to view the interactive taxi vs. Citi Bike map.
Data via NYC TLC and Citi BikeBased on trips 7/1/2016–6/30/2017
toddwschneider.com
Hover over a neighborhood (tap on mobile) to view travel time stats
40% of weekday taxi trips—over 50% during peak hours—would be faster as Citi Bike ridesI estimate that 40% of weekday taxi trips within the Citi Bike service area would expect to be faster if switched to a Citi Bike, based on data from July 2016 to June 2017. During peak midday hours, more than 50% of taxi trips would expect to be faster as Citi Bike rides.
There are some significant caveats to this estimate. In particular, if many taxi riders simultaneously switched to Citi Bikes, the bike share system would probably hit severe capacity constraints, making it difficult to find available bikes and docks. Increased bike usage might eventually lead to fewer vehicles on the road, which could ease vehicle congestion, and potentially increase bike lane congestion. It’s important to acknowledge that when I say “40% of taxi trips would be faster if they switched to Citi Bikes”, we’re roughly considering the decision of a single ablebodied person, under the assumption that everyone else’s behavior will remain unchanged.
Heading crosstown in Manhattan? Seriously consider taking a bike instead of a car!Crosstown Manhattan trips are generally regarded as more difficult than their northsouth counterparts. There are fewer subways that run crosstown, and if you take a car, the narrower eastwest streets often feel more congested than the broad northsouth avenues with their synchronized traffic lights. Crosstown buses are so notoriously slow that they’ve been known to lose races against tricycles.
I divided Manhattan into the crosstown zones pictured above, then calculated the taxi vs. Citi Bike win rate for trips that started and ended within each zone. Taxis fare especially badly in the Manhattan central business district. If you take a midday taxi that both starts and ends between 42nd and 59th streets, there’s over a 70% chance that the trip would have been faster as a Citi Bike ride.
Keep in mind that’s for all trips between 42nd and 59th streets. For some of the longest crosstown routes, for example, from the United Nations on the far east side to Hell’s Kitchen on the west, Citi Bikes beat taxis 90% of the time during the day. It’s worth noting that taxis made 8 times as many trips as Citi Bikes between 42nd and 59th streets from July 2016 to June 2017—almost certainly there would be less total time spent in transit if some of those taxi riders took bikes instead.
Hourly graphs for all of the crosstown zones are available here, and here’s a summary table for weekday trips between 8:00 AM and 7:00 PM:
Manhattan crosstown zone % taxis lose to Citi Bikes 96th–110th 41% 77th–96th 36% 59th–77th 54% 42nd–59th 69% 14th–42nd 64% Houston–14th 54% Canal–Houston 60% Below Canal 55%A reminder that this analysis restricts to trips that start and end within the same zone, so for example a trip from 23rd St to 57th St would be excluded because it starts and ends in different zones.
Taxis fare better for trips that stay on the east or west sides of Manhattan: 35% of daytime taxi trips that start and end west of 8th Avenue would expect to be faster as Citi Bike trips, along with 38% of taxi trips that start and end east of 3rd Avenue. Taxis also generally beat Citi Bikes on longer trips:
Taxis are losing more to Citi Bikes over timeWhen the Citi Bike program began in July 2013, less than half of weekday daytime taxi trips would have been faster if switched to Citi Bikes. I ran a monthbymonth analysis to see how the taxi vs. Citi Bike calculus has changed over time, and discovered that taxis are getting increasingly slower compared to Citi Bikes:
Note that this monthbymonth analysis restricts to the original Citi Bike service area, before the program expanded in August 2015. The initial expansion was largely into Upper Manhattan and the outer boroughs, where taxis generally fare better than bikes, and so to keep things consistent, I restricted the above graph to areas that have had Citi Bikes since 2013.
Taxis are losing more to Citi Bikes over time because taxi travel times have gotten slower, while Citi Bike travel times have remained roughly unchanged. I ran a pair of linear regressions to model travel times as a function of:
 trip distance
 time of day
 precipitation
 whether the route crosses between Manhattan and the outer boroughs
 month of year
 year
The regression code and output are available on GitHub: taxi, Citi Bike
As usual, I make no claim that this is a perfect model, but it does account for the basics, and if we look at the coefficients by year, it shows that, holding the other variables constant, a taxi trip in 2017 took 17% longer than the same trip in 2009. For example, a weekday morning trip from Midtown East to Union Square that took 10 minutes in 2009 would average 11:45 in 2017.
The same regression applied to Citi Bikes shows no such slowdown over time, in fact Citi Bikes got slightly faster. The regressions also show that:
 Citi Bike travel times are less sensitive to time of day than taxi travel times. A peak midday taxi trip averages 40% longer than the same trip at offpeak hours, while a peak Citi Bike trip averages 15% longer than during offpeak hours.
 Rainy days are associated with 2% faster Citi Bike travel times and 1% slower taxi travel times.
 For taxis, fall months have the slowest travel times, but for Citi Bikes, summer has the slowest travel times. For both, January has the fastest travel times.
It’s one thing to say that 50% of midday taxi trips would be faster as Citi Bike rides, but how much does that vary from day to day? You could imagine there are some days with severe road closures, where more nimble bikes have an advantage getting around traffic, or other days in the dead of summer, when taxis might take advantage of the less crowded roads.
I ran a more granular analysis to measure win/loss rates for individual dates. Here’s a histogram of the taxi loss rate—the % of taxi trips we’d expect to be faster if switched to Citi Bikes—for weekday afternoon trips from July 2016 to June 2017:
Many days see a taxi loss rate of just over 50%, but there are tails on both ends, indicating that some days tilt in favor of either taxis or Citi Bikes. I was curious if we could learn anything from the outliers on each end, so I looked at individual dates to see if there were any obvious patterns.
The dates when taxis were the fastest compared to Citi Bikes look like dates that probably had less traffic than usual. The afternoon with the highest taxi win rate was Monday, October 3, 2016, which was the Jewish holiday of Rosh Hashanah, when many New Yorkers would have been home from work or school. The next 3 best days for taxis were all Mondays in August, when I’d imagine a lot of people were gone from the city on vacation.
The top 4 dates where Citi Bikes did best against taxis were all rainy days in the fall of 2016. I don’t know why rainy days make bikes faster relative to taxis, maybe rain causes traffic on the roads that disproportionately affects cars, but it’s also possible that there’s a selection bias. I’ve written previously about how the weather predicts Citi Bike ridership, and not surprisingly there are fewer riders when it rains. Maybe the folks inclined to ride bikes when it’s raining are more confident cyclists, who also pedal faster when the weather is nice. It’s also possible that rainyday cyclists are particularly motivated to pedal faster so they can get out of the rain. I don’t know if these are really the causes, but they at least sound believable, and would explain the observed phenomenon.
When the President is in town, take a bikeJune 8, 2016 was a particularly good day for Citi Bikes compared to taxis. President Obama came to town that afternoon, and brought the requisite street closures with him. I poked around a bit looking for the routes that appeared to be the most impacted by the President’s visit, and came to afternoon trips from Union Square to Murray Hill. On a typical weekday afternoon, taxis beat Citi Bikes 57% of the time from Union Square to Murray Hill, but on June 8, Citi Bikes won 90% of the time. An even more dramatic way to see the Obama effect is to look at daily median travel times:
A typical afternoon taxi takes 8 minutes, but on June 8, the median was over 21 minutes. The Citi Bike median travel time is almost always 9 minutes, including during President Obama’s visit.
The same graph shows a similar phenomenon on September 19, 2016, when the annual United Nations General Assembly shut down large swathes of Manhattan’s east side, including Murray Hill. Although the impact was not as severe as during President Obama’s visit, the taxi median time doubled on September 19, while the Citi Bike median time again remained unchanged.
The morning of June 15, 2016 offers another example, this time on the west side, when an overturned tractor trailer shut down the Lincoln Tunnel for nearly seven hours. Taxi trips from the Upper West Side to West Chelsea, which normally take 15 minutes, took over 35 minutes. Citi Bikes typically take 18 minutes along the same route, and June 15 was no exception. Taxis would normally expect to beat Citi Bikes 67% of the time on a weekday morning, but on June 15, Citi Bikes won over 92% of the time.
These are of course three handpicked outliers, and it wouldn’t be entirely fair to extrapolate from them to say that Citi Bikes are always more resilient than taxis during extreme circumstances. The broader data shows, though, that taxis are more than twice as likely as Citi Bikes to have days when a route’s median time is at least 5 minutes slower than average, and more than 3.5 times as likely to be at least 10 minutes slower, so it really does seem that Citi Bikes are better at minimizing worstcase outcomes.
Why have taxis gotten slower since 2009?The biggest slowdowns in taxi travel times happened in 2014 and 2015. The data and regression model have nothing to say about why taxis slowed down so much over that period, though it might be interesting to dig deeper into the data to see if there are specific regions where taxis have fared better or worse since 2009.
Uber usage took off in New York starting in 2014, reaching over 10,000 vehicles dispatched per week by the beginning of 2015. There are certainly people who blame Uber—and other ridehailing apps like Lyft and Juno—for increasing traffic, but the city’s own 2016 traffic report did not blame Uber for increased congestion.
It’s undoubtedly very hard to do an accurate study measuring ridehailing’s impact on traffic, and I’m especially wary of people on both sides who have strong interests in blaming or exonerating the ridehailing companies. Nevertheless, if I had to guess the biggest reasons taxis got particularly slower in 2014 and 2015, I would start with the explosive growth of ridehailing apps, since the timing looks to align, and the publicly available data shows that they account for tens of thousands of vehicles on the roads.
On the other hand, if ridehailing were the biggest cause of increased congestion in 2014 and 2015, it doesn’t exactly make sense that taxi travel times have stabilized a bit in 2016 and 2017, because ridehailing has continued to grow, and while taxi usage continues to shrink, the respective rates of growth and shrinkage are not very different in 2016–17 than they were in 2014–15. One explanation could be that starting in 2016 there was a reduction in other types of vehicles—traditional black cars, private vehicles, etc.—to offset ridehailing growth, but I have not seen any data to support (or refute) that idea.
There are also those who blame bike lanes for worsening vehicle traffic. Again, different people have strong interests arguing both sides, but it seems like there are more data points arguing that bike lanes do not cause traffic (e.g. here, here, and here) than vice versa. I wasn’t able to find anything about the timing of NYC bike lane construction to see how closely it aligns with the 2014–15 taxi slowdown.
Lots of other factors could have contributed to worsening traffic: commuteradjusted population growth, subway usage, decaying infrastructure, construction, and presidential residences are just a few that feel like they could be relevant. I don’t know the best way to account for all of them, but it does seem like if you want to get somewhere in New York quickly, it’s increasingly less likely that a car is your best option.
How representative are taxis and Citi Bikes of all cars and bikes?I think it’s not a terrible assumption that taxis are representative of typical car traffic in New York. If anything, maybe taxis are faster than average cars since taxi drivers are likely more experienced—and often aggressive—than average drivers. On the other hand, taxi drivers seem anecdotally less likely to use a trafficenabled GPS, which maybe hurts their travel times.
Citi Bikes are probably slower than privatelyowned bikes. Citi Bikes are designed to be heavy and stable, which maybe makes them safer, but lowers their speeds. Plus, I’d guess that biking enthusiasts, who might be faster riders, are more likely to ride their own higherperformance bikes. Lastly, Citi Bike riders might have to spend extra time at the end of a trip looking for an available dock, whereas privatelyowned bikes have more parking options.
Weighing up these factors, I would guess that if we somehow got the relevant data to analyze the broader question of all cars vs. all bikes, the results would tip a bit in favor of bikes compared to the results of the narrower taxi vs. Citi Bike analysis. It’s also worth noting that both taxis and Citi Bikes have additional time costs that aren’t accounted for in trip durations: you have to hail a taxi, and there might not be a Citi Bike station in the near vicinity of your origin or destination.
What are the implications of all this?One thing to keep in mind is that even though the taxi and Citi Bike datasets are the most conveniently available for analysis, New Yorkers don’t limit their choices to cars and bikes. The subway, despite its poor reputation of late, carries millions of people every day, more than taxis, ridehailing apps, and Citi Bikes combined, so it’s not like “car vs. bike” is always the most relevant question. There are also legitimate reasons to choose a car over a bike—or vice versa—that don’t depend strictly on expected travel time.
Bike usage in New York has increased dramatically over the past decade, probably in large part because people figured out on their own that biking is often the fastest option. Even with this growth, though, the data shows that a lot of people could still save precious time—and minimize their worsecase outcomes—if they switched from cars to bikes. To the extent the city can incentivize that, it strikes me as a good thing.
When Lmageddon comes, take a bikeFor any readers who might be affected by the L train’s planned 2019 closure, if you only remember one thing from this post: Citi Bikes crush taxis when traveling from Williamsburg to just about anywhere in Manhattan during morning rush hour!
GitHubThe code for the taxi vs. Citi Bike analysis is available here as part of the nyctaxidata repo. Note that parts of the analysis also depend on loading the data from the nyccitibikedata repo.
The dataTaxi trip data is available since January 2009, Citi Bike data since July 2013. I filtered each dataset to make the analysis closer to an applestoapples comparison—see the GitHub repo for a more complete description of the filtering—but in short:
 Restrict both datasets to weekday trips only
 Restrict Citi Bike dataset to subscribers only, i.e. no daily pass customers
 Restrict taxi dataset to trips that started and ended in areas with Citi Bike stations, i.e. where taking a Citi Bike would have been a viable option
Starting in July 2016, perhaps owing to privacy concerns, the TLC stopped providing latitude and longitude coordinates for every taxi trip. Instead, the TLC now divides the city into 263 taxi zones (map), and provides the pickup and drop off zones for every trip. The analysis then makes the assumption that taxis and Citi Bikes have the same distribution of trips within a single zone, see GitHub for more.
80% of taxi trips start and end within zones that have Citi Bike stations, and the filtered dataset since July 2013 contains a total of 330 million taxi trips and 27 million Citi Bike trips. From July 1, 2016 to June 30, 2017—the most recent 12 month period of available data—the filtered dataset includes 68 million taxi trips and 9 million Citi Bike trips.
MethodologyI wrote a Monte Carlo simulation in R to calculate the probability that a Citi Bike would be faster than a taxi for a given route. Every trip is assigned to a bucket, where the buckets are picked so that trips within a single bucket are fairly comparable. The bucket definitions are flexible, and I ran many simulations with different bucket definitions, but one sensible choice might be to group trips by:
 Starting zone
 Ending zone
 Hour of day
For example, weekday trips from the West Village to Times Square between 9:00 AM and 10:00 AM would constitute one bucket. The simulation iterates over every bucket that contains at least 5 taxi and 5 Citi Bike trips, and for each bucket, it draws 10,000 random samples, with replacement, for each of taxi and Citi Bike trips. The bucket’s estimated probability that a taxi is faster than a Citi Bike, call it the “taxi win rate”, is the fraction of samples where the taxi duration is shorter than the Citi Bike duration. You can think of this as 10,000 individual headtohead races, with each race pitting a single taxi trip against a single Citi Bike trip.
Different bucketing and filtering schemes allow for different types of analysis. I ran simulations that bucketed by month to see how win rates have evolved over time, simulations that used only days where it rained, and others. There are undoubtedly more schemes to be considered, and the Monte Carlo methodology should be well equipped to handle them.
$(function() { var desktop = !mobileDevice();
var selected_location_click_handler = [];
var tooltip_x_handlers = [ { events: "@taxi_zone_marks:mouseover", update: "x()  100 < 10 ? 10 : (x()  100 > width  240 ? width  240 : x()  100)" }, { events: "@taxi_zone_marks:mouseout", update: "999" } ]
if (desktop) { $(".mapclickinstructions").show();
selected_location_click_handler = [ { events: "@taxi_zone_marks:click", update: "datum.properties.zone" }, { events: "@taxi_zone_base_marks:click", update: "datum.properties.zone" } ];
tooltip_x_handlers.push({ events: "@taxi_zone_marks:click", update: "999" }); }
if (window.screen && window.screen.width < 450) { var map_width = window.screen.width; var map_height = 520; var map_center = [73.854, 40.737]; var map_scale = 155000; $("#nyctaxizonesmap").css({"marginleft": "12px", "width": "100vw"}); } else { var map_width = 450; var map_height = 640; var map_center = [73.889, 40.75]; var map_scale = 192000; } var vega_spec = { $schema: "https://vega.github.io/schema/vega/v3.0.json", width: map_width, height: map_height, autosize: "none", signals: [ { name: "selected_location", bind: { input: "select", options: ['Alphabet City', 'Battery Park', 'Battery Park City', 'Bedford', 'Bloomingdale', 'Boerum Hill', 'Brooklyn Heights', 'Brooklyn Navy Yard', 'Bushwick South', 'Carroll Gardens', 'Central Park', 'Chinatown', 'Clinton East', 'Clinton Hill', 'Clinton West', 'Cobble Hill', 'Columbia Street', 'Crown Heights North', 'Downtown Brooklyn/MetroTech', 'DUMBO/Vinegar Hill', 'East Chelsea', 'East Harlem South', 'East Village', 'East Williamsburg', 'Financial District North', 'Financial District South', 'Flatiron', 'Fort Greene', 'Garment District', 'Gowanus', 'Gramercy', 'Greenpoint', 'Greenwich Village North', 'Greenwich Village South', 'Hudson Sq', 'Kips Bay', 'Lenox Hill East', 'Lenox Hill West', 'Lincoln Square East', 'Lincoln Square West', 'Little Italy/NoLiTa', 'Long Island City/Hunters Point', 'Long Island City/Queens Plaza', 'Lower East Side', 'Manhattan Valley', 'Meatpacking/West Village West', 'Midtown Center', 'Midtown East', 'Midtown North', 'Midtown South', 'Murray Hill', 'Park Slope', 'Penn Station/Madison Sq West', 'Prospect Heights', 'Prospect Park', 'Queensbridge/Ravenswood', 'Red Hook', 'Seaport', 'SoHo', 'South Williamsburg', 'Stuy Town/Peter Cooper Village', 'Stuyvesant Heights', 'Sunnyside', 'Sunset Park West', 'Sutton Place/Turtle Bay North', 'Times Sq/Theatre District', 'TriBeCa/Civic Center', 'Two Bridges/Seward Park', 'UN/Turtle Bay South', 'Union Sq', 'Upper East Side North', 'Upper East Side South', 'Upper West Side North', 'Upper West Side South', 'West Chelsea/Hudson Yards', 'West Village', 'Williamsburg (North Side)', 'Williamsburg (South Side)', 'World Trade Center', 'Yorkville East', 'Yorkville West'] }, value: "Midtown East", on: selected_location_click_handler }, { name: "time_of_day", bind: { input: "radio", options: ["8:00 AM–11:00 AM", "11:00 AM–4:00 PM", "4:00 PM–7:00 PM", "7:00 PM–10:00 PM", "10:00 PM–8:00 AM"] }, value: "8:00 AM–11:00 AM" }, { name: "hover_area", value: null, on: [ { events: "@taxi_zone_marks:mouseover", update: "datum" }, { events: "@taxi_zone_marks:mouseout", update: "null" } ] }, { name: "tooltip_title", value: null, update: "hover_area ? selected_location + ' to ' + hover_area.properties.zone : ''" }, { name: "tooltip_from", value: null, update: "hover_area ? 'From ' + selected_location : ''" }, { name: "tooltip_to", value: null, update: "hover_area ? 'to ' + hover_area.properties.zone : ''" }, { name: "tooltip_time_of_day", value: null, update: "hover_area ? 'Weekdays ' + time_of_day : ''" }, { name: "tooltip_message", value: null, update: "hover_area ? (hover_area.taxi_win_rate > 0.5 ? 'Taxis' : 'Citi Bikes') + ' beat ' + (hover_area.taxi_win_rate > 0.5 ? 'Citi Bikes' : 'taxis') + ' ' + format((hover_area.taxi_win_rate > 0.5 ? hover_area.taxi_win_rate : 1  hover_area.taxi_win_rate), '0.0%') + ' of the time' : ''" }, { name: "tooltip_taxi_median", value: null, update: "hover_area ? 'Taxi median: ' + timeFormat(1000 * hover_area.taxi_median, '%M:%S') : ''" }, { name: "tooltip_citi_median", value: null, update: "hover_area ? 'Citi Bike median: ' + timeFormat(1000 * hover_area.citi_median, '%M:%S') : ''" }, { name: "tooltip_x", value: 999, on: tooltip_x_handlers }, { name: "tooltip_y", on: [ { events: "@taxi_zone_marks:mouseover", update: "y()  145 < 15 ? y() + 50 : y()  145" } ] } ], data: [ { name: "simulation_results", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/simulation_results.csv", format: {type: "csv", parse: "auto"}, transform: [ { type: "filter", expr: "datum.time_bucket == time_of_day && datum.start_zone == selected_location" } ] }, { name: "east_river_bridges", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/east_river_bridges.json", format: {type: "topojson", feature: "bridges"} }, { name: "taxi_zones", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/taxi_zones_bmq.json", format: {type: "topojson", feature: "trimmed_taxi_zones_geojson"}, transform: [ { type: "lookup", from: "simulation_results", key: "end_taxi_zone_id", fields: ["properties.locationid"], values: ["taxi_win_rate", "taxi_median", "citi_median"], as: ["taxi_win_rate", "taxi_median", "citi_median"] } ] }, { name: "taxi_zones_with_data", source: "taxi_zones", transform: [ { type: "filter", expr: "datum.taxi_win_rate" } ] }, { name: "selected_taxi_zone_origin", source: "taxi_zones", transform: [ { type: "filter", expr: "datum.properties.zone == selected_location" } ] } ], projections: [ { name: "projection", type: "mercator", center: map_center, scale: map_scale } ], scales: [ { name: "color", type: "sequential", domain: [0.05, 0.95], range: {scheme: "viridis"} } ], legends: [ { fill: "color", orient: "bottomright", title: "Taxi win %", format: "0%", type: "gradient", encode: { gradient: { update: { width: {value: 150} } }, title: { enter: { fontSize: {value: 16} } }, labels: { enter: { text: {value: ""} } } } } ], marks: [ { type: "shape", name: "taxi_zone_base_marks", from: {data: "taxi_zones"}, encode: { update: { fill: {value: "#f4f4f4"}, fillOpacity: {value: 0.5}, stroke: {value: "#aaa"}, strokeWidth: {value: 0.2}, zindex: {value: 0} }, }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "east_river_bridges_marks", from: {data: "east_river_bridges"}, encode: { update: { stroke: {value: "#777"}, strokeWidth: {value: 2} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "taxi_zone_marks", from: {data: "taxi_zones_with_data"}, encode: { update: { fill: { scale: "color", field: "taxi_win_rate" }, fillOpacity: {value: 1}, stroke: {value: "#777"}, strokeWidth: {value: 0.2}, zindex: {value: 10} }, hover: { fillOpacity: {value: 0.8}, stroke: {value: "#222"}, strokeWidth: {value: 2}, zindex: {value: 100} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "selected_taxi_zone_marks", from: {data: "selected_taxi_zone_origin"}, encode: { update: { fill: {value: "#f00"}, fillOpacity: {value: 1}, stroke: {value: "#222"}, strokeWidth: {value: 2}, zindex: {value: 1000} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "rect", interactive: false, encode: { update: { width: {value: 243}, height: {value: 127}, fill: {value: "#fff"}, fillOpacity: {value: 0.9}, stroke: {value: "#777"}, strokeWidth: {value: 2}, cornerRadius: {value: 2}, x: {signal: "tooltip_x  5"}, y: {signal: "tooltip_y  15"} } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y"}, fontSize: {value: 14}, text: { signal: "tooltip_from" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 16"}, fontSize: {value: 14}, text: { signal: "tooltip_to" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 40"}, fontSize: {value: 14}, text: { signal: "tooltip_time_of_day" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 66"}, fontSize: {value: 14}, text: { signal: "tooltip_taxi_median" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 82"}, fontSize: {value: 14}, text: { signal: "tooltip_citi_median" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 106"}, fontSize: {value: 14}, text: { signal: "tooltip_message" } } } }, { type: "text", description: "hack to get the legend labels to work", interactive: false, encode: { update: { x: {value: map_width  168}, y: {value: map_height  15}, text: {value: "0%" + vega.repeat(" ", 39) + "100%"} } } } ] }; var vega_opts = { loader: vega.loader(), logLevel: vega.Warn, renderer: 'canvas' }; var view = new vega.View(vega.parse(vega_spec), vega_opts). initialize('#nyctaxizonesmap'). hover(). run(); $("#nyctaxivscitiselectcontainer").show(); $(".maphoverinstructions").show(); $("#nyctaxivscitiselect").on("change", function() { view.signal("selected_location", $(this).val()).run(); set_title(); }); $("input[name='nyctaxivscititime']").on("change", function() { view.signal("time_of_day", $(this).val()).run(); set_title(); }); $(".williamsburglink").on("click", function() { $("#nyctaxivscitiselect").val("Williamsburg (North Side)"); view.signal("selected_location", "Williamsburg (North Side)").run(); set_title(); }); var $title = $(".taxivscitimaptitle"); var set_title = function() { var new_title = "From " + $("#nyctaxivscitiselect").val() + ", weekdays " + $("input[name='nyctaxivscititime']:checked").val(); $title.html(new_title); }; if (desktop) { $("#nyctaxizonesmap").on("click", function() { vega_val = $("select[name='selected_location']").val(); outer_select = $("#nyctaxivscitiselect"); outer_val = outer_select.val(); if (vega_val !== outer_val) { outer_select.val(vega_val); set_title(); } }); } });
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Category: R  Todd W. Schneider. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data.Table by Example – Part 1
(This article was first published on R – Mathew Analytics, and kindly contributed to Rbloggers)
For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.
For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.
$ wget –nocheckcertificate –progress=dot https://data.cityofchicago.org/api/views/ijzpq8t2/rows.csv?accessType=DOWNLOAD > rows.csvThis file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.
dat = fread("rows.csv") dat[1:3] str(dat)Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.
names(dat) < gsub(" ", "_", names(dat))Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.
dat[, .N, by=.(Arrest)]In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.
dat[, .N, by=.(Primary_Type)][order(N)][N>=200000] dat[, .N, by=.(Location_Description)][order(N)][N>=200000]Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.
dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]If you want to plot the results as a line graph, just add another chain which executes the visualization.
dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>% ggplot(., aes(date2, N)) + geom_line(group=1)I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.
PS: As of this past week, I’ve decided to reenter the job market. So if you are looking for a statistical analyst or data scientist with strong experience in programming in R and time series econometrics, please contact me at mathewanalytics@gmail.com or reach me through LinkedIn
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to Create an Interactive Infographic
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
An interactive infographic can be used to communicate a lot of information in an engaging way. With the right tools, they are also relatively straightforward to create. In this post, I show stepbystep how to create this interactive infographic, using Canva, Displayr and R code. The interactive example is designed so that the user can change the country and have the infographic update automatically.
Tools used to create an interactive infographic: Canva is used to create the base infographic. The calculations, charting, and automatic textwriting are performed using the R language. It is all hooked up with Displayr.
Step 1: Create or download the infographicI start by going to Canva, and choosing the Neat Interactive Gaming Infographic (tip: use Find Templates on the lefthand panel). You could, of course, design your own infographic, either in Canva or elsewhere. I like Canva, but the key thing is to create an infographic image some way. In Canva, I edited the template by deleting the bits that I wanted to replace with interactive charts and visualizations and then I download the infographic as a PNG file (2,000 pixels high by 800 wide).
Step 2: Import the infographic into Displayr
Create an account in Displayr, and then click the button that says + New Document. Set the page size to the same aspect ratio as the PNG file (Home > Page Layout > Layout > Page Size > Custom). For this example, the page should be 20 inches high and 8 inches wide.
Next, insert the infographic into Displayr (Insert > Image), move and resize it to fit the page (tip: you can use the Properties panel on the right to type in pixels 800 x 2000 to reset the correct aspect ratio of the image).
Step 3: Get the data into Displayr
The data that will make the infographic interactive needs to be hooked up in Displayr. The data used to create the infographic in this example is shown to the right. There are lots of ways to import data into Displayr (e.g., importing a raw data file and creating the tables in Displayr). For this example, the data has been pasted into Displayr from Excel using the steps below.
To paste the data into Displayr:
 Insert > Paste Table (Data), click Add data (on the right of the screen).
 Paste in the data. Alternatively, you could type it into this screen. I first just pasted in the Age Distribution data and press OK.
 Properties > GENERAL and type AgeDistribution into the Label field and check the Automatic option (above).
 Drag the table so that it is to the left of the infographic.
 Hide the table (select it, Appearance > Hide). It will stay visible but will be invisible when you share the infographic.
Repeat this process for AverageAge, Ratio, and Multiplayer. It is important that you give each of these tables these names, as we refer to them later in our R code.
Step 4: Add the country selectorNext, I add a control so that the user can change country:
 Insert > Control
 Item list: China, US, Europe
 Move it to the top of the screen and style as desired (font size, color, border)
 Name: Click on the control and select China
 Properties > GENERAL > Name: Country
I then insert a text box (“GAMERS”), and placed it to the left of the control (i.e.: font: Impact, size: 48, color: #ffb600).
Step 5: Create the charts and visualizations in RFinally, create the charts and visualizations in Displayr using the following R code.
The column chartI created the column chart using my colleague Carmen’s nifty wrapperfunction for plotly. Insert an R Output in Displayr (Insert > R Output), and copy and paste the following code, pressing Calculate and resizing moving and resizing the chart.
flipStandardCharts::Chart(AgeDistribution[, Country], type = "Column", background.fill.color = "#212121", charting.area.fill.color = "#212121", colors = "tiel", x.tick.font.color = "white", x.tick.font.size = 20, x.grid.width = 0, y.tick.font.color = "white", y.tick.font.size = 20, y.title = "%", y.title.font.color = "white", y.grid.width = 0) Average ageThe average age was also created by inserting an R Output, using the code below. While I could have written the formatting in R, I instead used the various formatting tools built into Displayr (Properties >LAYOUT and Properties > APPEARANCE).
AverageAge[,Country] The hearts pictographThis was also done using an R Output in Displayr, with the following code (using R GitHub packages built by my colleagues Kyle and Carmen).
women = Ratio["Women", Country] total = sum(Ratio[, Country]) flipPictographs::SinglePicto(women, total, layout = "Number of columns", number.cols = 5, image = "Heart", hide.base.image = FALSE, auto.size = TRUE, fill.icon.color = "red", base.icon.color = "cyan", background.color ="#212121" )The R code used to create the textbox is below (tip: toggle on Wrap text output at the bottom of the Properties panel on the right)
women = Ratio["Women", Country] total = sum(Ratio[, Country]) paste0(women, " IN ", total, " GAMERS ARE WOMEN") The pie chartThis is the R code for the pie chart:
flipStandardCharts::Chart(Multiplayer[, Country], type = "Pie", colors = c(rgb(175/255, 224/255, 170/255), rgb(0/255, 181/255, 180/255)), data.label.font.color = "White", data.label.font.size = 18, data.label.decimals = 0, data.label.suffix = "%")The R code used to create the textbox was:
one.player = Multiplayer["1 player", Country] paste0(one.player, "% OF PEOPLE PLAY VIDEO GAMES ON THEIR OWN") Create the interactive infographic yourselfSign in to Displayr and edit the document used to create the interactive infographic here. In Edit mode you can click on each of the charts, pictographs, and text boxes to see the underlying code. The final document was published (i.e., turned into a Dashboard) using Export > Web Page. Or you can view the interactive infographic created using the instructions above.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Regular Expression Searching within Shiny Selectize Objects
(This article was first published on Jonathan Sidi's R Blog, and kindly contributed to Rbloggers)
regexSelect is a small package that uses Shiny modules to solve a problem in Shiny selectize objects – regular expression (regex) searching. You can quickly filter the values in the selectize object, while being able to add that new regex query to the selectize list.
This is great for long lists, since you can return multiple item simultaneously without needing to endlessly click items in a list!
Install install.packages('regexSelect') #devtools::install_github('yonicd/regexSelect')Below are two examples of using regular expressions to quickly populate multiple items in a ggplot and a datatable.
regexSelect with PlotsThe shiny module works with two main functions:
# server side: callModule(module=regexSelect, id='myId', reactive(<selectizeInput Choices>)) # ui side: regexSelectUI(id = "myId", label = 'myLabel', choices = <selectizeInput Choices>)regexSelect comes with controls placed in a group checkbox below the selectize object. When calling regexSelect you can show or hide the checkbox controls (hidden by default), as to make it look like a normal selectize object, and save valuable UI real estate.
While the shiny app is running regexSelect properties can be manipulated through the checkbox controls giving greater flexibilty to:
 Toggle regexSelect to work as a standard selectize object.
 Retain the regex search as a new value the selectize object.
 Change arguments that are passed to base::grep : ignore.case, perl, fixed, invert.
To leave a comment for the author, please follow the link and comment on their blog: Jonathan Sidi's R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
News Roundup from Microsoft Ignite
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
It's been a big day for the team here at Microsoft, with a flurry of announcements from the Ignite conference in Orlando. We'll provide more indepth details in the coming days and weeks, but for now here's a brief roundup of the news related to data science:
Microsoft ML Server 9.2 is now available. This is the new name for what used to be called Microsoft R Server, and now also includes support for operationalizing the Python language as well as R. This 2minute video explains how to deploy models with ML Server (and with this release, realtime scoring is now also supported Linux as well). Microsoft R Client 3.4.1, featuring R 3.4.1, was also released today and provides desktop capabilities for R developers with the ability to deploy computations to a production ML Server.
The next generation of Azure Machine Learning is now available. This new service includes the AML Workbench, a crossplatform client for AIpowered data wrangling and experiment management; the AML Experimentation service to help data scientists increase their rate of experimentation with big data and GPUs, and the AML Model Management service to host, version, manage and monitor machine learning models. This TechCrunch article provides an overview.
SQL Server 2017 is now generally available, bringing support for indatabase R and Python to both Windows and Linux.
Azure SQL Database now supports realtime scoring for R and Python, and a preview of general indatabase R services is now available as well.
Microsoft Cognitive Services offers new intelligent APIs for developing AI applications, including general availability of the text analytics API for topic extraction, language detection and sentiment analysis.
Visual Studio Code for AI, an extension for the popular opensource code editor, provides new interfaces for developing with Tensorflow, Cognitive Toolkit (CNTK), Azure ML and more.
Finally, Microsoft announced a new highlevel language for quantum computing to be integrated with Visual Studio, and a simulator for quantum computers up to 32 qubits. This 2minute video provides an overview of Microsoft's vision for Quantum Computing.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Custom Level Coding in vtreat
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "onehot encoding"). Level coding can be computationally and statistically preferable to onehot encoding for variables that have an extremely large number of possible levels.
Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS
By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional loglikelihood and global loglikelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for treebased methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.
We’ll implement our partial pooling encoders using the lmer() (multilevel linear regression) and glmer() (multilevel generalized linear regression) functions from the lme4 package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).
The Data: Radon levels in Minnesota library("vtreat") library("lme4") library("dplyr") library("tidyr") library("ggplot2") # example data srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE) # target: log of radon activity (activity) # grouping variable: county radonMN = filter(srrs, state=="MN") %>% select("county", "activity") %>% filter(activity > 0) %>% mutate(activity = log(activity), county = base::trimws(county)) %>% mutate(critical = activity>1.5) str(radonMN) ## 'data.frame': 916 obs. of 3 variables: ## $ county : chr "AITKIN" "AITKIN" "AITKIN" "AITKIN" ... ## $ activity: num 0.788 0.788 1.065 0 1.131 ... ## $ critical: logi FALSE FALSE FALSE FALSE FALSE FALSE ...For this example we have three columns of interest:
 county: 85 possible values
 activity: the log of the radon reading (numerical outcome)
 critical: TRUE when activity > 1.5 (categorical outcome)
The goal is to level code county for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is "critical").
As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.
Implementing Level Coders for Partial PoolingLet’s implement level coders that use partial pooling to compute the level score.
Regression
For regression problems, the custom coder should be a function that takes as input:
 v: a string with the name of the categorical variable
 vcol: the actual categorical column (assumed character)
 y: the numerical outcome column
 weights: a column of row weights
The function should return a column of scores (the level codings). For our example, the function builds a lmer model to predict y as a function of vcol, then returns the predictions on the training data.
# @param v character variable name # @param vcol character, independent or input variable # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderN < function(v, vcol, y, weights) { # regression case y ~ vcol d < data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m < lmer(y ~ (1  x), data=d, weights=weights) predict(m, newdata=d) }Categorization
For categorization problems, the function should assume that y is a logical column, where TRUE is assumed to be the target outcome. This is because vtreat converts the outcome column to a logical while creating the treatment plan.
# @param v character variable name # @param vcol character, independent or input variable # @param y logical, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderC < function(v, vcol, y, weights) { # classification case y ~ vcol d < data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m = glmer(y ~ (1  x), data=d, weights=weights, family=binomial) predict(m, newdata=d, type='link') }You can then pass the functions in as a named list into either designTreatmentsX or mkCrossFrameXExperiment to build the treatment plan. The format of the key is [nc].levelName[.option]*.
The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat to center the codes with respect to the estimated grand mean. ThecatN and catB level codings are centered in this way.
Our example coders can be passed in as shown below.
customCoders = list('n.poolN.center' = ppCoderN, 'c.poolC.center' = ppCoderC) Using the Custom CodersLet’s build a treatment plan for the regression problem.
# I only want to create the cleaned numeric variables, the isBAD variables, # and the level codings (not the indicator variables or catP, etc.) vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN') treatplanN = designTreatmentsN(radonMN, varlist = c('county'), outcomename = 'activity', codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) scoreFrame = treatplanN$scoreFrame scoreFrame %>% select(varName, sig, origName, code) ## varName sig origName code ## 1 county_poolN 1.343072e16 county poolN ## 2 county_catN 2.050811e16 county catNNote that the treatment plan returned both the catN variable (default level encoding) and the pooled level encoding (poolN). You can restrict to just using one coding or the other using the codeRestriction argument either during treatment plan creation, or in prepare().
Let’s compare the two level encodings.
# create a frame with one row for every county, measframe = data.frame(county = unique(radonMN$county), stringsAsFactors=FALSE) outframe = prepare(treatplanN, measframe) # If we wanted only the new pooled level coding, # (plus any numeric/isBAD variables), we would # use a codeRestriction: # # outframe = prepare(treatplanN, # measframe, # codeRestriction = c('clean', 'isBAD', 'poolN')) gather(outframe, key=scoreType, value=score, county_poolN, county_catN) %>% ggplot(aes(x=score)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of scores")Notice that the poolN scores are "tucked in" compared to the catN encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.
We can also code for the categorical problem.
# For categorical problems, coding is catB vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC') treatplanC = designTreatmentsC(radonMN, varlist = c('county'), outcomename = 'critical', outcometarget= TRUE, codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) outframe = prepare(treatplanC, measframe) gather(outframe, key=scoreType, value=linkscore, county_poolC, county_catB) %>% ggplot(aes(x=linkscore)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of link scores")Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and unmodeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lowervariance estimators when they do not cost much in terms of bias.
Other ConsiderationsFor this example, we used the lme4 package to create custom level codings. Once calculated, vtreat stores the coding as a lookup table in the treatment plan. This means lme4 is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.
When using mkCrossFrameXExperiment, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB and catN variables. This is because the treatment plan is built using all the data, while the cross frame is built using nfold cross validation on the data. See the cross frame vignette for more details.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.
In a followup article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat‘s default coding.
ReferencesGelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Survival Analysis with R
(This article was first published on R Views, and kindly contributed to Rbloggers)
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic lifetable methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.
Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – leftright, updown, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As wellorganized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger
The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.
ggfortify enables producing handsome, oneline survival plots with ggplot2::autoplot.
ranger might be the surprise in my very short list of survival packages. The ranger() function is wellknown for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building timetoevent models with the large, highdimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machinelearning age.
Load the dataThis first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a twotreatment, randomized trial for lung cancer.
library(survival) library(ranger) library(ggplot2) library(dplyr) library(ggfortify) # data(veteran) head(veteran) ## trt celltype time status karno diagtime age prior ## 1 1 squamous 72 1 60 7 69 0 ## 2 1 squamous 411 1 70 5 64 10 ## 3 1 squamous 228 1 60 3 38 0 ## 4 1 squamous 126 1 60 9 63 10 ## 5 1 squamous 118 1 70 11 65 10 ## 6 1 squamous 10 1 20 5 49 0The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes
Kaplan Meier AnalysisThe first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.
# Kaplan Meier Survival Curve km < with(veteran, Surv(time, status)) head(km,80) ## [1] 72 411 228 126 118 10 82 110 314 100+ 42 8 144 25+ ## [15] 11 30 384 4 54 13 123+ 97+ 153 59 117 16 151 22 ## [29] 56 21 18 139 20 31 52 287 18 51 122 27 54 7 ## [43] 63 392 10 8 92 35 117 132 12 162 3 95 177 162 ## [57] 216 553 278 12 260 200 156 182+ 143 105 103 250 100 999 ## [71] 112 87+ 231+ 242 991 111 1 587 389 33To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the KaplanMeier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.
km_fit < survfit(Surv(time, status) ~ 1, data=veteran) summary(km_fit, times = c(1,30,60,90*(1:10))) ## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1 137 2 0.985 0.0102 0.96552 1.0000 ## 30 97 39 0.700 0.0392 0.62774 0.7816 ## 60 73 22 0.538 0.0427 0.46070 0.6288 ## 90 62 10 0.464 0.0428 0.38731 0.5560 ## 180 27 30 0.222 0.0369 0.16066 0.3079 ## 270 16 9 0.144 0.0319 0.09338 0.2223 ## 360 10 6 0.090 0.0265 0.05061 0.1602 ## 450 5 5 0.045 0.0194 0.01931 0.1049 ## 540 4 1 0.036 0.0175 0.01389 0.0934 ## 630 2 2 0.018 0.0126 0.00459 0.0707 ## 720 2 0 0.018 0.0126 0.00459 0.0707 ## 810 2 0 0.018 0.0126 0.00459 0.0707 ## 900 2 0 0.018 0.0126 0.00459 0.0707 #plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready autoplot(km_fit)Next, we look at survival curves by treatment.
km_trt_fit < survfit(Surv(time, status) ~ trt, data=veteran) autoplot(km_trt_fit)And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.
vet < mutate(veteran, AG = ifelse((age < 60), "LT60", "OV60"), AG = factor(AG), trt = factor(trt,labels=c("standard","test")), prior = factor(prior,labels=c("N0","Yes"))) km_AG_fit < survfit(Surv(time, status) ~ AG, data=vet) autoplot(km_AG_fit)Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.
Cox Proportional Hazards ModelNext, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.
# Fit Cox Model cox < coxph(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) summary(cox) ## Call: ## coxph(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137, number of events= 128 ## ## coef exp(coef) se(coef) z Pr(>z) ## trttest 2.946e01 1.343e+00 2.075e01 1.419 0.15577 ## celltypesmallcell 8.616e01 2.367e+00 2.753e01 3.130 0.00175 ** ## celltypeadeno 1.196e+00 3.307e+00 3.009e01 3.975 7.05e05 *** ## celltypelarge 4.013e01 1.494e+00 2.827e01 1.420 0.15574 ## karno 3.282e02 9.677e01 5.508e03 5.958 2.55e09 *** ## diagtime 8.132e05 1.000e+00 9.136e03 0.009 0.99290 ## age 8.706e03 9.913e01 9.300e03 0.936 0.34920 ## priorYes 7.159e02 1.074e+00 2.323e01 0.308 0.75794 ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(coef) lower .95 upper .95 ## trttest 1.3426 0.7448 0.8939 2.0166 ## celltypesmallcell 2.3669 0.4225 1.3799 4.0597 ## celltypeadeno 3.3071 0.3024 1.8336 5.9647 ## celltypelarge 1.4938 0.6695 0.8583 2.5996 ## karno 0.9677 1.0334 0.9573 0.9782 ## diagtime 1.0001 0.9999 0.9823 1.0182 ## age 0.9913 1.0087 0.9734 1.0096 ## priorYes 1.0742 0.9309 0.6813 1.6937 ## ## Concordance= 0.736 (se = 0.03 ) ## Rsquare= 0.364 (max possible= 0.999 ) ## Likelihood ratio test= 62.1 on 8 df, p=1.799e10 ## Wald test = 62.37 on 8 df, p=1.596e10 ## Score (logrank) test = 66.74 on 8 df, p=2.186e11 cox_fit < survfit(cox) #plot(cox_fit, main = "cph model", xlab="Days") autoplot(cox_fit)Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, timedependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.
Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”
To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a timedependent intercept term, X is the vector of covariates for the subject (possibly timedependent), and B(t) is a timedependent matrix of coefficients.”
The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.
aa_fit <aareg(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) aa_fit ## Call: ## aareg(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137 ## 75 out of 97 unique event times used ## ## slope coef se(coef) z p ## Intercept 0.083400 3.81e02 1.09e02 3.490 4.79e04 ## trttest 0.006730 2.49e03 2.58e03 0.967 3.34e01 ## celltypesmallcell 0.015000 7.30e03 3.38e03 2.160 3.09e02 ## celltypeadeno 0.018400 1.03e02 4.20e03 2.450 1.42e02 ## celltypelarge 0.001090 6.21e04 2.71e03 0.229 8.19e01 ## karno 0.001180 4.37e04 8.77e05 4.980 6.28e07 ## diagtime 0.000243 4.92e05 1.64e04 0.300 7.65e01 ## age 0.000246 6.27e05 1.28e04 0.491 6.23e01 ## priorYes 0.003300 1.54e03 2.86e03 0.539 5.90e01 ## ## Chisq=41.62 on 8 df, p=1.6e06; test weights=aalen #summary(aa_fit) # provides a more complete summary of results autoplot(aa_fit) Random Forests ModelAs a final example of what some might perceive as a datasciencelike way to do timetoevent modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic treemodeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.
ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.
# ranger model r_fit < ranger(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior, data = vet, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE) # Average the survival models death_times < r_fit$unique.death.times surv_prob < data.frame(r_fit$survival) avg_prob < sapply(surv_prob,mean) # Plot the survival models for each patient plot(r_fit$unique.death.times,r_fit$survival[1,], type = "l", ylim = c(0,1), col = "red", xlab = "Days", ylab = "survival", main = "Patient Survival Curves") # cols < colors() for (n in sample(c(2:dim(vet)[1]), 20)){ lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n]) } lines(death_times, avg_prob, lwd = 2) legend(500, 0.7, legend = c('Average = black'))The next block of code illustrates how ranger() ranks variable importance.
vi < data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE)) names(vi) < "importance" head(vi) ## importance ## karno 0.0734 ## celltype 0.0338 ## diagtime 0.0003 ## trt 0.0007 ## prior 0.0011 ## age 0.0027Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest pvalues in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.
But ranger() does compute Harrell’s cindex (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the WilcoxonMannWhitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.
cat("Prediction Error = 1  Harrell's cindex = ", r_fit$prediction.error) ## Prediction Error = 1  Harrell's cindex = 0.308269An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:
However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.
I believe that the major use for treebased models for survival data will be to deal with very large data sets.
Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().
# Set up for ggplot kmi < rep("KM",length(km_fit$time)) km_df < data.frame(km_fit$time,km_fit$surv,kmi) names(km_df) < c("Time","Surv","Model") coxi < rep("Cox",length(cox_fit$time)) cox_df < data.frame(cox_fit$time,cox_fit$surv,coxi) names(cox_df) < c("Time","Surv","Model") rfi < rep("RF",length(r_fit$unique.death.times)) rf_df < data.frame(r_fit$unique.death.times,avg_prob,rfi) names(rf_df) < c("Time","Surv","Model") plot_df < rbind(km_df,cox_df,rf_df) p < ggplot(plot_df, aes(x = Time, y = Surv, color = Model)) p + geom_line()For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.
This fourpackage excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how opensource code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s cindex packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.
Some Tutorials and PapersFor a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.
Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.
For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].
For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.
See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.
ReferencesFor convenience, I have collected the references used throughout the post here.
[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 44524461 [3] Kaplan, E.L. & Meier, P. (1958). Nonparametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563. [4] Cox, D.R. (1972). Regression models and lifetables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
[5] Diez, David. Survival Analysis in R, OpenIntro
[6] Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
[7] Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361387 [9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved TimeBased Predictions in R
[10] NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
[11] Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model [12] Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
[13] Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005) [14] Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
[15] Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
[16] BouHamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)
Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.
_____='https://rviews.rstudio.com/2017/09/25/survivalanalysiswithr/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Super excited for R promises
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
We at Appsilon are excited about RStudio introducing promises in R quite soon which is going to be a huge step forward in programming in R (we have already used futures and similar libraries to run code asynchronously, however this is going to be a standard and it looks like it’s going to be very easy to use). They support chaining which is a great way of building clean code by piping computations that take a long time.
We’ve used futures/promises/tasks in programming languages like C#, Scala, Javascript and promises have always had a big impact on the way code is structured and on the overall execution speed.
We recently spoke with Joe Chang, and he mentioned promises at the EARL conference. Here are some links if you’re interested in reading more about promises on github and medium.
What do you think? Let us know in the comments.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
eXtremely Boost your machine learning Exercises (Part1)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in xgboost package and it is really worth including into anyone’s machine learning portfolio.
This is the first part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Load xgboost library and download German Credit dataset. Your goal in this tutorial will be to predict Creditability (the first column in the dataset).
Exercise 2
Convert columns c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20) to factors and then encode them as dummy variables. HINT: use model.matrix()
Exercise 3
Split data into training and test set 700:300. Create xgb.DMatrix for both sets with Creditability as label.
Exercise 4
Train xgboost with logistic objective and 30 rounds of training and maximal depth 2.
Exercise 5
To check model performance calculate test set classification error.
Exercise 6
Plot predictors importance.
 Create a machine learning algorithm from a beginner point of view
 Quickly dive into more advanced methods in an accessible pace and with more explanations
 And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
Exercise 7
Use xgb.train() instead of xgboost() to add both train and test sets as a watchlist. Train model with same parameters, but 100 rounds to see how it performs during training.
Exercise 8
Train model again adding AUC and Log Loss as evaluation metrices.
Exercise 9
Plot how AUC and Log Loss for train and test sets was changing during training process. Use plotting function/library of your choice.
Exercise 10
Check how setting parameter eta to 0.01 influences the AUC and Log Loss curves.
 Model Evaluation 2
 How to prepare and apply machine learning to your dataset
 Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part5)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppGSL 0.3.3
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its twocolumn pdf default.
The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.
No userfacing new code or features were added. The NEWS file entries follow below:
Changes in version 0.3.3 (20170924)
We also check for gslconfig at package load.

The vignette now uses the pinp package in twocolumn mode.

Minor other fixes to package and testing infrastructure.
Courtesy of CRANberries, a summary of changes to the most recent release is available.
More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppCNPy 0.2.7
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new version of the RcppCNPy package arrived on CRAN yesterday.
RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.
This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp twopage template and package.
Changes in version 0.2.7 (20170922)
Vignette updated to Rmd and use of pinp package

File src/init.c added for dynamic registration
CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppClassic 0.9.8
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A bugfix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.
Courtesy of CRANberries, there are changes relative to the previous release.
Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Upcoming data preparation and modeling article series
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects.
vtreat handles, in a statistically sound fashion:
 Missing values.
 Encoding of categorical values for regularized inference and machine learning techniques.
 Categorical variables with very many values.
 Novel categorical values (that is values not seen during training).
 Variable pruning.
 yaware scaling.
 Structured crossvalidation.
 Mitigating nested model bias.
In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public opensource implementation is for inmemory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL).
vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat version 0.6.0 is called “custom coders.” WinVector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and setvalued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...